Methodology - A progressive alignment approach to template extraction

4.4 A progressive alignment approach to template extraction

4.4.2 Methodology

An overview of the proposed framework is depicted in Figure 6.4. Given the note-level representations of N singing performances with a common melodic template, we first normalise all transcriptions to a relative time scale. Subsequently, we establish an alignment order based on pair-wise alignment scores. After initialising a weighted directed graph structure with the information contained in the first transcription, we progressively construct the model by aligning the remaining transcriptions according to the alignment order. The graph is updated after each step. Using a modification of Dijkstra’s algorithm [47], we estimate the melodic template as the average heaviest path through the graph. Comparing the edge weights of this path to the sum of all edges in the model, we derive a measure for melodic stability among the performances. By computing the alignment score of an unlabelled performance transcription with the resulting model, or alternatively the extracted template, we can estimate its similarity to the set of transcriptions based on which the model was constructed. Below, all stages are described in detail.

Melody representation

The proposed method operates on the note-level transcriptions described in Section 4.2. In both versions of the transcriptions, AT and MT, each note is described by its onset time, pitch (MIDI pitch quantised to the equal tempered scale) and duration. In order to facilitate

Fig. 4.15 System overview.

pair-wise and later on progressive alignment of note sequences, we first convert each note transcription into a point-set representation xk= (xk₁, ..., xk_M

k), where Mkdenotes the number of notes of the kth sequence and each point xk_i = (tk_i, pk_i) is defined by its relative onset time tk_i and pitch value pk_i. An example of the resulting melody representation is shown in Figure 4.16. The relative onset time tk_i ∈ [0, 1] is computed as the relative position between the onset of the first and the offset of the last sung note, discarding silences between notes.

Key normalisation and pair-wise alignment

In order to align two melodies, we employ again theNeedleman-Wunsch (NW) [143] algorithm. The method and its application to melodic sequence alignment has been described in detail in Section 3.2. Here, we use a gap penalty of g = 1 together with the following scoring function:

γ(xk_i, xl_j) =    1 − |tk_i − tl j| if pki = plj −1 if pk_i ̸= pl j (4.4)

In other words, if both notes are equal in pitch, they are considered a match and a positive score γ(xk_i, xl_j) ∈ [0, 1] is assigned based on their onset proximity. Otherwise, the pair is considered a mismatch and a negative score of -1 is assigned, which is equal to the gap penalty.

35 40 45 50 55 60 65 70 75 time [seconds] 50 55 60 65 70 MIDI pitch

manually corrected MIDI transcription

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rel. time 50 55 60 65 70 MIDI pitch

temporally normalized melody representation

Fig. 4.16 Top: Manually corrected note transcription of a Fandango de Valverde sung by El Raya; bottom: corresponding temporally normalised point set representation.

The alignment of two sequences, xk and xl, yields an alignment score dalign(xk, xl) and

an alignment path P_align(xk, xl). Since versions of the same melody may be performed in different keys, we furthermore employ the key normalisation procedure described in Section 2.2 prior to each pair-wise alignment.

Establishment of the alignment order

In the progressive alignment procedure to be described, the alignment order is of major importance. While the first sequences to be aligned strongly shape the model, later aligned sequences have a weaker influence. We therefore aim to aligntypical or common performances before aligning outliers. To this end, we establish an alignment order A based on pair-wise alignment scores.

We construct the score matrix S holding pair-wise alignment scores between all melody representations, where

S(k, l) = d_align(xk, xl). (4.5)

Since the alignment scores are length-dependent, we normalise each pair-wise score by dividing it by the average length of the two involved sequences.

S(k, l) := S(k, l) 0.5 ∗ (Mk+ Ml)

We now establish the alignment order A, by sorting the sequences by their average similarity ¯ Sk ¯ S_k= 1 N N X n=1 n̸=i S(k, n) (4.7)

to all other sequences in descending order

A = {x1, x2, ..., xN | ¯S_k≥ ¯S_k+1; k = 1, ..., N − 1}. (4.8) In this way, melodies with a high average similarity to all other melodies appear early in the alignment order, and melodies which are dissimilar to other melodies are aligned last.

The complexity of the establishment of an alignment order of N melodies containing at most M notes each results to O(N2M2) + O(N log(N )). The first term originates from the pair-wise alignments whereas the second corresponds to cost of the sorting procedure.

Model initialization

As a next step, we initialise a weighted directed graph G with vertices V and edges E based on x1, the first transcription according to the alignment order. For each note x_i in this transcription, we create a vertex vi∈ V and, for each note transition from i to i + 1, we add

an edge ei,i+1 ∈ E connecting vi to vi+1 and set its weight wi,i+1∈ W to 1. Furthermore, we

assign a score q_i ∈ Q to each vertex v_i, which we again initialise with q_iG= 1 and maintain the pitch values pG_i ∈ P and onset times tG

i ∈ T associated with the note from which the

vertex was created. We add two additional vertices, v_S and v_E, which act as fictitious start and end points of the melodic sequence. The edge e_S,1 connects v_S to the vertex representing the first note and e_M₁_,E connects the vertex representing the last note to v_E. The weights of both edges are initialised with w_M₁_,E = w_S,1= 1. We denote the number of vertices contained in G with |V |. An example of the resulting structure G(V, E, W, Q, P, T ) is shown in Figure 4.17. Note, that W is an edge property and each w_i,i+1 ∈ W is associated with the edge ei,i+1 ∈ E. Similarly, Q, P and T are vertex properties, where elements qi ∈ Q, pi ∈ P

and ti ∈ T are associate with the vertex vi ∈ V . An algorithmic description of the model

initialisation step is provided in Algorithm 4.1.

Progressive alignment and model update

Once the model is initialised, we successively align the remaining sequences according to the order A and update the model after each step. First, let xG be the sequence of notes corresponding to the nodes in graph G, where the order of notes is determined by their associated onset times. We align a new sequence xk to xG using the method in described in Section 4.4.2 with two minor modifications. First, for xG, we replace the duration weights of the pitch histogram required during the key normalisation process with the node scores. This

Fig. 4.17 Initialisation of the graph model: Melody representation (left) and derived initial graph structure (right).

Algorithm 4.1 Initialisation of the graph model G(V, E, W, Q, P, T ) based on x1_.

G(V, E, W, Q, P, T ) ← empty graph structure M1 ← length of x1

for i= 1 to M1 do # add vertices and their properties

add vi to V add qi = 1 to Q add pG i = p1i to P add tG i = t1i to T

for i= 1 to M1−1 do # add edges and their weights

add ei,i+1 to E

add wi,i+1 = 1 to W

return G(V, E, W, Q, P, T )

is necessary, because after several updates, xG contains notes which originate from different recordings. Therefore, the absolute durations are not directly comparable. Using node scores is a suitable choice, since more importance will be given to characteristic notes which have previously formed part of alignment paths.

Secondly, in a similar manner, we modify the scoring function γ of the NW algorithm to

γ(xk_i, xG_j ) =    qG j (1 − |tki − tGj|) if pki = pGj −1 if pk_i ̸= pG j (4.9)

where qG_j is the score and tG_j the onset associated with the jth vertex. In this way, an alignment with vertices which have already accumulated a high score will be favoured. After tracing the optimal alignment path, we can extract the sequences of matches. We denote P_match = {P_k, PG} as the sequence holding the note indices of xk in Pk which have been

matched along the optimal path with corresponding elements of xG, held in PG. Note, that

Gaps or local similarities with a negative score are not considered. Consequently, P_match is a subsequence of Palign, where element pairs containing gaps are omitted.

In order to update the model, we first increase the score of all vertices contained in P_G by 1. In this way, the score of a vertex reflects how many times the corresponding note has formed part of a matching path. Then, for each note xk_u ∈ P/ k in xk which is not contained

in the optimal path, we add a new vertex v_new to the model with pitch pk_u and score 1. We compute the onset time of the new vertex tG_new by linear interpolation (Figure 4.18): Let tk_prev with prev ∈ P_k be the onset of the last note preceding xk_u which has been matched to a vertex in the model, and let similarly tk_post with post ∈ P_k be the first note following xk_u which has been matched to the model. Furthermore, let tG_post and tG_prev be the onset times associated with vertices matched with tk_prev and tk_post, respectively. The onset time of the newly inserted vertex is then interpolated as

tG_new= tG_prev+ t k u− tkprev tk_post− tk prev · (tG_post− tG_prev). (4.10)

If xu lies before the first matched note in Pk, then

tG_new= tG_post− |tk_u− tk_post| (4.11) and, analogously, if xu is located after the last matched note in Pk, tnew results to

tG_new= tG_prev+ |tk_u− tk_prev| (4.12) As a last step, we update the edge weights: For each note transition (i, i + 1), i ∈ [1, M_k− 1] in the sequence xk, we increase the weight of the edges connecting the corresponding vertices (which have either been matched to a note or which have been newly created) by 1. This applies also to the edge connecting v_S to the first note as well as the edge connecting the last note in xk to v_E.

Given that the computational complexity of each pair-wise alignment of two sequences of length M is O(M2) and at its last stage, the model has at most O(N M ) vertices, the maximum cost for updating the model results to O(N M2).

The process of updating the graph model is depicted in Figure 4.19 and pseudo-code is provided in Algorithm 4.2. This step is repeated until all sequences in A have been processed. The model allows dynamic updating and can be further extended with additional labelled sequences.

In the sequel, we described how we use the resulting model to (a) extract a note sequence which approximates the template, (b) derive a measure for melodic stability, and (c) compute the joint melodic similarity between an unlabelled sequence and the union of melodies contained in the model.

Algorithm 4.2 Updating the graph model G(V, E, W, Q, P, T ) with xk_.

xG_←_{CONVERT_TO_NOTE_SEQUENCE(G)}

Palign ←COMPUTE_ALIGNMENT_PATH(xk, xG)

Pmatch(Pk, PG) ← empty sequence

for i= 1 do to length(Palign)

if gap /∈ Palign(i) then

append Palign(i) to Pmatch

for i ∈ PG do# increase score of matched notes

qi = qi+ 1

for i= 1 to Mk do# add new vertices for unmatched notes

if i /∈ Pk then

add vnew to V

add qnew = 1 to Q

add pG

new = pki to P

add tnew = INTERPOLATE_ONSET(tnew, Palign)

Palign(i, 2) = index(vnew) # replace gaps with index of new vertex

for i= 1 to Mk−1 do # add edges and update weights

sN ode= Palign(i, 2)

eN ode= Palign(i + 1, 2)

if esN ode,eN ode ∈ E then # edge already exists

wsN ode,eN ode = wsN ode,eN ode+ 1

else

add esN ode,eN ode to E # create new edge

add wsN ode,eN ode = 1 to W

if e_S,P_align_(1,2)∈ E then # create or update edge from start node

w_S,P_align(1,2) = wS,Palign(1,2)+ 1

else

add eS,Palign(1,2) to E # create new edge

add wS,Palign(1,2) = 1 to W

if ePalign(end,2),E ∈ E then # create or update edge to end node wP_align_(end,2),E = wPalign(end,2),E+ 1

else

add ePalign(end,2),E to E # create new edge

add wPalign(end,2),E = 1 to W

Fig. 4.18 Interpolation of onset time tG

new: (a) tku is located between two matched note

onsets; (b) tk

u is located before the first matched onset; (c) tku is located after the last

matched onset.

Template extraction

From the graph model G, we now aim to extract a note sequence xT which approximates the underlying melodic template on which all N performances are based. Ideally, the template will reflect the melodic essence which the performances have in common and should therefore contain note successions which can be found in most transcriptions. Ornaments and segments which vary among performances, on the other hand, should be discarded. To this end, we compute the average heaviest path B through G, connecting the start node v_S to the end node v_E, that is, the path from v_S to v_E with the highest average edge weight among all possible paths. In this way, we capture frequent notes and note transitions, while omitting ornamentation. We formulate the task of finding theaverage heaviest path as a variant of the well studiedshortest path problem [1] and solve it with a modification of Dijkstra’s algorithm [47], as shown in Algorithm 4.3. Note that this approach is valid in the given scenario, since G is a directed acyclic graph (DAG) and the solution can therefore be obtained by progressively extending the solutions of subproblems.

By construction, we have a linear ordering of the graph and we one by one process all vertices in order from left to right. Starting at v_S, we successively parse all vertices until reaching v_E and maintain the average heaviest path leading to each vertex. When visiting vertex v_i, we examine each of its unvisited neighbours. We compute the average weight of the concatenation of the best path leading to vi and the edge connecting vi to its neighbour

vn,i. If this value exceeds the average weight of the best path stored for reaching vn,i, we

replace the best path accordingly. After parsing all neighbours, vi is marked as visited and

we proceed to parsing its strongest connected neighbour. This process is repeated until v_E has been marked visited and the best path stored for v_E is returned as the average heaviest path B from v_S to v_E. An example is shown in Figure 4.20. Finally, we define the template

Fig. 4.19 Model update: (a) Melody representation, alignment to the model and match sequence Pk; (b) model before update and match sequence Pm; (c) model after update.

Fig. 4.20 Average heaviest path (black) through a graph G. Circle sizes and line widths are proportional to vertex scores and edge weights, respectively. Vertices and edges constituting the template are marked black, others grey.

xT as the sequence of notes associated with the vertices through which B passes, excluding v_S and v_E.

Algorithm 4.3 Algorithm for finding the average heaviest path from vS to vE in graph

G(V, E).

for all v ∈ V do

bestP ath[v] ← empty array

U ←V

curr ← vS

while vE ∈ U do

for all vn ∈ neighbours[curr] ∧ U do

avW eight ←(sum(w(e) ∈ bestP ath[vn]) + w(ecurr,vn)) avW eight ← avW eight/(length(bestP ath[vn]) + 1)

if avW eight > sum(w(e) ∈ bestP ath[curr])/length(bestP ath[curr]) then

bestP ath[vn] ← append e(curr, vn) to bestP ath[curr]

remove curr from U

curr ← neighbours[curr] with largest weight connecting to curr

return bestP ath[vE]

The asymptotic complexity of extracting the average heaviest path on acyclic graphs is O(|V | + |E|). In our case, this would yield a complexity of O(N M ), since each melody adds at most M vertices and 2M edges to the model. However, in practice we expect a large amount of notes to be matched with existing model nodes, resulting in significantly fewer nodes and edges to be added at each step. Assuming the number of newly added nodes and vertices to be constant results in an expected linear runtime in practice.

Quantifying melodic stability

Based on the graph model G and the average heaviest path B, we now define a measure for melodic stability among the analysed performances. To this end, we analyse the weights of the edges forming B with respect to the sum of all edge weights in G. If the performances are very similar to each other, B will concentrate a large percentage of the total amount of edge weights contained in G. The more the performed melodies vary, the weaker the edges of

B will be. Consequently, we compute the melodic stability r as the fraction of the sum of all

edge weights contained in B as

r = P e∈Bw(e) P e∈Ew(e) . (4.13)

Evaluating a transcription against a model

In this work, we explore two strategies to estimate the similarity between a query melody xq and a set of K performance transcriptions. The first option is to align xq to the sequence representing the graph model xG, using theNW algorithm with the modified scoring function from Equation 4.9, and estimating the similarity as the resulting score γ(xG, xq) of the optimal alignment path. In order to perform an alignment to the model, the graph is represented as a sequence, where the order of nodes is determined by their associated onset times tG_i . Given that different models may be constructed from datasets with a varying number of recordings and intra-group similarities, they may exhibit significant differences with respect to the sum of accumulated score values. We therefore normalise the alignment score by dividing by the

In document Flamenco music information retrieval. (Page 95-105)