Aligning wave files using audio fingerprints

Local audio file alignment

7.2 Aligning wave files using audio fingerprints

To solve the problem highlighted in section 7.1 we employ a simple au-dio fingerprinting technique to realign local copies of auau-dio files ensuring sample-accurate alignment to the original audio used to produce annota-tions. Using this technique it is possible for other researchers to legally obtain perfectly aligned audio files for use with the annotated data sets while only a very small amount of meta-data (i.e. the fingerprint informa-tion) needs to be provided by the owner of the original audio files.

7.2.1 Audio fingerprinting

Audio fingerprinting is often used in content-based retrieval of digital mu-sic. In this application, a particular audio file can be found from a large database of files when given a small segment of that same audio as a search query [CBKH05, KAH⁺02, HK02, RHG08]. For this reason, fin-gerprints must be produced for whole libraries of audio and must be robust to amplitude variation, noise, time-stretching and other distortions. The techniques used for this kind of application usually involve time-frequency analysis in order produce small fingerprints describing the features of the audio data.

Unlike the usual applications of audio fingerprinting, there is no need to search a large database for a given fingerprint in order to solve our problem. We already know which file the fingerprint belongs to; we are simply interested in aligning the local file relative to the position of the fingerprint data in the original audio. For this reason, we need a finger-printing technique that does not distort time information so that we may achieve sample-accurate alignment.

7.2.2 Fingerprint technique

The technique that we have developed uses the sign of the first backward difference to generate fingerprint data. This simple algorithm produces a 1 or a 0 for each sample of the audio signal depending on the sign of the difference between the current audio sample and the previous sample. For

amplitude

time

ρ (x)_n n

+ + + − − − + + − − − − + + + +

sign of ∇_n

11 11 1 00 0 00 0 01 1 1 1

Figure 7.1: Signal x_n with corresponding sign of first backward difference ∇n

and fingerprint function ρn(x). ρn(x) is invariant to different amplitude scalings of xn.

a discrete time signal x, the backwards difference ∇ⁿ at time point n is given by:

∇ⁿ= xn− xⁿ−1 (7.1)

We may describe the sign of the first backward difference, function ρn(x), mathematically in the following way:

ρn(x) =

( 1 if ∇ⁿ ≥ 0

0 if ∇ⁿ < 0 (7.2)

Figure 7.1 shows that ρn(x) is a binary function that tells us in which direction the amplitude of the signal is moving. We can see that while the signal rises in amplitude the function produces a string of ones and while it falls, it produces zeros. A transition from 1 to 0 or from 0 to 1 denotes a change in direction. No attempt is made to track the absolute signal amplitude so ρn(x) is invariant to different amplitude scalings of the original signal.

Original Figure 7.2: Fingerprint arrangement in original audio xn

7.2.3 Alignment of local audio files using fingerprints

To align local audio files with the originals we use two fingerprints for each song. The first fingerprint ρ⁽¹⁾m is used to align the local audio then the second one ρ⁽²⁾m is used to check the alignment. Figure 7.2 shows the arrangement of these two fingerprints in a song file.

If xn is our original audio signal of length N , we define the alignment fingerprint function ρ⁽¹⁾m for a particular song as:

ρ⁽¹⁾_m = ρnA+m(x) for 0≤ m < M (7.3) where nA is the index of xn at which the alignment fingerprint starts (i.e. an offset from x0) and M is the length of the fingerprint. We define a check fingerprint for the song, function ρ⁽²⁾m as:

ρ⁽²⁾_m = ρ_{N −n}_C_−M+m(x) for 0≤ m < M (7.4) where nC is distance of fingerprint ρ⁽²⁾m from the end of the song in samples (i.e. its offset from x_N).

We may now define a fingerprint correlation function Cn(x) to give a measure of how well a fingerprint matches segments of ρn(x):

Cn(x) =X

ρn+m(x)− ρ⁽¹⁾m | (7.5) for 0≤ n < N − M and 0 ≤ m < M. If M is set to be large enough to avoid false matches then we may assume that the following is true:

Cn(x)

( = 0 if n = nA

6= 0 Otherwise (7.6)

That is to say where Cn(x) evaluates to zero, the fingerprint exactly matches that section of ρn(x) and this should only happen once across the original audio signal at the point where n is equal to nA.

local

original track x length N local track y length L

Figure 7.3: Aligning a local audio file using fingerprint ρ⁽¹⁾m for alignment and ρ⁽²⁾m to check. In this example, n_Dsamples will be removed from the start of yn

and n_Esamples will be appended at the end to align it with original signal xn. After alignment using ρ⁽¹⁾m , if ρ⁽²⁾m matches such that n₂ = n_C then we know the process was successful.

Given this assumption, we may now use ρ⁽¹⁾m to align a local audio file.

If yn is a local audio signal of length L, we can find its fingerprint function ρ(yn) then calculate the correlation function:

Cn(y) =X

ρ(yn+m)− ρ⁽¹⁾m | (7.7) for 0≤ n < L − M and 0 ≤ m < M. The minimum value of Cⁿ(y) marks the section of ρ(yn) that best matches the alignment fingerprint ρ⁽¹⁾m . It should be noted that this value may be non-zero due to noise or distortion such as artefacts caused by compression. As long as local signal yn is not too heavily distorted then the minimum value of Cn(y) will still mark the best match. We will call the index of this value n1 which can be expressed as:

n1 = arg min(Cn(y)) (7.8)

Once n1 is determined, we must find the number of samples, nD, that must be inserted or removed at the start of local audio yn:

nD = nA− n1 (7.9)

To align the local audio with the original audio xn we alter the start of yn

in the following way:

Figure 7.3 illustrates the relationship between nD, nA and n1 at the start of a local audio signal yn. In this example nD samples will be removed at the start in order to align yn with original signal xn.

Once yn has been aligned to the start of the original audio xn, we must then ensure the local audio is the same length as xn. This can be done by calculating a value nE, the number of samples that must be appended to or removed from the end of yn:

nE=N − L + nD (7.11)

The example in figure 7.3 shows that nE samples will be appended to yn

in order to alter it to be length N .

7.2.4 Checking the alignment

With the alignment complete, let ˆyn be the newly aligned version of yn. We may now check that the alignment was successful by confirming that fingerprint ρ⁽²⁾m matches ˆyn at the correct position relative to the end of the signal. To do this we calculate the correlation function Cn(ˆy) thus:

Cn(ˆy) =X

ρ_{N −n+m}(ˆy)− ρ⁽²⁾m | (7.12) for M ≤ n < N and 0 ≤ m < M. Then we find the best match for fingerprint ρ⁽²⁾m , n2, from the correlation function:

n₂ = arg min(Cn(ˆy)) (7.13) If n2 and nC are the same value (as is the case shown in figure 7.3) then the fingerprint matches at the same point as it would in the original audio signal xn and we know that ˆyn is correctly aligned:

n₂− nC

( = 0 if ˆyn correctly aligned

6= 0 Otherwise (7.14)

If the alignment fails, the software notifies the user that it was unable to align the local audio file to the orginal.

In document Towards automatic extraction of harmony information from music signals (Page 183-188)