Harmonic Mixing: Key & Beat
Detection Algorithms
M.Eng Individual Project - Final Report
Christopher Roebuck
Project Supervisor: Iain Phillips
1.
Abstract
Harmonic mixing is the art of mixing together two songs based on their key. In order for a DJ to perform harmonic mixing of two songs, their key and tempo must be known in advance. The aim of this project is to automate the process of detecting the key and tempo of a song, so that a DJ can select two songs which will ‘sound good’ when mixed together.
The result will be a program, which given a song, can detect its key and tempo automatically and enable the user to mix two songs together based on these features. This document outlines the research into various key and beat detection algorithms and the design, implementation and evaluation of such a program.
2.
Acknowledgements
I would like to thank my supervisor, Iain Phillips, for proposing the project in the first place and taking the time to meet me regularly throughout the course of the project.
I would also like to thank Christopher Harte for sending me his paper on a Quantised Chromagram, and Kyogu Lee for responding to my emails about the Harmonic Product Spectrum.
Thanks also to Peter Littlewood, Rachel Lau and Tiana Kordbacheh for creating chord samples for which to test my algorithm on.
3.
Contents
1. Abstract ... 2 2. Acknowledgements ... 3 3. Contents ... 4 4. Table of Figures ... 6 5. Introduction ... 75.1 Motivation for this project ... 7
5.2 Major Objectives ... 7
5.3 Deeper Objectives ... 8
5.4 Report Layout ... 8
6. Background ... 9
6.1 History of DJ Mixing ... 9
6.2 Illustration of Beat Mixing ... 10
6.3 Key Detection Algorithms ... 14
6.3.1 Musical key extraction from audio ... 14
6.3.2 Chord Segmentation and Recognition using EM-Trained Hidden Markov Models ... 15
6.3.3 Automatic Chord Recognition from Audio Using Enhanced Pitch Class Profile ... 16
6.3.4 A Robust Predominant-F0 Estimation Method for Real-Time Detection of Melody and Bass Lines in CD Recording ... 18
6.3.5 A computational model of harmonic chord recognition ... 20
6.4 Beat Detection Algorithms ... 20
6.4.1 Tempo and Beat Analysis of Acoustic Musical Signals... 20
6.4.2 Analysis of the Meter of Acoustic Musical Signals ... 22
6.4.3 Audio Analysis using the Discrete Wavelet Transform ... 23
6.4.4 Statistical streaming beat detection ... 24
6.5 Similar Projects / Software ... 25
6.5.1 Traktor DJ Studio by Native Instruments ... 25
6.5.2 Rapid Evolution 2 ... 26
6.5.3 Mixed in Key ... 27
6.5.4 MixMeister ... 28
7. Design ... 29
7.1 System Architecture ... 29
7.2 Key Detection Algorithm Design Rationale ... 29
7.3 Beat Detection Algorithm Design Rationale ... 30
8. Implementation ... 32
8.1 System Implementation ... 32
8.2 Detecting the Key ... 34
8.3 Detecting the Beats ... 37
8.5 Automatic Beat Matching ... 39
8.6 Generating and animating the waveforms ... 39
9. Testing ... 41
9.1 Parameter Testing – Key Detection Algorithm... 41
9.1.1 Bass threshold frequency ... 41
9.1.2 Choice of FFT window length ... 41
9.1.3 Harmonic Product Spectrum ... 42
9.1.4 Weighting System... 44
9.1.5 Time in between overlapping frames ... 46
9.1.6 Downsampling ... 47
9.2 Parameter Evaluation – Beat Detection Algorithm ... 48
9.2.1 Size of Instant Energy ... 48
9.2.2 Size of Average Energy ... 48
9.2.3 Beat Interval ... 49
9.2.4 Low Pass Filtering ... 50
10. Evaluation ... 51
10.1 Quantitative Evaluation ... 51
10.1.1 Key Detection Accuracy Test with Dance Music ... 51
10.1.2 Key Detection Accuracy Test with Classical Music ... 52
10.1.3 Beat Detection Accuracy Test ... 54
10.1.4 Performance Evaluation ... 56
10.2 Qualitative Evaluation ... 56
10.2.1 Automatic Beat Matching... 56
10.2.2 Graphical User Interface ... 56
10.2.3 Pitch Shift and Time Stretching Functions ... 58
10.2.4 Overall Evaluation ... 58 11. Conclusion... 59 11.1 Appraisal ... 59 11.2 Further Work ... 60 12. Bibliography ... 61 13. Appendix ... 63
Appendix A: Introduction to Digital Signal Processing ... 63
Appendix B: Specification ... 65
Aims of the project ... 65
Core Specification ... 65
Extended Specification... 66
Appendix C: User Guide ... 67
Loading a track into a deck ... 67
4.
Table of Figures
Figure 1: Crossfader in the left position ... 10
Figure 2: Beats, Bars and Loops ... 10
Figure 3: Tracks in sync but not in phase ... 11
Figure 4: Train wreck mix ... 11
Figure 5: Tracks in sync and in phase ... 11
Figure 6: Crossfader in central position ... 12
Figure 7: Crossfader in right hand position ... 12
Figure 8: Circle of Fifths and Camelot Easymix System ... 13
Figure 9: Flow diagram of the algorithm from Sheh et al(9) ... 15
Figure 10: PCP vector of a C major triad ... 16
Figure 11: Pitch Class Profile of A minor triad ... 17
Figure 12: Harmonic Product Spectrum ... 17
Figure 13: Comparison of PCP and EPCP vectors from Lee(11) ... 18
Figure 14: Flow diagram of Goto’s algorithm (12)... 19
Figure 15: Overview of Scheirer's Algorithm(14) ... 21
Figure 16: Waveform showing Tatum, Tactus and Measure ... 22
Figure 17: Overview of algorithm from Klapuri et al (15) ... 22
Figure 18: Block diagram of algorithm from Tzanetakis et al (16) ... 23
Figure 19: Traktor DJ Studio ... 25
Figure 20: Rapid Evolution 2 ... 26
Figure 21: Mixed in Key... 27
Figure 22: MixMeister ... 28
Figure 23: Overview of System Architecture ... 29
Figure 24: System Overview... 32
Figure 25: Key Detection Algorithm Flow Chart ... 34
Figure 26: Output from the STFT ... 35
Figure 27: Chroma Vector of C Major chord and its correlation with key templates ... 36
Figure 28: Overlapping of waveform images ... 40
Figure 29: Illustration of the Harmonic Product Spectrum taken from (30) ... 43
Figure 30: Chroma Vector showing close correlation between many different key templates ... 45
Figure 31: F minor is detected correctly with the weighting system enabled... 46
Figure 32: C minor is detected without the weighting system enabled ... 46
Figure 33: Too many beats detected with 50ms beat interval ... 49
Figure 34: Beats being detected correctly with beat interval of 350ms ... 49
Figure 35: Sound energy variations detected as beats in silent areas of Quivver – Space Manoeuvres ... 55
Figure 36: The spacing between these detected beats is closer, leading to higher BPM calculation ... 55
Figure 37: Sampling of a signal for 4-bit PCM ... 63
Figure 38: How FMOD stores audio data ... 64
Figure 39: The Main Screen ... 67
Figure 41: The Deck Control ... 68
Figure 40: Loading Sasha - Magnetic North into Deck A ... 68
Figure 42: Key Detection progress/results ... 69
Figure 43: Crossfader in left hand position ... 70
Figure 44: Crossfader in central position ... 71
5.
Introduction
This section sets out the aims and motivation for the project and introduces some of the concepts which will be discussed in greater detail further in the report.
5.1
Motivation for this project
Beat mixing (or beat-matching) is a process employed by DJs to transition between two songs by changing the tempo of a new track to match that of the currently playing track, perfectly aligning the beats of one track with the beats of the other, then mixing or cross-fading between the two so that there is no pause between songs. This is used to keep the flow of the music constant for the pleasure of the listener, both through appreciation of the quality of the mix between records and the lack of time between tracks played back to back providing more variety in the melody and rhythm to dance to. Today's DJ software has simplified the task of beat mixing greatly; however, very few notable forays have addressed the idea of harmonic mixing.
Two tracks can be beat-mixed together perfectly and still sound ‘off’. This is likely to be because the two tracks are out of tune with each other and their harmonic elements are in incompatible keys causing the melodies to clash. Harmonic mixing sets out to address this problem.
Harmonic mixing is the natural evolution of beat mixing: mixing in compatible keys. It is the idea that the currently playing song should only be beat-mixed with another song of compatible key which will make the transition between the two songs sound pleasurable to the listener. This can give the DJ more creative freedom to perform a mix, as they do not have to rely on large segments of regular beats in order to make a transition between two songs, they can now start to overlay melody sections which are harmonic with each other.
People with perfect pitch will find it easy to detect the key to a song (through years and years of practice) but there seems to be no automatic process in parallel to beat detection algorithms which would save DJs manually finding the key of every song of their 1000+ collection. Even when that's done, two songs with compatible keys will still not necessarily match, since changing the tempo of the songs to achieve the same speed will result in a change of key. For example a 6% increase/decrease in tempo, measured in beats per minute (BPM), would cause a change of one semitone in key, say C to Db minor.
Time-stretching algorithms are therefore essential to lock the key of the track and allow the BPM of the track to be altered independently of pitch/key. Pitch-shifting algorithms can change the pitch/key of the track without affecting the BPM.
5.2
Major Objectives
The primary aim of this project is to design, implement and optimise a key detection algorithm which can work on polyphonic real-world audio. No key detection algorithm thus far can claim to be 100% accurate and as such there have been many different attempts at solving the problem with greater accuracy, each with their own strengths and weaknesses.
As the finished program will be aimed mainly at DJs, the key detection algorithm should be able to accurately extract the key from various types of dance music. The main problems associated with this genre of music, is that there is a lot of emphasis on the bass line and bass drum, which may make it
The other two major problems, the detection of beats and the calculation of an appropriate BPM value can be considered solved problems. There has been much research into the various ways of detecting the BPM from a piece of music and the main challenge is to find a suitable algorithm which will be able to detect beats and calculate a BPM value in the shortest amount of time while maintaining a certain amount of accuracy.
As this project aims to aid DJs perform harmonic mixing, I will also attempt to implement an automatic real-time beat mixing algorithm, which will enable the DJ to beat-mix two tracks together based on their tempo and position of beats. Obviously the success of this feature will rely heavily on the accuracy of the above stated beat detection algorithm.
5.3
Deeper Objectives
There are deeper objectives to this project than simply providing a DJ with an automatic key and tempo detection tool.
This project aims to show that academic and state of the art music analysis techniques can be applied to real world problems in an efficient and reliable manner. Part of this will be to show that disparate areas of research can be combined together successfully.
Finally the project aims to be more than just a research study of feasibility. The result of successful completion will be an application of sufficient reliability and quality that it can be released to, and used by, untrained computer users. This report only lightly touches on this facet of the project, as creating usable polished applications is a reasonably well solved problem, and the least interesting area of this project.
5.4
Report Layout
• The remainder of the report begins with a brief history of DJ mixing and illustrates the concepts
of beat mixing and harmonic mixing in Chapter 6 (Background). We then discuss the main literature on beat detection and BPM calculation, along with selected literature on key extraction from music. Finally we compare the strengths and weaknesses of the state of the art to the aims of this project.
• Chapter 7 (Design) gives a brief overview of the overall system design and the rationale behind
the design of the algorithms.
• A detailed description of the interesting or problematic areas of project implementation is given
in Chapter 8 (Implementation). Trivial and/or uninteresting areas of the project are not mentioned and can be considered to have been implemented successfully.
• The tests performed to determine the optimal values for the parameters of the algorithms are
stated in Chapter 9 (Testing).
• A quantitative and qualitative evaluation of the final product is made in Chapter 10
(Evaluation). Analysis of any anomalous results is given.
• The report concludes with Chapter 11 (Conclusion) which covers the strengths and weaknesses
6.
Background
To fully understand this project requires a basic understanding of the process which a DJ will perform behind the turntables. First of all a brief history of DJ Mixing will explain the evolution of DJ mixing and the advancements in technology and music culture which brought us to where we are today. Then a more in-depth look at the ‘science’ of beat and harmonic mixing will follow, which will explain in detail the concept of beat mixing and the extra constraints that harmonic mixing implies.
A discussion of the most applicable literature for detecting the beats and extracting the key from a song is given followed by an overview of software projects of a similar nature. An Introduction to Digital Signal Processing explaining some of the techniques used in the literature is given in the Appendix.
6.1
History of DJ Mixing
The art of DJ mixing has come a long way since its early appearances. In general, its journey can be plotted to have passed through 4 basic stages. Before there was any mixing or blending together of songs,
there was the Non-Mixing Jukebox DJ(1). Working with just one deck (or turntable) this DJs primary skill
was to entertain an audience whilst playing requested music, usually at a wedding or some other celebration.
The first dimension of mixing (Basic Fades)(1) occurred as DJs replaced bands as the primary form of club
entertainment. The DJ, now working with two decks and a mixer would fade a new song over the end of the currently playing song, usually with calamitous results. As neither the beats nor the keys were in sync, the overlays would sound like train-wrecks. A train-wreck describes when two tracks are playing at the
same time but their beats are not synchronised i.e. when your tracks cross, your train will crash(2). When
the audience can hear this, it will sound like incoherent beats occurring at odd times and not making any musical sense.
The second dimension of mixing (Cutting and Scratching)(1) coincided with the appearance of rap as a
distinct vocal form. High torque turntables now allowed DJs to ‘cut’ by inserting short musical sections from a second source and ‘scratch’ by rapidly and rhythmically repeating single beats from a second source usually by manipulating a vinyl record as it played with their hand.
Technological improvements brought about the 3rd Dimension of Mixing (Beat Mixing)(1). By now
turntables had accurate speed stability thanks to the arrival of the direct-drive turntable motor as opposed to older belt-driven turntable motors which over time would wear out and cause records to turn in
warped rotation and affect tempo. Technics(3) introduced the SL-1200 turntable in 1972 and by 1984 had
added features suited to the needs of DJs wanting to beat-mix(4). Some of these features included pitch
control, which allowed the DJ to adjust the speed of tracks to match one another. The fact that pressing the start button immediately started the turntable at the desired speed gave the DJ more confidence in starting a new track at exactly the right point and at the correct speed. Technics SL-1200 turntable also allowed the vinyl to be spun backwards for the first time to allow a DJ to carefully and precisely cue the starting position of the track to fall exactly on the onset of some desired beat.
Separate from advancements in the technology of the equipment which DJs used to play their sets (live performances) on, was the advancement in technology in dance music production. Most dance music used electronic drums, which locked in a consistent tempo indefinitely. The speed stability of both the music and the turntables allowed DJs to overlay long segments of different records, as long as they could be synchronised, and beat mixing was born.
However, the limitation of beat mixing was that if the desired segments of both songs contained melodies, the result was usually unpleasant because of clashing keys. Thus the DJ either used trial and error to find songs with compatible keys or had to rely on there being a beat-only intro and outro section on each song. This is the reason why most dance music has two to three minutes of continual beats at the beginning and end of the song, to enable the DJ enough time to beat match tracks together.
Harmonic mixing brings the fourth dimension, harmony, to DJ mixing technique, which only permits different melodies to be played simultaneously if they have compatible keys. The gradual shift away from analogue vinyl records to digital audio formats such as CD’s and MP3 combined with the development of time stretching algorithms made harmonic mixing possible. A time stretching algorithm locks in a certain key whilst allowing tempo to be altered independently. More and more DJs nowadays are letting computers do the beat mixing for them and focusing their attention on being more artistic and creative with their mixes.
6.2
Illustration of Beat Mixing
1. The DJ first starts their set with a song, we shall call it Song A. Song A has a tempo of 130BPM.
Whilst Song A is playing, the DJ decides which song to play next, we will call this Song B. Song B has a lower BPM than Song A of 120BPM. The crossfader (part of the mixer) allows multiple audio outputs to be blended together into one output. At the start the crossfader will be in the left position as shown in Figure 1 below so that only the output from Deck A will be heard by the audience.
2. As the DJ listens to Song B through their headphones, they detect that its tempo is slower than
Song A. The DJ increases the tempo of Song B to 130BPM to match that of Song A. Nearly all modern dance music is written in the 4/4 common time signature i.e. 4 beats to every bar. A typical dance track contains a series of loops of n bars where n is a power of 2, usually 4,8,16 or 32. Assume that both Song A and Song B contain a series of 4 bar loops with 4 beats to every bar as illustrated in Figure 2:
1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Downbeat for this loop Downbeat for next loop
1 beat 1 bar 1 loop Time Song A currently playing Song B loaded. Only DJ can hear this through headphones Deck A Deck B Crossfader
Output from Deck A only is audible to
audience
Figure 1: Crossfader in the left position
The downbeat is the first beat of a loop and is usually signified by an extra sound or accent, such as a cymbal crash. The DJ finds a downbeat towards the beginning of Song B (the first beat of Song B is normally used) and pauses Song B just before the onset of that downbeat. Song B is now cued and ready.
3. The DJ now waits for a downbeat to occur in Song A after the main melody has played out. Song
B is started at the exact same time as the onset of the downbeat in Song A. Song B is still only audible through the DJs headphones. The DJ ensures that the two beats are in time and in phase. To be in time the beats of the two tracks must occur at the same time, to be in phase the
downbeats of each track must occur at the same time. Below is an example where the two tracks are in time but not in phase:
4. If the two tracks have different BPMs they will eventually go out of time and out of phase as the
duration between the beats of each track will drift further and further apart. The following diagram shows this scenario, with Song B being the slower of the two tracks. This is a train-wreck mix:
When the two tracks are in phase and have the same BPM they should be aligned like this:
1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Downbeat for this loop Downbeat for next loop
Time 2 3 4 3 2 3 4 4 2 3 4 1 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 2 2 3 4 5 2 6 Downbeat for this loop
S o n g A S o n g B 1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Time S o n g A S o n g B 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 Time So n g A So n g B 1 2 3 4 2 2 3 4 3 2 3 4 4 2 3 4 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1
Figure 3: Tracks in sync but not in phase
5. Once the tracks are in time and in phase, the DJ fades in the output from Deck B by sliding the crossfader to the middle. Both tracks are now audible and are being mixed.
6. Finally after an arbitrary amount of time (or when Song A ends) the crossfader is moved all the
way to the right so that only Song B is audible. Song A is taken off the Deck and the DJ chooses another song, i.e. Song C to replace Song A on Deck A. Song B will then be mixed into Song C. Thus we are now back at stage 1, and the cycle continues.
Harmonic mixing adds constraints to the selection of the next track in stage one of the above cycle. The next track must be in a compatible key to the currently playing track. The circle of fifths illustrates relationships between compatible keys and is used by composers for correct sounding chord
progressions(5). Any song is compatible with another song of the same key, its perfect fourth, fifth or
relative major/minor.
Using the circle of fifths, a song in C Major is compatible with another song of C Major, a song in F Major, a song in G Major or a song in A Minor. To make this easier to use, Camelot Sound came up with the ‘Easymix’ system where each key is assigned a keycode, 1-12A for Minor keys and 1-12B for Major
keys(6). Using the easymix chart, a song with keycode 1A (A-Flat Minor) can be mixed together with
another song of keycode 1A, 2A (E-Flat Minor), 12A (D-Flat Minor) or 1B (B Major).
However, altering the tempo of a track by +/- 6% will alter its key by a semitone (it will shift its keycode by 7 steps). A song in E-Flat Minor (keycode 2A) becomes an E Minor (keycode 9A) song with a 6% increase. Song A currently playing Song B currently playing Deck A Deck B Crossfader
Output from both Decks is equally audible
Song C
loaded
Song B
currently
playing
Deck A
Deck B
Crossfader
Output from Deck B
is audible only
Figure 6: Crossfader in central position
Assume Song A in step 1 has a key of C Major (8B) with 130BPM and Song B has a key of F Major (7B) with 120BPM. The songs are compatible if played at their original tempo, but Song B has to be increased to 130 BPM to match the tempo of Song A. This is an 8.33% increase in the tempo and so Song B’s key now changes up a semitone to F-Sharp Major (2B) which is now incompatible with Song A. Time-stretching algorithms solve this problem, the tempo of Song B can be increased to match Song A and Song B’s key remains at F Major which is harmonically compatible with Song A.
Circle of Fifths Camelot Sound ‘Easymix’ System
6.3
Key Detection Algorithms
The extraction of key from audio is not new, but not often reported in literature. Many algorithms that are found in literature only work on symbolic data (e.g. MIDI or notated music) where the notes of the incoming signal are already known. For this project, the algorithm will need to work directly on incoming audio data with no prior knowledge of the notes which make up the song. Various different methods are used, varying from heavy use of spectral analysis, to statistical modelling, to modelling inner-hair cell dynamics. The algorithms presented below give a flavour of the research currently on-going into this challenging problem.
6.3.1
Musical key extraction from audio
A key extraction algorithm that works directly on raw audio data is presented by Pauws(7). Its
implementation is based on models of human auditory perception and music cognition. It is relatively straightforward and has minimal computing requirements.
For each 100 millisecond section of the signal, it first down-samples the audio content to around 10kHz, which reduces significantly the computing cost and also cuts off any frequencies above 5kHz. It is assumed that these high frequencies will not contribute to the pitches in the lower frequency ranges. The ‘remaining’ samples in a frame are multiplied by a Hamming window, zero-padded, and the amplitude spectrum is calculated from a 1024-point FFT.
A 12 dimension chroma vector (chromagram) is then calculated from the frequency spectrum, which converts the frequencies in the spectrum into the 12 musical notes, e.g. for pitch class C, this comes down to the six spectral regions centred around the pitch frequencies for C1 (32.7 Hz), C2 (65.4 Hz), C3 (130.8Hz), C4 (261.6 Hz), C5 (523.3 Hz) and C6 (1046.5 Hz). The chroma vector is normalised to show the relative ratios of each musical note in the frequency spectrum.
Eventually there will be a chroma vector for each 100 millisecond section of the song. These are
correlated with Krumhansl’s key profiles(8) and the key profile that has maximum correlation over all the
computed chroma vector is taken as the most likely key.
An evaluation with 237 CD recordings of classical piano sonatas indicated a classification accuracy of 75.1%. By considering the exact, relative, dominant, sub-dominant and parallel keys as similar keys, the accuracy is even 94.1%.
The algorithm is quite basic and whilst it has the benefits of being fast, it suffers from using the FFT which although useful for detecting the frequency spectrum of a stationary signal, such as a chord played constantly, it is not suitable for extracting the frequencies of a non-stationary signal where major
frequencies will change rapidly such as in any normal song. The method of scoring the most likely key could also be improved by weighting the maximum key relative to how close it was to the next likely detected key. This way, if two or more keys correlate highly for a single chromagram, the resulting winner is penalised by giving it a low weighting as it only just correlated higher than some other key. If one key dominates the correlation it is rewarded with a larger weighting and therefore is more likely to be the maximum key overall.
6.3.2
Chord Segmentation and Recognition using EM-Trained Hidden
Markov Models
Sheh et al.(9) describe a method of recognising the major chords in a piece of music using pitch class
profiles and Hidden Markov Models (HMMs) trained using the Expectation Maximisation (EM) algorithm.
The pitch class profile (PCP) was first proposed by Fujishima (10) and is the same idea as the above
algorithms ‘chroma vector’, in which the Fourier transform intensities are mapped to the twelve semitone pitch classes corresponding to musical notes.
First the input signal is transformed to the frequency domain using the short-time Fourier transform (STFT). The STFT has the advantage over the FFT of being able to determine frequency changes over time rather than simply just taking a snapshot of frequencies in a certain time span. Thus the STFT is more suited to frequency analysis of non stationary signals.
The STFT is mapped to the Pitch Class Profile (PCP) features, which traditionally consist of 12-dimensional vectors, with each dimension corresponding to the intensity of a semitone class (chroma). The procedure collapses pure tones of the same pitch class, independent of octave, to the same PCP bin. The PCP vectors are normalised to show the intensities of each pitch class relative to one another. Pre-determined PCP vectors are used as features to train a HMM with one state for each chord distinguished by the system. The EM algorithm calculates the mean and variance vector values and the transition probabilities for each chord HMM. Now the Viterbi algorithm is used to either forcibly align or recognise these labels. The PCP vector corresponding to a chord which aligned itself the most with the PCP vectors computed from the song is chosen as the most likely key.
This algorithm performs well but attempting to code a hidden Markov model and the algorithms required in training it would be too time consuming for this project. Comparable results can be established using much simpler template matching techniques. The algorithm is also computationally expensive and as such only a short segment of a song is used to detect the key on. One major advantage of this project is the use of the STFT to analyse the frequencies and map them to the PCP / chroma vector. This is much more accurate than the FFT and this part of the algorithm can be used as part of a different key detection algorithm.
6.3.3
Automatic Chord Recognition from Audio Using Enhanced Pitch
Class Profile
This algorithm (11) sets out to improve on other key detection algorithms which use a chromagram/PCP
as the feature vector to identify chords. Some use a template matching algorithm to correlate the PCP with pre-determined PCP vectors for the 24 chords; others use a probabilistic model such as HMMs. The problem with the PCP in the template matching algorithm is that the templates which the PCP is matched against are binary i.e. since a C major triad comprises three notes at C (root),E (third), and G (fifth), the template for a C major triad is [1,0,0,0,1,0,0,1,0,0,0,0] where chord labelling is
[C,C#,D,D#,E,F,F#,G,G#,A,A#,B]. However, the PCP from real world recordings will never be exactly binary because acoustic instruments produce overtones as well as fundamental tones. The PCP / chroma vector of a C major triad played on a piano is shown in Figure 10.
In Figure 10, even though the strongest peaks are found at C, E, and G, we can see that the chroma vector has nonzero intensities at all 12 pitch classes due to the overtones generated by the chord tones. This noisy chroma vector may cause confusion to the recognition systems with binary type templates, especially if two chords share one or more notes such as a major triad and its relative minor e.g. a C major triad and a C minor triad share two notes, C and G, and a C major triad and an A minor triad have notes C and E in common.
Figure 11 shows an A minor triad and its correlation with the 24 keys. The A minor triad correlates highest with a C major chord and is identified incorrectly as C major. This is due to the fact that the intensity of the G in the A minor triad, which is not a chord tone, is greater than that of the A, which is a chord tone.
To overcome this problem, Lee has suggested taking the harmonic product spectrum (HPS) of the frequency spectrum, before computing the Enhanced Pitch Class Profile (EPCP) from the HPS. The algorithm for computing the HPS is very simple and is based on the harmonicity of the signal. Since most acoustic instruments and human voice produce a sound that has harmonics at the integer multiples of its fundamental frequency, decimating the original magnitude spectrum by an integer number will also yield a peak at its fundamental frequency. This should in theory eliminate the overtones which are produced and amplify the pure tones, leading to a more binary type EPCP. The following figure demonstrates the HPS and how it amplifies the main peak from the FFT whilst reducing the number of overtone frequencies.
Figure 11: Pitch Class Profile of A minor triad
In Figure 13 below the EPCP vector from the above example (A minor), and its correlation with the 24 major/minor triad templates are shown. Overlaid are the conventional PCP vector and its correlation in dotted lines for comparison. We can clearly see from the figure that non-chord tones are suppressed enough to emphasize the chord tones only, which are A, C, and E in this example. This removes the ambiguity between its relative major triad, and the resulting correlation identifies the chord correctly as A minor.
This technique seems useful enough and could be used to optimise any key detection algorithm which uses PCP/chroma vectors. I am a little concerned that if used with dance music, which has a lot of intense low frequencies, such as the bass line and bass drum which may not be in key, that these frequencies will be amplified instead of the melodic frequencies which I intend to amplify, skewing the results in favour of the key of the bass line rather than the key of the melody.
6.3.4
A Robust Predominant-F0 Estimation Method for Real-Time
Detection of Melody and Bass Lines in CD Recording
Goto(12) describes a method, called PreFEst (Predominant-F0 Estimation Method), which can detect the
melody and bass lines in complex real-world audio signals. F0 is shorthand notation for the fundamental frequency of the piece of music, or it’s key.
The PreFEst obtains traces of the fundamental melody and bass lines under the following assumptions: • The melody and bass sounds have the harmonic structure. We do not care about the existence of the F0’s frequency component.
• The melody line has the most predominant harmonic structure in middle and high frequency regions and the bass line has the most predominant harmonic structure in a low frequency region.
• The melody and bass lines tend to have temporally continuous traces.
The diagram below shows an overview of the PreFEst. It first calculates instantaneous frequencies by using multi-rate signal processing techniques and extracts candidate frequency components on the basis of an instantaneous-frequency-related measure.
The PreFEst basically estimates the F0 which is supported by predominant harmonic frequency components within an intentionally limited frequency range; by using two band-pass filters it limits the frequency range to middle and high regions for the melody line and low region for the bass line.
It then forms a probability density function (PDF) of the F0, which represents the relative dominance of every possible harmonic structure. To form this F0’s PDF, it regards each set of the filtered frequency components as a weighted mixture of all possible harmonic-structure tone models and estimates their weights that can be interpreted as the F0’s PDF: the maximum-weight model corresponds to the most predominant harmonic structure. This estimation is carried out by using the Expectation Maximisation algorithm, which is an iterative technique for computing maximum likelihood estimates from incomplete data.
Finally, multiple agents track the temporal trajectories of salient promising peaks in the F0’s PDF and the output F0 is determined on the basis of the most dominant and stable trajectory.
6.3.5
A computational model of harmonic chord recognition
Walsh et al.(13) investigate the perception of harmonic chords by peripheral auditory processes and
auditory grouping. The frequency selectivity of the auditory system is modelled using a bank of overlapping band-pass filters and a model of inner hair cell dynamics. By computing intervals between different classes of pitch, the model achieves considerable success in recognizing major, minor, dominant seventh, diminished and augmented chords.
Part of the algorithm relies on an existing computational model of mechanical to neural transduction based on the hair cell-auditory-nerve fibre synapse. The output excitation function in response to an acoustic stimulus is a stream of spike events precisely located in time. The model describes the
production, movement and dissipation of a transmitter substance in the region of the hair cell-auditory-nerve fibre synapse.
It is probably not feasible to implement the algorithm described in this paper; the aim here is just to demonstrate the wide variety of methods and theories that researchers are trying to apply to the problem of extracting the key from polyphonic audio signals.
6.4
Beat Detection Algorithms
There are many different approaches to detecting the beats in a song. An overview of each algorithm is given below along with a brief discussion of its accuracy and applicability. The reader is encouraged to read each of the papers in full for more detail, the aim here is to give a brief introduction to the many ways in which beat detection can be performed. All diagrams from within this section come from their respective papers.
6.4.1
Tempo and Beat Analysis of Acoustic Musical Signals
Scheirer’s paper(14) is one of the most frequently referenced papers on beat detection. The paper details
the implementation of a fast, close to real time, beat detection system for music of many genres.
The algorithm works by first dividing the music into six different frequency bands using a filterbank. This filterbank can be constructed by combining a low-pass and high-pass filter with many band-pass filters in between.
The envelope of each frequency band is then calculated. The envelope is a highly smoothed representation of the positive values in a waveform.
The differentials of each of the six envelopes are calculated, they are highest where the slopes in the envelope are steepest. The peaks of the differentials would give a good estimate of the beats in the music, but the algorithm in the paper uses a different method.
Each differential is passed to a bank of comb filter resonators. In each bank of resonators, one of the comb filters will phase lock with the signal, where the resonant frequency of the filter matches the periodic modulation of the differential.
The outputs of all of the comb filters are examined to see which ones have phase locked, and this information is tabulated for each frequency band.
Summing this data across the frequency bands gives a tempo (BPM) estimate for the music. Referring back to the peak points in the comb filters allows the exact occurrence of each beat to be determined. The beat detection strategy used in this paper has demonstrated high accuracy and has been implemented many times by different parties. It can cope with a wide variety of music genres and fits the requirements of this project. The speed of the algorithm may also be beneficial to this project.
The algorithm is very complex and may be time consuming to implement. By working with music in a stream, it fails to take advantage of the ability to analyse all of the music as one element. This means that while the accuracy may be good enough to tap along with users in real time, it may not be able to determine the BPM to a sufficient accuracy for this project.
6.4.2
Analysis of the Meter of Acoustic Musical Signals
Klapuri et al(15) describe a method which analyses the meter of acoustic musical signals at the tactus,
tatum, and measure pulse levels illustrated in Figure 16. The target signals are not limited to any particular music type but all the main Western genres, including classical music, are represented in the validation database.
Figure 16: Waveform showing Tatum, Tactus and Measure
An overview of the method is shown below. For the time-frequency analysis part, a technique is proposed which aims at measuring the degree of accentuation in a music signal.
Figure 17: Overview of algorithm from Klapuri et al (15)
Feature extraction for estimating the pulse periods and phases is performed using comb filter resonators very similar to those used by Scheirer in the above paper. This is followed by a probabilistic model where the period-lengths of the tactus, tatum, and measure pulses are jointly estimated and temporal continuity of the estimates is modelled. At each time instant, the periods of the pulses are estimated first and act as inputs to the phase model. The probabilistic models encode prior musical knowledge and lead to a more reliable and temporally stable meter tracking.
An important aspect of this algorithm lies in the feature list creation block: the differentials of the loudness in 36 frequency sub-bands are combined into 4 ‘accent bands’, measuring the ‘degree of musical accentuation as a function of time’.
The goal in this procedure is to account for subtle energy changes that might occur in narrow frequency sub-bands (e.g. harmonic or melodic changes) as well as wide-band energy changes (e.g. drum
occurrences).
The algorithm presented in this paper seems to output some good results across a wide variety of musical genres. However, due to the complexity of the many different parts which make up the algorithm it is a bit beyond the scope of the simple beat detection which this project aims to achieve. With the
assumption that the system is designed only to work with music containing a prominent, distinguishable beat, implementing this algorithm would be like over-engineering the project and would use up valuable time ensuring that everything was working properly.
6.4.3
Audio Analysis using the Discrete Wavelet Transform
Tzanetakis et al (16) describe an algorithm based on the DWT that is capable of automatically extracting
beat information from real world musical signals with arbitrary timbral and polyphonic complexity. The beat detection algorithm is based on detecting the most salient periodicities of the signal. The signal is first decomposed into a number of octave frequency bands using the DWT. After that the time domain amplitude envelope of each band is extracted separately. This is achieved by low pass filtering each band, applying full wave rectification and down-sampling. The envelopes of each band are then summed together and an autocorrelation function is computed. The peaks of the autocorrelation function correspond to the various periodicities of the signal’s envelope.
The first five peaks of the autocorrelation function are detected and their corresponding periodicities in BPM are calculated and added in a histogram. This process is repeated by iterating over the signal. The periodicity corresponding to the most prominent peak of the final histogram is the estimated tempo in BPM of the audio file. A block diagram of the beat detection algorithm is shown below.
Figure 18: Block diagram of algorithm from Tzanetakis et al (16)
Key: WT: Wavelet Transform, LPF: Low Pass Filter, FWR: Full wave rectification, ↓: Downsampling, Norm: Normalisation, ACR: Autocorrelation, PKP: Peak Picking, Hist: Histogram
To evaluate the algorithm’s performance it was compared to the BPM detected manually by tapping the mouse with the music. The average time difference between the taps was used as the manual beat estimate. Twenty files containing a variety of music styles were used to evaluate the algorithm (5 Hip-Hop, 3 Rock, 6 Jazz, 1 Blues, 3 Classical, 2 Ethnic). For most of the files the prominent beat was detected clearly (13/20) (i.e. the beat corresponded to the highest peak of the histogram). For 5/20 files the beat was detected as a histogram peak but it was not the highest, and for 2/20 no peak corresponding to the beat was found. In the pieces that the beat was not detected there was no dominant periodicity (these pieces were either classical music or jazz). In such cases humans rely on higher level information like grouping, melody and harmonic progression to perceive the primary beat from the interplay of multiple periodicities.
This algorithm is different to the others in that it uses a more specialised version of the FFT algorithm to decompose the incoming signal into separate frequency bands. Whether this improves the beat detection
algorithm performs well on music containing a constant beat, which is fine for this project, however the algorithm may also be too time consuming to implement.
6.4.4
Statistical streaming beat detection
The human listening system determines the rhythm of music by detecting a pseudo – periodical succession of beats. The signal which is intercepted by the ear contains certain energy, this energy is converted into an electrical signal which the brain interprets. Obviously, the more energy the sound transports, the louder the sound will seem. But a sound will be heard as a beat only if this energy is largely superior to the sound's energy history. Therefore if the ear intercepts a monotonous sound with
sometimes big energy peaks it will detect beats, however, if you play a continuous loud sound you will not perceive any beats. This algorithm assumes that beats are big variations of sound energy.
Patin(17) presents a model whereby beats are detected by computing the average sound energy of the
signal and comparing it to the instant sound energy. The instant energy will be the energy contained in 1024 samples, 1024 samples represent about 5 hundredths of a second which is pretty much 'instant'. The average energy should not be computed on the entire song, as some songs have both intense passages and more calm parts. The instant energy must be compared to the nearby average energy, for example if a song has an intense ending, the energy contained in this ending shouldn't influence the beat detection at the beginning.
We detect a beat only when the energy is superior to a local energy average. Thus we will compute the average energy on say : 44032 samples which is about 1 second, that is to say we will assume that the hearing system only remembers of 1 second of the song to detect a beat. This 1 second time (44032 samples) is what we could call the human ear energy persistence model; it is a compromise between being too big and taking into account energies too far away, and being too small and becoming too close to the instant energy to make a valuable comparison.
6.5
Similar Projects / Software
6.5.1
Traktor DJ Studio by Native Instruments
Traktor DJ Studio (18) is state of the art proprietary software enabling DJs to mix together up to four
different tracks at the same time. Traktor’s beat detection system enables two tracks to be automatically beat-synchronised and manages to detect the beats well in most tracks with a prominent, regular beat. However, it does not produce good results when used with music of other genres such as classical and rock.
Traktor offers a visualisation of the playing track and highlights the detected beats with visual beat markers. It has support for time-stretching of tracks and also basic tempo adjustment. Extra features of Traktor include a whole host of real time effects, such as reverb, delay, flange which can be applied, plus a selection of low-, mid- and high-pass filters. A file browser displays information about files for easy dragging and dropping of them onto the decks, and the program allows you to record and save your own mix as it happens, capturing any effects applied.
Traktor is commercial software and is aimed at the professional DJ, however Traktor is missing a couple of features which this project aims to include. Traktor does not have any key detection algorithm capable of extracting the key from a digital audio file. Pitch shifting; enabling the pitch of the track to be adjusted without altering the tempo of the track is an aim of the project however is also not present in Traktor.
6.5.2
Rapid Evolution 2
Rapid Evolution 2 (19) is free software which allows the user to import their music files and have them
analysed in order to detect the BPM and the Key of the track. Based on the BPM and key extracted from an audio file, the system indicates which other songs would go well with the analysed song to produce a good harmonic mix. A unique element of rapid evolution is the availability of a virtual piano which can play the chord of the key detected in a song. This can be used to determine qualitatively how accurate the key detection of an audio file was, and would be a valuable feature in any program aimed at harmonic mixing. The program allows simultaneous playback of two files and has time-stretching functionality. Although this product strives to generate and display a lot of useful information to the harmonic mixing DJ, the graphical user interface is not the most intuitive. For example it is not obvious what the difference is between some of the buttons such as the ‘import’ and ‘add song(s)’ buttons, and a lot of the same controls and information is displayed in more than one area, making inefficient use of real-estate and confusing the user. The program does not have an automatic beat matching algorithm although this is planned for future release.
6.5.3
Mixed in Key
Mixed in Key (20) is a small commercial application whose sole purpose is to analyse files and extract the
key and BPM from them and store the information in the files metadata. Mixed in Key uses Camelot’s easymix system to display the key as well as the formal musical notation. The software licenses a
key-detection algorithm named tONaRT from zplane development(21) to detect the key from the audio file.
The application is geared towards batch processing of several files at once. The software does not provide any way of playing the song and as such it does not support features such as pitch-shifting and time-stretching.
6.5.4
MixMeister
MixMeister (22) DJ mixing software is commercial software which allows users to ‘design’ a mix rather
than create one in real time. With its unique timeline function it allows users to visualise the overlapping of two (or more) songs which they want to mix, enabling them to refine the mix so that for example the beats are perfectly aligned. It is much easier to create a perfect mix this way, as you have full control over the tempos of the tracks, and when they should both start and finish. The downside of this is that you would not be able to use MixMeister in a live situation, as it takes trial and error to align the songs perfectly. MixMeister is therefore aimed at people who want to create mixes for later use, such as creating their own mix CD.
MixMeister has seemingly accurate BPM and Key Detection, making use of the Camelot notation to display the detected keys as Camelot keycode's. On the whole it is a solid application which creates a unique technique for DJ mixing which would not be possible without the advancement of computers in music analysis.
7.
Design
This section of the report gives a very brief description of the design and architecture of the project, before the reasons behind the design of the algorithms.
7.1
System Architecture
The system was designed with the user in mind. As such the system was based around the need for a responsive, intuitive user interface. This meant keeping the graphical user interface (GUI) separate from the sound processing and from the main algorithms. The result is that the system has a modular
architecture which can be broken down into three main areas: GUI, Core and Algorithms.
The GUI comprises those classes which the user interacts with, and which the system uses to feedback information to the user about the state of the system.
The core contains the functions which process the audio files when called upon by the user interacting with the interface.
The algorithms are separated from the core logic as they apply specific routines on an audio file. When running, these routines should not hamper the smooth running of the program. They should work in the background independently of the core logic.
For a more in-depth discussion on these areas see the implementation section.
Figure 23: Overview of System Architecture
The algorithms are separated into the key detection and beat detection algorithm. The rationale behind the design decisions for each algorithm is explained below.
7.2
Key Detection Algorithm Design Rationale
Any key detection algorithm inevitably involves conversion of the signal from the time domain into the frequency domain, using either the Fourier, Constant Q or Wavelet transforms.
Initially, I planned to write the entire algorithm in C# using the FFTW(23) (Fastest Fourier Transform in
the West) library, which as the name suggests claims to perform the FFT transformation in the shortest amount of time. However, I was getting peculiar frequency spectrums which showed high intensities at very high frequencies (i.e. greater than 20kHz). Additionally, I learned that the FFT was not the best
Core Logic
GUI Algorithms
The transform I decided to use to convert the signal from the time to the frequency domain was the short time Fourier transform (STFT) which is essentially the FFT applied to small sections, or windows, of the signal at a time.
Eventually I chose to use Matlab to develop the majority of the key detection algorithm. Matlab is a matrix based programming language and has excellent support for digital signal processing. Matlab uses the FFTW library to perform Fourier transformations. Recent versions of Matlab include the ‘Builder for .NET’ tool which conveniently converts Matlab code into a C# compatible dynamic link library which can be interfaced from the rest of my project in the C# language.
To determine the key, the output from the STFT is mapped to a chroma vector. There are then two main techniques of matching the chroma vector to a key. Pattern matching techniques correlate the chroma vector against a series of pre-programmed key templates and record the highest correlating key.
Probabilistic models involve developing and training a hidden Markov model, and recording the template which best aligns itself with the chroma vector. Pattern matching techniques were chosen for the design of the algorithm because they have shown to give similar results to probabilistic methods without the extra development time needed to program and train a HMM.
The speed constraints of the key detection algorithm are not as tight as for the beat detection algorithm, as once a song has its key detected, that information will be stored in the songs ID3 tag and in future can be read by the program. Even so, it is still desirable for the process to take the shortest amount of time possible.
7.3
Beat Detection Algorithm Design Rationale
The beat detection algorithm is based on the method set out by Patin(17) in ‘Statistical streaming beat
detection’. This algorithm iteratively compares the instant energy of a piece of music with the average energy calculated over the past second. A beat is detected if this instant energy is significantly greater than the average energy. The concept is similar to the human hearing system in that when we listen to music, we only remember the past second or so of music.
We are designing the algorithm primarily to be used with dance music. It is assumed that this type of music will have a consistent tempo throughout. This assumption means that the algorithm is unlikely to give good results when applied to music without a consistent tempo.
It is also assumed that beats in this type of music are produced by a bass instrument such as a bass drum with low frequency. Because the algorithm does not convert the signal to the frequency domain, and works entirely in the time domain, the energies are based on the amplitude over the whole frequency spectrum. This means that a significant sound variation in the high frequencies could be detected as a beat just as much as one in the low frequencies. Applying a low pass filter to the signal should reduce the impact that high frequencies have on the detection of beats.
The required accuracy for the beat detection for this project is to be within +/- 1.5% of the actual BPM. Bearing in mind that the majority of time was devoted to developing the key detection algorithm to a high standard, the method chosen was dictated by the time constraints of the project. Nevertheless, the algorithm claims to give good results with songs containing a dominant, consistent beat, so it is perfect for this project, which is intended for use with dance music.
Patin’s method does not explicitly suggest a method for calculating a BPM value from the beats detected. Finding the BPM is not as simple as counting the number of beats detected in a minute. A comb filter could be used. This is a special type of filter that resonates at a certain frequency when a signal is passed
through it, that frequency is then used to calculate the BPM. Due to time constraints a more basic method was used to calculate the BPM; the average interval between similarly spaced beats is found and converted to a BPM value.
The algorithm for detecting beats had to be accurate and fast at the same time, because each time a file is loaded into a deck, the program will detect its beats. Increasing the speed of the beat detection algorithm usually implies a trade off in the accuracy of the algorithm, so it was important to strike the right balance between speed and accuracy.
8.
Implementation
This section describes in detail the actual implementation of the system and algorithms plus any other interesting implementation areas of the project. This is not a complete account of all areas of the project, many small details are omitted and can be assumed to have been successfully implemented. This approach was taken in order to increase the readability of this report.
8.1
System Implementation
The system was implemented in the C# language with the FMOD sound processing library(24) in mind.
FMOD is an advanced platform independent front end to Microsoft’s DirectShow and Direct X API’s. It makes it much easier to develop a multimedia based program than using the API’s directly. It is aimed at the games industry and is used by many high profile game developers. FMOD is free for non-commercial use.
Figure 24 illustrates a simplified overview of the system showing the main classes and their relationships.
FMOD defines three main types which are used throughout the program:
• The System object initialises the FMOD engine, handles the creation and playing of Sound
objects and is used to set global parameters for the FMOD engine, such as changing the size and
type of buffers used by FMOD. There should only be one System object initialised throughout
the whole program, for efficiency reasons, and I decided to keep this object in the core class and let other objects access it if and when they need it. This is the intuition behind the centralised design.
• The Sound object holds information on the type of audio file loaded, i.e. its length in samples,
bytes and milliseconds, the number of channels (mono or stereo),and the bit-rate. It also reads the audio data in the sound file into a byte buffer, enabling custom operations and analysis to be performed on the raw audio data.
• The Channel object handles the parameters in which the sound is played, such as its volume,
playback rate (tempo), pitch and current position.
For each deck, the Core class contains a corresponding Sound and Channel object. The GUI classes fire off events when certain actions are performed on them, these events are handled by the Core class which calls the appropriate FMOD function on the Sound and Channel objects corresponding to that deck. For example, when the Play button is pressed on deck A, an event is fired and sent to the Core class, the Core can tell from the message passed that deck A fired the event, so the core class knows to call FMOD’s play function on the sound object corresponding to deck A..
The GUI is made up of the following classes:
• Deck – encapsulates the behaviour of a turntable i.e. loading, playing and pausing of sounds as
well as controlling pitch and tempo. Each deck has a unique id which corresponds to the id of
the relevant FMOD Sound and Channel objects.
• WaveForm – the Deck class contains a WaveForm class, which presents a zoomed-in
animated visualisation of the currently loaded track. This visualisation contains beat markers which mark the precise location of where the beat detection algorithm detected a beat. The visualisation can be dragged forwards and backwards, mimicking the bi-directional rotation of a vinyl on a turntable. The waveform also displays a representation of the whole track, enabling the user to quickly skip to a certain position in the track.
• Mixer – blends the output from the two currently playing decks using the crossfader. Also adds
functionality to filter out high or low frequencies for each track.
• MusicBrowser – displays the music files supported by the program on the user’s computer,
and their corresponding metadata information, such as the BPM and Key that was detected by the program.
Both the algorithms run asynchronously in separate threads to the GUI and Core classes. This means that they run in the background and do not block the GUI thread. This enables the user, for example, to be playing a track in one deck, while at the same time loading a track in the other deck, whilst detecting the key of another track. Obviously, the more activities the user decides to perform simultaneously, the slower the performance of the system as the different threads all compete for CPU time.
The structure of both algorithms is the same. The ‘worker’ class sets off the main routine asynchronously, and receives progress updates from the main routine which allow it to update the relevant progress bars. The worker is notified when the main routine has completed, causing the ‘results’ class to return the relevant results from the algorithm. For beat detection, this is more than just the estimated BPM result.
arrays containing the values to be drawn onto the waveform. It also returns an array of the beat positions so that beat markers can be placed in the waveforms at the appropriate times.
8.2
Detecting the Key
The audio file is broken down into non-overlapping sections of approx 5.5 seconds and the flow diagram shows the process which is applied to each section of the song, before a key for the whole song is chosen.
Figure 25: Key Detection Algorithm Flow Chart
In order to save computation time, my approach starts by converting the audio section to mono and downsampling to 11025Hz. Converting to mono involves taking the average of every two consecutive samples in the signal, reducing the number of samples by a factor of two. Downsampling further reduces the number of samples in the audio stream whilst still conveying enough information to perform accurate key detection. A side-effect of downsampling the audio file is that frequency content above 5512.5Hz is not considered, due to Nyquist’s theory. However, frequencies above this limit do not contribute much to the harmonic content of the song; the note with highest frequency detectable by the human ear is D# in octave 8, with a frequency of 4978.03Hz.
After the pre-processing stage, the signal is passed to Matlab which performs an STFT of the signal using a hamming window of length 8192 samples. This is approximately 0.74s which is a relatively long analysis window in terms of musical harmony. Thus, to improve time resolution, frames are overlapped by an 1/8th of a window length giving a time resolution of 0.093s per frame. The STFT returns a spectrogram which shows a time-frequency plot, enabling you to see the intensities of frequencies at different time slices throughout the section of the song. Figure 26 shows a spectrogram of a C major chord played on the piano. You can see the most intense frequencies at around the 250 – 1500Hz range, and how the intensities gradually decay as time increases.
The next stage is to scan through the output from the STFT and map the frequencies in Hz to pitch classes or musical notes. The result will be a chroma vector, also called a Pitch Class Profile (PCP) or chromagram, which traditionally consist of 12-dimensional vectors, with each dimension corresponding to the intensity of a semitone class (chroma). The procedure collapses pure tones of the same pitch class, independent of octave, to the same chroma vector bin; for complex tones, the harmonics also fall into particular, related bins. Frequency to pitch mapping is achieved using the logarithmic characteristic of the
equal temperament scale. STFT bins are mapped to chroma vector bins according to:
12 log ⁄ ⁄ 12 Equation 1
Where is the reference frequency corresponding to the first index in the chroma vector (0 ). I
chose = 440Hz which is the frequency of pitch class A. is the sampling rate (11025Hz), is the
size of the FFT in samples (8192).
For each time slice, we calculate the value of each chroma vector element by summing the magnitude of all frequency bins that correspond to a particular pitch class i.e. for 0, 1," " " , 23,
∑ |& |
':)'*) Equation 2
Once we have our normalised chroma vector, we need to match it against pre-defined templates representing the 24 possible keys (12 major, 12 minor). These templates are also 12 dimensional where each bin represents a pitch class. They are binary type, i.e. each bin is either 1 or 0. A C major chord consists of the notes C (root), E (third) and G (fifth), therefore, the template for the key of C Major would be [0,0,0,1,0,0,0,1,0,0,1,0] where the labelling of the template is
[A,A#,B,C,C#,D,D#,E,F,F#,G,G#]. A G Major chord consists of the notes G, B and D, and so its template would be [0,0,1,0,0,1,0,0,0,0,1,0]. As can be seen from these examples, every template for the
but with the third shifted by one to the left. The template for a C minor chord (C,D#,G) is therefore [0,0,0,1,0,0,1,0,0,0,1,0] and the other minor keys are just a shifted version of this template. Templates for augmented, diminished, or 7th chords can be defined in a similar way. We will just deal with detecting of major and minor keys here, as the Camelot easymix system does not recognise other modes than these. Figure 27 shows the chroma vector of a C major chord played on the piano and its correlation with the 24 key templates.
We now perform correlation of the computed chroma vector with each of the 24 key templates and get a correlation coefficient for each of the 24 keys. The correlation coefficient is calculated using:
+ ∑ ∑ ,- . -./ ,01-./ 12
3∑ ∑ ,- . -./ ,04 -- ∑ ∑ - . 1-./ 124
Equation 3
Where A and B are matrices of size m x n, in our case these will simply be vectors of size 12.
We assign a weighting to the key that has the highest correlation which corresponds to the difference between its coefficient and the second highest correlation coefficient. For the weighting to be fair we need to normalise the correlation coefficients so that the highest value becomes 1. The weighting penalises the highest correlated key when the chroma vector correlates closely with other keys and the difference between them is minute, meaning that the key could possibly have been one of the other highly correlating keys. It rewards the highest key when the correlation coefficient is by far the highest value.
When we have reached the end of the song we will have several weightings, one for each 5.5 second segment of the song. To find the most likely key, we simply sum the weightings for each key and the key with the highest value at the end is selected as the most likely key.
The detected key is then stored in the ID3 tag of the song so that in the future, this can just be read straight away without having to go through the whole process described above again. The ID3 Tag is written using the library Ultra ID3 Lib(25).
8.3
Detecting the Beats
The basic intuition behind the beat detection algorithm is to find sections of the music where the instant energy in the signal is greater than some scaling of the average energy of the signal over the previous approximate second of music. The assumption made is that the instant energy in a signal will be much greater on the beat than between beats. This assumption is reasonable for songs with heavy down beats and little mid and high frequency “noise”.
The audio file is first split into manageable sections. The reason behind splitting the file up is simply because reading the whole of the file in as one big chunk of data requires a lot of memory to cope with the large buffer containing the samples. It also causes a bottleneck on the entire system as the reading of the entire song takes up the majority of the CPU usage at that particular moment.
The audio data is first converted to mono as in the key detection algorithm but is not downsampled. We
then iteratively apply the following process to the signal. First we calculate the instant energy, 6, which is
the energy contained in 512 samples. 512 samples are chosen for this length as it corresponds to one thousandth of a second which is pretty much instant. The instant energy is calculated using the following formula where X is the signal.
6 7 &8 9: ;*<
Equation 4
We then need to calculate the average energy. This is not calculated on the entire song, since a song may have an intense passage and also a calm part. The average energy is calculated on the last 44032 samples which is just short of a second. 44032 samples are chosen instead of 44100 because it is then more convenient to calculate the average energy by simply summing the past 86 instant energy readings (as 86 x
512 = 44032) and taking the average of them. We illustrate the calculation of the average energy, =2, in
Equation 5, where E is a history buffer of length 86 containing the past 86 instant energy readings.
= > 86 7=8 1
AB ;*<
Equation 5
Next we compare the current instant energy to the average energy over the past second multiplied by some constant C. To get the value of C we first compute the variance of the past 86 instant energies:
86 7=8 C =1 2
AB ;*<