Appendix A: Introduction to Digital Signal Processing
Digital Signal Processing (DSP) is the process of manipulat create various possible effects at the output.
Music can be thought of as a real world analogue signal. For music to be processed digitally by a computer it must be converted from a continuous analogue sign
sequence of discrete samples. This conversion is called sampling.
The analogue signal that we want to convert to a digital representation is sampled many times per second.
Each sample is simply a binary encoding of the a
frequency is the number of samples obtained per second, measured in hertz (Hz). If we sample at too low a rate, we could miss changes in amplitude that occur between the taking of samples, and we could also mistake higher frequency signals for lower ones. This is called aliasing.
To prevent aliasing from happening and to perfectly reconstruct the original signal completely and exactly, we must sample at a rate which is more than double the Nyquist Frequency.
frequency in the original signal that we want to sample. It is called the Nyquist frequency after it was proven in the Nyquist-Shannon sampling theorem. Most digital audio is recorded at 44,100 Hz (44.1KHz), so the Nyquist frequency for
highest frequency perceivable by the human ear.
PCM or Pulse code modulation is the most common method of encoding analogue audio signals in digital form. The diagram below shows the sampling
Figure
The following diagram displays how the FMOD library
and differentiates between samples, bytes and milliseconds. In this format it can be seen that a left pair is called a sample.
Appendix
Introduction to Digital Signal
Digital Signal Processing (DSP) is the process of manipulating a signal digitally either to analyse it or to create various possible effects at the output.
Music can be thought of as a real world analogue signal. For music to be processed digitally by a computer it must be converted from a continuous analogue signal to a digital signal made up of a sequence of discrete samples. This conversion is called sampling.
The analogue signal that we want to convert to a digital representation is sampled many times per second.
Each sample is simply a binary encoding of the amplitude at that sampling instant. The sampling
frequency is the number of samples obtained per second, measured in hertz (Hz). If we sample at too low a rate, we could miss changes in amplitude that occur between the taking of samples, and we could also
istake higher frequency signals for lower ones. This is called aliasing.
To prevent aliasing from happening and to perfectly reconstruct the original signal completely and exactly, we must sample at a rate which is more than double the Nyquist Frequency.
frequency in the original signal that we want to sample. It is called the Nyquist frequency after it was Shannon sampling theorem. Most digital audio is recorded at 44,100 Hz
(44.1KHz), so the Nyquist frequency for this sample rate would be 22,050 Hz, which is approximately the highest frequency perceivable by the human ear.
PCM or Pulse code modulation is the most common method of encoding analogue audio signals in digital form. The diagram below shows the sampling of a signal for 4-bit PCM.
Figure 37: Sampling of a signal for 4-bit PCM
The following diagram displays how the FMOD library (24) stores raw PCM audio data in memory buffers and differentiates between samples, bytes and milliseconds. In this format it can be seen that a left
Introduction to Digital Signal
ing a signal digitally either to analyse it or to
Music can be thought of as a real world analogue signal. For music to be processed digitally by a al to a digital signal made up of a
The analogue signal that we want to convert to a digital representation is sampled many times per second.
mplitude at that sampling instant. The sampling
frequency is the number of samples obtained per second, measured in hertz (Hz). If we sample at too low a rate, we could miss changes in amplitude that occur between the taking of samples, and we could also
To prevent aliasing from happening and to perfectly reconstruct the original signal completely and exactly, we must sample at a rate which is more than double the Nyquist Frequency. This is the highest frequency in the original signal that we want to sample. It is called the Nyquist frequency after it was
Shannon sampling theorem. Most digital audio is recorded at 44,100 Hz
this sample rate would be 22,050 Hz, which is approximately the
PCM or Pulse code modulation is the most common method of encoding analogue audio signals in digital
data in memory buffers and differentiates between samples, bytes and milliseconds. In this format it can be seen that a left-right
Once we have access to the audio data in digital format, we can apply many functions to manipulate it.
Filtering is a common process used to either select or suppress certain frequency ranges of the signal. A low-pass filter removes all high frequency components from a signal above a certain bound, allowing the low frequencies to pass through it normally. In music, high frequencies equate to the treble and low frequencies equate to the bass. Therefore a low pass filter would return only the bass elements of a song.
A high pass filter returns a signal containing only treble frequencies above a certain bound. A band-pass filter returns a signal containing frequencies between a specified lower and upper bound.
Filtering is made possible through a technique called convolution. Convolution takes the original signal, and with a shifted reversed version of the same signal, it finds the amount of overlap between the two signals. It is a very general moving average of the two signals. This process is very computationally expensive as every point in the original signal has to be multiplied by the corresponding point in the transformed signal.
Correlation is another function which shows how similar two signals are, and for how long they remain similar when one is shifted with respect to the other. The idea is the same as convolution, however the second signal is not reversed, just shifted by a certain factor. Correlating a signal with itself is called auto-correlation and can be used to extract a signal from noise.
Full wave rectification takes the absolute values of each sample, so that a signal can be treated on a purely positive amplitude scale. Half wave rectification is an example of a clipper, in which negative valued samples of a signal are blocked, whilst positive valued samples are untouched.
One of the most important techniques used in DSP (and in almost every algorithm described below) is called the Fourier Transform. Jean Baptiste Joseph Fourier showed that any signal could be reconstructed by summing together many different sine waves with different frequencies, amplitudes and phase. The discrete Fourier transform (DFT) is a specific type of Fourier transform used in signal processing. A computationally efficient algorithm to calculate the DFT is called the fast Fourier transform (FFT). There are many different algorithms to calculate the FFT, the most popular being the Cooley-Tukey algorithm.
The FFT basically takes a signal in the time domain as input and returns a spectrum of the frequency components which make up every sample in the signal. Many techniques in DSP operate in the frequency domain so the FFT is a good way of converting a signal to the frequency domain from the time domain to enable certain operations to be carried out. An inverse FFT is used to convert the data back into the time domain.
The short term Fourier transform (STFT) is a specialised version of the FFT more applicable to polyphonic, non-stationary signals such as music. Essentially, it applies the FFT to small sections of the signal at a time, in a process called windowing.
Figure 38: How FMOD stores audio data
Wavelets and the Discrete Wavelet Transformation (DWT) offer an even better alternative to the STFT.
More can be read about the DWT in the paper described below; Audio Analysis using the Discrete Wavelet Transform.
More on DSP techniques in general can be found at the online introduction to DSP by Bores(27).
Appendix B: Specification
Aims of the project
The aim of this project is to analyse pieces of music to detect their tempo (beat detection) and their key (harmonic detection). The system will work by accepting up to two music files at a time from the user, analyse them, and return a visualisation of the audio content with beat markers, and an indication of the key of the song. There should be intuitive controls to enable the user to play, pause, stop, alter the volume, pitch-shift and time-stretch the song. There should also be a function which can automatically mix together two songs based on the detected beats and keys of the songs.
The big idea of the project is to aid a DJ to perform a perfect beat and harmonic mix; to devise a program that will enable a DJ to mix together two songs which have a similar tempo, and a similar key, so that they
‘sound good’ together.
Core Specification
The system must be able to analyse music in the following common file formats: .wav (Microsoft Wave files), .mp3 (MPEG I/II Layer 3), .wma (Windows Media Audio format), .ogg (Ogg Vorbis format).
The system is only required to deal with music containing a prominent distinguishable beat. It is only expected to deal with music of a similar style to that which a DJ would be playing at a night club.
The system must be able to detect the beats from the music files accurately, and provide visualisations of those beats to the user as the music file is played. The system must be able to detect the key of the track accurately, and display the detected key to the user.
The system must be able to play two tracks at the same time, enabling the user to alter certain properties of each track independent of the other. The system should treat each track as a separate unit, rather like a physical deck or turntable in a DJ set-up. There should also be a mixer unit, which stands between the two tracks, and enables the user to mute or un-mute each track or to cross-fade one track into the other, so that they are both audible at the same time.
The properties of each track that the user should be able to adjust are: its volume, its tempo (independent of pitch; time-stretching), its pitch (independent of tempo: pitch-shifting) and both its tempo and pitch together. By varying the tempo, the user is also varying the tracks key, for example a ± 6% adjustment in the BPM rate would cause a change of one semitone in key. This adjustment in key should also be taken into account and updated based on the changes in tempo.
The system should be able to return some sort of visualisation of the currently playing tracks to the user.
A plot of the waveform of the currently playing track along with visual beat-markers that mark out the start of a new beat would be very useful to a user looking to beat-match together two songs.
An oscilloscope or VU meter outlining the amplitude or volume of each track would aid the user to
The BPM and key detected by the program should be stored in a tag within the music file, such as ID3v2 format (for MP3 files). This then makes it possible for the program to indicate which other tracks would be suitable to be mixed with a selected track. The system can indicate tracks with a compatible key to one that is selected based on Camelot’s easymix system as described in the background section. Additionally the system should be able to read and display common tags or metadata from files loaded such as the artist and title of a song.
The system should be able to automatically beat-match together two tracks. First it detects the beats and key of each track. Then it time stretches or adjusts the speeds of each track to the same speed and a compatible key. Finally by overlaying one track on top of another at the beginning of a beat, the two tracks should be beat-matched (the beats of each track should fall at precisely the same time), making a seamless harmonic mix.
The project is to be written in C#.net, making use of the power of windows forms for the graphical user interface. The FMOD sound system (24) will be the main library used. This library contains various functions that will enable the raw sound data to be extracted from the audio files and used in the algorithms. It includes a pitch-shifting algorithm which can also be used to do time stretching. This algorithm is based on code developed by Bernsee (28).
The graphical user interface should be clear and intuitive. Using sliders the user should be easily able to adjust each property of the track. Buttons with intuitive icons should make it clear to the user what their function is.
Extended Specification
If time permits, there are various extensions to the project that could be implemented to improve the overall functionality of the system.
Improvements to the beat detection algorithm could be investigated in order to enable the system to detect beats in a wider range of musical styles, such as rock. The addition of cue- and loop-points gives the user more control over the mix. Cue-points enable the user to start playback of a file from a defined point, such as the time of the first beat. Loop-points enable the user to repeat certain regions of the song, possibly to extend the length of a mix.
Various effects could be added to the mixer unit, such as reverb, echo and flange. Low and high-pass filters enable the user to filter out the sounds of one track which may not fit well with that of another, e.g.
a low pass filter could be used to eliminate an irregular high-pitched hi-hat sound.
The addition of real-time recording of mixes, including applied effects, would enable the user to look back over the mix to decide if two songs sound good together and to make pre-recorded mixing a possibility. Finally, extending the application with an integrated file browser displaying information about the users song library, would make it easier for the user to load new songs and generate playlists.
Appendix C: User Guide
Upon loading the program, you are presented with the main screen. This consists of two decks, Deck A and Deck B, a Mixer containing volume controls and crossfader, and the music browser, which will display information on the tracks in your music library.
Deck A Mixer Deck B
Music Browser
Figure 39: The Main Screen
Loading a track into a deck
A track can be loaded in one of three different ways. You can choose to load a track into a deck by dragging and dropping an appropriate file from windows explorer onto one of the decks. You can also click the eject button on the appropriate deck, this will load a file browser dialog where you can navigate to a music folder and load one of your songs.
The alternative method is to use the built in music browser to navigate your computer for music folders.
Once a folder is selected, the music browser will display information about the songs in the folder such as the artist/title/duration and BPM and Key if they have previously been detected. You can sort the tracks in ascending or descending order of any of these categories by clicking the column header of the
appropriate category.
Select a track from the music browser and click the ‘Load in Deck A’ button to load the track into Deck A, the same applies for Deck B. You can also right click on the track and select the same option from the drop down menu.
Once the song is loaded the deck will display the waveforms of the track; one which covers the whole track, and the zoomed-in view which displays 6 seconds of the audio at a time. The deck displays information on the track such as its BPM and Key. The deck now allows you to play and pause the track and to adjust its pitch and tempo.
Figure 41: The Deck Control
Figure 40: Loading Sasha - Magnetic North into Deck A
Detecting the Key of a track
Follow the same process as described above for loading a track, however instead of selecting ‘Load into Deck A/B’, choose the ‘Detect Key’ option.
The program will then begin to detect the key of the selected track. Once again, a progress bar indicates the progress of the key detection process.
Once the key detection process has finished, a table showing the results will pop up, and the status bar will display the key and keycode which will be written to the ID3 tag of the audio file.
In Figure 42 the key detected is clearly a C major.
Figure 42: Key Detection progress/results
Mixing two tracks
First load tracks into Deck A and Deck B. These tracks should be selected based on their BPM and Key detected. A track with keycode 4A can be harmonically mixed with any track with keycode 3A, 4A, 5A or 4B. The BPM’s of the two tracks should not differ by much (+/- 10 BPM max); trying to mix tracks that differ largely in their BPMs will not usually sound good. In Figure 43, the two tracks are X-Cabs – Neuro 99 in deck A, with a keycode of 10A and BPM of 139.68, which is going to be mixed into Xstasia – Sweetness which has a compatible keycode of 11A and BPM of 136.01.
1. Make sure the crossfader is all the way to the left hand side and start the track in deck A.
2. Enable the sync button on the track in deck B. The sync button will sync the non playing track to the tempo of the playing track. In figure 41, you can see the tempo of Xstasia in deck B has been
automatically increased by 3.67BPM (2.7%) to match the BPM of 139.68 of X-Cabs in deck A.
3. Now cue up the track in deck B to its first downbeat, as described in the ‘Illustration of beat mixing’ section of the background. This can be achieved by dragging the waveform until the first beat marker is found, and aligning the beat marker with the centre of the waveform.
Figure 43: Crossfader in left hand position
4. When the playing track reaches a downbeat press play on deck B. The track will only start playing when the next beat marker of song A passes the centre of the waveform.
5. Now both tracks are playing but because the crossfader is in the left hand position, only deck A’s output is audible. Drag the crossfader to the central position so that both tracks can now be heard, as in Figure 44. If the beats are in sync the mixed output will sound like one song still. If the beats are not matched, the beats of the two songs will clash at irregular intervals and the overall output will not make
5. Now both tracks are playing but because the crossfader is in the left hand position, only deck A’s output is audible. Drag the crossfader to the central position so that both tracks can now be heard, as in Figure 44. If the beats are in sync the mixed output will sound like one song still. If the beats are not matched, the beats of the two songs will clash at irregular intervals and the overall output will not make