One can understand a chord as a functional harmonic relationship between multiple pitches with some intervals of separation, so to identify a chord requires that one know two things: the combination of pitches being played and the functional relationship implied by that combination. For this, a system must be capable of discerning pitches out of a signal. A pitch is defined in terms of the amplitudes of the frequencies being sounded in a signal. Identifying a pitch depends on a process known as fundamental frequency, of F0, estimation. When a listener perceives a pitch, it is the F0 that defines the “note” that the pitch is sounding [43]. Logically then, one can estimate pitches and the chords they create based on the TFRs previously outlined, and indeed [44] has demonstrated a
method of doing so using the CQT. As shown, a strength of the CQT is that it
preserves the harmonic relationships between frequencies as an easily recognizable pattern. The harmonic relationships between frequencies that belong to the
Figure 24. The chroma features of an audio sample. The frequency features are filtered by pitch class, and greater value in the heat map represents sum amplitude.
same pitch are defined according to the overtone series [42], and as pitch changes the ratio between these frequencies does not change. This means that a fundamental frequency can be defined as a frequency over which the overtone series of frequencies can be identified. In situations where only one pitch is sounding, this is a simple procedure; however, when applied to chord identification where multiple pitches are sounding at once, interference between frequencies can cause strange behavior in the frequency-amplitude spectrum. [44] demonstrates that a component that should appear on the frequency-amplitude spectrum of the CQT at the frequency f1will not appear if
another component exists on the spectrum at the frequency f2 according to the
relationship (Ex. 12[44]). In practical terms, out of the 57 chords in the Western tradition that can be defined as a combination of up to 4 pitches, 14 of these chords consist of a combination of pitches that includes two frequencies with this relationship. This makes these chords impossible to identify with complete accuracy even given a completely accurate TFR.
Q
Q + 1 f2 ≤ f1 ≤ Q Q + 1 f2
While [44] explores how one might compensate for this effect, most
contemporary research is built upon an alternative representation that is designed to provide an estimation of likely pitches over a signal in combination with a pattern recognition model that can take these estimations as input and return the likely chords as output. [45] identifies three components common to chord identification systems under this paradigm: chroma feature extraction, filtering, and pattern matching. Chroma feature extraction refers to the generation of a type of this specialized TFR known as a pitch class
profile or chromagram. This TFR was first proposed for in [46] and is comparable in some ways to the MFCC. In both cases, the frequency-amplitude values determined by each frame of a TFR of a signal are passed through specialized filter bins, reducing the complexity of the representation to signify frequency features organized by some concept of relevance. In the case of the MFCC, these filter bins sit along the frequency spectrum with their distances determined by mel scale. It thus organizes frequency-amplitude by perceptual frequency distance. The pitch class profile organizes frequencies with another kind of filter, that of pitch. Windows in the frequency domain divide frequencies by the pitches that a fundamental in that range would signify. For instance, the range of
frequencies around 440 Hz are filtered into the pitch class “A.” These filters are octave- agnostic, meaning that all frequencies at would resolve to an A of any octave are filtered into the same pitch class of “A.” The sum of the amplitudes of the frequencies that fall within each pitch class filter in each frame of the STFT determine the value for that pitch class. Given that there are 12 pitch classes in the Western classical tradition (one for each note on the chromatic scale of an octave), the values for each frame of the STFT are thus reduced from a full frequency representation to a vector of 12 dimensions, for which the values represent the sum of amplitudes in that respective pitch class. In other words, the pitch class profile approximates the relative strength of each pitch at every frame of the signal, regardless of octave. Alternative PCP’s could be devised using vectors or greater or fewer dimensions corresponding to the organization of pitches used in other musical traditions. This TFR is considered foundational in the field of chord-sequence based retrieval [47]—[50]. Like with other TFRs, the duration of the frame has significant effects on the resolution of the represented frequency-amplitudes. This is what makes the
filtering step necessary before pattern matching can begin[45]. On one hand, the frame must be of a short enough to fit within the expected rate of chord change in order to capture that change. On the other hand, the shorter the frame, the more susceptible the frequency-amplitude representation is to noise. Commonly, researchers will pass a chromagram with frames of short duration through a low-pass filter that minimizes frequency-amplitude values that do not persist over a significant number of frames [46][48][51].
The next step is to take these pitch estimations and formulate some function that can use them to identify chord structures. The method originally proposed in [46] took the form of a simple nearest-neighbor calculation in the vector space defined by the pitch classes; however, [51] notes that this method has only been found to work well in cases of synthetic sound and not in real, often more chaotic polyphonic recordings. The first method to find success with live sound was proposed in [50] using hidden Markov models (HMMs) trained with an Expectation-Maximization (EM) algorithm. HMMs are a machine-learning algorithm in which the state of a set of data out of some finite set of states is predicted based on observations in that data. The use of HMMs is common in the signal analysis for speech signals, in which the state determined is a phoneme being pronounced. In [50], a vocabulary of chords forms the finite set of states. The probability of each state is defined as a single Gaussian distribution in N-dimensional space where N matches the dimensions of the pitch class vector. The probability is then adjusted
(trained) based on the performance according to an EM algorithm defined as (ex. 13[50]) where E is the estimation of the chord in terms of the probability P given the observed features X and unknown chord labels Q according the probability parameters Θ. The
estimation is determined as a function of the probability of both the current parameters and previous parameters such that the value of logP(X,Q|Θ) is maximized as the sum of estimated labels increases. The original Gaussian distribution model is set initially at random parameters and is then tuned by this E-M training process. As [50] notes, the original model parameters could be estimated directly only if the delineation between states (the boundaries between chords) was known beforehand. [48] attempted to approximate these boundaries by introducing high-level rhythmic information into the pitch class representation with some success, outperforming the original work done by [50]. Once a sequence of chord states is determined, sequence alignment can allow for quantization of similarity between chord sequences. This method has achieved marked success in cover-song and contrafact identification, for which instrumentation, timbre, duration, tempo, key, and potentially other qualities like melody are expected to vary, but harmonic progression is likely to remain largely the same [47].
E[log P(X, Q|Θ)] =�P(Q|x,
Q
Θold) log(P(X|Q,Θ)P(Q|Θ))