4.4.1. Feature-based modules
Auditory modules. Based on existing research, modules can be created to process pitch distance, loudness difference, onset synchrony, spatial differences (if relevant) and timbre distance to name a few. In each case, a relevant and meaningful viewpoint must be developed in order for IDyOM to learn its patterns in music and make predictions. These already exist for pitch, interval, scale degree, inter-onset interval and duration, to name a few of the most commonly used in existing research (for a full list, see Chapter 3). Harmonic, timbral and
86
spatial viewpoints are yet to be implemented, though work on harmonic viewpoints is ongoing in the Music Cognition Lab.
Musical modules. As presented in Chapter 2, parsing musical scenes is influenced by a number of features that play a particular role in the musical auditory scene: harmony, phrase boundaries, repetition and similarity. Harmony strengthens integration in a piece of music overall, where more consonant harmonies (Conklin & Bergeron, 2010) result in the strongest integration. A harmony-processing module can be implemented to reflect this, where an appropriate harmonic viewpoint is developed to learn and predict harmonic patterns.
Phrase boundaries are also a helpful guideline for predicting streaming structure as phrases are typically contained to a single streaming structure throughout. Some models use this to their advantage, easily preventing voice-crossing within segments by analysing segments rather than individual pitches (Chew & Wu, 2004; Ishigaki et al., 2011; Madsen & Widmer, 2006; Rizo et al., 2006). IDyOM already performs real-time phrase boundary detection, therefore this information could be harnessed in its own module to inform streaming decisions, for example by biasing the decision towards inertia – in other words keeping the same streaming structure – between phrase boundaries and allowing more flexibility to change streaming structure at phrase boundaries. This could be implemented by outputting a lower IC for the streaming structure that matches the immediately previous context between phrase boundaries, and outputting a uniform IC across all potential streaming structures at phrase boundaries.
Repetition and similarity are closely related (see Section 2.2), and inform auditory streaming in a similar way: things that are repeated or similar are more likely to originate from the same source. Repetition is inherently modelled in IDyOM through the STM model, where repeated sequences result in low IC as they become increasingly predictable. However, as we
87
have seen in the context of music, the source can be the overall piece of music performed by either a soloist or an ensemble, or the melody in a piece of music, played by one instrument, or a group of instruments. This complicates repetition and similarity in music substantially, as comparisons must be made between the similarity of phrase variations played by the same instrument and exact phrase repetitions played by two different instruments: which is most similar? Is an exact phrase repetition played by another instrument part of the same perceptual stream, or is it considered a separate, perhaps temporary stream? Further research is required to answer these questions and guide the implementation of modules incorporating repetition and similarity information into this integrated streaming model.
4.4.2 Including attention
Though most auditory and musical aspects of auditory streaming described above can be relatively easily modelled, modelling attention remains elusive and timbre remains a challenging aspect of music to understand. The difficulties of studying and modelling attention have previously been summarized in Section 2.3. An attention module in an integrated framework for auditory streaming would modulate the relative importance, or weight, of all other modules informing the final streaming structure decision for any given musical phrase. This is equivalent to determining the relative salience of all features informing auditory streaming. This would be once again done using information content, where the relative mean IC proportions of feature categories (i.e. pitch, time, harmony) for each stream built up in the model so far is translated into a corresponding weight accorded to the output of feature-based modules. For example, if the mean IC of pitch interval for all existing streams is 1.2 times the mean IC of inter-onset interval for all existing streams, the IC output of all pitch-related feature-based modules – auditory and musical – making predictions about the streaming structure of the next slice will be weighted 1.2 higher than time-related feature-based modules
88
for those same streaming structures. The work presented in Chapter 6 of this thesis will explore this hypothesis.
4.4.3 Timbre
Commonly defined as the aspects of sound that are not pitch, duration or loudness, a straightforward definition of timbre still eludes researchers. Attempts to break the concept down result in either many components with limited psychological interpretation in the case of MIR or two main temporal and spectral components that still fail to fully explain timbre in the case of music cognition (Alluri & Toiviainen, 2010; Caclin, McAdams, Smith, & Winsberg, 2005; Lakatos, 2000; McAdams et al., 1995). While determining the distance between pairs of timbres in a two-dimensional space and associating this with an integration or segregation threshold is possible (Sauvé, Stewart, & Pearce, 2014), musical timbre presents two particular challenges: 1) timbre changes as a function of pitch; and 2) music contains timbral blends, as a result of different instruments playing simultaneously in the case of ensemble music. The first challenge is particularly relevant to solo instrumental music, where timbral differences between voices (i.e. piano) or across a piece of music (i.e. flute) are much more subtle than in ensemble music. How important is timbre as a streaming cue in these situations? Furthermore, timbre changes differently as a function of pitch for different instruments, making the creation of an accurate model of solo timbre very complex to begin with; what about when instruments combine? The challenge of instrumental blend has begun to be investigated for pairs of instruments (Kendall & Carterette, 1991; Sandell, 1995). While these studies provide a much- needed start into understanding timbral blend, this line of research is only still in its infancy, and much work is needed to develop timbral understanding for the blend of not only two instruments, but many, in order to model timbral perception in large ensembles such as orchestras.
89
It is clear that many questions still remain to be answered in order to better understand timbre perception, and I would argue that this issue is the most important to address in order to better model musical ASA. Existing research has established timbre as a relevant streaming cue (Albert S. Bregman & Pinker, 1978; Handel & Erickson, 2004; Iverson, 1995; Marozeau et al., 2013; Sauvé, Stewart, & Pearce, 2014; Singh & Bregman, 1997) and the art of orchestration is concerned with choosing the most appropriate instruments for achieving the desired auditory scene. The same is true in composition, where timbre is one of the most important cues for segregating the melody from the accompaniment, or creating one unified sound mass. Therefore, if timbre perception can be appropriately modelled, so can musical ASA and the challenge in the context of this framework is to create a meaningful viewpoint that can do so.
4.4.4 Including musical training
A further challenge of modelling musical auditory scene analysis is the influence of individual differences on perception. In some cases, these even exceed musical training group definitions (Dean et al., 2014a). Research in musical training suggests that it is a prolonged and focused exposure to music that causes differences in perception between musicians and non-musicians (Fujioka et al., 2006, 2004; Habib & Besson, 2009; Micheyl et al., 2006), but only for domain-specific (i.e., musical) tasks (Bigand & Poulin-Charronnat, 2006; Carey et al., 2015). Those who receive Western musical training are also more likely to be exposed to and understand Western classical music, where non-musicians may have less exposure to this genre (though exposure is unlikely to be zero) and more to popular genres such as rock, pop or dance. Musicians will likely have been exposed to popular genres as well; it is difficult to avoid music in everyday life and individuals have probably been exposed to more genres than they realize.
90
Therefore, it is the relative contribution of each genre’s knowledge that will help define a listener’s background.
IDyOM has been shown to differentiate between jazz and classical music when trained on each of these specific genres of music (Hansen et al., 2016) by generating lower information content for music in the same genre it was trained on (see also van der Weij, Pearce, & Honing, 2017). This can theoretically be extended to simulate individual or groups of listeners’ musical backgrounds, including different cultures, where IDyOM’s long-term model can be trained on any chosen (currently restricted to monophonic) pieces of music. The choice of training material then becomes crucial as it influences the information content output of the core model, which informs melody selection. As melody selection is based on the highest average information content relative to other voices, the absolute IC value does not matter (i.e., higher IC for a lesser known genre will be higher for the entire piece of music, not just one voice); however, it is possible that highly idiomatic instrument-specific figures (e.g., violin arpeggiations) become well-learned and receive low information content, potentially low enough to push a melody into a non-melodic position according to the melody extraction module. This possible limitation should be explored in future research.