2.1 A note on segmental and suprasegmental aspects of speech
2.1.3 Suprasegmentals
61
Segmental features of speech are produced and perceived in a different time frame
from other phonetic features, referred to as ‘suprasegmentals’. The term itself reveals that these features have to do with vocal effects that extend over more than a single
segment (Crystal, 2008). When one talks to a baby or an animal, they often use a
combination of suprasegmental features, as for example, changes in voice quality or
higher pitch register (Ogden, 2009). This means that these features have a role which
can be viewed beyond the lexical aspect of speech. Despite the fact that these
features do not convey literal meaning per se, their extralinguistic meanings may
contribute a great deal to communication. It is therefore worth examining these
features individually in order to appreciate how these are described acoustically, how
they are perceived auditorily and, in later sections, how these can relate to acoustic
streams of music.
Prosody, another term for suprasegmentals, encompasses timing, frequency,
amplitude, pausing, and voice quality. These are variables that mark all parts of a
spoken utterance and, according to the existing literature, contribute to the formation
of three main types of realisations: linguistic, pragmatic, and emotional prosody. In
general, researchers use the term ‘prosody’ to either refer to an abstract definition
without looking at any specific prosodic components or to examine closely the above
features of timing, frequency, amplitude etc. (Cutler et al., 1997).
Loudness as a suprasegmental feature can contribute to disambiguation of meaning
(in the case of heteronyms, as for example in the word contrast which can be either a
noun or a verb) and communication of emotions (Skandera and Burleigh, 2005).
Loudness can be relative in terms of perception. That is, other factors, co-occurring
with loudness, can affect how loudness is perceived. For example, the pitch of an
62
Length is also used to describe speech signals at the suprasegmental level. It refers to
the physical duration of a sound or an utterance and it normally differs from duration
that pertains to the time devoted to the articulation of a sound or a syllable at the
segmental level (Crystal, 2008). Tempo refers to the speech rate of an utterance and
differences at this level have a different function compared to segmental
manipulations of duration. An important point is that differences in tempo do not
produce differences in meaning equivalent to the differences that duration can bring
about at the word level (Lehiste, 1970).
Another element that is taken into consideration for prosodic investigations of
speech is ‘voice quality’. In simple terms, voice quality is defined as the difference in ‘colour’ that one perceives among different voices and resembles the difference that one perceives when they are exposed to two identical notes played in equal
loudness by two different instruments (Skandera and Burleigh, 2005). In this sense it can be alternatively called ‘timbre’. A breathy voice and a harsh voice can be
perceived as different even when they display the same fundamental frequency and
loudness in the same sense that piano and violin timbre differences are perceived.
However, in a broader definition of voice quality, the term refers to a series of
features. It can refer to someone’s rate of speech, pitch height, loudness and timbre (Crystal, 2008) rather than timbre exclusively. In this second definition, the term
seems to encompass all these features that one can have at their disposal in order to
identify a speaker.
Pitch is an important auditory feature which has received a lot of scientific interest
and has also been the focus of research in many studies in the comparative
63
feature of frequency. The frequency of the vibration of the vocal folds determines the
auditory result that one perceives (Skandera and Burleigh, 2005). That is, a fast
vibration of the vocal folds results in a higher pitch, whereas a slower vibration
results in a lower pitch. In contrast to most sounds that surround us and which are
called complex sounds, there are also pure tones that differ from complex sounds in
that they contain only one frequency (Griffiths et al., 1999). By contrast, the sound
resulting from the vibration of the vocal folds is an example of complex sound with
many associated harmonics (Lehiste, 1970). Pitch and fundamental frequency do not
relate in a linear way. Fundamental frequency is defined as the frequency of a
periodic (regularly repeating) sound that corresponds to the lowest mode of vibration
and harmonics as whole number multiples of this frequency (Zatorre et al., 2007).
Ogden (2009) explains that the relationship between fundamental frequency and its
percept, pitch, is of logarithmic nature. More specifically, if this relationship was
absolute, then the difference between 100 Hz and 200 Hz would be equal to the
difference between 200 Hz and 300 Hz. Rather, fundamental frequency and the
stimuli we perceive relate in a proportional fashion. That is, the difference between
100 Hz and 200 Hz is similar to that between 200 Hz and 400 Hz, meaning that the
two stimuli have the same difference in proportion; 1:2 in both cases.
In simple auditory terms, pitch can be defined as the type of sensation scaled from
‘low’ to ‘high’ (Crystal, 2008). As also noted in 1.3, there is a distinction between absolute and relative pitch. Relative pitch requires the listener to abstract intervallic
relationships (De Cheveigne, 2005). That is, the listener has a point of reference at
their disposal and they base their judgement on the relationship between two notes
rather than the identification of a note out of melodic context. Absolute pitch or
64
of reference, that is, to name an isolated note. The use of pitch in prosodic functions
is discussed later in this chapter.