Auditory masking - Introduction to Digital Signal Processing

Imagine being in a quiet room watching a tennis match on TV between Federer and Nadal at some dB level that’s just perfect. For some reason or another, your roommate suddenly feels the urge to engage in some very important vacuum cleaning project. All of a sudden, the sound coming from the TV that was so crisp, clear, and just the right dB level is now overwhelmed by the mechanical sound produced by the 15-year old vacuum cleaner. You cannot hear anything — the sound from the tennis match is masked by the sound from the vacuum cleaner. This scenario where one sound source overwhelms another is referred to as masking.

There are basically two types of masking principles: simultaneous masking and non-simultaneous masking. Simultaneous masking refers to a situation when two sound events, the masker (vacuum cleaner) and the maskee (tennis game), occur at the same time instant. You can probably imagine that the threshold for hearing the maskee without the masker will be lower than when the maskee is present and vice-versa. That is, in a very quiet room without a masker, you will not need to turn up the volume on the TV too much, whereas a room with lots of noise the signal to noise ratio (SNR, see Section 5.1) will have to be improved by increasing the level of the television program. What is interesting is that this threshold of hearing of the maskee is not only a function of the maskee’s and masker’s intensity levels but also frequency levels. Figure 3.4 shows a general plot depicting the behavior of the masking threshold (adapted from Gelfand 2004) where the masking frequency is kept constant at fmask Hz and the

maskee frequency is altered along the frequency-axis (intensity levels are kept constant for both maskee and masker). What Fig. 3.4 tells us is that as the maskee (TV) frequency is shifted away from the masking frequency fmasker, the less of an eﬀect the masker will have in overwhelming the

maskee sound source. When the maskee frequency is equal to fmasker, the

most noticeable masking eﬀect takes place — you will need to really crank up the volume of your TV set if you want to hear it while the vacuum cleaner is running, especially when they are both at the same frequency range. However, if the vacuum cleaner’s frequency range (fmasker) is much lower

September 25, 2009 13:32 spi-b673 9in x 6in b673-ch01

12 Introduction to Digital Signal Processing

Fig. 3.4. General characteristics of masking threshold.

than the TV sound, you will not need to increase much the volume button on your remote control. For example, if the vacuum cleaner is running in another room separated by a wall (this is referred to as ﬁltering discussed in Chapter 7) much of the high frequency content will vanish (as well as the intensity level) and your TV experience, from the sonic point of view, will not be aﬀected much.

Unlike simultaneous masking, non-simultaneous masking (also known as temporal masking) refers to masking principles when the masker and maskee are out of synchrony and do not occur at the same time. Within temporal masking we also have what is referred to as post-masking and pre-masking. Post-masking intuitively makes sense if you consider the following scenario. Let’s say we are crossing a busy street in the heart of Seoul minding our own business (too much) without noticing that the pedestrian light has just turned red. This guy sitting in a huge truck gets all mad and hits on his 120 dB car honk that just blows you away (without damaging your ears luckily!). Even though the honk only lasted one second, you are not able to hear anything that occurred for another 200 milliseconds or so. This is called post-masking. Pre-masking is a bit more interesting. Let’s consider the same situation where you are crossing the street as before. Everything is the same including the post-masking eﬀects, that is, sounds that will be masked out after the loud honk. But that’s not the end of the story. It so happens that some sounds that occur before the masker, if they happen close enough to the start time of the maskers acoustic event (honk) will also not be heard. In other words, acoustic events that occur prior to the masker will also be erased without a trace — this is called pre-masking. Pre-masking is also a function of intensity and frequency of both the masker

Fig. 3.5. Masking characteristics.

and maskee. A summary of the three types of masking types are shown in Fig. 3.5 (after Zwicker 1999).

A group of audio compression algorithms that exploit the limitation and psychoacoustic tendencies of our hearing system are referred to as perceptual codecs. In these types of codecs (coder-decoder), a great example being MP3, masking characteristics play an important role in allowing the reduction of data for representing a given audio signal, thus improving download and upload performance for our listening pleasures.

4 Sampling: The Art of Being Discrete

Up until now we have been emphasizing on various limitations in our hearing system which brings us to the concept of sampling and digital representation of analog signals. Simply put sampling is defined as a process of analog to digital conversion through devices called ADCs (analog to digital converters) whereby a continuous analog signal is encoded into a discrete and limited version of the original analog signal. Think of motion picture movies — when we watch a movie in cinemas, we perceive through our visual sensory organs continuous and smooth change in moving objects, people, and anything else that is in motion. This is, however, an illusion as in reality there are a set number of pictures known as frames (24, 25, and 30 frames per second are common) that are projected on the silver screen. Like our hearing system our eyes have a finite sampling capacity or sampling rate and information at a higher rate than this sampling rate is not perceived. There is thus generally little need to have a movie camera take 100 snap-shots per second nor is there a need to project each frame for a duration shorter than 1/24th, 1/25th or /30th of a second as we will not be able to tell the difference. That is, 50 frames or 100 frames per second or conversely 1/50th or 1/100th of a second for each frame becomes redundant and more or less unperceivable. Our perceptual organs

September 25, 2009 13:32 spi-b673 9in x 6in b673-ch01

14 Introduction to Digital Signal Processing

and brains have limited ability in capturing and registering time critical events. However, if we want to slow down movies, however, that would be a totally diﬀerent story as more frames per second will render a smoother motion picture experience.

The same concept applies when sampling sound. As we have learned previously, we only hear up to approximately 20 kHz. Furthermore, we cannot hear every infinite subtlety in dynamic change either. When we say that we only hear up to 20 kHz, we mean that if there is a sine wave that oscillates above 20,000 cycles per second, we would not be able to perceive it. Our limitations in hearing are actually a blessing in disguise, at least when viewed from a shear number crunching or processing perspective, whether be it a machine or our brains. In other words, the time between each sample need not be infinitely small (infinity is something problematic on computers) but in fact only need be small enough to convince our ears and brain that what we have sampled (digitized version of the sound) is equivalent to the original analog version (non-digitized version of the sound). To get an idea what this all means let’s look at Fig. 4.1.

Although the plot looks very smooth and perhaps even continuous, this is not the case. The actual sampled version in reality looks more like Fig. 4.2 (here we have zoomed into half of the sine wave only). Upon closer inspection we see that it is not smooth at all and is rather made up of

Fig. 4.2. About half of 1 Hz sine wave sample atf_s= 200 Hz

discrete points outlining the sine wave. It’s somewhat analogous to pixel resolution of your computer monitor. When viewed from afar (whatever afar means) the digital picture you took last year at the top of the mountain looks smooth and continuous. But viewed with your nose up against the monitor and zoomed-in, you will inevitably see the artifacts of discreetness in the form of pixel resolution.

In this example, we actually used 200 discrete samples to represent one full cycle of the sine wave where the 200 samples correspond to exactly 1 second duration. Since the sine wave makes one full resolution in one second, it must be a 1 Hz signal by deﬁnition. We say that the sampling rate, commonly denoted as fs, for this system is fs= 200 Hz or 200 samples

per second. We can also deduce that each time unit or grid is 1/fs =

1/200th of a second (in this example). This time unit is referred to as the sampling period and usually is denoted as T . Hence, we have the following relationship between sampling rate fs(Hz) and T (sec):

fs= 1/T (4.1)

September 25, 2009 13:32 spi-b673 9in x 6in b673-ch01

16 Introduction to Digital Signal Processing

Fig. 4.3. Zoomed-in portion of sine wave and sampling periodT .

This is further illustrated in Fig. 4.3 (quarter plot of the same sine wave) where we notice that the spacing between each sample is equal to T , 1/200th of a second. Figure 4.4 shows the same half sine tone we have been using but sampled at a quarter of the previous sampling frequency with fs= 50 Hz.

It clearly shows that the number of samples representing the half sine wave has reduced by 1/4.

We may also at this point notice that time is now discrete — with sampling frequency of 200 Hz we have T = 1/200 = 0.005 seconds or 5 milliseconds (ms) which means that we cannot for example have any data at 2.5 milliseconds or any other value that is not an integer multiple of T as shown below (if we start at t = 0).

t = n · T (4.3)

Here n is the integer time index. We are now ready to express each time index or sample of the sine wave (or any digital signal for that matter) in the following manner:

Fig. 4.4. Approximately half duration of 1 Hz sine wave sample atfs= 50 Hz.

and replacing t with Eq. (4.3) we have:

y[n · T ] = sin(2 · pi · f · n · T ) (4.5)

Again, n is again an integer number denoting the time index. For instance the 0th sample in our example in Fig. 4.5 is:

y[n · T ] = y[0 · T ] = y[0] = 0 (4.6)

The 7th sample of our sine wave (n = 6 since we started counting from 0) is:

y[n · T ] = y[6 · T ] = 0.2 (4.7)

Note that for analog signals we use the ( ) (parentheses) and for digital signals we use the [ ] (square) brackets to clearly diﬀerentiate continuous and discrete signals. The representation of each time “location” which is now discrete can be done for every sample since we know the sampling rate fsand hence the period T . The sampling period T is customarily omitted

for notational convenience and y[n · T ] becomes y[n]. Going back to the MATLABR code that we started oﬀ the chapter with we can now see how the continuous time sine wave (t) becomes the discrete time sine wave (n·T ).

September 25, 2009 13:32 spi-b673 9in x 6in b673-ch01

18 Introduction to Digital Signal Processing

In other words:

y(t) = sin(2 · pi · f · t) (4.8)

setting t = n · T we get

y|t=nT(t) = y[nT ] = sin(2 · pi · f · n · T ) (4.9)

remembering that T = 1/fs the MATLABR code becomes

y[n] = sin(2 · pi · f · n · T ) = sin

2· pi · f · n

In document Introduction to Digital Signal Processing - Computer Musically speaking by Hong Park.pdf (Page 32-39)