Spatial Audio Juhan Nam

(1)

GCT535: Sound Technology for Multimedia

Spatial Audio

(2)

Outlines

● Localization using HRTF

(3)

Hearing Sound in Space

● We have different auditory experiences depending on where we listen to

sound

(4)

Hearing Sound in Space

● Sound Localization

○ Hear sound as a point source in 3-D space

○ Possible with two ears and human body structure

○ Panning (stereo), 3-D Sound

● Room Effect

○ Hear sound and its myriads of reflections

○ Determined by the room type: size, structure and materials

(5)

Sound Localization

● Perception of Directions

○ A sound source arrives in each of two ears with differences in time and level

○ We call them ITD (inter-aural Time Difference) and IID (Inter-aural Intensity

Difference)

○ IID is used as a main cue of direction above about 1.5 kHz

● Only the level and time differences are enough?

R L

ITD

(6)

Head-Related Transfer Function (HRTF)

● A filter that characterizes how a sound arrives in the outer end of ear

canal from the source

○ Measured from human body or dummy head

○ Determined by the refection on head, pinnae, torso or other body parts

○ Function of azimuth (horizontal angle), elevation (vertical angle) and

frequency 𝐻!(𝜔, ∅, 𝜃) 𝐻_"(𝜔, ∅, 𝜃) R L Source: https://developers.google.com/vr/concepts/spatial-audio

(7)

HRTF Measurement

Fall 2007 Music 220D Report Kolar 3

Subject/Receiver Alignment: subject wearing mic attachment/strain relief and

mirror-alignment headgear positioned on rotating stool (noted during 2nd session that midpoint of rotation not fixed because of seat asymmetry); flashlight suspended above subject to reflect light off top-of-head mirror onto azimuth markings on walls for sighted alignment by subject (fig. 3); line-of-sight verification by subject for speaker elevation alignment

figure 3: receiver mic attachment & alignment headgear, flashlight-mirror-marker positioning system

Equipment/Signal Chain: AuSIM HeadZap speaker (fig. 2 & 3) powered by Samson

Servo 120 amplifier, AuSIM Binaural Probe Microphone Set (AuPMC002, Sennheiser KE4-211-2 omni electret capsules, fig. 4) into Grace 101 preamplifier, RME Fireface 800 converter, Digital Performer 5.12 recording software on MacBook Pro running OS 10.4.11.

figure 3: AuSIM Binaural Probe Microphone specifications and response per product literature Measured Head-Related Impulse Responses Front Back Back Left Right

(8)

HRTF Measurement

Front Back Back Left Right Magnitude response of the HRIRs

(9)

Rendering Sound in 3-D Space

● Binaural Synthesis

○ 3D sound effects

○ Typically run with headphones

○ Applications: Game, VR

● Methods

○ Convolution with HRTFs: the IRs are typically several hundreds sample long

○ Modeling HRTFs with biquad filters (e.g. Prony’s method)

● Issues

○ Standardization: dummy head

○ Individualization Input Left output Right output ℎ_!(𝑡, ∅, 𝜃) ℎ"(𝑡, ∅, 𝜃)

(10)

Head-Related IR Datasets

● ARI

○ 85 subjects and 1550 source positions

○ https://www.kfs.oeaw.ac.at/index.php?option=com_content&view=article&i

d=608&Itemid=857&lang=en

● CIPIC

○ http://interface.cipic.ucdavis.edu/sound/hrtf.html

● IRCAM

(11)

HRTF Demos

● Virtual Barber Shop Hair Cut

○ https://www.youtube.com/watch?v=8IXm6SuUigI

● OpenAL Example

○ https://www.youtube.com/watch?v=tY9DhuEe1WY

● Google Chrome Omnitone

○ http://googlechrome.github.io/omnitone

● My Research Work

(12)

Stereo Panning

● Amplitude Panning in Loudspeakers

○ The gains for each speaker for monophonic source

■ 𝑥_! 𝑡 = 𝑔_!𝑥 𝑡 , 𝑖 = 1, 2 (left, right)

○ The tangent law (Bennett et. al.)

■ _"#$%"#$% ! = &_"'&_# &_"(&_# ○ Power preservation ■ 𝑔₎* + 𝑔_** = 1

BASIC SPATIAL EFFECTS 143

When the source is very near to the listener, there are also some binaural effects which are used in distance perception [DM98]. With close sources the magnitude of ILD is higher and appears at lower frequencies than a far source with the same direction, which creates the perception of a nearby source.

5.3 Basic spatial effects for stereophonic loudspeaker and headphone playback

The most common loudspeaker layout is the two-channel setup, called the standard stereophonic setup. It was widely taken into use after the development of single-groove 45◦_/45◦ _two-channel

records in the late 1950s. Two loudspeakers are positioned in front of the listener, separated by 60◦ _{from the listener’s viewpoint, as presented in Figure 5.2. The setup of two loudspeakers is}

very common, though quite often the setup is not as shown in the figure. In domestic use, or in car audio typically the listener is not situated in the centre, but the loudspeakers are located in different directions and distances from him than in Figure 5.2. However, even then two-channel reproduction is preferred from monophonic presentation in most cases. On the other hand, in some cases the listener can be assumed to be in the best listening position, as in computer audio.

q₀ = 30° q Virtual source Best listening area

Figure 5.2 Standard stereophonic listening configuration.

This section deals with the spatial effects obtainable with such two-channel reproduction and simple processing. Two types of effects are presented, the creation of point-like sources, and the creation of spatially spread sources. More advanced methods for loudspeaker reproduction with HRTF processing are discussed in Section 5.4.5, which provide some more degrees of freedom, but unfortunately also introduce some limitations in listening position and listening room acoustics.

5.3.1 Amplitude panning in loudspeakers

Amplitude panning is the most frequently used virtual-source-positioning technique. In it a sound signal is applied to loudspeakers with different amplitudes, which can be formulated as

(13)

Reverberation

● Acoustic phenomenon when a sound source is played in a room

○ Thousands of echoes are reflected against wall, ceiling and floors

○ The patterns are determined by the volume and geometry of the room and

materials on the surfaces

○ We can recognize the geometry and composition of the room from the

sound

○ They provide different (often better) feelings of the sound

Sound Source

Listener Direct sound

(14)

Room Impulse Response

● Room reverberation is characterized by its

impulse response (IR)

● The room IR is composed of three parts

○ Direct path

○ Early reflections: convey a sense of the

room geometry and size

○ Late-field reverberation: high echo density

like noise, determined by room size and materials

● RT60

Signal Processing Techniques for Digital Audio Effects 70

Reverberation and Linear Time-Invariant Systems

0 10 20 30 40 50 60 70 80 90 100 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

CCRMA Lobby Impulse Response

time - milliseconds

response amplitude

direct path

early reflections

late-field reverberation

Figure 61: CCRMA lobby response to transient signal

• The arrival times and nature of reflected source signals are sen-sitive to the details of the environment geometry and materials. • As a result, reverberation may seem unmanageably complex. • Fortunately, reverberation has two properties which allow its

analysis and synthesis without having to know the details: – linearity, and

– time invariance.

• Here, we explore reverberation by studying impulse responses of enclosed spaces.

(15)

Room Impulse Response

(16)

Artificial Reverberation

● Convolution reverb

○ Measure the impulse response of a room

○ Convolve input with the measured IR

● Mechanical reverb

○ Use metal plate and spring

○ EMT140 Plate Reverb: https://www.youtube.com/watch?v=HEmJpxCvp9M

● Delay-based reverb

○ Early reflections: feed-forward delayline

○ Late-field reverb: allpass/comb filter, feedback delay networks (FDN)

(17)

Delay-based Reverb

● Schroeder model

○ Cascade of allpass-comb filters

○ Mutually prime number for delay lengths

(18)

Delay-based Reverb

● Feedback Delay Networks

○ Mixing matrix creates “good spreading” of delayed outputs

■ Chosen to be orthonormal (unitary matrix)

○ The lengths of delaylines are chosen to be mutually prime number

○ Should generate a white noise in lossless mode

○ T60 is controlled by the loop gains

Feedback Delay Networks

+ x(n) Z-M1 Z-M2 Z-M3 + a11 a12 a13 a₁₁a₁₂a₁₃ a₁₁a₁₂a₁₃ y(n)

(19)

Measuring Impulse Response

● Measurement Model

○ Assume the system as linear time-invariant

○ Use a test signal and the output to derive the impulse response

○ Using a sine sweep: based on the convolution theorem

Signal Processing Techniques for Digital Audio Effects 74

Impulse Response Measurement

Measurement Model:

Figure 63: Measurement Configuration

- Room presumed LTI.

- Why not just crank an impulse out the speaker, record the result

and declare victory?

s(t)

LTI system

r(t)

test

sequence measured response

n(t)

h(t)

measurement noise

Figure 64: Measurement Model

- Noise assumed additive, unrelated to the test signal. (How will it

be related to the system?)

𝑟 𝑡 = 𝛿 𝑡 ∗ ℎ 𝑡 + 𝑛 𝑡 𝑟 𝑡 → *ℎ 𝑡

*ℎ 𝑡 = FFT')_{ FFT 𝑟 𝑡

(20)

Measuring HRTFs

20 𝑠(𝑡) 𝑟!(𝑡) 𝑟_"(𝑡) ,ℎ!(𝑡) ,ℎ"(𝑡)

Fall 2007 Music 220D Report Kolar 3

Subject/Receiver Alignment: subject wearing mic attachment/strain relief and

mirror-alignment headgear positioned on rotating stool (noted during 2nd_{session that midpoint of}

rotation not fixed because of seat asymmetry); flashlight suspended above subject to reflect light off top-of-head mirror onto azimuth markings on walls for sighted alignment by subject (fig. 3); line-of-sight verification by subject for speaker elevation alignment

figure 3: receiver mic attachment & alignment headgear, flashlight-mirror-marker positioning system

Equipment/Signal Chain: AuSIM HeadZap speaker (fig. 2 & 3) powered by Samson

Servo 120 amplifier, AuSIM Binaural Probe Microphone Set (AuPMC002, Sennheiser KE4-211-2 omni electret capsules, fig. 4) into Grace 101 preamplifier, RME Fireface 800 converter, Digital Performer 5.12 recording software on MacBook Pro running OS 10.4.11.

(21)

Measuring Room IRs

Music 318, Winter 2007, Impulse Response Measurement 13

0 500 1000 1500 -0.5 0 0.5 sine sweep, s(t) amplitude frequency - kHz

sine sweep spectrogram

0 200 400 600 800 1000 0 5 10 0 500 1000 1500 2000 -1 -0.5 0 0.5

1 sine sweep response, r(t)

time - milliseconds

amplitude

time - milliseconds

frequency - kHz

sine sweep response spectrogram

0 500 1000 1500 2000 0 5 10 0 100 200 300 400 500 600 700 800 900 1000 -0.04 -0.02 0 0.02 0.04 0.06

0.08 measured impulse response

time - milliseconds

amplitude

CCRMA Lobby Measurment Example

s(t)

r(t)

ˆ

h (t)

(22)

Room IR datasets

● Open AIR

○ http://www.openairlib.net/

● Aachen Impulse Response Database

○

(23)

Convolution by FFT

● Direct convolution in time domain is computationally expensive

○ It has O(n2_{) in complexity}

○ Especially, when the length of IR is long (e.g. room IR)

● FFT Convolution

○ Using Convolution Theorem: 𝑥 𝑛 ∗ ℎ 𝑛 ↔ 𝑌 𝑘 = 𝐻 𝑘 𝑋(𝑘)

○ FFT has O(nlog2n) and convolution in frequency O(n) in complexity

● However, the issue is latency when the IR is long. How can we implement

it in real-time reverb?

○ Solution: partitioned convolution

(24)

References

● Reverberation using Feedback Delay Network

○ https://ccrma.stanford.edu/~jos/pasp/FDN_Reverberation.html