Fundamentals of Computational Auditory Scene Analysis p. 1 Human Auditory Scene Analysis p. 2 Structure and Function of the Auditory System p.

(1)

Foreword p. xiii

Preface p. xvii

Contributors p. xix

Acronyms p. xxi

Fundamentals of Computational Auditory Scene Analysis p. 1

Human Auditory Scene Analysis p. 2

Structure and Function of the Auditory System p. 2

Perceptual Organization of Simple Stimuli p. 4

Perceptual Segregation of Speech from Other Sounds p. 5

Perceptual Mechanisms p. 8

Computational Auditory Scene Analysis (CASA) p. 11

What Is CASA? p. 11

What Is the Goal of CASA? p. 12

Why CASA? p. 13

Basics of CASA Systems p. 14

System Architecture p. 14 Cochleagram p. 15 Correlogram p. 19 Cross-Correlogram p. 21 Time-Frequency Masks p. 22 Resynthesis p. 23 CASA Evaluation p. 25 Evaluation Criteria p. 25 Corpora p. 26

Other Sound Separation Approaches p. 28

A Brief History of CASA (Prior to 2000) p. 30

Monaural CASA Systems p. 30

Binaural CASA Systems p. 34

Neural CASA Models p. 35

Conclusions p. 36 Acknowledgments p. 36 References p. 37 Multiple F0 Estimation p. 45 Introduction p. 45 Signal Models p. 46 Single-Voice F0 Estimation p. 47 Spectral Approach p. 48 Temporal Approach p. 50 Spectrotemporal Approach p. 53 Multiple-Voice F0 Estimation p. 55 Spectral Approach p. 56

(2)

Temporal Approach p. 57 Spectrotemporal Approach p. 59 Issues p. 61 Spectral Resolution p. 61 Temporal Resolution p. 62 Spectrotemporal Resolution p. 63

Other Sources of Information p. 64

Temporal and Spectral Continuity p. 64

Instrument Models p. 65

Learning-Based Techniques p. 67

Estimating the Number of Sources p. 68

Evaluation p. 69

Application Scenarios p. 70

Conclusion p. 71

Acknowledgments p. 72

References p. 72

Feature-Based Speech Segregation p. 81

Introduction p. 81

Feature Extraction p. 83

Pitch Detection p. 83

Onset and Offset Detection p. 83

Amplitude Modulation Extraction p. 85

Frequency Modulation Detection p. 88

Auditory Segmentation p. 90

What Is the Goal of Auditory Segmentation? p. 90

Segmentation Based on Cross-Channel Correlation and Temporal Continuity p. 92

Segmentation Based on Onset and Offset Analysis p. 93

Simultaneous Grouping p. 97

Voiced Speech Segregation p. 97

Unvoiced Speech Segregation p. 102

Sequential Grouping p. 106

Spectrum-Based Sequential Grouping p. 108

Pitch-Based Sequential Grouping p. 108

Model-Based Sequential Grouping p. 109

Discussion p. 110

References p. 111

Model-Based Scene Analysis p. 115

Introduction p. 115

Source Separation as Inference p. 115

(3)

Aspects of Model-Based Systems p. 125

Constraints: Types and Representations p. 126

Fitting Models p. 130

Generating Output p. 136

Discussion p. 139

Unknown Interference p. 139

Ambiguity and Adaptation p. 140

Relations to Other Separation Approaches p. 141

Conclusions p. 143

References p. 143

Binaural Sound Localization p. 147

Introduction p. 147

Physical and Physiological Mechanisms Underlying Auditory Localization p. 148

Physical Cues p. 148

Physiological Estimation of ITD and IID p. 150

Spatial Perception of Single Sources p. 152

Sensitivity to Differences in Interaural Time and Intensity p. 152

Lateralization of Single Sources p. 152

Localization of Single Sources p. 153

The Precedence Effect p. 154

Spatial Perception of Multiple Sources p. 155

Localization of Multiple Sources p. 155

Binaural Signal Detection p. 156

Models of Binaural Perception p. 158

Classical Models of Binaural Hearing p. 158

Cross-Correlation-Based Models of Binaural Interaction p. 160 Some Extensions to Cross-Correlation-Based Binaural Models p. 164

Multisource Sound Localization p. 168

Estimating Source Azimuth from Interaural Cross-Correlation p. 169

Methods for Resolving Azimuth Ambiguity p. 172

Localization of Moving Sources p. 175

General Discussion p. 175

References p. 178

Localization-Based Grouping p. 187

Introduction p. 187

Classical Beamforming Techniques p. 188

Fixed Beamforming Techniques p. 188

Adaptive Beamforming Techniques p. 189

Independent Component Analysis Techniques p. 190

(4)

Location-Based Grouping Using Interaural Time Difference Cue p. 191 Location-Based Grouping Using Interaural Intensity Difference Cue p. 199 Location-Based Grouping Using Multiple Binaural Cues p. 200

Discussion and Conclusions p. 202

References p. 203

Reverberation p. 209

Introduction p. 209

Effects of Reverberation on Listeners p. 211

Speech Perception p. 211

Sound Localization p. 213

Source Separation and Signal Detection p. 215

Distance Perception p. 219

Auditory Spatial Impression p. 219

Effects of Reverberation on Machines p. 220

Mechanisms Underlying Robustness to Reverberation in Human Listeners p. 224 The Role of Slow Temporal Modulations in Speech Perception p. 224

The Binaural Advantage p. 225

The Precedence Effect p. 226

Perceptual Compensation for Spectral Envelope Distortion p. 228

Reverberation-Robust Acoustic Processing p. 229

Dereverberation p. 229

Reverberation-Robust Acoustic Features p. 233

Reverberation Masking p. 235

CASA and Reverberation p. 237

Systems Based on Directional Filtering p. 237

CASA for Robust ASR in Reverberant Conditions p. 239

Systems that Use Multiple Cues p. 241

References p. 244

Analysis of Musical Audio Signals p. 251

Introduction p. 251

Music Scene Description p. 252

Music Scene Descriptions p. 253

Difficulties Associated with Musical Audio Signals p. 255

Estimating Melody and Bass Lines p. 256

PreFEst-front-end: Forming the Observed Probability Density Functions p. 258 PreFEst-core: Estimating the F0's Probability Density Function p. 258 PreFEst-back-end: Sequential F0 Tracking by Multiple-Agent Architecture p. 262

(5)

Estimating Beat Structure p. 267

Estimating Period and Phase p. 268

Dealing with Ambiguity p. 270

Using Musical Knowledge p. 271

Estimating Chorus Sections and Repeated Sections p. 275

Extracting Acoustic Features and Calculating Their Similarity p. 278

Finding Repeated Sections p. 281

Grouping Repeated Sections p. 282

Detecting Modulated Repetition p. 284

Selecting Chorus Sections p. 285

Other Methods p. 285

Importance p. 286

Evaluation Issues p. 287

Future Directions p. 288

References p. 289

Robust Automatic Speech Recognition p. 297

Introduction p. 297

ASA and Speech Perception in Humans p. 299

Speech Perception and Simultaneous Grouping p. 299

Speech Perception and Sequential Grouping p. 302

Speech Schemes p. 306

Challenges to the ASA Account of Speech Perception p. 309

Interim Summary p. 310

Speech Recognition by Machine p. 311

The Statistical Basis of ASR p. 311

Traditional Approaches to Robust ASR p. 313

CASA-Driven Approaches to ASR p. 315

Primitive CASA and ASR p. 316

Speech and Time-Frequency Masking p. 316

The Missing-Data Approach to ASR p. 318

Marginalization-Based Missing-Data ASR Systems p. 321

Imputation-Based Missing-Data Solutions p. 325

Estimating the Missing-Data Mask p. 328

Difficulties with the Missing-Data Approach p. 330

Model-Based CASA and ASR p. 333

The Speech Fragment Decoding Framework p. 334

Coupling Source Segregation and Recognition p. 337

Concluding Remarks p. 343

(6)

Neural and Perceptual Modeling p. 351

Introduction p. 351

The Neural Basis of Auditory Grouping p. 352

Theoretical Solutions to the Binding Problem p. 352

Empirical Results on Binding and ASA p. 353

Models of Individual Neurons p. 354

Relaxation Oscillators p. 354

Spike Oscillators p. 355

A Model of a Specific Auditory Neuron p. 357

Models of Specific Perceptual Phenomena p. 359

Perceptual Streaming of Tone Sequences p. 359

Perceptual Segregation of Concurrent Vowels with Different F0s p. 367

The Oscillatory Correlation Framework for CASA p. 372

Speech Segregation Based on Oscillatory Correlation p. 372

Schema-Driven Grouping p. 376

Discussion p. 378

Temporal or Spatial Coding of Auditory Grouping p. 379

Physiological Support for Neural Time Delays p. 379

Convergence of Psychological, Physiological, and Computational Approaches p. 380

Neural Models as a Framework for CASA p. 380

The Role of Attention p. 381

Schema-Based Organization p. 381

References p. 381

Index p. 389