Description

Book Synopsis
With the growing use of automatic speech recognition (ASR) in everyday life, the ability to solve problems in recorded speech is critical for engineers and researchers developing ASR technologies. The only resource of its kind, this book presents a comprehensive survey of state-of-the-art techniques used to improve the robustness of ASR systems.

Table of Contents

List of Contributors xv

Acknowledgments xvii

1 Introduction 1
Tuomas Virtanen, Rita Singh, Bhiksha Raj

1.1 Scope of the Book 1

1.2 Outline 2

1.3 Notation 4

Part One FOUNDATIONS

2 The Basics of Automatic Speech Recognition 9
Rita Singh, Bhiksha Raj, Tuomas Virtanen

2.1 Introduction 9

2.2 Speech Recognition Viewed as Bayes Classification 10

2.3 Hidden Markov Models 11

2.3.1 Computing Probabilities with HMMs 12

2.3.2 Determining the State Sequence 17

2.3.3 Learning HMM Parameters 19

2.3.4 Additional Issues Relating to Speech Recognition Systems 20

2.4 HMM-Based Speech Recognition 24

2.4.1 Representing the Signal 24

2.4.2 The HMM for a Word Sequence 25

2.4.3 Searching through all Word Sequences 26

References 29

3 The Problem of Robustness in Automatic Speech Recognition 31
Bhiksha Raj, Tuomas Virtanen, Rita Singh

3.1 Errors in Bayes Classification 31

3.1.1 Type 1 Condition: Mismatch Error 33

3.1.2 Type 2 Condition: Increased Bayes Error 34

3.2 Bayes Classification and ASR 35

3.2.1 All We Have is a Model: A Type 1 Condition 35

3.2.2 Intrinsic Interferences—Signal Components that are Unrelated to the Message: A Type 2 Condition 36

3.2.3 External Interferences—The Data are Noisy: Type 1 and Type 2 Conditions 36

3.3 External Influences on Speech Recordings 36

3.3.1 Signal Capture 37

3.3.2 Additive Corruptions 41

3.3.3 Reverberation 42

3.3.4 A Simplified Model of Signal Capture 43

3.4 The Effect of External Influences on Recognition 44

3.5 Improving Recognition under Adverse Conditions 46

3.5.1 Handling the Model Mismatch Error 46

3.5.2 Dealing with Intrinsic Variations in the Data 47

3.5.3 Dealing with Extrinsic Variations 47

References 50

Part Two SIGNAL ENHANCEMENT

4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53
Rainer Martin, Dorothea Kolossa

4.1 Introduction 53

4.2 Signal Analysis and Synthesis 55

4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55

4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57

4.3 Voice Activity Detection 58

4.3.1 VAD Design Principles 58

4.3.2 Evaluation of VAD Performance 62

4.3.3 Evaluation in the Context of ASR 62

4.4 Noise Power Spectrum Estimation 65

4.4.1 Smoothing Techniques 65

4.4.2 Histogram and GMM Noise Estimation Methods 67

4.4.3 Minimum Statistics Noise Power Estimation 67

4.4.4 MMSE Noise Power Estimation 68

4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69

4.5 Adaptive Filters for Signal Enhancement 71

4.5.1 Spectral Subtraction 71

4.5.2 Nonlinear Spectral Subtraction 73

4.5.3 Wiener Filtering 74

4.5.4 The ETSI Advanced Front End 75

4.5.5 Nonlinear MMSE Estimators 75

4.6 ASR Performance 80

4.7 Conclusions 81

References 82

5 Extraction of Speech from Mixture Signals 87
Paris Smaragdis

5.1 The Problem with Mixtures 87

5.2 Multichannel Mixtures 88

5.2.1 Basic Problem Formulation 88

5.2.2 Convolutive Mixtures 92

5.3 Single-Channel Mixtures 98

5.3.1 Problem Formulation 98

5.3.2 Learning Sound Models 100

5.3.3 Separation by Spectrogram Factorization 101

5.3.4 Dealing with Unknown Sounds 105

5.4 Variations and Extensions 107

5.5 Conclusions 107

References 107

6 Microphone Arrays 109
John McDonough, Kenichi Kumatani

6.1 Speaker Tracking 110

6.2 Conventional Microphone Arrays 113

6.3 Conventional Adaptive Beamforming Algorithms 120

6.3.1 Minimum Variance Distortionless Response Beamformer 120

6.3.2 Noise Field Models 122

6.3.3 Subband Analysis and Synthesis 123

6.3.4 Beamforming Performance Criteria 126

6.3.5 Generalized Sidelobe Canceller Implementation 129

6.3.6 Recursive Implementation of the GSC 130

6.3.7 Other Conventional GSC Beamformers 131

6.3.8 Beamforming based on Higher Order Statistics 132

6.3.9 Online Implementation 136

6.3.10 Speech-Recognition Experiments 140

6.4 Spherical Microphone Arrays 142

6.5 Spherical Adaptive Algorithms 148

6.6 Comparative Studies 149

6.7 Comparison of Linear and Spherical Arrays for DSR 152

6.8 Conclusions and Further Reading 154

References 155

Part Three FEATURE ENHANCEMENT

7 From Signals to Speech Features by Digital Signal Processing 161
Matthias W¨olfel

7.1 Introduction 161

7.1.1 About this Chapter 162

7.2 The Speech Signal 162

7.3 Spectral Processing 163

7.3.1 Windowing 163

7.3.2 Power Spectrum 165

7.3.3 Spectral Envelopes 166

7.3.4 LP Envelope 166

7.3.5 MVDR Envelope 169

7.3.6 Warping the Frequency Axis 171

7.3.7 Warped LP Envelope 175

7.3.8 Warped MVDR Envelope 176

7.3.9 Comparison of Spectral Estimates 177

7.3.10 The Spectrogram 179

7.4 Cepstral Processing 179

7.4.1 Definition and Calculation of Cepstral Coefficients 180

7.4.2 Characteristics of Cepstral Sequences 181

7.5 Influence of Distortions on Different Speech Features 182

7.5.1 Objective Functions 182

7.5.2 Robustness against Noise 185

7.5.3 Robustness against Echo and Reverberation 187

7.5.4 Robustness against Changes in Fundamental Frequency 189

7.6 Summary and Further Reading 191

References 191

8 Features Based on Auditory Physiology and Perception 193
Richard M. Stern, Nelson Morgan

8.1 Introduction 193

8.2 Some Attributes of Auditory Physiology and Perception 194

8.2.1 Peripheral Processing 194

8.2.2 Processing at more Central Levels 200

8.2.3 Psychoacoustical Correlates of Physiological Observations 202

8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206

8.2.5 Summary 208

8.3 “Classic” Auditory Representations 208

8.4 Current Trends in Auditory Feature Analysis 213

8.5 Summary 221

Acknowledgments 222

References 222

9 Feature Compensation 229
Jasha Droppo

9.1 Life in an Ideal World 229

9.1.1 Noise Robustness Tasks 229

9.1.2 Probabilistic Feature Enhancement 230

9.1.3 Gaussian Mixture Models 231

9.2 MMSE-SPLICE 232

9.2.1 Parameter Estimation 233

9.2.2 Results 236

9.3 Discriminative SPLICE 237

9.3.1 The MMI Objective Function 238

9.3.2 Training the Front-End Parameters 239

9.3.3 The Rprop Algorithm 240

9.3.4 Results 241

9.4 Model-Based Feature Enhancement 242

9.4.1 The Additive Noise-Mixing Equation 243

9.4.2 The Joint Probability Model 244

9.4.3 Vector Taylor Series Approximation 246

9.4.4 Estimating Clean Speech 247

9.4.5 Results 247

9.5 Switching Linear Dynamic System 248

9.6 Conclusion 249

References 249

10 Reverberant Speech Recognition 251
Reinhold Haeb-Umbach, Alexander Krueger

10.1 Introduction 251

10.2 The Effect of Reverberation 252

10.2.1 What is Reverberation? 252

10.2.2 The Relationship between Clean and Reverberant Speech Features 254

10.2.3 The Effect of Reverberation on ASR Performance 258

10.3 Approaches to Reverberant Speech Recognition 258

10.3.1 Signal-Based Techniques 259

10.3.2 Front-End Techniques 260

10.3.3 Back-End Techniques 262

10.3.4 Concluding Remarks 265

10.4 Feature Domain Model of the Acoustic Impulse Response 265

10.5 Bayesian Feature Enhancement 267

10.5.1 Basic Approach 268

10.5.2 Measurement Update 269

10.5.3 Time Update 270

10.5.4 Inference 271

10.6 Experimental Results 272

10.6.1 Databases 272

10.6.2 Overview of the Tested Methods 273

10.6.3 Recognition Results on Reverberant Speech 274

10.6.4 Recognition Results on Noisy Reverberant Speech 276

10.7 Conclusions 277

Acknowledgment 278

References 278

Part Four MODEL ENHANCEMENT

11 Adaptation and Discriminative Training of Acoustic Models 285
Yannick Est`eve, Paul Del´eglise

11.1 Introduction 285

11.1.1 Acoustic Models 286

11.1.2 Maximum Likelihood Estimation 287

11.2 Acoustic Model Adaptation and Noise Robustness 288

11.2.1 Static (or Offline) Adaptation 289

11.2.2 Dynamic (or Online) Adaptation 289

11.3 Maximum A Posteriori Reestimation 290

11.4 Maximum Likelihood Linear Regression 293

11.4.1 Class Regression Tree 294

11.4.2 Constrained Maximum Likelihood Linear Regression 297

11.4.3 CMLLR Implementation 297

11.4.4 Speaker Adaptive Training 298

11.5 Discriminative Training 299

11.5.1 MMI Discriminative Training Criterion 301

11.5.2 MPE Discriminative Training Criterion 302

11.5.3 I-smoothing 303

11.5.4 MPE Implementation 304

11.6 Conclusion 307

References 308

12 Factorial Models for Noise Robust Speech Recognition 311
John R. Hershey, Steven J. Rennie, Jonathan Le Roux

12.1 Introduction 311

12.2 The Model-Based Approach 313

12.3 Signal Feature Domains 314

12.4 Interaction Models 317

12.4.1 Exact Interaction Model 318

12.4.2 Max Model 320

12.4.3 Log-Sum Model 321

12.4.4 Mel Interaction Model 321

12.5 Inference Methods 322

12.5.1 Max Model Inference 322

12.5.2 Parallel Model Combination 324

12.5.3 Vector Taylor Series Approaches 326

12.5.4 SNR-Dependent Approaches 331

12.6 Efficient Likelihood Evaluation in Factorial Models 332

12.6.1 Efficient Inference using the Max Model 332

12.6.2 Efficient Vector-Taylor Series Approaches 334

12.6.3 Band Quantization 335

12.7 Current Directions 337

12.7.1 Dynamic Noise Models for Robust ASR 338

12.7.2 Multi-Talker Speech Recognition using Graphical Models 339

12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340

References 341

13 Acoustic Model Training for Robust Speech Recognition 347
Michael L. Seltzer

13.1 Introduction 347

13.2 Traditional Training Methods for Robust Speech Recognition 348

13.3 A Brief Overview of Speaker Adaptive Training 349

13.4 Feature-Space Noise Adaptive Training 351

13.4.1 Experiments using fNAT 352

13.5 Model-Space Noise Adaptive Training 353

13.6 Noise Adaptive Training using VTS Adaptation 355

13.6.1 Vector Taylor Series HMM Adaptation 355

13.6.2 Updating the Acoustic Model Parameters 357

13.6.3 Updating the Environmental Parameters 360

13.6.4 Implementation Details 360

13.6.5 Experiments using NAT 361

13.7 Discussion 364

13.7.1 Comparison of Training Algorithms 364

13.7.2 Comparison to Speaker Adaptive Training 364

13.7.3 Related Adaptive Training Methods 365

13.8 Conclusion 366

References 366

Part Five COMPENSATION FOR INFORMATION LOSS

14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371
Jon Barker

14.1 Introduction 371

14.2 Classification with Incomplete Data 373

14.2.1 A Simple Missing Data Scenario 374

14.2.2 Missing Data Theory 376

14.2.3 Validity of the MAR Assumption 378

14.2.4 Marginalising Acoustic Models 379

14.3 Energetic Masking 381

14.3.1 The Max Approximation 381

14.3.2 Bounded Marginalisation 382

14.3.3 Missing Data ASR in the Cepstral Domain 384

14.3.4 Missing Data ASR with Dynamic Features 386

14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388

14.4.1 Missing Data with Soft Masks 388

14.4.2 Sub-band Combination Approaches 391

14.4.3 Speech Fragment Decoding 393

14.5 Some Perspectives on Performance 395

References 396

15 Missing-Data Techniques: Feature Reconstruction 399
Jort Florent Gemmeke, Ulpu Remes

15.1 Introduction 399

15.2 Missing-Data Techniques 401

15.3 Correlation-Based Imputation 402

15.3.1 Fundamentals 402

15.3.2 Implementation 404

15.4 Cluster-Based Imputation 406

15.4.1 Fundamentals 406

15.4.2 Implementation 408

15.4.3 Advances 409

15.5 Class-Conditioned Imputation 411

15.5.1 Fundamentals 411

15.5.2 Implementation 412

15.5.3 Advances 413

15.6 Sparse Imputation 414

15.6.1 Fundamentals 414

15.6.2 Implementation 416

15.6.3 Advances 418

15.7 Other Feature-Reconstruction Methods 420

15.7.1 Parametric Approaches 420

15.7.2 Nonparametric Approaches 421

15.8 Experimental Results 421

15.8.1 Feature-Reconstruction Methods 422

15.8.2 Comparison with Other Methods 424

15.8.3 Advances 426

15.8.4 Combination with Other Methods 427

15.9 Discussion and Conclusion 428

Acknowledgments 429

References 430

16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433
Arun Narayanan, DeLiang Wang

16.1 Introduction 433

16.2 Auditory Scene Analysis 434

16.3 Computational Auditory Scene Analysis 435

16.3.1 Ideal Binary Mask 435

16.3.2 Typical CASA Architecture 438

16.4 CASA Strategies 440

16.4.1 IBM Estimation Based on Local SNR Estimates 440

16.4.2 IBM Estimation using ASA Cues 442

16.4.3 IBM Estimation as Binary Classification 448

16.4.4 Binaural Mask Estimation Strategies 451

16.5 Integrating CASA with ASR 452

16.5.1 Uncertainty Transform Model 454

16.6 Concluding Remarks 458

Acknowledgment 458

References 458

17 Uncertainty Decoding 463
Hank Liao

17.1 Introduction 463

17.2 Observation Uncertainty 465

17.3 Uncertainty Decoding 466

17.4 Feature-Based Uncertainty Decoding 468

17.4.1 SPLICE with Uncertainty 470

17.4.2 Front-End Joint Uncertainty Decoding 471

17.4.3 Issues with Feature-Based Uncertainty Decoding 472

17.5 Model-Based Joint Uncertainty Decoding 473

17.5.1 Parameter Estimation 475

17.5.2 Comparisons with Other Methods 476

17.6 Noisy CMLLR 477

17.7 Uncertainty and Adaptive Training 480

17.7.1 Gradient-Based Methods 481

17.7.2 Factor Analysis Approaches 482

17.8 In Combination with Other Techniques 483

17.9 Conclusions 484

References 485

Index 487

Techniques for Noise Robustness in Automatic

    Product form

    £91.76

    Includes FREE delivery

    RRP £101.95 – you save £10.19 (9%)

    Order before 4pm tomorrow for delivery by Mon 6 Jul 2026.

    A Hardback by Tuomas Virtanen, Rita Singh, Bhiksha Raj

    1 in stock

      Trusted by thousands of customers. See 2,385+ Customer Reviews

      View other formats and editions of Techniques for Noise Robustness in Automatic by Tuomas Virtanen

      Publisher: John Wiley & Sons Inc
      Publication Date: 30/10/2012
      ISBN13: 9781119970880, 978-1119970880
      ISBN10: 1119970881

      Description

      Book Synopsis
      With the growing use of automatic speech recognition (ASR) in everyday life, the ability to solve problems in recorded speech is critical for engineers and researchers developing ASR technologies. The only resource of its kind, this book presents a comprehensive survey of state-of-the-art techniques used to improve the robustness of ASR systems.

      Table of Contents

      List of Contributors xv

      Acknowledgments xvii

      1 Introduction 1
      Tuomas Virtanen, Rita Singh, Bhiksha Raj

      1.1 Scope of the Book 1

      1.2 Outline 2

      1.3 Notation 4

      Part One FOUNDATIONS

      2 The Basics of Automatic Speech Recognition 9
      Rita Singh, Bhiksha Raj, Tuomas Virtanen

      2.1 Introduction 9

      2.2 Speech Recognition Viewed as Bayes Classification 10

      2.3 Hidden Markov Models 11

      2.3.1 Computing Probabilities with HMMs 12

      2.3.2 Determining the State Sequence 17

      2.3.3 Learning HMM Parameters 19

      2.3.4 Additional Issues Relating to Speech Recognition Systems 20

      2.4 HMM-Based Speech Recognition 24

      2.4.1 Representing the Signal 24

      2.4.2 The HMM for a Word Sequence 25

      2.4.3 Searching through all Word Sequences 26

      References 29

      3 The Problem of Robustness in Automatic Speech Recognition 31
      Bhiksha Raj, Tuomas Virtanen, Rita Singh

      3.1 Errors in Bayes Classification 31

      3.1.1 Type 1 Condition: Mismatch Error 33

      3.1.2 Type 2 Condition: Increased Bayes Error 34

      3.2 Bayes Classification and ASR 35

      3.2.1 All We Have is a Model: A Type 1 Condition 35

      3.2.2 Intrinsic Interferences—Signal Components that are Unrelated to the Message: A Type 2 Condition 36

      3.2.3 External Interferences—The Data are Noisy: Type 1 and Type 2 Conditions 36

      3.3 External Influences on Speech Recordings 36

      3.3.1 Signal Capture 37

      3.3.2 Additive Corruptions 41

      3.3.3 Reverberation 42

      3.3.4 A Simplified Model of Signal Capture 43

      3.4 The Effect of External Influences on Recognition 44

      3.5 Improving Recognition under Adverse Conditions 46

      3.5.1 Handling the Model Mismatch Error 46

      3.5.2 Dealing with Intrinsic Variations in the Data 47

      3.5.3 Dealing with Extrinsic Variations 47

      References 50

      Part Two SIGNAL ENHANCEMENT

      4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53
      Rainer Martin, Dorothea Kolossa

      4.1 Introduction 53

      4.2 Signal Analysis and Synthesis 55

      4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55

      4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57

      4.3 Voice Activity Detection 58

      4.3.1 VAD Design Principles 58

      4.3.2 Evaluation of VAD Performance 62

      4.3.3 Evaluation in the Context of ASR 62

      4.4 Noise Power Spectrum Estimation 65

      4.4.1 Smoothing Techniques 65

      4.4.2 Histogram and GMM Noise Estimation Methods 67

      4.4.3 Minimum Statistics Noise Power Estimation 67

      4.4.4 MMSE Noise Power Estimation 68

      4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69

      4.5 Adaptive Filters for Signal Enhancement 71

      4.5.1 Spectral Subtraction 71

      4.5.2 Nonlinear Spectral Subtraction 73

      4.5.3 Wiener Filtering 74

      4.5.4 The ETSI Advanced Front End 75

      4.5.5 Nonlinear MMSE Estimators 75

      4.6 ASR Performance 80

      4.7 Conclusions 81

      References 82

      5 Extraction of Speech from Mixture Signals 87
      Paris Smaragdis

      5.1 The Problem with Mixtures 87

      5.2 Multichannel Mixtures 88

      5.2.1 Basic Problem Formulation 88

      5.2.2 Convolutive Mixtures 92

      5.3 Single-Channel Mixtures 98

      5.3.1 Problem Formulation 98

      5.3.2 Learning Sound Models 100

      5.3.3 Separation by Spectrogram Factorization 101

      5.3.4 Dealing with Unknown Sounds 105

      5.4 Variations and Extensions 107

      5.5 Conclusions 107

      References 107

      6 Microphone Arrays 109
      John McDonough, Kenichi Kumatani

      6.1 Speaker Tracking 110

      6.2 Conventional Microphone Arrays 113

      6.3 Conventional Adaptive Beamforming Algorithms 120

      6.3.1 Minimum Variance Distortionless Response Beamformer 120

      6.3.2 Noise Field Models 122

      6.3.3 Subband Analysis and Synthesis 123

      6.3.4 Beamforming Performance Criteria 126

      6.3.5 Generalized Sidelobe Canceller Implementation 129

      6.3.6 Recursive Implementation of the GSC 130

      6.3.7 Other Conventional GSC Beamformers 131

      6.3.8 Beamforming based on Higher Order Statistics 132

      6.3.9 Online Implementation 136

      6.3.10 Speech-Recognition Experiments 140

      6.4 Spherical Microphone Arrays 142

      6.5 Spherical Adaptive Algorithms 148

      6.6 Comparative Studies 149

      6.7 Comparison of Linear and Spherical Arrays for DSR 152

      6.8 Conclusions and Further Reading 154

      References 155

      Part Three FEATURE ENHANCEMENT

      7 From Signals to Speech Features by Digital Signal Processing 161
      Matthias W¨olfel

      7.1 Introduction 161

      7.1.1 About this Chapter 162

      7.2 The Speech Signal 162

      7.3 Spectral Processing 163

      7.3.1 Windowing 163

      7.3.2 Power Spectrum 165

      7.3.3 Spectral Envelopes 166

      7.3.4 LP Envelope 166

      7.3.5 MVDR Envelope 169

      7.3.6 Warping the Frequency Axis 171

      7.3.7 Warped LP Envelope 175

      7.3.8 Warped MVDR Envelope 176

      7.3.9 Comparison of Spectral Estimates 177

      7.3.10 The Spectrogram 179

      7.4 Cepstral Processing 179

      7.4.1 Definition and Calculation of Cepstral Coefficients 180

      7.4.2 Characteristics of Cepstral Sequences 181

      7.5 Influence of Distortions on Different Speech Features 182

      7.5.1 Objective Functions 182

      7.5.2 Robustness against Noise 185

      7.5.3 Robustness against Echo and Reverberation 187

      7.5.4 Robustness against Changes in Fundamental Frequency 189

      7.6 Summary and Further Reading 191

      References 191

      8 Features Based on Auditory Physiology and Perception 193
      Richard M. Stern, Nelson Morgan

      8.1 Introduction 193

      8.2 Some Attributes of Auditory Physiology and Perception 194

      8.2.1 Peripheral Processing 194

      8.2.2 Processing at more Central Levels 200

      8.2.3 Psychoacoustical Correlates of Physiological Observations 202

      8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206

      8.2.5 Summary 208

      8.3 “Classic” Auditory Representations 208

      8.4 Current Trends in Auditory Feature Analysis 213

      8.5 Summary 221

      Acknowledgments 222

      References 222

      9 Feature Compensation 229
      Jasha Droppo

      9.1 Life in an Ideal World 229

      9.1.1 Noise Robustness Tasks 229

      9.1.2 Probabilistic Feature Enhancement 230

      9.1.3 Gaussian Mixture Models 231

      9.2 MMSE-SPLICE 232

      9.2.1 Parameter Estimation 233

      9.2.2 Results 236

      9.3 Discriminative SPLICE 237

      9.3.1 The MMI Objective Function 238

      9.3.2 Training the Front-End Parameters 239

      9.3.3 The Rprop Algorithm 240

      9.3.4 Results 241

      9.4 Model-Based Feature Enhancement 242

      9.4.1 The Additive Noise-Mixing Equation 243

      9.4.2 The Joint Probability Model 244

      9.4.3 Vector Taylor Series Approximation 246

      9.4.4 Estimating Clean Speech 247

      9.4.5 Results 247

      9.5 Switching Linear Dynamic System 248

      9.6 Conclusion 249

      References 249

      10 Reverberant Speech Recognition 251
      Reinhold Haeb-Umbach, Alexander Krueger

      10.1 Introduction 251

      10.2 The Effect of Reverberation 252

      10.2.1 What is Reverberation? 252

      10.2.2 The Relationship between Clean and Reverberant Speech Features 254

      10.2.3 The Effect of Reverberation on ASR Performance 258

      10.3 Approaches to Reverberant Speech Recognition 258

      10.3.1 Signal-Based Techniques 259

      10.3.2 Front-End Techniques 260

      10.3.3 Back-End Techniques 262

      10.3.4 Concluding Remarks 265

      10.4 Feature Domain Model of the Acoustic Impulse Response 265

      10.5 Bayesian Feature Enhancement 267

      10.5.1 Basic Approach 268

      10.5.2 Measurement Update 269

      10.5.3 Time Update 270

      10.5.4 Inference 271

      10.6 Experimental Results 272

      10.6.1 Databases 272

      10.6.2 Overview of the Tested Methods 273

      10.6.3 Recognition Results on Reverberant Speech 274

      10.6.4 Recognition Results on Noisy Reverberant Speech 276

      10.7 Conclusions 277

      Acknowledgment 278

      References 278

      Part Four MODEL ENHANCEMENT

      11 Adaptation and Discriminative Training of Acoustic Models 285
      Yannick Est`eve, Paul Del´eglise

      11.1 Introduction 285

      11.1.1 Acoustic Models 286

      11.1.2 Maximum Likelihood Estimation 287

      11.2 Acoustic Model Adaptation and Noise Robustness 288

      11.2.1 Static (or Offline) Adaptation 289

      11.2.2 Dynamic (or Online) Adaptation 289

      11.3 Maximum A Posteriori Reestimation 290

      11.4 Maximum Likelihood Linear Regression 293

      11.4.1 Class Regression Tree 294

      11.4.2 Constrained Maximum Likelihood Linear Regression 297

      11.4.3 CMLLR Implementation 297

      11.4.4 Speaker Adaptive Training 298

      11.5 Discriminative Training 299

      11.5.1 MMI Discriminative Training Criterion 301

      11.5.2 MPE Discriminative Training Criterion 302

      11.5.3 I-smoothing 303

      11.5.4 MPE Implementation 304

      11.6 Conclusion 307

      References 308

      12 Factorial Models for Noise Robust Speech Recognition 311
      John R. Hershey, Steven J. Rennie, Jonathan Le Roux

      12.1 Introduction 311

      12.2 The Model-Based Approach 313

      12.3 Signal Feature Domains 314

      12.4 Interaction Models 317

      12.4.1 Exact Interaction Model 318

      12.4.2 Max Model 320

      12.4.3 Log-Sum Model 321

      12.4.4 Mel Interaction Model 321

      12.5 Inference Methods 322

      12.5.1 Max Model Inference 322

      12.5.2 Parallel Model Combination 324

      12.5.3 Vector Taylor Series Approaches 326

      12.5.4 SNR-Dependent Approaches 331

      12.6 Efficient Likelihood Evaluation in Factorial Models 332

      12.6.1 Efficient Inference using the Max Model 332

      12.6.2 Efficient Vector-Taylor Series Approaches 334

      12.6.3 Band Quantization 335

      12.7 Current Directions 337

      12.7.1 Dynamic Noise Models for Robust ASR 338

      12.7.2 Multi-Talker Speech Recognition using Graphical Models 339

      12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340

      References 341

      13 Acoustic Model Training for Robust Speech Recognition 347
      Michael L. Seltzer

      13.1 Introduction 347

      13.2 Traditional Training Methods for Robust Speech Recognition 348

      13.3 A Brief Overview of Speaker Adaptive Training 349

      13.4 Feature-Space Noise Adaptive Training 351

      13.4.1 Experiments using fNAT 352

      13.5 Model-Space Noise Adaptive Training 353

      13.6 Noise Adaptive Training using VTS Adaptation 355

      13.6.1 Vector Taylor Series HMM Adaptation 355

      13.6.2 Updating the Acoustic Model Parameters 357

      13.6.3 Updating the Environmental Parameters 360

      13.6.4 Implementation Details 360

      13.6.5 Experiments using NAT 361

      13.7 Discussion 364

      13.7.1 Comparison of Training Algorithms 364

      13.7.2 Comparison to Speaker Adaptive Training 364

      13.7.3 Related Adaptive Training Methods 365

      13.8 Conclusion 366

      References 366

      Part Five COMPENSATION FOR INFORMATION LOSS

      14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371
      Jon Barker

      14.1 Introduction 371

      14.2 Classification with Incomplete Data 373

      14.2.1 A Simple Missing Data Scenario 374

      14.2.2 Missing Data Theory 376

      14.2.3 Validity of the MAR Assumption 378

      14.2.4 Marginalising Acoustic Models 379

      14.3 Energetic Masking 381

      14.3.1 The Max Approximation 381

      14.3.2 Bounded Marginalisation 382

      14.3.3 Missing Data ASR in the Cepstral Domain 384

      14.3.4 Missing Data ASR with Dynamic Features 386

      14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388

      14.4.1 Missing Data with Soft Masks 388

      14.4.2 Sub-band Combination Approaches 391

      14.4.3 Speech Fragment Decoding 393

      14.5 Some Perspectives on Performance 395

      References 396

      15 Missing-Data Techniques: Feature Reconstruction 399
      Jort Florent Gemmeke, Ulpu Remes

      15.1 Introduction 399

      15.2 Missing-Data Techniques 401

      15.3 Correlation-Based Imputation 402

      15.3.1 Fundamentals 402

      15.3.2 Implementation 404

      15.4 Cluster-Based Imputation 406

      15.4.1 Fundamentals 406

      15.4.2 Implementation 408

      15.4.3 Advances 409

      15.5 Class-Conditioned Imputation 411

      15.5.1 Fundamentals 411

      15.5.2 Implementation 412

      15.5.3 Advances 413

      15.6 Sparse Imputation 414

      15.6.1 Fundamentals 414

      15.6.2 Implementation 416

      15.6.3 Advances 418

      15.7 Other Feature-Reconstruction Methods 420

      15.7.1 Parametric Approaches 420

      15.7.2 Nonparametric Approaches 421

      15.8 Experimental Results 421

      15.8.1 Feature-Reconstruction Methods 422

      15.8.2 Comparison with Other Methods 424

      15.8.3 Advances 426

      15.8.4 Combination with Other Methods 427

      15.9 Discussion and Conclusion 428

      Acknowledgments 429

      References 430

      16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433
      Arun Narayanan, DeLiang Wang

      16.1 Introduction 433

      16.2 Auditory Scene Analysis 434

      16.3 Computational Auditory Scene Analysis 435

      16.3.1 Ideal Binary Mask 435

      16.3.2 Typical CASA Architecture 438

      16.4 CASA Strategies 440

      16.4.1 IBM Estimation Based on Local SNR Estimates 440

      16.4.2 IBM Estimation using ASA Cues 442

      16.4.3 IBM Estimation as Binary Classification 448

      16.4.4 Binaural Mask Estimation Strategies 451

      16.5 Integrating CASA with ASR 452

      16.5.1 Uncertainty Transform Model 454

      16.6 Concluding Remarks 458

      Acknowledgment 458

      References 458

      17 Uncertainty Decoding 463
      Hank Liao

      17.1 Introduction 463

      17.2 Observation Uncertainty 465

      17.3 Uncertainty Decoding 466

      17.4 Feature-Based Uncertainty Decoding 468

      17.4.1 SPLICE with Uncertainty 470

      17.4.2 Front-End Joint Uncertainty Decoding 471

      17.4.3 Issues with Feature-Based Uncertainty Decoding 472

      17.5 Model-Based Joint Uncertainty Decoding 473

      17.5.1 Parameter Estimation 475

      17.5.2 Comparisons with Other Methods 476

      17.6 Noisy CMLLR 477

      17.7 Uncertainty and Adaptive Training 480

      17.7.1 Gradient-Based Methods 481

      17.7.2 Factor Analysis Approaches 482

      17.8 In Combination with Other Techniques 483

      17.9 Conclusions 484

      References 485

      Index 487

      Recently viewed products

      © 2026 Book Curl

        • American Express
        • Apple Pay
        • Diners Club
        • Discover
        • Google Pay
        • Maestro
        • Mastercard
        • PayPal
        • Shop Pay
        • Union Pay
        • Visa

        Login

        Forgot your password?

        Don't have an account yet?
        Create account