Description

Book Synopsis
With the growing use of automatic speech recognition (ASR) in everyday life, the ability to solve problems in recorded speech is critical for engineers and researchers developing ASR technologies. The only resource of its kind, this book presents a comprehensive survey of state-of-the-art techniques used to improve the robustness of ASR systems.

Table of Contents

List of Contributors xv

Acknowledgments xvii

1 Introduction 1
Tuomas Virtanen, Rita Singh, Bhiksha Raj

1.1 Scope of the Book 1

1.2 Outline 2

1.3 Notation 4

Part One FOUNDATIONS

2 The Basics of Automatic Speech Recognition 9
Rita Singh, Bhiksha Raj, Tuomas Virtanen

2.1 Introduction 9

2.2 Speech Recognition Viewed as Bayes Classification 10

2.3 Hidden Markov Models 11

2.3.1 Computing Probabilities with HMMs 12

2.3.2 Determining the State Sequence 17

2.3.3 Learning HMM Parameters 19

2.3.4 Additional Issues Relating to Speech Recognition Systems 20

2.4 HMM-Based Speech Recognition 24

2.4.1 Representing the Signal 24

2.4.2 The HMM for a Word Sequence 25

2.4.3 Searching through all Word Sequences 26

References 29

3 The Problem of Robustness in Automatic Speech Recognition 31
Bhiksha Raj, Tuomas Virtanen, Rita Singh

3.1 Errors in Bayes Classification 31

3.1.1 Type 1 Condition: Mismatch Error 33

3.1.2 Type 2 Condition: Increased Bayes Error 34

3.2 Bayes Classification and ASR 35

3.2.1 All We Have is a Model: A Type 1 Condition 35

3.2.2 Intrinsic Interferences—Signal Components that are Unrelated to the Message: A Type 2 Condition 36

3.2.3 External Interferences—The Data are Noisy: Type 1 and Type 2 Conditions 36

3.3 External Influences on Speech Recordings 36

3.3.1 Signal Capture 37

3.3.2 Additive Corruptions 41

3.3.3 Reverberation 42

3.3.4 A Simplified Model of Signal Capture 43

3.4 The Effect of External Influences on Recognition 44

3.5 Improving Recognition under Adverse Conditions 46

3.5.1 Handling the Model Mismatch Error 46

3.5.2 Dealing with Intrinsic Variations in the Data 47

3.5.3 Dealing with Extrinsic Variations 47

References 50

Part Two SIGNAL ENHANCEMENT

4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53
Rainer Martin, Dorothea Kolossa

4.1 Introduction 53

4.2 Signal Analysis and Synthesis 55

4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55

4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57

4.3 Voice Activity Detection 58

4.3.1 VAD Design Principles 58

4.3.2 Evaluation of VAD Performance 62

4.3.3 Evaluation in the Context of ASR 62

4.4 Noise Power Spectrum Estimation 65

4.4.1 Smoothing Techniques 65

4.4.2 Histogram and GMM Noise Estimation Methods 67

4.4.3 Minimum Statistics Noise Power Estimation 67

4.4.4 MMSE Noise Power Estimation 68

4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69

4.5 Adaptive Filters for Signal Enhancement 71

4.5.1 Spectral Subtraction 71

4.5.2 Nonlinear Spectral Subtraction 73

4.5.3 Wiener Filtering 74

4.5.4 The ETSI Advanced Front End 75

4.5.5 Nonlinear MMSE Estimators 75

4.6 ASR Performance 80

4.7 Conclusions 81

References 82

5 Extraction of Speech from Mixture Signals 87
Paris Smaragdis

5.1 The Problem with Mixtures 87

5.2 Multichannel Mixtures 88

5.2.1 Basic Problem Formulation 88

5.2.2 Convolutive Mixtures 92

5.3 Single-Channel Mixtures 98

5.3.1 Problem Formulation 98

5.3.2 Learning Sound Models 100

5.3.3 Separation by Spectrogram Factorization 101

5.3.4 Dealing with Unknown Sounds 105

5.4 Variations and Extensions 107

5.5 Conclusions 107

References 107

6 Microphone Arrays 109
John McDonough, Kenichi Kumatani

6.1 Speaker Tracking 110

6.2 Conventional Microphone Arrays 113

6.3 Conventional Adaptive Beamforming Algorithms 120

6.3.1 Minimum Variance Distortionless Response Beamformer 120

6.3.2 Noise Field Models 122

6.3.3 Subband Analysis and Synthesis 123

6.3.4 Beamforming Performance Criteria 126

6.3.5 Generalized Sidelobe Canceller Implementation 129

6.3.6 Recursive Implementation of the GSC 130

6.3.7 Other Conventional GSC Beamformers 131

6.3.8 Beamforming based on Higher Order Statistics 132

6.3.9 Online Implementation 136

6.3.10 Speech-Recognition Experiments 140

6.4 Spherical Microphone Arrays 142

6.5 Spherical Adaptive Algorithms 148

6.6 Comparative Studies 149

6.7 Comparison of Linear and Spherical Arrays for DSR 152

6.8 Conclusions and Further Reading 154

References 155

Part Three FEATURE ENHANCEMENT

7 From Signals to Speech Features by Digital Signal Processing 161
Matthias W¨olfel

7.1 Introduction 161

7.1.1 About this Chapter 162

7.2 The Speech Signal 162

7.3 Spectral Processing 163

7.3.1 Windowing 163

7.3.2 Power Spectrum 165

7.3.3 Spectral Envelopes 166

7.3.4 LP Envelope 166

7.3.5 MVDR Envelope 169

7.3.6 Warping the Frequency Axis 171

7.3.7 Warped LP Envelope 175

7.3.8 Warped MVDR Envelope 176

7.3.9 Comparison of Spectral Estimates 177

7.3.10 The Spectrogram 179

7.4 Cepstral Processing 179

7.4.1 Definition and Calculation of Cepstral Coefficients 180

7.4.2 Characteristics of Cepstral Sequences 181

7.5 Influence of Distortions on Different Speech Features 182

7.5.1 Objective Functions 182

7.5.2 Robustness against Noise 185

7.5.3 Robustness against Echo and Reverberation 187

7.5.4 Robustness against Changes in Fundamental Frequency 189

7.6 Summary and Further Reading 191

References 191

8 Features Based on Auditory Physiology and Perception 193
Richard M. Stern, Nelson Morgan

8.1 Introduction 193

8.2 Some Attributes of Auditory Physiology and Perception 194

8.2.1 Peripheral Processing 194

8.2.2 Processing at more Central Levels 200

8.2.3 Psychoacoustical Correlates of Physiological Observations 202

8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206

8.2.5 Summary 208

8.3 “Classic” Auditory Representations 208

8.4 Current Trends in Auditory Feature Analysis 213

8.5 Summary 221

Acknowledgments 222

References 222

9 Feature Compensation 229
Jasha Droppo

9.1 Life in an Ideal World 229

9.1.1 Noise Robustness Tasks 229

9.1.2 Probabilistic Feature Enhancement 230

9.1.3 Gaussian Mixture Models 231

9.2 MMSE-SPLICE 232

9.2.1 Parameter Estimation 233

9.2.2 Results 236

9.3 Discriminative SPLICE 237

9.3.1 The MMI Objective Function 238

9.3.2 Training the Front-End Parameters 239

9.3.3 The Rprop Algorithm 240

9.3.4 Results 241

9.4 Model-Based Feature Enhancement 242

9.4.1 The Additive Noise-Mixing Equation 243

9.4.2 The Joint Probability Model 244

9.4.3 Vector Taylor Series Approximation 246

9.4.4 Estimating Clean Speech 247

9.4.5 Results 247

9.5 Switching Linear Dynamic System 248

9.6 Conclusion 249

References 249

10 Reverberant Speech Recognition 251
Reinhold Haeb-Umbach, Alexander Krueger

10.1 Introduction 251

10.2 The Effect of Reverberation 252

10.2.1 What is Reverberation? 252

10.2.2 The Relationship between Clean and Reverberant Speech Features 254

10.2.3 The Effect of Reverberation on ASR Performance 258

10.3 Approaches to Reverberant Speech Recognition 258

10.3.1 Signal-Based Techniques 259

10.3.2 Front-End Techniques 260

10.3.3 Back-End Techniques 262

10.3.4 Concluding Remarks 265

10.4 Feature Domain Model of the Acoustic Impulse Response 265

10.5 Bayesian Feature Enhancement 267

10.5.1 Basic Approach 268

10.5.2 Measurement Update 269

10.5.3 Time Update 270

10.5.4 Inference 271

10.6 Experimental Results 272

10.6.1 Databases 272

10.6.2 Overview of the Tested Methods 273

10.6.3 Recognition Results on Reverberant Speech 274

10.6.4 Recognition Results on Noisy Reverberant Speech 276

10.7 Conclusions 277

Acknowledgment 278

References 278

Part Four MODEL ENHANCEMENT

11 Adaptation and Discriminative Training of Acoustic Models 285
Yannick Est`eve, Paul Del´eglise

11.1 Introduction 285

11.1.1 Acoustic Models 286

11.1.2 Maximum Likelihood Estimation 287

11.2 Acoustic Model Adaptation and Noise Robustness 288

11.2.1 Static (or Offline) Adaptation 289

11.2.2 Dynamic (or Online) Adaptation 289

11.3 Maximum A Posteriori Reestimation 290

11.4 Maximum Likelihood Linear Regression 293

11.4.1 Class Regression Tree 294

11.4.2 Constrained Maximum Likelihood Linear Regression 297

11.4.3 CMLLR Implementation 297

11.4.4 Speaker Adaptive Training 298

11.5 Discriminative Training 299

11.5.1 MMI Discriminative Training Criterion 301

11.5.2 MPE Discriminative Training Criterion 302

11.5.3 I-smoothing 303

11.5.4 MPE Implementation 304

11.6 Conclusion 307

References 308

12 Factorial Models for Noise Robust Speech Recognition 311
John R. Hershey, Steven J. Rennie, Jonathan Le Roux

12.1 Introduction 311

12.2 The Model-Based Approach 313

12.3 Signal Feature Domains 314

12.4 Interaction Models 317

12.4.1 Exact Interaction Model 318

12.4.2 Max Model 320

12.4.3 Log-Sum Model 321

12.4.4 Mel Interaction Model 321

12.5 Inference Methods 322

12.5.1 Max Model Inference 322

12.5.2 Parallel Model Combination 324

12.5.3 Vector Taylor Series Approaches 326

12.5.4 SNR-Dependent Approaches 331

12.6 Efficient Likelihood Evaluation in Factorial Models 332

12.6.1 Efficient Inference using the Max Model 332

12.6.2 Efficient Vector-Taylor Series Approaches 334

12.6.3 Band Quantization 335

12.7 Current Directions 337

12.7.1 Dynamic Noise Models for Robust ASR 338

12.7.2 Multi-Talker Speech Recognition using Graphical Models 339

12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340

References 341

13 Acoustic Model Training for Robust Speech Recognition 347
Michael L. Seltzer

13.1 Introduction 347

13.2 Traditional Training Methods for Robust Speech Recognition 348

13.3 A Brief Overview of Speaker Adaptive Training 349

13.4 Feature-Space Noise Adaptive Training 351

13.4.1 Experiments using fNAT 352

13.5 Model-Space Noise Adaptive Training 353

13.6 Noise Adaptive Training using VTS Adaptation 355

13.6.1 Vector Taylor Series HMM Adaptation 355

13.6.2 Updating the Acoustic Model Parameters 357

13.6.3 Updating the Environmental Parameters 360

13.6.4 Implementation Details 360

13.6.5 Experiments using NAT 361

13.7 Discussion 364

13.7.1 Comparison of Training Algorithms 364

13.7.2 Comparison to Speaker Adaptive Training 364

13.7.3 Related Adaptive Training Methods 365

13.8 Conclusion 366

References 366

Part Five COMPENSATION FOR INFORMATION LOSS

14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371
Jon Barker

14.1 Introduction 371

14.2 Classification with Incomplete Data 373

14.2.1 A Simple Missing Data Scenario 374

14.2.2 Missing Data Theory 376

14.2.3 Validity of the MAR Assumption 378

14.2.4 Marginalising Acoustic Models 379

14.3 Energetic Masking 381

14.3.1 The Max Approximation 381

14.3.2 Bounded Marginalisation 382

14.3.3 Missing Data ASR in the Cepstral Domain 384

14.3.4 Missing Data ASR with Dynamic Features 386

14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388

14.4.1 Missing Data with Soft Masks 388

14.4.2 Sub-band Combination Approaches 391

14.4.3 Speech Fragment Decoding 393

14.5 Some Perspectives on Performance 395

References 396

15 Missing-Data Techniques: Feature Reconstruction 399
Jort Florent Gemmeke, Ulpu Remes

15.1 Introduction 399

15.2 Missing-Data Techniques 401

15.3 Correlation-Based Imputation 402

15.3.1 Fundamentals 402

15.3.2 Implementation 404

15.4 Cluster-Based Imputation 406

15.4.1 Fundamentals 406

15.4.2 Implementation 408

15.4.3 Advances 409

15.5 Class-Conditioned Imputation 411

15.5.1 Fundamentals 411

15.5.2 Implementation 412

15.5.3 Advances 413

15.6 Sparse Imputation 414

15.6.1 Fundamentals 414

15.6.2 Implementation 416

15.6.3 Advances 418

15.7 Other Feature-Reconstruction Methods 420

15.7.1 Parametric Approaches 420

15.7.2 Nonparametric Approaches 421

15.8 Experimental Results 421

15.8.1 Feature-Reconstruction Methods 422

15.8.2 Comparison with Other Methods 424

15.8.3 Advances 426

15.8.4 Combination with Other Methods 427

15.9 Discussion and Conclusion 428

Acknowledgments 429

References 430

16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433
Arun Narayanan, DeLiang Wang

16.1 Introduction 433

16.2 Auditory Scene Analysis 434

16.3 Computational Auditory Scene Analysis 435

16.3.1 Ideal Binary Mask 435

16.3.2 Typical CASA Architecture 438

16.4 CASA Strategies 440

16.4.1 IBM Estimation Based on Local SNR Estimates 440

16.4.2 IBM Estimation using ASA Cues 442

16.4.3 IBM Estimation as Binary Classification 448

16.4.4 Binaural Mask Estimation Strategies 451

16.5 Integrating CASA with ASR 452

16.5.1 Uncertainty Transform Model 454

16.6 Concluding Remarks 458

Acknowledgment 458

References 458

17 Uncertainty Decoding 463
Hank Liao

17.1 Introduction 463

17.2 Observation Uncertainty 465

17.3 Uncertainty Decoding 466

17.4 Feature-Based Uncertainty Decoding 468

17.4.1 SPLICE with Uncertainty 470

17.4.2 Front-End Joint Uncertainty Decoding 471

17.4.3 Issues with Feature-Based Uncertainty Decoding 472

17.5 Model-Based Joint Uncertainty Decoding 473

17.5.1 Parameter Estimation 475

17.5.2 Comparisons with Other Methods 476

17.6 Noisy CMLLR 477

17.7 Uncertainty and Adaptive Training 480

17.7.1 Gradient-Based Methods 481

17.7.2 Factor Analysis Approaches 482

17.8 In Combination with Other Techniques 483

17.9 Conclusions 484

References 485

Index 487

Techniques for Noise Robustness in Automatic

Product form

£91.76

Includes FREE delivery

RRP £101.95 – you save £10.19 (9%)

Order before 4pm tomorrow for delivery by Mon 29 Dec 2025.

A Hardback by Tuomas Virtanen, Rita Singh, Bhiksha Raj

1 in stock


    View other formats and editions of Techniques for Noise Robustness in Automatic by Tuomas Virtanen

    Publisher: John Wiley & Sons Inc
    Publication Date: 30/10/2012
    ISBN13: 9781119970880, 978-1119970880
    ISBN10: 1119970881

    Description

    Book Synopsis
    With the growing use of automatic speech recognition (ASR) in everyday life, the ability to solve problems in recorded speech is critical for engineers and researchers developing ASR technologies. The only resource of its kind, this book presents a comprehensive survey of state-of-the-art techniques used to improve the robustness of ASR systems.

    Table of Contents

    List of Contributors xv

    Acknowledgments xvii

    1 Introduction 1
    Tuomas Virtanen, Rita Singh, Bhiksha Raj

    1.1 Scope of the Book 1

    1.2 Outline 2

    1.3 Notation 4

    Part One FOUNDATIONS

    2 The Basics of Automatic Speech Recognition 9
    Rita Singh, Bhiksha Raj, Tuomas Virtanen

    2.1 Introduction 9

    2.2 Speech Recognition Viewed as Bayes Classification 10

    2.3 Hidden Markov Models 11

    2.3.1 Computing Probabilities with HMMs 12

    2.3.2 Determining the State Sequence 17

    2.3.3 Learning HMM Parameters 19

    2.3.4 Additional Issues Relating to Speech Recognition Systems 20

    2.4 HMM-Based Speech Recognition 24

    2.4.1 Representing the Signal 24

    2.4.2 The HMM for a Word Sequence 25

    2.4.3 Searching through all Word Sequences 26

    References 29

    3 The Problem of Robustness in Automatic Speech Recognition 31
    Bhiksha Raj, Tuomas Virtanen, Rita Singh

    3.1 Errors in Bayes Classification 31

    3.1.1 Type 1 Condition: Mismatch Error 33

    3.1.2 Type 2 Condition: Increased Bayes Error 34

    3.2 Bayes Classification and ASR 35

    3.2.1 All We Have is a Model: A Type 1 Condition 35

    3.2.2 Intrinsic Interferences—Signal Components that are Unrelated to the Message: A Type 2 Condition 36

    3.2.3 External Interferences—The Data are Noisy: Type 1 and Type 2 Conditions 36

    3.3 External Influences on Speech Recordings 36

    3.3.1 Signal Capture 37

    3.3.2 Additive Corruptions 41

    3.3.3 Reverberation 42

    3.3.4 A Simplified Model of Signal Capture 43

    3.4 The Effect of External Influences on Recognition 44

    3.5 Improving Recognition under Adverse Conditions 46

    3.5.1 Handling the Model Mismatch Error 46

    3.5.2 Dealing with Intrinsic Variations in the Data 47

    3.5.3 Dealing with Extrinsic Variations 47

    References 50

    Part Two SIGNAL ENHANCEMENT

    4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53
    Rainer Martin, Dorothea Kolossa

    4.1 Introduction 53

    4.2 Signal Analysis and Synthesis 55

    4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55

    4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57

    4.3 Voice Activity Detection 58

    4.3.1 VAD Design Principles 58

    4.3.2 Evaluation of VAD Performance 62

    4.3.3 Evaluation in the Context of ASR 62

    4.4 Noise Power Spectrum Estimation 65

    4.4.1 Smoothing Techniques 65

    4.4.2 Histogram and GMM Noise Estimation Methods 67

    4.4.3 Minimum Statistics Noise Power Estimation 67

    4.4.4 MMSE Noise Power Estimation 68

    4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69

    4.5 Adaptive Filters for Signal Enhancement 71

    4.5.1 Spectral Subtraction 71

    4.5.2 Nonlinear Spectral Subtraction 73

    4.5.3 Wiener Filtering 74

    4.5.4 The ETSI Advanced Front End 75

    4.5.5 Nonlinear MMSE Estimators 75

    4.6 ASR Performance 80

    4.7 Conclusions 81

    References 82

    5 Extraction of Speech from Mixture Signals 87
    Paris Smaragdis

    5.1 The Problem with Mixtures 87

    5.2 Multichannel Mixtures 88

    5.2.1 Basic Problem Formulation 88

    5.2.2 Convolutive Mixtures 92

    5.3 Single-Channel Mixtures 98

    5.3.1 Problem Formulation 98

    5.3.2 Learning Sound Models 100

    5.3.3 Separation by Spectrogram Factorization 101

    5.3.4 Dealing with Unknown Sounds 105

    5.4 Variations and Extensions 107

    5.5 Conclusions 107

    References 107

    6 Microphone Arrays 109
    John McDonough, Kenichi Kumatani

    6.1 Speaker Tracking 110

    6.2 Conventional Microphone Arrays 113

    6.3 Conventional Adaptive Beamforming Algorithms 120

    6.3.1 Minimum Variance Distortionless Response Beamformer 120

    6.3.2 Noise Field Models 122

    6.3.3 Subband Analysis and Synthesis 123

    6.3.4 Beamforming Performance Criteria 126

    6.3.5 Generalized Sidelobe Canceller Implementation 129

    6.3.6 Recursive Implementation of the GSC 130

    6.3.7 Other Conventional GSC Beamformers 131

    6.3.8 Beamforming based on Higher Order Statistics 132

    6.3.9 Online Implementation 136

    6.3.10 Speech-Recognition Experiments 140

    6.4 Spherical Microphone Arrays 142

    6.5 Spherical Adaptive Algorithms 148

    6.6 Comparative Studies 149

    6.7 Comparison of Linear and Spherical Arrays for DSR 152

    6.8 Conclusions and Further Reading 154

    References 155

    Part Three FEATURE ENHANCEMENT

    7 From Signals to Speech Features by Digital Signal Processing 161
    Matthias W¨olfel

    7.1 Introduction 161

    7.1.1 About this Chapter 162

    7.2 The Speech Signal 162

    7.3 Spectral Processing 163

    7.3.1 Windowing 163

    7.3.2 Power Spectrum 165

    7.3.3 Spectral Envelopes 166

    7.3.4 LP Envelope 166

    7.3.5 MVDR Envelope 169

    7.3.6 Warping the Frequency Axis 171

    7.3.7 Warped LP Envelope 175

    7.3.8 Warped MVDR Envelope 176

    7.3.9 Comparison of Spectral Estimates 177

    7.3.10 The Spectrogram 179

    7.4 Cepstral Processing 179

    7.4.1 Definition and Calculation of Cepstral Coefficients 180

    7.4.2 Characteristics of Cepstral Sequences 181

    7.5 Influence of Distortions on Different Speech Features 182

    7.5.1 Objective Functions 182

    7.5.2 Robustness against Noise 185

    7.5.3 Robustness against Echo and Reverberation 187

    7.5.4 Robustness against Changes in Fundamental Frequency 189

    7.6 Summary and Further Reading 191

    References 191

    8 Features Based on Auditory Physiology and Perception 193
    Richard M. Stern, Nelson Morgan

    8.1 Introduction 193

    8.2 Some Attributes of Auditory Physiology and Perception 194

    8.2.1 Peripheral Processing 194

    8.2.2 Processing at more Central Levels 200

    8.2.3 Psychoacoustical Correlates of Physiological Observations 202

    8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206

    8.2.5 Summary 208

    8.3 “Classic” Auditory Representations 208

    8.4 Current Trends in Auditory Feature Analysis 213

    8.5 Summary 221

    Acknowledgments 222

    References 222

    9 Feature Compensation 229
    Jasha Droppo

    9.1 Life in an Ideal World 229

    9.1.1 Noise Robustness Tasks 229

    9.1.2 Probabilistic Feature Enhancement 230

    9.1.3 Gaussian Mixture Models 231

    9.2 MMSE-SPLICE 232

    9.2.1 Parameter Estimation 233

    9.2.2 Results 236

    9.3 Discriminative SPLICE 237

    9.3.1 The MMI Objective Function 238

    9.3.2 Training the Front-End Parameters 239

    9.3.3 The Rprop Algorithm 240

    9.3.4 Results 241

    9.4 Model-Based Feature Enhancement 242

    9.4.1 The Additive Noise-Mixing Equation 243

    9.4.2 The Joint Probability Model 244

    9.4.3 Vector Taylor Series Approximation 246

    9.4.4 Estimating Clean Speech 247

    9.4.5 Results 247

    9.5 Switching Linear Dynamic System 248

    9.6 Conclusion 249

    References 249

    10 Reverberant Speech Recognition 251
    Reinhold Haeb-Umbach, Alexander Krueger

    10.1 Introduction 251

    10.2 The Effect of Reverberation 252

    10.2.1 What is Reverberation? 252

    10.2.2 The Relationship between Clean and Reverberant Speech Features 254

    10.2.3 The Effect of Reverberation on ASR Performance 258

    10.3 Approaches to Reverberant Speech Recognition 258

    10.3.1 Signal-Based Techniques 259

    10.3.2 Front-End Techniques 260

    10.3.3 Back-End Techniques 262

    10.3.4 Concluding Remarks 265

    10.4 Feature Domain Model of the Acoustic Impulse Response 265

    10.5 Bayesian Feature Enhancement 267

    10.5.1 Basic Approach 268

    10.5.2 Measurement Update 269

    10.5.3 Time Update 270

    10.5.4 Inference 271

    10.6 Experimental Results 272

    10.6.1 Databases 272

    10.6.2 Overview of the Tested Methods 273

    10.6.3 Recognition Results on Reverberant Speech 274

    10.6.4 Recognition Results on Noisy Reverberant Speech 276

    10.7 Conclusions 277

    Acknowledgment 278

    References 278

    Part Four MODEL ENHANCEMENT

    11 Adaptation and Discriminative Training of Acoustic Models 285
    Yannick Est`eve, Paul Del´eglise

    11.1 Introduction 285

    11.1.1 Acoustic Models 286

    11.1.2 Maximum Likelihood Estimation 287

    11.2 Acoustic Model Adaptation and Noise Robustness 288

    11.2.1 Static (or Offline) Adaptation 289

    11.2.2 Dynamic (or Online) Adaptation 289

    11.3 Maximum A Posteriori Reestimation 290

    11.4 Maximum Likelihood Linear Regression 293

    11.4.1 Class Regression Tree 294

    11.4.2 Constrained Maximum Likelihood Linear Regression 297

    11.4.3 CMLLR Implementation 297

    11.4.4 Speaker Adaptive Training 298

    11.5 Discriminative Training 299

    11.5.1 MMI Discriminative Training Criterion 301

    11.5.2 MPE Discriminative Training Criterion 302

    11.5.3 I-smoothing 303

    11.5.4 MPE Implementation 304

    11.6 Conclusion 307

    References 308

    12 Factorial Models for Noise Robust Speech Recognition 311
    John R. Hershey, Steven J. Rennie, Jonathan Le Roux

    12.1 Introduction 311

    12.2 The Model-Based Approach 313

    12.3 Signal Feature Domains 314

    12.4 Interaction Models 317

    12.4.1 Exact Interaction Model 318

    12.4.2 Max Model 320

    12.4.3 Log-Sum Model 321

    12.4.4 Mel Interaction Model 321

    12.5 Inference Methods 322

    12.5.1 Max Model Inference 322

    12.5.2 Parallel Model Combination 324

    12.5.3 Vector Taylor Series Approaches 326

    12.5.4 SNR-Dependent Approaches 331

    12.6 Efficient Likelihood Evaluation in Factorial Models 332

    12.6.1 Efficient Inference using the Max Model 332

    12.6.2 Efficient Vector-Taylor Series Approaches 334

    12.6.3 Band Quantization 335

    12.7 Current Directions 337

    12.7.1 Dynamic Noise Models for Robust ASR 338

    12.7.2 Multi-Talker Speech Recognition using Graphical Models 339

    12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340

    References 341

    13 Acoustic Model Training for Robust Speech Recognition 347
    Michael L. Seltzer

    13.1 Introduction 347

    13.2 Traditional Training Methods for Robust Speech Recognition 348

    13.3 A Brief Overview of Speaker Adaptive Training 349

    13.4 Feature-Space Noise Adaptive Training 351

    13.4.1 Experiments using fNAT 352

    13.5 Model-Space Noise Adaptive Training 353

    13.6 Noise Adaptive Training using VTS Adaptation 355

    13.6.1 Vector Taylor Series HMM Adaptation 355

    13.6.2 Updating the Acoustic Model Parameters 357

    13.6.3 Updating the Environmental Parameters 360

    13.6.4 Implementation Details 360

    13.6.5 Experiments using NAT 361

    13.7 Discussion 364

    13.7.1 Comparison of Training Algorithms 364

    13.7.2 Comparison to Speaker Adaptive Training 364

    13.7.3 Related Adaptive Training Methods 365

    13.8 Conclusion 366

    References 366

    Part Five COMPENSATION FOR INFORMATION LOSS

    14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371
    Jon Barker

    14.1 Introduction 371

    14.2 Classification with Incomplete Data 373

    14.2.1 A Simple Missing Data Scenario 374

    14.2.2 Missing Data Theory 376

    14.2.3 Validity of the MAR Assumption 378

    14.2.4 Marginalising Acoustic Models 379

    14.3 Energetic Masking 381

    14.3.1 The Max Approximation 381

    14.3.2 Bounded Marginalisation 382

    14.3.3 Missing Data ASR in the Cepstral Domain 384

    14.3.4 Missing Data ASR with Dynamic Features 386

    14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388

    14.4.1 Missing Data with Soft Masks 388

    14.4.2 Sub-band Combination Approaches 391

    14.4.3 Speech Fragment Decoding 393

    14.5 Some Perspectives on Performance 395

    References 396

    15 Missing-Data Techniques: Feature Reconstruction 399
    Jort Florent Gemmeke, Ulpu Remes

    15.1 Introduction 399

    15.2 Missing-Data Techniques 401

    15.3 Correlation-Based Imputation 402

    15.3.1 Fundamentals 402

    15.3.2 Implementation 404

    15.4 Cluster-Based Imputation 406

    15.4.1 Fundamentals 406

    15.4.2 Implementation 408

    15.4.3 Advances 409

    15.5 Class-Conditioned Imputation 411

    15.5.1 Fundamentals 411

    15.5.2 Implementation 412

    15.5.3 Advances 413

    15.6 Sparse Imputation 414

    15.6.1 Fundamentals 414

    15.6.2 Implementation 416

    15.6.3 Advances 418

    15.7 Other Feature-Reconstruction Methods 420

    15.7.1 Parametric Approaches 420

    15.7.2 Nonparametric Approaches 421

    15.8 Experimental Results 421

    15.8.1 Feature-Reconstruction Methods 422

    15.8.2 Comparison with Other Methods 424

    15.8.3 Advances 426

    15.8.4 Combination with Other Methods 427

    15.9 Discussion and Conclusion 428

    Acknowledgments 429

    References 430

    16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433
    Arun Narayanan, DeLiang Wang

    16.1 Introduction 433

    16.2 Auditory Scene Analysis 434

    16.3 Computational Auditory Scene Analysis 435

    16.3.1 Ideal Binary Mask 435

    16.3.2 Typical CASA Architecture 438

    16.4 CASA Strategies 440

    16.4.1 IBM Estimation Based on Local SNR Estimates 440

    16.4.2 IBM Estimation using ASA Cues 442

    16.4.3 IBM Estimation as Binary Classification 448

    16.4.4 Binaural Mask Estimation Strategies 451

    16.5 Integrating CASA with ASR 452

    16.5.1 Uncertainty Transform Model 454

    16.6 Concluding Remarks 458

    Acknowledgment 458

    References 458

    17 Uncertainty Decoding 463
    Hank Liao

    17.1 Introduction 463

    17.2 Observation Uncertainty 465

    17.3 Uncertainty Decoding 466

    17.4 Feature-Based Uncertainty Decoding 468

    17.4.1 SPLICE with Uncertainty 470

    17.4.2 Front-End Joint Uncertainty Decoding 471

    17.4.3 Issues with Feature-Based Uncertainty Decoding 472

    17.5 Model-Based Joint Uncertainty Decoding 473

    17.5.1 Parameter Estimation 475

    17.5.2 Comparisons with Other Methods 476

    17.6 Noisy CMLLR 477

    17.7 Uncertainty and Adaptive Training 480

    17.7.1 Gradient-Based Methods 481

    17.7.2 Factor Analysis Approaches 482

    17.8 In Combination with Other Techniques 483

    17.9 Conclusions 484

    References 485

    Index 487

    Recently viewed products

    © 2025 Book Curl

      • American Express
      • Apple Pay
      • Diners Club
      • Discover
      • Google Pay
      • Maestro
      • Mastercard
      • PayPal
      • Shop Pay
      • Union Pay
      • Visa

      Login

      Forgot your password?

      Don't have an account yet?
      Create account