Synchrony-Based Feature Extraction for Robust Automatic Speech Recognition

This letter discusses the application of models of temporal patterns of auditory-nerve firings to enhance robustness of automatic speech recognition systems. Most conventional feature extraction schemes (such as mel-frequency cepstral coefficients and perceptual linear processing coefficients) are based on shorttime energy in each frequency band, and the temporal patterns of auditory-nerve activity are discarded. We compare the impact on speech recognition accuracy of several types of feature extraction schemes based on the putative synchrony of auditory-nerve activity, including feature extraction based on a modified version of the generalized synchrony detector proposed by Seneff, and a modified version of the averaged localized synchrony response proposed by Young and Sachs. It was found that the use of features based on auditory-nerve synchrony can indeed improve speech recognition accuracy in the presence of additive noise based on experiments using multiple standard speech databases. Recognition accuracy obtained using the synchrony-based features is further increased if some form of noise removal is applied to the signal before the synchrony measure is estimated. Signal processing for noise removal based on the noise suppression that is a part of PNCC feature extraction is more effective toward this end than conventional spectral subtraction.

Synchrony-Based Feature Extraction for Robust Automatic Speech Recognition Articles