On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

authors

GALLARDO ANTOLIN, ASCENSION
MONTERO, JUAN MANUEL

published in

NEUROCOMPUTING Journal

publication date

October 2021

start page

49

end page

60

volume

456

Digital Object Identifier (DOI)

https://doi.org/10.1016/j.neucom.2021.05.065

full text

http://hdl.handle.net/10016/33704

International Standard Serial Number (ISSN)

0925-2312

Electronic International Standard Serial Number (EISSN)

1872-8286

abstract

Speech intelligibility can be affected by multiple factors, such as noisy environments, channel distortions
or physiological issues. In this work, we deal with the problem of automatic prediction of the speech
intelligibility level in this latter case. Starting from our previous work, a non-intrusive system based
on LSTM networks with attention mechanism designed for this task, we present two main contributions.
In the first one, it is proposed the use of per-frame modulation spectrograms as input features, instead of
compact representations derived from them that discard important temporal information. In the second
one, two different strategies for the combination of per-frame acoustic log-mel and modulation spectrograms
into the LSTM framework are explored: at decision level or late fusion and at utterance level or
Weighted-Pooling (WP) fusion. The proposed models are evaluated with the UA-Speech database that
contains dysarthric speech with different degrees of severity. On the one hand, results show that attentional
LSTM networks are able to adequately modeling the modulation spectrograms sequences producing
similar classification rates as in the case of log-mel spectrograms. On the other hand, both
combination strategies, late and WP fusion, outperform the single-feature systems, suggesting that
per-frame log-mel and modulation spectrograms carry complementary information for the task of speech
intelligibility prediction, than can be effectively exploited by the LSTM-based architectures, being the system
with the WP fusion strategy and Attention-Pooling the one that achieves best results

On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification Articles

Overview

authors

published in

publication date

start page

end page

volume

Digital Object Identifier (DOI)

full text

International Standard Serial Number (ISSN)

Electronic International Standard Serial Number (EISSN)

abstract

Classification

subjects

keywords