Advancing Emotional Voice Conversion: Transforming Fundamental Frequency and Mel-Cepstral Coefficients Using Cycle Consistent Adversarial Networks with Two-Step Adversarial Loss and Patch-Based Discriminators
Articles
Electronic International Standard Serial Number (EISSN)
2192-1962
abstract
The aim of emotional voice conversion (EVC) is to alter the emotional content of spoken utterances without compromising the speaker"s identity or linguistic content. Many EVC frameworks rely on scarce parallel data recorded by actors. This paper proposes a novel framework for EVC that leverages non-parallel data through cycle consistent adversarial networks (CycleGANs). CycleGANs learn to transform input data between domains using a cycle loss that regularizes training by ensuring the reconstructed inputs match the original inputs in both domains. Despite their use in various voice conversion tasks, CycleGANs often produce audio with degraded quality, largely due to the oversmoothing of speech features. To address these issues, we devised two distinct CycleGAN-based methods within the aforementioned framework: the first method incorporates a two-step adversarial loss, while the second method enhances this by incorporating patch-based (PatchGAN) discriminators. Prior research has demonstrated that these techniques alleviate the oversmoothing of the spectrum and have shown superior capability in capturing dynamic spectral variations. In this work, we incorporate these enhancements not only to transform the spectrum, but also the fundamental frequency (F0), a speech feature that is strongly related to intonation and expression of emotion. The objective evaluation of the proposed methods shows improvements over the baseline in both Mel-cepstrum distortion and root-mean-square error, as well as in the Pearson correlation coefficient of the F0 transformation. Furthermore, subjective evaluations using the mean opinion score (MOS) and similarity MOS indicate that our model outperforms the baseline model in terms of naturalness and similarity to the target emotion.
Classification
subjects
Computer Science
keywords
emotional voice conversion; non-parallel voice conversion; cyclegan; two-step adversarial loss, patchgan, fundamental frequency