Advancing Emotional Voice Conversion: Transforming Fundamental Frequency and Mel-Cepstral Coefficients Using Cycle Consistent Adversarial Networks with Two-Step Adversarial Loss and Patch-Based Discriminators

authors

published in

Human-centric Computing and Information Sciences Journal

publication date

June 2025

start page

1

end page

19

volume

15

Digital Object Identifier (DOI)

https://doi.org/10.22967/hcis.2025.15.034

Electronic International Standard Serial Number (EISSN)

2192-1962

abstract

The aim of emotional voice conversion (EVC) is to alter the emotional content of spoken utterances without compromising the speaker"s identity or linguistic content. Many EVC frameworks rely on scarce parallel data recorded by actors. This paper proposes a novel framework for EVC that leverages non-parallel data through cycle consistent adversarial networks (CycleGANs). CycleGANs learn to transform input data between domains using a cycle loss that regularizes training by ensuring the reconstructed inputs match the original inputs in both domains. Despite their use in various voice conversion tasks, CycleGANs often produce audio with degraded quality, largely due to the oversmoothing of speech features. To address these issues, we devised two distinct CycleGAN-based methods within the aforementioned framework: the first method incorporates a two-step adversarial loss, while the second method enhances this by incorporating patch-based (PatchGAN) discriminators. Prior research has demonstrated that these techniques alleviate the oversmoothing of the spectrum and have shown superior capability in capturing dynamic spectral variations. In this work, we incorporate these enhancements not only to transform the spectrum, but also the fundamental frequency (F0), a speech feature that is strongly related to intonation and expression of emotion. The objective evaluation of the proposed methods shows improvements over the baseline in both Mel-cepstrum distortion and root-mean-square error, as well as in the Pearson correlation coefficient of the F0 transformation. Furthermore, subjective evaluations using the mean opinion score (MOS) and similarity MOS indicate that our model outperforms the baseline model in terms of naturalness and similarity to the target emotion.

Advancing Emotional Voice Conversion: Transforming Fundamental Frequency and Mel-Cepstral Coefficients Using Cycle Consistent Adversarial Networks with Two-Step Adversarial Loss and Patch-Based Discriminators Articles

Overview

authors

published in

publication date

start page

end page

volume

Digital Object Identifier (DOI)

Electronic International Standard Serial Number (EISSN)

abstract

Classification

subjects

keywords