We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-to-speech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.
VCTK Dataset
All of the following audios are converted using a single model trained on 20 speakers from VCTK dataset. For a fair comparison to the baseline models, all audios are downsampled to 16k Hz. We demonstrate four types of conversion schemes: many-to-many, any-to-many, cross-lingual and singing conversion.
All utterances are partially or completely unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.
For more audio samples, please go to our survey used for MOS evaluation here. You may have to randomly select some answers before proceeding to the next page.
Many-to-Many Conversion
The converted samples using AUTO-VC are directly taken from the survey above. We use different sources for different models in the survey to prevent the raters from finding out which sample is the ground truth, the audio clips shown below are converted from sources for AUTO-VC, which are different from those shown in the survey.
Female to Female
Sample 1 (p229 → p236)
Sample 2 (p231 → p230)
Source
Target
AUTO-VC
StarGANv2-VC
Female to Male
Sample 1 (p225 → p259)
Sample 2 (p244 → p243)
Source
Target
AUTO-VC
StarGANv2-VC
Male to Female
Sample 1 (p226 → p233)
Sample 2 (p232 → p236)
Source
Target
AUTO-VC
StarGANv2-VC
Male to Male
Sample 1 (p243 → p254)
Sample 2 (p259 → p273)
Source
Target
AUTO-VC
StarGANv2-VC
Any-to-Many Conversion
Our model can also convert from speakers unseen during training. One sample is shown for each case of any-to-many conversion.
Female to Female
p280 → p228
Source
Target
Converted
Female to Male
p267 → p227
Source
Target
Converted
Male to Female
p286 → p244
Source
Target
Converted
Male to Male
p287 → p273
Source
Target
Converted
Cross-lingual Conversion
We show that our model is able to convert to any language from unseen input speakers, even though the model is trained only on English data with English ASR perceptual loss. We use Korean, Japanese and Mandarin as example languages.
Korean
Korean male → p244
Source
Target
Converted
Japanese
Japanese male → p228
Source
Target
Converted
Mandarin
Mandarin female → p254
Source
Target
Converted
Singing Conversion
Lastly, we show our model can do singing conversion even though no singing samples are seen during training. Because both the conversion model and vocoder are trained with only speech data, there are some artifacts that resemble speech patterns. We compare our model results with Polyak et. al. Audio samples from Polyak et. al. are taken directly from their online supplement page. We use the same sources from NUS-48 singing dataset and target speakers available in the selected 20 speakers.
VKOW → p259
MCUR → p233
Source
Target
Polyak et. al.
StarGANv2-VC
ESD Dataset
To demonstrate the ability of converting into stylistic speech, we train another model with 10 English speakers from the Emotional Speech Dataset (ESD). Our model can convert a neutral reading into an emotional speech. We also demostrate the ability of converting from emotional speech to emotional speech. This shows that our model can be applied to movie dubbing with proper source input. All samples are in 16k Hz.
Emotional to Emotional
Female to Male
Male to Female
Source
Reference
Converted
Neutral to Emotional
Female to Male
Male to Female
Source
Reference (neutral)
Converted (neutral)
Reference (emotional)
Converted (emotional)
JVS Dataset
JVS dataset is a multi-speaker Japanese speech dataset that contains both regular and falsetto speech. We train a model with 130 regular speech utterances and 10 falsetto speech utterances from 10 randomly selected speakers. Our model can convert a regular speech into both regular and falsetto voices from a source of regular speech. We also show that our model can do crosslinual conversion with English source speakers from VCTK dataset, albeit trained with only Japanese corpus. All samples are in 24k Hz.
Female to Male (JVS 084 → JVS 006)
Source
Reference
Converted Speech
(regular speech)
(falsetto speech)
Male to Female (JVS 099 → JVS 010)
Source
Reference
Converted Speech
(regular speech)
(falsetto speech)
Cross-lingual Conversion
Female to Male
Male to Female
Source
Target
Converted
Ablation Study
We present two samples of ablation study for conditions described in Table 2 in our paper on VCTK dataset.