StarGANv2-VC

We present an unsupervised non-parallel many-to-many voice conversion (VC) method using a generative adversarial network (GAN) called StarGAN v2. Using a combination of adversarial source classifier loss and perceptual loss, our model significantly outperforms previous VC models. Although our model is trained only with 20 English speakers, it generalizes to a variety of voice conversion tasks, such as any-to-many, cross-lingual, and singing conversion. Using a style encoder, our framework can also convert plain reading speech into stylistic speech, such as emotional and falsetto speech. Subjective and objective evaluation experiments on a non-parallel many-to-many voice conversion task revealed that our model produces natural sounding voices, close to the sound quality of state-of-the-art text-to-speech (TTS) based voice conversion methods without the need for text labels. Moreover, our model is completely convolutional and with a faster-than-real-time vocoder such as Parallel WaveGAN can perform real-time voice conversion.


VCTK Dataset

All of the following audios are converted using a single model trained on 20 speakers from VCTK dataset. For a fair comparison to the baseline models, all audios are downsampled to 16k Hz. We demonstrate four types of conversion schemes: many-to-many, any-to-many, cross-lingual and singing conversion.

All utterances are partially or completely unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.

For more audio samples, please go to our survey used for MOS evaluation here. You may have to randomly select some answers before proceeding to the next page.


Many-to-Many Conversion

The converted samples using AUTO-VC are directly taken from the survey above. We use different sources for different models in the survey to prevent the raters from finding out which sample is the ground truth, the audio clips shown below are converted from sources for AUTO-VC, which are different from those shown in the survey.

Female to Female

  Sample 1 (p229 → p236) Sample 2 (p231 → p230)
Source
Target
AUTO-VC
StarGANv2-VC

Female to Male

  Sample 1 (p225 → p259) Sample 2 (p244 → p243)
Source
Target
AUTO-VC
StarGANv2-VC

Male to Female

  Sample 1 (p226 → p233) Sample 2 (p232 → p236)
Source
Target
AUTO-VC
StarGANv2-VC

Male to Male

  Sample 1 (p243 → p254) Sample 2 (p259 → p273)
Source
Target
AUTO-VC
StarGANv2-VC

Any-to-Many Conversion

Our model can also convert from speakers unseen during training. One sample is shown for each case of any-to-many conversion.

Female to Female

  p280 → p228
Source
Target
Converted

Female to Male

  p267 → p227
Source
Target
Converted

Male to Female

  p286 → p244
Source
Target
Converted

Male to Male

  p287 → p273
Source
Target
Converted

Cross-lingual Conversion

We show that our model is able to convert to any language from unseen input speakers, even though the model is trained only on English data with English ASR perceptual loss. We use Korean, Japanese and Mandarin as example languages.

Korean

  Korean male → p244
Source
Target
Converted

Japanese

  Japanese male → p228
Source
Target
Converted

Mandarin

  Mandarin female → p254
Source
Target
Converted

Singing Conversion

Lastly, we show our model can do singing conversion even though no singing samples are seen during training. Because both the conversion model and vocoder are trained with only speech data, there are some artifacts that resemble speech patterns. We compare our model results with Polyak et. al. Audio samples from Polyak et. al. are taken directly from their online supplement page. We use the same sources from NUS-48 singing dataset and target speakers available in the selected 20 speakers.

  VKOW → p259 MCUR → p233
Source
Target
Polyak et. al.
StarGANv2-VC

ESD Dataset

To demonstrate the ability of converting into stylistic speech, we train another model with 10 English speakers from the Emotional Speech Dataset (ESD). Our model can convert a neutral reading into an emotional speech. We also demostrate the ability of converting from emotional speech to emotional speech. This shows that our model can be applied to movie dubbing with proper source input. All samples are in 16k Hz.

Emotional to Emotional

  Female to Male Male to Female
Source
Reference
Converted

Neutral to Emotional

  Female to Male Male to Female
Source
Reference (neutral)
Converted (neutral)
Reference (emotional)
Converted (emotional)

JVS Dataset

JVS dataset is a multi-speaker Japanese speech dataset that contains both regular and falsetto speech. We train a model with 130 regular speech utterances and 10 falsetto speech utterances from 10 randomly selected speakers. Our model can convert a regular speech into both regular and falsetto voices from a source of regular speech. We also show that our model can do crosslinual conversion with English source speakers from VCTK dataset, albeit trained with only Japanese corpus. All samples are in 24k Hz.

Female to Male (JVS 084 → JVS 006)

Source Reference Converted Speech
(regular speech)
(falsetto speech)

Male to Female (JVS 099 → JVS 010)

Source Reference Converted Speech
(regular speech)
(falsetto speech)

Cross-lingual Conversion

  Female to Male Male to Female
Source
Target
Converted

Ablation Study

We present two samples of ablation study for conditions described in Table 2 in our paper on VCTK dataset.

  Sample 1 (p233 → p259) Sample 2 (p273 → p244)
Source
Target
Full StarGANv2-VC
No F0 Consistency Loss
No Speech Consistency Loss
No Norm Consistency Loss
No Adversarial Source Classifier Loss