Delightful TTS implementation #1715

erogol · 2022-07-04T12:23:50Z

👑 @loganhart420 is going to do the heavy lifting !!!

We can discuss here how we want to go about it.

dunky11 · 2022-07-04T13:58:09Z

Thanks for opening the issue! Pre-training the model will take me roughly one more week. Afterward, I will refactor the code, and get the project into a usable state, then I will implement this into coqui, so it will probably take me 6+ weeks.

Some info about the model: It's based upon DelightfulTTS with some modifications. Many components of the model weren't fully explained in the paper. They especially didn't get into details about both the phoneme and utterance level prosody encoders as well as hyperparameters used, so their implementations were heavily influenced by Comprehensive-Transformer-TTS. The model also uses a different scheme to provide both language and speaker embeddings. The scheme of DelightfulTTS may have worked for the blizzard challenge but didn't work when using more speakers/languages.

For the G2P model, I used DeepPhonemizer, which implements Transformer-based Grapheme-to-Phoneme Conversion, and increased the parameter count to ~23M. A single G2P model is trained on the global phone set of Montreal Forced Aligner in the following languages:

English
German
French
Castilian Spanish
Russian
Polish
Bulgarian
Czech
Croatian
European Portuguese
Swedish
Thai
Turkish
Ukrainian

Also, I increased the parameter count of DelightfulTTS to ~120M, otherwise, it would underfit the dataset. The dataset is ~20% stuff from public datasets like LibriTTS (100h and 360h split) and VCTK and ~80% stuff crawled by me. If you want to see some statistics about the dataset, you can click here.

The purpose of the model is to be fine-tuned on smaller datasets. It should provide a way to create TTS models in languages with limited data. It can also be used to code-switch: Since the model was pre-trained in English, German, French, Spanish, Russian and Polish voices it can be used to fine-tune the model on an English voice and then make it speak the other languages.

The parameter count may seem intimidating for a TTS model, but it can be fine-tuned without a problem on 6GB of VRAM using gradient accumulation. Also since the architecture is FastSpeech-based, and avoids autoregression, both training and inference are relatively quick and stable.

UnivNet is used as vocoder, but any vocoder that shares its STFT configuration should do its job.

In the future, I will further increase the size of the dataset, especially for the languages which contain no data yet. I also plan on further increasing the size of the model, since the current model still underfits the dataset. Then I will try to create a smaller model using knowledge distillation.

I really hope the model turns out fine, I will probably fine-tune it tomorrow on an English single-speaker dataset to check how well it speaks in the other languages, even though it hasn't fully pre-trained, since I'm always impatient :)

If you are interested in the progress, check out VoiceSmith (still a WIP) which provides a GUI to fine-tune the multilingual model and preprocess multilingual TTS datasets.

dunky11 · 2022-07-06T18:50:42Z

I fine-tuned the model on the voices of Twilight Sparkle (~6000 samples, My Little Pony) and Demoman (~500 samples, Team Fortress 2) now. There is definitely still a lot of work that needs to be done but I think it shows that it's possible to pre-train a model in a bunch of languages and then make it speak languages not seen in the fine-tuning dataset.

Original:
Twilight: https://vocaroo.com/1dI8o6NZqi9e
Demoman: https://vocaroo.com/1dokyJI1Kt0u

English (in pre-training dataset, in fine-tuning dataset):
Twilight: https://vocaroo.com/19M1jbIR8aDW
Demoman: https://vocaroo.com/1g0pnFJ3jqNK

German (in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/15jWTR3wpSiA
Demoman: https://vocaroo.com/1aInIEAfiJgb

French (in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/16cc8wjg8UKn
Demoman: https://vocaroo.com/1czkTPaQm8ir

Spanish (in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1oUORqiRiJOY
Demoman: https://vocaroo.com/1iXKvd5GaXlf

Russian (in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1h6k4mUdMQ0u
Demoman: https://vocaroo.com/1doXkCsPyoM5

Polish (in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1eklCtLEuqPg
Demoman: https://vocaroo.com/12om7gCZH5Zd

The languages below should not work since the model has not seen them in both pre-training and fine-tuning, I will include them anyway.

Bulgarian (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1dC5qGhPn4op
Demoman: https://vocaroo.com/1epSvuv04Tug

Czech (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1f8yMHYHvFwo
Demoman: https://vocaroo.com/19zHsVku5My1

Croatian (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1dAlyJqd6pNj
Demoman: https://vocaroo.com/1o0qrTSGiMwb

European Portuguese (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1d9K6XegabUO
Demoman: https://vocaroo.com/1h7jCNhbuAoG

Swedish (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/15ma7yF5OhZa
Demoman: https://vocaroo.com/1cP4I6JOsfqX

Thai (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1lIZW5rT8sbc
Demoman: https://vocaroo.com/12YFcNHHnDzH

Turkish (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/1h56wternaDp
Demoman: https://vocaroo.com/1gIOQTY6mPne

Ukrainian (not in pre-training dataset, not in fine-tuning dataset):
Twilight: https://vocaroo.com/150bIrTIDLaa
Demoman: https://vocaroo.com/14PmdpuAHOKj

dunky11 · 2022-07-06T18:56:24Z

I also noticed the multilingual G2P stuff and unusual phone set (phone set from Montreal Forced Aligner) will probably make it a pain to implement this into coqui. It's probably better to implement the English-only version which was trained on ARPABET, I will develop that one alongside the multilingual one anyway. I have a colab for inference of that one here.

erogol · 2022-07-06T20:55:28Z

Samples above are impressive. TR samples are like someone German speaking TR :)

Can't we just use espeak for G2P? What are the benefits of using a neural G2P model?

dunky11 · 2022-07-06T21:55:13Z

Just noticed coqui already has support for multiple languages, that is nice. It doesn't really matter which G2P model we use, we just need a way to extract the phoneme durations using a forced aligner for example. I see you implemented FastSpeech and FastPitch. How did you extract phoneme durations? Or did you differ from the original implementation using unsupervised durations (since you implemented https://arxiv.org/pdf/2108.10447.pdf)?

erogol · 2022-07-07T08:59:47Z

We learn durations unsupervised in different ways one of which is that paper. It is called AlignerNet in 🐸TTS.

lucasjinreal · 2022-09-28T04:00:45Z

Hello, guys. I saw there so many languages, but just not Chinese, does there any plan to support Chinese?

neurlang · 2022-10-02T10:53:29Z

Czech is understandable.

erogol · 2022-10-03T11:45:33Z

👑 @loganhart420 will continue implementing this.

stale · 2022-11-04T01:30:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

lucasjinreal · 2022-11-04T02:26:09Z

Does it avaiable for now

loganhart02 · 2022-11-04T11:18:06Z

Does it avaiable for now

I'm currently training the pretrained models, the pr to follow along is here: #2095

stale · 2022-12-05T05:26:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

erogol · 2022-12-05T11:19:58Z

Still WIP by @loganhart420

agilebean · 2022-12-12T13:59:01Z

Can someone please summarize the differences between Voicesmith by @dunky11 and the current repo of TTS by coqui-ai?

ludwikbukowski · 2023-01-07T13:55:17Z

Polish is awesome!
When would it be ready? Can I download the model even if its WIP ?

loganhart02 · 2023-01-09T14:25:05Z

Polish is awesome!
When would it be ready? Can I download the model even if its WIP ?

you can clone the branch right now and train from scratch. I've got a working ljspeech and vctk model working so it should work for single and multi speaker datasets.

raul-parada · 2023-01-10T12:23:16Z

how can I voice clone my own voice (spanish) with this tool? please, some steps. thanks

iamkhalidbashir · 2023-02-14T10:40:34Z

Will there be a pre-trained model available for it? (for fine-tuning)

lucasjinreal · 2023-02-14T11:05:08Z

Does delightful tts support mandrain now?

iamkhalidbashir · 2023-02-14T12:11:51Z

Does it avaiable for now

I'm currently training the pretrained models, the pr to follow along is here: #2095

Can you share the pre-trained models?

ludwikbukowski · 2023-02-14T12:25:09Z

yeah, sharing some of the models would be highly appreciated :)

iamkhalidbashir · 2023-03-14T08:09:34Z

is this PR for Delightful TTS 1 or Delightful TTS 2 (https://arxiv.org/abs/2207.04646)

erogol · 2023-03-14T09:27:22Z

its 1

iamkhalidbashir · 2023-03-14T09:30:28Z

Any possibility to change it to 2 as it requires small changes but give better MOS score that 1 as in the paper

stale · 2023-04-17T22:09:16Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

iamkhalidbashir · 2023-04-25T14:33:15Z

The PR is still open though
#2095

catselectro · 2023-08-06T17:25:47Z

I fine-tuned the model on the voices of Twilight Sparkle (~6000 samples, My Little Pony) and Demoman (~500 samples, Team Fortress 2) now. There is definitely still a lot of work that needs to be done but I think it shows that it's possible to pre-train a model in a bunch of languages and then make it speak languages not seen in the fine-tuning dataset.

Original: Twilight: https://vocaroo.com/1dI8o6NZqi9e Demoman: https://vocaroo.com/1dokyJI1Kt0u

English (in pre-training dataset, in fine-tuning dataset): Twilight: https://vocaroo.com/19M1jbIR8aDW Demoman: https://vocaroo.com/1g0pnFJ3jqNK

German (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/15jWTR3wpSiA Demoman: https://vocaroo.com/1aInIEAfiJgb

French (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/16cc8wjg8UKn Demoman: https://vocaroo.com/1czkTPaQm8ir

Spanish (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1oUORqiRiJOY Demoman: https://vocaroo.com/1iXKvd5GaXlf

Russian (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1h6k4mUdMQ0u Demoman: https://vocaroo.com/1doXkCsPyoM5

Polish (in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1eklCtLEuqPg Demoman: https://vocaroo.com/12om7gCZH5Zd

The languages below should not work since the model has not seen them in both pre-training and fine-tuning, I will include them anyway.

Bulgarian (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1dC5qGhPn4op Demoman: https://vocaroo.com/1epSvuv04Tug

Czech (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1f8yMHYHvFwo Demoman: https://vocaroo.com/19zHsVku5My1

Croatian (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1dAlyJqd6pNj Demoman: https://vocaroo.com/1o0qrTSGiMwb

European Portuguese (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1d9K6XegabUO Demoman: https://vocaroo.com/1h7jCNhbuAoG

Swedish (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/15ma7yF5OhZa Demoman: https://vocaroo.com/1cP4I6JOsfqX

Thai (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1lIZW5rT8sbc Demoman: https://vocaroo.com/12YFcNHHnDzH

Turkish (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/1h56wternaDp Demoman: https://vocaroo.com/1gIOQTY6mPne

Ukrainian (not in pre-training dataset, not in fine-tuning dataset): Twilight: https://vocaroo.com/150bIrTIDLaa Demoman: https://vocaroo.com/14PmdpuAHOKj

Hi Tim, is there any chance you could share that pre-trained model in multiple languages? Thank you!!

erogol added the feature request feature requests for making TTS better. label Jul 4, 2022

stale bot added the wontfix This will not be worked on but feel free to help. label Aug 6, 2022

coqui-ai deleted a comment from stale bot Aug 7, 2022

stale bot removed the wontfix This will not be worked on but feel free to help. label Aug 7, 2022

coqui-ai deleted a comment from stale bot Sep 8, 2022

erogol assigned loganhart02 Oct 3, 2022

stale bot added the wontfix This will not be worked on but feel free to help. label Nov 4, 2022

stale bot removed the wontfix This will not be worked on but feel free to help. label Nov 4, 2022

stale bot added the wontfix This will not be worked on but feel free to help. label Dec 5, 2022

stale bot removed the wontfix This will not be worked on but feel free to help. label Dec 5, 2022

stale bot added the wontfix This will not be worked on but feel free to help. label Apr 17, 2023

stale bot closed this as completed Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delightful TTS implementation #1715

Delightful TTS implementation #1715

erogol commented Jul 4, 2022 •

edited

Loading

dunky11 commented Jul 4, 2022 •

edited

Loading

dunky11 commented Jul 6, 2022 •

edited

Loading

dunky11 commented Jul 6, 2022

erogol commented Jul 6, 2022 •

edited

Loading

dunky11 commented Jul 6, 2022

erogol commented Jul 7, 2022

lucasjinreal commented Sep 28, 2022

neurlang commented Oct 2, 2022

erogol commented Oct 3, 2022

stale bot commented Nov 4, 2022

lucasjinreal commented Nov 4, 2022

loganhart02 commented Nov 4, 2022

stale bot commented Dec 5, 2022

erogol commented Dec 5, 2022

agilebean commented Dec 12, 2022

ludwikbukowski commented Jan 7, 2023

loganhart02 commented Jan 9, 2023

raul-parada commented Jan 10, 2023

iamkhalidbashir commented Feb 14, 2023

lucasjinreal commented Feb 14, 2023

iamkhalidbashir commented Feb 14, 2023

ludwikbukowski commented Feb 14, 2023

iamkhalidbashir commented Mar 14, 2023

erogol commented Mar 14, 2023

iamkhalidbashir commented Mar 14, 2023

stale bot commented Apr 17, 2023

iamkhalidbashir commented Apr 25, 2023

catselectro commented Aug 6, 2023

Delightful TTS implementation #1715

Delightful TTS implementation #1715

Comments

erogol commented Jul 4, 2022 • edited Loading

dunky11 commented Jul 4, 2022 • edited Loading

dunky11 commented Jul 6, 2022 • edited Loading

dunky11 commented Jul 6, 2022

erogol commented Jul 6, 2022 • edited Loading

dunky11 commented Jul 6, 2022

erogol commented Jul 7, 2022

lucasjinreal commented Sep 28, 2022

neurlang commented Oct 2, 2022

erogol commented Oct 3, 2022

stale bot commented Nov 4, 2022

lucasjinreal commented Nov 4, 2022

loganhart02 commented Nov 4, 2022

stale bot commented Dec 5, 2022

erogol commented Dec 5, 2022

agilebean commented Dec 12, 2022

ludwikbukowski commented Jan 7, 2023

loganhart02 commented Jan 9, 2023

raul-parada commented Jan 10, 2023

iamkhalidbashir commented Feb 14, 2023

lucasjinreal commented Feb 14, 2023

iamkhalidbashir commented Feb 14, 2023

ludwikbukowski commented Feb 14, 2023

iamkhalidbashir commented Mar 14, 2023

erogol commented Mar 14, 2023

iamkhalidbashir commented Mar 14, 2023

stale bot commented Apr 17, 2023

iamkhalidbashir commented Apr 25, 2023

catselectro commented Aug 6, 2023

erogol commented Jul 4, 2022 •

edited

Loading

dunky11 commented Jul 4, 2022 •

edited

Loading

dunky11 commented Jul 6, 2022 •

edited

Loading

erogol commented Jul 6, 2022 •

edited

Loading