-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delightful TTS implementation #1715
Comments
Thanks for opening the issue! Pre-training the model will take me roughly one more week. Afterward, I will refactor the code, and get the project into a usable state, then I will implement this into coqui, so it will probably take me 6+ weeks. Some info about the model: It's based upon DelightfulTTS with some modifications. Many components of the model weren't fully explained in the paper. They especially didn't get into details about both the phoneme and utterance level prosody encoders as well as hyperparameters used, so their implementations were heavily influenced by Comprehensive-Transformer-TTS. The model also uses a different scheme to provide both language and speaker embeddings. The scheme of DelightfulTTS may have worked for the blizzard challenge but didn't work when using more speakers/languages. For the G2P model, I used DeepPhonemizer, which implements Transformer-based Grapheme-to-Phoneme Conversion, and increased the parameter count to ~23M. A single G2P model is trained on the global phone set of Montreal Forced Aligner in the following languages:
Also, I increased the parameter count of DelightfulTTS to ~120M, otherwise, it would underfit the dataset. The dataset is ~20% stuff from public datasets like LibriTTS (100h and 360h split) and VCTK and ~80% stuff crawled by me. If you want to see some statistics about the dataset, you can click here. The purpose of the model is to be fine-tuned on smaller datasets. It should provide a way to create TTS models in languages with limited data. It can also be used to code-switch: Since the model was pre-trained in English, German, French, Spanish, Russian and Polish voices it can be used to fine-tune the model on an English voice and then make it speak the other languages. The parameter count may seem intimidating for a TTS model, but it can be fine-tuned without a problem on 6GB of VRAM using gradient accumulation. Also since the architecture is FastSpeech-based, and avoids autoregression, both training and inference are relatively quick and stable. UnivNet is used as vocoder, but any vocoder that shares its STFT configuration should do its job. In the future, I will further increase the size of the dataset, especially for the languages which contain no data yet. I also plan on further increasing the size of the model, since the current model still underfits the dataset. Then I will try to create a smaller model using knowledge distillation. I really hope the model turns out fine, I will probably fine-tune it tomorrow on an English single-speaker dataset to check how well it speaks in the other languages, even though it hasn't fully pre-trained, since I'm always impatient :) If you are interested in the progress, check out VoiceSmith (still a WIP) which provides a GUI to fine-tune the multilingual model and preprocess multilingual TTS datasets. |
I fine-tuned the model on the voices of Twilight Sparkle (~6000 samples, My Little Pony) and Demoman (~500 samples, Team Fortress 2) now. There is definitely still a lot of work that needs to be done but I think it shows that it's possible to pre-train a model in a bunch of languages and then make it speak languages not seen in the fine-tuning dataset. Original: English (in pre-training dataset, in fine-tuning dataset): German (in pre-training dataset, not in fine-tuning dataset): French (in pre-training dataset, not in fine-tuning dataset): Spanish (in pre-training dataset, not in fine-tuning dataset): Russian (in pre-training dataset, not in fine-tuning dataset): Polish (in pre-training dataset, not in fine-tuning dataset): The languages below should not work since the model has not seen them in both pre-training and fine-tuning, I will include them anyway. Bulgarian (not in pre-training dataset, not in fine-tuning dataset): Czech (not in pre-training dataset, not in fine-tuning dataset): Croatian (not in pre-training dataset, not in fine-tuning dataset): European Portuguese (not in pre-training dataset, not in fine-tuning dataset): Swedish (not in pre-training dataset, not in fine-tuning dataset): Thai (not in pre-training dataset, not in fine-tuning dataset): Turkish (not in pre-training dataset, not in fine-tuning dataset): Ukrainian (not in pre-training dataset, not in fine-tuning dataset): |
I also noticed the multilingual G2P stuff and unusual phone set (phone set from Montreal Forced Aligner) will probably make it a pain to implement this into coqui. It's probably better to implement the English-only version which was trained on ARPABET, I will develop that one alongside the multilingual one anyway. I have a colab for inference of that one here. |
Samples above are impressive. TR samples are like someone German speaking TR :) Can't we just use espeak for G2P? What are the benefits of using a neural G2P model? |
Just noticed coqui already has support for multiple languages, that is nice. It doesn't really matter which G2P model we use, we just need a way to extract the phoneme durations using a forced aligner for example. I see you implemented FastSpeech and FastPitch. How did you extract phoneme durations? Or did you differ from the original implementation using unsupervised durations (since you implemented https://arxiv.org/pdf/2108.10447.pdf)? |
We learn durations unsupervised in different ways one of which is that paper. It is called |
Hello, guys. I saw there so many languages, but just not Chinese, does there any plan to support Chinese? |
Czech is understandable. |
👑 @loganhart420 will continue implementing this. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Does it avaiable for now |
I'm currently training the pretrained models, the pr to follow along is here: #2095 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Still WIP by @loganhart420 |
Can someone please summarize the differences between Voicesmith by @dunky11 and the current repo of TTS by coqui-ai? |
Polish is awesome! |
you can clone the branch right now and train from scratch. I've got a working ljspeech and vctk model working so it should work for single and multi speaker datasets. |
how can I voice clone my own voice (spanish) with this tool? please, some steps. thanks |
Will there be a pre-trained model available for it? (for fine-tuning) |
Does delightful tts support mandrain now? |
Can you share the pre-trained models? |
yeah, sharing some of the models would be highly appreciated :) |
is this PR for Delightful TTS 1 or Delightful TTS 2 (https://arxiv.org/abs/2207.04646) |
its 1 |
Any possibility to change it to 2 as it requires small changes but give better MOS score that 1 as in the paper |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
The PR is still open though |
Hi Tim, is there any chance you could share that pre-trained model in multiple languages? Thank you!! |
Paper: https://arxiv.org/abs/2110.12612
👑 @loganhart420 is going to do the heavy lifting !!!
We can discuss here how we want to go about it.
The text was updated successfully, but these errors were encountered: