Made a tutorial for how to train a model with VITS V6 #1074

rioharper · 2022-01-07T03:05:33Z

rioharper
Jan 7, 2022

NOTE: This is a fairly old tutorial, so the training process that works with VITS may not work. I will get around to updating the tutorial soon, but for now refer to the coqui docs for more info!

I've been using this tech for a while now, so thought I should make some kind of contribution. This article details how to make a dataset, configure training values and generate audio all on Google Colab.

If anyone had any input to make this article more detailed and helpful, please let me know, and ill make sure to implement it!

Article

athenasaurav · 2022-01-10T08:14:54Z

athenasaurav
Jan 10, 2022

Hey @LordApplesause,

I m having the following issue while running the Training TTS colab.

Traceback (most recent call last):
  File "/content/drive/MyDrive/VoiceCloning/configs/train_vits.py", line 60, in <module>
    train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
  File "/usr/local/lib/python3.7/dist-packages/TTS/tts/datasets/__init__.py", line 75, in load_tts_samples
    formatter = _get_formatter_by_name(name)
  File "/usr/local/lib/python3.7/dist-packages/TTS/tts/datasets/__init__.py", line 120, in _get_formatter_by_name
    return getattr(thismodule, name.lower())
AttributeError: module 'TTS.tts.datasets' has no attribute 'kumar'

Here the name of my dataset folder is "kumar"

I do have changed the Dataset folder path in the train_vits.py like this 👍

dataset_config = BaseDatasetConfig(
    name="kumar", meta_file_train="metadata.csv", path=os.path.join(output_path, "/content/drive/MyDrive/VoiceCloning/TTS/")
)

The original notebook contains same lines as :

dataset_config = BaseDatasetConfig(
    name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(output_path, "/home/rio/Desktop/TTS/Coqui/Datasets/mixed")
)

Can you help me with what could be the issue?

Thanks

8 replies

athenasaurav Jan 13, 2022

Hello @Ca-ressemble-a-du-fake ,

Thats not the case.

Ca-ressemble-a-du-fake Jan 15, 2022

@athenasaurav looking at source code, "name" is the name of the processor to use. So if your dataset follows an LJSpeech like layout, you should keep "ljspeech". Yet it does not work for me either (with "ljspeech") although it is part of formatters.py.

Ca-ressemble-a-du-fake Jan 15, 2022

@athenasaurav @ahmshalan : eventually got it. The collab is using a different version of formatters.py. If you remove the VoiceCloning directory on Google Drive, and in Colab comment the line ending with "/content/drive/MyDrive/VoiceCloning/TTS/TTS/tts/datasets/formatters.py" you will be able to pass this error. But the question is why @LordApplesause did replace the formatters.py file ?

rioharper Jan 16, 2022
Author

I replaced it because A) I didn't want to confuse beginners with the "name" being LJSpeech, as that is what is made as the default. Additionally, the format I used was different from LJspeech, and therefore would not have worked. So it was a necessary move.

As for the error; You are not supposed to change that. It meant to be kept "myvoice" I should have specified that as such and will make changes to the medium article to make note of that, sorry for any confusion!

bemnet4u Jan 31, 2022

@LordApplesause Thanks for putting this instruction. I have made few fixes to the formatter to account for new params (ignored_speakers) added to formatters with the latest version. Also you will have to run pip install . after making change to the formatter.

It will be great if you put your notebook and updated files in github so we can help maintain it with PR as new breaking change happen.

def myvoice(root_path, meta_file, **kwargs):
    """Part of the tutorial by Rio for cloning your voice. medium.com/lordoapple"""
    txt_file = os.path.join(root_path, meta_file)
    items = []
    speaker_name = "myvoice"
    with open(txt_file, "r", encoding="utf-8") as ttf:
        for line in ttf:
            cols = line.split("|")
            wav_file = os.path.join(root_path, cols[0])
            text = cols[1]
            items.append([text, wav_file, speaker_name])
    return items

Ca-ressemble-a-du-fake · 2022-01-13T09:20:25Z

Ca-ressemble-a-du-fake
Jan 13, 2022

Hi @LordApplesause ,

Very good idea to write and share an up-to-date tutorial. I will provide feedback when I am done reading it. For now could you explain why we need to create a dataset (withthat 500 clips as you wrote) ? I tested the TTS colab and with just a minute of target speaker clips the results are already stunning. What will it bring to create a larger dataset ?

Looking forward to reading your reply

4 replies

athenasaurav Jan 13, 2022

500 Clips is for a new dataset is to create a new voice. If you generate voice without your own trained model, it will generate voice using the default pertained model.

rioharper Jan 16, 2022
Author

It's a fair point to say that such a limited amount of data does result in impressive voice clips however with such little data more extreme use cases however it may perform unwell with longer sentences, short sentences, odd/slang words, etc. But if it works well for you with such little data, all the power to ya!

Ca-ressemble-a-du-fake Jan 16, 2022

😄 got it, thank you. By the way when I run the notebook for the first time it runs smoothly. But then the next times it hangs at pip install . (first if) and nothing happens for minutes. This also happens if I want to concurrently check the model in inference mode in another notebook. I have to restart it once or twice and then I eventually see the install output, and I can do what I want.

A) Consequently it may be worth it performance wise to install tts locally on Colab, download all the files to Google Drive (if they don't exist already), and copy them to Colab drive. I haven't tested (and compared) this way yet.

B) Otherwise for the inference part you may give a hint on where to find the config.json file because it has nothing to do with the one you download at the beginning. The user has to find it in the model output folder.

C) And finally hints on what to change to train foreign languages would be useful. I am transfer learning a French model and I may need to change the alphabet to include accented letters and maybe some other parameters too.

rioharper Jan 16, 2022
Author

Thank you for your detailed response! Ill make sure to add those things to the article tonight.

arif334 · 2022-01-24T04:40:44Z

arif334
Jan 24, 2022

@LordApplesause Thank you for your effort. Is it possible for you to upload the tutorial elsewhere? Medium.com is not accessible from my place.

0 replies

Ca-ressemble-a-du-fake · 2022-02-06T20:22:49Z

Ca-ressemble-a-du-fake
Feb 6, 2022

Hi, which model are you using as a starting point in your tutorial ? Is it the one from Exp1 in the paper ? Will it work to generate a French voice ?

Thanks

0 replies

e0xextazy · 2022-02-07T14:58:27Z

e0xextazy
Feb 7, 2022

@LordApplesause Hello, can u share ur custom voice dataset for reproduce ur result. It will let me know if I'm doing everything right.

0 replies

Agusteando · 2022-04-06T03:45:10Z

Agusteando
Apr 6, 2022

Is the tutorial still working? I am getting ModuleNotFoundError: No module named 'TTS.trainer'

4 replies

arif334 Apr 6, 2022

I am asuming you are using TTS==0.6.1, while the tutorial was written for TTS==0.5.0. You should try to execute the script from the latest recipe.

e0xextazy Apr 18, 2022

When is the training tutorial for VITS scheduled to be updated?

rioharper Apr 20, 2022
Author

When is the training tutorial for VITS scheduled to be updated?

Unfortunately, I've been down with a nasty cold the past 2 weeks. I am
feeling better now, so expect a update soon™ (should be by the end of
April)

rioharper Apr 27, 2022
Author

just wanted to let you know I updated it last night to work on v6, you should be able to run smoothly!

erogol · 2022-04-20T13:35:13Z

erogol
Apr 20, 2022
Maintainer

@LordApplesause can you please reach me on Gitter, Element, or email. We prepared a small 🎁 for you.

0 replies

rezareza007 · 2022-10-08T07:36:40Z

rezareza007
Oct 8, 2022

@rioharper Thank you so much for writing the tutorial. I had a couple of problems using it. First, the SNR binary file could not be executed (I'm using WSL2). Then I can't find the code for synthesis in the notebooks you provide. (Do end-to-end models use vocoders??)

Tensorboard also says that there is no info to be read in the log directory, which is not correct!

1 reply

Setmaster Jan 30, 2023

How did you fix the SNR issue?

king-dahmanus · 2022-10-21T08:37:05Z

king-dahmanus
Oct 21, 2022

e2e models have their own vocoder in them, so you don't have to specify anything! All hail to the awsome vits!

0 replies

Setmaster · 2023-01-30T08:02:56Z

Setmaster
Jan 30, 2023

Any clue on how to run WADASNR locally? My WSL refuses to. It just acts as if it doesn't exist.

2 replies

Ca-ressemble-a-du-fake Jan 31, 2023

You mean locally via Jupyter ? If I recall correctly you've got to download it and put it where written in the Jupyter notebook (there are instructions in the beginning of the script). Did you try that ?

lopezjuanma96 Feb 24, 2023

Hi! I figured it out, for me WSL was missing gcc-multilib package, and doing sudo apt-get install gcc-multilib solved it, but here there's some more information I found while trying to solve it that might work if the gcc doesn't do it:

In Collab all dependencies are installed, you’d only need to create the proper folders and move all files to your Drive

Local:

Download the tar file with curl:
curl http://www.cs.cmu.edu/~robust/archive/algorithms/WADA_SNR_IS_2008/WadaSNR.tar.gz --output /folder/to/wada/WadaSNR.tar.gz
Export:
tar -xvf /folder/to/wada/WadaSNR.tar.gz -C /folder/to/wada
make WADASNR an executable file with:
sudo chmod +x /folder/to/wada/Exe/WADASNR
(remember this give permission to all, to restrict it use u+x or p+x instead of +x which is equivalent to a+x)
Try executing WADASNR with /folder/to/wada/Exe/WADASNR, if you are on the same folder as don’t do WADASNR, it does not work on its own, use ./WADASNR
If the output is a description of WADASNR that request for it to be run with more params, you’re golden. if the ouput is ‘command not found’ or ‘the path does not exist’:
Verify with:
file /folder/to/wada/Exe/WADASNR
if the file is an LSB executable
Verify with:
ldd /folder/to/wada/Exe/WADASNR

if it returns WADASNR not a dynamic file, install gcc_multilib https://unix.stackexchange.com/questions/75054/ldd-tells-me-my-app-is-not-a-dynamic-executable
if it returns a list of lib.*.so6 like names (the 6 can vary with version), check that those libraries are installed https://www.cyberithub.com/how-do-i-install-the-linux-library-libc-so-6-in-5-easy-steps/

lopezjuanma96 · 2023-02-24T20:36:48Z

lopezjuanma96
Feb 24, 2023

Hi man, I was following your tutorial to train my own model.

First of all I must say congratulations on the job, it is not only really helpful but also entertaining. Now I have a few questions:

Where in the Training notebook is the pretrained model you download being used. Since I want to train it for a spanish speaker, I downloaded a VITS pretrained model but I don't see where that model path should be changed, since when creating the Trainer object you only pass the VitsConfig object created before. Before I found your tutorial I was trying using this model on a FineTuning process (so i used coqui's CLI to do train --restore_path /path/to/pretrained/model but if that's the case here I'm not getting how to do it. Or maybe you input the model in a different way, don't know.
On the tutorial you say if the kernel dies or something else happens, you just reload the model and continue training by passing the path to the latest model on restore_path, but I'm not finding that parameter anywhere on the notebooks. I found that the trainer object has a restore_model function but it asks for an Optimizer argument that I can't find anywhere on the training-interrupted model directory. This is somewhat related to the question above, since the pretrained model might be passed onto this restore_path variable/parameter. If that's the case please tell me so.
The Configuration notebook was awesome, found out a lot of stuff about the training files and parameters I was using, but on the last part of comparing I have one doubt: are the attributes really changing on each audio? It doesn't seem like it for me. Like they differ from the original audio but not between each other. I tried working around the code of the comparing function but the results seemed to be the same.

I'm running my code locally and recording all the modifications I had to do to the Collab notebooks for them to wok on something like the WSL of a PC. If you want, as soon as I have it done, I can share it somehow, maybe I can create a repo and send you the link or something.

Once again, thanks for such an amazing tuto. Greetings from Argentina!

5 replies

rioharper Feb 24, 2023
Author

Hi! Sorry I haven't been very active in questions/tips, I kinda forgot this discussion was even started! I really need to update the tutorial considering all the various updates that Coqui has done. It's kinda hard for me to look back at my code/instruction, found a lot of errors in explaining my point thoroughly. As for your questions:

Using --restore_path should work, although I would personally refer to this section of the Coqui TTS docs, as its the most up to date.
the training should save as a .tar occasionally on the "out" path and you can use the same --restore_path parameters to link to that tar and continue training
For the config files, I agree there doesnt seem to be (to the ear) a change in quality, but I believe the newer notebooks (as found here) show the spectrograms with the different values added, so its easier to quantify the difference.

Overall, I would say that for the training process I would refer back to the updated documents from coqui, as a lot of the code I developed now is deprecated. I would love to update the coqui TTS tutorial at some point, but I'd hate to see you wait any longer. Training is not the most convoluted part of this process thankfully, so if you follow the directions on the documentation you shouldnt have too many issues. But if you do come across problems, let me know and Ill be happy to help the best I can!

Cheers,
Rio

lopezjuanma96 Feb 25, 2023

Okey man, so then it is a fine-tuning process. Cool, I am away for the weekend but I'll test it out again as soon as I come back. I'll comment here any other doubts and create a repo whenever I can with the modifications I had to make and the steps I took, those may help you whenever you need to update the tuto.

Hope we can stay in touch.
Bests,
JM

lopezjuanma96 Mar 16, 2023

Hi! Sorry I never replied again. I figured out the problem. I was trying to load the model from the trainer, but the trainer does it on its own if you set "restore_path" correctly. The problem I was facing is I did not know where to include the "restore_path". I thought it was an argument of Trainer but it is passed to the TrainerArgs object. So:

Normal Training:

trainer = Trainer(TrainerArgs(), config, ..)

FineTuning:

trainer_args = TrainerArgs(restore_path = "your/restore/path"
trainer = Trainer(trainer_args, config, ..)
#or you can directly do Trainer(TrainerArgs(restore_path = ""), config, ..)

In case someone else has issues with it. the trainer module was originally a submodule of TTS but now it's its own thing, so you can go to the repo and check Trainer, TrainerArgs and other trainer objects to figure out if you are missing something else, that's how I found out about the restore_path thing.

markrmiller Mar 24, 2023

Ah, missed that this was already asked. Unfortunately, that seems to lead to this text_encoder.emb.weight tensor size mismatch 179,192 from checkpoint, 131,192 from current model. Doesn't matter what doc or code I read and try, I hit something about this or else missing keys. Somewhere I read to just train for a checkpoint with missing keys anyway and then work off that first checkpoint. That led to nothing, the training just produced a consistent hum.

At some point it's going to be easier to get the audio for training the model from scratch than figuring out how to fine tune this thing from python. Maybe the command line stuff works.

lopezjuanma96 Mar 28, 2023

Hi. I don't exaclty know what might be making it fail, I had some issues with the format of the model I was trying to finetune too. Some things that occur to me is the number of symbols and punctuations on the config JSON. Try adding and/or deleting some of them and see if the tensor sizes change, you might have to make them match the config JSON of the base model you are trying to finetune. This might also help you with the hum results, I've got that type of results whenever the warning of "Unrecognized symbol" appeared too often. For me a good cleaning of the transcript text and adding the correct symbols did the trick.

markrmiller · 2023-03-24T17:42:42Z

markrmiller
Mar 24, 2023

The article mentions that to continue, just change resume_from to your latest checkpoint, but the training notebook doesn't seem to have any resume_from in it. Is it training from scratch rather than fine tuning? There is a spot it downloads a checkpoint, but I don't see it used.

1 reply

p0p4k Mar 28, 2023

There is a TrainerArgs function that takes in continue_path or restore_path arguments for continuing a training in the same directory or fresh directory respectively. You can try to open the trainer.py (the coqui trainer package) and read the details of those arguments.

daayaa · 2023-08-10T10:52:54Z

daayaa
Aug 10, 2023

Hello, How to fix this error

0 replies

Made a tutorial for how to train a model with VITS V6 #1074

Replies: 13 comments · 25 replies

rioharper Jan 16, 2022 Author

rioharper Jan 16, 2022 Author

rioharper Jan 16, 2022 Author

rioharper Apr 20, 2022 Author

rioharper Apr 27, 2022 Author

erogol Apr 20, 2022 Maintainer

rioharper Feb 24, 2023 Author

Replies: 13 comments 25 replies

rioharper Jan 16, 2022
Author

rioharper Jan 16, 2022
Author

rioharper Jan 16, 2022
Author

rioharper Apr 20, 2022
Author

rioharper Apr 27, 2022
Author

erogol
Apr 20, 2022
Maintainer

rioharper Feb 24, 2023
Author