Fine-tuning process of VITs model is very slow on A100. #4053

santosh-r24 · 2024-11-16T06:26:33Z

santosh-r24
Nov 16, 2024

Hello everyone,

I'm trying to fine-tune a VITs model with a dataset consisting of approximately 450 audio samples, totaling about 30 minutes of voice data (in LJspeech format, each clip is between 3 to 10 seconds). I am using an A100 GPU on Google Colab Pro.

Issue
After 10 hours of training, only ~3k epochs have passed. From what I’ve read in related discussions (e.g., this one), I should be able to reach around 25k epochs within the same time frame with an A100 GPU.

Observations
GPU utilization seems low: I monitored GPU usage, and even with a batch_size of 128, the utilization fluctuates between 2 GB and 20 GB, while the A100 has 40 GB of GPU memory. This indicates that I'm not fully utilizing the GPU, which likely explains the slowness.
I’m using mixed precision and enabled cudnn_benchmark to speed up operations, but it hasn't improved the training speed significantly.
Configuration
Here are some important config parameters I'm using:

Batch Size: 128
Mixed Precision: True
cudnn_benchmark: True
Learning Rate (lr): 0.004
Optimizer: AdamW
Epochs: 45000
Number of Workers: num_loader_workers = 12

The full configuration is below

"output_path": "/content/drive/MyDrive/tts_training/output",
    "logger_uri": null,
    "run_name": "vegeta_trial_run",
    "project_name": null,
    "run_description": "\ud83d\udc38Coqui trainer run.",
    "print_step": 25,
    "plot_step": 100,
    "model_param_stats": false,
    "wandb_entity": null,
    "dashboard_logger": "tensorboard",
    "log_model_step": null,
    "save_step": 10000,
    "save_n_checkpoints": 5,
    "save_checkpoints": true,
    "save_all_best": false,
    "save_best_after": 10000,
    "target_loss": null,
    "print_eval": true,
    "test_delay_epochs": -1,
    "run_eval": true,
    "run_eval_steps": null,
    "distributed_backend": "nccl",
    "distributed_url": "tcp://localhost:54321",
    "mixed_precision": true,
    "epochs": 45000,
    "batch_size": 128,
    "eval_batch_size": 32,
    "grad_clip": [
        1000,
        1000
    ],
    "scheduler_after_epoch": true,
    "lr": 0.004,
    "optimizer": "AdamW",
    "optimizer_params": {
        "betas": [
            0.8,
            0.99
        ],
        "eps": 1e-09,
        "weight_decay": 0.01
    },
    "lr_scheduler": null,
    "lr_scheduler_params": null,
    "use_grad_scaler": false,
    "cudnn_enable": true,
    "cudnn_deterministic": false,
    "cudnn_benchmark": true,
    "training_seed": 54321,
    "model": "vits",
    "num_loader_workers": 12,
    "num_eval_loader_workers": 8,
    "use_noise_augment": false,
    "audio": {
        "fft_size": 1024,
        "sample_rate": 22050,
        "win_length": 1024,
        "hop_length": 256,
        "num_mels": 80,
        "mel_fmin": 0,
        "mel_fmax": null
    },
    "use_phonemes": true,
    "phonemizer": "espeak",
    "phoneme_language": "en-us",
    "compute_input_seq_cache": true,
    "text_cleaner": "english_cleaners",
    "enable_eos_bos_chars": false,
    "test_sentences_file": "",
    "phoneme_cache_path": "/content/drive/MyDrive/tts_training/output/phoneme_cache",
    "characters": {
        "characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",
        "vocab_dict": null,
        "pad": "<PAD>",
        "eos": "<EOS>",
        "bos": "<BOS>",
        "blank": "<BLNK>",
        "characters": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u02b2\u025a\u02de\u026b",
        "punctuations": "!'(),-.:;? ",
        "phonemes": null,
        "is_unique": false,
        "is_sorted": true
    },
    "add_blank": true,
    "batch_group_size": 16,
    "loss_masking": null,
    "min_audio_len": 1,
    "max_audio_len": Infinity,
    "min_text_len": 1,
    "max_text_len": Infinity,
    "compute_f0": false,
    "compute_energy": false,
    "compute_linear_spec": true,
    "precompute_num_workers": 0,
    "start_by_longest": false,
    "shuffle": false,
    "drop_last": false,
    "datasets": [
        {
            "formatter": "ljspeech",
            "dataset_name": "",
            "path": "/content/drive/MyDrive/tts_training/audio/",
            "meta_file_train": "metadata.txt",
            "ignored_speakers": null,
            "language": "",
            "phonemizer": "",
            "meta_file_val": "",
            "meta_file_attn_mask": ""
        }
    ],
    "test_sentences": [
        [
            "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent."
        ],
        [
            "Be a voice, not an echo."
        ],
        [
            "I'm sorry Dave. I'm afraid I can't do that."
        ],
        [
            "This cake is great. It's so delicious and moist."
        ],
        [
            "Prior to November 22, 1963."
        ]
    ],
    "eval_split_max_size": null,
    "eval_split_size": 0.01,
    "use_speaker_weighted_sampler": false,
    "speaker_weighted_sampler_alpha": 1.0,
    "use_language_weighted_sampler": false,
    "language_weighted_sampler_alpha": 1.0,
    "use_length_weighted_sampler": false,
    "length_weighted_sampler_alpha": 1.0,
    "model_args": {
        "num_chars": 131,
        "out_channels": 513,
        "spec_segment_size": 32,
        "hidden_channels": 192,
        "hidden_channels_ffn_text_encoder": 768,
        "num_heads_text_encoder": 2,
        "num_layers_text_encoder": 6,
        "kernel_size_text_encoder": 3,
        "dropout_p_text_encoder": 0.1,
        "dropout_p_duration_predictor": 0.5,
        "kernel_size_posterior_encoder": 5,
        "dilation_rate_posterior_encoder": 1,
        "num_layers_posterior_encoder": 16,
        "kernel_size_flow": 5,
        "dilation_rate_flow": 1,
        "num_layers_flow": 4,
        "resblock_type_decoder": "1",
        "resblock_kernel_sizes_decoder": [
            3,
            7,
            11
        ],
        "resblock_dilation_sizes_decoder": [
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ],
            [
                1,
                3,
                5
            ]
        ],
        "upsample_rates_decoder": [
            8,
            8,
            2,
            2
        ],
        "upsample_initial_channel_decoder": 512,
        "upsample_kernel_sizes_decoder": [
            16,
            16,
            4,
            4
        ],
        "periods_multi_period_discriminator": [
            2,
            3,
            5,
            7,
            11
        ],
        "use_sdp": true,
        "noise_scale": 1.0,
        "inference_noise_scale": 0.667,
        "length_scale": 1,
        "noise_scale_dp": 1.0,
        "inference_noise_scale_dp": 1.0,
        "max_inference_len": null,
        "init_discriminator": true,
        "use_spectral_norm_disriminator": false,
        "use_speaker_embedding": false,
        "num_speakers": 0,
        "speakers_file": null,
        "d_vector_file": null,
        "speaker_embedding_channels": 256,
        "use_d_vector_file": false,
        "d_vector_dim": 0,
        "detach_dp_input": true,
        "use_language_embedding": false,
        "embedded_language_dim": 4,
        "num_languages": 0,
        "language_ids_file": null,
        "use_speaker_encoder_as_loss": false,
        "speaker_encoder_config_path": "",
        "speaker_encoder_model_path": "",
        "condition_dp_on_speaker": true,
        "freeze_encoder": false,
        "freeze_DP": false,
        "freeze_PE": false,
        "freeze_flow_decoder": false,
        "freeze_waveform_decoder": false,
        "encoder_sample_rate": null,
        "interpolate_z": true,
        "reinit_DP": false,
        "reinit_text_encoder": false
    },
    "lr_gen": 0.0008,
    "lr_disc": 0.0008,
    "lr_scheduler_gen": "ExponentialLR",
    "lr_scheduler_gen_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "lr_scheduler_disc": "ExponentialLR",
    "lr_scheduler_disc_params": {
        "gamma": 0.999875,
        "last_epoch": -1
    },
    "kl_loss_alpha": 1.0,
    "disc_loss_alpha": 1.0,
    "gen_loss_alpha": 1.0,
    "feat_loss_alpha": 1.0,
    "mel_loss_alpha": 45.0,
    "dur_loss_alpha": 1.0,
    "speaker_encoder_loss_alpha": 1.0,
    "return_wav": true,
    "use_weighted_sampler": false,
    "weighted_sampler_attrs": null,
    "weighted_sampler_multipliers": null,
    "r": 1,
    "num_speakers": 0,
    "use_speaker_embedding": false,
    "speakers_file": null,
    "speaker_embedding_channels": 256,
    "language_ids_file": null,
    "use_language_embedding": false,
    "use_d_vector_file": false,
    "d_vector_file": null,
    "d_vector_dim": 0,
    "restore_path": "/content/drive/MyDrive/tts_training/output/vegeta_trial_run-November-15-2024_01+46PM-0000000/best_model.pth",
    "github_branch": "inside_docker"

Attempts to Fix the Issue
Increased batch_size to 128 in hopes of speeding up training.
Updated learning rate accordingly (lr = 0.004), considering the batch size increase.
Tried different values for num_loader_workers and num_eval_loader_workers to make sure I wasn't bottlenecked by the dataloader.

What I'm Looking For
I'm looking for suggestions to improve training speed, including:

Optimal configuration for batch_size, learning rate, or any other parameters to take full advantage of an A100.
How to achieve better GPU utilization—currently, it doesn't seem to go above 50% most of the time.
General tips for fine-tuning VITs on small datasets to reach a decent level of convergence in a reasonable amount of time.

Answered by eginhard

Nov 16, 2024

Note that in the linked discussion they are talking about steps, not epochs, so it's not really comparable. For convergence you need to listen to the audio rather than counting steps or watching losses.

View full answer

eginhard · 2024-11-16T12:05:11Z

eginhard
Nov 16, 2024

Note that in the linked discussion they are talking about steps, not epochs, so it's not really comparable. For convergence you need to listen to the audio rather than counting steps or watching losses.

2 replies

santosh-r24 Nov 17, 2024
Author

Got it, ahh damn okay my bad. In that case can you let me know if my thought process is correct here.
Let's say i have 450 samples, and batch size is 128.
Then, steps/epoch = 450/128 = 4
Total steps completed = Epochs * Steps/Epoch
which in this case is 3000 * 4
= 12,000

So this would be the number of steps completed during fine tuning.

eginhard Nov 17, 2024

Yeah, it should print them in the logs as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning process of VITs model is very slow on A100. #4053

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Fine-tuning process of VITs model is very slow on A100. #4053

santosh-r24 Nov 16, 2024

Replies: 1 comment · 2 replies

eginhard Nov 16, 2024

santosh-r24 Nov 17, 2024 Author

eginhard Nov 17, 2024

santosh-r24
Nov 16, 2024

Replies: 1 comment 2 replies

eginhard
Nov 16, 2024

santosh-r24 Nov 17, 2024
Author