Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError when try use accelerate finetuning #285

Closed
dtischencko opened this issue Sep 25, 2024 · 1 comment
Closed

RuntimeError when try use accelerate finetuning #285

dtischencko opened this issue Sep 25, 2024 · 1 comment

Comments

@dtischencko
Copy link

dtischencko commented Sep 25, 2024

I get an error when I try to execute a finetune script on multiple GPUs:

CUDA_VISIBLE_DEVICES="1,2,3" accelerate launch --mixed_precision=fp16 --num_processes=3 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml

Error

Traceback (most recent call last):
  File "/home/raid/dtishencko/git/goblin/train/StyleTTS2/train_finetune_accelerate.py", line 284, in main
    ppgs, s2s_pred, s2s_attn = model.text_aligner(mels, mask, texts)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

CUDA

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Config

ASR_config: Utils/ASR/config.yml
ASR_path: Utils/ASR/epoch_00080.pth
F0_path: Utils/JDC/bst.t7
PLBERT_dir: Utils/PLBERT/
batch_size: 6
data_params:
  OOD_data: Data/OOD_texts.txt
  min_length: 50
  root_path: Data/wavs
  train_data: Data/train_list.txt
  val_data: Data/val_list.txt
device: cuda
epochs: 8
load_only_params: true
log_dir: Models/mymodel
log_interval: 10
loss_params:
  diff_epoch: 2
  joint_epoch: 4
  lambda_F0: 1.0
  lambda_ce: 20.0
  lambda_diff: 1.0
  lambda_dur: 1.0
  lambda_gen: 1.0
  lambda_mel: 5.0
  lambda_mono: 1.0
  lambda_norm: 1.0
  lambda_s2s: 1.0
  lambda_slm: 1.0
  lambda_sty: 1.0
max_len: 1200
model_params:
  decoder:
    resblock_dilation_sizes:
    - - 1
      - 3
      - 5
    - - 1
      - 3
      - 5
    - - 1
      - 3
      - 5
    resblock_kernel_sizes:
    - 3
    - 7
    - 11
    type: hifigan
    upsample_initial_channel: 512
    upsample_kernel_sizes:
    - 20
    - 10
    - 6
    - 4
    upsample_rates:
    - 10
    - 5
    - 3
    - 2
  diffusion:
    dist:
      estimate_sigma_data: true
      mean: -3.0
      sigma_data: 0.2
      std: 1.0
    embedding_mask_proba: 0.1
    transformer:
      head_features: 64
      multiplier: 2
      num_heads: 8
      num_layers: 3
  dim_in: 64
  dropout: 0.2
  hidden_dim: 512
  max_conv_dim: 512
  max_dur: 50
  multispeaker: false
  n_layer: 3
  n_mels: 80
  n_token: 178
  slm:
    hidden: 768
    initial_channel: 64
    model: microsoft/wavlm-base-plus
    nlayers: 13
    sr: 16000
  style_dim: 128
optimizer_params:
  bert_lr: 1.0e-05
  ft_lr: 0.0001
  lr: 0.0001
preprocess_params:
  spect_params:
    hop_length: 300
    n_fft: 2048
    win_length: 1200
  sr: 24000
pretrained_model: Models/LibriTTS/epochs_2nd_00020.pth
save_freq: 1
second_stage_load_pretrained: true
slmadv_params:
  batch_percentage: 1
  iter: 10
  max_len: 1200
  min_len: 400
  scale: 0.01
  sig: 1.5
  thresh: 5

I have tried various combinations of hyperparameters of batch_size and number of GPUs, the error is the same.

@martinambrus
Copy link

finetuning and 2nd stage training are not supported on multiple GPUs - see issue #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants