The optimizer is not receiving the FSDP model parameters. #3209

eljandoubi · 2024-11-01T12:48:38Z

System Info

- `Accelerate` version: 1.0.1
- Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
- `accelerate` bash location: /home/a/anaconda3/envs/trans/bin/accelerate
- Python version: 3.12.7
- Numpy version: 2.1.2
- PyTorch version (GPU?): 2.5.0+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 15.46 GB
- GPU type: NVIDIA GeForce RTX 3070 Laptop GPU
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

See code in the pdf
fsdp_acc.pdf

Expected behavior

The optimizer has the FSDP model parameters.

The text was updated successfully, but these errors were encountered:

weixiong-ur · 2024-11-03T11:37:23Z

does it mean that the optimizer actually does not take any parameters to update? So the model parameters won't be updated in the training loop?

eljandoubi · 2024-11-03T12:24:16Z

I guess.

eljandoubi · 2024-11-20T18:17:30Z

@weixiong-ur have you had the same result as me ?

muellerzr · 2024-11-21T00:30:27Z

a few things:

I'm confused by this repr. You only have a single GPU no? (per accelerate env). We don't support non-multi-GPU FSDP.
And if FSDP truly were broken in such a way, it'd be a much larger problem which we know it's not.

Can you give a non jupyter notebook-based repr, and if you are sharing a jupyter notebook, for security reasons, please share the notebook in a gist with outputs not as a PDF as I'm hesitant to open these due to security reasons, and it's difficult to copy/paste from them etc

BenjaminBossan · 2024-11-21T11:44:28Z

Not sure if it's related, but users reported an error in PEFT that points in a similar direction. Note that the error is not caused by PEFT, as I could reproduce it without PEFT. From the error message, it seems like the params passed to the optimizer are not consistent with the model parameters. I could resolve the error by downgrading to:

trl==0.11.0
tokenizers>=0.19,<0.20
transformers==4.44.2
accelerate==0.33.0

@eljandoubi Maybe you could check if these versions resolve your issue.

eljandoubi · 2024-11-21T17:34:57Z

@muellerzr
You can use my branch of the transformers repository [here] to train a model like Donut with FSDP wrapping based on layer size. It will print the number of parameters before and after applying FSDP wrapping.

github-actions · 2024-12-16T15:07:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

BenjaminBossan · 2024-12-16T15:11:11Z

Can you check if huggingface/transformers#35212 has solved the issue? If not, could you try if additionally switching off flash attention helps.

eljandoubi linked a pull request Nov 3, 2024 that will close this issue

create _preprare_fsdp to pre- prepare fsdp model training #3213

Open

BenjaminBossan mentioned this issue Nov 25, 2024

# [BUG] [Fix-Suggested] Model Training Stalls with FSDP when fsdp_use_orig_params=False due to inconsistent model-optimizer state #3256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The optimizer is not receiving the FSDP model parameters. #3209

The optimizer is not receiving the FSDP model parameters. #3209

eljandoubi commented Nov 1, 2024 •

edited

Loading

weixiong-ur commented Nov 3, 2024

eljandoubi commented Nov 3, 2024

eljandoubi commented Nov 20, 2024

muellerzr commented Nov 21, 2024

BenjaminBossan commented Nov 21, 2024

eljandoubi commented Nov 21, 2024 •

edited

Loading

github-actions bot commented Dec 16, 2024

BenjaminBossan commented Dec 16, 2024

The optimizer is not receiving the FSDP model parameters. #3209

The optimizer is not receiving the FSDP model parameters. #3209

Comments

eljandoubi commented Nov 1, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

weixiong-ur commented Nov 3, 2024

eljandoubi commented Nov 3, 2024

eljandoubi commented Nov 20, 2024

muellerzr commented Nov 21, 2024

BenjaminBossan commented Nov 21, 2024

eljandoubi commented Nov 21, 2024 • edited Loading

github-actions bot commented Dec 16, 2024

BenjaminBossan commented Dec 16, 2024

eljandoubi commented Nov 1, 2024 •

edited

Loading

eljandoubi commented Nov 21, 2024 •

edited

Loading