You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering the error "FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint" while trying to resume training using PyTorch Lightning with strategy='deepspeed_stage_2'. My training script saves only a .ckpt file, but DeepSpeed seems to require additional files for restoring checkpoints.
Bug description
I'm encountering the error "FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint" while trying to resume training using PyTorch Lightning with strategy='deepspeed_stage_2'. My training script saves only a .ckpt file, but DeepSpeed seems to require additional files for restoring checkpoints.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
- GPU:
- NVIDIA A100 80GB PCIe
- NVIDIA A100 80GB PCIe
- NVIDIA A100 80GB PCIe
- NVIDIA A100 80GB PCIe
- available: True
- version: 12.1
- lightning-utilities: 0.9.0
- pytorch-lightning: 2.3.0
- torch: 2.4.0
- torchaudio: 2.3.0
- torchmetrics: 1.4.0.post0
- torchvision: 0.19.0
- annotated-types: 0.7.0
- antlr4-python3-runtime: 4.9.3
- asttokens: 2.4.1
- autocommand: 2.2.2
- backports.tarfile: 1.2.0
- brotli: 1.0.9
- cachetools: 5.5.0
- certifi: 2024.7.4
- charset-normalizer: 3.3.2
- click: 8.1.7
- comm: 0.2.2
- debugpy: 1.6.7
- decorator: 5.1.1
- deepspeed: 0.15.0
- docker-pycreds: 0.4.0
- exceptiongroup: 1.2.2
- executing: 2.1.0
- filelock: 3.13.1
- fsspec: 2024.3.1
- gitdb: 4.0.11
- gitpython: 3.1.43
- gmpy2: 2.1.2
- hjson: 3.1.0
- idna: 3.7
- importlib-metadata: 8.4.0
- importlib-resources: 6.4.0
- inflect: 7.3.1
- ipykernel: 6.29.5
- ipython: 8.27.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jedi: 0.19.1
- jinja2: 3.1.4
- jupyter-client: 8.6.2
- jupyter-core: 5.7.2
- lightning-utilities: 0.9.0
- markupsafe: 2.1.3
- matplotlib-inline: 0.1.7
- mkl-fft: 1.3.8
- mkl-random: 1.2.4
- mkl-service: 2.4.0
- more-itertools: 10.3.0
- mpmath: 1.3.0
- nest-asyncio: 1.6.0
- networkx: 3.3
- ninja: 1.11.1.1
- numpy: 1.26.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-ml-py: 12.535.161
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.6.20
- nvidia-nvtx-cu12: 12.1.105
- nvitop: 1.3.2
- omegaconf: 2.3.0
- ordered-set: 4.1.0
- packaging: 24.1
- pandas: 2.2.2
- parso: 0.8.4
- pexpect: 4.9.0
- pickleshare: 0.7.5
- pillow: 10.4.0
- pip: 24.2
- platformdirs: 4.2.2
- prompt-toolkit: 3.0.47
- protobuf: 5.27.3
- psutil: 6.0.0
- ptyprocess: 0.7.0
- pure-eval: 0.2.3
- py-cpuinfo: 9.0.0
- pydantic: 2.8.2
- pydantic-core: 2.20.1
- pygments: 2.18.0
- pysocks: 1.7.1
- python-dateutil: 2.9.0
- pytorch-lightning: 2.3.0
- pytz: 2024.2
- pyyaml: 6.0.1
- pyzmq: 25.1.2
- requests: 2.32.3
- sentry-sdk: 2.13.0
- setproctitle: 1.3.3
- setuptools: 72.1.0
- six: 1.16.0
- smmap: 5.0.1
- stack-data: 0.6.2
- sympy: 1.12
- termcolor: 2.4.0
- tomli: 2.0.1
- torch: 2.4.0
- torchaudio: 2.3.0
- torchmetrics: 1.4.0.post0
- torchvision: 0.19.0
- tornado: 6.4.1
- tqdm: 4.66.4
- traitlets: 5.14.3
- triton: 3.0.0
- typeguard: 4.3.0
- typing-extensions: 4.11.0
- tzdata: 2024.1
- urllib3: 2.2.2
- wandb: 0.17.7
- wcwidth: 0.2.13
- wheel: 0.43.0
- xformers: 0.0.27.post2
- zipp: 3.20.1
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.11.9
- release: 5.15.0-87-generic
- version: Support for multiple val_dataloaders #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023
More info
No response
The text was updated successfully, but these errors were encountered: