Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint" when using strategy='deepspeed_stage_2' #20453

Open
ShiweiWu98 opened this issue Nov 26, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@ShiweiWu98
Copy link

Bug description

I'm encountering the error "FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint" while trying to resume training using PyTorch Lightning with strategy='deepspeed_stage_2'. My training script saves only a .ckpt file, but DeepSpeed seems to require additional files for restoring checkpoints.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

def do_train(args):
    "train one stage"
    cfg = setup(args)
    cfg.exp_name = os.path.basename(args.output_dir)
    if args.use_wandb:
        logger = WandbLogger(project='demo',
                             offline=True,
                             name=cfg.exp_name,
                             resume="allow" if args.resume_ckpt else None,
                             entity="team")
    else:
        logger = TensorBoardLogger("./tb_logs",name="demo",version=cfg.exp_name)
    model = SSLFLArch(cfg)
    ckpt_cb = ModelCheckpoint(dirpath=f'./ckpts/{cfg.exp_name}',
                              filename='{epoch:d}',
                              every_n_epochs=10,
                              save_top_k=-1)
    callbacks = [ckpt_cb]
    trainer = PL.Trainer(max_epochs=cfg.optim["epochs"],
                         callbacks=callbacks,
                         logger=logger,
                         enable_model_summary=False,
                         precision=16 if cfg.compute_precision.grad_scaler else 32,
                         log_every_n_steps=10,
                         accelerator='gpu',
                         devices=[0, 1, 2, 3],
                         strategy='deepspeed_stage_2',
                         )
    trainer.fit(model,
                ckpt_path=args.resume_ckpt,
                )

Error messages and logs

"FileNotFoundError: The provided path is not a valid DeepSpeed checkpoint"

Environment

  • CUDA:
    - GPU:
    - NVIDIA A100 80GB PCIe
    - NVIDIA A100 80GB PCIe
    - NVIDIA A100 80GB PCIe
    - NVIDIA A100 80GB PCIe
    - available: True
    - version: 12.1
  • Lightning:
    - lightning-utilities: 0.9.0
    - pytorch-lightning: 2.3.0
    - torch: 2.4.0
    - torchaudio: 2.3.0
    - torchmetrics: 1.4.0.post0
    - torchvision: 0.19.0
  • Packages:
    - annotated-types: 0.7.0
    - antlr4-python3-runtime: 4.9.3
    - asttokens: 2.4.1
    - autocommand: 2.2.2
    - backports.tarfile: 1.2.0
    - brotli: 1.0.9
    - cachetools: 5.5.0
    - certifi: 2024.7.4
    - charset-normalizer: 3.3.2
    - click: 8.1.7
    - comm: 0.2.2
    - debugpy: 1.6.7
    - decorator: 5.1.1
    - deepspeed: 0.15.0
    - docker-pycreds: 0.4.0
    - exceptiongroup: 1.2.2
    - executing: 2.1.0
    - filelock: 3.13.1
    - fsspec: 2024.3.1
    - gitdb: 4.0.11
    - gitpython: 3.1.43
    - gmpy2: 2.1.2
    - hjson: 3.1.0
    - idna: 3.7
    - importlib-metadata: 8.4.0
    - importlib-resources: 6.4.0
    - inflect: 7.3.1
    - ipykernel: 6.29.5
    - ipython: 8.27.0
    - jaraco.context: 5.3.0
    - jaraco.functools: 4.0.1
    - jaraco.text: 3.12.1
    - jedi: 0.19.1
    - jinja2: 3.1.4
    - jupyter-client: 8.6.2
    - jupyter-core: 5.7.2
    - lightning-utilities: 0.9.0
    - markupsafe: 2.1.3
    - matplotlib-inline: 0.1.7
    - mkl-fft: 1.3.8
    - mkl-random: 1.2.4
    - mkl-service: 2.4.0
    - more-itertools: 10.3.0
    - mpmath: 1.3.0
    - nest-asyncio: 1.6.0
    - networkx: 3.3
    - ninja: 1.11.1.1
    - numpy: 1.26.4
    - nvidia-cublas-cu12: 12.1.3.1
    - nvidia-cuda-cupti-cu12: 12.1.105
    - nvidia-cuda-nvrtc-cu12: 12.1.105
    - nvidia-cuda-runtime-cu12: 12.1.105
    - nvidia-cudnn-cu12: 9.1.0.70
    - nvidia-cufft-cu12: 11.0.2.54
    - nvidia-curand-cu12: 10.3.2.106
    - nvidia-cusolver-cu12: 11.4.5.107
    - nvidia-cusparse-cu12: 12.1.0.106
    - nvidia-ml-py: 12.535.161
    - nvidia-nccl-cu12: 2.20.5
    - nvidia-nvjitlink-cu12: 12.6.20
    - nvidia-nvtx-cu12: 12.1.105
    - nvitop: 1.3.2
    - omegaconf: 2.3.0
    - ordered-set: 4.1.0
    - packaging: 24.1
    - pandas: 2.2.2
    - parso: 0.8.4
    - pexpect: 4.9.0
    - pickleshare: 0.7.5
    - pillow: 10.4.0
    - pip: 24.2
    - platformdirs: 4.2.2
    - prompt-toolkit: 3.0.47
    - protobuf: 5.27.3
    - psutil: 6.0.0
    - ptyprocess: 0.7.0
    - pure-eval: 0.2.3
    - py-cpuinfo: 9.0.0
    - pydantic: 2.8.2
    - pydantic-core: 2.20.1
    - pygments: 2.18.0
    - pysocks: 1.7.1
    - python-dateutil: 2.9.0
    - pytorch-lightning: 2.3.0
    - pytz: 2024.2
    - pyyaml: 6.0.1
    - pyzmq: 25.1.2
    - requests: 2.32.3
    - sentry-sdk: 2.13.0
    - setproctitle: 1.3.3
    - setuptools: 72.1.0
    - six: 1.16.0
    - smmap: 5.0.1
    - stack-data: 0.6.2
    - sympy: 1.12
    - termcolor: 2.4.0
    - tomli: 2.0.1
    - torch: 2.4.0
    - torchaudio: 2.3.0
    - torchmetrics: 1.4.0.post0
    - torchvision: 0.19.0
    - tornado: 6.4.1
    - tqdm: 4.66.4
    - traitlets: 5.14.3
    - triton: 3.0.0
    - typeguard: 4.3.0
    - typing-extensions: 4.11.0
    - tzdata: 2024.1
    - urllib3: 2.2.2
    - wandb: 0.17.7
    - wcwidth: 0.2.13
    - wheel: 0.43.0
    - xformers: 0.0.27.post2
    - zipp: 3.20.1
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor: x86_64
    - python: 3.11.9
    - release: 5.15.0-87-generic
    - version: Support for multiple val_dataloaders #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023

More info

No response

@ShiweiWu98 ShiweiWu98 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

1 participant