Skip to content

v0.32.0: Profilers, new hooks, speedups, and more!

Compare
Choose a tag to compare
@muellerzr muellerzr released this 03 Jul 17:44
· 154 commits to main since this release

Core

  • Utilize shard saving from the huggingface_hub rather than our own implementation (#2795)
  • Refactor logging to use logger in dispatch_model (#2855)
  • The Accelerator.step number is now restored when using save_state and load_state (#2765)
  • A new profiler has been added allowing users to collect performance metrics during model training and inference, including detailed analysis of execution time and memory consumption. These can then be generated in Chrome's tracing tool. Read more about it here (#2883)
  • Reduced import times for doing import accelerate and any other major core import by 68%, now should be only slightly longer than doing import torch (#2845)
  • Fixed a bug in get_backend and added a clear_device_cache utility (#2857)

Distributed Data Parallelism

  • Introduce DDP communication hooks to have more flexibility in how gradients are communicated across workers, overriding the standard allreduce. (#2841)
  • Make log_line_prefix_template optional the notebook_launcher (#2888)

FSDP

  • If the output directory doesn't exist when using accelerate merge-weights, one will be automatically created (#2854)
  • When merging weights, the default is now .safetensors (#2853)

XPU

  • Migrate to pytorch's native XPU backend on torch>=2.4 (#2825)
  • Add @require_triton test decorator and enable test_dynamo work on xpu (#2878)
  • Fixed load_state_dict not working on xpu and refine xpu safetensors version check (#2879)

XLA

  • Added support for XLA Dynamo backends for both training and inference (#2892)

Examples

  • Added a new multi-cpu SLURM example using accelerate launch (#2902)

Full Changelog

New Contributors

Full Changelog: v0.31.0...v0.32.0