Release v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements · huggingface/accelerate

Improved Reproducibility

One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:

dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)

For more information see this PR, we will update the docs on a subsequent release with more information on this API.

Documentation

The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
We also now have documentation on how to perform multinode training, see the launch docs

Internal structure

Shared file systems are now supported under save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.
FSDP can now be used for bfloat16 mixed precision via torch.autocast
all_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensors
Specifying drop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()

What's Changed

Update big_modeling.md by @kli-casia in #1976
Fix model copy after dispatch_model by @austinapatel in #1971
FIX: Automatic checkpoint path inference issue by @BenjaminBossan in #1989
Fix skip first batch for deepspeed example by @SumanthRH in #2001
[docs] Quick tour refactor by @MKhalusova in #2008
Add basic documentation for multi node training by @SumanthRH in #1988
update torch_dynamo backends by @SunMarc in #1992
Sync states for xpu fsdp by @abhilash1910 in #2005
update fsdp docs by @pacman100 in #2026
Enable shared file system with save and save_state via ProjectConfiguration by @muellerzr in #1953
Fix save on each node by @muellerzr in #2036
Allow FSDP to use with torch.autocast for bfloat16 mixed precision by @brcps12 in #2033
Fix DeepSpeed version to <0.11 by @BenjaminBossan in #2043
Unpin deepspeed by @muellerzr in #2044
Reduce memory by using all_gather_into_tensor by @muellerzr in #1968
Safely end training even if trackers weren't initialized by @Ben-Epstein in #1994
Fix integration CI by @muellerzr in #2047
Make fsdp ram efficient loading optional by @pacman100 in #2037
Let drop_last modify gather_for_metrics by @muellerzr in #2048
fix docstring by @zhangsibo1129 in #2053
Fix stalebot by @muellerzr in #2052
Add space to docs by @muellerzr in #2055
Fix the error when the "train_batch_size" is absent in DeepSpeed config by @LZHgrla in #2060
remove unused constants by @statelesshz in #2045
fix: remove useless token by @rtrompier in #2069
DOC: Fix broken link to designing a device map by @BenjaminBossan in #2073
Let iterable dataset shard have a length if implemented by @muellerzr in #2066
Allow for samplers to be seedable and reproducable by @muellerzr in #2057
Fix docstring typo by @qgallouedec in #2072
Warn when kernel version is too low on Linux by @BenjaminBossan in #2077

New Contributors

@kli-casia made their first contribution in #1976
@MKhalusova made their first contribution in #2008
@brcps12 made their first contribution in #2033
@Ben-Epstein made their first contribution in #1994
@zhangsibo1129 made their first contribution in #2053
@LZHgrla made their first contribution in #2060
@rtrompier made their first contribution in #2069
@qgallouedec made their first contribution in #2072

Full Changelog: v0.23.0...v0.24.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements

Improved Reproducibility

Documentation

Internal structure

What's Changed

New Contributors

Contributors