v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements
Improved Reproducibility
One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch()
function to all Accelerate
DataLoaders
, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:
dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)
For more information see this PR, we will update the docs on a subsequent release with more information on this API.
Documentation
- The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
- We also now have documentation on how to perform multinode training, see the launch docs
Internal structure
- Shared file systems are now supported under
save
andsave_state
via theProjectConfiguration
dataclass. See #1953 for more info. - FSDP can now be used for
bfloat16
mixed precision viatorch.autocast
all_gather_into_tensor
is now used as the main gather operation, reducing memory in the cases of big tensors- Specifying
drop_last=True
will now properly have the desired affect when performingAccelerator().gather_for_metrics()
What's Changed
- Update big_modeling.md by @kli-casia in #1976
- Fix model copy after
dispatch_model
by @austinapatel in #1971 - FIX: Automatic checkpoint path inference issue by @BenjaminBossan in #1989
- Fix skip first batch for deepspeed example by @SumanthRH in #2001
- [docs] Quick tour refactor by @MKhalusova in #2008
- Add basic documentation for multi node training by @SumanthRH in #1988
- update torch_dynamo backends by @SunMarc in #1992
- Sync states for xpu fsdp by @abhilash1910 in #2005
- update fsdp docs by @pacman100 in #2026
- Enable shared file system with
save
andsave_state
via ProjectConfiguration by @muellerzr in #1953 - Fix save on each node by @muellerzr in #2036
- Allow FSDP to use with
torch.autocast
for bfloat16 mixed precision by @brcps12 in #2033 - Fix DeepSpeed version to <0.11 by @BenjaminBossan in #2043
- Unpin deepspeed by @muellerzr in #2044
- Reduce memory by using
all_gather_into_tensor
by @muellerzr in #1968 - Safely end training even if trackers weren't initialized by @Ben-Epstein in #1994
- Fix integration CI by @muellerzr in #2047
- Make fsdp ram efficient loading optional by @pacman100 in #2037
- Let drop_last modify
gather_for_metrics
by @muellerzr in #2048 - fix docstring by @zhangsibo1129 in #2053
- Fix stalebot by @muellerzr in #2052
- Add space to docs by @muellerzr in #2055
- Fix the error when the "train_batch_size" is absent in DeepSpeed config by @LZHgrla in #2060
- remove unused constants by @statelesshz in #2045
- fix: remove useless token by @rtrompier in #2069
- DOC: Fix broken link to designing a device map by @BenjaminBossan in #2073
- Let iterable dataset shard have a length if implemented by @muellerzr in #2066
- Allow for samplers to be seedable and reproducable by @muellerzr in #2057
- Fix docstring typo by @qgallouedec in #2072
- Warn when kernel version is too low on Linux by @BenjaminBossan in #2077
New Contributors
- @kli-casia made their first contribution in #1976
- @MKhalusova made their first contribution in #2008
- @brcps12 made their first contribution in #2033
- @Ben-Epstein made their first contribution in #1994
- @zhangsibo1129 made their first contribution in #2053
- @LZHgrla made their first contribution in #2060
- @rtrompier made their first contribution in #2069
- @qgallouedec made their first contribution in #2072
Full Changelog: v0.23.0...v0.24.0