Releases: huggingface/accelerate
v1.2.1: Patchfix
- fix: add max_memory to _init_infer_auto_device_map's return statement in #3279 by @Nech-C
- fix load_state_dict for npu in #3211 by @statelesshz
Full Changelog: v1.2.0...v1.2.1
v1.2.0: Bug Squashing & Fixes across the board
Core
- enable
find_executable_batch_size
on XPU by @faaany in #3236 - Use
numpy._core
instead ofnumpy.core
by @qgallouedec in #3247 - Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
- Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
- [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
- [
data_loader
] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246 - use XPU instead of GPU in the
accelerate config
prompt text by @faaany in #3268
Big Modeling
- Fix
align_module_device
, ensure only cpu tensors forget_state_dict_offloaded_model
by @kylesayrs in #3217 - Remove hook for bnb 4-bit by @SunMarc in #3223
- [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
- Take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
- Update deferring_execution.md by @max-yue in #3262
- Revert default behavior of
get_state_dict_from_offload
by @kylesayrs in #3253 - Fix: Resolve #3060,
preload_module_classes
is lost for nested modules by @wejoncy in #3248
DeepSpeed
- Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
- support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266
Documentation
-
Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
-
Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274
New Contributors
- @winglian made their first contribution in #3266
- @max-yue made their first contribution in #3262
- @as12138 made their first contribution in #3261
- @relh made their first contribution in #3259
- @wejoncy made their first contribution in #3248
- @henryhmko made their first contribution in #3274
Full Changelog
- Fix
align_module_device
, ensure only cpu tensors forget_state_dict_offloaded_model
by @kylesayrs in #3217 - remove hook for bnb 4-bit by @SunMarc in #3223
- enable
find_executable_batch_size
on XPU by @faaany in #3236 - take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
- [docs] update code in tracking documentation by @faaany in #3235
- Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
- [
data_loader
] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246 - [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
- Use
numpy._core
instead ofnumpy.core
by @qgallouedec in #3247 - Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
- [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
- use XPU instead of GPU in the
accelerate config
prompt text by @faaany in #3268 - support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266
- Update deferring_execution.md by @max-yue in #3262
- Fix: Resolve #3257 by @as12138 in #3261
- Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
- Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
- Revert default behavior of
get_state_dict_from_offload
by @kylesayrs in #3253 - Fix: Resolve #3060,
preload_module_classes
is lost for nested modules by @wejoncy in #3248 - [docs] update set-seed by @faaany in #3228
- [docs] fix typo by @faaany in #3221
- [docs] use real path for
checkpoint
by @faaany in #3220 - Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274
Code Diff
Release diff: v1.1.1...v1.2.0
v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes
Internals:
- Allow for a
data_seed
argument in #3150 - Trigger
weights_only=True
by default for all compatible objects when checkpointing and saving withtorch.save
in #3036 - Handle negative values for
dim
input inpad_across_processes
in #3114 - Enable cpu bnb distributed lora finetune in #3159
DeepSpeed
- Support torch dynamo for deepspeed>=0.14.4 in #3069
Megatron
- update Megatron-LM plugin code to version 0.8.0 or higher in #3174
Big Model Inference
- New
has_offloaded_params
utility added in #3188
Examples
- Florence2 distributed inference example in #3123
Full Changelog
- Handle negative values for
dim
input inpad_across_processes
by @mariusarvinte in #3114 - Fixup DS issue with weakref by @muellerzr in #3143
- Refactor scaler to util by @muellerzr in #3142
- DS fix, continued by @muellerzr in #3145
- Florence2 distributed inference example by @hlky in #3123
- POC: Allow for a
data_seed
by @muellerzr in #3150 - Adding multi gpu speech generation by @dame-cell in #3149
- support torch dynamo for deepspeed>=0.14.4 by @oraluben in #3069
- Fixup Zero3 +
save_model
by @muellerzr in #3146 - Trigger
weights_only=True
by default for all compatible objects by @muellerzr in #3036 - Remove broken dynamo test by @oraluben in #3155
- fix version check bug in
get_xpu_available_memory
by @faaany in #3165 - enable cpu bnb distributed lora finetune by @jiqing-feng in #3159
- [Utils]
has_offloaded_params
by @kylesayrs in #3188 - fix bnb by @eljandoubi in #3186
- [docs] update neptune API by @faaany in #3181
- docs: fix a wrong word in comment in src/accelerate/accelerate.py:1255 by @Rebornix-zero in #3183
- [docs] use nn.module instead of tensor as model by @faaany in #3157
- Fix typo by @kylesayrs in #3191
- MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #3187
- update Megatron-LM plugin code to version 0.8.0 or higher. by @eljandoubi in #3174
- 🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨 by @muellerzr in #3194
- Update transformers.deepspeed references from transformers 4.46.0 release by @loadams in #3196
- eliminate dead code by @statelesshz in #3198
- take
torch.nn.Module
model into account when moving to device by @faaany in #3167 - [docs] add xpu part and fix bug in
torchrun
by @faaany in #3166 - Models With Tied Weights Need Re-Tieing After FSDP Param Init by @fabianlim in #3154
- add the missing xpu for local sgd by @faaany in #3163
- typo fix in big_modeling.py by @a-r-r-o-w in #3207
- [Utils]
align_module_device
by @kylesayrs in #3204
New Contributors
- @mariusarvinte made their first contribution in #3114
- @hlky made their first contribution in #3123
- @dame-cell made their first contribution in #3149
- @kylesayrs made their first contribution in #3188
- @eljandoubi made their first contribution in #3186
- @Rebornix-zero made their first contribution in #3183
- @loadams made their first contribution in #3196
Full Changelog: v1.0.1...v1.1.0
v1.0.1: Bugfix
Bugfixes
- Fixes an issue where the
auto
values were no longer being parsed when using deepspeed - Fixes a broken test in the deepspeed tests related to the auto values
Full Changelog: v1.0.0...v1.0.1
Accelerate 1.0.0 is here!
🚀 Accelerate 1.0 🚀
With accelerate
1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.
To read more, check out our official blog here
Migration assistance
- Passing in
dispatch_batches
,split_batches
,even_batches
, anduse_seedable_sampler
to theAccelerator()
should now be handled by creating anaccelerate.utils.DataLoaderConfiguration()
and passing this to theAccelerator()
instead (Accelerator(dataloader_config=DataLoaderConfiguration(...))
) Accelerator().use_fp16
andAcceleratorState().use_fp16
have been removed; this should be replaced by checkingaccelerator.mixed_precision == "fp16"
Accelerator().autocast()
no longer accepts acache_enabled
argument. Instead, anAutocastKwargs()
instance should be used which handles this flag (among others) passing it to theAccelerator
(Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)])
)accelerate.utils.is_tpu_available
should be replaced withaccelerate.utils.is_torch_xla_available
accelerate.utils.modeling.shard_checkpoint
should be replaced withsplit_torch_state_dict_into_shards
from thehuggingface_hub
libraryaccelerate.tqdm.tqdm()
no longer acceptsTrue
/False
as the first argument, and instead,main_process_only
should be passed in as a named argument
Multiple Model DeepSpeed Support
After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial here, however essentially:
When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:
Knowledge distillation
(Where we train only one model, zero3, and another is used for inference, zero2)
from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin
zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")
deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}
accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)
To then select which plugin to be used at a certain time (aka when calling prepare
), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:
accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)
accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)
Multiple disjoint models
For disjoint models, separate accelerators should be used for each model, and their own .backward()
should be called later:
for batch in dl:
outputs1 = first_model(**batch)
first_accelerator.backward(outputs1.loss)
first_optimizer.step()
first_scheduler.step()
first_optimizer.zero_grad()
outputs2 = model2(**batch)
second_accelerator.backward(outputs2.loss)
second_optimizer.step()
second_scheduler.step()
second_optimizer.zero_grad()
FP8
We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily.
FSDP
- Fixed FSDP auto_wrap using characters instead of full str for layers
- Re-enable setting state dict type manually
Big Modeling
- Removed cpu restriction for bnb training
What's Changed
- Fix FSDP auto_wrap using characters instead of full str for layers by @muellerzr in #3075
- Allow DataLoaderAdapter subclasses to be pickled by implementing
__reduce__
by @byi8220 in #3074 - Fix three typos in src/accelerate/data_loader.py by @xiabingquan in #3082
- Re-enable setting state dict type by @muellerzr in #3084
- Support sequential cpu offloading with torchao quantized tensors by @a-r-r-o-w in #3085
- fix bug in
_get_named_modules
by @faaany in #3052 - use the correct available memory API for XPU by @faaany in #3076
- fix
skip_keys
usage in forward hooks by @152334H in #3088 - Update README.md to include distributed image generation gist by @sayakpaul in #3077
- MAINT: Upgrade ruff to v0.6.4 by @BenjaminBossan in #3095
- Revert "Enable Unwrapping for Model State Dicts (FSDP)" by @SunMarc in #3096
- MS-AMP support (w/o FSDP) by @muellerzr in #3093
- [docs] DataLoaderConfiguration docstring by @stevhliu in #3103
- MAINT: Permission for GH token in stale.yml by @BenjaminBossan in #3102
- [docs] Doc sprint by @stevhliu in #3099
- Update image ref for docs by @muellerzr in #3105
- No more t5 by @muellerzr in #3107
- [docs] More docstrings by @stevhliu in #3108
- 🚨🚨🚨 The Great Deprecation 🚨🚨🚨 by @muellerzr in #3098
- POC: multiple model/configuration DeepSpeed support by @muellerzr in #3097
- Fixup test_sync w/ deprecated stuff by @muellerzr in #3109
- Switch to XLA instead of TPU by @SunMarc in #3118
- [tests] skip pippy tests for XPU by @faaany in #3119
- Fixup multiple model DS tests by @muellerzr in #3131
- remove cpu restriction for bnb training by @jiqing-feng in #3062
- fix deprecated
torch.cuda.amp.GradScaler
FutureWarning for pytorch 2.4+ by @Mon-ius in #3132 - 🐛 [HotFix] Handle Profiler Activities Based on PyTorch Version by @yhna940 in #3136
- only move model to device when model is in cpu and target device is xpu by @faaany in #3133
- fix tip brackets typo by @davanstrien in #3129
- typo of "scalar" instead of "scaler" by @tonyzhaozh in #3116
- MNT Permission for PRs for GH token in stale.yml by @BenjaminBossan in #3112
New Contributors
- @xiabingquan made their first contribution in #3082
- @a-r-r-o-w made their first contribution in #3085
- @152334H made their first contribution in #3088
- @sayakpaul made their first contribution in #3077
- @Mon-ius made their first contribution in #3132
- @davanstrien made their first contribution in #3129
- @tonyzhaozh made their first contribution in #3116
Full Changelog: v0.34.2...v1.0.0
v0.34.1 Patchfix
Bug fixes
- Fixes an issue where processed
DataLoaders
could no longer be pickled in #3074 thanks to @byi8220 - Fixes an issue when using FSDP where
default_transformers_cls_names_to_wrap
would separate_no_split_modules
by characters instead of keeping it as a list of layer names in #3075
Full Changelog: v0.34.0...v0.34.1
v0.34.0: StatefulDataLoader Support, FP8 Improvements, and PyTorch Updates!
Dependency Changes
- Updated Safetensors Requirement: The library now requires
safetensors
version 0.4.3. - Added support for Numpy 2.0: The library now fully supports
numpy
2.0.0
Core
New Script Behavior Changes
- Process Group Management: PyTorch now requires users to destroy process groups after training. The
accelerate
library will handle this automatically withaccelerator.end_training()
, or you can do it manually usingPartialState().destroy_process_group()
. - MLU Device Support: Added support for saving and loading RNG states on MLU devices by @huismiling
- NPU Support: Corrected backend and distributed settings when using
transfer_to_npu
, ensuring better performance and compatibility.
DataLoader Enhancements
- Stateful DataDataLoader: We are excited to announce that early support has been added for the
StatefulDataLoader
fromtorchdata
, allowing better handling of data loading states. Enable by passinguse_stateful_dataloader=True
to theDataLoaderConfiguration
, and when callingload_state()
theDataLoader
will automatically be resumed from its last step, no more having to iterate through passed batches. - Decoupled Data Loader Preparation: The
prepare_data_loader()
function is now independent of theAccelerator
, giving you more flexibility towards which API levels you would like to use. - XLA Compatibility: Added support for skipping initial batches when using XLA.
- Improved State Management: Bug fixes and enhancements for saving/loading
DataLoader
states, ensuring smoother training sessions. - Epoch Setting: Introduced the
set_epoch
function forMpDeviceLoaderWrapper
.
FP8 Training Improvements
- Enhanced FP8 Training: Fully Sharded Data Parallelism (FSDP) and DeepSpeed support now work seamlessly with
TransformerEngine
FP8 training, including better defaults for the quantized FP8 weights. - Integration baseline: We've added a new suite of examples and benchmarks to ensure that our
TransformerEngine
integration works exactly as intended. These scripts run one half using 🤗 Accelerate's integration, the other with rawTransformersEngine
, providing users with a nice example of what we do under the hood with accelerate, and a good sanity check to make sure nothing breaks down over time. Find them here - Import Fixes: Resolved issues with import checks for the Transformers Engine that has downstream issues.
- FP8 Docker Images: We've added new docker images for
TransformerEngine
andaccelerate
as well. Usedocker pull huggingface/accelerate@gpu-fp8-transformerengine
to quickly get an environment going.
torchpippy
no more, long live torch.distributed.pipelining
- With the latest PyTorch release,
torchpippy
is now fully integrated into torch core, and as a result we are exclusively supporting the PyTorch implementation from now on - There are breaking examples and changes that comes from this shift. Namely:
- Tracing of inputs is done with a shape each GPU will see, rather than the size of the total batch. So for 2 GPUs, one should pass in an input of
[1, n, n]
rather than[2, n, n]
as before. - We no longer support Encoder/Decoder models. PyTorch tracing for
pipelining
no longer supports encoder/decoder models, so thet5
example has been removed. - Computer vision model support currently does not work: There are some tracing issues regarding resnet's we are actively looking into.
- Tracing of inputs is done with a shape each GPU will see, rather than the size of the total batch. So for 2 GPUs, one should pass in an input of
- If either of these changes are too breaking, we recommend pinning your accelerate version. If the encoder/decoder model support is actively blocking your inference using pippy, please open an issue and let us know. We can look towards adding in the old support for
torchpippy
potentially if needed.
Fully Sharded Data Parallelism (FSDP)
- Environment Flexibility: Environment variables are now fully optional for FSDP, simplifying configuration. You can now fully create a
FullyShardedDataParallelPlugin
yourself manually with no need for environment patching:
from accelerate import FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(...)
- FSDP RAM efficient loading: Added a utility to enable RAM-efficient model loading (by setting the proper environmental variable). This is generally needed if not using
accelerate launch
and need to ensure the env variables are setup properly for model loading:
from accelerate.utils import enable_fsdp_ram_efficient_loading, disable_fsdp_ram_efficient_loading
enable_fsdp_ram_efficient_loading()
- Model State Dict Management: Enhanced support for unwrapping model state dicts in FSDP, making it easier to manage distributed models.
New Examples
- Configuration and Models: Improved configuration handling and introduced a configuration zoo for easier experimentation. You can learn more here. This was largely inspired by the
axolotl
library, so very big kudos to their wonderful work - FSDP + SLURM Example: Added a minimal configuration example for running jobs with SLURM and using FSDP
Bug Fixes
- Fix bug of clip_grad_norm_ for xla fsdp by @hanwen-sun in #2941
- Explicit check for
step
when loading the state by @muellerzr in #2992 - Fix
find_tied_params
for models with shared layers by @qubvel in #2986 - clear memory after offload by @SunMarc in #2994
- fix default value for rank size in cpu threads_per_process assignment logic by @rbrugaro in #3009
- Fix batch_sampler maybe None error by @candlewill in #3025
- Do not import
transformer_engine
on import by @oraluben in #3056 - Fix torchvision to be compatible with torch version in CI by @SunMarc in #2982
- Fix gated test by @muellerzr in #2993
- Fix typo on warning str: "on the meta device device" -> "on the meta device" by @HeAndres in #2997
- Fix deepspeed tests by @muellerzr in #3003
- Fix torch version check by @muellerzr in #3024
- Fix fp8 benchmark on single GPU by @muellerzr in #3032
- Fix typo in comment by @zmoki688 in #3045
- Speed up tests by shaving off subprocess when not needed by @muellerzr in #3042
- Remove
skip_first_batches
support for StatefulDataloader and fix all the tests by @muellerzr in #3068
New Contributors
- @byi8220 made their first contribution in #2957
- @alex-jw-brooks made their first contribution in #2959
- @XciD made their first contribution in #2981
- @hanwen-sun made their first contribution in #2941
- @HeAndres made their first contribution in #2997
- @yitongh made their first contribution in #2966
- @qubvel made their first contribution in #2986
- @rbrugaro made their first contribution in #3009
- @candlewill made their first contribution in #3025
- @siddk made their first contribution in #3047
- @oraluben made their first contribution in #3056
- @tmm1 made their first contribution in #3055
- @zmoki688 made their first contribution in #3045
Full Changelog:
- Require safetensors>=0.4.3 by @byi8220 in #2957
- Fix torchvision to be compatible with torch version in CI by @SunMarc in #2982
- Enable Unwrapping for Model State Dicts (FSDP) by @alex-jw-brooks in #2959
- chore: Update runs-on configuration for CI workflows by @XciD in #2981
- add MLU devices for rng state saving and loading. by @huismiling in #2940
- remove .md to allow proper linking by @nbroad1881 in #2977
- Fix bug of clip_grad_norm_ for xla fsdp by @hanwen-sun in #2941
- Fix gated test by @muellerzr in #2993
- Explicit check for
step
when loading the state by @muellerzr in #2992 - Fix typo on warning str: "on the meta device device" -> "on the meta device" by @HeAndres in #2997
- Support skip_first_batches for XLA by @yitongh in #2966
- clear memory aft...
v0.33.0: MUSA backend support and bugfixes
MUSA backend support and bugfixes
Small release this month, with key focuses on some added support for backends and bugs:
- Support MUSA (Moore Threads GPU) backend in accelerate by @fmo-mt in #2917
- Allow multiple process per device by @cifkao in #2916
- Add
torch.float8_e4m3fn
formatdtype_byte_size
by @SunMarc in #2945 - Properly handle Params4bit in set_module_tensor_to_device by @matthewdouglas in #2934
What's Changed
- [tests] fix bug in torch_device by @faaany in #2909
- Fix slowdown on init with
device_map="auto"
by @muellerzr in #2914 - fix: bug where
multi_gpu
was being set and warning being printed even withnum_processes=1
by @HarikrishnanBalagopal in #2921 - Better error when a bad directory is given for weight merging by @muellerzr in #2852
- add xpu device check before moving tensor directly to xpu device by @faaany in #2928
- Add huggingface_hub version to setup.py by @nullquant in #2932
- Correct loading of models with shared tensors when using accelerator.load_state() by @jkuntzer in #2875
- Hotfix PyTorch Version Installation in CI Workflow for Minimum Version Matrix by @yhna940 in #2889
- Fix import test by @muellerzr in #2931
- Consider pynvml available when installed through the nvidia-ml-py distribution by @matthewdouglas in #2936
- Improve test reliability for Accelerator.free_memory() by @matthewdouglas in #2935
- delete CCL env var setting by @Liangliang-Ma in #2927
- feat(ci): add
pip
caching in CI by @SauravMaheshkar in #2952
New Contributors
- @HarikrishnanBalagopal made their first contribution in #2921
- @fmo-mt made their first contribution in #2917
- @nullquant made their first contribution in #2932
- @cifkao made their first contribution in #2916
- @jkuntzer made their first contribution in #2875
- @matthewdouglas made their first contribution in #2936
- @Liangliang-Ma made their first contribution in #2927
- @SauravMaheshkar made their first contribution in #2952
Full Changelog: v0.32.1...v0.33.0
v0.32.0: Profilers, new hooks, speedups, and more!
Core
- Utilize shard saving from the
huggingface_hub
rather than our own implementation (#2795) - Refactor logging to use logger in
dispatch_model
(#2855) - The
Accelerator.step
number is now restored when usingsave_state
andload_state
(#2765) - A new profiler has been added allowing users to collect performance metrics during model training and inference, including detailed analysis of execution time and memory consumption. These can then be generated in Chrome's tracing tool. Read more about it here (#2883)
- Reduced import times for doing
import accelerate
and any other major core import by 68%, now should be only slightly longer than doingimport torch
(#2845) - Fixed a bug in
get_backend
and added aclear_device_cache
utility (#2857)
Distributed Data Parallelism
- Introduce DDP communication hooks to have more flexibility in how gradients are communicated across workers, overriding the standard
allreduce
. (#2841) - Make
log_line_prefix_template
optional thenotebook_launcher
(#2888)
FSDP
- If the output directory doesn't exist when using
accelerate merge-weights
, one will be automatically created (#2854) - When merging weights, the default is now
.safetensors
(#2853)
XPU
- Migrate to pytorch's native XPU backend on
torch>=2.4
(#2825) - Add
@require_triton
test decorator and enabletest_dynamo
work on xpu (#2878) - Fixed
load_state_dict
not working onxpu
and refine xpusafetensors
version check (#2879)
XLA
- Added support for XLA Dynamo backends for both training and inference (#2892)
Examples
- Added a new multi-cpu SLURM example using
accelerate launch
(#2902)
Full Changelog
- Use shard saving from huggingface_hub by @SunMarc in #2795
- doc: fix link by @imba-tjd in #2844
- Revert "Slight rename" by @SunMarc in #2850
- remove warning hook addede during dispatch_model by @SunMarc in #2843
- Remove underlines between badges by @novialriptide in #2851
- Auto create dir when merging FSDP weights by @helloworld1 in #2854
- Add DDP Communication Hooks by @yhna940 in #2841
- Refactor logging to use logger in
dispatch_model
by @panjd123 in #2855 - xpu: support xpu backend from stock pytorch (>=2.4) by @dvrogozh in #2825
- Drop torch re-imports in npu and mlu paths by @dvrogozh in #2856
- Default FSDP weights merge to safetensors by @helloworld1 in #2853
- [tests] fix bug in
test_tracking.ClearMLTest
by @faaany in #2863 - [tests] use
torch_device
instead of0
for device check by @faaany in #2861 - [tests] skip bnb-related tests instead of failing on xpu by @faaany in #2860
- Potentially fix tests by @muellerzr in #2862
- [tests] enable XPU backend for
test_zero3_integration
by @faaany in #2864 - Support saving and loading of step while saving and loading state by @bipinKrishnan in #2765
- Add Profiler Support for Performance Analysis by @yhna940 in #2883
- Speed up imports and add a CI by @muellerzr in #2845
- Make
log_line_prefix_template
Optional in Elastic Launcher for Backward Compatibility by @yhna940 in #2888 - Add XLA Dynamo backends for training and inference by @johnsutor in #2892
- Added a MultiCPU SLURM example using Accelerate Launch and MPIRun by @okhleif-IL in #2902
- make more cuda-only tests device-agnostic by @faaany in #2876
- fix mlu device longTensor bugs by @huismiling in #2887
- add
require_triton
and enabletest_dynamo
work on xpu by @faaany in #2878 - fix
load_state_dict
for xpu and refine xpu safetensor version check by @faaany in #2879 - Fix get_backend bug and add clear_device_cache function by @NurmaU in #2857
New Contributors
- @McPatate made their first contribution in #2836
- @imba-tjd made their first contribution in #2844
- @novialriptide made their first contribution in #2851
- @panjd123 made their first contribution in #2855
- @dvrogozh made their first contribution in #2825
- @johnsutor made their first contribution in #2892
- @okhleif-IL made their first contribution in #2902
- @NurmaU made their first contribution in #2857
Full Changelog: v0.31.0...v0.32.0
v0.31.0: Better support for sharded state dict with FSDP and Bugfixes
Core
- Set
timeout
default to PyTorch defaults based on backend by @muellerzr in #2758 - fix duplicate elements in split_between_processes by @hkunzhe in #2781
- Add Elastic Launch Support to
notebook_launcher
by @yhna940 in #2788 - Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
FSDP
- Introduce shard-merging util for FSDP by @muellerzr in #2772
- Enable sharded state dict + offload to cpu resume by @muellerzr in #2762
- Enable config for fsdp activation checkpointing by @helloworld1 in #2779
Megatron
- Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
What's Changed
- Add feature to allow redirecting std streams into log files when using torchrun as the launcher. by @lyuwen in #2740
- Update modeling.py by adding try-catch section to skip the unavailable devices by @MeVeryHandsome in #2681
- Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity by @statelesshz in #2748
- Fix stacklevel in
logging
to log the actual user call site (instead of the call site inside the logger wrapper) of log functions by @luowyang in #2730 - LOMO / FIX: Support multiple optimizers by @younesbelkada in #2745
- Fix max_memory assignment by @SunMarc in #2751
- Fix duplicate environment variable check in multi-cpu condition by @yhna940 in #2752
- Simplify CLI args validation and ensure CLI args take precedence over config file. by @Iain-S in #2757
- Fix sagemaker config by @muellerzr in #2753
- fix cpu omp num threads set by @jiqing-feng in #2755
- Revert "Simplify CLI args validation and ensure CLI args take precedence over config file." by @muellerzr in #2763
- Enable sharded cpu resume by @muellerzr in #2762
- Sets default to PyTorch defaults based on backend by @muellerzr in #2758
- optimize get_module_leaves speed by @BBuf in #2756
- fix minor typo by @TemryL in #2767
- Fix small edge case in get_module_leaves by @SunMarc in #2774
- Skip deepspeed test by @SunMarc in #2776
- Enable config for fsdp activation checkpointing by @helloworld1 in #2779
- Add arg from CLI to fix failing test by @muellerzr in #2783
- Skip tied weights disk offload test by @SunMarc in #2782
- Introduce shard-merging util for FSDP by @muellerzr in #2772
- FIX / FSDP : Guard fsdp utils for earlier PyTorch versions by @younesbelkada in #2794
- Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in #2501
- Fixup CLI test by @muellerzr in #2796
- fix duplicate elements in split_between_processes by @hkunzhe in #2781
- Add Elastic Launch Support to
notebook_launcher
by @yhna940 in #2788 - Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in #2790
- Fix type in accelerator.py by @qgallouedec in #2800
- fix comet ml test by @SunMarc in #2804
- New template by @muellerzr in #2808
- Fix access error for torch.mps when using torch==1.13.1 on macOS by @SunMarc in #2806
- 4-bit quantization meta device bias loading bug by @SunMarc in #2805
- State dictionary retrieval from offloaded modules by @blbadger in #2619
- add cuda dep for a test by @SunMarc in #2820
- Remove out-dated xpu device check code in
get_balanced_memory
by @faaany in #2826 - Fix DeepSpeed config validation error by changing
stage3_prefetch_bucket_size
value to an integer by @adk9 in #2814 - Improve test speeds by up to 30% in multi-gpu settings by @muellerzr in #2830
- monitor-interval, take 2 by @muellerzr in #2833
- Optimize the megatron plugin by @zhangsheng377 in #2822
- fix fstr format by @Jintao-Huang in #2810
New Contributors
- @lyuwen made their first contribution in #2740
- @MeVeryHandsome made their first contribution in #2681
- @luowyang made their first contribution in #2730
- @Iain-S made their first contribution in #2757
- @BBuf made their first contribution in #2756
- @TemryL made their first contribution in #2767
- @helloworld1 made their first contribution in #2779
- @hkunzhe made their first contribution in #2781
- @adk9 made their first contribution in #2814
- @Jintao-Huang made their first contribution in #2810
Full Changelog: v0.30.1...v0.31.0