Releases: huggingface/accelerate
V0.10.0 DeepSpeed integration revamp and TPU speedup
This release adds two major new features: the DeepSpeed integration has been revamped to match the one in Transformers Trainer, with multiple new options unlocked, and the TPU integration has been sped up.
This version also officially stops supporting Python 3.6 and requires Python 3.7+
DeepSpeed integration revamp
Users can now specify a DeepSpeed config file when they want to use DeepSpeed, which unlocks many new options. More details in the new documentation.
- Migrate HFDeepSpeedConfig from trfrs to accelerate by @pacman100 in #432
- DeepSpeed Revamp by @pacman100 in #405
TPU speedup
If you're using TPUs we have sped up the dataloaders and models quite a bit, on top of a few bug fixes.
- Revamp TPU internals to be more efficient + enable mixed precision types by @muellerzr in #441
What's new?
- Fix docstring by @muellerzr in #447
- Add psutil as depenedency by @sgugger in #445
- fix fsdp torch version dependency by @pacman100 in #437
- Create Gradient Accumulation Example by @muellerzr in #431
- init by @muellerzr in #429
- Introduce
no_sync
context wrapper + clean up some more warnings for DDP by @muellerzr in #428 - updating tests to resolve runner failures wrt deepspeed revamp by @pacman100 in #427
- Fix secrets in Docker workflow by @muellerzr in #426
- Introduce a Dependency Checker to trigger new Docker Builds on main by @muellerzr in #424
- Enable slow tests nightly by @muellerzr in #421
- Push out python 3.6 + fix all tests related to the upgrade by @muellerzr in #420
- Speedup main CI by @muellerzr in #419
- Switch to evaluate for metrics by @sgugger in #417
- Create an issue template for Accelerate by @muellerzr in #415
- Introduce post-merge runners by @muellerzr in #416
- Fix debug_launcher issues by @muellerzr in #413
- Use main egg by @muellerzr in #414
- Introduce nightly runners by @muellerzr in #410
- Update requirements to pin tensorboard and include psutil by @muellerzr in #408
- Fix CUDA examples tests by @muellerzr in #407
- Move datasets and transformers to under func by @muellerzr in #411
- Fix CUDA Dockerfile by @muellerzr in #409
- Hotfix all failing GPU tests by @muellerzr in #401
- improve metrics logged in examples by @pacman100 in #399
- Refactor offload_state_dict and fix in offload_weight by @sgugger in #398
- Refactor version checking into a utility by @muellerzr in #395
- Include fastai in frameworks by @muellerzr in #396
- Add packaging to requirements by @muellerzr in #394
- Better dispatch for submodules by @sgugger in #392
- Build Docker Images nightly by @muellerzr in #391
- Small bugfix for the stalebot workflow by @muellerzr in #390
- Introduce stalebot by @muellerzr in #387
- Create Dockerfiles for Accelerate by @muellerzr in #377
- Mix precision -> Mixed precision by @muellerzr in #388
- Fix OneCycle step length when in multiprocess by @muellerzr in #385
v0.9.0: Refactor utils to use in Transformers
v0.9.0: Refactor utils to use in Transformers
This release offers no significant new API, it is just needed to have access to some utils in Transformers.
- Handle deprication errors in launch by @muellerzr in #360
- Update launchers.py by @tmabraham in #363
- fix tracking by @pacman100 in #361
- Remove tensor call by @muellerzr in #365
- Add a utility for writing a barebones config file by @muellerzr in #371
- fix deepspeed model saving by @pacman100 in #370
- deepspeed save model temp fix by @pacman100 in #374
- Refactor tests to use accelerate launch by @muellerzr in #373
- fix zero stage-1 by @pacman100 in #378
- fix shuffling for ShufflerIterDataPipe instances by @loubnabnl in #376
- Better check for deepspeed availability by @sgugger in #379
- Refactor some parts in utils by @sgugger in #380
v0.8.0: Big model inference
v0.8.0: Big model inference
Big model inference
To handle very large models, new functionality has been added in Accelerate:
- a context manager to initalize empty models
- a function to load a sharded checkpoint directly on the right devices
- a set of custom hooks that allow execution of a model split on different devices, as well as CPU or disk offload
- a magic method that auto-determines a device map for a given model, maximizing the GPU spaces, available RAM before using disk offload as a last resort.
- a function that wraps the last three blocks in one simple call (
load_checkpoint_and_dispatch
)
See more in the documentation
What's new
- Create peak_memory_uasge_tracker.py by @pacman100 in #336
- Fixed a typo to enable running accelerate correctly by @Idodox in #339
- Introduce multiprocess logger by @muellerzr in #337
- Refactor utils into its own module by @muellerzr in #340
- Improve num_processes question in CLI by @muellerzr in #343
- Handle Manual Wrapping in FSDP. Minor fix of fsdp example. by @pacman100 in #342
- Better prompt for number of training devices by @muellerzr in #344
- Fix prompt for num_processes by @pacman100 in #347
- Fix sample calculation in examples by @muellerzr in #352
- Fixing metric eval in distributed setup by @pacman100 in #355
- DeepSpeed and FSDP plugin support through script by @pacman100 in #356
v0.7.1 Patch release
v0.7.0: Logging API, FSDP, batch size finder and examples revamp
v0.7.0: Logging API, FSDP, batch size finder and examples revamp
Logging API
Use any of your favorite logging libraries (TensorBoard, Wandb, CometML...) with just a few lines of code inside your training scripts with Accelerate. All details are in the documentation.
- Add logging capabilities by @muellerzr in #293
Support for FSDP (fully sharded DataParallel)
PyTorch recently released a new model wrapper for sharded DDP training called FSDP. This release adds support for it (note that it doesn't work with mixed precision yet). See all caveats in the documentation.
- PyTorch FSDP Feature Incorporation by @pacman100 in #321
Batch size finder
Say goodbye to the CUDA OOM errors with the new find_executable_batch_size
decorator. Just decorate your training function and pick a starting batch size, then let Accelerate do the rest.
- Add a memory-aware decorator for CUDA OOM avoidance by @muellerzr in #324
Examples revamp
The Accelerate examples are now split in two: you can find in the base folder a very simple nlp and computer vision examples, as well as complete versions incorporating all features. But you can also browse the examples in the by_feature
subfolder, which will show you exactly what code to add for each given feature (checkpointing, tracking, cross-validation etc.)
- Refactor Examples by Feature by @muellerzr in #312
What's Changed
- Document save/load state by @muellerzr in #290
- Refactor precisions to its own enum by @muellerzr in #292
- Load model and optimizet states on CPU to void OOMs by @sgugger in #299
- Fix example for datasets v2 by @sgugger in #298
- Leave default as None in
mixed_precision
for launch command by @sgugger in #300 - Pass
lr_scheduler
toAccelerator.prepare
by @sgugger in #301 - Create new TestCase classes and clean up W&B tests by @muellerzr in #304
- Have custom trackers work with the API by @muellerzr in #305
- Write tests for comet_ml by @muellerzr in #306
- Fix training in DeepSpeed by @sgugger in #308
- Update example scripts by @muellerzr in #307
- Use --no_local_rank for DeepSpeed launch by @sgugger in #309
- Fix Accelerate CLI CPU option + small fix for W&B tests by @muellerzr in #311
- Fix DataLoader sharding for deepspeed in accelerate by @m3rlin45 in #315
- Create a testing framework for example scripts and fix current ones by @muellerzr in #313
- Refactor Tracker logic and write guards for logging_dir by @muellerzr in #316
- Create Cross-Validation example by @muellerzr in #317
- Create alias for Accelerator.free_memory by @muellerzr in #318
- fix typo in docs of accelerate tracking by @loubnabnl in #320
- Update examples to show how to deal with extra validation copies by @muellerzr in #319
- Fixup all checkpointing examples by @muellerzr in #323
- Introduce reduce operator by @muellerzr in #326
New Contributors
- @m3rlin45 made their first contribution in #315
- @loubnabnl made their first contribution in #320
- @pacman100 made their first contribution in #321
Full Changelog: v0.6.0...v0.7.0
v0.6.2: Fix launcher with mixed precision
The launcher was ignoring the mixed precision attribute of the config since v0.6.0. This patch fixes that.
v0.6.1: Hot fix
Patches an issue with mixed precision (see #286)
v0.6.0: Checkpointing and bfloat16 support
This release adds support for bloat16 mixed precision training (requires PyTorch >= 1.10) and a brand-new checkpoint utility to help with resuming interrupted trainings. We also get a completely revamped documentation frontend.
Checkpoints
Save the current state of all your objects (models, optimizers, RNG states) with accelerator.save_state(path_to_checkpoint)
and reload everything by calling accelerator.load_state(path_to_checkpoint)
- Add in checkpointing capability by @muellerzr in #255
- Implementation of saving and loading custom states by @muellerzr in #270
BFloat16 support
Accelerate now supports bfloat16 mixed precision training. As a result the old --fp16
argument has been deprecated to be replaced by the more generic --mixed-precision
.
- Add bfloat16 support #243 by @ikergarcia1996 in #247
New env subcommand
You can now type accelerate env
to have a copy-pastable summary of your environment and default configuration. Very convenient when opening a new issue!
New doc frontend
The documentation has been switched to the new Hugging Face frontend, like Transformers and Datasets.
What's Changed
- Fix send_to_device with non-tensor data by @sgugger in #177
- Handle UserDict in all utils by @sgugger in #179
- Use collections.abc.Mapping to handle both the dict and the UserDict types by @mariosasko in #180
- fix: use
store_true
on argparse in nlp example by @monologg in #183 - Update README.md by @TevenLeScao in #187
- Add signature check for
set_to_none
in Optimizer.zero_grad by @sgugger in #189 - fix typo in code snippet by @MrZilinXiao in #199
- Add high-level API reference to README by @Chris-hughes10 in #204
- fix rng_types in accelerator by @s-kumano in #206
- Pass along drop_last in DispatchDataLoader by @sgugger in #212
- Rename state to avoid name conflicts with pytorch's Optimizer class. by @yuxinyuan in #224
- Fix lr scheduler num samples by @sgugger in #227
- Add customization point for init_process_group kwargs by @sgugger in #228
- Fix typo in installation docs by @jaketae in #234
- make deepspeed optimizer match parameters of passed optimizer by @jmhessel in #246
- Upgrade black to version ~=22.0 by @LysandreJik in #250
- add support of gather_object by @ZhiyuanChen in #238
- Add launch flags --module and --no_python (#256) by @parameter-concern in #258
- Accelerate + Animus/Catalyst = 🚀 by @Scitator in #249
- Add
debug_launcher
by @sgugger in #259 - enhance compatibility of honor type by @ZhiyuanChen in #241
- Add a flag to use CPU only in the config by @sgugger in #263
- Basic fixes for DeepSpeed by @sgugger in #264
- Ability to set the seed with randomness from inside Accelerate by @muellerzr in #266
- Don't use dispatch_batches when torch is < 1.8.0 by @sgugger in #269
- Make accelerated model with AMP possible to pickle by @BenjaminBossan in #274
- Contributing guide by @LysandreJik in #254
- replace texts and link (master -> main) by @johnnv1 in #282
- Use workflow from doc-builder by @sgugger in #275
- Pass along execution info to the exit of autocast by @sgugger in #284
New Contributors
- @mariosasko made their first contribution in #180
- @monologg made their first contribution in #183
- @TevenLeScao made their first contribution in #187
- @MrZilinXiao made their first contribution in #199
- @Chris-hughes10 made their first contribution in #204
- @s-kumano made their first contribution in #206
- @yuxinyuan made their first contribution in #224
- @jaketae made their first contribution in #234
- @jmhessel made their first contribution in #246
- @ikergarcia1996 made their first contribution in #247
- @ZhiyuanChen made their first contribution in #238
- @parameter-concern made their first contribution in #258
- @Scitator made their first contribution in #249
- @muellerzr made their first contribution in #255
- @BenjaminBossan made their first contribution in #274
- @johnnv1 made their first contribution in #280
Full Changelog: v0.5.1...v0.6.0
v0.5.1: Patch release
v0.5.0 Dispatch batches from main DataLoader
v0.5.0 Dispatch batches from main DataLoader
This release introduces support for iterating through a DataLoader
only on the main process, that then dispatches the batches to all processes.
Dispatch batches from main DataLoader
The motivation behind this come from dataset streaming which introduces two difficulties:
- there might be some timeouts for some elements of the dataset, which might then be different in each process launched, thus it's impossible to make sure the data is iterated though the same way on each process
- when using IterableDataset, each process goes through the dataset, thus applies the preprocessing on all elements. This can yield to the training being slowed down by this preprocessing.
This new feature is activated by default for all IterableDataset
.
Various fixes
- fix fp16 covert back to fp32 for issue: unsupported operand type(s) for /: 'dict' and 'int' #149 (@Doragd)
- [Docs] Machine config is yaml not json #151 (@patrickvonplaten)
- Fix gather for 0d tensor #152 (@sgugger)
- [DeepSpeed] allow untested optimizers deepspeed #150 (@patrickvonplaten)
- Raise errors instead of warnings with better tests #170 (@sgugger)