v4.34: Mistral, Persimmon, Prompt templating, Flash Attention 2, Tokenizer refactor
New models
Mistral
Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:
- Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
- GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
- Byte-fallback BPE tokenizer - ensures that characters are never mapped to out-of-vocabulary tokens.
Persimmon
The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions.
- [
Persimmon
] Add support for persimmon by @ArthurZucker in #26042
BROS
BROS stands for BERT Relying On Spatiality. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.
- Add BROS by @jinhopark8345 in #23190
ViTMatte
ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.
- Add ViTMatte by @NielsRogge in #25843
Nougat
Nougat uses the same architecture as Donut, meaning an image Transformer encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.
- Add Nougat by @NielsRogge and @molbap in #25942
Prompt templating
We've added a new template feature for chat models. This allows the formatting that a chat model was trained with to be saved with the model, ensuring that users can exactly reproduce that formatting when they want to fine-tune the model or use it for inference. For more information, see our template documentation.
- Overhaul Conversation class and prompt templating by @Rocketknight1 in #25323
🚨🚨 Tokenizer refactor
- [
Tokenizer
] attemp to fix add_token issues by @ArthurZucker in #23909 - Nit-added-tokens by @ArthurZucker in #26538 adds some fix to #23909 .
🚨Workflow Changes 🚨:
These are not breaking changes per se but rather bugfixes. However, we understand that this may result in some workflow changes so we highlight them below.
- unique_no_split_tokens attribute removed and not used in the internal logic
- sanitize_special_tokens() follows a deprecation cycle and does nothing
- All attributes in SPECIAL_TOKENS_ATTRIBUTES are stored as AddedTokens and no strings.
- loading a slow from a fast or a fast from a slow will no longer raise and error if the tokens added don't have the correct index. This is because they will always be added following the order of the added_tokens but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)
- the length of a tokenizer is now max(set(self.get_vocab().keys())) accounting for holes in the vocab. The vocab_size no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5
- Adding a token using tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)]) now takes into account rstrip, lstrip, normalized information.
- added_tokens_decoder holds AddedToken, not strings.
- add_tokens() for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.
- initializing a tokenizer form scratch will now add missing special tokens to the vocab.
- stripping is not always done for special tokens! 🚨 Only if the AddedToken has lstrip=True and rstrip=True
- fairseq_ids_to_tokens attribute removed for Barthez (was not used)
➕ Most visible features:
- printing a tokenizer now shows
tokenizer.added_tokens_decoder
for both fast and slow tokenizers. Moreover, additional tokens that were already part of the initial vocab are also found there. - faster
from_pretrained
, fasteradd_tokens
because special and non special can be mixed together and the trie is not always rebuilt. - faster encode/decode with caching mechanism for
added_tokens_decoder/encoder
. - information is fully saved in the
tokenizer_config.json
For any issues relating to this, make sure to open a new issue and ping @ArthurZucker.
Flash Attention 2
FA2 support added to transformers for most popular architectures (llama, mistral, falcon) architectures actively being contributed in this issue (#26350). Simply pass use_flash_attention_2=True
when calling from_pretrained
In the future, PyTorch will support Flash Attention 2 through torch.scaled_dot_product_attention
, users would be able to benefit from both (transformers core & transformers + SDPA) implementations of Flash Attention-2 with simple changes (model.to_bettertransformer()
and force-dispatch the SDPA kernel to FA-2 in the case of SDPA)
- [
core
] Integrate Flash attention 2 in most used models by @younesbelkada in #25598
For our future plans regarding integrating F.sdpa from PyTorch in core transformers, see here: #26557
Lazy import structure
Support for lazy loading integration libraries has been added. This will drastically speed up importing transformers
and related object from the library.
Example before this change:
2023-09-11 11:07:52.010179: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
python3 -c "from transformers import CLIPTextModel" 3.31s user 3.06s system 220% cpu 2.893 total
After this change:
python3 -c "from transformers import CLIPTextModel" 1.70s user 1.49s system 220% cpu 1.447 total
- [Core] Add lazy import structure to imports by @patrickvonplaten in #26090
Bugfixes and improvements
- Fix typo by @susnato in #25966
- Fix Detr CI by @ydshieh in #25972
- Fix
test_load_img_url_timeout
by @ydshieh in #25976 - nn.Identity is not required to be compatible with PyTorch < 1.1.0 as the minimum PyTorch version we currently support is 1.10.0 by @statelesshz in #25974
- Add
Pop2Piano
space demo. by @susnato in #25975 - fix typo by @kai01ai in #25981
- Use main in conversion script by @ydshieh in #25973
- [doc] Always call it Agents for consistency by @julien-c in #25958
- Update RAG README.md with correct path to examples/seq2seq by @tleyden in #25953
- Update training_args.py to remove the runtime error by @sahel-sh in #25920
- Trainer: delegate default generation values to
generation_config
by @gante in #25987 - Show failed tests on CircleCI layout in a better way by @ydshieh in #25895
- Patch with accelerate xpu by @abhilash1910 in #25714
- PegasusX add _no_split_modules by @andreeahedes in #25933
- Add TFDebertaV2ForMultipleChoice by @raghavanone in #25932
- deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler by @pacman100 in #25863
- [Wav2Vec2 Conformer] Fix inference float16 by @sanchit-gandhi in #25985
- Add LLaMA resources by @eenzeenee in #25859
- [
CI
] Fix red CI and ERROR failed should show by @ArthurZucker in #25995 - [
VITS
] tokenizer integration test: fix revision did not exist by @ArthurZucker in #25996 - Fix Mega chunking error when using decoder-only model by @tanaymeh in #25765
- save space when converting hf model to megatron model. by @flower-with-safe in #25950
- Update README.md by @NinoRisteski in #26003
- Falcon: fix revision propagation by @LysandreJik in #26006
- TF-OPT attention mask fixes by @Rocketknight1 in #25238
- Fix small typo README.md by @zspo in #25934
- 🌐[i18n-KO] Translated
llm_tutorial.md
to Korean by @harheem in #25791 - Remove Falcon from undocumented list by @Rocketknight1 in #26008
- modify context length for GPTQ + version bump by @SunMarc in #25899
- Fix err with FSDP by @muellerzr in #25991
- fix _resize_token_embeddings will set lm head size to 0 when enabled deepspeed zero3 by @kai01ai in #26024
- Fix CircleCI config by @ydshieh in #26023
- Add
tgs
speed metrics by @CokeDong in #25858 - [VITS] Fix nightly tests by @sanchit-gandhi in #25986
- Added HerBERT to README.md by @Muskan011 in #26020
- Fix vilt config docstring parameter to match value in init by @raghavanone in #26017
- Punctuation fix by @kwonmha in #26025
- Try to fix training Loss inconsistent after resume from old checkpoint by @dumpmemory in #25872
- Fix Dropout Implementation in Graphormer by @alexanderkrauck in #24817
- Update missing docs on
activation_dropout
and fix DropOut docs for SEW-D by @gau-nernst in #26031 - Skip warning if tracing with dynamo by @angelayi in #25581
- 🌐 [i18n-KO] Translated
llama.md
to Korean by @harheem in #26044 - [
CodeLlamaTokenizerFast
] Fix fixset_infilling_processor
to properly reset by @ArthurZucker in #26041 - [
CITests
] skip failing tests until #26054 is merged by @ArthurZucker in #26063 - only main process should call _save on deepspeed zero3 by @zjjMaiMai in #25959
- docs: update link huggingface map by @pphuc25 in #26077
- docs: add space to docs by @pphuc25 in #26067
- [
core
] Import tensorflow inside relevant methods intrainer_utils
by @younesbelkada in #26106 - Generate: legacy mode is only triggered when
generation_config
is untouched by @gante in #25962 - Update logits_process.py docstrings by @larekrow in #25971
- Fix ExponentialDecayLengthPenalty negative logits issue by @pokjay in #25594
- 🌐 [i18n-KO] Translated
llama2.md
to Korean by @mjk0618 in #26047 - [docs] Updates to TTS task guide with regards to the new TTS pipeline by @MKhalusova in #26095
- 🌐 [i18n-KO] Translated
contributing.md
to Korean by @mjk0618 in #25877 - enable optuna multi-objectives feature by @sywangyi in #25969
- chore: correct update_step and correct gradient_accumulation_steps by @pphuc25 in #26068
- Text2text pipeline: don't parameterize from the config by @gante in #26118
- Fix
MarianTokenizer
to remove metaspace character indecode
by @tanaymeh in #26091 - safeguard torch distributed check by @pacman100 in #26056
- fix the deepspeed tests by @pacman100 in #26021
- Fix AutoTokenizer docstring typo by @amyeroberts in #26117
- [
core
] fix 4bitnum_parameters
by @younesbelkada in #26132 - Add missing space in generation/utils.py by @jbochi in #26121
- Update spectrogram and waveform model mapping for TTS/A pipeline by @Vaibhavs10 in #26114
- [
RWKV
] Final fix RWMV 4bit by @younesbelkada in #26134 - docs: feat: add llama2 notebook resources from OSSCA community by @junejae in #26076
- Generate: ignore warning when
generation_config.max_length
is set toNone
by @gante in #26147 - Fix
test_finetune_bert2bert
by @ydshieh in #25984 - Falcon: batched generation by @gante in #26137
- Fix
beam_scores
shape when token scores shape changes afterlogits_processor
by @BakerBunker in #25980 - Update training_args.py - addition of self.distributed_state when using XPU by @Serizao in #25999
- [docs] last hidden state vs hidden_states[-1] by @MKhalusova in #26142
- Flex xpu bug fix by @abhilash1910 in #26135
- Add missing Maskformer dataclass decorator, add dataclass check in ModelOutput for subclasses by @rachthree in #25638
- Fix eval accumulation when
accelerate
> 0.20.3 by @sam-scale in #26060 - [Whisper Tokenizer] Encode timestamps by @sanchit-gandhi in #26054
- [
PEFT
] Fix PEFT + gradient checkpointing by @younesbelkada in #25846 - [MusicGen] Add streamer to generate by @sanchit-gandhi in #25320
- Fix beam search when using model parallel by @pfldy2850 in #24969
- [MusicGen] Add sampling rate to config by @sanchit-gandhi in #26136
- [Whisper] Fix word-level timestamps for audio < 30 seconds by @xenova in #25607
- [BLIP-2] Improve conversion script by @NielsRogge in #24854
- IDEFICS: allow interpolation of vision's pos embeddings by @leot13 in #26029
- [TTA Pipeline] Test MusicGen and VITS by @sanchit-gandhi in #26146
- Tweaks to Chat Templates docs by @Rocketknight1 in #26168
- [Whisper] Check length of prompt + max new tokens by @sanchit-gandhi in #26164
- Update notebook.py to support multi eval datasets by @matrix1001 in #25796
- Fix pad to multiple of by @ArthurZucker in #25732
- [docs] IDEFICS guide and task guides restructure by @MKhalusova in #26035
- [PEFT] Allow PEFT model dict to be loaded by @patrickvonplaten in #25721
- No doctest for
convert_bros_to_pytorch.py
by @ydshieh in #26212 - Remove
utils/documentation_tests.txt
by @ydshieh in #26213 - moved
ctrl
toSalesforce/ctrl
by @julien-c in #26183 - Fix ConversationalPipeline tests by @Rocketknight1 in #26217
- [FSMT] Fix non-shared weights by @LysandreJik in #26187
- refactor decay_parameters production into its own function by @shijie-wu in #26152
- refactor: change default block_size in block size > max position embeddings by @pphuc25 in #26069
- [Wav2Vec2-Conf / LLaMA] Style fix by @sanchit-gandhi in #26188
- [Permisson] Style fix by @sanchit-gandhi in #26228
- [Check] Fix config docstring by @sanchit-gandhi in #26222
- 🌐 [i18n-KO] Translated
whisper.md
to Korean by @nuatmochoi in #26002 - Create the return value on device to avoid unnecessary copying from CPU by @mksit in #26151
- [AutoBackbone] Add test by @NielsRogge in #26094
- Update README.md by @NinoRisteski in #26198
- Update add_new_pipeline.md by @NinoRisteski in #26197
- [docs] Fix model reference in zero shot image classification example by @Aleksandar1932 in #26206
- Fix the gitlab user mention in issue templates to the correct user by @muellerz in #26237
- Fix some docstring in image processors by @ydshieh in #26235
- Fix gated repo tests by @Wauplin in #26257
- Fix
Error
not captured in PR doctesting by @ydshieh in #26215 - DeepSpeed ZeRO-3 handling when resizing embedding layers by @pacman100 in #26259
- [FIX] resize_token_embeddings by @passaglia in #26102
- FSDP tests and checkpointing fixes by @pacman100 in #26180
- fix name error when accelerate is not available by @pacman100 in #26278
- Update bros checkpoint by @jinhopark8345 in #26277
- Integrate AMD GPU in CI/CD environment by @mfuntowicz in #26007
- Rewrite for custom code warning messages by @Rocketknight1 in #26291
- fix deepspeed available detection by @fxmarty in #26252
- add bbox input validation by @jinhopark8345 in #26294
- include changes from llama by @ArthurZucker in #26260
- [
Trainer
] Refactor trainer + bnb logic by @younesbelkada in #26248 - add custom RMSNorm to
ALL_LAYERNORM_LAYERS
by @shijie-wu in #26227 - Keep relevant weights in fp32 when
model._keep_in_fp32_modules
is set even whenaccelerate
is not installed by @fxmarty in #26225 - Fix FSMT weight sharing by @LysandreJik in #26292
- update hf hub dependency to be compatible with the new tokenizers by @ArthurZucker in #26301
- Porting the torchaudio kaldi fbank implementation to audio_utils by @ylacombe in #26182
- More error message fixup, plus some linebreaks! by @Rocketknight1 in #26296
- [QUICK FIX LINK] Update trainer.py by @SoyGema in #26293
- Use CircleCI
store_test_results
by @ydshieh in #26223 - Fix doctest CI by @ydshieh in #26324
- [doc] fixed indices in obj detection example by @MKhalusova in #26343
- [TTA Pipeline] Fix MusicGen test by @sanchit-gandhi in #26348
- Add image to image pipeline by @LeviVasconcelos in #25393
- feat: adding num_proc to load_dataset by @pphuc25 in #26326
- Fixed unclosed p tags by @HanSeokhyeon in #26240
- Update add_new_model.md by @NinoRisteski in #26365
- Fix MusicGen logging error by @osanseviero in #26370
- [docs] removed MaskFormerSwin and TimmBackbone from the table on index.md by @MKhalusova in #26347
- Update tiny model information and pipeline tests by @ydshieh in #26285
- Add Russian localization for README by @qweme32 in #26208
- 🌐 [i18n-KO] Translated
audio_classification.mdx
to Korean by @gabrielwithappy in #26200 - [ViTMatte] Add resources by @NielsRogge in #26317
- Deleted duplicate sentence by @titi-devv in #26394
- added support for gradient checkpointing in ESM models by @sanjeevk-os in #26386
- Fix DeepSpeed issue with Idefics by @HugoLaurencon in #26393
- Add torch
RMSProp
optimizer by @natolambert in #26425 - Fix padding for IDEFICS by @shauray8 in #26396
- Update semantic_segmentation.md by @zekaouinoureddine in #26419
- Fixing tokenizer when
transformers
is installed withouttokenizers
by @urialon in #26236 - [
FA
/tests
] Add use_cache tests for FA models by @younesbelkada in #26415 - add bf16 mixed precision support for NPU by @statelesshz in #26163
- [
PEFT
] Fix PEFT multi adapters support by @younesbelkada in #26407 - Fix failing doctest by @LysandreJik in #26450
- Update
runs-on
in workflow files by @ydshieh in #26435 - [i18n-DE] Complete first toc chapter by @flozi00 in #26311
- 🌐 [i18n-KO] Translated
debugging.md
to Korean by @wonhyeongseo in #26246 - 🌐 [i18n-KO] Translated
perf_train_gpu_many.md
to Korean by @wonhyeongseo in #26244 - optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @NormXU in #26139
- Fix
cos_sin
device issue in Falcon model by @ydshieh in #26448 - docs: change assert to raise and some small docs by @pphuc25 in #26232
- change mention of decoder_input_ids to input_ids and same with decode_inputs_embeds by @tmabraham in #26406
- [VITS] Fix speaker_embed device mismatch by @fakhirali in #26115
- [
PEFT
] introducingadapter_kwargs
for loading adapters from different Hub location (subfolder
,revision
) than the base model by @younesbelkada in #26270 - Do not warn about unexpected decoder weights when loading T5EncoderModel and LongT5EncoderModel by @fleonce in #26211
- fix_mbart_tied_weights by @SunMarc in #26422
- Esm checkpointing by @Amelie-Schreiber in #26454
- [Whisper Tokenizer] Make decoding faster after adding timestamps by @sanchit-gandhi in #26299
- [docs] Update offline mode docs by @stevhliu in #26478
- [docs] navigation improvement between text gen pipelines and text gen params by @MKhalusova in #26477
- Skip 2 failing persimmon pipeline tests for now by @ydshieh in #26485
- Avoid all-zeor attnetion mask used in testing by @ydshieh in #26469
- [Flax Examples] Seq2Seq ASR Fine-Tuning Script by @sanchit-gandhi in #21764
- [ASR Pipe] Improve docs and error messages by @sanchit-gandhi in #26476
- Revert falcon exception by @LysandreJik in #26472
- Fix num_heads in _upad_input by @fs4r in #26490
- Fix requests connection error during modelcard creation by @jphme in #26518
- Fix issue of canine forward requiring input_ids anyway by @marcmk6 in #26290
- Fix broken link to video classification task by @HelgeS in #26487
- [
PEFT
] Pass token when callingfind_adapter_config
by @younesbelkada in #26488 - [
core
/auto
] Fix bnb test with code revision + bug with code revision by @younesbelkada in #26431 - Fix model integration ci by @ArthurZucker in #26322
- [
PEFT
] Protectadapter_kwargs
check by @younesbelkada in #26537 - Remove-warns by @ArthurZucker in #26483
- [Doctest] Add configuration_roformer.py by @Adithya4720 in #26530
- Code-llama-nit by @ArthurZucker in #26300
- add build_inputs_with_special_tokens to LlamaFast by @ArthurZucker in #26297
- 🌐 [i18n-KO] Translated
tokenizer_summary.md
to Korean by @wonhyeongseo in #26243 - [i18n-DE] contribute chapter by @flozi00 in #26481
- [RFC, Logging] Change warning to info by @patrickvonplaten in #26545
- Add tokenizer kwargs to fill mask pipeline. by @nmcahill in #26234
- [Wav2Vec2 and Co] Update init tests for PT 2.1 by @sanchit-gandhi in #26494
- [AMD] Add initial version for run_tests_multi_gpu by @mfuntowicz in #26346
- [Doctest] Add
configuration_encoder_decoder.py
by @SrijanSahaySrivastava in #26519 - [InternLM] Add support for InternLM by @Rocketknight1 in #26302
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @jinhopark8345
- @qweme32
- Add Russian localization for README (#26208)
- @Bam4d
- [Mistral] Mistral-7B-v0.1 support (#26447)
- @flozi00
- @wonhyeongseo