Release v4.34: Mistral, Persimmon, Prompt templating, Flash Attention 2, Tokenizer refactor · huggingface/transformers

New models

Mistral

Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices:

Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens
GQA (Grouped Query Attention) - allowing faster inference and lower cache size.
Byte-fallback BPE tokenizer - ensures that characters are never mapped to out-of-vocabulary tokens.

[Mistral] Mistral-7B-v0.1 support by @Bam4d in #26447

Persimmon

The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions.

[Persimmon] Add support for persimmon by @ArthurZucker in #26042

BROS

BROS stands for BERT Relying On Spatiality. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information.

Add BROS by @jinhopark8345 in #23190

ViTMatte

ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.

Add ViTMatte by @NielsRogge in #25843

Nougat

Nougat uses the same architecture as Donut, meaning an image Transformer encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them.

Add Nougat by @NielsRogge and @molbap in #25942

Prompt templating

We've added a new template feature for chat models. This allows the formatting that a chat model was trained with to be saved with the model, ensuring that users can exactly reproduce that formatting when they want to fine-tune the model or use it for inference. For more information, see our template documentation.

Overhaul Conversation class and prompt templating by @Rocketknight1 in #25323

🚨🚨 Tokenizer refactor

[Tokenizer] attemp to fix add_token issues by @ArthurZucker in #23909
Nit-added-tokens by @ArthurZucker in #26538 adds some fix to #23909 .

🚨Workflow Changes 🚨:

These are not breaking changes per se but rather bugfixes. However, we understand that this may result in some workflow changes so we highlight them below.

unique_no_split_tokens attribute removed and not used in the internal logic
sanitize_special_tokens() follows a deprecation cycle and does nothing
All attributes in SPECIAL_TOKENS_ATTRIBUTES are stored as AddedTokens and no strings.
loading a slow from a fast or a fast from a slow will no longer raise and error if the tokens added don't have the correct index. This is because they will always be added following the order of the added_tokens but will correct mistakes in the saved vocabulary if there are any. (And there are a lot in old format tokenizers)
the length of a tokenizer is now max(set(self.get_vocab().keys())) accounting for holes in the vocab. The vocab_size no longer takes into account the added vocab for most of the tokenizers (as it should not). Mostly breaking for T5
Adding a token using tokenizer.add_tokens([AddedToken("hey", rstrip=False, normalized=True)]) now takes into account rstrip, lstrip, normalized information.
added_tokens_decoder holds AddedToken, not strings.
add_tokens() for both fast and slow will always be updated if the token is already part of the vocab, allowing for custom stripping.
initializing a tokenizer form scratch will now add missing special tokens to the vocab.
stripping is not always done for special tokens! 🚨 Only if the AddedToken has lstrip=True and rstrip=True
fairseq_ids_to_tokens attribute removed for Barthez (was not used)

➕ Most visible features:

printing a tokenizer now shows tokenizer.added_tokens_decoder for both fast and slow tokenizers. Moreover, additional tokens that were already part of the initial vocab are also found there.
faster from_pretrained, faster add_tokens because special and non special can be mixed together and the trie is not always rebuilt.
faster encode/decode with caching mechanism for added_tokens_decoder/encoder.
information is fully saved in the tokenizer_config.json

For any issues relating to this, make sure to open a new issue and ping @ArthurZucker.

Flash Attention 2

FA2 support added to transformers for most popular architectures (llama, mistral, falcon) architectures actively being contributed in this issue (#26350). Simply pass use_flash_attention_2=True when calling from_pretrained

In the future, PyTorch will support Flash Attention 2 through torch.scaled_dot_product_attention, users would be able to benefit from both (transformers core & transformers + SDPA) implementations of Flash Attention-2 with simple changes (model.to_bettertransformer() and force-dispatch the SDPA kernel to FA-2 in the case of SDPA)

[core ] Integrate Flash attention 2 in most used models by @younesbelkada in #25598

For our future plans regarding integrating F.sdpa from PyTorch in core transformers, see here: #26557

Lazy import structure

Support for lazy loading integration libraries has been added. This will drastically speed up importing transformers and related object from the library.

Example before this change:

2023-09-11 11:07:52.010179: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
python3 -c "from transformers import CLIPTextModel"  3.31s user 3.06s system 220% cpu 2.893 total

After this change:

python3 -c "from transformers import CLIPTextModel"  1.70s user 1.49s system 220% cpu 1.447 total

[Core] Add lazy import structure to imports by @patrickvonplaten in #26090

Bugfixes and improvements

Fix typo by @susnato in #25966
Fix Detr CI by @ydshieh in #25972
Fix test_load_img_url_timeout by @ydshieh in #25976
nn.Identity is not required to be compatible with PyTorch < 1.1.0 as the minimum PyTorch version we currently support is 1.10.0 by @statelesshz in #25974
Add Pop2Piano space demo. by @susnato in #25975
fix typo by @kai01ai in #25981
Use main in conversion script by @ydshieh in #25973
[doc] Always call it Agents for consistency by @julien-c in #25958
Update RAG README.md with correct path to examples/seq2seq by @tleyden in #25953
Update training_args.py to remove the runtime error by @sahel-sh in #25920
Trainer: delegate default generation values to generation_config by @gante in #25987
Show failed tests on CircleCI layout in a better way by @ydshieh in #25895
Patch with accelerate xpu by @abhilash1910 in #25714
PegasusX add _no_split_modules by @andreeahedes in #25933
Add TFDebertaV2ForMultipleChoice by @raghavanone in #25932
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler by @pacman100 in #25863
[Wav2Vec2 Conformer] Fix inference float16 by @sanchit-gandhi in #25985
Add LLaMA resources by @eenzeenee in #25859
[CI] Fix red CI and ERROR failed should show by @ArthurZucker in #25995
[VITS] tokenizer integration test: fix revision did not exist by @ArthurZucker in #25996
Fix Mega chunking error when using decoder-only model by @tanaymeh in #25765
save space when converting hf model to megatron model. by @flower-with-safe in #25950
Update README.md by @NinoRisteski in #26003
Falcon: fix revision propagation by @LysandreJik in #26006
TF-OPT attention mask fixes by @Rocketknight1 in #25238
Fix small typo README.md by @zspo in #25934
🌐[i18n-KO] Translated llm_tutorial.md to Korean by @harheem in #25791
Remove Falcon from undocumented list by @Rocketknight1 in #26008
modify context length for GPTQ + version bump by @SunMarc in #25899
Fix err with FSDP by @muellerzr in #25991
fix _resize_token_embeddings will set lm head size to 0 when enabled deepspeed zero3 by @kai01ai in #26024
Fix CircleCI config by @ydshieh in #26023
Add tgs speed metrics by @CokeDong in #25858
[VITS] Fix nightly tests by @sanchit-gandhi in #25986
Added HerBERT to README.md by @Muskan011 in #26020
Fix vilt config docstring parameter to match value in init by @raghavanone in #26017
Punctuation fix by @kwonmha in #26025
Try to fix training Loss inconsistent after resume from old checkpoint by @dumpmemory in #25872
Fix Dropout Implementation in Graphormer by @alexanderkrauck in #24817
Update missing docs on activation_dropout and fix DropOut docs for SEW-D by @gau-nernst in #26031
Skip warning if tracing with dynamo by @angelayi in #25581
🌐 [i18n-KO] Translated llama.md to Korean by @harheem in #26044
[CodeLlamaTokenizerFast] Fix fix set_infilling_processor to properly reset by @ArthurZucker in #26041
[CITests] skip failing tests until #26054 is merged by @ArthurZucker in #26063
only main process should call _save on deepspeed zero3 by @zjjMaiMai in #25959
docs: update link huggingface map by @pphuc25 in #26077
docs: add space to docs by @pphuc25 in #26067
[core] Import tensorflow inside relevant methods in trainer_utils by @younesbelkada in #26106
Generate: legacy mode is only triggered when generation_config is untouched by @gante in #25962
Update logits_process.py docstrings by @larekrow in #25971
Fix ExponentialDecayLengthPenalty negative logits issue by @pokjay in #25594
🌐 [i18n-KO] Translated llama2.md to Korean by @mjk0618 in #26047
[docs] Updates to TTS task guide with regards to the new TTS pipeline by @MKhalusova in #26095
🌐 [i18n-KO] Translated contributing.md to Korean by @mjk0618 in #25877
enable optuna multi-objectives feature by @sywangyi in #25969
chore: correct update_step and correct gradient_accumulation_steps by @pphuc25 in #26068
Text2text pipeline: don't parameterize from the config by @gante in #26118
Fix MarianTokenizer to remove metaspace character in decode by @tanaymeh in #26091
safeguard torch distributed check by @pacman100 in #26056
fix the deepspeed tests by @pacman100 in #26021
Fix AutoTokenizer docstring typo by @amyeroberts in #26117
[core] fix 4bit num_parameters by @younesbelkada in #26132
Add missing space in generation/utils.py by @jbochi in #26121
Update spectrogram and waveform model mapping for TTS/A pipeline by @Vaibhavs10 in #26114
[RWKV] Final fix RWMV 4bit by @younesbelkada in #26134
docs: feat: add llama2 notebook resources from OSSCA community by @junejae in #26076
Generate: ignore warning when generation_config.max_length is set to None by @gante in #26147
Fix test_finetune_bert2bert by @ydshieh in #25984
Falcon: batched generation by @gante in #26137
Fix beam_scores shape when token scores shape changes after logits_processor by @BakerBunker in #25980
Update training_args.py - addition of self.distributed_state when using XPU by @Serizao in #25999
[docs] last hidden state vs hidden_states[-1] by @MKhalusova in #26142
Flex xpu bug fix by @abhilash1910 in #26135
Add missing Maskformer dataclass decorator, add dataclass check in ModelOutput for subclasses by @rachthree in #25638
Fix eval accumulation when accelerate > 0.20.3 by @sam-scale in #26060
[Whisper Tokenizer] Encode timestamps by @sanchit-gandhi in #26054
[PEFT] Fix PEFT + gradient checkpointing by @younesbelkada in #25846
[MusicGen] Add streamer to generate by @sanchit-gandhi in #25320
Fix beam search when using model parallel by @pfldy2850 in #24969
[MusicGen] Add sampling rate to config by @sanchit-gandhi in #26136
[Whisper] Fix word-level timestamps for audio < 30 seconds by @xenova in #25607
[BLIP-2] Improve conversion script by @NielsRogge in #24854
IDEFICS: allow interpolation of vision's pos embeddings by @leot13 in #26029
[TTA Pipeline] Test MusicGen and VITS by @sanchit-gandhi in #26146
Tweaks to Chat Templates docs by @Rocketknight1 in #26168
[Whisper] Check length of prompt + max new tokens by @sanchit-gandhi in #26164
Update notebook.py to support multi eval datasets by @matrix1001 in #25796
Fix pad to multiple of by @ArthurZucker in #25732
[docs] IDEFICS guide and task guides restructure by @MKhalusova in #26035
[PEFT] Allow PEFT model dict to be loaded by @patrickvonplaten in #25721
No doctest for convert_bros_to_pytorch.py by @ydshieh in #26212
Remove utils/documentation_tests.txt by @ydshieh in #26213
moved ctrl to Salesforce/ctrl by @julien-c in #26183
Fix ConversationalPipeline tests by @Rocketknight1 in #26217
[FSMT] Fix non-shared weights by @LysandreJik in #26187
refactor decay_parameters production into its own function by @shijie-wu in #26152
refactor: change default block_size in block size > max position embeddings by @pphuc25 in #26069
[Wav2Vec2-Conf / LLaMA] Style fix by @sanchit-gandhi in #26188
[Permisson] Style fix by @sanchit-gandhi in #26228
[Check] Fix config docstring by @sanchit-gandhi in #26222
🌐 [i18n-KO] Translated whisper.md to Korean by @nuatmochoi in #26002
Create the return value on device to avoid unnecessary copying from CPU by @mksit in #26151
[AutoBackbone] Add test by @NielsRogge in #26094
Update README.md by @NinoRisteski in #26198
Update add_new_pipeline.md by @NinoRisteski in #26197
[docs] Fix model reference in zero shot image classification example by @Aleksandar1932 in #26206
Fix the gitlab user mention in issue templates to the correct user by @muellerz in #26237
Fix some docstring in image processors by @ydshieh in #26235
Fix gated repo tests by @Wauplin in #26257
Fix Error not captured in PR doctesting by @ydshieh in #26215
DeepSpeed ZeRO-3 handling when resizing embedding layers by @pacman100 in #26259
[FIX] resize_token_embeddings by @passaglia in #26102
FSDP tests and checkpointing fixes by @pacman100 in #26180
fix name error when accelerate is not available by @pacman100 in #26278
Update bros checkpoint by @jinhopark8345 in #26277
Integrate AMD GPU in CI/CD environment by @mfuntowicz in #26007
Rewrite for custom code warning messages by @Rocketknight1 in #26291
fix deepspeed available detection by @fxmarty in #26252
add bbox input validation by @jinhopark8345 in #26294
include changes from llama by @ArthurZucker in #26260
[Trainer] Refactor trainer + bnb logic by @younesbelkada in #26248
add custom RMSNorm to ALL_LAYERNORM_LAYERS by @shijie-wu in #26227
Keep relevant weights in fp32 when model._keep_in_fp32_modules is set even when accelerate is not installed by @fxmarty in #26225
Fix FSMT weight sharing by @LysandreJik in #26292
update hf hub dependency to be compatible with the new tokenizers by @ArthurZucker in #26301
Porting the torchaudio kaldi fbank implementation to audio_utils by @ylacombe in #26182
More error message fixup, plus some linebreaks! by @Rocketknight1 in #26296
[QUICK FIX LINK] Update trainer.py by @SoyGema in #26293
Use CircleCI store_test_results by @ydshieh in #26223
Fix doctest CI by @ydshieh in #26324
[doc] fixed indices in obj detection example by @MKhalusova in #26343
[TTA Pipeline] Fix MusicGen test by @sanchit-gandhi in #26348
Add image to image pipeline by @LeviVasconcelos in #25393
feat: adding num_proc to load_dataset by @pphuc25 in #26326
Fixed unclosed p tags by @HanSeokhyeon in #26240
Update add_new_model.md by @NinoRisteski in #26365
Fix MusicGen logging error by @osanseviero in #26370
[docs] removed MaskFormerSwin and TimmBackbone from the table on index.md by @MKhalusova in #26347
Update tiny model information and pipeline tests by @ydshieh in #26285
Add Russian localization for README by @qweme32 in #26208
🌐 [i18n-KO] Translated audio_classification.mdx to Korean by @gabrielwithappy in #26200
[ViTMatte] Add resources by @NielsRogge in #26317
Deleted duplicate sentence by @titi-devv in #26394
added support for gradient checkpointing in ESM models by @sanjeevk-os in #26386
Fix DeepSpeed issue with Idefics by @HugoLaurencon in #26393
Add torch RMSProp optimizer by @natolambert in #26425
Fix padding for IDEFICS by @shauray8 in #26396
Update semantic_segmentation.md by @zekaouinoureddine in #26419
Fixing tokenizer when transformers is installed without tokenizers by @urialon in #26236
[FA / tests] Add use_cache tests for FA models by @younesbelkada in #26415
add bf16 mixed precision support for NPU by @statelesshz in #26163
[PEFT] Fix PEFT multi adapters support by @younesbelkada in #26407
Fix failing doctest by @LysandreJik in #26450
Update runs-on in workflow files by @ydshieh in #26435
[i18n-DE] Complete first toc chapter by @flozi00 in #26311
🌐 [i18n-KO] Translated debugging.md to Korean by @wonhyeongseo in #26246
🌐 [i18n-KO] Translated perf_train_gpu_many.md to Korean by @wonhyeongseo in #26244
optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @NormXU in #26139
Fix cos_sin device issue in Falcon model by @ydshieh in #26448
docs: change assert to raise and some small docs by @pphuc25 in #26232
change mention of decoder_input_ids to input_ids and same with decode_inputs_embeds by @tmabraham in #26406
[VITS] Fix speaker_embed device mismatch by @fakhirali in #26115
[PEFT] introducing adapter_kwargs for loading adapters from different Hub location (subfolder, revision) than the base model by @younesbelkada in #26270
Do not warn about unexpected decoder weights when loading T5EncoderModel and LongT5EncoderModel by @fleonce in #26211
fix_mbart_tied_weights by @SunMarc in #26422
Esm checkpointing by @Amelie-Schreiber in #26454
[Whisper Tokenizer] Make decoding faster after adding timestamps by @sanchit-gandhi in #26299
[docs] Update offline mode docs by @stevhliu in #26478
[docs] navigation improvement between text gen pipelines and text gen params by @MKhalusova in #26477
Skip 2 failing persimmon pipeline tests for now by @ydshieh in #26485
Avoid all-zeor attnetion mask used in testing by @ydshieh in #26469
[Flax Examples] Seq2Seq ASR Fine-Tuning Script by @sanchit-gandhi in #21764
[ASR Pipe] Improve docs and error messages by @sanchit-gandhi in #26476
Revert falcon exception by @LysandreJik in #26472
Fix num_heads in _upad_input by @fs4r in #26490
Fix requests connection error during modelcard creation by @jphme in #26518
Fix issue of canine forward requiring input_ids anyway by @marcmk6 in #26290
Fix broken link to video classification task by @HelgeS in #26487
[PEFT] Pass token when calling find_adapter_config by @younesbelkada in #26488
[core/ auto ] Fix bnb test with code revision + bug with code revision by @younesbelkada in #26431
Fix model integration ci by @ArthurZucker in #26322
[PEFT] Protect adapter_kwargs check by @younesbelkada in #26537
Remove-warns by @ArthurZucker in #26483
[Doctest] Add configuration_roformer.py by @Adithya4720 in #26530
Code-llama-nit by @ArthurZucker in #26300
add build_inputs_with_special_tokens to LlamaFast by @ArthurZucker in #26297
🌐 [i18n-KO] Translated tokenizer_summary.md to Korean by @wonhyeongseo in #26243
[i18n-DE] contribute chapter by @flozi00 in #26481
[RFC, Logging] Change warning to info by @patrickvonplaten in #26545
Add tokenizer kwargs to fill mask pipeline. by @nmcahill in #26234
[Wav2Vec2 and Co] Update init tests for PT 2.1 by @sanchit-gandhi in #26494
[AMD] Add initial version for run_tests_multi_gpu by @mfuntowicz in #26346
[Doctest] Add configuration_encoder_decoder.py by @SrijanSahaySrivastava in #26519
[InternLM] Add support for InternLM by @Rocketknight1 in #26302

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@jinhopark8345
- Add BROS (#23190)
- Update bros checkpoint (#26277)
- add bbox input validation (#26294)
@qweme32
- Add Russian localization for README (#26208)
@Bam4d
- [Mistral] Mistral-7B-v0.1 support (#26447)
@flozi00
- [i18n-DE] Complete first toc chapter (#26311)
- [i18n-DE] contribute chapter (#26481)
@wonhyeongseo
- 🌐 [i18n-KO] Translated debugging.md to Korean (#26246)
- 🌐 [i18n-KO] Translated perf_train_gpu_many.md to Korean (#26244)
- 🌐 [i18n-KO] Translated tokenizer_summary.md to Korean (#26243)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.34: Mistral, Persimmon, Prompt templating, Flash Attention 2, Tokenizer refactor