Skip to content

Safetensors serialization by default, DistilWhisper, Fuyu, Kosmos-2, SeamlessM4T, Owl-v2

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 02 Nov 17:00
· 3278 commits to main since this release

New models

Distil-Whisper

Distil-Whisper is a distilled version of Whisper that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution data. It was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling.

Distil-Whisper copies the entire encoder from Whisper, meaning it retains Whisper's robustness to different audio conditions. It only copies 2 decoder layers, which significantly reduces the time taken to auto-regressively generate text tokens:

Distil-Whisper is MIT licensed and directly available in the Transformers library with chunked long-form inference, Flash Attention 2 support, and Speculative Decoding. For details on using the model, refer to the following instructions.

Joint work from @sanchit-gandhi, @patrickvonplaten and @srush.

Fuyu

image

The Fuyu model was created by ADEPT, and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar.

The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs.

By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance.

Joint work from @molbap, @pcuenca, @amyeroberts, @ArthurZucker

SeamlessM4T

image

The SeamlessM4T model was proposed in SeamlessM4T — Massively Multilingual & Multimodal Machine Translation by the Seamless Communication team from Meta AI.

SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T enables multiple tasks without relying on separate models:

  • Speech-to-speech translation (S2ST)
  • Speech-to-text translation (S2TT)
  • Text-to-speech translation (T2ST)
  • Text-to-text translation (T2TT)
  • Automatic speech recognition (ASR)

SeamlessM4TModel can perform all the above tasks, but each task also has its own dedicated sub-model.

Kosmos-2

The KOSMOS-2 model was proposed in Kosmos-2: Grounding Multimodal Large Language Models to the World by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei.

KOSMOS-2 is a Transformer-based causal language model and is trained using the next-word prediction task on a web-scale dataset of grounded image-text pairs GRIT. The spatial coordinates of the bounding boxes in the dataset are converted to a sequence of location tokens, which are appended to their respective entity text spans (for example, a snowman followed by <patch_index_0044><patch_index_0863>). The data format is similar to “hyperlinks” that connect the object regions in an image to their text span in the corresponding caption.

Owl-v2

OWLv2 was proposed in Scaling Open-Vocabulary Object Detection by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2 scales up OWL-ViT using self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. This results in large gains over the previous state-of-the-art for zero-shot object detection.

🚨🚨🚨 Safetensors by default for torch serialization 🚨🚨🚨

Version v4.35.0 now puts safetensors serialization by default. This is a significant change targeted at making users of the Hugging Face Hub, transformers, and any downstream library leveraging it safer.

The safetensors library is a safe serialization framework for machine learning tensors. It has been audited and will become the default serialization framework for several organizations (Hugging Face, EleutherAI, Stability AI).

It was already the default loading mechanism since v4.30.0 and would therefore already default to loading model.safetensors files instead of pytorch_model.bin if these were present in the repository.

With v4.35.0, any call to save_pretrained for torch models will now save a safetensors file. This safetensors file is in the PyTorch format, but can be loaded in TensorFlow and Flax models alike.

⚠️ If you run into any issues with this, please let us know ASAP in the issues so that we may help you. Namely, the following errors may indicate something is up:

  • Loading a safetensors file and having a warning mentioning missing weights unexpectedly
  • Obtaining completely wrong/random results at inference after loading a pretrained model that you have saved in safetensors

If you wish to continue saving files in the .bin format, you can do so by specifying safe_serialization=False in all your save_pretrained calls.

Chat templates

Chat templates have been expanded with the addition of the add_generation_prompt argument to apply_chat_template(). This has also enabled us to rework the ConversationalPipeline class to use chat templates. Any model with a chat template is now automatically usable through ConversationalPipeline.

Guides

Two new guides on LLMs were added the library:

Quantization

Exllama-v2 integration

Exllama-v2 provides better GPTQ kernel for higher throughput and lower latency for GPTQ models. The original code can be found here.

You will need the latest versions of optimum and auto-gptq. Read more about the integration here.

AWQ integration

AWQ is a new and popular quantization scheme, already used in various libraries such as TGI, vllm, etc. and known to be faster than GPTQ models according to some benchmarks. The original code can be found here and here you can read more about the original paper.

Screenshot 2023-10-24 at 17 56 56

We support AWQ inference with original kernels as well as kernels provided through autoawq package that you can simply install with pip install autoawq.

We also provide an example script on how to push quantized weights on the hub on the original repository.

Read more about the benchmarks and the integration here

GPTQ on CPU !

You can now run GPTQ models on CPU using the latest version of auto-gptq thanks to @vivekkhandelwal1 !

Attention mask refactor

We refactored the attention mask logic for major models in transformers. For instance, we removed padding_mask argument which was ambiguous for some users

Flash Attention 2 for more models + quantization fine-tuning bug fix

Gpt-bigcode (starcoder), whisper, Bart and MBart now supports FA-2 ! Use it by simply passing use_flash_attention_2=True to from_pretrained. Some bugfixes with respect to mixed precision training with FA2 have been also addressed.

A bugfix with respect to fine-tuning with FA-2 in bfloat16 was addressed. You should now smoothly fine-tune FA-2 models in bfloat16 using quantized base models.

Neftune

NEFTune is a new technique to boost Supervised Fine-tuning performance by adding random noise on the embedding vector. Read more about it on the original paper here

Screenshot 2023-10-24 at 17 56 56

We propose a very simple API for users to benefit from this technique, simply pass a valid neftune_noise_alpha parameter to TrainingArguments

Read more about the API here

Gradient checkpointing refactor

We have refactored the gradient checkpointing API so that users can pass keyword arguments supported by torch.utils.checkpoint.checkpoint directly through gradient_checkpointing_kwargs when calling gradient_checkpointing_enable(), e.g.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})

gradient_checkpointing_kwargs is also supported with Trainer through TrainingArguments.

The refactor should be totally backward compatible with previous behaviour. For superusers, you can still use the attribute gradient_checkpointing on model's submodules to control the activation / deactivation of gradient_checkpointing.

Breaking changes

  • 🚨🚨🚨 [Quantization] Store the original dtype in the config as a private attribute 🚨🚨🚨 by @younesbelkada in #26761
  • 🚨🚨 Generate: change order of ops in beam sample to avoid nans by @gante in #26843
  • 🚨🚨 Raise error when no speaker embeddings in speecht5._generate_speech by @ylacombe in #26418

Bugfixes and improvements

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @jungnerd
    • 🌐 [i18n-KO] Translated semantic_segmentation.md to Korean (#26515)
  • @statelesshz
    • Extend Trainer to enable Ascend NPU to use the fused Adamw optimizer when training (#26194)
    • remove SharedDDP as it is deprecated (#25702)
    • remove the obsolete code related to fairscale FSDP (#26651)
    • make tests of pytorch_example device agnostic (#27081)
    • Device agnostic trainer testing (#27131)
    • deprecate function get_default_device in tools/base.py (#26774)
    • device agnostic pipelines testing (#27129)
    • device agnostic models testing (#27146)
    • device agnostic fsdp testing (#27120)
    • Reproducible checkpoint for npu (#27208)
  • @sgugger
  • @yyLeaves
    • add zh translation for installation (#26084)
    • [i18n-ZH] Translated fast_tokenizers.md to Chinese (#26910)
    • 🌐 [i18n-ZH] Translate multilingual into Chinese (#26935)
    • 🌐 [i18n-ZH] Translate create_a_model.md into Chinese (#27026)
    • 🌐 [i18n-ZH] Translate custom_models.md into Chinese (#27065)
    • 🌐 [i18n-ZH] Translate serialization.md into Chinese (#27076)
    • 🌐 [i18n-ZH] Translate tflite.md into Chinese (#27134)
  • @sinking-point
    • In assisted decoding, pass model_kwargs to model's forward call (fix prepare_input_for_generation in all models) (#25242)
  • @rajveer43
    • add japanese documentation (#26138)
    • Translating en/internal folder docs to Japanese 🇯🇵 (#26747)
    • Refactor code part in documentation translated to japanese (#26900)
    • Translating en/main_classes folder docs to Japanese 🇯🇵 (#26894)
  • @alvarorichard
    • translation brazilian portuguese (#26769)
  • @hakunamatata1997
    • Added Telugu [te] translations (#26828)
    • Added Telugu [te] translation for README.md in main (#27077)
  • @jiaqiw09
    • Translate pipeline_tutorial.md to chinese (#26954)
    • translate preprocessing.md to Chinese (#26955)
    • translate transformers_agents.md to Chinese (#27046)
    • translate traning.md to chinese (#27122)
    • Translate task summary to chinese (#27180)
  • @neggles
    • Add TensorFlow implementation of ConvNeXTv2 (#25558)