Skip to content

Patch v3.0.1: Better backward compatibility for tokenizers

Compare
Choose a tag to compare
@thomwolf thomwolf released this 03 Jul 15:37

Better backward-compatibility for tokenizers following v3.0.0 refactoring

Version v3.0.0, included a refactoring of the tokenizers' backend to allow a simpler and more flexible user-facing API.

This refactoring was conducted with a particular focus on keeping backward compatibility for the v2.X encoding, truncation and padding API but still led to two breaking changes that could have been avoided.

This patch aims to bring back better backward compatibility, by implementing the following updates:

  • the prepare_for_model method is now publicly exposed again for both slow and fast tokenizers with an API compatible with both the v2.X truncation/padding API and the v3.0 recommended API.
  • the truncation strategy now defaults again to longest_first instead of first_only.

Bug fixes and improvements: