Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer suggestions #63

Open
davidkoski opened this issue Mar 14, 2024 · 2 comments
Open

tokenizer suggestions #63

davidkoski opened this issue Mar 14, 2024 · 2 comments

Comments

@davidkoski
Copy link
Contributor

I have a couple of suggestions for the tokenizer API -- things that I have needed to work around here: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Tokenizer.swift

  • add eosToken / eosTokenId to the Tokenizer protocol

    • this is needed to know when to stop producing tokens
    • Tokenizer already has unknownToken
    • I don't know if any of the other special tokens should be exposed, e.g. bosToken
  • have a way to add to TokenizerModel/knownTokenizers or otherwise handle unknown tokenizers

    • right now it would probably be sufficient to map to "PreTrainedTokenizer": BPETokenizer.self
    • but in the future this might need to be more flexible
    • TokenizerModel is internal as are the various classes like BPETokenizer
    • in my workaround I mapped string -> string, e.g. "Qwen2Tokenizer": "PreTrainedTokenizer", which is perhaps the right level -- not exposing too much of the implementation
    • anyway, some kind of API to allow registration of overrides like this or perhaps just "PreTrainedTokenizer" as a fallback for now

If these fit in with the vision for the tokenizer API, please consider them!

Thanks

davidkoski added a commit to ml-explore/mlx-swift-examples that referenced this issue Mar 14, 2024
- swift-tokenizers is getting a lot of updates and fixes, let's track main for now
- remove some workarounds that are no longer needed

- huggingface/swift-transformers#63
@pcuenca
Copy link
Member

pcuenca commented Mar 14, 2024

  • Expose eos: +1, I saw your use in mlx-swift and it makes sense. I'll open a PR soon.
  • Also thought about the fallback tokenizer method but thought it was better to explicitly fail on unsupported tokenizers. A mechanism for the user to register overrides does look interesting!

davidkoski added a commit to ml-explore/mlx-swift-examples that referenced this issue Mar 14, 2024
* switch swift-tokenizers to main, remove some workarounds

- swift-tokenizers is getting a lot of updates and fixes, let's track main for now
- remove some workarounds that are no longer needed

- huggingface/swift-transformers#63
@pcuenca
Copy link
Member

pcuenca commented Mar 15, 2024

Opened #70 and ml-explore/mlx-swift-examples#28 to address the first topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants