-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111
Conversation
Hi @DePasqualeOrg, thanks! Yes, I was quite liberal in the use of force-unwrapping because the library was meant to be experimental and known to be incomplete, and I preferred to crash to signal areas that need improvement. In this case, I'd rather find the cause for the crash instead of ignoring unrecognized token ids. Let me run a few tests and get back to you on this! |
Thanks, @pcuenca. I think there will be more developers like me who start using this library in production. Just from my perspective, I'd like to avoid crashes as much as possible for my users. Maybe some of these force unwraps could be replaced with an error message to the console where the value is unexpectedly nil? |
Yes, Do you happen to have a reproduction for this particular problem with Gemma 2 2B? I've been testing it a bit and actually found a couple of tokenization edge cases that failed, in the sense that the encoded tokens are not exactly identical to the ones generated by the Python/Rust "fast" implementation. But I couldn't find anything like an unexpected token id. It'd be really helpful if we could have a way to reproduce to have the problem fixed. |
For whatever reason, I can't reproduce this now, but I did get the above-mentioned crash twice last night when generating longer outputs with Gemma 2 2B. |
9a088d5
to
22b2a2d
Compare
Now with this fix and Gemma 2 2B I'm getting a crash here in PreTokenizer.swift: start = index(startIndex, offsetBy: match.range.upperBound)
It happens when generating output with a previous chat history. Unfortunately it's not easy for me to share a reproduction, because my app is proprietary, and the current mlx-swift-examples doesn't include this functionality. |
fe529ed
to
089ca75
Compare
The latest commit fixes the second crash for me on Gemma 2 2B. |
089ca75
to
4a5b050
Compare
Possibly related to these changes (although I can't say for sure, because without them it crashes): Sometimes Gemma 2 2B returns an empty response (often when the prompt is not a complete sentence), or repeats a single character in a response. |
Thanks a lot for spending so much time on this! I've been investigating in parallel, and there are some Unicode complications in the Gemma tokenizer. Different strings such as "à" /* 0x61 0x300 / and "à" / 0xe0 */ are represented by different token ids, but Swift dictionaries with String keys consider both equal. This can explain the nil crashes you found, as there are a few entries missing from the vocab. I'm trying to prepare a PR to address these issues, and then we can test if it solves your crashes (I think it should), and maybe also the other issues as well (I doubt it). |
4a5b050
to
9a096df
Compare
9a096df
to
2081e31
Compare
Hi @DePasqualeOrg I just merged the fixes PR, sorry for the delay. If you rebase your changes, we can merge this one too. I opened #116 for the remaining edge cases, I hope to find some time to work on a workaround soon. |
2081e31
to
3bc30ae
Compare
Thank you! I've rebased this branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rebase did not work properly, could you please verify? 🙏
func testGemmaAddedTokens() async throws { | ||
let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/gemma-tokenizer") | ||
let inputIds = tokenizer("This\n\nis\na\ntest.") | ||
XCTAssertEqual(inputIds, [2, 1596, 109, 502, 108, 235250, 108, 2195, 235265]) | ||
|
||
let decoded = tokenizer.decode(tokens: inputIds) | ||
XCTAssertEqual(decoded, "<bos>This\n\nis\na\ntest.") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func testGemmaAddedTokens() async throws { | |
let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/gemma-tokenizer") | |
let inputIds = tokenizer("This\n\nis\na\ntest.") | |
XCTAssertEqual(inputIds, [2, 1596, 109, 502, 108, 235250, 108, 2195, 235265]) | |
let decoded = tokenizer.decode(tokens: inputIds) | |
XCTAssertEqual(decoded, "<bos>This\n\nis\na\ntest.") | |
} |
This test is already present in the same file.
public let fuseUnknownTokens: Bool | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was deleted by mistake
|
||
fuseUnknownTokens = tokenizerConfig.fuseUnk?.boolValue ?? false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3bc30ae
to
c1e8414
Compare
Sorry, I think I've fixed it now. |
Thanks! Launching a CI run. |
Merging, thanks for your patience @DePasqualeOrg! |
When running Gemma 2 2B, while the model is generating a longer output, swift-transformers crashes at this point in
PreTrainedTokenizer
:with this error:
This library uses a lot of force unwrapping, which in general is not a best practice and can cause crashes. I've only changed a few instances of force unwrapping here in the tokenizers. Please check if these changes would have any adverse consequences.