Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111

Merged
merged 2 commits into from
Aug 19, 2024

Conversation

DePasqualeOrg
Copy link
Contributor

When running Gemma 2 2B, while the model is generating a longer output, swift-transformers crashes at this point in PreTrainedTokenizer:

let tokenStrings = tokens.map { model.convertIdToToken($0)! }

with this error:

Thread 1: Fatal error: Unexpectedly found nil while unwrapping an Optional value

This library uses a lot of force unwrapping, which in general is not a best practice and can cause crashes. I've only changed a few instances of force unwrapping here in the tokenizers. Please check if these changes would have any adverse consequences.

@pcuenca
Copy link
Member

pcuenca commented Aug 1, 2024

Hi @DePasqualeOrg, thanks! Yes, I was quite liberal in the use of force-unwrapping because the library was meant to be experimental and known to be incomplete, and I preferred to crash to signal areas that need improvement. In this case, I'd rather find the cause for the crash instead of ignoring unrecognized token ids. Let me run a few tests and get back to you on this!

@DePasqualeOrg
Copy link
Contributor Author

Thanks, @pcuenca. I think there will be more developers like me who start using this library in production. Just from my perspective, I'd like to avoid crashes as much as possible for my users. Maybe some of these force unwraps could be replaced with an error message to the console where the value is unexpectedly nil?

@pcuenca
Copy link
Member

pcuenca commented Aug 1, 2024

Yes, Tokenizers in particular is used in several production use cases and error handling should certainly be better. I'm aiming at reaching a feature-complete tokenizers implementation soon.

Do you happen to have a reproduction for this particular problem with Gemma 2 2B? I've been testing it a bit and actually found a couple of tokenization edge cases that failed, in the sense that the encoded tokens are not exactly identical to the ones generated by the Python/Rust "fast" implementation. But I couldn't find anything like an unexpected token id. It'd be really helpful if we could have a way to reproduce to have the problem fixed.

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Aug 1, 2024

For whatever reason, I can't reproduce this now, but I did get the above-mentioned crash twice last night when generating longer outputs with Gemma 2 2B.

@DePasqualeOrg DePasqualeOrg force-pushed the fix-tokenizer-crash branch 2 times, most recently from 9a088d5 to 22b2a2d Compare August 1, 2024 19:01
@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Aug 1, 2024

Now with this fix and Gemma 2 2B I'm getting a crash here in PreTokenizer.swift:

start = index(startIndex, offsetBy: match.range.upperBound)
Thread 1: Fatal error: String index is out of bounds

It happens when generating output with a previous chat history. Unfortunately it's not easy for me to share a reproduction, because my app is proprietary, and the current mlx-swift-examples doesn't include this functionality.

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Aug 1, 2024

The latest commit fixes the second crash for me on Gemma 2 2B.

@DePasqualeOrg DePasqualeOrg changed the title Fix crash in PreTrainedTokenizer Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B Aug 1, 2024
@DePasqualeOrg
Copy link
Contributor Author

Possibly related to these changes (although I can't say for sure, because without them it crashes): Sometimes Gemma 2 2B returns an empty response (often when the prompt is not a complete sentence), or repeats a single character in a response.

@pcuenca
Copy link
Member

pcuenca commented Aug 2, 2024

Thanks a lot for spending so much time on this!

I've been investigating in parallel, and there are some Unicode complications in the Gemma tokenizer. Different strings such as "à" /* 0x61 0x300 / and "à" / 0xe0 */ are represented by different token ids, but Swift dictionaries with String keys consider both equal. This can explain the nil crashes you found, as there are a few entries missing from the vocab. I'm trying to prepare a PR to address these issues, and then we can test if it solves your crashes (I think it should), and maybe also the other issues as well (I doubt it).

@pcuenca
Copy link
Member

pcuenca commented Aug 19, 2024

Hi @DePasqualeOrg I just merged the fixes PR, sorry for the delay. If you rebase your changes, we can merge this one too. I opened #116 for the remaining edge cases, I hope to find some time to work on a workaround soon.

@DePasqualeOrg
Copy link
Contributor Author

Thank you! I've rebased this branch.

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rebase did not work properly, could you please verify? 🙏

Comment on lines 31 to 39
func testGemmaAddedTokens() async throws {
let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/gemma-tokenizer")
let inputIds = tokenizer("This\n\nis\na\ntest.")
XCTAssertEqual(inputIds, [2, 1596, 109, 502, 108, 235250, 108, 2195, 235265])

let decoded = tokenizer.decode(tokens: inputIds)
XCTAssertEqual(decoded, "<bos>This\n\nis\na\ntest.")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func testGemmaAddedTokens() async throws {
let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/gemma-tokenizer")
let inputIds = tokenizer("This\n\nis\na\ntest.")
XCTAssertEqual(inputIds, [2, 1596, 109, 502, 108, 235250, 108, 2195, 235265])
let decoded = tokenizer.decode(tokens: inputIds)
XCTAssertEqual(decoded, "<bos>This\n\nis\na\ntest.")
}

This test is already present in the same file.

Comment on lines 48 to 49
public let fuseUnknownTokens: Bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was deleted by mistake

Comment on lines 80 to 81

fuseUnknownTokens = tokenizerConfig.fuseUnk?.boolValue ?? false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, these lines were part of #117 (I started that thread from the branch of #113, that may have caused some confusion, sorry!)

@DePasqualeOrg
Copy link
Contributor Author

Sorry, I think I've fixed it now.

@pcuenca
Copy link
Member

pcuenca commented Aug 19, 2024

Thanks! Launching a CI run.

@pcuenca
Copy link
Member

pcuenca commented Aug 19, 2024

Merging, thanks for your patience @DePasqualeOrg!

@pcuenca pcuenca merged commit c088078 into huggingface:main Aug 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants