Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111

DePasqualeOrg · 2024-07-31T19:55:19Z

When running Gemma 2 2B, while the model is generating a longer output, swift-transformers crashes at this point in PreTrainedTokenizer:

let tokenStrings = tokens.map { model.convertIdToToken($0)! }

with this error:

Thread 1: Fatal error: Unexpectedly found nil while unwrapping an Optional value

This library uses a lot of force unwrapping, which in general is not a best practice and can cause crashes. I've only changed a few instances of force unwrapping here in the tokenizers. Please check if these changes would have any adverse consequences.

pcuenca · 2024-08-01T12:13:02Z

Hi @DePasqualeOrg, thanks! Yes, I was quite liberal in the use of force-unwrapping because the library was meant to be experimental and known to be incomplete, and I preferred to crash to signal areas that need improvement. In this case, I'd rather find the cause for the crash instead of ignoring unrecognized token ids. Let me run a few tests and get back to you on this!

DePasqualeOrg · 2024-08-01T13:58:36Z

Thanks, @pcuenca. I think there will be more developers like me who start using this library in production. Just from my perspective, I'd like to avoid crashes as much as possible for my users. Maybe some of these force unwraps could be replaced with an error message to the console where the value is unexpectedly nil?

pcuenca · 2024-08-01T16:26:55Z

Yes, Tokenizers in particular is used in several production use cases and error handling should certainly be better. I'm aiming at reaching a feature-complete tokenizers implementation soon.

Do you happen to have a reproduction for this particular problem with Gemma 2 2B? I've been testing it a bit and actually found a couple of tokenization edge cases that failed, in the sense that the encoded tokens are not exactly identical to the ones generated by the Python/Rust "fast" implementation. But I couldn't find anything like an unexpected token id. It'd be really helpful if we could have a way to reproduce to have the problem fixed.

DePasqualeOrg · 2024-08-01T18:25:30Z

For whatever reason, I can't reproduce this now, but I did get the above-mentioned crash twice last night when generating longer outputs with Gemma 2 2B.

DePasqualeOrg · 2024-08-01T20:56:18Z

Now with this fix and Gemma 2 2B I'm getting a crash here in PreTokenizer.swift:

start = index(startIndex, offsetBy: match.range.upperBound)

Thread 1: Fatal error: String index is out of bounds

It happens when generating output with a previous chat history. Unfortunately it's not easy for me to share a reproduction, because my app is proprietary, and the current mlx-swift-examples doesn't include this functionality.

DePasqualeOrg · 2024-08-01T21:14:50Z

The latest commit fixes the second crash for me on Gemma 2 2B.

DePasqualeOrg · 2024-08-02T16:11:05Z

Possibly related to these changes (although I can't say for sure, because without them it crashes): Sometimes Gemma 2 2B returns an empty response (often when the prompt is not a complete sentence), or repeats a single character in a response.

pcuenca · 2024-08-02T16:25:35Z

Thanks a lot for spending so much time on this!

I've been investigating in parallel, and there are some Unicode complications in the Gemma tokenizer. Different strings such as "à" /* 0x61 0x300 / and "à" / 0xe0 */ are represented by different token ids, but Swift dictionaries with String keys consider both equal. This can explain the nil crashes you found, as there are a few entries missing from the vocab. I'm trying to prepare a PR to address these issues, and then we can test if it solves your crashes (I think it should), and maybe also the other issues as well (I doubt it).

pcuenca · 2024-08-19T10:31:24Z

Hi @DePasqualeOrg I just merged the fixes PR, sorry for the delay. If you rebase your changes, we can merge this one too. I opened #116 for the remaining edge cases, I hope to find some time to work on a workaround soon.

DePasqualeOrg · 2024-08-19T12:30:53Z

Thank you! I've rebased this branch.

pcuenca

I think the rebase did not work properly, could you please verify? 🙏

pcuenca · 2024-08-19T14:25:23Z

Tests/TokenizersTests/AddedTokensTests.swift

+    func testGemmaAddedTokens() async throws {
+        let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/gemma-tokenizer")
+        let inputIds = tokenizer("This\n\nis\na\ntest.")
+        XCTAssertEqual(inputIds, [2, 1596, 109, 502, 108, 235250, 108, 2195, 235265])
+
+        let decoded = tokenizer.decode(tokens: inputIds)
+        XCTAssertEqual(decoded, "<bos>This\n\nis\na\ntest.")
+    }
+


Suggested change

func testGemmaAddedTokens() async throws {

let tokenizer = try await AutoTokenizer.from(pretrained: "pcuenq/gemma-tokenizer")

let inputIds = tokenizer("This\n\nis\na\ntest.")

XCTAssertEqual(inputIds, [2, 1596, 109, 502, 108, 235250, 108, 2195, 235265])

let decoded = tokenizer.decode(tokens: inputIds)

XCTAssertEqual(decoded, "<bos>This\n\nis\na\ntest.")

}

This test is already present in the same file.

pcuenca · 2024-08-19T14:26:28Z

Sources/Tokenizers/BPETokenizer.swift

-    public let fuseUnknownTokens: Bool
-


I think this was deleted by mistake

pcuenca · 2024-08-19T14:27:51Z

Sources/Tokenizers/BPETokenizer.swift

-
-        fuseUnknownTokens = tokenizerConfig.fuseUnk?.boolValue ?? false


Same, these lines were part of #117 (I started that thread from the branch of #113, that may have caused some confusion, sorry!)

DePasqualeOrg · 2024-08-19T15:29:36Z

Sorry, I think I've fixed it now.

pcuenca · 2024-08-19T15:29:55Z

Thanks! Launching a CI run.

pcuenca · 2024-08-19T15:32:45Z

Merging, thanks for your patience @DePasqualeOrg!

DePasqualeOrg force-pushed the fix-tokenizer-crash branch 2 times, most recently from 9a088d5 to 22b2a2d Compare August 1, 2024 19:01

DePasqualeOrg force-pushed the fix-tokenizer-crash branch from fe529ed to 089ca75 Compare August 1, 2024 21:14

DePasqualeOrg changed the title ~~Fix crash in PreTrainedTokenizer~~ Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B Aug 1, 2024

DePasqualeOrg force-pushed the fix-tokenizer-crash branch from 089ca75 to 4a5b050 Compare August 1, 2024 21:22

pcuenca mentioned this pull request Aug 2, 2024

Tokenizer fixes #113

Merged

DePasqualeOrg force-pushed the fix-tokenizer-crash branch from 4a5b050 to 9a096df Compare August 2, 2024 21:23

pcuenca mentioned this pull request Aug 8, 2024

Throwing error when the configs fail JSON serialization #114

Merged

DePasqualeOrg force-pushed the fix-tokenizer-crash branch from 9a096df to 2081e31 Compare August 9, 2024 21:42

DePasqualeOrg force-pushed the fix-tokenizer-crash branch from 2081e31 to 3bc30ae Compare August 19, 2024 12:29

pcuenca reviewed Aug 19, 2024

View reviewed changes

DePasqualeOrg added 2 commits August 19, 2024 17:28

Avoid force unwrapping

c13681c

Fix crashes in split(by captureRegex:)

c1e8414

DePasqualeOrg force-pushed the fix-tokenizer-crash branch from 3bc30ae to c1e8414 Compare August 19, 2024 15:28

pcuenca approved these changes Aug 19, 2024

View reviewed changes

pcuenca merged commit c088078 into huggingface:main Aug 19, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111

Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111

DePasqualeOrg commented Jul 31, 2024

pcuenca commented Aug 1, 2024

DePasqualeOrg commented Aug 1, 2024

pcuenca commented Aug 1, 2024

DePasqualeOrg commented Aug 1, 2024 •

edited

Loading

DePasqualeOrg commented Aug 1, 2024 •

edited

Loading

DePasqualeOrg commented Aug 1, 2024 •

edited

Loading

DePasqualeOrg commented Aug 2, 2024

pcuenca commented Aug 2, 2024

pcuenca commented Aug 19, 2024

DePasqualeOrg commented Aug 19, 2024

pcuenca left a comment

pcuenca Aug 19, 2024

pcuenca Aug 19, 2024

pcuenca Aug 19, 2024

DePasqualeOrg commented Aug 19, 2024

pcuenca commented Aug 19, 2024

pcuenca commented Aug 19, 2024


		fuseUnknownTokens = tokenizerConfig.fuseUnk?.boolValue ?? false

Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111

Fix crashes in PreTrainedTokenizer and PreTokenizer with Gemma 2 2B #111

Conversation

DePasqualeOrg commented Jul 31, 2024

pcuenca commented Aug 1, 2024

DePasqualeOrg commented Aug 1, 2024

pcuenca commented Aug 1, 2024

DePasqualeOrg commented Aug 1, 2024 • edited Loading

DePasqualeOrg commented Aug 1, 2024 • edited Loading

DePasqualeOrg commented Aug 1, 2024 • edited Loading

DePasqualeOrg commented Aug 2, 2024

pcuenca commented Aug 2, 2024

pcuenca commented Aug 19, 2024

DePasqualeOrg commented Aug 19, 2024

pcuenca left a comment

Choose a reason for hiding this comment

pcuenca Aug 19, 2024

Choose a reason for hiding this comment

pcuenca Aug 19, 2024

Choose a reason for hiding this comment

pcuenca Aug 19, 2024

Choose a reason for hiding this comment

DePasqualeOrg commented Aug 19, 2024

pcuenca commented Aug 19, 2024

pcuenca commented Aug 19, 2024

DePasqualeOrg commented Aug 1, 2024 •

edited

Loading

DePasqualeOrg commented Aug 1, 2024 •

edited

Loading

DePasqualeOrg commented Aug 1, 2024 •

edited

Loading