Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SplitPreTokenizer with invert true returning array with empty string #55

Open
davidkoski opened this issue Mar 6, 2024 · 1 comment

Comments

@davidkoski
Copy link
Contributor

See using commit 03d86ac See also: preternatural-explore/mlx-swift-chat#8

The tokenizer for stabilityai/stablelm-2-zephyr-1_6b has a configuration like this:

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": { 
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\r\n]*|\\s*[\r\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },

which ends up here:

class SplitPreTokenizer: PreTokenizer {
...
    func preTokenize(text: String) -> [String] {
        guard let pattern = pattern else { return [text] }
        return pattern.split(text, invert: invert)
    }

Given the input string "Why did the chicken cross the road? " it returns a array with an empty string:

(lldb) p pattern.split(text, invert: true)
([String]) 1 value {
  [0] = ""
}

I observed that if invert were false it gives something that look reasonable to my eyes:

(lldb) p pattern.split(text, invert: false)
([String]) 10 values {
  [0] = "Why"
  [1] = " did"
  [2] = " the"
  [3] = " chicken"
  [4] = " cross"
  [5] = " the"
  [6] = " road"
  [7] = "?"
  [8] = " "
  [9] = ""
}

I am not sure what the behavior is supposed to be here -- I wonder if the behavior of invert might be ... inverted? I think the configuration is correct because the python tokenizer behaves correctly.

@pcuenca
Copy link
Member

pcuenca commented Mar 9, 2024

Thanks for the report @davidkoski, I'll take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants