-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting more BERT-like models #89
Conversation
BERT and RoBERTa config files often come with a different config structure: string+int instead of int+string.
Another open problem that I've recently discovered is the way strings are compared in Swift. By default, the language uses UTF8-aware normalization techniques when comparing strings. This is great for some applications, but horrible for tokenization, especially with multilingual models. I've solved that by introducing a struct LiteralString: Hashable {
let value: String
static func ==(lhs: LiteralString, rhs: LiteralString) -> Bool {
return lhs.value.compare(rhs.value, options: .literal) == .orderedSame
}
func hash(into hasher: inout Hasher) {
hasher.combine(value)
}
} I believe it should be applicable in other places as well. Let me know what you think, @pcuenca 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Amazing contribution, this is in very good shape so I think we can merge soon.
@@ -23,7 +23,7 @@ extension Decoder { | |||
|
|||
enum DecoderType: String { | |||
case Sequence | |||
// case WordPiece | |||
case WordPiece |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
case TemplateProcessing | ||
case Template | ||
case ByteLevel | ||
case RobertaProcessing | ||
case Bert | ||
case Roberta | ||
|
||
static let BertProcessing = "Bert" | ||
static let RobertaProcessing = "Roberta" | ||
static let TemplateProcessing = "Template" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is worth doing, I find the aliases and the suffix removal distracting for little benefit. My original approach was to simply use the same names that appear in the json, so someone reading both could easily match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very hard to use JSON as the ground truth, as they cone in all shapes and sizes.
Shorter name was needed for our models to work, but I've added the static variables for backward-compatibility, to avoid breaking the library for other users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the json names used in transformers should be quite stable right now. Why do your models require shorter names? (just curious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess no model "requires" specific names, and JSONs can always be changed, but it generally results in a snowballing set of changes that have to be applied on every platform...
In UForm I have identical tests that run for the same models on the same data across all 3 languages across ONNX, PyTorch, and CoreML. You can check them here.
If a certain behavior is standard in the more popular ports of the library (Python and JS), I assume Hugging Face may want to provide the same behavior here to encourage adoption. A lot of people would probably appreciate the portability 🤗
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could keep the empty lines empty that'd be awesome. Otherwise no big deal, we can address all those issues in the style PR.
"BertTokenizer" : BertTokenizer.self, | ||
"CodeGenTokenizer" : CodeGenTokenizer.self, | ||
"CodeLlamaTokenizer" : CodeLlamaTokenizer.self, | ||
"FalconTokenizer" : FalconTokenizer.self, | ||
"GemmaTokenizer" : GemmaTokenizer.self, | ||
"GPT2Tokenizer" : GPT2Tokenizer.self, | ||
"LlamaTokenizer" : LlamaTokenizer.self, | ||
"T5Tokenizer" : T5Tokenizer.self, | ||
"WhisperTokenizer" : WhisperTokenizer.self, | ||
"CohereTokenizer" : CohereTokenizer.self, | ||
"PreTrainedTokenizer": BPETokenizer.self | ||
"Bert" : BertTokenizer.self, | ||
"CodeGen" : CodeGenTokenizer.self, | ||
"CodeLlama" : CodeLlamaTokenizer.self, | ||
"Falcon" : FalconTokenizer.self, | ||
"Gemma" : GemmaTokenizer.self, | ||
"GPT2" : GPT2Tokenizer.self, | ||
"Llama" : LlamaTokenizer.self, | ||
"Unigram" : UnigramTokenizer.self, | ||
"T5" : T5Tokenizer.self, | ||
"Whisper" : WhisperTokenizer.self, | ||
"Cohere" : CohereTokenizer.self, | ||
"PreTrained": BPETokenizer.self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I think I'd rather keep the same names if possible. If I search in the project for "PreTrainedTokenizer" I'd like to see this entry.
if let tokenizerClass = TokenizerModel.knownTokenizers[tokenizerName] { | ||
return try tokenizerClass.init(tokenizerConfig: tokenizerConfig, tokenizerData: tokenizerData, addedTokens: addedTokens) | ||
} else { | ||
// If the direct lookup fails, perform a case-insensitive scan over the keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is where we may want to drop the Tokenizer
suffix, in my opinion.
// Immediately mapping to `Float` values will result in exception, | ||
// when precision loss is detected. So let's convert to `Double` first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
tokensToIds = Dictionary(uniqueKeysWithValues: vocab.map { $0.token }.enumerated().map { (LiteralString(value: $1), $0) }) | ||
bosTokenId = tokensToIds[LiteralString(value: bosToken!)] // May be nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth considering a String
extension that returns a LiteralString
, for readability? (Just an idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 🤔
Any chance of this being merged in soon? I'm trying to use a BERT model and this PR would be a huge help :) |
I was trying out these changes since have DistilBERT model I wanted to test, but getting thrown Is there a chance support for DistilBERT is also added? |
Superseded by #137 |
Hi HF team!
I am extending our UForm repository of multimodal models to support Swift and mobile deployments, and along that way I've noticed that several classes for a broad range of BERT-like models are not yet supported by
swift-transformers
. So I've added aWordPieceDecoder
class and aliases forBertPreTokenizer
andBertProcessing
.Moreover, are you are well aware
config.json
andtokenizer.json
come in all shapes and sizes. So I've added fallback mechanisms to handle different tuple order in vocabulary listings.The current
main-dev
branch of UForm is already using this functionality from my fork. I am looking into integrating moreHub
functionality next. Please let me know what you think about this PR 🤗