Skip to content

Meaning of 'misaligned tokens' #12247

Discussion options

You must be logged in to vote

For a text like I can't see you-know-who there are multiple possible tokenizations, for example ['I', 'ca', "n't", 'see', 'you', '-', 'know', '-', 'who'] or ['I', "can't", 'see', 'you-', 'know-', 'who'] or ['I', "can't", 'see', 'you-know-who'].

Misaligned tokenization is when the tokens predicted by the tokenizer (in this case spacy.blank("ak")) aren't the same as the tokens in the training data.

It depends a bit on internal details for each component, in general spacy pipeline components will ignore the annotation on misaligned tokens while training, since the model can't know how the annotation on the training tokens should be converted for use with the predicted tokens.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by rmitsch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training Training and updating models feat / tokenizer Feature: Tokenizer
2 participants