Meaning of 'misaligned tokens' #12247
-
Hello, When I ran the
While I have seen discussions on the board about misaligned tokens in reference to entity detection or matching different tokenization schemes, I believe my pipeline does not include either of those things: From config file:
I was wondering if someone could explain what the misalignment error message means in this context? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
For a text like Misaligned tokenization is when the tokens predicted by the tokenizer (in this case It depends a bit on internal details for each component, in general spacy pipeline components will ignore the annotation on misaligned tokens while training, since the model can't know how the annotation on the training tokens should be converted for use with the predicted tokens. |
Beta Was this translation helpful? Give feedback.
For a text like
I can't see you-know-who
there are multiple possible tokenizations, for example['I', 'ca', "n't", 'see', 'you', '-', 'know', '-', 'who']
or['I', "can't", 'see', 'you-', 'know-', 'who']
or['I', "can't", 'see', 'you-know-who']
.Misaligned tokenization is when the tokens predicted by the tokenizer (in this case
spacy.blank("ak")
) aren't the same as the tokens in the training data.It depends a bit on internal details for each component, in general spacy pipeline components will ignore the annotation on misaligned tokens while training, since the model can't know how the annotation on the training tokens should be converted for use with the predicted tokens.