Meaning of 'misaligned tokens' #12247

megamattc · 2023-02-07T18:38:53Z

megamattc
Feb 7, 2023

Hello,

When I ran the debug data command on my training data I received the following information:

...
============================== Vocab & Vectors ==============================
ℹ 90796 total word(s) in the data (11777 unique)
⚠ 806 misaligned tokens in the training data
⚠ 95 misaligned tokens in the dev data
ℹ No word vectors present in the package
...

While I have seen discussions on the board about misaligned tokens in reference to entity detection or matching different tokenization schemes, I believe my pipeline does not include either of those things:

From config file:

...
[nlp]
lang = "ak"
pipeline = ["tok2vec","trainable_lemmatizer","lemmatizer","parser","tagger","attribute_ruler","morphologizer"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
...

I was wondering if someone could explain what the misalignment error message means in this context?

Answered by adrianeboyd

Feb 8, 2023

For a text like I can't see you-know-who there are multiple possible tokenizations, for example ['I', 'ca', "n't", 'see', 'you', '-', 'know', '-', 'who'] or ['I', "can't", 'see', 'you-', 'know-', 'who'] or ['I', "can't", 'see', 'you-know-who'].

Misaligned tokenization is when the tokens predicted by the tokenizer (in this case spacy.blank("ak")) aren't the same as the tokens in the training data.

It depends a bit on internal details for each component, in general spacy pipeline components will ignore the annotation on misaligned tokens while training, since the model can't know how the annotation on the training tokens should be converted for use with the predicted tokens.

View full answer

adrianeboyd · 2023-02-08T07:31:59Z

adrianeboyd
Feb 8, 2023

For a text like I can't see you-know-who there are multiple possible tokenizations, for example ['I', 'ca', "n't", 'see', 'you', '-', 'know', '-', 'who'] or ['I', "can't", 'see', 'you-', 'know-', 'who'] or ['I', "can't", 'see', 'you-know-who'].

Misaligned tokenization is when the tokens predicted by the tokenizer (in this case spacy.blank("ak")) aren't the same as the tokens in the training data.

It depends a bit on internal details for each component, in general spacy pipeline components will ignore the annotation on misaligned tokens while training, since the model can't know how the annotation on the training tokens should be converted for use with the predicted tokens.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meaning of 'misaligned tokens' #12247

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Meaning of 'misaligned tokens' #12247

megamattc Feb 7, 2023

Replies: 1 comment

adrianeboyd Feb 8, 2023

megamattc
Feb 7, 2023

adrianeboyd
Feb 8, 2023