ValueError: [E949] Unable to align tokens for the predicted and reference docs. #13529

ykyogoku · 2024-06-17T14:48:55Z

ykyogoku
Jun 17, 2024

Hello!

I tried to train a POS-Tagger for Tibetan using a custom tokenizer named Botok, but encountered the following error many times:

ValueError: [E949] Unable to align tokens for the predicted and reference docs. 
It is only possible to align the docs when both texts are the same except for whitespace and capitalization. 
The predicted tokens start with: ['ཕུར་བུ་ ', 'དོན་', 'གྲུབ ཤ', 'ཤ ', 'འོ་ ', 'ཡིའུ་ ', 'ཚའེ་ ', 'བཅས་ ', 'ཀྱིས་ ', 'སྤེལ་རེས་ ']. 
The reference tokens start with: ['ཕུར་བུ་', 'དོན་གྲུབ', 'ཤ', 'འོ་', 'ཡིའུ་', 'ཚའེ་', 'བཅས་', 'ཀྱིས་', 'སྤེལ་རེས་', 'གཏམ་བཤད་'].

Another error looks like as follows:

ValueError: [E949] Unable to align tokens for the predicted and reference docs. 
It is only possible to align the docs when both texts are the same except for whitespace and capitalization. 
The predicted tokens start with: ['པར་ངོས་ ', 'དང་པོ ', 'ར་ ', 'མཐུད ས ', 'གོ་ལ ', 'འི་ ', 'སྣེ་ ', 'གསུམ་པ་ ', 'ལོངས་སྤྱོད་ ', 'བྱེད ']. 
The reference tokens start with: ['པར་ངོས་', 'དང་པོ', 'ར་', 'མཐུད', 'ས', 'འི', 'གོ་ལ', 'འི་', 'སྣེ་', 'གསུམ་པ་'].

It is clear to me what this error message means: The tokenization of the prediction does not correspond to that of the reference.
In the first example, ཤ is repeated in the prediction, while in the second example, འི་ is missing in the prediction. There are actually much more errors similar to them, and I just took up two of them, in which the errors are visible. My question is where exactly the prediction is done in the training code, and how to fix it. In the reference, there is neither repetition nor missing of the signs.

The following is the code for the custom tokenizer.

import spacy
from spacy.tokens import Doc
from botok import WordTokenizer
from botok.config import Config
import pickle

nlp = spacy.blank("xx")

class BoTokTokenizer:
    def __init__(self, nlp):
        config = Config(dialect_name="custom")
        self.wt = WordTokenizer(config=config)
        self.vocab = nlp.vocab

    def __call__(self, text):
        tokens = self.wt.tokenize(text, split_affixes=True)
        words = [token.text for token in tokens]
        spaces = [True] * len(words)
        return Doc(self.vocab, words=words, spaces=spaces)

    def to_bytes(self):
        return pickle.dumps(self.__dict__)

    def from_bytes(self, data):
        self.__dict__.update(pickle.loads(data))

    def to_disk(self, path, **kwargs):
        with open(path, 'wb') as file_:
            file_.write(self.to_bytes())

    def from_disk(self, path, **kwargs):
        with open(path, 'rb') as file_:
            self.from_bytes(file_.read())

@spacy.registry.tokenizers("botok_tokenizer")
def create_botok_tokenizer():
    def create_tokenizer(nlp):
        return BoTokTokenizer(nlp)

    return create_tokenizer

[What I have tried so far to fix this error]

I removed from the reference (i.e., training dataset) the entries where the error occurs (Since the error itself is often invisible, I removed 200 lines in the reference). But I keep getting the same error of another entry.
I removed from the reference all the entries whose tokenization does not correspond to the prediction done by Botok in my local setup, which is the same as in the above code. Nevertheless, I still get the same error.
I reduced the size of the training and dev dataset as much as possible, then the error does not occur.
I tried to track back in the code by debugging, but could not identify where exactly the prediction is done.
I found a similar discussion here. It does not help me prevent the same error from occurring again and again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: [E949] Unable to align tokens for the predicted and reference docs. #13529

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

ValueError: [E949] Unable to align tokens for the predicted and reference docs. #13529

ykyogoku Jun 17, 2024

Replies: 0 comments

ykyogoku
Jun 17, 2024