SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced. #13404

dextde · 2024-03-30T02:31:21Z

dextde
Mar 30, 2024

Hi,

I add a custom tokenizer splitter as the first stage. It correctly splits the single token into two tokens.
I then detect the two (splitted) tokens using a SpanRuler. Notice that the SpanRuler works for a pattern of two separated tokens (ie pattern=['abc', 'efg']), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg').

Problem: However, when I print out the span text of the SpanRuler, the text refers to the single original token's text, not the two re-tokenized tokens' text (ie with a space in-between).

Notice the custom retokenizer does respect Spacy's non-destructive retokenization.

import spacy
from spacy.language import Language

@Language.component('splitter')
def splitter(doc):
    with doc.retokenize() as retokenizer:
        retokenizer.split(doc[0], ['abc', 'efg'], heads=[doc[0], doc[0]])
    return doc

nlp = spacy.load('en_core_web_sm'])
nlp.add_pipe('splitter', first=True)
sp_ruler = nlp.add_pipe('span_ruler')
sp_ruler.add_patterns([{'label': 'testing', 'pattern': [{'TEXT': 'abc'}, {'TEXT': 'efg'}]}])

doc = nlp('abcefg')

Actual Output:

print([(tok.text, i) for i, tok in enumerate(doc)])
print([(type(span), span.text, span.label_) for span in doc.spans["ruler"]])
print(len(doc.spans['ruler']))

> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abcefg', 'testing')]
> 1

Expected output:

print([(tok.text, i) for i, tok in enumerate(doc)])
print([(type(span), span.text, span.label_) for span in doc.spans["ruler"]])
print(len(doc.spans['ruler']))

> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abc efg', 'testing')]  # notice the space in the text, expected due to custom re-tokenization
> 1

Thanks for any help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced. #13404

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced. #13404

dextde Mar 30, 2024

Replies: 0 comments

dextde
Mar 30, 2024