SpanRuler on Retokenized tokens links back to original token text, not the token text with a split (space) introduced. #13404
Unanswered
dextde
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I add a custom tokenizer
splitter
as the first stage. It correctly splits the single token into two tokens.I then detect the two (splitted) tokens using a
SpanRuler
. Notice that the SpanRuler works for a pattern of two separated tokens (iepattern=['abc', 'efg']
), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg'
).Problem: However, when I print out the span text of the SpanRuler, the text refers to the single original token's text, not the two re-tokenized tokens' text (ie with a space in-between).
Notice the custom retokenizer does respect Spacy's non-destructive retokenization.
Actual Output:
Expected output:
Thanks for any help.
Beta Was this translation helpful? Give feedback.
All reactions