Spacy NER model extracting entire sentence instead of stopping at correct end index #12627

YasmineMh · 2023-05-12T08:39:25Z

YasmineMh
May 12, 2023

Hey, I have been using spacy version 3.4.1 to train multiple NER models on different types of data such as dates, text, and amounts. However, I have noticed that sometimes the model is not stopping at the correct end index and instead is extracting all the remaining text.

This is an example:
paragraph : <some text> date <some text>
annotation : date
prediction : date <some text>

I'm using "spacy-transformers.TransformerModel.v3" and "nlpaueb/legal-bert-small-uncased" from hugging face in my config file.

I checked the annotated data to be sure that I don't have long sentences annotated like this one.

I'm not sure if this behavior is normal, or if there are any hyperparameters that could fix it.

Thank you!

bdura · 2023-05-12T12:14:51Z

bdura
May 12, 2023

Hello @YasmineMh, that's strange. Would you have a minimal example to give (if it's not confidential), so we can explore this in more depth?

One reason that pops to mind and could explain this behavior is the interaction between spaCy and HuggingFace tokenizers (see here for a development on the problem). Is a warning issued about the size of the tokenized example?

4 replies

YasmineMh May 14, 2023
Author

Hey @bdura,
Thank you for your response!

I have prepared two examples for you to better understand the issue, (the examples are from different models)

1st Example:
paragraph : XXX Loan Number: XXX Property Name: XXX MULTIFAMILY LOAN AND SECURITY AGREEMENT - SENIORS HOUSING (Revised 9-30-2019) SUMMARY OF LOAN TERMS The following information in this Summary of Loan Terms (" Summary ") is incorporated into and deemed part of this Multifamily Loan and Security Agreement (" Loan Agreement "). Parties, Effective Date, Loan Amount Borrower(s): XXX, a Delaware limited liability company Lender: XXX, a national banking association Effective Date: 02/10/2020 Loan Amount: $24,357,000.00 Property Manager [See Section 6.09(d)] SNR 24 XXX MANAGEMENT
annotation : 02/10/2020
prediction : 02/10/2020 Loan Amount: $24,357,000.00 Property Manager [See Section 6.09(d)] SNR 24 XXX MANAGEMENT

2nd Example:
paragraph : The Company hereby offers to continue to employ the Executive, and the Executive hereby accepts such continued employment, for an initial term (the "Initial Term") commencing on October 1, 2008 and continuing through December 31, 2011, unless sooner terminated in accordance with the provisions of Section 4 or Section 5; with such employment to continue for successive one-year periods in accordance with the terms of this Agreement (subject to termination as aforesaid) unless either party notifies the other party of non-renewal in writing prior to 90 days before the expiration of the initial term and each annual renewal, as applicable (the Initial Term, together with any such extensions of employment hereunder, shall hereinafter be referred to as the "Term").
annotation : to continue
prediction : to continue for successive one-year periods in accordance with the terms of this Agreement (subject to termination as aforesaid) unless either party notifies the other party of non-renewal in writing prior to 90 days before the expiration of the initial term and each annual renewal, as applicable (the Initial Term, together with any such extensions of employment hereunder, shall hereinafter be referred to as the "Term").

I am unsure whether the warning should be issued during training or inference, as I did not receive any warning during inference. To investigate this further, I will retrain the model and check the logs.

Furthermore, I have discovered that some models were trained using spacy version 3.5.2 (on Colab). This could potentially be the cause of the issue as well.

bdura May 15, 2023

Thanks for the samples. Ok, the shape of your data and the fact that no warning was emitted tend to invalidate that theory... Unless you find out that there's a significant proportion of training data that is ill-defined.

YasmineMh May 16, 2023
Author

Hey, thanks for replying back! As I mentioned in my issue, I have checked the data, and it appears to be fine. I have also verified the length of the annotated data, which seems acceptable when compared to the incorrectly predicted data. Could you please clarify what you mean by "ill-defined"?

bdura May 16, 2023

Sure, I meant examples that would go over the 512 token limit, since it could make the model learn on corrupt examples. However, from what you've described so far I don't think that could be the source of the issue.

DuyguA · 2023-05-13T19:58:11Z

DuyguA
May 13, 2023

Reason might be the model being uncased and your data is not cased. According to my experience of building NER models, lack of casing is a huge issue in general. Do you know a Transformer based NER built upon to this model? How about using a cased model?

6 replies

DuyguA May 14, 2023

I say give a try to a cased model definitely. How about this one: https://huggingface.co/casehold/legalbert

YasmineMh May 15, 2023
Author

Yeah, I'll do. Thanks!
I haven't tested casehold/legalbert model yet, but it appears to be an uncased model.

bdura May 15, 2023

The lowercasing is usually performed on the tokenizer side, so I'm not sure it's the source of the issue.

@YasmineMh, how many training examples do you have? Could you try training with a non-transformer-based pipeline, and see if you run into the same issues? You can use spaCy's quickstart for an example configuration.

YasmineMh May 16, 2023
Author

Each model has its own set of training examples. One model has around 3k annotated entities, while the other has approximately 11k annotated entities.
I have already attempted to rerun some models using the same pre-trained model. It appears that for certain entities, the error is no longer present, but for others, there are still examples exhibiting this issue, albeit not in the same paragraphs as before.
The problem does not occur frequently in the models, and initially, I considered it to be a normal behavior. However, it seems suspicious that the end index consistently returns the end of the paragraph for those cases.

rmitsch Jun 6, 2023
Maintainer

If this error is still relevant to you: could you run spacy debug data and post the results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy NER model extracting entire sentence instead of stopping at correct end index #12627

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Spacy NER model extracting entire sentence instead of stopping at correct end index #12627

YasmineMh May 12, 2023

Replies: 2 comments · 10 replies

bdura May 12, 2023

YasmineMh May 14, 2023 Author

bdura May 15, 2023

YasmineMh May 16, 2023 Author

bdura May 16, 2023

DuyguA May 13, 2023

DuyguA May 14, 2023

YasmineMh May 15, 2023 Author

bdura May 15, 2023

YasmineMh May 16, 2023 Author

rmitsch Jun 6, 2023 Maintainer

YasmineMh
May 12, 2023

Replies: 2 comments 10 replies

bdura
May 12, 2023

YasmineMh May 14, 2023
Author

YasmineMh May 16, 2023
Author

DuyguA
May 13, 2023

YasmineMh May 15, 2023
Author

YasmineMh May 16, 2023
Author

rmitsch Jun 6, 2023
Maintainer