KeyError on Simple Training Loop #6485
-
How to reproduce the behaviourAs an exercise, I wanted to compare simple scikit-learn results with spaCy 3.0. The sentiment use-case is a bit silly but I've got a dataset that looks like; train_data = [
(' I`d have responded, if I were going', 'neutral'),
(' Sooo SAD I will miss you here in San Diego!!!', 'negative'),
('my boss is bullying me...', 'negative')
] Given the dataset, I figured I'd try out the training loop explained below here. import random
import spacy
from spacy.training import Example
nlp = spacy.load("en_core_web_sm")
optimizer = nlp.initialize()
for itn in range(5):
random.shuffle(train_data)
for x, c in train_data:
doc = nlp.make_doc(x)
example = Example.from_dict(doc, {"cats": {"sentiment": c}})
print(example)
nlp.update([example], sgd=optimizer) The print statement is able to give me;
But then the code gives me a key-error that seems internal to thinc.
Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 3 replies
-
There are a few issues with your code:
Calling
Your gold-standard data contains (only) annotations for the
This will solve your KeyError. With spaCy 3, we really recommend using the new config system to train your custom pipelines. The config can source components from existing models if you want to build on top of the pretrained weights of That said - the tagger really shouldn't crash with such an ugly error if it only gets |
Beta Was this translation helpful? Give feedback.
-
Have a look at this example v3 project that contains a workflow & config file for a binary text classifier with exclusive classes: https://github.com/explosion/projects/tree/v3/tutorials/textcat_docs_issues |
Beta Was this translation helpful? Give feedback.
-
@svlandeg thanks for the clear reply! Yeah, I should admit that my use-case is a bit ... different. I want to benchmark a lot of models/approaches (including a model made at Rasa) so for this particular use-case, I've got a strong preference to make all the components scikit-learn compatible. Otherwise, it's harder to automate the logging and I might be stuck handling the stats manually, which feels too error-prone. If I had a specific problem to be working on I'd certainly prefer the command-line. I'll try to get it to work with your comments and report back such that anybody who is googling this issue can find a fix. @svlandeg would you prefer it if I add an issue for a better error message? |
Beta Was this translation helpful? Give feedback.
-
No that's fine, I've added the UX issue to my personal list so that'll get fixed some day, it's just low priority right now ;-) Let me know if you encounter other trouble though! |
Beta Was this translation helpful? Give feedback.
-
@svlandeg I've given it another go, but now I think I'm hitting another thinc error. import random
import spacy
from spacy.training import Example
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
nlp = spacy.load("en_core_web_sm")
# Add standard textcat to the pipeline
config = {
"threshold": 0.5,
"model": DEFAULT_TEXTCAT_MODEL,
}
nlp.add_pipe("textcat", config=config) The "textcat" should now be added and I can confirm via; nlp.pipe_names
# ['tok2vec',
# 'tagger',
# 'parser',
# 'ner',
# 'attribute_ruler',
# 'lemmatizer',
# 'textcat'] Given that the pipeline looks good, I'll continue with the data and the training loop. # Very limited set of Training Data
train_data = [
(' I`d have responded, if I were going', 'neutral'),
(' Sooo SAD I will miss you here in San Diego!!!', 'negative'),
('my boss is bullying me...', 'negative')
]
# Put examples in the correct format
examples = []
for x, c in train_data:
doc = nlp.make_doc(x)
example = Example.from_dict(doc, {"cats": {"sentiment": c}})
examples.append(example)
# Run the optimiser
optimizer = nlp.resume_training()
with nlp.select_pipes(enable="textcat"):
for itn in range(5):
random.shuffle(examples)
nlp.update(examples, sgd=optimizer) This however gives a big traceback.
|
Beta Was this translation helpful? Give feedback.
-
Another observation, this runs without raising an error. import random
import spacy
from spacy.training import Example
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL
nlp = spacy.load("en_core_web_sm")
# Add standard textcat to the pipeline
config = {
"threshold": 0.5,
"model": DEFAULT_TEXTCAT_MODEL,
}
# I AM COMMENTING OUT THE LINES BELOW
# THAT MEANS NO TEXTCAT IS ADDED
# nlp.add_pipe("textcat", config=config)
# Very limited set of Training Data
train_data = [
(' I`d have responded, if I were going', 'neutral'),
(' Sooo SAD I will miss you here in San Diego!!!', 'negative'),
('my boss is bullying me...', 'negative')
]
# Put examples in the correct format
examples = []
for x, c in train_data:
doc = nlp.make_doc(x)
example = Example.from_dict(doc, {"cats": {"sentiment": c}})
examples.append(example)
# Run the optimiser
optimizer = nlp.resume_training()
with nlp.select_pipes(enable="textcat"):
for itn in range(5):
random.shuffle(examples)
nlp.update(examples, sgd=optimizer) This runs without error, but part of me is wondering why there's no error/warning being raised. It feels like |
Beta Was this translation helpful? Give feedback.
-
It looks like your code boils down to:
I inially said
But you have to consider the components you want to train. In your case, if I understand correctly, you're actually not interested in resuming training of any of the pretrained components of the pipeline, but you do want to train the new
I still really recommend using the config system and the |
Beta Was this translation helpful? Give feedback.
-
@svlandeg I have used the above suggestion and have re-written the code. By I get the following error:
Full Trace
Code
|
Beta Was this translation helpful? Give feedback.
-
@polm Installing spacy-lookups-data has resolved the issue. But when I have used
Code
|
Beta Was this translation helpful? Give feedback.
It looks like your code boils down to:
I inially said
But you have to consider the components you want to train. In your case, if I understand correctly, you're actually not interested in resuming training of any of the pretrained components of the pipeline, but you do want to train the new
textcat
. That means you do need to make s…