KeyError on Simple Training Loop #6485

koaning · 2020-12-02T19:30:48Z

koaning
Dec 2, 2020

How to reproduce the behaviour

As an exercise, I wanted to compare simple scikit-learn results with spaCy 3.0. The sentiment use-case is a bit silly but I've got a dataset that looks like;

train_data = [
 (' I`d have responded, if I were going', 'neutral'),
 (' Sooo SAD I will miss you here in San Diego!!!', 'negative'),
 ('my boss is bullying me...', 'negative')
]

Given the dataset, I figured I'd try out the training loop explained below here.

import random
import spacy 
from spacy.training import Example

nlp = spacy.load("en_core_web_sm")

optimizer = nlp.initialize()
for itn in range(5):
    random.shuffle(train_data)
    for x, c in train_data:
        doc = nlp.make_doc(x)
        example = Example.from_dict(doc, {"cats": {"sentiment": c}})
        print(example)
        nlp.update([example], sgd=optimizer)

The print statement is able to give me;

{'doc_annotation': {'cats': {'sentiment': 'negative'}, 'entities': ['O', 'O', 'O', 'O', 'O', 'O'], 'links': {}}, 'token_annotation': {'ORTH': ['my', 'boss', 'is', 'bullying', 'me', '...'], 'SPACY': [True, True, True, True, False, False], 'TAG': ['', '', '', '', '', ''], 'LEMMA': ['', '', '', '', '', ''], 'POS': ['', '', '', '', '', ''], 'MORPH': ['', '', '', '', '', ''], 'HEAD': [0, 1, 2, 3, 4, 5], 'DEP': ['', '', '', '', '', ''], 'SENT_START': [1, 0, 0, 0, 0, 0]}}

But then the code gives me a key-error that seems internal to thinc.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-32-7e71f2f37f86> in <module>
      8         example = Example.from_dict(doc, {"cats": {"sentiment": c}})
      9         print(example)
---> 10         nlp.update([example], sgd=optimizer)

~/Development/tokenwiser/venv/lib/python3.7/site-packages/spacy/language.py in update(self, examples, _, drop, sgd, losses, component_cfg, exclude)
   1093             if name in exclude or not hasattr(proc, "update"):
   1094                 continue
-> 1095             proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
   1096         if sgd not in (None, False):
   1097             for name, proc in self.pipeline:

~/Development/tokenwiser/venv/lib/python3.7/site-packages/spacy/pipeline/tagger.pyx in spacy.pipeline.tagger.Tagger.update()

~/Development/tokenwiser/venv/lib/python3.7/site-packages/spacy/pipeline/tagger.pyx in spacy.pipeline.tagger.Tagger.get_loss()

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/loss.py in __call__(self, guesses, truths)
    145         self, guesses: List[Floats2d], truths: List[Union[Ints1d, Floats2d]]
    146     ) -> Tuple[List[Floats2d], float]:
--> 147         grads = self.get_grad(guesses, truths)
    148         loss = self._get_loss_from_grad(grads)
    149         return grads, loss

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/loss.py in get_grad(self, guesses, truths)
    158         d_scores = []
    159         for yh, y in zip(guesses, truths):
--> 160             d_yh = self.cc.get_grad(yh, y)
    161             if self.normalize:
    162                 d_yh /= n

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/loss.py in get_grad(self, guesses, truths)
     92 
     93     def get_grad(self, guesses: Floats2d, truths: IntsOrFloatsOrStrs) -> Floats2d:
---> 94         target, mask = self.convert_truths(truths, guesses)
     95         if guesses.shape != target.shape:  # pragma: no cover
     96             err = f"Cannot calculate CategoricalCrossentropy loss: mismatched shapes: {guesses.shape} vs {target.shape}."

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/loss.py in convert_truths(self, truths, guesses)
     75                         truths[i] = self.names[0]
     76                         missing.append(i)
---> 77                 truths = [self._name_to_i[name] for name in truths]
     78             truths = xp.asarray(truths, dtype="i")
     79         else:

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/loss.py in <listcomp>(.0)
     75                         truths[i] = self.names[0]
     76                         missing.append(i)
---> 77                 truths = [self._name_to_i[name] for name in truths]
     78             truths = xp.asarray(truths, dtype="i")
     79         else:

KeyError: ''

Info about spaCy

spaCy version: 3.0.0rc2
Platform: Linux-5.4.0-7642-generic-x86_64-with-Pop-20.04-focal
Python version: 3.7.9
Pipelines: en_core_web_sm (3.0.0a0)

Answered by svlandeg

Dec 6, 2020

It looks like your code boils down to:

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textcat", config=config)
optimizer = nlp.resume_training()
with nlp.select_pipes(enable="textcat"):
    ....

I inially said

Calling nlp.initialize() basically wipes the internal weights of all the components in the pretrained model. Instead, you want to call resume_training if you want to further fine-tune an existing model.

But you have to consider the components you want to train. In your case, if I understand correctly, you're actually not interested in resuming training of any of the pretrained components of the pipeline, but you do want to train the new textcat. That means you do need to make s…

View full answer

svlandeg · 2020-12-02T21:16:58Z

svlandeg
Dec 2, 2020
Maintainer

There are a few issues with your code:

nlp = spacy.load("en_core_web_sm")
...
optimizer = nlp.initialize()
...

Calling nlp.initialize() basically wipes the internal weights of all the components in the pretrained model. Instead, you want to call resume_training if you want to further fine-tune an existing model.

example = Example.from_dict(doc, {"cats": {"sentiment": c}})

Your gold-standard data contains (only) annotations for the textcat, but there is no textcat component in your pipeline. You should add one, and then you can call a context manager to only activate the textcat for training:

with nlp.select_pipes(enable="textcat"):
    for ...

This will solve your KeyError.

With spaCy 3, we really recommend using the new config system to train your custom pipelines. The config can source components from existing models if you want to build on top of the pretrained weights of en_core_web_sm. Once you've experimented a bit with using the config, you'll see that it's actually much easier to work with than implementing your own custom training loop and taking care of all the details correctly ;-)

That said - the tagger really shouldn't crash with such an ugly error if it only gets '' input ;-)

0 replies

svlandeg · 2020-12-02T21:18:49Z

svlandeg
Dec 2, 2020
Maintainer

Have a look at this example v3 project that contains a workflow & config file for a binary text classifier with exclusive classes: https://github.com/explosion/projects/tree/v3/tutorials/textcat_docs_issues

0 replies

koaning · 2020-12-03T12:37:21Z

koaning
Dec 3, 2020
Author

@svlandeg thanks for the clear reply!

Yeah, I should admit that my use-case is a bit ... different. I want to benchmark a lot of models/approaches (including a model made at Rasa) so for this particular use-case, I've got a strong preference to make all the components scikit-learn compatible. Otherwise, it's harder to automate the logging and I might be stuck handling the stats manually, which feels too error-prone. If I had a specific problem to be working on I'd certainly prefer the command-line.

I'll try to get it to work with your comments and report back such that anybody who is googling this issue can find a fix.

@svlandeg would you prefer it if I add an issue for a better error message?

0 replies

svlandeg · 2020-12-03T12:39:35Z

svlandeg
Dec 3, 2020
Maintainer

No that's fine, I've added the UX issue to my personal list so that'll get fixed some day, it's just low priority right now ;-)

Let me know if you encounter other trouble though!

0 replies

koaning · 2020-12-06T20:34:27Z

koaning
Dec 6, 2020
Author

@svlandeg I've given it another go, but now I think I'm hitting another thinc error.

import random

import spacy
from spacy.training import Example
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL

nlp = spacy.load("en_core_web_sm")

# Add standard textcat to the pipeline
config = {
   "threshold": 0.5,
   "model": DEFAULT_TEXTCAT_MODEL,
}
nlp.add_pipe("textcat", config=config)

The "textcat" should now be added and I can confirm via;

nlp.pipe_names
# ['tok2vec',
#  'tagger',
#  'parser',
#  'ner',
#  'attribute_ruler',
#  'lemmatizer',
#  'textcat']

Given that the pipeline looks good, I'll continue with the data and the training loop.

# Very limited set of Training Data
train_data = [
 (' I`d have responded, if I were going', 'neutral'),
 (' Sooo SAD I will miss you here in San Diego!!!', 'negative'),
 ('my boss is bullying me...', 'negative')
]


# Put examples in the correct format
examples = []
for x, c in train_data:
    doc = nlp.make_doc(x)
    example = Example.from_dict(doc, {"cats": {"sentiment": c}})
    examples.append(example)
    

# Run the optimiser 
optimizer = nlp.resume_training()
with nlp.select_pipes(enable="textcat"):
    for itn in range(5):
        random.shuffle(examples)
        nlp.update(examples, sgd=optimizer)

This however gives a big traceback.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-3b663b4e2efb> in <module>
     36     for itn in range(5):
     37         random.shuffle(examples)
---> 38         nlp.update(examples, sgd=optimizer)

~/Development/tokenwiser/venv/lib/python3.7/site-packages/spacy/language.py in update(self, examples, _, drop, sgd, losses, component_cfg, exclude)
   1093             if name in exclude or not hasattr(proc, "update"):
   1094                 continue
-> 1095             proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
   1096         if sgd not in (None, False):
   1097             for name, proc in self.pipeline:

~/Development/tokenwiser/venv/lib/python3.7/site-packages/spacy/pipeline/textcat.py in update(self, examples, drop, set_annotations, sgd, losses)
    219             return losses
    220         set_dropout_rate(self.model, drop)
--> 221         scores, bp_scores = self.model.begin_update([eg.predicted for eg in examples])
    222         loss, d_scores = self.get_loss(examples, scores)
    223         bp_scores(d_scores)

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/model.py in begin_update(self, X)
    304         and returns the gradient with respect to the input.
    305         """
--> 306         return self._func(self, X, is_train=True)
    307 
    308     def predict(self, X: InT) -> OutT:

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/chain.py in forward(model, X, is_train)
     52     callbacks = []
     53     for layer in model.layers:
---> 54         Y, inc_layer_grad = layer(X, is_train=is_train)
     55         callbacks.append(inc_layer_grad)
     56         X = Y

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/model.py in __call__(self, X, is_train)
    286         """Call the model's `forward` function, returning the output and a
    287         callback to compute the gradients via backpropagation."""
--> 288         return self._func(self, X, is_train=is_train)
    289 
    290     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/concatenate.py in forward(model, X, is_train)
     42 
     43 def forward(model: Model[InT, OutT], X: InT, is_train: bool) -> Tuple[OutT, Callable]:
---> 44     Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
     45     if isinstance(Ys[0], list):
     46         return _list_forward(model, X, Ys, callbacks, is_train)  # type: ignore

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/concatenate.py in <listcomp>(.0)
     42 
     43 def forward(model: Model[InT, OutT], X: InT, is_train: bool) -> Tuple[OutT, Callable]:
---> 44     Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
     45     if isinstance(Ys[0], list):
     46         return _list_forward(model, X, Ys, callbacks, is_train)  # type: ignore

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/model.py in __call__(self, X, is_train)
    286         """Call the model's `forward` function, returning the output and a
    287         callback to compute the gradients via backpropagation."""
--> 288         return self._func(self, X, is_train=is_train)
    289 
    290     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/chain.py in forward(model, X, is_train)
     52     callbacks = []
     53     for layer in model.layers:
---> 54         Y, inc_layer_grad = layer(X, is_train=is_train)
     55         callbacks.append(inc_layer_grad)
     56         X = Y

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/model.py in __call__(self, X, is_train)
    286         """Call the model's `forward` function, returning the output and a
    287         callback to compute the gradients via backpropagation."""
--> 288         return self._func(self, X, is_train=is_train)
    289 
    290     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/with_cpu.py in forward(model, X, is_train)
     24 
     25 def forward(model: Model, X: Any, is_train: bool) -> Tuple[Any, Callable]:
---> 26     cpu_outputs, backprop = model.layers[0].begin_update(_to_cpu(X))
     27     gpu_outputs = _to_device(model.ops, cpu_outputs)
     28 

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/model.py in begin_update(self, X)
    304         and returns the gradient with respect to the input.
    305         """
--> 306         return self._func(self, X, is_train=True)
    307 
    308     def predict(self, X: InT) -> OutT:

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/chain.py in forward(model, X, is_train)
     52     callbacks = []
     53     for layer in model.layers:
---> 54         Y, inc_layer_grad = layer(X, is_train=is_train)
     55         callbacks.append(inc_layer_grad)
     56         X = Y

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/model.py in __call__(self, X, is_train)
    286         """Call the model's `forward` function, returning the output and a
    287         callback to compute the gradients via backpropagation."""
--> 288         return self._func(self, X, is_train=is_train)
    289 
    290     def initialize(self, X: Optional[InT] = None, Y: Optional[OutT] = None) -> "Model":

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/sparselinear.pyx in thinc.layers.sparselinear.forward()

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/layers/sparselinear.pyx in thinc.layers.sparselinear._begin_cpu_update()

~/Development/tokenwiser/venv/lib/python3.7/site-packages/thinc/model.py in get_dim(self, name)
    173         if value is None:
    174             err = f"Cannot get dimension '{name}' for model '{self.name}': value unset"
--> 175             raise ValueError(err)
    176         else:
    177             return value

ValueError: Cannot get dimension 'nO' for model 'sparse_linear': value unset

0 replies

koaning · 2020-12-06T20:37:35Z

koaning
Dec 6, 2020
Author

Another observation, this runs without raising an error.

import random

import spacy
from spacy.training import Example
from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL

nlp = spacy.load("en_core_web_sm")

# Add standard textcat to the pipeline
config = {
   "threshold": 0.5,
   "model": DEFAULT_TEXTCAT_MODEL,
}
# I AM COMMENTING OUT THE LINES BELOW
# THAT MEANS NO TEXTCAT IS ADDED 
# nlp.add_pipe("textcat", config=config)


# Very limited set of Training Data
train_data = [
 (' I`d have responded, if I were going', 'neutral'),
 (' Sooo SAD I will miss you here in San Diego!!!', 'negative'),
 ('my boss is bullying me...', 'negative')
]


# Put examples in the correct format
examples = []
for x, c in train_data:
    doc = nlp.make_doc(x)
    example = Example.from_dict(doc, {"cats": {"sentiment": c}})
    examples.append(example)
    

# Run the optimiser 
optimizer = nlp.resume_training()
with nlp.select_pipes(enable="textcat"):
    for itn in range(5):
        random.shuffle(examples)
        nlp.update(examples, sgd=optimizer)

This runs without error, but part of me is wondering why there's no error/warning being raised. It feels like nlp.select_pipes(enable="textcat") should take notice that there is no textcat.

0 replies

svlandeg · 2020-12-06T23:16:23Z

svlandeg
Dec 6, 2020
Maintainer

It looks like your code boils down to:

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textcat", config=config)
optimizer = nlp.resume_training()
with nlp.select_pipes(enable="textcat"):
    ....

I inially said

Calling nlp.initialize() basically wipes the internal weights of all the components in the pretrained model. Instead, you want to call resume_training if you want to further fine-tune an existing model.

But you have to consider the components you want to train. In your case, if I understand correctly, you're actually not interested in resuming training of any of the pretrained components of the pipeline, but you do want to train the new textcat. That means you do need to make sure that the textcat is properly initialized, e.g. by calling nlp.initialize with the examples (containing the gold labels that will determine the output dimension nO). Just make sure you do this AFTER you selected the textcat and disabled the other components:

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textcat", config=config)
with nlp.select_pipes(enable="textcat"):
    optimizer = nlp.initialize(lambda: examples)
    ...

I still really recommend using the config system and the spacy train command for all of this though! You'll be much less likely to run into issues like these.

0 replies

kjsr7 · 2021-11-20T03:52:08Z

kjsr7
Nov 20, 2021

@svlandeg I have used the above suggestion and have re-written the code. By I get the following error:

ValueError: [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.

Full Trace

ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_28648/4218670242.py in <module>
     33 
     34 with nlp.select_pipes(enable="textcat"):
---> 35     optimizer = nlp.begin_training()
     36     for itn in range(5):
     37         random.shuffle(examples)

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\language.py in begin_training(self, get_examples, sgd)
   1191     ) -> Optimizer:
   1192         warnings.warn(Warnings.W089, DeprecationWarning)
-> 1193         return self.initialize(get_examples, sgd=sgd)
   1194 
   1195     def initialize(

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\language.py in initialize(self, get_examples, sgd)
   1223         config = self.config.interpolate()
   1224         # These are the settings provided in the [initialize] block in the config
-> 1225         I = registry.resolve(config["initialize"], schema=ConfigSchemaInit)
   1226         before_init = I["before_init"]
   1227         if before_init is not None:

~\AppData\Local\Programs\Python\Python38\lib\site-packages\thinc\config.py in resolve(cls, config, schema, overrides, validate)
    727         validate: bool = True,
    728     ) -> Dict[str, Any]:
--> 729         resolved, _ = cls._make(
    730             config, schema=schema, overrides=overrides, validate=validate, resolve=True
    731         )

~\AppData\Local\Programs\Python\Python38\lib\site-packages\thinc\config.py in _make(cls, config, schema, overrides, resolve, validate)
    776         if not is_interpolated:
    777             config = Config(orig_config).interpolate()
--> 778         filled, _, resolved = cls._fill(
    779             config, schema, validate=validate, overrides=overrides, resolve=resolve
    780         )

~\AppData\Local\Programs\Python\Python38\lib\site-packages\thinc\config.py in _fill(cls, config, schema, validate, resolve, parent, overrides)
    848                     # We don't want to try/except this and raise our own error
    849                     # here, because we want the traceback if the function fails.
--> 850                     getter_result = getter(*args, **kwargs)
    851                 else:
    852                     # We're not resolving and calling the function, so replace

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\language.py in load_lookups_data(lang, tables)
     95 def load_lookups_data(lang, tables):
     96     util.logger.debug(f"Loading lookups from spacy-lookups-data: {tables}")
---> 97     lookups = load_lookups(lang=lang, tables=tables)
     98     return lookups
     99 

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\lookups.py in load_lookups(lang, tables, strict)
     30     if lang not in registry.lookups:
     31         if strict and len(tables) > 0:
---> 32             raise ValueError(Errors.E955.format(table=", ".join(tables), lang=lang))
     33         return lookups
     34     data = registry.lookups.get(lang)

ValueError: [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.

Code


import spacy
from spacy.training import Example
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL

nlp = spacy.load("en_core_web_sm")

# Add standard textcat to the pipeline
config = {
   "threshold": 0.5,
   "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
}

nlp.add_pipe("textcat", config=config)


# Very limited set of Training Data
train_data = [
 (' I`d have responded, if I were going', 'neutral'),
 (' Sooo SAD I will miss you here in San Diego!!!', 'negative'),
 ('my boss is bullying me...', 'negative')
]


# Put examples in the correct format
examples = []
for x, c in train_data:
    doc = nlp.make_doc(x)
    example = Example.from_dict(doc, {"cats": {"sentiment": c}})
    examples.append(example)
    

with nlp.select_pipes(enable="textcat"):
    optimizer = nlp.initialize(lambda: examples)
    for itn in range(5):
        random.shuffle(examples)
        nlp.update(examples, sgd=optimizer)

2 replies

polm Nov 20, 2021

Did you try installing the package spacy-lookups-data?

Note that this Discussion is quite old, and since it happened v3 of spaCy has come out of development and been officially released. As noted upthread, in v3 we really don't recommend writing your own training loop.

wuye251 Jun 6, 2024

I want to know how i can use it with online train with config? I am novice in use spacy. So i'm use spacy like above code, because it's easser than config. And I don't konw how to use it by online train model(by http api can train/upload/load model)
:)

kjsr7 · 2021-11-20T16:04:31Z

kjsr7
Nov 20, 2021

@polm Installing spacy-lookups-data has resolved the issue. But when I have used en_core_web_lg instead of en_core_web_sm, the following error has arisen.

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\language.py in initialize(self, get_examples, sgd)
   1229         try:
-> 1230             init_vocab(
   1231                 self, data=I["vocab_data"], lookups=I["lookups"], vectors=I["vectors"]

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\training\initialize.py in init_vocab(nlp, data, lookups, vectors)
    120     if vectors is not None:
--> 121         load_vectors_into_model(nlp, vectors)
    122         logger.info(f"Added vectors: {vectors}")

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\training\initialize.py in load_vectors_into_model(nlp, name, add_strings)
    130     try:
--> 131         vectors_nlp = load_model(name)
    132     except ConfigValidationError as e:

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\util.py in load_model(name, vocab, disable, exclude, config)
    330         raise IOError(Errors.E941.format(name=name, full=OLD_MODEL_SHORTCUTS[name]))
--> 331     raise IOError(Errors.E050.format(name=name))
    332 

OSError: [E050] Can't find model 'corpus/en_vectors'. It doesn't seem to be a Python package or a valid path to a data directory.

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19524/607413307.py in <module>
     26 
     27 with nlp.select_pipes(enable="textcat"):
---> 28     optimizer = nlp.initialize(lambda: examples)
     29     for itn in range(500):
     30         losses = {}

~\AppData\Local\Programs\Python\Python38\lib\site-packages\spacy\language.py in initialize(self, get_examples, sgd)
   1232             )
   1233         except IOError:
-> 1234             raise IOError(Errors.E884.format(vectors=I["vectors"]))
   1235         if self.vocab.vectors.data.shape[1] >= 1:
   1236             ops = get_current_ops()

OSError: [E884] The pipeline could not be initialized because the vectors could not be found at 'corpus/en_vectors'. If your pipeline was already initialized/trained before, call 'resume_training' instead of 'initialize', or initialize only the components that are new.

Code

import random

import spacy
from spacy.training import Example
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
# !pip install spacy-lookups-data

nlp = spacy.load("en_core_web_lg")

# Add standard textcat to the pipeline
config = {
   "threshold": 0.5,
   "model": DEFAULT_SINGLE_TEXTCAT_MODEL,
}

nlp.add_pipe("textcat", config=config)


# Put examples in the correct format
examples = []
for row in train_df.itertuples():
    doc = nlp.make_doc(row.title)
    example = Example.from_dict(doc, {"cats": {row.label: True}})
    examples.append(example)
    

with nlp.select_pipes(enable="textcat"):
    optimizer = nlp.initialize(lambda: examples)
    for itn in range(500):
        losses = {}
        random.shuffle(examples)
        nlp.update(examples, sgd=optimizer, losses=losses)
        print(losses)

1 reply

polm Nov 22, 2021

Like the error says, the vectors can't be found. If you train a model with vectors, then the vectors must be present when you use it, or when you resume training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError on Simple Training Loop #6485

{{title}}

Replies: 9 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

KeyError on Simple Training Loop #6485

koaning Dec 2, 2020

How to reproduce the behaviour

Info about spaCy

Replies: 9 comments · 3 replies

svlandeg Dec 2, 2020 Maintainer

svlandeg Dec 2, 2020 Maintainer

koaning Dec 3, 2020 Author

svlandeg Dec 3, 2020 Maintainer

koaning Dec 6, 2020 Author

koaning Dec 6, 2020 Author

svlandeg Dec 6, 2020 Maintainer

kjsr7 Nov 20, 2021

polm Nov 20, 2021

wuye251 Jun 6, 2024

kjsr7 Nov 20, 2021

polm Nov 22, 2021

koaning
Dec 2, 2020

Replies: 9 comments 3 replies

svlandeg
Dec 2, 2020
Maintainer

svlandeg
Dec 2, 2020
Maintainer

koaning
Dec 3, 2020
Author

svlandeg
Dec 3, 2020
Maintainer

koaning
Dec 6, 2020
Author

koaning
Dec 6, 2020
Author

svlandeg
Dec 6, 2020
Maintainer

kjsr7
Nov 20, 2021

kjsr7
Nov 20, 2021