Training NER Model on Large Dataset using CPU #13399

GaliFFun · 2024-03-27T16:53:11Z

GaliFFun
Mar 27, 2024

Spacy version: 3.7.4

The task I'm researching is extracting toponyms from addresses. I'm training the model on CPU for accuracy, because the model inference app will run in k8s environment without GPU.

I have 8.5 million entries in my dataset. Train size is ~572 Mb (710 files), and dev size is ~172 Mb (242 files) (the size of .spacy files on the disk). Each .spacy file contains at most 10,000 entries and its average size is 800 KB. However, when I try to train the model, the Python process eats too much memory, and OOMKiller kills it. I've read some of the suggested approaches in the links below:

#8456
#8157
#12551

Providing the labels in advance and streaming data DOES help in a way, but the memory usage is too high. Training using machine with 64 Gb RAM uses more than 50 GB for the Python process. This works quite in an unstable way because if there are any processes requesting memory, the just OS kills the Python process.

When I was looking at the launched processes, I saw that the memory usage was 8 GB, 18 Gb and 54 GB at different times. I thought that the streaming does not load all the data into the RAM, but the 54 Gb memory usage seems a little bit suspicious.

Is there anything else I can do with the memory usage? Would it help to split the train and dev datasets into smaller batches, training the model on them, and then resuming the training using the next batch (or will it lead to forgetting)?

Also, is there any way of introducing multicore into the training? I do understand the training process is using single core, but it is really slow on this amount of data. Is it possible to train multiple models on different CPUs at the same time and merge them somehow after the process is done? Or is there any other way to speed up training without allocating too much memory?

Thanks for your help in advance!

svlandeg · 2024-04-16T09:47:43Z

svlandeg
Apr 16, 2024
Maintainer

Hi! Sorry to hear you're running into these memory issues.

You said that streaming the data helps but that the memory usage is still quite high, and that it's also inconsistent. I wonder whether the 54GB peaks specifically are caused by the evaluation code. This runs on the dev set when a training epoch is finished and the command line wants to show the p/r/F numbers on the dev set. It could be that this particular code path should be further optimized for streaming data.

To test this hypothesis, could you run training with the same training set, providing the labels and streaming the data, but reduce the dev set to something really small? In theory, with the same seed, you would see the same training loss scores, as those only depend on the training data, but the reported accuracy will be different. This really only influences the final choice for what is the "best" model, as well as determining when to stop training (controlled by the patience parameter which measures progress on the dev set).

This experiment would help us identify where the culprit in the code is.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training NER Model on Large Dataset using CPU #13399

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Training NER Model on Large Dataset using CPU #13399

GaliFFun Mar 27, 2024

Replies: 1 comment

svlandeg Apr 16, 2024 Maintainer

GaliFFun
Mar 27, 2024

svlandeg
Apr 16, 2024
Maintainer