Add a lock to avoid mongo excessively computing random numbers #1601
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When ingesting annotation elements, we use insert_many in batches split across a thread pool. There is some work (such as computing bounding boxes) that parallelizes decently. The actual mongo insert_many call does parallelize, but is slower than using a lock to prevent concurrency.
The specific problem is that mongo assigns new ids to the new element documents using the bson package's ObjectID class. This class generates a new random number as part of the id whenever it is called from a different thread pid. The bulk of the insert time ends up being generating random numbers rather than anything else. By having a lock, this will only trigger the pid-change code once per batch.
One test showed a large annotation insert time dropping from 395 seconds to 140 seconds, and the cpu utilization dropping from 100% across all cores to around 4 cores in use. A longer run of a mixed set of annotations (some large, some small) dropped the time from 10:36 to 4:57.