Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a lock to avoid mongo excessively computing random numbers #1601

Merged
merged 1 commit into from
Aug 8, 2024

Conversation

manthey
Copy link
Member

@manthey manthey commented Aug 8, 2024

When ingesting annotation elements, we use insert_many in batches split across a thread pool. There is some work (such as computing bounding boxes) that parallelizes decently. The actual mongo insert_many call does parallelize, but is slower than using a lock to prevent concurrency.

The specific problem is that mongo assigns new ids to the new element documents using the bson package's ObjectID class. This class generates a new random number as part of the id whenever it is called from a different thread pid. The bulk of the insert time ends up being generating random numbers rather than anything else. By having a lock, this will only trigger the pid-change code once per batch.

One test showed a large annotation insert time dropping from 395 seconds to 140 seconds, and the cpu utilization dropping from 100% across all cores to around 4 cores in use. A longer run of a mixed set of annotations (some large, some small) dropped the time from 10:36 to 4:57.

When ingesting annotation elements, we use insert_many in batches split
across a thread pool.  There is some work (such as computing bounding
boxes) that parallelizes decently.  The actual mongo insert_many call
_does_ parallelize, but is slower than using a lock to prevent
concurrency.

The specific problem is that mongo assigns new ids to the new element
documents using the bson package's ObjectID class.  This class generates
a new random number as part of the id whenever it is called from a
different thread pid.  The bulk of the insert time ends up being
generating random numbers rather than anything else.  By having a lock,
this will only trigger the pid-change code once per batch.

One test showed a large annotation insert time dropping from 395 seconds
to 140 seconds, and the cpu utilization dropping from 100% across all
cores to around 4 cores in use.  A longer run of a mixed set of
annotations (some large, some small) dropped the time from 10:36 to
4:57.
@manthey manthey force-pushed the mongo-insert-many-lock branch from 0a7294b to 848869c Compare August 8, 2024 18:54
@manthey manthey merged commit 64c3104 into master Aug 8, 2024
16 checks passed
@manthey manthey deleted the mongo-insert-many-lock branch August 8, 2024 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant