-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tango concurrency and locking issues #182
Comments
Hey Team! I'm interested in working on this issue and I'd like to clarify my understanding of the code: (These might or might not be related to the actual issue raised above)
Thank you sm in advance! :-) |
Yes, I agree that this looks like a bug. From the Python docs, this would raise a
This is a good catch. The logging statement should case on whether
It seems like the if condition before is first checking whether this is a job that should be run on AWS EC2 instances.
I agree that the defaykt naming
I am not the author of the code so I can't speak to the original design, but I do agree that there is a lot of redundant lookups. We could change getNextPendingJob to return both id and job to avoid having to make a redis query again to find the id again in getNextPendingJobReuse.
Based on this supervisord config from the old one-click installation, it appears that there can be multiple server.py instances running at once, which are able to add jobs to the queue. So instead of having multiple consumers we actually have multiple producers. But in all honesty, we can really just require that there be a single producer because this will really never be the bottleneck.
This is a good idea and can within each queue we can enforce data structure consistency as well.
Do let me know if you have more questions! |
#148 is also a great place to look at for things which have been fixed by PDL that we could also use |
Hi Fan, thank you for the explanation!
Thanks a lot! |
If a job is already assigned to a VM, you do not need to find another VM for it. You also do not necessarily need to create a new VM for it in allocVM, if REUSE_VMS is set to true
Sorry if I was unclear previously, but I meant to say that the
This only happens if MAX_JOBID is reached and it gets reset back to 1, while the previous VM is still running the old job. Possible but extremely unlikely. You can also just make MAX_JOBID very large to be safe, but right now it is already 1000.
Could you point me to which part of the codebase you are referring to?
Unfortunately, I believe so, but there is no real reason to run multiple servers since it will never be a bottleneck
What do you mean by reset? If you are asking if they are ever restarted, the answer is no.
Definitely and good luck! |
There are potential concurrency issues in parts of the codebase that results in execution traces which causes Tango to behave anomalously.
See:
#138 (was replicable until very recently, but we used a workaround to fix it instead of resolving the root issue of why it was happening)
The text was updated successfully, but these errors were encountered: