-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to share context among threads, if not why? #306
Comments
I just learned abou GIL. So we have to use multiprocess and I can only dive into mpi to solve my problem if I want to stick with python? |
This also came up recently in #305 (comment). PyCUDA currently assumes that each context can only be active in a single thread. It appears that this was true up until CUDA 4, but this restriction was then lifted. I would welcome a PR that removes this restriction. It might be as simple as deleting the check for uniqueness of activation. |
Yes I also saw it. I may switch to polygraphy instead. I don't know much about cuda wrappers and I chose pycuda only because official tensorrt example used it. But the test code in tensorrt used polygraphy instead. But it seems like polygraphy hide all details about contexts. Hope that it can work. |
The nvidia guys told me that tensorrt inference releases GIL. That's a good news, if the new feature would be added it can be useful in this case. |
How come this got closed? The question you raised is a real concern to my mind, and I wouldn't be opposed to the issue staying open. |
okay, just one quick question also. I found that pycuda is a lot quicker than polygraphy when doing memcpy. Do you know the reason? |
PyCUDA isn't doing anything special with memcpy. It just calls the corresponding CUDA function. For an additional speedboost, you can use "page-locked" memory (on the host side). |
k, I'll try to read the source code myself... |
@menglin0320 similar situation with you, and I have a solution |
Would fixing this also make it possible for CUDA objects to be safely garbage collected in threads where the context is not current? |
Basically I want to achieve concurrent work with multithreading and my current inference code is pycuda + tensorrt.
why I want to do so
I'm trying to optimize the inference throughput for a model with dynamic input. the size difference between samples can be quite significant. So I want to avoid padding but still do something similar to batching, I want to run several samples concurrently with the same engine. The inference time will still be bottlenecked by the biggest sample in the batch but a lot of flops are saved, also it prevents possible performance drop from padding too much.
my current understanding of the problem
From what I understood If works are in different cuda contexts there is no real parallel working, instead it is just better scheduling. Also one process can only have one cuda context but threads can share contexts. It may not be true for pycuda so I need to . But I didn't find anything talking about how to share one context among threads yet.
I found the official example here for using multithreading with pycuda link
Device.make_context()
There's not much difference between multithreading and multiprocess then. If each thread owns it's own context then there is no real concurrent work.
My question:
I just wonder if my understanding on context is right. And I wonder if there is a way to share context between different threads. I feel it should be possible, if it is not possible with pycuda, can anyone briefly explain why?
The text was updated successfully, but these errors were encountered: