-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to limit number of threads per group in algorithms? (2024.11.18.) #1936
Comments
Pinging @fwyzard, @AuroraPerego and @ivorobts for info. |
Note though that I'm 99% sure at this point that we're running into a oneAPI bug here. 🤔 Since the kernel(s) run by However we do absolutely have a couple of kernels in our code that we run with only 64 threads per block/group at the moment. I think the SYCL runtime is picking up the register needs of one of those kernels, when it decides that it cannot run the sorting kernel. 😕 At first I thought that this would happen because we didn't specify a custom type for some of our So, in this case limiting the number of threads for the sorting would mainly be needed to work around a oneAPI bug in my understanding. But the ability to set limits on the number of threads could still be a good thing to have. 🤔 |
Hi, @krasznaa. I see that you use oneAPI 2024.2, which comes with oneDPL 2022.6. Have you tried oneDPL 2022.7 supplied as a part of oneAPI 2025.0? The issue you faced should already be fixed in the latest release: #1626. If the suggestion above does not work for you can try
It will call merge-sort algorithm instead of radix-sort. Basing on my empirical findings, it should perform better than radix-sort for number of elements ~100'000 or less (depends on GPU, though). Perhaps, it will suit you better. As for configuring the size of work-groups, we have started implementing a more low-level API named kernel templates. This feature is experimental and still evolving. Currently, it has a sorting algorithm running on Intel GPUs only, but we are considering adding a more generic one. I think a more conservative and faster approach for us to fix the issue is introspect kernels: #1938. It does not require passing external parameters. Attila, which device do you use? It would be helpful for the issue reproduction. I am going to delve into it. |
This is excellent news! I'll try it out soon. Upgrading our code to be compatible with oneAPI 2025.0.0 will take a bit more effort, but luckily our build is already not using oneDPL from oneAPI directly. 😉 https://github.com/acts-project/traccc/blob/main/extern/dpl/CMakeLists.txt#L21 So I'll try what happens when I upgrade our build to the latest oneDPL version. 👍 |
I had to go with "option B" for now. 🤔 Which did make our code run as it's supposed to. 🥳 (At least as much as the unit tests claim.) For some reason oneAPI 2024.2.1 really doesn't want to collaborate with oneDPL 2022.7.0. At least not in our build. 😦 When trying to use it, I get:
At first I thought that passing oneDPL 2022.7.0 headers to my compilation with |
The tests are currently done on NVIDIA GPUs, because we'll need to upgrade to oneAPI 2025.0.0 to get this latest part of our code working on Intel ones. (It's a longer story. You can find a taste of it in: acts-project/algebra-plugins#136) |
The linkage error from #1936 (comment) was fixed in #1849. @timmiesmith, could you make sure that #1849 is a part of oneDPL 2022.7 patch release? It seems it did not make it into the initial oneDPL 2022.7. |
I've submitted #1947 to pull #1849 into the patch release branch. |
We are finally starting to use oneDPL in earnest in the traccc project, now that #1060 is not an issue anymore.
Now I ran into a different, pretty interesting issue. (In the sense that I didn't see such an issue before...) During a unit test, I get this sort of a failure:
We indeed use some types in our code that are very register hungry. This is an issue that we're actively working on. But to my surprise, this failure didn't come from one of our own kernels, but from this oneDPL operation:
This is pretty surprising, since the data types being worked on by this algorithm launch are pretty simple... 😕 So I'm quite surprised that with any launch parameters we would run into this sort of an issue.
But it's definitely a possibility that we may need to put limits on the launch parameters that oneDPL could use. The execution policy received by the functions would seem like the perfect place to store such limits in. 🤔 But I don't see an option at least in
oneapi::dpl::execution::device_policy
to specify such launch limits.Is there a way to tell the algorithms to not launch more than N threads per block/group?
Cheers,
Attila
P.S. In case it may be useful, this is how this test of mine gets to the problem:
The text was updated successfully, but these errors were encountered: