Solving the Traveling Salesman Problem (TSP) with brute force and parallelism just for teaching and illustrative purpose.
The code here is mostly adapted from the very nice talk by David Olsen, Faster Code Through Parallelism on CPUs and GPUs at CppCon 2019.
I've just slightly changed the way the permutations are computed which turns out to be slightly faster than the original.
See also companion slides by D. Olsen (Nvidia) at GTC2019 s9770-c++17-parallel-algorithms-for-nvidia-gpus-with-pgi-c++.pdf
Timings reported here are measured on my own desktop.
Hardware :
- CPU Intel(R) Core(TM) i5-9400F @ 2.90 GHz - 6 cores
- GPU Nvidia GeForce RTX 2060
Software:
- Ubuntu 20.04
- Nvidia hpc sdk version 20.11 with cuda 11.2
- g++ 9.3
Number of cities | time (seconds) |
---|---|
10 | 0.064 |
11 | 0.66 |
12 | 8.48 |
13 | 130.7 |
14 | 1997.9 |
Number of cities | time (seconds) |
---|---|
10 | 0.044 |
11 | 0.152 |
12 | 1.64 |
13 | 24.2 |
14 | 368.3 |
We need an OpenMP 4.5 capable compiler:
Just follow the steps mentioned here:
- how-to-build-and-run-your-modern-parallel-code-in-c-17-and-openmp-4-5-library-on-nvidia-gpus
- Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs
Performance are quite slow (compared to PGI/OpenAcc or Kokkos::CUDA below) by a factor of x3 when N >= 12.
Number of cities | time (seconds) |
---|---|
10 | 0.077 |
11 | 0.104 |
12 | 0.39 |
13 | 3.56 |
14 | 59.5 |
I tried three compile toolchains:
- clang 10.0 with cuda 10.1
- clang 11.0 with cuda 11.2
- clang 12.0 with cuda 11.3
These results are strange, much too slow (maybe related to my host configuration ?). I also a strong performance drop when using nvhpc 20.11 versus 20.5 (??).
Number of cities | time (seconds) |
---|---|
10 | 0.030 |
11 | 0.25 |
12 | 7.21 |
13 | XX |
14 | XX |
Performance looks good (similar to Kokkos::CUDA below).
Number of cities | time (seconds) |
---|---|
10 | 0.107 |
11 | 0.128 |
12 | 0.25 |
13 | 1.51 |
14 | 20.5 |
To build, you just need to edit kokkos/Makefile
, and change the first line by modifying variable KOKKOS_PATH
to the full path where you cloned kokkos sources.
Just run make
in subdir kokkos
.
Timings are similar to OpenMP pragma.
Number of cities | time (seconds) |
---|---|
10 | 0.015 |
11 | 0.152 |
12 | 1.84 |
13 | 24.5 |
14 | 376.6 |
Just run make KOKKOS_DEVICES=Cuda
in subdir kokkos
.
Timings are very similar to Openacc (GPU).
Number of cities | time (seconds) |
---|---|
10 | 0.01 |
11 | 0.01 |
12 | 0.106 |
13 | 1.41 |
14 | 22.4 |
Again, performance obtained here is odd; it's much slower than OpenMP with g++, but should be similar....
Number of cities | time (seconds) |
---|---|
10 | 0.027 |
11 | 0.128 |
12 | 5.10 |
13 | 114.0 |
14 | TODO |
Similar performance as OpenAcc and Kokkos/Cuda.
Number of cities | time (seconds) |
---|---|
10 | 0.007 |
11 | 0.016 |
12 | 0.15 |
13 | 1.40 |
14 | 21.3 |
There are a lot of SYCL implementation available, here we tried
- Intel OneAPI DPC++ (linux) for x86 multi-core
- Intel LLVM (linux) for Nvidia GPU (see https://github.com/intel/llvm)
module load compiler/2021.1.1
Number of cities | time (seconds) |
---|---|
10 | 1.29 |
11 | 1.40 |
12 | 2.62 |
13 | 21.3 |
14 | TODO |
for large values of N, we retrieve the expected results (OpenMP, Kokkos::OpenMP, ...).
module load llvm/12-intel-sycl-cuda
These results should be optimized by changing the number of block of threads and the block size (à la CUDA).
Performance are correct, maybe a bit slow for N=14.
Number of cities | time (seconds) |
---|---|
10 | 0.001 |
11 | 0.008 |
12 | 0.11 |
13 | 1.86 |
14 | 37.4 |