-
Notifications
You must be signed in to change notification settings - Fork 8
/
TODO.gpu
55 lines (41 loc) · 2.43 KB
/
TODO.gpu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Last Reviewed: Seth T. 2022-10-19
Quality of Life
1. Verify kernel_info.maxThreadsPerBlock <= TPB_DEFAULT in BIT select loop.
2. Add info on what GPU kernels are available.
3. Copy over `s_bit` to the GPU in chucks of 2^20 to reduce gpu memory alloc
4. Improve Peak memory reporting during GPU runs
See https://stackoverflow.com/questions/11631191/why-does-the-cuda-runtime-reserve-80-gib-virtual
Big improvements
* Automatic tuning for TPB, gpucurves
gpu_throughput_test could be automated in the code and run when B1 > THRESHOLD
Getting the wrong gpucurves / TPB pair can cost 2x/4x performance.
* Flag to sleep between kernel calls
This reduces performance but can help the responsiveness of the computer.
Even 5ms of pause (~7% performance) made the system seem much more stable.
* Some benchmark (gpu_throughput_test.sh?) to prevent performance regressions.
Ideally it would record <git commit, GPU, registers, performance at BITS/CURVES>
* Consider what changes would be needed to run multiple numbers at a time.
`set_p_2p`, `findfactor`, and `process_results` seem easy to change
`np0` would have to become an array.
`n_log2` becames `max_n_log2`.
Testing
1. Improve carry bit testing (see overflow test in check_gpuecm.sage)
2. Ask a C++ person and verify that GPU and CPU both use same endianness
This could affect `to_mpz`, `from_mpz`, `allocate_and_set_s_bits`, and `set_p_2p`
This is possibly handled by `endian` parameter in mpz_export
Several things have been tried to improve performance that didn't turn out. These are recorded
so we don't forget.
1. Branchless
if(carry) { cgbn_sub(r, r, modulus) }
can be replaced with
cgbn_sub(r, r, zero_or_modulus[carry]);
In theory this should help the different threads all stay alligned, in practice it didn't
2. Reducing number of reserved carry bits
CARRY_BITS can be reduced form 6 to 2(?) be checking carry bit in addition to overflow
after cgbn_add. This increase the size of number that can be run in cgbn_kernel.
There's some performance penenalty for this so the code is uncommitted, but this is a big
improvement for numbers 506-510 and 1018-1022 bits.
3. Removing the bn_t variables CB,DA,AA,BB,k,dK
Only 2 (or possible 3) temporary variables are needed. This change didn't impact performance
and hurt readability so was backed out. It's always possible that it could reduce registers
4. Add fast squaring to CGBN, see https://github.com/NVlabs/CGBN/issues/19