-
Notifications
You must be signed in to change notification settings - Fork 6
/
optimizations.txt
30 lines (23 loc) · 1.26 KB
/
optimizations.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Small list of variants encountered along the optimization path:
1) Improving ILP by running two keys at a time in each group.
- Didn't work. The extra register pressure slowed things
down relative to the simpler version. It also uses
more memory, which was bad on the Tesla.
2) Using shared memory for exchanging key-parts during
salsa and for reading/writing. Shuffle is substantially
faster. The reading/writing shared mem functions are
still in the git history, but I've removed them from the
code.
3) NOTDONE / NOTNEEDED
Inside salsa_xor_core:
There is a minor possible optimization to reduce a register
assignment, but it only matters if the compiler isn't being smart
enough. Probably not worth doing: unroll the inner loop once
to replace the src x[i] with b[i] where appropriate, eliminating
the x[i] = b[i] assignment above. Hopefully nvcc is doing this already.
4) Reduce register pressure in scrypt_hash_finish_kernel
(NOTDONE: Only 1% of runtime here.)
This function reads ostate and tstate before doing any processing,
but it could defer the read of ostate until after most of the
PBKDF cleanup. This would let it use (at least) eight fewer
registers. Untested, but on the radar.