Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for GPU acceleration (CUDA) on recovery file creation. #176

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

RisaKirisu
Copy link

Few months ago I used this software to create my backup, but I felt it was very slow.. At the same time, I was learning CUDA. So over the past few months, I learnt RS-encode, read the paper that this project is based on, and studied the code in this project. Then I wrote a CUDA compute routine for recovery file creation. In my test, the overall run time of the CUDA routine on my RTX 2080 Ti is around 4X faster than the OpenMP routine on my Ryzen 5800x (8 core, 16 threads).

Since Autotools is kinda a pain to add CUDA compilation support to, I wrote a new build script using CMake. User can compile with CUDA part enabled by passing ENABLE_CUDA=ON into cmake. When CUDA option is not enabled, the CMake script should produce the same program as the automake script.

I tested the CUDA routine by producing recovery files for random input files using both CUDA version and the latest release of par2cmdline, and then diff the recovery files produced by CUDA version and the ones produced by the latest release of par2cmdline. There is no difference.

Lastly, I'm a student and this is the first open source project I participated in. The code might not be perfect, but I'll try to fix any bug or bad coding style thing that comes up and I will try my best to learn.

500M input test (options are c -r30 -b4057)
image

20G input test (options are c -q -r30 -b4057)
image

…t hash is computed incorrectly. It works now!
@animetosho
Copy link
Contributor

This looks quite interesting!

Not commenting on your code or pull request, but as an aside, if you're looking for faster PAR2, it's worth checking out MultiPar and ParPar.
Both these employ SIMD to be several times faster than par2cmdline, and the former includes an OpenCL implementation. I've also got an OpenCL implementation in ParPar, but it isn't enabled (I haven't been able to get great performance out of a GPU unfortunately).

I don't maintain par2cmdline (so again, not a comment on your pull request), but as a general suggestion, you may wish to consult with projects before implementing a large change - it could feel disheartening if it's ultimately not accepted after all. Also consider breaking it into smaller pull requests if possible, to be more palatable for maintainers to review.

Appreciate the contribution nonetheless!

@RisaKirisu
Copy link
Author

RisaKirisu commented Nov 10, 2022

@animetosho
Thank you for the comment. I checked out the projects you linked to, specifically the benchmark page you created has many helpful information. Indeed GF multiplication is the fundamental part of the program. The CUDA routine I implemented uses log and antilog tables for multiplication, and I was not aware of the shuffle and XOR algorithm you mentioned. Seems like they could bring much performance improvement. I'd like to study them and see if they can be implemented on GPU. Right now I expect about half of the log/antilog table look ups should have cache hit in GPU's L1 cache (because the factors doesn't change), but still there's a lot of random access outside of L1, so there's definitely improvements to be made.

When I started, I didn't know if I would be able to get this done, so perhaps I didn't have enough confidence and thus didn't consult with the project before I started. I will definitely keep that in mind.

@animetosho
Copy link
Contributor

animetosho commented Nov 11, 2022

I wrote a summary of techniques for CPU here.

There's a paper for what I call "shuffle" here. This may be workable on a GPU, but I haven't tried, as OpenCL doesn't support warp shuffles (which shouldn't be a limitation in CUDA).

"XOR" is a technique I discovered - it somewhat resembles a Cauchy layout in GF-Complete. I wrote up about it here, but don't think it's workable on a GPU as it requires JIT for each multiplication. Perhaps you could try splitting the multiplication in half, to avoid the need for JIT, but performance might suffer a lot.

GPUs lack instructions for more elaborate techniques that can be done on a CPU (like polynomial-multiply on ARM or GF affine on x86), so you may find that basic algorithms work the best here.

I've played around with the log/exp technique. There's actually a few tweaks that can be done regarding it:

  • avoid exponentiating the factor, i.e. keep it in logarithm form - this saves a log table lookup when doing the multiply
  • when copying the input to local memory, do the logarithm lookup there - this removes the other log lookup during multiplication (which also means the log table doesn't need to consume cache)
  • with the above, the inner loop should mostly be an antilog lookup + xor accumulate

The antilog table consumes 128KB, which may be too large for cache. I looked at a split exponentiation technique, which reduces table size to 16.25KB (which should fit in cache), at the expense of more operations. Results seem to be mixed across GPUs I've tested.
Splitting the table works on the notion of 2^(a+b) = 2^a * 2^b and finding a fast way to do * 2^b. Essentially I do a lookup on the top 13 bits (i.e. 2^a), then use a second lookup for multiplying by the bottom 3 bits.

I also looked at halving the log table to 64KB, but it doesn't seem to be beneficial. Exponentiation is trivial to split, but it doesn't work as nicely for log.

However, it seems that the classic low/high split lookup often works best on GPU, from my testing. This is the same algorithm implemented in par2cmdline and MultiPar's OpenCL code.
A warp shuffle based implementation might be interesting to see however.

When I started, I didn't know if I would be able to get this done, so perhaps I didn't have enough confidence and thus didn't consult with the project before I started.

That's fair, though once you've done some work, there's little harm in asking. If you're still unsure, you could take a look at existing issues, to get a sense of project activity, the types of stuff that get accepted etc. At the end of the day, if you hope for your changes to be accepted, you'll need to confront the project maintainers at some point, so it might be better sooner than later.
I mostly point this out because par2cmdline hasn't historically been a performance-oriented implementation. I can't speak for the maintainers here, but it's something worth considering when submitting vendor-specific performance optimisations like this.

Hope that was helpful.

@BlackIkeEagle
Copy link
Member

Thanks for being patient, I like the idea of CUDA acceleration. I currently lack hardware to test CUDA and also lack the time to go through such a huge MR.

Hence I have also updated the README that I'm looking for someone to take over the project

@RisaKirisu
Copy link
Author

@BlackIkeEagle I'm sorry that I didn't discuss about this in an issue before doing it and making a large addition. So what would be a proper next step for this?

I later thought of many possible optimizations to this initial implementation of CUDA routine, mostly inspired by earlie replies from animetosho. However, I've been too busy with graduation and job finding stuffs that I haven't got to implement and test much of them. Though now I'm starting to have more free time I will resume working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants