-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for GPU acceleration (CUDA) on recovery file creation. #176
base: master
Are you sure you want to change the base?
Conversation
…t hash is computed incorrectly. It works now!
…rrectly. Now everything works.
This looks quite interesting! Not commenting on your code or pull request, but as an aside, if you're looking for faster PAR2, it's worth checking out MultiPar and ParPar. I don't maintain par2cmdline (so again, not a comment on your pull request), but as a general suggestion, you may wish to consult with projects before implementing a large change - it could feel disheartening if it's ultimately not accepted after all. Also consider breaking it into smaller pull requests if possible, to be more palatable for maintainers to review. Appreciate the contribution nonetheless! |
@animetosho When I started, I didn't know if I would be able to get this done, so perhaps I didn't have enough confidence and thus didn't consult with the project before I started. I will definitely keep that in mind. |
I wrote a summary of techniques for CPU here. There's a paper for what I call "shuffle" here. This may be workable on a GPU, but I haven't tried, as OpenCL doesn't support warp shuffles (which shouldn't be a limitation in CUDA). "XOR" is a technique I discovered - it somewhat resembles a Cauchy layout in GF-Complete. I wrote up about it here, but don't think it's workable on a GPU as it requires JIT for each multiplication. Perhaps you could try splitting the multiplication in half, to avoid the need for JIT, but performance might suffer a lot. GPUs lack instructions for more elaborate techniques that can be done on a CPU (like polynomial-multiply on ARM or GF affine on x86), so you may find that basic algorithms work the best here. I've played around with the log/exp technique. There's actually a few tweaks that can be done regarding it:
The antilog table consumes 128KB, which may be too large for cache. I looked at a split exponentiation technique, which reduces table size to 16.25KB (which should fit in cache), at the expense of more operations. Results seem to be mixed across GPUs I've tested. I also looked at halving the log table to 64KB, but it doesn't seem to be beneficial. Exponentiation is trivial to split, but it doesn't work as nicely for log. However, it seems that the classic low/high split lookup often works best on GPU, from my testing. This is the same algorithm implemented in par2cmdline and MultiPar's OpenCL code.
That's fair, though once you've done some work, there's little harm in asking. If you're still unsure, you could take a look at existing issues, to get a sense of project activity, the types of stuff that get accepted etc. At the end of the day, if you hope for your changes to be accepted, you'll need to confront the project maintainers at some point, so it might be better sooner than later. Hope that was helpful. |
Thanks for being patient, I like the idea of CUDA acceleration. I currently lack hardware to test CUDA and also lack the time to go through such a huge MR. Hence I have also updated the README that I'm looking for someone to take over the project |
@BlackIkeEagle I'm sorry that I didn't discuss about this in an issue before doing it and making a large addition. So what would be a proper next step for this? I later thought of many possible optimizations to this initial implementation of CUDA routine, mostly inspired by earlie replies from animetosho. However, I've been too busy with graduation and job finding stuffs that I haven't got to implement and test much of them. Though now I'm starting to have more free time I will resume working on this. |
Few months ago I used this software to create my backup, but I felt it was very slow.. At the same time, I was learning CUDA. So over the past few months, I learnt RS-encode, read the paper that this project is based on, and studied the code in this project. Then I wrote a CUDA compute routine for recovery file creation. In my test, the overall run time of the CUDA routine on my RTX 2080 Ti is around 4X faster than the OpenMP routine on my Ryzen 5800x (8 core, 16 threads).
Since Autotools is kinda a pain to add CUDA compilation support to, I wrote a new build script using CMake. User can compile with CUDA part enabled by passing ENABLE_CUDA=ON into cmake. When CUDA option is not enabled, the CMake script should produce the same program as the automake script.
I tested the CUDA routine by producing recovery files for random input files using both CUDA version and the latest release of par2cmdline, and then diff the recovery files produced by CUDA version and the ones produced by the latest release of par2cmdline. There is no difference.
Lastly, I'm a student and this is the first open source project I participated in. The code might not be perfect, but I'll try to fix any bug or bad coding style thing that comes up and I will try my best to learn.
500M input test (options are c -r30 -b4057)
20G input test (options are c -q -r30 -b4057)