-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why ClPy is slower than CuPy even if on the same machine? #153
Comments
#163 reported |
This article (in Japanese) could help us https://qiita.com/shu65/items/42914bd2ad01d1e323da |
@vorj I hear you're trying this. |
@LWisteria (1 epoch, 500 iterations)
Environment: |
The result above shows the 10 functions with the longest "total time"s ( |
It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments. |
@ybsh I found 2 profilers for AMD GPUs. I haven't tried yet, but they seem to be supporting OpenCL kernels profiling. If you don't need to profile against NVIDIA GPUs, you can try those out. |
If we have no way to profile on NVIDIA GPUs, you must use other profiling tools for Python and Cython itself. |
Thank you again for your help @vorj , |
I'm not soooo good at OpenCL profiling itself. Why don't you just try it? |
Before going into graphic/GPGPU tools, I have profiled CuPy
ClPy
2 observations:
|
I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s). |
@LWisteria I ran Chainer's (CuPy [s] / ClPy [s])
|
@ybsh I don't know why but I think you don't need to know the reason to solve this issue. Do you need? |
@LWisteria |
@ybsh Now you confirmed the fact "ClPy is slower than CuPy", even with other configs/settings than the last report. So please keep on doing your work with your configs/setting! |
Profiled CuPyExecution time: 5.791 s
ClPyExecution time: 25.878 s
This shows a great performance difference in
The top 8 show the effective call stack. |
You can use graphviz to make call tree from cprofile results. |
Code location: Line 2153 in 3a97570
Now I'm going on to Cython-level profiling. |
Since it seems the code of |
This means there are still other overheads in By inserting Python's
The elapsed time of the kernel launch phase is quite reasonable, while that of the argument passing phase is not self-evident and needs further investigation, all the more because this part contains a lot of |
Is the quque.finish necessary? That could be difference from CuPy. |
@LWisteria As far as I remember, I have inserted synchronizations for CuPy for a fair comparison (which I will check later). |
Exactly. As @LWisteria mentioned, ClPy currently fails to work on a server of ours |
@y1r I hear you'll try this issue so I assign you. Feel free to make comments/reports on this issue in English and/or Japanese. |
@y1r You can refer to this branch for additional synchronizations. |
This might be helpful, though I haven't tried: |
@y1r To add synchronizations to CuPy for comparison, you can check diff_cupy.txt of this post #153 (comment) |
I've investigated performance of
I got following elapsed times [sec] by running
I suggest two optimizations:
What do you think? @LWisteria @ybsh |
@LWisteria See #153 (comment) . I used another number of iterations for benchmark so comparing my result and his result is no meaning. |
@y1r Thanks, I overlooked.
That means that blocks 2 and 5 dominate about 10%=1.6[s] of ClPy. That's remarkable. |
@y1r @ybsh @LWisteria How about pursuing Global Interpreter Lock (GIL) ?
|
Using nogil is good idea. This is also in my ToDo. |
@y1r and me decided to split our (performance improvement) work. |
@LWisteria @nsakabe-fixstars @vorj |
@ybsh Yes, please create new issues for your(our) convenience! |
I've optimized code block 4 and 5 (merged 4 into 5) by accessing a buffer protocol directly. 0 0.005388 |
Accoring to performance report, ClPy is slower than CuPy on both of TITAN V and GeForce 1060.
We need to know the reason (and improve if we can).
The text was updated successfully, but these errors were encountered: