Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why ClPy is slower than CuPy even if on the same machine? #153

Open
LWisteria opened this issue Dec 18, 2018 · 116 comments
Open

Why ClPy is slower than CuPy even if on the same machine? #153

LWisteria opened this issue Dec 18, 2018 · 116 comments
Assignees
Labels
performance Performance improvement and related
Milestone

Comments

@LWisteria
Copy link
Member

Accoring to performance report, ClPy is slower than CuPy on both of TITAN V and GeForce 1060.

We need to know the reason (and improve if we can).

@LWisteria LWisteria added the investigation Measure performance, investigate other's materials, pepers, label Dec 18, 2018
@LWisteria LWisteria added this to the v2.1.0beta2 milestone Dec 18, 2018
@LWisteria
Copy link
Member Author

#163 reported clWaitForEvents was one of the problems.

@LWisteria
Copy link
Member Author

This article (in Japanese) could help us https://qiita.com/shu65/items/42914bd2ad01d1e323da

@LWisteria LWisteria added performance Performance improvement and related and removed investigation Measure performance, investigate other's materials, pepers, labels Feb 13, 2019
@LWisteria
Copy link
Member Author

@vorj I hear you're trying this.

@LWisteria LWisteria modified the milestones: v2.1.0rc1, v2.1.0rc2 Mar 27, 2019
@ybsh ybsh self-assigned this Jul 31, 2019
@ybsh
Copy link
Collaborator

ybsh commented Aug 2, 2019

@LWisteria
I ran Chainer's (3.3.0) example code train_mnist.py and profiled trainer.run() with cProfile. All I could tell from the result was that the function spends most of its time waiting for I/O completion. Is there any way to further track down the bottleneck?
https://github.com/chainer/chainer/blob/a01c33f515ec23d33eab6e2eff6aa5d4e99f414d/examples/mnist/train_mnist.py#L122

(1 epoch, 500 iterations)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      145    0.621    0.004    0.621    0.004 {method 'poll' of 'select.poll' objects}
      199    0.238    0.001    0.238    0.001 {built-in method posix.read}
       27    0.220    0.008    0.220    0.008 {built-in method _posixsubprocess.fork_exec}
     4000    0.214    0.000    0.214    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
     3000    0.165    0.000    0.244    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2000    0.157    0.000    0.247    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
     3500    0.140    0.000    0.356    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
     2500    0.117    0.000    0.263    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
     8500    0.090    0.000    1.962    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
        3    0.080    0.027    0.080    0.027 {method 'normal' of 'mtrand.RandomState' objects}

Environment:
OS: Linux version 4.15.0-54-generic (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
GPU: NVIDIA Corporation GV100 [TITAN V]
Python 3.6.5

@ybsh
Copy link
Collaborator

ybsh commented Aug 2, 2019

The result above shows the 10 functions with the longest "total time"s (tottime) and does not cover the all functions called.

@ybsh
Copy link
Collaborator

ybsh commented Aug 5, 2019

It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments.
A post on NIVIDIA forum states they do not support OpenCL profiling anymore in versions >= 8.
https://devtalk.nvidia.com/default/topic/1030699/visual-profiler-and-nvprof/opencl-profiling-using-nvvp-command-line-profiler/post/5245061/#5245061
(@vorj pointed this out. Thanks.)

@vorj
Copy link

vorj commented Aug 5, 2019

@ybsh I found 2 profilers for AMD GPUs. I haven't tried yet, but they seem to be supporting OpenCL kernels profiling. If you don't need to profile against NVIDIA GPUs, you can try those out.
Radeon GPU Profiler
Radeon Compute Profiler

@LWisteria
Copy link
Member Author

@vorj @ybsh we need NVIDIA GPU because we need to know why clpy is slower than cupy

@LWisteria
Copy link
Member Author

If we have no way to profile on NVIDIA GPUs, you must use other profiling tools for Python and Cython itself.

@ybsh
Copy link
Collaborator

ybsh commented Aug 5, 2019

Thank you again for your help @vorj ,
I bumped into two "performance tools" which are linked to by NVIDIA's website .
Do these look like what we are looking for? Both seem to "support NVIDIA GPUs and OpenCL".
TAU
Vampir
Since I'm new both to graphic and GPU profilers, I really appreciate your opinions. @LWisteria @vorj

@LWisteria
Copy link
Member Author

I'm not soooo good at OpenCL profiling itself. Why don't you just try it?

@ybsh
Copy link
Collaborator

ybsh commented Aug 7, 2019

Before going into graphic/GPGPU tools, I have profiled train_mnist.py for CuPy/ClPy (again on titanv, with cProfile), this time with the same number of epochs and iterations as the performance report.

CuPy

  • total execution time
real 52.27
user 51.63
sys 1.81
  • top 20 tottimes
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    72000    2.432    0.000    2.931    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
       80    2.189    0.027    2.189    0.027 {built-in method matplotlib._png.write_png}
   218000    2.171    0.000   20.920    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
    70006    2.156    0.000    2.298    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/cupy/core/fusion.py:693(__call__)
   102000    2.138    0.000    2.138    0.000 {method 'dot' of 'cupy.core.core.ndarray' objects}
    52080    2.047    0.000    2.232    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    50000    1.558    0.000    1.614    0.000 {method 'sum' of 'cupy.core.core.ndarray' objects}
    42003    1.335    0.000    2.295    0.000 {built-in method cupy.core.core.array}
    12000    1.305    0.000   11.798    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
    33340    1.226    0.000    1.226    0.000 {built-in method numpy.concatenate}
    14000    1.086    0.000    4.311    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.008    0.000    2.259    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    24000    0.832    0.000    0.833    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.822    0.000    2.815    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)
    14000    0.783    0.000    0.783    0.000 {built-in method cupy.cudnn.get_handle}
    12000    0.637    0.000    0.683    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
   253111    0.625    0.000    0.864    0.000 {built-in method numpy.array}
   399999    0.596    0.000    1.090    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:160(__init__)
4550436/4550431    0.582    0.000    0.768    0.000 {built-in method builtins.isinstance}
  1202000    0.581    0.000    1.295    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/datasets/tuple_dataset.py:41(__getitem__)

ClPy

  • total execution time
real 77.34
user 75.85
sys 2.42
  • top 20 tottimess
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   102000    4.765    0.000    4.765    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
    72000    4.505    0.000    9.484    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
    52080    3.873    0.000    4.493    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    98000    3.491    0.000    5.770    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
    64000    2.670    0.000    4.281    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
   752502    2.434    0.000    4.534    0.000 {built-in method numpy.array}
999976/983678    2.380    0.000    3.336    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       80    2.179    0.027    2.179    0.027 {built-in method matplotlib._png.write_png}
   218000    2.168    0.000   34.324    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
   582836    2.100    0.000    2.100    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)
    42003    1.684    0.000    2.333    0.000 {built-in method clpy.core.core.array}
    14000    1.488    0.000    9.758    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.446    0.000    3.957    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    12000    1.390    0.000   17.243    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
   582836    1.303    0.000    8.662    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:274(_get_void_ptr)
   582836    1.218    0.000    1.218    0.000 (HOMEDIR)/.pyenv/versions/3.6.5/lib/python3.6/ctypes/__init__.py:484(cast)
   582836    1.105    0.000    1.105    0.000 {method 'from_buffer' of '_ctypes.PyCArrayType' objects}
    12000    1.083    0.000    2.061    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
    24000    1.021    0.000    1.070    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.881    0.000    2.972    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)

2 observations:

  • The main 'bottleneck' I have reported in the previous result turns out to be less important in larger workload sizes.
  • Those two groups have many common codepaths, and they are overall more preformant in CuPy. We cannot ascribe the performance gap to just a few codepaths.

@ybsh
Copy link
Collaborator

ybsh commented Aug 7, 2019

I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s).

@ybsh
Copy link
Collaborator

ybsh commented Aug 7, 2019

@LWisteria I ran Chainer's train_word2vec.py on titanv with the default epochs/iterations (i.e. the same as the performance report), but the execution times were much longer than the reported figures even without cProfile enabled.
Do you think I have misconfigured/misinterpreted something?

(CuPy [s] / ClPy [s])

  • Reported: 27.84 / 192.12
  • My try: 94.62 / 503.71

@LWisteria
Copy link
Member Author

@ybsh I don't know why but I think you don't need to know the reason to solve this issue. Do you need?

@ybsh
Copy link
Collaborator

ybsh commented Aug 7, 2019

@LWisteria
Maybe no for right now, but maybe yes in the long run.
I thought it was a good thing to find hidden configuration differences (if there are) for assuring reproducibility.
At least the benchmark itself is working, so I am currently working on bottleneck trackdown.

@LWisteria
Copy link
Member Author

@ybsh Now you confirmed the fact "ClPy is slower than CuPy", even with other configs/settings than the last report. So please keep on doing your work with your configs/setting!

@ybsh
Copy link
Collaborator

ybsh commented Aug 7, 2019

Profiled train_word2vec.py with cProfile on titanv, epoch = 1.

CuPy

Execution time: 5.791 s

  • top 10 tottime
  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1004    4.133    0.004    4.156    0.004 {method 'get' of 'cupy.core.core.ndarray' objects}
     1004    0.086    0.000    4.297    0.004 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
        1    0.085    0.085    0.085    0.085 {built-in method cupy.cuda.nvrtc.compileProgram}
     6880    0.078    0.000    4.886    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
     1860    0.078    0.000    0.096    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2008    0.075    0.000    0.142    0.000 {built-in method cupy.core.core.array}
     1004    0.068    0.000    0.167    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
     1005    0.051    0.000    0.063    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      930    0.048    0.000    0.364    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
     2008    0.041    0.000    0.041    0.000 {method 'take' of 'numpy.ndarray' objects}

ClPy

Execution time: 25.878 s

  • top 10 tottime
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
     1004    1.696    0.002    1.711    0.002 {method 'get' of 'clpy.core.core.ndarray' objects}
       74    0.363    0.005    0.363    0.005 {method 'poll' of 'select.poll' objects}
     3724    0.185    0.000    0.360    0.000 {method 'fill' of 'clpy.core.core.ndarray' objects}
     1004    0.140    0.000    2.106    0.002 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
     1860    0.133    0.000    0.297    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     1005    0.100    0.000    0.230    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      102    0.097    0.001    0.097    0.001 {built-in method posix.read}
     1004    0.096    0.000    0.191    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
    25013    0.092    0.000    0.172    0.000 {built-in method numpy.array}
    23005    0.080    0.000    0.080    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)

This shows a great performance difference in {built-in method clpy.core.core.array}. Where is this called from?

  • top 10 cumtime (runtime including callees)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.011    0.011   25.878   25.878 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.008    0.000   25.020    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000   25.012    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.002    0.000   21.411    0.021 <string>:171(convert)
     2008    0.006    0.000   21.409    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.006    0.000   21.397    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000   21.389    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
      930    0.016    0.000    3.837    0.004 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)
     6880    0.076    0.000    2.917    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)

The top 8 show the effective call stack.

@LWisteria
Copy link
Member Author

Where is this called from?

You can use graphviz to make call tree from cprofile results.

@ybsh
Copy link
Collaborator

ybsh commented Aug 7, 2019

Code location:

cpdef ndarray array(obj, dtype=None, bint copy=True, str order='K',

Now I'm going on to Cython-level profiling.

@ybsh
Copy link
Collaborator

ybsh commented Dec 6, 2019

As you can see, _launch is causing a notable difference (7.5 s/16 s).
In the following call graphs, corresponding parts are circled with red lines:

  • CuPy
    cupy_153-for-profile_epoch20_profil_0
  • ClPy
    clpy_153-for-profile_epoch20_profil_0

@ybsh
Copy link
Collaborator

ybsh commented Dec 6, 2019

Since it seems the code of _launch itself rather than its callees that is taking long in ClPy, I am setting on to line-level performance analysis of this function.

@ybsh ybsh closed this as completed Dec 6, 2019
@ybsh ybsh reopened this Dec 6, 2019
@ybsh
Copy link
Collaborator

ybsh commented Dec 6, 2019

This means there are still other overheads in _launch() other than the NumPy interfaces removed last month.

By inserting Python's time.perf_counter() in _launch() and measuring elapsed time of each parts,
I found the following:

The elapsed time of the kernel launch phase is quite reasonable, while that of the argument passing phase is not self-evident and needs further investigation, all the more because this part contains a lot of ndarray operations.

@LWisteria
Copy link
Member Author

Is the quque.finish necessary? That could be difference from CuPy.

@ybsh ybsh mentioned this issue Feb 6, 2020
@ybsh
Copy link
Collaborator

ybsh commented Feb 6, 2020

@LWisteria As far as I remember, I have inserted synchronizations for CuPy for a fair comparison (which I will check later).

@LWisteria
Copy link
Member Author

@ybsh Before you restart this issue, you must solve #261 ...

@ybsh
Copy link
Collaborator

ybsh commented Feb 6, 2020

Exactly. As @LWisteria mentioned, ClPy currently fails to work on a server of ours titanv. I will try to create a separate older CUDA environment with nvidia-docker.

@LWisteria LWisteria assigned y1r and unassigned vorj Feb 21, 2020
@LWisteria
Copy link
Member Author

@y1r I hear you'll try this issue so I assign you. Feel free to make comments/reports on this issue in English and/or Japanese.

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

@y1r You can refer to this branch for additional synchronizations.
https://github.com/fixstars/clpy/commits/153-for-profile

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

This might be helpful, though I haven't tried:
https://stackoverflow.com/questions/28301931/how-to-profile-cython-functions-line-by-line

@ybsh
Copy link
Collaborator

ybsh commented Feb 21, 2020

@y1r To add synchronizations to CuPy for comparison, you can check diff_cupy.txt of this post #153 (comment)

@y1r
Copy link
Collaborator

y1r commented Feb 26, 2020

I've investigated performance of clpy.backend.function._launch by adding a few codes like:

  • insert time.perf_counter to each code block after type-checking
  • blocks
    • code block 1:
      buffer_object = a.data.buf.get()
    • code block 2:
      ndim = len(a.strides)
      for d in range(ndim):
      if a.strides[d] % a.itemsize != 0:
      raise ValueError("Stride of dim {0} = {1},"
      " but item size is {2}"
      .format(d, a.strides[d], a.itemsize))
      arrayInfo.shape_and_index[d] = a.shape[d]
      arrayInfo.shape_and_index[d + ndim] = a.strides[d]
      arrayInfo.offset = a.data.cl_mem_offset()
      arrayInfo.size = a.size
    • code block 3:
      for d in range(a.ndim):
      indexer.shape_and_index[d] = a.shape[d]
      ptr = <size_t>&indexer
      indexer.size = a.size
      size = a.get_size()
    • code block 4:
      if isinstance(a, clpy.core.core.Size_t):
      if clpy.backend.opencl.utility.typeof_size() \
      == 'uint':
      a = numpy.uint32(a.val)
      elif clpy.backend.opencl.utility.typeof_size() \
      == 'ulong':
      a = numpy.uint64(a.val)
      else:
      raise "api_sizeof_size is illegal"
      elif isinstance(a, int):
      a = numpy.int_(a)
      elif isinstance(a, float):
      a = numpy.float_(a)
      elif isinstance(a, bool):
      a = numpy.bool_(a)
    • code block 5:
      if core.numpy_scalar_type_set():
      ptr = <size_t>a.__array_interface__["data"][0]
      size = a.nbytes
      else:
      raise TypeError('Unsupported type %s' % type(a))
    • code block 6:
      cdef size_t gws[3]
      for i in range(global_dim):
      gws[i] = global_work_size[i]
      cdef size_t lws[3]
      cdef size_t* lws_ptr
      if local_dim > 0:
      for i in range(local_dim):
      lws[i] = local_work_size[i]
      lws_ptr = &lws[0]
      else:
      lws_ptr = <size_t*>NULL

I got following elapsed times [sec] by running train_mnist.py on titanv with 100 iterations:

  • code block 1: 0.006035
  • code block 2: 0.028756
  • code block 3: 0.006422
  • code block 4: 0.005220
  • code block 5: 0.023301
  • code block 6: 0.000841

I suggest two optimizations:

  • for code block 2 (but currently I am not sure how to optimize this)
  • for code block 4 and 5, use an immediate value for SetKernelArg's argument

What do you think? @LWisteria @ybsh

@LWisteria
Copy link
Member Author

As @ybsh reported, block 2 and 5 must be one of the reasons ClPy's slowness against CuPy. I agree with @y1r .

@nsakabe-fixstars how do you think?

@LWisteria
Copy link
Member Author

@ybsh By the way, where can we see the difference of train_mnist.py time between current clpy and base cupy?
@y1r Can you point if you know?

@y1r
Copy link
Collaborator

y1r commented Feb 26, 2020

@LWisteria See #153 (comment) . I used another number of iterations for benchmark so comparing my result and his result is no meaning.

@LWisteria
Copy link
Member Author

LWisteria commented Feb 26, 2020

@y1r Thanks, I overlooked.

difference (7.5 s/16 s)

That means that blocks 2 and 5 dominate about 10%=1.6[s] of ClPy. That's remarkable.

@ghost
Copy link

ghost commented Feb 27, 2020

@y1r @ybsh @LWisteria
I agree that _launch's overhead (block 2, 5) is what we should shrink.

How about pursuing Global Interpreter Lock (GIL) ?
https://github.com/kzky/languages/tree/master/cython/nogil
If we use only C object, we can supress GIL with nogil: section.

@cython.boundscheck(False) can be helpful.

@y1r
Copy link
Collaborator

y1r commented Feb 27, 2020

Using nogil is good idea. This is also in my ToDo.

@ybsh ybsh mentioned this issue Mar 2, 2020
@ybsh
Copy link
Collaborator

ybsh commented Mar 5, 2020

@y1r and me decided to split our (performance improvement) work.
#153 (comment)
I'm starting working on code block 2, and @y1r is working on code block 4, 5.

@ybsh
Copy link
Collaborator

ybsh commented Mar 5, 2020

@LWisteria @nsakabe-fixstars @vorj
Would it be better if we made a separate issue for each code block?

@ghost
Copy link

ghost commented Mar 5, 2020

@ybsh Yes, please create new issues for your(our) convenience!

@y1r
Copy link
Collaborator

y1r commented Mar 5, 2020

I've optimized code block 4 and 5 (merged 4 into 5) by accessing a buffer protocol directly.
I'll create PR soon.

0 0.005388
1 0.024878
2 0.005229
3 0.000000
4 0.003777
5 0.000785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance improvement and related
Projects
None yet
Development

No branches or pull requests

4 participants