Why ClPy is slower than CuPy even if on the same machine? #153

LWisteria · 2018-12-18T06:19:14Z

Accoring to performance report, ClPy is slower than CuPy on both of TITAN V and GeForce 1060.

We need to know the reason (and improve if we can).

LWisteria · 2019-02-01T07:08:38Z

#163 reported clWaitForEvents was one of the problems.

LWisteria · 2019-02-08T04:17:05Z

This article (in Japanese) could help us https://qiita.com/shu65/items/42914bd2ad01d1e323da

LWisteria · 2019-02-14T05:54:12Z

@vorj I hear you're trying this.

ybsh · 2019-08-02T05:20:03Z

@LWisteria
I ran Chainer's (3.3.0) example code train_mnist.py and profiled trainer.run() with cProfile. All I could tell from the result was that the function spends most of its time waiting for I/O completion. Is there any way to further track down the bottleneck?
https://github.com/chainer/chainer/blob/a01c33f515ec23d33eab6e2eff6aa5d4e99f414d/examples/mnist/train_mnist.py#L122

(1 epoch, 500 iterations)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      145    0.621    0.004    0.621    0.004 {method 'poll' of 'select.poll' objects}
      199    0.238    0.001    0.238    0.001 {built-in method posix.read}
       27    0.220    0.008    0.220    0.008 {built-in method _posixsubprocess.fork_exec}
     4000    0.214    0.000    0.214    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
     3000    0.165    0.000    0.244    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2000    0.157    0.000    0.247    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
     3500    0.140    0.000    0.356    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
     2500    0.117    0.000    0.263    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
     8500    0.090    0.000    1.962    0.000 (MY HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
        3    0.080    0.027    0.080    0.027 {method 'normal' of 'mtrand.RandomState' objects}

Environment:
OS: Linux version 4.15.0-54-generic (buildd@lgw01-amd64-039) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
GPU: NVIDIA Corporation GV100 [TITAN V]
Python 3.6.5

ybsh · 2019-08-02T05:45:14Z

The result above shows the 10 functions with the longest "total time"s (tottime) and does not cover the all functions called.

ybsh · 2019-08-05T05:57:07Z

It seems at least we cannot use nvvp's OpenCL profiling functions in our NVIDIA GPU environments.
A post on NIVIDIA forum states they do not support OpenCL profiling anymore in versions >= 8.
https://devtalk.nvidia.com/default/topic/1030699/visual-profiler-and-nvprof/opencl-profiling-using-nvvp-command-line-profiler/post/5245061/#5245061
(@vorj pointed this out. Thanks.)

vorj · 2019-08-05T06:22:33Z

@ybsh I found 2 profilers for AMD GPUs. I haven't tried yet, but they seem to be supporting OpenCL kernels profiling. If you don't need to profile against NVIDIA GPUs, you can try those out.
Radeon GPU Profiler
Radeon Compute Profiler

LWisteria · 2019-08-05T06:39:28Z

@vorj @ybsh we need NVIDIA GPU because we need to know why clpy is slower than cupy

LWisteria · 2019-08-05T06:41:36Z

If we have no way to profile on NVIDIA GPUs, you must use other profiling tools for Python and Cython itself.

ybsh · 2019-08-05T06:47:16Z

Thank you again for your help @vorj ,
I bumped into two "performance tools" which are linked to by NVIDIA's website .
Do these look like what we are looking for? Both seem to "support NVIDIA GPUs and OpenCL".
TAU
Vampir
Since I'm new both to graphic and GPU profilers, I really appreciate your opinions. @LWisteria @vorj

LWisteria · 2019-08-05T06:51:39Z

I'm not soooo good at OpenCL profiling itself. Why don't you just try it?

ybsh · 2019-08-07T06:21:22Z

Before going into graphic/GPGPU tools, I have profiled train_mnist.py for CuPy/ClPy (again on titanv, with cProfile), this time with the same number of epochs and iterations as the performance report.

CuPy

total execution time

real 52.27
user 51.63
sys 1.81

top 20 tottimes

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    72000    2.432    0.000    2.931    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
       80    2.189    0.027    2.189    0.027 {built-in method matplotlib._png.write_png}
   218000    2.171    0.000   20.920    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
    70006    2.156    0.000    2.298    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/cupy/core/fusion.py:693(__call__)
   102000    2.138    0.000    2.138    0.000 {method 'dot' of 'cupy.core.core.ndarray' objects}
    52080    2.047    0.000    2.232    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    50000    1.558    0.000    1.614    0.000 {method 'sum' of 'cupy.core.core.ndarray' objects}
    42003    1.335    0.000    2.295    0.000 {built-in method cupy.core.core.array}
    12000    1.305    0.000   11.798    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
    33340    1.226    0.000    1.226    0.000 {built-in method numpy.concatenate}
    14000    1.086    0.000    4.311    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.008    0.000    2.259    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    24000    0.832    0.000    0.833    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.822    0.000    2.815    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)
    14000    0.783    0.000    0.783    0.000 {built-in method cupy.cudnn.get_handle}
    12000    0.637    0.000    0.683    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
   253111    0.625    0.000    0.864    0.000 {built-in method numpy.array}
   399999    0.596    0.000    1.090    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:160(__init__)
4550436/4550431    0.582    0.000    0.768    0.000 {built-in method builtins.isinstance}
  1202000    0.581    0.000    1.295    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/datasets/tuple_dataset.py:41(__getitem__)

ClPy

total execution time

real 77.34
user 75.85
sys 2.42

top 20 tottimess

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   102000    4.765    0.000    4.765    0.000 {method 'dot' of 'clpy.core.core.ndarray' objects}
    72000    4.505    0.000    9.484    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
    52080    3.873    0.000    4.493    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
    98000    3.491    0.000    5.770    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/core/fusion.py:714(__call__)
    64000    2.670    0.000    4.281    0.000 {method 'sum' of 'clpy.core.core.ndarray' objects}
   752502    2.434    0.000    4.534    0.000 {built-in method numpy.array}
999976/983678    2.380    0.000    3.336    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
       80    2.179    0.027    2.179    0.027 {built-in method matplotlib._png.write_png}
   218000    2.168    0.000   34.324    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
   582836    2.100    0.000    2.100    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)
    42003    1.684    0.000    2.333    0.000 {built-in method clpy.core.core.array}
    14000    1.488    0.000    9.758    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:111(forward_gpu)
    42000    1.446    0.000    3.957    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/linear.py:30(forward)
    12000    1.390    0.000   17.243    0.001 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
   582836    1.303    0.000    8.662    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:274(_get_void_ptr)
   582836    1.218    0.000    1.218    0.000 (HOMEDIR)/.pyenv/versions/3.6.5/lib/python3.6/ctypes/__init__.py:484(cast)
   582836    1.105    0.000    1.105    0.000 {method 'from_buffer' of '_ctypes.PyCArrayType' objects}
    12000    1.083    0.000    2.061    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/loss/softmax_cross_entropy.py:206(backward_gpu)
    24000    1.021    0.000    1.070    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/activation/relu.py:76(forward_gpu)
   372001    0.881    0.000    2.972    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/variable.py:447(__init__)

2 observations:

The main 'bottleneck' I have reported in the previous result turns out to be less important in larger workload sizes.
Those two groups have many common codepaths, and they are overall more preformant in CuPy. We cannot ascribe the performance gap to just a few codepaths.

ybsh · 2019-08-07T06:23:45Z

I think I should try samples with more conspicuous performance gaps, such as word2vec (CuPy: 192s, ClPy: 28s).

ybsh · 2019-08-07T07:53:20Z

@LWisteria I ran Chainer's train_word2vec.py on titanv with the default epochs/iterations (i.e. the same as the performance report), but the execution times were much longer than the reported figures even without cProfile enabled.
Do you think I have misconfigured/misinterpreted something?

(CuPy [s] / ClPy [s])

Reported: 27.84 / 192.12
My try: 94.62 / 503.71

LWisteria · 2019-08-07T08:25:41Z

@ybsh I don't know why but I think you don't need to know the reason to solve this issue. Do you need?

ybsh · 2019-08-07T08:34:42Z

@LWisteria
Maybe no for right now, but maybe yes in the long run.
I thought it was a good thing to find hidden configuration differences (if there are) for assuring reproducibility.
At least the benchmark itself is working, so I am currently working on bottleneck trackdown.

LWisteria · 2019-08-07T08:40:04Z

@ybsh Now you confirmed the fact "ClPy is slower than CuPy", even with other configs/settings than the last report. So please keep on doing your work with your configs/setting!

ybsh · 2019-08-07T08:57:16Z

Profiled train_word2vec.py with cProfile on titanv, epoch = 1.

CuPy

Execution time: 5.791 s

top 10 tottime

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1004    4.133    0.004    4.156    0.004 {method 'get' of 'cupy.core.core.ndarray' objects}
     1004    0.086    0.000    4.297    0.004 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
        1    0.085    0.085    0.085    0.085 {built-in method cupy.cuda.nvrtc.compileProgram}
     6880    0.078    0.000    4.886    0.001 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)
     1860    0.078    0.000    0.096    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     2008    0.075    0.000    0.142    0.000 {built-in method cupy.core.core.array}
     1004    0.068    0.000    0.167    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
     1005    0.051    0.000    0.063    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      930    0.048    0.000    0.364    0.000 (HOMEDIR)/.pyenv/versions/cupy/lib/python3.6/site-packages/chainer/variable.py:891(_backward_main)
     2008    0.041    0.000    0.041    0.000 {method 'take' of 'numpy.ndarray' objects}

ClPy

Execution time: 25.878 s

top 10 tottime

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
     1004    1.696    0.002    1.711    0.002 {method 'get' of 'clpy.core.core.ndarray' objects}
       74    0.363    0.005    0.363    0.005 {method 'poll' of 'select.poll' objects}
     3724    0.185    0.000    0.360    0.000 {method 'fill' of 'clpy.core.core.ndarray' objects}
     1004    0.140    0.000    2.106    0.002 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/links/loss/hierarchical_softmax.py:166(forward_gpu)
     1860    0.133    0.000    0.297    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizers/adam.py:68(update_core_gpu)
     1005    0.100    0.000    0.230    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/reporter.py:266(add)
      102    0.097    0.001    0.097    0.001 {built-in method posix.read}
     1004    0.096    0.000    0.191    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/functions/connection/embed_id.py:27(forward)
    25013    0.092    0.000    0.172    0.000 {built-in method numpy.array}
    23005    0.080    0.000    0.080    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/numpy/core/_internal.py:262(__array_interface__)

This shows a great performance difference in {built-in method clpy.core.core.array}. Where is this called from?

top 10 cumtime (runtime including callees)

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.011    0.011   25.878   25.878 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/trainer.py:258(run)
      930    0.008    0.000   25.020    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:213(update)
      930    0.004    0.000   25.012    0.027 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/training/updater.py:226(update_core)
     1004    0.002    0.000   21.411    0.021 <string>:171(convert)
     2008    0.006    0.000   21.409    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:235(to_gpu)
     2008    0.006    0.000   21.397    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/backends/cuda.py:278(_array_to_gpu)
     2008    0.001    0.000   21.389    0.011 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/clpy-2.1.0rc1-py3.6-linux-x86_64.egg/clpy/creation/from_data.py:44(asarray)
     2008   21.344    0.011   21.388    0.011 {built-in method clpy.core.core.array}
      930    0.016    0.000    3.837    0.004 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/optimizer.py:519(update)
     6880    0.076    0.000    2.917    0.000 (HOMEDIR)/.pyenv/versions/clpy/lib/python3.6/site-packages/chainer/function_node.py:198(apply)

The top 8 show the effective call stack.

LWisteria · 2019-08-07T08:59:49Z

Where is this called from?

You can use graphviz to make call tree from cprofile results.

ybsh · 2019-08-07T09:13:57Z

Code location:

clpy/clpy/core/core.pyx

Line 2153 in 3a97570

cpdef ndarray array(obj, dtype=None, bint copy=True, str order='K',

Now I'm going on to Cython-level profiling.

ybsh · 2019-12-06T07:15:36Z

As you can see, _launch is causing a notable difference (7.5 s/16 s).
In the following call graphs, corresponding parts are circled with red lines:

CuPy
ClPy

ybsh · 2019-12-06T07:18:12Z

Since it seems the code of _launch itself rather than its callees that is taking long in ClPy, I am setting on to line-level performance analysis of this function.

ybsh · 2019-12-06T08:38:07Z

This means there are still other overheads in _launch() other than the NumPy interfaces removed last month.

By inserting Python's time.perf_counter() in _launch() and measuring elapsed time of each parts,
I found the following:

Around 50% of the time is spent in the argument passing phase.
- Half of the time (i.e. around 25%) is spent in this block.
- Half of the time (i.e. around 25%) is spent in this block.
Around 50% of the time is spent in the kernel launch phase.

The elapsed time of the kernel launch phase is quite reasonable, while that of the argument passing phase is not self-evident and needs further investigation, all the more because this part contains a lot of ndarray operations.

LWisteria · 2019-12-06T08:56:44Z

Is the quque.finish necessary? That could be difference from CuPy.

ybsh · 2020-02-06T04:58:35Z

@LWisteria As far as I remember, I have inserted synchronizations for CuPy for a fair comparison (which I will check later).

LWisteria · 2020-02-06T04:59:07Z

@ybsh Before you restart this issue, you must solve #261 ...

ybsh · 2020-02-06T05:03:04Z

Exactly. As @LWisteria mentioned, ClPy currently fails to work on a server of ours titanv. I will try to create a separate older CUDA environment with nvidia-docker.

LWisteria · 2020-02-21T06:48:47Z

@y1r I hear you'll try this issue so I assign you. Feel free to make comments/reports on this issue in English and/or Japanese.

ybsh · 2020-02-21T08:14:51Z

@y1r You can refer to this branch for additional synchronizations.
https://github.com/fixstars/clpy/commits/153-for-profile

ybsh · 2020-02-21T08:26:05Z

This might be helpful, though I haven't tried:
https://stackoverflow.com/questions/28301931/how-to-profile-cython-functions-line-by-line

ybsh · 2020-02-21T08:31:38Z

@y1r To add synchronizations to CuPy for comparison, you can check diff_cupy.txt of this post #153 (comment)

y1r · 2020-02-26T07:22:41Z

I've investigated performance of clpy.backend.function._launch by adding a few codes like:

insert time.perf_counter to each code block after type-checking

blocks

code block 1:

clpy/clpy/backend/function.pyx

Line 58 in f217343

buffer_object = a.data.buf.get()

code block 2:

clpy/clpy/backend/function.pyx

Lines 63 to 72 in f217343

    
           ndim = len(a.strides) 
        
           for d in range(ndim): 
        
               if a.strides[d] % a.itemsize != 0: 
        
                   raise ValueError("Stride of dim {0} = {1}," 
        
                                    " but item size is {2}" 
        
                                    .format(d, a.strides[d], a.itemsize)) 
        
               arrayInfo.shape_and_index[d] = a.shape[d] 
        
               arrayInfo.shape_and_index[d + ndim] = a.strides[d] 
        
           arrayInfo.offset = a.data.cl_mem_offset() 
        
           arrayInfo.size = a.size

code block 3:

clpy/clpy/backend/function.pyx

Lines 81 to 85 in f217343

    
           for d in range(a.ndim): 
        
               indexer.shape_and_index[d] = a.shape[d] 
        
           ptr = <size_t>&indexer 
        
           indexer.size = a.size 
        
           size = a.get_size()

code block 4:

clpy/clpy/backend/function.pyx

Lines 87 to 102 in f217343

    
           if isinstance(a, clpy.core.core.Size_t): 
        
               if clpy.backend.opencl.utility.typeof_size() \ 
        
                       == 'uint': 
        
                   a = numpy.uint32(a.val) 
        
               elif clpy.backend.opencl.utility.typeof_size() \ 
        
                       == 'ulong': 
        
                   a = numpy.uint64(a.val) 
        
               else: 
        
                   raise "api_sizeof_size is illegal" 
        
           elif isinstance(a, int): 
        
               a = numpy.int_(a) 
        
           elif isinstance(a, float): 
        
               a = numpy.float_(a) 
        
           elif isinstance(a, bool): 
        
               a = numpy.bool_(a)

code block 5:

clpy/clpy/backend/function.pyx

Lines 103 to 107 in f217343

    
           if core.numpy_scalar_type_set(): 
        
               ptr = <size_t>a.__array_interface__["data"][0] 
        
               size = a.nbytes 
        
           else: 
        
               raise TypeError('Unsupported type %s' % type(a))

code block 6:

clpy/clpy/backend/function.pyx

Lines 111 to 122 in f217343

    
           cdef size_t gws[3] 
        
           for i in range(global_dim): 
        
               gws[i] = global_work_size[i] 
        
           cdef size_t lws[3] 
        
           cdef size_t* lws_ptr 
        
           if local_dim > 0: 
        
               for i in range(local_dim): 
        
                   lws[i] = local_work_size[i] 
        
               lws_ptr = &lws[0] 
        
           else: 
        
               lws_ptr = <size_t*>NULL

I got following elapsed times [sec] by running train_mnist.py on titanv with 100 iterations:

code block 1: 0.006035
code block 2: 0.028756
code block 3: 0.006422
code block 4: 0.005220
code block 5: 0.023301
code block 6: 0.000841

I suggest two optimizations:

for code block 2 (but currently I am not sure how to optimize this)
for code block 4 and 5, use an immediate value for SetKernelArg's argument

What do you think? @LWisteria @ybsh

LWisteria · 2020-02-26T08:21:51Z

As @ybsh reported, block 2 and 5 must be one of the reasons ClPy's slowness against CuPy. I agree with @y1r .

@nsakabe-fixstars how do you think?

LWisteria · 2020-02-26T08:25:30Z

@ybsh By the way, where can we see the difference of train_mnist.py time between current clpy and base cupy?
@y1r Can you point if you know?

y1r · 2020-02-26T08:28:42Z

@LWisteria See #153 (comment) . I used another number of iterations for benchmark so comparing my result and his result is no meaning.

LWisteria · 2020-02-26T08:33:43Z

@y1r Thanks, I overlooked.

difference (7.5 s/16 s)

That means that blocks 2 and 5 dominate about 10%=1.6[s] of ClPy. That's remarkable.

ghost · 2020-02-27T10:01:22Z

@y1r @ybsh @LWisteria
I agree that _launch's overhead (block 2, 5) is what we should shrink.

How about pursuing Global Interpreter Lock (GIL) ?
https://github.com/kzky/languages/tree/master/cython/nogil
If we use only C object, we can supress GIL with nogil: section.

@cython.boundscheck(False) can be helpful.

y1r · 2020-02-27T10:40:39Z

Using nogil is good idea. This is also in my ToDo.

ybsh · 2020-03-05T06:53:20Z

@y1r and me decided to split our (performance improvement) work.
#153 (comment)
I'm starting working on code block 2, and @y1r is working on code block 4, 5.

ybsh · 2020-03-05T06:56:30Z

@LWisteria @nsakabe-fixstars @vorj
Would it be better if we made a separate issue for each code block?

ghost · 2020-03-05T07:05:17Z

@ybsh Yes, please create new issues for your(our) convenience!

y1r · 2020-03-05T09:43:17Z

I've optimized code block 4 and 5 (merged 4 into 5) by accessing a buffer protocol directly.
I'll create PR soon.

0 0.005388
1 0.024878
2 0.005229
3 0.000000
4 0.003777
5 0.000785

LWisteria added the investigation Measure performance, investigate other's materials, pepers, label Dec 18, 2018

LWisteria added this to the v2.1.0beta2 milestone Dec 18, 2018

LWisteria added performance Performance improvement and related and removed investigation Measure performance, investigate other's materials, pepers, labels Feb 13, 2019

LWisteria mentioned this issue Feb 13, 2019

Do not wait for CLBlast #174

Closed

LWisteria assigned vorj Feb 14, 2019

LWisteria mentioned this issue Mar 1, 2019

Test for ultima #185

Open

LWisteria modified the milestones: v2.1.0rc1, v2.1.0rc2 Mar 27, 2019

ybsh self-assigned this Jul 31, 2019

ybsh closed this as completed Dec 6, 2019

ybsh reopened this Dec 6, 2019

ybsh mentioned this issue Feb 6, 2020

Support half #265

Open

LWisteria assigned y1r and unassigned vorj Feb 21, 2020

ybsh mentioned this issue Mar 2, 2020

titanv CI broken #261

Closed

ybsh mentioned this issue Mar 5, 2020

Performance improvement of _launch (code block 2: packing CArray) #280

Open

y1r mentioned this issue Mar 6, 2020

Optimize kernel launch using buffer protocol #281

Merged

Why ClPy is slower than CuPy even if on the same machine? #153

Why ClPy is slower than CuPy even if on the same machine? #153

Comments

LWisteria commented Dec 18, 2018

LWisteria commented Feb 1, 2019

LWisteria commented Feb 8, 2019

LWisteria commented Feb 14, 2019

ybsh commented Aug 2, 2019 • edited Loading

ybsh commented Aug 2, 2019

ybsh commented Aug 5, 2019 • edited Loading

vorj commented Aug 5, 2019

LWisteria commented Aug 5, 2019

LWisteria commented Aug 5, 2019

ybsh commented Aug 5, 2019 • edited Loading

LWisteria commented Aug 5, 2019

ybsh commented Aug 7, 2019 • edited Loading

CuPy

ClPy

ybsh commented Aug 7, 2019

ybsh commented Aug 7, 2019 • edited Loading

LWisteria commented Aug 7, 2019

ybsh commented Aug 7, 2019

LWisteria commented Aug 7, 2019

ybsh commented Aug 7, 2019 • edited Loading

CuPy

ClPy

LWisteria commented Aug 7, 2019

ybsh commented Aug 7, 2019

ybsh commented Dec 6, 2019 • edited Loading

ybsh commented Dec 6, 2019 • edited Loading

ybsh commented Dec 6, 2019

LWisteria commented Dec 6, 2019

ybsh commented Feb 6, 2020 • edited Loading

LWisteria commented Feb 6, 2020

ybsh commented Feb 6, 2020 • edited Loading

LWisteria commented Feb 21, 2020

ybsh commented Feb 21, 2020

ybsh commented Feb 21, 2020

ybsh commented Feb 21, 2020

y1r commented Feb 26, 2020 • edited Loading

LWisteria commented Feb 26, 2020

LWisteria commented Feb 26, 2020

y1r commented Feb 26, 2020

LWisteria commented Feb 26, 2020 • edited Loading

ghost commented Feb 27, 2020

y1r commented Feb 27, 2020

ybsh commented Mar 5, 2020

ybsh commented Mar 5, 2020 • edited Loading

ghost commented Mar 5, 2020

y1r commented Mar 5, 2020

ybsh commented Aug 2, 2019 •

edited

Loading

ybsh commented Aug 5, 2019 •

edited

Loading

ybsh commented Aug 5, 2019 •

edited

Loading

ybsh commented Aug 7, 2019 •

edited

Loading

ybsh commented Aug 7, 2019 •

edited

Loading

ybsh commented Aug 7, 2019 •

edited

Loading

ybsh commented Dec 6, 2019 •

edited

Loading

ybsh commented Dec 6, 2019 •

edited

Loading

ybsh commented Feb 6, 2020 •

edited

Loading

ybsh commented Feb 6, 2020 •

edited

Loading

y1r commented Feb 26, 2020 •

edited

Loading

LWisteria commented Feb 26, 2020 •

edited

Loading

ybsh commented Mar 5, 2020 •

edited

Loading