This document answers some frequently asked questions. For other information about CLBlast, see the main README.
There are two ways to perform GEMM implemented in CLBlast:
- Direct GEMM: Computing GEMM using a single generic kernel which handles all cases (e.g. all kinds of matrix sizes).
- Indirect GEMM: Computing GEMM using multiple kernels: the main GEMM kernel and a few pre-processing and post-processing kernels. The main kernel makes several assumptions (e.g. sizes need to be multiples of 32), which the other kernels make sure are satisfied. The main kernel is often faster than the generic kernel of the direct approach, but the cost of pre-processing and post-processing kernels can sometimes be high for small sizes or particular devices.
The GEMM routine tuner will find out from which m/n/k sizes onwards the indirect approach is favorable over the direct approach. Typically the direct approach is faster for small matrices.
For the indirect GEMM kernel (see above) there are basically two implementations, an older approach (GEMMK=0) and a newer kernel with 2D register tiling and support for shuffling (GEMMK=1). On most device the old approach is still the fastest, but some devices can benefit more from the other kernel. The regular GEMM kernel tuner will explore both kernels, making sure to select the fastest one.
The regular GEMM tuner tunes the indirect kernel (see above), tuning for the GEMMK=0 kernel first (stage 1/4 and 2/4) followed by the GEMMK=1 variant (stage 3/4 and 4/4). In both cases, first a fixed set of likely-to-be-good parameters is explored fully (1/4 and 3/4), followed by a random selection of parameters in a much larger search space (2/4 and 4/4). In the end the library will only care about the final best kernel configuration among all 4 stages.
The direct GEMM tuner runs in 2 stages: as above, it first explores a small set of parameters exhaustively, followed by a random selection of a larger search space.
By design the indirect version of the GEMM kernel might allocate some temporarily memory on your device, and that might be an issue in some scenarios. However, there are a few things you could do to avoid this:
-
Use the override parameters functionality to set the switching point between direct and in-direct kernels much further. Example here in one of the tests. This might affect the performance of the GEMM routine.
-
Query the required buffer size, allocate the buffer yourself, and pass that to GEMM. That way you are in control and can make sure it is only allocated once for example.
-
Make sure no temporary buffer is required. Thus, make sure the buffer size is already a multiple of the amount of work done per work-group, e.g. 32, 64 or 128 at most depending on the tuned values for your device (you can query them if wanted). Then also make sure they are pre-transposed as needed. The query-temp-buffer-size function and its implementation can help you figure out if you are there yet.
The tuners explore many different kernel parameters, sometimes quite extreme, seeking the bounds of the hardware or resulting in very large binaries. Depending on your device and OpenCL implementation, it might well be that failures occur. However, the tuner will automatically detect incorrect results or failed kernels, and will skip them. Only if the amount of failures is very large, something might be wrong in the CLBlast code. In that case, it can be reported as an issue.