GPU implementation of hamming distance #541

felixpetschko · 2024-08-19T18:53:03Z

Hamming distance implementation with numba.cuda for GPU support.
This is built on top of the changes in Hamming distance implementation with Numba #512

for more information, see https://pre-commit.ci

… into numba_hamming

for more information, see https://pre-commit.ci

… into numba_hamming

for more information, see https://pre-commit.ci

…tcrdist distance metrics

… into numba_hamming

for more information, see https://pre-commit.ci

… into numba_hamming

for more information, see https://pre-commit.ci

… into numba_hamming

…curing in all sequences

for more information, see https://pre-commit.ci

…into gpu_hamming

for more information, see https://pre-commit.ci

grst · 2024-11-04T18:59:06Z

Hi @felixpetschko, what's the status here? Do you need anything from myself or Severin?

I've seen you switched to Cupy, could you elaborate how that compares to the numba implementation?

felixpetschko · 2024-11-05T08:16:34Z

Hi @grst! I am mainly done with my implementation here. Currently the speedup on my laptop for 1 million cells for the ir_dist function with hamming is at around 10 (45 vs. 480 seconds) compared to the new fast numba CPU implementation (and probably >100 compared to the original CPU implemenation). I think this is also the maximum speedup I would aim at for now, because there are currently some sequential parts (1 cpu) in the ir_dist function besides the hamming GPU kernel and the upstream processing for reading and preparing the data takes already longer anyway. So further optimization of the hamming kernel wouldn't be very effective.

My plan would be to prepare a pull request that is ready for review over the next days.

The reasons for switching to CuPy were the following:
Numba cuda only provides limited cuda features and you never really know how the numba code is mapped to the cuda features internally. However CuPy allows you to write the cuda kernels directly in C++ and offers all cuda features (that i have seen so far). Also most online ressources about GPU programming are about cuda kernels written in C++ and numba cuda is very niche. Also other programmers that might look at the code in the future will probably only know C++ cuda kernels (if they already did GPU programming). When doing GPU programming I use profiling tools to find out what is actually going on at the hardware and what the compiler did, so having this additional abstraction level with numba can actually be a nightmare.

grst · 2024-11-05T10:23:43Z

Makes sense, thanks!
Still not sure if this should be merged in to scirpy or into rapids-singlecell then. @Intron7, I'll also bring that up at the next core dev meeting what's our long-term strategy here.

Intron7 · 2024-11-08T10:22:24Z

Hey @felixpetschko can you send me a larger dataset to test this? I have some ideas and want to see if this works.

…into gpu_hamming

for more information, see https://pre-commit.ci

grst · 2024-11-16T19:14:09Z

Still not sure if this should be merged in to scirpy or into rapids-singlecell then. @Intron7, I'll also bring that up at the next core dev meeting what's our long-term strategy here.

@felixpetschko, the outcome of this discussion was that the function stays here, and we'll setup a GPU CI for scirpy. @ilan-gold or @flying-sheep can help with that once this PR is ready.

felixpetschko and others added 30 commits April 29, 2024 13:28

take static methods out of tcrdist

bad62d8

made _tcrdist_mat a normal class method

72565bf

parent method NumbaDistanceCalculator extracted

add8e7f

numba version of hamming distance implemented

e9c0642

hamming numba tests passed and reference test added

68e0493

hamming numba distance calculator implemented and tested

ef0fa7d

n_jobs parameter handling done in NumbaDistanceCalculator superclass

0b15f8b

documentation adapted

46bfc14

removed unnecessary import

e339e14

[pre-commit.ci] auto fixes from pre-commit.com hooks

7da4519

for more information, see https://pre-commit.ci

hamming distance with numba parallelization implemented

82b0259

Merge branch 'numba_hamming' of https://github.com/felixpetschko/scirpy…

b2d28d3

… into numba_hamming

[pre-commit.ci] auto fixes from pre-commit.com hooks

249e626

for more information, see https://pre-commit.ci

imports fixed

2fccc6a

Merge branch 'numba_hamming' of https://github.com/felixpetschko/scirpy…

9ee1a2b

… into numba_hamming

[pre-commit.ci] auto fixes from pre-commit.com hooks

a68ab53

for more information, see https://pre-commit.ci

implemented parallelization with n_jobs and n_blocks for hamming and …

d68a10b

…tcrdist distance metrics

performance optimization for hamming and tcrdist

0005e63

more documentation added

6f16a3e

Merge branch 'numba_hamming' of https://github.com/felixpetschko/scirpy…

6b32311

… into numba_hamming

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad13f52

for more information, see https://pre-commit.ci

documentation adapted

08ad838

Merge branch 'numba_hamming' of https://github.com/felixpetschko/scirpy…

a8d9846

… into numba_hamming

[pre-commit.ci] auto fixes from pre-commit.com hooks

b86030c

for more information, see https://pre-commit.ci

documentation adapted

2fb8254

Merge branch 'numba_hamming' of https://github.com/felixpetschko/scirpy…

bb0f430

… into numba_hamming

signature of _calc_dist_mat_block changed

80ae271

the alphabet for the hamming distance is now the unique characters oc…

91c1dea

…curing in all sequences

[pre-commit.ci] auto fixes from pre-commit.com hooks

899e2eb

for more information, see https://pre-commit.ci

Merge branch 'main' into numba_hamming

a0627b4

felixpetschko and others added 20 commits September 29, 2024 12:10

cuda numba experiments

03afdd3

cupy experiments

0b0c5fb

cupy experiments

babdf9a

scaled cupy to 1 million cells

f22ae38

sorted sequences by length

730cb80

textures used for seqs_mat1 and seqs_mat2

5e4776c

texture mit up to 100k cells

e6bf393

sorted seqs with multiple blocks

da251d0

scaled textures to 1 million cells

60ec651

use char for sequences

6f9d6bd

shared memory used

adbd239

experiments, run 1 million cells with global memory

a39dbeb

run 1 million cells with only global memory

972836f

refactoring and time measurements

8d0c2e4

optimized seqs2mat

6bc496c

increased result matrix stacking speed

2d4756f

changed data dtype to int8

c2a290f

scaled to 1 million cells

459c2a2

Merge branch 'gpu_hamming' of https://github.com/felixpetschko/scirpy …

1fb02b1

…into gpu_hamming

[pre-commit.ci] auto fixes from pre-commit.com hooks

f54dc7e

for more information, see https://pre-commit.ci

felixpetschko and others added 4 commits November 14, 2024 16:38

sort indices of result csr matrix

38f1fea

refactoring

c646651

Merge branch 'gpu_hamming' of https://github.com/felixpetschko/scirpy …

897c17b

…into gpu_hamming

[pre-commit.ci] auto fixes from pre-commit.com hooks

f6668c4

for more information, see https://pre-commit.ci

grst changed the title ~~Hamming distance implementation with numba.cuda (GPU)~~ GPU implementation of hamming distance Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU implementation of hamming distance #541

GPU implementation of hamming distance #541

felixpetschko commented Aug 19, 2024

grst commented Nov 4, 2024

felixpetschko commented Nov 5, 2024

grst commented Nov 5, 2024

Intron7 commented Nov 8, 2024

grst commented Nov 16, 2024

GPU implementation of hamming distance #541

Are you sure you want to change the base?

GPU implementation of hamming distance #541

Conversation

felixpetschko commented Aug 19, 2024

grst commented Nov 4, 2024

felixpetschko commented Nov 5, 2024

grst commented Nov 5, 2024

Intron7 commented Nov 8, 2024

grst commented Nov 16, 2024