Implement engine="thread" in ChunkRecordingExecutor #3526

samuelgarcia · 2024-11-08T11:56:54Z

@h-mayorquin @zm711 @alejoe91
I wanted to do this since a long time : implement thread for ChunkRecordingExecutor.

Now we can do :

# this use threads
job_kwargs = dict(chunk_duration="1s", pool_engine="thread", n_jobs=8, progress_bar=True)
# this use processes
job_kwargs = dict(chunk_duration="1s", pool_engine="process", n_jobs=8, progress_bar=True)

Maybe this will help a lot windows user and will avoid use "spawn" for multiprocessing (which is too slow at startup).
My bet is that thread will have IO (read data from disk) and computation withou the GIL without computing lib, so it should good enought for parralell computing. Lets see.

Here a very first as a proof of concept. (Without any real test).

TODO:

add engine thread
add worker_index mechanism in job_tools.py (this will be usefull for multi GPU)
rename max_threads_per_process to max_threads_per_worker. See note.
discuss renaming n_jobs to n_worker or num_worker or max_worker (Alessio will be unhappy)
make a get_best_job_kwargs() that will be platform dependant.

When using engine thread we will need to disambigu the concept of worker that can be thread and max_threads_per_worker which related to the hidden thread use by computing lib like numpy/scipy/blas/lapcak/sklearn.

EDIT:

on linux thread are a bit slower than process with fork but not so much.

Important notes:
many ProcessPoolExecutor are hard coded in several places (principal_components, split, merge...)
we should also do this but this will be a separated PR. because theses parallelisation are not on the recording axis.
Note that tricky stuff between loops, thread and process is initialzing variable to a private dict per worker. This is done in ChunkRecordingExecutor but this should be propagated for other use cases that do not need recording slices.

samuelgarcia · 2024-11-20T12:52:25Z

After a long fight this is now working.
The threadpool was making the progress_bar randomly freezing.
I found a simple trick but it took long time.

samuelgarcia · 2024-11-20T12:54:17Z

@zm711 : if you have time could you make some test using windows ?
I will also do it.

zm711 · 2024-11-20T13:22:03Z

@zm711 : if you have time could you make some test using windows ?

Yep. Will do once tests pass :)

samuelgarcia · 2024-11-20T13:45:00Z

@alejoe91 @zm711 @chrishalcrow @h-mayorquin : ready for review.

Also we should discuss changing n_jobs to something else.
job is ambiguous worker is better.
This semantic come from joblib but I think the smantic is unclear.
A "job" is more something push to slurm.
In my mind job=task but NOT the the numer of people doing the task queue (they are workers). No ?

alejoe91 · 2024-11-20T15:21:26Z

failing tests are really weird. The arrays that are not equal look exactly the same from the values

src/spikeinterface/core/job_tools.py

zm711

few tweaks as well.

I haven't look deeply but since Windows and Macs have a different mp context at baseline do we think that could be contributing or that this is just floating point issues?

src/spikeinterface/core/job_tools.py

src/spikeinterface/core/tests/test_job_tools.py

zm711 · 2024-11-20T19:29:12Z

script in case people want to know

import spikeinterface.full as si
sorting, rec = si.generate_ground_truth_recording()
analyzer = si.create_sorting_analyzer(rec, sorting)
analyzer.compute(['random_spikes', 'waveforms'])
%timeit analyzer.compute(['principal_components']) 
# or
%timeit analyzer.compute(['principal_components'], n_jobs=2)

Okay with default settings on this branch using the generate_ground_truth_recording

n_jobs=1
67 ms ± 4.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=2
47.9 s ± 971 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# explicitly setting pool_engine=='thread'
48.4 s ± 1.23 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

compared to going back to main

n_jobs=1
78.3 ms ± 27.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=2
48.6 s ± 890 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm thinking this might be deeper than our code. Let me know if you want a different type of test Sam. For real data with n_jobs we often have PCA take 1-2 hours. So I'm not currently testing with real data. Do we want me to play around with n_job number and threads?

samuelgarcia · 2024-11-21T09:18:12Z

Hi Zach and Alessio. Thanks fo testing. pca is not the best parralelisation we have.
for me detect_peaks or save() or better illustration. But this is ok.

Yes, this failing test are super strange because it is a floating point issue which is not the same with multiprocessing or thread. This hard to figure.

As a side note : I modified the waveforms_tools that estimate templates with both thread and process and this do not give the same results. I need to set decimals=4 for almost equal which is very easy to pass. I did not except this. Maybe we have somthing deep with parralisation that can change the result.
Lets discuss this with a call.

zm711 · 2024-11-21T13:01:40Z

I can test save at some point this week on Windows.

…parralel_with_thread

samuelgarcia · 2024-11-22T12:02:03Z

test are passing now

zm711 · 2024-11-22T14:55:09Z

I forgot.... easy testing of save is not really possible on Windows due to the inability to overwrite files still open in the process so I can't use timeit. I will need to try to write a script. I can try detect_peaks though.

zm711 · 2024-11-22T15:02:46Z

okay detect_peaks

on main

n_jobs=1
43.6 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=2
3.86 s ± 361 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

with this pr

n_jobs=2
164 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=2, and explicitly setting pool_engine='thread'
162 ms ± 515 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So an improvement but still worse than not using multiprocessing.

zm711 · 2024-11-22T15:34:22Z

Now trying a longer simulated recording to see if the more realistic length if we see a big difference.

object is

GroundTruthRecording: 4 channels - 25.0kHz - 1 segments - 250,000,000 samples
                      10,000.00s (2.78 hours) - float32 dtype - 3.73 GiB

n_jobs=1
43.6 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=4, pool_engine='thread'
1min 21s ± 821 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

zm711 · 2024-11-22T16:20:49Z

Now checking if we have to deal with more channels

GroundTruthRecording: 64 channels - 25.0kHz - 1 segments - 100,000,000 samples
                      4,000.00s (1.11 hours) - float32 dtype - 23.84 GiB

n_jobs=1
3min 51s ± 2.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
n_jobs=4, pool_engine='thread'
1min 45s ± 564 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So we finally found the spot where the threading helps. As channel count increases we even get a benefit from the n_jobs>1 and using 'thread'. Cool. I guess we could set something up to try to find these types of conditions to help people choose when to do multiprocessing for some of these steps on Windows (& maybe Mac--which I haven't tested yet).

samuelgarcia · 2024-11-22T16:34:36Z

Salut Zach,
thanks a gain for your time.

theses results are a bit disapointing... I would except better speedup.
Do you have many core on this windows ?
I will also do some tests.

zm711 · 2024-11-22T16:42:57Z

Salut Sam,
I have 8 cores on this computer (it's a few years old at this point). Switching to n_jobs=8 with pool_engine='thread' looks like it will take 1 min 14 seconds to do detect_peaks on with a 64 channel 1 hour simulated recording. So the benefit doesn't seem to be scaling. You want me to run some tests changing the number of threads too?

zm711

Just a few more comments that can be cleaned up now that we have the big push for this PR done.

I still think we should find a way either in docs or maybe some sort of CI to help determine when users of specific OSs should use n_jobs>1. It seems like different functionality will depend on this in different ways (e.g. detect_peaks cares about channels), so this thought isn't fully formed yet.

EDIT: Just ran quick short tests (so might be missing time effects) but seems like anything over 8 channels is improved by threading. Again maxing out at 2.1x ish improvement, but that's still something.

zm711 · 2024-11-22T18:47:45Z

src/spikeinterface/core/job_tools.py

+
+class WorkerFuncWrapper:
+    """
+    small wraper that handle:


Suggested change

small wraper that handle:

small wrapper that handles:

zm711 · 2024-11-22T18:48:04Z

src/spikeinterface/core/job_tools.py

-# so they are not share in the same process
-global _worker_ctx
-global _func
+# the tricks is : this variable are global per worker (so not shared in the same process)


Suggested change

# the tricks is : this variable are global per worker (so not shared in the same process)

# the trick is : this variable is global per worker (so not shared in the same process)

zm711 · 2024-11-22T18:49:30Z

src/spikeinterface/core/job_tools.py

@@ -57,12 +59,62 @@
    "chunk_duration",
 )

+def get_best_job_kwargs():
+    """
+    Given best possible job_kwargs for the platform.


Suggested change

Given best possible job_kwargs for the platform.

Gives best possible job_kwargs for the platform.

Currently this function is from developer experience, but may be adapted in the future.

zm711 · 2024-11-22T18:49:49Z

src/spikeinterface/core/job_tools.py

+        pool_engine = "process"
+        mp_context = "fork"
+
+        # this is totally empiricat but this is a good start


Suggested change

# this is totally empiricat but this is a good start

# this is totally empirical but this is a good start

zm711 · 2024-11-22T18:50:04Z

src/spikeinterface/core/job_tools.py

+
+        # this is totally empiricat but this is a good start
+        if n_cpu <= 16:
+            # for small n_cpu lets make many process


Suggested change

# for small n_cpu lets make many process

# for small n_cpu let's make many process

zm711 · 2024-11-22T18:50:37Z

src/spikeinterface/core/job_tools.py

+            n_jobs = n_cpu
+            max_threads_per_worker = 1
+        else:
+            # lets have less process with more thread each


Suggested change

# lets have less process with more thread each

# let's have fewer processes with more threads each

zm711 · 2024-11-22T18:50:52Z

src/spikeinterface/core/job_tools.py

+
+    else: # windows and mac
+        # on windows and macos the fork is forbidden and process+spwan is super slow at startup
+        # so lets go to threads


Suggested change

# so lets go to threads

# so let's go to threads

samuelgarcia added 3 commits November 8, 2024 12:47

proof of concept of chunkexecutor with thread

e0ef39b

for progress_bar the for res in results need to be inside the with

a28c33d

wip

67b055b

alejoe91 added core Changes to core module concurrency Related to parallel processing labels Nov 11, 2024

samuelgarcia added 2 commits November 12, 2024 13:55

wip

cecb211

Move worker_index to job_tools.py

4365321

change max_threads_per_process to max_threads_per_worker

e929820

implement get_best_job_kwargs()

d4a6e95

samuelgarcia marked this pull request as ready for review November 20, 2024 13:35

samuelgarcia added 2 commits November 20, 2024 14:38

oups

423801b

oups

1736b65

alejoe91 reviewed Nov 20, 2024

View reviewed changes

src/spikeinterface/core/job_tools.py Outdated Show resolved Hide resolved

zm711 reviewed Nov 20, 2024

View reviewed changes

src/spikeinterface/core/job_tools.py Outdated Show resolved Hide resolved

src/spikeinterface/core/job_tools.py Outdated Show resolved Hide resolved

src/spikeinterface/core/job_tools.py Outdated Show resolved Hide resolved

src/spikeinterface/core/tests/test_job_tools.py Show resolved Hide resolved

samuelgarcia added 4 commits November 22, 2024 10:06

Feedback from Zach and Alessio better test for waveforms_tools

f0ec139

fix tests

cc8b4c4

Merge branch 'main' of github.com:SpikeInterface/spikeinterface into …

5cf6ace

…parralel_with_thread

oups

c16ca72

zm711 reviewed Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement engine="thread" in ChunkRecordingExecutor #3526

Implement engine="thread" in ChunkRecordingExecutor #3526

samuelgarcia commented Nov 8, 2024 •

edited

Loading

samuelgarcia commented Nov 20, 2024

samuelgarcia commented Nov 20, 2024

zm711 commented Nov 20, 2024

samuelgarcia commented Nov 20, 2024

alejoe91 commented Nov 20, 2024

zm711 left a comment

zm711 commented Nov 20, 2024 •

edited

Loading

samuelgarcia commented Nov 21, 2024

zm711 commented Nov 21, 2024

samuelgarcia commented Nov 22, 2024

zm711 commented Nov 22, 2024

zm711 commented Nov 22, 2024

zm711 commented Nov 22, 2024

zm711 commented Nov 22, 2024

samuelgarcia commented Nov 22, 2024

zm711 commented Nov 22, 2024 •

edited

Loading

zm711 left a comment •

edited

Loading

zm711 Nov 22, 2024

zm711 Nov 22, 2024

zm711 Nov 22, 2024

zm711 Nov 22, 2024

zm711 Nov 22, 2024

zm711 Nov 22, 2024

zm711 Nov 22, 2024

	# the tricks is : this variable are global per worker (so not shared in the same process)
	# the trick is : this variable is global per worker (so not shared in the same process)

	Given best possible job_kwargs for the platform.
	Gives best possible job_kwargs for the platform.
	Currently this function is from developer experience, but may be adapted in the future.

	# this is totally empiricat but this is a good start
	# this is totally empirical but this is a good start

	# for small n_cpu lets make many process
	# for small n_cpu let's make many process

	# lets have less process with more thread each
	# let's have fewer processes with more threads each

Implement engine="thread" in ChunkRecordingExecutor #3526

Are you sure you want to change the base?

Implement engine="thread" in ChunkRecordingExecutor #3526

Conversation

samuelgarcia commented Nov 8, 2024 • edited Loading

samuelgarcia commented Nov 20, 2024

samuelgarcia commented Nov 20, 2024

zm711 commented Nov 20, 2024

samuelgarcia commented Nov 20, 2024

alejoe91 commented Nov 20, 2024

zm711 left a comment

Choose a reason for hiding this comment

zm711 commented Nov 20, 2024 • edited Loading

samuelgarcia commented Nov 21, 2024

zm711 commented Nov 21, 2024

samuelgarcia commented Nov 22, 2024

zm711 commented Nov 22, 2024

zm711 commented Nov 22, 2024

zm711 commented Nov 22, 2024

zm711 commented Nov 22, 2024

samuelgarcia commented Nov 22, 2024

zm711 commented Nov 22, 2024 • edited Loading

zm711 left a comment • edited Loading

Choose a reason for hiding this comment

zm711 Nov 22, 2024

Choose a reason for hiding this comment

zm711 Nov 22, 2024

Choose a reason for hiding this comment

zm711 Nov 22, 2024

Choose a reason for hiding this comment

zm711 Nov 22, 2024

Choose a reason for hiding this comment

zm711 Nov 22, 2024

Choose a reason for hiding this comment

zm711 Nov 22, 2024

Choose a reason for hiding this comment

zm711 Nov 22, 2024

Choose a reason for hiding this comment

samuelgarcia commented Nov 8, 2024 •

edited

Loading

zm711 commented Nov 20, 2024 •

edited

Loading

zm711 commented Nov 22, 2024 •

edited

Loading

zm711 left a comment •

edited

Loading