-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement engine="thread" in ChunkRecordingExecutor #3526
base: main
Are you sure you want to change the base?
Implement engine="thread" in ChunkRecordingExecutor #3526
Conversation
After a long fight this is now working. |
@zm711 : if you have time could you make some test using windows ? |
Yep. Will do once tests pass :) |
@alejoe91 @zm711 @chrishalcrow @h-mayorquin : ready for review. Also we should discuss changing |
failing tests are really weird. The arrays that are not equal look exactly the same from the values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few tweaks as well.
I haven't look deeply but since Windows and Macs have a different mp context at baseline do we think that could be contributing or that this is just floating point issues?
script in case people want to know import spikeinterface.full as si
sorting, rec = si.generate_ground_truth_recording()
analyzer = si.create_sorting_analyzer(rec, sorting)
analyzer.compute(['random_spikes', 'waveforms'])
%timeit analyzer.compute(['principal_components'])
# or
%timeit analyzer.compute(['principal_components'], n_jobs=2) Okay with default settings on this branch using the
compared to going back to main
I'm thinking this might be deeper than our code. Let me know if you want a different type of test Sam. For real data with n_jobs we often have PCA take 1-2 hours. So I'm not currently testing with real data. Do we want me to play around with n_job number and threads? |
Hi Zach and Alessio. Thanks fo testing. pca is not the best parralelisation we have. Yes, this failing test are super strange because it is a floating point issue which is not the same with multiprocessing or thread. This hard to figure. As a side note : I modified the waveforms_tools that estimate templates with both thread and process and this do not give the same results. I need to set decimals=4 for almost equal which is very easy to pass. I did not except this. Maybe we have somthing deep with parralisation that can change the result. |
I can test |
test are passing now |
I forgot.... easy testing of |
okay detect_peaks on main
with this pr
So an improvement but still worse than not using multiprocessing. |
Now trying a longer simulated recording to see if the more realistic length if we see a big difference. object is GroundTruthRecording: 4 channels - 25.0kHz - 1 segments - 250,000,000 samples
10,000.00s (2.78 hours) - float32 dtype - 3.73 GiB
|
Now checking if we have to deal with more channels GroundTruthRecording: 64 channels - 25.0kHz - 1 segments - 100,000,000 samples
4,000.00s (1.11 hours) - float32 dtype - 23.84 GiB
So we finally found the spot where the threading helps. As channel count increases we even get a benefit from the n_jobs>1 and using |
Salut Zach, theses results are a bit disapointing... I would except better speedup. |
Salut Sam, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments that can be cleaned up now that we have the big push for this PR done.
I still think we should find a way either in docs or maybe some sort of CI to help determine when users of specific OSs should use n_jobs>1. It seems like different functionality will depend on this in different ways (e.g. detect_peaks cares about channels), so this thought isn't fully formed yet.
EDIT: Just ran quick short tests (so might be missing time effects) but seems like anything over 8 channels is improved by threading. Again maxing out at 2.1x ish improvement, but that's still something.
|
||
class WorkerFuncWrapper: | ||
""" | ||
small wraper that handle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small wraper that handle: | |
small wrapper that handles: |
# so they are not share in the same process | ||
global _worker_ctx | ||
global _func | ||
# the tricks is : this variable are global per worker (so not shared in the same process) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# the tricks is : this variable are global per worker (so not shared in the same process) | |
# the trick is : this variable is global per worker (so not shared in the same process) |
@@ -57,12 +59,62 @@ | |||
"chunk_duration", | |||
) | |||
|
|||
def get_best_job_kwargs(): | |||
""" | |||
Given best possible job_kwargs for the platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given best possible job_kwargs for the platform. | |
Gives best possible job_kwargs for the platform. | |
Currently this function is from developer experience, but may be adapted in the future. |
pool_engine = "process" | ||
mp_context = "fork" | ||
|
||
# this is totally empiricat but this is a good start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# this is totally empiricat but this is a good start | |
# this is totally empirical but this is a good start |
|
||
# this is totally empiricat but this is a good start | ||
if n_cpu <= 16: | ||
# for small n_cpu lets make many process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# for small n_cpu lets make many process | |
# for small n_cpu let's make many process |
n_jobs = n_cpu | ||
max_threads_per_worker = 1 | ||
else: | ||
# lets have less process with more thread each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# lets have less process with more thread each | |
# let's have fewer processes with more threads each |
|
||
else: # windows and mac | ||
# on windows and macos the fork is forbidden and process+spwan is super slow at startup | ||
# so lets go to threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# so lets go to threads | |
# so let's go to threads |
@h-mayorquin @zm711 @alejoe91
I wanted to do this since a long time : implement thread for ChunkRecordingExecutor.
Now we can do :
Maybe this will help a lot windows user and will avoid use "spawn" for multiprocessing (which is too slow at startup).
My bet is that thread will have IO (read data from disk) and computation withou the GIL without computing lib, so it should good enought for parralell computing. Lets see.
Here a very first as a proof of concept. (Without any real test).
TODO:
max_threads_per_process
tomax_threads_per_worker
. See note.n_worker
ornum_worker
ormax_worker
(Alessio will be unhappy)get_best_job_kwargs()
that will be platform dependant.When using engine thread we will need to disambigu the concept of worker that can be thread and
max_threads_per_worker
which related to the hidden thread use by computing lib like numpy/scipy/blas/lapcak/sklearn.EDIT:
Important notes:
many
ProcessPoolExecutor
are hard coded in several places (principal_components, split, merge...)we should also do this but this will be a separated PR. because theses parallelisation are not on the recording axis.
Note that tricky stuff between loops, thread and process is initialzing variable to a private dict per worker. This is done in
ChunkRecordingExecutor
but this should be propagated for other use cases that do not need recording slices.