Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidStateError while estimating sparsity #3486

Open
JeffreyBoucher opened this issue Oct 18, 2024 · 15 comments
Open

InvalidStateError while estimating sparsity #3486

JeffreyBoucher opened this issue Oct 18, 2024 · 15 comments
Labels
concurrency Related to parallel processing

Comments

@JeffreyBoucher
Copy link

Hello all!

I've been trying to run kilosort 3 on a concatenated Neuropixels 2 dataset. Lately I've been running into an issue with create_sorting_analyzer, while it is estimating sparsity. Basically, the code is able to run about 70-80% (not any specific number) of the way through, then I get an exception "InvalidStateError". I guess this means that some aspect of my data doesn't work well with the sparsity-estimating algorithm, but I have no guess what that would be.

Here is an example of the exception:

Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/concurrent/futures/process.py", line 323, in run
self.terminate_broken(cause)
File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/concurrent/futures/process.py", line 458, in terminate_broken
work_item.future.set_exception(bpe)
File "/home/sjjgjbo/.conda/envs/neurovis_try2/lib/python3.9/concurrent/futures/_base.py", line 549, in set_exception
raise InvalidStateError('{}: {!r}'.format(self._state, self))
concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x2aaf7896ea00 state=cancelled>

Additionally, here are the inputs into create_sorting_analyzer:

we = si.create_sorting_analyzer(recording=rec, sorting=sorting, folder=outDir / 'sortings_folder',
format="binary_folder",
sparse=True
)

Please help me if you can, I would be very greatful, it's been confounding. Let me know if I can offer any additional information that would help!

Thanks,

Jeff Boucher

@zm711
Copy link
Collaborator

zm711 commented Oct 18, 2024

I don't think any of us are using 3.9 at this point. But this is good to know. Our test suite is on 3.9 and doesn't have a problem. I'm not sure. I think we need to ping @alejoe91 and @samuelgarcia to take a look at this.

What n_jobs are you using? Could you try n_jobs=1 to confirm it is a multiprocessing issue.?

@JeffreyBoucher
Copy link
Author

Hello!

Thanks for the response! I am using n_jobs -1, which should be 10 cores on the cluster I'm using. I'll try with n_jobs = 1 and get back to you!

@zm711
Copy link
Collaborator

zm711 commented Oct 20, 2024

Thanks let us know with the n_jobs=1. Sometimes there can be issues with how a server shares resources so we need to trouble shoot 3 things:

  1. python 3.9 issue
  2. multiprocessing issue
  3. spikeinterface + server issue

@JeffreyBoucher
Copy link
Author

Hello!

Setting n_jobs = 1 indeed let me get through the sparsity estimation without error! Naturally this takes much longer to do, though.

Since you have implied I might be able to solve my problem by updating from python 3.9 I'll maybe give that a shot next. It's been a while since I chose my version, but I think having 3.9 isn't critical at this stage of the pipeline.

Thanks for your help! I'll let you know if changing versions doesn't solve it for me; let me know if you need any more information from me.

Jeffrey Boucher

@zm711
Copy link
Collaborator

zm711 commented Oct 21, 2024

Yeah it would be great if you could test python 3.10 or 3.11. There have been some improvements in multiprocessing at the python level. If updating python works it tells us that 3.9 might not be as well supported as we thought for our multiprocessing. If 3.10/3.11/3.12 doesn't work then it might be a problem in our multiprocessing itself.

@zm711 zm711 added the concurrency Related to parallel processing label Oct 21, 2024
@JeffreyBoucher
Copy link
Author

Hello!

Unfortunately, I still got the same error with python 3.11!

:estimate_sparsity: 70%|███████ | 7983/11378 [3:11:25<36:45, 1.54it/s]
estimate_sparsity: 70%|███████ | 7990/11378 [3:11:40<1:21:16, 1.44s/it]Exception in thread Thread-2:

Traceback (most recent call last):
File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/concurrent/futures/process.py", line 347, in run
self.terminate_broken(cause)
File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/concurrent/futures/process.py", line 499, in terminate_broken
work_item.future.set_exception(bpe)
File "/home/sjjgjbo/.conda/envs/spikesortEnv/lib/python3.11/concurrent/futures/_base.py", line 559, in set_exception
raise InvalidStateError('{}: {!r}'.format(self._state, self))
concurrent.futures._base.InvalidStateError: CANCELLED: <Future at 0x2b409bed9d90 state=cancelled>

Any advice on what to try next? Anything you might also want to look at?

Thanks!

Jeff Boucher

@zm711
Copy link
Collaborator

zm711 commented Oct 25, 2024

Thanks for that info! A few more background questions then:

What OS are you using (looks like linux maybe, which flavor?)? Is this on a server or locally? If on a server what is your local OS that you are communicating with the server with?

Could you do a conda list or pip list of version numbers for your packages in the environment?

Could you give us the stats on your recording object? If you just type recording into your terminal the repr should tell us file size/dtype/number of samples?

@JeffreyBoucher
Copy link
Author

Hello!

I am indeed using linux. Here is the output of cat /etc/os-release:

"
NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"

"

This is on a server; it's a cluster organized by the university I work at (myriad at UCL). Because of this, when I run the spikesorter I am interfacing with a job submission scheduler. My local OS is linux as well, and is Ubuntu 22.04.

Here is the output of the conda list:

"
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
asciitree 0.3.3 pypi_0 pypi
blas 1.0 mkl
bzip2 1.0.8 h5eee18b_6
ca-certificates 2024.9.24 h06a4308_0
contourpy 1.3.0 pypi_0 pypi
cuda-python 12.6.0 pypi_0 pypi
cudatoolkit 10.1.243 h6bb024c_0
cudnn 7.6.5 cuda10.1_0
cycler 0.12.1 pypi_0 pypi
distinctipy 1.3.4 pypi_0 pypi
fasteners 0.19 pypi_0 pypi
fastrlock 0.5 pypi_0 pypi
filelock 3.13.1 pypi_0 pypi
fonttools 4.54.1 pypi_0 pypi
fsspec 2024.6.1 pypi_0 pypi
gmp 6.2.1 h295c915_3
gmpy2 2.1.2 pypi_0 pypi
h5py 3.12.1 pypi_0 pypi
intel-openmp 2023.1.0 hdb19cb5_46306
jinja2 3.1.4 pypi_0 pypi
joblib 1.4.2 pypi_0 pypi
kiwisolver 1.4.7 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libabseil 20240116.2 cxx17_h6a678d5_0
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libprotobuf 4.25.3 he621ea3_0
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
llvmlite 0.43.0 pypi_0 pypi
markupsafe 2.1.3 pypi_0 pypi
matplotlib 3.9.2 pypi_0 pypi
mkl 2023.1.0 h213fc3f_46344
mkl-fft 1.3.8 pypi_0 pypi
mkl-random 1.2.4 pypi_0 pypi
mkl-service 2.4.0 pypi_0 pypi
mkl_fft 1.3.8 py311h5eee18b_0
mkl_random 1.2.4 py311hdb19cb5_0
mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.3.0 pypi_0 pypi
mtscomp 1.0.2 pypi_0 pypi
nccl 2.8.3.1 hcaf9a05_0
ncurses 6.4 h6a678d5_0
neo 0.13.4 pypi_0 pypi
networkx 3.3 pypi_0 pypi
numba 0.60.0 pypi_0 pypi
numcodecs 0.13.1 pypi_0 pypi
numpy 1.26.4 pypi_0 pypi
numpy-base 1.26.4 py311hf175353_0
openssl 3.0.15 h5eee18b_0
packaging 24.1 pypi_0 pypi
pandas 2.2.3 pypi_0 pypi
pillow 11.0.0 pypi_0 pypi
pip 24.2 pypi_0 pypi
probeinterface 0.2.24 pypi_0 pypi
pyparsing 3.2.0 pypi_0 pypi
python 3.11.10 he870216_0
python-dateutil 2.9.0.post0 pypi_0 pypi
pytorch 2.3.0 cpu_py311h6fe12db_1
pytz 2024.2 pypi_0 pypi
quantities 0.16.1 pypi_0 pypi
readline 8.2 h5eee18b_0
scikit-learn 1.5.2 pypi_0 pypi
scipy 1.14.1 pypi_0 pypi
setuptools 72.1.0 pypi_0 pypi
six 1.16.0 pypi_0 pypi
spikeinterface 0.101.2 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0
sympy 1.13.2 pypi_0 pypi
tbb 2021.8.0 hdb19cb5_0
threadpoolctl 3.5.0 pypi_0 pypi
tk 8.6.14 h39e8969_0
torch 2.3.0 pypi_0 pypi
tqdm 4.66.4 pypi_0 pypi
typing-extensions 4.11.0 pypi_0 pypi
typing_extensions 4.11.0 py311h06a4308_0
tzdata 2024.2 pypi_0 pypi
wheel 0.43.0 pypi_0 pypi
xz 5.4.6 h5eee18b_1
zarr 2.17.2 pypi_0 pypi
zlib 1.2.13 h5eee18b_1

"

I'll get you the recording stats momentarily...

@JeffreyBoucher
Copy link
Author

The recording I am currently working with outputs:

"
ConcatenateSegmentRecording: 384 channels - 30.0kHz - 1 segments - 341,317,801 samples
11,377.26s (3.16 hours) - int16 dtype - 244.13 GiB
"

It's a set of concatenated recordings taken over a period of about a week and a half.

Thanks for your help!

Jeff Boucher

@zm711
Copy link
Collaborator

zm711 commented Oct 25, 2024

Could you try running just one of the recordings and see if that works with n_jobs. I remember vaguely that we had a problem with certain concatenations so I would like to test this.

@h-mayorquin do you remember this too? That giant concatenations were causing problems with multiprocessing?

the issue with this is the best way for us to fix this is to have the data to try to fix it with but sharing ~250 GB is a non-trivial thing :)

Maybe @samuelgarcia or @alejoe91 also have opinions about why multiprocessing is failing with concatenation (and they both use linux!).

@samuelgarcia
Copy link
Member

Salut.
Are you running the script using slurm ?
In my lab, the slurm kill jobs because the way slurm is counting memory is wrong.
With shared mem, every process is memory is cumulated that overflow the slurm limits but the real machine limits is OK.
Can could test with less process and more thread ?
n_jobs=6, max_thread_perprocess=8 for instance ?

@JeffreyBoucher
Copy link
Author

JeffreyBoucher commented Oct 28, 2024

Hello!

I'll run a single session dataset overnight tonight.

We are not using slurm; the cluster seems to be using "SGE 8.1.9", which stands for "Sun Grid Engine". I don't know if there would be a similar problem with this; I'll try to do the single session dataset first

@JeffreyBoucher
Copy link
Author

Hello!

In fact I ran into a bug which I think is on my end; I'm going to de-prioritize this for a bit, since I was able to get it working by turning off parallel processing I want to get that started on my real dataset, but afterward I'll get back to this (within a week)

Thanks for your help!

Jeff Boucher

@samuelgarcia
Copy link
Member

Maybe SGE is killing your job because using too much ram. Could you increase the mem when submiting the job ?

@JeffreyBoucher
Copy link
Author

Hello

Parallel processing worked fine for a single session; for that and other reasons, I think that the suggestion to request more RAM for my jobs is a good one. I'll try it!

Thanks,

Jeff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concurrency Related to parallel processing
Projects
None yet
Development

No branches or pull requests

3 participants