Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run items using SLURM; save batches with platform-independent paths #298

Merged
merged 20 commits into from
Jul 16, 2024

Conversation

ethanbb
Copy link
Collaborator

@ethanbb ethanbb commented Jun 7, 2024

This implements the "slurm" backend via _run_slurm. Supports passing a partition or list of partitions on which to run the jobs; more options (such as memory allocation) could be added if we think they're important.

For controlling number of CPUs/processes per job, right now I'm using the MESMERIZE_N_PROCESSES environment variable to indicate how many processes to use per job, which matches the behavior of the "subprocess" backend. However, it might be a good idea to instead divide this number by the number of jobs that are running in parallel, to avoid needlessly over-parallelizing each job. I'm doing this in my own code to set the environment variable and it works fine, but it may make sense to automate it.

This also adds a dependency on filelock to lock the batch file when updating it to avoid race conditions. I used the SoftFileLock because the regular FileLock wasn't working for me on a NTFS remote drive (from Linux). As I understand it, this basically just creates a lock file when acquiring and deletes it when releasing, which also wouldn't be hard to implement ourselves if you prefer not to add a dependency.

@ethanbb ethanbb marked this pull request as draft June 7, 2024 16:15
@ethanbb
Copy link
Collaborator Author

ethanbb commented Jun 7, 2024

Changed to draft because I'm actually still refactoring things as part of the cross-platform saving fix

@ethanbb
Copy link
Collaborator Author

ethanbb commented Jun 11, 2024

tox-dev/filelock#331 is currently a blocker... could probably sidestep but I'm hoping they fix it quickly (for now)

@ethanbb ethanbb marked this pull request as ready for review June 13, 2024 22:45
@ethanbb ethanbb changed the title Run items using SLURM Run items using SLURM; save batches with platform-independent paths Jun 13, 2024
@ethanbb
Copy link
Collaborator Author

ethanbb commented Jun 13, 2024

This now also fixes #209 - sorry for the lack of clean separation between them. Locking the batch file while writing was involved in both sets of changes.

Copy link
Collaborator

@kushalkolar kushalkolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Mostly LGTM, have a few suggestions

mesmerize_core/algorithms/cnmf.py Show resolved Hide resolved
mesmerize_core/batch_utils.py Outdated Show resolved Hide resolved
mesmerize_core/batch_utils.py Outdated Show resolved Hide resolved
mesmerize_core/batch_utils.py Outdated Show resolved Hide resolved
mesmerize_core/batch_utils.py Outdated Show resolved Hide resolved
mesmerize_core/batch_utils.py Outdated Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Outdated Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Outdated Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Outdated Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Show resolved Hide resolved
@ethanbb
Copy link
Collaborator Author

ethanbb commented Jun 16, 2024

I made some significant changes to put the saving functionality in the caiman dataframe extension and streamline locking the file where necessary - please take another look when you can!

@ethanbb
Copy link
Collaborator Author

ethanbb commented Jun 16, 2024

Hmm checks seem to be failing at the install step - not sure if it's due to these changes or something else is going on.

mesmerize_core/caiman_extensions/_utils.py Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Outdated Show resolved Hide resolved

Parameters
----------
index: int, str or UUID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put type annotation in function signature too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't know what the best practice is with this, but isn't it correct given that the code of this function accepts only ints? On the other hand, because of decoration by @_index_parser, CaimanDataFrameExtensions.update_item accepts ints, strs, and UUIDs. However, I agree it is confusing; most people probably look to function signatures rather than docstrings to learn how to call things.

mesmerize_core/caiman_extensions/common.py Outdated Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Outdated Show resolved Hide resolved
mesmerize_core/caiman_extensions/common.py Show resolved Hide resolved
@kushalkolar
Copy link
Collaborator

have a few comments, sorry been busy with a big fastplotlib release. You can always always ping me with @ if I forget

@ethanbb
Copy link
Collaborator Author

ethanbb commented Jun 19, 2024

No worries, I've been working on other things as well.

os.remove(bak)
except:
except BaseException as err:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes BaseException, however here I believe it's correct (at least matches previous behavior); just need to put it explicitly in order to get err so it can be forwarded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except (Exception, KeyboardInterrupt)

@ethanbb
Copy link
Collaborator Author

ethanbb commented Jun 19, 2024

@kushalkolar done making changes for now; see remaining threads for questions I still have

os.remove(bak)
except:
except BaseException as err:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except (Exception, KeyboardInterrupt)

@kushalkolar kushalkolar merged commit 306f702 into nel-lab:master Jul 16, 2024
0 of 5 checks passed
@ethanbb ethanbb deleted the slurm branch July 16, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants