-
Notifications
You must be signed in to change notification settings - Fork 3
Instructions: Using Singularity on the HPC
Singularity is a container software specifically designed for clusters. Application containers allow us to package software into a portable, shareable image. The ability to create a static image with all of the dependencies necessary for a given package or workflow allows us to control the environment in which we test, debug, and execute our code. For scientists, this is extremely useful.
Consider a experiment acquiring neuroimaging data over a long period of time. Given the amount of time it takes to process these data (e.g., Freesurfer alone normally takes ~12 hours), it only makes sense to process new subjects as they are acquired. However, even on HPCs, software packages are likely to be updated more than once within the lifespan of the project. This is a problem because it introduces time as a confound. Changes to the software over the lifespan of the experiment will necessarily induce time-related confounds to the processed data. Two common solutions to this are to either 1) process all of the data after it has been acquired or 2) use project-specific environments on the HPC to specify versions of individual software packages when running processing workflows. The former approach is inefficient (although it does prevent data peeking) in that it may cause substantial delays in analyzing the data after acquisition is complete, while the latter is not exactly secure, as changes on the HPC or unsupervised changes to the environment by lab members can affect results without users' knowledge. Container software like Singularity addresses the weaknesses in both of these approaches.
BIDS Apps are processing and analysis pipelines for neuroimaging data specifically designed to work on datasets organized in BIDS format. These pipelines are able to run on any datasets organized according to this convention (assuming they contain the requisite data, of course). Combined with application container software like Docker or Singularity, this means that the same pipeline will return the same results on the same dataset, no matter where or when you run it!
Moreover, because the majority of these pipelines have been developed by methodologists and have been evaluated in associated publications (e.g., Esteban et al., 2017; Craddock et al., 2013), they are likely to be of higher quality and better validated than pipelines developed in-lab (typically based on some in-lab dataset). Using independently-developed pipelines also reduces the ability and incentive of researchers to leverage the analytic flexibility inherent to neuroimaging data in order to p-hack (whether intentionally or not) their pipelines to produce the most appealing results in their data.
- SSH onto a login node (
ssh [username]@hpclogin01.fiu.edu
) - SSH from the login node to the data transfer node (
ssh u03
)- If anyone figures out a way to SSH directly onto the data transfer node, please update these instructions accordingly.
module load singularity-3.5.3
-
singularity build [image name] docker://[docker_user|org]/[container]:[version tag]
- E.g.,
singularity build poldracklab_fmriprep_1.5.0rc1.sif docker://poldracklab/fmriprep:1.5.0rc1
- E.g.,
-
Copy your data to
/scratch
. Your Singularity image can only access/scratch
and your home directory. -
Write a SLURM job file.
- Must use CentOS7 nodes for processing.
- Consider including the
--cleanenv
argument to keep environmental variables from the host from messing up the variables in the image. - To use the entry-point script (e.g., a BIDS App), use
singularity run
.- An example job file for processing data with a BIDS App.
- To run a script that's not the entry-point (essentially using the image as an environment), use
singularity exec
.- An example sub file for using a Singularity image as an environment. Not yet figured out, but more information available here.
-
Submit said job. E.g.,
sbatch job_file.sh
.
When processing multiple participants at once, it is generally more efficient to submit a separate job for each participant. You can do this by hand (i.e., writing separate, participant-specific jobs and submitting them individually) or you can write a script that loads in a formattable job template, loops through participants, formats the template into a participant-specific job, and then submits that job for each participant.
However, if you are processing more than a handful, you may wish to limit the number of jobs running at a given time to keep some cores open on your queue for the rest of your lab. This requires a more complicated version of the wrangler script, which must now be submitted as a job and which follows the following logic:
- Have a template job where the subject ID is a formattable string.
- In the wrangler job, index participants.
- Loop through participants.
- Within above loop, have while loop with two components:
- Check number of jobs under your username. You can do this by writing out your jobs to a file and then checking the number of rows (e.g.,
squeue -u [username] > temp_job_list.txt
). - If the number of jobs is at or above your limit, wait some amount of time before continuing the while loop.
- Check number of jobs under your username. You can do this by writing out your jobs to a file and then checking the number of rows (e.g.,
- After while loop, load template job, format it with the subject ID, and submit resulting participant-specific job script.
NOTE: Here is an example of a job wrangler and a formattable job script.
NOTE: If you have a processing pipeline that randomly fails (like fMRIPrep sometimes does), then you may want a more complicated version of the above that will resubmit jobs for failed participants. This involves checking which specific jobs are currently running (instead of just the number of jobs) and comparing that list to the list of participants and an index of completed participants (via a check for some final output file).