-
Notifications
You must be signed in to change notification settings - Fork 3
Instructions: Using Singularity on the HPC
Singularity is a container software specifically designed for clusters. Application containers allow us to package software into a portable, shareable image. The ability to create a static image with all of the dependencies necessary for a given package or workflow allows us to control the environment in which we test, debug, and execute our code. For scientists, this is extremely useful.
Consider a experiment acquiring neuroimaging data over a long period of time. Given the amount of time it takes to process these data (e.g., Freesurfer alone normally takes ~12 hours), it only makes sense to process new subjects as they are acquired. However, even on HPCs, software packages are likely to be updated more than once within the lifespan of the project. This is a problem because it introduces time as a confound. Changes to the software over the lifespan of the experiment will necessarily induce time-related confounds to the processed data. Two common solutions to this are to either 1) process all of the data after it has been acquired or 2) use project-specific environments on the HPC to specify versions of individual software packages when running processing workflows. The former approach is inefficient (although it does prevent data peeking) in that it may cause substantial delays in analyzing the data after acquisition is complete, while the latter is not exactly secure, as changes on the HPC or unsupervised changes to the environment by lab members can affect results without users' knowledge. Container software like Singularity addresses the weaknesses in both of these approaches.
BIDS Apps are processing and analysis pipelines for neuroimaging data specifically designed to work on datasets organized in BIDS format. These pipelines are able to run on any datasets organized according to this convention (assuming they contain the requisite data, of course). Combined with application container software like Docker or Singularity, this means that the same pipeline will return the same results on the same dataset, no matter where or when you run it!
Moreover, because the majority of these pipelines have been developed by methodologists and have been evaluated in associated publications (e.g., Esteban et al., 2017; Craddock et al., 2013), they are likely to be of higher quality and better validated than pipelines developed in-lab (typically based on some in-lab dataset). Using independently-developed pipelines also reduces the ability and incentive of researchers to leverage the analytic flexibility inherent to neuroimaging data in order to p-hack (whether intentionally or not) their pipelines to produce the most appealing results in their data.
- SSH onto a login node (
ssh [username]@hpclogin01.fiu.edu
) - SSH from the login node to the data transfer node (
ssh u03
)- If anyone figures out a way to SSH directly onto the data transfer node, please update these instructions accordingly.
module load singularity-3.5.3
-
singularity build [image name] docker://[docker_user|org]/[container]:[version tag]
- E.g.,
singularity build poldracklab_fmriprep_1.5.0rc1.sif docker://poldracklab/fmriprep:1.5.0rc1
- E.g.,
-
Copy your data to
/scratch
. Your Singularity image can only access/scratch
and your home directory. -
Write a SLURM job file.
- Must use CentOS7 nodes for processing.
- Consider including the
--cleanenv
argument to keep environmental variables from the host from messing up the variables in the image. - To use the entry-point script (e.g., a BIDS App), use
singularity run
.- An example job file for processing data with a BIDS App.
- To run a script that's not the entry-point (essentially using the image as an environment), use
singularity exec
.- An example sub file for using a Singularity image as an environment. Not yet figured out, but more information available here.
-
Submit said job. E.g.,
sbatch job_file.sh
.
When you are processing many participants, it is generally a good idea to run one job per participant rather than running through each participant sequentially in a single job.
See this example job file.
To run that job, you will want to use the --array
option, like so:
sbatch --array=1-100%5 slurm_singularity_array.sbatch
In the above call, submitting the script to SLURM will create a management job, which will, in turn, start looping through values of 1 to 100 and feed those values into the sbatch script as a variable named SLURM_ARRAY_TASK_ID
. You can directly access that variable within the sbatch script to specify a given row within a subject list file (e.g., the BIDS participants.tsv) to run your command on just the one subject at a time. The management job will keep only 5 (see the %5
?) subject-specific jobs running at a time, so you don't block up your lab's queue.