-
Notifications
You must be signed in to change notification settings - Fork 0
Part 4: Basic BOLD Preprocessing
So far, we have run one thing at a time using an idev session. For most processing, however, we can run much more efficiently using the compute nodes on Lonestar. Make sure to read the Lonestar User Guide to learn about the cluster and how to use it properly.
We'll first run prep_bold_run.sh on each functional run to do basic preprocessing of the BOLD data, including motion correction, brain extraction, and quality assurance. For (almost) every script in FAT, you can get basic information about how to run it by just entering the name of the script:
prep_bold_run.sh
displays:
Usage: prep_bold_run.sh [-k] bold_dir
Required:
bold_dir
Path to directory with a raw functional timeseries image named
bold.nii.gz. Outputs will be placed in this directory.
Options:
-k
Keep intermediate files.
This is a little different from the scripts we used earlier. It's written in Bash, not python, and does not have a --dry-run option. In the usage information, square brackets ([]) indicate optional inputs. Inputs that start with "-" are sometimes called "flags", and are used to set options when running scripts. Here, there is one optional flag that can be specified. If "-k" is included in the command, intermediate files will be kept. If it is omitted, intermediate files will be deleted to save space. Most of the scripts in the preprocessing pipeline have an option to keep or delete intermediate files; this helps to keep disk usage on your WORK directory from hitting the quota (initially 1TB, though it's possible to request more).
The last part of the usage line is bold_dir. This indicates that you should provide the path to a directory with a raw functional timeseries.
We have two subjects, each of which has two functional runs (in a real case, there would be much more data). Ideally, we want all of these scans to be processed in parallel on one or more compute nodes on Lonestar. To submit a job to do this, you need an ssh login session on one of the regular login nodes on Lonestar (submitting jobs from the virtual login does not work). In the Terminal, open another window or tab, and type:
ssh -Y $username@ls5.tacc.utexas.edu
replacing $username with your TACC username.
There are two steps to running a parallel job on Lonestar:
- Create a job command file, with one command on each line
- Tell the cluster's job scheduler how to run the commands, and submit
First, create a simple job command file. You can use any text editor. A simple way to do this is using nano. Create a file called prep_bold_commands.sh with the following lines:
prep_bold_run.sh $STUDYDIR/bender_03/BOLD/prex_1
prep_bold_run.sh $STUDYDIR/bender_03/BOLD/study_1
prep_bold_run.sh $STUDYDIR/bender_04/BOLD/prex_1
prep_bold_run.sh $STUDYDIR/bender_04/BOLD/study_1
Every line of this file can be run in parallel, since prep_bold_run.sh only makes changes within a given BOLD directory. So we can run all the commands at once without issue. Next, we need to tell the cluster's scheduler how to run these commands.
launch -s prep_bold_commands.sh -N 1 -n 4 -a 6 -r 02:00:00 -A ANTS -p development
There are a few things to unpack here.
- -N indicates the total number of nodes to use for the job. Each standard compute node has 24 cores and 64 GB of RAM. Here, we just need 1 node, since the memory usage for each command won't be too crazy.
- -n indicates the total number of tasks for the job. This is the number of commands that will be run at a time. Here, we can run all 4 simultaneously. We could also set this to something like "-n 2"; that would run two commands at a time, and would take about twice as long to run. This can be useful sometimes if you're getting out of memory errors; since only two commands run at a time, memory usage will be cut about in half.
- -a is the number of cores that ANTS should use. Many scripts in the toolbox use ANTS, and ANTS supports processing using multiple cores. Here, we indicate that each task should have access to 6 cores; this will allow ANTS to run much faster. This is also important because ANTS will sometimes crash if you run multiple processes on a single node, since every process tries to use all the cores on the node, and this leads to problems. A good rule of thumb is to set -a to 24/(n/N), so that you will use all the cores on each node.
- -r is the maximum run time that a job can take. Here, we've set that to 2 hours. If the commands aren't finished running when the time limit is up, the job will be killed anyway. It's a good idea to keep this short enough so that the job won't have to wait as long on the queue (there is a preference for shorter jobs to run sooner), but long enough that the commands will definitely have time to finish.
- -A is the "allocation" to charge the job to. The lab has an allocation for running jobs on Lonestar called ANTS. You need to be added to that allocation before you can submit jobs under it.
- -p is the "partition" of the cluster to submit to. For most purposes, there are two main partitions: normal and development. The normal partition is for most regular jobs. The development partition is designated for testing things out. Here, we use the development partition. It has some restrictions; see the Lonestar User Guide for details.
Run launch -h
to see all the available options for running jobs.
The toolbox includes some utilities to make submitting jobs quick. See Running Scripts for details. Here, we'll use rlaunch, which runs some command for a set of subjects, each of which has multiple functional runs.
First, define a couple environment variables with lists of subjects and runs to process:
SUBJIDS=bender_03:bender_04
RUNIDS=prex_1:study_1
These are colon-separated lists of subjects to process and runs to process. These lists can be put in your $HOME/.bashrc
file so they'll always be defined, and can be updated to reflect new participants being added or participants being excluded from analysis.
Also define a BATCHDIR variable, which indicates where job information and outputs should be saved:
export BATCHDIR=$WORK/preproc/batch/launchscripts
mkdir -p $BATCHDIR
First, we'll use the -t (test) option to display the list of commands without actually running anything:
rlaunch -t "prep_bold_run.sh $STUDYDIR/{s}/BOLD/{r}" $SUBJIDS $RUNIDS
The first part gives the commands to run (they must be in quotes so the spaces are handled correctly). In the commands string, {s} will be replaced with a subject ID, and {r} will be replaced with a run ID. Next are the list of subjects and the list of runs. This will print out all the commands to be run:
prep_bold_run.sh /work/03206/mortonne/lonestar/preproc/bender_03/BOLD/prex_1
prep_bold_run.sh /work/03206/mortonne/lonestar/preproc/bender_03/BOLD/study_1
prep_bold_run.sh /work/03206/mortonne/lonestar/preproc/bender_04/BOLD/prex_1
prep_bold_run.sh /work/03206/mortonne/lonestar/preproc/bender_04/BOLD/study_1
That looks right, so now we can define and submit a job in one line at the terminal. Remove the -t option and add any launch options at the end, to actually submit the commands:
rlaunch "prep_bold_run.sh $STUDYDIR/{s}/BOLD/{r}" $SUBJIDS $RUNIDS -N 1 -n 4 -a 6 -r 02:00:00 -p development
This might seem like overkill for 2 subjects with 2 runs each, but being able to construct commands automatically like this is useful for quickly submitting jobs to process many subjects and runs.