-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Setting up CellProfiler jobs to run on eddie3.
cptools2 automatically creates commands for eddie3 array jobs, and csv files suitable for CellProfiler LoadData
module.
Make sure you're on a worker node, then load python >= 3.5 with module load python
.
Go to the cptools2 location (/exports/igmm/eddie/Drug-Discovery/tools/cptools2
)
When you're within the cptools2 directory you will see a file called setup.py
.
Install with python setup.py install --user
.
This creates an entry point, so you should be able to use the cptools2
command from the command line without worrying about python or finding where the cptools2 code is located.
The cellprofiler pipelines need to be set up in a certain way to be used with cptools2.
The pipeline should start with a LoadData
module, which takes the image information in the form of a csv file which will be generated by cptools2.
It's easier to create the pipeline using the normal drag-and-drop interface in cellprofiler to load the images, and extracting metadata from the file paths. Then at the end change it to use the LoadData module.
Channels names in the cellprofiler pipeline need to be W and then numbered, so W1
, W2
... etc. NB: capital 'W'.
The pipelines should end with an ExportToSpreadsheet
module, with the location set to default
.
It's recommended to combine object level data into a single spreadsheet containing all objects. i.e A single csv file for both nuclei and cell-bodies. This means each job produces two spreadsheets, objects (normally called DATA.csv
) and Image.csv
for image-level data.
cptools2 uses a config file which details:
- The ImageXpress experiment to analyse
- If certain plates should be included/excluded
- How many imagesets should each job analyse
- The CellProfiler pipeline to use
- Where to save the results
- Where to save the submission commands
An example of a config file:
experiment: /path/to/ImageExpress/experiment
chunk: 96
pipeline: /path/to/cellprofiler/pipeline.cppipe
location: /path/to/output/location
commands location: /where/to/store/commands
More details on config file options
To create the commands and LoadData csv files, first make sure you're on a staging node with access to datastore.
cptools2 config.yaml
Where config.yaml
is your configuration file.
This should create the staging, analysis, and destaging commands in the commands location
. And, creates a LoadData csv file for each job in the location
directory, as well as the SGE submission scripts, and a final bash script to submit the submission scripts in the correct order.
cptools2 automatically creates everything you need to submit the job to the cluster. However, you might need to alter these scripts for more memory, or to batch submit jobs if you run over the 10,000 task limit.
After running cptools2 on the config file, 3 files are saved in the commands location
(staging.txt
, cp_commands.txt
and destaging.txt
). These 3 files contain a command per line, and will be run as three concurrent array jobs on the cluster.
cptools2 creates three default submission scripts (staging_script.sh
, analysis_script.sh
, destaging_script.sh
) which are saved in the commands location
directory along with the three files of commands. These are a template, and may need to be altered (e.g to increase the run-time-limit for long-running jobs).
The jobs are dependent on one another, so the analysis task will only start running once the corresponding staging task has finished. This uses the -hold_jid_ad
flag on SGE. It's therefore important to give your jobs names so they can run in the correct order.
The -t
flag is which tasks to run. To run all the jobs in your command list, set this from 1 to the number of lines in the command lists (they should all be the same number of lines). In this example staging.txt
, cp_commands.txt
and destaging.txt
each have 288 lines, with one line per command to run.
The staging jobs simply copy the images over from datastore to a cluster storage location. This needs to be run on a staging node.
-t
flag. You can always run these sub-jobs sequentially using the -hold_jid
flag on the previous sub-jobs destaging name.
Another option is to decrease the priority of the staging jobs, this can be altered with the -p
flag, the lowest priority you can set is -1023.
You can also use the -tc
flag on staging jobs to limit the number of concurrently running staging jobs, e.g #$ -tc 5
will make it so only a maximum of 5 staging jobs will run at a time.
#!/bin/bash
#$ -N stage_study
#$ -q staging
#$ -j y
#$ -l h_vmem=0.5G
#$ -l h_rt=02:00:00
#$ -o /exports/eddie/scratch/$USER/study/logs/staging
#$ -t 1-288
SEEDFILE=~/study/commands/staging.txt
SEED=$(awk "NR==$SGE_TASK_ID" $SEEDFILE)
$SEED
The analysis script calls each line of the cp_commands.txt
file as a separate job. This runs cellprofiler on a batch of images and saves the csv output.
Cellprofiler is not actually installed on the cluster, but instead runs within a virtualenvironment. This means you have a set up a virtualenvironment for each user and the source ...
command in the analysis script will have to point to virtualenvironment correct for each user.
Dependent on the size of the images or the analysis you may have to adjust the memory requirements. In this example it's set very high (using 2 nodes and 24GB of RAM). You can set this smaller and use a single node, which means more of your jobs will run. Though if you set it too low some jobs will fail due to MemoryError
s which will appear in the log, and you won't have any csv file in the output location.
The -l h_rt
flag is the run-time limit of the job. This can be lowered once you know how long the jobs will take.
#!/bin/bash
#$ -N analyse_study
#$ -hold_jid_ad stage_study
#$ -pe sharedmem 2
#$ -l h_vmem=12G
#$ -l h_rt=48:00:00
#$ -j y
#$ -o /exports/eddie/scratch/$USER/study/logs/analysis
#$ -t 1-288
# allow modules to be loaded
. /etc/profile.d/modules.sh
module load igmm/apps/hdf5/1.8.16
module load igmm/apps/python/2.7.10
module load igmm/apps/jdk/1.8.0_66
module load igmm/libs/libpng/1.6.18
# activate the cellprofiler virtualenvironment
source /exports/igmm/eddie/Drug-Discovery/virtualenv-1.10/myVE/bin/activate
SEEDFILE=~/study/commands/cp_commands.txt
SEED=$(awk "NR==$SGE_TASK_ID" $SEEDFILE)
$SEED
Destaging removes the image data that was copied in from datastore.
#!/bin/bash
#$ -N destage_study
#$ -l h_vmem=0.5G
#$ -l h_rt=01:00:00
#$ -hold_jid analyse_study
#$ -j y
#$ -o /exports/eddie/scratch/$USER/study/logs/destaging
#$ -t 1-288
SEEDFILE=~/study/commands/destaging.txt
SEED=$(awk "NR=$SGE_TASK_ID" $SEEDFILE)
$SEED
As the jobs are dependent on one another they have to be submitted in the correct order. Using qsub
, submit in the following order:
- staging
- analysis
- destaging