A dockerised version of the Seqware workflow derived from Sanger's Cancer Genome Project core somatic calling pipeline. This is a cleaned up version as used in the ICGC/TCGA PanCancer project. See http://pancancer.info for more information. This workflow has been designed to work with SeqWare 1.1.1 VMs setup by Bindle 2.0.
Email Brian if you have questions.
- Keiran Raine [email protected]
- Brian O'Connor [email protected]
- Denis Yuen [email protected]
This tool has been validated as a CWL v1.0 CommandLineTool.
Versions that we tested with are the following
avro (1.8.1)
cwl-runner (1.0)
cwl-upgrader (0.1.1)
cwltool (1.0.20160712154127)
schema-salad (1.14.20160708181155)
setuptools (25.1.6)
Future work will include testing as a CWL draft-3 tool for backwards compatibility.
Successful testing was completed with the following command (note that the input must be downloaded from https://s3-eu-west-1.amazonaws.com/wtsi-pancancer/testdata/HCC1143_ds.tar and placed in the corresponding path in test1.json , additionally the output location must exist and be writeable by the executing user).
dockstore tool launch --entry Dockstore.cwl --local-entry --json test1.json
Warning: Execution can take a substantial amount of disk space and upwards of 24 hours for execution.
Note that test2.json
also works but will take a more substantial amount of time to execute (on the order of 6 days on a 8-core, 58GB of RAM host).
Also note that the tabix server requirement is no longer needed (reference data is built into the workflow).
This is a work in progress, some of the information below has yet to be updated for the simplified Docker process.
This workflow assumes you are working on a docker image of pancancer/seqware_whitestar_pancancer:1.1.2.
Again, this is a bit complex but in the future we will distributed pre-created VMs as OVA files or AMIs that should make this process go much more smoothly.
This workflow recommends:
- 16-32 cores (make sure you set the coresAddressable variable correctly in the ini file)
- 4.5G per core, so, ideally 72GB+ for 16 cores, for 32 cores 144GB+, on Amazon we recommend r3.8xlarge or r3.4xlarge
- 1TB of local disk space
This workflow requires an HTTP server to serve tabix-indexed data files, the location is controlled by the tabixSrvUri variable in the ini file
The directions above at our PanCancer SOP docs site will build a base SeqWare 1.1.0 box. Over time we wil add these packages to the base configuration but for now you will want to install the following on your SeqWare 1.1.0 VM once Bindle 2.0 has created it.
The following are the packages needed for Ubuntu 12.04:
# added due to some issues during setup of the new HVM host
sudo apt-get autoclean
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install g++ pkg-config dh-autoreconf
sudo apt-get install libncurses5-dev
sudo apt-get install libgd2-xpm-dev
sudo apt-get install libboost-all-dev
sudo apt-get install libpstreams-dev
sudo apt-get install libglib2.0-dev
sudo apt-get install zlib1g-dev
# for the vcf-uploader
sudo apt-get install libxml-dom-perl libxml-xpath-perl libjson-perl libxml-libxml-perl time libdata-uuid-libuuid-perl libcarp-always-perl libipc-system-simple-perl libfile-tail-perl
These packages are needed for execution, even when pulling from Artifactory:
sudo apt-get install r-base r-base-core r-cran-rcolorbrewer
All of the above on one line to make it easy to cut and paste into a terminal:
sudo apt-get autoclean; sudo apt-get update; sudo apt-get -y upgrade; sudo apt-get -y install libxml-dom-perl libxml-xpath-perl libjson-perl libxml-libxml-perl zlib1g-dev libglib2.0-dev libpstreams-dev libboost-all-dev libgd2-xpm-dev libncurses5-dev g++ pkg-config dh-autoreconf r-base r-base-core r-cran-rcolorbrewer time libdata-uuid-libuuid-perl libcarp-always-perl libipc-system-simple-perl
Note: on BioNimbus I ran into an issue with r-cran-rcolorbrewer not being up to date with R 3.x. See http://stackoverflow.com/questions/16503554/r-3-0-0-update-has-left-loads-of-2-x-packages-incompatible
You will also need Duck to transfer data to SFTP.
See https://trac.cyberduck.io/wiki/help/en/howto/cli
If you plan on using Synapse uploads see the setup instructions at https://github.com/ICGC-TCGA-PanCancer/vcf-uploader
Specifically, this tool requires:
sudo apt-get install python-dev python-pip
sudo pip install synapseclient
sudo pip install python-dateutil
sudo pip install elasticsearch
sudo pip install xmltodict
sudo pip install pysftp
sudo pip install paramiko
There are settings files that the workflow will attempt to create for you given the parameters you pass in.
If you plan on uploading to S3 you will need the Amazon command line tools. Install them using:
sudo apt-get install python-pip
sudo pip install awscli
The workflow will setup your credential files when called.
Further details can be found at the following:
- https://aws.amazon.com/cli/
- http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
You need a tabix HTTP server serving up controlled access data that's used in the variant calling pipeline. We can't share this publicaly so email the authors for information about this component.
Now that you have a VM built and system-level dependencies installed it's time to build the workflow. Clone the project in git and cd into SeqWare-CGP-SomaticCore. Now build with Maven:
mvn clean install
You should see the output in target, the Workflow_Bundle_CgpCnIndelSnvStr_1.0-SNAPSHOT_SeqWare_1.1.0-alpha.5 directory.
In order to test the proper functioning of the workflow with test data, create an ini file (the contents of this will depend on your workflow). For testing purposes, you will require the following ini, note that the ip address for the tabix server will appear in your environment variables as PANCANCER_TABIX_SERVER_PORT_80_TCP_ADDR
testMode=true
tabixSrvUri=http://172.17.0.13/
Run workflow sequentially (inside the container) with
seqware bundle launch --dir /workflow --no-metadata --ini workflow.ini
That will run the integrated tests.
You can manually run other tests using the --ini parameter to pass custom ini files with your test settings in them. In a production system, these ini files would be created by a "decider" that schedules workflows to many VMs in parallel but that is beyond the scope of this doc. See the PanCancer SOP docs for more info: https://github.com/ICGC-TCGA-PanCancer/pancancer-info/tree/develop/docs
NOTE: to download "real" data from a GNOS repository you need to provide a valid GNOS .pem key file, see the ini in the bundle for more information.
Once you build this on a SeqWare VM the next thing you'll want to do is actually run some test data. The workflow can operate in two modes. First, in local test mode and, second, in online mode where large data is downloaded from GNOS. The latter takes about 30 hours to run whereas the former takes about 1.5 on 8 cores. Take a look at CgpSomaticCore.ini and make decisions about the following variables:
# number of cores to use, you want 4.5G/core or more
coresAddressable=32
# the server that has various tabix-indexed files on it, see above, update with your URL
tabixSrvUri=http://10.89.9.50/
# in non local test mode make sure this path is valid
pemFile=/mnt/seqware-build/ebi.pem
# not "true" means the data will be downloaded using AliquotIDs
testMode=false
# these are residing on the EBI GNOS repo
tumourAliquotIds=f393bb07-270c-2c93-e040-11ac0d484533
tumourAnalysisIds=ef26d046-e88a-4f21-a232-16ccb43637f2
tumourBams=7723a85b59ebce340fe43fc1df504b35.bam
controlAnalysisId=1b9215ab-3634-4108-9db7-7e63139ef7e9
controlBam=8f957ddae66343269cb9b854c02eee2f.bam
Once you've modified (or copied and modified) the .ini file you can launch the workflow with something like:
seqware bundle launch --dir <path>/target/Workflow_Bundle_CgpCnIndelSnvStr_... --ini <path_to_modified_ini>
This will launch the workflow and print out some key debugging information that will help you if things go wrong.
The following values should be set to the maximum that the executing host can safely use (taking into account OS and other services running):
coresAddressable - the maximum number of core that will be available to the workflow.
memHostMbAvailable - the total memory available to the workflow.
There are a few steps that don't handle cores and memory in the normal Seqware way. This allows the per-process overhead to be mostly negated and more efficient parallel processing. These processes which use this approach are:
CaVEMan_mstep
CaVEMan_estep
BRASS_assemble
cgpPindel_pindel
These processes can theoretically use all of the cores and memory available to the host. If you decide that these steps generically need a change to memory please modify the
appropriate mem*PerThread
and not the generic coresAddressable
or memHostMbAvailable
. The workflow will assess the available memory and sacrifice cores for memory if required.
As these steps will be running multiple threads (internally) each can share/donate memory to those running at the same time and so memory failures are reduced.
For the processes where core and memory utilisation is managed by the underlying perl code and not Seqware/SGE directly (other than total available) the logs are written to the log area for the specific algorithm. This is done so that the output from multiple threads isn't mixed on a single stream.
It is easy to see where things got up to with something like:
ls -ltrh seqware-results/0/<ALG>/tmp*/logs/*.err | tail -n 10
Where <ALG>
can be:
brass
caveman
pindel
There are also *.out
files.
In these cases do not modify the *-qsub.opts
file but instead reduce the parallel threads by 2 as follows:
Process | script stub | Modify |
---|---|---|
cgpPindel_pindel | cgpPindel_pindel_115.sh | -l and -c |
BRASS_assemble | BRASS_assemble_148.sh | -l and -c |
CaVEMan_mstep | CaVEMan_mstep_239.sh | -l and -t |
CaVEMan_estep | CaVEMan_estep_241.sh | -l and -t |
(the numeric component of 'script stub' was correct at time of writing, it may drift)
The Dockerfile adds the following to the pancancer/seqware_whitestar_pancancer image. This is just provided if you need to start from scratch.
You need to build and install the following in this order:
- PCAP v1.12.1
- cgpVcf v1.3.1
- alleleCount v2.1.2
- cgpNgsQc v1.0.3
- ascatNgs v1.5.2
- cgpPindel v1.5.4
- cgpCaVEManPostProcessing v1.5.0
- cgpCaVEManWrapper v1.9.0
- BRASS v4.0.12
- VAGrENT v2.1.2
- grass v1.1.6
There is a script included in this package that automates this process, just run:
workflow/bin/setup.sh
This creates workflow/bin/opt
which should then be packaged (tar.gz
) and deployed to Artifactory.
If you find there is a problem following upgrade please reference this script.
If you launch our VM on a machine with more or fewer processors or memory you can tell SGE to use it using the following commands:
sudo qconf -aattr queue slots "[master=`nproc`]" main.q
sudo qconf -mattr queue load_thresholds "np_load_avg=`nproc`" main.q
sudo qconf -rattr exechost complex_values h_vmem=`free -b|grep Mem | cut -d" " -f5` master
You shouldn't really need to change these but if you're changing instance types/flavors then you might.
This is fixed on the HVM AMI but if you want to call oozie commands directly then you might need to define the follow env var:
export OOZIE_URL=http://master:11000/oozie