-
Data
- Import the scores into Hail
- Add annotations for MAPS
- Group variants and calculate the number of singletons
-
Analysis
- Download grouped variants
- Calculate MAPS for each score-group combination
- Visualise the results
Tested with Hail version 0.2.107 and Snakemake 7.32
data/
contains Hail and Snakemake code that requires execution in Google Cloud and saves files to a Google Storage (GS) bucket.
- Create a new cluster:
hailctl dataproc start <cluster_name> --packages snakemake --requester-pays-allow-buckets gnomad-public-requester-pays --project <project_name> --bucket <bucket_name> --region <region> --num-workers <N> --image-version=2.0.27-debian10
- Connect to the cluster:
gcloud beta compute ssh <user_name>@<cluster_name>-m --project "<project_name>"
git clone
this repository and navigate todata/
- Run the pipeline:
snakemake --cores all --configfile config.yaml --config gcp_rootdir="<bucket_name>/some_directory/"
Alternatively, in Step 4 you can submit the pipeline as a job. Create job.py
containing the following:
import snakemake
snakemake.main(
[
"--snakefile",
"/path/to/Snakefile",
"--cores",
"all",
"--configfile",
"/path/to/config.yaml",
"--config",
'gcp_rootdir="<bucket_name>/some_directory/"',
]
)
Submit the script with hailctl dataproc submit <cluster_name> job.py
analysis/
contains scripts that calculate and visualise MAPS scores using files created in data/
.
git clone
the code for CAPS (https://github.com/VCCRI/CAPS) into the same root directory asMAPS_for_splicing/
- Navigate to
MAPS_for_splicing/analysis/
snakemake --cores all --config gcp="True" gcp_rootdir="<bucket_name>/some_directory/"
(slower; will download the GS files each time) orsnakemake --cores all --config gcp="False"
(faster; will re-use the contents offiles/
; create the directory and download the required files from GS into it if necessary)