Snakemake Tutorial

Joshua L. Major-Mincer Last updated: 08/14/23

WARNING

Please ensure that you are on a compute or interactive node if you are testing this on the CHPC, NOT on a login node!

Quickstart

# Get snakemake tutorial. 
git clone [email protected]:quinlan-lab/Snakemake_Tutorial.git
cd Snakemake_Tutorial 

# Downloaded necessary files. 
wget https://snakemake-tutorial.s3.us-west-1.amazonaws.com/data.tar.gz -O data.tar.gz

wget https://snakemake-tutorial.s3.us-west-1.amazonaws.com/reference.tar.gz -O reference.tar.gz

# Uncompress files. 
tar -zxvf data.tar.gz
tar -zxvf reference.tar.gz

# Remove the archives if desired. 
rm data.tar.gz 
rm reference.tar.gz 

# Create snakemake environment using conda. 
conda env create -f envs/snakemake_env.yaml
conda activate snakemake_env

Description

This repository serves as an introduction to Snakemake for the members of the Quinlan lab. Each subdirectory is a module meant to provide information and practice with concepts that can help with the development and portability of Snakemake workflows. The concepts covered here are meant to provide an introductory lesson to how Snakemake workflows are generally structured, function, and 'tips-n-tricks' to make the most out of reproducible workflows. Each module uses the same input files, those that are downloaded and extracted into data/ and reference/. As such, if you want to run the modules' Snakefiles to practice, download the data.tar.gz and reference.tar.gz files before proceeding!

There may be more to follow, but for now, advanced concepts such as rule output aggregation, parallel SLURM job execution, etc. will not be included. If you would like any such concepts to be covered, or have any suggestions, please let me know!

Modules

Each module serves a purpose to introduce one or a few new concepts to build upon the previous ones, ordered numerically. Each module contains a Snakefile(s), and the inputs referenced are data/ and reference/. Each comes with a README, providing a description on the new concepts introduced, as well as the command-line to run the pipeline.

Module 1 - Basics: Shows the bare-bones skeleton of Snakemake workflows and how to get a minimally-function workflow running.
Module 2 - Muliptple Rules: Illustrates the chaining of inputs-outputs of many rules to make a longer, more comprehensive workflow.
Module 3 - Workflow Decorations: Some rules may need specific resources, such as CPU cores or RAM limits. This module shows how to define these.
Module 4 - Branching: An intermediate output might need to have two separate operations performed on it. In this instance, it is useful to know how to "branch" the workflow into deviating paths.
Module 5 - Config: Configurations are .yaml files that provide flexibility to the workflow using defined dictionary options.
Module 6 - Conda or Modules: Snakemake provides the ability to define software environments to run the rule with. In this module, we exemplify the usage of either conda environments or installed modules on a high performance computing environment (i.e. module load tool/1.0.0)
Module 7 - ScriptsAndOOP: More complicated rules can be put into Python or R scripts, which is covered here. Additionally, illustrated is the Object Orientation of Snakemake and how to leverage this to make your workflow more legible.
Module 8 - Full Pipeline: This workflow ties together all of these concepts, with a few minor changes that can clean up and make your workflows neater and more legible.

Acknowledgement

1000 Genomes Acknowledgement for deep coverage of the 2504 phase 3 genomes (or subset thereof):

The following cell lines/DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: [NA06984, NA06985, NA06986, NA06989, NA06994, NA07000, NA07037, NA07048, NA07051, NA07056, NA07347, NA07357, NA10847, NA10851, NA11829, NA11830, NA11831, NA11832, NA11840, NA11843, NA11881, NA11892, NA11893, NA11894, NA11918, NA11919, NA11920, NA11930. NA11931, NA11932, NA11933, NA11992, NA11994, NA11995, NA12003, NA12004, NA12005, NA12006, NA12043, NA12044, NA12045, NA12046, NA12058, NA12144, NA12154, NA12155, NA12156, NA12234, NA12249, NA12272, NA12273, NA12275, NA12282, NA12283, NA12286, NA12287, NA12340, NA12341, NA12342, NA12347, NA12348, NA12383, NA12399, NA12400, NA12413,, NA12414, NA12489, NA12546, NA12716, NA12717, NA12718, NA12748, NA12749, NA12750, NA12751, NA12760, NA12761, NA12762, NA12763, NA12775, NA12776, NA12777, NA12778, NA12812, NA12813, NA12814, NA12815, NA12827, NA12828, NA12829, NA12830, NA12842, NA12843, NA12872, NA12873, NA12874, NA12878, NA12889, NA12890]. These data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1.

1000 Genomes Acknowledgement for deep coverage of the extended 3202 genomes (or subset thereof):

The following cell lines/DNA samples were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research: [NA06984, NA06985, NA06986, NA06989, NA06991, NA06993, NA06994, NA06995, NA06997, NA07000, NA07014, NA07019, NA07022, NA07029, NA07031, NA07034, NA07037, NA07045, NA07048, NA07051, NA07055, NA07056, NA07340, NA07345, NA07346, NA07347, NA07348, NA07349, NA07357, NA07435, NA10830, NA10831, NA10835, NA10836, NA10837, NA10838, NA10839, NA10840, NA10842, NA10843, NA10845, NA10846, NA10847, NA10850, NA10851, NA10852, NA10853, NA10854, NA10855, NA10856, NA10857, NA10859, NA10860, NA10861, NA10863, NA10864, NA10865, NA11829, NA11830, NA11831, NA11832, NA11839, NA11840, NA11843, NA11881, NA11882, NA11891, NA11892, NA11893, NA11894, NA11917, NA11918, NA11919, NA11920, NA11930, NA11931, NA11932, NA11933, NA11992, NA11993, NA11994, NA11995, NA12003, NA12004, NA12005, NA12006, NA12043, NA12044, NA12045, NA12046, NA12056, NA12057, NA12058, NA12144, NA12145, NA12146, NA12154, NA12155, NA12156, NA12234, NA12239, NA12248, NA12249, NA12264, NA12272, NA12273, NA12274, NA12275, NA12282, NA12283, NA12286, NA12287, NA12329, NA12335, NA12336, NA12340, NA12341, NA12342, NA12343, NA12344, NA12347, NA12348, NA12375, NA12376, NA12383, NA12386, NA12399, NA12400, NA12413, NA12414, NA12485, NA12489, NA12546, NA12707, NA12708, NA12716, NA12717, NA12718, NA12739, NA12740, NA12748, NA12749, NA12750, NA12751, NA12752, NA12753, NA12760, NA12761, NA12762, NA12763, NA12766, NA12767, NA12775, NA12776, NA12777, NA12778, NA12801, NA12802, NA12812, NA12813, NA12814, NA12815, NA12817, NA12818, NA12827, NA12828, NA12829, NA12830, NA12832, NA12842, NA12843, NA12864, NA12865, NA12872, NA12873, NA12874, NA12875, NA12877, NA12878, NA12889, NA12890, NA12891, NA12892]. These data were generated at the New York Genome Center with funds provided by NHGRI Grants 3UM1HG008901-03S1 and 3UM1HG008901-04S2.

Pending publication of our manuscript, please cite our biorxiv preprint (Byrska-Bishop et al.) as the marker paper for these data: https://www.biorxiv.org/content/10.1101/2021.02.06.430068v1 or doi: https://doi.org/10.1101/2021.02.06.430068.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Module1-Basics		Module1-Basics
Module2-MultipleRules		Module2-MultipleRules
Module3-WorkflowDecorators		Module3-WorkflowDecorators
Module4-Branching		Module4-Branching
Module5-Config		Module5-Config
Module6-CondaOrModules		Module6-CondaOrModules
Module7-ScriptsAndOOP		Module7-ScriptsAndOOP
Module8-FullPipeline		Module8-FullPipeline
envs		envs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Snakemake Tutorial

WARNING

Quickstart

Description

Modules

Acknowledgement

About

Releases

Packages

Languages

quinlan-lab/Snakemake_Tutorial

Folders and files

Latest commit

History

Repository files navigation

Snakemake Tutorial

WARNING

Quickstart

Description

Modules

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages