Skip to content

Commit

Permalink
update snakemake page
Browse files Browse the repository at this point in the history
  • Loading branch information
jamorrison committed Nov 9, 2023
1 parent 05ade1e commit 7a7c0c0
Showing 1 changed file with 73 additions and 53 deletions.
126 changes: 73 additions & 53 deletions docs/snakemake_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,33 @@ analyses.

## Dependencies

- Required:
- `snakemake` (version 6.0+)
- `biscuit` (version 1.2.0+)
- `samtools` (version 1.12+)
- `htslib` (version 1.12+)
- `samblaster` (version 0.1.26+)
- bedtools
- pigz
- GNU parallel
- FastQC
- MultiQC
- Python 3.7+ with pandas, numpy, matplotlib, and seaborn
- Optional:
- TrimGalore! (required for trimming adaptors)
- R with tidyverse, ggplot, patchwork, and viridis (required for plotting methylation controls)
- Bismark (required when running fastq_screen)
- fastq_screen (required when running fastq_screen)
- preseq (required for finding library complexity, version 3.1.2+ must be compiled with htslib enabled)
The following dependencies are downloaded when running with `--use-conda`, otherwise you must have these in your PATH.
| Package | Conda Version Downloaded | Notes |
|:---------------|:------------------------:|:------|
| `snakemake` | 7.0+ | Needed before running pipeline |
| `biscuit` | 1.2.0 | |
| `htslib` | 1.17 | |
| `samtools` | 1.17 | |
| `dupsifter` | 1.2.0 | |
| `parallel` | 20230322 | |
| `bedtools` | 2.30.0 | |
| `preseq` | 3.2.0 | Must be compiled with htslib enabled |
| `fastqc` | 0.12.1 | |
| `trim_galore` | 0.6.10 | |
| `fastq_screen` | 0.15.3 | Only required if running `fastq_screen`)
| `bismark` | 0.24.0 | Only required if running `fastq_screen`)
| `pigz` | 2.6 | |
| `python` | 3.11.3 | |
| `pandas` | 2.0.0 | |
| `numpy` | 1.24.2 | |
| `matplotlib` | 3.7.1 | |
| `seaborn` | 0.12.2 | |
| `multiqc` | 1.14 | |
| `R` | 4.2.3 | |
| `tidyverse` | 2.0.0 | Only required for plotting methylation controls |
| `ggplot2` | 3.4.2 | Only required for plotting methylation controls |
| `patchwork` | 1.1.2 | Only required for plotting methylation controls |
| `viridislite` | 0.4.1 | Only required for plotting methylation controls |

Two things of note, 1) it is easiest when working with `snakemake` to install `mamba` using `conda` when running with
`--use-conda`, and 2) it is preferable to install `snakemake` using `conda`, rather than using a module. This is due to
Expand All @@ -39,54 +48,66 @@ the snakemake module.
The following components are generally in order, but may run in a different order, depending on exact dependencies
needed.
- [default off] Generate asset files used during QC related rules
- [default off] Modify and index genome reference to including methylation controls
- [default off] Trim adapters and/or hard clip R2
- [default off] Modify and index reference genome to include methylation controls (lambda phage and pUC19)
- [default off] Trim FASTQ files
- [default off] Run Fastq Screen in bisulfite mode
- Run FastQC on raw FASTQ files
- Alignment, duplicate tagging, indexing, flagstat of input data (biscuitBlaster v1 and v2)
- Alignment, duplicate marking, and indexing of input data (biscuitSifter pipeline)
- Samtools flagstat of input data
- Methylation information extraction (BED Format)
- Merge C and G beta values in CpG dinucleotide context
- [default off] SNP and epiBED extraction
- [default off] Run Preseq on aligned BAM
- MultiQC with BICUIT QC modules specifically for methyaltion data
- MultiQC with BICUIT QC modules specifically for methylation data
- [default off] Generate plots of the observed / expected coverage ratio for different genomic features
- [default off] Generate percentage of covered CpGs and CpG island coverage figures
- [default off] Find coverage uniformity across genome
- [default off] Plot percentage of genome covered
- [default off] Find average methylation values in bins across genome
- [default off] Find average methylation values in bins centered on specified regions
- [default off] QC methylated and unmethylated controls
- [default off] Find binned average methylation
- [default off] Find binned methylation centered on provided regions

Many options can be easily specified in the `config.yaml`! Otherwise, the commands in the Snakefile can also be modified
to meet different needs.

## Running the Workflow

- Clone the repo `git clone [email protected]:huishenlab/Biscuit_Snakemake_Workflow.git`
- Place *gzipped* FASTQ files into `raw_data/` directory
- Replace the example `bin/samples.tsv` with your own `config/samples.tsv` sample sheet containing:
For ease of reference, the configuration file `config/config.yaml` will be referred to throughout as the file to define
any configuration needed for your pipeline run. That said, you can copy this config file to another file and use that
config file in your pipeline with `snakemake --configfile /my/new/config.yaml` or by changing the `CONFIG_FILE` variable
in the SLURM submit script.

- [Clone the repo](https://github.com/huishenlab/Biscuit_Snakemake_Workflow/tree/master)
- SSH: `git clone [email protected]:huishenlab/Biscuit_Snakemake_Workflow.git`
- HTTPS: `git clone https://github.com/huishenlab/Biscuit_Snakemake_Workflow.git`
- Place *gzipped* FASTQ files into `raw_data/`. Alternatively, you can specify the location of your *gzipped* FASTQ
files in `config/config.yaml`.
- Replace the example `config/samples.tsv` with your own sample sheet containing:
- One row for each sample
- Three columns per row (separated by a tab - "\t"):
- `sample_name` name of sample to be used throughout processing
- `fq1` name of R1 file for `sample_name` in `raw_data/`, multiple FASTQs can be specified (comma-separated)
- `fq2` name of R2 file for `sample_name` in `raw_data/`, multiple FASTQs can be specified (comma-separated)
- Any additional columns are ignored
- Modify `config/config.yaml` to specify your:
- The following three columns for each row (separated by a tab):
- A. `sample` (name of the sample used throughout processing)
- B. `fq1` (name of R1 file for `sample` in your raw data directory, multiple FASTQs can be specified with a
comma-separated list)
- C. `fq2` (name of R2 file for `sample` in your raw data directory, multiple FASTQs can be specified with a
comma-separated list)
- D. Any other columns included are ignored
- Note, you can either edit `config/samples.tsv` in place or specify the path to your sample sheet in
`config/config.yaml`. If you create your own sample sheet, make sure to include the header line as is seen in the
example file.
- Modify `config/config.yaml` to specify the appropriate
- Reference genome
- BISCUIT index
- BISCUIT QC assets (see [Quality Control]({{ site.baseurl }}{% link docs/alignment/QC.md %}) for details)
- Toggle optional workflow components
- Set other run parameters in `config/config.yaml`
- Turn on optional rules in `config/config.yaml` (change from False to True)
- If you are using environmental modules on your system, you can set the locations in the corresponding location. By
default, the pipeline will use `conda`/`mamba` to download the required packages. Note, if using the modules and a
module is not available, snakemake gives a warning but will run successfully *as long as the required executables
are in the path*.
- Toggle optional workflow components (change from False to True)
- Specify other run parameters
- Submit the workflow to an HPC using command similar to `bin/run_snakemake_workflow.sh`
- `bin/run_snakemake_workflow.sh` works for submitting to a PBS/Torque queue system
- Submit via: `qsub -q [queue_name] bin/run_snakemake_workflow.sh`
- Make sure the queue you submit to is able to submit jobs from jobs running on that queue
- If the nodes are not able to submit jobs, the snakemake pipeline will not be able to run properly
- `bin/run_snakemake_workflow.sh` can be easily modified for submission on other queue systems
- Snakemake can also be run on the command line:
- `snakemake --use-envmodules --cores 1`
default, the pipeline will use `conda`/`mamba` to download the required packages. Note, if using the modules and a
module is not available, snakemake gives a warning but will run successfully *as long as the required executables are
in PATH*.
- Modify SLURM submit script as needed (new config file in `CONFIG_FILE`, etc.).
- Then submit the workflow to an HPC using something similar to `bin/run_snakemake_workflow.slurm` (e.g.,
`sbatch bin/run_snakemake_workflow.slurm`). `bin/run_snakemake_workflow.slurm` works for a SLURM queue
system. A PBS/Torque version is available in a previous release on GitHub for those who need it.

## After Workflow Completion

Expand All @@ -112,10 +133,9 @@ run the test dataset, copy the ten `.fq.gz` files in `bin/working_example_datase
`bin/samples.tsv` file. This set of files should be mapped to the human genome.

## Useful Commands
For more information on Snakemake: https://snakemake.readthedocs.io/en/stable/

- To perform a dry run of the commands that will be run by snakemake
- `snakemake -npr`
- To unlock the pipeline after a manually aborted run
- `snakemake --unlock --cores 1`
- To create a workflow diagram of your run
- `snakemake --dag | dot -Tpng > my_dag.png`
- Perform a dry run of the commands that will be run by snakemake: `snakemake -npr`
- Unlock the pipeline after a manually aborted run: `snakemake --unlock --cores 1`
- Create a workflow diagram of your run: `snakemake --dag | dot -Tpng > my_dag.png`
- Snakemake can also be run on the command line: `snakemake --use-conda --cores 1`

0 comments on commit 7a7c0c0

Please sign in to comment.