update snakemake page

huishenlab · Nov 9, 2023 · 7a7c0c0 · 7a7c0c0
1 parent 05ade1e
commit 7a7c0c0
Showing 1 changed file with 73 additions and 53 deletions.
diff --git a/docs/snakemake_pipeline.md b/docs/snakemake_pipeline.md
@@ -10,24 +10,33 @@ analyses.
 
 ## Dependencies
 
-  - Required:
-    - `snakemake` (version 6.0+)
-    - `biscuit` (version 1.2.0+)
-    - `samtools` (version 1.12+)
-    - `htslib` (version 1.12+)
-    - `samblaster` (version 0.1.26+)
-    - bedtools
-    - pigz
-    - GNU parallel
-    - FastQC
-    - MultiQC
-    - Python 3.7+ with pandas, numpy, matplotlib, and seaborn
-  - Optional:
-    - TrimGalore! (required for trimming adaptors)
-    - R with tidyverse, ggplot, patchwork, and viridis (required for plotting methylation controls)
-    - Bismark (required when running fastq_screen)
-    - fastq_screen (required when running fastq_screen)
-    - preseq (required for finding library complexity, version 3.1.2+ must be compiled with htslib enabled)
+The following dependencies are downloaded when running with `--use-conda`, otherwise you must have these in your PATH.
+| Package        | Conda Version Downloaded | Notes |
+|:---------------|:------------------------:|:------|
+| `snakemake`    | 7.0+                     | Needed before running pipeline |
+| `biscuit`      | 1.2.0                    |       |
+| `htslib`       | 1.17                     |       |
+| `samtools`     | 1.17                     |       |
+| `dupsifter`    | 1.2.0                    |       |
+| `parallel`     | 20230322                 |       |
+| `bedtools`     | 2.30.0                   |       |
+| `preseq`       | 3.2.0                    | Must be compiled with htslib enabled |
+| `fastqc`       | 0.12.1                   |       |
+| `trim_galore`  | 0.6.10                   |       |
+| `fastq_screen` | 0.15.3                   | Only required if running `fastq_screen`)
+| `bismark`      | 0.24.0                   | Only required if running `fastq_screen`)
+| `pigz`         | 2.6                      |       |
+| `python`       | 3.11.3                   |       |
+| `pandas`       | 2.0.0                    |       |
+| `numpy`        | 1.24.2                   |       |
+| `matplotlib`   | 3.7.1                    |       |
+| `seaborn`      | 0.12.2                   |       |
+| `multiqc`      | 1.14                     |       |
+| `R`            | 4.2.3                    |       |
+| `tidyverse`    | 2.0.0                    | Only required for plotting methylation controls |
+| `ggplot2`      | 3.4.2                    | Only required for plotting methylation controls |
+| `patchwork`    | 1.1.2                    | Only required for plotting methylation controls |
+| `viridislite`  | 0.4.1                    | Only required for plotting methylation controls |
 
 Two things of note, 1) it is easiest when working with `snakemake` to install `mamba` using `conda` when running with
 `--use-conda`, and 2) it is preferable to install `snakemake` using `conda`, rather than using a module. This is due to
@@ -39,54 +48,66 @@ the snakemake module.
 The following components are generally in order, but may run in a different order, depending on exact dependencies
 needed.
   - [default off] Generate asset files used during QC related rules
-  - [default off] Modify and index genome reference to including methylation controls
-  - [default off] Trim adapters and/or hard clip R2
+  - [default off] Modify and index reference genome to include methylation controls (lambda phage and pUC19)
+  - [default off] Trim FASTQ files
   - [default off] Run Fastq Screen in bisulfite mode
   - Run FastQC on raw FASTQ files
-  - Alignment, duplicate tagging, indexing, flagstat of input data (biscuitBlaster v1 and v2)
+  - Alignment, duplicate marking, and indexing of input data (biscuitSifter pipeline)
+  - Samtools flagstat of input data
   - Methylation information extraction (BED Format)
   - Merge C and G beta values in CpG dinucleotide context
   - [default off] SNP and epiBED extraction
   - [default off] Run Preseq on aligned BAM
-  - MultiQC with BICUIT QC modules specifically for methyaltion data
+  - MultiQC with BICUIT QC modules specifically for methylation data
   - [default off] Generate plots of the observed / expected coverage ratio for different genomic features
   - [default off] Generate percentage of covered CpGs and CpG island coverage figures
+  - [default off] Find coverage uniformity across genome
+  - [default off] Plot percentage of genome covered
+  - [default off] Find average methylation values in bins across genome
+  - [default off] Find average methylation values in bins centered on specified regions
   - [default off] QC methylated and unmethylated controls
-  - [default off] Find binned average methylation
-  - [default off] Find binned methylation centered on provided regions
 
 Many options can be easily specified in the `config.yaml`! Otherwise, the commands in the Snakefile can also be modified
 to meet different needs.
 
 ## Running the Workflow
-
-  - Clone the repo `git clone [email protected]:huishenlab/Biscuit_Snakemake_Workflow.git`
-  - Place *gzipped* FASTQ files into `raw_data/` directory
-  - Replace the example `bin/samples.tsv` with your own `config/samples.tsv` sample sheet containing:
+For ease of reference, the configuration file `config/config.yaml` will be referred to throughout as the file to define
+any configuration needed for your pipeline run. That said, you can copy this config file to another file and use that
+config file in your pipeline with `snakemake --configfile /my/new/config.yaml` or by changing the `CONFIG_FILE` variable
+in the SLURM submit script.
+
+  - [Clone the repo](https://github.com/huishenlab/Biscuit_Snakemake_Workflow/tree/master)
+    - SSH: `git clone [email protected]:huishenlab/Biscuit_Snakemake_Workflow.git`
+    - HTTPS: `git clone https://github.com/huishenlab/Biscuit_Snakemake_Workflow.git`
+  - Place *gzipped* FASTQ files into `raw_data/`. Alternatively, you can specify the location of your *gzipped* FASTQ
+  files in `config/config.yaml`.
+  - Replace the example `config/samples.tsv` with your own sample sheet containing:
     - One row for each sample
-    - Three columns per row (separated by a tab - "\t"):
-      - `sample_name` name of sample to be used throughout processing
-      - `fq1` name of R1 file for `sample_name` in `raw_data/`, multiple FASTQs can be specified (comma-separated)
-      - `fq2` name of R2 file for `sample_name` in `raw_data/`, multiple FASTQs can be specified (comma-separated)
-    - Any additional columns are ignored
-  - Modify `config/config.yaml` to specify your:
+    - The following three columns for each row (separated by a tab):
+      - A. `sample` (name of the sample used throughout processing)
+      - B. `fq1` (name of R1 file for `sample` in your raw data directory, multiple FASTQs can be specified with a
+      comma-separated list)
+      - C. `fq2` (name of R2 file for `sample` in your raw data directory, multiple FASTQs can be specified with a
+      comma-separated list)
+      - D. Any other columns included are ignored
+    - Note, you can either edit `config/samples.tsv` in place or specify the path to your sample sheet in
+    `config/config.yaml`. If you create your own sample sheet, make sure to include the header line as is seen in the
+    example file.
+  - Modify `config/config.yaml` to specify the appropriate
     - Reference genome
     - BISCUIT index
     - BISCUIT QC assets (see [Quality Control]({{ site.baseurl }}{% link docs/alignment/QC.md %}) for details)
+    - Toggle optional workflow components
+    - Set other run parameters in `config/config.yaml`
+    - Turn on optional rules in `config/config.yaml` (change from False to True)
     - If you are using environmental modules on your system, you can set the locations in the corresponding location. By
-      default, the pipeline will use `conda`/`mamba` to download the required packages. Note, if using the modules and a
-      module is not available, snakemake gives a warning but will run successfully *as long as the required executables
-      are in the path*.
-    - Toggle optional workflow components (change from False to True)
-    - Specify other run parameters
-  - Submit the workflow to an HPC using command similar to `bin/run_snakemake_workflow.sh`
-    - `bin/run_snakemake_workflow.sh` works for submitting to a PBS/Torque queue system
-      - Submit via: `qsub -q [queue_name] bin/run_snakemake_workflow.sh`
-      - Make sure the queue you submit to is able to submit jobs from jobs running on that queue
-      - If the nodes are not able to submit jobs, the snakemake pipeline will not be able to run properly
-    - `bin/run_snakemake_workflow.sh` can be easily modified for submission on other queue systems
-  - Snakemake can also be run on the command line:
-    - `snakemake --use-envmodules --cores 1`
+    default, the pipeline will use `conda`/`mamba` to download the required packages. Note, if using the modules and a
+    module is not available, snakemake gives a warning but will run successfully *as long as the required executables are
+    in PATH*.
+  - Modify SLURM submit script as needed (new config file in `CONFIG_FILE`, etc.).
+  - Then submit the workflow to an HPC using something similar to `bin/run_snakemake_workflow.slurm` (e.g.,
+  `sbatch bin/run_snakemake_workflow.slurm`). `bin/run_snakemake_workflow.slurm` works for a SLURM queue
+  system. A PBS/Torque version is available in a previous release on GitHub for those who need it.
 
 ## After Workflow Completion
 
@@ -112,10 +133,9 @@ run the test dataset, copy the ten `.fq.gz` files in `bin/working_example_datase
 `bin/samples.tsv` file. This set of files should be mapped to the human genome.
 
 ## Useful Commands
+For more information on Snakemake: https://snakemake.readthedocs.io/en/stable/
 
-  - To perform a dry run of the commands that will be run by snakemake
-    - `snakemake -npr`
-  - To unlock the pipeline after a manually aborted run
-    - `snakemake --unlock --cores 1`
-  - To create a workflow diagram of your run
-    - `snakemake --dag | dot -Tpng > my_dag.png`
+  - Perform a dry run of the commands that will be run by snakemake: `snakemake -npr`
+  - Unlock the pipeline after a manually aborted run: `snakemake --unlock --cores 1`
+  - Create a workflow diagram of your run: `snakemake --dag | dot -Tpng > my_dag.png`
+  - Snakemake can also be run on the command line: `snakemake --use-conda --cores 1`