Add new ChIP-Seq WF that handles replicates and controls

wm75 · Oct 24, 2024 · 598e5bb · 598e5bb
1 parent 0a04aff
commit 598e5bb
Show file tree

Hide file tree

Showing 7 changed files with 3,858 additions and 0 deletions.
diff --git a/workflows/epigenetics/chipseq-pe-with-replicates-controls/.dockstore.yml b/workflows/epigenetics/chipseq-pe-with-replicates-controls/.dockstore.yml
@@ -0,0 +1,11 @@
+version: 1.2
+workflows:
+- name: main
+  subclass: Galaxy
+  publish: true
+  primaryDescriptorPath: /chipseq-pe-with-replicates-controls.ga
+  testParameterFiles:
+  - /chipseq-pe-with-replicates-controls-tests.yml
+  authors:
+  - name: Wolfgang Maier
+    orcid: 0000-0002-9464-6640
diff --git a/workflows/epigenetics/chipseq-pe-with-replicates-controls/.workflowhub.yml b/workflows/epigenetics/chipseq-pe-with-replicates-controls/.workflowhub.yml
@@ -0,0 +1,5 @@
+version: '0.1'
+registries:
+- url: https://workflowhub.eu
+  project: iwc
+  workflow: chipseq-pe-with-replicates-controls/main
diff --git a/workflows/epigenetics/chipseq-pe-with-replicates-controls/CHANGELOG.md b/workflows/epigenetics/chipseq-pe-with-replicates-controls/CHANGELOG.md
@@ -0,0 +1,5 @@
+# Changelog
+
+## [0.1] 2024-10-22
+
+Initial release
diff --git a/workflows/epigenetics/chipseq-pe-with-replicates-controls/README.md b/workflows/epigenetics/chipseq-pe-with-replicates-controls/README.md
@@ -0,0 +1,55 @@
+# Quality control, mapping and peaks identification for ChIP-Seq replicates with controls
+
+This workflow is for analyzing batches of ChIP-Seq samples with controls and replicates from paired-end sequenced reads to called peaks.
+
+It uses:
+- fastp for sequenced reads pre-processing,
+- bowtie2 for mapping
+- MACS2 for peak calling
+- deeptools for cross-sample correlation and averaging
+
+The workflow provides quality control at the level of sequenced reads, mapping results and called peaks and visualizes correlation between samples.
+
+## Input datasets
+
+- Sequencing data: this must be provided as a single list collection of paired fastq datasets of all samples.
+- Sample sheet: this is expected to be a 4-column tabular dataset that describes samples, their association with each other and with conditions and replicates.
+
+  The first column of the file must list all samples with their names matching the element names in the Sequencing data collection. Samples can be listed in any order.
+  The second column is used to specify the specific experimental condition that each sample represents. There is no formal restriction on this column, but values should be kept short for readable reports.
+  The third column is used to specify the replicate that each sample belongs to. There is no formal restriction to replicate identifiers, but they should be kept as short as possible. At least two replicates are required per condition, but different conditions can have different numbers of replicates.
+  The fourth column must provide the name of the sample that serves as the control for the sample described on each line. Different samples can be associated with the same control sample.
+  Control samples must also be listed on their own lines just like regular samples, but must use . or - as the value of the fourth column. The value of the third column (replicate ID) is ignored for control sample lines so may also be set to . or -.
+
+  Here's an example sample sheet:
+
+  SRR5680995    input       -       -
+  SRR5680996    H3K4me3     rep1    SRR5680995
+  SRR5680997    H3K27me3    rep1    SRR5680995
+  SRR5681007    H3K27me3    rep2    SRR5681005
+  SRR5681006    H3K4me3     rep2    SRR5681005
+  SRR5680998    CTCF        rep1    SRR5680995
+  SRR5681008    CTCF        rep2    SRR5681005
+  SRR5681005    input       -       -
+
+  This declares an experimental design with three conditions - H3K4me3, H3K27me3 and CTCF - with two replicates per condition and one input control per replicate. The control sample SRR5680995 is declared as the shared control for all samples from replicate rep1, SRR5681005 as the control for all samples from replicate rep2.
+
+## Input parameters
+
+- Reference genome: set this to the reference genome of your organism of interest; used at the read mapping step
+- Sequencing adapter - forward (optional)
+- Sequencing aadapter - reverse (optional)
+- Effective genome size: this is used by MACS2 and may be entered manually (indications are provided for heavily used genomes).
+- Average size of sequenced fragments: used for deeptools-base QC
+
+## Outputs:
+
+- MultiQC analysis reports:
+- Sample fingerprints:
+- Between-samples correlation plot:
+- Clustered heatmap of peaks across samples:
+- Peak regions called by MACS2:
+- Positions of summits of MACS2-called peaks:
+- Peaks per replicate:
+- Peaks averaged across replicates:
+
diff --git a/...enetics/chipseq-pe-with-replicates-controls/chipseq-pe-with-replicates-controls-tests.yml b/...enetics/chipseq-pe-with-replicates-controls/chipseq-pe-with-replicates-controls-tests.yml
@@ -0,0 +1,86 @@
+- doc: Test outline for ChIPseq-PE-with-replicates-controls workflow
+  job:
+    Sample sheet:
+      class: File
+      path: test-data/test_sample_sheet.tsv
+      filetype: tabular
+    Sequencing data:
+      class: Collection
+      collection_type: list:paired
+      elements:
+      - class: Collection
+        type: paired
+        identifier: SRR5204807
+        elements:
+        - identifier: forward
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204807_Spt5-ChIP_IP1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
+          filetype: fastqsanger.gz
+        - identifier: reverse
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204807_Spt5-ChIP_IP1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
+          filetype: fastqsanger.gz
+      - class: Collection
+        type: paired
+        identifier: SRR5204808
+        elements:
+        - identifier: forward
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204808_Spt5-ChIP_IP2_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
+          filetype: fastqsanger.gz
+        - identifier: reverse
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204808_Spt5-ChIP_IP2_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
+          filetype: fastqsanger.gz
+      - class: Collection
+        type: paired
+        identifier: SRR5204809
+        elements:
+        - identifier: forward
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
+          filetype: fastqsanger.gz
+        - identifier: reverse
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
+          filetype: fastqsanger.gz
+      - class: Collection
+        type: paired
+        identifier: SRR5204810
+        elements:
+        - identifier: forward
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R1.fastq.gz
+          filetype: fastqsanger.gz
+        - identifier: reverse
+          class: File
+          location: https://github.com/nf-core/test-datasets/raw/refs/heads/chipseq/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R2.fastq.gz
+          filetype: fastqsanger.gz
+    Reference genome: "sacCer3"
+    Effective genome size: 12000000
+    Average size of sequenced fragments: 200
+  outputs:
+    multiqc_stats:
+      asserts:
+        has_n_lines:
+          n: 5
+        has_text_matching:
+          expression: "SRR5204807_Spt5_rep1\t163.0\t0.0\t0.0\t844.+"
+    macs2_report:
+      element_tests:
+        wt_H3K4me3:
+          asserts:
+            - that: "has_text"
+              text: "# name = SRR5204807_Spt5_rep1"
+            - that: "has_text"
+              text: "# fragment size is determined as 163 bps"
+            - that: "has_text"
+              text: "# fragments after filtering in treatment: 86394"
+    mapping_stats:
+      element_tests:
+        SRR5204807_Spt5_rep1:
+            asserts:
+            - that: "has_text"
+              text: "3067 (3.14%) aligned concordantly 0 times"
+            - that: "has_text"
+              text: "80795 (82.60%) aligned concordantly exactly 1 time"