A input JSON to run the full pipeline, including building gemBS
indexes, will look like this. In this example we have two biological replicates, the first with two libraries and the second with one library, with each library sequenced paired-ended:
{
"wgbs.fastqs": [
[
[
"rep1_lib1_R1.fastq.gz",
"rep1_lib1_R2.fastq.gz"
],
[
"rep1_lib2_R1.fastq.gz",
"rep1_lib2_R2.fastq.gz"
]
],
[
[
"rep2_lib1_R1.fastq.gz",
"rep2_lib1_R2.fastq.gz"
]
]
],
"wgbs.reference": "hg38.fasta.gz",
"wgbs.extra_reference": "lambda.fasta.gz",
"wgbs.underconversion_sequence_name": "chrL"
}
To process RRBS data, add "wgbs.pipeline_type": "rrbs"
to your input JSON.
If you plan on running the pipeline multiple times it is recommended to create the gemBS
index file once, then use that as input to the runs instead of the raw sequence fasta
files. To build the references, your input should look like the following:
{
"wgbs.extra_reference": "lambda.fasta.gz",
"wgbs.index_only": true,
"wgbs.reference": "hg38.fasta.gz",
"wgbs.underconversion_sequence_name": "chrL"
}
This will generate a combined contig sizes for the reference and extra reference as well as the gemBS
index file. Then, to use the built references, specfiy indexed_contig_sizes
and indexed_reference
in the input JSON, like so:
{
"wgbs.fastqs": [
[
[
"rep1_lib1_R1.fastq.gz",
"rep1_lib1_R2.fastq.gz"
]
]
],
"wgbs.indexed_contig_sizes": "my.contig.sizes",
"wgbs.indexed_reference": "indexes.tar.gz",
"wgbs.reference": "hg38.fasta.gz",
"wgbs.extra_reference": "lambda.fasta.gz",
"wgbs.underconversion_sequence_name": "chrL"
}
wgbs.fastqs
is a 3-D array of fastq
file URIs (e.g. local file paths, Google Cloud Storage URIs, HTTP links) indexed by [replicate][library][read pair]
. Replicates are processed independently. Libraries are merged together within each replicate to produce a single bam file and set of downstream methylation files for each replicate. Note that mixed paired and single ended libraries are not allowed.
wgbs.reference
is a fasta
file corresponding to the genome of interest, e.g. https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/ for GRCh38 (human).
wgbs.extra_reference
is a fasta
file containing extra control sequences used for QC purposes. For example, lambda DNA, which is unmethylated by nature, is commonly added to WGBS libraries as a control to verify that unmethylated cytosines are converted to uracil at a satisfactory rate (> 99% per ENCODE standards). As such, for a WGBS experiment using lambda control, a fasta
containing the sequence of lambda genome would be used, e.g. https://www.encodeproject.org/files/lambda.fa/
wgbs.underconversion_sequence_name
is a string corresponding to the name of the sequence in the wgbs.extra_reference
fasta
that contains the control sequence to be used for calculating the bisulfite conversion rate. For lambda, this would typically be chrL
.
wgbs.indexed_reference
is a .tar.gz
archive containing the following files produced from gemBS index
during this pipeline's index generation step. Here ${prefix}
is the basename of the fasta file used to produce it, without the extension:
${prefix}.gemBS.contig_md5
${prefix}.gemBS.ref.gzi
${prefix}.BS.gem
${prefix}.gemBS.ref
${prefix}.contig.sizes
${prefix}.gemBS.ref.fai
wgbs.indexed_contig_sizes
is a chromosome (chrom) sizes-style file, which is simply a two-column tsv
where the rows contain the chromosome name and size, in that order. This file is produced by this pipeline during index generation from the input reference fasta
files.
wgbs.map_sort_threads
is the number of threads to use for samtools sort
when running gemBS
map
command
wgbs.map_sort_memory
is the max memory to use per thread for samtools sort
when running gemBS
map
command. It is a number followed by a K, M, or G suffix to indicate the units, e.g. 1G
for 1 gigabyte or 800M
for 800 megabytes.
Every task also has customizable resources (number of cpus, RAM, and disk size). There
are too many to list individually here, but they all take the same form in the input
JSON file: "wgbs.[TASK_NAME]_[RESOURCE]": [INTEGER]
, where RESOURCE
is one of
num_cpus
, ram_gb
, and disk_size_gb
. As an example, if you need to increase RAM to
80 GB for the map
task then in the input JSON you would put the following line:
"wgbs.map_ram_gb": 80
The pipeline produces more outputs than listed here, only the most useful ones are listed here.
bam
: aBAM
file containing mapping results for the replicate, with all libraries for that replicate merged together.csi
: aBAM
index incsi
format for the aboveBAM
, useful for visualization and subsetting, among other purposes. See specifications for details.
average_coverage_qc
: aJSON
file containing the average coverage of its input BAM file andsamtools stats
quality metrics. Average coverage is defined as(aligned read counts * read length ) / total genome size
, where read length and aligned read counts are taken from thesamtools stats
output andtotal genome size
is the sum of chromosome sizes in theindex_contig_sizes
file.
Unless otherwise specified, bigBed
files are in ENCODE bedMethyl
(bed9+2
) format, see ENCODE standards for details. One of each of these files is produced per replicate.
plus_strand_bw
: abigWig
file containing the methylation percentage signal on the plus strandminus_strand_bw
: abigWig
file containing the methylation percentage signal on the minus strandchg_bb
: abigBed
file containing the methylation at reference CHG siteschh_bb
: abigBed
file containing the methylation at reference CHH sitescpg_bb
: abigBed
file containing the methylation at reference CpG siteschg_bed
: abed
file containing the methylation at reference CHG siteschh_bed
: abed
file containing the methylation at reference CHH sitescpg_bed
: abed
file containing the methylation at reference CpG sitescpg_txt
: agemBS
-stylebed
file containing the methylation at genotyped CpG sites (i.e. sites that could confidently be called bygemBS
as CpGs). For details on this format, see the gemBS documentation
coverage_bigwig
: abigWig
file containing the coverage genome-wide, one file per replicate.
bed_pearson_correlation_qc
: aJSON
file containing the Pearson correlation between replicates of methylation percentage at reference CpG sites with 10 or more reads. Note that this file is only produced when there are exactly two replicates.
portal_map_qc_json
: aJSON
file containing mapping quality metrics for the replicate. These values are scraped out of thegemBS
HTML QC reports. See the ENCODE schema for details on which metrics are provided. Further descriptions of the metrics can be found in the gemBS documentationmap_qc_insert_size_plot_png
: aPNG
file containing a plot of insert size distribution across readsmap_qc_mapq_plot_png
: aPNG
file containing a plot of mapping quality (MAPQ) distribution across reads