Feature Aggregate Depth Utility
Most current available quantification tools for transcriptomics analyses have been designed using human data sets, in which full-length transcript sequences are available. In most prokaryotic systems, full-length transcript sequences are unavailable, causing prokaryotic transcriptomics analyses being performed using coding sequences instead. In contrast to eukaryotes, prokaryotes contain polycistronic transcripts and when genes are quantified based on coding sequences instead of transcript sequences, this leads to an increased abundance of ambiguous multi-gene fragments, especially between genes in operons. Here we describe FADU, a quantification tool designed specifically for prokaryotic RNA-Seq analyses with the purpose of quantifying operonic genes while minimizing the pitfalls associated with improperly assigning fragment counts from ambiguous transcripts.
- Shaun Adkins ([email protected])
- Matthew Chung ([email protected])
v1.9
Julia - v1.7 or later
NOTE: Packages installation instructions can be found at https://docs.julialang.org/en/v1.4/stdlib/Pkg/index.html#Getting-Started-1
- ArgParse
- Logging
- BGZFStreams.jl - v0.3.2
- GenomicFeatures.jl - v2.1.0
- GFF3 - v0.2.3
- Indexes.jl - v0.1.3
- StructArrays.jl - v0.6.17
- XAM.jl - v0.3.1
- BED.jl - v0.3.0
FADU v1.7 and earlier will not work with Julia v1.4.2 and has different package requirements. If using one of those versions, please use the recommended version of Julia and recommended packages listed on the README of that particular release.
FADU is supported for both the Linux and Mac OSX operating systems, and has been tested on the following operating system versions:
- Linux - CentOS 7
- Mac OSX - High Sierra (v10.13.6)
- MacOS M1 architecture- Monterrey (v12.4 - used Julia v1.7.3 for this test)
- A single BAM file of reads (sorted by position)
- A GFF3-formatted annotation file
The output is a tab-delimited file whose name is the BAM input file's basename, ending in '.counts.txt'. For example if the name of the BAM file was 'sample123.bam', the output file name is 'sample123.counts.txt'
The output file has the following fields:
- Name of feature
- Number of bases that do not overlap with other features (uniq_len)
- Number of BAM records that aligned to this feature's non-overlapping bases
- Aligned reads not part of a fragment get 0.5 count, and fragments get 1 count
- Fractionalized feature counts
- For each aligned fragment, the depth is (intersect of fragment coords to feature coords that do not overlap with other features) / length of the fragment
- Aligned reads that are not part of a fragment get their depth contribution cut in half
- TPM count in scientific notation
First cd
to the directory of choice to install FADU
git clone https://github.com/IGS/FADU.git
cd FADU
For this example we will create a directory called "fadu_output" to write output into. This example uses a BAM alignment file of subsampled Wolbachia reads. This example should take about 60 seconds to run.
mkdir ./fadu_output
julia ./fadu.jl -M -g "./test_data/GCF_000008385.1_ASM838v1_genomic.gff" -b "./test_data/SRR5192555.100000x.sortedbyposition.bam" -o "./fadu_output" -s "reverse" -f "CDS" -a "ID"
The resulting output file should be called "SRR5192555.100000x.sortedbyposition.counts.txt" and located in the "fadu_output" directory.
Here are the counts for the first 10 features:
featureID uniq_len num_alignments counts tpm
cds0 1017 4.5 2.81 1755.54
cds1 1194 3.5 2.48 1318.88
cds10 1161 0.0 0.00 0.00
cds100 591 0.0 0.00 0.00
cds1000 741 8.0 7.46 6399.49
cds1001 850 5.0 4.45 3331.56
cds1002 829 1.0 0.48 366.19
cds1003 1167 0.0 0.00 0.00
cds1004 1164 0.0 0.00 0.00
cds1005 816 0.0 0.00 0.00
usage: fadu.jl -b /path/to/file.bam -g /path/to/annotation.gff3
-o /path/to/output/dir/ [--no_output_header]
[-x /path/to/regions.bed] [-s STRANDED]
[-f FEATURE_TYPE] [-a ATTRIBUTE_TYPE] [-p]
[-m MAX_FRAGMENT_SIZE] [-M] [-e EM_ITER] [--version]
[-h]
Generate counts of reads that map to non-overlapping portions of genes
optional arguments:
-b, --bam_file /path/to/file.bam
Path to BAM file (SAM is not supported).
-g, --gff3_file /path/to/annotation.gff3
Path to GFF3-formatted annotation file (GTF is
not supported).
-o, --output_dir /path/to/output/dir/
Directory to write the output.
--no_output_header If enabled, do not write the header line to
the output file.
-x, --exclude_regions_file /path/to/regions.bed
Path to BED file containing regions to exclude
from analysis. Any alignments that overlap
these regions will be ignored.
-s, --stranded STRANDED
Indicate if BAM reads are from a
strand-specific assay. Choose between 'yes',
'no', or 'reverse'. (default: "no")
-f, --feature_type FEATURE_TYPE
Which GFF3/GTF feature type (column 3) to
obtain. readcount statistics for.
Case-sensitive. (default: "gene")
-a, --attribute_type ATTRIBUTE_TYPE
Which GFF3/GTF feature type (column 9) to
obtain. readcount statistics for.
Case-sensitive. (default: "ID")
-p, --keep_only_proper_pairs
If enabled, keep only properly paired reads
when performing calculations.
-m, --max_fragment_size MAX_FRAGMENT_SIZE
If the fragment size of properly-paired reads
exceeds this value, process pair as single
reads instead of as a fragment. Setting this
value to 0 will make every fragment pair be
processed as two individual reads. If
--keep_only_proper_pairs is enabled, then any
fragment exceeding this value will be
discarded. (type: Int64, default: 1000)
-M, --remove_multimapped
If enabled, remove any reads or fragments that
are mapped to multiple regions of the genome,
indiated by the 'NH' attribute being greater
than 1. Otherwise, use EM algorithm to
quantify reads after all other reads are
counted.
-e, --em_iterations EM_ITER
Number of iterations to perform EM algorithm
on multimapped read depth. Only applies if
--remove_multimapped flag is not passed (is
disabled) (type: Int64, default: 1)
--log_level LEVEL Set the log level. Options are: DEBUG, INFO,
WARNING, ERROR, CRITICAL. (default: "INFO")
--log_file /path/to/logfile.log
Path to log file. If not specified, log
messages will be printed to STDERR.
--version show version information and exit
-h, --help show this help message and exit
FADU currently does not support eukaryotic samples
A safe way to pass in values to the options is to enclose any string arguments (like file paths or strandedness) in quotations so that it ensures Julia's argument parser reads the parameter as an option value instead of a command-type argument.
Currently at this time, only fragment depth is supported. Performing calculations using read depth may be implemented in the future though.
Perhaps the easiest possibility would be to extract the 'featureID' and 'counts' columns and structure them like the output from htseq-count. From there you can follow the corresponding instructions from the DESeq2 manual on using htseq-count output as input.
My dataset consists of only single-end reads, and the assigned counts only appear to be half of what was expected. What gives?
In the case of singletons, FADU gives them half the weight compared to a paired-end read in the final counts. One reason FADU does this is because we noticed that if we treated single-end and properly-paired-end reads with the same weight, there was some overcounting done with respect to certain genes or features.
Yes! Currently, there is no Docker image for FADU on Docker Hub as their free tier removes any images that have been inactive for 6+ months. Instead, you can build the Docker image as follows (this assumes you are in the FADU root directory):
docker build -t fadu .
This Docker image uses Julia 1.7 and installs the various packages required to run FADU. Once the image has been built you can create and run a FADU container by running the following:
docker run -v $PWD:/opt/FADU fadu <fadu_options>
This command assumes that your input files are in your current working directory and that your output will be written to a space relative to your current working directory as well.
If you have a question or suggestion about FADU, feel free to e-mail either of the authors above, or create an issue here