Skip to content

Commit

Permalink
Merge pull request #88 from kbseah/master
Browse files Browse the repository at this point in the history
Improve compare script, update docs
  • Loading branch information
HRGV authored Jun 27, 2019
2 parents 36009cc + f3c8ced commit 24dc562
Show file tree
Hide file tree
Showing 4 changed files with 114 additions and 20 deletions.
39 changes: 39 additions & 0 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,18 @@ The extracted SSU rRNA reads will be mapped back to the full-length sequences to

You'll also get output parameters that will help you evaluate quality of your sequences (see below)

# Conda gets stuck on "Solving environment" when I try to install phyloFlash - help!

In some cases, `conda install` or `conda create` can hang on the "Solving environment" step. This appears to be because of ambiguities in dependency specifications in packages on different channels (see this [issue](https://github.com/conda/conda/issues/8197) on GitHub). Setting the `channel_priority` to `strict` asks Conda to always pick the higher-priority channel first when installing packages. This requires conda version to be 4.6 and above.

Run the following command before installing phyloFlash with Conda. It will modify the `.condarc` configuration file of the current user by default (usually `~/.condarc`).

```bash
conda config --set channel_priority strict
```

Add either the `--system` or `--env` flags to modify the system or environment config respectively. See `conda config --help` and `conda config --describe` for more information.

# How does phyloFlash show library quality?

- Insert size histogram - should be more or less unimodal. Multiple peaks may be caused by contamination from other sequencing libraries.
Expand Down Expand Up @@ -64,6 +76,33 @@ EMIRGE in particular can be tricky to install because of its own software depend

The taxonomic affiliation of the SSU rRNA reads is taken from the reference sequence to which it maps in the SILVA database. From v3.2b1 onwards, the taxonomy reported is the last-common-ancestor consensus of the top hits, i.e. if a read has more than one best-scoring hit, it will report the lowest taxonomic level which they have in common. In previous versions, only the single best hit was retained, but this would result in overly-preecise taxonomic assignments especially for divergent sequences. To replicate the old behavior, use the `--tophit` option.

# How can I use phyloFlash in a Snakemake pipeline?

[Snakemake](https://snakemake.readthedocs.io/en/stable/) is a workflow engine for bioinformatics, which allows users to document workflows and automatically detect dependencies that have to be updated. The concept is similar to [GNU Make](https://www.gnu.org/software/make/).

Steps in a Snakemake pipeline are called "rules", and each rule definition specifies an input file pattern, an output file pattern, and the command that produces the output from the input. The input/output file patterns can be defined e.g. by a list or regular expressions, so this allows plenty of flexibility.

There are a few possible stumbling blocks when incorporating phyloFlash into a Snakemake pipeline:
* Snakemake takes files as input/output arguments, whereas the `-lib` option of phyloFlash is just the file prefix.
* phyloFlash produces output in the current working folder, whereas a user might want to put all output from a given rule into a single folder. However phyloFlash will not accept a path prefix (with `/` character) in the `-lib` option.

Here is an example Snakemake rule for phyloFlash, assuming that you have a [configuration file](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html) listing all the sample names under `samples` and the path to phyloFlash database under `phyloFlash_db`.

```python
rule phyloFlash:
input:
"reads/{sample}.fastq.gz"
output:
expand("phyloFlash/{sample}.phyloFlash.tar.gz",sample=config['samples'])
threads: 16
log: "logs/{sample}_phyloFlash.log"
conda: "envs/phyloflash.yml" # Specify a Conda environment containing phyloFlash
params:
dbhome=config['phyloFlash_db']
shell:
"phyloFlash.pl -lib {wildcards.sample} -dbhome {params.dbhome} -read1 {input} -interleaved -CPUs {threads} -almosteverything; mv {wildcards.sample}.phyloFlash.* phyloFlash/"
```

# How do I get help if something doesn't work?

First check if you have all the dependencies installed properly with `phyloFlash.pl -check_env`.
Expand Down
8 changes: 4 additions & 4 deletions docs/utilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ order: 4

In addition to the main `phyloFlash.pl` tool, the phyloFlash folder contains a number of other tools for working with phyloFlash and analyzing the results.

To see usage instructions for any phyloFlash tool, simply run the script without arguments (short synopsis), or with the options `--help` (short help) or `--man` (full detailed manual page).
To see usage instructions for any phyloFlash tool, simply run the script without arguments (short synopsis), or with the options `--help` (short help) or `--man` (full detailed manual page).

## 1. Database setup

Expand All @@ -15,7 +15,7 @@ To see usage instructions for any phyloFlash tool, simply run the script without
phyloFlash_makedb.pl --remote

# use local copies
phyloFlash makedb.pl --silva_file path/to/silva_db --univec_file path/to/univec_db
phyloFlash_makedb.pl --silva_file path/to/silva_db --univec_file path/to/univec_db
```

Set up phyloFlash databases. See [Installation](install.html).
Expand Down Expand Up @@ -92,7 +92,7 @@ Given the run accession number, download read data from the [European Nucleotide

Proxy server users should specify the address to the `--http_proxy` option.

Note that only RUN accession numbers are valid. For an explanation of the different types of accession numbers used by ENA, see their [metadata documentation](http://ena-docs.readthedocs.io/en/latest/meta_01.html#metadata-model). Run accessions typically have the prefix "ERR", "SRR", or "DRR".
Note that only RUN accession numbers are valid. For an explanation of the different types of accession numbers used by ENA, see their [metadata documentation](http://ena-docs.readthedocs.io/en/latest/meta_01.html#metadata-model). Run accessions typically have the prefix "ERR", "SRR", or "DRR".

## 4. Genome binning

Expand All @@ -106,7 +106,7 @@ phyloFlash_fastgFishing.pl --fasta [Fasta] --fastg [Fastg] --paths [Paths] --out

From (meta)genome assembly graph in Fastg format, predict SSU rRNA sequences, extract contig clusters with total length > cutoff (default 100 kbp), and match them to phyloFlash SSU rRNA-targeted assembly results.

This is intended to aid binning of microbial genomes from metagenomes. Each cluster in a Fastg graph is likely to originate from a single genome, and so represents a putative genome bin.
This is intended to aid binning of microbial genomes from metagenomes. Each cluster in a Fastg graph is likely to originate from a single genome, and so represents a putative genome bin.

If only a Fasta and Fastg file are specified, predict SSU rRNA sequences in the Fasta contigs and report all contig clusters containing an SSU rRNA with total cluster length above the cutoff.

Expand Down
17 changes: 13 additions & 4 deletions phyloFlash_barplot.R
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ makepalette <- function (n, brewer.name='Set3', othercolor='grey') {
# Define palette for taxa, using RColorBrewer Set3 as default for n colors <= 12
# otherwise attempt to "stretcH" the palette with colorRampPalette
# othercolor is the default color for "Other" taxa

# Account for different maximum n for different preset palettes
max.n <- switch(brewer.name,
max.n <- switch(brewer.name,
"Set3" = 12,
"Accent" = 8,
"Dark2" = 8,
Expand All @@ -46,7 +46,7 @@ makepalette <- function (n, brewer.name='Set3', othercolor='grey') {
"Set1" = 9,
"Set2" = 8
)

if (n <= max.n ) {
out.palette <- brewer.pal(n, name=brewer.name)
out.palette <- c(out.palette,othercolor)
Expand Down Expand Up @@ -105,6 +105,15 @@ options <- list(
action="store_true",
default=FALSE,
help="Plot raw counts rather than proportions"
),
make_option(
c("-w","--scaleplotwidth"),
type="numeric",
action="store",
default=1,
help="Change the plot width by this scaling factor (e.g. 2 makes it twice
as wide). Allows adjustment when bars are hidden because the
legend labels are too long."
)
);

Expand Down Expand Up @@ -179,7 +188,7 @@ dd.rename.barplot <- (ggplot(dd.rename, aes(sample,prop))
outname <- conf$options$out[1]
# Adjust width of plot
num.samples <- length(levels(dd.rename$sample))
width <- 360 + 80 * num.samples
width <- 360 + conf$options$scaleplotwidth[1] * 80 * num.samples
height <- 480

# Choose which output format by output prefix (adapted from heatmap script)
Expand Down
70 changes: 58 additions & 12 deletions phyloFlash_compare.pl
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ =head1 DESCRIPTION
my $barplot_palette = 'Set3';
my $barplot_subset;
my $barplot_rawval;
my $barplot_scaleplotwidth;
my $out_prefix = 'test.phyloFlash_compare';
my $outfmt = "pdf";
my $keeptmp;
Expand Down Expand Up @@ -157,9 +158,9 @@ =head2 ANALYSIS OPTIONS
=item --task I<STRING>
Type of analysis to be run. Options: "heatmap", "barplot", "matrix", or a
recognizable substring thereof. Supply more than one option as comma-separated
list.
Type of analysis to be run. Options: "heatmap", "barplot", "matrix", "ntu_table"
or a recognizable substring thereof. Supply more than one option as comma-
separated list.
Default: None
Expand Down Expand Up @@ -230,6 +231,13 @@ =head2 ARGUMENTS FOR BARPLOT
Default: False
=item --barplot_scaleplotwidth
Numeric: Change plot width by this scaling factor (e.g. 2 makes it twice as wide).
Allows adjustment when bars are hidden because the legend labels are too long.
Default: 1
=back
=head2 ARGUMENTS FOR HEATMAP
Expand Down Expand Up @@ -303,6 +311,16 @@ =head1 OUTPUT
Output filename: I<[PREFIX].matrix.tsv>
=item "ntu_table"
Outputs a tab-separated table of NTU counts per sample. This is the raw data
used to draw the barplot and may be useful for users who wish to plot the data
themselves with some other program.
Output columns are taxon string, sample, counts.
Output filename: I<[PREFIX].ntu_table.tsv>
=back
=head2 HELP MESSAGES
Expand Down Expand Up @@ -337,6 +355,7 @@ =head2 HELP MESSAGES
"barplot_palette=s" => \$barplot_palette,
"barplot_subset=s" => \$barplot_subset,
"barplot_rawval" => \$barplot_rawval,
"barplot_scaleplotwidth=f" => \$barplot_scaleplotwidth,
"cluster-samples=s" => \$heatmap_clustersamples,
"cluster-taxa=s" => \$heatmap_clustertaxa,
"long-taxnames" => \$heatmap_longtaxnames,
Expand All @@ -359,7 +378,7 @@ =head2 HELP MESSAGES
## Catch exceptions ############################################################

if (!defined $task_opt) {
pod2usage ("ERROR: Please specify tasks [barplot, heatmap, matrix] to option --task");
pod2usage ("ERROR: Please specify tasks [barplot, heatmap, matrix, ntu_table] to option --task");
pod2usage(-verbose=>0,-exit=>1);
}
if ($taxlevel > 7) {
Expand Down Expand Up @@ -530,6 +549,20 @@ =head2 HELP MESSAGES

## Perform plotting ############################################################

if (defined $task_hash{'ntu_table'}) {
## NTU table ###############################################################
msg ("Summarizing NTUs per sample at taxonomic level $taxlevel");
my $out_aref = abundance_hash_to_array(\%ntuhash, "\t");
my $outfile_name = "$out_prefix.ntu_table.tsv";
my $ntu_table_fh;
open ($ntu_table_fh, ">", $outfile_name) or die ("Cannot open file $outfile_name for writing");
foreach my $line (@$out_aref) {
print $ntu_table_fh $line."\n";
}
close ($ntu_table_fh);
msg ("NTU abundance table written to file: $outfile_name");
}

if (defined $task_hash{'barplot'} ) {
## Barplot #################################################################
msg ("Plotting barplot using script: $barplot_script");
Expand All @@ -553,7 +586,9 @@ =head2 HELP MESSAGES
if (defined $barplot_rawval) {
push @barplot_args, "--rawval";
}

if (defined $barplot_scaleplotwidth) {
push @barplot_args, "--scaleplotwidth=$barplot_scaleplotwidth";
}

my $barplot_cmd = join " ", ('Rscript', $barplot_script, @barplot_args);
msg ("Plotting barplot: $barplot_cmd");
Expand Down Expand Up @@ -736,10 +771,19 @@ sub parse_task_options {
$href) = @_;
my @task_arr = split /,/, $task_string;
foreach my $task (@task_arr) {
$href->{'heatmap'} ++ if (index ('heatmap',$task) != -1);
$href->{'barplot'} ++ if (index ('barplot',$task) != -1);
$href->{'matrix'} ++ if (index ('matrix',$task) != -1);
$href->{'test'} ++ if (index ('test', $task) != -1);
if (index ('heatmap',$task) != -1) {
$href->{'heatmap'} ++;
} elsif (index ('barplot',$task) != -1) {
$href->{'barplot'} ++;
} elsif (index ('matrix',$task) != -1) {
$href->{'matrix'} ++;
} elsif (index ('ntu_table', $task) != -1) {
$href->{'ntu_table'} ++ ;
} elsif (index ('test', $task) != -1) {
$href->{'test'} ++;
} else {
msg ("ERROR: Unrecognized argument $task passed to option --task ; Valid arguments are: heatmap, barplot, matrix, ntu_table");
}
}
}

Expand Down Expand Up @@ -822,18 +866,20 @@ sub ntu_csv_arr_to_hash {
}

sub abundance_hash_to_array {
my ($href) = @_;
my ($href, $delim) = @_;
if (!defined $delim) {
$delim = ",";
}
my @arr;
foreach my $taxon (sort keys %$href) {
foreach my $sample (sort keys %{$href->{$taxon}}) {
my $count = $href->{$taxon}{$sample};
push @arr, join ",", ($taxon, $sample, $count);
push @arr, join $delim, ($taxon, $sample, $count);
}
}
return \@arr;
}


sub encode_persample_counts_on_tree {
my ($href, # Output hash tree
$href_in, # Input hash for a given sample
Expand Down

0 comments on commit 24dc562

Please sign in to comment.