Merge pull request #88 from kbseah/master

Improve compare script, update docs
HRGV · Jun 27, 2019 · 24dc562 · 24dc562
2 parents 36009cc + f3c8ced
commit 24dc562
Show file tree

Hide file tree

Showing 4 changed files with 114 additions and 20 deletions.
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -31,6 +31,18 @@ The extracted SSU rRNA reads will be mapped back to the full-length sequences to
 
 You'll also get output parameters that will help you evaluate quality of your sequences (see below)
 
+# Conda gets stuck on "Solving environment" when I try to install phyloFlash - help!
+
+In some cases, `conda install` or `conda create` can hang on the "Solving environment" step. This appears to be because of ambiguities in dependency specifications in packages on different channels (see this [issue](https://github.com/conda/conda/issues/8197) on GitHub). Setting the `channel_priority` to `strict` asks Conda to always pick the higher-priority channel first when installing packages. This requires conda version to be 4.6 and above.
+
+Run the following command before installing phyloFlash with Conda. It will modify the `.condarc` configuration file of the current user by default (usually `~/.condarc`).
+
+```bash
+conda config --set channel_priority strict
+```
+
+Add either the `--system` or `--env` flags to modify the system or environment config respectively. See `conda config --help` and `conda config --describe` for more information.
+
 # How does phyloFlash show library quality?
 
  - Insert size histogram - should be more or less unimodal. Multiple peaks may be caused by contamination from other sequencing libraries.
@@ -64,6 +76,33 @@ EMIRGE in particular can be tricky to install because of its own software depend
 
 The taxonomic affiliation of the SSU rRNA reads is taken from the reference sequence to which it maps in the SILVA database. From v3.2b1 onwards, the taxonomy reported is the last-common-ancestor consensus of the top hits, i.e. if a read has more than one best-scoring hit, it will report the lowest taxonomic level which they have in common. In previous versions, only the single best hit was retained, but this would result in overly-preecise taxonomic assignments especially for divergent sequences. To replicate the old behavior, use the `--tophit` option.
 
+# How can I use phyloFlash in a Snakemake pipeline?
+
+[Snakemake](https://snakemake.readthedocs.io/en/stable/) is a workflow engine for bioinformatics, which allows users to document workflows and automatically detect dependencies that have to be updated. The concept is similar to [GNU Make](https://www.gnu.org/software/make/).
+
+Steps in a Snakemake pipeline are called "rules", and each rule definition specifies an input file pattern, an output file pattern, and the command that produces the output from the input. The input/output file patterns can be defined e.g. by a list or regular expressions, so this allows plenty of flexibility.
+
+There are a few possible stumbling blocks when incorporating phyloFlash into a Snakemake pipeline:
+ * Snakemake takes files as input/output arguments, whereas the `-lib` option of phyloFlash is just the file prefix.
+ * phyloFlash produces output in the current working folder, whereas a user might want to put all output from a given rule into a single folder. However phyloFlash will not accept a path prefix (with `/` character) in the `-lib` option.
+
+Here is an example Snakemake rule for phyloFlash, assuming that you have a [configuration file](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html) listing all the sample names under `samples` and the path to phyloFlash database under `phyloFlash_db`.
+
+```python
+rule phyloFlash:
+    input:
+        "reads/{sample}.fastq.gz"
+    output:
+        expand("phyloFlash/{sample}.phyloFlash.tar.gz",sample=config['samples'])
+    threads: 16
+    log: "logs/{sample}_phyloFlash.log"
+    conda: "envs/phyloflash.yml" # Specify a Conda environment containing phyloFlash
+    params:
+        dbhome=config['phyloFlash_db']
+    shell:
+        "phyloFlash.pl -lib {wildcards.sample} -dbhome {params.dbhome} -read1 {input} -interleaved -CPUs {threads} -almosteverything; mv {wildcards.sample}.phyloFlash.* phyloFlash/"
+```
+
 # How do I get help if something doesn't work?
 
 First check if you have all the dependencies installed properly with `phyloFlash.pl -check_env`.

diff --git a/docs/utilities.md b/docs/utilities.md
@@ -6,7 +6,7 @@ order: 4
 
 In addition to the main `phyloFlash.pl` tool, the phyloFlash folder contains a number of other tools for working with phyloFlash and analyzing the results.
 
-To see usage instructions for any phyloFlash tool, simply run the script without arguments (short synopsis), or with the options `--help` (short help) or `--man` (full detailed manual page). 
+To see usage instructions for any phyloFlash tool, simply run the script without arguments (short synopsis), or with the options `--help` (short help) or `--man` (full detailed manual page).
 
 ## 1. Database setup
 
@@ -15,7 +15,7 @@ To see usage instructions for any phyloFlash tool, simply run the script without
 phyloFlash_makedb.pl --remote
 
 # use local copies
-phyloFlash makedb.pl --silva_file path/to/silva_db --univec_file path/to/univec_db
+phyloFlash_makedb.pl --silva_file path/to/silva_db --univec_file path/to/univec_db
 ```
 
 Set up phyloFlash databases. See [Installation](install.html).
@@ -92,7 +92,7 @@ Given the run accession number, download read data from the [European Nucleotide
 
 Proxy server users should specify the address to the `--http_proxy` option.
 
-Note that only RUN accession numbers are valid. For an explanation of the different types of accession numbers used by ENA, see their [metadata documentation](http://ena-docs.readthedocs.io/en/latest/meta_01.html#metadata-model). Run accessions typically have the prefix "ERR", "SRR", or "DRR". 
+Note that only RUN accession numbers are valid. For an explanation of the different types of accession numbers used by ENA, see their [metadata documentation](http://ena-docs.readthedocs.io/en/latest/meta_01.html#metadata-model). Run accessions typically have the prefix "ERR", "SRR", or "DRR".
 
 ## 4. Genome binning
 
@@ -106,7 +106,7 @@ phyloFlash_fastgFishing.pl --fasta [Fasta] --fastg [Fastg] --paths [Paths] --out
 
 From (meta)genome assembly graph in Fastg format, predict SSU rRNA sequences, extract contig clusters with total length > cutoff (default 100 kbp), and match them to phyloFlash SSU rRNA-targeted assembly results.
 
-This is intended to aid binning of microbial genomes from metagenomes. Each cluster in a Fastg graph is likely to originate from a single genome, and so represents a putative genome bin. 
+This is intended to aid binning of microbial genomes from metagenomes. Each cluster in a Fastg graph is likely to originate from a single genome, and so represents a putative genome bin.
 
 If only a Fasta and Fastg file are specified, predict SSU rRNA sequences in the Fasta contigs and report all contig clusters containing an SSU rRNA with total cluster length above the cutoff.
 

diff --git a/phyloFlash_barplot.R b/phyloFlash_barplot.R
@@ -34,9 +34,9 @@ makepalette <- function (n, brewer.name='Set3', othercolor='grey') {
   # Define palette for taxa, using RColorBrewer Set3 as default for n colors <= 12
   # otherwise attempt to "stretcH" the palette with colorRampPalette
   # othercolor is the default color for "Other" taxa
-  
+
   # Account for different maximum n for different preset palettes
-  max.n <- switch(brewer.name, 
+  max.n <- switch(brewer.name,
                   "Set3" = 12,
                   "Accent" = 8,
                   "Dark2" = 8,
@@ -46,7 +46,7 @@ makepalette <- function (n, brewer.name='Set3', othercolor='grey') {
                   "Set1" = 9,
                   "Set2" = 8
                   )
-  
+
   if (n <= max.n ) {
     out.palette <- brewer.pal(n, name=brewer.name)
     out.palette <- c(out.palette,othercolor)
@@ -105,6 +105,15 @@ options <- list(
     action="store_true",
     default=FALSE,
     help="Plot raw counts rather than proportions"
+  ),
+  make_option(
+    c("-w","--scaleplotwidth"),
+    type="numeric",
+    action="store",
+    default=1,
+    help="Change the plot width by this scaling factor (e.g. 2 makes it twice
+                  as wide). Allows adjustment when bars are hidden because the
+                  legend labels are too long."
   )
 );
 
@@ -179,7 +188,7 @@ dd.rename.barplot <- (ggplot(dd.rename, aes(sample,prop))
 outname <- conf$options$out[1]
 # Adjust width of plot
 num.samples <- length(levels(dd.rename$sample))
-width <- 360 + 80 * num.samples
+width <- 360 + conf$options$scaleplotwidth[1] * 80 * num.samples
 height <- 480
 
 # Choose which output format by output prefix (adapted from heatmap script)

diff --git a/phyloFlash_compare.pl b/phyloFlash_compare.pl
@@ -89,6 +89,7 @@ =head1 DESCRIPTION
 my $barplot_palette = 'Set3';
 my $barplot_subset;
 my $barplot_rawval;
+my $barplot_scaleplotwidth;
 my $out_prefix = 'test.phyloFlash_compare';
 my $outfmt = "pdf";
 my $keeptmp;
@@ -157,9 +158,9 @@ =head2 ANALYSIS OPTIONS
 
 =item --task I<STRING>
 
-Type of analysis to be run. Options: "heatmap", "barplot", "matrix", or a
-recognizable substring thereof. Supply more than one option as comma-separated
-list.
+Type of analysis to be run. Options: "heatmap", "barplot", "matrix", "ntu_table"
+or a recognizable substring thereof. Supply more than one option as comma-
+separated list.
 
 Default: None
 
@@ -230,6 +231,13 @@ =head2 ARGUMENTS FOR BARPLOT
 
 Default: False
 
+=item --barplot_scaleplotwidth
+
+Numeric: Change plot width by this scaling factor (e.g. 2 makes it twice as wide).
+Allows adjustment when bars are hidden because the legend labels are too long.
+
+Default: 1
+
 =back
 
 =head2 ARGUMENTS FOR HEATMAP
@@ -303,6 +311,16 @@ =head1 OUTPUT
 
 Output filename: I<[PREFIX].matrix.tsv>
 
+=item "ntu_table"
+
+Outputs a tab-separated table of NTU counts per sample. This is the raw data
+used to draw the barplot and may be useful for users who wish to plot the data
+themselves with some other program.
+
+Output columns are taxon string, sample, counts.
+
+Output filename: I<[PREFIX].ntu_table.tsv>
+
 =back
 
 =head2 HELP MESSAGES
@@ -337,6 +355,7 @@ =head2 HELP MESSAGES
             "barplot_palette=s" => \$barplot_palette,
             "barplot_subset=s" => \$barplot_subset,
             "barplot_rawval" => \$barplot_rawval,
+            "barplot_scaleplotwidth=f" => \$barplot_scaleplotwidth,
             "cluster-samples=s" => \$heatmap_clustersamples,
             "cluster-taxa=s" => \$heatmap_clustertaxa,
             "long-taxnames" => \$heatmap_longtaxnames,
@@ -359,7 +378,7 @@ =head2 HELP MESSAGES
 ## Catch exceptions ############################################################
 
 if (!defined $task_opt) {
-    pod2usage ("ERROR: Please specify tasks [barplot, heatmap, matrix] to option --task");
+    pod2usage ("ERROR: Please specify tasks [barplot, heatmap, matrix, ntu_table] to option --task");
     pod2usage(-verbose=>0,-exit=>1);
 }
 if ($taxlevel > 7) {
@@ -530,6 +549,20 @@ =head2 HELP MESSAGES
 
 ## Perform plotting ############################################################
 
+if (defined $task_hash{'ntu_table'}) {
+    ## NTU table ###############################################################
+    msg ("Summarizing NTUs per sample at taxonomic level $taxlevel");
+    my $out_aref = abundance_hash_to_array(\%ntuhash, "\t");
+    my $outfile_name = "$out_prefix.ntu_table.tsv";
+    my $ntu_table_fh;
+    open ($ntu_table_fh, ">", $outfile_name) or die ("Cannot open file $outfile_name for writing");
+    foreach my $line (@$out_aref) {
+        print $ntu_table_fh $line."\n";
+    }
+    close ($ntu_table_fh);
+    msg ("NTU abundance table written to file: $outfile_name");
+}
+
 if (defined $task_hash{'barplot'} ) {
     ## Barplot #################################################################
     msg ("Plotting barplot using script: $barplot_script");
@@ -553,7 +586,9 @@ =head2 HELP MESSAGES
     if (defined $barplot_rawval) {
         push @barplot_args, "--rawval";
     }
-
+    if (defined $barplot_scaleplotwidth) {
+        push @barplot_args, "--scaleplotwidth=$barplot_scaleplotwidth";
+    }
 
     my $barplot_cmd = join " ", ('Rscript', $barplot_script, @barplot_args);
     msg ("Plotting barplot: $barplot_cmd");
@@ -736,10 +771,19 @@ sub parse_task_options {
         $href) = @_;
     my @task_arr = split /,/, $task_string;
     foreach my $task (@task_arr) {
-        $href->{'heatmap'} ++ if (index ('heatmap',$task) != -1);
-        $href->{'barplot'} ++ if (index ('barplot',$task) != -1);
-        $href->{'matrix'} ++ if (index ('matrix',$task) != -1);
-        $href->{'test'} ++ if (index ('test', $task) != -1);
+        if (index ('heatmap',$task) != -1) {
+            $href->{'heatmap'} ++;
+        } elsif (index ('barplot',$task) != -1) {
+            $href->{'barplot'} ++;
+        } elsif (index ('matrix',$task) != -1) {
+            $href->{'matrix'} ++;
+        } elsif (index ('ntu_table', $task) != -1) {
+            $href->{'ntu_table'} ++ ;
+        } elsif (index ('test', $task) != -1) {
+            $href->{'test'} ++;
+        } else {
+            msg ("ERROR: Unrecognized argument $task passed to option --task ; Valid arguments are: heatmap, barplot, matrix, ntu_table");
+        }
     }
 }
 
@@ -822,18 +866,20 @@ sub ntu_csv_arr_to_hash {
 }
 
 sub abundance_hash_to_array {
-    my ($href) = @_;
+    my ($href, $delim) = @_;
+    if (!defined $delim) {
+      $delim = ",";
+    }
     my @arr;
     foreach my $taxon (sort keys %$href) {
         foreach my $sample (sort keys %{$href->{$taxon}}) {
             my $count = $href->{$taxon}{$sample};
-            push @arr, join ",", ($taxon, $sample, $count);
+            push @arr, join $delim, ($taxon, $sample, $count);
         }
     }
     return \@arr;
 }
 
-
 sub encode_persample_counts_on_tree {
     my ($href,          # Output hash tree
         $href_in,       # Input hash for a given sample