diff --git a/docs/output.md b/docs/output.md index 18fe2b6d..4705d1e5 100644 --- a/docs/output.md +++ b/docs/output.md @@ -15,6 +15,9 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [BlobDir](#blobdir) - Output files viewable on a [BlobToolKit viewer](https://github.com/blobtoolkit/blobtoolkit) - [Static plots](#static-plots) - Static versions of the BlobToolKit plots - [BUSCO](#busco) - BUSCO results +- [Read alignments](#read-alignments) - Aligned reads (optional) +- [Read coverage](#read-coverage) - Read coverage tracks +- [Base content](#base-content) - _k_-mer statistics (for k ≤ 4) - [MultiQC](#multiqc) - Aggregate report describing results from the whole pipeline - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution @@ -26,8 +29,8 @@ The files in the BlobDir dataset which is used to create the online interactive Output files - `blobtoolkit/` - - `/` - - `*.json.gz`: files generated from genome and alignment coverage statistics + - `/` + - `*.json.gz`: files generated from genome and alignment coverage statistics. More information about visualising the data in the [BlobToolKit repository](https://github.com/blobtoolkit/blobtoolkit/tree/main/src/viewer) @@ -53,12 +56,56 @@ BUSCO results generated by the pipeline (all BUSCO lineages that match the claas
Output files -- `blobtoolkit/` - - `busco/` - - `*.batch_summary.txt`: BUSCO scores as tab-separated files (1 file per lineage). - - `*.fasta.txt`: BUSCO scores as formatted text (1 file per lineage). - - `*.json`: BUSCO scores as JSON (1 file per lineage). - - `*/`: all output BUSCO files, including the coordinate and sequence files of the annotated genes. +- `busco/` + - `/` + - `short_summary.json`: BUSCO scores for that lineage as a tab-separated file. + - `short_summary.tsv`: BUSCO scores for that lineage as JSON. + - `short_summary.txt`: BUSCO scores for that lineage as formatted text. + - `full_table.tsv`: Coordinates of the annotated BUSCO genes as a tab-separated file. + - `missing_busco_list.tsv`: List of the BUSCO genes that could not be found. + - `*_busco_sequences.tar.gz`: Sequences of the annotated BUSCO genes. 1 _tar_ archive for each of the three annotation levels (`single_copy`, `multi_copy`, `fragmented`), with 1 file per gene. + - `hmmer_output.tar.gz`: Archive of the HMMER alignment scores. + +
+ +### Read alignments + +Read alignments in BAM format -- only if the pipeline is run with `--align true`. + +
+Output files + +- `read_mapping/` + - `/` + - `.bam`: alignments of that sample's reads in BAM format. + +
+ +### Read coverage + +Read coverage statistics as computed by the pipeline. +Those files are the raw data used to build the BlobDir. + +
+Output files + +- `read_mapping/` + - `/` + - `.coverage.1k.bed.gz`: Bedgraph file with the coverage of the alignments of that sample per 1 kbp windows. + +
+ +### Base content + +_k_-mer statistics. +Those files are the raw data used to build the BlobDir. + +
+Output files + +- `base_content/` + - `_*nuc_windows.tsv.gz`: Tab-separated files with the counts of every _k_-mer for k ≤ 4 in 1 kbp windows. The first three columns correspond to the coordinates (sequence name, start, end), followed by each _k_-mer. + - `_freq_windows.tsv.gz`: Tab-separated files with frequencies derived from the _k_-mer counts.