Skip to content

Latest commit

 

History

History
102 lines (65 loc) · 2.05 KB

README_N50.md

File metadata and controls

102 lines (65 loc) · 2.05 KB

n50 - Calculate N50

A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files.

Features

  • Supports both FASTA and FASTQ formats
  • Optimized for FASTQ raw file (Nanopore, PacBio)
  • Handles gzipped input files

Installation

Prerequisites

  • GCC compiler
  • zlib library
  • pthread library

Compiling

To compile the program, use the following command:

make

or compile binaries like:

gcc -o n50 src/n50.c -lz -lpthread -O3

Usage

./n50 [options] [filename]...

If no filename is provided, the program reads from standard input.

Options

  • --fasta or -a: Force FASTA input format
  • --fastq or -q: Force FASTQ input format
  • --header or -H: Print header in the output
  • --n50 or -n: Output only the N50 value
  • --version and --help to display version and help information respectively

Examples

  1. Process a FASTA file:
./n50 input.fasta
  1. Process a gzipped FASTQ file:
./n50 input.fastq.gz
  1. Process a file with header output:
./n50 --header input.fasta
  1. Output only the N50 value:
./n50 --n50 input.fasta
  1. Process input from stdin:
cat input.fasta | ./n50 --format fasta

Output

By default, the program outputs a tab-separated line with the following fields:

  1. Format (FASTA or FASTQ)
  2. Total sequence length
  3. Total number of sequences
  4. N50 value

When using the --header option, a header line is printed before the results.

When using the --n50 option, only the N50 value is printed.

Performance

The program uses multi-threading to process large files efficiently. It automatically adjusts the number of threads based on the input size, up to a maximum of 8 threads.

Limitations

  • The maximum number of threads is currently set to 8. This can be adjusted by modifying the MAX_THREADS constant in the source code.
  • The initial capacity for storing sequence lengths is set to 1,000,000. For extremely large datasets, this value might need to be increased.