A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files.
- Supports both FASTA and FASTQ formats
- Optimized for FASTQ raw file (Nanopore, PacBio)
- Handles gzipped input files
- GCC compiler
- zlib library
- pthread library
To compile the program, use the following command:
make
or compile binaries like:
gcc -o n50 src/n50.c -lz -lpthread -O3
./n50 [options] [filename]...
If no filename is provided, the program reads from standard input.
--fasta
or-a
: Force FASTA input format--fastq
or-q
: Force FASTQ input format--header
or-H
: Print header in the output--n50
or-n
: Output only the N50 value--version
and--help
to display version and help information respectively
- Process a FASTA file:
./n50 input.fasta
- Process a gzipped FASTQ file:
./n50 input.fastq.gz
- Process a file with header output:
./n50 --header input.fasta
- Output only the N50 value:
./n50 --n50 input.fasta
- Process input from stdin:
cat input.fasta | ./n50 --format fasta
By default, the program outputs a tab-separated line with the following fields:
- Format (FASTA or FASTQ)
- Total sequence length
- Total number of sequences
- N50 value
When using the --header
option, a header line is printed before the results.
When using the --n50
option, only the N50 value is printed.
The program uses multi-threading to process large files efficiently. It automatically adjusts the number of threads based on the input size, up to a maximum of 8 threads.
- The maximum number of threads is currently set to 8. This can be adjusted by modifying the
MAX_THREADS
constant in the source code. - The initial capacity for storing sequence lengths is set to 1,000,000. For extremely large datasets, this value might need to be increased.