chrom_mini_graph is a tool for generating and mapping reads onto a chromatic (coloured) minimizer pangenome graph.
- rust and associated tools such as cargo are required and assumed to be in PATH.
- cmake version >= 3.12 has to be in path due to dependency on --parallel command (see here).
git clone https://github.com/bluenote-1577/chrom_mini_graph
cd chrom_mini_graph
cargo build --release
./target/release/chrom_mini_graph generate test_refs/*
./target/release/chrom_mini_graph map -a -b test_bam.bam serialized_mini_graph.bin test_reads/hg_01243_pacbio_reads.fastq > output.txt
cargo build --release
first builds the chrom_mini_graph binary, which is found in the ./target/release/ directory.- The
chrom_mini_graph generate
command generates a coloured minimizer pangenome graph. - The
chrom_mini_graph map
command chains onto the output graph and produces an alignment. The-a
option outputs a BAM file with name specified by the-b
option.
- 6 reference 1M bp segments of chromosome 20 are provided in the test_ref folder.
- Simulated PacBio CLR reads for hg01243 are available in the test_reads folder.
chrom_mini_graph generate ref_1.fasta ref_2.fasta ... -o output_from_generate
to create a coloured minimizer pangenome graph for references ref_1.fasta, ref_2.fasta, etc. The output specified by the -o
option is used for the mapping step.
- The window size can be easily modified in the
src/bin/chrom_mini_graph.rs
file. The default value is 16. - Outputs a *.bin file to be used for mapping and other auxillary information; see below.
- Each fasta file can have multiple contigs. Each contig will be treated as its own reference genome.
The first reference used (i.e. ref_1.fasta
) serves as the backbone for the minimizer graph. Make sure that this first reference is the most contiguous contig.
A proof of concept read-to-graph chainer by chaining minimizers in the read onto the graph without knowledge of colour and then finding the best colours (reference genomes) for the chain.
chrom_mini_graph map -a output_from_generate.bin your_reads.fastq -b bam_name.bam > output.txt
outputs the bam file bam_name.bam
and directs stdout to a output.txt log.
The file best_genome_reads.txt is also output. The best_genome_reads.txt
shows the top 5 (or less) best candidate reference genomes for each read. The format is
>read_1
chrom_1 score_1
chrom_2 score_2
...
>read_2
chrom_1 score_1
chrom_2 score_2
...
More than 5 best candidates may be output due to secondary alignments and less than 5 may be output if the read is deemed unmappable to certain references.
Caveats:
- Only the best candidate genome is aligned to.
- No supplementary/secondary alignments are output in the bam file; MapQ is defaulted to 60.