GitHub - cvwang/SeqUMich: Index thousands of genomes to determine if they contain a given DNA mutation

cvwang / SeqUMich Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Index thousands of genomes to determine if they contain a given DNA mutation

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Makefile		Makefile
README		README
binary_file2.h		binary_file2.h
hash_table.cpp		hash_table.cpp
hash_table.h		hash_table.h
lu_wang.h		lu_wang.h
main.cpp		main.cpp

Repository files navigation

SeqUMich
Authors: Sam Lu, Charles Wang, & Hui Jiang
University of Michigan

SeqUMich is a program that is able to process a file containing a set of query strings (e.g. .fa files) and determine which of those strings (usually each ~1000bp long) are contained within a given sequencing reads file (e.g. fastq files).

SeqUMich works in three steps:
1. Creating a binary file of the fastq file (if one has not already been specified)
2. Creating a hash table to index the fastq file
3. Aligning and scoring segments of the query string with reads

Binary File
Since most fastq files are far too large to store in memory, we first create a binary file of the fastq file to allow for fast access later in the scoring and alignment section. The binary file is generated in a straightforward sequential manner that preserves the order of the reads in the original fastq file. The first char of each stored read represents the size of the read to follow. The following letters specified by that initial char have been compressed from 4bp to 1 char (2 bits per bp and 8 bits per char). Depending on the program’s specified maximum length of each read (default 100, but user-adjustable) each represented read in the binary will have that specified number of chars (i.e. 25 chars for 100bp long read). Please refer to SeqUMich’s UROP poster for further clarification.

Hash Table

Alignment & Scoring

Sample command line

The following line will build a binary file (-i) as well as hash the fastq file and perform scoring and alignment (hashing and alignment always occur).

./main -r short.fastq -t refMrna.fa -i binary_reads_short.bin -m 100 -b 0 > output_short.txt

The following line will not build a binary file but use the provided bin file (-o) for read indexing. As usual the program will always hash the fastq file and perform scoring and alignment.

./main -r short.fastq -t refMrna.fa -o binary_reads_short.bin -m 100 -b 0 > output_short.txt

Future improvements

This program still has much room for improvement and several features that need to be implemented:
Saving hashtable of an indexed fastq file (clarification: essentially saving the prefix and suffix tables of our program).
Debugging out of bound errors when dealing with large .fa files
Removing redundant comparisons of the same reads and sections of the a target string
Improving the complexity of the hash function for indexing the read files.
Improving the user command

For further information, contact:
Sam Lu: [email protected]
Charles Wang: [email protected]