logLevel |
String |
all , trace , debug , info , warn , error , fatal , off |
info |
Only the log levels error , warn , info and trace are used by Genestrip. |
all |
logProgressUpdateCycle |
long |
[0, 2147483647] |
1000000 |
Affects the log level trace : Defines after how many reads per fastq file, information on the matching progress is logged. If less than 1, then no progress information is logged. |
match , matchlr , filter |
threads |
int |
[-1, 64] |
-1 |
The number of consumer threads n when processing data with respect to the goals match , filter and also so during the update phase of the db goal. There is always one additional thread that reads and uncompresses a corresponding fastq or fasta file (so it is n + 1 threads in total). When negative, the number of available processors - 1 is used as n. When 0, then the corresponding goals run in single-threaded mode. |
db , match , matchlr , filter |
httpBaseURL |
String |
|
https://ftp.ncbi.nlm.nih.gov |
This base URL will be extended by /pub/taxonomy/ in order to download the taxonomy file taxdmp.zip and by /genomes/genbank for files from Genbank. |
db |
ftpBaseURL |
String |
|
ftp.ncbi.nih.gov |
|
db |
refseqHttpBaseURL |
String |
|
https://ftp.ncbi.nlm.nih.gov/refseq |
This mirror might be considered as an alternative. (No other mirror sites are known.) |
db |
refseqFTPBaseURL |
String |
|
ftp.ncbi.nih.gov |
|
db |
useHttp |
boolean |
|
true |
Use http(s) to download data from NCBI. If false , then Genestrip will do anonymous FTP instead (with login and password set to anonymous ). |
db |
ignoreMissingFastas |
boolean |
|
false |
If true , then a download of files from NCBI will not stop in case a file is missing on the server. |
db |
maxDownloadTries |
int |
[1, 1024] |
5 |
The number of download attempts for a file before giving up. |
db |
seqType |
nominal |
GENOMIC , RNA , M_RNA , ALL_RNA , ALL |
GENOMIC |
Which type of sequence files to include from the RefSeq. RNA files from the RefSeq end with rna.fna.gz , whereas genomes end with genomic.fna.gz . |
db |
rankCompletionDepth |
nominal |
superkingdom , kingdom , phylum , subphylum , superclass , class , subclass , superorder , order , suborder , superfamily , family , subfamily , clade , genus , subgenus , species group , species , varietas , subspecies , serogroup , biotype , strain , serotype , genotype , forma , forma specialis , isolate , no rank |
no rank |
The rank up to which tax ids from taxids.txt will be completed by descendants of the taxonomy tree (the set rank included). If not set, the completion will traverse down to the lowest possible levels of the taxonomy. Typical values could be genus , species or strain , but all values used for assigning ranks in the taxonomy are possible. |
db |
maxGenomesPerTaxid |
int |
[1, 2147483647] |
2147483647 |
The maximum number of genomes per tax id from the RefSeq to be included in the database. Note, that this is an important parameter to control database size, because in some cases, there are millions of genomic entries for a tax id such as for 573 (which does not even account for entries of its descendants). |
db |
completeGenomesOnly |
boolean |
|
false |
If true , then only genomic accessions with the prefixes AC , NC_ , NZ_ will be considered when generating a database. Otherwise, all genomic accessions will be considered. See RefSeq accession numbers and molecule types for details. |
db |
refSeqLimitForGenbankAccess |
int |
[0, 2147483647] |
0 |
Determines whether Genestrip should try to lookup genomic fasta files from Genbank, if the number of corresponding reference genomes from the RefSeq is below the given limit for a requested tax id. E.g. refSeqLimitForGenbankAccess=1 would imply that Genbank is consulted if not a single reference genome is found in the RefSeq for a requested tax id. The default refSeqLimitForGenbankAccess=0 essentially inactivates this feature. In addition, Genbank access is also influenced by the keys fastaQualities and maxFromGenBank (see below). |
db |
maxFromGenBank |
int |
[-1, 2147483647] |
1 |
Determines the maximum number of fasta files used from Genbank per requested tax id. If the corresponding number of matching files exceeds maxFromGenBank , then then best ones according to fastaQualities will be retained to still match this maximum. |
db |
fastaQualities |
list of nominals |
ADDITIONAL , COMPLETE_LATEST , COMPLETE , CHROMOSOME_LATEST , CHROMOSOME , SCAFFOLD_LATEST , SCAFFOLD , CONTIG_LATEST , CONTIG , LATEST , NONE |
COMPLETE_LATEST,CHROMOSOME_LATEST |
Determines the allowed quality levels of fasta files from Genbank. The values must be comma-separated. If a corresponding value is included in the list, then a fasta file for a requested tax id on that quality level will be included, otherwise not (while also respecting the conditions excerted via the keys refSeqLimitForGenbankAccess and maxFromGenBank ). The quality levels are based on Genbank's Assembly Summary File (columns version_status and assembly_level ). |
db |
kMerSize |
int |
[15, 64] |
31 |
The number of base pairs k for k-mers. Changes to this values do not affect the memory usage of database. A value > 32 will cause collisions, i.e. leads to false positives for the match goal. |
db |
maxDust |
int |
[-1, 2147483647] |
-1 |
When generating a database via the goal db , any low-complexity k-mer with too many repetitive sequences of base pairs may be omitted for storing. To do so, Genestrip employs a simple genetic dust-filter for k-mers: It assigns a dust value d to each k-mer, and if d > maxDust , then the k-mer will not be stored. Given a k-mer with n repeating base pairs of repeat length k(1), ... k(n) with k(i) > 1, then d = fib(k(1)) + ... + fib(k(n)), where fib(k(i)) is the Fibonacci number of k(i). E.g., for the 8-mer TTTCGGTC , we have n = 2 with k(1) = 3, k(2) = 2 and d = fib(3) + fib(2) = 2 + 1 = 3. For practical concerns maxDust = 20 may be suitable. In this case, if 31-mers were uniformly, randomly generated, then about 0.2 % of them would be omitted. If maxDust = -1 , then dust-filtering is inactive. |
db |
classifyReads |
boolean |
|
true |
Whether to do read classification in the style of Kraken and KrakenUniq. Matching is faster without read classification and the columns kmers , unique kmers and max contig length in resulting CSV files are usually more conclusive anyways - in particular with respect to long reads. When read classification is off, the columns reads and kmers from reads will be 0 in resulting CSV files. |
match |
countUniqueKMers |
boolean |
|
true |
If true , the number of unique k-mers will be counted and reported. This requires less than 5% of additional main memory. |
match , matchlr |
writeFilteredFastq |
boolean |
|
false |
If true , then the goal match writes filtered fastq files in the same way that the goal filter does. |
match , matchlr |
writeKrakenStyleOut |
boolean |
|
false |
If true , Genestrip will write output files with suffix .out in the Kraken output format under <base dir>/projects/<project_name>/krakenout covering all reads with at least one matching k-mer. |
match , matchlr |
normalizedKMersFactor |
long |
[1, 9223372036854775807] |
1000000000 |
A factor used to compute normalized kmers at read analysis time. |
match , matchlr |
useBloomFilterForMatch |
boolean |
|
true |
If true a bloom filter will be loaded and used during fastq file analysis (i.e. matching). Using the bloom filter tends to shorten matching time, if the most part of the reads cannot be classified because they contain no k-mers from the database. Otherwise, using the bloom filter might increase matching time by up to 30%. It also requires more main memory. |
match , matchlr |
maxReadTaxErrorCount |
double |
[0.0, 1.7976931348623157E308] |
0.5 |
The absolute or relative maximum number of k-mers that do not have to be in the database for a read to be classified. If the number is above maxReadTaxErrorCount , then the read will not be classified. Otherwise the read will be classified in the same way as done by Kraken. If maxReadTaxErrorCount is >= 1, then it is interpreted as an absolute number of k-mers. Otherwise (and so, if >= 0 and < 1), it is interpreted as the ratio between the k-mers not in the database and all k-mers of the read. |
match , matchlr |
maxKMerResCounts |
int |
[0, 65536] |
0 |
If > 0, the corresponding number of frequencies of the most frequent k-mers per tax id will be reported. |
match , matchlr |
writeDumpedFastq |
boolean |
|
false |
If true , then filter will also generate a fastq file dumped_... with all reads not written to the corresponding filtered fastq file. |
filter |
minPosCountFilter |
int |
[0, 1024] |
1 |
The mininum number of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. If minPosCountFilter=0 , then posRatioFilter becomes effective. |
filter |
posRatioFilter |
double |
[0.0, 1.0] |
0.2 |
Only effective if minPosCountFilter=0 : The mininum ratio of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. |
filter |