Skip to content

Latest commit

 

History

History
32 lines (32 loc) · 9.62 KB

ConfigParams.md

File metadata and controls

32 lines (32 loc) · 9.62 KB
Name Type Value Range Default Description For Goals
logLevel String all, trace, debug, info, warn, error, fatal, off info Only the log levels error, warn, info and trace are used by Genestrip. all
logProgressUpdateCycle long [0, 2147483647] 1000000 Affects the log level trace: Defines after how many reads per fastq file, information on the matching progress is logged. If less than 1, then no progress information is logged. match, matchlr, filter
threads int [-1, 64] -1 The number of consumer threads n when processing data with respect to the goals match, filter and also so during the update phase of the db goal. There is always one additional thread that reads and uncompresses a corresponding fastq or fasta file (so it is n + 1 threads in total). When negative, the number of available processors - 1 is used as n. When 0, then the corresponding goals run in single-threaded mode. db, match, matchlr, filter
httpBaseURL String https://ftp.ncbi.nlm.nih.gov This base URL will be extended by /pub/taxonomy/ in order to download the taxonomy file taxdmp.zip and by /genomes/genbank for files from Genbank. db
ftpBaseURL String ftp.ncbi.nih.gov db
refseqHttpBaseURL String https://ftp.ncbi.nlm.nih.gov/refseq This mirror might be considered as an alternative. (No other mirror sites are known.) db
refseqFTPBaseURL String ftp.ncbi.nih.gov db
useHttp boolean true Use http(s) to download data from NCBI. If false, then Genestrip will do anonymous FTP instead (with login and password set to anonymous). db
ignoreMissingFastas boolean false If true, then a download of files from NCBI will not stop in case a file is missing on the server. db
maxDownloadTries int [1, 1024] 5 The number of download attempts for a file before giving up. db
seqType nominal GENOMIC, RNA, M_RNA, ALL_RNA, ALL GENOMIC Which type of sequence files to include from the RefSeq. RNA files from the RefSeq end with rna.fna.gz, whereas genomes end with genomic.fna.gz. db
rankCompletionDepth nominal superkingdom, kingdom, phylum, subphylum, superclass, class, subclass, superorder, order, suborder, superfamily, family, subfamily, clade, genus, subgenus, species group, species, varietas, subspecies, serogroup, biotype, strain, serotype, genotype, forma, forma specialis, isolate, no rank no rank The rank up to which tax ids from taxids.txt will be completed by descendants of the taxonomy tree (the set rank included). If not set, the completion will traverse down to the lowest possible levels of the taxonomy. Typical values could be genus, species or strain, but all values used for assigning ranks in the taxonomy are possible. db
maxGenomesPerTaxid int [1, 2147483647] 2147483647 The maximum number of genomes per tax id from the RefSeq to be included in the database. Note, that this is an important parameter to control database size, because in some cases, there are millions of genomic entries for a tax id such as for 573 (which does not even account for entries of its descendants). db
completeGenomesOnly boolean false If true, then only genomic accessions with the prefixes AC, NC_, NZ_ will be considered when generating a database. Otherwise, all genomic accessions will be considered. See RefSeq accession numbers and molecule types for details. db
refSeqLimitForGenbankAccess int [0, 2147483647] 0 Determines whether Genestrip should try to lookup genomic fasta files from Genbank, if the number of corresponding reference genomes from the RefSeq is below the given limit for a requested tax id. E.g. refSeqLimitForGenbankAccess=1 would imply that Genbank is consulted if not a single reference genome is found in the RefSeq for a requested tax id. The default refSeqLimitForGenbankAccess=0 essentially inactivates this feature. In addition, Genbank access is also influenced by the keys fastaQualities and maxFromGenBank (see below). db
maxFromGenBank int [-1, 2147483647] 1 Determines the maximum number of fasta files used from Genbank per requested tax id. If the corresponding number of matching files exceeds maxFromGenBank, then then best ones according to fastaQualities will be retained to still match this maximum. db
fastaQualities list of nominals ADDITIONAL, COMPLETE_LATEST, COMPLETE, CHROMOSOME_LATEST, CHROMOSOME, SCAFFOLD_LATEST, SCAFFOLD, CONTIG_LATEST, CONTIG, LATEST, NONE COMPLETE_LATEST,CHROMOSOME_LATEST Determines the allowed quality levels of fasta files from Genbank. The values must be comma-separated. If a corresponding value is included in the list, then a fasta file for a requested tax id on that quality level will be included, otherwise not (while also respecting the conditions excerted via the keys refSeqLimitForGenbankAccess and maxFromGenBank). The quality levels are based on Genbank's Assembly Summary File (columns version_status and assembly_level). db
kMerSize int [15, 64] 31 The number of base pairs k for k-mers. Changes to this values do not affect the memory usage of database. A value > 32 will cause collisions, i.e. leads to false positives for the match goal. db
maxDust int [-1, 2147483647] -1 When generating a database via the goal db, any low-complexity k-mer with too many repetitive sequences of base pairs may be omitted for storing. To do so, Genestrip employs a simple genetic dust-filter for k-mers: It assigns a dust value d to each k-mer, and if d > maxDust, then the k-mer will not be stored. Given a k-mer with n repeating base pairs of repeat length k(1), ... k(n) with k(i) > 1, then d = fib(k(1)) + ... + fib(k(n)), where fib(k(i)) is the Fibonacci number of k(i). E.g., for the 8-mer TTTCGGTC, we have n = 2 with k(1) = 3, k(2) = 2 and d = fib(3) + fib(2) = 2 + 1 = 3. For practical concerns maxDust = 20 may be suitable. In this case, if 31-mers were uniformly, randomly generated, then about 0.2 % of them would be omitted. If maxDust = -1, then dust-filtering is inactive. db
classifyReads boolean true Whether to do read classification in the style of Kraken and KrakenUniq. Matching is faster without read classification and the columns kmers, unique kmers and max contig length in resulting CSV files are usually more conclusive anyways - in particular with respect to long reads. When read classification is off, the columns reads and kmers from reads will be 0 in resulting CSV files. match
countUniqueKMers boolean true If true, the number of unique k-mers will be counted and reported. This requires less than 5% of additional main memory. match, matchlr
writeFilteredFastq boolean false If true, then the goal match writes filtered fastq files in the same way that the goal filter does. match, matchlr
writeKrakenStyleOut boolean false If true, Genestrip will write output files with suffix .out in the Kraken output format under <base dir>/projects/<project_name>/krakenout covering all reads with at least one matching k-mer. match, matchlr
normalizedKMersFactor long [1, 9223372036854775807] 1000000000 A factor used to compute normalized kmers at read analysis time. match, matchlr
useBloomFilterForMatch boolean true If true a bloom filter will be loaded and used during fastq file analysis (i.e. matching). Using the bloom filter tends to shorten matching time, if the most part of the reads cannot be classified because they contain no k-mers from the database. Otherwise, using the bloom filter might increase matching time by up to 30%. It also requires more main memory. match, matchlr
maxReadTaxErrorCount double [0.0, 1.7976931348623157E308] 0.5 The absolute or relative maximum number of k-mers that do not have to be in the database for a read to be classified. If the number is above maxReadTaxErrorCount, then the read will not be classified. Otherwise the read will be classified in the same way as done by Kraken. If maxReadTaxErrorCount is >= 1, then it is interpreted as an absolute number of k-mers. Otherwise (and so, if >= 0 and < 1), it is interpreted as the ratio between the k-mers not in the database and all k-mers of the read. match, matchlr
maxKMerResCounts int [0, 65536] 0 If > 0, the corresponding number of frequencies of the most frequent k-mers per tax id will be reported. match, matchlr
writeDumpedFastq boolean false If true, then filter will also generate a fastq file dumped_... with all reads not written to the corresponding filtered fastq file. filter
minPosCountFilter int [0, 1024] 1 The mininum number of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. If minPosCountFilter=0, then posRatioFilter becomes effective. filter
posRatioFilter double [0.0, 1.0] 0.2 Only effective if minPosCountFilter=0: The mininum ratio of a read's k-mers to be found in the bloom index such that the read is added to the filtered fastq file. filter