This class comprises static methods used for reading and writing files.
@@ -113,50 +113,58 @@
All Classes and Interfaces<
Main class of MUSIAL (MUlti Sample varIant AnaLysis), a tool to calculate SNV, gene, and whole genome alignments,
together with other relevant statistics based on vcf files.
Initializes a Feature object with the specified parameters and adds it to the features list.
@@ -401,8 +363,7 @@
addFeature
name - String; The internal name to use for the feature.
matchKey - String; The key of the attribute in the specified .gff format reference annotation to match the feature from.
matchValue - String; The value of the attribute in the specified .gff format reference annotation to match the feature from.
-
asCds - Boolean; Whether to consider feature as cds, independent of provided structure.
-
pdbFile - File; Optional object pointing to a .pdb format file yielding a protein structure derived for the (gene) feature.
+
coding - Boolean; Whether to consider feature as cds, independent of provided structure.
annotations - HashMap of String key/pair values; feature meta information.
Throws:
MusialException - If the initialization of the Feature fails; If the specified .gff reference annotation or .pdb protein file can not be read; If the specified feature is not found or parsed multiple times from the reference annotation.
ConcurrentSkipListMap of String, File pairs. Each key specifies one feature on the
passed reference `FASTA` to be analyzed. Each assigned value points to an optional `PDB` file per feature and
induces the analysis variants allocated to the protein structure; values are expected to be `null` or `""` if
- variant to protein structure allocation shall not be run.
+ variantInformation to protein structure allocation shall not be run.
- Variants are extracted wrt. the query versus the target sequence and formatted as p@r@a, where p is the variant
+ Variants are extracted wrt. the query versus the target sequence and formatted as p@r@a, where p is the variantInformation
position, r is the reference content and a is the single alternate alleles content..
Parameters:
@@ -737,10 +737,10 @@
inferProteoform
musialStorage.
- The computed variants maps keys are formatted as `x+y`, where x is the 1-based indexed position in the
- reference sequence at which the variant occurs and y is the 1-based indexed number of inserted positions after
+ reference sequence at which the variantInformation occurs and y is the 1-based indexed number of inserted positions after
this position.
- The computed variants maps values are the respective single letter code amino-acid contents of the variants
- including the content at the variant position.
+ including the content at the variantInformation position.
Parameters:
musialStorage - VariantsDictionary instance which content is used to infer proteoform information.
@@ -749,7 +749,7 @@
inferProteoform
sId - String specifying the SampleEntry.name of the sample of
which the proteoform should be inferred.
Returns:
-
ConcurrentSkipListMap mapping positions to variant contents.
+
ConcurrentSkipListMap mapping positions to variantInformation contents.
Throws:
MusialBioException - If any translation procedure of nucleotide sequences fails.
"\n\n build Build a local .json database (MUSIAL storage) from variant calls.\n view_features View annotated features from a built MUSIAL storage in a tabular format.\n view_samples View annotated samples from a built MUSIAL storage in a tabular format.\n view_variants View annotated variants from a built MUSIAL storage in a tabular format.\n export_table Export variants from a built MUSIAL storage into a matrix-like .tsv file.\n export_sequence Generate sequences in .fasta format from a built MUSIAL storage.\n\n\n"
+
"\n\n build Build a local database/MUSIAL storage (brotli compressed binary JSON format) from variant calls.\n view_features View annotated features from a built MUSIAL storage in a tabular format.\n view_samples View annotated samples from a built MUSIAL storage in a tabular format.\n view_variants View annotated variants from a built MUSIAL storage in a tabular format.\n export_table Export variants from a built MUSIAL storage into a matrix-like .tsv file.\n export_sequence Generate sequences in .fasta format from a built MUSIAL storage.\n\n\n"
Container to store representation of a reference sequence location that is subject to analysis.
This may be the full genome, a single gene, contigs or plasmids and chromosomes.
The type of this feature; Either coding or non_coding.
This value is used to distinguish Feature from FeatureCoding instances during parsing an existing MUSIAL storage JSON file.
Hierarchical map structure to store variants wrt. the nucleotide sequences. The first layer represents the
position on the chromosome. The second layer represents the variant content.
@@ -353,7 +326,7 @@
Constructor Details
Feature
publicFeature(String name,
- String chromosome,
+ String contig,
int start,
int end,
String type)
@@ -362,8 +335,8 @@
Feature
Parameters:
name - String representing the internal name of the reference feature to analyze.
-
chromosome - String the name of the reference location (contig, chromosome, plasmid) the feature
- is located on.
+
contig - String the name of the reference location (contig, chromosome, plasmid) the feature
+ is located on.
start - Integer The 1-based indexed starting position of the feature on the reference.
end - Integer The 1-based indexed end position of the feature on the reference.
Throws:
@@ -383,7 +356,7 @@
Method Details
isCoding
publicbooleanisCoding()
-
Checks whether this instance of Feature is an instance of FeatureCoding, i.e., declared as a coding feature.
+
Return whether this instance of Feature is declared as a coding feature.
Hierarchical map structure to store variants wrt. proteoforms. The first layer represents the
position on the feature. The second layer represents the variant content.
Constructs a String yielding information about the following properties wrt. a single sample, separated by a `|` symbol:
- - If the variant was rejected, i.e. failed any filter criteria.
- - If the variant is primary, i.e. has the highest frequency in the case of a het. variant.
- - The quality of the variant call.
- - The frequency wrt. coverage of the variant call.
- - The total coverage at the variant site.
+ - If the variantInformation was rejected, i.e. failed any filter criteria.
+ - If the variantInformation is primary, i.e. has the highest frequency in the case of a het. variantInformation.
+ - The quality of the variantInformation call.
+ - The frequency wrt. coverage of the variantInformation call.
+ - The total coverage at the variantInformation site.
Parameters:
-
rejected - Boolean whether the variant was rejected.
-
primary - Boolean whether the variant is a primary variant.
-
quality - Double; The quality of the variant call.
-
frequency - Double; The frequency of the variant call wrt. coverage.
-
coverage - Double; The total coverage at the variant site.
+
rejected - Boolean whether the variantInformation was rejected.
+
primary - Boolean whether the variantInformation is a primary variantInformation.
+
quality - Double; The quality of the variantInformation call.
+
frequency - Double; The frequency of the variantInformation call wrt. coverage.
+
coverage - Double; The total coverage at the variantInformation site.
Returns:
String; The passed parameters separated by `|` symbols.
Third level: CALL;DP;ALT:AD,... with CALL as one of N (ambiguous), ? (no coverage), or the index of the alternative call. DP, AD as defined in VCF specification (samtools.github.io/hts-specs/VCFv4.2.pdf}).
+
vcfFileReader
public transienthtsjdk.variant.vcf.VCFFileReadervcfFileReader
-
VCFFileReader instance pointing to the vcf file of this sample.
+
VCFFileReader instance pointing to the vcf file of this sample. No permanent storage in MUSIAL session.
public finaljava.util.concurrent.ConcurrentSkipListMap<java.lang.Integer,java.util.concurrent.ConcurrentSkipListMap<java.lang.String,NucleotideVariantAnnotationEntry>>variants
Hierarchical map structure to store variants wrt. maintained features and samples. The first layer represents the
- position on the chromosome. The second layer represents the variant content.
+ position on the chromosome. The second layer represents the variantInformation content.
minCoverage - Double; Minimal read coverage to use for variant filtering.
-
minFrequency - Double; Minimal form frequency to use for hom. variant filtering.
-
minHet - Double; Minimal form frequency to use for het. variant filtering.
-
maxHet - Double; Maximal form frequency to use for het. variant filtering.
-
minQuality - Double; Minimal Phred scaled genotyping call quality to use for variant filtering.
+
minCoverage - Double; Minimal read coverage to use for variantInformation filtering.
+
minFrequency - Double; Minimal form frequency to use for hom. variantInformation filtering.
+
minHet - Double; Minimal form frequency to use for het. variantInformation filtering.
+
maxHet - Double; Maximal form frequency to use for het. variantInformation filtering.
+
minQuality - Double; Minimal Phred scaled genotyping call quality to use for variantInformation filtering.
chromosome - String; Name of the chromosome on which maintained FeatureEntrys are located;
Has to reflect the value in the used input .vcf, .fasta and .gff files.
Adds information about a variantInformation to variants.
Parameters:
-
featureId - String; The internal name of the feature for which this variant was detected.
+
featureId - String; The internal name of the feature for which this variantInformation was detected.
referencePosition - Integer; The position on the reference chromosome (1-based indexing).
-
variantContent - String; The alternate content of the variant; Single letter contents reflect
+
variantContent - String; The alternate content of the variantInformation; Single letter contents reflect
SNVs, multi letter contents reflect SVs.
-
referenceContent - String; The reference content of the variant; The length in relation to the
+
referenceContent - String; The reference content of the variantInformation; The length in relation to the
variantContent parameter indicates whether a SV is a insertion or deletion.
-
sampleId - String; The internal name of the sample of which the variant was called.
-
isPrimary - Boolean; Whether the variant call is a primary call, i.e. if it is the variant
+
sampleId - String; The internal name of the sample of which the variantInformation was called.
+
isPrimary - Boolean; Whether the variantInformation call is a primary call, i.e. if it is the variantInformation
with the highest frequency.
-
isRejected - Boolean; Whether the variant was rejected, i.e. if it failed any filtering
+
isRejected - Boolean; Whether the variantInformation was rejected, i.e. if it failed any filtering
criteria.
quality - Double; The Phred scaled GT quality of the call.
coverage - Double; The depth of coverage of the call.
IOException - Thrown if any input or output file is missing or unable to being generated (caused by any native Java method).
MusialException - If any method fails wrt. biological context, i.e. parsing of unknown symbols; If any method fails wrt. internal logic, i.e. assignment of proteins to genomes; If any input or output file is missing or unable to being generated.
Internal method to transfers an annotation value to a container map. This is used to supplement missing annotation values in any object to consider with null values.
Internal method to construct a two-layer map structure used for export of tables and sequences.
-
- The first layer stores positions in the format x+y, where x represent true genomic positions and y represent insertion positions.
-
- The second layer stores sample names mapping to variant contents.
Updates the statistics related to genetic variants and allele frequencies based on the data stored in the provided MusialStorage.
+ This method iterates over features, samples, and proteoforms, calculates statistics such as substitution count, insertion count, deletion count,
+ ambiguous count, and allele frequencies, and updates the information accordingly.
Returns the enum constant of this class with the specified name.
+The string must match exactly an identifier used to declare an
+enum constant in this class. (Extraneous whitespace characters are
+not permitted.)
+
+
Parameters:
+
name - the name of the enum constant to be returned.
Main class of MUSIAL (MUlti Sample varIant AnaLysis), a tool to calculate SNV, gene, and whole genome alignments,
- together with other relevant statistics based on vcf files.
+
Collection of common property keys used for annotations.
Collection of common property keys used for annotations.
+
Main class of MUSIAL (MUlti Sample varIant AnaLysis), a tool to calculate SNV, gene, and whole genome alignments,
+ together with other relevant statistics based on vcf files.
- Variants are extracted wrt. the query versus the target sequence and formatted as p:r:a, where p is the variant
+ Variants are extracted wrt. the query versus the target sequence and formatted as p:r:a, where p is the variantInformation
position, r is the reference content and a is the single position alternate content.
IOException - If any error occurs while parsing the pdb file.
-
-
-
-
-
-
writeFasta
-
public staticvoidwriteFasta(File outputFile,
- ArrayList<htsjdk.samtools.util.Tuple<String,String>> sequences)
-
// TODO: Fix comment.
- Writes a fasta format file to the specified output file from a HashMap instance mapping sequences to
- lists of identifiers (used to construct the fasta entry headers).
-
- Each key of the passed map will be used to build one fasta entry. The header of the respective entry is
- constructed by joining all Strings of the value accessible via the (sequence) key with the `|` delimiter.
-
-
Parameters:
-
outputFile - File object pointing to the output fasta file.
Returns the enum constant of this class with the specified name.
+The string must match exactly an identifier used to declare an
+enum constant in this class. (Extraneous whitespace characters are
+not permitted.)
+
+
Parameters:
+
name - the name of the enum constant to be returned.
Returns an amino-acid representation of the passed `codon`.
+
+
Parameters:
+
codon - String representing the nucleotide codon to translate.
+
asAA3 - Boolean whether the codon should be translated to the amino-acid three-letter code or not.
+
includeTermination - Boolean whether a representation for translation termination (in the case of
+ three-letter code 'TER' and in the case of one-letter code 'Z') shall
+ be returned or not, i.e. an empty String is returned instead.
+
includeIncomplete - Boolean whether incomplete codons, i.e. with a length other than 3, should be
+ translated as incomplete amino-acids.
Translates a nucleotide sequence, split into codons of length three, into a single-letter amino acid sequence.
+
+
Parameters:
+
splitNucSequence - ArrayList<String> representing a nucleotide sequence that was split into codons
+ of length 3.
+
includeTermination - Boolean whether a representation for translation termination (in the case of
+ three-letter code 'TER' and in the case of one-letter code 'Z') shall
+ be returned or not, i.e. an empty String is returned instead.
+
includeIncomplete - Boolean whether an incomplete amino-acid should be added to the end if the
+ sequence contains an incomplete codon at the end.
+
Returns:
+
String representing the translated nucleotide sequence.
+
Throws:
+
MusialException - If any codon with a length different from three is detected.
Translates a nucleotide sequence into a single-letter amino acid sequence.
+
+
Parameters:
+
nucSequence - String representing a nucleotide sequence.
+
includeTermination - Boolean whether a representation for translation termination (in the case of
+ three-letter code 'TER' and in the case of one-letter code 'Z') shall
+ be returned or not, i.e. an empty String is returned instead.
+
includeIncomplete - Boolean whether an incomplete amino-acid should be added to the end if the
+ sequence contains an incomplete codon at the end.
+
asSense - Boolean whether the sequence shall be translated as sense or anti-sense.
+
Returns:
+
String representing the translated nucleotide sequence.
+
Throws:
+
MusialException - If any codon with a length different from three is detected.
+
+
+
+
+
+
reverseComplement
+
public staticStringreverseComplement(String sequence)
+
Returns the reverse complement of the passed nucleotide sequence.
+
+
Parameters:
+
sequence - String representing a nucleotide sequence.
+
Returns:
+
String representing the reverse complement of the passed sequence.
+ - Uses a simple substitution matrix that scores matches with 1 and mismatches with -1.
+ - Uses a gap open and extension penalty of -2 and -1, respectively.
+
+
Parameters:
+
nucSeq1 - String representation of the first nucleotide sequence for alignment.
+
nucSeq2 - String representation of the second nucleotide sequence for alignment.
bandWidth - Integer specifying the band-width for banded alignment or null for non-banded alignment.
+
Returns:
+
Triplet storing the alignment score, the aligned first sequence and the aligned second sequence.
+
+
+
+
+
+
getVariantsOfAlignedSequences
+
public staticArrayList<org.apache.commons.lang3.tuple.Triple<Integer,String,String>>getVariantsOfAlignedSequences(String targetSequence,
+ String querySequence)
+
Computes variants from two aligned sequences.
+
+ Variants are extracted wrt. the query versus the target sequence and formatted as p:a:r, where p is the variant
+ position, r is the reference content and a is the alternate content.
+
+
Parameters:
+
targetSequence - String representation of the target sequence (i.e. reference).
+
querySequence - String representation of the query sequence (i.e. the one with variants).
+
Returns:
+
ArrayList containing derived variants, c.f. method description for format details.