SiteGA implements the algorithm of Levitsky et al. (2007) for de novo motif search in a single ChIP-seq dataset (ChIP-seq peaks) (Tsukanov et al., 2022). SiteGA applies a genetic algorithm that evaluates motifs as patterns of frequencies of interdependent locally located dinucleotides (LPDs). This approach models short- and long-range interactions of nucleotide context within transcription factor binding sites (TFBS). Thus, the SiteGA approach is quite different from that of the traditional position weight matrix (PWM) model, which estimates sites by sums of additive contributions of nucleotide frequencies from all positions. The PWM approach ignores any dependencies between different positions within the TFBS.
For a given set of N foreground sequences (peaks), the fitness function F(X) = D(X) * E(X) * I(X) estimates the alignment X = {x(1), x(2), ..., x(N)} of N best predicted sites in the the SiteGA model. The first factor D(X) reflects dependencies of positions within an alignment through the linear discriminant analysis approach (Levitsky et al. 2007). The second factor E(X) implies the average enrichment of k-mers within the alignment of the predicted sites (Tsukanov et al., 2022). The program indexes all possible positions of sites in the foreground set with weights equal to the logarithms of the enrichment of their intrinsic k-mers when comparing foreground and background sequence sets. The third factor I(X) is the information content of the alignment X. This factor is computed by position-specific counts {q(i,j)} of dinucleotides within the alignment X (i and j are indices of position within the motif and dinucleotide type, 1 <= i <= L, L means motif length, 1 <= j <= 16). The formula takes into account the expected frequencies of dinucleotides {pexp(j)} as described in (Kulakovskiy et al., 2010). The default option of the algorithm takes the fitness function F(X) as D(X) * E(X) * I(X); optionally, the fitness function takes only the first two factors, F(X) = D(X) * E(X), this allows achieve the better performance for the minor part of tested data.
SiteGA source code is written in C++ language. To compile exetubables from the source code you need:
- In Linux system, C++ compiler, the minimal one is GCC compiler, more prominent is Clang under the management of Spack tool
- In Windows system any VC++ package, e.g. Microsoft Visual Studio Community
The bulk of input data are ChIP-seq peaks in FASTA format. To optimize computation time, the default limit of 3000 bp for the length of any peak is used, although the length of a peak is not restricted by the algorithm. This same reason requires at least a moderate number of peaks to be used in the analysis, usually ~1000 peaks are sufficient to obtain a motif.
Folder src contains SiteGA source code files that relate to individual pipeline modules and are described below.
Folder bin contains SiteGA source code files compiled by Clang under the management of Spack tool. It is recomended to use these binary files instead of those compiled by default g++ compiler due to computation speed. Just copy these binary files to the src folder and ensure that they are functional. For illustrative purpose only all scripts are started from binary files compiled with default g++ compiler, see build.sh.
The background set of sequences is required as a complement to the foreground set to select the values of two parameters of the SiteGA model: the motif length and the number of LPDs. The main purpose of the background set is to exclude artifact motifs related to a genome-specific sequence content bias, e.g. polyA, from the results of de novo motif search. To prepare genome-specific sets of background sequences, it is recommended to use the AntiNoise package (Raditsa et al., 2024).
andy0bsn5cell.cpp performs the bootsrap cross-validation test to select parameters of a model providing the best performance for given foreground and background sets. A model has parameters of the motif length and the range for the number of LPDs. The motif length, the minimal value for the number of LPDs and it step are the parameters of command line. Recommended motif lengths are from 8 to 12 nt. The genetic algorithm simultaneously tests multiple values of the number of LPDs. The internal parameters of genetic algorithm are the range of the number of LPDs and its increment, default range consists of sixteen numbers of LPDs, 18, 20, etc. up to 48. The bootsrap cross-validation test denotes the partitioning of the foreground set into subsets of training and control sequence sets, the former is used to train a model and the latter to measure its performance. The maximal partial area under ROC curve (pAUC) and the partial area under precision-recall curve (pAUCPR) are used to estimate the accuracy of a model for certain motif length and the number of LPDs. The term partial means that only the parts of the ROC or PR curve is impied for pAUC computation. In both curves the criterion of motif frequency (Expected Recognition Rate, ERR). For certain set of sequences ERR is the ratio of the number of predicted sites to the number of available positions for them. The receiver operating characteristic (ROC) curve means the dependence between True Positive Rate (TPR) and False Positive Rate (FPR). TPR/FPR (axes Y/X of the ROC curve) are defined as the fraction of sequences from the foreground set containing predicted sites and the frequency of predicted sites in the background set, respectively. For ROC curve FPR = Nb/Wb, here Nb - number of positions conyaining predicted sites in the background set, Wb - total number of checked positions for predicted sites in the background set. The PR curve is the dependence between Precision and Recall, the metrics of the PR curve is the partial area under the PR curve, pAUPRC. Recall = TPR = TP/(TP+FN), Precision = TPR/(TPR+FPR) = [TP/NF]/[TP/NF+FP/NB] = TP/[TP+FP*(NF/NF)], here TPR & FPR are fraction of all NF foreground & NB background sequences that are predicted, TP & FP denote counts of predicted sequences from foreground & background sets (Levitsky et al., 2024). The calculation of pAUPRC also implies the precision values are decreased by their expectation value of 0.5, see MetArea. The two-fold cross-validation procedure is performed, its two iterations mean odd/even peaks either for training and control subsets. The sizes of training and control subsets are approximately equal.
andy05cell.cpp trains a SiteGA model with selected by the bootstrap cross-validation procedure parameters of the motif length and the number of LPDs for given foreground and background sets. The resulting model is written in a LPD weight matrix file
sitega_thr_dist_mat.cpp creates table of thresholds for model's hits search in test sequences (Scan test sequences with a model module) based on the distribution of recognition score for the set of whole-genome promoter sequences selected for the respective species. Prepeared sets are stored in genomes folder. The threshold selection implies the estimation of Expected Recognition Rate (ERR) of a model for promoter sequences of whole genome. The dependence of the threshold from ERR is stored in a LPD weight matrix file containing a list of two columns: recognition thresholds in descending order and corresponding -Log10(ERR) values
andy1_mat.cpp scans test sequences with a constructed model and the selected threshold of a model.
andy1_mat_long.cpp scans whole genome as a set of full-sized chromosomes with a constructed model and the selected threshold of a model.
- In Linux system:
git clone https://github.com/parthian-sterlet/sitega
cd sitega/src
chmod a+x build.sh
./build.sh
- In Windiws system:
separate compilation of all source files in VC++
Scheme of modules functioning is given below
Modules Set parameters of a model through accuracy estimation and Train a model take the input data of the foreground and background sequence sets.
The module Set parameters of a model through accuracy estimation is required for functionality of the module Train a model and all subsequent modules since the bootstrap procedure correctly selects parameters of a model (see output data block ROC & PR curves, pAUC ROC & pAUPRC performance metrics)
Modules Set threshold for a model and Scan test seauences with a model require file with SiteGA model that should be prepared by the module Train a model
The module Set threshold for a model is required to select a correct threshold for Scan test sequences with a model module
Lists of command line arguments for all modules are described below
Use background_genome_mono.cpp from the github repositiory AntiNoise
- path to fasta files with sets of foreground and background sequences (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
- fasta file, set of foreground sequences
- fasta file, set of background sequences
- integer value, maximal length of one LPD (default value 6)
- integer value, minimal number of LPDs (LPDmin, default value in the range from 20 to 50). The computation time increases with the growth of the number of LPDs.
- integer value, step of the number of LPDs (default value 1, four numbers are {LPDmin, LPDmin + 1, LPDmin + 2, LPDmin + 3} since the default number of distinct numbers of LPDs is four).
- integer value, length of motif in nucleotides, the motif length from 8 to 12 nt is recommended. The computation time increases with the growth of motif length.
- double value, Expected Recognition Rate - maximum frequency of motif in the background set. The value is used to restrict the X axes in ROC and PR curves to define partial areas under curves.
- integer value, k-mer length to take into account the sequence bias between foreground and background sequences (default 6, i.e. hexamer frequencies are involved)
- integer value 1 (default) or 0, take into account or disregard the information content in the fitness function F(X). It is defined as D(X) * E(X) * I(X) or D(X) * E(X), respectively.
- path to output files (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
- integer value, maximal peak length (default value is 3000)
- output log file
- path to fasta files with sets of foreground and background sequences (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
- fasta file with set of foreground sequences
- fasta file with set of background sequences
- integer value, maximal length of one LPD (default value 6)
- integer value, length of motif (this value is selected in the bootstrap cross-valiation test, see above)
- integer value, size, the number of LPDs (a value is selected in the bootstrap cross-valiation test, see above)
- integer value, k-mer length to take into account the sequence bias between foreground and background sequences (default 6, i.e. hexamer frequencies are involved)
- integer value 1 (default) or 0, take into account or disregard the information content in the fitness function F(X). It is defined as D(X) * E(X) * I(X) or D(X) * E(X), respectively.
- path to output files (the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
- integer value, maximal peak length (default value is 3000)
- output log file
- path to file_profile_fasta (see argument #3 below, the last symbol of path must be '/' and '\' for Linux and Windows OS, respectively)
- sitega_matrix_file = input file SiteGA model from Train a model module
- file_profile_fasta = input Whole-genome promoters set in fasta format (archive with example provides one for M. musculus)
- output file, text mode, Table Threshold_vs_ERR, table SiteGA model threshold vs. Expected Recognition Rate (ERR)
- output file, binary mode, SiteGA model as a matrix and Table Threshold_vs_ERR, table SiteGA model threshold vs. ERR)
- double pvalue_large = maximal ERR, default value 0.001, value below 0.01 is recomended
- double dpvalue = granulation value for ERR compaction in Table Threshold_vs_ERR, the default value 0.0000005 implies a moderate compaction
- test file in fasta format with its path
- input file LPD weight matrix file of SiteGA model from Train a model module
- site_description_mode = 0 or 1. the value 0 means default mode, the value 1 means computation of frequencies of all LPDs for all tested sequences (option is used for the train fasta file to describe a SiteGA model)
- ERR threshold = threshold for ERR of SiteGA model is used to select the SiteGA threshold according to input file Threshold vs ERR table from Set threshold for a model module
- input file Threshold vs ERR table for SiteGA model threshold selection by ERR threshold
- output file Profile with recognized hits
- path to chromosome sequences in Plain format, individual chromosomes are in separate files, they are prepared by splitting the whole-genome fasta file by the https://github.com/parthian-sterlet/antinoise/blob/main/src/fasta_muliplefiles.cpp and https://github.com/parthian-sterlet/antinoise/blob/main/src/fasta_to_plain0.cpp from the AntiNoise package.
- species and genome release (values hg38, mm10, rn6, zf11, dm6, and ce235; at10, gm21, zm73, and mp61; sc64 and sch294). The animals inludes human Homo sapiens hg38, mouse Mus musculus mm10, rat Rattus norvegicus Rnor_6.0, zebrafish Danio rerio GRCz11, fly Drosophila melanogaster dm6, and roundworm Caenorhabditis elegans WBcel235; the plants are arabidopsis Arabidopsis thaliana TAIR10, soybean Glycine max v2.1, maize Zea mays B73, and liverwort Marchantia polymorpha MpTak v6.1; the fungi are baker's yeast Saccharomyces cerevisiae R64-1-1 and fission yeast Schizosaccharomyces pombe ASM294v2.
- input file LPD weight matrix file of SiteGA model from Train a model module
- input file Threshold vs ERR table for SiteGA model threshold selection by ERR threshold
- double value, ERR threshold = threshold for ERR of SiteGA model is used to select the SiteGA threshold according to input file Table Threshold vs ERR from Set threshold for a model module
- output file Profile with recognized hits
These scripts implement various pipelines for Linux:
- bootstrap test for a model - Estimate accuracy for a model module
- training a model - Train a model module
- thresholds selection by ERRs - Set threshold for a model module
- sites recognition for a test fasta file with a model - Scan test seauences with a model module
- sites recognition for a whole genome with a model - Scan whole genome with a model module
background_genome.cpp from the AntiNoise package prepares the background set that may be used for training (Train a model module) or performance evaluation (Set parameters of a model through accuracy estimation module)
andy0bsn5cell.cpp several times construct distinct SiteGA models (parameter of command line 'number of iterations'), but each time it uses only a part of the foreground set for training, the rest (control) part of set is used to estimate FPR). The output files ROC curve (pAUC ROC) and PR curve (pAUPRC) for various values of parameters of a model allows selection of the one model among several ones. These several ones come from different numbers of LPDs and lengths L of a motif. Summary file lists pAUC ROC and pAUPRC values for various numbers of LPDs.
andy05cell.cpp constructs one sitega model, with the numbers of locally positioned dinucleotides (LPDs) assigned according the parameter of the command line, this value deduced from the bootstrap cross validation test (see the previous paragraph). The selected model respecting the maximal pAUC in the bootstrap cross validation test, this model is written in output file SiteGA model with .mat extension
sitega_thr_dist_mat.cpp computes the distribition of the recognition scores of SiteGA model, output file Table Threshold vs ERR represents two columns with respective thresholds and ERRs
andy1_mat.cpp takes ready sitega model and threshold from input file Table Threshold vs ERR and constructs the profile of hits for tested file in fasta format, main output file is Profile file I.e. after the header of each peak with first '>' symbol from zero to several lines respect to separate hits, for each hit a start position, score, strand and whole sequence are printed.
andy1_mat_long.cpp takes ready sitega model and threshold from input file Table Threshold vs ERR and constructs the profile of hits for a genome in plain format, main output file is Profile file I.e. after the header of each peak with first '>' symbol from zero to several lines respect to separate hits, for each hit a start position, score, strand and whole sequence are printed.