Most of these scripts are are triggered by the Nextflow master script for the extraction of genomic features. The users are not required to interact with these scripts directly but knowing what these scripts do might help for troubleshooting or for non-standard use cases.
autodownsample_merged_tsv.py
Script for automatically triggering the downsampling of the TSV table that has been derived by merging bedgraph files
cluster_junctions_fisher_test.py
Script for counting cluster junctions between windows in the GDA output 'clusters.bed' file.
Fisher test is used to determine if some types of cluster junctions occur more rarely or often that expected by chance
convert_einverted_output_to_gff.py
Script for converting the output of EMBOSS einverted to GFF3 format
count_neighbouring_clusters_in_bed_file.py
Script for counting neighboring clusters of each cluster in GDA output BED file
decomposition_gc_skew_repeats_sliding_window.py
Script for finding sequence GC, GC skew, stop codon frequency and frequency of some repeat motifs in a FASTA file with sliding window
dustmasker_get_masked_seq_percentage_bedgraph.py
Script for finding the percentages of nucleotides that got masked by Dustmasker in a FASTA file, as a sliding window
extract_gene_stats.py
Script for extracting gene statistics (length, exon count, intron count, strand) from a GFF file
extract_gff_features.py
Script for extracting mRNA, pseudogene, tRNA and rRNA features from GFF3 file to convert them into bedgraph format
filtered_ectopic_organellar_blast_hits_to_gff.py
Script for filtering the BLAST results to detect ectopic mitochondrial and apicoplast sequences. The script outputs the ectopic mitochondrion and apicoplast BLAST hits as GFF
gaps_to_bedgraph.py
Script for detecting Ns in a FASTA file with sliding window and reporting them in bedgraph format
gc_skew_etc_to_bedgraph.py
Script for converting the output of decomposition_gc_skew_repeats_sliding_window.py to bedgraph format
gda_check_software_dependencies.py
Script for checking if required software for the GDA feature extraction pipeline is in path
general_purpose_functions.py
General purpose functions
File for functions that can be reused in many Python scripts
genome_decomp_pipeline_shared_functions.py
File for functions that are shared between scripts of the decomposition pipeline
gff_features_to_bedgraph.py
Script for converting GFF features to bedgraph format. Output (STDOUT): a bedgraph file with the fractions of regions that contain the query feature in fixed length chunks of scaffolds
kmer_freq_sliding_window.py
Script for counting kmer frequencies in a FASTA file using a sliding window
Output: bedgraph files for the counts of kmers in every sliding window step across the scaffolds in the input FASTA file
map_rna-seq_reads_and_get_coverage.py
Script for getting RNA-Seq read coverage for decomposition analysis of genomes
quick_test_feature_extraction_pipeline.py
Script for running a quick test of the genomic feature extraction pipeline of GDA with P. falciparum chromosome 1 as the input.
This script runs all the mandatory parts of the pipeline but skips most of the optional parts
run_blast_to_detect_ectopic_organellar_seq.py
Script for detection of ectopic organellar sequences using BLAST
run_dustmasker.py
Script for running DustMasker to detect low complexity regions in an assembly
run_einverted.py
Script for running einverted to detect inverted repeats
run_ltrharvest_and_ltrdigest.py
Script for running LTRharvest and LTRdigest
run_red_meshclust2.py
Script for running Red and MeShClust2 to detect repeat families
run_wgsim.py
Script for running wgsim to generate simulated reads, mapping these reads and finding their coverage
sam_to_sorted_indexed_bam.py
Script conversion of .sam file with mapped reads to sorted and indexed .bam file
Argument1: path to .sam file
Argument2: number of threads
samtools_depth_to_bedgraph.py
Script for converting coverage data (based on SAMtools depth) to bedgraph format. Output (STDOUT): a bedgraph file with mean coverage of fixed length chunks of scaffolds
shorten_fasta_headers.py
Script for shortening FASTA headers, by splitting the header and keeping only the first element
stats_per_gene_to_bedgraph.py
Script for converting the table of stats per each gene to bedgraph
validate_input_files.py
Script for validating the input files of the GDA feature extraction pipeline
validate_nextflow_config.py
Script for validating the nextflow.config file of GDA
validate_pipeline_run_folder.py
Script for validating the GDA pipeline run folder before running the pipeline
add_missing_ids_to_gff3.py
Script for processing a GFF file to add missing IDs. The input is a GFF that has been created by combining Augustus, Barrnap and tRNAscan-SE
combine_annotation_gff_files.py
Script for combining the output GFF3 files of Augustus, Barrnap and tRNAscan into one GFF3 file
convert_trnascan_bed_file_to_gff.py
Script for converting tRNAscan output BED file to GFF3 format
gda_annotate_genes.py
Master script for running gene annotation scripts for GDA
gff_to_transcripts_and_proteins.py
Script for extracting transcripts and protein sequences from a GFF3 and a genome assembly FASTA file
run_augustus.py
Script for running Augustus for genome annotation as multiple parallel jobs
This script uses snippets of code adapted from https://github.com/stephenrdoyle/generic_scripts/blob/master/random_workflows/run_augustus_split_by_contigs.sh
run_barrnap.py
Script for running Barrnap for detecting rRNAs
run_liftoff.py
Script for running Liftoff to transfer gene annotations
run_trnascan.py
Script for running tRNAscan-SE for detecting tRNAs
generate_orthomcl_gg_file_from_fasta.py
Script for generating gg_file for OrthoMCL from protein FASTA files
Argument1: path to a CSV file. First column: species identifiers (short names). Second column: names of FASTA files for each species (without folder path)
Output: gg_file for OrthoMCL
Argument2: path to folder with protein FASTA files
orthomcl_batch.py
Script for running OrthoMCL as batch
orthomcl_conservation.py
Script for converting OrthoMCL results into a table of paralog counts, ortholog counts and conservation ratio
remove_non_mrna_cds_features.py
Script for processing a GFF3 file to remove CDS features whose parent feature is something other than 'mRNA'
run_orthomcl.py
Script for running OrthoMCL (including Diamond blastp for OrthoMCL)
condense_simple_repeat_sequences.py
Script for condensing a list of simple repeat sequences to remove redundant sequences.
For example: TAA and TTA are the same sequence, one is the reverse complement of the other. TAATAA repeat is the same as TAA repeat. AATAAT is the same as TAATAA but with a shifted starting point
Input: GFF with repeat locations from RepeatMasker, processed with process_repeatmasker_gffs.py to extract only simple repeat sequences
Output: simple repeats GFF with redundant sequences collapsed into one sequence
find_repeats_enriched_at_scaff_edges.R
Script for using RepeatModeler's repeat families output for detecting repeats that are enriched at scaffold edges
This is not a component of the main pipeline of GDA but can be used as an extra step to get more information out of the data
process_repeatmasker_gffs.py
Script for running scripts that process RepeatMasker gff files
reformat_repeatmasker_gff.py
Script for splitting the simple and complex repeat lines in RepeatMasker GFF output and reformatting the GFF so that it can be used as the input for the multiple_gff_features_to_bedgraph.py script
repeatmasker_gff_to_bedgraph.py
Script for converting RepeatMasker repeat coordinates from GFF to bedgraph
repeatmasker_simple_repeat_frequencies.py
Script for finding simple repeat frequencies in in GFF derived from the output of RepeatModeler + RepeatMasker
run_repeatmasker_repeatmodeler.py
Script for running RepeatMasker and RepeatModeler as a part of genome decomposition
sum_simple_or_complex_repeat_tracks.py
Script for making bedgraph tracks that are the sum of all simple repeat tracks or sum of all complex repeat tracks
run_trf.py
Script for running Tandem Repeats Finder as a part of genome decomposition
trf_repeat_density_sliding_window.py
Script for finding repeat density in a genome using sliding window on genome FASTA file where repeats have been masked with Tandem Repeats Finder
Output: tab separated table. Column1: scaffold name. Column2: chunk start coordinate in the scaffold (1-based). Column3: chunk end coordinate in the scaffold.
Column4: fraction of nucleotides in the chunk that were masked by TandemRepeatsFinder. Column5: True if the fraction of masked nucleotides exceeds a cutoff, False if not.
trf_repeat_density_to_bedgraph.py
Script for converting Tandem Repeats Finder repeat density values to bedgraph format
Argument1: path to an output file of trf_repeat_density_sliding_window.py
Output (STDOUT): input file converted to bedgraph format
trf_repeat_density_to_gff.py
Script for converting Tandem Repeats Finder repeat density values (that have been divided into repeat-rich and repeat-poor regions) into GFF or BED format