Skip to content

Latest commit

 

History

History
executable file
·
202 lines (202 loc) · 10.3 KB

gda_feature_extraction_pipeline_script_descriptions.md

File metadata and controls

executable file
·
202 lines (202 loc) · 10.3 KB

Descriptions of scripts that are components of the genomic feature extraction pipeline

Most of these scripts are are triggered by the Nextflow master script for the extraction of genomic features. The users are not required to interact with these scripts directly but knowing what these scripts do might help for troubleshooting or for non-standard use cases.

Main

autodownsample_merged_tsv.py
Script for automatically triggering the downsampling of the TSV table that has been derived by merging bedgraph files

cluster_junctions_fisher_test.py
Script for counting cluster junctions between windows in the GDA output 'clusters.bed' file.
Fisher test is used to determine if some types of cluster junctions occur more rarely or often that expected by chance

convert_einverted_output_to_gff.py
Script for converting the output of EMBOSS einverted to GFF3 format

count_neighbouring_clusters_in_bed_file.py
Script for counting neighboring clusters of each cluster in GDA output BED file

decomposition_gc_skew_repeats_sliding_window.py
Script for finding sequence GC, GC skew, stop codon frequency and frequency of some repeat motifs in a FASTA file with sliding window

dustmasker_get_masked_seq_percentage_bedgraph.py
Script for finding the percentages of nucleotides that got masked by Dustmasker in a FASTA file, as a sliding window

extract_gene_stats.py
Script for extracting gene statistics (length, exon count, intron count, strand) from a GFF file

extract_gff_features.py
Script for extracting mRNA, pseudogene, tRNA and rRNA features from GFF3 file to convert them into bedgraph format

filtered_ectopic_organellar_blast_hits_to_gff.py
Script for filtering the BLAST results to detect ectopic mitochondrial and apicoplast sequences. The script outputs the ectopic mitochondrion and apicoplast BLAST hits as GFF

gaps_to_bedgraph.py
Script for detecting Ns in a FASTA file with sliding window and reporting them in bedgraph format

gc_skew_etc_to_bedgraph.py
Script for converting the output of decomposition_gc_skew_repeats_sliding_window.py to bedgraph format

gda_check_software_dependencies.py
Script for checking if required software for the GDA feature extraction pipeline is in path

general_purpose_functions.py
General purpose functions
File for functions that can be reused in many Python scripts

genome_decomp_pipeline_shared_functions.py
File for functions that are shared between scripts of the decomposition pipeline

gff_features_to_bedgraph.py
Script for converting GFF features to bedgraph format. Output (STDOUT): a bedgraph file with the fractions of regions that contain the query feature in fixed length chunks of scaffolds

kmer_freq_sliding_window.py
Script for counting kmer frequencies in a FASTA file using a sliding window
Output: bedgraph files for the counts of kmers in every sliding window step across the scaffolds in the input FASTA file

map_rna-seq_reads_and_get_coverage.py
Script for getting RNA-Seq read coverage for decomposition analysis of genomes

quick_test_feature_extraction_pipeline.py
Script for running a quick test of the genomic feature extraction pipeline of GDA with P. falciparum chromosome 1 as the input.
This script runs all the mandatory parts of the pipeline but skips most of the optional parts

run_blast_to_detect_ectopic_organellar_seq.py
Script for detection of ectopic organellar sequences using BLAST

run_dustmasker.py
Script for running DustMasker to detect low complexity regions in an assembly

run_einverted.py
Script for running einverted to detect inverted repeats

run_ltrharvest_and_ltrdigest.py
Script for running LTRharvest and LTRdigest

run_red_meshclust2.py
Script for running Red and MeShClust2 to detect repeat families

run_wgsim.py
Script for running wgsim to generate simulated reads, mapping these reads and finding their coverage

sam_to_sorted_indexed_bam.py
Script conversion of .sam file with mapped reads to sorted and indexed .bam file
Argument1: path to .sam file
Argument2: number of threads

samtools_depth_to_bedgraph.py
Script for converting coverage data (based on SAMtools depth) to bedgraph format. Output (STDOUT): a bedgraph file with mean coverage of fixed length chunks of scaffolds

shorten_fasta_headers.py
Script for shortening FASTA headers, by splitting the header and keeping only the first element

stats_per_gene_to_bedgraph.py
Script for converting the table of stats per each gene to bedgraph

validate_input_files.py
Script for validating the input files of the GDA feature extraction pipeline

validate_nextflow_config.py
Script for validating the nextflow.config file of GDA

validate_pipeline_run_folder.py
Script for validating the GDA pipeline run folder before running the pipeline


Genome annotation

add_missing_ids_to_gff3.py
Script for processing a GFF file to add missing IDs. The input is a GFF that has been created by combining Augustus, Barrnap and tRNAscan-SE

combine_annotation_gff_files.py
Script for combining the output GFF3 files of Augustus, Barrnap and tRNAscan into one GFF3 file

convert_trnascan_bed_file_to_gff.py
Script for converting tRNAscan output BED file to GFF3 format

gda_annotate_genes.py
Master script for running gene annotation scripts for GDA

gff_to_transcripts_and_proteins.py
Script for extracting transcripts and protein sequences from a GFF3 and a genome assembly FASTA file

run_augustus.py
Script for running Augustus for genome annotation as multiple parallel jobs
This script uses snippets of code adapted from https://github.com/stephenrdoyle/generic_scripts/blob/master/random_workflows/run_augustus_split_by_contigs.sh

run_barrnap.py
Script for running Barrnap for detecting rRNAs

run_liftoff.py
Script for running Liftoff to transfer gene annotations

run_trnascan.py
Script for running tRNAscan-SE for detecting tRNAs


OrthoMCL

generate_orthomcl_gg_file_from_fasta.py
Script for generating gg_file for OrthoMCL from protein FASTA files
Argument1: path to a CSV file. First column: species identifiers (short names). Second column: names of FASTA files for each species (without folder path)
Output: gg_file for OrthoMCL
Argument2: path to folder with protein FASTA files

orthomcl_batch.py
Script for running OrthoMCL as batch

orthomcl_conservation.py
Script for converting OrthoMCL results into a table of paralog counts, ortholog counts and conservation ratio

remove_non_mrna_cds_features.py
Script for processing a GFF3 file to remove CDS features whose parent feature is something other than 'mRNA'

run_orthomcl.py
Script for running OrthoMCL (including Diamond blastp for OrthoMCL)


RepeatModeler

condense_simple_repeat_sequences.py
Script for condensing a list of simple repeat sequences to remove redundant sequences.
For example: TAA and TTA are the same sequence, one is the reverse complement of the other. TAATAA repeat is the same as TAA repeat. AATAAT is the same as TAATAA but with a shifted starting point
Input: GFF with repeat locations from RepeatMasker, processed with process_repeatmasker_gffs.py to extract only simple repeat sequences
Output: simple repeats GFF with redundant sequences collapsed into one sequence

find_repeats_enriched_at_scaff_edges.R
Script for using RepeatModeler's repeat families output for detecting repeats that are enriched at scaffold edges
This is not a component of the main pipeline of GDA but can be used as an extra step to get more information out of the data

process_repeatmasker_gffs.py
Script for running scripts that process RepeatMasker gff files

reformat_repeatmasker_gff.py
Script for splitting the simple and complex repeat lines in RepeatMasker GFF output and reformatting the GFF so that it can be used as the input for the multiple_gff_features_to_bedgraph.py script

repeatmasker_gff_to_bedgraph.py
Script for converting RepeatMasker repeat coordinates from GFF to bedgraph

repeatmasker_simple_repeat_frequencies.py
Script for finding simple repeat frequencies in in GFF derived from the output of RepeatModeler + RepeatMasker

run_repeatmasker_repeatmodeler.py
Script for running RepeatMasker and RepeatModeler as a part of genome decomposition

sum_simple_or_complex_repeat_tracks.py
Script for making bedgraph tracks that are the sum of all simple repeat tracks or sum of all complex repeat tracks


Tandem Repeats Finder

run_trf.py
Script for running Tandem Repeats Finder as a part of genome decomposition

trf_repeat_density_sliding_window.py
Script for finding repeat density in a genome using sliding window on genome FASTA file where repeats have been masked with Tandem Repeats Finder
Output: tab separated table. Column1: scaffold name. Column2: chunk start coordinate in the scaffold (1-based). Column3: chunk end coordinate in the scaffold.
Column4: fraction of nucleotides in the chunk that were masked by TandemRepeatsFinder. Column5: True if the fraction of masked nucleotides exceeds a cutoff, False if not.

trf_repeat_density_to_bedgraph.py
Script for converting Tandem Repeats Finder repeat density values to bedgraph format
Argument1: path to an output file of trf_repeat_density_sliding_window.py
Output (STDOUT): input file converted to bedgraph format

trf_repeat_density_to_gff.py
Script for converting Tandem Repeats Finder repeat density values (that have been divided into repeat-rich and repeat-poor regions) into GFF or BED format