README

##########################

Extracting the Ramp Sequence from genes
ExtRamp.py
Created By: Logan Brase and Justin Miller
Email: braselogan@gmail.com or justin.miller@uky.edu

##########################

ExtRamp is a tool to extract the Ramp sequence from the beginning of genes.
It uses the tAI values (user provided) or codon proportions to determine the speed of translation
and the appropriate cut off point for the Ramp sequence.

##########################

ARGUMENT OPTIONS:

ExtRamp requires 1 user input at the runtime:
	-i input	INPUT fasta file containing the cds gene sequences of interest (.gz file extension required for gzipped files)
	
	
Optional arguments:
 -a tAI		        INPUT file in csv format, contains the tAI values. Two formats are accepted as shown in the two included examples (S.cerevisiae_tAI_Values and Ecoli_tAI_Values)
 -u rscu                INPUT fasta file used to compute relative synonymous codon usage and relative adaptiveness of codons
 -o ramp	       	OUTPUT fasta file to write the ramp sequences to
 -v verbose             Flag to print progress to standard error
 -l vals	       	OUTPUT file in csv format, the sequence speed values are written here if provided
 -p speeds              OUTPUT speeds file to write tAI/relative adaptiveness values for each position in the sequence from the codon after the start codon to the codon before the stop codon. Format: Header newline list of values
 -n noRamp              OUTPUT Text file to write the gene names that contained no ramp sequence.
 -z removedSequences    OUTPUT Write the header lines that are removed (e.g., sequence not long enough or not divisible by 3) to output file
 -x afterRamp           OUTPUT Fasta file containing gene sequences after the identified ramp sequence
 -t threads       	The number of threads used to run the program, default is 9
 -w window	        The number of codons in the ribosome window, default is 9 codons
 -s stdev               The number of standard deviations below the mean the cutoff value will be. Default is not used.
 -d stdevRampLength	The number of standard deviations in the lengths of the ramp sequences. Default is not used.
 -m middle              The type of statistic used to measure the middle (consensus) efficiency. Options are 'hmean','mean', 'gmean', and 'median'. Default is 'hmean' 
 -r rna                 Flag for RNA sequences. Default is DNA.
 -f determine_cutoff    Flag to determine outlier percentages for mean cutoff based on species FASTA file. Default: local minimum in first 8 percent of gene')
 -c cutoff              Cutoff for where the local minimum must occur in the gene for a ramp to be calculated. If --determine_cutoff (-f) is used, then this value may change. Is not used if standard deviations are set. Default:8
 -e determine_cutoff_percent Cutoff for determining percent of gene that is in an outlier region. Used in conjunction with -f. Default is true outliers. Other options include numbers from 0-99, which indicate the region of a box plot. For instance, 75 means the 75th quartile or above. Default: True Outliers
 -q seqLength           Minimum nucleotide sequence length. Default is 100 amino acids * 3 = 300 nucleotides
	
NOTE: Only the standard codon table is supported because many codon tables have ambigous codons that encode for more than one amino acid. Since we use the relative codon adaptiveness for each amino acid, we cannot account for ambiguous codons.

##########################

REQUIREMENTS:

ExtRamp.py uses Python version 3.5 in a Linux environment

Python Libraries:
1.  statistics
2.  gzip
3.  csv
4.  argparse
5.  sys
6.  tqdm (optional)
7.  multiprocessing
8.  numpy
9.  scipy
10. math
11. re


If any of those libraries is not currently in your Python Path, use the following command:
pip3 install --user [library_name]
to install the library to your path.


##########################
USAGE

With tAI file (RECOMMENDED):

If you do not have the tAI values, check the stAicalc at:
http://tau-tai.azurewebsites.net/
A large list of the species with known tAI values is present. You must select a valid fasta
file and then click submit for the tAI values to be printed at the bottom. The fasta file 
does not need to be from the correct species, but it won't print the values until a file 
is selected. The values can then be exported into a CSV File.
 
python ExtRamp.py -i path/to/SEQUENCES.fasta.gz -a path/to/tAI.csv -o path/to/OUTFILE.fasta 

Without tAI file:

python ExtRamp.py -i path/to/SEQUENCES.fasta.gz -o path/to/OUTFILE.fasta -v

After running the top command, these updates will be printed to standard error:
Reading Sequences...
Calculating Codon Speeds...
Calculating Sequence Speeds...

Consensus Codon Efficiency using hmean: [NUMBER]
Standard Deviation: [NUMBER]
Maximum Efficiency in Ramp Sequence [NUMBER]

Isolating Ramp Sequences...

[NUMBER] Ramp Sequences found out of [NUMBER] total sequences

##########################
EXAMPLE USAGE WITH TAI VALUES

Try running ExtRamp.py on the provided S_cerevisiae example files in the example_files folder.

python ExtRamp.py -i example_files/Saccharomyces_cerevisiae.gz -a example_files/S_cerevisiae_tAI_Values.csv -o outTest.fasta 

The output should match the S_cerevisiae_output.fasta file in the example_files folder (note: due to multithreading, the order of the sequences might vary)

##########################
EXAMPLE USAGE WITHOUT TAI VALUES

python ExtRamp.py -i example_files/Homo_sapiens.gz -o output.fasta 

For us, that command takes approximately six minutes of user time. Using 16 cores, it took approximately 30 seconds.

output.fasta should match example_files/Homo_sapiens_output.fasta (note: the order of the sequences might vary)

##########################

Thank you, and happy researching!