-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'tamu-origin/master'
- Loading branch information
Showing
308 changed files
with
80,885 additions
and
625 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Overview | ||
> input: multi protein fasta file | ||
> output: multi fasta of positive candidates and a table summarizing the stats for each candidate with identifier, length of potential SAR, topology orientation, calculate % of hydrophillic residues. | ||
# Requirements | ||
* python 3.6+ | ||
* biopython | ||
* <s>pandas</s> --> thought I might use this, but ended up seeing I wouldnt needed it by the time I was wrapping this up. | ||
* <s>numpy</s> | ||
|
||
# Outline | ||
1. Read in input multi fasta | ||
* multi fasta parsed by `biopython_parsing.py` | ||
2. Check SAR requirements | ||
* <s>Min peptide length check</s> --> Currently omitted | ||
* <s>Max peptide length check (user dictated)</s> --> Currently omitted | ||
* Hydrophobic residues (Ile, Leu, Val, Phe, Tyr, Trp, Met) except often rich in Gly, Ala, and/or Ser residues | ||
* <s>Option 1: FIWLVMYAGS</s> Using option #2 | ||
* __Option 2: FIWLVMYCATGS # add C and T --> This is what is being used__ | ||
* Lysines can be present in the hydrophobic stretch if within 3 residues of the domain boundary (lysine snorkeling) | ||
* <s>Currently, I'm not checking for lysine snorkels as the hydrophobic region present will still be caught.</s> | ||
* Snorkelers are found by if a Lys is on the first or last index of the sequence range being inspected, checks for hydrophobic residues between it and either the beginning or end of the sequence. | ||
* More refinement will be necessary to verify a "K_nonhydro_nonhydro_nonhydro_hydro..hydro_" | ||
* I would think the argument of tuning it anymore is that the method currently would still catch the hydrophobic domains with within a given range. | ||
* Topology check | ||
* N term (net positive charge) | ||
* C term catalytic domain | ||
3. Return candidates and multi fasta. | ||
4. Return candidates in multi gff3. | ||
5. Write statistics to output file in table format | ||
* <s>identifier :: length of peptide :: topology orientation :: %G and %A :: likely more later</s> | ||
* Been reworked to include what is currently in tab-separated format: | ||
* ["Name","Protein Sequence","Protein Length","SAR Length","Putative SAR Sequence","SAR Start Location","[res%]","N-term Sequence","N-term net Charge"] | ||
|
||
# File Summaries | ||
* `SAR_functions.py` | ||
* Has the SAR class and accompanying methods | ||
* `SAR_finder.py` | ||
* The executed script by Galaxy | ||
* `biopython_parsing.py` | ||
* _might scale to a_ sym link for parsing bio related files, otherwise will just be related to addressing this experiment. | ||
* `file_operations.py` | ||
* _mich scale to a_ sym link for operating on files and exporting them, otherwise will just be related to addressing this experiment, and writing the outputs. | ||
|
||
# Testing | ||
* Mu (mu-proteins.fa) for a TP | ||
* Phage-21 for a TP | ||
* simple-proteins.fa includes TP from Mu and TN from Mu. Used with Planemo. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
import sys | ||
import argparse | ||
import os | ||
import re | ||
from biopython_parsing import FASTA_parser | ||
from file_operations import fasta_from_SAR_dict, gff3_from_SAR_dict, tab_from_SAR_dict | ||
from SAR_functions import CheckSequence | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser(description="SAR Finder") | ||
|
||
parser.add_argument("fa",type=argparse.FileType("r"),help="organism's multi fasta file") | ||
|
||
parser.add_argument("--min",type=int,default=20,help="minimum size of candidate peptide") | ||
|
||
parser.add_argument("--max",type=int,default=200,help="maximum size of candidate peptide") | ||
|
||
parser.add_argument("--sar_min",type=int,default=15,help="minimum size of candidate peptide TMD domain") | ||
|
||
parser.add_argument("--sar_max",type=int,default=20,help="maximum size of candidate peptide TMD domain") | ||
|
||
parser.add_argument("--out_fa",type=argparse.FileType("w"),help="multifasta output of candidate SAR proteins",default="candidate_SAR.fa") | ||
|
||
parser.add_argument("--out_stat",type=argparse.FileType("w"),help="summary statistic file for candidate SAR proteins, tab separated",default="candidate_SAR_stats.tsv") | ||
|
||
parser.add_argument("--out_gff3",type=argparse.FileType("w"),help="multigff3 file for candidate SAR proteins",default="candidate_SAR.gff3") | ||
|
||
args = parser.parse_args() | ||
|
||
fa_dict = FASTA_parser(fa=args.fa).multifasta_dict() | ||
|
||
sars = {} | ||
|
||
for protein_name, protein_data in fa_dict.items(): | ||
sar = CheckSequence(protein_name, protein_data) | ||
#sar.check_sizes(min=args.min,max=args.max) | ||
hydros = sar.shrink_results(sar_min=args.sar_min, sar_max=args.sar_max) | ||
sars.update(hydros) | ||
|
||
|
||
gff3_from_SAR_dict(sars, args.out_gff3) | ||
tab_from_SAR_dict(sars,args.out_stat,"SGAT",sar_min=args.sar_min, sar_max=args.sar_max) | ||
fasta_from_SAR_dict(sars,args.out_fa) | ||
#stat_file_from_SAR_dict(sars,args.out_stat,sar_min=args.sar_min,sar_max=args.sar_max) # fix this whenever ready. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
<tool id="edu.tamu.cpt.sar.sar_finder" name="SAR Finder" version="1.0"> | ||
<description>SAR Domain Finder</description> | ||
<macros> | ||
<import>macros.xml</import> | ||
</macros> | ||
<expand macro="requirements"> | ||
</expand> | ||
<command detect_errors="aggressive"><![CDATA[ | ||
python $__tool_directory__/SAR_finder.py | ||
$fa | ||
--sar_min $sar_min | ||
--sar_max $sar_max | ||
--out_fa $out_fa | ||
--out_gff3 $out_gff3 | ||
--out_stat $out_stat | ||
]]></command> | ||
<inputs> | ||
<param label="Multi FASTA File" name="fa" type="data" format="fasta" /> | ||
<param label="SAR domain minimal size" name="sar_min" type="integer" value="15" /> | ||
<param label="SAR domain maximum size" name="sar_max" type="integer" value="20" /> | ||
</inputs> | ||
<outputs> | ||
<data format="tabular" name="out_stat" label="candidate_SAR_stats.tsv"/> | ||
<data format="fasta" name="out_fa" label="candidate_SAR.fa"/> | ||
<data format="gff3" name="out_gff3" label="candidate_SAR.gff3"/> | ||
</outputs> | ||
<tests> | ||
<test> | ||
<param name="fa" value="simple-proteins.fa"/> | ||
<param name="sar_min" value="15"/> | ||
<param name="sar_max" value="20"/> | ||
<output name="out_stat" file="candidate_SAR_stats.tsv"/> | ||
<output name="out_fa" file="candidate_SAR.fa"/> | ||
<output name="out_gff3" file="candidate_SAR.gff3"/> | ||
</test> | ||
</tests> | ||
<help><![CDATA[ | ||
A tool that analyzes protein sequence within the first 50 residues for a weakly hydrophobic domain sometimes found in endolysins called Signal-Anchor-Release (aka SAR) | ||
Definition: A Signal-Arrest-Release (SAR) domain is a N-terminal, weakly hydrophobic transmembrane region rich is Gly/Ala and/or Ser residues sometimes found in phage lysis proteins, including endolysins and holins. The SAR domain can be released from the membrane in a proton motive force-dependent manner. | ||
This tool finds proteins that contain a stretch (default 15-20 residues) of hydrophobic residues (Ile, Leu, Val, Phe, Tyr, Trp, Met, Gly, Ala, Ser) and calculates the % Gly/Ala/Ser/Thr residues in the hydrophobic stretch. The net charge on the N-terminal region is also displayed to aid in determining the SAR topology.[1] | ||
INPUT : Protein Multi FASTA | ||
OUTPUT : | ||
* Multi FASTA of candidate proteins that pass the SAR domain criteria | ||
* Text summary file containing each protein that passes the SAR domain criteria | ||
* Multi GFF3 | ||
]]></help> | ||
<citations> | ||
<citation type="doi">https://dx.doi.org/10.1016/bs.aivir.2018.09.003</citation> | ||
<citation type="bibtex"> | ||
@unpublished{galaxyTools, | ||
author = {C. Ross}, | ||
title = {CPT Galaxy Tools}, | ||
year = {2020-}, | ||
note = {https://github.com/tamu-cpt/galaxy-tools/} | ||
} | ||
</citation> | ||
</citations> | ||
</tool> |
Oops, something went wrong.