Skip to content
Jennifer Chang edited this page Nov 22, 2019 · 5 revisions

It is recommended to start with the Minimal Tutorial. The minimal tutorial walks through classifying the WGS nucleotide sequences for two strains, A/swine/Iowa/A02478541/2019 (H1N1) and A/swine/Minnesota/A01785989/2019 (H3N2).

Documentation

nn_classifier.R: reads in a tree, identifies clade (and other info) based on nearest neighbor for query sequences. Reference sequences are annotated with metadata information (clade, segment, state) split by the | character. (Dependencies: ape package)

The general format of the command is

RScript nn_classifier.R <treefile.tre> [metadata column number...]

For example, the following command reads in a PB2 tree containing a single unknown named "QUERY." The script identifies the nearest neighbor, and returns the name of the QUERY with the metadata extracted from the nearest neighbors defline at positions 3 and 1:

Rscript nn_classifier.R sample_data/pb2_sample.tre 3 1

# QUERY	A/swine/Minnesota/A01785592/2018	TRIG

It is assumed that the query sequence has less | than the first metadata column specified. This R script relies on the ape package.


pipeline.sh: creates the blast database from reference.fasta, identifies the segment (HA to NS) of the query fasta file and splits them out by segment, builds the segment trees, and runs nn_classifier.R to pull out classification of each segment. (Dependencies: NCBI blast, smof, R with ape, mafft, FastTree)

bash pipeline.sh <query.fasta>

For example

bash pipeline.sh sample_data/query.fasta

# ==== Final results in  query.fasta_Final_Output.txt
# alignment and tree files in the 'query.fasta_output' folder
# Tree files are listed below: 
# -rw-r--r--  1  macusers  15856 Jun  6 10:36 query.fasta_output/H1.tre
# -rw-r--r--  1  macusers  18508 Jun  6 10:36 query.fasta_output/H3.tre
# -rw-r--r--  1  macusers  10042 Jun  6 10:36 query.fasta_output/N1.tre
# -rw-r--r--  1  macusers  17472 Jun  6 10:36 query.fasta_output/N2.tre
# -rw-r--r--  1  macusers  14575 Jun  6 10:36 query.fasta_output/PB2.tre
# -rw-r--r--  1  macusers  14014 Jun  6 10:37 query.fasta_output/PB1.tre
# -rw-r--r--  1  macusers  15881 Jun  6 10:37 query.fasta_output/PA.tre
# -rw-r--r--  1  macusers  16242 Jun  6 10:37 query.fasta_output/NP.tre
# -rw-r--r--  1  macusers  17712 Jun  6 10:37 query.fasta_output/M.tre
# -rw-r--r--  1  macusers  13355 Jun  6 10:38 query.fasta_output/NS.tre

head query.fasta_Final_Output.txt

# QUERY_MH540411_A/swine/Iowa/A02169143/2018	H1	pdm-vaccine	1A.3.3.2-vaccine 
# QUERY_MH608109_A/swine/Minnesota/A01785575/2018	H1	gamma2-beta-like	1A.2-3-like 
# QUERY_MH936448_A/swine/Minnesota/A01785608/2018	H1	gamma2-beta-like	1A.2-3-like 
# QUERY_MK024151_A/swine/Minnesota/A01785613/2018	H1	gamma2-beta-like	1A.2-3-like 
# QUERY_MH802648_A/swine/Iowa/A02254795/2018	H1	gamma2-beta-like	1A.2-3-like 
# QUERY_MH608115_A/swine/Oklahoma/A01785571/2018	H1	gamma2-beta-like	1A.2-3-like 
# QUERY_MH561753_A/swine/Oklahoma/A01785573/2018	H1	gamma2-beta-like	1A.2-3-like 
# QUERY_MH551254_A/swine/South_Dakota/A02016893/2018	H1	beta	1A.2 
# QUERY_MH595472_A/swine/Illinois/A02170163/2018	H1	alpha	1A.1.1 
# QUERY_MH922878_A/swine/Ohio/18TOSU4522/2018	H1	alpha	1A.1.1 

mergeGC.R: will attempt to merge segments into gene constellations. (Dependencies: reshape2 package)

General format of the command is:

Rscript merge.R <query_Final_Output.txt> 

The output will be in Merge_Attempt.txt. For example:

bash pipeline.sh sample_data/query.fasta

# ==== Final results in  query.fasta_Final_Output.txt
# alignment and tree files in the 'query.fasta_output' folder
# Tree files are listed below: 
# -rw-r--r--  1 user  macusers  15648 Mar 15 09:45 query.fasta_output/H1.tre
# -rw-r--r--  1 user  macusers  16997 Mar 15 09:45 query.fasta_output/H3.tre
# -rw-r--r--  1 user  macusers  10040 Mar 15 09:46 query.fasta_output/N1.tre
# -rw-r--r--  1 user  macusers  17585 Mar 15 09:46 query.fasta_output/N2.tre
# -rw-r--r--  1 user  macusers  13674 Mar 15 09:46 query.fasta_output/PB2.tre
# -rw-r--r--  1 user  macusers  11184 Mar 15 09:46 query.fasta_output/PB1.tre
# -rw-r--r--  1 user  macusers  15076 Mar 15 09:46 query.fasta_output/PA.tre
# -rw-r--r--  1 user  macusers  16300 Mar 15 09:46 query.fasta_output/NP.tre
# -rw-r--r--  1 user  macusers  18773 Mar 15 09:46 query.fasta_output/M.tre
# -rw-r--r--  1 user  macusers  13251 Mar 15 09:46 query.fasta_output/NS.tre

cat query.fasta_Final_Output.txt |awk -F'/' 'OFS="\t"{print $4,$0}' | awk -F'\t' 'OFS="\t" {print $1,$3,$4,$5}' > query_output.txt

Rscript mergeGC.R query_output.txt
head -5 Merge_Attempt.txt | column -t

# V1          H1            H3                        N1  N2          PB2            PB1            PA                NP                       M                     NS
# 18TOSU3605  -             2010-human_like|3.2010.1  -   2002_NA_N2  TRIG_avOrigin  -              TRIG_avOrigin     TRIG_classicalOrigin     pdm_EurasianSwOrigin  TRIG_classicalOrigin
# 18TOSU3611  -             2010-human_like|3.2010.1  -   2002_NA_N2  TRIG_avOrigin  -              TRIG_avOrigin     TRIG_classicalOrigin     pdm_EurasianSwOrigin  TRIG_classicalOrigin
# 18TOSU3614  -             2010-human_like|3.2010.1  -   2002_NA_N2  TRIG_avOrigin  -              TRIG_avOrigin     TRIG_classicalOrigin     pdm_EurasianSwOrigin  TRIG_classicalOrigin
# 18TOSU4522  alpha|1A.1.1  -                         -   2002_NA_N2  TRIG_avOrigin  TRIG_huOrigin  pdm_TRIGavOrigin  pdm_TRIGclassicalOrigin  pdm_EurasianSwOrigin  TRIG_classicalOrigin
Clone this wiki locally