-
Notifications
You must be signed in to change notification settings - Fork 3
Understand The Yaml File
HAPPI GWAS needs a configuration file, which is the yaml file, to run different set of tasks, such as generate BLUP dataset, generate BLUE dataset, or use GAPIT3 to run different models on your dataset. The purpose of the yaml file is for users to specify the input and output files, and parameters that will be used when running those tasks. The yaml file are divided into six sections which are Raw Data, BLUP or BLUE, GAPIT3, Haploview, Match Gene Start and Stop, and Output Directory Sections. Here, we will use our Demo_GLM.yaml as an example.
#######################################################################
## Raw Data
#######################################################################
raw_data:
by_column:
- 1
- 2
start_column: 3
#######################################################################
## BLUP or BLUE
#######################################################################
BLUP: ./Demo/mdp_traits.txt
BLUP_by_column:
- 1
BLUP_start_column: 2
#######################################################################
## Gapit3
#######################################################################
GAPIT_kinship_matrix:
GAPIT_covariates:
GAPIT_hapmap:
GAPIT_genotype_data_numeric:
GAPIT_genotype_map_numeric:
GAPIT_hapmap_file_extension: hmp.txt
GAPIT_genotype_data_numeric_file_extension:
GAPIT_genotype_map_numeric_file_extension:
GAPIT_hapmap_filename: mdp_genotype_chr
GAPIT_genotype_data_numeric_filename:
GAPIT_genotype_map_numeric_filename:
GAPIT_genotype_file_path: ./Demo/
GAPIT_genotype_file_named_sequentially_from: 1
GAPIT_genotype_file_named_sequentially_to: 10
GAPIT_model:
- GLM
# - MLM
# - MLMM
# - SUPER
# - FarmCPU
GAPIT_SNP_MAF: 0.05
GAPIT_PCA_total: 3
GAPIT_Model_selection: TRUE
GAPIT_SNP_test: TRUE
GAPIT_file_output: TRUE
GAPIT_p_value_threshold:
GAPIT_p_value_fdr_threshold: 0.05
GAPIT_LD_number: 100000
#######################################################################
## Haploview
#######################################################################
Haploview_file_path: ./Demo/
Haploview_file_name: mdp_genotype_haploview_chr
Haploview_file_extension: txt
Haploview_file_named_sequentially_from: 1
Haploview_file_named_sequentially_to: 10
#######################################################################
## Match Gene Start and Stop
#######################################################################
GFF_file_path: ./Demo/
GFF_file_name: gene_chr
GFF_file_extension: gff.txt
GFF_file_named_sequentially_from: 1
GFF_file_named_sequentially_to: 10
#######################################################################
## Output Directory
#######################################################################
output: ../output/demo_output_GLM
If users plan to use best linear unbiased prediction (BLUP) or best linear unbiased estimation (BLUE) to generate dataset, they have to fill in this section. First, they have to fill in the path of their raw data to raw_data. After that, they have to fill in the variable column index numbers to the by_column. Next, they have to fill in the index number of the starting trait column to start_column. If the generateBLUP argument is used, column 1 will be considered a random effect. Conversely, if the generateBLUE argument is used, column 1 will be considered a fixed effect. The rest of the columns in by_column are always random effect regardless generateBLUP argument or generateBLUE argument is used.
#######################################################################
## Raw Data
#######################################################################
raw_data: /path/to/raw_data.txt
by_column:
- 1
- 2
start_column: 3
#######################################################################
## Raw Data
#######################################################################
raw_data: /path/to/raw_data.txt
by_column:
- 1
- 2
- 3
- 4
start_column: 5
If users are planning to use the GAPIT argument to run different models such as GLM, MLM, and FarmCPU, they must fill in this section and keep the raw_data field in the Raw Data section empty. The by_column in this section takes in the index number of the variable column, and the start_column in this section is for the index number of the starting traits in your dataset. Users can refer to Demo_GLM.yaml for this section.
In this section, values of GAPIT_kinship_matrix, GAPIT_covariates, GAPIT_hapmap, GAPIT_genotype_data_numeric, GAPIT_genotype_map_numeric, GAPIT_hapmap_file_extension, GAPIT_genotype_data_numeric_file_extension, GAPIT_genotype_map_numeric_file_extension, GAPIT_hapmap_filename, GAPIT_genotype_data_numeric_filename, GAPIT_genotype_map_numeric_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to are passed directly to the GAPIT3 package. In the Demo_GLM.yaml, the reference files used to run GLM model are in hapmap format and splitted into chunks. Therefore, GAPIT_hapmap_file_extension, GAPIT_hapmap_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to were filled. If users plan to use reference file that is in hapmap format and not splitted into chunks, they need put a file path of the reference file to GAPIT_hapmap. On the other hand, if reference files used are genotype data and genotype map splitted into chunks, GAPIT_genotype_data_numeric_file_extension, GAPIT_genotype_map_numeric_file_extension, GAPIT_genotype_data_numeric_filename, GAPIT_genotype_map_numeric_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to need to be filled. If reference files used are genotype data and genotype map not splitted into chunks, only GAPIT_genotype_data_numeric, GAPIT_genotype_map_numeric need to be filled.
The value of GAPIT_model is passed into GAPIT3 directly as well. The only different is users do not need to specify a file path here; instead, they need to choose a model from the list by un-commenting the model they want and leave the rest of the models commented.
GAPIT_p_value_threshold and GAPIT_p_value_fdr_threshold can filter significant SNPs based on FDR or P-value. A user-defined P-value threshold can be added at GAPIT_p_value_threshold while GAPIT_p_value_fdr_threshold is left blank. If a FDR cutoff is desired, a user-defined FDR threshold can be added at GAPIT_p_value_fdr_threshold while GAPIT_p_value_threshold is left blank. To filter by a Bonferroni cutoff, simply take the desired P-value threshold and divide it by the total number of SNPs. Put this calculated number (Bonferroni corrected P-value threshold) under GAPIT_p_value_threshold.
Gallery of GWAS input parameters in GAPIT:
Parameter | Default | Option | Description |
---|---|---|---|
GAPIT_kinship_matrix | NULL | User | Kinship matrix |
GAPIT_covariates | NULL | User | Covariate Variables |
GAPIT_hapmap | NULL | User | Genotype data in Hapmap format |
GAPIT_genotype_data_numeric | NULL | User | Genotype data in numeric format |
GAPIT_genotype_map_numeric | NULL | User | Genotype Map file in Hapmap format |
GAPIT_hapmap_file_extension | hmp.txt | User | File extension for Hapmap file |
GAPIT_genotype_data_numeric_file_extension | NULL | User | File extension for genotype data in numeric format |
GAPIT_genotype_map_numeric_file_extension | NULL | User | File extension for genotype data in Hapmap format |
GAPIT_hapmap_filename | NULL | User | File name for genotype data in Hapmap format |
GAPIT_genotype_data_numeric_filename | NULL | User | File name for genotype data in numeric format |
GAPIT_genotype_file_path | ./Demo/ | User | Path to genotype file |
GAPIT_genotype_file_named_sequentially_from | 1 | User | Starting number of sequentially named genotype files |
GAPIT_genotype_file_named_sequentially_to | 10 | User | Ending number of sequentially named genotype files |
GAPIT_model | MLM | GLM MLM MLMM SUPER FarmCPU |
GWAS model |
GAPIT_SNP_MAF | 0.05 | >0 and <1 | Minor allele frequency to filter SNPs |
GAPIT_PCA_total | 0 | >0 | Number of PC’s as covariates |
GAPIT_Model_selection | TRUE | TRUE/FALSE | Forward model selection is done using Bayesian information criterion (BIC) to determine optimal PC/Covariables. |
GAPIT_SNP_test | TRUE | TRUE/FALSE | Perform SNP testing |
GAPIT_file_output | TRUE | TRUE/FALSE | Provides automatic GAPIT output files |
GAPIT_p_value_threshold | NULL | >0 and <1 | P-value threshold used to filter significant SNPs |
GAPIT_p_value_fdr_threshold | 0.05 | >0 and <1 | FDR threshold used to filter significant SNPs |
GAPIT_LD_number | 100000 | >1 | Range (in bp) around significant SNP for LD analysis |
If users are planning to use the extractHaplotype argument which depends on the GAPIT argument, all the fields in this section need to be filled. The reference files used in this section need to be splitted into chunks by chromosome.
If users are planning to use the searchGenes argument which depends on the GAPIT argument, all the fields in this section need to be filled. The reference files used in this section need to be splitted into chunks by chromosome.
This section holds the output directory where all the output files will go when the tool is running. This section is definitely required to be completed in order for HAPPI GWAS to run.
- Users are recommended to read the user manual before using HAPPI GWAS. The user manual is written in a more detailed form and provides users additional explanation and guidance.
- In the yaml file, it is recommended by the developer of this project to use the absolute path. Using a relative path will work, but is discouraged.