Skip to content

Understand The Yaml File

Yen On Chan edited this page May 16, 2020 · 2 revisions

An overview of yaml file

HAPPI GWAS needs a configuration file, which is the yaml file, to run different set of tasks, such as generate BLUP dataset, generate BLUE dataset, or use GAPIT3 to run different models on your dataset. The purpose of the yaml file is for users to specify the input and output files, and parameters that will be used when running those tasks. The yaml file are divided into six sections which are Raw Data, BLUP or BLUE, GAPIT3, Haploview, Match Gene Start and Stop, and Output Directory Sections. Here, we will use our Demo_GLM.yaml as an example.

#######################################################################
## Raw Data
#######################################################################
raw_data: 
by_column:
- 1
- 2
start_column: 3

#######################################################################
## BLUP or BLUE
#######################################################################
BLUP: ./Demo/mdp_traits.txt
BLUP_by_column:
- 1
BLUP_start_column: 2

#######################################################################
## Gapit3
#######################################################################
GAPIT_kinship_matrix: 
GAPIT_covariates: 
GAPIT_hapmap:
GAPIT_genotype_data_numeric:
GAPIT_genotype_map_numeric:
GAPIT_hapmap_file_extension: hmp.txt
GAPIT_genotype_data_numeric_file_extension: 
GAPIT_genotype_map_numeric_file_extension: 
GAPIT_hapmap_filename: mdp_genotype_chr
GAPIT_genotype_data_numeric_filename: 
GAPIT_genotype_map_numeric_filename: 
GAPIT_genotype_file_path: ./Demo/
GAPIT_genotype_file_named_sequentially_from: 1
GAPIT_genotype_file_named_sequentially_to: 10
GAPIT_model: 
 - GLM
# - MLM
# - MLMM
# - SUPER
# - FarmCPU
GAPIT_SNP_MAF: 0.05
GAPIT_PCA_total: 3
GAPIT_Model_selection: TRUE
GAPIT_SNP_test: TRUE
GAPIT_file_output: TRUE
GAPIT_p_value_threshold:
GAPIT_p_value_fdr_threshold: 0.05
GAPIT_LD_number: 100000

#######################################################################
## Haploview
#######################################################################
Haploview_file_path: ./Demo/
Haploview_file_name: mdp_genotype_haploview_chr
Haploview_file_extension: txt
Haploview_file_named_sequentially_from: 1
Haploview_file_named_sequentially_to: 10

#######################################################################
## Match Gene Start and Stop
#######################################################################
GFF_file_path: ./Demo/
GFF_file_name: gene_chr
GFF_file_extension: gff.txt
GFF_file_named_sequentially_from: 1
GFF_file_named_sequentially_to: 10

#######################################################################
## Output Directory
#######################################################################
output: ../output/demo_output_GLM

Raw Data

If users plan to use best linear unbiased prediction (BLUP) or best linear unbiased estimation (BLUE) to generate dataset, they have to fill in this section. First, they have to fill in the path of their raw data to raw_data. After that, they have to fill in the variable column index numbers to the by_column. Next, they have to fill in the index number of the starting trait column to start_column. If the generateBLUP argument is used, column 1 will be considered a random effect. Conversely, if the generateBLUE argument is used, column 1 will be considered a fixed effect. The rest of the columns in by_column are always random effect regardless generateBLUP argument or generateBLUE argument is used.

Raw Data Section Example 1:
#######################################################################
## Raw Data
#######################################################################
raw_data: /path/to/raw_data.txt
by_column:
- 1
- 2
start_column: 3
Raw Data Section Example 2:
#######################################################################
## Raw Data
#######################################################################
raw_data: /path/to/raw_data.txt
by_column:
- 1
- 2
- 3
- 4
start_column: 5

BLUP or BLUE

If users are planning to use the GAPIT argument to run different models such as GLM, MLM, and FarmCPU, they must fill in this section and keep the raw_data field in the Raw Data section empty. The by_column in this section takes in the index number of the variable column, and the start_column in this section is for the index number of the starting traits in your dataset. Users can refer to Demo_GLM.yaml for this section.

GAPIT3

In this section, values of GAPIT_kinship_matrix, GAPIT_covariates, GAPIT_hapmap, GAPIT_genotype_data_numeric, GAPIT_genotype_map_numeric, GAPIT_hapmap_file_extension, GAPIT_genotype_data_numeric_file_extension, GAPIT_genotype_map_numeric_file_extension, GAPIT_hapmap_filename, GAPIT_genotype_data_numeric_filename, GAPIT_genotype_map_numeric_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to are passed directly to the GAPIT3 package. In the Demo_GLM.yaml, the reference files used to run GLM model are in hapmap format and splitted into chunks. Therefore, GAPIT_hapmap_file_extension, GAPIT_hapmap_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to were filled. If users plan to use reference file that is in hapmap format and not splitted into chunks, they need put a file path of the reference file to GAPIT_hapmap. On the other hand, if reference files used are genotype data and genotype map splitted into chunks, GAPIT_genotype_data_numeric_file_extension, GAPIT_genotype_map_numeric_file_extension, GAPIT_genotype_data_numeric_filename, GAPIT_genotype_map_numeric_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to need to be filled. If reference files used are genotype data and genotype map not splitted into chunks, only GAPIT_genotype_data_numeric, GAPIT_genotype_map_numeric need to be filled.

The value of GAPIT_model is passed into GAPIT3 directly as well. The only different is users do not need to specify a file path here; instead, they need to choose a model from the list by un-commenting the model they want and leave the rest of the models commented.

GAPIT_p_value_threshold and GAPIT_p_value_fdr_threshold can filter significant SNPs based on FDR or P-value. A user-defined P-value threshold can be added at GAPIT_p_value_threshold while GAPIT_p_value_fdr_threshold is left blank. If a FDR cutoff is desired, a user-defined FDR threshold can be added at GAPIT_p_value_fdr_threshold while GAPIT_p_value_threshold is left blank. To filter by a Bonferroni cutoff, simply take the desired P-value threshold and divide it by the total number of SNPs. Put this calculated number (Bonferroni corrected P-value threshold) under GAPIT_p_value_threshold.

Gallery of GWAS input parameters in GAPIT:

Parameter Default Option Description
GAPIT_kinship_matrix NULL User Kinship matrix
GAPIT_covariates NULL User Covariate Variables
GAPIT_hapmap NULL User Genotype data in Hapmap format
GAPIT_genotype_data_numeric NULL User Genotype data in numeric format
GAPIT_genotype_map_numeric NULL User Genotype Map file in Hapmap format
GAPIT_hapmap_file_extension hmp.txt User File extension for Hapmap file
GAPIT_genotype_data_numeric_file_extension NULL User File extension for genotype data in numeric format
GAPIT_genotype_map_numeric_file_extension NULL User File extension for genotype data in Hapmap format
GAPIT_hapmap_filename NULL User File name for genotype data in Hapmap format
GAPIT_genotype_data_numeric_filename NULL User File name for genotype data in numeric format
GAPIT_genotype_file_path ./Demo/ User Path to genotype file
GAPIT_genotype_file_named_sequentially_from 1 User Starting number of sequentially named genotype files
GAPIT_genotype_file_named_sequentially_to 10 User Ending number of sequentially named genotype files
GAPIT_model MLM GLM
MLM
MLMM
SUPER
FarmCPU
GWAS model
GAPIT_SNP_MAF 0.05 >0 and <1 Minor allele frequency to filter SNPs
GAPIT_PCA_total 0 >0 Number of PC’s as covariates
GAPIT_Model_selection TRUE TRUE/FALSE Forward model selection is done using Bayesian information criterion (BIC) to determine optimal PC/Covariables.
GAPIT_SNP_test TRUE TRUE/FALSE Perform SNP testing
GAPIT_file_output TRUE TRUE/FALSE Provides automatic GAPIT output files
GAPIT_p_value_threshold NULL >0 and <1 P-value threshold used to filter significant SNPs
GAPIT_p_value_fdr_threshold 0.05 >0 and <1 FDR threshold used to filter significant SNPs
GAPIT_LD_number 100000 >1 Range (in bp) around significant SNP for LD analysis

Haploview

If users are planning to use the extractHaplotype argument which depends on the GAPIT argument, all the fields in this section need to be filled. The reference files used in this section need to be splitted into chunks by chromosome.

Match Gene Start and Stop

If users are planning to use the searchGenes argument which depends on the GAPIT argument, all the fields in this section need to be filled. The reference files used in this section need to be splitted into chunks by chromosome.

Output Directory

This section holds the output directory where all the output files will go when the tool is running. This section is definitely required to be completed in order for HAPPI GWAS to run.



Side Notes

  1. It is always a good idea to read the user manual. The user manual is written in a more detail form, and it can gives users more explanation and guidance.
  2. In the yaml file, it is a practice to use absolute path, and it is recommended by the developer of this project. Using relative path is not entirely bad, just that users have to understand which directory the relative path is relative to.