Understand The Yaml File

An overview of yaml file

HAPPI GWAS needs a configuration file, which is the yaml file, to run different set of tasks, such as generate BLUP dataset, generate BLUE dataset, or use GAPIT3 to run different models on your dataset. The purpose of the yaml file is for users to specify the input and output files, and parameters that will be used when running those tasks. The yaml file are divided into six sections which are Raw Data, BLUP or BLUE, GAPIT3, Haploview, Match Gene Start and Stop, and Output Directory Sections. Here, we will use our Demo_GLM.yaml as an example.

#######################################################################
## Raw Data
#######################################################################
raw_data: 
by_column:
- 1
- 2
start_column: 3

#######################################################################
## BLUP or BLUE
#######################################################################
BLUP: ./Demo/mdp_traits.txt
BLUP_by_column:
- 1
BLUP_start_column: 2

#######################################################################
## Gapit3
#######################################################################
GAPIT_kinship_matrix: 
GAPIT_covariates: 
GAPIT_hapmap:
GAPIT_genotype_data_numeric:
GAPIT_genotype_map_numeric:
GAPIT_hapmap_file_extension: hmp.txt
GAPIT_genotype_data_numeric_file_extension: 
GAPIT_genotype_map_numeric_file_extension: 
GAPIT_hapmap_filename: mdp_genotype_chr
GAPIT_genotype_data_numeric_filename: 
GAPIT_genotype_map_numeric_filename: 
GAPIT_genotype_file_path: ./Demo/
GAPIT_genotype_file_named_sequentially_from: 1
GAPIT_genotype_file_named_sequentially_to: 10
GAPIT_model: 
 - GLM
# - MLM
# - MLMM
# - SUPER
# - FarmCPU
GAPIT_SNP_MAF: 0.05
GAPIT_PCA_total: 3
GAPIT_Model_selection: TRUE
GAPIT_SNP_test: TRUE
GAPIT_file_output: TRUE
GAPIT_p_value_threshold:
GAPIT_p_value_fdr_threshold: 0.05
GAPIT_LD_number: 100000

#######################################################################
## Haploview
#######################################################################
Haploview_file_path: ./Demo/
Haploview_file_name: mdp_genotype_haploview_chr
Haploview_file_extension: txt
Haploview_file_named_sequentially_from: 1
Haploview_file_named_sequentially_to: 10

#######################################################################
## Match Gene Start and Stop
#######################################################################
GFF_file_path: ./Demo/
GFF_file_name: gene_chr
GFF_file_extension: gff.txt
GFF_file_named_sequentially_from: 1
GFF_file_named_sequentially_to: 10

#######################################################################
## Output Directory
#######################################################################
output: ../output/demo_output_GLM

Raw Data

If users plan to use best linear unbiased prediction (BLUP) or best linear unbiased estimation (BLUE) to generate dataset, they have to fill in this section. First, they have to fill in the path of their raw data to raw_data. After that, they have to fill in the variable column index numbers to the by_column. Next, they have to fill in the index number of the starting trait column to start_column. If the generateBLUP argument is used, column 1 will be considered a random effect. Conversely, if the generateBLUE argument is used, column 1 will be considered a fixed effect. The rest of the columns in by_column are always random effect regardless generateBLUP argument or generateBLUE argument is used.

Raw Data Section Example 1:

#######################################################################
## Raw Data
#######################################################################
raw_data: /path/to/raw_data.txt
by_column:
- 1
- 2
start_column: 3

Raw Data Section Example 2:

#######################################################################
## Raw Data
#######################################################################
raw_data: /path/to/raw_data.txt
by_column:
- 1
- 2
- 3
- 4
start_column: 5

BLUP or BLUE

If users are planning to use the GAPIT argument to run different models such as GLM, MLM, and FarmCPU, they must fill in this section and keep the raw_data field in the Raw Data section empty. The by_column in this section takes in the index number of the variable column, and the start_column in this section is for the index number of the starting traits in your dataset. Users can refer to Demo_GLM.yaml for this section.

GAPIT3

In this section, values of GAPIT_kinship_matrix, GAPIT_covariates, GAPIT_hapmap, GAPIT_genotype_data_numeric, GAPIT_genotype_map_numeric, GAPIT_hapmap_file_extension, GAPIT_genotype_data_numeric_file_extension, GAPIT_genotype_map_numeric_file_extension, GAPIT_hapmap_filename, GAPIT_genotype_data_numeric_filename, GAPIT_genotype_map_numeric_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to are passed directly to the GAPIT3 package. In the Demo_GLM.yaml, the reference files used to run GLM model are in hapmap format and splitted into chunks. Therefore, GAPIT_hapmap_file_extension, GAPIT_hapmap_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to were filled. If users plan to use reference file that is in hapmap format and not splitted into chunks, they need put a file path of the reference file to GAPIT_hapmap. On the other hand, if reference files used are genotype data and genotype map splitted into chunks, GAPIT_genotype_data_numeric_file_extension, GAPIT_genotype_map_numeric_file_extension, GAPIT_genotype_data_numeric_filename, GAPIT_genotype_map_numeric_filename, GAPIT_genotype_file_path, GAPIT_genotype_file_named_sequentially_from, GAPIT_genotype_file_named_sequentially_to need to be filled. If reference files used are genotype data and genotype map not splitted into chunks, only GAPIT_genotype_data_numeric, GAPIT_genotype_map_numeric need to be filled.

The value of GAPIT_model is passed into GAPIT3 directly as well. The only different is users do not need to specify a file path here; instead, they need to choose a model from the list by un-commenting the model they want and leave the rest of the models commented.

GAPIT_p_value_threshold and GAPIT_p_value_fdr_threshold can filter significant SNPs based on FDR or P-value. A user-defined P-value threshold can be added at GAPIT_p_value_threshold while GAPIT_p_value_fdr_threshold is left blank. If a FDR cutoff is desired, a user-defined FDR threshold can be added at GAPIT_p_value_fdr_threshold while GAPIT_p_value_threshold is left blank. To filter by a Bonferroni cutoff, simply take the desired P-value threshold and divide it by the total number of SNPs. Put this calculated number (Bonferroni corrected P-value threshold) under GAPIT_p_value_threshold.

Gallery of GWAS input parameters in GAPIT:

Parameter	Default	Option	Description
GAPIT_kinship_matrix	NULL	User	Kinship matrix
GAPIT_covariates	NULL	User	Covariate Variables
GAPIT_hapmap	NULL	User	Genotype data in Hapmap format
GAPIT_genotype_data_numeric	NULL	User	Genotype data in numeric format
GAPIT_genotype_map_numeric	NULL	User	Genotype Map file in Hapmap format
GAPIT_hapmap_file_extension	hmp.txt	User	File extension for Hapmap file
GAPIT_genotype_data_numeric_file_extension	NULL	User	File extension for genotype data in numeric format
GAPIT_genotype_map_numeric_file_extension	NULL	User	File extension for genotype data in Hapmap format
GAPIT_hapmap_filename	NULL	User	File name for genotype data in Hapmap format
GAPIT_genotype_data_numeric_filename	NULL	User	File name for genotype data in numeric format
GAPIT_genotype_file_path	./Demo/	User	Path to genotype file
GAPIT_genotype_file_named_sequentially_from	1	User	Starting number of sequentially named genotype files
GAPIT_genotype_file_named_sequentially_to	10	User	Ending number of sequentially named genotype files
GAPIT_model	MLM	GLM MLM MLMM SUPER FarmCPU	GWAS model
GAPIT_SNP_MAF	0.05	>0 and <1	Minor allele frequency to filter SNPs
GAPIT_PCA_total	0	>0	Number of PC’s as covariates
GAPIT_Model_selection	TRUE	TRUE/FALSE	Forward model selection is done using Bayesian information criterion (BIC) to determine optimal PC/Covariables.
GAPIT_SNP_test	TRUE	TRUE/FALSE	Perform SNP testing
GAPIT_file_output	TRUE	TRUE/FALSE	Provides automatic GAPIT output files
GAPIT_p_value_threshold	NULL	>0 and <1	P-value threshold used to filter significant SNPs
GAPIT_p_value_fdr_threshold	0.05	>0 and <1	FDR threshold used to filter significant SNPs
GAPIT_LD_number	100000	>1	Range (in bp) around significant SNP for LD analysis

Haploview

If users are planning to use the extractHaplotype argument which depends on the GAPIT argument, all the fields in this section need to be filled. The reference files used in this section need to be splitted into chunks by chromosome.

Match Gene Start and Stop

If users are planning to use the searchGenes argument which depends on the GAPIT argument, all the fields in this section need to be filled. The reference files used in this section need to be splitted into chunks by chromosome.

Output Directory

This section holds the output directory where all the output files will go when the tool is running. This section is definitely required to be completed in order for HAPPI GWAS to run.

Side Notes

Users are recommended to read the user manual before using HAPPI GWAS. The user manual is written in a more detailed form and provides users additional explanation and guidance.
In the yaml file, it is recommended by the developer of this project to use the absolute path. Using a relative path will work, but is discouraged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understand The Yaml File

An overview of yaml file

Raw Data

Raw Data Section Example 1:

Raw Data Section Example 2:

BLUP or BLUE

GAPIT3

Haploview

Match Gene Start and Stop

Output Directory

Side Notes

HAPPI GWAS

Description

Getting Started

Understand The Yaml File

Test Run Using Demo Data and Demo Yaml File

Run Demo Arabidopsis 360 Population

User Manual

Clone this wiki locally