-
Notifications
You must be signed in to change notification settings - Fork 1
Appendix 2. Creating MatrixTable and QC
Before getting started, make sure you have successfully built the environment and installed all dependencies for HEIG. The current tutoral is compatible with HEIG (v1.2.0).
This tutorial will construct hail.MatrixTable
from bfile
. To replicate the analysis, you will use genotype data at input/genotype/g1000_eur_1ksnps
.
To convert bfile
into hail.MatrixTable
and perform QC:
heig.py \
--make-mt \
--out output/genotype/g1000_eur_1ksnps_qc \
--maf-min 0.05 \
--maf-max 0.3 \
--hwe 1e-06 \
--call-rate 0.95 \
--bfile input/genotype/g1000_eur_1ksnps \
--grch37 \
--variant-type snv \
--spark-conf input/misc/spark_config_small_mem.json \
--make-mt
(required): the flag for making hail.MatrixTable
.
--out
(required): the prefix of output.
--maf-min
(optional): the minimum MAF (>). SNPs with a MAF less than or equal to this value are excluded.
--maf-max
(optional): the maximum MAF (<=). SNPs with a MAF greater than this value are excluded.
--hwe
(optional): a threshold for Hardy-Weinberg equilibrium (HWE) test (>=). SNPs with a p-value less than this value are excluded.
--call-rate
(optional): a genotype call rate threshold, equivalent to 1 - missing rate (>=). SNPs with a call rate less than this value are excluded.
--bfile
(required): prefix of bfile
triplets.
--grch37
(optional): a flag indicating the reference genome of the genotype data is GRCh37. Default is GRCh38
.
--variant-type
(optional): Variant type (case insensitive), must be one of variant
, snv
, indel
.
--spark-conf
(required): a spark configuration file in json format.
Additional options include:
--vcf
: a VCF
file, which must have SNP rsID included for Voxel‐level GWAS reconstruction and Heritability and (cross‐trait) genetic correlation analysis.
--chr-interval
: A segment of chromosome, e.g. 3:1000000,3:2000000
, representing from chromosome 3 bp 1000000 to chromosome 3 bp 2000000. Cross-chromosome is not allowed. And the end position must be greater than the start position.
--keep
, --remove
, --extract
, --exclude
for managing subjects (based on FID
and IID
) and variants (based on SNP rsID). Refer to Data management and input formats.
-
output/genotype/g1000_eur_1ksnps_qc
: ahail.MatrixTable
of QCed genotype data.