Skip to content

Appendix 2. Creating MatrixTable and QC

Zhiwen Owen Jiang edited this page Nov 25, 2024 · 2 revisions

Before getting started, make sure you have successfully built the environment and installed all dependencies for HEIG. The current tutoral is compatible with HEIG (v1.2.0).

This tutorial will construct hail.MatrixTable from bfile. To replicate the analysis, you will use genotype data at input/genotype/g1000_eur_1ksnps.

Doing analysis

To convert bfile into hail.MatrixTable and perform QC:

heig.py \
--make-mt \
--out output/genotype/g1000_eur_1ksnps_qc \
--maf-min 0.05 \
--maf-max 0.3 \
--hwe 1e-06 \
--call-rate 0.95 \
--bfile input/genotype/g1000_eur_1ksnps \
--grch37 \
--variant-type snv \
--spark-conf input/misc/spark_config_small_mem.json \

--make-mt (required): the flag for making hail.MatrixTable.

--out (required): the prefix of output.

--maf-min (optional): the minimum MAF (>). SNPs with a MAF less than or equal to this value are excluded.

--maf-max (optional): the maximum MAF (<=). SNPs with a MAF greater than this value are excluded.

--hwe (optional): a threshold for Hardy-Weinberg equilibrium (HWE) test (>=). SNPs with a p-value less than this value are excluded.

--call-rate (optional): a genotype call rate threshold, equivalent to 1 - missing rate (>=). SNPs with a call rate less than this value are excluded.

--bfile (required): prefix of bfile triplets.

--grch37 (optional): a flag indicating the reference genome of the genotype data is GRCh37. Default is GRCh38.

--variant-type (optional): Variant type (case insensitive), must be one of variant, snv, indel.

--spark-conf (required): a spark configuration file in json format.

Notes

Additional options include:

--vcf: a VCF file, which must have SNP rsID included for Voxel‐level GWAS reconstruction and Heritability and (cross‐trait) genetic correlation analysis.

--chr-interval: A segment of chromosome, e.g. 3:1000000,3:2000000, representing from chromosome 3 bp 1000000 to chromosome 3 bp 2000000. Cross-chromosome is not allowed. And the end position must be greater than the start position.

--keep, --remove, --extract, --exclude for managing subjects (based on FID and IID) and variants (based on SNP rsID). Refer to Data management and input formats.

Output

  • output/genotype/g1000_eur_1ksnps_qc: a hail.MatrixTable of QCed genotype data.