PGS Background #2
alkaZeltser
started this conversation in
General
Replies: 1 comment
-
I'd like to suggest we use the more formal statistical learning terminology for the three steps:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What is a PGS?
Nomenclature
PGS stands for Polygenic Score. This "score" has many equivalent names that are used in the scientific community, and there isn't a consensus yet as to the most appropriate name. For example, polygenic index (PGI), polygenic risk score (PRS), genomic risk score (GRS), and polygenic hazard score (PHS) are all also essentially describing the same thing, but with small nuances in some cases (a polygenic Hazard score is typically derived from a Cox Proportional Hazards model, as opposed to linear or logistic regression). Since the largest database of published such scores has opted to use the "PGS" nomenclature, we have chosen to adopt PGS as well.
Purpose
A Polygenic Score is a measure of the aggregated effects of multiple variants (typically SNPs) across the genome on a specific trait. The trait can be a disease (like diagnosis of prostate cancer), but can also be any other measurable phenotype (like height). It is interpreted as an estimate of genetic predisposition to a trait. In animal and plant agriculture, PGSs of desired traits (e.g. milk production, fruit size) are used to make breeding decisions to maintain or improve such traits in successive generations. In human medicine applications, PGSs can provide an estimate of an individual's risk of a specific disease, which can be used by doctors and patients to inform management decisions.
Calculation
There are two distinct phases of PGS calculation, which are tricky to distinguish using standard nomenclature when used out of context. To "calculate" a PGS could mean:
1. Feature Selection and Model Training
In human genetics, a PGS is computed using summary statistics from a Genome-Wide Association Study (GWAS). A typical GWAS consists of millions of regressions in cohorts of typically hundreds of thousands of people, with every SNP being a predictor for the trait of interest. The result is an estimated effect size for the trait of interest, consisting of the estimated regression coefficient of every measured SNP (beta, Odds Ratio, or Hazard Ratio depending on the model and reporting decisions). The final set of PGS weights can just be these effect sizes. However, there are a variety of additional heuristics that can be used to improve the accuracy or clinical utility of a PGS, such as LD adjustment and p-value thresholding. Most PGSs restrict their component SNPs to just ones significantly associated with a trait (p-value thresholding), resulting in scores with dozens to hundreds of component SNPs. However, many studies have shown that including all measured SNPs, regardless of significance, often improves PGS accuracy. Such PGSs have millions of component SNPs.
2. Model Application
Once the PGS component SNP-set and their weights have been computed, the PGS can be applied to any individual with genotype data at the component SNP sites. The pieces of information needed are:
Genetic data usually comes from a sequencing or microarray experiment, and can be processed through a genotype imputation server. PGS component SNPs are matched to the individual's genetic data. This is a matter of matching genomic coordinates between a PGS weight file and a genotype data file (typically VCF). Next, the individual's genotype dosage must be determined at each matched site. Dosage is defined as the number of effect alleles that an individual has. In a standard situation, an individual inherits two copies of an allele, one from each parent. Therefore the number of effect alleles an individual could have is 0, 1, or 2.* There are cases where calculating the dosage is slightly more tricky, for example when encountering a multiallelic site (more than 2 possible alleles) or a potential strand-flip (likely the cause when neither the effect nor the reference allele match the reported PGS variant site). Once matching and dosage are sorted out, the final step is to apply the standard weighted sum formula for a single individual$i$ :
where$m$ is a PGS component variant out of a total $M$ variants, and $\beta_m$ represents the effect size weight of the $m^{th}$ variant.
*Note: in imputed genotype data, the dosage can also be reported as a probabilistic estimate, taking any value between 0 and 2 e.g. 1.998.
The PGS Catalog
The PGS Catalog is a database of PGSs, administered in collaboration with EMBL-EBI and the University of Cambridge, with funding from the NHGRI. From their website:
Any researcher may submit a PGS that they have developed to this database, and many do so as a matter of course. They have established standardized formats for PGS data which is extensively documented. We have chosen to design our tool to operate on inputs standardized in the same way.
Beta Was this translation helpful? Give feedback.
All reactions