PGS Background #2

alkaZeltser · 2024-01-02T23:31:29Z

alkaZeltser
Jan 2, 2024
Maintainer

What is a PGS?

Nomenclature

PGS stands for Polygenic Score. This "score" has many equivalent names that are used in the scientific community, and there isn't a consensus yet as to the most appropriate name. For example, polygenic index (PGI), polygenic risk score (PRS), genomic risk score (GRS), and polygenic hazard score (PHS) are all also essentially describing the same thing, but with small nuances in some cases (a polygenic Hazard score is typically derived from a Cox Proportional Hazards model, as opposed to linear or logistic regression). Since the largest database of published such scores has opted to use the "PGS" nomenclature, we have chosen to adopt PGS as well.

Purpose

A Polygenic Score is a measure of the aggregated effects of multiple variants (typically SNPs) across the genome on a specific trait. The trait can be a disease (like diagnosis of prostate cancer), but can also be any other measurable phenotype (like height). It is interpreted as an estimate of genetic predisposition to a trait. In animal and plant agriculture, PGSs of desired traits (e.g. milk production, fruit size) are used to make breeding decisions to maintain or improve such traits in successive generations. In human medicine applications, PGSs can provide an estimate of an individual's risk of a specific disease, which can be used by doctors and patients to inform management decisions.

Calculation

There are two distinct phases of PGS calculation, which are tricky to distinguish using standard nomenclature when used out of context. To "calculate" a PGS could mean:

Establish which variants to include in your PGS and compute their effect sizes with regard to your trait of interest in a training population. The output of this step is a list of effect size "weights" for each variant that composes your PGS. This process can be referred to as Model Training and Feature Selection.
Apply the effect size "weights" from step 1 back to your training dataset or to an external dataset to calculate a weighted sum "score" for each individual. The output of this step is a list of polygenic scores, one for each individual in your cohort. This step can be referred to as Model Application.

1. Feature Selection and Model Training

In human genetics, a PGS is computed using summary statistics from a Genome-Wide Association Study (GWAS). A typical GWAS consists of millions of regressions in cohorts of typically hundreds of thousands of people, with every SNP being a predictor for the trait of interest. The result is an estimated effect size for the trait of interest, consisting of the estimated regression coefficient of every measured SNP (beta, Odds Ratio, or Hazard Ratio depending on the model and reporting decisions). The final set of PGS weights can just be these effect sizes. However, there are a variety of additional heuristics that can be used to improve the accuracy or clinical utility of a PGS, such as LD adjustment and p-value thresholding. Most PGSs restrict their component SNPs to just ones significantly associated with a trait (p-value thresholding), resulting in scores with dozens to hundreds of component SNPs. However, many studies have shown that including all measured SNPs, regardless of significance, often improves PGS accuracy. Such PGSs have millions of component SNPs.

2. Model Application

Once the PGS component SNP-set and their weights have been computed, the PGS can be applied to any individual with genotype data at the component SNP sites. The pieces of information needed are:

genomic coordinates of each PGS component SNP
the effect allele of each PGS component SNP
the genotype of the individual at each component SNP site

Genetic data usually comes from a sequencing or microarray experiment, and can be processed through a genotype imputation server. PGS component SNPs are matched to the individual's genetic data. This is a matter of matching genomic coordinates between a PGS weight file and a genotype data file (typically VCF). Next, the individual's genotype dosage must be determined at each matched site. Dosage is defined as the number of effect alleles that an individual has. In a standard situation, an individual inherits two copies of an allele, one from each parent. Therefore the number of effect alleles an individual could have is 0, 1, or 2.* There are cases where calculating the dosage is slightly more tricky, for example when encountering a multiallelic site (more than 2 possible alleles) or a potential strand-flip (likely the cause when neither the effect nor the reference allele match the reported PGS variant site). Once matching and dosage are sorted out, the final step is to apply the standard weighted sum formula for a single individual $i$:

$$ PGS_i = \sum_{m=1}^{M} \left( \beta_m \times dosage_{im} \right) $$

where $m$ is a PGS component variant out of a total $M$ variants, and $\beta_m$ represents the effect size weight of the $m^{th}$ variant.

*Note: in imputed genotype data, the dosage can also be reported as a probabilistic estimate, taking any value between 0 and 2 e.g. 1.998.

The PGS Catalog

The PGS Catalog is a database of PGSs, administered in collaboration with EMBL-EBI and the University of Cambridge, with funding from the NHGRI. From their website:

The PGS Catalog is an open database of published polygenic scores (PGS). Each PGS in the Catalog is consistently annotated with relevant metadata; including scoring files (variants, effect alleles/weights), annotations of how the PGS was developed and applied, and evaluations of their predictive performance.

Any researcher may submit a PGS that they have developed to this database, and many do so as a matter of course. They have established standardized formats for PGS data which is extensively documented. We have chosen to design our tool to operate on inputs standardized in the same way.

pboutros · 2024-01-03T01:09:27Z

pboutros
Jan 3, 2024
Maintainer

I'd like to suggest we use the more formal statistical learning terminology for the three steps:

Feature Selection (identifying the subset of dependent variables one wishes to include in a model
Model Fitting (creating a model that aggregates those features together to derive a score of some type)
Model Application (taking a vector of the dependent variables with non-zero weights in the final model, and using the fitted model to create a score for that vector)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PGS Background #2

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

PGS Background #2

alkaZeltser Jan 2, 2024 Maintainer

What is a PGS?

Nomenclature

Purpose

Calculation

1. Feature Selection and Model Training

2. Model Application

The PGS Catalog

Replies: 1 comment

pboutros Jan 3, 2024 Maintainer

alkaZeltser
Jan 2, 2024
Maintainer

pboutros
Jan 3, 2024
Maintainer