diff --git a/docs/pipeline_technical.Rmd b/docs/pipeline_technical.Rmd index bb1cbb5..7aca331 100644 --- a/docs/pipeline_technical.Rmd +++ b/docs/pipeline_technical.Rmd @@ -179,7 +179,7 @@ MegaPRS uses a range of priors (lasso, ridge, bolt, BayesR) for SNP effects, run #### PRS-CS -PRS-CS, a Bayesian method using a continuous shrinkage prior, specifies a range of global shrinkage parameters (phi), generating multiple sets of genetic effects for polygenic scoring. Its 'auto' model estimates the optimal parameter directly from GWAS summary statistics, negating the need for an external dataset. In GenoPred, PRS-CS is run using the script [pgs_methods/prscs.R](https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R). GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the auto model. By default, GenoPred uses the PRS-CS provided 1KG-derived LD matrix data, matching the population of the GWAS sample. The user can select the UKB-derived LD matrix data to be used using the `prscs_ldref` parameter in the `configfile`. 1KG is used by default as PGS based on Yengo et al. sumstats performed significantly better in the OpenSNP target sample, when using the 1KG reference data (this may differ for other GWAS). +PRS-CS, a Bayesian method using a continuous shrinkage prior, specifies a range of global shrinkage parameters (phi), generating multiple sets of genetic effects for polygenic scoring. Its 'auto' model estimates the optimal parameter directly from GWAS summary statistics, negating the need for an external dataset. In GenoPred, PRS-CS is run using the script [pgs_methods/prscs.R](https://github.com/opain/GenoPred/blob/master/Scripts/pgs_methods/prscs.R). By default, GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the auto model, but the user can modify this behaviour using the prscs_phi parameter in the configfile. By default, GenoPred uses the PRS-CS provided 1KG-derived LD matrix data, matching the population of the GWAS sample. The user can select the UKB-derived LD matrix data to be used using the `prscs_ldref` parameter in the `configfile`. 1KG is used by default as PGS based on Yengo et al. sumstats performed significantly better in the OpenSNP target sample, when using the 1KG reference data (this may differ for other GWAS). *** @@ -229,7 +229,7 @@ Target genotype QC is performed using the [format_target.R](https://github.com/o ## Ancestry Inference -Target samples then undergo ancestry inference, using the [Ancestry_identifier.R](https://github.com/opain/GenoPred/blob/master/Scripts/Ancestry_identifier/Ancestry_identifier.R) script, estimating the probability that each target individual matches each reference population (AFR = African, AMR = Admixed American, EAS = East Asian, EUR = European, CSA = Central and South Asian, MID = Middle Eastern). Population membership was predicted using a reference trained elastic net model consisting of the first six reference-projected genetic principal components. Principal components were defined in the reference dataset using variants present in the target dataset with a minor allele frequency >0.05, missingness <0.02 and Hardy-Weinberg p-value >1×10-6 (if target sample size <100, then only missingness threshold is applied in the target). LD pruning for independent variants is then performed in PLINK after removal of long-range LD regions (ref), using a window size of 1000, step size of 5, and r2 threshold of 0.2. The A multinomial elastic net model predicting super population membership in the reference is derived in using the glmnet R package, with model performance assessed using 5-fold cross validation. The reference-derived principal components are then projected into the target dataset, and the reference-derived elastic net model is used to predict population membership in target. By default, target individuals are assigned to a population if the predicted probability was >0.95, but the user can modify this threshold using the `ancestry_prob_thresh` parameter in the config file. +Target samples then undergo ancestry inference, using the [Ancestry_identifier.R](https://github.com/opain/GenoPred/blob/master/Scripts/Ancestry_identifier/Ancestry_identifier.R) script, estimating the probability that each target individual matches each reference population (AFR = African, AMR = Admixed American, EAS = East Asian, EUR = European, CSA = Central and South Asian, MID = Middle Eastern). Population membership was predicted using a reference trained elastic net model consisting of the first six reference-projected genetic principal components. Principal components were defined in the reference dataset using variants present in the target dataset with a minor allele frequency >0.05, missingness <0.02 and Hardy-Weinberg p-value >1×10-6 (if target sample size <100, then only missingness threshold is applied in the target). LD pruning for independent variants is then performed in PLINK after removal of long-range LD regions (ref), using a window size of 1000, step size of 5, and r2 threshold of 0.2. The A multinomial elastic net model predicting super population membership in the reference is derived in using the glmnet R package, with model performance assessed using 5-fold cross validation. The reference-derived principal components are then projected into the target dataset, and the reference-derived elastic net model is used to predict population membership in target. By default, target individuals are assigned to a population if the predicted probability was >0.95, but the user can modify this threshold using the ancestry_prob_thresh parameter in the config file. If an individual does not have a predicted probability greater than the ancestry_prob_thresh parameter, then they will be excluded from downstream polygenic scoring. If the ancestry_prob_thresh parameter is low, then an individual may be assigned to multiple reference populations, and they will have polygenic scores that have been standardised according to each assigned reference population. In this case, the individual-level report created by GenoPred will present polygenic scores standardised according to the reference population with the highest predicted probability. *** @@ -253,13 +253,13 @@ This step calculates scores in the target sample, based on scoring files from th ### Individual-level -This step creates an .html report summarising the pipeline outputs for each individual in the target sample. It simply reads in pipeline outputs, and then tabulates and plots them. The only analysis it performs is the conversion of polygenic scores onto the absolute scale. It uses a [previously published method](https://pubmed.ncbi.nlm.nih.gov/34983942/). The estimate of the PGS R2 come from the lassosum pseudovalidation analysis, and the distribution in the general population is provided by the user in the prev, mean and sd columns of the gwas_list. Note: It does not convert PGS from externally derived polygenic scores onto the absolutes scale. +This step creates an .html report summarising the pipeline outputs for each individual in the target sample. It simply reads in pipeline outputs, and then tabulates and plots them. The only analysis it performs is the conversion of polygenic scores onto the absolute scale. It uses a [previously published method](https://pubmed.ncbi.nlm.nih.gov/34983942/). The estimate of the PGS R2 come from the lassosum pseudovalidation analysis, and the distribution in the general population is provided by the user in the prev, mean and sd columns of the gwas_list. Note: It does not convert PGS from externally derived polygenic scores onto the absolutes scale. An example of the individual-level report derived using the test data can be found here. *** ### Sample-level -This step creates an .html report summarising the pipeline outputs for each target sample. It simply reads in pipeline outputs, and then tabulates and plots them. +This step creates an .html report summarising the pipeline outputs for each target sample. It simply reads in pipeline outputs, and then tabulates and plots them. An example of the sample-level report derived using the test data can be found here. *** diff --git a/docs/pipeline_technical.html b/docs/pipeline_technical.html index 046575e..d78d221 100644 --- a/docs/pipeline_technical.html +++ b/docs/pipeline_technical.html @@ -936,14 +936,16 @@

PRS-CS

negating the need for an external dataset. In GenoPred, PRS-CS is run using the script pgs_methods/prscs.R. -GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) and the -auto model. By default, GenoPred uses the PRS-CS provided 1KG-derived LD -matrix data, matching the population of the GWAS sample. The user can -select the UKB-derived LD matrix data to be used using the -prscs_ldref parameter in the configfile. 1KG -is used by default as PGS based on Yengo et al. sumstats performed -significantly better in the OpenSNP target sample, when using the 1KG -reference data (this may differ for other GWAS).

+By default, GenoPred specifies four phi parameters (1e-6, 1e-4, 1e-2, 1) +and the auto model, but the user can modify this behaviour using the +prscs_phi parameter in the configfile. By default, GenoPred uses the +PRS-CS provided 1KG-derived LD matrix data, matching the population of +the GWAS sample. The user can select the UKB-derived LD matrix data to +be used using the prscs_ldref parameter in the +configfile. 1KG is used by default as PGS based on Yengo et +al. sumstats performed significantly better in the OpenSNP target +sample, when using the 1KG reference data (this may differ for other +GWAS).


@@ -1081,8 +1083,16 @@

Ancestry Inference

target dataset, and the reference-derived elastic net model is used to predict population membership in target. By default, target individuals are assigned to a population if the predicted probability was >0.95, -but the user can modify this threshold using the -ancestry_prob_thresh parameter in the config file.

+but the user can modify this threshold using the ancestry_prob_thresh +parameter in the config file. If an individual does not have a predicted +probability greater than the ancestry_prob_thresh parameter, then they +will be excluded from downstream polygenic scoring. If the +ancestry_prob_thresh parameter is low, then an individual may be +assigned to multiple reference populations, and they will have polygenic +scores that have been standardised according to each assigned reference +population. In this case, the individual-level report created by +GenoPred will present polygenic scores standardised according to the +reference population with the highest predicted probability.


@@ -1144,14 +1154,18 @@

Individual-level

pseudovalidation analysis, and the distribution in the general population is provided by the user in the prev, mean and sd columns of the gwas_list. Note: It does not convert PGS from externally derived -polygenic scores onto the absolutes scale.

+polygenic scores onto the absolutes scale. An example of the +individual-level report derived using the test data can be found +here.


Sample-level

This step creates an .html report summarising the pipeline outputs for each target sample. It simply reads in pipeline outputs, and then -tabulates and plots them.

+tabulates and plots them. An example of the sample-level report derived +using the test data can be found +here.