Rahul Sharma (Editor)

Polygenic score

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

A polygenic score, also called a polygenic risk score, genetic risk score, or genome-wide score, is a number based on variation in multiple genetic loci and their associated weights (see regression analysis). It serves as the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.

Contents

Polygenic scores are widely employed in animal, plant, and behavioral genetics for prediction and understanding genetic architectures. In a genome-wide association study (GWAS), polygenic scores having substantially higher predictive performance than the genome-wide statistically-significant hits indicates that the trait in question is affected by a larger number of variants than just the hits and larger sample sizes will yield more hits; a conjunction of low variance explained and high heritability as measured by GCTA, twin studies or other methods indicates that a trait may be massively polygenic and affected by thousands of variants. Once a polygenic score explaining at least a few percent of variance has been created which effectively identifies most of the genetic variants affecting a trait, it can be used as a lower bound to test whether heritability estimates may be biased, measure the genetic overlap of traits (genetic correlation) which might indicate eg shared genetic bases for groups of mental disorders, used to measure group differences in a trait such as height, examine changes in a trait over time due to natural selection indicative of a soft selective sweep such as intelligence (where the changes in frequency would be too small to detect on each individual hit but the polygenic score declines), used in Mendelian randomization (assuming no pleiotropy with relevant traits), detect & control for the presence of genetic confounds in outcomes (eg the correlation of schizophrenia with poverty), and investigate gene–environment interactions.

Polygenic scores are widely used in animal breeding (usually termed genomic prediction) due to their practical use in breeding improved livestock and crops. Their use in human studies are increasing.

Estimating weights

Weights are usually estimated using some form of regression analysis. Because the number of genomic variants (usually SNPs) is usually larger than the sample size, one cannot use OLS multiple regression (p > n problem). Instead, researchers have opted to use other methods including regressing variants one at a time (usually used in studies with human data). Due to concerns about weakening predictive power, polygenic scores can be constructed by multiple-testing different sets of SNPs selected at various thresholds, such as all SNPs which are genome-wide statistically-significant hits or all SNPs p<0.05 or all SNPs with p<0.50, and the one with greatest performance used for further analysis; especially for highly polygenic traits, the best polygenic score will tend to use most or all SNPs.

The standard GWAS regression can be improved on using penalized regression methods like the LASSO/ridge regression. (Penalized regression can be interpreted as placing informative priors on how many genetic variants are expected to affect a trait, and the distribution of their effect sizes; Bayesian counterparts exist for LASSO/ridge, and other priors have been suggested & used. They can perform better in some circumstances.) A multi-dataset, multi-method study found that of 15 different methods compared across four datasets, minimum redundancy maximum relevance was the best performing method. Furthermore, variable selection methods tended to outperform other methods. Variable selection methods do not use all the available genomic variants present in a dataset, but attempt to select an optimal subset of variants to use. This leads to less overfitting but more bias (see bias-variance tradeoff).

Predictive validity

The benefit of polygenic score is that they can be used to predict the future. This has large practical benefits for animal breeding because it increases the selection precision and allows for shorter generations, both of which speed up evolution. For humans, it can be used to predict future disease susceptibility and for embryo selection.

Some accuracy values are given below for comparison purposes. These are given in terms of correlations and have been converted from explained variance if given in that format in the source.

In humans

  • In 2016, r ≈ 0.30 for educational attainment variation at age 16. This polygenic score was based off the a GWAS using data from 293k persons.
  • In 2016, r ≈ 0.31 for case/control status for first-episode psychosis.
  • In non-human animals

  • In 2016, r ≈ 0.30 for variation in milk fat%.
  • In 2014, r ≈ 0.18 to 0.46 for various measures of meat yield, carcass value etc.
  • In plants

  • In 2015, r ≈ 0.55 for total root length in Maize (Zea mays L.).
  • In 2014, r ≈ 0.03 to 0.99 across four traits in barley.
  • References

    Polygenic score Wikipedia