An Le, BMSc '25
An Le
Assessing the Impact of Genotype Imputation on Polygenic Risk Score Prediction Accuracy Across Complex Phenotypes
-Project supervised by Dr. Osvaldo Espin-Gracia, and completed in 2025
Abstract
Objective: This study investigates how genotype imputation affects the predictive accuracy of polygenic risk scores (PRS) for complex metabolic traits, particularly in relation to trait polygenicity. The aim is to determine whether imputation improves PRS performance and to compare the efficacy of three PRS construction methods: Clumping and Thresholding (C+T), LassoSum, and LDpred2.
Background: Incomplete genotyping data can limit the effectiveness of PRS in predicting disease risk, especially for traits influenced by many genetic variants. Genotype imputation addresses this by inferring untyped SNPs using reference panels and linkage disequilibrium patterns. However, the benefit of imputation on PRS performance—especially across traits with varying degrees of genetic complexity—has not been comprehensively assessed.
Data and Methods: Genotype and phenotype data from the Northern Finland Birth Cohort 1966 (NFBC1966) were analyzed for four lipid-related traits: high-density lipoprotein (HDL), low-density lipoprotein (LDL), log-transformed triglycerides (logTG), and total cholesterol (TC). Genotyping was conducted using the Illumina HumanCNV370-Duo BeadChip, and rigorous quality control procedures were applied. Genotype imputation was performed with IMPUTE5 using the 1000 Genomes Project and HRC reference panels, preceded by pre-phasing with SHAPEIT5. PRS were computed using GWAS summary statistics from the Global Lipids Genetics Consortium and three methods: C+T, LassoSum, and LDpred2-auto. Performance was evaluated using R² and RMSE from linear regression models predicting normalized phenotype values.
Results: Using the non-imputed dataset (~300,000 SNPs), LassoSum consistently achieved the highest variance explained, up to R² = 0.18 for LDL, followed by HDL and TC. LDpred2 performed moderately well, while C+T showed the weakest predictive performance across all traits. RMSE varied by phenotype, with HDL having the lowest error and TC the highest. Broader PRS distributions observed in LassoSum and LDpred2 suggest better capture of polygenic signals compared to the sparser C+T method. Traits like TC and logTG demonstrated lower predictability, likely reflecting stronger environmental influences and genetic heterogeneity.
Conclusion: LassoSum emerges as the most effective method for PRS construction under sparse genotyping conditions, with performance expected to improve further with imputed data. The study highlights the value of genotype imputation and LD-aware methods in enhancing PRS-based risk prediction, particularly for highly polygenic traits. Future analyses will extend this work by incorporating imputed genotype data and exploring additional methods such as PRS-CS.
About An
An Le, BMSc ’25, completed his Year 4 Research Project under the supervision of Dr. Osvaldo Espin-Garcia in the Department of Epidemiology and Biostatistics. His thesis explored the impact of genotype imputation on the accuracy of polygenic risk score prediction across metabolic traits. His research interests lie in genetic epidemiology, statistical modeling, and the use of translational genetics to improve predictive health analytics and promote precision medicine.