XGMIX: Local-Ancestry Inference with Stacked XGBOOST

October 14, 2022

Genomic medicine promises increased resolution for accurate diagnosis, for personalized treatment, and for identification of population-wide health burdens at rapidly decreasing cost (with a genotype now cheaper than an MRI and dropping). The benefits of this emerging form of affordable, data-driven medicine will accrue predominantly to those populations whose genetic associations have been mapped, so it is of increasing concern that over 80% of such genome-wide association studies (GWAS) have been conducted solely within individuals of European ancestry. The severe under-representation of the majority of the world’s populations in genetic association studies stems in part from an addressable algorithmic weakness: lack of simple, accurate, and easily trained methods for identifying and annotating ancestry along the genome (local ancestry). Here we have presented such a method (XGMix) based on gradient boosted trees, which, while being accurate, is also simple to use, and fast to train, taking minutes on consumer-level laptops.

Link to paper