IEEE/ACM transactions on computational biology and bioinformatics | 2021

Inherent Nonlinear Distribution of High-Dimensional Genotypic Data Identified as a Possible Source of Confounding Factors in Population Structure Analysis.

 

Abstract


It has become routing work to detect and correct for population structure in genome-wide association analysis. A variety of methods have been proposed. Particularly, the methods based on spectral graph theory have shown superior performance. We discovered that the inherent nonlinear distribution of high-dimensional genotypic data was a possible source of confounding factors in population structure analysis, and was also the possible underlying reason that accounted for the superiority of these spectral-based methods. We verified this hypothesis by validating a variation of the Laplacian Eigen analysis: LAPMAP. The method could faithfully reveal the underlying population structures of HapMap II and III data sets. The inferred top eigenvectors together with minor eigenvectors were used to segregate samples by their ancestries. We found that the top 3 eigenvectors differentiated the 4 populations in phase II data set; the top 3 eigenvectors clustered the populations into 4 clusters, reflecting their continental origins. 9 populations were well recognized in phase III populations. Next, we estimated admixture proportions for simulated individuals. The method showed comparable or better performance in capturing and correcting for modelled population structures. All experimental results showed that LAPMAP was robust, efficient and scalable to genome-wide association studies.

Volume PP
Pages None
DOI 10.1109/TCBB.2021.3069503
Language English
Journal IEEE/ACM transactions on computational biology and bioinformatics

Full Text