. 2006 Dec;2(12):e190.

doi: 10.1371/journal.pgen.0020190.

Population structure and eigenanalysis

Nick Patterson¹, Alkes L Price, David Reich

Affiliations

PMID: 17194218
PMCID: PMC1713260
DOI: 10.1371/journal.pgen.0020190

Population structure and eigenanalysis

Nick Patterson et al. PLoS Genet. 2006 Dec.

. 2006 Dec;2(12):e190.

doi: 10.1371/journal.pgen.0020190.

Authors

Nick Patterson¹, Alkes L Price, David Reich

Affiliation

¹ Broad Institute of Harvard and MIT, Cambridge, Massachusetts, United States of America.

PMID: 17194218
PMCID: PMC1713260
DOI: 10.1371/journal.pgen.0020190

Abstract

Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general "phase change" phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like FST) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. The Tracy–Widom Density**
Conventional percentile points are: P = 0.05, x = .9794; P = 0.01, x = 2.0236; P = 0.001, x = 3.2730.

**Figure 2. Testing the Fit of the TW Distribution**
(A) We carried out 1,000 simulations of a panmictic population, where we have a sample size of m = 100 and n = 5,000 unlinked markers. We give a P–P plot of the TW statistic against the theoretical distribution; this shows the empirical cumulative distribution against the theoretical cumulative distribution for a given quantile. If the fit is good, we expect the plot will lie along the line y = x. Interest is primarily at the top right, corresponding to low p-values. (B) P–P plot corresponding to a sample size of m = 200 and n = 50,000 markers. The fit is again excellent, demonstrating the appropriateness of the Johnstone normalization.

**Figure 3. Testing the Fit of the Second Eigenvalue**
We generated genotype data in which the leading eigenvalue is overwhelmingly significant (*F_ST* = .01, m = 100, n = 5,000) with two equal-sized subpopulations. We show P–P plots for the TW statistic computed from the *second* eigenvalue. The fit at the high end is excellent.

**Figure 4. Three African Populations**
Plots of the first two eigenvectors for some African populations in the CEPH–HGDP dataset [30]. Yoruba and Bantu-speaking populations are genetically quite close and were grouped together. The Mandenka are a West African group speaking a language in the Mande family [15, p. 182]. The eigenanalysis fails to find structure in the Bantu populations, but separation between the Bantu and Mandenka with the second eigenvector is apparent.

**Figure 5. Three East Asian Populations**
Plots of the first two eigenvectors for a population from Thailand and Chinese and Japanese populations from the International Haplotype Map [32]. The Japanese population is clearly distinguished (though not by either eigenvector separately). The large dispersal of the Thai population, along a line where the Chinese are at an extreme, suggests some gene flow of a Chinese-related population into Thailand. Note the similarity to the simulated data of Figure 8.

**Figure 6. The BBP Phase Change**
We ran a series of simulations, varying the sample size m and number of markers n but keeping the product at mn = 2²⁰. Thus the predicted phase change threshold is *F_ST* = 2⁻¹⁰. We vary *F_S* and plot the log p-value of the Tracy–Widom statistic. (We clipped −log₁₀ p at 20.) Note that below the threshold there is no statistical significance, while above threshold, we tend to get enormous significance.

**Figure 7. Simulation of an Admixed Population**
We show a simple demography generating an admixed population. Populations *A,B,D* trifurcated 100 generations ago, while population C is a recent admixture of A and B. Admixture weights for the proportion of population A in population C are Beta-distributed with parameters (3.5,1.5). Effective population sizes are 10,000.

**Figure 8. A Plot of a Simulation Involving Admixture (See Main Text for Details)**
We plot the first two principal components. Population C is a recent admixture of two populations, B and a population not sampled. Note the large dispersion of population C along a line joining the two parental populations. Note the similarity of the simulated data to the real data of Figure 5.

**Figure 9. LD Correction with no LD Present**
P–P plots of the TW statistic, when no LD is present and after varying levels (k) of our LD correction. We first show this (A) for m = 500, n = 5,000, and then (B) for m = 200, n = 50,000. In both cases the LD correction makes little difference to the fit.

**Figure 10. LD Correction with Strong LD**
(A) Shows P–P plots of the TW statistic (m = 100, n = 5,000) with large blocks of complete LD. Uncorrected, the TW statistic is hopelessly poor, but after correction the fit is again good. Here, we show 1,000 runs with the same data size parameters as in Figure 2A, m = 500, n = 5,000, varying k, the number of columns used to “correct” for LD. The fit is adequate for any nonzero value of k. (B) Shows a similar analysis with m = 200, n = 50,000.

See this image and copyright information in PMC

Cited by

Polygenic Indices (a.k.a. Polygenic Scores) in Social Science: A Guide for Interpretation and Evaluation.
Burt CH. Burt CH. Sociol Methodol. 2024 Aug;54(2):300-350. doi: 10.1177/00811750241236482. Epub 2024 Mar 21. Sociol Methodol. 2024. PMID: 39091537 Free PMC article.
A map of canine sequence variation relative to a Greenland wolf outgroup.
Nguyen AK, Schall PZ, Kidd JM. Nguyen AK, et al. Mamm Genome. 2024 Aug 1. doi: 10.1007/s00335-024-10056-1. Online ahead of print. Mamm Genome. 2024. PMID: 39088040
Investigating linguistic and genetic shifts in East Indian tribal groups.
Ahlawat B, Dewangan H, Pasupuleti N, Dwivedi A, Rajpal R, Pandey S, Kumar L, Thangaraj K, Rai N. Ahlawat B, et al. Heliyon. 2024 Jul 9;10(14):e34354. doi: 10.1016/j.heliyon.2024.e34354. eCollection 2024 Jul 30. Heliyon. 2024. PMID: 39082022 Free PMC article.
Limited evidence of a shared genetic relationship between C-reactive protein levels and cognitive function in older UK adults of European ancestry.
Packer A, Corbett A, Arathimos R, Ballard C, Aarsland D, Hampshire A, Dima D, Creese B, Malanchini M, Powell TR. Packer A, et al. Front Dement. 2023 Aug 2;2:1093223. doi: 10.3389/frdem.2023.1093223. eCollection 2023. Front Dement. 2023. PMID: 39081969 Free PMC article.
Spatiotemporal fluctuations of population structure in the Americas revealed by a meta-analysis of the first decade of archaeogenomes.
Dos Santos ALC, Sullasi HSL, Gokcumen O, Lindo J, DeGiorgio M. Dos Santos ALC, et al. Am J Biol Anthropol. 2023 Apr;180(4):703-714. doi: 10.1002/ajpa.24673. Epub 2022 Dec 4. Am J Biol Anthropol. 2023. PMID: 39081397 Free PMC article.

See all "Cited by" articles

References

1. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. - PubMed
1. Menozzi P, Piazza A, Cavalli-Sforza L. Synthetic maps of human gene frequencies in Europeans. Science. 1978;201:786–792. - PubMed
1. Cavalli-Sforza LL, Feldman MW. The application of molecular genetic approaches to the study of human evolution. Nat Genet. 2003;33(Supplement):266–275. Historical article. - PubMed
1. Chakraborty R, Jin L. A unified approach to study hypervariable polymorphisms: Statistical considerations of determining relatedness and population distances. In: Pena S, Jeffreys A, Epplen J, Chakraborty R, editors. DNA fingerprinting, current state of the science. Basel: Birkhauser; 1993. pp. 153–175. - PubMed
1. Shriver M, Mei R, Parra E, Sonpar V, Halder I, et al. Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Human Genomics. 2005;2:81–89. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Population structure and eigenanalysis

Affiliation

Population structure and eigenanalysis

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous