In population genetics, the allele frequency spectrum, sometimes called the site frequency spectrum, is the distribution of the allele frequencies of a given set of loci (often SNPs) in a population or sample. Because an allele frequency spectrum is often a summary of or compared to sequenced samples of the whole population, it is a histogram with size depending on the number of sequenced individual chromosomes. Each entry in the frequency spectrum records the total number of loci with the corresponding derived allele frequency. Loci contributing to the frequency spectrum are assumed to be independently changing in frequency. Furthermore, loci are assumed to be biallelic (that is, with exactly two alleles present), although extensions for multiallelic frequency spectra exist.
Contents
Many summary statistics of observed genetic variation are themselves summaries of the allele frequency spectrum, including estimates of
Example
The allele frequency spectrum from a sample of
The allele frequency spectrum can be written as the vector
Calculation
The expected allele frequency spectrum may be calculated using either a coalescent or diffusion approach. The demographic history of a population and natural selection affect allele frequency dynamics, and these effects are reflected in the shape of the allele frequency spectrum. For the simple case of selective neutral alleles segregating in a population that has reached demographic equilibrium (that is, without recent population size changes or gene flow), the expected allele frequency spectrum
where
Calculating the frequency spectrum from observed sequence data requires one to be able to distinguish the ancestral and derived (mutant) alleles, often by comparing to an outgroup sequence. For example in human population genetic studies, the homologous chimpanzee reference sequence is typically used to estimate the ancestral allele. However, sometimes the ancestral allele cannot be determined, in which case the folded allele frequency spectrum may be calculated instead. The folded frequency spectrum stores the observed counts of the minor (most rare) allele frequencies. The folded spectrum can be calculated by binning together the
Multi-population allele frequency spectrum
The joint allele frequency spectrum (JAFS) is the joint distribution of allele frequencies across two or more related populations. The JAFS for
Example
Suppose we sequence diploid individuals from two populations, 4 individuals from population 1 and 2 individuals from population 2. The JAFS would be a
Applications
The shape of the allele frequency spectrum is sensitive to demography, such as population size changes, migration, and substructure, as well as natural selection. By comparing observed data summarized in a frequency spectrum to the expected frequency spectrum calculated under a given demographic and selection model, one can assess the goodness of fit of that the model to the data, and use likelihood theory to estimate the best fit parameters of the model.
For example, suppose a population experienced a recent period of exponential growth and
This approach has been used to infer demographic and selection models for many species, including humans. For example, Marth et al. (2004) used the single population allele frequency spectra for a group of Africans, Europeans, and Asians to show that population bottlenecks have occurred in the Asian and European demographic histories, but not in the Africans. More recently, Gutenkunst et al. (2009) used the joint allele frequency spectrum for these same three populations to infer the time at which the populations diverged and the amount of subsequent ongoing migration between them (see out of Africa hypothesis). Additionally, these methods may be used to estimate patterns of selection from allele frequency data. For example, Boyko et al. (2008) inferred the distribution of fitness effects for newly arising mutations using human polymorphism data that controlled for the effects of non-equilibrium demography.