Fastsimcoal2

I am currently using fastsimcoal2 to model European and Asian demography.

A relatively recent development in population genetics is the use of maximum likelihood approaches to estimate demographic parameters from the site frequency spectrum (SFS). The SFS gives the number of SNPs observed at given frequencies in a sample. The distribution of these frequencies is affected by the demographic history of the population. For example, population expansion leads to long external branches on coalescent trees and consequently to an abundance of low-frequency variants. Population contraction leads to long internal coalescent branches and a skew toward intermediate frequency variants. Programs such as fastsimcoal2 (Excoffier et al. 2013) have developed methods to estimate the likelihood of an observed SFS under a particular set of demographic parameters.

fastsimcoal2 uses a maximum likelihood approach to estimate demographic parameters from the site frequency spectrum. The user provides a template file describing the proposed model in terms of the parameters to be estimated. The program selects a set of parameters at random (within ranges set by the user) and proceeds to carry out coalescent simulations based on the model in order to determine the composite likelihood of observing the given site frequency spectrum under that model. For a set of parameters, repeated coalescent trees are drawn. Using methods detailed in Nielsen (2000), fastsimcoal2 calculates the proportion of branch lengths on a coalescent tree that lead to ‘i’ nodes in the present day. This portion of the tree represents the probability of a SNP appearing in ‘i’ chromosomes in the sample. By repeating this estimation over Z simulations (at least 100,000x) for each value of ‘i,’ fastsimcoal2 calculates an arbitrarily precise estimator which can be used in composite likelihood calculations.

Using the Brent Algorithm, fastsimcoal2 then optimizes each parameter over repeated cycles (20-40 “ECM cycles”) to determine which parameter values maximize the likelihood estimate of the observed SFS under the proposed model.

These simulations can be carried out on a 1D SFS (one population), or a joint SFS for 2 or more populations.

My biggest tip is that you have to be extremely careful with the formatting of the site frequency spectrum you use as input. It must say “1 observation” (and nothing else) in the line above your SFS to be recognized as an SFS; if not, all the likelihood values you get will be 0.000. This isn’t explicit in the manual, so beware. More tips to come!

Sources Cited:

Excoffier, L., Dupanloup, I., Huerta-Sánchez, E., Sousa, V. C., & Foll, M. (2013). Robust demographic inference from genomic and SNP data. PLoS genetics, 9(10), e1003905.

Nielsen, R. (2000). Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics, 154(2), 931-942.