Beruflich Dokumente
Kultur Dokumente
Release 2.0.1
1 Introduction 1
2 Installation 3
2.1 Installing with conda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Installing with pip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Parallelization 5
4 Tutorial 7
4.1 Constructing a demographic history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Plotting a demography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3 Reading and simulating data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5 Statistics of the SFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.7 Other features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 API Documentation 23
5.1 Demographic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
i
ii
CHAPTER 1
Introduction
momi (MOran Models for Inference) is a Python package that computes the expected sample frequency spectrum
(SFS), a statistic commonly used in population genetics, and uses it to fit demographic history.
The code is on github, and a preprint describing the method can be found on bioRxiv.
1
momi Documentation, Release 2.0.1
2 Chapter 1. Introduction
CHAPTER 2
Installation
momi requires Python >= 3.5, and can be installed with conda or pip.
3. Install:
Note the order of the -c flags matters, it determines the priority of each channel when installing dependencies.
The momi source distribution is provided on PyPi, and can be downloaded, built, and installed with pip.
First, ensure the following non-Python dependencies are installed with your favorite package manager (e.g. apt-get,
yum, brew, conda, etc):
1. hdf5
2. gsl
3. (OSX only) OpenMP-enabled clang
• If using homebrew, do brew install llvm libomp.
• Or if using conda, do conda install llvm-openmp clang.
3
momi Documentation, Release 2.0.1
• You will also need to set the environment variable CC=/path/to/clang during installation.
Then do pip install momi.
On OSX, remember to set the environment variable:
CC=/path/to/clang pip install momi
If you installed the above dependencies using homebrew, this should be:
CC=$(brew --prefix llvm)/bin/clang pip install momi
Depending on your system, pip may have trouble installing some dependencies (such as numpy, msprime, pysam).
In this case, you should manually install these dependencies and try again.
See venv to install into a virtual environment.
2.3 Troubleshooting
This is usually caused by trying to import momi when in the top-level folder of the momi2 project. In this case,
Python will try to import the local, unbuilt copy of the momi subdirectory rather than the installed version.
To fix this, simply cd out of the top-level directory before importing momi.
On macOS the system version of clang does not support OpenMP, which causes this error when building momi with
pip.
To solve this, make sure you have OpenMP-enabled LLVM/clang installed, and set the environment variable CC as
noted in the pip installation instructions above.
Note: it is NOT recommended to replace clang with gcc on macOS, as this can cause strange numerical errors when
used with Intel MKL; for example, see https://github.com/ContinuumIO/anaconda-issues/issues/8803
4 Chapter 2. Installation
CHAPTER 3
Parallelization
momi will automatically use all available CPUs to perform computations in parallel. You can control the number of
threads by setting the environment variable OMP_NUM_THREADS.
To take full advantage of parallelization, it is recommended to make sure numpy is linked against a parallel BLAS
implementation such as MKL or OpenBlas. This is automatically taken care of in most packaged, precompiled versions
of numpy, such as in Anaconda Python.
5
momi Documentation, Release 2.0.1
6 Chapter 3. Parallelization
CHAPTER 4
Tutorial
This is a tutorial for the momi package. You can run the ipython notebook that created this tutorial at docs/
tutorial.ipynb.
To get started, import the momi package:
Some momi operations can take awhile complete, so it is useful to turn on status monitoring messages to check that
everything is running normally. Here, we output logging messages to the file tutorial.log.
Use DemographicModel to construct a demographic history. Below, we set the diploid effective size N_e=1.2e4,
the generation time gen_time=29 years per generation, and mutation rate muts_per_gen=1.25e-8 per base
per generation.
Use DemographicModel.add_leaf to add sampled populations. Below we add 3 populations: YRI, CHB, and NEA.
The archaic NEA population is sampled t=5e4 years ago. The YRI population has size N=1e5, while the CHB
population is initialized to have size N=1e5 and growth rate g=5e-4 per year (NEA starts at the default size 1.
2e4).
7
momi Documentation, Release 2.0.1
Demographic events are added to the model by the methods DemographicModel.set_size and Demographic-
Model.move_lineages. DemographicModel.set_size is used to change population size and growth rate, while
DemographicModel.move_lineages is used for population split and admixture events.
Note that events can involve other populations aside from the 3 sampled populations YRI, CHB, and NEA. Unsampled
populations are also known as “ghost populations”. In this example, CHB receives a small amount of admixture from
a population “GhostNea”, which splits off from NEA at an earlier date.
momi relies on matplotlib for plotting. In a notebook, first call %matplotlib inline to enable matplotlib, then
you can use DemographyPlot to create a plot of the demographic model.
fig = momi.DemographyPlot(
model, ["YRI", "CHB", "GhostNea", "NEA"],
figsize=(6,8),
major_yticks=yticks,
linthreshy=1e5, pulse_color_bounds=(0,.25))
8 Chapter 4. Tutorial
momi Documentation, Release 2.0.1
Note the user needs to specify the order of all populations (including ghost populations) along the x-axis.
The argument linthreshy is useful for visualizing demographic events at different scales. In our example, the split
time of NEA is far above the other events. Times below linthreshy are plotted on a linear scale, while times above
it are plotted on a log scale.
In this section we demonstrate how to read in data from a VCF file. We start by simulating a dataset so that we can
read it in later.
Use DemographicModel.simulate_vcf to simulate data (using msprime) and save the resulting dataset to a VCF file.
Below we simulate a dataset of diploid individuals, with 20 “chromosomes” of length 50Kb, with a recombination rate
of 1.25e-8.
We saved the datasets in tutorial_datasets/$chrom.vcf.gz. Accompanying tabix and bed files are also
created.
10 Chapter 4. Tutorial
momi Documentation, Release 2.0.1
!cat tutorial_datasets/ind2pop.txt
NEA_0 NEA
YRI_0 YRI
YRI_1 YRI
CHB_0 CHB
CHB_1 CHB
The next step is to compute the allele counts for each VCF separately. To do this, use the shell command python
-m momi.read_vcf $VCF $IND2POP $OUTFILE --bed $BED:
[10]: %%sh
for chrom in `seq 1 20`;
do
python -m momi.read_vcf \
tutorial_datasets/$chrom.vcf.gz tutorial_datasets/ind2pop.txt \
tutorial_datasets/$chrom.snpAlleleCounts.gz \
--bed tutorial_datasets/$chrom.bed
done
The --bed flag specifies a BED accessible regions file; only regions present in the BED file are read from the VCF.
The BED file also determines the length of the data in bases. If no BED file is specified, then all SNPs are read, and
the length of the data is set to unknown.
You should NOT use the same BED file across multiple VCF files, and should ensure your BED files do not contain
overlapping regions. Otherwise, regions will be double-counted when computing the length of the data. You can use
tabix to split a single BED file into multiple non-overlapping files.
By default ancestral alleles are read from the INFO AA field (SNPs missing this field are skipped) but this behavior
can be changed by setting the flags --no_aa or --outgroup.
Use the --help flag to see more command line options, and see also the documentation for SnpAllele-
Counts.read_vcf, which provides the same functionality within Python.
Use python -m momi.extract_sfs $OUTFILE $NBLOCKS $COUNTS... from the command line to
combine the SFS across multiple files, and split the SFS into a number of equally sized blocks for jackknifing and
bootstrapping.
[11]: %%sh
python -m momi.extract_sfs tutorial_datasets/sfs.gz 100 tutorial_datasets/*.
˓→snpAlleleCounts.gz
Use the --help flag to see the command line options, and see also the documentation for SnpAllele-
Counts.concatenate and SnpAlleleCounts.extract_sfs which provide the same functionality within Python.
4.4 Inference
In this section we will infer a demography for the data we simulated. We will start by fitting a sub-demography on
CHB and YRI, and then iteratively build on this model, by adding the NEA population and also additional parameters
and events.
We will start by fitting a simplifed model without admixture. Use DemographicModel() to initialize it as before:
Note that muts_per_gen is optional, and can be omitted if unknown, but specifying it provides extra power to the
model.
Use DemographicModel.set_data to add data to the model for inference:
[14]: no_pulse_model.set_data(sfs)
Demographic events can be added similarly as before. Parameters are specified by name (string), while constants are
specified as numbers (float).
Use DemographicModel.optimize to search for the MLE. It is a thin wrapper around scipy.optimize.minimize and
accepts similar arguments.
[17]: no_pulse_model.optimize(method="TNC")
12 Chapter 4. Tutorial
momi Documentation, Release 2.0.1
/home/jack/miniconda3/envs/momi2-conda-nomkl/lib/python3.6/site-packages/autograd/
˓→numpy/numpy_vjps.py:444: FutureWarning: Using a non-tuple sequence for
status: 1
success: True
x: array([1.64805192e+01, 9.97783662e-04, 1.04562323e+05])
The default optimization method is method="TNC" (truncated Newton conjugate). This is very accurate but can be
slow for large models; for large models, method="L-BFGS-B" is a good choice.
We can print the inferred parameter values with DemographicModel.get_params:
[18]: no_pulse_model.get_params()
[18]: ParamsDict({'n_chb': 14368074.379920935, 'g_chb': 0.000997783661801449, 't_chb_yri':
˓→114562.32318043888})
4.4. Inference 13
momi Documentation, Release 2.0.1
Now we add in the NEA population, along with a parameter for its split time t_anc. We use the keyword
lower_constraints to require that t_anc > t_chb_yri.
We search for the new MLE and plot the inferred demography:
[21]: no_pulse_model.optimize()
fig = momi.DemographyPlot(
no_pulse_model, ["YRI", "CHB", "NEA"],
figsize=(6,8), linthreshy=1e5,
major_yticks=yticks)
14 Chapter 4. Tutorial
momi Documentation, Release 2.0.1
Now we create a new DemographicModel, by copying the previous model and adding a NEA->CHB migration
arrow.
add_pulse_model.move_lineages(
"CHB", "GhostNea", t="t_pulse", p="p_pulse")
add_pulse_model.add_time_param(
"t_ghost", lower=5e4,
lower_constraints=["t_pulse"], upper_constraints=["t_anc"])
add_pulse_model.move_lineages(
"GhostNea", "NEA", t="t_ghost")
It turns out this model has local optima, so we demonstrate how to fit a few independent runs with different starting
parameters.
Use DemographicModel.set_params to set new parameter values to start the search from. If a parameter is not specified
4.4. Inference 15
momi Documentation, Release 2.0.1
[23]: results = []
n_runs = 3
for i in range(n_runs):
print(f"Starting run {i+1} out of {n_runs}...")
add_pulse_model.set_params(
# parameters inherited from no_pulse_model are set to their previous values
no_pulse_model.get_params(),
# other parmaeters are set to random initial values
randomize=True)
results.append(add_pulse_model.optimize(options={"maxiter":200}))
# sort results according to log likelihood, pick the best (largest) one
best_result = sorted(results, key=lambda r: r.log_likelihood)[-1]
add_pulse_model.set_params(best_result.parameters)
best_result
Starting run 1 out of 3...
Starting run 2 out of 3...
Starting run 3 out of 3...
[23]: fun: 0.006390407169686503
jac: array([ 3.86428996e-08, -8.26898179e-02, 4.63629649e-13, -7.
˓→35845677e-14,
status: 1
success: True
x: array([ 1.62856876e+01, 1.00000000e-03, 9.03668931e+04, 4.
˓→24138405e+05,
16 Chapter 4. Tutorial
momi Documentation, Release 2.0.1
Here we discuss how to compute statistics of the SFS, for evaluating the goodness-of-fit of our models, and for
estimating the mutation rate.
4.5.1 Goodness-of-fit
Use SfsModelFitStats to see how well various statistics of the SFS fit a model, via the block-jackknife.
Below we create an SfsModelFitStats to evaluate the goodness-of-fit of the no_pulse_model.
One important statistic is the f4 or “ABBA-BABA” statistic for detecting introgression (Patterson et al 2012).
In the absence of admixture f4(YRI, CHB, NEA, AncestralAllele) should be 0, but for our dataset it will be negative
due to the NEA->CHB admixture.
Use SfsModelFitStats.f4 to compute f4 stats. For the no-pulse model, we see that f4(YRI, CHB, NEA,
AncestralAllele) is indeed negative.
print("Expected = {}".format(f4.expected))
print("Observed = {}".format(f4.observed))
print("SD = {}".format(f4.sd))
print("Z(Expected-Observed) = {}".format(f4.z_score))
Computing f4(YRI, CHB, NEA, AncestralAllele)
Expected = 6.938893903907228e-18
Observed = -0.003537103263299063
SD = 0.0017511821832936283
Z(Expected-Observed) = -2.0198373972983648
The related f2 and f3 statistics are also available via SfsModelFitStats.f2 and SfsModelFitStats.f3.
Another method for evaluating model fit is SfsModelFitStats.all_pairs_ibs, which computes the probability that two
random alleles are the same, for every pair of populations:
[27]: no_pulse_fit_stats.all_pairs_ibs()
[27]: Pop1 Pop2 Expected Observed Z
0 YRI YRI 0.699637 0.706936 1.924912
1 NEA NEA 0.732753 0.720787 -1.388584
2 CHB NEA 0.545176 0.548234 0.668636
3 CHB YRI 0.694800 0.697731 0.651155
4 NEA YRI 0.545176 0.543831 -0.334112
5 CHB CHB 0.965634 0.964595 -0.231728
Finally, the method SfsModelFitStats.tensor_prod can be used to compute very general statistics of the SFS (specifi-
cally, linear combinations of tensor-products of the SFS). See the documentation for more details.
Limitations of SfsModelFitStats
Note the SfsModelFitStats class above has some limitations. First, it computes goodness-of-fit for the SFS
without any missing data; all entries with missing samples are removed. For datasets with many individuals and
pervasive missingness, this can result in most or all of the data being removed.
18 Chapter 4. Tutorial
momi Documentation, Release 2.0.1
In such cases you can specify to use the SFS restricted to a smaller number of samples; then all SNPs with at least that
many of non-missing individuals will be used. For example,
will compute statistics for the SFS restricted to 2 samples per population.
The second limitation of SfsModelFitStats is that it ignores the mutation rate – it only fits the SFS normalized
to be a probability distribution. However, see the next subsection on how to evaluate the total number of mutations in
the data.
To evaluate the total number of mutations in the data, e.g. to fit the mutation rate, use the method Demographic-
Model.fit_within_pop_diversity, which computes the within-population nucleotide diversity, i.e. the heterozygosity of
a random individual in that population assuming Hardy-Weinberg Equilibrium:
[29]: no_pulse_model.fit_within_pop_diversity()
[29]: Pop EstMutRate JackknifeSD JackknifeZscore
0 CHB 1.283114e-08 1.624236e-09 0.203875
1 YRI 1.215218e-08 1.571356e-10 -2.213517
2 NEA 1.301250e-08 4.018888e-10 1.275228
This method returns a dataframe giving estimates for the mutation rate. Note that there is an estimate for each popula-
tion – these estimates are non-independent estimates for the same value, just computed in different ways (by computing
the expected to observed heterozygosity for each population separately). These estimates account for missingness in
the data; it is fine to use it on datasets with large amounts of missingness.
Since we initialized our model with muts_per_gen=1.25e-8, the method also returns a Z-value for the residuals
of the estimated mutation rates.
[30]: n_bootstraps = 5
# make copies of the original models to avoid changing them
no_pulse_copy = no_pulse_model.copy()
add_pulse_copy = add_pulse_model.copy()
bootstrap_results = []
for i in range(n_bootstraps):
print(f"Fitting {i+1}-th bootstrap out of {n_bootstraps}")
bootstrap_results.append(add_pulse_copy.get_params())
Fitting 1-th bootstrap out of 5
Fitting 2-th bootstrap out of 5
Fitting 3-th bootstrap out of 5
Fitting 4-th bootstrap out of 5
Fitting 5-th bootstrap out of 5
We can visualize the bootstrap results by overlaying them onto a single plot.
20 Chapter 4. Tutorial
momi Documentation, Release 2.0.1
For large models, it can be useful to perform stochastic optimization: instead of computing the full likelihood at every
step, we use a random subset of SNPs at each step to estimate the likelihood gradient. This is especially useful for
rapidly searching for a reasonable starting point, from which full optimization can be performed.
DemographicModel.stochastic_optimize implements stochastic optimization with the ADAM algorithm. Setting
svrg=n makes the optimizer use the full likelihood every n steps which can lead to better convergence (see SVRG).
The cell below performs 10 steps of stochastic optimization, using 1000 random SNPs per step, and computing the
full likelihood every 3 iterations.
[32]: add_pulse_copy.stochastic_optimize(
snps_per_minibatch=1000, num_iters=10, svrg_epoch=3)
[32]: fun: 3.5684750877096234
jac: array([ 2.16057690e-06, -2.30524330e-03, 2.41528617e-07, -1.
˓→54878487e-07,
success: False
x: array([ 1.66675285e+01, -1.00000000e-03, 8.49349310e+04, 4.
˓→20940654e+05,
[ ]:
22 Chapter 4. Tutorial
CHAPTER 5
API Documentation
5.2 Plotting
5.3 Data
5.4 Statistics
23
momi Documentation, Release 2.0.1
• genindex
• modindex
• search
25