Sie sind auf Seite 1von 9

Haplotype Frequency Estimation: EM Algorithm

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

Consider two loci. Assume the loci are bi-allelic for simplicity.
Locus 1 has alleles A, a
Locus 2 has alleles B, b
Genotype counts are observed at the two loci for N individuals
in the population.
homozygous for both loci: nAABB , naaBB , nAAbb , naabb
homozygous for one loci: nAABb , naaBb , nAaBB , nAabb
heterozygous for both loci: nAaBb
We would like to make some inference about the haplotypes.
In particular, we would like to estimate haplotype frequencies.
For each genotype combination for the two loci, what are the
possible haplotypes?
For which genotype configuration is it not possible to infer
phase?

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

nAaBb is a union of 2 haplotype pairs: nAB/ab and nAb/aB


nAB/ab and nAb/aB are our missing data since phase for these
haplotypes can not be resolved from the genotype data.
If phase were known for all haplotypes, then could easily write
a likelihood function for nAB , nAb , naB , nab in terms of the
haplotype frequencies pAB , pAb , paB , pab .
We want to estimate pAB , pAb , paB , pab . How can we do this?
Can use an EM algorithm
Note that = (pAB , pAb , paB , pab )
What is the observed data? What is the complete data?

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

For the complete data case, what is nAB ?


What is nAb for the complete data case?
What would this likelihood function of nAB , nAb , naB , nab be
for the compete data case?

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

nAB for the complete data case is

nAB = 2nAB/AB + nAB/Ab + naB/AB + nAB/ab

The complete data likelihood is


2N! nAB nAb naB nab
L(nAB , nAb , naB , nab ) = pAB pAb paB pab
nAB !nAb !naB !nab !
The sample of N individuals contains 2N haplotypes.
nAB
The MLE for pAB is pAB = 2N
Can do the same for pAb , paB , and pab
We dont observe the complete data, but we can use an EM
algorithm.

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

Let the observed data be Y =


(nAABB , naaBB , nAAbb , naabb , nAABb , naaBb , nAaBB , nAabb , nAaBb )
The E step of the algorithm involves the calculation of Q, the
expected complete data log-likelihood. Must obtain the
following conditional expectations:
0 0 0 0 0
nAB = E [nAB |Y , pAB , pAb , paB , pab ]
0 0 0 0 0
nAb = E [nAb |Y , pAB , pAb , paB , pab ]
0 0 0 0 0
naB = E [naB |Y , pAB , pAb , paB , pab ]
0 0 0 0 0
nab = E [nab |Y , pAB , pAb , paB , pab ]
0 ?
What is nAB

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

0 0 0 0 0
nAB = E [nAB |Y , pAB , pAb , paB , pab ] = 2nAABB +nAABb +nAaBB +
0 0 0 0
E [nAB/ab |Y , pAB , pAb , paB , pab ]
0 , p 0 , p 0 , p 0 ]?
What is E [nAB/ab |Y , pAB Ab aB ab

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

0 p0
pAB
0 0 0 0 ab
E [nAB/ab |Y , pAB , pAb , paB , pab ] = nAaBb 0 0 0 p0
pAB pab + pAb aB

Thus
0
nAB = 2nAABB + nAABb + nAaBB +
0 p0
pAB ab
nAaBb 0 p0 + p0 p0
pAB ab Ab aB
0 , n0 , n0
Similarly can calculate nAb aB ab

Haplotype Frequency Estimation: EM Algorithm


Haplotype Frequency Estimation via EM

The M step involves maximizing Q, the expected value of the


log-likelihood (obtained in the E step) with respect to
pAB , pAb , paB , and pab .
The MLE is:
0
nAB
pAB = 2N
0
nAb
pAb = 2N
0
naB
paB = 2N
0
nab
pab = 2N
1 = p
The next step is to set pAB 1 1
AB , pAb = pAb , paB = paB ,
1
pab = pab
Then return to the E step of the algorithm and compute Q
again
Continue iterating between the E and the M step until the
parameters converge.

Haplotype Frequency Estimation: EM Algorithm

Das könnte Ihnen auch gefallen