Sie sind auf Seite 1von 11

18.

417 - Introdu tion to Computational Mole ular Biology Handout 9


Bonnie Berger O tober 7, 1999

Problem Set 3
Due Date: Thursday, O tober 21
Covers Material Through: Thursday, O tober 14

1. Phylogeny:
a) First we will show that no matter what data we have, if we are dealing with only three
OTU's (the spe ies whi h supply our a tual data points), we an onstru t a perfe t t to
the data. To this end, onsider the following distan e data on spe ies A,B,C.

A B C
A 0
B x 0
C y z 0
Constru t a phylogeny tree whi h ts the data perfe tly. Clearly, the distan es on the tree
that you turn in as a solution should be variables, i.e., (x + y z )2 .
b) Now we will prove that with as few as four data points, it is possible to have data whi h
does not t any tree perfe tly. Re all the following lemma.

Lemma 1 (4 point ondition) A metri spa e O is additive i given any 4 points in O ,


we an label them i; j; k; l, su h that
dij + dkl = dik + djl  dil + djk

Show that the following data de nes a metri spa e (all the measured data obeys the triangle
inequality), but that metri spa e is not additive (this will allow us to on lude that there is
no phylogeny tree whi h ts the data perfe tly). Hint: use the above lemma.

A B C D
A 0
B 1 0
C 1 :7 0
D 1:1 1 :5 0
) Considering the data from part b), what phylogeny tree does UPGMA (the Unweighted
Pair Group Method with Arithmeti mean) lead to? Show your work.
d) In this problem we will use a great pie e of software, matlab, to investigate this problem
more deeply.
Suppose we assume the topology whi h the UPGMA pro edure puts out is orre t. Then we
an represent our data as

A B
x
y

w v

C D

where v, w, x, y, z are the distan es of the edges they lie along. Consider writing this as a
ve tor ~e = [v w x y z℄
Note then that we obtain the predi ted distan es p(; )
x + z + v = p(A; D)
y + z + w = p(B; C )
x + z + w = p(A; C )
y + z + v = p(B; D)
x + y = p(A; B )
w + v = p(C; D)

It is possible to write this su in tly as a matrix times a ve tor. Let our matrix be F (for
Fit). Then

01 0 1 0 1
1
BB 0 1 0 1 1CC
BB 0 1 1 0 1CCC
F = BB 1 0 0 1 1C
(1)
B 0 0 1 1 0A
C
1 1 0 0 0
yields the equation
0 p(A; D) 1
BB p(B; C ) CC
BB p(A; C ) CC
F~e = BB p(B; D) CC (2)
B p(A; B ) CA
p(C; D)
Using this formulation, we an easily improve the t using the following pro edure. Start
matlab. To start matlab on athena (all of you should have a ess to athena), type
add matlab; matlab
followed by return.
Now put the following ommands in at the prompt
F = [1 0 1 0 1; 0 1 0 1 1; 0 1 1 0 1; 1 0 0 1 1; 0 0 1 1 0; 1 1 0 0 0℄
You should see
F =
1 0 1 0 1
0 1 0 1 1
0 1 1 0 1
1 0 0 1 1
0 0 1 1 0
1 1 0 0 0
on your s reen.
Now type in
e = [.25 .25 .608 .425 .175℄
followed by
e = e'
in order to get a olumn ve tor rather than a row ve tor. Now we have ~e = [v w x z y℄ for
the data we obtained through UPGMA.
Now type in
F * e
This is the ve tor of distan es whi h the UPGMA method has produ ed. Call it ~r, for Result.
We now want to know how far o we are with respe t to the goal. To this end, all the ve tor
of distan es we would ideally like to see ~g (for Goal) and type in
r = F * e
followed by
g = [1.1 .7 1 1 1 .5℄; g = g'
Now we are going to optimize our answer with respe t to the Cavalli-Sforza and Edwards
riterion. That is, given observed distan es dij , we are going to hoose predi ted distan es
P (i; j ) in su h a way as to minimize
X
(dij P (i; j ))2
i;j
To see how far away our data is from perfe tion at the moment, we are going to evaulate our
urrent result, ~r, with the above riterion. To do this, type in
temp = g-r; temp = temp .* temp; sum(temp)
If you wish to see the operation as it progresses, just type in the above three ommands
separately (without semi- olons). You should nd that we are urrently o by .0517
Now we are ready to ask if we an do better. To nd out, let's type
newe = F n g
ba kslash is a magi al matlab ommand whi h does a lot of neat things - in our ase, it is
omputing the least-squares solution to the problem of solving (F * newe = g) where F and
g are over- onstrained. Follow this by omputing
newr = F * newe
temp = g-newr; temp = temp .* temp; sum(temp)
What is output by this last ommand? What is the ratio of our new solution to our old
solution, and what fa tor of improvement is this? So just some simple post-pro essing on our
UPGMA answer produ ed a signi antly loser t. Matlab is great.
2. More String Sear hing: Sear hing for one string in another string is one of the oldest
problems in omputer s ien e, and people are still ontinually dis overing new and innovative
ways of solving it. The two most ommon justi ations for beating this dead horse are \the
web" and \the genome." This problem will explore a state-of-the-art randomized algorithm
for ounting the number of times one string appears inside another string. Note that this is
easier than getting an optimal alignment, and we are ignoring gaps. This will free us to make
several short uts whi h will drasti ally improve the time that our algorithm takes. Assume
we have one string of length n (the target) and we know that the other string will have length
k (the pattern). Our goal is to run an O(n) time algorithm on our target string, su h that, no
matter what length k string we are later given, we will be able to return how many o uren es
of the length k string lie in the target string in time O(k).
To onvin e yourself that this is a worthwhile goal to investigate, spend a few minutes trying
to gure out how you would solve this problem. It's hard.
Sin e our algorithm is randomized, we will not always return the right answer. One of the
inputs to our algorithm will be a parameter p, and we will return the orre t answer with
probability  1 1=p. In fa t, we will be right all the time with probability  1 n=p. This
will be a eptable be ause our running time in the prepro essing stage will just be (n log p),
and so our running time will only in rease by a fa tor of 20 if we want to make sure we make
mistakes less than one out of every million ( = 220 ) times we run the algorithm.
One useful property of our algorithm will be that it never under ounts the number of times a
length k substring appears, although it may over ount. In some appli ations, this is exa tly
the kind of \failure" that is a eptable. Imagine that you have just sequen ed a 1000 base
pair gene, and you want to pi k a short (say, 20 base pair) subsequen e to use as an STS.
Also imagine that the human genome proje t has already been nished, and is available to
you on disk. Say you ask the omputer, \what about this parti ular STS, is this a uniquely
appearing length 20 subsequen e of the entire genome?" If the omputer says no, you just
pi k another length 20 subsequen e and try again. On the other hand, if the omputer says
yes, you an be sure that the a tual number of times your STS appears in the genome is
exa tly 1.
The bird's eye view of our algorithm is as follows: we start by pi king a random prime p of the
appropriate length. The appropriate length is determined by the user's desire for a tradeo
between speed (the amount of time we spend initially) and on den e (how sure we want
to be that our nal answer is orre t). We represent all length k subsequen es as numbers
(every base is mapped to a number between 0 and 3) and onstru t a hash table whi h ounts
the number of times ea h parti ular one appears. Hash tables are a ubiquitously useful item
in omputer s ien e. Our hash table will have size p log n, be ause every entry will ount the
potentially very large (up to n) number of times that the subsequen e appears.
We represent our hash table by a hash fun tion, h(). h() will be de ned su h that h() takes
a subsequen e as input and returns a lo ation in the hash table. When we want to nd out
how many times a parti ular pattern l appears, we will just look up h(l).
De ne our target string to be s = s0 s1 s2 : : : sn 1
Let the rst length k substring be i0 = s0 s1 s2 : : : sk 1
Then we run the following algorithm,
initially set every entry of the hash table to be 0
for j = 0 to n k
1. in rement h(ij ) by one
2. ompute ij +1 by taking ij , subtra ting o the highest letter, shifting everything up, and
adding in the new low letter (while doing all omputations modulo p)
ij +1 (ij sk 1+j 4k 1 )  4 + sj mod p
That's it. Step 1 just in rements the bin a parti ular string is hashed into, noting that we
saw one, and step 2 qui kly al ulates the new value for a substring given the value of the
old substring. To examine step 2 in a little more detail, suppose we have one number written
out as a string of digits, and we want to onvert it into another, shown below.
ab : : : xy ! b d : : : xyz
All step 2 says is, subtra t a times whatever the value of it's pla e in the number is, multiply
everything else up by 4 (sin e we are working base 4) and nally add in z . However, to keep
things fast, we do everything mod p (that is, throwing away the part of the number whi h is
greater than p). Your rst question asks you to show that we get the same result whether we
perform the modulous at every step or we perform the modulous only at the end.
a) show that [(a mod p)  (b mod p) + ( mod p)℄ mod p = (a  b + ) mod p
This exer ise should have onvin ed you that we an do everything modp and thus that we
an restri t our attention to omparatively small integers (sin e log p << n). Now we pro eed
to develop our hash fun tion. Suppose we pi k two integers a and b between 1 and p 1
independently at random. Take two strings (whi h we an view as numbers) s and t that are
not equal mod p.
b) What is the probability over hoi e of a and b that a + bs = a + bt mod p? You may use
that xy = 0 mod p implies that either x = 0 mod p or y = 0 mod p sin e p is a prime. Justify
your answer.
) So the only thing we are really worried about is whether s and t are equal mod p. Suppose
s and t have length k in binary. Say we have a ess to distin t primes in the range [ ; : : : 2 ℄.
Show that the number of possible s; t pairs of length k whi h are equal mod all of the primes,
but not truly equal, is
22k

You may use the Chinese Remainder Theorem, that for distin t primes p; q,
a = b mod pq if and only if a = b mod p and a = b mod q.
d) Now suppose that we have > 22k , and we have 2 su h primes (not just ). Then the
probability we hoose a prime whi h distinguishes between two parti ular s and t is  1=2.
Prove this.
From the previous analysis, it is not mu h more work to prove formally that the han e of
su ess for the algorithm is what we laimed in the beginning, but we'll leave that for another
day.
e) Assume we hoose p su h that the han e of never having a failure is  1 1=n, that is,
p = n2 . Did the amount of spa e you needed to store the genome get smaller or larger? By
how mu h? Do not express your answer as a fun tion of n and p - give numbers. Assume
that unused entries in the hash table take up no spa e. Can you still sear h the genome faster
than before (when you just had the genome as a single string)?
3. Gene Chips: The following problem is due to Pablo Tamayo.
Normal vs Renal Car inoma Gene Expression Dataset
The following table shows numeri al values for mRNA on entrations obtained using DNA-
mi roarrays from 6 normal and 6 renal ar inoma samples in human patients. The data was
obtained using the hips manufa tured by this ompany
http://www.affymetrix. om/te hnology/index.html
Based on the given data, answer the following three questions:
a) Whi h genes are the best \markers" to separate normal from ar inoma samples?
b) How would you lassify new samples into normal and ar inomas using those markers?
) If you were to make a lini al assay to make this lassi ation and ould only test for three
genes whi h three would you hoose? (explain)
The dataset shows 20 genes sele ted from a total of 6,817 in the hip. The rst six olumns
are readings from normal ells, and the se ond six olumns are readings from an erous ells.
This data is available in the table below, and also in a le lo ated at
http://web.mit.edu/afs/athena.mit.edu/user/j/d/jdunagan/
Publi /18.417-pset3-gene hip-dataset

12 L07648 42 175 50 29 59 154 1087 252 93 309 66 60


13 U12465 1410 2120 1009 1070 1481 1965 5734 1714 1038 1487 3819 4935
14 X59798 130 380 358 37 230 154 1050 592 1367 1872 1414 900
15 U39318 253 229 365 470 373 258 437 714 559 568 317 134
16 X52541 979 375 434 341 909 426 280 783 220 19 19 68
17 X56494 315 1091 62 8 967 483 3303 2525 2104 1719 2002 2215
18 Z30644 870 1452 1707 1745 1500 1243 315 596 622 532 350 425
19 U96915 124 332 21 95 151 145 652 257 51 74 60 0
20 D10995 182 142 718 737 199 155 231 436 469 645 129 157
21 X51435 237 149 262 629 160 133 144 343 96 196 232 173
22 X01677 4788 4480 2535 2470 7088 6925 4040 4899 1538 8773 4289 6837
23 D42039 78 134 284 641 143 163 98 193 421 412 249 116
24 J03827 643 606 250 870 764 604 1659 606 459 482 316 858
25 U43189 110 195 298 329 65 137 188 291 668 423 148 61
26 S45630 1043 2263 1069 1845 2255 1613 5213 1191 2702 1962 473 142
27 U37251 337 243 398 662 107 184 271 452 388 584 243 200
28 U85992 281 206 546 670 26 219 217 195 357 292 448 169
29 L22342 373 163 573 824 422 368 311 359 926 605 298 243
30 Z30425 276 169 596 612 344 400 132 575 415 292 393 93
31 U70671 579 146 821 1787 806 632 247 369 995 790 356 838

This should suggest to you two ways of solving the problem. One way is to simply look at
the data with your naked eye. Another way is to use matlab again! If you do hoose to use
matlab, the following ommands are meant to step you through the problem even if you have
no previous matlab experien e.
First, opy the data into a le alled just data
For those of you who have never used unix before, one way to do this after downloading the
data set is
p 18.417-pset3-gene hip-dataset data
Then, inside matlab, type the following ommands
load data
The entire matrix is now in the matlab variable data. To on rm this, type
data
Now we pro eed to put the rst 6 olumns into a set of \normal" data points, and the
remaining olumns into \ an erous" data points.
normal = data(:,1:6)
an erous = data(:,7:12)
Now we extra t the maximum and minimum along ea h row. Sin e matlab has a built in
fun tion to do this along ea h olumn, we tilt the matrix, extra t the max, and tilt it ba k.
normal max = max(normal')'
normal min = min(normal')'
an erous max = max( an erous')'
an erous min = min( an erous')'
Now we get ba k to our original question: what genes are more highly expressed in the
an erous ells than in normal ells? The ideal ase (whi h o urs here, but often does not
o ur) is that we an nd some gene whi h is more highly expressed in every an erous ell
than it is in any normal ell. To test for this hypothesis, we enter
normal max < an erous min
The ones in the resulting olumn ve tor denote genes where the normal ell whi h expresses
the gene the most still expresses the gene less than the an erous ell whi h expresses the
gene the least.
We an make a similar test to see if there are any genes whi h are \o " in the an erous ell
when they should be \on."
normal min > an erous max
At this point you should be able to mat h up ones from the last two matlab ommands with
genes in the table above. Che k by hand that we did indeed nd good marker genes and
write up your answers to the above questions.
4. Sequen ing: In this problem we will work out some things whi h Sera m glossed over in
his le ture. The answers to these questions are in a paper whi h Sera m would be happy
to provide you with, and reading that paper would be a great way to solve this problem.
Alternatively, you an work through all of this on your own, and that will also be edu ational.
If you have experien e working with the exponential (or Poisson) distribution, this problem
should be easy.
The rst thing we examine is the number of lones we have to sequen e in the walking step.
Re all from le ture that we start with \islands" whi h we have sequen ed and try to walk
over \o eans" of unsequen ed DNA with as little ex ess sequen ing as possible. Just as in
le ture, we will assume we only make unidire tional walking steps (i.e., we only extend our
islands to the right). We start by making the following de nitions, whi h are the same as
those in le ture.

J = number of walking steps from one island to the next


E [J ℄ = the expe ted value of J
D = the depth of the lone library
= the average number of times a base is overed by a lone
! = the mean o ean width
PD;! = the probability we hit an island on our next walking step
Assume every lone in the lone library has length 1 (they are all 1 lone unit long). The rst
thing we will examine is the impa t of the Exponential O eans Assumption. Sera m Batzoglou
referred to this in passing in his le ture as an idealization whi h they have experimentally
veri ed - most of the data ts this assumption. Te hni ally, this means that we assume that
the o eans obey the following formula
Z x0 1
Pr[for a random o ean O, O has length  x0 ℄ = e x=! dx
0 !
The rst thing we'll do is some sanity he ks about this de nition. The rst thing to he k
is that with probability 1, all o ean lengths lie between 0 and +1. This is just the following
integration
Z +1 1
e x=! dx = 1
0 !
Similarly, we now want to he k that these o eans do indeed have mean length !. This
amounts to showing that
Z +1 x
e x=! dx = !
0 !
The above integral is left as an exer ise. Now onsider the laim that, after I have advan ed
some xed t in to a random o ean, if I have not already hit the end of the o ean, then I expe t
to have to advan e another ! to hit the end of the o ean. Sera m Batzoglou mentioned this in
his le ture when he said \we an ut o some xed amount from the front of the exponential
distribution and get ba k the exponential distribution."
a) Prove the above statement.
Now, as an exer ise, we will al ulate E [J ℄ in terms of PD;! .
E [J ℄ an be omputed re ursively. Say we take one step - if we do lose the gap, then we
are done (obviously). If we don't lose the gap, then how mu h more work remains? This
is where the Exponential O eans Assumption allows us to pro eed. If we assume that the
o eans are distributed in this way, then after one step that doesn't lose the gap, we are no
more likely to lose the o ean on the next step than we were on the previous one.1 This leads
us to the following equation

E [J ℄ = PD;!  1 + (1 PD;! )  (E [J ℄ + 1)
The rst term omes from the possibility that we do nish walking, and the se ond term
omes from the possibility that we don't nish walking. This leads to the solution

1
E [J ℄ =
PD;!
Now, suppose we have two libraries, one of depth D1 and one of depth D2 . As above, all the
lones from the rst library have length 1, but this time the lones from the se ond library
have length l < 1. We now propose the following walking step: if it is possible, we lose the
gap with a lone from the smaller library.2 If that is not an option, we try to lose the gap
with a lone from the larger library. If neither of those is an option, then we take one walking
step with the lone from the larger library and start again. We now make the following
additonal de nitions to help us ope with this more ompli ated problem

K = number of walking steps from one island to the next,


weighted by the length of the walking step
E [K ℄ = the expe ted value of K
PD;!;1 = the probability we hit an island on our next walking step of length 1
PD;!;l = the probability we hit an island on our next walking step of length l

b) Write out the equation for E [K ℄ whi h this leads to and solve it. As a he k, you should
nd that your solution yields

1 + PD2 ;!;l  (l 1)
E[K℄ =
PD1 ;!;1

1
Although this last part sounds ontradi tory (after all, didn't we make some progress?), we just proved that it is

true in part a).


2
Ignore for the moment how we ould know that a smaller lone would lose the gap without sequen ing it - this

is not a tually an issue.

Das könnte Ihnen auch gefallen