Beruflich Dokumente
Kultur Dokumente
Problem Set 3
Due Date: Thursday, O
tober 21
Covers Material Through: Thursday, O
tober 14
1. Phylogeny:
a) First we will show that no matter what data we have, if we are dealing with only three
OTU's (the spe
ies whi
h supply our a
tual data points), we
an
onstru
t a perfe
t t to
the data. To this end,
onsider the following distan
e data on spe
ies A,B,C.
A B C
A 0
B x 0
C y z 0
Constru
t a phylogeny tree whi
h ts the data perfe
tly. Clearly, the distan
es on the tree
that you turn in as a solution should be variables, i.e., (x + y z )2 .
b) Now we will prove that with as few as four data points, it is possible to have data whi
h
does not t any tree perfe
tly. Re
all the following lemma.
Show that the following data denes a metri
spa
e (all the measured data obeys the triangle
inequality), but that metri
spa
e is not additive (this will allow us to
on
lude that there is
no phylogeny tree whi
h ts the data perfe
tly). Hint: use the above lemma.
A B C D
A 0
B 1 0
C 1 :7 0
D 1:1 1 :5 0
) Considering the data from part b), what phylogeny tree does UPGMA (the Unweighted
Pair Group Method with Arithmeti
mean) lead to? Show your work.
d) In this problem we will use a great pie
e of software, matlab, to investigate this problem
more deeply.
Suppose we assume the topology whi
h the UPGMA pro
edure puts out is
orre
t. Then we
an represent our data as
A B
x
y
w v
C D
where v, w, x, y, z are the distan
es of the edges they lie along. Consider writing this as a
ve
tor ~e = [v w x y z℄
Note then that we obtain the predi
ted distan
es p(; )
x + z + v = p(A; D)
y + z + w = p(B; C )
x + z + w = p(A; C )
y + z + v = p(B; D)
x + y = p(A; B )
w + v = p(C; D)
It is possible to write this su
in
tly as a matrix times a ve
tor. Let our matrix be F (for
Fit). Then
01 0 1 0 1
1
BB 0 1 0 1 1CC
BB 0 1 1 0 1CCC
F = BB 1 0 0 1 1C
(1)
B 0 0 1 1 0A
C
1 1 0 0 0
yields the equation
0 p(A; D) 1
BB p(B; C ) CC
BB p(A; C ) CC
F~e = BB p(B; D) CC (2)
B p(A; B ) CA
p(C; D)
Using this formulation, we
an easily improve the t using the following pro
edure. Start
matlab. To start matlab on athena (all of you should have a
ess to athena), type
add matlab; matlab
followed by return.
Now put the following
ommands in at the prompt
F = [1 0 1 0 1; 0 1 0 1 1; 0 1 1 0 1; 1 0 0 1 1; 0 0 1 1 0; 1 1 0 0 0℄
You should see
F =
1 0 1 0 1
0 1 0 1 1
0 1 1 0 1
1 0 0 1 1
0 0 1 1 0
1 1 0 0 0
on your s
reen.
Now type in
e = [.25 .25 .608 .425 .175℄
followed by
e = e'
in order to get a
olumn ve
tor rather than a row ve
tor. Now we have ~e = [v w x z y℄ for
the data we obtained through UPGMA.
Now type in
F * e
This is the ve
tor of distan
es whi
h the UPGMA method has produ
ed. Call it ~r, for Result.
We now want to know how far o we are with respe
t to the goal. To this end,
all the ve
tor
of distan
es we would ideally like to see ~g (for Goal) and type in
r = F * e
followed by
g = [1.1 .7 1 1 1 .5℄; g = g'
Now we are going to optimize our answer with respe
t to the Cavalli-Sforza and Edwards
riterion. That is, given observed distan
es dij , we are going to
hoose predi
ted distan
es
P (i; j ) in su
h a way as to minimize
X
(dij P (i; j ))2
i;j
To see how far away our data is from perfe
tion at the moment, we are going to evaulate our
urrent result, ~r, with the above
riterion. To do this, type in
temp = g-r; temp = temp .* temp; sum(temp)
If you wish to see the operation as it progresses, just type in the above three
ommands
separately (without semi-
olons). You should nd that we are
urrently o by .0517
Now we are ready to ask if we
an do better. To nd out, let's type
newe = F n g
ba
kslash is a magi
al matlab
ommand whi
h does a lot of neat things - in our
ase, it is
omputing the least-squares solution to the problem of solving (F * newe = g) where F and
g are over-
onstrained. Follow this by
omputing
newr = F * newe
temp = g-newr; temp = temp .* temp; sum(temp)
What is output by this last
ommand? What is the ratio of our new solution to our old
solution, and what fa
tor of improvement is this? So just some simple post-pro
essing on our
UPGMA answer produ
ed a signi
antly
loser t. Matlab is great.
2. More String Sear
hing: Sear
hing for one string in another string is one of the oldest
problems in
omputer s
ien
e, and people are still
ontinually dis
overing new and innovative
ways of solving it. The two most
ommon justi
ations for beating this dead horse are \the
web" and \the genome." This problem will explore a state-of-the-art randomized algorithm
for
ounting the number of times one string appears inside another string. Note that this is
easier than getting an optimal alignment, and we are ignoring gaps. This will free us to make
several short
uts whi
h will drasti
ally improve the time that our algorithm takes. Assume
we have one string of length n (the target) and we know that the other string will have length
k (the pattern). Our goal is to run an O(n) time algorithm on our target string, su
h that, no
matter what length k string we are later given, we will be able to return how many o
uren
es
of the length k string lie in the target string in time O(k).
To
onvin
e yourself that this is a worthwhile goal to investigate, spend a few minutes trying
to gure out how you would solve this problem. It's hard.
Sin
e our algorithm is randomized, we will not always return the right answer. One of the
inputs to our algorithm will be a parameter p, and we will return the
orre
t answer with
probability 1 1=p. In fa
t, we will be right all the time with probability 1 n=p. This
will be a
eptable be
ause our running time in the prepro
essing stage will just be (n log p),
and so our running time will only in
rease by a fa
tor of 20 if we want to make sure we make
mistakes less than one out of every million ( = 220 ) times we run the algorithm.
One useful property of our algorithm will be that it never under
ounts the number of times a
length k substring appears, although it may over
ount. In some appli
ations, this is exa
tly
the kind of \failure" that is a
eptable. Imagine that you have just sequen
ed a 1000 base
pair gene, and you want to pi
k a short (say, 20 base pair) subsequen
e to use as an STS.
Also imagine that the human genome proje
t has already been nished, and is available to
you on disk. Say you ask the
omputer, \what about this parti
ular STS, is this a uniquely
appearing length 20 subsequen
e of the entire genome?" If the
omputer says no, you just
pi
k another length 20 subsequen
e and try again. On the other hand, if the
omputer says
yes, you
an be sure that the a
tual number of times your STS appears in the genome is
exa
tly 1.
The bird's eye view of our algorithm is as follows: we start by pi
king a random prime p of the
appropriate length. The appropriate length is determined by the user's desire for a tradeo
between speed (the amount of time we spend initially) and
onden
e (how sure we want
to be that our nal answer is
orre
t). We represent all length k subsequen
es as numbers
(every base is mapped to a number between 0 and 3) and
onstru
t a hash table whi
h
ounts
the number of times ea
h parti
ular one appears. Hash tables are a ubiquitously useful item
in
omputer s
ien
e. Our hash table will have size p log n, be
ause every entry will
ount the
potentially very large (up to n) number of times that the subsequen
e appears.
We represent our hash table by a hash fun
tion, h(). h() will be dened su
h that h() takes
a subsequen
e as input and returns a lo
ation in the hash table. When we want to nd out
how many times a parti
ular pattern l appears, we will just look up h(l).
Dene our target string to be s = s0 s1 s2 : : : sn 1
Let the rst length k substring be i0 = s0 s1 s2 : : : sk 1
Then we run the following algorithm,
initially set every entry of the hash table to be 0
for j = 0 to n k
1. in
rement h(ij ) by one
2.
ompute ij +1 by taking ij , subtra
ting o the highest letter, shifting everything up, and
adding in the new low letter (while doing all
omputations modulo p)
ij +1 (ij sk 1+j 4k 1 ) 4 + sj mod p
That's it. Step 1 just in
rements the bin a parti
ular string is hashed into, noting that we
saw one, and step 2 qui
kly
al
ulates the new value for a substring given the value of the
old substring. To examine step 2 in a little more detail, suppose we have one number written
out as a string of digits, and we want to
onvert it into another, shown below.
ab
: : : xy ! b
d : : : xyz
All step 2 says is, subtra
t a times whatever the value of it's pla
e in the number is, multiply
everything else up by 4 (sin
e we are working base 4) and nally add in z . However, to keep
things fast, we do everything mod p (that is, throwing away the part of the number whi
h is
greater than p). Your rst question asks you to show that we get the same result whether we
perform the modulous at every step or we perform the modulous only at the end.
a) show that [(a mod p) (b mod p) + (
mod p)℄ mod p = (a b +
) mod p
This exer
ise should have
onvin
ed you that we
an do everything modp and thus that we
an restri
t our attention to
omparatively small integers (sin
e log p << n). Now we pro
eed
to develop our hash fun
tion. Suppose we pi
k two integers a and b between 1 and p 1
independently at random. Take two strings (whi
h we
an view as numbers) s and t that are
not equal mod p.
b) What is the probability over
hoi
e of a and b that a + bs = a + bt mod p? You may use
that xy = 0 mod p implies that either x = 0 mod p or y = 0 mod p sin
e p is a prime. Justify
your answer.
) So the only thing we are really worried about is whether s and t are equal mod p. Suppose
s and t have length k in binary. Say we have a
ess to distin
t primes in the range [; : : : 2 ℄.
Show that the number of possible s; t pairs of length k whi
h are equal mod all of the primes,
but not truly equal, is
22k
You may use the Chinese Remainder Theorem, that for distin
t primes p; q,
a = b mod pq if and only if a = b mod p and a = b mod q.
d) Now suppose that we have > 22k , and we have 2 su
h primes (not just ). Then the
probability we
hoose a prime whi
h distinguishes between two parti
ular s and t is 1=2.
Prove this.
From the previous analysis, it is not mu
h more work to prove formally that the
han
e of
su
ess for the algorithm is what we
laimed in the beginning, but we'll leave that for another
day.
e) Assume we
hoose p su
h that the
han
e of never having a failure is 1 1=n, that is,
p = n2 . Did the amount of spa
e you needed to store the genome get smaller or larger? By
how mu
h? Do not express your answer as a fun
tion of n and p - give numbers. Assume
that unused entries in the hash table take up no spa
e. Can you still sear
h the genome faster
than before (when you just had the genome as a single string)?
3. Gene Chips: The following problem is due to Pablo Tamayo.
Normal vs Renal Car
inoma Gene Expression Dataset
The following table shows numeri
al values for mRNA
on
entrations obtained using DNA-
mi
roarrays from 6 normal and 6 renal
ar
inoma samples in human patients. The data was
obtained using the
hips manufa
tured by this
ompany
http://www.affymetrix.
om/te
hnology/index.html
Based on the given data, answer the following three questions:
a) Whi
h genes are the best \markers" to separate normal from
ar
inoma samples?
b) How would you
lassify new samples into normal and
ar
inomas using those markers?
) If you were to make a
lini
al assay to make this
lassi
ation and
ould only test for three
genes whi
h three would you
hoose? (explain)
The dataset shows 20 genes sele
ted from a total of 6,817 in the
hip. The rst six
olumns
are readings from normal
ells, and the se
ond six
olumns are readings from
an
erous
ells.
This data is available in the table below, and also in a le lo
ated at
http://web.mit.edu/afs/athena.mit.edu/user/j/d/jdunagan/
Publi
/18.417-pset3-gene
hip-dataset
This should suggest to you two ways of solving the problem. One way is to simply look at
the data with your naked eye. Another way is to use matlab again! If you do
hoose to use
matlab, the following
ommands are meant to step you through the problem even if you have
no previous matlab experien
e.
First,
opy the data into a le
alled just data
For those of you who have never used unix before, one way to do this after downloading the
data set is
p 18.417-pset3-gene
hip-dataset data
Then, inside matlab, type the following
ommands
load data
The entire matrix is now in the matlab variable data. To
onrm this, type
data
Now we pro
eed to put the rst 6
olumns into a set of \normal" data points, and the
remaining
olumns into \
an
erous" data points.
normal = data(:,1:6)
an
erous = data(:,7:12)
Now we extra
t the maximum and minimum along ea
h row. Sin
e matlab has a built in
fun
tion to do this along ea
h
olumn, we tilt the matrix, extra
t the max, and tilt it ba
k.
normal max = max(normal')'
normal min = min(normal')'
an
erous max = max(
an
erous')'
an
erous min = min(
an
erous')'
Now we get ba
k to our original question: what genes are more highly expressed in the
an
erous
ells than in normal
ells? The ideal
ase (whi
h o
urs here, but often does not
o
ur) is that we
an nd some gene whi
h is more highly expressed in every
an
erous
ell
than it is in any normal
ell. To test for this hypothesis, we enter
normal max <
an
erous min
The ones in the resulting
olumn ve
tor denote genes where the normal
ell whi
h expresses
the gene the most still expresses the gene less than the
an
erous
ell whi
h expresses the
gene the least.
We
an make a similar test to see if there are any genes whi
h are \o" in the
an
erous
ell
when they should be \on."
normal min >
an
erous max
At this point you should be able to mat
h up ones from the last two matlab
ommands with
genes in the table above. Che
k by hand that we did indeed nd good marker genes and
write up your answers to the above questions.
4. Sequen
ing: In this problem we will work out some things whi
h Seram glossed over in
his le
ture. The answers to these questions are in a paper whi
h Seram would be happy
to provide you with, and reading that paper would be a great way to solve this problem.
Alternatively, you
an work through all of this on your own, and that will also be edu
ational.
If you have experien
e working with the exponential (or Poisson) distribution, this problem
should be easy.
The rst thing we examine is the number of
lones we have to sequen
e in the walking step.
Re
all from le
ture that we start with \islands" whi
h we have sequen
ed and try to walk
over \o
eans" of unsequen
ed DNA with as little ex
ess sequen
ing as possible. Just as in
le
ture, we will assume we only make unidire
tional walking steps (i.e., we only extend our
islands to the right). We start by making the following denitions, whi
h are the same as
those in le
ture.
E [J ℄ = PD;! 1 + (1 PD;! ) (E [J ℄ + 1)
The rst term
omes from the possibility that we do nish walking, and the se
ond term
omes from the possibility that we don't nish walking. This leads to the solution
1
E [J ℄ =
PD;!
Now, suppose we have two libraries, one of depth D1 and one of depth D2 . As above, all the
lones from the rst library have length 1, but this time the
lones from the se
ond library
have length l < 1. We now propose the following walking step: if it is possible, we
lose the
gap with a
lone from the smaller library.2 If that is not an option, we try to
lose the gap
with a
lone from the larger library. If neither of those is an option, then we take one walking
step with the
lone from the larger library and start again. We now make the following
additonal denitions to help us
ope with this more
ompli
ated problem
b) Write out the equation for E [K ℄ whi
h this leads to and solve it. As a
he
k, you should
nd that your solution yields
1 + PD2 ;!;l (l 1)
E[K℄ =
PD1 ;!;1
1
Although this last part sounds
ontradi
tory (after all, didn't we make some progress?), we just proved that it is