Sie sind auf Seite 1von 3

This problem asks whether any 20substring of nucleotides in a string of three billion match any of the other 20substringsnot

one in particular. This is the Birthday Paradox problem in disguise: How many strangers need to be in a room together before the probability of a pair of them sharing a birthday is greater than 50%? Heres how to find the number of possible unique 20substrings: There are four options for the first base (A, C, G, or T), four options for the second base (again: A, C, G, or T), four options for the third base, , and four options for the 20th base. This problem assumes that each 20substring is as equally likely to appear as any other possible 20substring, and it further assumes that each 20substring is independent of any other 20substring that is to say, just because there is a bunch of guanine nearby doesnt mean that the next 20substring will have more or less guanine than usual. The multiplication counting principle allows us to conclude that the number of possible 20substrings is 44444444444444444444 = 4^20 = 1,099,511,627,776. I will now refer to this number as N. In alphabetical order, all N of these possible 20substrings can be listed as: 1) AAAAA AAAAA AAAAA AAAAA 2) AAAAA AAAAA AAAAA AAAAC 3) AAAAA AAAAA AAAAA AAAAG 4) AAAAA AAAAA AAAAA AAAAT 5) AAAAA AAAAA AAAAA AAACA 6) AAAAA AAAAA AAAAA AAACC 1,099,511,627,770) TTTTT TTTTT TTTTT TTTGC 1,099,511,627,771) TTTTT TTTTT TTTTT TTTGG 1,099,511,627,772) TTTTT TTTTT TTTTT TTTGT 1,099,511,627,773) TTTTT TTTTT TTTTT TTTTA 1,099,511,627,774) TTTTT TTTTT TTTTT TTTTC 1,099,511,627,775) TTTTT TTTTT TTTTT TTTTG 1,099,511,627,776) TTTTT TTTTT TTTTT TTTTT As an aside, let P be the probability of something happeningsay, P is the 70% chance that it will rain sometime next week. Then the probability of it not raining sometime next week (which I will write as P) is obviously 30%. Notice that those two percentages add up to 100% because P and P are mutually exclusive that is to say, it will have either rained sometime next week or will not have, and theres no way that the weather may have done both. So, P + P = 100% Now, back on topic. If P is the probability that a pair of 20substrings in the three billion are identical, then P is the probability that they arent identical. It may also be simpler to calculate P and then calculate P knowing that P + P must sum to 100%. So, whats Pthe probability that a pair of them dont match? To not match, the second 20substring will have to be any of the 1,099,511,627,775 options that are different from the first one. Notice that that huge number is just (N-1). The probability of this happening is (N-1 nonmatching ways) (N possible ways). This is very nearly 100%, which should intuitively make sense.

Whats the probability that a pile of three 20substrings would be all different? Well, pick the first one. The first 20substring put into the pile can be any of the N possible configurations, so the probability of picking a first 20substring for the pile and having it not be like anything else in the (empty) pile is N N, which is equal to 100% as we would expect because the pile is empty, so theres a 100% chance that this first one wont match any N 1 of the nothingness already there right? Right. As discussed before, the second one has an probability of N being different than the first. But the third one must be different from both of those first two, so, of the 1,099,511,627,776 different possible 20substrings, 1,099,511,627,774 of them have not been represented yet. N 2 This is easier to write as N-2. The probability of the third one being different from the first two is . N According to the multiplication counting principle, the probability of the first 20substring being an acceptable 20substring AT THE SAME TIME AS the second 20substring being different than the first AT THE SAME N N 1 N 2 TIME AS the third one being different from the first two is: P'3 . N N N

Similar arguments can be made to find that the probability of randomly selecting four 20substrings, N N 1 N 2 N 3 putting them all in a pile, and finding that they are all unique equals P' 4 , and N N N N N N 1 N 2 N 3 N 4 P' 5 , and so on. Soon, it becomes easy to see that the difference in these three N N N N N boxed equations has just been the inclusion of a new term at the end of the probabilitys product.

In fact, we can find out what the probability is of putting X randomly selected 20substrings in a pile and finding that they are all unique! I will call this probability Px. Notice when X is 5, there are five Ns in the denominator of P5. When X is 4, there are four Ns in the denominator of P4. In short, the denominator will end up having X Ns all being multiplied together, so the expression for Px will have an Nx in its denominator. Recall what a factorial is for some number a: a! a (a 1) (a 2) (a 3) (a 4) .

a! . Pretty neat. As (a b)! for the numerators of those boxed probability equations above, it turns out that we only wanted the first four terms of N! in P4, and the first five terms of N! in P5, and so on. This lets us conclude that Px should have a N! numerator equal to . Since we know both the numerator and denominator of Px , we can put them ( N X )! N! together for the expression: P' X Remember, we know that N is just a really large number, but we ( N X )!N X know what it is and we can plug it in whenever we want to.
If we only want the first b terms of the factorial of a, we can find this using

But, what is X? How many possible 20substrings are there in the 3,000,000,000 nucleotides? Well, one 20substring is made up of the first twenty nucleotides. The second 20substring is made up of nucleotides #2 through #21. The third 20substring is made up of nucleotides #3 through #22. The fourth 20substring is made up of nucleotides #4 through #23. The hundredth 20substring is made up of nucleotides #100 through #119. The Wth 20substring is made up of nucleotides #W through #W+19. The 2,999,999,980th 20substring is made up of nucleotides #2,999,999,980 through #2,999,999,999. The 2,999,999,981st 20substring is made up of nucleotides #2,999,999,981 through #3,000,000,000. So, there are 2,999,999,981 20substrings that can be found in a strand of 3 billion nucleotides. And whats the probability that putting all of them into a pile will be different? In this case, X = 2,999,999,981 and N! P' X ( N X )!N X

1,099,511,627,776 ! (1,099,511,627,776 2,999,999,981)!1,099,511,627,776

2 , 999, 999, 981

Using some very powerful mathematical software, you can find that the probability of each of the 20substrings being unique when compared to all the other substrings is an inconceivably small number: 0.000000000000000000000000000000000000000000000000000000000000000000000000000000002%, with 1,779,876 zeros between the decimal point and the 2 thats almost two million zeros before the first significant figure! Soooo, what is the probability that there will be a repeat? Well, since we already talked about how P + P = 100%, all we have to do is find the number which can be added to that incredibly tiny number to add up to 100%. P = 99.99999999999999999999999999999999999999999999999999999999999998%, where there are almost two million 9s before that eight comes around. Ummmm, this is about as certain a probability as it gets!

Das könnte Ihnen auch gefallen