Beruflich Dokumente
Kultur Dokumente
Recently there have been several reports that certain [1], a "min-max" method was proposed to take into ac-
DNA sequences may display long-range power-law corre- count the "nucleotide heterogeneity. " A potential draw-
lations extending across more than 104 nucleotides [1—5]. back of this method is that it requires the investigator to
Moreover, it appears that coding sequences do not dis- judge how many local maxima and minima of a landscape
play long-range correlations [1,2]. How to interpret these to utilize in the analysis. Here we present an alternative
findings is unclear at present. It is known that DNA method — —
"detmnded fluctuation analysis" (DEA) that
nucleotides form a mosaic comprised of "patches" (ex- is independent of investigator input and permits the de-
cess of one type of nucleotide) [6—10]. The possibility tection of long-range correlations embedded in a patchy
that such patchiness is related to the appearance of long- landscape [such as Fig. 1(b)], and also avoids the spu-
range power-law correlations has been raised by Nee [9] rious detection of apparent long-range correlations that
and by Karlin and Brendel [10]. Here we address this are an artifact of patchiness [Fig. 1(a)].
possibility by systematically studying distinct forms of The DFA method comprises the following steps.
mosaic structure. (1) Divide the entire sequence of length N into N/E
A useful way of analyzing patchiness arising &om nonoverlapping boxes, each containing E nucleotides, and
the heterogeneous purine-pyrimidine content is the DNA define the "local trend" in each box (proportional to the
walk [1], defined as follows: A one-dimensional walker compositional bias in the box) to be the ordinate of a
dictated by the nucleotide sequence takes one step down linear least-squares fit for the DNA walk displacement in
when there is a purine [u(i) = — 1] and one step up when that box [14].
there is a pyrimidine [u(i) = 1]. The displacement of the "
{2) Define the "detrended walk, denoted by yt{n), as
walker after n steps, y(n), is defined as y(n):— P," i u(i), the difFerence between the original walk y(n) and the
and is displayed on a graph of y vs n as in Fig. 1 [11]. local trend. Calculate the variance about the detrended
Figure 1(a) shows a representative DNA walk for a walk for each box, and calculate the average of these
class of artificial "control" sequences generated by stitch- variances over all the boxes of size E, denoted Fdz(E) [15].
ing together subsequences that correspond to random To illustrate the DFA method, we show in Fig. 2 a
walks with different nucleotide composition (strand bias) 1000-nucleotide subsequence of the DNA walk of bac-
[12]. Figure 1(b) shows a representative DNA walk for teriophage A [GenBank name: LAMCG, 48502 bp (bp
a difFerent class of sequences, generated by a model with denotes base pair)]. Figure 2(a) shows the local trends
a well-defined long-range power-law correlation [13]; the when this subsequence is partitioned into boxes of size
heterogeneous structure of this control sequence arises E = 100 while Fig. 2(b) shows the local trends when the
directly from the long-range correlation itself. subsequence is partitioned into boxes of size E = 200. It is
We find apparent patchiness in real DNA sequences— apparent by visual inspection that the variance increases
both in the noncoding and coding regions. Figure 1(c) with the box size. The dependence of variance on box size
displays the E. coli K12 genomic fragment (composed of gives rise to the scaling properties of the Huctuations.
more than 80%%uo coding regions), while Fig. 1(d) shows the If only short-range correlations (or no correlations) ex-
human T-cell receptor alpha/delta locus (( 10% coding). ist in the nucleotide sequence, then the detrended DNA
The fundamental di8'erence between the two control walk must have the statistical properties of a random
sequences is that the first does not possess long-range walk (unbiased or biased) so Eg(E) giIz; however, if
power-law correlations, while the second does. Therefore there is long-range power-law correlation (i.e. , no char-
, ~ appropriate scaling analysis of the correlation proper- acteristic length scale), then Ed(t') / with n g li2
ties should be able to distinguish between them. In Ref. [16].
49 1685
1686 C.-K. PENG et al. 49
800 800 T
40Q 400
0 rftrt~t~,
-400
800 8 Q r
400 400
-400 -400—
-800 '
-800
0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
n
FIG. 1. (s) DNA walk for a control sequence obtained by stitching together biased random walks; the characteristic length
for the patches is 2500. (b) DNA walk for a control sequence obtained from building s long-range correlation into s set of
100000 "nucleotides" which are correlated with power-law exponent a = 0.61 using the procedure of Ref. [13]. (c) DNA walk
for s genomic fragment containing mostly coding regions [E coli K12. genome, 0—2.4 min region, GenBsnk name: ECO110K,
111401 bp]. (d) DNA walk for s typical intron-containing chromosomal region of s comparable length (humsn T-cell receptor
alpha/delta locus, GenBank name: HUMTCRADCV, 97634 bp). Large subregions ("patches") of uniform overall slope ("strand
bias" ) re8ect the mosaic structure. To facilitate the comparison of subtle nuctustions, each landscape is plotted so that the
end point has the same vertical displacement as the starting point, i.e. , the overall bias has been removed.
-1940
Figure 3 shows the results of applying DFA to the DNA I
I
]
3.0
-1000
-2000
I s I i I
-1.0--- I
tinguishable Rom true power-law correlations observed ing of the manuscript. Partial support was provided to
in noncoding sequences which usually exhibit constant o. C.K.P. by NIH/NIMH, to A. L.G. by the G. Harold and
value () —
0.5) over several decades more than 3 decades Leila Y. Mathers Charitable Foundation, the National
in the case of the data displayed in Fig. 3. Heart, Lung and Blood institute and the National Aero-
nautics and Space Administration, to M. S. by the Amer-
We wish to thank C. DeLisi, F. Sciortino, and T. Vic- ican Heart Association, and to S.V.B., S.H. , and H. E.S.
sek for valuable discussions, and P. 3ensen, J. Kertesz, by the National Science Foundation and OKce of Naval
W. Li, M. Matsa, and S. M. Ossadnik for a critical read- Research.
[1] C. K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, N into k subregions and assign a compositional bias
F. Sciortino, M. Simous, and H. E. Stanley, Nature (Lon- (purine-pyrimidine concentration) to each subregion from
don) 356, 168 (1992). a Gaussian distribution of width cr and mean 0.5. The se-
[2] W. Li and K. Kaneko, Europhys. Lett. 17, 655 (1992). quences thus generated are patchy by construction and
[3] C. L. Berthelsen, J. A. Glazier, and M. H. Skolnick, Phys. the patch size distribution displays a characteristic length
Rev. A 45, 8902 (1992). given by N/k In the case of ig. 1(a), we divided the
F.
[4] R. F. Voss, Phys. Rev. Lett. 68, 3805 (1992); 71, 1777 whole sequence (10 bp) into 40 subregions with length
(1993); S. V. Buldyrev, A. Goldberger, S. Havtin, C.-K. ranging from 68 bp to 11679 bp. The characteristic
Peng, F. Sciortino, M. Simons, and H. E. Stanley, ibid. length is 10 /40 = 2500 bp. The width of the bias dis-
71, 1776 (1993). tribution is o = 0.1. As expected, the landscape of the
[5] P. J. Munson. R. C. Taylor, and G. S. Michaels, Nature typical control sequence from this biased random model
(London) 360, 636 (1992). is highly heterogeneous.
[6] G. Bernardi, B. Olofsson, J. Filipski, M. Zerial, J. Sali- [13] To generate long-range correlated variables, we start with
nas, F. Cuny, M. Meunier-Rotival, and F. Rodier, Science random uncorrelated variables, taken from a uniform dis-
228, 953 (1985). tribution. We then form the Fourier transform and multi-
[7] G. A. Churchill, Bull. Math. Biol. 51, 79 (1989). ply by a power in q space. Finally, we Fourier transform
[8] J. W. Fickett, D. C. Torney, and D. R. Wolf, Genomics back to real space and thereby obtain long-range cor-
13, 1056 (1992). related variables with a correlation function given by a
[9] S. Nee, Nature (London) 357, 450 (1992). power law. This method was proposed in S. Havlin, R.
[10] S. Karlin and V. Brendel, Science 259, 677 (1993). B. Selinger, M. Schwartz, H. E. Stanley, and A. Bunde,
[11] Variants of this rule permit studying patchiness arising Phys. Rev. Lett. 61, 1438 (1988).
from other forms of nucleotide heterogeneity. For exam- [14] A more general procedure is to apply the same method
ple, adenine heterogeneity can be studied by forming a except, instead of incrementing the box. position by a
walk with 3 steps up for each adenine and 1 step down distance 8 between each measurement, to increment the
for a nonadenine. position by a distance f/S, where S is a sliding parameter
[12] To be precise, we randomly divide a sequence of length that can be chosen to give optimal statistics.
MOSAIC ORGANIZATION OF DNA NUCLEOTIDES
[15] The precise definition of the average variance on length the error of statistical fluctuations) if we evaluate the
( Q,
scale 8 is Fs (f):—~ y~ (n). exponent from 4 & 4.
[16] Note that if a 1/2, there exist long-range correlations [»1 We have tested the DFA method extensively on DNA
that re6ect an alternation of different nucleotide types, sequences from genomic and cDNA sequences which we
while if o.) 1/2 the long-range correlations reSect a per-
sistence of the same nucleotide type. For the ideal case of
and others had previously studied by other methods, as
well as correlated and uncorrelated control sequences. In
a random sequence, n = 0.5. However, since the length all cases, the DFA method gave results as convincing as
of the sequences is not infinite, we must take into ac- the examples in the text.
count the statistical Suctuations for a finite size sample. [13] S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng,
A discussion of the effects of the finite length of DNA se- M. Simons, and H. E. Stanley, Phys. Rev. E 47, 4514
quences analyzed appears in C. K. Peng, S. V. Buldyrev, (1993). The DFA method has been successfully used to
A. L. Goldberger, S. Havtin, M. Simons, and H. E. Stan- study changes in fractal complexity with evolution in S.
ley, Phys. Rev. E 47, 3730 (1993). To measure the expo- V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, H.
nent a, we calculate the slope of log, o Fq(f) versus log, s 8 E. Stanley, and M. Simons, Biophys. 3. 85, 2675 (1993).
)
for E 4, since for very small 8 we expect deviation (in- [19] A. Yu. Grosberg, Y. Rabin, S. Havtin, and A. Nir,
Biofisika (Russia) 28, 1 (1993); Europhys. Lett. 23, 373
troduced by the detrending algorithm) from power-law
behavior. Numerical studies of various types of control (1993).
sequences show that this deviation is negligible (within