10 1 1 17 4592

Significance Tests for Patterns in Continuous Data
Richard J. Bolton and David J. Hand Department of Mathematics 180 Queens Gate Imperial College London SW7 2BZ (r.bolton, d.j.hand}@ic.ac.uk
Abstract In this paper we consider the question of uncertainty of detected patterns in data mining. In particular, we develop statistical tests for patterns found in continuous data, indicating the significance of these patterns in terms of the probability that they have occurred by chance. We examine the performance of these tests on patterns detected in several large data sets, including a data set describing the locations of earthquakes in California and another describing flow cytometry measurements on phytoplankton. Keywords: Data mining, pattern detection, local structure, uncertainty, significance tests
1 Introduction
Data mining is the process of seeking structures in large data sets (Hand et al, 2000; Hand, Mannila and Smyth, 2001). Such structures have been divided usefully into the large and small scale. Large-scale structures, summarising global features of a data set, are often called models; model building has been the focus of statistical investigation over many decades. Small-scale structures, typically referring to structures that are local in the space of measured variables, are called patterns. Pattern detection as an area of formal investigation is a relatively recent development unlike model building, which could be developed for hand application on small data sets, the search procedures implicit in pattern detection required the advent of the computer. Tools for pattern detection have been developed in parallel in several areas, including association rule analysis, genomic sequence analysis, clickstream analysis, and fraud detection. Such has been the rate of progress in these areas driven by urgent commercial and scientific needs that many pattern detection algorithms have been developed. However, detecting a pattern is but the first step. A vital subsequent
question, once a potential local structure has been flagged as of interest, is that of whether the pattern represents some real underlying aspect of the data generating mechanism, or is merely a feature of chance fluctuation in the data which happen to have been observed. It is this question that is the focus of the present paper. In particular, we describe significance tests for patterns. We do not pretend that our method provides the definitive answer to the question of uncertainty in pattern detection; however, we demonstrate a statistical approach that contributes a practical solution to this problem. Patterns are variously described in the literature. Broadly speaking, for our purposes a pattern may be defined as a recurring configuration of features describing the objects under investigation. The recurrence may be in different objects (e.g. anomalous sequences of credit card transactions), or may be at different positions in the feature set describing a given object (e.g. a nucleotide sequence in DNA). We adopt a definition that subsumes these and other definitions. We define a pattern as a data vector identifying an anomalously high local density of objects (Hand, Blunt, Kelly and Adams, 2000; Adams, Hand, and Till, 2001). This definition can be applied to most data types, from temporal sequences to static databases, provided some metric measuring the similarity of objects can be defined. The algorithm that we use to find local peaks of probability density is that described in Adams, Hand, and Till (2001) (and which we now call the PEAKER algorithm:). This algorithm finds estimates of the probability density of the data at each of the objects, and then examines each object in turn, to see if it lies at a local peak of the probability density function (pdf). To illustrate, take an arbitrary object P. A local region is defined about P, in terms of the L closest objects to P. These L closest objects to P are examined to see if any of them has a higher estimated pdf than the estimate at P. If none do if the pdf estimate at P is larger than that at any of its L nearest neighbors then object P lies at a local pdf peak or mode. The algorithm requires the specification of the method for estimating the probability density and the choice of L. In fact, of course, any monotonic transformation of the pdf estimate will do we are not concerned with the absolute values, but only their relative size. Taking advantage of this invariance, we use the reciprocal of the average Euclidean distance from P to its k nearest neighbors as our estimate. Adams, Hand, and Till (2001) describe some experiments on choice of k and L. In the examples described below, we set k independently of L, but quite small relative to the size of the data set, as we do not want to miss any structure in low density regions of the data. However, any values other than the ones we use, or indeed, any other choice of method of pdf estimation, could be used without affecting our method of significance testing, which does not depend on this choice. Once we have detected these modes, we wish to know whether they are true patterns, or simply artefacts of random error in the data. That is, we want to attach a measure of statistical significance to a mode such that we can provide evidence that this mode is a pattern. To do this, we apply statistical tests under the null hypothesis that the data are locally uniform and identify departures from this hypothesis that indicate peakedness in the density. This tells us the probability that we would get such a configuration if the data were uniform, with the alternative that the data are peaked and thus a pattern. We describe three different test statistics, and discuss features of these tests and their implementation
2 Distributional Assumptions
PEAKER identifies local modes of probability density. The question we now want to answer is whether such a mode is likely to have arisen if the background distribution (in the region covered by the L nearest objects to the mode) is uniform. We thus define test statistics measuring departure from uniformity and derive their distributions. In particular, since we are interested in alternative hypotheses that favour leptokurtosis (peakedness), our test statistics are measures of leptokurtosis. The result of such a test will be a significance value indicating the probability of observing such an extreme measure of leptokurtosis if the underlying distributions were truly locally uniform. It will be immediately apparent that issues of multiple testing will complicate the conclusions. We discuss these issues below. Since our basic mode finding algorithm is based on nearest neighbor methods, we also use such concepts to define our test statistics. Nearest neighbor methods have been used for clutter removal in spatial point processes (Byers and Raftery, 1998), but not as yet for assessing the significance of the features that they leave behind. In particular, when defining our tests, we summarize the distances of the L nearest objects to the mode using measures such as means and medians. The distributions of these can be evaluated under the null hypothesis of uniformity, and we can compute the probability of observing values as extreme as those we do observe (in fact, in the case of the mean and median, this translates into the probability of observing values as small as those we do observe). We give the details below. First, however, it is useful to establish some distributional machinery. Suppose we have a continuous multivariate data set with dimensionality d (i.e. d variables). The volume, V R , of a d-dimensional hypersphere of radius R is given by:
2p 2 R d VR = dG(d 2 ) where G(a ) is the Gamma function. We are ultimately interested in the distribution of the distances of objects from the center of this hypersphere when these objects are uniformly distributed within a hypersphere of radius R. However, this distribution is easier to derive if we first define a variable based on volumes. Draw an object from a uniform distribution over the hypersphere of radius R, and let V be the random variable defined as the volume of a hypersphere with same center as the original hypersphere and radius the distance from the center to the object. The probability density function for V can be written as:
d
Pr(V = v) = f (v) =
dG( d 2 ) 2p
d 2
Rd
; 0 v VR
where G() is the gamma function. We can relate the random variable, V, to the random variable, X, defined as the distance from the center of the hypersphere by the equation: 2p 2 X d V = dG( d 2 )
d
And we can write the probability density function of X as follows:

dv dG( d 2 ) 2dp 2 x d -1 dx d -1 f ( x ) = f (v ) = = ; dG( d 2 ) dx 2p d 2 R d Rd
d
d
0 xR
We can also see that Y1 = X ~ Beta(d ,1) , and Y2 = Y1d = ( X ) ~ U (0,1) , results that R R will come in useful when formulating significance tests. Remember that in these formulations R is a constant and not a random variable.
3 Significance Tests
Suppose that we have a sample of n objects within a d-dimensional hypersphere of radius R; that is, we have n random variables, X 1 ,..., X n representing the distances of the objects from the center of the hypersphere, from which we obtain the corresponding functions of X , (Y1 )1 ,..., (Y1 ) n and (Y2 )1 ,..., (Y2 ) n , as described in the previous section. We now seek test statistics for departure from uniformity that highlight high density towards the center of the hypersphere.
3.1 Mean of Y1
1 The mean of Y1 , is calculated by the equation Y1 = n S (Y1 ) i ; however, the distribution i =1 n
of Y1 is not easily obtained. Finding the distribution of the sum of Beta distributions is a hard problem as the probability and moment generating functions cannot be characterized in a closed form and so cannot be used to derive the distribution of Y1 . We choose instead to use Central Limit Theorem results to characterize the distribution as an approximate normal distribution. Hence
CLT d d Y1 ~ d + 1 , (d + 2)(d + 1) 2 n
and we can test the standardized normal random variable Z= Y1 - dd+1

d
( d + 2 )( d +1) 2 n
against the lower tail probabilities of the standard Normal distribution, as a pattern will have an abundance of shorter distances and thus a small value of Y1 associated with it. The probability value (significance level) associated with this test is thus P( Z z ) = F ( z ) . When we talk of a significance level of a% we mean that there is less than a% chance that the observed result is due to chance. Note that the normal
approximation appears to hold quite well, even for small values of n and large values of d.
3.2 Mean of Y2
1 We can obtain the probability density function of the mean of Y2 , Y2 = n S (Y2 ) i , in
i =1 n
exact form (Mood, Graybill and Boes 1974):

f Y2 ( y ) = such that
n -1
n n n -1 n -1 j n n -1 (ny ) - (ny - 1) + ... + (-1) (ny - j ) I ( j / n,( j +1) / n ] ( y ) 1 j ! j = 0 (n - 1)
FY2 ( y ) =
n -1
n 1 n n j n n (ny ) - (ny - 1) + ... + (-1) (ny - j ) I ( j / n ,( j +1) / n ] ( y ) 1 j j = 0 n!
Using this distributional result we can calculate lower tail probabilities for Y2 . As with the mean of Y1 , a low value of Y2 indicates that the data are concentrated towards the center of the hypersphere and the significance level associated with this test is FY2 ( y ) .
3.3 Median of Y2
Standard results show us that if we have a random sample of n objects from a probability density function f () , with cumulative distribution function F () , and order statistics X (1) X ( 2) ... X ( n ) then the distribution of the ath order statistic,
X a is given by
f X (a ) ( x) = n! [F ( x)]a -1 [1 - F ( x)]n-a f ( x) (a - 1)!(n - a )!
1 2
The median value is given by X (( n +1) / 2) if n is odd, or by
(X
( n / 2)
+ X (( n / 2 )+1) ) if n is
even. It follows that, for odd n, the median M of the random variable Y2 has probability density function
f M ( m) = n! m n -1 ( 2 )!( n2 1 )!
n -1 2
(1 - m ) 2
n -1
; 0 m 1
+ + so that, in fact, M ~ Beta( n2 1 , n2 1 ) . For n even, the distributional result is more complicated, being a linear combination of the sum of two Beta distributions. This distribution can be calculated, but for reasonably large n we suspect that there is very little difference in the forms for n odd and even.
Hence we have an exact distribution for M
X X M = median 1 ,..., n R R
d
median( X 1 ,..., X n ) d + + = ~ Beta( n2 1 , n2 1 ) R
since R , d > 0. If there is a region of high density around our object point, then M will be smaller than we would expect than if the data were uniformly distributed on the hypersphere. We can compare values of M with the lower tail probabilities of the Beta distribution for appropriate values of n.
Note: We replace n with (L-1) in all of these tests to obtain significance tests for modes detected by PEAKER with the radius, R, of the local region defined as the distance to the Lth nearest neighbor. The null hypothesis is valid for the (L-1) nearest neighbors, given our choice of R.
4 Edge Effects
Situations in which the ranges of the variables have sharp cut-off values, or which have clusters of objects with sharp edges, can cause difficulties. Our tests are based on the null hypothesis that the modes being tested are at the center of hyperspheres containing objects generated from a uniform distribution. This null hypothesis led us to deduce the form for the distribution of the distance of objects from the central mode. Unfortunately, if the hypersphere is centerd near a sharp edge, the distribution of the distance of objects from its center no longer has the same distribution (the further objects are taken from a distorted distribution). In such situations, if we were to apply the significance tests to every object, there would be artificially inflated values of the k-nearest neighbor distance due to edge effects, something that is commonly seen with nearest neighbor methods in spatial statistics. This means that objects near the boundaries of a cluster or a region will have artificially deflated values of the significance statistics and these boundary objects will appear more significant than they actually are. We can illustrate this with a data set containing 10,000 objects uniformly and independently distributed over [-1,1]5. Figure 1 shows those objects with a significance level of 0.05% after applying the Y2 statistic with L=100 nearest neighbors.
0.0 0.5
0.5
0.0
[, 5]
-0.5 0.0
0.0 -0.5
0.0 0.5
0.5
0.0
[, 4]
-0.5 0.0
0.0
-0.5
0.0 0.5 0.0
0.5
[, 3]
-0.5 0.0
0.0 -0.5
0.0 0.5 0.0
0.5
[, 2]
-0.5 0.0
0.0 -0.5
0.0 0.5 0.0
0.5
[, 1]
-0.5 0.0
0.0 -0.5
Figure 1. Scatterplot Matrix of significant objects at 0.05% level
The concentrations of objects at the corners of the plot show that edge effects are particularly evident here. In spatial statistics, there are many approaches to coping with these edge effects. One solution is a toroidal wrap-around of the data such that no object is, in effect, on an edge. This is a practical solution but not so easy to justify in higher dimensions. We also note that a object on the edge of a cluster will display edge effects that will not be removed by a wrap-around of the whole data set. For example, in Figure 2 we have a data set in two dimensions with 10,000 objects uniformly distributed over four unit squares. Figure 3 shows the objects significant at the 1% level after the toroidal wrap around has removed edge effects.
2.5
2.5
2.0
2.0
1.5 y y 1.0
1.5
1.0
0.5
0.5
0.0
0.0
0.0
0.5
1.0 x
1.5
2.0
2.5
0.0
0.5
1.0 x
1.5
2.0
2.5
Figure 2. Full data set
Figure 3. Objects significant at 1% level
The central edges have artificially deflated values of the Y2 statistic and thus appear more significant, whereas the outside edge effects have been (mostly) removed. This edge effect increases greatly with increasing dimension and increasing numbers of objects. We note that objects that are precisely on an edge or a corner are not necessarily affected by edge effects. In this case, the nearest neighbors all lie in a volume arc of the hypersphere centered at the object in question. Within this arc, the assumptions of the distribution of the radius still hold. However, an object a short distance from the edge will have nearest neighbors in a small radius sphere, but next nearest neighbors only on one side (see Figure 4), thus artificially reducing the mean nearest neighbor distance. This is why one approach to edge correction in spatial statistics is to examine only those objects within a guard area such that edge effects are not encountered.
Figure 4. Illustration of edge effect
5 Correction of Significance Values for Multiple Testing

We have already mentioned the fact that significance values will be affected by multiple testing due to the large number of peaks explored. However, the first correction we must make arises because any objects selected by PEAKER are by definition extreme within their local region; these objects have a higher density than their L nearest neighbors. Consequently, we can say that such an object is more likely to be a pattern (or peak) than these L nearest neighbors, and thus more likely to have more extreme values of the test statistic than these L neighbors. A high local density is implied by small nearest neighbor distances, especially as these tests are functions of the local density. The statistical tests that we have discussed will only provide us with correct significance values if we apply them to a randomly selected object, so we must correct for the extremeness of each mode.
In our tests we are interested in small values of the statistic and so we want the distribution of the minimum of L+1 realizations of the statistic. If F () is the CDF of the test statistic in question, then the CDF of the minimum of a sample of (L+1) of these statistics is given by: Fmin () = 1 - [1 - F ()]
L +1
Of course, the underlying assumption in this distributional result is that the objects are independent and identically distributed. Unfortunately, this assumption of independence will not necessarily be true as our tests are all functions of the other L objects in the sample, which will be more dependent the closer an object is to the peak object. Fmin () is thus a conservative estimate of the CDF. For example, we find that an object is a peak amongst its 50 nearest neighbors and we apply test Y2 , finding that F (Y2 ) = 0.0097 . This would suggest statistical significance at the 1% level; however, taking into account the multiple testing aspect, we have Fmin (Y2 ) = 0.392 , a far less convincing result! When we run the PEAKER algorithm for a certain value of L we will often identify more than one peak. We need to take into account all of these peaks when performing statistical tests; that is, we must account for the effects of multiple testing once again. This time, however, the correction depends on the number of peaks that PEAKER identifies. Suppose we identify P peaks, then the adjusted value of the CDF for the test statistic is given by Fadj () = 1 - [1 - F ()]
( L +1) P
For this result we assume that peaks are distributed independently of each other. If we set PEAKER to search through multiple values of L, we encounter more multiple testing problems that are harder to account for, as issues of independence are less clear. The consideration of how this variation of L affects the test statistics is continuing work.
6 Choice of Tests
The significance tests available to us are:
(X ). R d Y2 , the mean of ( X ) , R
Y1 , the mean of median(Y2 ) , the median of
( X )d , R
The last two tests can be compared with exact distributions, whereas the first only has an approximate distribution, albeit a very accurate one even for a relatively small
number of nearest neighbors. We prefer the second and third tests and we apply these to the data sets in the examples below. The combination of the PEAKER algorithm with the statistical tests is referred to from here on as PEAKER+ (PEAKER-Plus).
7 Results
We ran PEAKER+ on two artificial and two real data sets. The first artificial data set, Art1, consists of 2,000 objects generated from a uniform distribution on U [0,1] U [0,1] . The second artificial data set, Art2 (Figure 5), consists of Art1 appended with 50 objects from a multivariate normal distribution:
0.3 0.025 2 MVN , 0.3 0
0 0.025 2
These added objects constitute a peak above the background noise of Art1, although it is not clear from the plot whether the peak is significantly different from the background noise. We set k = 20 and L = 50,100 and 200 to represent small, medium and large peaks; note that the smaller value of L was set to equal the number of objects in the peak and the medium value was set at twice the number of objects in the peak. We also employed the toroidal wrap-around to remove edge effects.
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
Figure 5. Art2 data.
L 50 100 200
# of Peaks 29 15 11
Min Fadj ( Y2 ) 0.999 1.000 1.000
Min Fadj ( median(Y2 ) ) 0.981 1.000 1.000
Table 1. Results from running PEAKER+ on Art1.
L 50 100 200
# of Peaks 28 14 9
Min Fadj ( Y2 ) 0.999 0.0000376 0.00754
Min Fadj ( median(Y2 ) ) 0.978 0.000693 0.285
Table 2. Results from running PEAKER+ on Art2.
The number of peaks found and the minimum value of Fadj for the mean and median test statistic for data set Art1 are shown in Table 1. No peaks were identified as being significant at anything more extreme than the 99% level, which is reassuring for a data set consisting entirely of random noise. The same experiment was run on Art2 and results shown in Table 2. Similar results to those for Art1 were obtained for L = 50, but the peak is recognized at L = 100 with a significance level of approximately 0.004% for the mean test and 0.07% for the median test. The peak is less significant at the L = 200 level, and the median test suggests that there is not enough evidence here to suggest that the peak is not a feature of random variation in the data. As we include more objects in the test region, so the peak becomes more likely to have arisen through some random effect in the data. The object identified as a peak with L = 100 had co-ordinates (0.300,0.308) to 3 significant figures, showing that the pattern has been correctly identified. No other objects flagged as peaks in Art2 were found to be significant different from a uniform distribution under the tests. The Quakes data set consists of the longitude and latitude measurements of all 2,049 earthquakes with Richter intensity above 2.5 in California, USA recorded between 1962 and 1981. Patterns found here may indicate an unusually high density of earthquakes in a local region, as opposed to earthquakes that occur randomly in the region. Again, we set k = 20 but this time we varied L over smaller values than those we used in the artificial data sets in order to find peaks with smaller support. The number of peaks significant at the 5% level found by PEAKER+ for these parameter values on the Quakes data is shown in Table 3. L 25 50 100 # of Peaks 38 19 13 # Fadj < 0.05 ( median(Y2 ) ) 10 9 6
# Fadj < 0.05 ( Y2 ) 7 10 8
Table 3. Results from running PEAKER+ on Quakes data.
The following plots show peaks significant at the 5% level for the Y2 statistic, together with their local regions for L = 25 (Figure 6) and L = 50 (Figure 7). The coloured triangles with white centers represent peaks significant at the 5% level. These peaks are surrounded by a coloured circle representing their local region as defined by the Lth nearest neighbor distance; the different sized local regions reflect the densities of objects in different areas. The coloured circles with white centers represent peaks found by PEAKER+ that are not significant at the 5% level; we have little evidence to suggest that these patterns are not simply the result of random fluctuation in the data. The value of L is an indication of the level of support for a pattern, so our choice of L reflects the pattern size if a peak is flagged as being significant. In Figure 6, many peaks are flagged but only a few are significant. In some cases, peaks that look like they should be patterns are not significant (for example, the upper left peak). They are not significant for this level of L; however, this is not to say that they will not be significant at some other level of L. We see in Figure 7, that the upper left peak is now significant, whereas the three significant peaks on the left-hand side in Figure 6 are not significant for L = 50. By eye, we can sometimes find objects that look like they are not features of random variation, but by applying PEAKER+ we can find peaks and assess their significance for particular values of L, also identifying patterns that the naked eye may miss.
latitude
36.0
36.5
37.0
37.5
38.0
38.5
-123
-122
-121
-120
longitude
Figure 6. Quakes data set with significant peaks. k = 20, L = 25.
latitude
36.0
36.5
37.0
37.5
38.0
38.5
-123
-122
-121
-120
longitude
Figure 7. Quakes data set with significant peaks. k = 20, L = 50.
The Crpret1 data set is made up of five (standardized) measurements on each of 10,000 phytoplankton cells through analytical flow cytometry. These measurements are forward light scatter (FSC), 90 light scatter (SSC), depolarized/horizontally polarized light scatter (FL1), orange fluorescence (FL2) and red fluorescence (FL3) (see Collins and Krzanowski (2001) for more details). Patterns here may indicate small groups of phytoplankton with unusual characteristics. We set k = 20 and recorded results for L = 25, 50 and 100, to discover small patterns in the data. Results in Table 4 suggest that there are 2 significant peaks at the 5% level for L = 50 and that either one or both of these peaks are also significant for L = 30, depending on which test we use. No peaks are significant at the larger L = 100 level. L 25 50 100 # of Peaks 56 35 21 # Fadj < 0.05 ( median(Y2 ) ) 2 2 0
# Fadj < 0.05 ( Y2 ) 1 2 0
Table 4. Results from running PEAKER+ on Crpret1 data.
1.2 1.0 0.8 0.6
0.6 0.8 1.0 1.2
FL3
0.6 0.4 0.2
0.0 0.2 0.4 0.6 2.0 1.5 2.0
0.0
1.5
FL2
1.0 1.5
1.5
1.0
1.0 0.8 0.6
0.6
0.8
1.0
FL1
0.4 0.6
0.6 0.4 0.2
0.2 0.6 1.0 0.8 0.6 0.8 1.0
SSC
0.4 0.6
0.6 0.4
0.2 1.4 1.2 1.0 1.0 1.2 1.4
0.2
FSC
0.8 1.0
1.0 0.8 0.6
0.6
Figure 8. Scatterplot matrix of Crpret1 data.
A scatterplot matrix (Figure 8) appears to confirm the presence of these two patterns in the data. The dense cluster to the top of the FL3 axis (9909 objects) contains no small significant patterns within it, something that we have no way of knowing simply by looking at the plot.
8 Conclusion
The main parameter of interest in the statistical tests for PEAKER+ is the number of nearest neighbors, L, that define the local region. We have seen that patterns are significant for some values of L but not for others and we realize that it is tempting to vary L until a most significant value is reached. However, we reiterate that this action renders the significance tests meaningless in their current form, as we must once again account for multiple testing to obtain valid significance values. Increasing the number of multiple tests requires accurate tail probability estimates that are increasingly sensitive to error. We are currently investigating an approach that would allow such variation of L. We observed differences in significance values between results for the mean and median test over all data sets, especially as L increased. This is due to the different distributions of the statistics. The distribution of median (Y2 ) has a larger variance and
heavier tails than the distribution of Y2 . Estimates of the significance levels for peaks are thus generally more conservative for median(Y2 ) but the estimates themselves have greater variance. However, for large values of L an estimate of the significance value using median(Y2 ) would be more robust to edge effects. In a setting where we are analysing spatial data, such as the Quakes data above, we may also be interested in patterns over certain geographical areas. A possibility here would be to define the local region by area rather than by support. However, since the density of objects often varies considerably over different areas, the significance tests will consider different numbers of objects for each peak. It is not clear whether we can account for multiple testing in the same way when we consider this inconsistency. If this can be overcome then such an approach can be useful for spatial data. However, we feel that varying the local region according to L, and thus the support of a peak, is more easily transferred to multivariate data and thus has wider application. Of course, others have also developed statistical tests for patterns though, as we noted in the introduction, the field is a relatively new one. Most other work appears to be based on computing the probability of the occurrence of patterns when the features are categorical. Examples, in a genomic context, are given in Atteson (1998) and Tompa (1999). Our work generalizes this, permitting approximate matches and continuous data. In summary, PEAKER+ is a data mining algorithm that provides the answer to the most important question we should ask ourselves whenever we discover patterns in data containing any element of uncertainty: are these patterns real?
Acknowledgements
We are grateful to Dr G Tarran of the Plymouth Marine Laboratory, UK, who generated the Crpret1 data used in this paper.
References
Adams, N.M., Hand D.J., and Till, R.J. (2001) Mining for classes and patterns in behavioural data. To appear in Journal of Operational Research. Atteson, K. (1998) Calculating the exact probability of language-like patterns in biomolecular sequences. In Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology, 17-24. Byers, S. and Raftery, A.E. (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association, 93(442), 577-584. Collins, G.S. and Krzanowski, W.J. (2001) Nonparametric discriminant analysis of phytoplankton species using data from analytical flow cytometry, submitted for publication.
Hand, D.J., Blunt, G., Kelly, M.G, and Adams, N.M. (2000). Data Mining for fun and profit (with discussion), Statistical Science, 15(2), 111-126. Hand, D.J., Mannila H., and Smyth P. (2001) Principles of data mining, MIT Press. Mood, A.M., Graybill, F.A., and Boes, D.C. (1974). Introduction to the theory of statistics (3rd Edition). McGraw-Hill, Singapore. Tompa, M. (1999) An exact method for finding short motifs in sequences, with application to the ribosome binding site problem. In Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology, 262271.

10 1 1 17 4592

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

10 1 1 17 4592

Hochgeladen von

Copyright:

Verfügbare Formate

Significance Tests for Patterns in Continuous Data

And we can write the probability density function of X as follows:

and we can test the standardized normal random variable Z= Y1 - dd+1

exact form (Mood, Graybill and Boes 1974):

n n n -1 n -1 j n n -1 (ny ) - (ny - 1) + ... + (-1) (ny - j ) I ( j / n,( j +1) / n ] ( y ) 1 j ! j = 0 (n - 1)

n 1 n n j n n (ny ) - (ny - 1) + ... + (-1) (ny - j ) I ( j / n ,( j +1) / n ] ( y ) 1 j j = 0 n!

The median value is given by X (( n +1) / 2) if n is odd, or by

Hence we have an exact distribution for M

median( X 1 ,..., X n ) d + + = ~ Beta( n2 1 , n2 1 ) R

0.0 0.5 0.0

0.0 0.5 0.0

0.0 0.5 0.0

Figure 1. Scatterplot Matrix of significant objects at 0.05% level

Figure 2. Full data set

Figure 3. Objects significant at 1% level

Figure 4. Illustration of edge effect

5 Correction of Significance Values for Multiple Testing

0.3 0.025 2 MVN , 0.3 0

Figure 5. Art2 data.

Min Fadj ( Y2 ) 0.999 1.000 1.000

Min Fadj ( median(Y2 ) ) 0.981 1.000 1.000

Table 1. Results from running PEAKER+ on Art1.

Min Fadj ( Y2 ) 0.999 0.0000376 0.00754

Min Fadj ( median(Y2 ) ) 0.978 0.000693 0.285

Table 2. Results from running PEAKER+ on Art2.

# Fadj < 0.05 ( Y2 ) 7 10 8

Table 3. Results from running PEAKER+ on Quakes data.

Figure 6. Quakes data set with significant peaks. k = 20, L = 25.

Figure 7. Quakes data set with significant peaks. k = 20, L = 50.

# Fadj < 0.05 ( Y2 ) 1 2 0

Table 4. Results from running PEAKER+ on Crpret1 data.

1.2 1.0 0.8 0.6

0.6 0.8 1.0 1.2

0.6 0.4 0.2

0.0 0.2 0.4 0.6 2.0 1.5 2.0

1.0 0.8 0.6

0.6 0.4 0.2

0.2 0.6 1.0 0.8 0.6 0.8 1.0

0.2 1.4 1.2 1.0 1.0 1.2 1.4

1.0 0.8 0.6

Figure 8. Scatterplot matrix of Crpret1 data.

Das könnte Ihnen auch gefallen