Beruflich Dokumente
Kultur Dokumente
Kernel Density Estimation Theory and Application in Discriminant Analysis
Kernel Density Estimation Theory and Application in Discriminant Analysis
Kernel Density Estimation Theory and Application in Discriminant Analysis
Abstract: Nowadays, one can find a huge set of methods to estimate the den-
sity function of a random variable nonparametrically. Since the first version
of the most elementary nonparametric density estimator (the histogram) re-
searchers produced a vast amount of ideas especially corresponding to the is-
sue of choosing the bandwidth parameter in a kernel density estimator model.
To focus not only on a descriptive application, the model seems to be quite
suitable for application in discriminant analysis, where (multivariate) class
densities are the basis for the assignment of a vector to a given class. This
article gives insight to most popular bandwidth parameter selectors as well as
to the performance of the kernel density estimator as a classification method
compared to the classical linear and quadratic discriminant analysis, respec-
tively. Both a direct estimation in a multivariate space as well as an appli-
cation of the concept to marginal normalizations of the single variables will
be taken into consideration. From this report the gap between theory and
application is going to be pointed out.
1 Introduction
Since non-parametric smoothing methods provide an interesting alternative to classical
parametric estimation methods, this paper is concerned with the genesis of kernel density
estimation itself (for descriptive means) as well as at its application in discriminant analy-
sis to serve as a competitor in estimating class densities with respect to the corresponding
model-based discrimination rules, which are the linear and the quadratic discriminant
analysis (LDA, QDA). Starting at the first description of the kernel density estimation
concept by Rosenblatt (1956)
n µ ¶
1 X x − xi
fˆh (x) = K , (1)
nh i=1 h
where
R K(·) denotes the kernel function (including some appropriate restrictions, e.g.
K(u)du = 1, K(u) ≥ 0 for all u, etc.) and h is the so called bandwidth parame-
ter. Many proposals about how to select the bandwidth appropriate have been made. For
the kernel function the most commonly used density is N (µ, σ 2 ), but also several other
kernel densities with bounded support are used. Section 2 gives an overview about the
improvement in techniques and strategies for optimizing K and h with respect to different
optimization criteria during the last 20 to 25 years. Results are available for the univariate
and the multivariate model. Besides, the issue of handling the data in high(er) dimensions
to still keep the model as competitor to LDA and QDA, respectively for discriminatory
purposes, is treated. Here, another problem of optimization arises as well as the so called
curse of dimensionality which hinders the user to generalize the concept easily without
having sufficient data. Finally, in Section 3 the reader will see hard numbers, where the
theoretical ideas are applied to different types of datasets by means of a detailed simula-
tion study, where altogether 14 estimators are used for 21 different datasets.
the expectation Z
M ISE(fˆh ) = E{fˆh (x) − f (x)}2 dx (3)
into account. Jones (1991) provides an overview of what is better to use. Again, to cir-
cumvent difficulties, a Taylor approximation of (3) (called AMISE) is often used. Marron
and Wand (1992) show, that those step-by-step adaptations (AMISE instead of MISE with
p = 2 instead of p = 1 in (2) for reasons of easier calculation) can cause essential changes
in the resulting estimated model parameter (in particular the optimal bandwidth). One has
to be aware of the fact, that the extent of estimating the tails of the distribution well, de-
creases as p increases, which leads to the fact, that p → ∞ stresses a good overall fit
and does not care about a bad fit in the tails. Besides, one can think of other criteria like
comparing the number (and location) of the estimated modes to the numbers (and loca-
tion) of the modes of the original density (see Park and Turlach, 1992), but one cannot
carry out useful calculations without knowing the original density in application. Marron
and Tsybakov (1995) go even further and suggest to include a kind of horizontal distance
between the curves for reasons of a more intuitive fit, as well. Nevertheless, the only use-
ful possibility to derive automatically data-driven parameters in application is to handle a
compact formula similar as in (2). Stressing now on the AMISE, Wand and Jones (1995)
derive the following formula which is a decomposition of a bias and variance term:
1 1
AM ISE(fˆh ) = R(K) + h4 µ2 (K)2 R(f 00 ) , (4)
nh 4
R R
where µ2 (K) = z 2 K(z)dz denotes the variance of the kernel and R(K) = K(x)2 dx.
After separating K as in (4), the Epanechnikow kernel (see, e.g. Silverman, 1986) min-
imizes the AMISE, but the choice of the kernel is not that crucial, because even for the
triangular and the uniform kernel, there are less than 10% additional data points neces-
sary to get the same AMISE. Small improvements are possible by leaving the restriction
K(x) ≥ 0 for all x, which amounts in higher order kernels (Wand and Jones, 1995) lead-
ing to values fˆh (x) < 0 for some x, which are not adequate in the estimation of class
densities in discriminant analysis. The more important bandwidth choice is carried out by
minimizing (4) with respect to h, which leads to
· ¸1/5
R(K)
hAM ISE = . (5)
µ2 (K)2 R(f 00 )n
Before considering ideas and techniques for a concrete estimation of hAM ISE (the
unknown density f has to be substituted) we discuss two further ways of improving the
concept. First it is possible to smooth the density with different bandwidths, depending
on a distance dj,k of the datapoint xj to its k-th nearest neighbor (Breiman et al., 1977)
n
1X 1
fˆh (x) = , (6)
n i=1 hdj,k
where the curve is more radically smoothed in sparse regions, which is useful in highly
skewed distributions (see Sheather, 1992, p. 246, for an example of failure of the standard
model). The problem of choosing another parameter k, which is, again more complicated
270 Austrian Journal of Statistics, Vol. 33 (2004), No. 3, 267–279
in high dimensions does not qualify this model as fully data-driven and is therefore not
suitable for unexperienced users. Another concept is the so-called transformed kernel
density estimator. Its aim is to transform the observations into others, which results in
easier-to-estimate densities (densities, which are much more effective to estimate with
respect to the observation number). Wand and Jones (1995) give a parametric concept of
a transformation rule, which again amounts in at least one parameter to be estimated by
the unexperienced user, while Ruppert and Cline (1994) provide a calculation-intensive
non-parametric approach that was used in the simulation study described below.
They refer to the fact, that if F and G are the cumulative distribution functions for
the densities f and g, then Y = G−1 (F (X)) has density g. One is now able to choose
any distribution, but the user will probably choose a normal distribution, because normal-
ization is wanted. In application t(x) = G−1 (Fˆh ) is taken and Fˆh (x) is a (pilot-) kernel
estimate of F (x). The issue of bandwidth selection in the univariate case is discussed
detailed in literature. Authors offer many proposals to apply (5) in practice, starting by
using a Gaussian kernel N (0, σ 2 ) for K and replacing f 00 in (5) by the respective func-
tional of N (0, σ 2 ) as well. This rule of thumb or normal rule was suggested by Silverman
(1986) and amounts in the formula for the AMISE-optimal bandwidth
hopt ≈ 1.06σn−1/5 ,
which is easy to calculate and therefore often used. σ can be estimated by the sam-
ple standard deviation or by a more robust estimate as the inter-quartile-range R (and
a corresponding adaptation of the coefficient). However, it often oversmooths the den-
sity. Another idea, which is well-known in statistics is the concept of cross-validation.
Bowman (1984) states the optimization problem by giving unbiased estimators for the
minimization of the ISE(fˆh ) itself. In Section 3 the corresponding formula for the mul-
tivariate case is given. This estimator is known as least-square cross-validation. Biased
cross-validation is similar, but minimizes M ISE(fˆh . Both, biased- and least-square
cross-validation have the drawback of high variances of the estimators and sometimes the
occurrence of more than one local minimum. Sheather (1992) and Marron in his discus-
sion of Sheather (1992) give different explanations about which one is the best to take.
Finally, the likelihood cross-validation method (Silverman, 1986) leads to problems, if
kernels having bounded support are used.
An interesting set of methods are the plug-in methods (see Sheather and Jones, 1991,
or Park and Marron, 1990), which represent the state of the art in selection of the band-
width parameter as it seems, because many simulation studies identify them as the best or
at least one of the best estimators with respect to an intuitive fit, the variance of the esti-
mator and the performance in the estimation of harder-to estimate densities (Jones et al.,
1996; Sheather, 1992; Park and Turlach, 1992; Cao et al., 1994). Common to all plug-in
methods is, that they include an estimate of the unknown density functional R(f 00 )in (5),
which is performed by a kernel estimate, but in general with another kernel and smoothing
parameter as for estimating f . This makes it different to the Biased cross-validation con-
cept. Besides, regarding the asymptotic analysis of the bandwidth selectors, the plug-in
approach has a faster convergence rate for h than the cross-validation methods.
T. Ledl 271
Xn
1 ¡ ¢
ˆ
fH (x) = |H| −1/2
K H−1/2 (x − xi ) ,
n i=1
An important fact is, that concerning the discrimination context the minimization of error
rates in classification has not necessarily the same aim as the minimization of error criteria
discussed in section 2.1. Since the theoretical misclassification rate is an L1 -based mea-
sure, a MISE-optimal bandwidth selector weights the fit in the tails of the distributions to
a smaller extent. Actually, in higher dimensions an increasing part of the data disappears
in the tails of the distribution. For that reason, a fit in the tails is demanded in classification
tasks and Ripley (1996) underlines, that the estimation of differences of log-densities is
crucial in classification. Hall and Wand (1988) treat the topic of at least estimating differ-
ences in densities (without logarithm), however, this is ignored in any other source. They
use kernel functions having negative values and the multivariate generalization cannot be
derived easily.
272 Austrian Journal of Statistics, Vol. 33 (2004), No. 3, 267–279
3.1 Preliminaries
The existing results for using kernel density estimation as a classification method go back
to the late 1970s and provide a very limited view of the problem. Ness and Simpson
(1976), Ness (1980), Habbema et al. (1978), and Remme et al. (1980) essentially treat
uncorrelated multivariate normal distributions in high dimensions, in which groups were
only separated by one variable. From the viewpoint of a curse of dimensionality im-
provements can only happen accidently and the essential issue of applying the concept to
non-normal data is not taken into account. Besides, the parameters have been estimated
with the likelihood cross-validation method, which is somewhat out-dated and often re-
sults in oversmoothing. All Studies used two classes and there was also no reason to
change this in our simulation study, because the basic problem will probably be the same
in a setting with more than two classes.
The first effect to study is the behavior of the model when the densities gradually move
away from the normal distribution. In addition, skewed and bimodal distributions are
interesting. Table 1 and Figure 1 show the univariate prototype distributions used and
exhibit different types of deviations from the normal case. Table 2 lists the construction
principles for the dataset based on univariate prototype distributions. Each dataset has
dimension 1400 × 10 and consists of two classes, each with 700 observations, 600 for
estimating the class density and 100 for calculating the classification criteria. Table 1 and
Figure 1 show the distributions of class 1. For class 2 the parameters of the exponential
distribution change from λ = 1 to λ = 2 and the normals and bimodals are shifted by 0.5
to the right.
After this linking step, these ten datasets (1-10) have been transformed linearly by
ten 10 × 10-matrices, which are the roots of ten self-produced correlation-matrices. The
datasets 11-20 have been produced exactly in the same way as datasets 1-10, but pop-
ulation 1 and population 2 have been transformed by different transformation-matrices.
Concerning Table 2 and Figures 2-5 the datasets having equal covariance matrices have 1
as their last digit, the others have 2. For example, the dataset Bi42 consists originally of
eight skewed distributions and two bimodals (whose bumps are strongly separated), and
the transformation happened for both groups with unequal transformation matrices. The
30 (10 + 2 × 10) correlation matrices have been produced by assuming a common factor
in the ten variables having a regression coefficient, whose absolute value is uniformly
distributed between 0.3 and 1. Finally an insurance dataset having the same dimensions
like the synthetic data was used (dataset 21).
T. Ledl 273
0.4
0.2
0.2
0.0
0.0
-4 -2 0 2 4 -4 -2 0 2 4
0.2
0.0
0.0
-4 -2 0 2 4 -4 -2 0 2 4
0.10
0.4
0.0
0.0
0 1 2 3 4 5 6 -2 0 2 4 6 8
Bimodal - far
0.20
0.10
0.0
-2 0 2 4 6 8
As mentioned above, the simulation study uses 14 estimators. First of all the LDA and
QDA serve as conservative competitors. The multivariate densities were estimated by the
multivariate normal rule (Scott, 1992) and the multivariate least-squares cross-validation
(LSCV) selector (Bowman, 1984), which is given by the minimum of LSCV (H) in (7)
T. Ledl 275
1 n−2 X
LSCV (H) = N (0, 2H) + N (xi − xj , 2H)
n−1 n(n − 1)2 i6=j
2 X
− N (xi − xj , H) . (7)
n(n − 1) i6=j
For this reason, the original datasets have been projected classwise by a principal com-
ponent analysis (PCA) onto a subspace consisting of 2 to 5 dimensions, respectively and
were estimated by both selectors, normal rule and LSCV leading to estimators 3-10. By
doing that a trade-off between the information loss caused by the PCA and the accuracy
loss for the kernel density estimators caused by additional dimensions can be confronted
with each other.
The issue of using marginal normalizations amount in the estimators 11-14. They are
constructed as described in Subsection 2.1 for each of the ten variables in each dataset.
Here, the univariate plug-in bandwidth selector of Sheather and Jones (1991) and the
normal rule (Silverman, 1986) was applied to normalize the datasets and in a subsequent
step, the LDA and QDA were used for the transformed ten dimensional distributions,
resulting in the 2 × 2 = 4 last estimation procedures.
2 X³ ´2
n
BS = p̂(Group 2|xi ) − ci ,
n i=1
3.5 Results
One of the most important results is that the bandwidth selection procedure was not cru-
cial. As the kernel choice for descriptive applications is not that important, the bandwidth
parameter is not for discrimination purposes. The Sheather-Jones selector performed sim-
ilar to the normal rule in the univariate normalizations and the LSCV selector resembled
the normal rule in the multivariate setting, and so using the normal rule, which extremely
saves computation time is completely sufficient.
276 Austrian Journal of Statistics, Vol. 33 (2004), No. 3, 267–279
LDA
0,6
LSCV(3)
0,5 LSCV(4)
0,4
0,3
0,2
0,1
0,0
NN11 NN21 NN31 SkN11 SkN21 SkN31 Bi11 Bi21 Bi31 Bi41
LDA
0,4
Normal rule - norm(LDA)
Sheather-Jones - norm(LDA)
0,3
0,2
0,1
0,0
NN11 NN21 NN31 SkN11 SkN21 SkN31 Bi11 Bi21 Bi31 Bi41
The better performance of LDA compared to QDA in the case of equal covariance matri-
ces was also obvious after marginal normalizations by the kernel method and vice versa
in case of unequal covariance matrices. The most interesting result appeared in the com-
parison of the two main estimation techniques.
T. Ledl 277
0,3 QDA
LSCV(3)
LSCV(4)
0,2
0,1
0,0
NN12 NN22 NN32 SkN12 SkN22 SkN32 Bi12 Bi22 Bi32 Bi42
0,3
QDA
0,1
0,0
NN12 NN22 NN32 SkN12 SkN22 SkN32 Bi12 Bi22 Bi32 Bi42
The rates of the multivariate kernel estimators compared to the classical methods, LDA
(in Figure 2 for equal correlation matrices) and QDA (in Figure 4 for unequal correlation
matrices), are quite disappointing and the euphoria of the simulation studies in the past
(e.g., Remme et al., 1980) are from this point of view not comprehensible. The direct den-
sity estimation in 2 to 5 dimensions had a poor performance compared to their parametric
counterparts (LDA and QDA). Within those non-parametric estimators, the projection to
3 and 4 dimensions performed better than those to 2 and 5, but all in all this trial failed.
278 Austrian Journal of Statistics, Vol. 33 (2004), No. 3, 267–279
The LDA is the best in all cases and the kernel concepts are quite bad for datasets,
where the assumption of multivariate normal distribution has to be rejected. This is really
interesting, since the (normal-distribution-based) LDA should actually lose its advantage
for those datasets. Figure 4 does not qualify the kernel based LSCV-method as a superior
competitor to QDA case of non-equal covariance matrices, either. The univariate normal-
izations which are, however calculation intensive, led especially in the case of non-equal
covariance matrices to considerable improvements (Figure 5). For equal covariance ma-
trices (see Figure 3) the strain of calculating kernel smoothers appears to be not necessary,
since (possible) improvements seem to happen accidently.
The classification for the insurance dataset failed for all selectors, since the classes did
not differ much and some of the distributions were highly skewed.
References
Bowman, A. W. (1984). An alternative method of cross-validation for the smoothing of
density estimates. Biometrika, 71, 353-360.
Breiman, L., Meisel, W., and Purcell, E. (1977). Variable kernel estimates of multivariate
densities. Technometrics, 19, 135-144.
Cao, R., Cuevas, A., and Manteiga, W. (1994). A comparative study of several smoothing
methods in density estimation. Computational Statistics and Data Analalysis, 17,
153-176.
Devroye, L., and Györfi, L. (1985). Nonparametric Density Estimation: The L1 View.
New York: John Wiley.
Habbema, J. D. F., Hermans, J., and Remme, J. (1978). Variable kernel density estimation
in discriminant analysis. In Proceedings in Computational Statistics (p. 178-185).
Physica Verlag Wien.
Hall, P., and Wand, M. P. (1988). The plug-in bandwidth selection. Biometrika, 75,
541-547.
Hand, D. J. (1997). Construction and Assessment of Classification Rules. Chicester: John
Wiley & Sons.
Jones, M. C. (1991). The roles of ISE and MISE in density estimation. Statistical
Probability Letters, 12, 51-56.
Jones, M. C., Marron, J. S., and Sheather, S. J. (1996). A brief survey of bandwidth
selection for density estimation. Journal of the American Statistical Association,
91, 401-407.
Marron, J. S., and Tsybakov, A. B. (1995). Visual error criteria for qualitative smoothing.
Journal of the American Statistical Association, 90, 499-507.
Marron, J. S., and Wand, M. P. (1992). Exact mean integrated squared error. The Annals
of Statistics, 20, 712-736.
Ness, J. V. (1980). On the dominance of non-parametric Bayes rule discriminant algo-
rithms in high dimensions. Pattern Recognition, 12, 355-368.
Ness, J. W. V., and Simpson, C. (1976). On the effects of dimension in discriminant
analysis. Technometrics, 18, 175-187.
T. Ledl 279
Author’s address:
Mag. Thomas Ledl
Department of Statistics and Decision Support Systems
University of Vienna
Universitätsstr. 5/9
A-1010 Vienna
Austria
E-mail: thomas.ledl@univie.ac.at
http://www.univie.ac.at/statistics/