Sie sind auf Seite 1von 10

On Continuous Optimization Methods in Data Mining

Cluster Analysis, Classication and Regression


Provided for Decision Support and Other Applications
Tatiana Tchemisova

, Basak Akteke-

Ozt urk

, Gerhard Wilhelm Weber

Department of Mathematics, University of Aveiro, Portugal

Department of Scientic Computing, Institute of Applied Mathematics, Middle East Technical Univer-
sity, Ankara, Turkey

Departments of Scientic Computing and Financial Mathematics, Institute of Applied Mathematics,


Middle East Technical University, Ankara, Turkey; and Faculty of Economics, Business and Law, Univer-
sity of Siegen, Germany
Abstract
Optimization is an area of mathematics that nds an optimum (minimum or maximum) of some function
dened in a nite or innite set. If the functions used for the problemformulation are continuous or piece-
wise continuous, we obtain a continuous optimization problem. When some practical task is formulated
in the form of optimization problem, the success in its solution depends mainly on the quality of a
mathematical model and on the choice of an appropriate method of the model solution.
Data mining today can be considered as an interdisciplinary research which employs applied mathemat-
ics and computational statistics. It treats data obtained in experiments, records, measurements, question-
naires, etc., and aims at modeling, prediction and decision support. In the paper we consider continuous
optimization methods in solution of some specic problems arising in data mining: clustering, classi-
cation and regression. The results discussed are recently obtained by the members of EURO Working
Group on Continuous Optimization (EUROPT) and are based on the theory and methods of continuous
optimization.
Keywords: Data Mining, Continuous Optimizations, Statistic Learning, Computational Statistics, De-
cision Support
1 Introduction
Generally, an optimization problem consists in maximization or minimization of some function (objective
function) f : S Rin a set S R
n
. The feasible set S can be either nite or innite, and can be described
with the help of a nite or innite number of equalities and inequalities or in the form of some topological
structure in R
n
. In case global maxima or minima (global extrema) are looked for, we have a global
optimization problems, otherwise the problem is of local optimization. When the function f is continuous
or piece-wise continuous and the set S is described with the help of functions (constraint functions) that
have the same continuity property, we obtain a continuous optimization problem. The methods for solution
of certain optimization problem depend mainly on the properties of the objective function and the feasible
set. Thus, when we are looking for extrema of a linear function regarded on some polyhedral set, then the
methods of linear programming can be applied; when f is a convex function and S is a convex set, we
1
apply methods of convex programming; if the feasible set S is described with the help of an innite number
of equalities or inequalities, the methods of semi-innite programming should be used, etc..
Today data mining can be considered as an interdisciplinary research which employs the applied math-
ematics and computational statistics. It treats the data bases received from measurements, of technical,
social or nancial records, etc.. It is important to understand and analyze these data sets to make them
benecial, e.g., for support of decisions in all areas of science, high tech, economy, development and daily
life. In fact, data mining can be regarded as a mathematics-based preprocessing module within of a deci-
sion support system. Indeed, decisions can be settled on each of the clusters or classes separately, before
coming unied later on, the model underlying the decision can be approximated appropriately and, by this,
the decision making signicantly facilitated.
In this paper, we discuss how the methods of continuous optimization can be used in three areas of data
mining, namely, in clustering, classication and regression, all three ones considered interrelated [25].
The results discussed are recently obtained by the members of EUROPT and are based on the theory and
methods of continuous optimization.
2 Cluster analysis
Cluster analysis has many applications, including decision-making and machine-learning, information re-
trieval and medicine, image segmentation and pattern classication, etc. Furthermore, cluster analysis can
be used as a stand-alone tool to gain insight into the distribution of data, to observe characteristics of each
cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may support as a
preprocessing step for other algorithms, such as classication and characterization, operating the detecting
clusters [11]. Clustering problems has been studied by many authors and different algorithms have been
developed for their solution[16, 17, 11, 12].
2.1 Clustering problems and clustering algorithms
Assume that we have been given a nite set X of points in the ndimensional space R
n
: X =
x
1
, x
2
, . . . , x
M
, where x
i
R
n
, i = 1, 2, . . . , M.
Let us call the elements of the set X patterns. The aim of cluster analysis is to partition X into a nite
number of clusters based on similarity between patterns. As a measure of similarity we use any distance
function. Here for the sake of simplicity we consider Euclidean distance [[ [[
2
. Given a number q N,
we are looking for q subsets C
i
, i = 1, 2, . . . , q, such that the medium distance between the elements in
each subset is minimal and the following conditions are satised:
1. C
i
,= , i = 1, 2, . . . , q,
2. X =
q
_
i=1
C
i
.
The sets C
i
, i = 1, 2, . . . , q, introduced above are called clusters and the problem of determination of
clusters is the clustering problem. When the clusters can overlap, the clustering problem is fuzzy. If we
request additionally
3. C
i

C
j
= if i ,= j; i, j = 1, 2, . . . , q,
then we obtain a hard clustering problem. Let us assume that each cluster C
i
, i = 1, 2, . . . , q, can be
identied by its center or centroid, dened as [5] c
i
:=
1
|C
i
|

xC
i x, where [C
i
[ denotes a cardinality
2
of the cluster C
i
. Then the clustering problem can be reduced to the following optimization problem, which
is known as a minimum sum of squares clustering [6]:
min
1
M
q

i=1

xC
i
[[c
i
x[[
2
2
, such that C = C
1
, C
2
, . . . , C
q


C. (2.1)
Here,

C is a set of all possible q partitions of the set X, c = (c
1
, c
2
, . . . , c
q
) R
nq
.
According to [17], clustering algorithms differ each from other depending on
the input data representation, e.g., pattern-matrix or similarity matrix, or data type, e.g., numerical,
categorical, or special data structures, such a rank data, strings, graphs, etc.;
the output representation, e.g., a partition or hierarchy of partitions;
optimization method chosen for solution of optimization model;
clustering direction, e.g., agglomerative or divisive.
In general, the clustering algorithms are either partitioning clustering or hierarchical. A partitioning
method assumes that the number q of clusters to be nd is given and looks for the optimal partition based on
the objective function. In the case the number q is to be dened also during the procedure of the algorithm,
an hierarchical algorithm can be applied. These algorithms partition the data set in a nested manner into
clusters which are either disjoint or included one into another.
There exist two approaches in constructing hierarchical algorithms, agglomerative and divisive approaches.
An agglomerative method starts with elementar partitioning where each pattern forms a distinct (singleton)
cluster and successively merges clusters together until a stopping criterion is satised. A divisive method
begins with unique cluster formed by the set X and performs splitting until a stopping criterion is fullled.
The main requirements that a clustering algorithm should satisfy are as follows: scalability, ability to
deal with different types of attributes, discovery of clusters with arbitrary shape, minimal requirement for
domain knowledge to determine input parameters, ability to deal with noise in data, insensitivity to the
order of input records, high dimensionality, interpretability and usability.
2.2 Optimization approach to clustering
The clustering problem (2.1) can be rewritten as single minimization problem as follows:
min
1
M
M

j=1
q

i=1

ij
[[x
j
c
i
[[
2
2
,
such that
ij
0, 1, i = 1, 2, . . . , q, j = 1, 2, . . . , M,
q

i=1

ij
= 1, j = 1, 2, . . . , M.
(2.2)
The problem obtained is a mixed-integer problem. The centers of clusters can be rewritten as c
i
:=
(
M

j=1

ij
x
j
)/(
M

j=1

ij
). Here,
ij
is the association weight of the pattern x
j
with cluster i given by

ij
=
_
1, if pattern j is allocated to cluster i; i = 1, 2, . . . , q; j = 1, 2, . . . , M,
0, otherwise.
It can be shown that (2.2) is a global optimization problem with possibly many local minima [5]. In general,
solving the global optimization problem is a difcult task. This makes it necessary to develop clustering
algorithms which compute the local minimizers of problem (2.2) separately. In [5, 2, 3] the optimization
techniques are suggested that are based on nonsmooth optimization approach.
3
Finally, note that the clustering problems (2.1) and (2.2) can be reformulated as an unconstrained nons-
mooth and nonconvex problem
minf(c
1
, c
2
, . . . , c
q
), (c
1
, c
2
, . . . , c
q
) R
nq
, (2.3)
where f(c
1
, c
2
, . . . , c
q
) =
1
M
M

i=1
min
j=1,2,...,q
[[c
j
x
i
[[
2
2
. Since the function (y) = [[y c[[
2
2
, y R
n
,
is separable (as a sum of squares), the function (x
i
) = min
j=1,2,...,q
[[c
j
x
i
[[
2
2
is piece-wise separable. It
is proved in [4] that the function f(c
1
, c
2
, . . . , c
q
) is piecewise separable as well. The special separable
structure of this problem together with its nonsmoothness allows a corresponding analysis and specic
numerical methods related with derivative free optimization.
2.3 Cluster stability using minimal spanning trees
Cluster analysis is a popular area of data mining and of Operations Research; however, there are many open
questions which wait for a theoretical and practical treatment. Estimation of the appropriate number of
clusters is a fundamental problem in cluster analysis. Many approaches to this problem exploit the within-
cluster dispersion matrix (dened according to the pattern of a covariance matrix). The span of this matrix
(column space) usually decreases as the number of groups rises and may have a point in which it falls.
Such an elbow on the graph locates in several known methods, a true number of clusters. Stability
based approaches, for the cluster validation problem, evaluate the partitions variability under repeated
applications of a clustering algorithm on samples. Low variability is understood as high consistency of the
results obtained and the number of clusters that minimizes cluster stability is accepted as an estimate for
the true number of clusters.
In [24], a statistical method for the study of cluster stability is proposed. This method suggests a geo-
metrical stability of a partition drawing samples from the partition and estimating the clusters by means
of each one of the drawn samples. A pair of partitions is considered to be consistent if the obtained di-
visions match. The matching is measured by a minimal spanning tree (MST) constructed for each one of
the clusters and the number of edges connecting points from different samples is calculated. Recall that a
tree which reaches out to all vertices of a graph is called a spanning tree; among all the spanning trees of a
weighted and connected graph, the one (or possibly more) with the least total weight is called a minimum
spanning tree. MSTs are important for several reasons: They can be quickly and easily computed (by
Prims, Kruskals or Dijkstras algorithms, for example), they create a sparse subgraph which reects some
essence of the given graph, and they provide a way to identify clusters in pointsets.
3 Classication and regression in statistical learning
Machine learning is the process of extracting rules and patterns from huge data sets for given training
points to be able to make predictions on new data samples. Common machine learning algorithms include
supervised learning, unsupervised learning, semi-supervised learning, etc. Supervised learning is a method
of learning where examples in the training sample are labeled or classied, e.g., in binary case by labels
1, and the test points are classied with a rule priory from the learning stage. The input/output pairings
reect the functional relationship between the input and output. Such a function, if it exists, is called a
target or decision function. If the learning problem has binary outputs, then it is referred to as a binary
classication, when the outputs are real, the problem is called as regression [7].
4
3.1 Classication
Binary classication is frequently performed using linear classication methods. Let f : X R be
a function dened on X R
n
. Then x X, x = (x
1
, x
2
, . . . , x
n
)
T
, is assigned to positive class if
f(x) 0 and to the negative class otherwise. Here, n is the dimension of input space. Suppose that f is
afnely linear, i.e., it can be expressed in the form f(x) = , x) + b, where = (
1
,
2
, . . . ,
n
)
T
is
a weight vector and b is a so-called bias. In linear binary classication, the two classes are discriminated
by a hyperplane dened by , x) +b = 0. Consider examples (x
i
, y
i
), i = 1, 2, . . . , l, which may be the
outcome of a preprocessing by clustering and correspond to a cluster in the input space, and a hyperplane
f(x) = 0. If for a functional
i
:= y
i
(, x
i
) +b) it is satised
i
> 0, then the correct classication is
achieved. The distance between each of the points nearest to the hyperplane and the hyperplane is called
the geometric margin.
For the real-world problems, data in the input space may not be linearly classiable. Then, nonlinear
classiers are needed. Kernel methods use an approach which increases the exibility of linear functions
by applying a nonlinear mapping from the input space into a higher dimensional vector space called
feature space. In these methods, the functional f has the form f(x) = , (x)) + b and, in general, it is
nonlinear.
Using optimality conditions and duality, the vector is represented here in the form =

l
i=1

i
y
i
(x
i
).
Having substituted into f(x), we get
f(x) =
l

i=1

i
y
i
(x
i
), (x)) +b =
l

i=1

i
y
i
k(x
i
, x) +b.
The value k(x, z) := (x), (z)) is called the kernel. There are different direct methods for computing
kernels. For example, the quadratic kernel is dened as k(x, z) := (1+x, z))
2
; the Gaussian kernel looks
as follows: k(x, z) := exp([[x z[[
2
2
/
2
).
One of the key ideas of Support Vector Machines (SVMs) is the introduction of a so-called feature space
enabling the separation of classes which cannot be separated linearly in the input space. SVMs choose
the linear classier that maximizes the geometric margin of the training data. Since re-scaling of (, b)
preserves the classication, we can enforce y
i
(, x
i
) + b) 1. Then, the functional margin is 1 and
the geometric margin is equal to
1
||||2
. Hence, to maximize the margin, it is necessary to minimize [[[[
2
2
.
Thus we can formulate the following convex optimization problem that nds the optimal classier
min
,b
, ), such that y
i
(, x
i
) +b) 1, i = 1, 2, . . . , l.
Using the duality theory of optimization we can solve instead the following dual problem:
max

i=1

i

1
2
l

i,j=1
y
i
y
j

j
k(x
i
, x
j
),
such that
l

i=1
y
i

i
= 0;
i
0, i = 1, 2, . . . , l.
In practice, to solve complex classication problems, it is not enough to apply strictly perfect marginal
classiers without any error term, since they will not be applicable to real data. Therefore, some additional
variables are introduced to allow the maximal margin criterion to be violated. Such classiers are called
soft margin classiers. Let us consider the following problem:
min
,,b
[[[[
2
2
+C
+

i:y
i
=+1

i
+C

i:y
i
=1

i
,
such that y
i
(, x
i
) +b) 1
i
, i = 1, 2, . . . , l.
5
Here,
i
, i = 1, 2, . . . , l, are so called slack variables, and C
+
are regularization constants for positive
examples and C

are regularization constants for negative examples. The use of different regularization
parameters for positive and negative examples becomes important when there are signicant unbalance
between the number of positive and negative training examples.
The dual problem for the latest optimization problem determines the classier called the SVM
max

i=1

i

1
2
l

i,j=1
y
i
y
j

j
k(x
i
, x
j
),
such that
l

i=1
y
i

i
= 0,
0
i
C
+
if y
i
= +1, and 0
i
C

if y
i
= 1, i = 1, 2, . . . , l.
3.1.1 Max-min separability
The problems of supervised data classication arise in many areas including management science, medicine,
chemistry etc. The aim of supervised data classication is to establish rules for the classication of some
observations assuming that the classes of data are known. To nd these rules, known training subsets of
the given classes are used. The problem of supervised data classication can be reduced to a number of
set separation problems. For each class, the training points belonging to it have to be separated from the
other training points using a certain, not necessarily linear, function. This problem is formulated in [4] as
a nonsmooth optimization problem with max-min objective function.
Let A and B be given disjoint sets containing m and p vectors from R
n
, respectively:
A = a
1
, . . . , a
m
, a
i
R
n
, i = 1, . . . , m; B = b
1
, . . . , b
p
, b
j
R
n
, j = 1, . . . , p;
A

B = .
Let H = h
1
, . . . , h
l
be a nite set of hyperplanes where h
j
is given by x
j
, z) y
j
= 0, j = 1, 2, . . . , l
with x
j
R
n
, y
j
R (, ) denoting scalar product). Let J = 1, 2, . . . , l. Consider any partition of J
in the form J
r
= J
1
, . . . , J
r
where
J
k
,= , k = 1, . . . , r; J
k

J
s
= , if k ,= s;
r
_
k=1
J
k
= J.
Let I = 1, . . . , r. A particular partition J
r
of the set J denes the following max-min type function:
(z) = max
iI
min
jJi
(x
j
, z) y
j
), z R
n
. (3.1)
We say that the sets A and B are max-min separable if there exist a nite number of hyperplanes, H =
h
1
, . . . , h
l
, and a partition J
r
= J
1
, . . . , J
r
of the set J such that
1. for all i I and a A we have (x
j
, a) y
j
) < 0;
2. for any b B there exists at least one i I such that (x
j
, b) y
j
) > 0.
It follows from the denition above that if the sets A and B are max-min separable then (a) < 0 for any
a A and (b) > 0 for any b B where the function is dened by (3.1). Thus the sets A and B can be
separated by a function represented as a max-min of linear functions. Therefore this kind of separability is
called a max-min separability.
The problem of the max-min separability is reduced to the following optimization problem:
minf(x, y) such that (x, y) R
ln
R
l
, (3.2)
6
where the objective function f has the following form:
f(x, y) = f
1
(x, y) +f
2
(x, y)
and
f
1
(x, y) =
1
m
m

k=1
max[0, max
iI
min
jJi
(x
j
, a
k
) y
j
+ 1)],
f
2
(x, y) =
1
p
p

s=1
max[0, min
iI
max
jJi
(x
j
, b
s
) +y
j
+ 1)].
The functions f
1
and f
2
are piece-wise linear, therefore the resulting function f is piecewise linear and
consequently piecewise separable. In [4] it is shown that even for very simple cases these type of functions
may not be regular and therefore the calculation of their subgradients is quite difcult.
A derivative-free algorithm for minimization of max-min type functions is proposed in [4]. This algorithm
is the modication of the discrete gradient method. The results of the numerical experiments demonstrate
that the algorithm is efcient for solving large scale problems up to 2000 variables.
3.2 Regression with elementary splines
Multivariate adaptive regression spline (MARS) is a method to estimate general functions of high dimen-
sional arguments given sparse data [9]; it allows emerging applications in many areas of science, economy
and technology. MARS is an adaptive procedure because the selection of basis functions is data-based and
specic to the problem at hand. It can also be interpreted as a continuous (however, still nonsmooth)
version of the famous classication tool CART which, basically, yields classes in the form of parallelpipes
and uses the discontinuous indicator functions for this. MARS is a nonparametric regression procedure
that makes no specic assumption about the underlying functional relationship between the dependent and
independent variables [8]. It has the special ability to estimate the contributions of the basis functions so
that both the additive and the interactive effects of the predictors are allowed to determine the response
variable. MARS is based on a modied recursive partitioning methodology, and its underlying model uses
expansions in piece-wise linear basis univariate functions (in any of the n separated variables) of the
form
c
+
(x, ) = [+(x )]
+
, c

(x, ) = [(x )]
+
,
where [q]
+
:= max 0, q and is an univariate knot. Each function, called a reected pair, is piecewise
linear, with a knot at the value . Let us consider the following general model on the relation between input
and response:
Y = f (X) +,
where Y is a response variable, X = (X
1
, X
2
, . . . , X
p
)
T
a vector of predictors and an additive stochastic
component which is assumed to have zero mean and nite variance. The goal is to construct reected pairs
for each input X
j
(j = 1, 2, . . . , p) with p-dimensional knots
i
= (
i,1
,
i,2
. . . ,
i,p
)
T
at or just nearby
each the input data vectors x
i
= ( x
i,1
, x
i,2
. . . , x
i,p
)
T
of that input. After these preparations, our set of
basis functions is
:=
_
(X
j
)
+
, ( X
j
)
+
[ x
1,j
, x
2,j
, . . . , x
N,j

_
.
Thus, we can represent f by a linear combination which is successively built up by this set and with the
intercept
0
in the form
Y =
0
+
M

m=1

m
(X) +,
7
where
m
are basis functions from or products of two or more such functions, and it is taken from a set of
M linearly independent basis elements;
m
are the unknown coefcients for the mth basis function or for
the constant 1 (m = 0). A set of eligible knots
i,j
is assigned separately for each input variable dimension
and is chosen to approximately coincide with the input levels represented in the data. Interaction basis
functions are created by multiplying an existing basis function with a truncated linear function involving a
new variable. In this case, both the existing basis function and the newly created interaction basis function
are used in the MARS approximation. Provided the observations represented by the data ( x
i
, y
i
) , i =
1, 2, . . . , N, the form of the mth basis function is as follows:

m
(x) =
m

j=1
_
s

m
j
_
x

m
j

m
j
__
.
Here, K
m
is the number of truncated linear functions multiplied in the mth basis function, x

m
j
is the input
variable corresponding to the jth truncated linear function in the mth basis function,

m
j
is the knot value
corresponding to the variable x

m
j
, and s

m
j
is the sign 1.
A lack-of-t criterion is used to compare the possible basis functions. The search for new basis can be
restricted to interactions of a maximum order. For example, if only up to two-factor interactions are permit-
ted, then, K
m
2 would be restricted in Y . A fundamental drawback of recursive partitioning strategies,
overcome by MARS, lies in the fact that the recursive partitioning often results in a poor predictive ability
for even low-order performance functions when new data are introduced. MARS overcomes these two
problems of recursive partitioning regression to increase accuracy.
MARS consists of two subalgorithms [9]:
Forward stepwise algorithm: Here, forward stepwise search for the basis function takes place with the
constant basis function, the only one present initially. At each step, the split that minimized some lack
of t criterion from all the possible splits on each basis function is chosen. The process stops when a
user-specied value, M
max
is reached. At the end of this process we a have large expression in Y . Since
our model typically overts the data, a backward deletion procedure is applied.
Backward stepwise algorithm: Its purpose is to prevent from overtting by decreasing the complexity of the
model without degrading the t to the data. Therefore, the backward stepwise algorithm involves removing
from the model basis functions that contribute to the smallest increase in the residual squared error at each
stage, producing an optimally estimated model

f

with respect to each number of terms, called . We


note that expresses some complexity of our estimation. To estimate the optimal value of , a so-called
generalized cross-validation can be employed [10].
As an alternative [23] proposes penalty terms in addition to the least squares estimation in order to con-
trol the lack of t from the viewpoint of the complexity of the estimation. Referring to corresponding
integration domains again, the penalized residual sum of squares is given by
PRRS :=
N

i=1
_
y
i

0

M

m=1

m
( x
m
i
)
Mmax

m=M+1

m
( x
m
i
)
_
2
+
Mmax

m=1

m
2

|=1|
=(1,2)

rs
r,sV (m)
_

2
m
_
D

r,s

m
(t
m
)

2
dt
m
,
where x
i
= (x
i,1
, x
i,2
, . . . , x
i,p
)
T
, V (m) :=
_

m
j
[j = 1, 2, . . . , K
m
_
is the variable set associated with
the mth basis function
m
, t
m
=
_
t
m1
, t
m2
, . . . , t
m
Km
_
T
represents the vector of variables which con-
tribute to the mth basis function
m
.
Furthermore, we refer to D

r,s

m
(t
m
) :=

1t
m
r

2t
s
r
(t
m
), = (
1
,
2
) , [[ =
1
+
2
, where
1
,
2

0, 1 . Our optimization problem bases on the tradeoff between both accuracy, i.e., a small sum of error
8
squares, and not too high a complexity [20, 21, 22]. This tradeoff is established through the penalty
parameters. The second goal on a small complexity encompasses two parts. In terms of famous inverse
problems [1], the minimization of PRSS can be interpreted as a Tikhonov regularization [13, 14, 15].
Using the discretized form instead integration, we consider our optimization problem as a conic quadratic
programming program which is a well-structured convex optimization tool and can be solved by the well-
developed interior point methods [18, 19].
4 Conclusion
Data mining and its mathematical foundation are a key technology for the maintainance and development
of our modern societies. This paper introduced into the foundations and recent developments of three of
its most vast areas: clustering, classication and regression. The methods and results are obtained by the
members of EURO Working Group on Continuous Optimization (EUROPT) and its friends. We authors
cordially thank all of them! This paper is considered and offered as an expression of friendship and fruitful
collaboration between EUROPT and EURO Working Group on Decision Support Systems which is a host
of the IFIP TC8/WG8.3 Working Conference in Toulouse.
References
[1] Aster, A., Borchers, B., and Thurber C., Parameter Estimation and Inverse Problems, Acad. Press, 2004.
[2] Bagirov, A.M., Rubinov, A.M., Soukhoroukova, N.V., and Yearwood, J., Unsupervised and supervised data
classication via nonsmooth and global optimization, TOP, 11, 1 (2003) 1-93.
[3] Bagirov, A.M., Rubinov, A.M., and Yearwood, J., Global optimization approach to classication, Optimization
and Engineering,22 (2006) 65-74.
[4] Bagirov, A.M., and Ugon, J., Piecewise partially separable functions and a derivative-free algorithm for large
scale nonsmooth optimization, Journal of Global Optimization 35 (2006) 163-195.
[5] Bagirov, A.M., and Yearwood, J., A new nonsmooth optimization algorithm for minimum sum-of-squares clus-
tering problems, EJOR 170, 2 (2006) 578-596.
[6] Bock, H.H., Automatische Klassikation, Vandenhoeck and Ruprecht, G ottingen (1974).
[7] Cristianini, N., and Shawe-Taylor, J., An introduction to Support Vector Machines and other Kernel-Based Learn-
ing Methods, Cambridge University Press (2000).
[8] Fox, J., Nonparametric Regression, Appendix to an R and S-Plus Companion to Applied Regression, Sage Pub-
lications, 2002.
[9] Friedman, J.H., Multivariate adaptive regression splines, The Annals of Statistics 19, 1 (Mar., 1991) 1-67.
[10] Friedman, J.H., and Stuetzle, W., Projection pursuit regression, J. Amer. Stat. Assoc. 76 (1981) 817-823.
[11] Han, J., and Kamber, M., Data Mining: Concepts and Techniques, The Morgan Kaufmann Series in Data
Management Systems, Jim Gray, Series Editor Morgan Kaufmann Publishers (2000).
[12] Hansen, P., and Jaumard, B., Cluster Analysis and Mathematical Programming, Mathematical Programming 79
(1-3) (1997) 191-215.
[13] Hastie, T., and Tibshirani, R., Generalized additive models, Statist. Science 1, 3 (1986) 297-310.
[14] Hastie, T., and Tibshirani, R., Generalized Additive Models, New York, Chapman and Hall, 1990.
[15] Hastie, T., Tibshirani, R., and Freedman, J., The Elements of Statistical Learning - Data Mining, Inference and
Prediction, Springer Series in Statistics, 2001.
[16] Jain, A.K., Murty, M.N., and Flynn, P.J., Data clustering: a review, ACM Computing Surveys 31 (3), 264-323.
9
[17] Jain, A.K., Topchy, A., Law, M.H.C., and Buhmann J.M., Landscape of clustering algorithms, in Proc. IAPR
International conference on pattern recagnition, Cambridge, UK (2004) 260-263.
[18] Nemirovski, A., Lectures on modern convex optimization, Israel Institute of Technology (2002),
http://iew3.technion.ac.il/Labs/Opt/opt/LN/Final.pdf.
[19] Nesterov, Y.E., and Nemirovskii, A.S., Interior Point Methods in Convex Programming, SIAM, 1993.
[20] Stone, C.J., Additive regression and other nonparametric models, The Annals of Statistics 13, 2 (1985) 689-705.
[21] Taylan, P., and Weber, G.-W., Newapproaches to regression in nancial mathematics by additive models, Journal
of Computational Technologies 12, 2 (2007) 3-22.
[22] Taylan, P., Weber, G.-W., and Beck, A., New approaches to regression by generalized additive models and con-
tinuous optimization for modern applications in nance, science and technology, in the special issue in honour
of Prof. Dr. Alexander Rubinov, of Optimization 56, 5-6 (2007) 1-24.
[23] Taylan, P., Weber, G.-W., and Yerlikaya, F., Continuous optimization applied in MARS for modern applica-
tions in nance, science and technology, preprint at Institute of Applied Mathematics, METU, submitted to the
ISI Proceedings of 20th Mini-EURO Conference Continuous Optimization and Knowledge-Based Technologies,
Neringa, Lithuania, May 20-23, 2007.
[24] Volkovich, Z.V., Barzily, Z., Akteke-

Ozt urk, B., and Weber, G.-W., Cluster stability using minimal spanning
trees - a contribution to text and data mining, submitted to TOP.
[25] Weber, G.-W., Taylan, P.,

Oz og ur, S., and Akteke-

Ozt urk, B., Statistical learning and optimization methods in


data mining, in: Recent Advances in Statistics, eds.: Ayhan, H.

O., and Batmaz, I., Turkish Statistical Institute


Press, Ankara, at the occasion of Graduate Summer School on New Advances in Statistics (August 2007)
181-195.
10

Das könnte Ihnen auch gefallen