Beruflich Dokumente
Kultur Dokumente
BY
ANDREI L. TURINSKY
M.S., Kharkiv National University, 1995
M.S., University of Illinois at Chicago, 1997
THESIS
Submitted in partial fulllment of the requirements
for the degree of Doctor of Philosophy in Mathematics
in the Graduate College of the
University of Illinois at Chicago, 2002
Chicago, Illinois
Copyright by
Andrei L. Turinsky
2002
ACKNOWLEDGMENTS
I owe a great debt of gratitude to my family who have been so patient and supportive
throughout my academic career. They are the main source of strength and inspiration for me,
to which I attribute all my past and future achievements. I'd also like to acknowledge AT&T
for its frequent discounts on international phone service, which helped us stay in touch during
my studies abroad.
I am very thankful to my thesis advisor Dr. Robert Grossman for the chance to write this
dissertation while working in his Laboratory for Advanced Computing. Professor Grossman's
directions and encouragement were essential in my progress. I have always had full condence
in his judgment that comes from many years of professional experience, and merely observing
Dr. Grossman's approach to research work is quite illuminating.
I am happy to be associated with the Laboratory for Advanced Computing, a part of the
National Center for Data Mining at UIC. Few other places exist where one can gain as much
exposure to such a wide variety of novel application areas in data mining. Among the important
benets, I regularly received nancial support from LAC to attend data mining conferences,
which was a substantial part of my learning experience. In this regard I wish to thank Shirley
Connelly who was instrumental in securing the travel grants. Another best kept secret about
our lab is its great social atmosphere and stimulating discussions, much of which is also a result
of Shirley's involvement. I am grateful to Marco Mazzucco for proofreading several chapters
of this thesis and for a number of useful tips he gave me on the thesis defense procedure. He
iii
ACKNOWLEDGMENTS (Continued)
also made sure that my computers ran smoothly. I'd like to thank Stuart Bailey who was
my technical mentor during my rst year at LAC. Cheryl Fernandes taught me several Java
programming techniques. Arvind Sethuraman provided some additional monetary incentive for
me to speed up my graduation process, which also helped. All other colleagues at the lab have
been very supportive of my eorts as well.
There were a number of people outside the Laboratory for Advanced Computing who helped
or in
uenced my academic progress. I learned quite a few research techniques from my rst
advisor Professor Valery Korobov, a prominent scientist at the Kharkiv National University.
My former classmate and good friend Eugenia Vinogradskaya inspired me to move to the U.S.
and handled much of my admission process at the University of Illinois at Chicago. Professor Alexander Lipton was kind enough to support my application and later provided valuable
assistance in choosing my research area. Professor Floyd Hanson frequently acted as my unocial advisor during my rst years of studies. As a member of the thesis defense committee,
he proofread this dissertation and suggested several important improvements. I also wish to
thank other members of the committee Professors Bhaskar DasGupta, Jason Leigh and Charles
Tier for their useful comments. Professor Stanley Pliska gave me the opportunity to discuss
my academic career with him on a number of occasions, which was rather valuable. He also
made me learn Latex while working on an interesting project. Professor Yang Dai oered me
the gene expression microarray dataset that was used in the experiments in this thesis.
iv
ACKNOWLEDGMENTS (Continued)
TABLE OF CONTENTS
PAGE
CHAPTER
1
2
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
INTERMEDIATE STRATEGIES . . . . . . . . . . . . . . . . . . . . . . .
6
6
8
10
11
11
12
12
13
14
16
20
23
23
24
27
29
30
31
36
38
38
39
40
43
44
46
48
49
50
52
54
2.1
2.2
2.3
2.3.1
2.3.2
2.4
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.5
2.5.1
2.5.2
2.5.3
2.5.4
2.5.5
2.5.6
2.6
3.1
3.2
3.3
3.3.1
3.3.2
3.4
3.4.1
3.4.2
3.4.3
3.4.4
3.5
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background and Related Work . . . . . . . . . . . . . . . . . . . . .
Computational Model . . . . . . . . . . . . . . . . . . . . . . . . . .
Network Conguration . . . . . . . . . . . . . . . . . . . . . . . . .
Building Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The OPTDMP Method . . . . . . . . . . . . . . . . . . . . . . . . .
Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In-place and Centralized Strategies . . . . . . . . . . . . . . . . . .
Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Case Study: Nursery Data . . . . . . . . . . . . . . . . . . . . . . .
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Error Function Estimation . . . . . . . . . . . . . . . . . . . . . . .
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In-place and Centralized Strategies . . . . . . . . . . . . . . . . . .
Optimal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dual Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dual Strategies for Ensemble Learning . . . . . . . . . . . . . . . .
Simple Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Boosted Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dual Strategies for Clustering . . . . . . . . . . . . . . . . . . . . .
Clustering Gene Expression Microarray Data . . . . . . . . . . . .
Eect on the Cluster Tightness . . . . . . . . . . . . . . . . . . . .
Identifying Similar Genes . . . . . . . . . . . . . . . . . . . . . . . .
Eect on the Precision and Recall . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
PAGE
CHAPTER
4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Greedy Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . .
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . .
Greedy Data Labeling: The Ideal Version . . . . . . . . . . . . . .
Greedy Data Labeling: The Ecient Version . . . . . . . . . . . .
Building the Meta-model . . . . . . . . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
71
73
74
74
76
76
77
79
81
83
83
83
85
86
92
96
CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.1
5.1.1
5.1.2
5.1.3
5.2
5.2.1
5.2.2
5.2.3
5.2.4
5.2.5
5.3
5.3.1
5.3.2
5.3.3
5.3.4
5.4
55
55
56
58
59
62
64
66
69
vii
LIST OF TABLES
PAGE
TABLE
II
III
IV
VI
VII
VIII
IX
XI
XII
XIII
PAGE
TABLE
XIV
ix
LIST OF FIGURES
PAGE
FIGURE
LIST OF ABBREVIATIONS
CART
EM
Expectation-Maximization
GDL
NB
Naive Bayesian
NN1
OPTDMP
PMML
RandMix
Random Mixture
UCI ML
xi
SUMMARY
The objective of this thesis is to present the fundamental trade-o in distributed data
mining, namely, a trade-o between the cost of communication and computation on one side
and the accuracy of the data mining results on the other side. There are two extreme approaches
to distributed data mining. One is to mine all data locally and combine the results, which is
the cheapest solution. Another is to collect all data at a central repository and mine it there,
which gives the most accurate results.
Chapter 1 is an introductory chapter that presents the fundamental trade-o between cost
and accuracy in more details.
Chapter 2 develops a mathematical framework for formulating this trade-o in rigorous
terms and then nding intermediate strategies that balance cost and accuracy. We show that
the problem may be reduced to that of a constrained optimization. Using the known property of convexity of the learning curves as well as the cost function, we demonstrate how the
intermediate strategies may be found.
Chapter 3 presents a related dual problem in which cost constraints are xed and accuracy
is maximized, which leads into the area of the quality of data partitions. We use experimental
applications with UCI Machine Learning data and gene expression microarray data to illustrate
the important aspects of this approach.
Chapter 4 develops a mathematical foundation for nding proper data partitions in the
distributed environment presented in Chapter 3. We show that this problem may be formalized
xii
SUMMARY (Continued)
xiii
CHAPTER 1
INTRODUCTION
2
TABLE I
TRADE-OFF BETWEEN COST AND ACCURACY FOR DIFFERENT DISTRIBUTED
DATA MINING STRATEGIES
DM Strategies Cost Accuracy
In-place
low
low
Intermediate balanced balanced
Centralized
high
high
internet and large data sets, mining the data in-place is the cheapest and quickest but often
the least accurate solution, while the centralized approach is more accurate but generally quite
expensive in terms of the time and other resources required.
There are a variety of intermediate strategies in which some data is moved and some data
is left in place, analyzed locally, and the resulting models are moved and combined. These
intermediate cases are becoming of practical signicance with the explosion of ber and the
emergence of high performance networks. They represent a balance between a sucient accuracy of the data mining models and results on one hand and an acceptable level of cost on the
other hand. This is shown schematically in Table I.
In Chapter 2, we examine this intermediate case in the context in which high performance
networks are present and the cost function represents both computational and communication
costs. We reduce the problem to a convex programming problem so that standard techniques
can be applied. We illustrate our approach through the analysis of an example showing the
complexity and richness of this class of problems.
3
Today, the capability of the broadband communications infrastructure is doubling every
9-12 months, faster than the 18 month doubling of processor speeds (Moore's law). For example, a 155 Mb/s OC-3 link can move 10 Gigabytes of data in about 15 minutes. Given this
infrastructure and the growing importance of large distributed data sets, intermediate strategies between in-place and centralized strategies of the type described here should be of growing
interest.
Chapter 2 deals with building a mathematical framework for the analysis of the intermediate strategies primarily in the context of minimizing the cost of data transfer and processing
while maintaining a given level of accuracy for the outcome. In Chapter 3, we present a dual
optimization problem, namely, a problem of intermediate strategies that maximize accuracy
given cost constraints. We show that this problem leads directly into such important areas as
data quality, the impact of the initial data partition in distributed data mining on the accuracy
of the results, and the issue of data instance selection.
Two practical applications are chosen to illustrate these issues using existing methods of
data instance selection. In the rst set of experiments, we examine the benets of such method
as boosting an ensemble of models when the ensemble is built on a distributed collection of
datasets. In the second set of experiments, we present the problem of clustering a distributed
bioinformatics data. Our goal is to identify important issues that arise in the dual intermediate
strategy approach, primarily the issue of the quality of distributed data partitions.
In Chapter 4, we develop a general mathematical treatment of the dual intermediate strategy
case and the issue of data partitions. We show that this problem allows a rigorous formulation
4
as a case of Expectation-Maximization algorithm with constraints on the hidden variable. We
prove a theorem that establishes how the constrained region in the hidden variable space aects
the quality of the EM solutions. We also show the way to use this theoretical framework to
build new families of EM-based algorithms for a distributed data mining environment.
In Chapter 5, we develop one such algorithm using an alternative motivation. We note that
one of the challenges in distributed data mining is to choose the best method of deployment of
a collection of predictive models built remotely. While methods such as ensemble learning use
the entire collection of models for classication, complex data may sometimes be best modeled
by a hierarchical system in which only one specialized model is deployed each time. We explore
this scenario on Chapter 5 where we present the distributed model assignment problem and
suggest a method to address it.
Let there be k remote datasets. The distributed model assignment problem is a problem of
computing k local statistical models and an \assignment model", or meta-model. The quality
of the resulting system depends on the underlying data partition.
In Chapter 5, we introduce an algorithm called Greedy Data Labeling that improves the
initial data partition by selecting small portions of data for re-allocation between distributed
sites, so that when each model is built on its data subset, the resulting hierarchical system
has minimal error. We present experimental results showing that model assignment approach
may in certain situations be more natural than traditional ensemble learning, and that when
enhanced by GDL, it nearly always outperforms ensembles. Our technique is broadly related
5
to partition-based clustering algorithms and employs some ideas from boosting and simulated
annealing.
Although the GDL algorithm is developed independently from Chapter 4, it is of the same
type as the general family of constrained EM-like algorithms for which Chapter 4 provides a
theoretical foundation.
CHAPTER 2
INTERMEDIATE STRATEGIES
2.1
Introduction
The work presented in this chapter, to our knowledge, is the rst attempt to identify a
fundamental trade-o in distributed data mining; namely, the trade-o between the eciency
and cost-eectiveness of a distributed data mining application on one side, and the accuracy
and reliability of the resulting predictive system on the other side.
We provide evidence that the most ecient application may give an unacceptably inaccurate
predictive results, while the most accurate predictions may require an inecient data processing
strategy. We also explore a variety of intermediate strategies. Some of the ndings presented
in this chapter have already been published in (Turinsky and Grossman, 2000) and to some
extent in (Grossman et al., 2000).
Because moving large data sets over the commodity internet can be very time consuming, a
common strategy today for mining geographically distributed data is to leave the data in place,
build local models, and combine the models at a central site. Call this an in-place strategy. At
the other extreme, when the amount of geographically distributed data is very small, the most
naive strategy is simply to move all the data to a central site and build a single model there.
Call this a centralized strategy.
7
Given geographically distributed data, we can either a) move data, b) move the results of
applying algorithms to data (models), or c) move the results of applying models to data (result
vectors). It is not uncommon for there to be a 10x-100x dierence in the size of the data, model
and result vectors.
Consider a cost function that measures the total cost to produce a model and includes both
the communication and processing costs. As the size of the data grows and the speed of the
link connecting two sites decreases, an in-place strategy is, generally speaking, less expensive
but also less accurate. Conversely, the centralized strategy is generally more expensive but
also more accurate. Given a minimally acceptable accuracy, it is plausible that there is an
intermediate strategy which produces a model with this level of accuracy with the minimum
possible cost. Call these intermediate strategies.
We show that this is indeed the case and describe a method called the OPTDMP (OPTimal
Data and Model Partitions) for nding such a strategy. We also present an experimental case
study to show that such intermediate strategies occur rather naturally.
This chapter makes the following contributions:
1. We introduce the problem of computing intermediate strategies in distributed data mining
and point out these type of strategies will become more and more important with the
emergence of wide area high performance networks.
2. We provide a mathematical framework for analyzing intermediate distributed data mining
strategies.
8
3. We introduce an method called OPTDMP for nding intermediate strategies in the case
of a linear cost function.
4. We show with a example that intermediate strategies are interesting, even for the simple
case of linear cost functions.
Given the analysis of the chapter, it is straightforward to dene versions of OPTDMP for
a variety of other cost functions. Our point of view is to reduce nding intermediate strategies
to a mathematical programming problem which minimizes a cost function subject to an error
constraint. The framework presented in the next sections holds for a wide range of cost functions
and mathematical programming algorithms. The purpose of this chapter is to introduce these
ideas with a simple cost function and a simple example.
2.2
9
Several systems for analysis of distributed data have been developed in recent years. These
include the JAM system developed by Stolfo et al (Stolfo et al., 1997), the Kensington system
developed by Guo et al (Guo et al., 1997), and BODHI developed by Kargupta et al (Kargupta
et al., 1997), (Kargupta et al., 1999). These systems dier in several ways. For example, JAM
uses meta-learning that combines several models by building a separate meta-model whose
inputs are the outputs of the collection of models and whose output is the desired outcome.
Kensington employs knowledge probing that considers learning from a black box viewpoint and
creates an overall model by examining the input and the output of each model, as well as the
desired output. BODHI system employs so-called collective mining that relies in part on ideas
from Fourier analysis to combine dierent models. In terms of data and model transfer, JAM,
Kensington and BODHI all use local learning.
A new system for distributed data mining called Papyrus is now being developed at the
National Center for Data Mining (Grossman et al., 2000). Among other features, it is designed
to support dierent data and model strategies, including local learning, centralized learning,
and a variety of intermediate strategies, that is, hybrid learning. Work is under way to develop
a methodology of choosing an information transfer strategy that is optimized for a particular
data mining task.
A variety of load balancing techniques have been utilized for a long time in parallel computing
applications. Load balancing is aimed at nding an optimal regime of moving data to the
nodes of a supercomputer or, more recently, of a network of workstations. Zaki (Zaki et al.,
1997) provides an example of a load balancing method that optimizes the eciency of parallel
10
computation on a network of compute nodes. Other examples and motivating discussion can
be found in (Cheung, 1992) and (Grimshaw et al., 1994). These techniques, however, do not
directly target the specic issues that arise in distributed data mining, such as ways of combining
predictive models and the accuracy of the resulting predictive system.
An important topic in data mining is the study of the so-called learning curves. Essentially,
a learning curve shows the relationship between the size of a training dataset and the accuracy
of a predictive model built on that data. In general, exposing a model to more data reduces the
predictive error, although usually not to zero. See e.g. (Cortes et al., 1995). Learning curves
vary in shape depending on the quality of the data, type of models, and other factors. More
importantly, they all share several common features which we shall exploit. A detailed analysis
of learning curves is available in (Murata et al., 1993), (Haussler et al., 1996).
2.3
Computational Model
We use a computational model consisting of a collection of geographically distributed processors, each with dedicated memory and connected with a high performance network. We
assume that network access is substantially more expensive than a local memory access. Naturally, this assumption may not hold for very fast networks where remote access to memory
might in fact be faster than a local disk access. However, here we focus on a situation where
the processors are connected via a high speed broadband type network that adheres to quality
of service requirements.
11
2.3.1
Formally, we assume that there are n dierent sites connected by a network. The cost of
processing data at ith node into a predictive model is i dollars per record and the optimal cost
of moving data from ith to j th node via the cheapest route between the two nodes is ij dollars
per record. One of the nodes is the network root where the overall result will be computed.
2.3.2
Building Models
Our assumption is that at each node, a choice must be made: either ship raw data across
the network to another node for processing, or process data locally into a predictive model and
ship the model across the network. Same data, or a portion of data, may be used both locally
and for sharing with another processing node. Given this viewpoint, building models consists
of the following steps:
1. Re-distribute data across the network.
2. At every node, compute a predictive model.
3. Re-distribute all local predictive models to the root.
4. At the root, combine all models into a single predictive model.
Let Di be the initial amount of data at the ith node. After the re-distribution in Step 1,
this node accumulates D~ i of data. Let Mi be the size of the predictive model computed from
D~ i
in Step 2. It is later transferred to the root. We assume that when data is processed into
12
a predictive model, its amount is compressed uniformly for each node with a coecient , so
that
Mj = D~ j
for all j .
(2.1)
In this section, we describe a method for nding OPTimal strategies for Data and Model
Partitions called OPTDMP.
2.4.1
Strategies
(2.2)
13
where xij is the amount of data Di that is moved from the ith node to the j th node for processing.
This portion of data contributes to D~ j , is processed, and later transferred to the root as a part
of Mj .
Note that
0 xij Di :
(2.3)
14
2.4.3
Cost Function
The basic idea is to reduce the problem of nding optimal strategies for building models
to a constrained optimization problem. The overall
cost function
computed as
C (X ) =
X
ij
cij xij
ij
(2.4)
The coecients ij represent the cost of network communication between nodes i and j per
unit volume of data, while the coecients j represent the cost at node j per unit volume of
data for an algorithm to process data and produce a statistical model. Recall that R is the
root, so jR is the cost of moving data from node j to the root R. The rst term represents
the cost of moving data, the second term represents the cost to transform data into predictive
models, while the third term represents the cost of moving the predictive models to the root so
that they can be combined. For convenience, we dene the coecients
cij = ij + j + jR ;
(2.5)
as indicated.
Note that the term representing the nal step of combining the results at the root is not
present in the cost function. This is due to the fact that regardless of the strategy X , the same
amount of results will have to be processed at the root:
15
X
j
Mj =
X
j
D~ j = D
(2.6)
where D is the total initial amount of all data and is the compression coecient as above.
Therefore, the term in question would be a constant and we may omit it without loss of generality. It may also be convenient to do so from a practical standpoint if the cost of combining
models is negligible compared to other cost components. For example, combining several predictive models into a voting ensemble is a rather trivial operation compared to network transfer
or model building.
We assume a linear cost of data processing here, which will later lead to a linear function
optimization problem. Generally, more complex algorithms would lead to non-linear cost functions. In practice, many of the cost functions associated with nonlinear algorithms are convex,
in which case essentially the same approach works. The actual values of the coecients may
be estimated based on the network throughputs, business infrastructure, particular algorithms
used, etc.
Cost is dierent for dierent centralized strategies. Let C0 be the best cost available under
a centralized strategy. Also, let C1 = C (X1 ) be the cost of the in-place strategy.
The in-place strategy might not be the cheapest one. If the data mining algorithm requires
a lot of resources at the data processing stage (e.g. due to a large algorithm complexity), the
cost of deploying these resources may vary signicantly from site to site and be considerable
compared to the communication cost. It may then be more cost-eective to move the data
16
to cheaper processing sites than to process it locally. The cheapest policy will be found by
minimizing the cost function.
2.4.4
Error
We assume that there are two factors that introduce error. First, the loss of accuracy may
be due to the nature of the data and of the computational algorithm itself, regardless of the
strategy. Denote this error term 0 . This type of error is covered in standard books on statistical
modeling, and, for example, can be estimated using a validation set.
Second, accuracy is generally lost when data is processed locally at several nodes instead of
moving it to one central node and processing it there. This follows from the observation that
any technique used to build a statistical model when the data is distributed can also be applied
when all the data is available at a single node; on the other hand, certain techniques available
when all the data is in one place are not available when the data is distributed, leading to
inferior quality of models built in the latter case. For example, even in ensemble learning when
several models are trained, it is often preferable to build them from a centralized dataset.
As an illustration, consider the case when the data is heterogeneous and diers from site to
site. If a collection of models is obtained from processing locally stored datasets in-place, each
of the resulting models would only be useful for classifying data of the same kind as the one
on which that model was built. If the particular kind of a new data record is not known, using
the collection of models as a voting ensemble would probably give inaccurate results since most
models would be bad predictors.
17
This situation is not at all unusual: many large datasets are partitioned into smaller subsets
in a way that makes them heterogeneous. For example, storing the dataset sorted by one or
several attributes is a common practice, and so is splitting it later sequentially into smaller
pieces. As a result, if e.g. a car dataset was previously sorted by mileage, dierent subsets
would contain data on vehicles of dierent mileage ranges, yet this fact may be overlooked,
leading to biased local models.
Sharing portions of data between processing sites will reduce the overall error of the ensemble
of models. Essentially, learning the data from each subset introduces a separate learning curve
for an individual model. The shapes of learning curves have been studied e.g. in (Murata et
al., 1993). The error function E (X ) is essentially a superposition of several learning curves,
one for each dimension in the X -space of strategies. Indeed, e.g. varying only one component
x12
of a strategy X and xing the rest of the components denes learning the 1st data subset
by the 2nd model, where the data from other subsets has been learned to the extend shown by
other components of X . Hence the error function surface E (X ) could be thought of as a surface
comprised of collections of learning curves parallel to each of the xij -axes in the X -space.
Knowing the type of the shapes of the individual learning curves, we make an important
observation: the level contours of the the error surface E (X ), i.e. contours fX : E (X ) = constg,
generate a family of convex sets in the space of strategies. See Figure 1. This insight will be
crucial for nding the optimal data allocation strategies.
We further observe that if all distributed data comes essentially from the same data distribution, then the error function will be constant on the linear sets fX : Pi xij =
j g where
j
18
100
90
80
70
60
50
40
30
20
10
10
20
30
40
50
60
70
80
90
100
Figure 1. A 2-dimensional slice of level contours for a typical error function. The coordinates
represent percentages of data coming from two remote nodes. The error takes its highest
value at (0,0) - the in-place strategy - and its lowest value at (100%,100%) - a centralized
strategy. The cost typically decreases in the opposite direction, which allows for an optimal
intermediate strategy. The error function was obtained from mining UCI ML Nursery dataset.
are some constants, since the only factor reducing error would be the amount of the processed
data D~ j = Pi xij but not its origins. In this case, the level contours of E (X ) will be linear.
If, as shown in (Murata et al., 1993), the shape of error curves is of the type 0 + 1=x, where
x is the amount of training data and 1;2
= 0 + P1x :
i
ij
(2.7)
19
However, if the data is heterogeneous, we will probably see level contours with a more
pronounced convexity due to a faster error decrease along paths in X -space that represent a
mix of data from various sources, and a slower decrease along paths parallel to coordinate axes.
Our experiments conrm this observation.
An example of an error function for model j may be
Ej (X ) = bj 0 +
b0j +
Pk
i
=1 bij xij
pj
(2.8)
where bij are parameters to be estimated, k is the number of data subsets, and the power
pj
2 (0; 1] denes the degree of convexity. Depending on how the local models are combined,
the overall error function E (X ), or the upper bound for E (X ), can then be determined. E.g.
if the models are used as an averaging ensemble, (Krogh and Sollich, 1997) shows that the
disagreement between the models always causes the error of the ensemble to be lower than the
average error of the individual models, which leads to the following error bound:
E (X )
1 X E (X ):
(2.9)
Also, bagging, boosting and other ensemble learning techniques can be used to improve the
accuracy of the combined model and may lead to a more complex mathematical structure
of E (X ).
Various other forms of E (X ) satisfying the convexity property are possible and could be
used without substantial changes to the algorithm. Perhaps the best practical way to dene
20
E (X ) is either to rely on previously acquired experience with a particular type of data mining
problems or to use a sample of data for a simulation that will determine the parameters of the
error function. We shall demonstrate this approach in the next section.
Once the parameters are known, the minimum and maximum values 0; 1 of the error
function could then be estimated to nd the range
0 E (X ) 1 :
(2.10)
It follows that with this model, the error takes its minimal value 0 when all data is available
at a single processing node (centralized strategies X0), and its largest possible value 1 when
the data is not shared at all (in-place strategy X1 ).
It should also be mentioned that the data mining error is a random variable that diers
between experiments, so that the above approach based on learning curves deals with the
average error for a certain type of experiments, or an upper bound for such error, but not
necessarily an actual error rate of any particular experiment.
2.4.5
Optimization
Our task now translates into nding a strategy X = [xij ] that is a solution of the following
optimization problem:
21
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
(2.11)
E (X ) max
where max is the maximum error level allowed, the vector D is given by the initial data
distribution, and cij is dened by Equation Equation 2.5. The optimal solution is a strategy X
that gives the least cost C = C (X ) among all suciently accurate strategies.
The rst two lines in (Equation 2.11) dene a linear programming problem with a convex
bounded polyhedron domain B (which is a multidimensional rectangle in X -space). It can
easily be solved, and its solution X gives the best cost attainable in the absence of accuracy
restrictions (Chavtal, 1983). As was mentioned, X may dier from the in-place strategy.
Geometrically, X is the \lowest" vertex of the polyhedron, where the direction is determined
by the level of the linear cost function. If X satises the accuracy requirement, then the
optimal strategy X = X, otherwise we continue our search.
Since the error function E (X ) has convex level contours, the third equation of the optimization problem is an additional convex constraint. It follows that if we dene sets B = fX 2
B : E (X )
g where the accuracy threshold parameter 2 [0 ; 1], then the collection fB g
22
As a side note, we have not come across evidence, either theoretical or experimental, that
the direction of convexity of the level contours fX : E (X ) = constg may be reversed. Such
a phenomenon might occur for example when pj > 1 in the error function (Equation 2.8), or
in functions of similar shape. However, if this were to happen, it would result in the following
modication of our optimization problem. The sets A = fX 2 B : E (X ) g now become
convex and sets fBg dened above become their complements. (Consider e.g. functions similar
to 1=(x2 + y2 ) whose level contours are circles with centers at the origin.) With the error
tolerance level set at max, the optimal solution X is still the lowest point of Bmax , but in this
case, it could be easily shown by convexity argument that such point must be on the intersection
of the error level surface fX : E (X ) = max g with one of the edges of the polyhedron B . The
suggested procedure of nding X is then as follows: rst nd the lowest point X of B using
linear programming techniques and then go \up" (in the direction of increasing cost) along
each edge until it intersects the error level surface E (X ) = max , which occurs when the lowest
boundary of the set Bmax of acceptable strategies is reached. It is then a simple task to choose
X
as the lowest of the intersection points. Notice that if the linear restrictions are as in
(Equation 2.11) and hence B is merely a multidimensional box, the geometry of the procedure
is much simplied.
This approach is simple yet eective and, as shown in the following example, produces a
non-trivial optimal cost solution.
23
Note that the decision of which instances of the data to move is not covered by this method,
just the fraction of the data to move. In the simplest case, we would assume that the local data
subsets are homogeneous and move random samples of local data.
2.5
We tested our approach on experiments with several datasets from the UCI Machine Learning Repository (Murphy and Aha, 1993). The results were similar for most datasets, although
some cases exhibited virtually no improvement in accuracy in centralized processing compared
to the in-place processing. Coincidentally, these were the datasets that produced a fairly high
classication error even in centralized learning, that is, under most favorable conditions. We
tend to believe that the reason for lack of improvement was a higher intrinsic noise of the
dataset and a high degree of homogeneity of the data.
The following illustratory example with the Nursery dataset demonstrates our approach.
The dataset contains 12960 data points, each with 8 independent attributes and a class label.
2.5.1
Data Preparation
Assume that there are k = 3 sites that contain distributed, possibly heterogeneous data.
Our task is to determine how much data should be exchanged before the data processing begins
so that the accuracy stays within acceptable limits. After processing, the resulting models are
collected at site 1 for aggregation into an ensemble (thus the root node R = 1). To simulate
such an environment, we took the Nursery database and split it sequentially into three (equal)
parts. Originally, the data stored at at UCI repository was sorted, which is common. Therefore,
24
sequential partition resulted in three heterogeneous subsets. We also withheld certain portion
of the data for a validation set. As a model type, we chose C4.5 classication trees. C4.5 is
a state-of-the-art decision tree algorithm (Quinlan, 1993). We built the decision tree models
using a freeware data mining package called Weka (Witten and Frank, 1999). The resulting
trees were combined into a voting ensemble so that for a new data instance, all models would
make their predictions and then a majority vote would be taken to produce the nal prediction
of the ensemble.
For simplicity of visualization, we decided that strategy components xij will indicate percentages of data
shared between nodes i and j , and that all data stored locally shall be used,
along with portions of data coming from other models, for building local models. Hence xii = 1
for all i. The problem is then to determine the other 6 components of the strategy matrix
2
X=
2.5.2
6
6
6
6
6
6
6
4
1 x12 x13
x21
1 x23
x31 x32
3
7
7
7
7
7;
7
7
5
xij 2 [0; 1]
(2.12)
To estimate the error of building C4.5 trees on a given distributed collection of data, we
considered each local model separately. For example, model 1 was built using the entire data
subset D1 , x21 portion of subset D2 , and x31 portion of subset D3. With the type of error
functions (Equation 2.8) discussed above, the error of model 1 is essentially a function of two
variables x21 and x31 given parametrically as:
25
TABLE II
MISCLASSIFICATION ERROR RATE (%) FOR MODEL 1, TABULATED FOR
0 X21 ; X31 1
0 .25 .50 .75
1
0 17.3 12.8 13.6 13.2 12.8
.25 13.3 6.0 5.1 4.9 4.5
.50 13.0 5.6 4.8 4.5 4.1
.75 12.5 5.2 4.2 4.1 3.7
1 11.9 5.0 4.1 3.7 3.4
E1 (X ) = b10 +
(2.13)
(Since x11 = 1, we can combine b01 + b11 xp111 into a single constant b01 .) To estimate the
coecients, we tabulated the values of E1(X ) on the square (x21 ; x31 ) 2 [0; 1]2 by moving
appropriate amounts of data from nodes 2 and 3 to node 1 as indicated by a pair (x21 ; x31 ),
building a C4.5 tree model, and estimating its accuracy on a validation set. The results of 5-fold
cross-validation were then averaged. Some of the tabulated error values for the three models
are presented in Table II{Table IV, where rows and columns correspond to values of varying
xij
components.
Some minor
uctuations notwithstanding, the error functions exhibit the behavior that we
have expected. It is easy to see that sharing data is important, as even a modest amount
of sharing gives a rapid improvement in accuracy. That constitutes further evidence that a
26
TABLE III
MISCLASSIFICATION ERROR RATE (%) FOR MODEL 2, TABULATED FOR
0 X12 ; X32 1
0 .25 .50 .75 1
0 13.4 9.5 8.7 8.3 8.6
.25 10.7 6.2 5.3 5.1 4.6
.50 11.9 5.2 4.4 4.1 3.8
.75 11.8 5.1 4.3 3.9 3.6
1 11.9 5.0 4.1 3.7 3.4
TABLE IV
MISCLASSIFICATION ERROR RATE (%) FOR MODEL 3, TABULATED FOR
0 X13 ; X23 1
0 .25 .50 .75 1
0 20.1 14.4 10.3 9.9 8.6
.25 14.4 5.6 5.3 4.9 4.6
.50 13.2 4.9 4.5 4.1 3.8
.75 13.0 4.6 4.3 3.9 3.6
1 12.8 4.5 4.1 3.7 3.4
27
pure in-place strategy is likely to be inferior in quality. On the other hand, sharing relatively
small amounts of data is still cheap, which gives hope that intermediate strategies will indeed
represent a good balance between cost and accuracy. The level contours of E1 (X ) were also
shown in Figure 1 in the previous section.
The values were then used to t the parameters of the error functions (Equation 2.8). We
generated least-squares estimates for the parameters with the MATLAB package. There are
numerous other interpolation and curve tting techniques available. The error function formulas
that we obtained were
2.5.3
(2.14)
Optimization
Once both the cost function and the error function are known, the optimization problem (Equation 2.11) can be solved using standard techniques. In our illustratory example,
a slight modication will make it possible to get an analytical solution to (Equation 2.11).
Namely, we shall require that each local model satisfy the property Ej (X ) max , j = 1; 2; 3.
This simplication will allow us to break (Equation 2.11) into three smaller optimization problems:
28
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
0 xij 1
E1 (X ) = b10 + b01 +b21 xp11 +b31 xp1
21
31
max
0 xij 1
E2 (X ) = b20 + b02 +b12 xp12 +b32 xp2
12
32
max
0 xij 1
E3 (X ) = b30 + b03 +b13 xp13 +b23 xp3
13
23
max
Leaving the constraint 0 xij 1 aside for a while and using standard optimization techniques such as Kuhn-Tucker theorem, it is possible to show after some algebraic manipulations
that an optimization problem
8
>
>
>
<
>
>
>
:
(2.18)
max
has a solution
2
1
1 pj 6
4
emax bj 0
(bij cpkjj ) 1
pj
b0j
+ (bkj cpijj ) 1
3
7
1 5
pj
pj
(2.19)
29
An optimal solution that satises 0 xij 1 will then be either given by the formula (Equation 2.19) or be one of the points of the intersection of the curve Ej (X ) = max and the
boundaries of the square [0; 1]2 , which are easy to locate.
2.5.4
Cost
Whereas the error function is dened by the data processing algorithms and the quality
of data, the cost function depends on factors like the network infrastructure, hardware and
software systems used, etc. In our example, we have some liberty in choosing the cost function.
Generally, when strategy components xij represent percentages of data instead of actual
amounts xij Di that are shared, the cost of data processing must be represented by a matrix
of ij values instead of a vector j . However, because we have had equal initial amounts of
data at each local node, this modication is not necessary and a vector (1; 2 ; 3 ) will suce.
Also, the communication cost between each pair of node is symmetrical, so that we only need
to know the values (12 ; 13 ; 23).
We assume here that the cost of shipping the resulting three decision tree models to the
root for combining them into an ensemble is negligible because the size of a decision tree model
is typically much smaller than the size of the data on which it was built. (Otherwise, some
trivial modications have to be made in what follows.) Moreover, as was mentioned before,
aggregating the three models at the root will cost the same regardless of strategy, hence the
optimization is unaected. Therefore, we let = 0. Then
30
2
6
6
6
6
6
6
6
4
12 + 2 13 + 3
1
2
23 + 3
13 + 1 23 + 2
3
3
7
7
7
7
7
7
7
5
(2.20)
Depending on which node is chosen as a root, there are three centralized strategies of moving
all data to one node and building a single model:
2
X0 (1) =
6
6
6
6
6
6
6
4
1 0 0 77
7
1 0 0 777 ;
1 0 0
7
5
X0 (2) =
6
6
6
6
6
6
6
4
0 1 0 77
7
0 1 0 777 ;
0 1 0
7
5
X0 (3) =
6
6
6
6
6
6
6
4
0 0 1 77
7
0 0 1 777
0 0 1
7
5
(2.21)
The best centralized cost C0 is then the smallest of the three values
C (X0 (1)) = 12 + 13 + 31
C (X0 (2)) = 12 + 23 + 32
(2.22)
31
The in-place strategy, i.e. a strategy of no data sharing, is
2
X1 =
6
6
6
6
6
6
6
4
1 0 0 77
7
0 1 0 777 ;
0 0 1
7
5
(2.23)
which has a cost C1 = 1 + 2 + 3 but results in high errors (based on tabulated values):
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
E1 (X1 ) = 17:3%
E2 (X1 ) = 13:4%
(2.24)
E3 (X1 ) = 20:1%
Hence the in-place strategy is not acceptable. But depending on the relative values of
the communication cost ij and data processing cost j , centralized strategy may become too
expensive, and we can trade some of its accuracy for a noticeable improvement in cost.
2.5.6
Optimal Solution
Let us set the accuracy threshold at max = 8%. We shall now explore the eects that
dierent combinations of ij and j have on the solution. The actual values of the cost coecients below have no specic meaning by themselves, as it is their relative proportions to each
other that matter. Note that all optimal solutions X presented below satisfy the accuracy
requirement, and so what we are interested in is how much savings they provide over the best
centralized strategy.
32
Case #1: Communication and processing cost are uniform and of similar magnitude:
2
8
>
>
>
<
>
>
>
:
1 = 1
2 = 1
=)
3 = 1
C=
6
6
6
6
6
6
6
4
1 2 2 77
7
8
>
>
>
<
C0 = 5
7
5
>
>
>
:
C1 = 3
2 1 2 777 ;
2 2 1
(2.25)
The in-place strategy thus gives 40% savings over the centralized strategies, which indicates
that there is likely to be a well-balanced intermediate strategy. Equation 2.19 produces the
following optimal intermediate strategy:
2
X =
6
6
6
6
6
6
6
4
1 :016 :081
:073
1 :114
:075 :038
3
7
7
7
7
7;
7
7
5
C = 3:79
(2.26)
which satises accuracy constraints yet is 24:1% cheaper than the best of centralized approaches.
Case #2: Communication and processing cost are uniform, communication is more expensive:
8
>
>
>
<
>
>
>
:
2 = 1
3 = 1
=)
C=
6
6
6
6
6
6
6
4
1 11 11 77
7
7
11 1 11 77 ;
11 11 1
7
5
8
>
>
>
<
C0 = 23
>
>
>
:
C1 = 3
(2.27)
33
The in-place strategy thus provides impressive 87% savings over the centralized strategies.
(Equation 2.19) gives
2
X =
6
6
6
6
6
6
6
4
:016 :081
:073
1 :114
:075 :038
3
7
7
7
7
7;
7
7
5
C = 7:36
(2.28)
Note that this optimal strategy X is the same as in previous example. This is due to the
fact that the relative magnitudes of the non-diagonal coecients of the matrix C with respect
to each other stayed constant, in which case Equation 2.19 gives the same solution. What diers
is the amount of cost, which went down due to the reduction of expensive (and unnecessary)
data transfer.
Case #3: Communication and processing cost are uniform, data processing is expensive:
8
>
>
>
<
>
>
>
:
=)
1 = 10 2 = 10 3 = 10
C=
6
6
6
6
6
6
6
4
10 11 11 77
11 10 11
11 11 10
7
7
7;
7
7
5
8
>
>
>
<
C0 = 32
>
>
>
:
C1 = 30
(2.29)
X=
6
6
6
6
6
6
6
4
1 :016 :081
:073
:075 :038
:114
3
7
7
7
7
7;
7
7
5
C (X ) = 34:37
(2.30)
34
but it is no longer optimal! It appears that cheap communication allows us to move all data to
a centralized location and build a single very accurate model there. This is both cheaper and
better than building three local models, each on only a portion of data. The optimal strategy
is then any of the three centralized strategies, as they all cost the same. Note also that the
centralized strategies do not satisfy our initial assumption xii = 1. Hence they are not covered
by Equation 2.19 and must be examined separately. This explains why Equation 2.19 was not
able to produce the optimal solution.
Case #4: Communication cost varies, data processing is cheap:
8
>
>
>
<
>
>
>
:
2 = 1
=)
3 = 1
C=
6
6
6
6
6
6
6
4
1 11 6 77
11 1 2
6 2 1
7
7
7;
7
7
5
8
>
>
>
<
C0 = 9
>
>
>
:
C1 = 3
(2.31)
Due to the signicant reduction in the amount of expensive communication between nodes
1 and 2, Equation 2.19 gives an optimal intermediate strategy
2
X =
6
6
6
6
6
6
6
4
1 :004 :036
:051
:105 :065
:190
3
7
7
7
7
7;
7
7
5
C = 4:96
(2.32)
35
Case #5: Communication and processing cost vary:
8
>
>
>
<
>
>
>
:
=)
1 = 10 2 = 10 3 = 1
C=
6
6
6
6
6
6
6
4
10 11 2
11 10 11
11 20 1
3
7
7
7
7
7;
7
7
5
8
>
>
>
<
C0 = 14
>
>
>
:
C1 = 21
(2.33)
The best centralized strategy is X0 (3). Equation 2.19 gives an intermediate strategy
2
X=
6
6
6
6
6
6
6
4
:026 :246
:073
1 :044
:075 :031
3
7
7
7
7
7;
7
7
5
C (X ) = 24:5
(2.34)
which is more expensive than the centralized strategy X0(3) of moving all data to node 3 and
building a single model there. Interestingly, even the in-place strategy, which usually is the
most cost eective if accuracy is not an issue, is more expensive than X0(3) in this case! The
explanation is simple: local data processing at nodes 1 and 2 would be too costly, and moving
all data to node 3 is a better option. Thus X = X0 (3).
Numerous other cases are readily available for investigation using this framework. Some
possible modication may include:
1. To require the communication cost coecients ij satisfy triangle inequalities. This may
better represent the option of moving data between nodes via a third node. We haven't
done so just to show dierent possibilities and
exibility of using our framework.
36
2. To impose additional (linear) constraints on xij due to network topology. E.g. if the only
route between nodes 1 and 3 is through node 2, we may want to make a use of all the
data passing between nodes 1 and 3 in building local model 2. In this case,
x12 x13 ;
x32 x31
(2.35)
which would require some changes in the optimization procedure. The general approach
would still remain the same.
These examples show that once the structure of the learning process - i.e. of the error
function - is known, non-trivial intermediate strategies occur naturally and are often superior.
2.6
Conclusion
In this chapter, we have introduced a new framework and methodology for distributed data
mining. It allows us to choose a cost-optimal balance between local computation and node-tonode communication and data transfer. We show that this framework eectively bridges two
simple approaches to distributed data mining which are common today: one that computes
all data locally (in-place mining) and one that moves all data to a single processing node
(centralized mining). We call these intermediate strategies.
The framework reduces the problem of nding intermediate strategies to a mathematical
programming problem which minimizes a cost function incorporating both communcation and
processing terms subject to an error constraint. We show by example that this problem is
37
interesting even for linear cost functions. Finally, we introduce a method OPTDMP for nding
intermediate strategies.
CHAPTER 3
3.1
Introduction
In the previous chapter, a problem of nding the intermediate strategies for distributed data
mining was introduced. Intermediate strategies suggest a balance between the accuracy of a
centralized data processing and the cost savings of an in-place processing. It was shown that
given a general structure of the error function and the cost of each stage of data processing,
the problem may be posed as that of minimizing a cost function while satisfying the accuracy
condition. In this case, a linear function is minimized over a convex feasible set.
A problem dual to the one described above is that of minimizing the error while satisfying
the cost constraints:
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
min[xij ] E (X )
(3.1)
0 xij Di
C (X ) =
c x
ij ij ij
If the previous setting is followed, this corresponds to minimizing a convex function over a
feasible set dened by linear inequalities, which is a well-known problem with a wide range of
methods available to solve it. See (Lewis and Borwein, 2000).
However, there is an essential dierence between the two problems. When the cost is
minimized, the primary question is how much data to transfer. Varying the amount of data
38
39
trac is what aects the cost function the most. Note that for the optimal intermediate
strategies, the equation E (x) = max is satised.
On the other hand, when the error function E (X ) is minimized, it follows from the nature
of E (X ) that there are typically no local minima within the interior of the feasible region and,
therefore, the optimal solution is also on the boundary. That is, C (X ) = . Because the value
of is strongly related to the volume of data trac, the issue is no longer how much data to
transfer because given the value of , there is little room for variation. If so, the next most
important issue remaining is which data to choose for transfer. Choosing dierent data records
generally results in dierent accuracy of the predictive models built on that data and, therefore,
the problem of minimizing the error E (X ) given cost constraints is more naturally related to
the problem of selecting "proper" data rather than to that of selecting the right amount.
3.2
Dual Strategies
In what follows, it is assumed that the cost of data transfer between data sites dominates the
cost of local data processing and aggregation of the results. This assumption is quite reasonable
from the practical standpoint, given that it is data delivery and cleaning, not data processing,
that is currently the most time-consuming and costly part of the entire data mining process.
Under this assumption, a strategy acquires cost only at the data transfer stage, thus
C (X ) =
6=
i j
ij xij ;
1 i; j k
where as before ij is the unit cost of data transfer between nodes i and j . Also, ii = 0.
(3.2)
40
Instead of solving the general minimization problem, we now consider a simplication when
the cost is split evenly between all k(k 1) terms in Equation 3.2. In other words, consider
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
min[xij ] E (X )
0 xij Di
(3.3)
It follows from the observation above that the optimal strategy X would satisfy the cost
condition exactly, in which case
x
ij
8
>
>
>
<
=>
>
>
:
; i 6= j
( 1)
ij k k
0; i = j
(3.4)
Hence the amount of data transfer between each pair of nodes is known immediately. The
question that is now more pertinent is: Which data to choose for this transfer?
3.3
In this part of the study, we conducted experiments to explore and compare several possible
ways to build ensembles of local models in the presence of constraints on data trac.
The idea of choosing particular data instances to feed into the learning algorithm has been
used in several successful supervised learning methods, primarily in the context of boosting
(Dietterich, 2000), (Freund, 1995). In a traditional boosting scenario, a single predictive model
is repeatedly trained on data sampled from the same dataset but with probability weights that
change over time. Namely, a data instance has a larger probability of been chosen if the model
41
in its current iteration makes an error predicting the class value of the instance. If the class
attribute is numeric, the size of the error aects the probability as well. This approach ensures
that parts of the original data distribution where the model still makes errors will be better
exposed to the learning algorithm at the next iteration, whereas parts that are already well
learned are not exposed needlessly.
There is a natural connection between boosting and the ensemble methods or, more generally, random forests and other random collections of predictive models. See (Breiman, 1999).
Our motivation is to explore the potential of boosting-like approach in our distributed data
mining scenario.
To build the ensembles, we used the same Nursery dataset from the UCI Data Repository
(Murphy and Aha, 1993) as in the Study Case in the earlier chapter. The data was partitioned
into k = 3 distributed subsets. Again, we used the Weka data mining package (Witten and
Frank, 1999).
For the initial partition, we were interested in modeling the situations where data comes
from either a single source or dierent sources. Because we used real data and couldn't control
its source, the training set was initially split either homogeneously or heterogeneously in the
following fashion.
In a homogeneous initial split, we assumed that there is no essential dierence between the
underlying data distributions at each of the k distributed sites, so the dataset was divided into
k equal parts randomly.
procedure: build a single C4.5 decision tree on the entire dataset and broadly mimic the way
42
TABLE V
MISCLASSIFICATION ERROR RATES FOR ENSEMBLES BUILT ON NURSERY DATA
Data transfer Same data source Dierent data sources
(%)
Simple Boosted Simple Boosted
0
5.2
5.2
37.6
37.6
10
5.2
5.5
7.7
7.2
20
5.3
5.6
6.7
5.6
30
5.2
5.5
6.0
5.5
40
5.2
5.5
5.7
5.5
50
5.2
5.4
5.4
5.3
60
5.3
5.6
5.4
5.4
70
5.3
5.3
5.3
5.3
it splits the data. Decision trees have an important property of creating data splits that are
easily interpretable. In our experiments, we sorted the data by its most informative attribute
- the one at the tree root node - and then split it sequentially into k equal subsets.
We conducted three separate runs for each type of experiment and each type of initial
partition. A single run used ve fold cross validation to test the accuracy of the resulting
collection of locally built models. The test results were then averaged over these 15 cross
validation folds. The accuracy of the collection was tested for a range of values of , the
parameter that sets the constraints on data trac. In this case, 2 [0; 1] was dened the
largest percentage of data that can be moved between local data subsets, or colors.
We describe these experiments below. The test results are presented in Table V.
43
3.3.1
Simple Ensembles
In this rst type of experiments, the initial partition of data into k colors was updated by a
random exchange of data. Namely, percent of each local dataset was chosen randomly, then
divided equally into k 1 parts, and each part was moved to one of the other k 1 colors. Once
data was collected on the receiving cite, a C4.5 decision tree model was built on the resulting
dataset. The collection of k locally built models was then tested as a voting ensemble, where
the test set was dened by the current cross validation fold. This experiment provides a base
case against which we measure the benets of other methods of data selection.
It is worth mentioning that because the process of selecting the portion of data to be
transfered in this experiment was completely randomized, we made three independent random
selections for each value of in each of the cross validation folds. The test results were averaged
over these three runs, and then averaged further over the 15 cross validation folds as described
above.
It appears that when the initial partition of data is random, i.e. when the initial colors
represent the same data distribution, the Simple Ensemble method produces ensembles of practically the same accuracy (5.2% misclassication error) regardless of the allowable data transfer.
This is not unexpected: indeed, if the initial partition of data represents k replicas of the same
data distribution, then subsequent random exchanges of data between colors preserve these
local distributions.
However, when the initial partition represents k distinct data distributions, we observe a
dramatic improvement in the accuracy of the ensemble as the amount of data trac increases.
44
The misclassication error rate of the in-place strategy ( = 0, no data trac) is more than
seven times that of the strategy that allows a thourough mixing of the k distinct distributions.
This is due to the fact initially, each model of the ensemble is not able to learn any of the
other k 1 data distributions and hence makes intolerably frequent errors on the test set.
Taking a vote of such a poor collection of predictors gives an error rate of as much as 37.6%.
However, as the value of the data trac increases, the local models start learning the other
data distributions and thus become better and better predictors for the whole dataset, with
the error rate of the ensemble dropping to as low as 5.3%.
We also observe that the adequate degree of mixture is achieved at around = 60%, where
the accuracy is already as good as that of the ensemble built in the case of a common initial
distribution. Note that the value =
original data distributions i.e. a mixture where each color contains k1 of the data from each of
the original distributions.
3.3.2
Boosted Ensembles
In our next experiment, the data was no longer chosen for transfer randomly. Instead
we adopted boosting as our method for choosing the data instances to be transfered. In the
distributed environment, this results in the following iterative procedure:
For iterations i = 1; 2; : : ::
{ Build all local models.
{ Set the data transfer constraint i for the current iteration.
45
{ For each pair of colors (fromColor, toColor):
Choose ki1 percent of data instances in the fromColor that are misclassied by
the model of the toColor.
Move the chosen data instances to the toColor.
Use the resulting k models as an ensemble that makes its prediction by a majority vote.
The data trac is thus broken into several batches, one for each iteration. To ensure the
stability of the algorithm and the overall data transfer constraint, the following requirements
should be met:
1 1 2 : : : 0;
X
i
i :
(3.5)
46
Perhaps it is for a similar reason that in the case of a common initial data distribution, boosted ensembles were consistently outperformed by simple ensembles, although only
marginally.
3.4
Another important class of data mining methods that is based on selecting particular data
instances is clustering and mixture problems (Everitt, 1974),(Mitchell, 1997). There is a conceptual dierence between voting ensembles of predictive models and clustering methods. Every
model in the ensemble is expected to learn the entire global data distribution. On the other
hand, each cluster in a clustering model represents only a portion of the global distribution
that, in general, does not overlap the portions represented by other clusters. In terms of selecting and relocating data instances between clusters, clustering can be viewed as a technique
opposite to boosting, in the following sense: in boosting, a model is presented with data on
which it makes the largest errors, whereas in clustering, a cluster generally receives data that
ts it better than other clusters.
Our focus is on partition based clustering algorithms as the most appropriate type of clustering methods for the problem of partitioning data across k distributed cites. In a distributed environment, this translates into treating each of the k cites as a separate cluster. The constraints
on moving data given by Equation 3.4 must then be injected into the clustering algorithm of
our choice.
We adopt a k-means clustering algorithm for our tests, which is a well-known partition
based clustering algorithm (Everitt, 1974). The restrictions on the amount of data transfer
47
make it impractical to update the cluster centroids too often, as is done in the standard version
of the k-means. Indeed, if the goal is to keep the data locally, then the strategy is to exchange
only the clustering results, i.e. the centroids, between the sites but do all the updates locally.
However, this would require keeping all locally stored copies of the centroids up-to-date. Note
that each cluster may comprise data from all data sites. If the algorithm updates the centroids
whenever a new data point is added to the cluster, given that this data point may come from
either of the local cites, there needs to be a prohibitively frequent exchange of the centroid
copies. The resulting trac between sites would be equivalent, or worse, to a trac created
when the entire dataset is centralized in the rst place.
A natural way to overcome this obstacle is to update the cluster centroids by moving data
in batch and exchange the centroids between the k cites only infrequently. Also, there must
be a mechanism of stopping the clustering process before the amount of data trac exceeds .
We therefore deployed an iterative k-means procedure similar to the one for boosting:
For iterations i = 1; 2; : : ::
{ Update the centroids for each color, k distributed centroids comprising the k-means
clustering model.
{ Set the data transfer constraint i for the current iteration.
{ For each pair of colors (fromColor, toColor):
Choose
i
k
centroid of the toColor. Move the chosen data instances to the toColor.
48
3.4.1
In this experiment, we illustrate some aspects of the dual strategies for clustering. Our
dataset for this test contains gene expression microarray data. Gene expression data has recently
become a very powerful tool of analysis of individual genes and their interaction, as well as
discovery of natural gene clusters and groups. See (Tibshirani et al., 1999), (Gerstein and
Jansen, 2000). In a typical microarray dataset, rows correspond to dierent individual genes
and columns correspond to particular experiments, types of environment, or some other external
stimuli. The numerical values within the table are called gene expression values and represent
the activity level of each gene in responce to each stimulus. Because similar genes react to
the same stimulus in a similar way, gene expression data is frequently used to cluster genes or
otherwise draw conclusions about gene similarity.
We took a gene expression microarray dataset described in (Zhang et al., 2001). The data
contains the expression values of 2000 individual genes (rows) on 62 dierent tissues (columns),
of which 22 were normal tissues and 40 were colon cancer tissues. We treat each row as a
separate data instance and each column as a data attribute. Our objective is demonstrate the
eect of data transfer constraints on the quality of clustering.
In this experiment, we set k = 10 and made r = 100 independent runs. In each run, the data
was initially partitioned into 10 data subsets randomly. Then for a range of dierent values
allowable data transfer , we ran the distributed k-means clustering algorithm. We chose values
of that roughly follow a logarithmic scale. Values of may exceed 100% because the data
is allowed to be transfer between local data subsets multiple times before a stable clustering
49
TABLE VI
ANALYSIS OF THE GENE EXPRESSION MICROARRAY DATA
Allowable data Actual data Residual Precision Recall
transfer (%) transfer (%) error
(%)
(%)
0
0
346.09
N/A
N/A
10
11.1
342.31
46.2
14.1
30
23.2
332.81
53.6
28.5
100
61.0
268.87
74.3
52.6
300
113.2
208.39
72.3
76.4
1000
147.8
174.81
88.3
86.8
3000
147.3
164.56
95.5
92.6
10000
157.8
162.22
100.0 100.0
conguration is reached, giving a large overall data transfer. Weka data mining package (Witten
and Frank, 1999) was used.
It is known that for a high dimensional data, low Lp norms such as the Manhattan norm
give better distance metrics than the usual Euclidean one (Aggarwal et al., 2001). Therefore,
Manhattan distance metric was used for clustering. The obtained clusters of genes were then
examined. The results of the tests are presented in the Table VI and are discussed below.
3.4.2
We rst observe the eect of increasing the allowable data trac on the tightness of the
obtained clusters. Because all attributes of the data are numerical, we can treat each data
record as a point in a 62-dimensional space. We dene residual error as the average distance
per dimension between the data record and the nearest of the k = 10 centroids. The error
50
rates presented in the table are the result of averaging over the r = 100 independent trials. The
residual error is as high as 346.09 in the initial random distribution but, as expected, subsequent
clustering increases the tightness of each cluster and the average residual error drops to as low
as 162.22 when = 10000%, i.e. when each data record is potentially moved around 100 times
on average. There is no signicant reduction in error beyond this value of . In fact, the
residual errors begin to stabilize somewhere around = 3000%
We also see that the actual data trac does not increase indenitely as the constraints
are relaxed. Instead, it stabilizes around a value 150-160%. despite a much larger value for
the allowable data transfer. This is due to the fact that for each dataset and task, there is
an optimal amount of data transfer that is required to reach a stable clustering conguration.
Beyond this intrinsic value, further data transfer is unnecessary or may even be detrimental.
3.4.3
Our main goal is to use the results of clustering to nd similar genes, which is an important
problem in bioinformatics. Assume a gene GeneX of interest is chosen among the 2000 genes in
the data. Note that because clustering methods like k-means depend on the initial conditions,
it is unrealistic to expect that the same clustering conguration will result from each run of
the algorithm. We therefore wish to identify all the genes that consistently appear in the same
cluster with GeneX .
Let GeneY be another gene from the gene pool. Let P^ be the observed frequency for GeneY
to fall in the same cluster with GeneX and let p be the true frequency. We shall call GeneX
and GeneY similar if we can reject a statistical null hypothesis
51
H0 : GeneX
(3.6)
where p0 = 1=k is a frequency of two data records falling into the same cluster at random, given
k clusters.
We shall develop a level .05 test for H0. First note that we are dealing with a binomial
distribution for which a positive outcome is: "Both GeneX and GeneY are in the same cluster".
It is known that when the number of trials r 100, a binomial random variable may be closely
approximated by a Normal Gaussian distribution, and hence the empirical frequency P^ of a
binomial distribution may also be approximated by a Normal Gaussian distribution (Pugachev,
1984). We then use the Neyman-Pearson Lemma to construct a critical region of size .05 for
the test. The general type of the region is well known in the case of a Normal random variable:
because P^ has a mean p and a variance p(1 p)=r, the critical region of size .05 for H0 is dened
by the condition
P^
p0
p0 (1 p0 )=r
:95 = 1:645
(3.7)
52
where :95 is the 95-th percentile of a standard Normal distribution. See (Hoel et al., 1971).
Hence to identify genes that consistently fall in the same cluster with GeneX , we compute
the empirical frequencies P^ for all genes in the pool and take those genes for which Equation 3.7
is satised. This gives a stable GeneX -cluster.
3.4.4
The experiment described above was performed for each values of . However, we know
that the quality of clustering is best when the restrictions on data transfer are relaxed the most,
which corresponds to the value = 100 in this experiment. Therefore, we treat the GeneX cluster found with = 100 as a true cluster and GeneX -clusters found with lower values of
as approximations to this true cluster.
transfer constraints on the ability of the clustering method to recover the true GeneX -cluster.
We are interested in both the precision and the recall, dened as usual:
precision =
where Appr and True are the approximate and the true clusters respectively and j:j indicates
the number of elements in the set.
In our experiments, we randomly selected ve dierent genes for the role of GeneX . This
gave ve dierent collections of GeneX -clusters, where each collection is made of clusters found
under dierent data transfer constraints . The values of precision and recall were averaged
over the ve collections and are presented in Table VI. The last row represents the true GeneX -
53
clusters, hence both precision and recall values are 100%. The rst row represents the initial
distribution before clustering, hence neither precision nor recall values were collected.
We observe that when the allowable data trac is small, it is the value of the recall that
suers the most. Only a small percentage of genes that are similar to GeneX are identied as
such by the clustering method. However, the precision is considerably better than the recall,
indicating that the clustering does not pick wrong genes as often as it misses the correct ones.
This situation is due to the fact that for low values of data transfer, the clustering is able to
produce only a small GeneX -cluster, hence missing many correct answers. Indeed, when the
data transfer is low, the clustering does not have enough time to diverge from a random initial
distribution. Therefore, the clustering process is so far from completion that at the end of it
many genes appear in virtually random combinations with other genes, and the next run of the
clustering method would rearrange these combinations completely. As a result, only a few gene
pairs ever have a chance to be identied as consistently similar because the hypothesis H0 is
seldom rejected.
This also serves as a warning against trusting any individual run of a clustering method when
the exact associations between data instances are of primary interest. This danger is aggravated
when the data transfer constraints are present. Instead, a method combining several runs must
be devised. The hypothesis testing procedure described above is an example of such method.
As the constraints on data transfer are relaxed, we observe a rapid increase in the value
of the recall that eventually reaches the value of the precision. This is due to the fact that as
the quality of clustering results improves, pairs of similar genes fall into the same cluster more
54
often. Therefore, the hypothesis H0 of their similarity being merely random is rejected more
frequently. As a result, the GeneX -cluster returned by the clustering method contains more
genes and hence fewer correct answers are missed, which boosts the recall.
3.5
Conclusion
This chapter demonstrated that in the distributed environment, the quality of data and data
partition becomes one of the dominant factors that aect the accuracy of the resulting data
mining systems. Moreover, it is possible to improve the quality of the data partitions by using
initial data reallocation. In this case, however, there must be a mechanism that allows to select
the right data instances for the transfer. It appears that previously developed methods such
boosting and clustering may be used to develop new methods of marking the data instances,
which depending on the data mining task may lead to a signicant improvement in the quality
of the results.
More generally, clustering and mixture model environment may also be used as a basis for
a theoretical foundation that allows to formalize the problem of nding the right data partition
across the k distributed sites before the data mining phase begins. In our next chapter, we
develop such formal framework using the Expectation-Maximization approach. Specically, we
show that the problem of dual intermediate strategies, and hence the problem of nding data
partitions, may be solved by a constrained EM algorithm.
CHAPTER 4
4.1
The previous chapter presents the problem of nding the optimal allocation of data to
k
distributed data cites. The strategy of transferring data aims at minimizing the overall
mixture of models
comprised of data drawn from k distinct distributions but the true membership of each data
instance is not known in advance. The information about true data membership thus serves
as hidden data attributes, as opposed to the observable attributes. The goal is to recover the
parameters of each distribution of the mixture or, equivalently, to separate the data into k
groups corresponding to their true membership. Many well-known problems could be treated
in a mixture of models context. For example, k-means clustering is essentially a problem of
recovering k Normal distributions from a dataset.
One of the most powerful tools used in a mixture of models scenario is the EM algorithm
(Dempster et al., 1977). In short, the algorithm alternates between two steps:
55
56
Expectation: estimates the probability distribution of the hidden variables using the
current estimates for the model parameters.
Maximization: nds the maximum likelihood estimates of the model parameters using
the current estimates for the distribution of the hidden variables.
It has been shown (see (Dellaert, 2002)) that at each iteration, the EM algorithm nds a tight
lower bound for the true log-likelihood function, and subsequent maximizations of this bound
result in eventual convergence of the algorithm to a local maximum of the true log-likelihood.
Let X represent the observable part of the data, Z represent the hidden variable, and
represent the parameters of the models in the mixture. The goal is to nd the maximum
likelihood estimate for the :
= arg max
log P (X j) = arg max
log
X
Z
P (X; Z j)
(4.1)
where P (X j) is the probability of observing data X if the true mixture is given by .
P (X; Z j) is dened similarly.
4.2
Now consider the case where the data transfer constraints between models, or colors, are
present. Values of the hidden variable Z correspond to dierent ways of allocating data to the k
colors. Restricting the amount of data transfer between colored data subsets eectively reduces
the number of possible data colorings Z reachable from the current data coloring.
57
Let the distance jjZ1 Z2 jj in the Z -space be dened as the least amount of data transfer
between the k colors to reach a color conguration Z2 from a color conguration Z1. Let t be
the number of the current iteration, with Zt and t as the current estimates for the hidden
variable and the model parameters.
Assume that the amount of data transfered at the current iteration is constrained by a
parameter t . Then Zt+1 must fall in the t-neighborhood of Zt dened as
Nt = fZ : jjZ
Zt jj t g:
(4.2)
t
(4.3)
where represents the constraint on the total amount of data transfered between the colors by
the algorithm. If Z0 is the value of the hidden variable that corresponds to the initial allocation
of data at t = 0, then the feasible values of Z are restricted to the -neighborhood of Z0
throughout the execution of the algorithm. Therefore, the original log-likelihood optimization
problem from Equation 4.1 is now assumed to be replaced by the following constrained loglikelihood optimization problem:
= arg max
log P (X j) arg max
log
jjZ Z0jj
P (X; Z j):
(4.4)
58
The corresponding optimization problem for the tth iteration of the constrained ExpectationMaximization algorithm is then
= arg max
log P (X j) arg max
log
4.3
Z Nt
P (X; Z j):
(4.5)
Consider a t-th iteration of the constrained EM algorithm. Possible values of Z must now be
restricted to the t -neighborhood of Zt . Therefore, the lower bound for the true log-likelihood
function is derived in terms of a truncated probability distribution of Z with Nt as a support.
Let t (Z ) be a probability distribution of Z over the Z -space of hidden variables. Then
after normalization, the corresponding truncated distribution of Z with support Nt is dened
in terms of the conditional distribution:
8
>
>
>
<
t (Z )=m(Nt ); Z 2 Nt
>
>
:
0;
tN (Z ) = P (Z jZ 2 Nt ) = >
otherwise
(4.6)
where m(Nt ) is the probability measure of the t-neighborhood of Zt under the original untruncated probability distribution:
m(Nt ) = P (Z 2 Nt ) =
Z Nt
t (Z )
(4.7)
Following a procedure similar to the one in (Dellaert, 2002), this conditional distribution is
used to dene the lower bound which the constrained EM algorithm maximizes:
59
B (; t ) =
X
Z
tN (Z ) log
P (X; Z j)
tN (Z )
(4.8)
X
Z
tN (Z )
P (X; Z j)
tN (Z )
= log P (X j):
(4.9)
The EM approach can be interpreted as a sequential maximization of the lower bound B (; t).
4.4
It is shown in (Dellaert, 2002) that in the absence of the data transfer constraints, maximizing the expression B (t; t ) with respect to the probability distribution t(Z ) gives the
optimal distribution
t (Z ) = P (Z jX; t )
(4.10)
that results in the best possible lower bound for the log-likelihood, namely, a locally tight lower
bound that touches the objective log-likelihood function:
B (t ; t ) = log P (X jt ):
(4.11)
When the data transfer constraints are present, Equation 4.6 gives the corresponding optimal truncated distribution in the t -neighborhood of Zt :
60
8
>
>
>
< P (Z jX;t) ;
tN (Z ) = >
>
>
:
( )
m Nt
0;
Z 2 Nt
9
>
>
>
=
otherwise
>
>
>
;
= P (Z jX; t ; Nt )
(4.12)
where in this case, m(Nt) = P (Z 2 NtjX; t ). Alternatively, the same expression may be
obtained by a direct maximization as in (Dellaert, 2002) where Nt now plays the role of the
Z -space.
It follows that
B (t ; t )
j)
tN (Z ) log P (X;Z
t (Z )
N
P Z X; t
Z Nt
m Nt
(4.13)
(4.14)
Comparing Equation 4.11 and Equation 4.14 demonstrates the qualitative eect of the data
transfer restrictions on the lower bound B (; t). Note that by nature of probability measure,
m (N t )
2 [0; 1] and so log m(Nt ) 0. Therefore, whereas the lower bound for the true log-
61
Also note that when converted back from the logarithmic to the usual scale, Equation 4.14
gives the corresponding lower bound for the true likelihood function:
eB(t ;t) = P (X jt )m(Nt ) P (X jt )
(4.15)
where as before, m(Nt ) = P (Z 2 Nt jX; t ) 2 [0; 1]. This shows that the locally tight lower
bound for the likelihood function of the unconstrained problem is reduced by a factor of m(Nt ) in
the constrained problem. The results Equation 4.14 and Equation 4.15 express in quantitative
terms the intrinsic sub-optimality of the constrained EM algorithm compared to the usual
unconstrained version.
Naturally, relaxing the restrictions on data transfer improves the quality of the solution
found by the constrained EM algorithm at each iteration. Indeed, increasing the value of
the constraint parameter t correspond to the expansion of the neighborhood Nt around the
hidden variable value Zt in the Z -space, and so its probability measure m(Nt ) also increases.
Removing the restrictions alltogether corresponds to t = 1 and the allowable hidden variable
neighborhood Nt covering the entire Z -space, which gives m(Nt ) = 1 and log m(Nt ) = 0. In
this case, Equation 4.11 follows directly from either Equation 4.14 or Equation 4.16.
Also note that Equation 4.14 may be rewritten as
B (t ; t ) = log (P (X jt )P (Nt jX; t )) = log P (X; Nt jt )
(4.16)
which further reveals the nature of the lower bound function B (; t) in the constrained case.
62
These results are summarized in the following
Theorem 4.4.1 Let Equation 4.8 dene a family of lower bounds B (; t ) for the log-likelihood
function
log P (X jt )
maximum allowable amount of data transfer at the t-th iteration. Then the best lower bound is
achieved with the distribution
tN (Z ) = P (Z jX; t ; Nt )
(4.17)
B (t ; t ) = log P (X; Nt jt ) = log P (X jt ) + log m(Nt ) log P (X jt )
(4.18)
where m(Nt ) = P (Nt jX; t ). The corresponding lower bound for the likelihood function reaches
only as high as P (X jt )m(Nt ), whereas a tight lower bound of P (X jt ) is achieved in the
unconstrained case.
4.5
As was mentioned above, the essence of the EM approach lies in a sequential maximization
of the lower bound B (; t ). Similar to the derivation in (Dellaert, 2002), Equation 4.8 and
Equation 4.12 give
63
B (; t )
j)
= E log P (ZPjX;(X;Z
t)=m(Nt ) j X; t ; Nt
(4.19)
(4.20)
Therefore, the general EM algorithm with constraints on the amount of data transfer can
be described as follows:
Given the initial distribution of data across k colors and the maximum amount of
allowable data transfer, identify the initial value of the hidden variable Z0 and the sequence
ft g.
For each iteration t:
{ E-step: identify Zt , Nt , estimate the distribution tN (Z ) = P (Z jX; t ; Nt ) of the
64
{ M-step: nd the data transfer strategy that results in building local models that
Implementation Details
In this section, the constrained EM approach will now be demonstrated for the case when
there are k distributed data sites D1 ; : : : ; Dk and a predictive model fj is built on each dataset,
where 0 j k. Let the class attribute be numeric so that all fj are regression models.
In general, regression models are predictive models that, given an unlabeled data instance
x,
return a probability distribution of its numeric class value or, presuming other parameters
known, the mean of such distribution. Typically, Normal Gaussian distribution is used, with
its mean used as a predicted value. The variance relates the accuracy of such prediction.
There is a wide variety of types of regression models, including linear and generalized linear
and nonlinear regression models(Bates and Watts, 1988), CART decision trees (Breiman et al.,
1984), clustering models (Everitt, 1974), etc.
For simplicity of demonstration, assume that all Normal distributions returned by f1; : : : ; fk
have the same variance 2. For a given unlabeled data instance x, the prediction fj (x) is then
the mean of the Normal distribution of the class attribute returned by the j -th model.
The derivation below follows the steps of the derivation of the k-means clustering with an
unconstrained EM approach presented in (Mitchell, 1997).
65
Let there be a total of n data instances fxigni=1 in the combined dataset [kj=1Dj . First note
that the hidden variable Z that contains the data membership information may be represented
by an n-by-k matrix of values:
8
>
>
>
<
1; xi 2 Dj
>
>
:
0; otherwise
zij = >
(4.21)
In order to derive the expression for Qt() dened in Equation 4.20 for the E-step, notice
as in (Mitchell, 1997) that only one of the zij is non-zero for a given j . Also, let fj (j) be the
predicted mean of the Normal distribution returned by the j -th model when the parameters of
the model correspond to the given value of . The log-likelihood function may then be written
as
log P (X; Z j) = log ni=1 P (xi; zi1 ; : : : ; zik j)
=
=
k
1
2
log =1 p212 e 22 j=1 zij (xi fj (xi j))
P
P
n log p212 212 ni=1 kj=1 zij (xi fj (xi j))2
n
i
(4.22)
Let !ijt = E [zij jX; t ; Nt ]. It follows from Equation 4.21 that f!ijt g is a collection of numbers
on [0; 1] independent from . Then the function Qt() maximized in the M-step is dened as
Qt ()
n log p212
1 P t
22 ij !ij (xi
fj (xi j))2
(4.23)
Therefore, the M-step of the constrained EM algorithm is equivalent to the following global
weighted least squares minimization problem:
66
k X
n
X
j
=1 i=1
(4.24)
Note that in a distributed environment, =< 1; : : : ; k >, where each j corresponds to
a predictive model fj = fj (jj ) built separately on Dj . Hence Equation 4.24 results in the
following system of k weighted least squares minimization problems:
jt+1 = arg min
j
4.7
n
X
=1
2
1 j k:
(4.25)
Note that the system of optimization problems Equation 4.25 is the same for both the regular
EM algorithm and the EM algorithm with constraints on data transfer. The constraints only
aect the weights !ij . In fact, there is a deep connection between the weighting schema f!ij g
and the strategy of data transfer between the distributed data sites D1; : : : ; Dk .
As was stated, in a distributed environment, the objective is to nd the optimal allocation
of data instances to the k data sites. At every step of the algorithm, each of the predictive
models fj is built only on data from Dj . Consequently, only the terms that correspond to Dj
will be available in the sum Equation 4.25 for each j in any practical implementation of the
algorithm. Therefore, Equation 4.25 must be interpreted as a stochastic optimization rather
than a deterministic optimization. The weights are then interpreted as
!ijt = P ( xi
(4.26)
67
Let x / fj indicate that a data instance x was generated by a Normal distribution with
a mean = fj (x) and variance 2, where fj is one of the k predictive models. Assume for
now that there are no restrictions on data transfer, i.e. that t = 1 and Nt covers the whole
Z -space.
In this case, if for some data instance xi / fj then xi must belong to Dj . Similar to
= !ijt jt=1
= E [zij jX; t ; t = 1]
= P (xi / fj jX; t )
=
(4.27)
Pke
(xi
The stochastic data transfer strategy that corresponds directly to Equation 4.25 is then to
allocate each data instance xi to the dataset Dj with probability !ijt j1 and build least squares
models f1; : : : ; fk on the resulting datasets. On average, this strategy results in the following
volume of data trac:
V1t
1 P (xi is movedjX; t ; t = 1)
(1 !
= jDj
P
i
t
ici
j1 )
!ict i j1
(4.28)
68
where ci = j i xi 2 Dj i.e. ci is the current membership of xi, and jDj is the total amount
of data in all k datasets. The sum Pi !ict ij1 represents the average amount of data that stays
in-place.
To investigate the case when the constraints are present, the nature of the neighborhood Nt
must rst be established. As before, let Zt be the current allocation of data and let Z 2 Nt if
there needs to be no more than t amount of data transfer to reach data allocation Z from Zt .
There are several possible ways to enforce the constraints on the data trac. A natural way
to do so is to adjust the probabilities !ijt to make it uniformly \easier" for each data instance
xi to stay in its current data subset Dci .
a factor , the appropriate value of which to be found later. The other k 1 probabilities must
then be normalized, eectively making it uniformly \harder" for the data to be moved to other
datasets. The weights are then expressed in as
!ict i = !ict i j1; !ijt =
!ijt j1
; j 6= ci
!ict ij1
(4.29)
where an appropriate value of v must be found depending on the desired amount of data
transfer. In this constrained case, an argument similar to Equation 4.28 gives
t = jDj
X
i
!ict ij1
(4.30)
69
=
jDj t = jDj t
t
jDj V1t
i !ici j1
(4.31)
This together with Equation 4.27 and Equation 4.29 gives the formula for the weights. The
data instances shall then be allocated according to these found probabilities, which on average
will result in the volume t of data trac. Other stochastic schemata that correspond to
Equation 4.25 are possible as well.
As expected, 1 whenever t V1t i.e. whenever the data trac is indeed restricted
by imposing constraints. Also, it may seem like the calculation of and hence of the weights
!ijt
involves using all data instances, which would require access to the entire centralized data
set. In reality, however, only the locally built models fi and some aggregate statistics need be
exchanged between the k distributed sites.
4.8
Conclusion
This chapter shows that the problem of dual intermediate strategies and the problem of
nding data partitions, may be solved by a constrained EM algorithm. It also shows how
the general theoretical framework of a constrained EM algorithm may be applied to develop a
family of algorithms that search for the optimal partitions of data across the distributed sites.
There are a variety of dierent versions and modications possible within this framework.
In the next chapter, we introduce a scenario where it is benecial to build a hierarchical
model assignment system on the distributed collection of data. We also develop an algorithm
called Greedy Data Labeling that is used to enhance the quality of the data partition so that a
70
near optimal model assignment system is produced. We show that a model assignment system
enhanced by the GDL algorithm nearly always outperforms such methods as voting ensembles.
The GDL algorithm belongs to the more general family of algorithms that was introduced in
the current chapter through the constrained EM approach.
CHAPTER 5
5.1
Introduction
and the dataset D, a predictive model f : x ! y is built to capture the properties of the
underlying data distribution and predict the class value of unlabeled data instances.
In a distributed data mining, there are several datasets that are geographically distributed
with data potentially coming from dierent underlying distributions. This requires a dierent
approach. Chapter 2 addressed the issue of the fundamental trade-o in distributed data
mining. namely, there are two extremes: to combine all data at a central site and build a single
model there, which typically gives a more accurate result, or to mine data in-place, which is
cheaper. In this latter scenario, several predictive models are built locally. Generally, there is
also a so-called meta-model { an overstructure that regulates the deployment of the collection
of local models f1; : : : ; fk , which we call base-models. The meta-model does not have to be of
the same type F as the base-models.
71
72
There are several well known ways to deploy a collection of models, whether built from
several data subsets or from a combined dataset. One is the voting/averaging ensemble, where
either an average or a majority vote of individual models' predictions is used as an overall
prediction of the system (Dietterich, 2000). In this case, the meta-model g0 is a simple averaging
function. Ensemble learning has become one of the most popular methods due to both its
simplicity and eectiveness. A more complex technique is the so-called meta-learning (Chan
et al., 1995), where a meta-model is separately trained on the outputs the base-models. When
a new unlabeled data is presented, individual base-models make predictions, after which the
meta-model reads their scores and makes the overall prediction. Both ensemble learning and
meta-learning are bottom-up methods: all base-models are deployed for scoring and the metamodel combines their outputs into the nal result.
A model assignment problem is a problem of building a top-down hierarchical predictive
system. When new unlabeled data is presented, the meta-model is deployed rst. Based on
its output, the data is forwarded to one of the base-models - the one that handles that type of
data better than others. The output of the chosen base-model becomes the overall prediction.
The model assignment problem arises naturally from the observation that data may come
from heterogeneous sources, i.e. from dierent underlying distributions. In this setting, it may
be desirable to have a collection of specialized models, each corresponding to a dierent data
distribution. Consider an application where there is a separate model predicting the risk of a
heart disease for each age group. When a new patient arrives, the meta-model will invoke the
73
predictive model that corresponds to that patient's age. There is no need to use any of the
other base-models.
In this chapter, we explore the scenario when a model assignment system is built on a
distributed collection of data subsets, i.e. when base-models are created by applying the chosen
learning algorithm F to each of the local data subsets D1 ; : : : ; Dk . It follows from the above
argument that the model assignment approach would probably work best when the initial
partition of data across the k local sets re
ects the dierences in the underlying data types.
This also is the scenario in the previous chapter.
5.1.1
Data Partitions
Each partition of a dataset D into subsets D1; : : : ; Dk can be thought of as a way to color
the combined dataset D with k dierent colors, after which a base-model is built on each color
and a meta-model is trained to assign colors to unlabeled data instances. In general, our point
of view is that, given a particular data mining task, there is an optimal partition of a dataset
D across k sites, or equivalently, an optimal coloring with k dierent colors, that results in the
best model assignment system. However, the initial partition D1 ; : : : ; Dk of the data may not
be the best and hence needs to be optimized before the base-models are built on local datasets.
In this chapter, we introduce a simple and ecient method for improving the quality of data
partitions, called Greedy Data Labeling (GDL). It allows to choose individual data instances for
relocation between the distributed data subsets D1 ; : : : ; Dk so that the resulting subsets become,
in a certain sense, more homogeneous within themselves. This gives an overall partition that
better re
ects the inner structure of the data distribution.
74
5.1.2
Objective
Our goal is to show that in a number of cases, a top-down model assignment system may
be preferable to traditional bottom-up methods such as ensemble learning. We also want to
demonstrate that the quality of the predictive system in both cases may depend signicantly
on the initial distribution of data across multiple sites. We present the experimental evidence
that the initial data distribution for a model assignment system may be signicantly improved
by re-allocating small amounts of data between sites using the Greedy Data Labeling method,
and that a model assignment system enhanced by GDL outperforms a voting ensemble in most
cases.
In this chapter, as in the previous ones, we shall talk about local data subsets, data colors, or
base-models interchangeably, since every subset can be thought to have a unique color and the
corresponding base-model built on that set is uniquely dened by the choice of the data mining
algorithm F (up to a randomization, if applicable). We also note that since our interest is in
exploring the quality of data partitions, we do not address the case where some data instances
from one subset are added to another subset, thus becoming available in both subsets. In our
setting, each data instance belongs to only one subset (color) at a time.
5.1.3
Related work
The GDL method is broadly related to clustering (Everitt, 1974), (Bradley et al., 1998).
Traditional clustering algorithms such as k-means use geometric proximity as a measure of
\tightness", whereas GDL uses a more general criterion based on the choice of the learning
algorithm F . We shall explain this point later. GDL also employs an idea related to boosting,
75
a method of weighted re-sampling of data (Freund, 1995). In boosting, previously misclassied
data instances are selected by the learning algorithm more frequently, which gives the predictive
model more chances to learn that portion of data.
Bagging (Breiman, 1996) is one of the main methods for building a collection of classiers.
Ideally in bagging, each classier is trained on a dierent subset of data drawn from the same
distribution. This is one of the scenarios examined in this chapter for both ensemble and
model assignment approaches. In practice, the data subsets for bagging are usually created by
re-sampling the original dataset with replacement and hence may share some data instances.
A system where classiers may specialize and/or abstain from voting is considered in (Freund et al., 1997), although not in a distributed context. Several theoretical error bounds are
provided.
There are various techniques to recover individual data distributions from a mixture of
distributions, the EM algorithm being one of the most popular (Dempster et al., 1977), (Jordan
et al., 1994). This connection is explored in details in Chapter 4.
In (Chipman et al., 1999), CART decision trees are used to achieve segmentation of the
data set into subsets such that further submodels may be built on each subset for a hierarchical
model selection. Their approach employs ideas from Bayesian analysis and Markov chain Monte
Carlo methods and performs a stochastic search in the space of the appropriate decision trees.
A competitive machine learning algorithm similar in part to GDL was independently developed in bioinformatics. (Obradovic, 2002) uses it to automatically partition a set of available
disordered proteins into subsets with similar properties.
76
Conceptual clustering is a methodology of building a hierarchy of classes based on the
content of their knowledge objects (Stepp et al., 1986), (Michalski et al., 1983). A classication
scheme is produced according to how well objects t to descriptive concepts - not according to
simple similarity measures. One example is the COBWEB algorithm that builds a clustering
tree where each node is a cluster and can be split into subclusters as children (Fisher, 1987).
For other related results, see (Fayyad et al., 1998), (Bradley et al., 1998), and (Zhang at al.,
1996).
Finally, there are several approaches to reducing the algorithm complexity by randomization.
Some general ideas are discussed in (Ben-David et al., 1994) and (Gomez at al., 1998). In
the context of clustering problems, randomization frequently takes the form of using several
subsamples of the data sample to nd the best initialization for clusters. Using subsampling, a
considerable speedup may be achieved, which is of crucial importance for non-linear algorithms
applied to large data sets. See (Rocke et al., 2000).
5.2
5.2.1
Assumptions
Let the data subsets D1; : : : ; Dk be distributed over a network, each represented by a different color. If combined, they comprise a single dataset D. However, we assume that it is
relatively expensive or otherwise undesirable to combine all data at a central location, while
processing data locally is relatively cheap.
A model assignment system is created as follows. A base-model fi is built on each local data
subset Di using a learning algorithm F . Then a sample of data from each color is taken and used
77
to train a meta-model that can predict the colors. Whenever a new data instance is presented
for classication, the meta-model will assign it a color and forward it to the appropriate basemodel for scoring.
The initial partition of data across k sets may be far from optimal, hence not sharing any
data at all between local sites will likely produce an inferior model assignment system. Cost
and accuracy can be balanced by improving the distribution of data but still mining the data
in-place afterward. The key is to share only a small number of data instances.
5.2.2
Optimization Problem
First, we dene native colors. Given a data partition fDi gk1 , the color is native for a data
instance (x; y) 2 D if
= arg
1ik
(5.1)
i.e. if the base-model f (x) built on the subset D predicts the true class of that instance better
than base-models of other colors. Here, jjy fi(x)jj is the appropriately dened prediction error.
Note that several models may produce equally accurate predictions, hence native colors may
not be unique. In case of a tie, which is typical in classication problems, the native color is
chosen randomly among the candidates.
The model assignment problem translates into nding the optimal partition that gives a
solution to the following error minimization problem:
78
min
k
X
=1 (x;y)2Di
jjy fi(x)jj
(5.2)
where the minimum is taken over all possible data partitions. This is a combinatorial optimization problem that searches for the best possible k-coloring of the dataset D among kjDj
possible colorings, where jDj is the dataset size. We observe that as a necessary condition of
optimality, all data instances must belong to their native colors. Indeed, if an instance (x; y)
belongs to a current color Dc that is not native, moving it to its native color D would decrease
the corresponding term in the sum in Equation 5.2 by
(x; y) = jjy fc(x)jj
(5.3)
The core idea behind the Greedy Data Labeling method is to move data instances to their
native colors in a way that gives a greedy solution to Equation 5.2. Finding an instance with
max (x; y) and moving it to its native color is equivalent to making the largest single-term
reduction in Equation 5.2 and hence the largest step in the direction of the steepest descent.
To prevent possible instability in greedy methods, the size of the allowable greedy step
is usually reduced in each subsequent iteration. Because in our case the process of moving
data instances is discreet, the same eect may be achieved by reducing the probability of data
relocation. We shall show this later in the context of the simulated annealing.
79
However, when data instances are moved between subsets, the base-models of the aected
subsets must be re-built to re
ect the new partition. The terms in the sum in Equation 5.2 are
thus interdependent, which complicates the greedy nature of the steepest descent.
Note also that if a perfect meta-model were available for the current data partition, the
resulting hierarchical model assignment system for the dataset D would have a predictive function
h(x) = fi (x) when (x; y) 2 Di
(5.4)
(x;y)2D
jjy h(x)jj:
(5.5)
We also wish to point out the inherent similarity between Equation 5.2 and Equation 5.5
on one side and Equation 4.24 and Equation 4.25 on the other side.
5.2.3
Compute the error i = jjy fi(x)jj on this instance for each of the k base-models.
80
Identify the native model f that gives the least error. If not unique, choose
randomly among the candidates.
Compute the largest possible error reduction = c for that data instance,
where c is its current color.
{ Select the data instance with max . Move it to its native subset.
{ Update the base-models of the subsets that were aected by the move.
Note that to implement this algorithm, there must be a copy of every base-model available
at each of the k distributed sites fDi g. This is easy to achieve, e.g. by exchanging the models
between sites after each model update in Predictive Model Markup Language format, which
is now the emerging industry standard for encoding statistical models. See (Grossman et al.,
1999). Models in PMML format are typically much smaller than data on which they are built
so the additional network trac overhead is negligible.
As was mentioned before, there are noticeable similarities between the GDL method and
some unsupervised clustering techniques. In both cases, instances are moved between data
subsets to improve a certain measure of \tightness" within subsets. In clustering, this measure
is usually related to the geometrical proximity of the data. For example, the k-means clustering
algorithm is initially seeded with k centroid vectors. Each new data instance is placed into a
cluster with the closest centroid, after which the centroid is recomputed. The distance metrics
is best dened by the covariance matrix of the data distribution in that cluster. In the GDL,
the algorithm is seeded by the initial data partition across k sites. Each new data instance is
placed into a subset with the least prediction error, as given by the predictive model computed
81
on the data from that subset. After the placement, the model is recomputed. The \tightness"
criterion in Equation 5.2 is more general and is dened by the chosen type F of the base-model
algorithm.
5.2.4
There are several potential problems with the ideal version of GDL. First, as each data
instance shift aects the data partition, re-building base-models too often is impractical. Secondly, depending on the initial conditions, the process may reach a local minimum dierent
from a globally optimal partition. Also, since the function minimized in Equation 5.2 and
Equation 5.5 is essentially the empirical error on the training set, overtting may potentially
become an issue.
Below is a more ecient version of the algorithm:
Do until all data points are in their native subsets:
{ Initialize a batch size parameter T .
{ For every data instance (x; y) in D:
Compute the prediction error i = jjy fi(x)jj on this instance for each of the k
base-models.
Identify the native model f that gives the least error. If not unique, choose
randomly among the candidates.
Compute the greedy step = c for that data instance, where c is its current
color.
82
If the data instance is not in its native subset ( > 0), mark it as a candidate.
{ Use a randomization technique to select up to T candidates and move them to their
native subsets.
{ Update all base-models.
{ Reduce the batch size parameter T before the next iteration.
This improved version is ecient enough to allow a practical implementation of the GDL
algorithm, as it requires only a few updates of base-models before convergence.
There are several randomization schemata available for selecting data instances that allow
the algorithm to escape local minima. The most notable is the so-called simulated annealing
(Laarhoven et al., 1987), a standard technique used to avoid local minima in combinatorial optimization by permitting sub-optimal steps in greedy descend. Although annealing makes each
individual run of the algorithm longer, the alternative is to make multiple algorithm runs from
dierent initial conditions without annealing and then choose the best of the obtained results,
which could be quite expensive. Overall, a randomization method like simulated annealing
is often preferable. When implemented in GDL, it selects a data instance with probability
p = e 1=t ,
where t is a control parameter tied to the batch size T . The smaller the value of
1=(+1)t
with non-zero probability and escape local minima better at the expense of a larger number of
iterations. See also (Ben-David et al., 1994).
Also, we have not noticed any problems related to overtting in our experiments.
83
5.2.5
Once the data partition fDi g is found by the GDL algorithm, we take a sample of the data
instances from all subsets, replace their class values with their colors, and train a meta-model
on the resulting sample. The meta-model thus learns to classify data into k colors as dened
by the current data partition.
An important point is that we only take a sample of data for building the meta-model. We
call such sample a meta-sample. Using the whole dataset to train a model requires moving all
data to a centralized location, which defeats the purpose of processing the data locally.
5.3
Experiments
5.3.1
Datasets
The goal of our experiments was two-fold. First, we wanted to show that a model assignment
approach may in certain situations outperform traditional bottom-up hierarchical techniques
such as ensemble learning that use all base-models for prediction. Secondly, we wanted to
explore the quality of data partitions and to demonstrate the superiority of the partitions
found by GDL over other possible partitions in the model assignment context.
In our preliminary trials we discovered an interesting feature of the GDL method: it consistently gives superior partitions on datasets with discreet attributes, whereas its advantage
on data with continuous attributes is considerably less pronounced. We are still investigating
the reasons behind this phenomenon. Our understanding is that because the space over which
the combinatorial search is performed is greatly reduced when attributes are discreet, the GDL
84
TABLE VII
DESCRIPTION OF THE UCI DATASETS
Dataset
Num of Num of Num of
Name
Instances Attributes Classes
Balance Scale
625
4
3
Tic-Tac-Toe
958
9
2
Car Evaluation 1728
6
4
Chess
3196
36
2
Nursery
12960
8
5
optimization task is simplied substantially. In the future, we shall explore the eectiveness
of the GDL applied in conjunction with various attribute discretization techniques. In this
chapter, however, we restrict our experiments to purely discreet data.
We selected several datasets from the UCI Machine Learning Repository that satisfy three
criteria: (a) all attributes are discreet, (b) the number of data instances is at least 500, (c) the
domain is suciently complex. The latter criterion, for example, made us reject datasets for
which the initial partition gave less than 1% classication error (so that hardly any improvement
was possible) or for which the non-zero class occurrence is extremely rare (so that a base-model
of type F built on the data always predicts zero).
The datasets that we used for the tests are described in Table VII, with the number of
instances, attributes, and class values.
Although these experimental datasets are easy to analyze in a centralized fashion and are
not too massive to pose any cost-related problem, our current objective is to test the quality
85
of the model assignment approach and the GDL partitions in dierent scenarios rather than
estimate the data analysis cost. For such task, the UCI data is quite adequate.
5.3.2
Partitions
To examine the quality of data partitions produced by the GDL method, we compared three
types of partitions:
given by the way the data was split across k sites at the
beginning.
86
into k parts. However, the latter approach may give an optimistic bias in the experiments in
which the meta-model is chosen to be of the C4.5 type, since all it needs is to replicate the
original k-leaf tree.
Once the initial data partition of either the same-source type or the dierent-sources type
was obtained, we used it as a starting point to create the corresponding RandMix and GDL
partitions. In the RandMix approach, we updated the initial partition by making each subset
Di
exchange a total of percent of its data with other subsets, divided equally among the
receiving subsets.
For the GDL, we initialized the batch parameter size T to 2 jDj for each value of using
the same values as in RandMix. Because GDL reduces T in half after each iteration, this choice
ensures that the total data trac does not exceed =2 + =4 + : : : < percent. Our goal
was to demonstrate that with GDL, we get superior colorings moving less data than with the
RandMix. We also used the same set of data transfer parameters to create a boosted data
partition used for building a boosted ensemble as described in Chapter 3.
5.3.3
Test Results
Each partition was tested by 5-fold cross-validation in the following manner. A validation set
was withheld in each fold. After partitioning the data, a base-model was built on each subset.
A percent sample of each color was then taken to build a meta-model. The accuracy of the
resulting model assignment system was tested on the validation set. Also, for the initial and
random mixture partitions, we combined all locally built base-models into a voting ensemble
and tested the ensemble's accuracy on the validation set. We also built a boosted ensemble
87
as described in Chapter 3, to compare a boosted data partition to a random partition. The
goal was to investigate whether the base-models work better as an ensemble or as a model
assignment system for each data partition type.
In our experiments, we chose a simplied interpretation of the simulated annealing. With
zero-or-one classication error (match or no match), (x; y) also takes values 0 or 1 (native or
non-native current color), in which case p = e
p=0
1=t
randomly.
For building and testing predictive models, we again used Weka freeware data mining package (Witten and Frank, 1999). We chose C4.5 decision trees as the type F of base-models and
experimented with three dierent types of meta-models: C4.5 trees, naive Bayesian models,
and one-nearest neighbor models.
We used the following parameter values:
The number of dierent colors was k = 3 or k = 10.
The meta-sample percentage = 100, 50, 25 and 10%. We were interested in the sensitivity of GDL to decreasing the meta-sample size.
The data mixing parameter = 10, 20, 40, 60, 80, and 90%. For example, with k = 10
colors, = 90% results in each subset retaining 10% of its original data and acquiring
10% of the data from each of the other 9 subsets, hence all 10 subsets having essentially
the same data distribution regardless of the initial partition. For the same number of
88
colors, = 10% means retaining as much as 90% of data at each color and exchanging
only 10% for data from the other 9 colors, 1.1% from each.
To ensure robust test results, 5-fold cross-validation trials were run for each partition, and
three dierent samples were collected for each value of the meta-sample size in each trial.
The results were then averaged. For RandMix, Boosted and GDL partitions, they were further
averaged over all values of the data mixing parameter .
We discovered that of the three meta-model types used with 4.5 decision tree base-models,
the nearest neighbor model is the best one. In Table VIII{Table XII, we compare a voting
ensemble of the locally built C4.5 base-models (ENS) to a model assignment system with the
same collection of base-models and a nearest neighbor meta-model (NN1). The Boost/GDL
lines correspond to boosted ensembles and GDL-enhanced model assignment systems. The
tables show the classication errors on each UCI dataset for both initial partition splits (samesource and dierent-sources), for dierent numbers of models and partition types, and for
dierent meta-sample sizes in the case of model assignment. Table XIII shows the average over
all datasets.
We make several observations:
For both same-source and dierent-sources data partitions, the accuracy of the predictive
system drops as the number k of distributed sites increases. This conrms our earlier suggestion
that combining data into fewer subsets results in better predictive systems, which is true for
both the ensemble learning and the model assignment methods. Also, the quality of the model
89
TABLE VIII
BALANCE SCALE DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT
ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
28.3 30.3 31.8 32.5 33.1
RandMix
28.8 31.6 32.5 33.4 33.8
Boost/GDL 30.8 25.4 27.5 29.2 30.8
Same 10 Initial
24.3 33.7 34.9 35.9 36.9
RandMix
24.2 32.4 33.8 34.8 35.8
Boost/GDL 24.4 26.7 29.0 31.3 33.6
Dier 3 Initial
29.9 29.6 31.2 33.5 34.2
RandMix
28.4 30.1 31.5 33.1 33.9
Boost/GDL 29.9 26.7 27.6 29.8 32.1
Dier 10 Initial
31.9 29.3 30.6 33.3 34.9
RandMix
26.4 30.6 32.6 34.2 35.5
Boost/GDL 26.9 26.4 27.3 29.9 31.4
assignment system drops as the size of the meta-sample decreases. This shows that, given the
same collection of base models, care must be taken to build a suciently accurate meta-model.
In the homogeneous data case (same-source initial partition), we observe that ensembles
built on initial data partitions are, in general, superior to model assignment systems built on
the same partitions even with a 100% meta-sample sizes. Exchanging random samples of data
between subsets gives no consistent advantage or disadvantage to either ensembles or model
assignment, perhaps due to the fact that mixing data that is already homogeneous makes no
essential dierence in the underlying data partition. Ensembles, therefore, remain preferable
in RandMix partitions. However, model assignment systems enhanced by GDL consistently
90
TABLE IX
TIC-TAC-TOE DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
19.5 19.4 20.3 21.6 22.3
RandMix
19.4 19.0 20.3 21.2 22.1
Boost/GDL 19.4 13.9 16.1 18.4 20.5
Same 10 Initial
25.9 21.8 23.9 25.9 27.6
RandMix
26.4 22.2 24.7 26.8 28.5
Boost/GDL 25.8 12.1 16.1 20.1 24.2
Dier 3 Initial
31.5 9.6 11.1 15.6 23.0
RandMix
24.5 14.3 16.6 19.5 23.0
Boost/GDL 24.6 6.9 9.2 13.8 22.0
Dier 10 Initial
37.2 12.6 15.7 20.7 27.2
RandMix
28.3 16.6 20.4 24.2 27.4
Boost/GDL 30.5 9.8 13.8 19.1 26.3
outperform ensembles, sometimes even with meta-samples as small as 10%, although this eect
usually diminishes as the meta-sample size decreases.
In the case of the heterogeneous data (dierent-sources initial partition), we note how poorly
ensembles perform on the initial partitions. Their accuracy can be improved dramatically by
exchanging random samples of data between subsets. This is due to the eect of creating a
more homogeneous data distribution across the k sites, an environment in which ensembles are
preferable. In eect, ensembles are \penalized" if their base-models are over-specialized. On
the other hand, model assignment systems appear to be far superior to ensembles even with
small meta-sample sizes and without using GDL. With GDL, their accuracy improves even
91
TABLE X
CAR EVALUATION DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT
ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
13.2 13.6 14.5 15.0 15.4
RandMix
13.1 13.6 14.1 14.6 15.0
Boost/GDL 13.5 11.4 12.8 13.8 14.7
Same 10 Initial
18.1 18.8 19.5 20.0 20.7
RandMix
18.0 18.4 19.5 20.0 20.8
Boost/GDL 18.0 13.1 15.9 17.8 19.6
Dier 3 Initial
27.7 8.2 8.3 9.9 11.1
RandMix
16.9 10.9 11.5 12.8 13.6
Boost/GDL 15.7 8.3 8.1 10.0 11.3
Dier 10 Initial
27.9 10.1 12.1 15.6 18.1
RandMix
20.1 14.0 16.2 18.4 20.0
Boost/GDL 19.7 8.9 10.9 14.9 17.7
further. This is the eect that we expected, since model assignment systems are \rewarded"
if their base-models over-specialize, and the GDL is designed to make this eect even more
pronounced.
We note that despite some successes, boosting did not result on average in a signicant
drop of ensemble error even in the case of dierent-sources data partitions, and it was slightly
detremental in the case of a same-source partition, a latter fact already observed in Chapter 3.
It appears that the RandMix strategy for ensembles is generally quite adequate. We shall
therefore use it as a benchmark in what follows.
92
TABLE XI
CHESS DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
1.3 1.7 1.6 1.6 1.6
RandMix
1.3 1.6 1.6 1.7 1.8
Boost/GDL 1.1 1.2 1.2 1.4 1.6
Same 10 Initial
3.3 3.9 4.1 4.3 4.6
RandMix
3.4 3.7 3.9 4.1 4.4
Boost/GDL 3.4 3.0 3.6 3.7 4.3
Dier 3 Initial
14.5 2.8 4.0 5.1 6.9
RandMix
2.1 1.8 2.0 2.2 2.5
Boost/GDL 1.3 2.6 3.9 5.1 6.9
Dier 10 Initial
20.8 6.0 8.7 11.2 15.0
RandMix
5.2 4.0 4.8 5.7 6.7
Boost/GDL 4.8 5.8 8.6 11.2 15.1
5.3.4
Eciency
We also observed that the choice of a meta-model algorithm aects the performance significantly. It follows from the previous remarks that ensembles generally work better when built
on a RandMix partition while model assignment systems are almost always greatly improved
by using a GDL partition. In Table XIV, we compare ensemble systems to model assignment
systems when both are used under optimal conditions: ensembles are built on RandMix partitions and the model assignment systems are built on GDL partitions with meta-sample size
= 100%.
We use all three meta-model types: nearest neighbor (NN1), C4.5 tree (C45), and
naive Bayesian model (NB). Table XIV presents the classication errors for each case. We also
93
TABLE XII
NURSERY DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
5.2 5.5 5.8 5.9 6.0
RandMix
5.2 5.6 5.9 6.0 6.2
Boost/GDL 5.4 4.6 5.1 5.5 5.9
Same 10 Initial
8.6 8.3 8.7 9.0 9.4
RandMix
8.6 8.2 8.7 9.0 9.3
Boost/GDL 8.8 5.8 6.8 7.8 8.5
Dier 3 Initial
38.5 3.5 3.5 5.3 7.9
RandMix
6.4 4.5 4.7 5.0 5.3
Boost/GDL 6.3 3.5 3.5 5.3 8.1
Dier 10 Initial
34.6 3.5 5.2 10.9 14.9
RandMix
9.3 5.4 6.4 7.8 8.8
Boost/GDL 9.2 3.2 4.9 10.5 14.7
show the average percentage of data moved by GDL and the average number of GDL iterations
in each case.
Nearest neighbor meta-models appear to be the the best type of meta-models, followed by
C4.5 trees. Naive Bayesian models are clearly undesirable.
We see that in terms of data trac, GDL moves only a small percentage of data between
subsets. Furthermore, it was observed that the amount of data that is actually moved typically
stays about the same even if the batch size T is increased to allow more potential data trac.
This is due to the fact that only a small portion of data needs to be relocated. Note also that
there is much less data moved when the initial partition already represents dierent sources
94
TABLE XIII
AVERAGE OVER ALL DATASETS: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT
ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
13.5 14.1 14.8 15.3 15.7
RandMix
13.6 14.3 14.9 15.4 15.8
Boost/GDL 14.0 11.3 12.5 13.7 14.7
Same 10 Initial
16.0 17.3 18.2 19.0 19.8
RandMix
16.1 17.0 18.1 18.9 19.8
Boost/GDL 16.8 12.1 14.3 16.1 18.0
Dier 3 Initial
28.4 10.7 11.6 13.9 16.6
RandMix
15.7 12.3 13.3 14.5 15.7
Boost/GDL 15.5 9.6 10.5 12.8 16.1
Dier 10 Initial
30.5 12.3 14.5 18.3 22.0
RandMix
17.9 14.1 16.1 18.1 19.7
Boost/GDL 18.2 10.8 13.1 17.1 21.0
95
TABLE XIV
ERROR RATES OF AN ENSEMBLE OF C4.5 TREES BUILT ON A RANDMIX
PARTITION VS. MODEL ASSIGNMENT SYSTEMS BUILT ON THE GDL PARTITION
WITH C4.5 BASE-MODELS AND DIFFERENT TYPES OF META-MODELS.
Dataset Initial k ENS Meta model type Actual GDL
source
NN1 C45 NB transfer iterations
Balance Same 3 28.8 25.4 29.4 32.2 13.7
4.0
Same 10 24.2 26.7 30.5 31.0 17.3
4.7
Dier 3 28.4 26.7 28.7 27.6 12.3
3.9
Dier 10 26.4 26.4 28.7 27.5 13.9
4.8
T-T-T Same 3 19.4 13.9 19.1 22.2 10.5
5.0
Same 10 26.4 12.1 24.0 27.7 15.6
5.5
Dier 3 24.5 6.9 13.9 27.4 4.1
3.7
Dier 10 28.3 9.8 14.8 24.6 4.0
4.0
Car Same 3 13.1 11.4 12.7 14.1 5.9
3.9
Same 10 18.0 13.1 15.5 20.7 11.2
5.2
Dier 3 16.9 8.3 9.3 9.4
3.3
3.5
Dier 10 20.1 8.9 9.7 12.3 4.9
4.7
Chess Same 3 1.3 1.2 1.3 1.4
.6
2.4
Same 10 3.4 3.0 2.3 3.3
2.1
3.9
Dier 3 2.1 2.6 1.8 4.5
.8
2.3
Dier 10 5.2 5.8 2.0 6.8
.4
2.6
Nursery Same 3 5.2 4.6 5.1 6.4
2.7
3.7
Same 10 8.6 5.8 6.3 9.8
5.6
5.1
Dier 3 6.4 3.5 3.4 3.5
.8
2.8
Dier 10 9.3 3.2 3.4 15.2 1.7
5.5
Average Same
14.9 11.7 14.6 16.9 8.5
4.3
Dier
16.8 10.2 11.6 15.9 4.6
3.8
96
5.4
Conclusion
In this chapter, we compared several scenarios of distributed data mining when data is
distributed across k sites and combining it in one central location is undesirable. It appears
that in the case of a homogeneous data distribution, hierarchical systems such as ensembles
that use all base-models may be preferable to model assignment systems that have specialized
base-models. However, the latter may be superior in cases where the distributed data partition
re
ects the dierences in underlying data distributions.
We also introduced a method called Greedy Data Labeling (GDL) that enhances the data
partition in the model assignment setting. GDL is broadly related to clustering and allows
to exchange small portions of data between distributed subsets before the learning algorithms
are applied. The data is chosen in a special way. The resulting model assignment systems
outperform traditional ensembles in both homogeneous and heterogeneous data environment
even when the ensembles are also enhances by an appropriate data exchange strategy, namely,
a strategy of exchanging either random or boosted subsets of data between the sites before
local models are built. GDL is inherently similar to the family of constrained EM algorithms
introduced in Chapter 4.
Future work may include developing more advanced versions of the GDL algorithm based on
dierent data weighting schemata and randomization mechanisms. Another direction is to allow
data instances to belong to multiple colors and build base-models on overlapping partitions.
CHAPTER 6
CONCLUSION
This dissertation is, to our knowledge, the rst work that addresses the fundamental tradeo between cost and accuracy in distributed data mining. There exists a substantial body of
research on improving data mining accuracy in the case when the entire dataset is immediately available to the learning algorithm. Separately, a number of methods in distributed data
mining are based on local data processing, which we call an in-place strategy. However, as we
showed in this work, lack of full information about the other parts of the dataset may cause an
unacceptable loss of accuracy in the in-place case. On the other hand, combining all data at
the same central location usually gives the most accurate results but may be too expensive to
implement due to the cost of network trac, data aggregation, and algorithm complexity.
Our point of view is that there must be an intermediate strategy that balances cost and
accuracy by moving only a portion of the data across the network. The exact nature of such a
strategy may depend on a variety of factors. This calls for developing a formal framework that
can serve as a base for developing new techniques of nding data transfer strategies.
In this dissertation, we showed that nding the intermediate strategies may be formulated
as a mathematical optimization problem. There are two options available for balancing cost
and accuracy: either to set a tolerance for the accuracy and then minimize the cost, or to set
a tolerance for the cost and then optimize the accuracy.
97
98
We demonstrated that in the former case, the problem becomes a problem of convex optimization and hence standard methods are available to solve it. We also showed how this
approach may be used for analysis of a number of interesting situations where intermediate
strategies occur naturally.
In the latter case, i.e., when the cost is restricted and the accuracy is optimized, the problem
of nding a balanced data transfer strategy leads into the area of the quality of data partitions
and mixtures of models, which in itself is a large and well established area of statistics and
data mining. Pursuing this direction, we have developed another mathematical framework
that formalizes the bridge between the trade-o in distributed data mining and the mixture of
models scenario. This framework is based on the classic Expectation-Maximization approach.
We showed how cost constraints may be translated into restrictions in the hidden variable space,
from where interesting likelihood bounds may be found. We also demonstrated how to apply
this framework in particular situations.
There may be other important issues related to the ways the data is partitioned. We explored
them in our experimental studies with data from the UCI Machine Learning Repository as well
as with gene expression microarray data.
This work also attempts to identify the best way to use a collection of data mining models
built in a distributed environment. In particular, the usual bottom-up approach where all
models participate in making predictions, such as ensemble learning, is compared to a top-down
approach of a model assignment system where individual models specialize. We demonstrated
that a model assignment system is frequently preferable as long as it is built on a \good"
99
partition of data across distributed sites. We introduced an algorithm called Greedy Data
Labeling that allows us to nd such data partitions. Although developed for a dierent purpose,
this algorithm is related to the general family of constrained EM algorithms that was formalized
earlier.
Overall, because the question of the trade-o between cost and accuracy in distributed data
mining is new, one of the objectives of this dissertation was to survey the issues that arise and
to serve as a base for further research in this direction. Therefore, we tried to develop a formal
mathematical treatment of the problem. At the same time, this work contains a number of
ad hoc
techniques and heuristics that allow alternative interpretations and in-depth extensions
along the same general lines. We hope that a good balance between rigor and
exibility was
reached, which will facilitate further research and development.
100
CITED LITERATURE
Aggarwal, C.C, Hinneburg, A. and Keim, D.A.: On the surprising behavior of distance metrics
in high dimensional spaces. Proc. International Conference on Database Theory 420-434,
2001.
Bates, D. M. and Watts, D. G.: Nonlinear regression analysis and its applications. Wiley, 1988.
Ben-David, S., Borodin, A., Karp, R., Tardos, G. and Wigderson, A: On the power of randomization in on-line algorithms. Algorithmica 11(1):2{14, 1994.
Bradley, P.S., Fayyad, U. and Reina, C.: Scaling clustering algorithms to large databases.
Proc. 4th Int. Conf. Knowledge Discovery and Data Mining (KDD-98) 9-15, Menlo Park,
CA, AAAI Press, 1998.
Breiman, L.: Bagging predictors. Machine Learning, 24(2):123-140, 1996.
Breiman, L.: Random forests, random features. Technical report, University of California,
Berkeley, 1999.
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J.: Classication and Regression
Trees, Wadsworth and Brooks, 1984.
Chan, P.K. and Stolfo, S.J.: Learning arbiter and combiner trees from partitioned data
for scaling machine learning. Proc. 1st Int. Conf. Knowledge Discovery and Data Mining
(KDD-95) 39-44, Menlo Park, CA, AAAI Press. 1995.
Chavtal, V.: Linear Programming. Freeman and Co., 1983.
Cheung, A., and Reeves, A.: High performance computing on a cluster of workstations. Proc. 1st International Symposium on High Performance Distributed Computing
152-160, September 1992.
Chipman, H., George, E. and McCulloch, R.: Segmentation via tree models. Presentation
at the Chicago Chapter of the American Statistical Association, Jun 22 1999, avail. at
http://gsbrem.uchicago.edu/talks/.
Cormen, T., Leiserson, C., and Rivest, R.: Introduction to Algorithms. MIT Electrical Engineering and Computer Science Series, 1990.
Cortes, C., Jackel, L.D., and Chiang, W.-P.: Predicting failures of telecommunication paths:
limits on learning machine accuracy imposed by data quality. Proc. Intl. Workshop on
Applications of Neural Networks to Telecommunications 2, Stockholm, 1995.
101
Dellaert, F.: The Expectation Maximization Algorithm, February 2002,
Avail. at www.cc.gatech.edu/dellaert/em-paper.pdf.
Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society B 39:1-38. 1977.
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of
decision trees: bagging, boosting, and randomization. Machine Learning 40(2):139-157,
2000.
Dietterich, T.G.: Machine learning research: four current directions. AI Magazine 18:97-136,
1997.
Everitt, B.: Cluster Analysis. Wiley, New York, NY, 1974.
Fayyad, U., Reina, C. and Bradley, P.S.: Initialization of iterative renement clustering algorithms. Proc. 4th Int. Conf. Knowledge Discovery and Data Mining (KDD-98) 194-198,
Menlo Park, CA, AAAI Press, 1998.
Fisher, D.: Improving inference through conceptual clustering. Proc. 1987 AAAI Conf. 461-465,
Seattle, WA, 1987.
Freund, Y.: Boosting a weak learning algorithm by majority. Information and Computation
121(2):256-285, 1995.
Freund, Y. and Schapire, R.E.: A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences 55(1):119-139, 1997.
Freund, Y., Schapire, R.E., Singer, Y. and Warmuth, M.K.: Using and combining predictors
that specialize. Proc. of the 29th Annual ACM Symposium on the Theory of Computing
pp.334{343, 1997.
Gerstein M. and Jansen R.: The current excitement in bioinformatics, analysis of whole-genome
expression data: How does it relate to protein structure and function? Current Opinion
in Structural Biology 10(5):574-584, 2000.
Gomes, C., Selman, B. and Kautz, H.: Boosting combinatorial search through randomization.
Proc. 15th Nat. Conf. on Articial Intellligence 431-437. AAAI Press/ MIT Press, 1998.
Grimshaw, A.S., Weissman, J.B., West, E.A., and Loyot, E.C.: Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems,
Journal of Parallel and Distributed Computing 21(3):257-270, 1994.
Grossman, R.L., Bailey, S., Ramu, A., Malhi, B., Hallstrom, P., Pulleyn, I. and Qin, X.:
The management and mining of multiple predictive models using the Predictive Modeling
Markup Language (PMML), Information and System Technology 41:589-595, 1999.
102
Grossman, R.L, Bailey, S., Ramu, A., Malhi, B. and Turinsky, A.: The preliminary design of Papyrus: a system for high performance, distributed data mining over clusters. In: Advances
in Distributed and Parallel Knowledge Discovery, eds. H. Kargupta and P. Chan, pp. 259275, AAAI Press/MIT Press, Menlo Park, California, 2000.
Guo, Y., Rueger, S.M., Sutiwaraphun, J., and Forbes-Millott, J.: Meta-learnig for parallel data
mining. Proc. 7th Parallel Computing Workshop 1-2, 1997.
Haussler, D., Kearns, M., Seung, H. and Tishby, N.: Rigorous learning curve bounds from
statistical mechanics, Machine Learning 25:195{236, 1996.
Hoel, P.G., Port, S.C. and Stone, C.J.: Introduction to Statistical Theory, Houghton Miin,
1971.
Jordan, M.I, and Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. Neural
Computation 6(2):181{214, 1994.
Kargupta, H., Hamzaoglu, I. and Staord, B.: Scalable, distributed data mining using an agent
based architecture. In: Proc. Third International Conference on the Knowledge Discovery
and Data Mining, eds. D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, pp.
211-214, AAAI Press, Menlo Park, CA, 1997.
Kargupta, H., Johnson, E., Sanseverino, E.R., Park, B.-H., Silvestre, L.D., and Hershberger, D.:
Scalable data mining from vertically partitioned feature space using collective mining and
gene expression based genetic algorithms, KDD Workshop on Distributed Data Mining,
1998.
Krogh, A. and Sollich, P.: Statistical mechanics of ensemble learning, Physical Review E
55:811-825, 1997.
Lewis, A.S. and Borwein, J.M.: Convex Analysis and Nonlinear Optimization : Theory and
Examples. Cms Advanced Books in Mathematics, Springer Verlag, 2000.
Michalski, R.S. and Stepp, R.: Automated construction of classications: conceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Analysis and Machine Intelligence
5:396{410, 1983.
Mitchell, T.: Machine Learning, WCB/McGraw-Hill, 1997.
Murata, N., Yoshizawa, S. and Amari, S.: Learning curves, model selection and complexity of
neural networks. In: Advances in Neural Information Processing Systems, eds. S.J. Hanson, J.D. Cowan, and C.L. Giles, 5:607{614, San Mateo, CA, 1993.
Murphy, P. M. and Aha, D. W.: UCI repository of machine learning databases. University of
California, Department of Information and Computer Science, Irvine, CA, 1993.
Avail. at http://www.ics.uci.edu/mlearn/MLRepository.html.
103
Obradovic, Z.: Commonness, complexity,
avors and function of intrinsic protein disorder: a
bioinformatics study. Presentation at IPAM Workshop on the Mathematical Challenges in
Scientic Data Mining, 2002. Avail. at www.ist.temple.edu/zoran/bioinformatics.html.
Pugachev, V.S.: Probability Theory and Mathematical Statistics for Engineers, Pergamon
Press, 1984.
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA,
1993.
Raftery, A.E., Madigan, D., and Hoeting, J.A.: Bayesian model averaging for linear regression
models. Journal of the American Statistical Association 92:179-191, 1996.
Rocke, D. and Dai, J.: Sampling and subsampling for cluster analysis in data mining with
applications to Sky Survey. 2002.
Avail. at http://handel.cipic.ucdavis.edu/dmrocke/preprints.html.
Skillicorn, D.: Parallel Data Mining. CASCON'98, Toronto, December 1998.
Stepp, R.E. and Michalski, R.S.: Conceptual clustering: Inventing goal oriented classications
of structured objects. Machine Learning: An Articial Intelligence Approach vol.II. Morgan Kaumann, San Mateo, CA, 1986.
Stolfo, S., Prodromidis, A.L., and Chan, P.K.: JAM: Java agents for meta-learning over
distributed databases, Proc. Third International Conference on Knowledge Discovery and
Data Mining, AAAI Press, Menlo Park, California, 1997.
Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D. and Brown, P.: Clustering methods
for the analysis of DNA microarray data. Technical report, Department of Health Research
and Policy, Stanford University, 1999.
Turinsky, A.L. and Grossman, R.L.: A Framework for Finding Distributed Data Mining
Strategies That Are Intermediate Between Centralized Strategies and In-Place Strategies.
Prersentation at the KDD 2000 Workshop on Distributed Data Mining, Boston, USA,
2000. Avail. at http://citeseer.nj.nec.com/turinsky00framework.html.
Van Laarhoven, P.J.M. and Aarts, E.H.L.: Simulated Annealing: Theory and Applications.
D.Reidel, Norwell, MA, 1987.
Witten, I.H. and Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations. Morgan Kaufmann, San Mateo, CA, October 1999.
WEKA software avail. at http://www.cs.waikato.ac.nz/ml/weka/.
Wolpert, D.: Stacked generalization. Neural Networks 5:241-259, 1992.
104
Xu, L. and Jordan, M.I.: EM learning on a generalized nite mixture model for combining
multiple classiers. Proc. World Congress on Neural Networks. Hillsdale, NJ, Erlbaum,
1993.
Zaki, M., Li, W., and Parthasarathy, S.: Customizing dynamic load balancing for a network of workstations. Journal of Parallel and Distributed Computing: Special Issue on
Performance Evaluation, Scheduling, and Fault Tolerance, June 1997.
Zhang, H., Yu, C.Y., Singer, B., and Xiong, M. Recursive partitioning for tumor classication
with gene expression microarray data. Proc. Natl. Acad. Sci. USA 98:6730-6735, 2001.
Zhang, T., Ramakrishnan, R. and Livny, M.: BIRCH: an ecient data clustering method
for very large databases. Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data pp.
103-114, Montreal, Canada, 1996.
105
VITA
Research Assistant, Laboratory for Advanced Computing, National Center for Data
Mining, University of Illinois at Chicago, Chicago, Illinois, USA, 1998{2002
Teaching Assistant, Department of Mathematics, Statistics, and Computer Science,
University of Illinois at Chicago, Chicago, Illinois, USA, 1995{1998
Teaching Intern, Kharkiv National University Lyceum, Kharkiv, Ukraine, 1995
Programmer / Application Developer, Civil Engineering Research and Development
Institute, Kharkiv, Ukraine, 1992{1995
HONORS AND AWARDS:
Master of Science Degree with Honors, Kharkiv National University, Kharkiv, Ukraine,
1995
Scholarship, Fund of Student and Educational Initiatives, Kharkiv National University,
Kharkiv, Ukraine, 1993
\Outstanding Runner-Up" Team Award, Net Challenge Event, SuperComputing-2000
conference, Dallas, Texas, USA, 2000
\High Performance Communication" Team Award, High Performance Computing
Challenge, SuperComputing-1999 conference, Portland, Oregon, USA, 1999
\Most Innovative of the Show" Team Award, High Performance Computing Challenge,
SuperComputing-1998 conference, Orlando, Florida, USA, 1998
106
PUBLICATIONS AND PRESENTATIONS:
Pliska, S.R., Turinsky, A.: Solution Manual for the Exercises, Introduction to Mathematical Finance: Discrete Time Models by S.R. Pliska, Blackwell Publishers, 1997.
Turinsky, A. and Grossman, R.L.: Greedy Data Labeling and the Model Assignment
Problem for Scientic Data Sets. Poster presentation at the IPAM Workshop on Mathematical Challenges in Scientic Data Mining, Los Angeles, USA, 2002.
Turinsky, A. and Grossman, R.L.: A Framework for Finding Distributed Data Mining
Strategies That Are Intermediate Between Centralized Strategies and In-Place Strategies.
Presentation at the KDD 2000 Workshop on Distributed Data Mining, Boston, USA, 2000.
Avail. at http://citeseer.nj.nec.com/turinsky00framework.html.
Grossman, R.L., Bailey, S., Ramu, A., Malhi, B., Sivakumar, H., Turinsky, A.: Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters,
Proceedings of Supercomputing 1999, IEEE, 1999.
Grossman, R.L, Bailey, S., Ramu, A., Malhi, B. and Turinsky, A.: The preliminary
design of Papyrus: a system for high performance, distributed data mining over clusters.
In: Advances in Distributed and Parallel Knowledge Discovery, eds. H. Kargupta and
P. Chan, pp. 259-275, AAAI Press/MIT Press, Menlo Park, California, 2000.