Sie sind auf Seite 1von 119

BALANCING COST AND ACCURACY

IN DISTRIBUTED DATA MINING

BY
ANDREI L. TURINSKY
M.S., Kharkiv National University, 1995
M.S., University of Illinois at Chicago, 1997

THESIS
Submitted in partial ful llment of the requirements
for the degree of Doctor of Philosophy in Mathematics
in the Graduate College of the
University of Illinois at Chicago, 2002
Chicago, Illinois

Copyright by
Andrei L. Turinsky
2002

ACKNOWLEDGMENTS

I owe a great debt of gratitude to my family who have been so patient and supportive
throughout my academic career. They are the main source of strength and inspiration for me,
to which I attribute all my past and future achievements. I'd also like to acknowledge AT&T
for its frequent discounts on international phone service, which helped us stay in touch during
my studies abroad.
I am very thankful to my thesis advisor Dr. Robert Grossman for the chance to write this
dissertation while working in his Laboratory for Advanced Computing. Professor Grossman's
directions and encouragement were essential in my progress. I have always had full con dence
in his judgment that comes from many years of professional experience, and merely observing
Dr. Grossman's approach to research work is quite illuminating.
I am happy to be associated with the Laboratory for Advanced Computing, a part of the
National Center for Data Mining at UIC. Few other places exist where one can gain as much
exposure to such a wide variety of novel application areas in data mining. Among the important
bene ts, I regularly received nancial support from LAC to attend data mining conferences,
which was a substantial part of my learning experience. In this regard I wish to thank Shirley
Connelly who was instrumental in securing the travel grants. Another best kept secret about
our lab is its great social atmosphere and stimulating discussions, much of which is also a result
of Shirley's involvement. I am grateful to Marco Mazzucco for proofreading several chapters
of this thesis and for a number of useful tips he gave me on the thesis defense procedure. He
iii

ACKNOWLEDGMENTS (Continued)

also made sure that my computers ran smoothly. I'd like to thank Stuart Bailey who was
my technical mentor during my rst year at LAC. Cheryl Fernandes taught me several Java
programming techniques. Arvind Sethuraman provided some additional monetary incentive for
me to speed up my graduation process, which also helped. All other colleagues at the lab have
been very supportive of my e orts as well.
There were a number of people outside the Laboratory for Advanced Computing who helped
or in uenced my academic progress. I learned quite a few research techniques from my rst
advisor Professor Valery Korobov, a prominent scientist at the Kharkiv National University.
My former classmate and good friend Eugenia Vinogradskaya inspired me to move to the U.S.
and handled much of my admission process at the University of Illinois at Chicago. Professor Alexander Lipton was kind enough to support my application and later provided valuable
assistance in choosing my research area. Professor Floyd Hanson frequently acted as my unocial advisor during my rst years of studies. As a member of the thesis defense committee,
he proofread this dissertation and suggested several important improvements. I also wish to
thank other members of the committee Professors Bhaskar DasGupta, Jason Leigh and Charles
Tier for their useful comments. Professor Stanley Pliska gave me the opportunity to discuss
my academic career with him on a number of occasions, which was rather valuable. He also
made me learn Latex while working on an interesting project. Professor Yang Dai o ered me
the gene expression microarray dataset that was used in the experiments in this thesis.

iv

ACKNOWLEDGMENTS (Continued)

In addition, I used MATLAB and Weka software packages in my computations. Weka


(Witten and Frank, 1999) is a comprehensive freeware data mining library that saved me at
least a year of coding and debugging. MATLAB is a trademark of The MathWorks, Inc.
ALT

TABLE OF CONTENTS

PAGE

CHAPTER
1
2

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

INTERMEDIATE STRATEGIES . . . . . . . . . . . . . . . . . . . . . . .

6
6
8
10
11
11
12
12
13
14
16
20
23
23
24
27
29
30
31
36

STRATEGIES FOR A DUAL OPTIMIZATION PROBLEM . . . .

38
38
39
40
43
44
46
48
49
50
52
54

2.1
2.2
2.3
2.3.1
2.3.2
2.4
2.4.1
2.4.2
2.4.3
2.4.4
2.4.5
2.5
2.5.1
2.5.2
2.5.3
2.5.4
2.5.5
2.5.6
2.6

3.1
3.2
3.3
3.3.1
3.3.2
3.4
3.4.1
3.4.2
3.4.3
3.4.4
3.5

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background and Related Work . . . . . . . . . . . . . . . . . . . . .
Computational Model . . . . . . . . . . . . . . . . . . . . . . . . . .
Network Con guration . . . . . . . . . . . . . . . . . . . . . . . . .
Building Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The OPTDMP Method . . . . . . . . . . . . . . . . . . . . . . . . .
Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In-place and Centralized Strategies . . . . . . . . . . . . . . . . . .
Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Case Study: Nursery Data . . . . . . . . . . . . . . . . . . . . . . .
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Error Function Estimation . . . . . . . . . . . . . . . . . . . . . . .
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In-place and Centralized Strategies . . . . . . . . . . . . . . . . . .
Optimal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dual Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dual Strategies for Ensemble Learning . . . . . . . . . . . . . . . .
Simple Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Boosted Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dual Strategies for Clustering . . . . . . . . . . . . . . . . . . . . .
Clustering Gene Expression Microarray Data . . . . . . . . . . . .
E ect on the Cluster Tightness . . . . . . . . . . . . . . . . . . . .
Identifying Similar Genes . . . . . . . . . . . . . . . . . . . . . . . .
E ect on the Precision and Recall . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi

TABLE OF CONTENTS (Continued)

PAGE

CHAPTER
4

EXPECTATION-MAXIMIZATION ALGORITHM WITH DATA


TRANSFER CONSTRAINTS . . . . . . . . . . . . . . . . . . . . . . . . .

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

GREEDY DATA LABELING AND THE DISTRIBUTED MODEL


ASSIGNMENT PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Greedy Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . .
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . .
Greedy Data Labeling: The Ideal Version . . . . . . . . . . . . . .
Greedy Data Labeling: The Ecient Version . . . . . . . . . . . .
Building the Meta-model . . . . . . . . . . . . . . . . . . . . . . . .
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71
71
73
74
74
76
76
77
79
81
83
83
83
85
86
92
96

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.1
5.1.1
5.1.2
5.1.3
5.2
5.2.1
5.2.2
5.2.3
5.2.4
5.2.5
5.3
5.3.1
5.3.2
5.3.3
5.3.4
5.4

Mixture of Models and the Expectation-Maximization Approach


Expectation-Maximization with Constraints on Data Transfer . .
Lower Bound for the Log-likelihood . . . . . . . . . . . . . . . . . .
Lower Bound Sub-optimality . . . . . . . . . . . . . . . . . . . . . .
General Expectation-Maximization Algorithm with Constraints .
Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . .
Weighting Schema and Data Transfer in the Maximization Step .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55
55
56
58
59
62
64
66
69

CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

vii

LIST OF TABLES

PAGE

TABLE

TRADE-OFF BETWEEN COST AND ACCURACY FOR DIFFERENT


DISTRIBUTED DATA MINING STRATEGIES . . . . . . . . . . . . . . .

II

MISCLASSIFICATION ERROR RATE (%) FOR MODEL 1, TABULATED


FOR 0  X21 ; X31  1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

III

MISCLASSIFICATION ERROR RATE (%) FOR MODEL 2, TABULATED


FOR 0  X12 ; X32  1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

IV

MISCLASSIFICATION ERROR RATE (%) FOR MODEL 3, TABULATED


FOR 0  X13 ; X23  1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

MISCLASSIFICATION ERROR RATES FOR ENSEMBLES BUILT ON


NURSERY DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

VI

ANALYSIS OF THE GENE EXPRESSION MICROARRAY DATA . . . 49

VII

DESCRIPTION OF THE UCI DATASETS . . . . . . . . . . . . . . . . . . 84

VIII

BALANCE SCALE DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

IX

TIC-TAC-TOE DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

CAR EVALUATION DATASET: ENSEMBLE ERRORS VS. MODEL


ASSIGNMENT ERRORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

XI

CHESS DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT


ERRORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

XII

NURSERY DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

XIII

AVERAGE OVER ALL DATASETS: ENSEMBLE ERRORS VS. MODEL


ASSIGNMENT ERRORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
viii

LIST OF TABLES (Continued)

PAGE

TABLE

XIV

ERROR RATES OF AN ENSEMBLE OF C4.5 TREES BUILT ON A


RANDMIX PARTITION VS. MODEL ASSIGNMENT SYSTEMS BUILT
ON THE GDL PARTITION WITH C4.5 BASE-MODELS AND DIFFERENT TYPES OF META-MODELS. . . . . . . . . . . . . . . . . . . . . . . 95

ix

LIST OF FIGURES

PAGE

FIGURE

A 2-dimensional slice of level contours for a typical error function. The


coordinates represent percentages of data coming from two remote nodes.
The error takes its highest value at (0,0) - the in-place strategy - and its
lowest value at (100%,100%) - a centralized strategy. The cost typically
decreases in the opposite direction, which allows for an optimal intermediate
strategy. The error function was obtained from mining UCI ML Nursery
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

LIST OF ABBREVIATIONS

CART

Classi cation and Regression Tree

EM

Expectation-Maximization

GDL

Greedy Data Labeling

NB

Naive Bayesian

NN1

Nearest Neighbor with One Neighbor

OPTDMP

Optimal Data and Model Partition

PMML

Predictive Model Markup Language

RandMix

Random Mixture

UCI ML

University of California at Irvine Machine Learning

xi

SUMMARY

The objective of this thesis is to present the fundamental trade-o in distributed data
mining, namely, a trade-o between the cost of communication and computation on one side
and the accuracy of the data mining results on the other side. There are two extreme approaches
to distributed data mining. One is to mine all data locally and combine the results, which is
the cheapest solution. Another is to collect all data at a central repository and mine it there,
which gives the most accurate results.
Chapter 1 is an introductory chapter that presents the fundamental trade-o between cost
and accuracy in more details.
Chapter 2 develops a mathematical framework for formulating this trade-o in rigorous
terms and then nding intermediate strategies that balance cost and accuracy. We show that
the problem may be reduced to that of a constrained optimization. Using the known property of convexity of the learning curves as well as the cost function, we demonstrate how the
intermediate strategies may be found.
Chapter 3 presents a related dual problem in which cost constraints are xed and accuracy
is maximized, which leads into the area of the quality of data partitions. We use experimental
applications with UCI Machine Learning data and gene expression microarray data to illustrate
the important aspects of this approach.
Chapter 4 develops a mathematical foundation for nding proper data partitions in the
distributed environment presented in Chapter 3. We show that this problem may be formalized
xii

SUMMARY (Continued)

as a Expectation-Maximization method with constraints on the hidden variable. We also show


how a new family of algorithms may be developed using the constrained EM approach.
Chapter 5 presents a model assignment problem and develops an algorithm called Greedy
Data Labeling to address this problem. The GDL algorithm allows to create hierarchical model
assignment systems that generally outperform voting ensembles. Although developed using an
alternative motivation, GDL is one of the algorithms for which the constrained EM framework
of the previous chapter provides a theoretical foundation.

xiii

CHAPTER 1

INTRODUCTION

In a traditional data mining scenario, a learning algorithm is applied to a single, although


possibly large, set of data. A single predictive model is built so that it is able to capture the
properties of the underlying data distribution and predict the class value of unlabeled data
instances.
Nowadays, distributed data mining is emerging as a fundamental computational problem.
In this scenario, there are several datasets that are geographically distributed and, furthermore,
may come from either a single data distribution or from di erent distributions. This calls for
new techniques of manipulating the data before the learning algorithm is applied.
At one extreme, all data can be moved to a central site and a single model built. Although
having all the data available typically gives the most accurate predictive models, this approach
may be too expensive, for several reasons. First, it increases the network trac dramatically
and in addition, may result in problems related to data delivery, cleaning, and aggregation.
Secondly, the data processing algorithm complexity may become an issue. For example, if there
are k local sites containing n data instances each and the algorithm complexity is quadratic,
then it costs O(kn2 ) to apply it separately to k individual data subsets but O(k2 n2) if all data
is combined in a single set of kn instances.
At the other extreme, a common approach with distributed data mining is to build separate
models at geographically distributed sites and then combine the models. With the commodity
1

2
TABLE I
TRADE-OFF BETWEEN COST AND ACCURACY FOR DIFFERENT DISTRIBUTED
DATA MINING STRATEGIES
DM Strategies Cost Accuracy
In-place
low
low
Intermediate balanced balanced
Centralized
high
high

internet and large data sets, mining the data in-place is the cheapest and quickest but often
the least accurate solution, while the centralized approach is more accurate but generally quite
expensive in terms of the time and other resources required.
There are a variety of intermediate strategies in which some data is moved and some data
is left in place, analyzed locally, and the resulting models are moved and combined. These
intermediate cases are becoming of practical signi cance with the explosion of ber and the
emergence of high performance networks. They represent a balance between a sucient accuracy of the data mining models and results on one hand and an acceptable level of cost on the
other hand. This is shown schematically in Table I.
In Chapter 2, we examine this intermediate case in the context in which high performance
networks are present and the cost function represents both computational and communication
costs. We reduce the problem to a convex programming problem so that standard techniques
can be applied. We illustrate our approach through the analysis of an example showing the
complexity and richness of this class of problems.

3
Today, the capability of the broadband communications infrastructure is doubling every
9-12 months, faster than the 18 month doubling of processor speeds (Moore's law). For example, a 155 Mb/s OC-3 link can move 10 Gigabytes of data in about 15 minutes. Given this
infrastructure and the growing importance of large distributed data sets, intermediate strategies between in-place and centralized strategies of the type described here should be of growing
interest.
Chapter 2 deals with building a mathematical framework for the analysis of the intermediate strategies primarily in the context of minimizing the cost of data transfer and processing
while maintaining a given level of accuracy for the outcome. In Chapter 3, we present a dual
optimization problem, namely, a problem of intermediate strategies that maximize accuracy
given cost constraints. We show that this problem leads directly into such important areas as
data quality, the impact of the initial data partition in distributed data mining on the accuracy
of the results, and the issue of data instance selection.
Two practical applications are chosen to illustrate these issues using existing methods of
data instance selection. In the rst set of experiments, we examine the bene ts of such method
as boosting an ensemble of models when the ensemble is built on a distributed collection of
datasets. In the second set of experiments, we present the problem of clustering a distributed
bioinformatics data. Our goal is to identify important issues that arise in the dual intermediate
strategy approach, primarily the issue of the quality of distributed data partitions.
In Chapter 4, we develop a general mathematical treatment of the dual intermediate strategy
case and the issue of data partitions. We show that this problem allows a rigorous formulation

4
as a case of Expectation-Maximization algorithm with constraints on the hidden variable. We
prove a theorem that establishes how the constrained region in the hidden variable space a ects
the quality of the EM solutions. We also show the way to use this theoretical framework to
build new families of EM-based algorithms for a distributed data mining environment.
In Chapter 5, we develop one such algorithm using an alternative motivation. We note that
one of the challenges in distributed data mining is to choose the best method of deployment of
a collection of predictive models built remotely. While methods such as ensemble learning use
the entire collection of models for classi cation, complex data may sometimes be best modeled
by a hierarchical system in which only one specialized model is deployed each time. We explore
this scenario on Chapter 5 where we present the distributed model assignment problem and
suggest a method to address it.
Let there be k remote datasets. The distributed model assignment problem is a problem of
computing k local statistical models and an \assignment model", or meta-model. The quality
of the resulting system depends on the underlying data partition.
In Chapter 5, we introduce an algorithm called Greedy Data Labeling that improves the
initial data partition by selecting small portions of data for re-allocation between distributed
sites, so that when each model is built on its data subset, the resulting hierarchical system
has minimal error. We present experimental results showing that model assignment approach
may in certain situations be more natural than traditional ensemble learning, and that when
enhanced by GDL, it nearly always outperforms ensembles. Our technique is broadly related

5
to partition-based clustering algorithms and employs some ideas from boosting and simulated
annealing.
Although the GDL algorithm is developed independently from Chapter 4, it is of the same
type as the general family of constrained EM-like algorithms for which Chapter 4 provides a
theoretical foundation.

CHAPTER 2

INTERMEDIATE STRATEGIES

2.1

Introduction

The work presented in this chapter, to our knowledge, is the rst attempt to identify a
fundamental trade-o in distributed data mining; namely, the trade-o between the eciency
and cost-e ectiveness of a distributed data mining application on one side, and the accuracy
and reliability of the resulting predictive system on the other side.
We provide evidence that the most ecient application may give an unacceptably inaccurate
predictive results, while the most accurate predictions may require an inecient data processing
strategy. We also explore a variety of intermediate strategies. Some of the ndings presented
in this chapter have already been published in (Turinsky and Grossman, 2000) and to some
extent in (Grossman et al., 2000).
Because moving large data sets over the commodity internet can be very time consuming, a
common strategy today for mining geographically distributed data is to leave the data in place,
build local models, and combine the models at a central site. Call this an in-place strategy. At
the other extreme, when the amount of geographically distributed data is very small, the most
naive strategy is simply to move all the data to a central site and build a single model there.
Call this a centralized strategy.

7
Given geographically distributed data, we can either a) move data, b) move the results of
applying algorithms to data (models), or c) move the results of applying models to data (result
vectors). It is not uncommon for there to be a 10x-100x di erence in the size of the data, model
and result vectors.
Consider a cost function that measures the total cost to produce a model and includes both
the communication and processing costs. As the size of the data grows and the speed of the
link connecting two sites decreases, an in-place strategy is, generally speaking, less expensive
but also less accurate. Conversely, the centralized strategy is generally more expensive but
also more accurate. Given a minimally acceptable accuracy, it is plausible that there is an
intermediate strategy which produces a model with this level of accuracy with the minimum
possible cost. Call these intermediate strategies.
We show that this is indeed the case and describe a method called the OPTDMP (OPTimal
Data and Model Partitions) for nding such a strategy. We also present an experimental case
study to show that such intermediate strategies occur rather naturally.
This chapter makes the following contributions:
1. We introduce the problem of computing intermediate strategies in distributed data mining
and point out these type of strategies will become more and more important with the
emergence of wide area high performance networks.
2. We provide a mathematical framework for analyzing intermediate distributed data mining
strategies.

8
3. We introduce an method called OPTDMP for nding intermediate strategies in the case
of a linear cost function.
4. We show with a example that intermediate strategies are interesting, even for the simple
case of linear cost functions.
Given the analysis of the chapter, it is straightforward to de ne versions of OPTDMP for
a variety of other cost functions. Our point of view is to reduce nding intermediate strategies
to a mathematical programming problem which minimizes a cost function subject to an error
constraint. The framework presented in the next sections holds for a wide range of cost functions
and mathematical programming algorithms. The purpose of this chapter is to introduce these
ideas with a simple cost function and a simple example.
2.2

Background and Related Work

As mentioned above, a common approach to distributed data mining is centralized learning,


where all the data is moved to a single central location for analysis and predictive modeling.
Another common approach is local learning, where models are built locally at each site, and
then moved to a common site where they are combined.
Ensemble learning (Dietterich, 1997) is often used as a means of combining models built at
geographically distributed sites. Methods for combining models in an ensemble include voting
schemata (Dietterich, 1997), meta-learning (Stolfo et al., 1997), knowledge probing (Guo et al.,
1997), Bayesian model averaging and model selection (Raftery et al., 1996), stacking (Wolpert,
1992), mixture of experts (Xu and Jordan, 1993), etc.

9
Several systems for analysis of distributed data have been developed in recent years. These
include the JAM system developed by Stolfo et al (Stolfo et al., 1997), the Kensington system
developed by Guo et al (Guo et al., 1997), and BODHI developed by Kargupta et al (Kargupta
et al., 1997), (Kargupta et al., 1999). These systems di er in several ways. For example, JAM
uses meta-learning that combines several models by building a separate meta-model whose
inputs are the outputs of the collection of models and whose output is the desired outcome.
Kensington employs knowledge probing that considers learning from a black box viewpoint and
creates an overall model by examining the input and the output of each model, as well as the
desired output. BODHI system employs so-called collective mining that relies in part on ideas
from Fourier analysis to combine di erent models. In terms of data and model transfer, JAM,
Kensington and BODHI all use local learning.
A new system for distributed data mining called Papyrus is now being developed at the
National Center for Data Mining (Grossman et al., 2000). Among other features, it is designed
to support di erent data and model strategies, including local learning, centralized learning,
and a variety of intermediate strategies, that is, hybrid learning. Work is under way to develop
a methodology of choosing an information transfer strategy that is optimized for a particular
data mining task.
A variety of load balancing techniques have been utilized for a long time in parallel computing
applications. Load balancing is aimed at nding an optimal regime of moving data to the
nodes of a supercomputer or, more recently, of a network of workstations. Zaki (Zaki et al.,
1997) provides an example of a load balancing method that optimizes the eciency of parallel

10
computation on a network of compute nodes. Other examples and motivating discussion can
be found in (Cheung, 1992) and (Grimshaw et al., 1994). These techniques, however, do not
directly target the speci c issues that arise in distributed data mining, such as ways of combining
predictive models and the accuracy of the resulting predictive system.
An important topic in data mining is the study of the so-called learning curves. Essentially,
a learning curve shows the relationship between the size of a training dataset and the accuracy
of a predictive model built on that data. In general, exposing a model to more data reduces the
predictive error, although usually not to zero. See e.g. (Cortes et al., 1995). Learning curves
vary in shape depending on the quality of the data, type of models, and other factors. More
importantly, they all share several common features which we shall exploit. A detailed analysis
of learning curves is available in (Murata et al., 1993), (Haussler et al., 1996).
2.3

Computational Model

We use a computational model consisting of a collection of geographically distributed processors, each with dedicated memory and connected with a high performance network. We
assume that network access is substantially more expensive than a local memory access. Naturally, this assumption may not hold for very fast networks where remote access to memory
might in fact be faster than a local disk access. However, here we focus on a situation where
the processors are connected via a high speed broadband type network that adheres to quality
of service requirements.

11
2.3.1

Network Con guration

Formally, we assume that there are n di erent sites connected by a network. The cost of
processing data at ith node into a predictive model is i dollars per record and the optimal cost
of moving data from ith to j th node via the cheapest route between the two nodes is ij dollars
per record. One of the nodes is the network root where the overall result will be computed.
2.3.2

Building Models

Our assumption is that at each node, a choice must be made: either ship raw data across
the network to another node for processing, or process data locally into a predictive model and
ship the model across the network. Same data, or a portion of data, may be used both locally
and for sharing with another processing node. Given this viewpoint, building models consists
of the following steps:
1. Re-distribute data across the network.
2. At every node, compute a predictive model.
3. Re-distribute all local predictive models to the root.
4. At the root, combine all models into a single predictive model.
Let Di be the initial amount of data at the ith node. After the re-distribution in Step 1,
this node accumulates D~ i of data. Let Mi be the size of the predictive model computed from
D~ i

in Step 2. It is later transferred to the root. We assume that when data is processed into

12
a predictive model, its amount is compressed uniformly for each node with a coecient , so
that
Mj = D~ j

for all j .

(2.1)

Let the Rth node be chosen as a root, where 1  R  k.


Our objective is to determine the procedure of data transfers over the network that minimizes
the overall cost of building models.
More generally, the same approach will work with more complicated strategies. For example,
we could move data, clean the data, produce models, move the models, and combine the models.
In this case, the strategy, the cost function, and error function would be more complicated, but
of the same general form.
2.4

The OPTDMP Method

In this section, we describe a method for nding OPTimal strategies for Data and Model
Partitions called OPTDMP.
2.4.1

Strategies

A strategy X is a matrix of numbers


X = [xij ]ni;j =1

(2.2)

13
where xij is the amount of data Di that is moved from the ith node to the j th node for processing.
This portion of data contributes to D~ j , is processed, and later transferred to the root as a part
of Mj .
Note that
0  xij  Di :

(2.3)

Also, network topology may present some additional constraints.


Alternatively, a component xij may represent not the amount but the percentage of data
to be shared between nodes. With the initial data distribution known, the two approaches are
equivalent.
2.4.2

In-place and Centralized Strategies

We de ne a centralized or naive strategy to be a strategy X0 = X0(R) of moving data from


all other nodes to the root R in Step 1 for further computation. For such a strategy, xiR = Di
for each i and the rest of xij are zero. Choosing di erent roots yields k di erent centralized
strategies.
We also de ne an in-place strategy to be a strategy X1 of processing all data locally. For
such a strategy, xii = Di for each i and the rest of xij are zero.
Intermediate strategies then represent a balance between fully centralized strategies X0 and
the fully distributed strategy X1 .

14
2.4.3

Cost Function

The basic idea is to reduce the problem of nding optimal strategies for building models
to a constrained optimization problem. The overall

cost function

for a strategy X is easily

computed as
C (X ) =

X
ij

(ij xij + j xij + jR xij ) =

cij xij

ij

(2.4)

The coecients ij represent the cost of network communication between nodes i and j per
unit volume of data, while the coecients j represent the cost at node j per unit volume of
data for an algorithm to process data and produce a statistical model. Recall that R is the
root, so jR is the cost of moving data from node j to the root R. The rst term represents
the cost of moving data, the second term represents the cost to transform data into predictive
models, while the third term represents the cost of moving the predictive models to the root so
that they can be combined. For convenience, we de ne the coecients
cij = ij + j + jR ;

(2.5)

as indicated.
Note that the term representing the nal step of combining the results at the root is not
present in the cost function. This is due to the fact that regardless of the strategy X , the same
amount of results will have to be processed at the root:

15

X
j

Mj =

X
j

D~ j = D

(2.6)

where D is the total initial amount of all data and is the compression coecient as above.
Therefore, the term in question would be a constant and we may omit it without loss of generality. It may also be convenient to do so from a practical standpoint if the cost of combining
models is negligible compared to other cost components. For example, combining several predictive models into a voting ensemble is a rather trivial operation compared to network transfer
or model building.
We assume a linear cost of data processing here, which will later lead to a linear function
optimization problem. Generally, more complex algorithms would lead to non-linear cost functions. In practice, many of the cost functions associated with nonlinear algorithms are convex,
in which case essentially the same approach works. The actual values of the coecients may
be estimated based on the network throughputs, business infrastructure, particular algorithms
used, etc.
Cost is di erent for di erent centralized strategies. Let C0 be the best cost available under
a centralized strategy. Also, let C1 = C (X1 ) be the cost of the in-place strategy.
The in-place strategy might not be the cheapest one. If the data mining algorithm requires
a lot of resources at the data processing stage (e.g. due to a large algorithm complexity), the
cost of deploying these resources may vary signi cantly from site to site and be considerable
compared to the communication cost. It may then be more cost-e ective to move the data

16
to cheaper processing sites than to process it locally. The cheapest policy will be found by
minimizing the cost function.
2.4.4

Error

We assume that there are two factors that introduce error. First, the loss of accuracy may
be due to the nature of the data and of the computational algorithm itself, regardless of the
strategy. Denote this error term 0 . This type of error is covered in standard books on statistical
modeling, and, for example, can be estimated using a validation set.
Second, accuracy is generally lost when data is processed locally at several nodes instead of
moving it to one central node and processing it there. This follows from the observation that
any technique used to build a statistical model when the data is distributed can also be applied

when all the data is available at a single node; on the other hand, certain techniques available
when all the data is in one place are not available when the data is distributed, leading to
inferior quality of models built in the latter case. For example, even in ensemble learning when
several models are trained, it is often preferable to build them from a centralized dataset.
As an illustration, consider the case when the data is heterogeneous and di ers from site to
site. If a collection of models is obtained from processing locally stored datasets in-place, each
of the resulting models would only be useful for classifying data of the same kind as the one
on which that model was built. If the particular kind of a new data record is not known, using
the collection of models as a voting ensemble would probably give inaccurate results since most
models would be bad predictors.

17
This situation is not at all unusual: many large datasets are partitioned into smaller subsets
in a way that makes them heterogeneous. For example, storing the dataset sorted by one or
several attributes is a common practice, and so is splitting it later sequentially into smaller
pieces. As a result, if e.g. a car dataset was previously sorted by mileage, di erent subsets
would contain data on vehicles of di erent mileage ranges, yet this fact may be overlooked,
leading to biased local models.
Sharing portions of data between processing sites will reduce the overall error of the ensemble
of models. Essentially, learning the data from each subset introduces a separate learning curve
for an individual model. The shapes of learning curves have been studied e.g. in (Murata et
al., 1993). The error function E (X ) is essentially a superposition of several learning curves,
one for each dimension in the X -space of strategies. Indeed, e.g. varying only one component
x12

of a strategy X and xing the rest of the components de nes learning the 1st data subset

by the 2nd model, where the data from other subsets has been learned to the extend shown by
other components of X . Hence the error function surface E (X ) could be thought of as a surface
comprised of collections of learning curves parallel to each of the xij -axes in the X -space.
Knowing the type of the shapes of the individual learning curves, we make an important
observation: the level contours of the the error surface E (X ), i.e. contours fX : E (X ) = constg,
generate a family of convex sets in the space of strategies. See Figure 1. This insight will be
crucial for nding the optimal data allocation strategies.
We further observe that if all distributed data comes essentially from the same data distribution, then the error function will be constant on the linear sets fX : Pi xij = j g where j

18
100

90

80

70

60

50

40

30

20

10

10

20

30

40

50

60

70

80

90

100

Figure 1. A 2-dimensional slice of level contours for a typical error function. The coordinates
represent percentages of data coming from two remote nodes. The error takes its highest
value at (0,0) - the in-place strategy - and its lowest value at (100%,100%) - a centralized
strategy. The cost typically decreases in the opposite direction, which allows for an optimal
intermediate strategy. The error function was obtained from mining UCI ML Nursery dataset.

are some constants, since the only factor reducing error would be the amount of the processed
data D~ j = Pi xij but not its origins. In this case, the level contours of E (X ) will be linear.
If, as shown in (Murata et al., 1993), the shape of error curves is of the type 0 + 1=x, where
x is the amount of training data and 1;2

are certain constants, the error rate of an individual

model j would be of the type



Ej = 0 + ~1
Dj

= 0 + P1x :
i

ij

(2.7)

19
However, if the data is heterogeneous, we will probably see level contours with a more
pronounced convexity due to a faster error decrease along paths in X -space that represent a
mix of data from various sources, and a slower decrease along paths parallel to coordinate axes.
Our experiments con rm this observation.
An example of an error function for model j may be
Ej (X ) = bj 0 +

b0j +

Pk
i

=1 bij xij

pj

(2.8)

where bij are parameters to be estimated, k is the number of data subsets, and the power
pj

2 (0; 1] de nes the degree of convexity. Depending on how the local models are combined,

the overall error function E (X ), or the upper bound for E (X ), can then be determined. E.g.
if the models are used as an averaging ensemble, (Krogh and Sollich, 1997) shows that the
disagreement between the models always causes the error of the ensemble to be lower than the
average error of the individual models, which leads to the following error bound:
E (X ) 

1 X E (X ):

(2.9)

Also, bagging, boosting and other ensemble learning techniques can be used to improve the
accuracy of the combined model and may lead to a more complex mathematical structure
of E (X ).
Various other forms of E (X ) satisfying the convexity property are possible and could be
used without substantial changes to the algorithm. Perhaps the best practical way to de ne

20
E (X ) is either to rely on previously acquired experience with a particular type of data mining

problems or to use a sample of data for a simulation that will determine the parameters of the
error function. We shall demonstrate this approach in the next section.
Once the parameters are known, the minimum and maximum values 0; 1 of the error
function could then be estimated to nd the range
0  E (X )  1 :

(2.10)

It follows that with this model, the error takes its minimal value 0 when all data is available
at a single processing node (centralized strategies X0), and its largest possible value 1 when
the data is not shared at all (in-place strategy X1 ).
It should also be mentioned that the data mining error is a random variable that di ers
between experiments, so that the above approach based on learning curves deals with the
average error for a certain type of experiments, or an upper bound for such error, but not
necessarily an actual error rate of any particular experiment.
2.4.5

Optimization

Our task now translates into nding a strategy X = [xij ] that is a solution of the following
optimization problem:

21
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:

min[xij ] C (X ) = Pij cij xij


0  xij  Di

(2.11)

E (X )  max

where max is the maximum error level allowed, the vector D is given by the initial data
distribution, and cij is de ned by Equation Equation 2.5. The optimal solution is a strategy X 
that gives the least cost C  = C (X  ) among all suciently accurate strategies.
The rst two lines in (Equation 2.11) de ne a linear programming problem with a convex
bounded polyhedron domain B (which is a multidimensional rectangle in X -space). It can
easily be solved, and its solution X gives the best cost attainable in the absence of accuracy
restrictions (Chavtal, 1983). As was mentioned, X may di er from the in-place strategy.
Geometrically, X is the \lowest" vertex of the polyhedron, where the direction is determined
by the level of the linear cost function. If X satis es the accuracy requirement, then the
optimal strategy X  = X, otherwise we continue our search.
Since the error function E (X ) has convex level contours, the third equation of the optimization problem is an additional convex constraint. It follows that if we de ne sets B = fX 2
B : E (X )

 g where the accuracy threshold parameter  2 [0 ; 1], then the collection fB g

is an expanding family of convex sets within B . Therefore, nding an optimal intermediate


strategy reduces to minimizing a linear function C (X ) on a convex set Bmax , which is a wellstudied type of optimization problems and could be solved using standard techniques (Lewis
and Borwein, 2000).

22
As a side note, we have not come across evidence, either theoretical or experimental, that
the direction of convexity of the level contours fX : E (X ) = constg may be reversed. Such
a phenomenon might occur for example when pj > 1 in the error function (Equation 2.8), or
in functions of similar shape. However, if this were to happen, it would result in the following
modi cation of our optimization problem. The sets A = fX 2 B : E (X )  g now become
convex and sets fBg de ned above become their complements. (Consider e.g. functions similar
to 1=(x2 + y2 ) whose level contours are circles with centers at the origin.) With the error
tolerance level set at max, the optimal solution X  is still the lowest point of Bmax , but in this
case, it could be easily shown by convexity argument that such point must be on the intersection
of the error level surface fX : E (X ) = max g with one of the edges of the polyhedron B . The
suggested procedure of nding X  is then as follows: rst nd the lowest point X of B using
linear programming techniques and then go \up" (in the direction of increasing cost) along
each edge until it intersects the error level surface E (X ) = max , which occurs when the lowest
boundary of the set Bmax of acceptable strategies is reached. It is then a simple task to choose
X

as the lowest of the intersection points. Notice that if the linear restrictions are as in

(Equation 2.11) and hence B is merely a multidimensional box, the geometry of the procedure
is much simpli ed.
This approach is simple yet e ective and, as shown in the following example, produces a
non-trivial optimal cost solution.

23
Note that the decision of which instances of the data to move is not covered by this method,
just the fraction of the data to move. In the simplest case, we would assume that the local data
subsets are homogeneous and move random samples of local data.
2.5

Case Study: Nursery Data

We tested our approach on experiments with several datasets from the UCI Machine Learning Repository (Murphy and Aha, 1993). The results were similar for most datasets, although
some cases exhibited virtually no improvement in accuracy in centralized processing compared
to the in-place processing. Coincidentally, these were the datasets that produced a fairly high
classi cation error even in centralized learning, that is, under most favorable conditions. We
tend to believe that the reason for lack of improvement was a higher intrinsic noise of the
dataset and a high degree of homogeneity of the data.
The following illustratory example with the Nursery dataset demonstrates our approach.
The dataset contains 12960 data points, each with 8 independent attributes and a class label.
2.5.1

Data Preparation

Assume that there are k = 3 sites that contain distributed, possibly heterogeneous data.
Our task is to determine how much data should be exchanged before the data processing begins
so that the accuracy stays within acceptable limits. After processing, the resulting models are
collected at site 1 for aggregation into an ensemble (thus the root node R = 1). To simulate
such an environment, we took the Nursery database and split it sequentially into three (equal)
parts. Originally, the data stored at at UCI repository was sorted, which is common. Therefore,

24
sequential partition resulted in three heterogeneous subsets. We also withheld certain portion
of the data for a validation set. As a model type, we chose C4.5 classi cation trees. C4.5 is
a state-of-the-art decision tree algorithm (Quinlan, 1993). We built the decision tree models
using a freeware data mining package called Weka (Witten and Frank, 1999). The resulting
trees were combined into a voting ensemble so that for a new data instance, all models would
make their predictions and then a majority vote would be taken to produce the nal prediction
of the ensemble.
For simplicity of visualization, we decided that strategy components xij will indicate percentages of data

shared between nodes i and j , and that all data stored locally shall be used,

along with portions of data coming from other models, for building local models. Hence xii = 1
for all i. The problem is then to determine the other 6 components of the strategy matrix
2

X=

2.5.2

6
6
6
6
6
6
6
4

1 x12 x13
x21

1 x23

x31 x32

3
7
7
7
7
7;
7
7
5

xij 2 [0; 1]

(2.12)

Error Function Estimation

To estimate the error of building C4.5 trees on a given distributed collection of data, we
considered each local model separately. For example, model 1 was built using the entire data
subset D1 , x21 portion of subset D2 , and x31 portion of subset D3. With the type of error
functions (Equation 2.8) discussed above, the error of model 1 is essentially a function of two
variables x21 and x31 given parametrically as:

25
TABLE II
MISCLASSIFICATION ERROR RATE (%) FOR MODEL 1, TABULATED FOR
0  X21 ; X31  1
0 .25 .50 .75
1
0 17.3 12.8 13.6 13.2 12.8
.25 13.3 6.0 5.1 4.9 4.5
.50 13.0 5.6 4.8 4.5 4.1
.75 12.5 5.2 4.2 4.1 3.7
1 11.9 5.0 4.1 3.7 3.4

E1 (X ) = b10 +

b01 + b21 xp211 + b31 xp311

(2.13)

(Since x11 = 1, we can combine b01 + b11 xp111 into a single constant b01 .) To estimate the
coecients, we tabulated the values of E1(X ) on the square (x21 ; x31 ) 2 [0; 1]2 by moving
appropriate amounts of data from nodes 2 and 3 to node 1 as indicated by a pair (x21 ; x31 ),
building a C4.5 tree model, and estimating its accuracy on a validation set. The results of 5-fold
cross-validation were then averaged. Some of the tabulated error values for the three models
are presented in Table II{Table IV, where rows and columns correspond to values of varying
xij

components.
Some minor uctuations notwithstanding, the error functions exhibit the behavior that we

have expected. It is easy to see that sharing data is important, as even a modest amount
of sharing gives a rapid improvement in accuracy. That constitutes further evidence that a

26

TABLE III
MISCLASSIFICATION ERROR RATE (%) FOR MODEL 2, TABULATED FOR
0  X12 ; X32  1
0 .25 .50 .75 1
0 13.4 9.5 8.7 8.3 8.6
.25 10.7 6.2 5.3 5.1 4.6
.50 11.9 5.2 4.4 4.1 3.8
.75 11.8 5.1 4.3 3.9 3.6
1 11.9 5.0 4.1 3.7 3.4

TABLE IV
MISCLASSIFICATION ERROR RATE (%) FOR MODEL 3, TABULATED FOR
0  X13 ; X23  1
0 .25 .50 .75 1
0 20.1 14.4 10.3 9.9 8.6
.25 14.4 5.6 5.3 4.9 4.6
.50 13.2 4.9 4.5 4.1 3.8
.75 13.0 4.6 4.3 3.9 3.6
1 12.8 4.5 4.1 3.7 3.4

27
pure in-place strategy is likely to be inferior in quality. On the other hand, sharing relatively
small amounts of data is still cheap, which gives hope that intermediate strategies will indeed
represent a good balance between cost and accuracy. The level contours of E1 (X ) were also
shown in Figure 1 in the previous section.
The values were then used to t the parameters of the error functions (Equation 2.8). We
generated least-squares estimates for the parameters with the MATLAB package. There are
numerous other interpolation and curve tting techniques available. The error function formulas
that we obtained were

2.5.3

E1 (X ) = :5531 + 1= 1:3042 + :1924(x21 ):1335 + :1973(x31 ):1335

E2 (X ) = :4861 + 1= 1:5662 + :1063(x12 ):1441 + :2269(x32 ):1441

E3 (X ) = :4332 + 1= 1:5317 + :2671(x13 ):1717 + :3536(x23 ):1717

(2.14)

Optimization

Once both the cost function and the error function are known, the optimization problem (Equation 2.11) can be solved using standard techniques. In our illustratory example,
a slight modi cation will make it possible to get an analytical solution to (Equation 2.11).
Namely, we shall require that each local model satisfy the property Ej (X )  max , j = 1; 2; 3.
This simpli cation will allow us to break (Equation 2.11) into three smaller optimization problems:

28
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:

min[xij ] C (X ) = c21 x21 + c31 x31


(2.15)

0  xij  1
E1 (X ) = b10 + b01 +b21 xp11 +b31 xp1
21

31

 max

min[xij ] C (X ) = c12 x12 + c32 x32


(2.16)

0  xij  1
E2 (X ) = b20 + b02 +b12 xp12 +b32 xp2
12

32

 max

min[xij ] C (X ) = c13 x13 + c23 x23


(2.17)

0  xij  1
E3 (X ) = b30 + b03 +b13 xp13 +b23 xp3
13

23

 max

Leaving the constraint 0  xij  1 aside for a while and using standard optimization techniques such as Kuhn-Tucker theorem, it is possible to show after some algebraic manipulations
that an optimization problem
8
>
>
>
<
>
>
>
:

min[X ] C (X ) = cij xij + ckj xkj


Ej (X ) = bj 0 + b +b xp1j +b xpj
0j ij ij kj kj

(2.18)

 max

has a solution
2

xij = (bij ckj )

1
1 pj 6
4

emax bj 0

(bij cpkjj ) 1

pj

b0j

+ (bkj cpijj ) 1

3
7

1 5
pj

pj

(2.19)

29
An optimal solution that satis es 0  xij  1 will then be either given by the formula (Equation 2.19) or be one of the points of the intersection of the curve Ej (X ) = max and the
boundaries of the square [0; 1]2 , which are easy to locate.
2.5.4

Cost

Whereas the error function is de ned by the data processing algorithms and the quality
of data, the cost function depends on factors like the network infrastructure, hardware and
software systems used, etc. In our example, we have some liberty in choosing the cost function.
Generally, when strategy components xij represent percentages of data instead of actual
amounts xij Di that are shared, the cost of data processing must be represented by a matrix
of ij values instead of a vector j . However, because we have had equal initial amounts of
data at each local node, this modi cation is not necessary and a vector (1; 2 ; 3 ) will suce.
Also, the communication cost between each pair of node is symmetrical, so that we only need
to know the values (12 ; 13 ; 23).
We assume here that the cost of shipping the resulting three decision tree models to the
root for combining them into an ensemble is negligible because the size of a decision tree model
is typically much smaller than the size of the data on which it was built. (Otherwise, some
trivial modi cations have to be made in what follows.) Moreover, as was mentioned before,
aggregating the three models at the root will cost the same regardless of strategy, hence the
optimization is una ected. Therefore, we let = 0. Then

30
2
6
6
6
6
6
6
6
4

12 + 2 13 + 3

1

[cij ] = [ij + j + j1] = 12 + 1


2.5.5

2

23 + 3

13 + 1 23 + 2

3

3
7
7
7
7
7
7
7
5

(2.20)

In-place and Centralized Strategies

Depending on which node is chosen as a root, there are three centralized strategies of moving
all data to one node and building a single model:
2

X0 (1) =

6
6
6
6
6
6
6
4

1 0 0 77
7

1 0 0 777 ;
1 0 0

7
5

X0 (2) =

6
6
6
6
6
6
6
4

0 1 0 77
7

0 1 0 777 ;
0 1 0

7
5

X0 (3) =

6
6
6
6
6
6
6
4

0 0 1 77
7

0 0 1 777
0 0 1

7
5

(2.21)

The best centralized cost C0 is then the smallest of the three values
C (X0 (1)) = 12 + 13 + 31
C (X0 (2)) = 12 + 23 + 32

(2.22)

C (X0 (3)) = 13 + 23 + 33

and the centralized error is E (X0 ) = 0 = 3:4% (from the tabulation).


Note that having moved all data in one place, we can also build an ensemble of models
there, which may improve the accuracy even further but result in an additional cost due to
increased complexity of the data processing part.

31
The in-place strategy, i.e. a strategy of no data sharing, is
2

X1 =

6
6
6
6
6
6
6
4

1 0 0 77
7

0 1 0 777 ;
0 0 1

7
5

(2.23)

which has a cost C1 = 1 + 2 + 3 but results in high errors (based on tabulated values):
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:

E1 (X1 ) = 17:3%
E2 (X1 ) = 13:4%

(2.24)

E3 (X1 ) = 20:1%

Hence the in-place strategy is not acceptable. But depending on the relative values of
the communication cost ij and data processing cost j , centralized strategy may become too
expensive, and we can trade some of its accuracy for a noticeable improvement in cost.
2.5.6

Optimal Solution

Let us set the accuracy threshold at max = 8%. We shall now explore the e ects that
di erent combinations of ij and j have on the solution. The actual values of the cost coecients below have no speci c meaning by themselves, as it is their relative proportions to each
other that matter. Note that all optimal solutions X  presented below satisfy the accuracy
requirement, and so what we are interested in is how much savings they provide over the best
centralized strategy.

32
Case #1: Communication and processing cost are uniform and of similar magnitude:
2

8
>
>
>
<

12 = 1 13 = 1 23 = 1

>
>
>
:

1 = 1

2 = 1

=)

3 = 1

C=

6
6
6
6
6
6
6
4

1 2 2 77
7

8
>
>
>
<

C0 = 5

7
5

>
>
>
:

C1 = 3

2 1 2 777 ;
2 2 1

(2.25)

The in-place strategy thus gives 40% savings over the centralized strategies, which indicates
that there is likely to be a well-balanced intermediate strategy. Equation 2.19 produces the
following optimal intermediate strategy:
2

X =

6
6
6
6
6
6
6
4

1 :016 :081
:073

1 :114

:075 :038

3
7
7
7
7
7;
7
7
5

C  = 3:79

24.1% savings over C0

(2.26)

which satis es accuracy constraints yet is 24:1% cheaper than the best of centralized approaches.
Case #2: Communication and processing cost are uniform, communication is more expensive:
8
>
>
>
<
>
>
>
:

12 = 10 13 = 10 23 = 10


1 = 1

2 = 1

3 = 1

=)

C=

6
6
6
6
6
6
6
4

1 11 11 77
7
7

11 1 11 77 ;
11 11 1

7
5

8
>
>
>
<

C0 = 23

>
>
>
:

C1 = 3

(2.27)

33
The in-place strategy thus provides impressive 87% savings over the centralized strategies.
(Equation 2.19) gives
2

X =

6
6
6
6
6
6
6
4

:016 :081

:073

1 :114

:075 :038

3
7
7
7
7
7;
7
7
5

C  = 7:36

68% savings over C0

(2.28)

Note that this optimal strategy X  is the same as in previous example. This is due to the
fact that the relative magnitudes of the non-diagonal coecients of the matrix C with respect
to each other stayed constant, in which case Equation 2.19 gives the same solution. What di ers
is the amount of cost, which went down due to the reduction of expensive (and unnecessary)
data transfer.
Case #3: Communication and processing cost are uniform, data processing is expensive:
8
>
>
>
<
>
>
>
:

12 = 1 13 = 1 23 = 1

=)

1 = 10 2 = 10 3 = 10

C=

6
6
6
6
6
6
6
4

10 11 11 77
11 10 11
11 11 10

7
7
7;
7
7
5

8
>
>
>
<

C0 = 32

>
>
>
:

C1 = 30

(2.29)

Again, Equation 2.19 gives the same intermediate strategy


2

X=

6
6
6
6
6
6
6
4

1 :016 :081
:073

:075 :038

:114

3
7
7
7
7
7;
7
7
5

C (X ) = 34:37

(2.30)

34
but it is no longer optimal! It appears that cheap communication allows us to move all data to
a centralized location and build a single very accurate model there. This is both cheaper and
better than building three local models, each on only a portion of data. The optimal strategy
is then any of the three centralized strategies, as they all cost the same. Note also that the
centralized strategies do not satisfy our initial assumption xii = 1. Hence they are not covered
by Equation 2.19 and must be examined separately. This explains why Equation 2.19 was not
able to produce the optimal solution.
Case #4: Communication cost varies, data processing is cheap:
8
>
>
>
<
>
>
>
:

12 = 10 13 = 5 23 = 1


1 = 1

2 = 1

=)

3 = 1

C=

6
6
6
6
6
6
6
4

1 11 6 77
11 1 2
6 2 1

7
7
7;
7
7
5

8
>
>
>
<

C0 = 9

>
>
>
:

C1 = 3

(2.31)

Due to the signi cant reduction in the amount of expensive communication between nodes
1 and 2, Equation 2.19 gives an optimal intermediate strategy
2

X =

6
6
6
6
6
6
6
4

1 :004 :036
:051

:105 :065

:190

3
7
7
7
7
7;
7
7
5

C  = 4:96

45% savings over C0.

(2.32)

35
Case #5: Communication and processing cost vary:
8
>
>
>
<
>
>
>
:

12 = 1 13 = 1 23 = 10

=)

1 = 10 2 = 10 3 = 1

C=

6
6
6
6
6
6
6
4

10 11 2
11 10 11
11 20 1

3
7
7
7
7
7;
7
7
5

8
>
>
>
<

C0 = 14

>
>
>
:

C1 = 21

(2.33)

The best centralized strategy is X0 (3). Equation 2.19 gives an intermediate strategy
2

X=

6
6
6
6
6
6
6
4

:026 :246

:073

1 :044

:075 :031

3
7
7
7
7
7;
7
7
5

C (X ) = 24:5

(2.34)

which is more expensive than the centralized strategy X0(3) of moving all data to node 3 and
building a single model there. Interestingly, even the in-place strategy, which usually is the
most cost e ective if accuracy is not an issue, is more expensive than X0(3) in this case! The
explanation is simple: local data processing at nodes 1 and 2 would be too costly, and moving
all data to node 3 is a better option. Thus X  = X0 (3).
Numerous other cases are readily available for investigation using this framework. Some
possible modi cation may include:
1. To require the communication cost coecients ij satisfy triangle inequalities. This may
better represent the option of moving data between nodes via a third node. We haven't
done so just to show di erent possibilities and exibility of using our framework.

36
2. To impose additional (linear) constraints on xij due to network topology. E.g. if the only
route between nodes 1 and 3 is through node 2, we may want to make a use of all the
data passing between nodes 1 and 3 in building local model 2. In this case,
x12  x13 ;

x32  x31

(2.35)

which would require some changes in the optimization procedure. The general approach
would still remain the same.
These examples show that once the structure of the learning process - i.e. of the error
function - is known, non-trivial intermediate strategies occur naturally and are often superior.
2.6

Conclusion

In this chapter, we have introduced a new framework and methodology for distributed data
mining. It allows us to choose a cost-optimal balance between local computation and node-tonode communication and data transfer. We show that this framework e ectively bridges two
simple approaches to distributed data mining which are common today: one that computes
all data locally (in-place mining) and one that moves all data to a single processing node
(centralized mining). We call these intermediate strategies.
The framework reduces the problem of nding intermediate strategies to a mathematical
programming problem which minimizes a cost function incorporating both communcation and
processing terms subject to an error constraint. We show by example that this problem is

37
interesting even for linear cost functions. Finally, we introduce a method OPTDMP for nding
intermediate strategies.

CHAPTER 3

STRATEGIES FOR A DUAL OPTIMIZATION PROBLEM

3.1

Introduction

In the previous chapter, a problem of nding the intermediate strategies for distributed data
mining was introduced. Intermediate strategies suggest a balance between the accuracy of a
centralized data processing and the cost savings of an in-place processing. It was shown that
given a general structure of the error function and the cost of each stage of data processing,
the problem may be posed as that of minimizing a cost function while satisfying the accuracy
condition. In this case, a linear function is minimized over a convex feasible set.
A problem dual to the one described above is that of minimizing the error while satisfying
the cost constraints:
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:

min[xij ] E (X )
(3.1)

0  xij  Di
C (X ) =

c x

ij ij ij



If the previous setting is followed, this corresponds to minimizing a convex function over a
feasible set de ned by linear inequalities, which is a well-known problem with a wide range of
methods available to solve it. See (Lewis and Borwein, 2000).
However, there is an essential di erence between the two problems. When the cost is
minimized, the primary question is how much data to transfer. Varying the amount of data
38

39
trac is what a ects the cost function the most. Note that for the optimal intermediate
strategies, the equation E (x) = max is satis ed.
On the other hand, when the error function E (X ) is minimized, it follows from the nature
of E (X ) that there are typically no local minima within the interior of the feasible region and,
therefore, the optimal solution is also on the boundary. That is, C (X  ) = . Because the value
of  is strongly related to the volume of data trac, the issue is no longer how much data to
transfer because given the value of , there is little room for variation. If so, the next most
important issue remaining is which data to choose for transfer. Choosing di erent data records
generally results in di erent accuracy of the predictive models built on that data and, therefore,
the problem of minimizing the error E (X ) given cost constraints is more naturally related to
the problem of selecting "proper" data rather than to that of selecting the right amount.
3.2

Dual Strategies

In what follows, it is assumed that the cost of data transfer between data sites dominates the
cost of local data processing and aggregation of the results. This assumption is quite reasonable
from the practical standpoint, given that it is data delivery and cleaning, not data processing,
that is currently the most time-consuming and costly part of the entire data mining process.
Under this assumption, a strategy acquires cost only at the data transfer stage, thus
C (X ) =

6=

i j

ij xij  ;

1  i; j  k

where as before ij is the unit cost of data transfer between nodes i and j . Also, ii = 0.

(3.2)

40
Instead of solving the general minimization problem, we now consider a simpli cation when
the cost is split evenly between all k(k 1) terms in Equation 3.2. In other words, consider
8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:

min[xij ] E (X )
0  xij  Di

(3.3)

ij xij  k(k 1) ; i 6= j

It follows from the observation above that the optimal strategy X  would satisfy the cost
condition exactly, in which case
x

ij

8
>
>
>
<

=>
>
>
:

 ; i 6= j
( 1)

ij k k

0; i = j

(3.4)

Hence the amount of data transfer between each pair of nodes is known immediately. The
question that is now more pertinent is: Which data to choose for this transfer?
3.3

Dual Strategies for Ensemble Learning

In this part of the study, we conducted experiments to explore and compare several possible
ways to build ensembles of local models in the presence of constraints on data trac.
The idea of choosing particular data instances to feed into the learning algorithm has been
used in several successful supervised learning methods, primarily in the context of boosting
(Dietterich, 2000), (Freund, 1995). In a traditional boosting scenario, a single predictive model
is repeatedly trained on data sampled from the same dataset but with probability weights that
change over time. Namely, a data instance has a larger probability of been chosen if the model

41
in its current iteration makes an error predicting the class value of the instance. If the class
attribute is numeric, the size of the error a ects the probability as well. This approach ensures
that parts of the original data distribution where the model still makes errors will be better
exposed to the learning algorithm at the next iteration, whereas parts that are already well
learned are not exposed needlessly.
There is a natural connection between boosting and the ensemble methods or, more generally, random forests and other random collections of predictive models. See (Breiman, 1999).
Our motivation is to explore the potential of boosting-like approach in our distributed data
mining scenario.
To build the ensembles, we used the same Nursery dataset from the UCI Data Repository
(Murphy and Aha, 1993) as in the Study Case in the earlier chapter. The data was partitioned
into k = 3 distributed subsets. Again, we used the Weka data mining package (Witten and
Frank, 1999).
For the initial partition, we were interested in modeling the situations where data comes
from either a single source or di erent sources. Because we used real data and couldn't control
its source, the training set was initially split either homogeneously or heterogeneously in the
following fashion.
In a homogeneous initial split, we assumed that there is no essential di erence between the
underlying data distributions at each of the k distributed sites, so the dataset was divided into
k equal parts randomly.

To model a heterogeneous initial partition, we suggested the following

procedure: build a single C4.5 decision tree on the entire dataset and broadly mimic the way

42
TABLE V
MISCLASSIFICATION ERROR RATES FOR ENSEMBLES BUILT ON NURSERY DATA
Data transfer Same data source Di erent data sources
(%)
Simple Boosted Simple Boosted
0
5.2
5.2
37.6
37.6
10
5.2
5.5
7.7
7.2
20
5.3
5.6
6.7
5.6
30
5.2
5.5
6.0
5.5
40
5.2
5.5
5.7
5.5
50
5.2
5.4
5.4
5.3
60
5.3
5.6
5.4
5.4
70
5.3
5.3
5.3
5.3

it splits the data. Decision trees have an important property of creating data splits that are
easily interpretable. In our experiments, we sorted the data by its most informative attribute
- the one at the tree root node - and then split it sequentially into k equal subsets.
We conducted three separate runs for each type of experiment and each type of initial
partition. A single run used ve fold cross validation to test the accuracy of the resulting
collection of locally built models. The test results were then averaged over these 15 cross
validation folds. The accuracy of the collection was tested for a range of values of , the
parameter that sets the constraints on data trac. In this case,  2 [0; 1] was de ned the
largest percentage of data that can be moved between local data subsets, or colors.
We describe these experiments below. The test results are presented in Table V.

43
3.3.1

Simple Ensembles

In this rst type of experiments, the initial partition of data into k colors was updated by a
random exchange of data. Namely,  percent of each local dataset was chosen randomly, then
divided equally into k 1 parts, and each part was moved to one of the other k 1 colors. Once
data was collected on the receiving cite, a C4.5 decision tree model was built on the resulting
dataset. The collection of k locally built models was then tested as a voting ensemble, where
the test set was de ned by the current cross validation fold. This experiment provides a base
case against which we measure the bene ts of other methods of data selection.
It is worth mentioning that because the process of selecting the portion of data to be
transfered in this experiment was completely randomized, we made three independent random
selections for each value of  in each of the cross validation folds. The test results were averaged
over these three runs, and then averaged further over the 15 cross validation folds as described
above.
It appears that when the initial partition of data is random, i.e. when the initial colors
represent the same data distribution, the Simple Ensemble method produces ensembles of practically the same accuracy (5.2% misclassi cation error) regardless of the allowable data transfer.
This is not unexpected: indeed, if the initial partition of data represents k replicas of the same
data distribution, then subsequent random exchanges of data between colors preserve these
local distributions.
However, when the initial partition represents k distinct data distributions, we observe a
dramatic improvement in the accuracy of the ensemble as the amount of data trac increases.

44
The misclassi cation error rate of the in-place strategy ( = 0, no data trac) is more than
seven times that of the strategy that allows a thourough mixing of the k distinct distributions.
This is due to the fact initially, each model of the ensemble is not able to learn any of the
other k 1 data distributions and hence makes intolerably frequent errors on the test set.
Taking a vote of such a poor collection of predictors gives an error rate of as much as 37.6%.
However, as the value  of the data trac increases, the local models start learning the other
data distributions and thus become better and better predictors for the whole dataset, with
the error rate of the ensemble dropping to as low as 5.3%.
We also observe that the adequate degree of mixture is achieved at around  = 60%, where
the accuracy is already as good as that of the ensemble built in the case of a common initial
distribution. Note that the value  =

= 66:7% corresponds to a uniform mixture of the

original data distributions i.e. a mixture where each color contains k1 of the data from each of
the original distributions.
3.3.2

Boosted Ensembles

In our next experiment, the data was no longer chosen for transfer randomly. Instead
we adopted boosting as our method for choosing the data instances to be transfered. In the
distributed environment, this results in the following iterative procedure:
 For iterations i = 1; 2; : : ::
{ Build all local models.
{ Set the data transfer constraint i for the current iteration.

45
{ For each pair of colors (fromColor, toColor):

 Choose ki1 percent of data instances in the fromColor that are misclassi ed by
the model of the toColor.
 Move the chosen data instances to the toColor.
 Use the resulting k models as an ensemble that makes its prediction by a majority vote.
The data trac is thus broken into several batches, one for each iteration. To ensure the
stability of the algorithm and the overall data transfer constraint, the following requirements
should be met:
1  1  2  : : :  0;

X
i

i  :

(3.5)

In our implementation i = =2i , which satis es Equation 3.5.


The test results show that boosting gave improvement over simple ensembles in the case of a
heterogeneous initial data partition i.e. di erent initial distributions. This improvement is most
noticeable for small or medium amounts of allowable data transfer. For example, exchanging
only 20% of data chosen by boosting results in an ensemble just as accurate (5.6% error rate),
or even slightly more accurate, than that obtained by exchanging twice as much data chosen
randomly (5.7% error rate). As the value of  increases, however, the bene ts of boosting
gradually disappear. The reason is that when allowable trac is large, the simple ensemble
approach guarantees a well balanced mixture of the initial local data distributions and hence
not much is left for improvement.

46
Perhaps it is for a similar reason that in the case of a common initial data distribution, boosted ensembles were consistently outperformed by simple ensembles, although only
marginally.
3.4

Dual Strategies for Clustering

Another important class of data mining methods that is based on selecting particular data
instances is clustering and mixture problems (Everitt, 1974),(Mitchell, 1997). There is a conceptual di erence between voting ensembles of predictive models and clustering methods. Every
model in the ensemble is expected to learn the entire global data distribution. On the other
hand, each cluster in a clustering model represents only a portion of the global distribution
that, in general, does not overlap the portions represented by other clusters. In terms of selecting and relocating data instances between clusters, clustering can be viewed as a technique
opposite to boosting, in the following sense: in boosting, a model is presented with data on
which it makes the largest errors, whereas in clustering, a cluster generally receives data that
ts it better than other clusters.
Our focus is on partition based clustering algorithms as the most appropriate type of clustering methods for the problem of partitioning data across k distributed cites. In a distributed environment, this translates into treating each of the k cites as a separate cluster. The constraints
on moving data given by Equation 3.4 must then be injected into the clustering algorithm of
our choice.
We adopt a k-means clustering algorithm for our tests, which is a well-known partition
based clustering algorithm (Everitt, 1974). The restrictions on the amount of data transfer

47
make it impractical to update the cluster centroids too often, as is done in the standard version
of the k-means. Indeed, if the goal is to keep the data locally, then the strategy is to exchange
only the clustering results, i.e. the centroids, between the sites but do all the updates locally.
However, this would require keeping all locally stored copies of the centroids up-to-date. Note
that each cluster may comprise data from all data sites. If the algorithm updates the centroids
whenever a new data point is added to the cluster, given that this data point may come from
either of the local cites, there needs to be a prohibitively frequent exchange of the centroid
copies. The resulting trac between sites would be equivalent, or worse, to a trac created
when the entire dataset is centralized in the rst place.
A natural way to overcome this obstacle is to update the cluster centroids by moving data
in batch and exchange the centroids between the k cites only infrequently. Also, there must
be a mechanism of stopping the clustering process before the amount of data trac exceeds .
We therefore deployed an iterative k-means procedure similar to the one for boosting:
 For iterations i = 1; 2; : : ::
{ Update the centroids for each color, k distributed centroids comprising the k-means

clustering model.
{ Set the data transfer constraint i for the current iteration.
{ For each pair of colors (fromColor, toColor):

 Choose

i
k

percent of data instances in the fromColor that are closer to the

centroid of the toColor. Move the chosen data instances to the toColor.

48
3.4.1

Clustering Gene Expression Microarray Data

In this experiment, we illustrate some aspects of the dual strategies for clustering. Our
dataset for this test contains gene expression microarray data. Gene expression data has recently
become a very powerful tool of analysis of individual genes and their interaction, as well as
discovery of natural gene clusters and groups. See (Tibshirani et al., 1999), (Gerstein and
Jansen, 2000). In a typical microarray dataset, rows correspond to di erent individual genes
and columns correspond to particular experiments, types of environment, or some other external
stimuli. The numerical values within the table are called gene expression values and represent
the activity level of each gene in responce to each stimulus. Because similar genes react to
the same stimulus in a similar way, gene expression data is frequently used to cluster genes or
otherwise draw conclusions about gene similarity.
We took a gene expression microarray dataset described in (Zhang et al., 2001). The data
contains the expression values of 2000 individual genes (rows) on 62 di erent tissues (columns),
of which 22 were normal tissues and 40 were colon cancer tissues. We treat each row as a
separate data instance and each column as a data attribute. Our objective is demonstrate the
e ect of data transfer constraints on the quality of clustering.
In this experiment, we set k = 10 and made r = 100 independent runs. In each run, the data
was initially partitioned into 10 data subsets randomly. Then for a range of di erent values
allowable data transfer , we ran the distributed k-means clustering algorithm. We chose values
of  that roughly follow a logarithmic scale. Values of  may exceed 100% because the data
is allowed to be transfer between local data subsets multiple times before a stable clustering

49
TABLE VI
ANALYSIS OF THE GENE EXPRESSION MICROARRAY DATA
Allowable data Actual data Residual Precision Recall
transfer (%) transfer (%) error
(%)
(%)
0
0
346.09
N/A
N/A
10
11.1
342.31
46.2
14.1
30
23.2
332.81
53.6
28.5
100
61.0
268.87
74.3
52.6
300
113.2
208.39
72.3
76.4
1000
147.8
174.81
88.3
86.8
3000
147.3
164.56
95.5
92.6
10000
157.8
162.22
100.0 100.0

con guration is reached, giving a large overall data transfer. Weka data mining package (Witten
and Frank, 1999) was used.
It is known that for a high dimensional data, low Lp norms such as the Manhattan norm
give better distance metrics than the usual Euclidean one (Aggarwal et al., 2001). Therefore,
Manhattan distance metric was used for clustering. The obtained clusters of genes were then
examined. The results of the tests are presented in the Table VI and are discussed below.
3.4.2

E ect on the Cluster Tightness

We rst observe the e ect of increasing the allowable data trac on the tightness of the
obtained clusters. Because all attributes of the data are numerical, we can treat each data
record as a point in a 62-dimensional space. We de ne residual error as the average distance
per dimension between the data record and the nearest of the k = 10 centroids. The error

50
rates presented in the table are the result of averaging over the r = 100 independent trials. The
residual error is as high as 346.09 in the initial random distribution but, as expected, subsequent
clustering increases the tightness of each cluster and the average residual error drops to as low
as 162.22 when  = 10000%, i.e. when each data record is potentially moved around 100 times
on average. There is no signi cant reduction in error beyond this value of . In fact, the
residual errors begin to stabilize somewhere around  = 3000%
We also see that the actual data trac does not increase inde nitely as the constraints
are relaxed. Instead, it stabilizes around a value 150-160%. despite a much larger value for
the allowable data transfer. This is due to the fact that for each dataset and task, there is
an optimal amount of data transfer that is required to reach a stable clustering con guration.
Beyond this intrinsic value, further data transfer is unnecessary or may even be detrimental.
3.4.3

Identifying Similar Genes

Our main goal is to use the results of clustering to nd similar genes, which is an important
problem in bioinformatics. Assume a gene GeneX of interest is chosen among the 2000 genes in
the data. Note that because clustering methods like k-means depend on the initial conditions,
it is unrealistic to expect that the same clustering con guration will result from each run of
the algorithm. We therefore wish to identify all the genes that consistently appear in the same
cluster with GeneX .
Let GeneY be another gene from the gene pool. Let P^ be the observed frequency for GeneY
to fall in the same cluster with GeneX and let p be the true frequency. We shall call GeneX
and GeneY similar if we can reject a statistical null hypothesis

51

H0 : GeneX

and GeneY appear in the same cluster only randomly.

We formalize this null hypothesis and the alternative hypothesis as follows:


H0 : p  p0
H1 : p > p0

(3.6)

where p0 = 1=k is a frequency of two data records falling into the same cluster at random, given
k clusters.

We shall develop a level .05 test for H0. First note that we are dealing with a binomial
distribution for which a positive outcome is: "Both GeneX and GeneY are in the same cluster".
It is known that when the number of trials r  100, a binomial random variable may be closely
approximated by a Normal Gaussian distribution, and hence the empirical frequency P^ of a
binomial distribution may also be approximated by a Normal Gaussian distribution (Pugachev,
1984). We then use the Neyman-Pearson Lemma to construct a critical region of size .05 for
the test. The general type of the region is well known in the case of a Normal random variable:
because P^ has a mean p and a variance p(1 p)=r, the critical region of size .05 for H0 is de ned
by the condition
P^

p0
p0 (1 p0 )=r

 :95 = 1:645

(3.7)

52
where :95 is the 95-th percentile of a standard Normal distribution. See (Hoel et al., 1971).
Hence to identify genes that consistently fall in the same cluster with GeneX , we compute
the empirical frequencies P^ for all genes in the pool and take those genes for which Equation 3.7
is satis ed. This gives a stable GeneX -cluster.
3.4.4

E ect on the Precision and Recall

The experiment described above was performed for each values of . However, we know
that the quality of clustering is best when the restrictions on data transfer are relaxed the most,
which corresponds to the value  = 100 in this experiment. Therefore, we treat the GeneX cluster found with  = 100 as a true cluster and GeneX -clusters found with lower values of
 as approximations to this true cluster.

Our objective is to investigate the e ect of the data

transfer constraints on the ability of the clustering method to recover the true GeneX -cluster.
We are interested in both the precision and the recall, de ned as usual:
precision =

jAppr \ T ruej ; recall = jAppr \ T ruej


jApprj
jT ruej

where Appr and True are the approximate and the true clusters respectively and j:j indicates
the number of elements in the set.
In our experiments, we randomly selected ve di erent genes for the role of GeneX . This
gave ve di erent collections of GeneX -clusters, where each collection is made of clusters found
under di erent data transfer constraints . The values of precision and recall were averaged
over the ve collections and are presented in Table VI. The last row represents the true GeneX -

53
clusters, hence both precision and recall values are 100%. The rst row represents the initial
distribution before clustering, hence neither precision nor recall values were collected.
We observe that when the allowable data trac is small, it is the value of the recall that
su ers the most. Only a small percentage of genes that are similar to GeneX are identi ed as
such by the clustering method. However, the precision is considerably better than the recall,
indicating that the clustering does not pick wrong genes as often as it misses the correct ones.
This situation is due to the fact that for low values of data transfer, the clustering is able to
produce only a small GeneX -cluster, hence missing many correct answers. Indeed, when the
data transfer is low, the clustering does not have enough time to diverge from a random initial
distribution. Therefore, the clustering process is so far from completion that at the end of it
many genes appear in virtually random combinations with other genes, and the next run of the
clustering method would rearrange these combinations completely. As a result, only a few gene
pairs ever have a chance to be identi ed as consistently similar because the hypothesis H0 is
seldom rejected.
This also serves as a warning against trusting any individual run of a clustering method when
the exact associations between data instances are of primary interest. This danger is aggravated
when the data transfer constraints are present. Instead, a method combining several runs must
be devised. The hypothesis testing procedure described above is an example of such method.
As the constraints on data transfer are relaxed, we observe a rapid increase in the value
of the recall that eventually reaches the value of the precision. This is due to the fact that as
the quality of clustering results improves, pairs of similar genes fall into the same cluster more

54
often. Therefore, the hypothesis H0 of their similarity being merely random is rejected more
frequently. As a result, the GeneX -cluster returned by the clustering method contains more
genes and hence fewer correct answers are missed, which boosts the recall.
3.5

Conclusion

This chapter demonstrated that in the distributed environment, the quality of data and data
partition becomes one of the dominant factors that a ect the accuracy of the resulting data
mining systems. Moreover, it is possible to improve the quality of the data partitions by using
initial data reallocation. In this case, however, there must be a mechanism that allows to select
the right data instances for the transfer. It appears that previously developed methods such
boosting and clustering may be used to develop new methods of marking the data instances,
which depending on the data mining task may lead to a signi cant improvement in the quality
of the results.
More generally, clustering and mixture model environment may also be used as a basis for
a theoretical foundation that allows to formalize the problem of nding the right data partition
across the k distributed sites before the data mining phase begins. In our next chapter, we
develop such formal framework using the Expectation-Maximization approach. Speci cally, we
show that the problem of dual intermediate strategies, and hence the problem of nding data
partitions, may be solved by a constrained EM algorithm.

CHAPTER 4

EXPECTATION-MAXIMIZATION ALGORITHM WITH DATA


TRANSFER CONSTRAINTS

4.1

Mixture of Models and the Expectation-Maximization Approach

The previous chapter presents the problem of nding the optimal allocation of data to
k

distributed data cites. The strategy of transferring data aims at minimizing the overall

predictive error while satisfying restrictions on data transfer.


In the absence of data transfer constraints, the task of allocating data instances to several
disjoint groups is related to the so-called

mixture of models

problem, where the dataset is

comprised of data drawn from k distinct distributions but the true membership of each data
instance is not known in advance. The information about true data membership thus serves
as hidden data attributes, as opposed to the observable attributes. The goal is to recover the
parameters of each distribution of the mixture or, equivalently, to separate the data into k
groups corresponding to their true membership. Many well-known problems could be treated
in a mixture of models context. For example, k-means clustering is essentially a problem of
recovering k Normal distributions from a dataset.
One of the most powerful tools used in a mixture of models scenario is the EM algorithm
(Dempster et al., 1977). In short, the algorithm alternates between two steps:

55

56
 Expectation: estimates the probability distribution of the hidden variables using the
current estimates for the model parameters.
 Maximization: nds the maximum likelihood estimates of the model parameters using
the current estimates for the distribution of the hidden variables.
It has been shown (see (Dellaert, 2002)) that at each iteration, the EM algorithm nds a tight
lower bound for the true log-likelihood function, and subsequent maximizations of this bound
result in eventual convergence of the algorithm to a local maximum of the true log-likelihood.
Let X represent the observable part of the data, Z represent the hidden variable, and 
represent the parameters of the models in the mixture. The goal is to nd the maximum
likelihood estimate for the :
 = arg max
log P (X j) = arg max
log



X
Z

P (X; Z j)

(4.1)

where P (X j) is the probability of observing data X if the true mixture is given by .
P (X; Z j) is de ned similarly.

4.2

Expectation-Maximization with Constraints on Data Transfer

Now consider the case where the data transfer constraints between models, or colors, are
present. Values of the hidden variable Z correspond to di erent ways of allocating data to the k
colors. Restricting the amount of data transfer between colored data subsets e ectively reduces
the number of possible data colorings Z reachable from the current data coloring.

57
Let the distance jjZ1 Z2 jj in the Z -space be de ned as the least amount of data transfer
between the k colors to reach a color con guration Z2 from a color con guration Z1. Let t be
the number of the current iteration, with Zt and t as the current estimates for the hidden
variable and the model parameters.
Assume that the amount of data transfered at the current iteration is constrained by a
parameter t . Then Zt+1 must fall in the t-neighborhood of Zt de ned as
Nt = fZ : jjZ

Zt jj  t g:

(4.2)

It must be noted that the sequence ft g must satisfy


X
t

t  

(4.3)

where  represents the constraint on the total amount of data transfered between the colors by
the algorithm. If Z0 is the value of the hidden variable that corresponds to the initial allocation
of data at t = 0, then the feasible values of Z are restricted to the -neighborhood of Z0
throughout the execution of the algorithm. Therefore, the original log-likelihood optimization
problem from Equation 4.1 is now assumed to be replaced by the following constrained loglikelihood optimization problem:
 = arg max
log P (X j)  arg max
log



jjZ Z0jj

P (X; Z j):

(4.4)

58
The corresponding optimization problem for the tth iteration of the constrained ExpectationMaximization algorithm is then
 = arg max
log P (X j)  arg max
log


4.3

Z Nt

P (X; Z j):

(4.5)

Lower Bound for the Log-likelihood

Consider a t-th iteration of the constrained EM algorithm. Possible values of Z must now be
restricted to the t -neighborhood of Zt . Therefore, the lower bound for the true log-likelihood
function is derived in terms of a truncated probability distribution of Z with Nt as a support.
Let t (Z ) be a probability distribution of Z over the Z -space of hidden variables. Then
after normalization, the corresponding truncated distribution of Z with support Nt is de ned
in terms of the conditional distribution:
8
>
>
>
<

t (Z )=m(Nt ); Z 2 Nt

>
>
:

0;

tN (Z ) = P (Z jZ 2 Nt ) = >

otherwise

(4.6)

where m(Nt ) is the probability measure of the t-neighborhood of Zt under the original untruncated probability distribution:
m(Nt ) = P (Z 2 Nt ) =

Z Nt

t (Z )

(4.7)

Following a procedure similar to the one in (Dellaert, 2002), this conditional distribution is
used to de ne the lower bound which the constrained EM algorithm maximizes:

59

B (; t ) =

X
Z

tN (Z ) log

P (X; Z j)
tN (Z )

(4.8)

and hence from Jensen's inequality and Equation 4.5


B (; t )  log

X
Z

tN (Z )

P (X; Z j)
tN (Z )

= log P (X j):

(4.9)

The EM approach can be interpreted as a sequential maximization of the lower bound B (; t).
4.4

Lower Bound Sub-optimality

It is shown in (Dellaert, 2002) that in the absence of the data transfer constraints, maximizing the expression B (t; t ) with respect to the probability distribution t(Z ) gives the
optimal distribution
t (Z ) = P (Z jX; t )

(4.10)

that results in the best possible lower bound for the log-likelihood, namely, a locally tight lower
bound that touches the objective log-likelihood function:
B (t ; t ) = log P (X jt ):

(4.11)

When the data transfer constraints are present, Equation 4.6 gives the corresponding optimal truncated distribution in the t -neighborhood of Zt :

60
8
>
>
>
< P (Z jX;t) ;

tN (Z ) = >
>
>
:

( )

m Nt

0;

Z 2 Nt

9
>
>
>
=

otherwise

>
>
>
;

= P (Z jX; t ; Nt )

(4.12)

where in this case, m(Nt) = P (Z 2 NtjX; t ). Alternatively, the same expression may be
obtained by a direct maximization as in (Dellaert, 2002) where Nt now plays the role of the
Z -space.

It follows that
B (t ; t )

j)
tN (Z ) log P (X;Z
t (Z )
N

( j  ) log P (X;Z j)


( )
P (Z jX;t )=m(Nt )

P Z X; t
Z Nt
m Nt

P (Z jX; t ; Nt ) (log P (X jt ) + log m(Nt ))

(4.13)

= E [log P (X jt ) + log m(Nt ) j X; t ; Nt ] :


Because m(Nt ) is merely a constant and log P (X jt ) is fully de ned by the values X; t , the
conditional expectation can be dropped, which gives
B (t ; t ) = log P (X jt ) + log m(Nt )  log P (X jt ):

(4.14)

Comparing Equation 4.11 and Equation 4.14 demonstrates the qualitative e ect of the data
transfer restrictions on the lower bound B (; t). Note that by nature of probability measure,
m (N t )

2 [0; 1] and so log m(Nt )  0. Therefore, whereas the lower bound for the true log-

likelihood function is locally tight in the unconstrained problem, it is relaxed by as much as


log m(Nt), a value that is directly related to the allowable data trac.

61
Also note that when converted back from the logarithmic to the usual scale, Equation 4.14
gives the corresponding lower bound for the true likelihood function:
eB(t ;t) = P (X jt )m(Nt )  P (X jt )

(4.15)

where as before, m(Nt ) = P (Z 2 Nt jX; t ) 2 [0; 1]. This shows that the locally tight lower
bound for the likelihood function of the unconstrained problem is reduced by a factor of m(Nt ) in
the constrained problem. The results Equation 4.14 and Equation 4.15 express in quantitative
terms the intrinsic sub-optimality of the constrained EM algorithm compared to the usual
unconstrained version.
Naturally, relaxing the restrictions on data transfer improves the quality of the solution
found by the constrained EM algorithm at each iteration. Indeed, increasing the value of
the constraint parameter t correspond to the expansion of the neighborhood Nt around the
hidden variable value Zt in the Z -space, and so its probability measure m(Nt ) also increases.
Removing the restrictions alltogether corresponds to t = 1 and the allowable hidden variable
neighborhood Nt covering the entire Z -space, which gives m(Nt ) = 1 and log m(Nt ) = 0. In
this case, Equation 4.11 follows directly from either Equation 4.14 or Equation 4.16.
Also note that Equation 4.14 may be rewritten as
B (t ; t ) = log (P (X jt )P (Nt jX; t )) = log P (X; Nt jt )

(4.16)

which further reveals the nature of the lower bound function B (; t) in the constrained case.

62
These results are summarized in the following
Theorem 4.4.1 Let Equation 4.8 de ne a family of lower bounds B (; t ) for the log-likelihood
function

log P (X jt )

at the t-th iteration of the constrained EM algorithm. Let t be the

maximum allowable amount of data transfer at the t-th iteration. Then the best lower bound is
achieved with the distribution

tN (Z ) = P (Z jX; t ; Nt )

(4.17)

but is nevertheless sub-optimal, in the following sense:

B (t ; t ) = log P (X; Nt jt ) = log P (X jt ) + log m(Nt )  log P (X jt )

(4.18)

where m(Nt ) = P (Nt jX; t ). The corresponding lower bound for the likelihood function reaches
only as high as P (X jt )m(Nt ), whereas a tight lower bound of P (X jt ) is achieved in the
unconstrained case.

4.5

General Expectation-Maximization Algorithm with Constraints

As was mentioned above, the essence of the EM approach lies in a sequential maximization
of the lower bound B (; t ). Similar to the derivation in (Dellaert, 2002), Equation 4.8 and
Equation 4.12 give

63
B (; t )

j)
= E log P (ZPjX;(X;Z
t)=m(Nt ) j X; t ; Nt

= E [ log P (X; Z j) j X; t ; Nt ]


E [ log P (Z jX; t )=m(Nt ) j X; t ; Nt

(4.19)

= Qt() E [ log P (Z jX; t )=m(Nt ) j X; t ; Nt ]


Because the second term in the di erence does not depend on , the maximization problem
becomes
t+1 = arg max
Q () = arg max E [ log P (X; Z j) j X; t ; Nt ]
 t


(4.20)

Therefore, the general EM algorithm with constraints on the amount of data transfer can
be described as follows:
 Given the initial distribution of data across k colors and the maximum amount  of
allowable data transfer, identify the initial value of the hidden variable Z0 and the sequence
ft g.
 For each iteration t:
{ E-step: identify Zt , Nt , estimate the distribution tN (Z ) = P (Z jX; t ; Nt ) of the

hidden variable Z and compute the function


Qt () = E [ log P (X; Z j) j X; t ; Nt ] :

64
{ M-step: nd the data transfer strategy that results in building local models that

correspond to the maximum likelihood estimates


t+1 = arg max
Q ():
 t
4.6

Implementation Details

In this section, the constrained EM approach will now be demonstrated for the case when
there are k distributed data sites D1 ; : : : ; Dk and a predictive model fj is built on each dataset,
where 0  j  k. Let the class attribute be numeric so that all fj are regression models.
In general, regression models are predictive models that, given an unlabeled data instance
x,

return a probability distribution of its numeric class value or, presuming other parameters

known, the mean of such distribution. Typically, Normal Gaussian distribution is used, with
its mean used as a predicted value. The variance relates the accuracy of such prediction.
There is a wide variety of types of regression models, including linear and generalized linear
and nonlinear regression models(Bates and Watts, 1988), CART decision trees (Breiman et al.,
1984), clustering models (Everitt, 1974), etc.
For simplicity of demonstration, assume that all Normal distributions returned by f1; : : : ; fk
have the same variance 2. For a given unlabeled data instance x, the prediction fj (x) is then
the mean of the Normal distribution of the class attribute returned by the j -th model.
The derivation below follows the steps of the derivation of the k-means clustering with an
unconstrained EM approach presented in (Mitchell, 1997).

65
Let there be a total of n data instances fxigni=1 in the combined dataset [kj=1Dj . First note
that the hidden variable Z that contains the data membership information may be represented
by an n-by-k matrix of values:
8
>
>
>
<

1; xi 2 Dj

>
>
:

0; otherwise

zij = >

(4.21)

In order to derive the expression for Qt() de ned in Equation 4.20 for the E-step, notice
as in (Mitchell, 1997) that only one of the zij is non-zero for a given j . Also, let fj (j) be the
predicted mean of the Normal distribution returned by the j -th model when the parameters of
the model correspond to the given value of . The log-likelihood function may then be written
as
log P (X; Z j) = log ni=1 P (xi; zi1 ; : : : ; zik j)
=
=

k
1
2
log  =1 p212 e 22 j=1 zij (xi fj (xi j))
P
P
n log p212 212 ni=1 kj=1 zij (xi fj (xi j))2
n
i

(4.22)

Let !ijt = E [zij jX; t ; Nt ]. It follows from Equation 4.21 that f!ijt g is a collection of numbers
on [0; 1] independent from . Then the function Qt() maximized in the M-step is de ned as
Qt ()

= E [ log P (X; Z j) j X; t ; Nt ]


=

n log p212

1 P t
22 ij !ij (xi

fj (xi j))2

(4.23)

Therefore, the M-step of the constrained EM algorithm is equivalent to the following global
weighted least squares minimization problem:

66

t+1 = arg min




k X
n
X
j

=1 i=1

!ijt (xi fj (xi j))2 :

(4.24)

Note that in a distributed environment,  =< 1; : : : ; k >, where each j corresponds to
a predictive model fj = fj (jj ) built separately on Dj . Hence Equation 4.24 results in the
following system of k weighted least squares minimization problems:
jt+1 = arg min
j
4.7

n
X

=1

2

!ijt xi fj (xi jj ) ;

1  j  k:

(4.25)

Weighting Schema and Data Transfer in the Maximization Step

Note that the system of optimization problems Equation 4.25 is the same for both the regular
EM algorithm and the EM algorithm with constraints on data transfer. The constraints only
a ect the weights !ij . In fact, there is a deep connection between the weighting schema f!ij g
and the strategy of data transfer between the distributed data sites D1; : : : ; Dk .
As was stated, in a distributed environment, the objective is to nd the optimal allocation
of data instances to the k data sites. At every step of the algorithm, each of the predictive
models fj is built only on data from Dj . Consequently, only the terms that correspond to Dj
will be available in the sum Equation 4.25 for each j in any practical implementation of the
algorithm. Therefore, Equation 4.25 must be interpreted as a stochastic optimization rather
than a deterministic optimization. The weights are then interpreted as
!ijt = P ( xi

to be used in building fj (jt+1 ) )

(4.26)

67
Let x / fj indicate that a data instance x was generated by a Normal distribution with
a mean  = fj (x) and variance 2, where fj is one of the k predictive models. Assume for
now that there are no restrictions on data transfer, i.e. that t = 1 and Nt covers the whole
Z -space.

In this case, if for some data instance xi / fj then xi must belong to Dj . Similar to

(Mitchell, 1997), this together with Equation 4.21 gives:


!ijt j1

= !ijt jt=1
= E [zij jX; t ; t = 1]
= P (xi / fj jX; t )
=

(4.27)

fj (xi jt ))2 =22


(xi fs (xi jt ))2 =22
s=1 e

Pke

(xi

The stochastic data transfer strategy that corresponds directly to Equation 4.25 is then to
allocate each data instance xi to the dataset Dj with probability !ijt j1 and build least squares
models f1; : : : ; fk on the resulting datasets. On average, this strategy results in the following
volume of data trac:
V1t

= E [ amount of data moved jX; t ; t = 1]


=

1  P (xi is movedjX; t ; t = 1)

(1 !

= jDj

P
i

t
ici

j1 )

!ict i j1

(4.28)

68
where ci = j i xi 2 Dj i.e. ci is the current membership of xi, and jDj is the total amount
of data in all k datasets. The sum Pi !ict ij1 represents the average amount of data that stays
in-place.
To investigate the case when the constraints are present, the nature of the neighborhood Nt
must rst be established. As before, let Zt be the current allocation of data and let Z 2 Nt if
there needs to be no more than t amount of data transfer to reach data allocation Z from Zt .
There are several possible ways to enforce the constraints on the data trac. A natural way
to do so is to adjust the probabilities !ijt to make it uniformly \easier" for each data instance
xi to stay in its current data subset Dci .

This could be achieved by increasing each !ict i by the

a factor  , the appropriate value of which to be found later. The other k 1 probabilities must
then be normalized, e ectively making it uniformly \harder" for the data to be moved to other
datasets. The weights are then expressed in as
!ict i = !ict i j1; !ijt =

!ijt j1
; j 6= ci
!ict ij1

(4.29)

where an appropriate value of v must be found depending on the desired amount of data
transfer. In this constrained case, an argument similar to Equation 4.28 gives
t = jDj

from where an equation for  follows:

X
i

!ict ij1

(4.30)

69

=

jDj t = jDj t
t
jDj V1t
i !ici j1

(4.31)

This together with Equation 4.27 and Equation 4.29 gives the formula for the weights. The
data instances shall then be allocated according to these found probabilities, which on average
will result in the volume t of data trac. Other stochastic schemata that correspond to
Equation 4.25 are possible as well.
As expected,   1 whenever t  V1t i.e. whenever the data trac is indeed restricted
by imposing constraints. Also, it may seem like the calculation of  and hence of the weights
!ijt

involves using all data instances, which would require access to the entire centralized data

set. In reality, however, only the locally built models fi and some aggregate statistics need be
exchanged between the k distributed sites.
4.8

Conclusion

This chapter shows that the problem of dual intermediate strategies and the problem of
nding data partitions, may be solved by a constrained EM algorithm. It also shows how
the general theoretical framework of a constrained EM algorithm may be applied to develop a
family of algorithms that search for the optimal partitions of data across the distributed sites.
There are a variety of di erent versions and modi cations possible within this framework.
In the next chapter, we introduce a scenario where it is bene cial to build a hierarchical
model assignment system on the distributed collection of data. We also develop an algorithm
called Greedy Data Labeling that is used to enhance the quality of the data partition so that a

70
near optimal model assignment system is produced. We show that a model assignment system
enhanced by the GDL algorithm nearly always outperforms such methods as voting ensembles.
The GDL algorithm belongs to the more general family of algorithms that was introduced in
the current chapter through the constrained EM approach.

CHAPTER 5

GREEDY DATA LABELING AND THE DISTRIBUTED MODEL


ASSIGNMENT PROBLEM

5.1

Introduction

In traditional data mining, a learning algorithm F is applied to a single, although possibly


large, set of data D. Data instances are thought to be of the type (x; y); where x is a vector of
data attributes and y is a class label. The choice of the learning algorithm essentially de nes
a parametric family of functions, called predictive models or classi ers. Given the algorithm
F

and the dataset D, a predictive model f : x ! y is built to capture the properties of the

underlying data distribution and predict the class value of unlabeled data instances.
In a distributed data mining, there are several datasets that are geographically distributed
with data potentially coming from di erent underlying distributions. This requires a di erent
approach. Chapter 2 addressed the issue of the fundamental trade-o in distributed data
mining. namely, there are two extremes: to combine all data at a central site and build a single
model there, which typically gives a more accurate result, or to mine data in-place, which is
cheaper. In this latter scenario, several predictive models are built locally. Generally, there is
also a so-called meta-model { an overstructure that regulates the deployment of the collection
of local models f1; : : : ; fk , which we call base-models. The meta-model does not have to be of
the same type F as the base-models.
71

72
There are several well known ways to deploy a collection of models, whether built from
several data subsets or from a combined dataset. One is the voting/averaging ensemble, where
either an average or a majority vote of individual models' predictions is used as an overall
prediction of the system (Dietterich, 2000). In this case, the meta-model g0 is a simple averaging
function. Ensemble learning has become one of the most popular methods due to both its
simplicity and e ectiveness. A more complex technique is the so-called meta-learning (Chan
et al., 1995), where a meta-model is separately trained on the outputs the base-models. When
a new unlabeled data is presented, individual base-models make predictions, after which the
meta-model reads their scores and makes the overall prediction. Both ensemble learning and
meta-learning are bottom-up methods: all base-models are deployed for scoring and the metamodel combines their outputs into the nal result.
A model assignment problem is a problem of building a top-down hierarchical predictive
system. When new unlabeled data is presented, the meta-model is deployed rst. Based on
its output, the data is forwarded to one of the base-models - the one that handles that type of
data better than others. The output of the chosen base-model becomes the overall prediction.
The model assignment problem arises naturally from the observation that data may come
from heterogeneous sources, i.e. from di erent underlying distributions. In this setting, it may
be desirable to have a collection of specialized models, each corresponding to a di erent data
distribution. Consider an application where there is a separate model predicting the risk of a
heart disease for each age group. When a new patient arrives, the meta-model will invoke the

73
predictive model that corresponds to that patient's age. There is no need to use any of the
other base-models.
In this chapter, we explore the scenario when a model assignment system is built on a
distributed collection of data subsets, i.e. when base-models are created by applying the chosen
learning algorithm F to each of the local data subsets D1 ; : : : ; Dk . It follows from the above
argument that the model assignment approach would probably work best when the initial
partition of data across the k local sets re ects the di erences in the underlying data types.
This also is the scenario in the previous chapter.
5.1.1

Data Partitions

Each partition of a dataset D into subsets D1; : : : ; Dk can be thought of as a way to color
the combined dataset D with k di erent colors, after which a base-model is built on each color
and a meta-model is trained to assign colors to unlabeled data instances. In general, our point
of view is that, given a particular data mining task, there is an optimal partition of a dataset
D across k sites, or equivalently, an optimal coloring with k di erent colors, that results in the

best model assignment system. However, the initial partition D1 ; : : : ; Dk of the data may not
be the best and hence needs to be optimized before the base-models are built on local datasets.
In this chapter, we introduce a simple and ecient method for improving the quality of data
partitions, called Greedy Data Labeling (GDL). It allows to choose individual data instances for
relocation between the distributed data subsets D1 ; : : : ; Dk so that the resulting subsets become,
in a certain sense, more homogeneous within themselves. This gives an overall partition that
better re ects the inner structure of the data distribution.

74
5.1.2

Objective

Our goal is to show that in a number of cases, a top-down model assignment system may
be preferable to traditional bottom-up methods such as ensemble learning. We also want to
demonstrate that the quality of the predictive system in both cases may depend signi cantly
on the initial distribution of data across multiple sites. We present the experimental evidence
that the initial data distribution for a model assignment system may be signi cantly improved
by re-allocating small amounts of data between sites using the Greedy Data Labeling method,
and that a model assignment system enhanced by GDL outperforms a voting ensemble in most
cases.
In this chapter, as in the previous ones, we shall talk about local data subsets, data colors, or
base-models interchangeably, since every subset can be thought to have a unique color and the
corresponding base-model built on that set is uniquely de ned by the choice of the data mining
algorithm F (up to a randomization, if applicable). We also note that since our interest is in
exploring the quality of data partitions, we do not address the case where some data instances
from one subset are added to another subset, thus becoming available in both subsets. In our
setting, each data instance belongs to only one subset (color) at a time.
5.1.3

Related work

The GDL method is broadly related to clustering (Everitt, 1974), (Bradley et al., 1998).
Traditional clustering algorithms such as k-means use geometric proximity as a measure of
\tightness", whereas GDL uses a more general criterion based on the choice of the learning
algorithm F . We shall explain this point later. GDL also employs an idea related to boosting,

75
a method of weighted re-sampling of data (Freund, 1995). In boosting, previously misclassi ed
data instances are selected by the learning algorithm more frequently, which gives the predictive
model more chances to learn that portion of data.
Bagging (Breiman, 1996) is one of the main methods for building a collection of classi ers.
Ideally in bagging, each classi er is trained on a di erent subset of data drawn from the same
distribution. This is one of the scenarios examined in this chapter for both ensemble and
model assignment approaches. In practice, the data subsets for bagging are usually created by
re-sampling the original dataset with replacement and hence may share some data instances.
A system where classi ers may specialize and/or abstain from voting is considered in (Freund et al., 1997), although not in a distributed context. Several theoretical error bounds are
provided.
There are various techniques to recover individual data distributions from a mixture of
distributions, the EM algorithm being one of the most popular (Dempster et al., 1977), (Jordan
et al., 1994). This connection is explored in details in Chapter 4.
In (Chipman et al., 1999), CART decision trees are used to achieve segmentation of the
data set into subsets such that further submodels may be built on each subset for a hierarchical
model selection. Their approach employs ideas from Bayesian analysis and Markov chain Monte
Carlo methods and performs a stochastic search in the space of the appropriate decision trees.
A competitive machine learning algorithm similar in part to GDL was independently developed in bioinformatics. (Obradovic, 2002) uses it to automatically partition a set of available
disordered proteins into subsets with similar properties.

76
Conceptual clustering is a methodology of building a hierarchy of classes based on the
content of their knowledge objects (Stepp et al., 1986), (Michalski et al., 1983). A classi cation
scheme is produced according to how well objects t to descriptive concepts - not according to
simple similarity measures. One example is the COBWEB algorithm that builds a clustering
tree where each node is a cluster and can be split into subclusters as children (Fisher, 1987).
For other related results, see (Fayyad et al., 1998), (Bradley et al., 1998), and (Zhang at al.,
1996).
Finally, there are several approaches to reducing the algorithm complexity by randomization.
Some general ideas are discussed in (Ben-David et al., 1994) and (Gomez at al., 1998). In
the context of clustering problems, randomization frequently takes the form of using several
subsamples of the data sample to nd the best initialization for clusters. Using subsampling, a
considerable speedup may be achieved, which is of crucial importance for non-linear algorithms
applied to large data sets. See (Rocke et al., 2000).
5.2

Greedy Data Labeling

5.2.1

Assumptions

Let the data subsets D1; : : : ; Dk be distributed over a network, each represented by a different color. If combined, they comprise a single dataset D. However, we assume that it is
relatively expensive or otherwise undesirable to combine all data at a central location, while
processing data locally is relatively cheap.
A model assignment system is created as follows. A base-model fi is built on each local data
subset Di using a learning algorithm F . Then a sample of data from each color is taken and used

77
to train a meta-model that can predict the colors. Whenever a new data instance is presented
for classi cation, the meta-model will assign it a color and forward it to the appropriate basemodel for scoring.
The initial partition of data across k sets may be far from optimal, hence not sharing any
data at all between local sites will likely produce an inferior model assignment system. Cost
and accuracy can be balanced by improving the distribution of data but still mining the data
in-place afterward. The key is to share only a small number of data instances.
5.2.2

Optimization Problem

First, we de ne native colors. Given a data partition fDi gk1 , the color  is native for a data
instance (x; y) 2 D if
 = arg

min jjy fi(x)jj

1ik

(5.1)

i.e. if the base-model f (x) built on the subset D predicts the true class of that instance better
than base-models of other colors. Here, jjy fi(x)jj is the appropriately de ned prediction error.
Note that several models may produce equally accurate predictions, hence native colors may
not be unique. In case of a tie, which is typical in classi cation problems, the native color is
chosen randomly among the candidates.
The model assignment problem translates into nding the optimal partition that gives a
solution to the following error minimization problem:

78

min

k
X

=1 (x;y)2Di

jjy fi(x)jj

(5.2)

where the minimum is taken over all possible data partitions. This is a combinatorial optimization problem that searches for the best possible k-coloring of the dataset D among kjDj
possible colorings, where jDj is the dataset size. We observe that as a necessary condition of
optimality, all data instances must belong to their native colors. Indeed, if an instance (x; y)
belongs to a current color Dc that is not native, moving it to its native color D would decrease
the corresponding term in the sum in Equation 5.2 by
(x; y) = jjy fc(x)jj

jjy f(x) jj:

(5.3)

The core idea behind the Greedy Data Labeling method is to move data instances to their
native colors in a way that gives a greedy solution to Equation 5.2. Finding an instance with
max (x; y) and moving it to its native color is equivalent to making the largest single-term
reduction in Equation 5.2 and hence the largest step in the direction of the steepest descent.
To prevent possible instability in greedy methods, the size of the allowable greedy step
is usually reduced in each subsequent iteration. Because in our case the process of moving
data instances is discreet, the same e ect may be achieved by reducing the probability of data
relocation. We shall show this later in the context of the simulated annealing.

79
However, when data instances are moved between subsets, the base-models of the a ected
subsets must be re-built to re ect the new partition. The terms in the sum in Equation 5.2 are
thus interdependent, which complicates the greedy nature of the steepest descent.
Note also that if a perfect meta-model were available for the current data partition, the
resulting hierarchical model assignment system for the dataset D would have a predictive function
h(x) = fi (x) when (x; y) 2 Di

(5.4)

allowing us to rewrite Equation 5.2 as an empirical error minimization problem in a standard


form:
min

(x;y)2D

jjy h(x)jj:

(5.5)

We also wish to point out the inherent similarity between Equation 5.2 and Equation 5.5
on one side and Equation 4.24 and Equation 4.25 on the other side.
5.2.3

Greedy Data Labeling: The Ideal Version

The ideal GDL schema works as follows:


 Do until all data points are in their native subsets:
{ For every data instance (x; y) in D:

 Compute the error i = jjy fi(x)jj on this instance for each of the k base-models.

80
 Identify the native model f that gives the least error. If not unique, choose
randomly among the candidates.
 Compute the largest possible error reduction = c  for that data instance,
where c is its current color.
{ Select the data instance with max . Move it to its native subset.
{ Update the base-models of the subsets that were a ected by the move.

Note that to implement this algorithm, there must be a copy of every base-model available
at each of the k distributed sites fDi g. This is easy to achieve, e.g. by exchanging the models
between sites after each model update in Predictive Model Markup Language format, which
is now the emerging industry standard for encoding statistical models. See (Grossman et al.,
1999). Models in PMML format are typically much smaller than data on which they are built
so the additional network trac overhead is negligible.
As was mentioned before, there are noticeable similarities between the GDL method and
some unsupervised clustering techniques. In both cases, instances are moved between data
subsets to improve a certain measure of \tightness" within subsets. In clustering, this measure
is usually related to the geometrical proximity of the data. For example, the k-means clustering
algorithm is initially seeded with k centroid vectors. Each new data instance is placed into a
cluster with the closest centroid, after which the centroid is recomputed. The distance metrics
is best de ned by the covariance matrix of the data distribution in that cluster. In the GDL,
the algorithm is seeded by the initial data partition across k sites. Each new data instance is
placed into a subset with the least prediction error, as given by the predictive model computed

81
on the data from that subset. After the placement, the model is recomputed. The \tightness"
criterion in Equation 5.2 is more general and is de ned by the chosen type F of the base-model
algorithm.
5.2.4

Greedy Data Labeling: The Ecient Version

There are several potential problems with the ideal version of GDL. First, as each data
instance shift a ects the data partition, re-building base-models too often is impractical. Secondly, depending on the initial conditions, the process may reach a local minimum di erent
from a globally optimal partition. Also, since the function minimized in Equation 5.2 and
Equation 5.5 is essentially the empirical error on the training set, over tting may potentially
become an issue.
Below is a more ecient version of the algorithm:
 Do until all data points are in their native subsets:
{ Initialize a batch size parameter T .
{ For every data instance (x; y) in D:

 Compute the prediction error i = jjy fi(x)jj on this instance for each of the k
base-models.
 Identify the native model f that gives the least error. If not unique, choose
randomly among the candidates.
 Compute the greedy step = c  for that data instance, where c is its current
color.

82
 If the data instance is not in its native subset ( > 0), mark it as a candidate.
{ Use a randomization technique to select up to T candidates and move them to their

native subsets.
{ Update all base-models.
{ Reduce the batch size parameter T before the next iteration.

This improved version is ecient enough to allow a practical implementation of the GDL
algorithm, as it requires only a few updates of base-models before convergence.
There are several randomization schemata available for selecting data instances that allow
the algorithm to escape local minima. The most notable is the so-called simulated annealing
(Laarhoven et al., 1987), a standard technique used to avoid local minima in combinatorial optimization by permitting sub-optimal steps in greedy descend. Although annealing makes each
individual run of the algorithm longer, the alternative is to make multiple algorithm runs from
di erent initial conditions without annealing and then choose the best of the obtained results,
which could be quite expensive. Overall, a randomization method like simulated annealing
is often preferable. When implemented in GDL, it selects a data instance with probability
p = e 1=t ,

where t is a control parameter tied to the batch size T . The smaller the value of

, the less likely it is to select the data instance.

readily. E.g. simulated annealing with p = e

Other randomization techniques may be used

1=(+1)t

would allow to select instances with = 0

with non-zero probability and escape local minima better at the expense of a larger number of
iterations. See also (Ben-David et al., 1994).
Also, we have not noticed any problems related to over tting in our experiments.

83
5.2.5

Building the Meta-model

Once the data partition fDi g is found by the GDL algorithm, we take a sample of the data
instances from all subsets, replace their class values with their colors, and train a meta-model
on the resulting sample. The meta-model thus learns to classify data into k colors as de ned
by the current data partition.
An important point is that we only take a sample of data for building the meta-model. We
call such sample a meta-sample. Using the whole dataset to train a model requires moving all
data to a centralized location, which defeats the purpose of processing the data locally.
5.3

Experiments

5.3.1

Datasets

The goal of our experiments was two-fold. First, we wanted to show that a model assignment
approach may in certain situations outperform traditional bottom-up hierarchical techniques
such as ensemble learning that use all base-models for prediction. Secondly, we wanted to
explore the quality of data partitions and to demonstrate the superiority of the partitions
found by GDL over other possible partitions in the model assignment context.
In our preliminary trials we discovered an interesting feature of the GDL method: it consistently gives superior partitions on datasets with discreet attributes, whereas its advantage
on data with continuous attributes is considerably less pronounced. We are still investigating
the reasons behind this phenomenon. Our understanding is that because the space over which
the combinatorial search is performed is greatly reduced when attributes are discreet, the GDL

84
TABLE VII
DESCRIPTION OF THE UCI DATASETS
Dataset
Num of Num of Num of
Name
Instances Attributes Classes
Balance Scale
625
4
3
Tic-Tac-Toe
958
9
2
Car Evaluation 1728
6
4
Chess
3196
36
2
Nursery
12960
8
5

optimization task is simpli ed substantially. In the future, we shall explore the e ectiveness
of the GDL applied in conjunction with various attribute discretization techniques. In this
chapter, however, we restrict our experiments to purely discreet data.
We selected several datasets from the UCI Machine Learning Repository that satisfy three
criteria: (a) all attributes are discreet, (b) the number of data instances is at least 500, (c) the
domain is suciently complex. The latter criterion, for example, made us reject datasets for
which the initial partition gave less than 1% classi cation error (so that hardly any improvement
was possible) or for which the non-zero class occurrence is extremely rare (so that a base-model
of type F built on the data always predicts zero).
The datasets that we used for the tests are described in Table VII, with the number of
instances, attributes, and class values.
Although these experimental datasets are easy to analyze in a centralized fashion and are
not too massive to pose any cost-related problem, our current objective is to test the quality

85
of the model assignment approach and the GDL partitions in di erent scenarios rather than
estimate the data analysis cost. For such task, the UCI data is quite adequate.
5.3.2

Partitions

To examine the quality of data partitions produced by the GDL method, we compared three
types of partitions:


Initial Partition (Initial):

given by the way the data was split across k sites at the

beginning.


Random Mixture (RandMix):

after the initial partition, random samples of a certain xed

size are exchanged between subsets.




Greedy Data Labeling (GDL):

after the initial partition, samples of data selected by the

GDL are exchanged between subsets.


We now describe the three types of data partitions in details.
For the initial partition, we were interested in modeling the situations where data comes
from either a single source or di erent sources. Our interest was in how e ective the model
assignment approach and GDL are for homogeneous and heterogeneous data. Because we used
real data and couldn't control its source, the training set was initially split either homogeneously
or heterogeneously in the same way as described in Chapter 3 for the Nursery data, i.e. building
a C4.5 decision tree to identify the most informative attribute and using it to split the data.
The resulting partition is suciently heterogeneous but does not follow the tree model exactly.
Another option is to prune the tree until it has only k leaves and use it to separate the data

86
into k parts. However, the latter approach may give an optimistic bias in the experiments in
which the meta-model is chosen to be of the C4.5 type, since all it needs is to replicate the
original k-leaf tree.
Once the initial data partition of either the same-source type or the di erent-sources type
was obtained, we used it as a starting point to create the corresponding RandMix and GDL
partitions. In the RandMix approach, we updated the initial partition by making each subset
Di

exchange a total of  percent of its data with other subsets, divided equally among the

receiving subsets.
For the GDL, we initialized the batch parameter size T to 2 jDj for each value of  using
the same values as in RandMix. Because GDL reduces T in half after each iteration, this choice
ensures that the total data trac does not exceed =2 + =4 + : : : <  percent. Our goal
was to demonstrate that with GDL, we get superior colorings moving less data than with the
RandMix. We also used the same set of data transfer parameters to create a boosted data
partition used for building a boosted ensemble as described in Chapter 3.
5.3.3

Test Results

Each partition was tested by 5-fold cross-validation in the following manner. A validation set
was withheld in each fold. After partitioning the data, a base-model was built on each subset.
A  percent sample of each color was then taken to build a meta-model. The accuracy of the
resulting model assignment system was tested on the validation set. Also, for the initial and
random mixture partitions, we combined all locally built base-models into a voting ensemble
and tested the ensemble's accuracy on the validation set. We also built a boosted ensemble

87
as described in Chapter 3, to compare a boosted data partition to a random partition. The
goal was to investigate whether the base-models work better as an ensemble or as a model
assignment system for each data partition type.
In our experiments, we chose a simpli ed interpretation of the simulated annealing. With
zero-or-one classi cation error (match or no match), (x; y) also takes values 0 or 1 (native or
non-native current color), in which case p = e
p=0

1=t

uniformly for all candidate instances and

otherwise. With an appropriate value of t, this is equivalent to selecting T candidates

randomly.
For building and testing predictive models, we again used Weka freeware data mining package (Witten and Frank, 1999). We chose C4.5 decision trees as the type F of base-models and
experimented with three di erent types of meta-models: C4.5 trees, naive Bayesian models,
and one-nearest neighbor models.
We used the following parameter values:
 The number of di erent colors was k = 3 or k = 10.
 The meta-sample percentage  = 100, 50, 25 and 10%. We were interested in the sensitivity of GDL to decreasing the meta-sample size.
 The data mixing parameter  = 10, 20, 40, 60, 80, and 90%. For example, with k = 10
colors,  = 90% results in each subset retaining 10% of its original data and acquiring
10% of the data from each of the other 9 subsets, hence all 10 subsets having essentially
the same data distribution regardless of the initial partition. For the same number of

88
colors,  = 10% means retaining as much as 90% of data at each color and exchanging
only 10% for data from the other 9 colors, 1.1% from each.
To ensure robust test results, 5-fold cross-validation trials were run for each partition, and
three di erent samples were collected for each value of the meta-sample size  in each trial.
The results were then averaged. For RandMix, Boosted and GDL partitions, they were further
averaged over all values of the data mixing parameter .
We discovered that of the three meta-model types used with 4.5 decision tree base-models,
the nearest neighbor model is the best one. In Table VIII{Table XII, we compare a voting
ensemble of the locally built C4.5 base-models (ENS) to a model assignment system with the
same collection of base-models and a nearest neighbor meta-model (NN1). The Boost/GDL
lines correspond to boosted ensembles and GDL-enhanced model assignment systems. The
tables show the classi cation errors on each UCI dataset for both initial partition splits (samesource and di erent-sources), for di erent numbers of models and partition types, and for
di erent meta-sample sizes in the case of model assignment. Table XIII shows the average over
all datasets.
We make several observations:
For both same-source and di erent-sources data partitions, the accuracy of the predictive
system drops as the number k of distributed sites increases. This con rms our earlier suggestion
that combining data into fewer subsets results in better predictive systems, which is true for
both the ensemble learning and the model assignment methods. Also, the quality of the model

89
TABLE VIII
BALANCE SCALE DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT
ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
28.3 30.3 31.8 32.5 33.1
RandMix
28.8 31.6 32.5 33.4 33.8
Boost/GDL 30.8 25.4 27.5 29.2 30.8
Same 10 Initial
24.3 33.7 34.9 35.9 36.9
RandMix
24.2 32.4 33.8 34.8 35.8
Boost/GDL 24.4 26.7 29.0 31.3 33.6
Di er 3 Initial
29.9 29.6 31.2 33.5 34.2
RandMix
28.4 30.1 31.5 33.1 33.9
Boost/GDL 29.9 26.7 27.6 29.8 32.1
Di er 10 Initial
31.9 29.3 30.6 33.3 34.9
RandMix
26.4 30.6 32.6 34.2 35.5
Boost/GDL 26.9 26.4 27.3 29.9 31.4

assignment system drops as the size  of the meta-sample decreases. This shows that, given the
same collection of base models, care must be taken to build a suciently accurate meta-model.
In the homogeneous data case (same-source initial partition), we observe that ensembles
built on initial data partitions are, in general, superior to model assignment systems built on
the same partitions even with a 100% meta-sample sizes. Exchanging random samples of data
between subsets gives no consistent advantage or disadvantage to either ensembles or model
assignment, perhaps due to the fact that mixing data that is already homogeneous makes no
essential di erence in the underlying data partition. Ensembles, therefore, remain preferable
in RandMix partitions. However, model assignment systems enhanced by GDL consistently

90
TABLE IX
TIC-TAC-TOE DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
19.5 19.4 20.3 21.6 22.3
RandMix
19.4 19.0 20.3 21.2 22.1
Boost/GDL 19.4 13.9 16.1 18.4 20.5
Same 10 Initial
25.9 21.8 23.9 25.9 27.6
RandMix
26.4 22.2 24.7 26.8 28.5
Boost/GDL 25.8 12.1 16.1 20.1 24.2
Di er 3 Initial
31.5 9.6 11.1 15.6 23.0
RandMix
24.5 14.3 16.6 19.5 23.0
Boost/GDL 24.6 6.9 9.2 13.8 22.0
Di er 10 Initial
37.2 12.6 15.7 20.7 27.2
RandMix
28.3 16.6 20.4 24.2 27.4
Boost/GDL 30.5 9.8 13.8 19.1 26.3

outperform ensembles, sometimes even with meta-samples as small as 10%, although this e ect
usually diminishes as the meta-sample size decreases.
In the case of the heterogeneous data (di erent-sources initial partition), we note how poorly
ensembles perform on the initial partitions. Their accuracy can be improved dramatically by
exchanging random samples of data between subsets. This is due to the e ect of creating a
more homogeneous data distribution across the k sites, an environment in which ensembles are
preferable. In e ect, ensembles are \penalized" if their base-models are over-specialized. On
the other hand, model assignment systems appear to be far superior to ensembles even with
small meta-sample sizes and without using GDL. With GDL, their accuracy improves even

91
TABLE X
CAR EVALUATION DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT
ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
13.2 13.6 14.5 15.0 15.4
RandMix
13.1 13.6 14.1 14.6 15.0
Boost/GDL 13.5 11.4 12.8 13.8 14.7
Same 10 Initial
18.1 18.8 19.5 20.0 20.7
RandMix
18.0 18.4 19.5 20.0 20.8
Boost/GDL 18.0 13.1 15.9 17.8 19.6
Di er 3 Initial
27.7 8.2 8.3 9.9 11.1
RandMix
16.9 10.9 11.5 12.8 13.6
Boost/GDL 15.7 8.3 8.1 10.0 11.3
Di er 10 Initial
27.9 10.1 12.1 15.6 18.1
RandMix
20.1 14.0 16.2 18.4 20.0
Boost/GDL 19.7 8.9 10.9 14.9 17.7

further. This is the e ect that we expected, since model assignment systems are \rewarded"
if their base-models over-specialize, and the GDL is designed to make this e ect even more
pronounced.
We note that despite some successes, boosting did not result on average in a signi cant
drop of ensemble error even in the case of di erent-sources data partitions, and it was slightly
detremental in the case of a same-source partition, a latter fact already observed in Chapter 3.
It appears that the RandMix strategy for ensembles is generally quite adequate. We shall
therefore use it as a benchmark in what follows.

92
TABLE XI
CHESS DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
1.3 1.7 1.6 1.6 1.6
RandMix
1.3 1.6 1.6 1.7 1.8
Boost/GDL 1.1 1.2 1.2 1.4 1.6
Same 10 Initial
3.3 3.9 4.1 4.3 4.6
RandMix
3.4 3.7 3.9 4.1 4.4
Boost/GDL 3.4 3.0 3.6 3.7 4.3
Di er 3 Initial
14.5 2.8 4.0 5.1 6.9
RandMix
2.1 1.8 2.0 2.2 2.5
Boost/GDL 1.3 2.6 3.9 5.1 6.9
Di er 10 Initial
20.8 6.0 8.7 11.2 15.0
RandMix
5.2 4.0 4.8 5.7 6.7
Boost/GDL 4.8 5.8 8.6 11.2 15.1

5.3.4

Eciency

We also observed that the choice of a meta-model algorithm a ects the performance significantly. It follows from the previous remarks that ensembles generally work better when built
on a RandMix partition while model assignment systems are almost always greatly improved
by using a GDL partition. In Table XIV, we compare ensemble systems to model assignment
systems when both are used under optimal conditions: ensembles are built on RandMix partitions and the model assignment systems are built on GDL partitions with meta-sample size
 = 100%.

We use all three meta-model types: nearest neighbor (NN1), C4.5 tree (C45), and

naive Bayesian model (NB). Table XIV presents the classi cation errors for each case. We also

93
TABLE XII
NURSERY DATASET: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
5.2 5.5 5.8 5.9 6.0
RandMix
5.2 5.6 5.9 6.0 6.2
Boost/GDL 5.4 4.6 5.1 5.5 5.9
Same 10 Initial
8.6 8.3 8.7 9.0 9.4
RandMix
8.6 8.2 8.7 9.0 9.3
Boost/GDL 8.8 5.8 6.8 7.8 8.5
Di er 3 Initial
38.5 3.5 3.5 5.3 7.9
RandMix
6.4 4.5 4.7 5.0 5.3
Boost/GDL 6.3 3.5 3.5 5.3 8.1
Di er 10 Initial
34.6 3.5 5.2 10.9 14.9
RandMix
9.3 5.4 6.4 7.8 8.8
Boost/GDL 9.2 3.2 4.9 10.5 14.7

show the average percentage of data moved by GDL and the average number of GDL iterations
in each case.
Nearest neighbor meta-models appear to be the the best type of meta-models, followed by
C4.5 trees. Naive Bayesian models are clearly undesirable.
We see that in terms of data trac, GDL moves only a small percentage of data between
subsets. Furthermore, it was observed that the amount of data that is actually moved typically
stays about the same even if the batch size T is increased to allow more potential data trac.
This is due to the fact that only a small portion of data needs to be relocated. Note also that
there is much less data moved when the initial partition already represents di erent sources

94
TABLE XIII
AVERAGE OVER ALL DATASETS: ENSEMBLE ERRORS VS. MODEL ASSIGNMENT
ERRORS
Data k Partition
ENS NN1 Meta-sample Size:
Source
Method
100% 50% 25% 10%
Same 3 Initial
13.5 14.1 14.8 15.3 15.7
RandMix
13.6 14.3 14.9 15.4 15.8
Boost/GDL 14.0 11.3 12.5 13.7 14.7
Same 10 Initial
16.0 17.3 18.2 19.0 19.8
RandMix
16.1 17.0 18.1 18.9 19.8
Boost/GDL 16.8 12.1 14.3 16.1 18.0
Di er 3 Initial
28.4 10.7 11.6 13.9 16.6
RandMix
15.7 12.3 13.3 14.5 15.7
Boost/GDL 15.5 9.6 10.5 12.8 16.1
Di er 10 Initial
30.5 12.3 14.5 18.3 22.0
RandMix
17.9 14.1 16.1 18.1 19.7
Boost/GDL 18.2 10.8 13.1 17.1 21.0

(4.6% vs 8.5%). In comparison, a RandMix approach requires a xed percentage  of data to


be moved, where  could be large and, more importantly, not dictated by the internal data
structure.
In terms of eciency, we observe that an average of about four GDL iterations is required.
This translates into increasing the amount of local data processing four times, compared to
initial or RandMix approaches that require building local base-models only once. However,
given our assumptions that local data processing is cheap and data transfer and aggregation is
expensive, this is acceptable.

95

TABLE XIV
ERROR RATES OF AN ENSEMBLE OF C4.5 TREES BUILT ON A RANDMIX
PARTITION VS. MODEL ASSIGNMENT SYSTEMS BUILT ON THE GDL PARTITION
WITH C4.5 BASE-MODELS AND DIFFERENT TYPES OF META-MODELS.
Dataset Initial k ENS Meta model type Actual GDL
source
NN1 C45 NB transfer iterations
Balance Same 3 28.8 25.4 29.4 32.2 13.7
4.0
Same 10 24.2 26.7 30.5 31.0 17.3
4.7
Di er 3 28.4 26.7 28.7 27.6 12.3
3.9
Di er 10 26.4 26.4 28.7 27.5 13.9
4.8
T-T-T Same 3 19.4 13.9 19.1 22.2 10.5
5.0
Same 10 26.4 12.1 24.0 27.7 15.6
5.5
Di er 3 24.5 6.9 13.9 27.4 4.1
3.7
Di er 10 28.3 9.8 14.8 24.6 4.0
4.0
Car Same 3 13.1 11.4 12.7 14.1 5.9
3.9
Same 10 18.0 13.1 15.5 20.7 11.2
5.2
Di er 3 16.9 8.3 9.3 9.4
3.3
3.5
Di er 10 20.1 8.9 9.7 12.3 4.9
4.7
Chess Same 3 1.3 1.2 1.3 1.4
.6
2.4
Same 10 3.4 3.0 2.3 3.3
2.1
3.9
Di er 3 2.1 2.6 1.8 4.5
.8
2.3
Di er 10 5.2 5.8 2.0 6.8
.4
2.6
Nursery Same 3 5.2 4.6 5.1 6.4
2.7
3.7
Same 10 8.6 5.8 6.3 9.8
5.6
5.1
Di er 3 6.4 3.5 3.4 3.5
.8
2.8
Di er 10 9.3 3.2 3.4 15.2 1.7
5.5
Average Same
14.9 11.7 14.6 16.9 8.5
4.3
Di er
16.8 10.2 11.6 15.9 4.6
3.8

96
5.4

Conclusion

In this chapter, we compared several scenarios of distributed data mining when data is
distributed across k sites and combining it in one central location is undesirable. It appears
that in the case of a homogeneous data distribution, hierarchical systems such as ensembles
that use all base-models may be preferable to model assignment systems that have specialized
base-models. However, the latter may be superior in cases where the distributed data partition
re ects the di erences in underlying data distributions.
We also introduced a method called Greedy Data Labeling (GDL) that enhances the data
partition in the model assignment setting. GDL is broadly related to clustering and allows
to exchange small portions of data between distributed subsets before the learning algorithms
are applied. The data is chosen in a special way. The resulting model assignment systems
outperform traditional ensembles in both homogeneous and heterogeneous data environment
even when the ensembles are also enhances by an appropriate data exchange strategy, namely,
a strategy of exchanging either random or boosted subsets of data between the sites before
local models are built. GDL is inherently similar to the family of constrained EM algorithms
introduced in Chapter 4.
Future work may include developing more advanced versions of the GDL algorithm based on
di erent data weighting schemata and randomization mechanisms. Another direction is to allow
data instances to belong to multiple colors and build base-models on overlapping partitions.

CHAPTER 6

CONCLUSION

This dissertation is, to our knowledge, the rst work that addresses the fundamental tradeo between cost and accuracy in distributed data mining. There exists a substantial body of
research on improving data mining accuracy in the case when the entire dataset is immediately available to the learning algorithm. Separately, a number of methods in distributed data
mining are based on local data processing, which we call an in-place strategy. However, as we
showed in this work, lack of full information about the other parts of the dataset may cause an
unacceptable loss of accuracy in the in-place case. On the other hand, combining all data at
the same central location usually gives the most accurate results but may be too expensive to
implement due to the cost of network trac, data aggregation, and algorithm complexity.
Our point of view is that there must be an intermediate strategy that balances cost and
accuracy by moving only a portion of the data across the network. The exact nature of such a
strategy may depend on a variety of factors. This calls for developing a formal framework that
can serve as a base for developing new techniques of nding data transfer strategies.
In this dissertation, we showed that nding the intermediate strategies may be formulated
as a mathematical optimization problem. There are two options available for balancing cost
and accuracy: either to set a tolerance for the accuracy and then minimize the cost, or to set
a tolerance for the cost and then optimize the accuracy.
97

98
We demonstrated that in the former case, the problem becomes a problem of convex optimization and hence standard methods are available to solve it. We also showed how this
approach may be used for analysis of a number of interesting situations where intermediate
strategies occur naturally.
In the latter case, i.e., when the cost is restricted and the accuracy is optimized, the problem
of nding a balanced data transfer strategy leads into the area of the quality of data partitions
and mixtures of models, which in itself is a large and well established area of statistics and
data mining. Pursuing this direction, we have developed another mathematical framework
that formalizes the bridge between the trade-o in distributed data mining and the mixture of
models scenario. This framework is based on the classic Expectation-Maximization approach.
We showed how cost constraints may be translated into restrictions in the hidden variable space,
from where interesting likelihood bounds may be found. We also demonstrated how to apply
this framework in particular situations.
There may be other important issues related to the ways the data is partitioned. We explored
them in our experimental studies with data from the UCI Machine Learning Repository as well
as with gene expression microarray data.
This work also attempts to identify the best way to use a collection of data mining models
built in a distributed environment. In particular, the usual bottom-up approach where all
models participate in making predictions, such as ensemble learning, is compared to a top-down
approach of a model assignment system where individual models specialize. We demonstrated
that a model assignment system is frequently preferable as long as it is built on a \good"

99
partition of data across distributed sites. We introduced an algorithm called Greedy Data
Labeling that allows us to nd such data partitions. Although developed for a di erent purpose,
this algorithm is related to the general family of constrained EM algorithms that was formalized
earlier.
Overall, because the question of the trade-o between cost and accuracy in distributed data
mining is new, one of the objectives of this dissertation was to survey the issues that arise and
to serve as a base for further research in this direction. Therefore, we tried to develop a formal
mathematical treatment of the problem. At the same time, this work contains a number of
ad hoc

techniques and heuristics that allow alternative interpretations and in-depth extensions

along the same general lines. We hope that a good balance between rigor and exibility was
reached, which will facilitate further research and development.

100
CITED LITERATURE

Aggarwal, C.C, Hinneburg, A. and Keim, D.A.: On the surprising behavior of distance metrics
in high dimensional spaces. Proc. International Conference on Database Theory 420-434,
2001.
Bates, D. M. and Watts, D. G.: Nonlinear regression analysis and its applications. Wiley, 1988.
Ben-David, S., Borodin, A., Karp, R., Tardos, G. and Wigderson, A: On the power of randomization in on-line algorithms. Algorithmica 11(1):2{14, 1994.
Bradley, P.S., Fayyad, U. and Reina, C.: Scaling clustering algorithms to large databases.
Proc. 4th Int. Conf. Knowledge Discovery and Data Mining (KDD-98) 9-15, Menlo Park,
CA, AAAI Press, 1998.
Breiman, L.: Bagging predictors. Machine Learning, 24(2):123-140, 1996.
Breiman, L.: Random forests, random features. Technical report, University of California,
Berkeley, 1999.
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J.: Classi cation and Regression
Trees, Wadsworth and Brooks, 1984.
Chan, P.K. and Stolfo, S.J.: Learning arbiter and combiner trees from partitioned data
for scaling machine learning. Proc. 1st Int. Conf. Knowledge Discovery and Data Mining
(KDD-95) 39-44, Menlo Park, CA, AAAI Press. 1995.
Chavtal, V.: Linear Programming. Freeman and Co., 1983.
Cheung, A., and Reeves, A.: High performance computing on a cluster of workstations. Proc. 1st International Symposium on High Performance Distributed Computing
152-160, September 1992.
Chipman, H., George, E. and McCulloch, R.: Segmentation via tree models. Presentation
at the Chicago Chapter of the American Statistical Association, Jun 22 1999, avail. at
http://gsbrem.uchicago.edu/talks/.
Cormen, T., Leiserson, C., and Rivest, R.: Introduction to Algorithms. MIT Electrical Engineering and Computer Science Series, 1990.
Cortes, C., Jackel, L.D., and Chiang, W.-P.: Predicting failures of telecommunication paths:
limits on learning machine accuracy imposed by data quality. Proc. Intl. Workshop on
Applications of Neural Networks to Telecommunications 2, Stockholm, 1995.

101
Dellaert, F.: The Expectation Maximization Algorithm, February 2002,
Avail. at www.cc.gatech.edu/dellaert/em-paper.pdf.
Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society B 39:1-38. 1977.
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of
decision trees: bagging, boosting, and randomization. Machine Learning 40(2):139-157,
2000.
Dietterich, T.G.: Machine learning research: four current directions. AI Magazine 18:97-136,
1997.
Everitt, B.: Cluster Analysis. Wiley, New York, NY, 1974.
Fayyad, U., Reina, C. and Bradley, P.S.: Initialization of iterative re nement clustering algorithms. Proc. 4th Int. Conf. Knowledge Discovery and Data Mining (KDD-98) 194-198,
Menlo Park, CA, AAAI Press, 1998.
Fisher, D.: Improving inference through conceptual clustering. Proc. 1987 AAAI Conf. 461-465,
Seattle, WA, 1987.
Freund, Y.: Boosting a weak learning algorithm by majority. Information and Computation
121(2):256-285, 1995.
Freund, Y. and Schapire, R.E.: A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences 55(1):119-139, 1997.
Freund, Y., Schapire, R.E., Singer, Y. and Warmuth, M.K.: Using and combining predictors
that specialize. Proc. of the 29th Annual ACM Symposium on the Theory of Computing
pp.334{343, 1997.
Gerstein M. and Jansen R.: The current excitement in bioinformatics, analysis of whole-genome
expression data: How does it relate to protein structure and function? Current Opinion
in Structural Biology 10(5):574-584, 2000.
Gomes, C., Selman, B. and Kautz, H.: Boosting combinatorial search through randomization.
Proc. 15th Nat. Conf. on Arti cial Intellligence 431-437. AAAI Press/ MIT Press, 1998.
Grimshaw, A.S., Weissman, J.B., West, E.A., and Loyot, E.C.: Metasystems: An Approach Combining Parallel Processing and Heterogeneous Distributed Computing Systems,
Journal of Parallel and Distributed Computing 21(3):257-270, 1994.
Grossman, R.L., Bailey, S., Ramu, A., Malhi, B., Hallstrom, P., Pulleyn, I. and Qin, X.:
The management and mining of multiple predictive models using the Predictive Modeling
Markup Language (PMML), Information and System Technology 41:589-595, 1999.

102
Grossman, R.L, Bailey, S., Ramu, A., Malhi, B. and Turinsky, A.: The preliminary design of Papyrus: a system for high performance, distributed data mining over clusters. In: Advances
in Distributed and Parallel Knowledge Discovery, eds. H. Kargupta and P. Chan, pp. 259275, AAAI Press/MIT Press, Menlo Park, California, 2000.
Guo, Y., Rueger, S.M., Sutiwaraphun, J., and Forbes-Millott, J.: Meta-learnig for parallel data
mining. Proc. 7th Parallel Computing Workshop 1-2, 1997.
Haussler, D., Kearns, M., Seung, H. and Tishby, N.: Rigorous learning curve bounds from
statistical mechanics, Machine Learning 25:195{236, 1996.
Hoel, P.G., Port, S.C. and Stone, C.J.: Introduction to Statistical Theory, Houghton Miin,
1971.
Jordan, M.I, and Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. Neural
Computation 6(2):181{214, 1994.
Kargupta, H., Hamzaoglu, I. and Sta ord, B.: Scalable, distributed data mining using an agent
based architecture. In: Proc. Third International Conference on the Knowledge Discovery
and Data Mining, eds. D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, pp.
211-214, AAAI Press, Menlo Park, CA, 1997.
Kargupta, H., Johnson, E., Sanseverino, E.R., Park, B.-H., Silvestre, L.D., and Hershberger, D.:
Scalable data mining from vertically partitioned feature space using collective mining and
gene expression based genetic algorithms, KDD Workshop on Distributed Data Mining,
1998.
Krogh, A. and Sollich, P.: Statistical mechanics of ensemble learning, Physical Review E
55:811-825, 1997.
Lewis, A.S. and Borwein, J.M.: Convex Analysis and Nonlinear Optimization : Theory and
Examples. Cms Advanced Books in Mathematics, Springer Verlag, 2000.
Michalski, R.S. and Stepp, R.: Automated construction of classi cations: conceptual clustering versus numerical taxonomy. IEEE Trans. Pattern Analysis and Machine Intelligence
5:396{410, 1983.
Mitchell, T.: Machine Learning, WCB/McGraw-Hill, 1997.
Murata, N., Yoshizawa, S. and Amari, S.: Learning curves, model selection and complexity of
neural networks. In: Advances in Neural Information Processing Systems, eds. S.J. Hanson, J.D. Cowan, and C.L. Giles, 5:607{614, San Mateo, CA, 1993.
Murphy, P. M. and Aha, D. W.: UCI repository of machine learning databases. University of
California, Department of Information and Computer Science, Irvine, CA, 1993.
Avail. at http://www.ics.uci.edu/mlearn/MLRepository.html.

103
Obradovic, Z.: Commonness, complexity, avors and function of intrinsic protein disorder: a
bioinformatics study. Presentation at IPAM Workshop on the Mathematical Challenges in
Scienti c Data Mining, 2002. Avail. at www.ist.temple.edu/zoran/bioinformatics.html.
Pugachev, V.S.: Probability Theory and Mathematical Statistics for Engineers, Pergamon
Press, 1984.
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA,
1993.
Raftery, A.E., Madigan, D., and Hoeting, J.A.: Bayesian model averaging for linear regression
models. Journal of the American Statistical Association 92:179-191, 1996.
Rocke, D. and Dai, J.: Sampling and subsampling for cluster analysis in data mining with
applications to Sky Survey. 2002.
Avail. at http://handel.cipic.ucdavis.edu/dmrocke/preprints.html.
Skillicorn, D.: Parallel Data Mining. CASCON'98, Toronto, December 1998.
Stepp, R.E. and Michalski, R.S.: Conceptual clustering: Inventing goal oriented classi cations
of structured objects. Machine Learning: An Arti cial Intelligence Approach vol.II. Morgan Kau mann, San Mateo, CA, 1986.
Stolfo, S., Prodromidis, A.L., and Chan, P.K.: JAM: Java agents for meta-learning over
distributed databases, Proc. Third International Conference on Knowledge Discovery and
Data Mining, AAAI Press, Menlo Park, California, 1997.
Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D. and Brown, P.: Clustering methods
for the analysis of DNA microarray data. Technical report, Department of Health Research
and Policy, Stanford University, 1999.
Turinsky, A.L. and Grossman, R.L.: A Framework for Finding Distributed Data Mining
Strategies That Are Intermediate Between Centralized Strategies and In-Place Strategies.
Prersentation at the KDD 2000 Workshop on Distributed Data Mining, Boston, USA,
2000. Avail. at http://citeseer.nj.nec.com/turinsky00framework.html.
Van Laarhoven, P.J.M. and Aarts, E.H.L.: Simulated Annealing: Theory and Applications.
D.Reidel, Norwell, MA, 1987.
Witten, I.H. and Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations. Morgan Kaufmann, San Mateo, CA, October 1999.
WEKA software avail. at http://www.cs.waikato.ac.nz/ml/weka/.
Wolpert, D.: Stacked generalization. Neural Networks 5:241-259, 1992.

104
Xu, L. and Jordan, M.I.: EM learning on a generalized nite mixture model for combining
multiple classi ers. Proc. World Congress on Neural Networks. Hillsdale, NJ, Erlbaum,
1993.
Zaki, M., Li, W., and Parthasarathy, S.: Customizing dynamic load balancing for a network of workstations. Journal of Parallel and Distributed Computing: Special Issue on
Performance Evaluation, Scheduling, and Fault Tolerance, June 1997.
Zhang, H., Yu, C.Y., Singer, B., and Xiong, M. Recursive partitioning for tumor classi cation
with gene expression microarray data. Proc. Natl. Acad. Sci. USA 98:6730-6735, 2001.
Zhang, T., Ramakrishnan, R. and Livny, M.: BIRCH: an ecient data clustering method
for very large databases. Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data pp.
103-114, Montreal, Canada, 1996.

105
VITA

NAME: Andrei L. Turinsky


EDUCATION:

 Ph.D., Mathematical Computer Science, University of Illinois at Chicago, Chicago,


Illinois, USA, 2002
 M.S., Mathematical Computer Science, University of Illinois at Chicago, Chicago, Illinois,
USA, 1997
 M.S., Applied Mathematics, Kharkiv National University, Kharkiv, Ukraine, 1995
EXPERIENCE:

 Research Assistant, Laboratory for Advanced Computing, National Center for Data
Mining, University of Illinois at Chicago, Chicago, Illinois, USA, 1998{2002
 Teaching Assistant, Department of Mathematics, Statistics, and Computer Science,
University of Illinois at Chicago, Chicago, Illinois, USA, 1995{1998
 Teaching Intern, Kharkiv National University Lyceum, Kharkiv, Ukraine, 1995
 Programmer / Application Developer, Civil Engineering Research and Development
Institute, Kharkiv, Ukraine, 1992{1995
HONORS AND AWARDS:

 Master of Science Degree with Honors, Kharkiv National University, Kharkiv, Ukraine,
1995
 Scholarship, Fund of Student and Educational Initiatives, Kharkiv National University,
Kharkiv, Ukraine, 1993
 \Outstanding Runner-Up" Team Award, Net Challenge Event, SuperComputing-2000
conference, Dallas, Texas, USA, 2000
 \High Performance Communication" Team Award, High Performance Computing
Challenge, SuperComputing-1999 conference, Portland, Oregon, USA, 1999
 \Most Innovative of the Show" Team Award, High Performance Computing Challenge,
SuperComputing-1998 conference, Orlando, Florida, USA, 1998

106
PUBLICATIONS AND PRESENTATIONS:

 Pliska, S.R., Turinsky, A.: Solution Manual for the Exercises, Introduction to Mathematical Finance: Discrete Time Models by S.R. Pliska, Blackwell Publishers, 1997.
 Turinsky, A. and Grossman, R.L.: Greedy Data Labeling and the Model Assignment
Problem for Scienti c Data Sets. Poster presentation at the IPAM Workshop on Mathematical Challenges in Scienti c Data Mining, Los Angeles, USA, 2002.
 Turinsky, A. and Grossman, R.L.: A Framework for Finding Distributed Data Mining
Strategies That Are Intermediate Between Centralized Strategies and In-Place Strategies.
Presentation at the KDD 2000 Workshop on Distributed Data Mining, Boston, USA, 2000.
Avail. at http://citeseer.nj.nec.com/turinsky00framework.html.
 Grossman, R.L., Bailey, S., Ramu, A., Malhi, B., Sivakumar, H., Turinsky, A.: Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters,
Proceedings of Supercomputing 1999, IEEE, 1999.
 Grossman, R.L, Bailey, S., Ramu, A., Malhi, B. and Turinsky, A.: The preliminary
design of Papyrus: a system for high performance, distributed data mining over clusters.
In: Advances in Distributed and Parallel Knowledge Discovery, eds. H. Kargupta and
P. Chan, pp. 259-275, AAAI Press/MIT Press, Menlo Park, California, 2000.

Das könnte Ihnen auch gefallen