Challenges and Opportunities in Applied Machine Learning

Articles
Challenges and Opportunities

in Applied Machine Learning
Carla E. Brodley, Umaa Rebbapragada,

Kevin Small, Byron C. Wallace
W
I Machine-learning research is often conduct- hat do citation screening for evidence-based medicine
ed in vitro, divorced from motivating practical and generating land-cover maps of the Earth have in
applications. A researcher might develop a new common? Both are real-world problems for which we
method for the general task of classification,
have applied machine-learning techniques to assist human
then assess its utility by comparing its perform-
ance (for example, accuracy or AUC) to that of
experts, and in each case doing so has motivated the develop-
existing classification models on publicly avail- ment of novel machine-learning methods. Our research group
able data sets. In terms of advancing machine works closely with domain experts from other disciplines to
learning as an academic discipline, this solve practical problems. For many tasks, off-the-shelf methods
approach has thus far proven quite fruitful. work wonderfully, and when asked for a collaboration we sim-
However, it is our view that the most interesting ply point the domain experts to the myriad available open-
open problems in machine learning are those source machine-learning and data-mining tools. In other cases,
that arise during its application to real-world
however, we discover that the task presents unique challenges
problems. We illustrate this point by reviewing
two of our interdisciplinary collaborations, both
that render traditional machine-learning methods inadequate,
of which have posed unique machine-learning necessitating the development of novel techniques.
problems, providing fertile ground for novel In this article we describe two of our collaborative efforts,
research. both of which have addressed the same question: if you do not
initially obtain the classification performance that you are look-
ing for, what is the reason and what can you do about it? The
point we would like to make is that much of the research in
machine learning focuses on improvements to existing tech-
niques, as measured over benchmark data sets. However, in our
experience, when applying machine learning, we have found
that the choice of learning algorithm (for example, logistic
regression, SVM, decision tree, and so on) usually has only a
small effect on performance, and for most real-world problems
it’s easy to try some large subset of methods (for example, using
cross-validation) until one finds the best method for the task at
hand. In our view, the truly interesting questions arise when
none of the available machine-learning algorithms performs
adequately.
Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2012 11
Articles
In such cases, one needs to address three ques- classes. It is critical to judiciously use expert time
tions: (1) why is the performance poor? (2) can it and minimize the expert’s overall effort.
be improved? and (3) if so, how? Over the past 15 In an ongoing collaboration that investigates
years, our research group has worked on over 14 methods to semiautomate biomedical citation
domains (Draper, Brodley, and Utgoff 1994; Moss screening for systematic reviews, we have devel-
et al. 1997; Brodley and Smyth 1997; Brodley et al. oped novel strategies that better exploit experts by
1999; Shyu et al. 1999; Lane and Brodley 1999; using their annotation time wisely and providing
Kapadia, Fortes, and Brodley 1999; Friedl, Brodley, a framework that incorporates their domain
and Stahler 1999; Stough and Brodley 2001; knowledge into the underlying machine-learning
MacArthur, Brodley, and Broderick 2002; Dy et al. model. In this section we review this collaboration,
2003; Early, Brodley, and Rosenberg 2003; Aisen et illustrating how working closely with domain
al. 2003; Pusara and Brodley 2004; Fern, Brodley, experts can motivate new directions in machine-
and Friedl 2005; Lomasky et al. 2007; Preston et al. learning research by exposing inadequacies in con-
2010; Rebbapragada et al. 2009; 2008b; 2008a), ventional techniques.
with an emphasis on effectively collaborating with
domain experts. In the majority of these, poor per- Systematic Reviews
formance was due not to the choice of learning Over the past few decades, evidence-based medi-
algorithm, but rather was a problem with the train- cine (EBM) (Sackett et al. 1996) has become
ing data. In particular, we find that poor classifier increasingly important in guiding health-care best
performance is usually attributable to one or more practices. Systematic reviews, in which a specific
of the following three problems. First, insufficient clinical question is addressed by an exhaustive
training data was collected — either the training assessment of pertinent published scientific
data set is too small to learn a generalizable model research, are a core component of EBM. Citation
or the data are a skewed sample that does not screening is the process in which expert reviewers
reflect the true underlying population distribution. (clinical researchers, who are typically doctors)
Second, the data are noisy; either the feature values evaluate biomedical document abstracts to assess
have random or systematic noise, or the labels pro- whether they are germane to the review at hand
vided by the domain expert are noisy. Third, the (that is, whether they meet the prespecified clini-
features describing the data are not sufficient for cal criteria). This is an expensive and laborious, but
making the desired discriminations. For example, a critical, step in conducting systematic reviews. A
person’s shoe size will not help in diagnosing what systematic review typically begins with a PubMed2
type of pulmonary disease he or she has.1 query to retrieve candidate abstracts. The team of
The unifying theme of this work is that applica- clinicians must then identify the relevant subset of
tion-driven research begets novel machine-learn- this retrieved set. The next step is to review the full
ing methods. The reason for this is twofold: first, in text of all “relevant” articles to select those that
the process of fielding machine learning, one dis- ultimately will be included in the systematic
covers the boundaries and limitations of existing review. Figure 1 illustrates the systematic review
methods, spurring new research. Second, when process. Shown in log scale are three piles contain-
one is actively working with domain experts, it is ing the total number of papers in PubMed in 2011
natural to ask how one might better exploit their (left), potentially relevant papers (middle), and
time and expertise to improve the performance of papers deemed relevant and requiring full-text
the machine-learning system. This has motivated screening (right). Here we focus on the process
our research into interactive protocols for acquir- between the middle and right piles, known as cita-
ing more training data (for example, active learn- tion screening.
ing and variants), cleaning existing labeled data, The torrential volume of published biomedical
and exploiting additional expert annotation
literature means reviewers must evaluate many
beyond instance labels (that is, alternative forms of
thousands of abstracts for a given systematic
supervision).
review. Indeed, for one such review concerning dis-
ability in infants and children, researchers at the
Active Learning without Tufts Evidence-Based Practice Center screened
more than 30,000 citations. This process is per-
Simplifying Assumptions formed for every systematic review. The number of
When confronted with inadequate classifier per- citations examined per review is increasing as EBM
formance, perhaps the most natural strategy for becomes more integral to the identification of best
improving performance is to acquire more labeled practices. Figure 2 illustrates the magnitude of this
training data. Unfortunately, this requires that the task. In general, an experienced reviewer can
domain expert spend valuable time manually cat- screen (that is, label) an abstract in approximately
egorizing instances (for example, biomedical cita- 30 seconds. For the aforementioned review in
tions or land-surface images) into their respective which more than 30,000 citations were screened,
12 AI MAGAZINE
Articles
Number of abstracts, log-scale
PubMed: ~ 23 million
Retrieved with Search: 5000
Deemed Relevant: 100
Figure 1. The Systematic Review Screening Process.
this translates to more than 250 (tedious) hours of Active learning has been shown to work quite
uninterrupted work. Moreover, because doctors well empirically for in vitro settings (for example,
usually perform this task, citation screening is Mccallum and Nigam [1998], Tong and Koller
incredibly expensive. [2000]). In particular, uncertainty sampling, in
which at each step in active learning the unlabeled
Addressing the Limitations of instance about whose predicted label the model is
Off-the-Shelf Machine Learning least confident is selected for labeling, has become
In collaboration with the evidence-based practice an immensely popular technique. We were there-
center, our group is developing machine-learning fore disappointed when uncertainty sampling
methods to semiautomate citation screening. We failed to outperform random sampling for our sys-
begin by framing the task as a binary text classifi- tematic review data sets with respect to the recall-
focused evaluation metric of interest for this appli-
cation problem; each citation is either relevant or
cation.
irrelevant to the current systematic review. The
This last statement raises an important point: If
aim then is to induce a classifier capable of making
evaluating uncertainty sampling as if these were
this distinction automatically, allowing the review-
“benchmark” data sets about which we knew noth-
ers to safely ignore citations screened out (that is,
ing aside from the feature vectors and their labels,
labeled irrelevant) by the classifier. The combina- the results using active learning were quite good in
tion of expensive annotation time and the require- terms of accuracy. Figure 3 illustrates this point.
ment that a new classifier must be learned for each Shown are two comparisons of querying strategies:
systematic review has motivated our interest in the random sampling, in which the instances to be
active learning framework (Settles 2009), which labeled and used as training data are selected ran-
attempts to mitigate the amount of labeled data domly, and the active learning strategy known as
required to induce a sufficiently performing classi- SIMPLE (Tong and Koller 2000), an uncertainty-
fier by training the model interactively. The intu- sampling technique. The results shown are aver-
ition is that by selecting examples cleverly, rather ages taken over 10 runs. As a base learner for both
than at random, less labeled data will be required querying strategies, we used support vector
to induce a good model, thereby saving the expert machines (SVMs). On the left side of the figure, we
valuable time. have plotted overall accuracy (assessed over a hold-
SPRING 2012 13
Articles
Figure 2. Abstracts to Be Screened.

These stacks of paper (printed abstracts) represent only a portion of the roughly 33,000 abstracts screened for a systematic review concern-
ing disability in infants and children. This task required approximately 250 hours of effort by researchers at Tufts Medical Center.
out set) against the number of labels provided by imperative that all relevant published literature be
the domain expert. Note that, as one might hope, included, or else the scientific validity of the
the active learning curve dominates that of ran- review may be compromised. In the nomenclature
dom sampling. One might conclude that active of classification, this means we have highly asym-
learning ought then to be used for this task, as it metric costs. In particular, false negatives (relevant
produces a higher accuracy classifier using fewer citations the classifier designates as irrelevant) are
labels (that is, less human effort). However, con- much costlier than false positives (irrelevant cita-
sider the right side of the figure, which substitutes tions the classifier designates as relevant); false
sensitivity for accuracy as the metric of perform- negatives may jeopardize the integrity of the entire
ance. Sensitivity is the proportion of positive review while false positives incur additional labor
instances (in our case, relevant citations) that were for the reviewers (in the form of reviewing the arti-
correctly classified by the model. cle in full text to determine its relevance).
In the citation screening task, we care more Although it has previously been recognized that
about sensitivity than we do about the standard class imbalance and cost-sensitive classification are
metric of accuracy for two reasons. First, of the important for real-world problems (Drummond,
thousands of citations retrieved through an initial Holte, et al. 2005; Provost 2000; He and Garcia
search, only about 5 percent, on average, are rele- 2009), many new studies still use accuracy or AUC
vant. This is a phenomenon known as class imbal- to evaluate new learning methods. It is thus impor-
ance. Second, for a given systematic review, it is tant to reiterate our point: without understanding
14 AI MAGAZINE
Articles
1.0 1.0
0.9 0.9
0.8 0.8
Sensitivity
Accuracy
0.7 0.7
0.6 0.6
0.5 active Learning 0.5 active Learning

random sampling random sampling
0.4 0.4
0 200 400 600 800 1000 0 200 400 600 800 1000
Number of Labeled Examples Number of Labeled Examples
Figure 3. The Performance of Two Different Querying Strategies on a Systematic Review Data Set.
(1) Random sampling, in which labels are requested for instances at random, and (2) the popular active learning technique SIMPLE (Tong
and Koller 2000), wherein the learner requests labels for instances about whose label it is most uncertain. On the left, learners are compared
as to accuracy (that is, total proportion of correctly classified instances); on the right, sensitivity (accuracy with respect to the minority class
of relevant citations). Due to the imbalance in the data set, the two metrics tell very different stories about the relative performance of the
respective querying strategies.
the data one is working with, it is easy to draw mis- motivated by our interaction with the reviewers
leading conclusions by considering the wrong met- during an ongoing systematic review. When ini-
ric. It is impossible to conclude that one learning tially experimenting with uncertainty sampling,
algorithm is better than another without reference we would show reviewers citations for which the
to a classification task with associated misclassifi- model was least certain. The reviewers would
cation costs. often remark that the model was being confused
Fortunately, researchers are beginning to by certain things. For example, in one case the
acknowledge the shortcomings of existing learning model was consistently confounded by published
techniques when they are deployed for real-world trials that were otherwise relevant to the review,
tasks. For example, Attenberg and Provost’s (2010) but in which the subjects were mice rather than
recent call-to-arms paper addresses class imbalance, humans. Unfortunately, there was no way for
nonuniform misclassification costs, and several them to impart this information directly to the
other factors that can lead to poor performance of model. We therefore developed an algorithm that
fielded active learning. In our own work, the obser- exploits labeled terms (for example, words like
vation that traditional uncertainty sampling per- mice that indicate a document containing them is
formed poorly in our application led to two devel- likely to be irrelevant) to guide the active learning
opments. First, in light of the above discussion process. Figure 4 shows the labeled terms for a sys-
regarding appropriate evaluation metrics, we adapt- tematic review of proton beam radiation therapy.
ed a method from the field of medical decision In particular, we build a simple, intuitive mod-
making that has been used for diagnostic test el over the labeled phrases in tandem with a lin-
assessment (Vickers and Elkin 2006) to elicit a met- ear-kernel support vector machine (Vapnik 1995)
ric from the domain expert reflecting the implicit over a standard bag-of-words (BOW) representa-
relative costs they assign to false negatives and false tion of the corpus. 3 For the former, we use an
positives. This metric can then be used to compare odds-ratio based on term counts (that is, the ratio
classifiers in a task-specific way. Second, we devel- of positive to negative terms in a document).
oped a novel active learning algorithm that works Specifically, suppose we have a set of positive fea-
well under class imbalance by exploiting domain tures (that is, n-grams indicative of relevance), PF,
knowledge in the form of labeled terms to guide the and a set of negative features NF. Then, given a
active learning process (Wallace et al. 2010). document d to classify, we can generate a crude
The strategy of exploiting labeled terms was prediction model for d being relevant, stated as:
SPRING 2012 15
Articles
had the labeled terms, and thus could not have

developed this approach. This methodology is now
used for each new systematic review, with the
potential to dramatically reduce workload.
helium beam cell Note that recently, several other researchers in
particle treatment petri dish machine learning also recognized the potential
heavy ion radiotherapy x-ray benefit of exploiting labeled features in addition to
alpha case report labeled instances for classifier induction. This type
mean survival pi-meson of learning scenario is often referred to as dual-
carbon proton pump supervision. For recent developments, see Atten-
berg, Melville, and Provost (2010).
Addressing Other
Simplifying Assumptions
Figure 4. An Example of Labeled Terms for the Proton Beam Data.
More recently, we have addressed several other
The left side shows the terms that if present indicate that the document is rel- simplifying assumptions that are not valid when
evant, and the right side shows terms likely to be present in irrelevant applying active learning in a real-world setting.
abstracts. One such assumption is that there is a single, infal-
lible expert who can provide labels at a fixed cost.
I d (w + ) +1 In our case, however, there is often a group of
w + P F
annotators (for example, medical experts partici-
I d (w ) +1
wN F
pating in a systematic review) who can provide
labels of varying quality and cost: some are experi-
where Id(w) is an indicator function that is 1 if w is enced reviewers whose time is costly, others are
in d and 0 otherwise. Note that we add pseudo- novices whose time is comparatively cheap.
counts to both the negative and positive sums, to We call this multiple expert active learning (Wal-
avoid division by zero. The value of this ratio rela- lace et al. 2011). The task is, given a panel of
tive to 1 gives a class prediction and the magnitude experts, a set of unlabeled examples, and a budget,
of the ratio gives a confidence.4 For example, if d who should label which examples? Although
contains 10 times as many positive terms as it does recent work has explored somewhat similar sce-
negative terms, the class prediction is + and a narios, such as crowd sourcing with Mechanical
proxy for our confidence is 10 (note that because Turk (Sheng, Provost, and Ipeirotis 2008), our
we’re using an odds-ratio, the confidence can be an domain differs in an important way — we have a
arbitrarily large scalar). setting in which all annotators must posses a req-
Our active learning strategy then makes use of uisite minimum aptitude for annotating instances,
both this simple, semantic or white-box labeled precluding the use of low-cost, untrained annota-
terms model and the black-box SVM model during tors through crowd sourcing.
active learning using a framework known as “co- A similar problem was explored by Donmez and
testing” (Muslea, Minton, and Knoblock 2006), Carbonell (2008). They proposed a decision-theo-
which works as follows. Suppose we have two retic approach to the task, wherein the instance to
models, M1 and M2. Define contention points as be labeled and the expert to do the labeling is
those unlabeled examples about whose labels M1 picked at each step in active learning in order to
and M2 disagree. Now request the label for one of maximize local utility (that is, the estimated value
these points. Because at least one of these models of acquiring a label for the selected instance nor-
is incorrect, such points are likely highly informa- malized by the cost of the chosen expert). This
tive. In our particular case, these points are those strategy is intuitively appealing, but has a few
about whose labels the black-box SVM model dis- drawbacks for our domain. In particular, we found
agrees with what the domain expert has explicitly it quite difficult to produce a reasonable value of
communicated to us. Intuitively, it is desirable to information for each unlabeled instance. More
teach the SVM to make classifications that tend to problematically, we found that, on our data, this
agree with the information the expert has provid- greedy strategy tended repeatedly to select cheap-
ed. This approach improves the performance of the er labelers, accruing a noisy training set. This in
system substantially; experts have to label fewer turn gave rise to comparatively inaccurate models.
citations to identify the relevant ones than they do We had no way to take into account the negative
using the uncertainty sampling approach. The key utility of incorrectly labeled instances, particularly
point here is that actually working with the because predicting which instances novices would
domain experts motivated a new method that ulti- mislabel proved a difficult task (see Wallace et al.
mately reduced workload. Had we been working [2011] for a more detailed discussion).
with only benchmark data sets, we would not have While conducting our experiments with real
16 AI MAGAZINE
Articles
experts, we noticed that novice labelers (reviewers) classifier performance. We now turn our attention
appeared capable of gauging the reliability of the to a second common cause of poor model general-
labels they provided (that is, they knew when they ization accuracy: noisy data. In particular, labeling
knew). Motivated by this observation, we set out errors in training data negatively affect classifier
to exploit the metacognitive abilities of the accuracy. Experts make errors because of inade-
“novice” experts. Specifically, in addition to sim- quate time spent per example, annotator fatigue,
ply labeling an example, we ask experts to indicate inconsistent labeling criteria, poor data quality,
if they are uncertain about their classification. If ambiguity among classes, and shifts in decision cri-
they are, we pass the example on to a more experi- teria. For example, when using multispectral satel-
enced expert (in our domain, as presumably in lite imagery data to create maps of global land cov-
most, cost monotonically increases with expertise). er, there are two potential sources of labeling
In Wallace et al. (2011) we demonstrated that this errors: poor image quality and ambiguity between
approach outperforms all other strategies pro- multimodal classes (Friedl et al. 2010). The “Open
posed, in terms of cost versus classifier perform- Shrubland” class in the International Geosphere-
ance. Moreover, we presented empirical evidence Biosphere Programme (IGBP) is bimodal. Both
that novice labelers are indeed conscious of which high-latitude tundra and desert cacti qualify for
examples they are likely to mislabel, and we also this label. However, high-latitude tundra is spec-
argued that automatically identifying difficult trally similar to the “Grassland” class and thus
instances before labeling is not feasible. Thus we “Grassland” and “Open Shrubland” are often con-
are relying on relatively cheap human computa- fused. In this domain, the experts can consult aux-
tion rather than automated methods, as described iliary data sources where possible to make a high
in Rebbapragada (2010). confidence decision, but this incurs significant
additional cost; many of the original labels were
Summary assigned prior to the availability of high-quality
In summary, while developing a fielded machine- data sources (for example, Google Earth), whose
learning system for semiautomating citation review could resolve previously ambiguous cases.
screening for systematic reviews we ran into prob- In the case of labeling medical abstracts for sys-
lems when applying off-the-shelf machine-learning tematic reviews, as mentioned earlier, experts
technologies. These issues spurred new research spend an average of 30 seconds per abstract, often
questions, that we addressed by working closely skimming the text to make a decision. To under-
with domain experts. Thus far, the Tufts Evidence- stand why labeling errors are made in this domain,
Based Practice Center has used our machine-learn- we examined the error rate of a particular expert
ing methods prospectively to prioritize work when on a systematic review investigating proton beam
conducting two systematic reviews (that is, we used cancer therapy. When we asked the expert to per-
the model to select the documents that the experts form a second review of certain examples, he dis-
should screen, and also to inform them when they covered that had he read some abstracts more thor-
had likely identified all of the relevant citations). At oughly, he would have assigned them a different
this point in both of the aforementioned reviews, label. This expert admitted feeling fatigue and
the remaining citations (which the model labeled boredom during the original labeling process,
as irrelevant) were screened by a less costly human which caused him to relax his criteria on assigning
(that is, not a doctor). In both cases, our model was articles to the relevant class. He also noted that
correct in its designation of the remaining citations because his decision criteria evolved over time, he
as irrelevant. As we further validate our model would have assigned different labels to earlier
empirically, our goal is for experts to trust its deci- examples had he gone back and relabeled them.
sions enough to eliminate the human screening of Our physician in this case was an expert, thus
irrelevant citations. demonstrating that even experienced labelers are
We are working on an online tool for citation not immune from making errors in this domain.
screening that integrates our methods (active
learning, labeling task allocation, and the auto- Involving Experts to Correct Label Errors
matic exclusion of irrelevant citations). We are cur- Existing fully automated approaches for mitigating
rently conducting a larger scale empirical evalua- label noise generally look to improve the resulting
tion and plan to make our system available to classifier by avoiding overfitting to the noise in the
evidence-based practice centers outside of Tufts in data. Operationally, such methods often have the
Summer 2012. goal of removing label noise as a preprocessing step
for the training data prior to applying the super-
vised machine-learning method. Preprocessing
Cleaning Up after Noisy Labelers methods attempt to detect noisy examples in a sin-
In the preceding section we considered the strate- gle pass of the training data set. Methods differ in
gy of collecting additional training data to improve how they handle suspected mislabelings. Some dis-
SPRING 2012 17
Articles
card suspected examples (Zeng and Martinez 2003; Department of Geography and Environment at
Valizadegan and Tan 2007; Gamberger, Lavrac, and Boston University. The source of the data is the sys-
Dzeroski 1996; Gamberger, Lavraselj, and Groselj tem for terrestrial ecosystem parameterization
1999; Brodley and Friedl 1996; 1999; Venkatara- (STEP) database (Muchoney et al. 1999), which is a
man et al. 2004; Verbaeten 2002; Verbaeten and collection of polygons drawn over known land-
Assche 2003; Zhu, Wu, and Chen 2003), a few cover types. The database was created from data
attempt to correct them (Zeng and Martinez 2001; collected by the moderate resolution imaging spec-
Wagstaff et al. 2010), and a few perform a hybrid of troradiometer (MODIS). Each observation in this
the two actions (Lallich, Muhlenbach, and Zighed database measures surface reflectance from seven
2002; Muhlenbach, Lallich, and Zighed 2004; Reb- spectral bands, one land-surface temperature band,
bapragada and Brodley 2007). The primary draw- and one vegetation index (a transformation of the
back of “single pass” approaches is that they intro- spectral bands that is sensitive to the amount of
duce Type I and II errors. They may eliminate or flip vegetation present in the pixel). The data has been
the labels on clean examples, while leaving truly aggregated by month and includes the mean, max-
mislabeled examples unchanged. imum, and minimum surface reflectance measure-
For both the land-cover and the citation-screen- ments for one year. The number of MODIS pixels
ing domains, after recognizing that we have label- in a site ranges from 1 to 70; each site is given a
ing errors, we asked the question: how can we single class label. The size of the site depends on
involve our expert in this process? Our idea was to the landscape homogeneity and is constrained to
leverage the strength of fully automated tech- have an area between 2 and 25 square kilometers.
niques but outsource the task of relabeling to a The labeling scheme is determined by the Inter-
domain expert. Our approach, called active label national Geosphere-Biosphere Programme
correction (ALC), assumes the availability of addi- (Muchoney et al. 1999), which is the consensus
tional data sources that the original annotator or a standard for the Earth-science modeling commu-
second annotator can consult to improve their nity. The IGBP legend contains 17 land-cover class-
decision. Fortunately, this is a realistic scenario, es. Ideally, the land-cover classes should be mutu-
even in domains when expert time comes at a pre- ally exclusive and distinct with respect to both
mium. For example, in the domain of citation space and time. By these criteria, the IGBP scheme
screening, one can retrieve the full text if the is not ideal because it tries to characterize land cov-
abstract has not provided sufficient information to er, land use, and surface hydrology, which may
choose a label, and for land-cover classification, overlap in certain areas (in the next section we dis-
there are different procedures for labeling data, cuss efforts in how to rethink this classification
some of which take considerably more time and scheme for the geography community using
effort than others. For such domains, if the expert machine-learning methods). The version of STEP
is presented a small, focused set of examples that used to produce the MODIS Collection 5 Land
are likely mislabeled, he or she can justify spending Cover product was initially labeled at Boston Uni-
the additional time necessary to ensure those versity during 1998–2007 using various informa-
examples are properly labeled. tion sources such as Landsat imagery. Since then
By highlighting the difficult cases for rereview, new information sources have become available
ALC ensures quality labels for examples that are such as high-resolution Google Earth imagery,
ambiguous, and potentially more informative for independently produced maps of agriculture,
classification. Note that this scenario is different MODIS-derived vegetation indices, and ecoregion
than the one described above in multiple-expert maps (Friedl et al. 2010). These new sources of
active learning, in which one has several experts information make it an ideal candidate for ALC.
and a single pool of unlabeled data. Here we have We partnered with our domain experts to evalu-
one expert, either the original or a more qualified ate ALC’s ability to find all instances that might be
alternate, who is prepared to spend the necessary mislabeled. We ran six rounds of ALC; for each
time analyzing select instances that were already round we asked the domain expert to examine the
labeled using a cheaper procedure suspected of 10 sites for which our method had the lowest con-
introducing errors. fidence on their assigned class labels as computed
from using the calibrated probabilities from an
Cleaning Up the SVM.5 At each round, a new SVM was computed
Land-Cover Training Data including the (possibly) corrected training data of
In Rebbapragada (2010) we explore many instanti- the last round. Thus our expert examined 60
ations of ALC, run several experiments on real- unique sites in the evaluation.
world data both within scenarios for which we do Our expert analyzed each of the selected sites
and do not know the amount of noise, and apply and provided his personal primary and secondary
it to a database of known land-cover types used to labeling for each site, the confidence on the com-
train a global land-cover classifier provided by the position of the site itself, and some detailed com-
18 AI MAGAZINE
Articles
ments. Prior to our involvement, the domain sources, for some sites it is impossible to come up
experts had revisited some of the labels on this with a single correct label despite significant addi-
data set. Thus, some sites had two labels: the “orig- tional effort. Our conjecture was that the features
inal” label and a “corrected” label. Out of the 60 were not sufficient to make the class discrimina-
that our method chose for rereview (using the orig- tions required in the IGBP classification scheme.
inal labels), 19 were sites that the geographers had For this domain, our ability to gather more data is
already identified as potentially mislabeled. Of limited by the physical properties of MODIS, thus
those 19, our expert agreed with 12 labels in the the avenue of collecting additional data in the
“corrected” data. Of the remaining 7, he decided form of more features is closed. Indeed, for any
to relabel 4 of them. For 2, although he disagreed real-world domain, before applying a supervised
with the “corrected” label, he was not confident of learning method, we must identify the set of class-
the true label. On the last he determined that the es whose distinction might be useful for the
“orginal” label was more accurate. Thus in 18 of 19 domain expert, and then determine whether these
cases, the expert confirmed that those examples classifications can actually be distinguished by the
were truly mislabeled. Of the 41 sites that had con- available data.
sistent labels between original and corrected, the Where do class definitions originate? For many
expert verified that 16 of these sites were truly mis- data sets, they are provided by human experts as
labeled. Of the other 25 sites, he did make adjust- the categories or concepts that people find useful.
ments to 3 in which he found other errors and For others, one can apply clustering methods to
wanted to revisit later, either for deletion or to find automatically the homogeneous groups in the
regroup the site with other observations. Thus, for data. Both approaches have drawbacks. The cate-
this data set ALC was successful in isolating the dif- gories and concepts that people find useful may
ficult cases that require more human attention. not be supported by the features (that is, the fea-
Informally, the expert told us that “[ALC] is both tures may be inadequate for making the class dis-
useful and important to the maintenance of the tinctions of interest). Applying clustering to find
database. (1) It identifies sites that are very hetero- the homogeneous groups in the data may find a
geneous and mixtures of more than one land-cov- class structure that is not of use to the human. For
er class. (2) It identifies sites that are mislabeled or example, applying clustering to high-resolution
of poor quality. I would like to see this analysis on CT scans of the lung (Aisen et al. 2003) in order to
some of the other STEP labels besides the IGBP find the classes of pulmonary diseases may group
class since the IGBP classification has a number of together in one cluster two pulmonary diseases
problems with class ambiguity.” that have radically different treatments or, con-
versely, it may split one disease class into multiple
Summary clusters that have no meaning with respect to diag-
In summary, when trying automatically to classify nosis or treatment of the disease.
regions of Earth’s surface, we needed to address the
issue that the available training data had labeling Combining Clustering
errors. Through discussions with our domain and Human Expertise
experts we ascertained that more costly, labor- To address this issue, we developed a method for
intensive processes were available and that they redefining class definitions that leverages both the
were willing to reexamine some labels to correct class definitions found useful by the human
for errors that if left untouched would decrease the experts and the structure found in the data
ultimate performance of an automated classifica- through clustering. Our approach is based on the
tion scheme. Indeed, experiments illustrated that observation that for supervised training data sets,
this was a better use of their time than labeling significant effort went into defining the class labels
new data with the less costly procedure (Rebbapra- of the training data, but that these distinctions
gada 2010). Because we were working closely with may not be supported by the features. Our
our domain experts, we developed a new way in method, named class-level penalized probabilistic
which to use human expertise, which we named clustering (CPPC) (Preston et al. 2010), is designed
active label correction. We plan to apply these to find the natural number of clusters in a data set,
ideas more extensively to both the citation screen- when given constraints in the form of class labels.
ing and the land-cover classification domain. We require a domain expert to specify an L L
matrix of pairwise probabilistic constraints, where
L is the number of classes in the data. This matrix
Rethinking Class Definitions makes use of the idea that the expert will have dif-
In working with the geographers on trying to cor- ferent levels of confidence for each class definition,
rect label noise in their training data, we sat back and preferences as to which class labels may be
and brainstormed about the causes of the errors. similar. Thus, each element of the matrix defines
Even with the ability to use auxiliary information the expert’s belief that a pair of classes should be
SPRING 2012 19
Articles
kept separate or should be merged. Elements on class provides additional useful information, such
the diagonal reflect the expert’s opinion of as corn and wheat clusters within the agriculture
whether a particular class is multimodal. Using the class (see figure 5 for an illustrative example of the
instances’ class labels and the probabilities in the results).
matrix provides us with pairwise probabilistic Based on these results, our geographers are inter-
instance-level constraints. ested in using this tool to redesign the land-cover
Our framework for discovering the natural clus- hierarchy. By observing the results of the con-
tering of the data given this prior knowledge uses straints, one can better assess what types of actual
a method based primarily on the penalized proba- divisions exist in the data and whether or not our
bility clustering (PPC) algorithm (Lu and Leen desired class descriptions are feasible distinctions
2007). Penalized probability clustering is a con- (that is, can actually be distinguished by the fea-
straint-based clustering algorithm, in which Lu tures).
and Leen defined a generative model to capture the
constraints of the domain and the clustering objec- Summary
tive. The model allows the user to provide probab- Our work in redefining the labeling scheme for
listic must- and cannot-link constraints; a must- land-cover classification arises from an observation
link constraint between a pair of instances tells that even when the experts tried to correct for
PPC to try to put the instances in the same cluster labeling errors in their data set, it was not possible
and a cannot-link contraint indicates they should to create an accurate classifier from the data. The
not be in the same cluster. The probability helps reason is that the features do not support the class-
the algorithm determine how much weight to es defined by the IGBP classification scheme. This
place on the constraint. To make the constraint- led to research on how to redefine this scheme
based clustering tractable they use a mean field using the best of both worlds: the work already
approximation (Lu and Leen 2007).6 In Preston et done on examining each training data point to
al. (2010) we provide the details of this work and come up with a land-cover label and the ability of
we provide a framework to evaluate a class cluster- unsupervised learning to uncover structure in the
ing model using a modification of the Bayesian data. Thus our ideas resulted from working on an
information criterion that measures the fit of a application for which off-the-shelf machine-learn-
clustering solution to both the data and to the ing methods failed to give the acceptable perform-
specified constraints while adding a term to penal- ance and by considering novel ways in which to
ize for model complexity. use our domain expertise. We have begun to work
with two groups of doctors to apply some of these
same ideas to develop a classification scheme for
Redefining the Land-Cover grouping clinical trials and for developing a classi-
Classification Scheme fication scheme for chronic obstructive pulmonary
Before we applied CPPC to the land-cover data set, disease (COPD).
the geographers speculated that (1) large, complex
classes such as agriculture may contain several dis-
tinguishable subclasses and (2) geographic regions
Conclusions
classified as one of the mixed classes, such as A primary objective in our research is to consider
“mixed forest” should most likely be merged with how we can best utilize domain expertise to
either of its two subclasses: “deciduous broadleaf improve the performance of machine-learning and
forest” or “evergreen broadleaf forest.” In the latter data-mining algorithms. This article describes two
case, some sensors cannot distinguish well such collaborations, which led to methods for uti-
between these types of forest. To evaluate CPPC’s lizing human input in the form of imparting addi-
ability to uncover meaningful class structure, we tional domain knowledge (that is, terms indicative
presented the geographers with two maps built of class membership), correction of labeling errors,
using a land-cover data set for North America. One and the use of training labels and expert knowl-
map was generated using CPPC, the other with edge to guide the redefinition of class boundaries.
expectation maximization (EM) clustering. They Some of our previous research efforts led to the
were not told which was the new solution. They ideas of human-guided feature selection for unsu-
determined that the map generated using the pervised learning (Dy and Brodley 2000) as applied
CPPC clusters defined far more meaningful class to content-based image retrieval for diagnosing
structure than the map generated from the EM high-resolution CT images of the lung; determin-
clusters, and further that the map was better than ing what experiments should be run next when
the map generated by applying supervised learn- the process of generating the data cannot be sepa-
ing to the original labeled data. The results showed rated from the process of labeling the data
that CPPC finds areas of confusion where classes (Lomasky et al. 2007) for the generation of training
should be merged or where separation within a data for an artificial nose; and user-feedback meth-
20 AI MAGAZINE
Articles
Figure 5. Maps Using Original IGBP Classification (left), CPPC Clustering (middle), and EM Clustering (right).
The top row shows a map tile centered on the the agricultural oasis just south of the Salton Sea in southern California. It represents a region
with very sparse desert shrub cover (darker gray) and some agriculture (light gray) and denser forest/grassland (medium gray). The original
IGBP classification contains a barren or sparsely vegetated class that is similar (spectrally) to the open shrubland class and is preserved in
the clustering with constraints but not the EM clustering. In this case, CPPC preserves the class structure of interest. The middle row shows
an area centered on Montreal. It represents a transition between southern hardwoods (deciduous broadleaf) and boreal conifers (evergreen
needleleaf) forests. It is also an area of transition between forest and the agricultural region along the river. The original IGBP classification
has two “mixture” classes that mostly disappear after clustering: mixed forest and agriculture/natural mosaic (mixture of trees and agricul-
ture). In this case CPPC correctly decides to merge classes. The bottom row shows an area on the border of North and South Dakota in the
west with Minnesota to the east. It is almost completely covered with agriculture with a small patch of natural prairie to the southwest of
the image. The original IGBP classification has one single agriculture class (light gray). As we see with the two clustering results there are
actually three distinct agricultural signals within this area, which indicate different types of crops, different harvesting times, or the pres-
ence/absence of irrigation. In this case, CPPC is able to find subclasses withing a single class long thought to be multimodal.
SPRING 2012 21
Articles
ods for anomaly detection of astrophysics data to ber of classes is generally much smaller than the number
find anomalous sky objects. of instances.
Clearly theory and algorithm development are
both crucial for the advancement of machine learn- References
ing, but we would argue strongly that being Aisen, A. M.; Broderick, L. S.; Winer-Muram, H.; Brodley,
involved in an application can highlight shortcom- C. E.; Kak, A. C.; Pavlopoulou, C.; Dy, J.; and Marchiori,
ings of existing methodologies and lead to new A. 2003. Automated Storage and Retrieval of Medical
Images to Assist Diagnosis: Implementation and Prelimi-
insights into previously unaddressed research issues.
nary Assessment. Radiology 228(1): 265–270.
To conclude, we are motivated by the belief that
Attenberg, J., and Provost, F. 2010. Why Label When You
contact with real problems, real data, real experts,
Can Search?: Alternatives to Active Learning for Apply-
and real users can generate the creative friction that
ing Human Resources to Build Classification Models
leads to new directions in machine learning. Under Extreme Class Imbalance. In Proceedings of the 16th
ACM SIGKDD International cConference on Knowledge Dis-
Acknowledgments covery and Data Mining, 423–432. New York: Association
We would like to thank our domain experts Mark for Computing Machinery.
Friedl and Damien Sulla-Menashe from the Depart- Attenberg, J.; Melville, P.; and Provost, F. 2010. A Unified
ment of Geography and Environment at Boston Approach to Active Dual Supervision for Labeling Fea-
University, and Tom Trikalinos from Tufts Medical tures and Examples. In Proceedings of the 2010 European
Center at Tufts University for their willingness to Conference on Machine Learning and Knowledge Discovery in
spend countless hours with us. We would like to Databases, 40–55. Berlin: Springer.
thank Roni Khardon and Dan Preston for their Brodley, C. E., and Friedl, M. A. 1996. Identifying and
work in helping to develop the CPPC algorithm. Eliminating Mislabeled Training Instances. In Proceedings
We would like to thank Roni Khardon, Foster of the Thirteenth National Conference on Artificial Intelli-
gence, 799–805. Palo Alto, CA: AAAI Press.
Provost, and Kiri Wagstaff for reading and com-
Brodley, C. E., and Friedl, M. A. 1999. Identifying Misla-
menting on this work. Our research was supported
beled Training Data. Journal of Artificial Intelligence
by NSF IIS-0803409, IIS-0713259, and R01HS0
Research 11: 131–167.
18494. The research was carried out in part at the
Brodley, C. E., and Smyth, P. J. 1997. Applying Classifica-
Jet Propulsion Laboratory, California Institute of
tion Algorithms in Practice. Statistics and Computing 7(1):
Technology, under a contract with the National 45–56.
Aeronautics and Space Administration. US Gov-
Brodley, C. E.; Kak, A. C.; Dy, J. G.; Shyu, C. R.; Aisen, A.;
ernment sponsorship is acknowledged. and Broderick, L. 1999. Content-Based Retrieval from
Medical Image Databases: A Synergy of Human Interac-
tion, Machine Learning, and Computer Vision. In Pro-
Notes ceedings of the Sixteenth National Conference on Artificial
1. The exception is cystic fibrosis, which affects mostly Intelligence, 760–767. Palo Alto, CA: AAAI Press.
the young. Donmez, P., and Carbonell, J. G. 2008. Proactive Learn-
2. PubMed is a repository of published biomedical papers. ing: Cost-Sensitive Active Learning with Multiple Imper-
Sometimes other databases are used in addition to, or fect Oracles. In Proceedings of the 17th ACM Conference on
instead of, PubMed. Information and Knowledge Management, 619–628. New
3. Bag-of-Words is a featurespace encoding for textual York: Association for Computing Machinery.
data in which documents are mapped to vectors whose Draper, B. A.; Brodley, C. E.; and Utgoff, P. E. 1994. Goal-
entries are functions over word counts. The simplest such Directed Classification Using Linear Machine Decision
representation is binary BOW, in which each column rep- Trees. IEEE Transactions on Pattern Analysis and Machine
resents a particular word and the value of this column for Intelligence 16(9): 888–893.
a given document is set to 1 or 0, indicating that the Drummond, C.; Holte, R.; et al. 2005. Severe Class Imbal-
word is present or not, respectively. ance: Why Better Algorithms Aren’t the Answer. In Pro-
4. In order to ensure that the magnitude is symmetric in ceedings of the 16th European Conference of Machine Learn-
the respective directions, one may either flip the ratio so ing and Knowledge Discovery in Databases, 539–546. Berlin:
that the numerator is always larger than the denomina- Springer.
tor, or one may take the log of the ratio. Dy, J., and Brodley, C. E. 2000. Visualization and Interac-
5. The SVM used a polynomial kernel (order 2) where the tive Feature Selection for Unsupervised Data. In Proceed-
C parameter was set to 1.0. ings of the Sixth ACM SIGKDD International Conference on
6. A straightforward implementation of the PPC algo- Knowledge Discovery and Data Mining, 360–364. New York:
rithm using mean field approximation (Jaakkola 2000) Association for Computing Machinery.
results in O(N2) computational complexity due to the Dy, J.; Brodley, C.; Kak, A.; Pavlopoulo, C.; Aisen, A.; and
number of constraints (that is, one constraint for each Broderick, L. 2003. The Customized-Queries Approach to
pair of instances), where N is the total number of Image Retrieval. IEEE Transactions on Pattern Analysis and
instances. In CPPC, we reduce this time complexity to Machine Intelligence 25(3): 373–378.
O(NL) by taking advantage of the repetitions in the set of Early, J. P.; Brodley, C. E.; and Rosenberg, K. 2003. Behav-
instance pairwise constraints that are induced from the ioral Authentication of Server Flows. In Proceedings of the
class pairwise constraints. In general, L << N, as the num- 19th Annual IEEE Computer Security Applications Confer-
22 AI MAGAZINE
Articles
ence, 49–55. Piscataway, NJ: Institute of Electrical and Moss, E.; Utgoff, P.; Cavazos, J.; Precup, D.; Stefanovic, D.;
Electronics Engineers. Brodley, C.; and Scheeff, D. T. 1997. Learning to Schedule
Fern, X. Z.; Brodley, C. E.; and Friedl, M. A. 2005. Corre- Straight-Line Code. In Advances in Neural Information Pro-
lation Clustering for Learning Mixtures of Canonical cessing Systems 9, 929–935. Cambridge, MA: The MIT
Correlation Models. In Proceedings of the Fifth SIAM Inter- Press.
national Conference on Data Mining, 439–448. Philadel- Muchoney, D.; Strahler, A.; Hodges, J.; and LoCastro, J.
phia, PA: Society for Industrial and Applied Mathemat- 1999. The IGBP Discover Confidence Sites and the Sys-
ics. tem for Terrestrial Ecosystem Parameterization: Tools for
Friedl, M. A.; Brodley, C. E.; and Stahler, A. 1999. Maxi- Validating Global Land Cover Data. Photogrammetric Engi-
mizing Land Cover Classification Accuracies Produced by neering and Remote Sensing 65(9): 1061–1067.
Decision Trees at Continental to Global Scales. IEEE Muhlenbach, F.; Lallich, S.; and Zighed, D. A. 2004. Iden-
Transactions on Geoscience and Remote Sensing 37(2): 969– tifying and Handling Mislabelled Instances. Journal of
977. Intelligent Information Systems 22(1): 89–109.
Friedl, M. A.; Sulla-Menashe, D.; Tan, B.; Schneider, A.; Muslea, I.; Minton, S.; and Knoblock, C. A. 2006. Active
Ramankutty, N.; Sibley, A.; and Huang, X. 2010. MODIS Learning with Multiple Views. Journal of Artificial Intelli-
Collection 5 Global Land Cover: Algorithm Refinements gence Research 27: 203–233.
and Characterization of New Datasets. Remote Sensing of Preston, D.; Brodley, C. E.; Khardon, R.; Sulla-Menashe,
Environment 114(1): 168–182. D.; and Friedl, M. 2010. Redefining Class Definitions
Gamberger, D.; Lavrac, N.; and Dzeroski, S. 1996. Noise Using Constraint-Based Clustering. In Proceedings of the
Elimination in Inductive Concept Learning: A Case Study Sixteenth ACM SIGKDD Conference on Knowledge Discovery
in Medical Diagnosis. In Algorithmic Learning Theory: Sev- and Data Mining, 823–832. New York: Association for
enth International Workshop, Lecture Notes in Computer Computing Machinery.
Science 1160, 199–212. Berlin: Springer. Provost, F. 2000. Machine Learning from Imbalanced
Gamberger, D.; Lavrac, C.; Groselj, C. 1999. Experiments Data Sets 101. In Learning from Imbalanced Data Sets:
with Noise Filtering in a Medical Domain. In Proceedings of Papers from the AAAI Workshop. AAAI Technical Report
the Sixteenth International Conference on Machine Learning, WS-00-05. Palo Alto, CA: AAAI Press.
143–151. San Francisco: Morgan Kaufmann Publishers. Pusara, M., and Brodley, C. E. 2004. User ReAuthentica-
He, H., and Garcia, E. 2009. Learning from Imbalanced tion Via Mouse Movements. In Proceedings of the 2004
Data. IEEE Transactions on Knowledge and Data Engineering ACM Workshop on Visualization and Data Mining for Com-
21(9): 1263–1284. puter Security. New York: Association for Computing
Jaakkola, T. S. 2000. Tutorial on Variational Approxima- Machinery.
tion Methods. In Advanced Mean Field Methods: Theory Rebbapragada, U. 2010. Strategic Targeting of Outliers for
and Practice, 129–159. Cambridge, MA: The MIT Press. Expert Review. Ph.D. Dissertation, Tufts University,
Kapadia, N.; Fortes, J.; and Brodley, C. 1999. Predictive Boston, MA.
Application-Performance Modeling in a Computational Rebbapragada, U., and Brodley, C. E. 2007. Class Noise
Grid Environment. In Proceedings of the 8th IEEE Interna- Mitigation Through Instance Weighting. In Proceedings of
tional Symposium on High Performance Distributed Comput- the 18th European Conference on Machine Learning (ECML)
ing (HPDC 99), 47–54. Piscataway, NJ: Institute of Elec- and the 11th European Conference on Principles and Practice
trical and Electronics Engineers. of Knowledge Discovery in Databases (PKDD), Lecture Notes
Lallich, S.; Muhlenbach, F.; and Zighed, D. A. 2002. in Computer Science Volume 4701, 708–715. Berlin:
Improving Classification by Removing or Relabeling Mis- Springer.
labeled Instances. In Proceedings of the Thirteenth Interna- Rebbapragada, U.; Lomasky, R.; Brodley, C. E.; and Friedl,
tional Symposium on the Foundations of Intelligent Systems, M. 2008a. Generating High-Quality Training Data for
5–15. Berlin: Springer-Verlag. Automated Land-Cover Mapping. In Proceedings of the
Lane, T., and Brodley, C. E. 1999. Temporal Sequence 2008 IEEE International Geoscience and Remote Science Sym-
Learning and Data Reduction for Anomaly Detection. posium, 546–548. Piscataway, NJ: Institute of Electrical
ACM Transactions on Computer Security 2(3): 295–331. and Electronics Engineers.
Lomasky, R.; Brodley, C.; Aerneke, M.; Walt, D.; and Rebbapragada, U.; Mandrake, L.; Wagstaff, K.; Gleeson,
Friedl, M. 2007. Active Class Selection. In Proceedings of D.; Castao, R.; Chien, S.; and Brodley, C. E. 2009. Improv-
the Eighteenth European Conference on Machine Learning ing Onboard Analysis of Hyperion Images by Filtering
and Knowledge Discovery in Databases, 640–647. Berlin: Mislabeled Training Data Examples. In Proceedings of the
Springer. 2009 IEEE Aerospace Conference. Piscataway, NJ: Institute
Lu, Z., and Leen, T. K. 2007. Penalized Probabilistic Clus- of Electrical and Electronics Engineers.
tering. Neural Computation 19(6): 1528–1567. Rebbapragada, U.; Protopapas, P.; Brodley, C. E.; and
MacArthur, S.; Brodley, C. E.; and Broderick, L. 2002. Alcock, C. 2008b. Finding Anomalous Periodic Time
Interactive Content-Based Image Retrieval Using Rele- Series: An Application to Catalogs of Periodic Variable
vance Feedback. Computer Vision and Image Understanding Stars. Machine Learning 74(3): 281.
88(3): 55–75. Sackett, D.; Rosenberg, W.; Gray, J.; Haynes, R.; and
McCallum, A., and Nigam, K. 1998. Employing EM and Richardson, W. 1996. Evidence Based Medicine: What It
Pool-Based Active Learning for Text Classification. In Pro- Is and What It Isn’t. BMJ Journal 312(7023): 71.
ceedings of the Fifteenth International Conference on Machine Settles, B. 2009. Active Learning Literature Survey. Tech-
Learning (ICML), 350–358. San Francisco: Morgan Kauf- nical Report 1648, Department of Computer Science,Uni-
mann. versity of Wisconsin–Madison, Madison, WI.
SPRING 2012 23
Articles
Sheng, V. S.; Provost, F.; and Ipeirotis, P. G. 2008. Get in Instrumentation, Measurement, and Related Applica-
Another Label? Improving Data Quality and Data Min- tions.Piscataway, NJ: Institute of Electrical and Electron-
ing Using Multiple, Noisy Labelers. In Proceedings of the ics Engineers.
14th ACM SIGKDD International Conference on Knowledge Zhu, X.; Wu, X.; and Chen, S. 2003. Eliminating Class
Discovery and Data Mining (KDD), 614–622. New York: Noise in Large Datasets. In Proceedings of the TwentiethIn-
Association for Computing Machinery. ternational Conference on Machine Learning, 920–927. Palo
Shyu, C.; Brodley, C. E.; Kak, A.; Kosaka, A.; Aisen, A.; and Alto, CA: AAAI Press.
Broderick, L. 1999. Assert: A Physician-in-the-Loop Con-
tent-Based Image Retrieval System for HRCT Image Data- Carla E. Brodley is a professor and chair of the Depart-
bases. Computer Vision and Image Understanding 75(1–2): ment of Computer Science at Tufts University. She
111–132. received her PhD in computer science from the Universi-
Stough, T., and Brodley, C. E. 2001. Focusing Attention ty of Massachusetts at Amherst in 1994. From 1994–
on Objects of Interest Using Multiple Matched Filters. 2004, she was on the faculty of the School of Electrical
IEEE Transactions on Image Processing 10(3): 419–426. Engineering at Purdue University. She joined the faculty
at Tufts in 2004 and became chair in September 2010.
Tong, S., and Koller, D. 2000. Support Vector Machine
Professor Brodley’s research interests include machine
Active Learning with Applications to Text Classification.
learning, knowledge discovery in databases, health IT,
In Journal of Machine Learning Research 2: 999–1006.
and personalized medicine. She has worked in the areas
Valizadegan, H., and Tan, P.-N. 2007. Kernel-Based Detec- of intrusion detection, anomaly detection, classifier for-
tion of Mislabeled Training Examples. In Proceedings of the mation, unsupervised learning, and applications of
Seventh SIAM International Conference on Data Mining, machine learning to remote sensing, computer security,
309–319. Philadelphia, PA: Society for Industrial and neuroscience, digital libraries, astrophysics, content-
Applied Mathematics. based image retrieval of medical images, computational
Vapnik, V. N. 1995. The Nature of Statistical Learning The- biology, chemistry, evidence-based medicine, and per-
ory. Berlin: Springer. sonalized medicine. In 2001 she served as program
Venkataraman, S.; Metaxas, D.; Fradkin, D.; Kulikowski, cochair for ICML and in 2004 as general chair. She is on
C.; and Muchnik, I. 2004. Distinguishing Mislabeled Data the editorial boards of JMLR, Machine Learning, and
from Correctly Labeled Data in Classifier Design. In Pro- DKMD. She is a member of the AAAI Council, and is a
ceedings of the 16th IEEE International Conference on Tools board member of IMLS and of the CRA Board of Direc-
with Artificial Intelligence, 668–672. Piscataway, NJ: Insti- tors.
tute of Electrical and Electronics Engineers.
Umaa Rebbapragada received her B.A. in mathematics
Verbaeten, S. 2002. Identifying Mislabeled Training from the University of California, Berkeley, and her M.S.
Examples in ILP Classification Problems. In Proceedings of
and Ph.D. in computer science from Tufts University. She
the Twelfth Belgian-Dutch Conference on Machine Learning.
currently works in the Machine Learning and Instrument
Utrecht, The Netherlands: University of Utrecht, Faculty
Autonomy group at the Jet Propulsion Laboratory of the
of Mathematics and Computer Science.
California Institute of Technology. Her research interests
Verbaeten, S., and Assche, A. V. 2003. Ensemble Methods are machine learning and data mining, with specific
for Noise Elimination in Classification Problems. In Mul- focus on anomaly detection and supervised learning in
tiple Classifier Systems, 4th International Workshop, Volume the presence of noisy labels and imperfect experts. She
4. Berlin: Springer. has applied her research to problems in astrophysics,
Vickers, A. J., and Elkin, E. B. 2006. Decision Curve land-cover classification, and earthquake damage assess-
Analysis: A Novel Method for Evaluating Prediction Mod- ment.
els. Medical Decision Making 26(6): 565–574.
Kevin Small is a research scientist within the Institute
Wagstaff, K.; Kocurek, M.; Mazzoni, D.; and Tang, B.
for Clinical Research and Health Policy Studies at Tufts
2010. Progressive Refinement for Support Vector
Medical Center. He received his Ph.D. in computer sci-
Machines. Data Mining and Knowledge Discovery 20(1): 53–
ence from the University of Illinois at Urbana-Cham-
69.
paign in 2009. His research interests are in machine
Wallace, B. C.; Small, K.; Brodley, C. E.; and Trikalinos, T. learning with an emphasis on scenarios where there is an
A. 2010. Active Learning for Biomedical Citation Screen- interaction with a domain expert during the learning
ing. In Proceedings of the Seventeenth ACM SIGKDD Confer- procedure and methods that exploit known interdepen-
ence on Knowledge Discovery and Data Mining, 173–182. dencies between multiple learned classifiers (such as
New York: Association for Computing Machinery. structured learning). He has applied his research primari-
Wallace, B. C.; Small, K.; Brodley, C. E.; and Trikalinos, T. ly to information extraction and related natural language
A. 2011. Who Should Label What? Instance Allocation in tasks, with a recent focus on health informatics domains.
Multiple Expert Active Learning. In Proceedings of the
Eleventh SIAM International Conference on Data Mining Byron Wallace received his B.S. in computer science
(SDM), 176–187. Philadelphia, PA: Society for Industrial from the University of Massachusetts, Amherst, and his
and Applied Mathematics. M.S. in computer science from Tufts University. He is cur-
rently a Ph.D. candidate in computer science at Tufts. He
Zeng, X., and Martinez, T. 2001. An Algorithm for Cor-
is also research staff in the Institute for Clinical Research
recting Mislabeled Data. Intelligent Data Analysis 5(1):
and Health Policy Studies at Tufts Medical Center. His
491–502.
research interests are in machine learning with a focus
Zeng, X., and Martinez, T. R. 2003. A Noise Filtering on its applications in health informatics.
Method Using Neural Networks. In Proceedings of the 2003
IEEE International Workshop of Soft Computing Techniques
24 AI MAGAZINE

Challenges and Opportunities in Applied Machine Learning

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Challenges and Opportunities in Applied Machine Learning

Hochgeladen von

Copyright:

Verfügbare Formate

Articles

Challenges and Opportunities

Carla E. Brodley, Umaa Rebbapragada,

Number of abstracts, log-scale

Retrieved with Search: 5000

Deemed Relevant: 100

Figure 1. The Systematic Review Screening Process.

Figure 2. Abstracts to Be Screened.

0.5 active Learning 0.5 active Learning

had the labeled terms, and thus could not have

Das könnte Ihnen auch gefallen