10 Challenging Problems in Data Mining Research

10 Challenging Problems in Data Mining Research
prepared for ICDM 2005
1. Developing a Unifying Theory of Data Mining
 The current state of the art of data-mining research is too ``ad-hoc“

 techniques are designed for individual problems
 no unifying theory
 Needs unifying research
 Exploration vs explanation
 Long standing theoretical issues
 How to avoid spurious correlations?
 Deep research
 Knowledge discovery on hidden causes?
 Similar to discovery of Newton’s Law?
An Example (from Tutorial Slides by Andrew Moore ):
 VC dimension. If you've got a learning algorithm in one hand and a dataset

in the other hand, to what extent can you decide whether the learning
algorithm is in danger of overfitting or underfitting?
 formal analysis into the fascinating question of how overfitting can
happen,
 estimating how well an algorithm will perform on future data that is
solely based on its training set error,
 a property (VC dimension) of the learning algorithm. VC-dimension
thus gives an alternative to cross-validation, called Structural Risk
Minimization (SRM), for choosing classifiers.
 CV,SRM, AIC and BIC.
2. Scaling Up for High Dimensional Data and High Speed Streams
 Scaling up is needed
 ultra-high dimensional classification problems (millions or billions of
features, e.g., bio data)
 Ultra-high speed data streams
 Streams
 continuous, online process
 e.g. how to monitor network packets for intruders?
 concept drift and environment drift?
 RFID network and sensor network data
3. Sequential and Time Series Data
 How to efficiently and accurately cluster, classify and predict the trends ?
 Time series data used for predictions are contaminated by noise
 How to do accurate short-term and long-term predictions?
 Signal processing techniques introduce lags in the filtered data, which
reduces accuracy
 Key in source selection, domain knowledge in rules, and optimization
methods
4. Mining Complex Knowledge from Complex Data
 Mining graphs
 Data that are not i.i.d. (independent and identically distributed)
 many objects are not independent of each other, and are not of a
single type.
 mine the rich structure of relations among objects,
 E.g.: interlinked Web pages, social networks, metabolic networks in
the cell
 Integration of data mining and knowledge inference
 The biggest gap: unable to relate the results of mining to the real-
world decisions they affect - all they can do is hand the results back to
the user.
 More research on interestingness of knowledge
Citation (Paper 2) Conference Name
Title Author (Paper1)
5. Data Mining in a Network Setting
 Community and Social Networks

 Linked data between emails, Web pages, blogs, citations, sequences
and people
 Static and dynamic structural behavior
 Mining in and for Computer Networks
 detect anomalies (e.g., sudden traffic spikes due to a DoS (Denial of
Service) attacks
 Need to handle 10Gig Ethernet links (a) detect (b) trace back (c ) drop
packet
6. Distributed Data Mining and Mining Multi-agent Data
 Need to correlate the data seen at the various probes (such as in a

sensor network)
 Adversary data mining: deliberately manipulate the data to sabotage
them (e.g., make them produce false negatives)
 Game theory may be needed for help
Games Player1=miner
Action: H T
H T
T H
(1,-1) (1,-1)
Outcome
7. Data Mining for Biological and Environmental Problems
 New problems raise new questions

 Large scale problems especially so
 Biological data mining, such as HIV vaccine design
 DNA, chemical properties, 3D structures, and functional properties ◊
need to be fused
 Environmental data mining
 Mining for solving the energy crisis
8. Data-mining-Process Related Problems
 How to automate mining process?

 the composition of data mining operations
 Data cleaning, with logging capabilities
 Visualization and mining automation
 Need a methodology: help users avoid many data mining mistakes

 What is a canonical set of data mining operations?
Sampling
Feature Sel
Mining…
9. Security, Privacy and Data Integrity
 How to ensure the users privacy while their data are being mined?
 How to do data mining for protection of security and privacy?
 Knowledge integrity assessment
 Data are intentionally modified from their original version, in order
to misinform the recipients or for privacy and security
 Development of measures to evaluate the knowledge integrity of a
collection of
 Data
 Knowledge and patterns
10. Dealing with Non-static, Unbalanced and Cost-sensitive Data
 The UCI datasets are small and not highly unbalanced

 Real world data are large (10^5 features) but only < 1% of the useful classes
(+’ve)
 There is much information on costs and benefits, but no overall model of
profit and loss
 Data may evolve with a bias introduced by sampling
1.5 Data mining problems/issues
Data mining systems rely on databases to supply the raw data for input and this raises
problems in that databases tend be dynamic, incomplete, noisy, and large. Other problems
arise as a result of the adequacy and relevance of the information stored.
1.5.1 Limited Information
A database is often designed for purposes different from data mining and sometimes the
properties or attributes that would simplify the learning task are not present nor can they
be requested from the real world. Inconclusive data causes problems because if some
attributes essential to knowledge about the application domain are not present in the data
it may be impossible to discover significant knowledge about a given domain. For
example cannot diagnose malaria from a patient database if that database does not
contain the patients red blood cell count.
1.5.2 Noise and missing values
Databases are usually contaminated by errors so it cannot be assumed that the data they
contain is entirely correct. Attributes which rely on subjective or measurement
judgements can give rise to errors such that some examples may even be mis-classified.
Error in either the values of attributes or class information are known as noise. Obviously
where possible it is desirable to eliminate noise from the classification information as this
affects the overall accuracy of the generated rules.
Missing data can be treated by discovery systems in a number of ways such as;
• simply disregard missing values

• omit the corresponding records
• infer missing values from known values
• treat missing data as a special value to be included additionally in the attribute
domain
• or average over the missing values using Bayesian techniques.
Noisy data in the sense of being imprecise is characteristic of all data collection and
typically fit a regular statistical distribution such as Gaussian while wrong values are data
entry errors. Statistical methods can treat problems of noisy data, and separate different
types of noise.
1.5.3 Uncertainty
Uncertainty refers to the severity of the error and the degree of noise in the data. Data
precision is an important consideration in a discovery system.
1.5.4 Size, updates, and irrelevant fields
Databases tend to be large and dynamic in that their contents are ever-changing as
information is added, modified or removed. The problem with this from the data mining
perspective is how to ensure that the rules are up-to-date and consistent with the most
current information. Also the learning system has to be time-sensitive as some data
values vary over time and the discovery system is affected by the `timeliness' of the data.
Another issue is the relevance or irrelevance of the fields in the database to the current
focus of discovery for example post codes are fundamental to any studies trying to
establish a geographical connection to an item of interest such as the sales of a product.
Workshop Description
Motivation
Early work in predictive data mining did not address the complex circumstances in which models are built and
applied. It was assumed that a fixed amount of training data were available and only simple objectives, namely
predictive accuracy, were considered. Over time, it became clear that these assumptions were unrealistic and that
the economic utility of acquiring training data, building a model, and applying the model had to be considered.
The machine learning and data mining communities responded with research on active learning, which focused on
methods for cost-effective acquisition of information for the training data, and research on cost-sensitive learning,
which considered the costs and benefits associated with using the learned knowledge and how these costs and
benefits should be factored into the data mining process.
All the different stages of the data mining process are affected by economic utility. In the data acquisition phase
we have to consider the costs of obtaining training data, such as the cost of labelling additional examples or
acquiring new feature values. In applying the data mining algorithm, we have to consider the running time of the
algorithm and the costs and benefits associated with cleaning the data, transforming the data and constructing
new features. Economic utility also impacts the assessment of the decisions made based on the learned
knowledge. Simple assessment measures like predictive accuracy have given way to more complex economic
measures, including measures of profitability. These considerations can in turn impact policies for model induction.
The latter topic has received more attention in the context of cost-sensitive learning.
Goals
Almost all work that considers the impact of economic utility on data mining focuses exclusively on one of the
stages in the data mining process. Thus, economic factors have been studied in isolation, without much attention
to how they interact. This workshop will begin to remedy this deficiency by bringing together researchers who
currently consider different economic aspects in data mining, and by promoting an examination of the impact of
economic utility throughout the entire data mining process. This workshop will attempt to encourage the field to
go beyond what has been accomplished individually in the areas of active learning and cost-sensitive learning
(although both of these areas are within the scope of this workshop). In addition, existing research which has
addressed the role of economic utility in data mining has focused on predictive data mining tasks. This workshop
will begin to explore methods for incorporating economic utility considerations into both predictive and descriptive
data mining tasks.
This workshop will be geared toward researchers with an interest in how economic factors affect data mining
(e.g., researchers in cost-sensitive learning and evaluation and active learning) and practitioners who have real-
world experience with how these factors influence data mining. Attendance is not limited to the paper authors and
we strongly encourage interested researchers from related areas to attend the workshop. This will be a full-day
workshop and will include invited talks, paper presentations, short position statements and two panel discussions.
Workshop Topics
• Types of economic factors in data mining
o What economic factors arise in the context of data mining and to what stage of the data mining
process do they apply?
o What assessment metrics are used in response to these economic factors?
o Can the use of economic utility help address previously studied problems in data mining, such as
the problems of learning rare classes and learning from skewed distributions?
• Algorithms
o Utility-based approaches for information acquisition, data preprocessing, mining and knowledge
application. This includes work in active learning/sampling and cost-sensitive learning.
o This workshop will also address how predictive and descriptive data mining tasks such as
predictive modeling, clustering and link analysis can be adapted to incorporate economic utility.
• Consideration of economic utility throughout the data mining process
o Work towards a comprehensive framework for incorporating economic utility to benefit the entire
data mining process. This work includes utility-based data mining techniques which take into
account the dependencies between different phases of the data mining process to maximize the
utility of more than a single phase. For example, methods for acquiring training data which take
into account the costs of errors in addition to the cost of training data; or methods for the
extraction of predictive patterns which take into account the cost of test features necessary at
prediction time.
• Applications
o What existing data mining applications have taken economic utility into account?
o What methods do these applications use to take economic utility into consideration?
o How does economic utility and the methods for dealing with it vary according to the specific
problem addressed (e.g., by industry)?

10 Challenging Problems in Data Mining Research

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

10 Challenging Problems in Data Mining Research

Hochgeladen von

Copyright:

Verfügbare Formate

10 Challenging Problems in Data Mining Research

prepared for ICDM 2005

1. Developing a Unifying Theory of Data Mining

 The current state of the art of data-mining research is too ``ad-hoc“

 VC dimension. If you've got a learning algorithm in one hand and a dataset

2. Scaling Up for High Dimensional Data and High Speed Streams

3. Sequential and Time Series Data

Citation (Paper 2) Conference Name

Title Author (Paper1)

5. Data Mining in a Network Setting

 Community and Social Networks

6. Distributed Data Mining and Mining Multi-agent Data

 Need to correlate the data seen at the various probes (such as in a

7. Data Mining for Biological and Environmental Problems

 New problems raise new questions

8. Data-mining-Process Related Problems

 How to automate mining process?

 Need a methodology: help users avoid many data mining mistakes

9. Security, Privacy and Data Integrity

10. Dealing with Non-static, Unbalanced and Cost-sensitive Data

 The UCI datasets are small and not highly unbalanced

1.5.1 Limited Information

1.5.2 Noise and missing values

• simply disregard missing values

1.5.4 Size, updates, and irrelevant fields

Das könnte Ihnen auch gefallen