Sie sind auf Seite 1von 92

Data Mining and Knowledge

Discovery
Introduction
Data is produced at a phenomenal rate
Our ability to store has grown
Users expect more sophisticated information
How?

UNCOVER HIDDEN INFORMATION
DATA MINING
Why Data Mining?
The Explosive Growth of Data
Data collection and data availability
Automated data collection tools, database systems, Web, computerized
society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation,
Society and everyone: news, digital cameras,
We are drowning in data, but starving for knowledge!
Necessity is the mother of inventionData miningAutomated analysis of massive
data sets
Why Data Mining
Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely
to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent, given the
demographics and transactional history of a particular customer?
Customer relationship management:
Which of my customers are likely to be the most loyal, and which are
most likely to leave for a competitor? :

Data Mining helps extract such
information
Data mining
Process of semi-automatically analyzing large
databases to find patterns that are:
valid: hold on new data with some certainity
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to
interpret the pattern
Also known as Knowledge Discovery in
Databases (KDD)

Why Mine Data? Commercial Viewpoint
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions

Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Why Mine Data? Scientific Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases

Alternative names and their inside stories:
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
Watch out: Is everything data mining?
Simple search and query processing
(Deductive) expert systems
Data Mining
The non-trivial extraction of novel, implicit, and actionable knowledge
from large datasets.
Extremely large datasets
Discovery of the non-obvious
Useful knowledge that can improve processes
Can not be done manually
Technology to enable data exploration, data analysis, and data
visualization of very large databases at a high level of abstraction,
without a specific hypothesis in mind.
Sophisticated data search capability that uses statistical algorithms to
discover patterns and correlations in data.
Data Mining (cont.)
Data Mining (cont.)
Data Mining is a step of Knowledge Discovery in
Databases (KDD) Process
Data Warehousing
Data Selection
Data Preprocessing
Data Transformation
Data Mining
Interpretation/Evaluation
Data Mining is sometimes referred to as KDD and DM
and KDD tend to be used as synonyms
Major Issues in Data Warehousing and
Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data mining
Expression and visualization of data mining results
Handling noise and incomplete data
Pattern evaluation: the interestingness problem
Performance and scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed and incremental mining methods

Major Issues in Data Warehousing and
Mining
Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases and global
information systems (WWW)
Issues related to applications and social impacts
Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Process control and decision making
Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
Protection of data security, integrity, and privacy

Examples: What is (not) Data Mining?
What is not Data
Mining?
Look up phone
number in phone
directory

Query a Web
search engine for
information about
Amazon
What is Data Mining?

Certain names are more
prevalent in certain US locations
(OBrien, ORurke, OReilly in
Boston area)
Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
Extraction of Knowledge from
Data
4 Phases of Data Mining
Data Preparation
Identify the main data sets to be used by the data
mining operation (usually the data warehouse)
Data Analysis and Classification
Study the data to identify common data
characteristics or patterns
Data groupings, classifications, clusters, sequences
Data dependencies, links, or relationships
Data patterns, trends, deviation
4 Phases of Data Mining
Knowledge Acquisition
Uses the Results of the Data Analysis and Classification phase
Data mining tool selects the appropriate modeling or knowledge-
acquisition algorithms
Neural Networks
Decision Trees
Rules Induction
Genetic algorithms
Memory-Based Reasoning
Prognosis
Predict Future Behavior
Forecast Business Outcomes
65% of customers who did not use a particular credit card in the last 6
months are 88% likely to cancel the account.
3 Steps Data Mining Process
Stage 1: Exploration. This stage usually starts with data
preparation which may involve cleaning data, data
transformations, selecting subsets of records
Stage 2: Model building and validation. This stage involves
considering various models and choosing the best one based
on their predictive performance
Stage 3: Deployment. That final stage involves using the
model selected as best in the previous stage and applying it to
new data in order to generate predictions or estimates of the
expected outcome
Some of the tools used for data
mining are:
Artificial neural networks - Non-linear predictive models that
learn through training and resemble biological neural
networks in structure.
Decision trees - Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification
of a dataset.
Rule induction - The extraction of useful if-then rules from
data based on statistical significance.
Genetic algorithms - Optimization techniques based on the
concepts of genetic combination, mutation, and natural
selection.
Nearest neighbor - A classification technique that classifies
each record based on the records most similar to it in an
historical database.

Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Basic Data Mining Tasks
(contd)
Summarization maps data into subsets with
associated simple descriptions.
Characterization
Generalization
Link Analysis uncovers relationships among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential patterns.
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
I nformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, I nformation Providers, Database Systems, OLTP
Data Mining: On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database
Technology
Statistics
Other
Disciplines
Information
Science
Machine
Learning
Visualization
Data Mining vs. Statistical Analysis
Statistical Analysis:
Ill-suited for Nominal and Structured Data Types
Completely data driven - incorporation of domain knowledge not possible
Interpretation of results is difficult and daunting
Requires expert user guidance

Data Mining:
Large Data sets
Efficiency of Algorithms is important
Scalability of Algorithms is important
Real World Data
Lots of Missing Values
Pre-existing data - not user generated
Data not static - prone to updates
Efficient methods for data retrieval available for use
Data Mining vs. DBMS
Example DBMS Reports
Last months sales for each service type
Sales per service grouped by customer sex or age bracket
List of customers who lapsed their policy

Questions answered using Data Mining
What characteristics do customers that lapse their policy
have in common and how do they differ from customers
who renew their policy?
Which motor insurance policy holders would be potential
customers for my House Content Insurance policy?

Data Mining and Data Warehousing
Data Warehouse: a centralized data repository which can be
queried for business benefit.
Data Warehousing makes it possible to
extract archived operational data
overcome inconsistencies between different legacy data formats
integrate data throughout an enterprise, regardless of location,
format, or communication requirements
incorporate additional or expert information
OLAP: On-line Analytical Processing
Multi-Dimensional Data Model (Data Cube)
Operations:
Roll-up
Drill-down
Slice and dice
Rotate
Major Issues in Data Warehousing and
Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data mining
Expression and visualization of data mining results
Handling noise and incomplete data
Pattern evaluation: the interestingness problem
Performance and scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed and incremental mining methods

Major Issues in Data Warehousing and
Mining
Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases and global
information systems (WWW)
Issues related to applications and social impacts
Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Process control and decision making
Integration of the discovered knowledge with existing knowledge: A
knowledge fusion problem
Protection of data security, integrity, and privacy

Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
What makes data mining
possible?
Advances in the following areas are making
data mining deployable:
data warehousing
better and more data (i.e., operational,
behavioral, and demographic)
the emergence of easily deployed data mining
tools and
the advent of new data mining techniques.
-- Gartner Group
Data Mining Motivation
Changes in the Business Environment
Customers becoming more demanding
Markets are saturated
Databases today are huge:
More than 1,000,000 entities/records/rows
From 10 to 10,000 fields/attributes/variables
Gigabytes and terabytes
Databases a growing at an unprecedented rate
Decisions must be made rapidly
Decisions must be made with maximum knowledge
ADVANTAGES OF DATA
MINING
Marking/Retailing: Data mining can aid direct
marketers by providing them with useful and
accurate trends about their customers
purchasing behavior.
Banking/Crediting: Data mining can assist
financial institutions in areas such as credit
reporting and loan information.
ADVANTAGES OF DATA
MINING Cont
Law enforcement: Data mining can aid law enforcers
in identifying criminal suspects as well as
apprehending these criminals by examining trends in
location, crime type, habit, and other patterns of
behaviors.
Researchers: Data mining can assist researchers by
speeding up their data analyzing process; thus,
allowing them more time to work on other
projects.
DISADVANTAGES OF DATA
MINING
Privacy Issues: For example, according to
Washing Post, in 1998, CVS had sold their
patients prescription purchases to a different
company
American Express also sold their customers
credit card purchases to another company.
DISADVANTAGES OF DATA
MINING Cont
Security issues: Although companies have a lot of personal
information about us available online, they do not have
sufficient security systems in place to protect that
information.
Misuse of information: Some of the company will answer
your phone based on your purchase history. If you have spent
a lot of money or buying
a lot of product from one company, your call will be answered
really soon. So you should not think that your call is really
being answer in the order in which it was receive.
The key in business is to know something that nobody
else knows.
Aristotle Onassis





To understand is to perceive patterns.
Sir Isaiah Berlin
P
H
O
T
O
:

L
U
C
I
N
D
A

D
O
U
G
L
A
S
-
M
E
N
Z
I
E
S

PHOTO: HULTON-DEUTSCH COLL
Data Mining Motivation

Vertical integration:
Mining on the web
Web log analysis for site design:
what are popular pages,
what links are hard to find.
Electronic stores sales enhancements:
recommendations, advertisement:
Collaborative filtering: Net perception, Wisewire
Inventory control: what was a shopper looking for
and could not find..

Data Mining
The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large databases and
using it to make crucial business decisions
(Simoudis, 1996).

Involves analysis of data and use of
software techniques for finding hidden
and unexpected patterns and relationships
in sets of data.

Data Mining
Reveals information that is hidden and
unexpected, as little value in finding patterns
and relationships that are already intuitive.

Patterns and relationships are identified by
examining the underlying rules and features in
the data.

Tends to work from the data up and most
accurate results normally require large
volumes of data to deliver reliable
conclusions.

Data Mining
Starts by developing an optimal
representation of structure of sample data,
during which time knowledge is acquired
and extended to larger sets of data.

Data mining can provide huge paybacks for
companies who have made a significant
investment in data warehousing.

Relatively new technology, however
already used in a number of industries.

Data Mining Operations
Four main operations include:
Predictive modeling.
Database segmentation.
Link analysis.
Deviation detection.

There are recognized associations between
the applications and the corresponding
operations.
e.g. Direct marketing strategies use database
segmentation.

Data Mining and Data
Warehousing
Major challenge to exploit data mining is
identifying suitable data to mine.

Data mining requires single, separate,
clean, integrated, and self-consistent
source of data.

Data Mining and Data
Warehousing
A data warehouse is well equipped for
providing data for mining.

Data quality and consistency is a
prerequisite for mining to ensure the
accuracy of the predictive models. Data
warehouses are populated with clean,
consistent data.
Data Mining and Data
Warehousing
Advantageous to mine data from multiple
sources to discover as many
interrelationships as possible. Data
warehouses contain data from a number
of sources.

Selecting relevant subsets of records and
fields for data mining requires query
capabilities of the data warehouse.
Data Mining and Data
Warehousing
Results of a data mining study are useful if
there is some way to further investigate
the uncovered patterns. Data warehouses
provide capability to go back to the data
source.
Ex: Time Series Analysis
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior
Data Mining vs. KDD
Knowledge Discovery in Databases (KDD):
process of finding useful information and
patterns in data.
Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD
process.
Knowledge Discovery Process
Data mining: the core of
knowledge discovery
process.




Data Cleaning
Data Integration
Databases
Preprocessed
Data
Task-relevant Data
Data transformations
Selection
Data Mining
Knowledge Interpretation
KDD Process Ex: Web Log
Selection:
Select log data (dates and locations) to use
Preprocessing:
Remove identifying URLs
Remove error logs
Transformation:
Sessionize (sort and group)
Data Mining:
Identify and count patterns
Construct data structure
Interpretation/Evaluation:
Identify and display frequently accessed sequences.
Potential User Applications:
Cache prediction
Personalization

Data Mining Development
Similarity Measures
Hierarchical Clustering
IR Systems
Imprecise Queries
Textual Data
Web Search Engines

Bayes Theorem
Regression Analysis
EM Algorithm
K-Means Clustering
Time Series Analysis
Neural Networks
Decision Tree Algorithms
Algorithm Design Techniques
Algorithm Analysis
Data Structures
Relational Data Model
SQL
Association Rule Algorithms
Data Warehousing
Scalability Techniques

HIGH PERFORMANCE
DATA MINING
KDD Issues
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
KDD Issues (contd)
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
Social Implications of DM
Privacy
Profiling
Unauthorized use

Data Mining Metrics
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
Database Perspective on Data
Mining
Scalability
Real World Data
Updates
Ease of Use
Information Retrieval
Information Retrieval (IR): retrieving desired information
from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about data mining.

DM: Similarity measures;
Mine text/Web data.

Information Retrieval (contd)
Similarity: measure of how close a query is
to a document.
Documents which are close enough are
retrieved.
Metrics:
Precision = |Relevant and Retrieved|
|Retrieved|
Recall = |Relevant and Retrieved|
|Relevant|
IR Query Result Measures and
Classification
IR Classification
The KDD process
Problem fomulation
Data collection
subset data: sampling might hurt if highly skewed data
feature selection: principal component analysis, heuristic search
Pre-processing: cleaning
name/address cleaning, different meanings (annual, yearly),
duplicate removal, supplying missing values
Transformation:
map complex objects e.g. time series data to features e.g. frequency
Choosing mining task and mining method:
Result evaluation and Visualization:

Knowledge discovery is an iterative process
Relationship with other fields
Overlaps with machine learning, statistics, artificial
intelligence, databases, visualization but more stress
on
scalability of number of features and instances
stress on algorithms and architectures whereas
foundations of methods and formulations provided by
statistics and machine learning.
automation for handling large, heterogeneous data

OLAP Mining integration
OLAP (On Line Analytical Processing)
Fast interactive exploration of multidim.
aggregates.
Heavy reliance on manual operations for analysis:
Tedious and error-prone on large multidimensional
data
Ideal platform for vertical integration of mining but
needs to be interactive instead of batch.
State of art in mining OLAP integration
Decision trees [Information discovery, Cognos]
find factors influencing high profits
Clustering [Pilot software]
segment customers to define hierarchy on that dimension
Time series analysis: [Seagates Holos]
Query for various shapes along time: eg. spikes, outliers
Multi-level Associations [Han et al.]
find association between members of dimensions
Sarawagi [VLDB2000]
Trends leading to Data Flood
More data is generated:
Bank, telecom, other
business transactions ...
Scientific Data: astronomy,
biology, etc
Web, text, and e-commerce
More data is captured:
Storage technology faster
and cheaper
DBMS capable of handling
bigger DB
Examples
Europe's Very Long Baseline Interferometry
(VLBI) has 16 telescopes, each of which
produces 1 Gigabit/second of astronomical
data over a 25-day observation session
storage and analysis a big problem
Walmart reported to have 24 Tera-byte DB
AT&T handles billions of calls per day
data cannot be stored -- analysis is done on the fly

Growth Trends
Moores law
Computer Speed doubles every 18
months
Storage law
total storage doubles every 9 months
Consequence
very little data will ever be looked at
by a human
Knowledge Discovery is NEEDED to
make sense and use of data.
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in
data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
Related Fields

Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
__
__
__
__
__
__
__
__
__
Transformed
Data
Patterns
and
Rules
Target
Data
Raw
Dat
a
Knowledge
Interpretation
& Evaluation
Integration
U
n
d
e
r
s
t
a
n
d
i
n
g

Knowledge Discovery Process
DATA
Ware
house
Knowledge
Data Mining Tasks:
Classification
Learn a method for predicting the instance class from pre-
labeled (classified) instances
Many approaches:
Statistics,
Decision Trees, Neural
Networks,
...
Classification: Linear
Regression
Linear Regression
w
0
+ w
1
x + w
2
y >= 0
Regression computes wi from
data to minimize squared
error to fit the data
Not flexible enough
Classification: Decision Trees
X
Y
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
5 2
3
Classification: Neural Nets
Can select more complex
regions
Can be more accurate
Also can overfit the data
find patterns in random noise
Data Mining Tasks: Clustering
Find natural grouping of instances
given un-labeled data
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Estimation: predicting a continuous value
Deviation Detection: finding changes
Link Analysis: finding relationships

Case Study: Search Engines
Early search engines used mainly keywords on
a page were subject to manipulation
Google success is due to its algorithm which
uses mainly links to the page
Google founders Sergey Brin and Larry Page
were students in Stanford doing research in
databases and data mining in 1998 which led
to Google

Case Study:
Direct Marketing and CRM
Most major direct marketing companies are
using modeling and data mining
Most financial companies are using customer
modeling
Modeling is easier than changing customer
behaviour
Some successes
Verizon Wireless reduced churn rate from 2% to
1.5%

Case Study:
Security and Fraud Detection
Credit Card Fraud Detection
Money laundering
FAIS (US Treasury)
Securities Fraud
NASDAQ Sonar system
Phone fraud
AT&T, Bell Atlantic, British
Telecom/MCI
Bio-terrorism detection at Salt Lake
Olympics 2002
Data Mining and Terrorism:
Controversy in the News
TIA: Terrorism (formerly Total) Information
Awareness Program
DARPA program closed by Congress
some functions transferred to intelligence
agencies
CAPPS II screen all airline passengers
controversial

Invasion of Privacy or Defensive Shield?
Criticism of analytic approach
to Threat Detection:
Data Mining will
invade privacy
generate millions of false positives

But can it be effective?
Can Data Mining and Statistics be
Effective for Threat Detection?
Criticism: Databases have 5% errors, so analyzing 100
million suspects will generate 5 million false positives
Reality: Analytical models correlate many items of
information to reduce false positives.
Example: Identify one biased coin from 1,000.
After one throw of each coin, we cannot
After 30 throws, one biased coin will stand out with high
probability.
Can identify 19 biased coins out of 100 million with
sufficient number of throws
Another Approach: Link Analysis
Can Find Unusual Patterns in the Network Structure
Analytic technology can be effective
Combining multiple models and link analysis
can reduce false positives
Today there are millions of false positives with
manual analysis
Data Mining is just one additional tool to help
analysts
Analytic Technology has the potential to
reduce the current high rate of false positives

Data Mining with Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation distributed data

Bayardo & Srikant, Technological Solutions for
Protecting Privacy, IEEE Computer, Sep 2003

1990
1998
2000
2002
Expectations
Performance

The Hype Curve for Data Mining and
Knowledge Discovery
Over-inflated
expectations
Disappointme
nt
Growing acceptance
and mainstreaming

rising
expectations
Knowledge Discovery & Data
Mining
process of extracting previously unknown,
valid, and actionable (understandable)
information from large databases
Data mining is a step in the KDD process of
applying data analysis and discovery
algorithms

Machine learning, pattern recognition,
statistics, databases, data visualization.
Traditional techniques may be inadequate
large data
Why Mine Data?
Huge amounts of data being collected and
warehoused
Walmart records 20 millions per day
health care transactions: multi-gigabyte databases
Mobil Oil: geological data of over 100 terabytes
Affordable computing
Competitive pressure
gain an edge by providing improved, customized
services
information as a product in its own right
Knowledge discovery in databases (KDD) is
the non-trivial process of identifying valid,
potentially useful and ultimately
understandable patterns in data
Clean,
Collect,
Summarize
Data
Warehouse
Data
Preparation
Training
Data
Data
Mining
Model
Patterns
Verification,
Evaluation
Operational
Databases
Knowledge Discovery & Data
Mining
process of extracting previously unknown,
valid, and actionable (understandable)
information from large databases
Data mining is a step in the KDD process of
applying data analysis and discovery
algorithms

Machine learning, pattern recognition,
statistics, databases, data visualization.
Traditional techniques may be inadequate
large data
Why Mine Data?
Huge amounts of data being collected and
warehoused
Walmart records 20 millions per day
health care transactions: multi-gigabyte databases
Mobil Oil: geological data of over 100 terabytes
Affordable computing
Competitive pressure
gain an edge by providing improved, customized
services
information as a product in its own right
Knowledge discovery in databases (KDD) is
the non-trivial process of identifying valid,
potentially useful and ultimately
understandable patterns in data
Clean,
Collect,
Summarize
Data
Warehouse
Data
Preparation
Training
Data
Data
Mining
Model
Patterns
Verification,
Evaluation
Operational
Databases

Das könnte Ihnen auch gefallen