Sie sind auf Seite 1von 14

OPIM 5671: Data Mining and

Business Intelligence
Session 1: Introduction and Data Mining Overview

1
What is Data Mining
Knowledge discovery from data
Extraction of interesting patterns or knowledge (non-trivial,
implicit, understandable, previously unknown and useful) from
huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything data mining?
Simple search and query processing
(Deductive) expert systems

2
Business Intelligence
A broad concept, refers to processes,
technologies, tools needed to turn data into
information, information into knowledge, and
knowledge into plans that drive profitable
business actions (from David Loshin, Business
Intelligence: The Savvy Managers Guide, 2003)
Encompasses data warehousing, reporting,
OLAP, performance metrics and
benchmarking, business analytic tools, and
content/knowledge management, etc.

3
Data Mining vs. Statistics
Statistics
User driven
There exist underlying theory about certain
relationships in data
Data is often collected for specific purpose
Use statistical methods to test the theory and/or
hypotheses
Data Mining
Data driven, data are often observational and
collected for some other purposes
Often no pre-existing theory
Use statistics, machine learning, and other
techniques to examine data and uncover unknown
4
relationships
Key Assumptions Necessary for DM

Past behavior is a good predictor of future


behavior
Data are available for use
Data contain what you want to predict

5
Need for Data Mining and Analytics
The Explosive Growth of Data: from terabytes to petabytes
e.g. , Wal-Mart: 20 million transactions/day, 10 terabyte database;
Blockbuster: 36 million households
Data collection and data availability
Automated data collection tools, database systems, Web, Internet of
Things
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation,
Society and everyone: social networks, social media, news,
We are drowning in data, but starving for knowledge!

6
KDD Process
Knowledge
Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning
Data Integration

Databases

7
KDD Process: Several Key Steps
Learning the application domain, relevant prior knowledge, and
goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
8
Data Mining Pyramid View
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

9
Data Mining Tasks
Descriptive
Find human-interpretable patterns that describe
the data
Predictive
Use some variables to predict unknown or future
values of other variables

10
Core Concepts
Types of Data:
Numeric
Continuous ratio and interval
Discrete
Need for Binning
Categorical order and unordered
Binary
Overfitting and Generalization
Regularization: Penalty for model
complexity
Curse of Dimensionality
Loss Functions

11
Regression with Overfit

12
Typical Characteristics of Data
Standard data format --- table:
Row=observation unit, Column=variable
Opportunistic (often by-product of
transactions)
Not from designed experiments
Often has outliers, missing data, incomplete
data, etc.

13
Challenges in Data Mining
Mining methodology
Handling noise and incomplete data
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Integration of the discovered knowledge with existing one:
knowledge fusion
User interaction
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy

14

Das könnte Ihnen auch gefallen