Sie sind auf Seite 1von 26

DATA MINING

KDD (Knowledge Discovery In


Database

10/15/08 Sudarshan 1
Definition
Knowledge Discovery in Databases (KDD) is
about finding new and useful information that
are not obvious .
KDD is sometimes called Data Mining
However Data Mining is only part of the
process.
No one has analyzed all the steps to KDD

10/15/08 Sudarshan 2
Why analyze the KDD
process?
Understand this complex process better
Better creation of better KDD tools

Lowering the time and effort towards a KDD

10/15/08 Sudarshan 3
Data Mining
An attempt at knowledge discovery
Searching for patterns and structure in a
sea of data
Uses techniques from many disciplines,
such as statistical analysis and machine
learning
 These techniques are not our main interest

10/15/08 Sudarshan 4
Definition (Cont.)
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable patterns
in data.

Valid: The patterns hold in general.

Novel: We did not know the pattern beforehand.

Useful: We can devise actions from the patterns.

Understandable: We can interpret and comprehend the


patterns.
10/15/08 Sudarshan 5
Why Use Data Mining Today?
Human analysis skills are inadequate:
• Volume and dimensionality of the
data
• High data growth rate
Availability of:
• Data
• Storage
• Computational power
• Expertise
10/15/08 Sudarshan 6
Motivation: “Necessity is the Mother of
Invention”
Data explosion problem:
 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories.
We are drowning in data, but starving for
knowledge!
Data mining (knowledge discovery in databases):

 Extraction of interesting knowledge (rules, regularities,


patterns, constraints) from data in large databases.
10/15/08 Sudarshan 7
We are Data Rich but
Information Poor

Databases are too big

Data Mining can help


discover knowledge

Terrorbytes
10/15/08 Sudarshan 8
Why Data Mining? -- Potential
Applications
Database analysis and decision support
 Market analysis
 Corporate analysis
 Fraud detection

Other Applications:
 Intelligent query answering
 Prediction and scheduling

10/15/08 Sudarshan 9
Applications
Banking: loan/credit card approval

 predict good customers based on old customers

Customer relationship management:


 identify those who are likely to leave for a competitor.

Targeted marketing:
 identify likely responders to promotions

Fraud detection: telecommunications, financial


transactions
 from an online stream of event identify fraudulent events

Manufacturing and production:


 automatically adjust knobs when process parameter
changes
10/15/08 Sudarshan 10
Applications (continued)
Medicine: disease outcome, effectiveness of
treatments
 analyzepatient disease history: find relationship
between diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
 identify new galaxies by searching for sub clusters
Web site/store design and promotion:
 find affinity of visitor to pages and modify layout

10/15/08 Sudarshan 11
Preprocessing and Mining

10/15/08 Sudarshan 12
Data Mining: A KDD Process
 Data mining: the core of
knowledge discovery Pattern Evaluation
process.
Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
10/15/08 Sudarshan 13
Steps of a KDD Process
Learning the application domain:
 relevant prior knowledge and goals of application

Creating a target data set:


Data cleaning and preprocessing:
Data reduction and projection:
 Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
 summarization, classification, regression, clustering.

Choosing the mining algorithm(s)


Data mining: search for patterns of interest
Interpretation:
 visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
10/15/08 .:
Sudarshan 14
What is (not) Data Mining?
●What is not Data ● What is Data Mining?
Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
10/15/08 Sudarshan 15
Why is data mining important?
Rapid computerization of businesses produce
huge amount of data
How to make best use of data?
A growing realization: knowledge discovered
from data can be used for competitive
advantage.

10/15/08 Sudarshan 16
Why is data mining
necessary?
Make use of your data assets
There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
Many interesting things you want to find
cannot be found using database queries
“find me people likely to buy my products”
“Who are likely to respond to my promotion”

10/15/08 Sudarshan 17
Data Mining Tasks
Prediction Methods
 Use some variables to predict unknown or
future values of other variables.

Description Methods
 Findhuman-interpretable patterns that
describe the data.

10/15/08 Sudarshan 18
Main data mining tasks
Classification:
mining patterns that can classify future data into
known classes.
Association rule mining
mining any rule of the form X → Y, where X and Y
are sets of data items.
Clustering
identifying a set of similarity groups in the data

10/15/08 Sudarshan 19
Main data mining tasks (cont …)
Sequential pattern mining:
A sequential rule: A→ B, says that event A
will be immediately followed by event B
with a certain confidence
Deviation detection:
discovering the most significant changes in
data
Data visualization: using graphical
methods to show patterns in data.

10/15/08 Sudarshan 20
Counting co- occurances
A marble basket is a collection of items purchased by a
customer in a single customer transaction.
A customer transaction consists of purchasing the items from
the store by single visit.
Consider the foll. “purchase relation” :
the tuples are stored into groups by transaction. All tuples in a
group have same customer id (cid) and together describes a
customer transaction, that invokes the purchase of one or more
items. There is a redundancy in a table. It can be removed by
decomposing the purchase relation.

10/15/08 Sudarshan 21
Tid Cid Date Item Qty(packets)
204 c1 4/1/05 sugar 2
204 c1 4/1/05 milk 1
204 c1 4/1/05 cheese 1
204 c1 4/1/05 juice 2

205 c2 5/1/05 sugar 1


205 c2 5/1/05 milk 3
205 c2 5/1/05 pen 4

206 c3 7/1/05 Milk 2


206 c3 7/1/05 Sugar 2
206 c3 7/1/05 Juice 2

207 C2 10/1/05 Milk 3


207 C2 10/1/05 Juice 6
207 C2 10/1/05 Apples 5
207 C2 10/1/05 Rice 4
10/15/08 Sudarshan 22
Frequent Itemsets
It can be decomposed into 2 relations
 Purchase 1(tid,cid, date)
 Purchase 2 (tid, item, qty)

To find the frequency purchased item from the store, the original
purchase table is considered. This table is created at the cleaning
steps of KDP . this table is easy to handle for applying data mining
tools.
We can make following observations from purchase table:-
75% of transaction contain purchase of milk and sugar together
25% of transaction contain the purchase of sugar and juice

10/15/08 Sudarshan 23
Following terminology is used to develop an algorithm
for purchasing frequent items from the shop:
 A set of item is called item set.
 The support of an item set is a fraction of transaction in
the data base that contains item from itemset.
 For e.g. {sugar, milk} has 75% support in purchases. We
thus conclude that sugar and milk are frequently
purchased together. On the other hand, sugar and rice
are not purchased together.
 User can specify the minimum support (minsup) and find
all items that are above minsup. These sets of items are
may be singleton set.
 Let’s consider the user specified minimum support as
70% then frequent items will be {milk, sugar}, {juice}.
10/15/08 Sudarshan 24
Algorithm to identify frequent itemset
For each item
check if it is frequent itemset
(appears in > minsup)
K=1
Repeat
for each new frequent itemset, Ik
with K items, generate all itemsets
I k+1 with K+1 itemset Ik ⊂ Ik+1
Scan all transactions once and check
If generated K+1 itemsets are frequent
K:= K+1
Until new frequent itemsets
10/15/08 Sudarshan are identified. 25
All the best for your
test.

10/15/08 Sudarshan 26

Das könnte Ihnen auch gefallen