Data Mini

DATA MINING
KDD (Knowledge Discovery In

Database
10/15/08 Sudarshan 1
Definition
Knowledge Discovery in Databases (KDD) is
about finding new and useful information that
are not obvious .
KDD is sometimes called Data Mining
However Data Mining is only part of the
process.
No one has analyzed all the steps to KDD
Why analyze the KDD
process?
Understand this complex process better
Better creation of better KDD tools
Lowering the time and effort towards a KDD
Data Mining
An attempt at knowledge discovery
Searching for patterns and structure in a
sea of data
Uses techniques from many disciplines,
such as statistical analysis and machine
learning
 These techniques are not our main interest
Definition (Cont.)
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable patterns
in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend the

patterns.
Why Use Data Mining Today?
Human analysis skills are inadequate:
• Volume and dimensionality of the
data
• High data growth rate
Availability of:
• Data
• Storage
• Computational power
• Expertise
Motivation: “Necessity is the Mother of
Invention”
Data explosion problem:
 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories.
We are drowning in data, but starving for
knowledge!
Data mining (knowledge discovery in databases):
 Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases.
We are Data Rich but
Information Poor
Databases are too big
Data Mining can help

discover knowledge
Terrorbytes
Why Data Mining? -- Potential
Applications
Database analysis and decision support
 Market analysis
 Corporate analysis
 Fraud detection
Other Applications:
 Intelligent query answering
 Prediction and scheduling
Applications
Banking: loan/credit card approval
 predict good customers based on old customers
Customer relationship management:

 identify those who are likely to leave for a competitor.
Targeted marketing:
 identify likely responders to promotions
Fraud detection: telecommunications, financial

transactions
 from an online stream of event identify fraudulent events
Manufacturing and production:

 automatically adjust knobs when process parameter
changes
Applications (continued)
Medicine: disease outcome, effectiveness of
treatments
 analyzepatient disease history: find relationship
between diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
 identify new galaxies by searching for sub clusters
Web site/store design and promotion:
 find affinity of visitor to pages and modify layout
Preprocessing and Mining
Data Mining: A KDD Process
 Data mining: the core of
knowledge discovery Pattern Evaluation
process.
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Steps of a KDD Process
Learning the application domain:
 relevant prior knowledge and goals of application
Creating a target data set:

Data cleaning and preprocessing:
Data reduction and projection:
 Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
 summarization, classification, regression, clustering.
Choosing the mining algorithm(s)

Data mining: search for patterns of interest
Interpretation:
 visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
10/15/08 .:
Sudarshan 14
What is (not) Data Mining?
●What is not Data ● What is Data Mining?
Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Why is data mining important?
Rapid computerization of businesses produce
huge amount of data
How to make best use of data?
A growing realization: knowledge discovered
from data can be used for competitive
advantage.
Why is data mining
necessary?
Make use of your data assets
There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
Many interesting things you want to find
cannot be found using database queries
“find me people likely to buy my products”
“Who are likely to respond to my promotion”
Data Mining Tasks
Prediction Methods
 Use some variables to predict unknown or
future values of other variables.
Description Methods
 Findhuman-interpretable patterns that
describe the data.
Main data mining tasks
Classification:
mining patterns that can classify future data into
known classes.
Association rule mining
mining any rule of the form X → Y, where X and Y
are sets of data items.
Clustering
identifying a set of similarity groups in the data
Main data mining tasks (cont …)
Sequential pattern mining:
A sequential rule: A→ B, says that event A
will be immediately followed by event B
with a certain confidence
Deviation detection:
discovering the most significant changes in
data
Data visualization: using graphical
methods to show patterns in data.
Counting co- occurances
A marble basket is a collection of items purchased by a
customer in a single customer transaction.
A customer transaction consists of purchasing the items from
the store by single visit.
Consider the foll. “purchase relation” :
the tuples are stored into groups by transaction. All tuples in a
group have same customer id (cid) and together describes a
customer transaction, that invokes the purchase of one or more
items. There is a redundancy in a table. It can be removed by
decomposing the purchase relation.
Tid Cid Date Item Qty(packets)
204 c1 4/1/05 sugar 2
204 c1 4/1/05 milk 1
204 c1 4/1/05 cheese 1
204 c1 4/1/05 juice 2
205 c2 5/1/05 sugar 1

205 c2 5/1/05 milk 3
205 c2 5/1/05 pen 4
206 c3 7/1/05 Milk 2

206 c3 7/1/05 Sugar 2
206 c3 7/1/05 Juice 2
207 C2 10/1/05 Milk 3

207 C2 10/1/05 Juice 6
207 C2 10/1/05 Apples 5
207 C2 10/1/05 Rice 4
Frequent Itemsets
It can be decomposed into 2 relations
 Purchase 1(tid,cid, date)
 Purchase 2 (tid, item, qty)
To find the frequency purchased item from the store, the original
purchase table is considered. This table is created at the cleaning
steps of KDP . this table is easy to handle for applying data mining
tools.
We can make following observations from purchase table:-
75% of transaction contain purchase of milk and sugar together
25% of transaction contain the purchase of sugar and juice
Following terminology is used to develop an algorithm
for purchasing frequent items from the shop:
 A set of item is called item set.
 The support of an item set is a fraction of transaction in
the data base that contains item from itemset.
 For e.g. {sugar, milk} has 75% support in purchases. We
thus conclude that sugar and milk are frequently
purchased together. On the other hand, sugar and rice
are not purchased together.
 User can specify the minimum support (minsup) and find
all items that are above minsup. These sets of items are
may be singleton set.
 Let’s consider the user specified minimum support as
70% then frequent items will be {milk, sugar}, {juice}.
Algorithm to identify frequent itemset
For each item
check if it is frequent itemset
(appears in > minsup)
K=1
Repeat
for each new frequent itemset, Ik
with K items, generate all itemsets
I k+1 with K+1 itemset Ik ⊂ Ik+1
Scan all transactions once and check
If generated K+1 itemsets are frequent
K:= K+1
Until new frequent itemsets
10/15/08 Sudarshan are identified. 25
All the best for your
test.

Data Mini

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mini

Hochgeladen von

Copyright:

Verfügbare Formate

DATA MINING

KDD (Knowledge Discovery In

Lowering the time and effort towards a KDD

Valid: The patterns hold in general.

Novel: We did not know the pattern beforehand.

Useful: We can devise actions from the patterns.

Understandable: We can interpret and comprehend the

 Extraction of interesting knowledge (rules, regularities,

Databases are too big

Data Mining can help

 predict good customers based on old customers

Customer relationship management:

Fraud detection: telecommunications, financial

Manufacturing and production:

Creating a target data set:

Choosing the mining algorithm(s)

205 c2 5/1/05 sugar 1

206 c3 7/1/05 Milk 2

207 C2 10/1/05 Milk 3

Das könnte Ihnen auch gefallen