Beruflich Dokumente
Kultur Dokumente
Lecture 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
Tan,Steinbach, Kumar
4/18/2004
Regression [Predictive]
Deviation
Tan,Steinbach, Kumar
Detection [Predictive]
4/18/2004
Classification: Definition
predict the outcome, the algorithm processes a training set containing a set of attributes and
the respective outcome, usually called goal or prediction attribute. The algorithm tries to
discover relationships between the attributes that would make it possible to predict the
outcome. Next the algorithm is given a data set not seen before, called prediction set, which
contains the same set of attributes, except for the prediction attribute not yet known. The
algorithm analyses the input and produces a prediction. The prediction accuracy defines
how good the algorithm is
Tan,Steinbach, Kumar
4/18/2004
Classification Example
Features
Classes
Training
Set
Test
Set
Tan,Steinbach, Kumar
Learn
Classifier
Model
4/18/2004
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new cell-phone product
:..
Approach:
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
4/18/2004
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions
..
Approach:
Tan,Steinbach, Kumar
4/18/2004
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost
to a competitor.
:.
Approach:
Use
.
How often the customer calls, where he calls, what time-of-the day he
calls most, his financial status, marital status , etc.
Label
4/18/2004
Clustering Definition
( )
Tan,Steinbach, Kumar
4/18/2004
Euclidian Distance ( )
Y axis
Y axis
7
(7,4)
4
7
(7,4)
(2,2)
b
X axis
X axis
4/18/2004
Illustrating Clustering
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Tan,Steinbach, Kumar
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
4/18/2004
10
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other
based on the important terms appearing in them. :
.
Tan,Steinbach, Kumar
4/18/2004
11
Tan,Steinbach, Kumar
Financial
Total
Articles
555
Correctly
Placed
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
4/18/2004
12
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Tissues,
{Tissues,Milk}
Milk}-->
-->{Tea}
{Tea}
Tan,Steinbach, Kumar
4/18/2004
13
Tan,Steinbach, Kumar
4/18/2004
14
If a customer buys diaper and milk, then he is very likely to buy beer.
!
Tan,Steinbach, Kumar
4/18/2004
15
Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different
events. , ,
Rules are formed by first disovering patterns. Event occurrences in the patterns are governed by timing constraints. ,
(A B)
(A B)
<= xg
(C)
(D E)
(C) (D E)
>ng
<= ws
<= ms
Tan,Steinbach, Kumar
4/18/2004
16
First
buy computer, then CD-ROM, and then digital camera, within 3 months.
Athletic
Apparel Store: :
Tan,Steinbach, Kumar
4/18/2004
17
Regression
Examples:
Predicting sales amounts of new product based on advertising
expenditure.
Predicting wind velocities as a function of temperature, humidity, air
pressure, etc , , .
Time series prediction of stock market indices
.
Tan,Steinbach, Kumar
4/18/2004
18
Deviation/Anomaly Detection/
Typical network traffic at University level may reach over 100 million connections per day
Tan,Steinbach, Kumar
4/18/2004
19