Sie sind auf Seite 1von 65

How Data Mining Works

How exactly is data mining able to tell you


important things that you didn't know or
what is going to happen next? The
technique that is used to perform these
feats in data mining is called modeling.

Modeling is simply the act of building a


model in one situation where you know the
answer and then applying it to another
situation that you don't.
Computers are loaded up with lots of
information about a variety of
situations where an answer is known
and then the data mining software on
the computer must run through that
data and distill the characteristics of
the data that should go into the model

Once the model is built it can then be


used in similar situations where you
don't know the answer
Some results of Data Mining
Forecasting what may happen in the
future.
Classifying people or things into
groups by recognizing patterns.
Clustering people or things into
groups based on their attributes.
Sequencing what events are likely to
lead to later events
Example
For example, say that you are the
director of marketing for a
telecommunications company and
you'd like to acquire some new long
distance phone customers.
1)randomly mail out the coupon to
general population.

2) or use your business experience


stored in your database to build a
model , then choose the right target.
Points to Remember..
Data mining is a tool, not a magic box.

Data mining will not automatically discover


solutions without guidance.

To ensure meaningful results, its vital that


you understand your data.
User-centric interactive process which
leverages analytic technologies and
computing power.

Data mining central quest: Find true patterns


and avoid overfitting (finding random patterns
by searching too many possibilities)
Classification and Regression.
Databases are rich with hidden information that
can be used to make intelligent business
decisions.
Classification and Regression are two form of
data analysis that can be used to extract models,
describing important data classes or to predict
future data trends.

Classification is used to predict or classify categorical response


variable, like to predict Iris type of flowers
(Setosa,Verginica,Versocol).
Regression is used to predict quantitative
response variable, average income of household.
Statistical learning plays a key role in many areas
of science, finance, industry many other
applications.
Here are some examples of learning problems:
Predict whether a patient, hospitalized due to a
heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet and
clinical measurements for that patient.
Predict the price of a stock in 6 months from now,
on the basis of company performance measures
and economic data.
Identify the customers who will be beneficial for
the banker in loan application.
Identify the numbers in a handwritten ZIP code,
from a digitized image.
Estimate the amount of glucose in the blood of a
diabetic person, from the infrared absorption
spectrum of that persons blood.
Steps of Classification and Regression
models

Step 1: In the first step a model is built


describing a predetermined set of
data classes. (Supervised learning).

Step 2: In the second step the predictive accuracy


of the model is estimated.

Step 3: If the accuracy of the model is


considered acceptable, then the
model can be used to classify future
data for which the class label is unknown.
2. ASSOCIATION RULES
What Is Association Rule Mining?
Association rule mining is finding frequent
patterns, associations, correlations, or causal
structures among sets of items or objects in
transaction databases, relational databases,
and other information repositories.
Applications:
Basket data analysis,
cross-marketing,
catalog design,
clustering, classification, etc.
Rule form: Body Head [support,
confidence]. 9
Association rule mining
Examples.
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) takes(x, DB) grade(x, A) [1%,
75%]

Association Rule Mining Problem:


Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find: all rules that correlate the presence of one
set of items with that of another set of items
E.g., 98% of people who purchase tires and auto
accessories also get automotive services done.
10
Support and confidence
That is.
support, s, probability that a transaction contains {A
B}
s = P(A B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.

11
Frequent item set
A set of items is referred as an itemset. An itemset that
contains k items is a k-itemset. The occurrence
frequency of an itemset is the number of transactions
that contain the itemset.
An itemset satisfies minimum support if the occurrence
frequency of the itemset is greater than or equal to the
product of min_suf and the total number of
transactions in D. The number of transactions required
for the itemset to satisfy minimum support is referred
to as the minimum support count.
If an itemset satisfies minimum support, then it is a
frequent itemset. The set of frequent k-itemsets is
commonly denoted by Lk.

12
Example 2.1

Transaction-ID Items_bought
-------------------------------------------
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F

Let minimum support 50%, and minimum


confidence 50%, we have
A C (50%, 66.6%)
C A (50%, 100%)
13
How to mine association rules
from large databases?
Association rule mining is a two-step process:
1. Find all frequent itemsets (the sets of items that have
minimum support)
A subset of a frequent itemset must also be a frequent
itemset. (Apriori principle)
i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k


(k-itemset)
2.Generate strong association rules from the frequent itemsets.

The overall performance of mining association rules is


determined by the first step.

14
3. CLASSIFICATION
Classification is the process of learning a model
that describes different classes of data. The
classes are predetermined.
Example: In a banking application, customers
who apply for a credit card may be classify as a
good risk, a fair risk or a poor risk. Hence,
this type of activity is also called supervised
learning.
Once the model is built, then it can be used to
classify new data.

15
The first step, of learning the model, is accomplished by using
a training set of data that has already been classified. Each
record in the training data contains an attribute, called the
class label, that indicates which class the record belongs to.
The model that is produced is usually in the form of a
decision tree or a set of rules.
Some of the important issues with regard to the model and
the algorithm that produces the model include:
the models ability to predict the correct class of the new
data,
the computational cost associated with the algorithm
the scalability of the algorithm.
Let examine the approach where the model is in the form of a
decision tree.
A decision tree is simply a graphical representation of the
description of each class or in other words, a representation
of the classification rules.

16
Example 3.1
Example 3.1: Suppose that we have a database of customers
on the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every new
customers in the database can be quite costly. A more cost-
efficient method would be to target only those new customers
who are likely to purchase a new computer. A classification
model can be constructed and used for this purpose.
The figure 2 shows a decision tree for the concept
buys_computer, indicating whether or not a customer at
AllElectronics is likely to purchase a computer.

17
Each internal node
represents a test on
an attribute. Each leaf
node represents a
class.

A decision tree for the concept buys_computer, indicating whether or not


a customer at AllElectronics is likely to purchase a computer.

18
Decision Trees
For example, consider the widely referenced Iris data classification
problem introduced by Fisher (1936).
The purpose of the analysis is to learn how one can discriminate
between the three types of flowers, based on the four measures of
width and length of petals and sepals.
A classification tree will determine a set of logical if-then conditions
(instead of linear equations) for predicting or classifying cases.
Advantages of tree
methods.
Simplicity of results.
In most cases, the interpretation of results summarized in
a tree is very simple. This simplicity is useful not only for
purposes of rapid classification of new observations .
Often yield a much simpler "model" for explaining why
observations are classified or predicted in a particular
manner .
e.g., when analyzing business problems, it is much easier
to present a few simple if-then statements to management,
than some elaborate equations.

Tree methods are nonparametric and nonlinear.


The final results of using tree methods for classification
or regression can be summarized in a series of logical if-
then conditions .
Therefore, there is no implicit assumption that the
underlying relationships between the predictor variables
and the dependent variable are linear, follow some specific
non-linear link function , or that they are even monotonic in
nature.
General Classification and Regression tree

The STATISTICA General Classification and


Regression Trees module (GC&RT) will build
classification and regression trees for predicting
continuous dependent variables (regression) and
categorical predictor variables (classification).
The program supports the classic C&RT
algorithm and includes various methods for
pruning and cross-validation, as well as the
powerful v-fold cross-validation methods.
Classification and Regression Trees (C&RT)
In most general terms, the purpose of the
analyses via tree-building algorithms is to
determine a set of if-then logical (split) conditions
that permit accurate prediction or classification of
Classification Trees

The example data file Irisdat.sta reports the lengths and


widths of sepals and petals of three types of irises (Setosa,
Versicol, and Virginic). The purpose of the analysis is to learn how
one can discriminate between the three types of flowers, based on
the four measures of width and length of petals and sepals.
Discriminant function analysis will estimate several linear
combinations of predictor variables for computing classification
scores (or probabilities) that allow the user to determine the
predicted classification for each observation.
A classification tree will determine a set of logical if-then
conditions (instead of linear equations) for predicting or classifying
cases.

Regression Trees.

The general approach to derive predictions from few simple if-then


conditions can be applied to regression problems as well.
Example 1 is based on the data file Poverty.sta, which contains
1960 and 1970 Census figures for a random selection of 30
counties. The research question (for that example) was to
determine the correlates of poverty, that is, the variables that best
predict the percent of families below the poverty line in a county.
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN rules


One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a
conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand.
Example
IF age = <=30 AND student = no THEN
buys_computer = no
IF age = <=30 AND student = yes THEN
buys_computer = yes
IF age = 3140 THEN buys_computer = yes
IF age = >40 AND credit_rating = excellent THEN
buys_computer = no
IF age = >40 AND credit_rating = fair THEN
23
buys_computer = yes
Neural Networks and Classification
Neural network is a technique derived from AI
that uses generalized approximation and provides
an iterative method to carry it out. ANNs use the
curve-fitting approach to infer a function from a
set of samples.
This technique provides a learning approach; it
is driven by a test sample that is used for the
initial inference and learning. With this kind of
learning method, responses to new inputs may be
able to be interpolated from the known samples.
This interpolation depends on the model
developed by the learning method.

24
Artificial Intelligence for Data Mining

Neural networks are useful for data mining and


decision-support applications.

People are good at generalizing from experience.

Computers excel at following explicit instructions


over and over.

Neural networks bridge this gap by modeling, on


a computer, the neural behavior of human brains.

25
Neural Network
Characteristics

Neural networks are useful for


pattern recognition or data
classification, through a learning
process.

Neural networks simulate biological


systems, where learning involves
adjustments to the synaptic
connections between neurons 26
Anatomy of a Neural
Network
Neural Networks map a set
of input-nodes to a set of
output-nodes

Number of inputs/outputs
is variable

The Network itself is


composed of an arbitrary
number of nodes with an
arbitrary topology

27
Biological Background
A neuron: many-inputs / one-output
unit

Output can be excited or not excited

Incoming signals from other neurons


determine if the neuron shall excite
("fire")

Output subject to attenuation in the


synapses, which are junction parts of
the neuron
28
Basics of a Node
A node is an
element which
performs a
function

Connection

Node

29
Preceptron Training

Its a single-unit network

Adjust weights based on a how well the current weights


match an objective

30
Neural Network Learning
From experience: examples / training
data

Strength of connection between the


neurons is stored as a weight-value
for the specific connection

Learning the solution to a problem =


changing the connection weights
31
Neural Network Learning
Continuous Learning Process

Evaluate output

Adapt weights

Take new inputs

Learning causes stable state of the


weights
32
Learning Performance
Supervised
Need to be trained ahead of time with lots of data

Unsupervised networks adapt to the input


Applications in Clustering and reducing dimensionality
Learning may be very slow
No help from the outside
No training data, no information available on the desired
output
Learning by doing
Used to pick out structure in the input:
Clustering
Compression

33
Topologies Back-Propogated
Networks
Inputs are put
through a
Hidden Layer
before the
output layer

All nodes
connected
between layers
34
BP Network Supervised
Training
Desired output of the training examples

Error = difference between actual & desired output

Change weight relative to error size

Calculate output layer error , then propagate back to


previous layer

Hidden weights updated

Improved performance

35
Neural Network Topology
Characteristics

Set of inputs

Set of hidden nodes

Set of outputs

Increasing nodes makes network


more difficult to train

36
Applications of Neural
Networks
Prediction weather, stocks, disease

Classification financial risk assessment, image


processing

Data association Text Recognition (OCR)

Data conceptualization Customer purchasing


habits

Filtering Normalizing telephone signals (static)

37
ANN and classification
ANNs can be classified into 2 categories: supervised
and unsupervised networks. Adaptive methods that
attempt to reduce the output error are supervised
learning methods, whereas those that develop
internal representations without sample outputs are
called unsupervised learning methods.
ANNs can learn from information on a specific
problem. They perform well on classification tasks
and are therefore useful in data mining.

38
Information processing at a neuron in an ANN

39
Machine Learning
Algorithms.
STATISTICA Machine Learning provides a
number of advanced statistical methods for
handling regression and classification tasks
with multiple dependent and independent
variables.
These methods include
Support Vector Machines (SVM)
( for regression and classification).

Naive Bayes (for classification)

K-Nearest Neighbors (KNN)


( for regression and classification.)
Support Vector Machines

STATISTICA Support Vector Machine (SVM) is primarily a


classier method that performs classification tasks by
constructing hyperplanes in a multidimensional space that
separates cases of different class labels.
STATISTICA SVM supports both regression and classification
tasks and can handle multiple continuous and categorical
variables.

To construct an optimal hyperplane, SVM employees an


iterative training algorithm, which is used to minimize an
error function. According to the form of the error function,
SVM models can be classified into four distinct groups:

Classification SVM Type 1 (also known as C-SVM


classification).
Classification SVM Type 2 (also known as nu-SVM
classification).
Regression SVM Type 1 (also known as epsilon-SVM
regression).
Regression SVM Type 2 (also known as nu-SVM regression).
Naive-Bayes Classification
Bayesian Classifiers are Statistical
classifiers, which can predict class
membership probabilities, such as
the probability that a given sample
belongs to a particular class .
Bayesian Classification is based on
Bayes-theorem.
Bayesian classifier has also high
accuracy and speed when applied to
large data set.
Bayes Theorem.
Let X be a data sample whose class label is unknown.
Let H be some hypothesis, such as that the data
sample X belongs to a specified class C. For
classification problem we want to determine P(H|
X),the probability that the hypothesis H holds given
the observed data sample X.
P(H|X) is called the posterior probability.
Suppose the world of data samples consists of
fruits,describing by their color and shape.
Suppose x is red and round and that H is hypothesis
that X is an apple.
Then P(H|X) reflects our confidence that X is an apple
given that we have seen X is red and round.
K-Nearest Neighbors .
STATISTICA K-Nearest Neighbors (KNN) is a memory-
based model defined by a set of objects known as
examples for which the outcome are known (i.e., the
examples are labeled).

The independent and dependent variables can be


either continuous or categorical. For continuous
dependent variables, the task is regression;
otherwise it is a classification. Thus, STATISTICA KNN
can handle both regression and classification tasks.

Given a new case of dependent values (query point),


we would like to estimate the outcome based on the
KNN examples. STATISTICA KNN achieves this by
finding K examples that are closest in distance to the
query point, hence, the name K-Nearest Neighbors.
For regression problems, KNN predictions are based
on averaging the outcomes of the K nearest
neighbors; for classification problems, a majority of
voting is used.
Association Rule.
The goal of the Association rule is to detect
relationships or associations among a large set of
data items.

It is an important data mining model studied


extensively by the database and data mining
community.
Assume all data are categorical.
Initially used for Market Basket Analysis to
find how items purchased by customers are
related.

The discovery of such association rule can help


people to develop marketing strategies by
gaining insight into, which items are frequently
purchased together by customer.
Transaction data: supermarket
data
Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}

tn: {biscuit, eggs, milk}
Concepts:
An item: an item/article in a basket
I: the set of all items sold in the store
A transaction: items purchased in a basket; it
may have TID (transaction ID)
A transactional dataset: A set of transactions
Rule strength measures
Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X Y.
sup = Pr(X Y)= Count (XY)/total count.
Confidence: The rule holds in T with
confidence conf if conf% of transactions that
contain X also contain Y.
conf = Pr(Y | X)=support(X,Y)/support(X).
An association rule is a pattern that states
when X occurs, Y occurs with certain
probability.
An Example.
Transaction data
Assume:
minsup = 30%
t1: Beef, Chicken, Milk
minconf = 80% t2:Beef, Cheese
An example frequent itemset: t3:Cheese, Boots
{Chicken, Clothes, Milk} [sup = 3/7] t4:Beef, Chicken, Cheese
Association rules from the itemset: t5:Beef, Chicken, Clothes,
Clothes Milk,Chicken[sup = 3/7, conf = Cheese, Milk
3/3]
t6:Chicken, Clothes, Milk

Clothes, Chicken Milk[sup = 3/7, conf = t7:Chicken, Milk, Clothes
3/3]
Cluster Analysis.
The process of grouping the data into classes
or clusters so that objects within a cluster
have high similarity in comparison to one
another, but are very dissimilar to objects in
other clusters.
Clustering is an example of unsupervised
learning, where the learning do not rely on
predefined classes and class labeled training
examples.
For the above reason , Clustering is the form
of Learning by observation , rather than
learning by Example.
Area of Application.
Market Research.
Clustering can help marketers
discover distinct groups in their
customer bases and characterize
customer groups based on
purchasing patterns.
Biology.
Biologist can use cluster to discover
distinct groups of species depending
on some useful parameters.
5. CLUSTERING
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters.
Cluster analysis
Grouping a set of data objects into clusters.
Clustering is unsupervised learning: no
predefined classes, no class-labeled training
samples.
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
51
General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial
data mining
Image Processing
Economic Science (especially market
research)
World Wide Web
Document classification
Cluster Weblog data to discover groups
52 of similar
access patterns
Examples of Clustering
Applications
Marketing: Help marketers discover distinct
groups in their customer bases, and then use
this knowledge to develop targeted marketing
programs.
Land use: Identification of areas of similar
land use in an earth observation database.
Insurance: Identifying groups of motor
insurance policy holders with a high average
claim cost.
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location.
Earth-quake studies: Observed earth quake
epicenters should be clustered along
continent faults.
53
Partitioning Algorithms: Basic
Concept
Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion.
- Global optimal: exhaustively enumerate all
partitions
- Heuristic methods: k-means and k-medoids
algorithms
k-means (MacQueen67): Each cluster is
represented by the center of the cluster
k-medoids or PAM (Partition around medoids)
(Kaufman & Rousseeuw87): Each cluster is
represented by one of the objects in the54
cluster
The K-Means Clustering Method

Input: a database D, of m records, r1, r2,


,rm and a desired number of clusters k.
Output: set of k clusters that minimizes the
square error criterion.
Given k, the k-means algorithm is
implemented in 4 steps:
Step 1: Randomly choose k records as the initial
cluster centers.
Step 2: Assign each records ri, to the cluster such
that the distance between ri and the cluster
centroid (mean) is the smallest among the k
clusters.
Step 3: recalculate the centroid (mean) of each
cluster based on the records assigned to the
cluster.
Step 4: Go back to Step 2, stop when no more new
assignment.
55
The algorithm begins by randomly choosing k records to
represent the centroids (means), m1, m2,,mk of the
clusters, C1, C2,,Ck. All the records are placed in a given
cluster based on the distance between the record and
the cluster mean. If the distance between mi and record
rj is the smallest among all cluster means, then record is
placed in cluster Ci.
Once all records have been placed in a cluster, the mean
for each cluster is recomputed.
Then the process repeats, by examining each record
again and placing it in the cluster whose mean is closest.
Several iterations may be needed, but the algorithm will
converge, although it may terminate at a local optimum.

56
Clustering of a set of objects based on the k-means method.

57
Hierarchical Clustering
A hierarchical clustering method works by grouping
data objects into a tree of clusters.
In general, there are two types of hierarchical
clustering methods:
Agglomerative hierarchical clustering: This bottom-up
strategy starts by placing each object in its own cluster and
then merges these atomic clusters into larger and larger
clusters, until all of the objects are in a single cluster or
until a certain termination conditions are satisfied. Most
hierarchical clustering methods belong to this category.
They differ only in their definition of intercluster similarity.
Divisive hierarchical clustering: This top-down strategy
does the reverse of agglomerative hierarchical clustering by
starting with all objects in one cluster. It subdivides the
cluster into smaller and smaller pieces, until each object
forms a cluster on its own or until it satisfied certain
termination condition, such as a desired number clusters is
obtained or the distance between two closest clusters is
above a certain threshold distance.
58
Agglomerative and divisive hierarchical clustering on data objects {a, b, c,
d, e}

59
Hierarchical Clustering

In DIANA, all of the objects are used to form one initial


cluster. The cluster is split according to some principle,
such as the maximum Euclidean distance between the
closest neighboring objects in the cluster. The cluster
splitting process repeats until, eventually, each new
cluster contains only a single objects.
In general, divisive methods are more computationally
expensive and tend to be less widely used than
agglomerative methods.
There are a variety of methods for defining the
intercluster distance D(Ck, Ch). However, local
pairwise distance measures (i.e., between pairs of
clusters) are especially suited to hierarchical methods.

60
7. POTENTIAL APPLICATIONS OF DM

Database analysis and decision support


Market analysis and management
target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and Web
analysis.
Intelligent query answering

61
Market Analysis and Management

Where are the data sources for analysis?


Credit card transactions, discount coupons,
customer complaint calls, plus (public) lifestyle
studies
Target marketing
Find clusters of model customers who share the
same characteristics: interest, income level,
spending habits, etc.
Determine customer purchasing patterns over
time
Conversion of single to a joint bank account:
marriage, etc.
62
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements
identifying the best products for different
customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central
tendency and variation)

63
Fraud Detection and Management
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring
of doctors and ring of references

64
Some representative data mining tools

Oracle (Oracle Data Mining) classification, prediction,


regression, clustering, association, feature selection, feature
extraction, anomaly selection.
Weka system (http://www.cs.waikato.ac.nz/ml/weka )
University of Waikato, Newzealand. The system is written in
Java. The platforms: Linux, Windows, Macintosh.

Acknosoft (Kate)Decision trees, case-based reasoning


DBMiner Technology (DBMiner) OLAP analysis, Associations,
classification, clustering.
IBM (Intelligent Miner) Classification, Association rules,
predictive models.
NCR (Management Discovery Tool) Association rules
SAS (Enterprise Miner) Decision trees, Association rules,
neural networks, Regression, clustering
Silicon Graphics (MineSet) Decision trees, Association rules

65

Das könnte Ihnen auch gefallen