Sie sind auf Seite 1von 342

COMP527:

Data Mining
COMP527: Data Mining

Dr Robert Sanderson
(azaroth@liv.ac.uk)

Dept. of Computer Science
University of Liverpool
2008

This is the full course notes, but not quite complete. You should come to the lectures anyway. Really.

Introduction to the Course January 18, 2008 Slide 1


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Agglomerative/Divisive
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Introduction to the Course January 18, 2008 Slide 2


COMP527:
Data Mining
Today's Topics

Me, You: Introductions
Lectures
Tutorials
References
Course Summary
Assessment
Something Fun*

* Or at least more fun, hopefully

Introduction to the Course January 18, 2008 Slide 3


COMP527:
Data Mining
Introductions

Dr. Robert Sanderson

Office:    1.04,  Ashton Building
Extension:    54252   [external: 795 4252]
Email:   azaroth@liv.ac.uk
Web:     http://www.csc.liv.ac.uk/~azaroth/
Hours:  10:00 to 18:00, not Thursday
Email for a time, or show up at any time knowing that 
I might not be there.
Where's your accent from:  New Zealand

Introduction to the Course January 18, 2008 Slide 4


COMP527:
Data Mining
Not Me!

So you went to Waikato?    

Your PhD is in Data Mining? 
... Computer Science? 
... Science? Math? Engineering? 

You at least write Java?
... C++?

What sort of CS Lecturer are you?!

Introduction to the Course January 18, 2008 Slide 5


COMP527:
Data Mining
Me!

Went to University of Canterbury (NZ, not Kent)
... But I do know Ian Witten quite well.

PhD is in French/History
... But focused on Computing in the Humanities/Informatics

Python!

Information Science:  Information Retrieval, Data Mining, Text 
Mining, XML, Databases, Interoperability, Grid Processing, 
Digital Preservation ...

Introduction to the Course January 18, 2008 Slide 6


COMP527:
Data Mining
You!

...

Introduction to the Course January 18, 2008 Slide 7


COMP527:
Data Mining
Lectures

Lecture Slots:
Monday:  10­11am    Here
Tuesday:  10­11am    Here
Friday:  2­3pm        Here

Course requirement:    30 hours of lectures
Semester Timetable:
  8 weeks class, 3 weeks easter, 4 weeks class.

Dates:
21st January to 11th of March    (Rob @ conference on 14th)
7th April to 21st April   (But may run to 25th?)
    
Introduction to the Course January 18, 2008 Slide 8
COMP527:
Data Mining
Tutorials/Lab Sessions

Location:
Lab 6, Tuesdays 3-4pm
(just before departmental seminar)

Aims:
Provide time for practical experience
Answer any questions from lectures/reading
Informal self-assessment exercises

Software:
Data mining 'workbench' software WEKA installed on Windows
image. May be available under Linux. Freely downloadable from
University of Waikato:

http://www.cs.waikato.ac.nz/ml/weka/

Introduction to the Course January 18, 2008 Slide 9


COMP527:
Data Mining
Course Web Sites

Departmental Home Page:

http://www.csc.liv.ac.uk/teaching/modules/newmscs2/comp527.html

Lecture Notes, Assignments, Exercises:

http://www.csc.liv.ac.uk/~azaroth/courses/current/comp527/

Introduction to the Course January 18, 2008 Slide 10


COMP527:
Data Mining
Reference Texts

Witten, Ian and Eibe Frank, Data Mining: Practical Machine Learning Tools and 
Techniques, Second Edition, Morgan Kaufmann, 2005

Dunham, Margaret H, Data Mining: Introductory and Advanced Topics, Prentice 
Hall, 2003

Introduction to the Course January 18, 2008 Slide 11


COMP527:
Data Mining
Frequently Used Resources

– Han and Kamber, Data Mining: Concepts and Techniques, Second 
Edition, Morgan Kaufmann, 2006
– Berry, Browne, Lecture Notes in Data Mining, World Scientific, 2006
– Berry and Linoff, Data Mining Techniques, Second Edition, Wiley, 2004
– Zhang, Association Rule Mining, Springer, 2002
– Konchady, Text Mining Application Programming, Thomson, 2006
– Weiss et al., Text Mining: Predictive Methods for Analyzing 
Unstructured Information, Springer, 2005
– Inmon, Building the Data Warehouse, Wiley, 1993

– KDD    (http://www.kdd2007.com)
– PAKDD   (http://lamda.nju.edu.cn/conf/PAKDD07/)
– PKDD  (http://www.ecmlpkdd2008.org/)

Introduction to the Course January 18, 2008 Slide 12


COMP527:
Data Mining
Frequently Used Websites

– CiteSeer:            http://citeseer.ist.psu.edu/
– KDNuggets:        http://www.kdnuggets.com/
– UCI Repository:  http://kdd.ics.uci.edu/

(plus follow link to Machine Learning Archive)
– Wikipedia: http://en.wikipedia.org/wiki/Data_mining
– MathWorld: http://mathworld.wolfram.com/
– Google Scholar: http://scholar.google.com/
– NaCTeM: http://www.nactem.ac.uk/

Introduction to the Course January 18, 2008 Slide 13


COMP527:
Data Mining
Course Summary

● Introduction, Basics: 4 lectures


● Data Warehousing: 1 lecture
● Classification: 10 lectures
● Input Preprocessing: 2 lectures
● Association Rule Mining 4 lectures
● Clustering: 3 lectures
● Hybrid Approaches: 1 lecture
● Graph Mining: 1 lecture
● Text Mining: 3 lectures
● Revision: 1 lecture

Total: 30 lectures

Introduction to the Course January 18, 2008 Slide 14


COMP527:
Data Mining
Assessment

● 75% End of Year Exam:


● 2 ½ hours

● Short Answer and/or Essays

● Choose 4 of 5 sections

● 25% Continuous Assessment:


● 12% Assignment 1 (Due 2008-03-10 16:00:00)
● 13% Assignment 2 (Due 2007-04-25 16:00:00)

● Self assessment exercises


● Weekly (or as desired) during tutorial session

Introduction to the Course January 18, 2008 Slide 15


COMP527:
Data Mining
And Now...

... what you've all been waiting for ...

Something Fun! *

* (Or more fun than the rest of the lecture at least, your mileage may
vary, opinions expressed herein bla bla bla)

Introduction to the Course January 18, 2008 Slide 16


COMP527:
Data Mining
“Nomic Mao”

The Rules:

– Each player is dealt 7 cards by the dealer
– The first person to have no cards in hand wins
– Every turn, each player discards a card
– Play starts with the person to the left of the dealer and proceeds 
to the left
– The dealer and then the winner of each round makes a secret 
rule
– If you break a rule, you receive a penalty from the rule's creator
– The penalty is: You must draw one card

Introduction to the Course January 18, 2008 Slide 17


COMP527:
Data Mining
Advanced Rules

– Later rules may overturn earlier rules, either completely or in part
– Each rule may only change one aspect of the game play
– Penalty conditions for breaking rules include:

Illegal card played    (eg black on red)

Procedural error       (eg playing out of turn)

Incorrect penalty      (eg when a later rule enables a play)
– Each rule is numbered (eg: Procedural error under Rule 3)
– When taking a penalty for playing out of turn, or discarding 
multiple cards, you must return the state of the game to as it was 
before the penalty and then the penalty is incurred.

Introduction to the Course January 18, 2008 Slide 18


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Introduction to Data Mining January 18, 2008 Slide 19


COMP527:
Data Mining
Today's Topics

● What is Data Mining?


● Definitions

● Views on the Process

● Basic Functions

● Why would you do this?


● Motivations

● Applications

● WEKA: Waikato Environment for Knowledge Analysis


(And a cute little bird!)

Introduction to Data Mining January 18, 2008 Slide 20


COMP527:
Data Mining
What is Data Mining?

Some Definitions:
– “The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data” (Piatetsky-Shapiro)
– "...the automated or convenient extraction of patterns
representing knowledge implicitly stored or captured in large
databases, data warehouses, the Web, ... or data
streams." (Han, pg xxi)
– “...the process of discovering patterns in data. The process
must be automatic or (more usually) semiautomatic. The
patterns discovered must be meaningful...” (Witten, pg 5)
– “...finding hidden information in a database.” (Dunham, pg 3)
– “...the process of employing one or more computer learning
techniques to automatically analyse and extract knowledge
from data contained within a database.” (Roiger, pg 4)

Introduction to Data Mining January 18, 2008 Slide 21


COMP527:
Data Mining
What is Data Mining?

Keywords from each definition:


– “The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data” (Piatetsky-Shapiro)
– "...the automated or convenient extraction of patterns
representing knowledge implicitly stored or captured in large
databases, data warehouses, the Web, ... or data
streams." (Han, pg xxi)
– “...the process of discovering patterns in data. The process
must be automatic or (more usually) semiautomatic. The
patterns discovered must be meaningful...” (Witten, pg 5)
– “...finding hidden information in a database.” (Dunham, pg 3)
– “...the process of employing one or more computer learning
techniques to automatically analyze and extract knowledge
from data contained within a database.” (Roiger, pg 4)

Introduction to Data Mining January 18, 2008 Slide 22


COMP527:
Data Mining
KDD: Knowledge Discovery in Databases

Many texts treat KDD and Data Mining as the same process,
but it is also possible to think of Data Mining as the
discovery part of KDD.

Dunham:
KDD is the process of finding useful information and
patterns in data.
Data Mining is the use of algorithms to extract information
and patterns derived by the KDD process.

For this course, we will discuss the entire process (KDD) but
focus mostly on the algorithms used for discovery.

Introduction to Data Mining January 18, 2008 Slide 23


COMP527:
Data Mining
Piatetsky-Shapiro View

Knowledge
Interpretation

Data Model
Data Mining

Transformed Data
Transformation

Preprocessed Data
Preprocessing

Target Data
Selection

Initial Data
(As tweaked by Dunham)

Introduction to Data Mining January 18, 2008 Slide 24


COMP527:
Data Mining
CRISP-DM View

Introduction to Data Mining January 18, 2008 Slide 25


COMP527:
Data Mining
Data Mining Functions

All Data Mining functions can be thought of as attempting to


find a model to fit the data.
Each function needs Criteria to create one model over
another.
Each function needs a technique to Compare the data.

Two types of model:


– Predictive models predict unknown values based on
known data
– Descriptive models identify patterns in data

Each type has several sub-categories, each of which has


many algorithms. We won't have time to look at ALL of
them in detail.

Introduction to Data Mining January 18, 2008 Slide 26


COMP527:
Data Mining
Data Mining Functions

Classification: Maps data into predefined


classes
Regression: Maps data into a function
Prediction: Predict future data states
Predictive
Time Series Analysis: Analyze data over time

(Supervised Learning)
Data
Mining

Descriptive Clustering: Find groups of similar items


Association Rules: Find relationships between items
Characterisation: Derive representative information
Sequence Discovery: Find sequential patterns

(Unsupervised Learning)

Introduction to Data Mining January 18, 2008 Slide 27


COMP527:
Data Mining
Classification

The aim of classification is to create a model that can


predict the 'type' or some category for a data instance
that doesn't have one.

Two phases:
1. Given labelled data instances, learn model for how
to predict the class label for them. (Training)
2. Given an unlabelled, unseen instance, use the
model to predict the class label. (Prediction)

Some algorithms predict only a binary split (yes/no), some


can predict 1 of N classes, some give probabilities for
each of N classes.

Introduction to Data Mining January 18, 2008 Slide 28


COMP527:
Data Mining
Clustering

The aim of clustering is similar to classification, but without


predefined classes.

Clustering attempts to find clusters of data instances which


are more similar to each other than to instances outside
of the cluster.

Unsupervised Learning: learning by observation, rather


than by example.

Some algorithms must be told how many clusters to find,


others try to find an 'appropriate' number of clusters.

Introduction to Data Mining January 18, 2008 Slide 29


COMP527:
Data Mining
Association Rule Mining

The aim of association rule mining is to find patterns that


occur in the data set frequently enough to be interesting.
Hence the association or correlation of data attributes
within instances, rather than between instances.

These correlations are then expressed as rules – if X and Y


appear in an instance, then Z also appears.

Most algorithms are extensions of a single base algorithm


known as 'A Priori', however a few others also exist.

Introduction to Data Mining January 18, 2008 Slide 30


COMP527:
Data Mining
Why?

That all sounds ... complicated. Why should I learn about


Data Mining?

What's wrong with just a relational database? Why would I


want to go through these extra [complicated] steps?

Isn't it expensive? It sounds like it takes a lot of skill,


programming, computational time and storage space.
Where's the benefit?

Data Mining isn't just a cute academic exercise, it has very


profitable real world uses. Practically all large companies
and many governments perform data mining as part of
their planning and analysis.

Introduction to Data Mining January 18, 2008 Slide 31


COMP527:
Data Mining
The Data Explosion

The rate of data creation is accelerating each year. In 2003,


UC Berkeley estimated that the previous year generated 5
exabytes of data, of which 92% was stored on
electronically accessible media.
Mega < Giga < Tera < Peta < Exa ... All the data in all the
books in the US Library of Congress is ~136 Terabytes.
So 37,000 New Libraries of Congress in 2002.

VLBI Telescopes produce 16 Gigabytes of data every


second.
Each engine of each plane of each company produces ~1
Gigabyte of data every trans-atlantic length journey.
Google searches 18 billion+ accessible web pages.

Introduction to Data Mining January 18, 2008 Slide 32


COMP527:
Data Mining
Data Explosion Implications

As the amount of data increases, the proportion of


information decreases.

As more and more data is generated automatically, we need


to find automatic solutions to turn those stored raw
results into information.

Companies need to turn stored data into profit ... otherwise


why are they storing it?

Let's look at some real world examples.

Introduction to Data Mining January 18, 2008 Slide 33


COMP527:
Data Mining
Classification

The data generated by airplane engines can be used to


determine when it needs to be serviced. By discovering
the patterns that are indicative of problems, companies
can service working engines less often (increasing profit)
and discover faults before they materialise (increasing
safety).

Loan companies can “give you results in minutes” by


classifying you into a good credit risk or a bad risk, based
on your personal information and a large supply of
previous, similar customers.

Cell phone companies can classify customers into those


likely to leave, and hence need enticement, and those
that are likely to stay regardless.
Introduction to Data Mining January 18, 2008 Slide 34
COMP527:
Data Mining
Clustering

Discover previously unknown groups of customers/items.


By finding clusters of customers, companies can then
determine how best to handle that particular cluster.

For example, this could be used for targeted advertising,


special offers, transferring information gathered by
association rule mining to other members of the cluster,
and so forth.

The concept of 'Similarity' is often used for determining


other items that you might be interested in, eg 'More Like
This' links.

Introduction to Data Mining January 18, 2008 Slide 35


COMP527:
Data Mining
Association Rule Mining

By finding association rules from shopping baskets,


supermarkets can use this information for many things,
including:
– Product placement in the store
– What to put on sale
– What to create as 'joint special offers'
– What to offer the customer in terms of coupons
– What to advertise together

It shouldn't be surprising that your Tesco coupons are for


things that you sometimes buy, rather than things you
always or never buy.
Wal-Mart in the US records every transaction at every store
-- petabytes of information to sift through. (TeraData)

Introduction to Data Mining January 18, 2008 Slide 36


COMP527:
Data Mining
Data/Information/Knowledge/Wisdom

Note well that data mining applications have no wisdom.


They cannot apply the knowledge that they discover
appropriately.

For example, a data mining application may tell you that


there is a correlation between buying music magazines
and beer, but it doesn't tell you how to use that
knowledge. Should you put the two close together to
reinforce the tendency, or should you put them far apart
as people will buy them anyway and thus stay in the store
longer?

Data mining can help managers plan strategies for a


company, it does not give them the strategies.

Introduction to Data Mining January 18, 2008 Slide 37


COMP527:
Data Mining
WEKA

Introduction to Data Mining January 18, 2008 Slide 38


COMP527:
Data Mining
WEKA

Introduction to Data Mining January 18, 2008 Slide 39


COMP527:
Data Mining
Further Reading

● Witten Chapter 1
● Dunham Chapter 1
● Han Chapter 1; Sections 6.1, 7.1
● Berry & Linoff Chapters 1,2

● http://en.wikipedia.org/wiki/Data_mining
and linked pages

Introduction to Data Mining January 18, 2008 Slide 40


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Introduction to Text Mining January 18, 2008 Slide 41


COMP527:
Data Mining
Today's Topics

Information Retrieval (IR)
What is IR?
Typical IR Process

Data Mining on Text

Text Mining
What is Text Mining?
Typical Text Mining Process

Applications

Introduction to Text Mining January 18, 2008 Slide 42


COMP527:
Data Mining
What is Information Retrieval?

IR is concerned with retrieving textual records, not data


items like relational databases, nor (specifically) with
finding patterns like data mining.

Examples:
SQL: Find rows where the text column LIKE “%information
retrieval%”

DM: Find a model in order to classify document topics.

IR: Find documents with text that contains the words


Information adjacent to Retrieval, Protocol or SRW, but
not Google.

Introduction to Text Mining January 18, 2008 Slide 43


COMP527:
Data Mining
What is Information Retrieval?

– IR focuses on finding the most appropriate or relevant


records to the user's request.

The supremacy of Google can be attributed primarily to its


PageRank algorithm for ranking web pages in order of
relevance to the user's query. $741.79 (on 2007-11-06,
up from $471.80 on 2006-11-03) a share says this topic is
important to understand!

– IR also focuses on finding these records as quickly as


possible.

Not only does Google find relevant pages, it finds them Fast,
for many thousands (maybe millions?) of concurrent
users.

Introduction to Text Mining January 18, 2008 Slide 44


COMP527:
Data Mining
IR = Google??

So is “Google” the answer to the question of “Information


Retrieval”?

No! Google has a good answer for how to search the web,
but there are many more sources of data, and many more
interesting questions.

Many other examples, including:


Library catalogues
XML searching
Distributed searching
Query languages

Introduction to Text Mining January 18, 2008 Slide 45


COMP527:
Data Mining
IR Processes: Discovery

Need Query
Search
User
Engine

Information

Research topics exist for each box and arrow!

Introduction to Text Mining January 18, 2008 Slide 46


COMP527:
Data Mining
IR Processes: Ingestion

Documents

Search
Engine
Target
Documents

Preprocessed
Information Records
Documents

Compare to the KDD process we looked at last time!

Introduction to Text Mining January 18, 2008 Slide 47


COMP527:
Data Mining
Document Indexing

What information do we need to store?


Query: Documents containing Information and Retrieval but
not Protocol
Need to find which documents contain which words.

Could perform this query using a document/term matrix:

Term1 Term2 Term3 Term4 Term5 ... TermN


Document1 1 0 0 1 0 1
Document2 1 1 0 1 0 0
Document3 0 1 1 0 1 0
Document4 1 0 1 1 1 0
Document5 0 1 1 1 0 0
... DocumentN 1 0 0 0 1 1

Introduction to Text Mining January 18, 2008 Slide 48


COMP527:
Data Mining
Document Indexing

Also useful to know is the frequency of the term in the


document.
Each row in the matrix is a vector, and useful for data
mining functions as the document has been reduced to a
series of numbers rather than words.

Our new matrix might look like:

Term1 Term2 Term3 Term4 Term5 ... TermN


Document1 2 0 0 4 0 9
Document2 3 5 0 1 0 0
Document3 0 2 6 0 1 0
Document4 1 0 3 1 2 0
Document5 0 4 1 2 0 0
... DocumentN 1 0 0 0 3 1

Introduction to Text Mining January 18, 2008 Slide 49


COMP527:
Data Mining
Evaluation

Common evaluation for IR relevance ranking: Precision and


Recall

Precision: Number Relevant and Retrieved / Number


Retrieved
Recall: Number Relevant and Retrieved / Number Relevant
F Score: recall * precision / ((recall + precision) / 2)

Ideal situation is all and only relevant documents retrieved.


Also used in Data Mining evaluation.

Introduction to Text Mining January 18, 2008 Slide 50


COMP527:
Data Mining
Topics of Interest

Format Processing:  Extraction of text from different file formats
Indexing:  Efficient extraction/storage of terms from text 
Query Languages:  Formulation of queries against those indexes
Protocols: Transporting queries from client to server
Relevance Ranking: Determining the relevance of a document to the 
user's query
Metasearch: Cross­searching multiple document sets with the same query
GridIR: Using the grid (or other massively parallel infrastructure) to 
perform IR processes
Multimedia IR:  IR techniques on multimedia objects, compound digital 
objects... 

Introduction to Text Mining January 18, 2008 Slide 51


COMP527:
Data Mining
Data Mining on Text

All of the Data Mining functions can be applied to textual data, using
term as the attribute and frequency as the value.

Classification:
Classify a text into subjects, genres, quality, reading age, ...

Clustering:
Cluster together similar texts

Association Rule Mining:


Find words that frequently appear together
Finds texts that are frequently cited together

Key challenge is the very large number of terms (eg the number of
different words across all documents)

Introduction to Text Mining January 18, 2008 Slide 52


COMP527:
Data Mining
Text Mining

So, we've looked at Data Mining and IR... What's Text Mining then?

Good question.  No canonical definition yet, but a similar definition for 
Data Mining could be applied:

The non­trivial extraction of previously unknown, interesting facts from 
an (invariably large) collection of texts.

So it sounds like a combination of IR and Data Mining, but actually the 
process involves many other steps too.  Before we look at what actually 
happens, let's look at why it's different...

Introduction to Text Mining January 18, 2008 Slide 53


COMP527:
Data Mining
Text Mining vs Data Mining

Data Mining finds a model for the data based on the attributes of the 
items.  The only attributes of text are the words that make up the text.
As we looked at for IR, this creates a very sparse matrix.

Even if we create that matrix, what sort of patterns could we find:
– Classification:   We could classify texts into pre­defined classes 

(eg spam / not spam)
– Association Rule Mining:  Finding frequent sets of words. 

(eg if 'computer' appears 3+ times, then 'data' appears at least once)
– Clustering:  Finding groups of similar documents (IR?)

None of these fit our definition of Text Mining.

Introduction to Text Mining January 18, 2008 Slide 54


COMP527:
Data Mining
Text Mining vs IR

Information Retrieval finds documents that match the user's query.

Even if we matched at a sentence level rather than document, all we do is 
retrieve matching sentences, we're not discovering anything new.

The relevance ranking is important, but it still just matches information 
we already knew... it just orders it appropriately.  

IR (typically) treats a document as a big bag of words... but doesn't care 
about the meaning of the words, just if they exist in the document.

IR doesn't fit our definition of Text Mining either.

Introduction to Text Mining January 18, 2008 Slide 55


COMP527:
Data Mining
Text Mining Process

How would one find previously unknown facts from a bunch of text?

– Need to understand the meaning of the text!

Part of speech of words

Subject/Verb/Object/Preposition/Indirect Object
– Need to determine that two entities are the same entity.
– Need to find correlations of the same entity.
– Form logical chains:  Milk contains Magnesium. Magnesium stimulates 
receptor activity. Inactive receptors cause Headaches ­­> Milk is good 
for Headaches.  (fictional example!)

Introduction to Text Mining January 18, 2008 Slide 56


COMP527:
Data Mining
Part of Speech Tagging

First we need to tag the text with the parts of speech for each word.

eg:
Rob/noun teaches/verb the/article course/noun

How could we do this?  By learning a model for the language!  Essentially 
a data mining classification problem ­­ should the system classify the 
word as a noun, a verb, an adjective, etc.

Lots of different tags, often based on a set called the Penn Treebank.
(NN = Noun, VB = Verb, JJ = Adjective, RB = Adverb, etc)

Introduction to Text Mining January 18, 2008 Slide 57


COMP527:
Data Mining
Deep Parsing

Now we need to discover the phrases and parts of each clause.

Rob/noun teaches/verb the/article course/noun
(Subject: Rob  Verb:teaches (Object: the+course))

The phrase sections are often expressed as trees:
( TOP 
( S 
( NP ( DT This ) ( JJ crazy ) ( NN sentence )  ) 
( VP ( VBD amused ) 
 ( NP ( NNP Rob )  ) 
 ( PP ( IN for ) 
( NP ( DT a ) ( JJ few ) ( NNS minutes )  )  

Introduction to Text Mining January 18, 2008 Slide 58


COMP527:
Data Mining
Entity Recognition

Once we've parsed the text for linguistic structure, we need to identify the 
real world objects referred to.

Rob teaches the course

Rob:  Me  (Sanderson, Robert D.  b.1976­07­20 Rangiora/New Zealand)
the course: Comp527 2006/2007, University of Liverpool, UK

This is typically done via lookups in very large thesauri or 'ontologies', 
specific to the domain being processed (eg medical, historical, current 
events, etc.)

Introduction to Text Mining January 18, 2008 Slide 59


COMP527:
Data Mining
Fact Extraction

There will normally be a lot more text to parse:

Rob Sanderson, a lecturer at the University of Liverpool, teaches a 
masters level course on data mining (Comp527)

Rob is a lecturer
Rob is at the University of Liverpool
Rob teaches a course
The course is called Comp527
The course is masters level
The course is about data mining

Introduction to Text Mining January 18, 2008 Slide 60


COMP527:
Data Mining
Correlation

Rob Sanderson, a lecturer at the University of Liverpool, teaches a 
masters level course on data mining (Comp527)

Data mining is about finding models to describe data sets.

­­> The University of Liverpool has a course about finding models to 
describe data sets.

(Not very interesting or novel in this case, but that's the process)

Introduction to Text Mining January 18, 2008 Slide 61


COMP527:
Data Mining
Applications

Search engines of all types are based on IR.
But where would you use text mining?

Most research so far is on medical data sets ... because this is the most 
profitable!  If you could correlate facts to find a cure for cancer, you 
would be very VERY rich!  So ... lots of people are trying to do just 
that for various values of 'cancer'.

Also because of the wide availability of ontologies and datasets, in 
particular abstracts for medical journal articles (PubMed/Medline)

Introduction to Text Mining January 18, 2008 Slide 62


COMP527:
Data Mining
Applications

More application areas:

News feeds
Terrorism detection
Social sciences analysis
Historical text analysis
Corpus linguistics
'Net Nanny' filters
etc.

Introduction to Text Mining January 18, 2008 Slide 63


COMP527:
Data Mining
Further Reading


Weiss et al  Chapter 1 (and 2 if you're interested)

Baeza­Yates, Modern Information Retrieval, Chapter 1

Jackson and Moulinier, Natural Language Processing for Online 
Applications, Chapter 1 

http://www.jisc.ac.uk/publications/publications/pub_textmining.aspx

http://people.ischool.berkeley.edu/~hearst/text­mining.html

Introduction to Text Mining January 18, 2008 Slide 64


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

General Data Mining Issues January 18, 2008 Slide 65


COMP527:
Data Mining
Today's Topics

Machine Learning?
Input to Data Mining Algorithms
Data types
Missing values
Noisy values
Inconsistent values
Redundant values
Number of values
Over-fitting / Under-fitting
Scalability
Human Interaction
Ethical Data Mining

General Data Mining Issues January 18, 2008 Slide 66


COMP527:
Data Mining
Machine Learning

What do we mean by 'learning' when applied to machines?

– Not just committing to memory (= storage)


– Can't require consciousness
– Learn facts (data), or processes (algorithms)?

“Things learn when they change their behaviour in a way


that makes them perform better” (Witten)

– Ties to future performance, not the act itself


– But things change behaviour for reasons other than
'learning'
– Can a machine have the Intent to perform better?

General Data Mining Issues January 18, 2008 Slide 67


COMP527:
Data Mining
Inputs

The aim of data mining is to learn a model for the data. This
could be called a concept of the data, so our outcome will
be a concept description.

Eg, the task is classify emails as spam/not spam. Concept


to learn is the concept of 'what is spam?'

Input comes as instances.


Eg, the individual emails.
Instances have attributes.
Eg sender, date, recipient, words in text

General Data Mining Issues January 18, 2008 Slide 68


COMP527:
Data Mining
Inputs

Use attributes to determine what about an instance means


that it should be classified as a particular class. ==
Learning!

Obvious input structure: Table of instances (rows) and


attributes (columns)

IRIS DATA Sepal Length Sepal Width Petal Length Petal Width


Flower1 5.1 3.5 1.4 0.2
Flower2 4.9 3.0 1.4 0.2
Flower3 4.7 3.2 1.3 0.2
FlowerN 5.0 3.6 1.4 0.2

General Data Mining Issues January 18, 2008 Slide 69


COMP527:
Data Mining
WEKA's ARFF Format

@relation Iris

@attribute sepal_length numeric
@attribute sepal_width numeric
@attribute petal_length numeric
@attribute petal_width numeric

@data
5.1, 3.5, 1.4, 0.2
4.9, 3.0, 1.4, 0.2
4.7, 3.2, 1.3, 0.2
5.0, 3.6, 1.4, 0.2
...

But what about non numeric data?

General Data Mining Issues January 18, 2008 Slide 70


COMP527:
Data Mining
Data Types

Nominal: Prespecified, finite number of values


eg: {cat, fish, dog, squirrel}

Includes boolean {true, false} and all enumerations.

Ordinal: Orderable, but no concept of distance


eg: hot > warm > cool > cold

Domain specific ordering, but no notion of how much hotter


warm is compared to cool.

General Data Mining Issues January 18, 2008 Slide 71


COMP527:
Data Mining
Data Types

Interval: Ordered, fixed unit


eg: 1990 < 1995 < 2000 < 2005

Difference between values makes sense (1995 is 5 years


after 1990)
Sum does not make sense (1990 + 1995 = year 3985??)

Ratio: Ordered, fixed unit, relative to a zero point


eg: 1m, 2m, 3m, 5m

Difference makes sense (3m is 1m greater than 2m)


Sum makes sense (1m + 2m = 3m)

General Data Mining Issues January 18, 2008 Slide 72


COMP527:
Data Mining
ARFF Data Types

Nominal:
@attribute name {option1, option2, ... optionN}

Numeric:
@attribute name numeric -- real values

Other:
@attribute name string -- text fields
@attribute name date -- date fields (ISO-8601 format)

General Data Mining Issues January 18, 2008 Slide 73


COMP527:
Data Mining
Data Issues: Missing Values

The following issues will come up over and over again, but
different algorithms have different requirements.

What happens if we don't know the value for a particular


attribute in an instance?
For example, the data was never stored, lost or not able to
be represented.

Maybe that data was important!


ARFF records missing values with a ? in the table

How should we process missing values?

General Data Mining Issues January 18, 2008 Slide 74


COMP527:
Data Mining
Missing Values

Possible 'solutions' for dealing with missing values:

– Ignore the instance completely. (eg class missing in


training data set)
Not very useful solution if in test data to be classified!
– Fill in values by hand
Could be very slow, and likely to be impossible
– Global 'missingValue' constant
Possible for enumerations, but what about numeric data?
– Replace with attribute mean
– Replace with class's attribute mean
– Train new classifier to predict missing value!
– Just leave as missing and require algorithm to apply
appropriate technique

General Data Mining Issues January 18, 2008 Slide 75


COMP527:
Data Mining
Noisy Values

By 'noisy data' we mean random errors scattered in the


data.
For example, due to inaccurate recording, data corruption.

Some noise will be very obvious:


– data has incorrect type (string in numeric attribute)
– data does not match enumeration (maybe in yes/no field)
– data is very dissimilar to all other entries (10 in an attr
otherwise 0..1)

Some incorrect values won't be obvious at all. Eg typing


0.52 at data entry instead of 0.25.

General Data Mining Issues January 18, 2008 Slide 76


COMP527:
Data Mining
Noisy Values

Some possible solutions:

– Manual inspection and removal


– Use clustering on the data to find instances or attributes
that lie outside the main body (outliers) and remove them
– Use regression to determine function, then remove those
that lie far from the predicted value
– Ignore all values that occur below a certain frequency
threshold
– Apply smoothing function over known-to-be-noisy data

If noise is removed, can apply missing value techniques on


it. If it is not removed, it may adversely affect the
accuracy of the model.

General Data Mining Issues January 18, 2008 Slide 77


COMP527:
Data Mining
Inconsistent Values

Some values may not be recorded in different ways.


For example 'coke', 'coca cola', 'coca-cola', 'Coca Cola' etc
etc

In this case, the data should be normalised to a single form.


Can be treated as a special case of noise.

Some values may be recorded inaccurately on purpose!

Email address: r.d.nospam.sanderson@...

Spike in early census data for births on 11/11/1911. Had to


put in some value, so defaulted to 1s everywhere. Ooops!
(Possibly urban legend?)

General Data Mining Issues January 18, 2008 Slide 78


COMP527:
Data Mining
Redundant Values

Just because the base data includes an attribute doesn't


make it worth giving to the data mining task.

For example, denormalise a typical commercial database


and you might have:
ProductId, ProductName, ProductPrice, SupplierId,
SupplierAddress...

SupplierAddress is dependant on SupplierId (remember SQL


normalisation rules?) so they will always appear together.
A 100% confident, 100% support association rule is not very
interesting!

General Data Mining Issues January 18, 2008 Slide 79


COMP527:
Data Mining
Number of Attributes

Is there any harm in putting in redundant values? Yes for


association rule mining, and ... yes for other data mining
tasks too.

Can treat text as thousands of numeric attributes:


term/frequency from our inverted indexes.

But not all of those terms are useful for determining (for
example) if an email is spam. 'the' does not contribute to
spam detection.

The number of attributes in the table will affect the time it


takes the data mining process to run. It is often the case
that we want to run it many times, so getting rid of
unnecessary attributes is important.

General Data Mining Issues January 18, 2008 Slide 80


COMP527:
Data Mining
Number of Attributes/Values

Called 'dimensionality reduction'.

We'll look at techniques for this later in the course, but


some simplistic versions:

– Apply upper and lower thresholds of frequency


– Noise removal functions
– Remove redundant attributes
– Remove attributes below a threshold of contribution to
classification
(Eg if attribute is evenly distributed, adds no knowledge)

General Data Mining Issues January 18, 2008 Slide 81


COMP527:
Data Mining
Over-Fitting / Under-Fitting

Learning a concept must stop at the appropriate time.

For example, could express the concept of 'Is Spam?' as a


list of spam emails. Any email identical to those is spam.
Accuracy: 0% on new data, 100% on training data.

Ooops! This is called Over-Fitting. The concept has been


tailored too closely to the training data.

Story: US Military trained a neural network to distinguish


tanks vs rocks.
It would shoot the US tanks they trained it on very
consistently and never shot any rocks ... or enemy tanks.
[probably fiction, but amusing]

General Data Mining Issues January 18, 2008 Slide 82


COMP527:
Data Mining
Over-Fitting / Under-Fitting

Extreme case of over-fitting:

Algorithm tries to learn a set of rules to determine class.


Rule1: attr1=val1/1 and attr2=val2/1 and attr3=val3/1 = class1
Rule2: attr1=val1/2 and attr2=val2/2 and attr3=val3/2 = class2

Urgh. One rule for each instance is useless.

Need to prevent the learning from becoming too specific to


the training set, but also don't want it to be too broad.
Complicated!

General Data Mining Issues January 18, 2008 Slide 83


COMP527:
Data Mining
Over-Fitting / Under-Fitting

Extreme case of under-fitting:

Always pick the most frequent class, ignore the data


completely.

Eg: if one class makes up 99% of the data, then a 'classifier'


that always picks this class will be correct 99% of the
time!

But probably the aim of the exercise is to determine the 1%,


not the 99%... making it accurate 0% of the time when
you need it.

General Data Mining Issues January 18, 2008 Slide 84


COMP527:
Data Mining
Scalability

We may be able to reduce the number of attributes, but


most of the time we're not interested in small 'toy'
databases, but huge ones.

When there are millions of instances, and thousands of


attributes, that's a LOT of data to try to find a model for.

Very important that data mining algorithms scale well.


– Can't keep all data in memory
– Might not be able to keep all results in memory either
– Might have access to distributed processing?
– Might be able to train on a sample of the data?

General Data Mining Issues January 18, 2008 Slide 85


COMP527:
Data Mining
Human Interaction

Problem Exists Between Keyboard And Chair.

– Data Mining experts are probably not experts in the


domain of the data. Need to work together to find out
what is needed, and formulate queries
– Need to work together to interpret and evaluate results
– Visualisation of results may be problematic
– Integrating into the normal workflow may be problematic
– How to apply the results appropriately may not be clear
(eg Barbie + Chocolate?)

General Data Mining Issues January 18, 2008 Slide 86


COMP527:
Data Mining
Ethical Data Mining

Just because we can doesn't mean we should.

Should we include married status, gender, race, religion or


other attributes about a person in a data mining
experiment? Discrimination?
But sometimes those attributes are appropriate and
important ... medical diagnosis, for example.

What about attributes that are dependent on 'sensitive'


attributes? Neighbourhoods have different average
incomes... discriminating against the poor by using
location?

Privacy issues? Data Mining across time? Government


sponsored data mining?

General Data Mining Issues January 18, 2008 Slide 87


COMP527:
Data Mining
Further Reading

● Witten, Chapters 1,2


● Dunham Sections 1.3-1.5
● Han Sections 1.9, 11.4

General Data Mining Issues January 18, 2008 Slide 88


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Data Warehousing January 18, 2008 Slide 89


COMP527:
Data Mining
Today's Topics

Data Warehouses
Data Cubes
Warehouse Schemas
OLAP
Materialisation

Data Warehousing January 18, 2008 Slide 90


COMP527:
Data Mining
What is a Data Warehouse?

Most common definition:
“A data warehouse is a subject-oriented, integrated,
time-variant and nonvolatile collection of data in
support of management's decision-making process.” -
W. H. Inmon

– Corporate focused, assumes a lot of data, and typically


sales related
– Data for “Decision Support System” or “Management
Support System”
– 1996 survey: Return on Investment of 400+%

Data Warehousing: Process of constructing (and using) a


data warehouse

Data Warehousing January 18, 2008 Slide 91


COMP527:
Data Mining
Data Warehouse

– Subject­oriented:  
● Focused on important subjects, not transactions

● Concise view with only useful data for decision making

– Integrated:
● Constructed from multiple, heterogeneous data

sources. Normally distributed relational databases, not


necessarily same schema.
● Cleaning, pre-processing techniques applied for

missing data, noisy data, inconsistent data (sounds


familiar, I hope)

Data Warehousing January 18, 2008 Slide 92


COMP527:
Data Mining
Data Warehouse

– Time­variant: 
● Has different values for the same fields over time.

● Operational database only has current value. Data

Warehouse offers historical values.

– Nonvolatile:
● Physically separate store

● Updates not online, but in offline batch mode only

● Read only access required, so no concurrency issues

Data Warehousing January 18, 2008 Slide 93


COMP527:
Data Mining
Data Warehouse

Data Warehouses are distinct from:

● Distributed DB: Integrated via wrappers/mediators. Far


too slow, semantic integration much more complicated.
Integration done before loading, not at run time.

● Operational DB: Only records current value, lots of


extra non useful information such as HR.
Different schemas/models, access patterns, users,
functions, even though the data is derived from an
operational db.

Data Warehousing January 18, 2008 Slide 94


COMP527:
Data Mining
OLAP vs OLTP

OLAP:  Online Analytical Processing (Data Warehouse)
OLTP: Online Transaction Processing (Traditional DBMS)

OLAP data typically: historical, consolidated, and multi-


dimensional (eg: product, time, location).
Involves lots of full database scans, across terabytes or
more of data.

Typically aggregation and summarisation functions.

Distinctly different uses to OLTP on the operational


database.

Data Warehousing January 18, 2008 Slide 95


COMP527:
Data Mining
Data Cubes

Data is normally Multi­Dimensional,
 and can be thought of as a cube.

Often: 3 dimensions of time,


location and product.

No need to have just 3


dimensions -- could have one
for cars with make, colour,
price, location, and time
for example.

Image courtesy of IBM OLAP Miner documentation

Data Warehousing January 18, 2008 Slide 96


COMP527:
Data Mining
Data Cubes

– Can construct many 'cuboids' from the full cube by


excluding dimensions.
– In an N dimensional data cube, the cuboid with N
dimensions is the 'base cuboid'. A 0 dimensional cuboid
(other than non existent!) is called the 'apex cuboid'.
– Can think of this as a lattice of cuboids...

(Following lattice courtesy of Han & Kamber)

Data Warehousing January 18, 2008 Slide 97


COMP527:
Data Mining
Lattice of Cuboids

all
0-D(apex) cuboid

time item locationsupplier


1-D cuboids

time,item time,location item,location location,supplier

2-D cuboids
time,supplier item,supplier

time,location,supplier
time,item,location 3-D cuboids

time,item,supplier item,location,supplier

4-D(base) cuboid
time, item, location, supplier

Data Warehousing January 18, 2008 Slide 98


COMP527:
Data Mining
Multi-dimensional Units

Each dimension can also be thought of in terms of different units.
– Time: decade, year, quarter, month, day, hour (and
week, which isn't strictly hierarchical with the others!)
– Location: continent, country, state, city, store
– Product: electronics, computer, laptop, dell, inspiron

This is called a “Star-Net” model in data warehousing, and


allows for various operations on the dimensions and the
resulting cuboids.

Data Warehousing January 18, 2008 Slide 99


COMP527:
Data Mining
Star-Net Model

Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
DISTRICT
SALES PERSON
REGION
DISTRICT

COUNTRY
DIVISION

Geography
Promotion Organization

Data Warehousing January 18, 2008 Slide 100


COMP527:
Data Mining
Data Cube Operations

– Roll Up:   Summarise data by climbing up hierarchy.


Eg. From monthly to quarterly, from Liverpool to England
– Drill Down: Opposite of Roll Up
Eg. From computer to laptop, from £100-199 to £100-999
– Slice: Remove a dimension by setting a value for it
Eg. location/product where time is Q1,2007
– Dice: Restrict cube by setting values for multiple
dimensions
Eg. Q1,Q2 / North American cities / 3 products sub cube
– Pivot: Rotate the cube (mostly for visualisation)

Data Warehousing January 18, 2008 Slide 101


COMP527:
Data Mining
Data Cube Schemas

– Star Schema:  Single fact table in the middle, with connected set 
of dimension tables
(Hence a star)
– Snowflake Schema: Some of the dimension tables
further refined into smaller dimension tables
(Hence looks like a snow flake)
– Fact Constellation: Multiple fact tables can share
dimension tables
(Hence looks like a collection of star schemas. Also called
Galaxy Schema)

Data Warehousing January 18, 2008 Slide 102


COMP527:
Data Mining
Star Schema

Time Dimension Item Dimension

time_key item_key
day
name
day_of_week
month Sales Fact Table brand
quarter type
year time_key supplier_type
item_key

location_key
Loc.n Dimension
units_sold
location_key
street
city
Measure (value) state
country
continent

Data Warehousing January 18, 2008 Slide 103


COMP527:
Data Mining
Snowflake Schema

Time Dimension Item Dimension

time_key item_key
day Sales Fact Table name
day_of_week brand
month time_key type
quarter supplier_key
year item_key

location_key

units_sold Loc Dimension

location_key
street
city_key City Dimension

city_key
city
Measure (value) state
country

Data Warehousing January 18, 2008 Slide 104


COMP527:
Data Mining
Fact Constellation

Time Dimension Item Dimension

time_key item_key Shipping Table


day Sales Fact Table name
day_of_week brand
time_key
month time_key type
quarter supplier_key item_key
year item_key
from_key
location_key
units_shipped
units_sold Loc Dimension

location_key
street
city_key City Dimension

city_key
city
Measure (value) state
country

Data Warehousing January 18, 2008 Slide 105


COMP527:
Data Mining
OLAP Server Types

ROLAP:  Relational OLAP
● Uses relational DBMS to store and manage the warehouse

data
● Optimised for non traditional access patterns

● Lots of research into RDBMS to make use of!

MOLAP: Multidimensional OLAP


● Sparse array based storage engine

● Fast access to precomputed data

HOLAP: Hybrid OLAP


● Mixture of both MOLAP and ROLAP

Data Warehousing January 18, 2008 Slide 106


COMP527:
Data Mining
Data Warehouse Architecture

(also courtesy of Han & Kamber)

Monitor OLAP
Other Metadata & Server
sources Integrator

Analysis
Query
Operational Extract
Serve Reports
DBs Transform Data
Load Data mining
Warehouse
Refresh

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

Data Warehousing January 18, 2008 Slide 107


COMP527:
Data Mining
Materialisation

In order to compute OLAP queries efficiently, need to materialise some of 
the cuboids from the data.
● None: Very slow, as need to compute entire cube at run

time
● Full: Very fast, but requires a LOT of storage space and

time to compute all possible cuboids


● Partial: But which ones to materialise? Called an 'iceberg

cube', as only partially materialised and the rest is "below


water".
Many cells in a cuboid will be empty, only materialise
sections that contain more values than a minimum
threshold.

Data Warehousing January 18, 2008 Slide 108


COMP527:
Data Mining
Further Reading


Han, Chapters 3,4

Dunham Sections 2.1, 2.6, 2.7

Berry and Linoff, Chapter 15

Inmon, Building the Data Warehouse

Inmon, Managing the Data Warehouse


http://en.wikipedia.org/wiki/Data_warehouse
and subsequent links 

Data Warehousing January 18, 2008 Slide 109


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Challenges, Basics January 18, 2008 Slide 110


COMP527:
Data Mining
Today's Topics

Classification

Basic Algorithms:

KNN
Perceptron
Winnow

Classification: Challenges, Basics January 18, 2008 Slide 111


COMP527:
Data Mining
Classification

Main Idea: Learn the concept of what it means to be part


of a named class of instances.

Called Supervised Learning, as it learns by example from


data which is already classified correctly.
Often called the Class Label attribute, hence learns from
Labeled data.

Two main phases:


● Training: Learn the classification
model from labeled data
● Prediction: Use the pre-built model
to classify new instances

Classification: Challenges, Basics January 18, 2008 Slide 112


COMP527:
Data Mining
Classification Accuracy

We need to use previously unseen instances to test a


classifier.
● Over-fitting is the main problem. Classifiers will often

learn too specific a model, and testing on data that was


used in training would reinforce this problem.
● Need to split the data set into training and testing.

Revised Phases for accuracy estimation:


● Split data set into distinct Training and Testing sets

● Build classifier with Training set

● Assess accuracy with Testing set – normally expressed

as %

Classification: Challenges, Basics January 18, 2008 Slide 113


COMP527:
Data Mining
Comparing Methods

Accuracy
● Percent of instances classified correctly.

Speed
● Computational cost of both learning model and

predicting classes
Robustness
● Ability to cope with noisy or missing data

Scalability
● Ability to cope with very large amounts of data

Interpretability
● Is the model understandable to a human, or otherwise

useful?

Classification: Challenges, Basics January 18, 2008 Slide 114


COMP527:
Data Mining
Classification vs Prediction

Classification predicts a class label from a given finite set.


● The label is a nominal attribute, so unordered and

enumerable
● Some algorithms predict probability for more than one

label
● Sometimes called a categorical attribute (eg H&K)

Prediction predicts a number instead of a label.


● Ordered and infinite set of possible outcomes

● Also often called Regression or Numeric Prediction

● Often viewed as a function

Classification: Challenges, Basics January 18, 2008 Slide 115


COMP527:
Data Mining
Eager vs Lazy Learners

Eager Learner: Constructs the model when it receives the


training data.
Builds a model likely to be very different in structure to
the data.

Lazy Learner: Doesn't construct a model when training,


only when classifying new instances.
Does only enough work to ensure that data can be
compared later
Sometimes called instance-based learners

Most classifiers are Eager, but there's an important Lazy


classifier called 'KNN' – K Nearest Neighbour

Classification: Challenges, Basics January 18, 2008 Slide 116


COMP527:
Data Mining
But First...

Which group, left or right,


for these two flowers?

(Experiment reported on in Cognitive Science, 2002)

Classification: Challenges, Basics January 18, 2008 Slide 117


COMP527:
Data Mining
Resemblance

People classify things by finding other items that are similar


which have already been classified.
For example: Is a new species a bird? Does it have the
same attributes as lots of other birds? If so, then it's
probably a bird too.

A combination of rote memorisation and the notion of


'resembles'.

Although kiwis can't fly like most other birds, they resemble
birds more than they resemble other types of animals.

So the problem is to find which instances most closely


resemble the instance to be classified.

Classification: Challenges, Basics January 18, 2008 Slide 118


COMP527:
Data Mining
KNN: Distance Measures

Distance (or similarity) between instances is easy if the data


is numeric.

Typically use Euclidian distance:


d = √(x1i-x1j)2 + (x2i-x2j)2 + ...

Also Manhattan / City Block distance:


d = (x1i-x1j) + (x2i-x2j) + ...

However we should normalise all of the values to the same


scale first. Otherwise income will overpower age, for
example.

Classification: Challenges, Basics January 18, 2008 Slide 119


COMP527:
Data Mining
KNN: Non Numeric Distance

For nominal attributes, we can only compare whether the


value is the same or not. Equally, this can be done by
dividing enumerations into many boolean attributes.

Might be able to convert to attributes between which


distance can be determined by some function. Eg colour,
temperature.

Text can be treated as one attribute per word, with the


frequency as the value, normalised to 0..1, and preferably
with very high frequency words ignored (eg the, a, as,
is...)

Classification: Challenges, Basics January 18, 2008 Slide 120


COMP527:
Data Mining
KNN: Classification

Classification process is then straight forward:

– Find the k closest instances to the test instance


– Predict the most common class among those instances
– Or predict the mean, for numeric prediction

What value to use for k?


● Depends on dataset size. Large dbs need a higher k,

whereas a high k for a small dataset might cross out of


the class boundaries
● Calculate accuracy on test set for increasing value of k,

and use a hill climbing algorithm to find the best.


● Typically use an odd number to help avoid ties

Classification: Challenges, Basics January 18, 2008 Slide 121


COMP527:
Data Mining
KNN: Classification

5-NN – find 5 closest to black point, 3 blue and 2 red, so


predict blue

Classification: Challenges, Basics January 18, 2008 Slide 122


COMP527:
Data Mining
KNN: Classification

Classification can be very slow to find the k nearest


instances.
– In a trivial implementation, it could take |D| comparisons.
– Using indexing it can easily be improved.
– Also easy to parallelise as one comparison is completely
distinct to other comparisons.

Can remove instances from the data set that do not help, for
example a tight cluster of 1000 instances of the same
class is unnecessary for k<50

Can also use advanced data structures to improve the


speed of classification, by storing the instance information
appropriately.

Classification: Challenges, Basics January 18, 2008 Slide 123


COMP527:
Data Mining
KNN: kD-Trees

KD-Tree is a binary tree that divides the input space with a


plane, then splits each such partition recursively. Each
split is made parallel to an axis and through an instance.

Typical strategy is to find the point closest to the mean in


the current partition and split through it, along a different
axis to the previous split. (Actually on the axis with the
greatest variance)

Then to search, descend the tree to the leaf partition that


contains the test instance. Search only that partition, then
if an edge is closer than any of the k closest instances,
search the parent partitions as well.

Classification: Challenges, Basics January 18, 2008 Slide 124


COMP527:
Data Mining
KNN: kD-Trees

First split at instance


(7,4) then again at
(6,7) which divides
the search space
into more easily
searchable sections.

Classification: Challenges, Basics January 18, 2008 Slide 125


COMP527:
Data Mining
KNN: kD-Trees

Then to classify the star, descend


into the section with both star and
the black instance.
But note that the instance in the
other section is closer, so we still
must check the adjacent area.

Note that the shaded area is the


black node's sibling and hence
cannot contain closer points.

Classification: Challenges, Basics January 18, 2008 Slide 126


COMP527:
Data Mining
Perceptron and Winnow

Two very simple eager methods: Perceptron and Winnow.


The both use the idea of a single neuron that fires when
given the right stimuli. (We'll look at this idea again later
under Neural Networks)

First thing to keep in mind is that the input to the


perceptron must be a vector of numbers.
Secondly, that it can only answer a 2 class problem – either
the neuron fires (class 1) or it doesn't (class 2).

Classification: Challenges, Basics January 18, 2008 Slide 127


COMP527:
Data Mining
Perceptron and Winnow

The square boxes are inputs, the w lines are weights and
the circle is the perceptron. The learning problem is to
find the correct weights to apply to the attributes.

The bias is a fixed value (1)


that is then learnt in the
same way as the other
attributes, in order to ensure
that the result perceptron can
check if the result is > 0 or
not to see if it should fire.

Classification: Challenges, Basics January 18, 2008 Slide 128


COMP527:
Data Mining
Perceptron

For each attribute, we have an input node. Then there is


one output node to which all of them connect, with a
weight on each connection.
We can then multiply weight by value, and add them all
up...
w0a0 + w1a1 + ... + wnan

Make it an equation equal to 0 and it's the equation for a


hyperplane. So essentially we are learning the
hyperplane that separates the two classes. Then
classification is just checking which side of the plane the
instance falls on.
But how do we learn the weights?

Classification: Challenges, Basics January 18, 2008 Slide 129


COMP527:
Data Mining
Perceptron

Remember that instances are a set of numeric attributes (a


vector). We can also treat the weights on the connections
as a vector. We only want to classify between two classes.
So:

weightVector = [0,...0]
while classificationFailed,
for each training instance I,
if not classify(I) == I.class,
if I.class == class1:
weightVector += I
else:
weightVector ­= I

No complicated higher math here!

Classification: Challenges, Basics January 18, 2008 Slide 130


COMP527:
Data Mining
Winnow

Winnow only updates when it finds a misclassified instance, and uses


multiplication to do the update rather than addition. It only works
when the attribute values are also binary. (1 or 0)

delta = (user defined)
while classificationFailed,
for each instance I,
if classify(I) != I.class,
if I.class == class1,
for each attribute ai in I,
if ai == 1, wi *= delta
else,
for each attribute ai in I,
if ai == 1, wi /= delta

Classification: Challenges, Basics January 18, 2008 Slide 131


COMP527:
Data Mining
Further Reading

● Witten, Section 3.8, and pp 124-136


● Han, Sections 6.1, 6.9
● Dunham Sections 4.1-4.3
● Berry and Linoff, Chapter 8
● Berry and Browne, Chapter 6
● Devijver and Kittler, Pattern Recognition: A Statistical Approach,
Chapter 3
● For KNN and Perceptron, Wikipedia, as always :)

Classification: Challenges, Basics January 18, 2008 Slide 132


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Rules January 18, 2008 Slide 133


COMP527:
Data Mining
Today's Topics

Introduction
Rule Sets vs Rule Lists
Constructing Rules-based Classifiers
1R
PRISM
Reduced Error Pruning
RIPPER
Rules with Exceptions

Classification: Rules January 18, 2008 Slide 134


COMP527:
Data Mining
Rules-Based Classifiers

Idea:  Learn a set of rules from the data. Apply those rules to 
determine the class of the new instance.

For example:
R1. If blood-type=Warm and lay-eggs=True then Bird
R2. If blood-type=Cold and flies=False then Reptile
R3. If blood-type=Warm and lay-eggs=False then Mammal

Hawk: flies=True, blood-type=Warm, lay-eggs=True,


class=???

R1 is True, so the classifier predicts that Hawk = Bird.


Yay!

Classification: Rules January 18, 2008 Slide 135


COMP527:
Data Mining
Rules-Based Classifiers

A rule r covers an instance x if the attributes of the instance satisfy 
the condition of the rule.

The coverage of a rule is the percentage of records that


satisfy the condition.

The accuracy of a rule is the percentage of covered records


that satisfy the condition and the conclusion.

For example, a rule might cover 10/50 records (coverage


20%) of which 8 are correct (accuracy 80%).

Classification: Rules January 18, 2008 Slide 136


COMP527:
Data Mining
Rule Set vs Rule List

Rules can either be grouped as a set or an ordered list.

Set:
The rules make independent predictions.
Every record is covered by 0..1 rules (hopefully 1!)

R1. If flies=True and lays-eggs=True and lives-in-


water=False then Bird
R2. If flies=False and lives-in-water=True and lays-
eggs=True then Fish
R3. If blood-type=Warm and lays-eggs=False then Mammal
R4. If blood-type=Warm and lays-eggs=True then Reptile

Doesn’t matter which order we evaluate these rules.

Classification: Rules January 18, 2008 Slide 137


COMP527:
Data Mining
Rule Set vs Rule List

List:  
The rules make dependent predictions.
Every record is covered by 0..* rules (hopefully 1..*!)

R1. If flies=True and lays-eggs=True=False then Bird


R2. If blood-type=Warm and lays-eggs=False then Mammal
R3. If lives-in-water=True then Fish
R4. If lays-eggs=True then Reptile

Does matter which order we evaluate these rules.

If all records are covered by at least one rule, then rule set
or list is considered Exhaustive.

Classification: Rules January 18, 2008 Slide 138


COMP527:
Data Mining
Constructing Rules-Based Classifiers

Covering approach:  At each stage, a rule is found that covers 
some instances.

“Separate and Conquer” -- Choose rule that identifies many


instances, separate them out, repeat.

But first a very very simple classifier called “1R”.


1R because the rules all test one particular attribute.

Classification: Rules January 18, 2008 Slide 139


COMP527:
Data Mining
1R Classifier

Idea: Construct one rule for each attribute/value combination 
predicting the most common class for that combination.
Example Data:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

Classification: Rules January 18, 2008 Slide 140


COMP527:
Data Mining
1R Classifier

Rules generated:
Attribute Rules Errors Total Errors
Outlook sunny » no 2/5 4/14
overcast » yes 0/4
rainy » yes 2/5
Temperature hot » no 2/4 (random) 5/14
mild » yes 2/6
cool » yes 1/4
Humidity high » no 3/7 4/14
normal » yes 1/7
Windy false » yes 2/8 5/14
true » no 3/6 (random)

Now choose the attribute with the fewest errors. Randomly


decide on Outlook. So 1R will simply use the outlook
attribute to predict the class for new instances.

Classification: Rules January 18, 2008 Slide 141


COMP527:
Data Mining
1R Algorithm

foreach attribute,
foreach value of that attribute,
find class distribution for attr/value
conc = most frequent class
make rule: attribute=value -> conc
calculate error rate of ruleset
select ruleset with lowest error rate

Almost not worth wasting a slide on, really!

Classification: Rules January 18, 2008 Slide 142


COMP527:
Data Mining
PRISM Classifier

Instead of always looking to the full data set, after


constructing each rule we could remove the instances
that the rule covers before looking for a new rule.

Start with a high coverage rule, then increase its accuracy


by adding more conditions to it.

Want to maximise the accuracy of each rule: maximise the


ratio of positive instances/covered instances.

Finished adding conditions when p/t = 1, or no more


instances to look at

Classification: Rules January 18, 2008 Slide 143


COMP527:
Data Mining
PRISM Classifier

Following Witten (pg 6, 108+)


If X then recommendation=hard
Find highest coverage ratio condition for X:

Age = Young                 2/8
Age = Pre­presbyopic        1/8
Age = Presbyopic            1/8
Prescription = Myope        3/12
Prescription = Hypermetrope 1/12
Astigmatism = no            0/12
Astigmatism = yes           4/12
Tear­Production = Reduced   0/12
Tear­Production = Normal    4/12

Select astigmatism = yes


(arbitrarily over Tear-Production = Normal)

Classification: Rules January 18, 2008 Slide 144


COMP527:
Data Mining
PRISM Classifier

This covers:
Age Spectacle  Astigmatism Tear production  Recommended 
prescription rate lenses
Young  Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre­presbyopic Myope Yes Reduced None
Pre­presbyopic Myope Yes Normal Hard
Pre­presbyopic  Hypermetrope Yes Reduced None
Pre­presbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None

Now need to add another condition to make it more accurate.

If astigmatism = yes and X then recommendation = hard

Classification: Rules January 18, 2008 Slide 145


COMP527:
Data Mining
PRISM Classifier

Best condition is Tear-Production = normal (4/6)


New rule: astigmatism=yes and tear-production = normal

But still some inaccuracy...

Age=Young (2/2) or Prescription = Myope (3/3) both have


100% ratio in remaining instances. Choose the greater
coverage.

If astigmatism = yes and tear-production = normal and


prescription = myope then recommendation = hard
Repeat the process, removing the instances covered by this
rule.
Then repeat for all classes.

Classification: Rules January 18, 2008 Slide 146


COMP527:
Data Mining
PRISM Classifier

Try with the other example data set.  If X then play=yes
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
Outlook=overcast is (4/4) Already perfect, remove
rainy mild normal false yes
sunny mild normal true yes
instances and look again.
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

Classification: Rules January 18, 2008 Slide 147


COMP527:
Data Mining
PRISM Classifier

With reduced dataset,  if X then play=yes
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
sunny mild high false no
sunny cool normal false yes
Sunny
rainy
(2/5) rainymild
(3/5) hot (0/2)normal
mild (3/5) cool (2/3)
false
high (1/5) normal
yes
(4/5)
sunnyfalse (4/6) true
mild(¼) normal true yes
Select humidity=normal (4/5) and look for another rule as
rainy mild high true no

not perfect

Classification: Rules January 18, 2008 Slide 148


COMP527:
Data Mining
PRISM Classifier

If humidity=normal and X then play=yes
Outlook Temperature Humidity Windy Play?
rainy cool normal false yes
rainy cool normal true no
sunny cool normal false yes
rainy mild normal false yes
If we could use 'and-not' we could have:
sunny mild normal true yes
and-not (temperature=cool and windy=true)
But instead:
rainy (2/3), sunny (2/2), cool (2/3), mild (2/2), false(3/3),
true (1/2)
So we select windy=false to maximise t and add that to the
rule.

Classification: Rules January 18, 2008 Slide 149


COMP527:
Data Mining
PRISM Algorithm

for each class C
initialise E to the complete instance set
while E contains instances with class C
create empty rule R if X then C
until R is perfect (or no more attributes)
for each attribute A not in R, and each value v,
consider adding A=v to R
select A and v to maximise accuracy of p/t
add A=v to R
remove instances covered by R from E

Classification: Rules January 18, 2008 Slide 150


COMP527:
Data Mining
Issues with PRISM

Overfitting. As we saw, we had 4/5 and then lost one


positive instance to lose the one negative instance. But
with more examples maybe it was 199/200 and we'd need
to lose 40 positive to remove it... that's crazy.

Measure 2: Information Gain


p * (log(p/t) - log (P/T))
p/t as before (positive over total)
P and T are positive and total before new condition.

Emphasises large number of positive examples. Use this in


PRISM in place of maximising p/t.

Classification: Rules January 18, 2008 Slide 151


COMP527:
Data Mining
Over-Fitting Avoidance

We could split the training set into a growing set and a


pruning set.
Grow rules out using the first, and then try to cut the rules
back with the pruning set.

Two strategies:

Reduced-error pruning: Build full rule set then prune


rules

Incremental reduced-error pruning: Simplify rules as


built
Can re-split the data after each rule. Let's look at this one.

Classification: Rules January 18, 2008 Slide 152


COMP527:
Data Mining
Incremental Reduced Error Pruning

initialise E to instance set
until E is empty:
split E into Grow and Prune (ratio 2:1)
for each class C in Grow
generate best rule for C
using Prune:
calc worth(R) and worth(R­finalCondition)
while worth(R­) > worth(R), prune rule
from rules for different classes, select 
largest worth(R)
remove instances covered by rule

Classification: Rules January 18, 2008 Slide 153


COMP527:
Data Mining
Rule Worth?

How can we generate the worth of the rules? (Witten 203)

– ( p + (N -n)) / T
● true positive + true negative / total number of

instances
● positive + totalNegative - negativesCovered /

totalInstances
● p=2000, t=3000 --> 1000 + N / T

● p=1000, t=1001 --> 999 + N / T


– p/t
● Same problem as before p=1,t=1 vs p=1000,t=1001

– Simple but intuitive algorithm for worth?

Classification: Rules January 18, 2008 Slide 154


COMP527:
Data Mining
Issue with Grow/Prune Splitting

Say we have 1000 examples, and we split 2:1 for train/test


(666,334), then 2:1 for grow/prune (444,222) ... we're
building our rules on less than half of our data!

Depending on the dataset, classes may be absent from the


training set, or the distributions may be very wrong, or
any number of other statistical problems with random
sampling to this degree.

Ameliorated in Incremental as re-split often. But might still


want to perform the algorithm several times and pick the
best.

Classification: Rules January 18, 2008 Slide 155


COMP527:
Data Mining
RIPPER

Rules-based classifier from industry.

If 2 classes, then learn rules for one and default the other
If more than 2 classes, start with smallest until you have 2.

Information Gain to grow rules


Measure for pruning: p-n / p + n (positive/negative
examples covered in pruning set)

Uses 'Description Length' metric -- Ockham's Razor says


that the simplest solution is the best, so here the simplest
rule set is the best. (Not going into how to calculate this)

Classification: Rules January 18, 2008 Slide 156


COMP527:
Data Mining
RIPPER Algorithm

Repeated Incremental Pruning to Produce Error Reduction
split E into Grow/Prune
BUILD:
repeat until no examples, or DL of ruleset >minDL(rulesets)+64, or error >50%
GROW:  add conditions until rule is 100% by IG
PRUNE: prune last to first while worth metric W increases
for each rule R, for each class C:
split E into Grow/Prune
remove all instances from Prune covered by other rules
GROW and PRUNE two competing rules:
R1 is new rule built from scratch
R2 is generated by adding conditions to R
prune using worth metric A for reduced dataset
replace R by R, R1 or R2 with smallest DL
if uncovered instances of C, return to BUILD to make more rules
calculate DL for ruleset and ruleset with each rule omitted, delete any rule that 
increases the DL
remove instances covered by rules generated

DL = Description Length, Metric W= p+1/t+2, Metric A=p+N­n/T 

Classification: Rules January 18, 2008 Slide 157


COMP527:
Data Mining
Rules with Exceptions

If we get more data after a ruleset has been generated, it might be 
useful to add exceptions to rules.

If X then class1 unless Y then class2

Consider our humidity rule:
if humidity=normal then play=yes 
unless temperature=cool and windy=true then play = no

Exceptions developed with the Induct system, called 'ripple­down 
rules'

Classification: Rules January 18, 2008 Slide 158


COMP527:
Data Mining
Further Reading


Witten, Sections 3.3, 3.5, 3.6, 4.1, 4.4

Dunham Section 4.6

Han, Section 6.5

Berry and Browne, Chapter 8

Classification: Rules January 18, 2008 Slide 159


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Trees January 18, 2008 Slide 160


COMP527:
Data Mining
Today's Topics

Trees
Tree Learning Algorithm
Attribute Splitting Decisions
Random
'Purity Count'
Entropy (aka ID3)
Information Gain Ratio

Classification: Trees January 18, 2008 Slide 161


COMP527:
Data Mining
Trees

Anything can be made better by storing it in a tree structure! (Not really!)

Instead of having lists or sets of rules, why not have a tree


of rules? Then there's no problem with order, or repeating
the same test over and over again in different conjunctive
rules.

So each node in the tree is an attribute test, the branches


from that node are the different outcomes.

Instead of 'separate and conquer', Decision Trees are the


more typical 'divide and conquer' approach. Once the
tree is built, new instances can be tested by simply
stepping through each test.

Classification: Trees January 18, 2008 Slide 162


COMP527:
Data Mining
Example Data Again

Here's our example data again:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

How to construct a tree from it, instead of rules?

Classification: Trees January 18, 2008 Slide 163


COMP527:
Data Mining
Tree Learning Algorithm

Trivial Tree Learner:

create empty tree T
select attribute A
create branches in T for each value v of A
for each branch,
recurse with instances where A=v
add tree as branch node

Most interesting part of this algorithm is line 2, the attribute 
selection.  Let's start with a Random selection, then look at how it 
might be improved.

Classification: Trees January 18, 2008 Slide 164


COMP527:
Data Mining
T

Random method:  Let's pick 'windy'

Windy

false true

6 yes 3 yes
2 no 3 no

Need to split again, looking at only the 8 and 6 instances respectively.
For windy=false, we'll randomly select outlook:
sunny: no, no, yes    | overcast: yes, yes    | rainy: yes, yes, yes

As all instances of overcast and rainy are yes, they stop, sunny continues.

Classification: Trees January 18, 2008 Slide 165


COMP527:
Data Mining
Attribute Selection

As we may have thousands of attributes and/or values to test, we 
want to construct small decision trees.  Think back to RIPPER's 
description length ... the smallest decision tree will have the 
smallest description length.  So how can we reduce the number 
of nodes in the tree?

We want all paths through the tree to be as short as


possible. Nodes with one class stop a path, so we want
those to appear early in the tree, otherwise they'll occur
in multiple branches.

Think back: the first rule we generated was


outlook=overcast because it was pure.

Classification: Trees January 18, 2008 Slide 166


COMP527:
Data Mining
Attribute Selection: Purity

'Purity' count:

Outlook

sunny rainy

2 yes 3 yes
4 yes
3 no 2 no

Select attribute that has the most 'pure' nodes, randomise equal 
counts.
Still mediocre. Most data sets won't have pure nodes for several 
levels. Need a measure of the purity instead of the simple count.

Classification: Trees January 18, 2008 Slide 167


COMP527:
Data Mining
Attribute Selection: Entropy

For each test:
Maximal purity:  All values are the same
Minimal purity:  Equal number of each value

Find a scale between maximal and minimal, and then merge across all of the 
attribute tests.

One function that calculates this is the Entropy function:
entropy(p1,p2...,pn) 
= ­p1*log(p1) + ­p2*log(p2) + ... ­pn*log(pn)

p1 ... pn are the number of instances of each class, expressed as a fraction of 
the total number of instances at that point in the tree. log is base 2.

Classification: Trees January 18, 2008 Slide 168


COMP527:
Data Mining
Attribute Selection: Entropy

entropy(p1,p2...,pn) 
= ­p1*log(p1) + ­p2*log(p2) + ... ­pn *log(pn)

This is to calculate one test.     For outlook there are three tests:
sunny:  info(2,3)  
= ­2/5 log(2/5)  ­3/5 log(3/5) 
= 0.5287 + 0.4421
= 0.971

overcast:     info(4,0) = ­(4/4*log(4/4)) +  ­(0*log(0))

Ohoh!  log(0) is undefined.  But note that we're multiplying it by 0, so what ever it 
is the final result will be 0. 

Classification: Trees January 18, 2008 Slide 169


COMP527:
Data Mining
Attribute Selection: Entropy

sunny:   info(2,3) = 0.971
overcast:  info(4,0) = 0.0
rainy:  info(3,2) = 0.971

But we have 14 instances to divide down those paths...
So the total for outlook is:
(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693

Now to calculate the gain, we work out the entropy for the top node 
and subtract the entropy for outlook:
info(9,5) = 0.940 
gain(outlook) = 0.940 ­ 0.693 = 0.247

Classification: Trees January 18, 2008 Slide 170


COMP527:
Data Mining
Attribute Selection: Entropy

Now to calculate the gain for all of the attributes:
gain(outlook) = 0.247
gain(humidity) = 0.152
gain(windy) = 0.048
gain(temperature) = 0.029

And select the maximum ... which is outlook.
This is (also!) called information gain.  The total is the information, 
measured in 'bits'.
Equally we could select the minimum amount of information 
needed ­­ the minimum description length issue in RIPPER.

Let's do the next level, where outlook=sunny.

Classification: Trees January 18, 2008 Slide 171


COMP527:
Data Mining
Attribute Selection: Entropy

Now to calculate the gain for all of the attributes:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes

Temp:    hot   info(0,2)  mild info(1,1)   cool info(1,0)
Humidity: high  info(0,3)  normal info(2,0)
Windy:    false info(1,2)  true info(1,1)

Don't even need to do the math.  Humidity is the obvious choice as 
it predicts all 5 instances correctly.  Thus the information will be 
0, and the gain will be maximal. 

Classification: Trees January 18, 2008 Slide 172


COMP527:
Data Mining
Attribute Selection: Entropy

Now our tree looks like:
Outlook

sunny rainy

?
Humidity yes

normal high

yes no

This algorithm is called ID3, developed by Quinlan.

Classification: Trees January 18, 2008 Slide 173


COMP527:
Data Mining
Entropy: Issues

Nasty side effect of Entropy:  It prefers attributes with a large 
number of branches.
Eg, if there was an 'identifier' attribute with a unique value, this 
would uniquely determine the class, but be useless for 
classification. (over­fitting!)

Eg: info(0,1) info(0,1) info(1,0) ...

Doesn't need to be unique.  If we assign 1 to the first two instances, 
2 to the second and so forth, we still get a 'better' split.

Classification: Trees January 18, 2008 Slide 174


COMP527:
Data Mining
Entropy: Issues

Half­Identifier 'attribute':
info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0) 
info(1,1) 
= 0  0  0.5  0.5  0  0  0.5

2/14 down each route, so: 
= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...
= 3 * (2/14 * 0.5)
= 3/14 
= 0.214
Gain is: 
0.940 ­ 0.214 = 0.726

Remember that the gain for Outlook was only 0.247!
Urgh.  Once more we run into over­fitting. 

Classification: Trees January 18, 2008 Slide 175


COMP527:
Data Mining
Gain Ratio

Solution:  Use a gain ratio.  Calculate the entropy disregarding 
classes for all of the daughter nodes:

eg  info(2,2,2,2,2,2,2) for half­identifier 
and  info(5,4,5) for outlook

identifier = ­1/14 * log(1/14) * 14 = 3.807
half­identifier = ­1/7 * log(1/7) * 7 = 2.807
outlook = 1.577

Ratios:
identifier = 0.940 / 3.807 = 0.247
half­identifier = 0.726 / 2.807 = 0.259
outlook = 0.247 / 1.577 = 0.157

Classification: Trees January 18, 2008 Slide 176


COMP527:
Data Mining
Gain Ratio

Close to success:  Picks half­identifier (only accurate in 4/7 
branches) over identifier (accurate in all 14 branches)!

half­identifier = 0.259
identifier = 0.247
outlook = 0.157
humidity = 0.152
windy = 0.049
temperature = 0.019

Humidity is now also very close to outlook, whereas before they 
were separated.

Classification: Trees January 18, 2008 Slide 177


COMP527:
Data Mining
Gain Ratio

We can simply check for identifier like attributes and ignore them.  
Actually, they should be removed from the data before the data 
mining begins.

However the ratio can also over­compensate.  It might pick an 
attribute because it's entropy is low.  Note how close humidity 
and outlook became... maybe that's not such a good thing?

Possible Fix:  First generate the information gain.  Throw away any 
attributes with less than the average. Then compare using the 
ratio.

Classification: Trees January 18, 2008 Slide 178


COMP527:
Data Mining
Alternative: Gini

An alternative method to Information Gain is called the Gini Index

The total for node D is:   
gini(D) = 1 ­ sum(p12, p22, ... pn2)
Where p1..n are the frequency ratios of class 1..n in D.

So the Gini Index for the entire set:
= 1­ (9/142 + 5/142) 
= 1 ­ (0.413 + 0.127)
= 0.459

Classification: Trees January 18, 2008 Slide 179


COMP527:
Data Mining
Gini

The gini value of a split of D into subsets is:

Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)

Where N' is the size of split D', and N is the size of D.

eg:   Outlook splits into 5,4,5:
split  = 5/14 gini(sunny) + 4/14 gini(overcast) 
  + 5/14 gini(rainy)
sunny  = 1­sum(2/52, 3/52) = 1 ­ 0.376 = 0.624
overcast= 1­ sum(4/42, 0/42) = 0.0
rainy  = sunny
split  = (5/14 * 0.624) * 2 
= 0.446 

Classification: Trees January 18, 2008 Slide 180


COMP527:
Data Mining
Gini

The attribute that generates the smallest gini split value is chosen 
to split the node on.

(Left as an exercise for you to do!)

Gini is used in CART (Classification and Regression Trees), IBM's 
IntelligentMiner system, SPRINT (Scalable PaRallelizable 
INduction of decision Trees).  It comes from an Italian statistician 
who used it to measure income inequality.

Classification: Trees January 18, 2008 Slide 181


COMP527:
Data Mining
Decision Tree Issues

The various problems that a good DT builder needs to address:

– Ordering of Attribute Splits
As seen, we need to build the tree picking the best attribute to split on first.
– Numeric/Missing Data
Dividing numeric data is more complicated. How?
– Tree Structure
A balanced tree with the fewest levels is preferable.
– Stopping Criteria
Like with rules, we need to stop adding nodes at some point. When?
– Pruning
It may be beneficial to prune the tree once created? Or incrementally?

Classification: Trees January 18, 2008 Slide 182


COMP527:
Data Mining
Further Reading


Introductory statistical text books

Witten, 3.2, 4.3

Dunham, 4.4

Han, 6.3

Berry and Browne, Chapter 4

Berry and Linoff, Chapter 6

Classification: Trees January 18, 2008 Slide 183


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Trees 2 January 18, 2008 Slide 184


COMP527:
Data Mining
Today's Topics

Numeric Data
Missing Values
Pruning
Pre- vs Post-Pruning
Chi-squared Test
Sub-tree Replacement
Sub-tree Raising
C4.5's error estimation
From Trees to Rules

Classification: Trees 2 January 18, 2008 Slide 185


COMP527:
Data Mining
Numeric Attributes

The temperature attribute for the weather data is actually a set of 
Fahrenheit values between 64 and 85:
64 65 68 69 70 71 72 75 80 81 83 85
yes no yes yes yes no no, yes yes, yes no yes yes no

While we could have a 12 way branching node, this wouldn't


catch training data that had, for example, a temperature
of 66. We've seen the issues of over-fitting with many
branches already.
Need to have some split points. Eg temperature > = 71

Could have one node with many branches, or one or more


nodes with two branches each.

Classification: Trees 2 January 18, 2008 Slide 186


COMP527:
Data Mining
T

Assuming one split, where should it be?
64 65 68 69 70 71 72 75 80 81 83 85
yes no yes yes yes no no, yes yes, yes no yes yes no

Common to split between two values, so 11 possible points.


(Why?)
We can simply test the information gain for each split.

< 71.5 is 4 yes, 2 no. > 71.5 is 5 yes, 3 no. So:

info([4,2],[5,3]) = 6/14 * info([4,2]) + 8/14 * info([5,3]) 
= 0.939

Then calculate it for all of the other split points and take the best. 

Classification: Trees 2 January 18, 2008 Slide 187


COMP527:
Data Mining
Numeric Attributes

Once the best split has been found, continue as normal. Almost.

Just because it has been split once doesn't mean that it


couldn't benefit from being split again. You might want to
test the same attribute again in the same path. For
example:
1 2 3 4 5 6 7 8 9 10  11 12
A A A B B B A A A B B B
X Y X Y X Y X Y X Y X Y

For a/b, might split at 6.5, then again at 3.5 and 9.5
But splitting for x/y will eventually lead to 1=x, 2=y, 3=x ...
Over-fitting.
Also: Many binary splits on an attribute make the tree hard
to read.

Classification: Trees 2 January 18, 2008 Slide 188


COMP527:
Data Mining
Numeric Attributes

Isn't a multi­way split better?  Yes, but harder to accomplish. How 
many splits? Where?
1 2 3 4 5 6 7 8 9 10  11 12
A A A B B B A A A B B B
X Y X Y X Y X Y X Y X Y

We really want to find a function to test the data with.
For X/Y we want to test:   value % 2
For A/B we want to test:   value­1 /3 % 2 

Complicated.  We'll look at regression trees later.

Algorithm papers:   http://citeseer.ist.psu.edu/context/412349/0
Classification: Trees 2 January 18, 2008 Slide 189
COMP527:
Data Mining
Numeric Attributes

Sounds like a lot of computation for attributes with a wide range of 
data.  ... Yes.

Second computational problem:  If we test (for example) windy first 
and then test temperature, the possible values will be different 
because not all instances have made it to that node.  So we'll 
need to re­sort everything every node.

Not quite. The order doesn't change because instances are left out. 
 We can sort once and cross out instances that we don't have.

Classification: Trees 2 January 18, 2008 Slide 190


COMP527:
Data Mining
Numeric Attributes

Eg:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
7 6 5 9 4 14 8 12 10 11 2 13 3 1

Just because we don't have instances 1,3,4,5,8,9,10 and 13 
doesn't mean that the order of the others has changed. 
It will still be: 7, 6, 14, 12, 11, 2

Disadvantage:  Need to store this information for each numeric 
attribute for every instance.   If the numeric attributes are used 
further down the tree, it may be cheaper to do it only on the 
subsets.

Classification: Trees 2 January 18, 2008 Slide 191


COMP527:
Data Mining
Numeric Attributes

We don't have this problem with nominal attributes.  We could transform the 
numeric attribute into nominal before the data mining stage?

If we can discretize it before data mining, surely we can do it during it as well 
using the same techniques?  And wouldn't it be faster, as you're only going to 
be dealing with a subset of the data, not all of it?  Yes, but it might be over­
fitting!

Solutions:

Pre­discretize to nominal attribute (will look at later)

Many binary splits

One multi­branch split

Classification: Trees 2 January 18, 2008 Slide 192


COMP527:
Data Mining
Missing Values

What happens when an instance is missing the value for an 
attribute? Already discussed some possibilities for filling in the 
value.
While Training, may be possible to just ignore it. But we need a 
solution for a Test instance with a missing value.

Idea:   Send the instance down all of the branches and combine the 
results?

Need to record in the tree the 'popularity' of each branch.  Eg how 
many instances went down it.

Classification: Trees 2 January 18, 2008 Slide 193


COMP527:
Data Mining
Missing Values

For example: Split the 14 instances by Windy ... 8  go down the 
false branch, 6 down the true branch. So when we get an 
instance without windy, the classification for false happened 8/14 
times and the classification for true 6/14.

Instead of ending up with a single class, we might end up with 4/7 
votes for one and 3/7 votes for another. Or they might both end 
up the same.

Classification: Trees 2 January 18, 2008 Slide 194


COMP527:
Data Mining
Pre- vs Post- Pruning

In the same way as generating rule sets, we need to prune trees to 
avoid over­fitting.

Pre­Pruning:  Stop before reaching the bottom of the tree path.
But it might stop too early.  For example when a combination of two 
attributes is important, but neither by themselves is significant.

Post­Pruning:  Generate the entire tree and then remove some 
branches.
More time consuming, but more likely to help classification 
accuracy.

Classification: Trees 2 January 18, 2008 Slide 195


COMP527:
Data Mining
Pre-Pruning

How to determine when to stop growing?

Statistical Significance:  
Stop growing the tree when there is no statistically significant 
association between any attribute and the class at a particular 
node

Popular test:  chi­squared
chi2 = sum( (O­E)2 / E ) 
O = observed data, E = expected values based on hypothesis.

Classification: Trees 2 January 18, 2008 Slide 196


COMP527:
Data Mining
Pre-Pruning

chi2 = sum( (O­E)2 / E ) 

Example (from Dunham, 55): 5 schools have the same test.


Total score is 375, individual results are: 50, 93, 67, 78
and 87. Is this distribution significant, or was it just luck?
Average is 75.
(50 ­ 75)2/75 + (93­75)2/75 + (67­75)2/75 + (87­75)2/75
= 15.55

This distribution is significant.   
ID3 only allowed significant attributes to be selected by Information 
Gain.

Classification: Trees 2 January 18, 2008 Slide 197


COMP527:
Data Mining
Post-Pruning

Two possible options for post­pruning:

Sub­tree Replacement:  Select a sub­tree and replace with a 
single leaf.

Sub­tree Raising: Select a sub­tree and raise it to replace a higher 
tree.  More complicated, harder to tell if worth­while in practice.

Need to the training data into a Grow/Prune division again.  Grow 
the entire tree from the Grow set, then prune it using the Prune 
set. But it has the same problems as with rules­based systems. 

Classification: Trees 2 January 18, 2008 Slide 198


COMP527:
Data Mining
Sub-Tree Replacement

Replace left sub­tree with 'bad' leaf node

(Witten fig 1.3)

Classification: Trees 2 January 18, 2008 Slide 199


COMP527:
Data Mining
Sub-Tree Raising

                                                                                   Raise sub­tree C to B
                     
                                                                                   (Witten fig 6.1)

But now need to reclassify instances that 
would have gone to 4 & 5.

Classification: Trees 2 January 18, 2008 Slide 200


COMP527:
Data Mining
Post-Pruning

Can we estimate the error of the tree without a Pruning set?  Can 
we estimate it based on the training set that it has just been 
grown from?

C4.5 uses a “shaky” statistical method that works well in


practice (as shown by C4.5's accuracy and subsequent
popularity)

Method: Derive a confidence interval and calculate


pessimistic error rate. If the combined training error for a
set of branches is higher than the parent, prune them.

Classification: Trees 2 January 18, 2008 Slide 201


COMP527:
Data Mining
Post-Pruning

– Error estimate for subtree is weighted sum of error estimates for 
all its leaves
– Error estimate for a node:


e= f 
z2
2N
z
 f
N

f2
N

z2
4N
2  
/ 1
z2
N

– If c = 25% (default for C4.5) then z = 0.69 (from normal


distribution)
– f is the error on the training data
– N is the number of instances covered by the leaf

(These slides thanks to the official publisher slide sets, not my strong point!)

Classification: Trees 2 January 18, 2008 Slide 202


COMP527:
Data Mining
Post-Pruning

f = 5/14
e = 0.46

e < 0.51
so prune!

Combined using ratios 6:2:6


f=0.33 f=0.5 f=0.33
e=0.47 e=0.72 e=0.47 gives 0.51

Classification: Trees 2 January 18, 2008 Slide 203


COMP527:
Data Mining
From Trees to Rules

Now we've built a tree, it might be desirable to re­express it as a list 
of rules.
Simple Method: Generate a rule by conjunction of tests in
each path through the tree.
Eg:
if temp > 71.5 and ... and windy = false then play=yes
if temp > 71.5 and ... and windy = true  then play=no 

But these rules are more complicated than necessary.


Instead we could use the pruning method of C4.5 to prune
rules as well as trees.

Classification: Trees 2 January 18, 2008 Slide 204


COMP527:
Data Mining
From Trees to Rules

for each rule,
e = error rate of rule
e' = error rate of rule ­ finalCondition
if e' < e, 
rule = rule ­ finalCondition
recurse
remove duplicate rules

Expensive:  Need to re­evaluate entire training set for every 
condition!
Might create duplicate rules if all of the final conditions from a path 
are removed.

Classification: Trees 2 January 18, 2008 Slide 205


COMP527:
Data Mining
Further Reading

As previous:

Witten, 3.2, 4.3 PLUS 6.1

Dunham, 4.4

Han, 6.3

Berry and Browne, Chapter 4

Berry and Linoff, Chapter 6

Classification: Trees 2 January 18, 2008 Slide 206


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Bayes January 18, 2008 Slide 207


COMP527:
Data Mining
Today's Topics

Statistical Modeling
Bayes Rule
Naïve Bayes
Fixes to Naïve Bayes
Document classification
Bayesian Networks
Structure
Learning

Classification: Bayes January 18, 2008 Slide 208


COMP527:
Data Mining
Bayes Rule

The probability of hypothesis H, given evidence E:

Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]

Pr[H] = A Priori probability of H (before evidence seen)
Pr[H|E] = A Posteriori probability of H (after evidence seen) 

We want to use this in a classification system,  so our goal is to find 
the most probable hypothesis (class) given the evidence (test 
instance).

Classification: Bayes January 18, 2008 Slide 209


COMP527:
Data Mining
Example

Meningitis causes a stiff neck 50% of the time.
Meningitis occurs 1/50,000, stiff necks occur 1/20.

Probability of Meningitis, given that the patient has a stiff


neck:

Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]
Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN]
         = 0.5 * 1/50000 / 1/20
         = 0.0002

Classification: Bayes January 18, 2008 Slide 210


COMP527:
Data Mining
Bayes Rule

Our evidence E is made up of different attributes A[1..n], so:

Pr[H|E] = Pr[A1|H]*Pr[A2|H]...Pr[An|H]*Pr[H]/Pr[E]

So we need to work out the probability of the individual attributes 
per class. Easy...

Outlook=Sunny appears twice for yes out of 9 yes instances.
We can work these out for all of our training instances...

Classification: Bayes January 18, 2008 Slide 211


COMP527:
Data Mining
Weather Probabilities

Outlook Temperature Humidity Windy Play


Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5

Given a test instance (sunny, cool, high, true) 
play=yes: 2/9 * 3/9 * 3/9 * 9/14       = 0.0053
play=no:  3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206

So we'd predict play=no for that particular instance.
Classification: Bayes January 18, 2008 Slide 212
COMP527:
Data Mining
Weather Probabilities

play=yes: 2/9 * 3/9 * 3/9 * 9/14       = 0.0053
play=no:  3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206

This is the likelihood, not the probability. We need to normalise these.

Prob(yes) = 0.0053 / 0.0053 + 0.0206 = 20.5%

This is when the Pr[E] denominator disappears from Bayes's rule.
Nice. Surely there's more to it than this... ?

Classification: Bayes January 18, 2008 Slide 213


COMP527:
Data Mining
Naïve Bayes

Issue:  It's only valid to multiply probabilities when the events are 
independent of each other.  It is “naïve” to assume independence 
between attributes in datasets, hence the name.

Eg: The probability of Liverpool winning a football match is not 
independent of the probabilities for each member of the team 
scoring a goal.

But even given that, Naïve Bayes is still very effective in practice, 
especially if we can eliminate redundant attributes before 
processing. 

Classification: Bayes January 18, 2008 Slide 214


COMP527:
Data Mining
Naïve Bayes

Issue:  If an attribute value does not co­occur with a class value, then the 
probability generated for it will be 0.

Eg:  Given outlook=overcast, the probability of play=no is 0/5.  The other 
attributes will be ignored as the final result will be multiplied by 0.

This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute set. 
 You can easily imagine a case where the likelihood for all classes is 0.

Eg:  'Viagra' is always spam, 'data mining' is never spam.  An email with 
both will be 0 for spam=yes and 0 for spam=no ... probability will be 
undefined ... uh oh! 

Classification: Bayes January 18, 2008 Slide 215


COMP527:
Data Mining
Laplace Estimator

The trivial solution is of course to mess with the probabilities such that 
you never have 0s.   We add 1 to the numerator and 3 to the 
denominator to compensate.

So we end up with 1/8 instead of 0/5.

No reason to use 3, could use 2 and 6.  No reason to split equally... we 
could add weight to some attributes by giving them a larger share:
a+3/na+6 * b+2/nb+6 * c+1/nc+6   

However, how to assign these is unclear.
For reasonable training sets, simply initialise counts to 1 rather than 0. 

Classification: Bayes January 18, 2008 Slide 216


COMP527:
Data Mining
Missing Values

Naïve Bayes deals well with missing values:

Training:  Ignore the instance for the attribute/class combination, 
but we can still use it for the known attributes.

Classification:  Ignore the attribute in the calculation as the 
difference will be normalised during the final step anyway.

Classification: Bayes January 18, 2008 Slide 217


COMP527:
Data Mining
Numeric Values

Naïve Bayes does not deal well with numeric values without some help.
The probability of it being exactly 65 degrees is zero.

We could discretize the attribute, but instead we'll calculate the mean and 
standard deviation and use a density function to predict the probability.

mean:  sum(values) / count(values)
variance:  sum(square(value ­ mean)) / count(values)­1
standard deviation:  square root of variance
 

Mean for temperature is 73, Std. Deviation is 6.2

Classification: Bayes January 18, 2008 Slide 218


COMP527:
Data Mining
Numeric Values

Density function:
( x− µ )2
                                    f(x) =  1 e 2σ 2
2π σ

Unless you've a math background, just plug the numbers in...
At which point we get a likelihood of 0.034
Then we continue with this number as before.

This assumes a reasonably normal distribution. Often not the case.

Classification: Bayes January 18, 2008 Slide 219


COMP527:
Data Mining
Document Classification

The Bayesian model is often used to classify documents as it deals 
well with a huge number of attributes simultaneously.  (eg 
boolean occurrence of words within the text)
But we may know how many times the word occurs.

This leads to Multinomial Naive Bayes.

Assumptions:  
1. Probability of a word occurring in a document is independent 
of its location within the document.  
2. The document length is not related to the class.

Classification: Bayes January 18, 2008 Slide 220


COMP527:
Data Mining
Document Classification

Pr[E|H] = N! * product(pn/n!)

N = number of words in document


p = relative frequency of word in documents of class H
n = number of occurrences of word in document

So, if A has 75% and B has 25% frequency in class H

Pr[“A A A”|H] = 3! * 0.753/3! * 0.250/0! 
= 27/64  
= 0.422

Pr[“A A A B B”|H]  = 5! *  0.753/3! * 0.252/2! 
   =  0.264

Classification: Bayes January 18, 2008 Slide 221


COMP527:
Data Mining
Document Classification

Pr[E|H] = N! * product(pn/n!)

We don't need to work out all the factorials, as they'll normalise out 
at the end.

We still end up with insanely small numbers, as vocabularies are 
much much larger than 2 words.  Instead we can sum the 
logarithms of the probabilities instead of multiplying them. 

Classification: Bayes January 18, 2008 Slide 222


COMP527:
Data Mining
Bayesian Networks

Back to the attribute independence assumption.  Can we get rid of 
it?
Yes, with a Bayesian Network.

Each attribute has a node in a Directed Acyclic Graph.

Each node has a table of all attributes with edges pointing at the 
node linked against the probabilities for the attribute's values.

Examples will be hopefully enlightening...

Classification: Bayes January 18, 2008 Slide 223


COMP527:
Data Mining
Simple Network

play
yes    no
.633  .367

outlook windy
play| sunny overcast rainy play| false true
yes | .238   .429    .333   yes | .350  .650
no  | .538   .077    .385  no  | .583  .417

temperature humidity
play| hot  mild cold play| high normal
yes | .238 .429 .333 yes | .350 .650  
no  | .385 .385 .231 no  | .750 .250  

Classification: Bayes January 18, 2008 Slide 224


COMP527:
Data Mining
Less Simple Network

play
yes    no
.633  .367 windy
play outlook false true
yes sunny    .500 .500
Outlook yes overcast .500 .500
play sunny overcast rainy yes rainy    .125 .875
yes  .238   .429    .333   no  sunny    .375 .625
no   .538   .077    .385  no  overcast .500 .500
no  rainy    .833 .167

temperature humidity
play outlook  hot  mild cold play temp  high normal
yes sunny    .238 .429 .333 yes  hot   .500 .500  
yes overcast .385 .385 .231 yes  mild  .500 .500  
yes rainy    .111 .556 .333 yes  cool  .125 .875  
no  sunny    .556 .333 .111 no   hot   .833 .167  
no  overcast .333 .333 .333 no   mild  .833 .167  
no  rainy    .143 .429 .429 no   cool  .250 .750  

Classification: Bayes January 18, 2008 Slide 225


COMP527:
Data Mining
Bayesian Networks

To use the network, simply step through each node and multiply the 
results in the table together for the instance's attributes' values.
Or, more likely, sum the logarithms as with the multinomial case.

Then, as before, normalise them to sum to 1.

This works because the links between the nodes determine the 
probability distribution at the node.

Using it seems straightforward. So all that remains is to find out the best 
network structure to use.  Given a large number of attributes, there's a 
LARGE number of possible networks...

Classification: Bayes January 18, 2008 Slide 226


COMP527:
Data Mining
Training Bayesian Networks

We need two components:

– Evaluate a network based on the data
As always we need to find a system that measures the 
'goodness' without overfitting 
(overfitting in this case = too many edges)
We need a penalty for the complexity of the network.

– Search through the space of possible networks
As we know the nodes, we need to find where the edges in the 
graph are.  Which nodes connect to which other nodes?

Classification: Bayes January 18, 2008 Slide 227


COMP527:
Data Mining
Training Bayesian Networks

Following the Minimum Description Length ideal, networks with lots 
of edges will be more complex, and hence likely to over­fit.
We could add a penalty for each cell in the nodes' tables.

AIC:   ­LL +K
MDL:   ­LL + K/2 log(N)

LL is total log­likelihood of the network and training set.  eg Sum of 
log of probabilities for each instance in the data set.
K is the number of cells in tables, minus the number of cells in the 
last row (which can be calculated, by 1­ sum of other cells in row)
N is the number of instances in the data.

Classification: Bayes January 18, 2008 Slide 228


COMP527:
Data Mining
Network Training: K2

K2:
for each node,
for each previous node,
add node, calculate worth
continue when doesn't improve
(Use MDL or AIC to determine worth)

The results of K2 depend on initial order selected to process the 
nodes in.
Run it several times with different orders and select the best.
Can help to ensure that the class attribute is first and links to all 
nodes (not a requirement)

Classification: Bayes January 18, 2008 Slide 229


COMP527:
Data Mining
Other Structures

TAN:  Tree Augmented Naive Bayes.

Class attribute is only parent for each node in Naive Bayes. Start 
here and consider adding a second parent to each node.

Bayesian Multinet:
Build a separate network for each class and combine the values.

Classification: Bayes January 18, 2008 Slide 230


COMP527:
Data Mining
Further Reading


Witten 4.2, 6.7

Han 6.4

Dunham 4.2

Devijver and Kittler, Pattern Recognition: A Statistical Approach, 
Chapter 2

Berry and Browne, Chapter 2 

Classification: Bayes January 18, 2008 Slide 231


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Neural Networks January 18, 2008 Slide 232


COMP527:
Data Mining
Today's Topics

Introduction to Neural Networks
Issues
Training
Kohonen Self-Organising Maps
Radial Basis Function Networks

Classification: Neural Networks January 18, 2008 Slide 233


COMP527:
Data Mining
Neural Networks

How do animals learn (including humans)?  Perhaps we can 
simulate that for learning simple patterns?

How does the brain work (simplistically)?  The brain has lots of 
neurons which either fire or not and are linked together in a huge 
three dimensional structure.  It receives input from many neurons 
and sends its output to many neurons.  The input comes from 
external connected sensors such as eyes.

So ... can we model an artificial network of neurons to solve just 
one task (a classification problem)?

Classification: Neural Networks January 18, 2008 Slide 234


COMP527:
Data Mining
Neural Networks

We need some inputs, then some neurons connected together and 
then some outputs.  Our inputs are the attributes from the data 
set.  Our outputs are (typically) the classes.  We can have 
connections from all of the inputs to the neurons, then from the 
neurons to the outputs.

Then we just need to train the neurons to react to the values in the 
attributes in the proper way such that the output layer gives us 
the classification.  (Which is of course the complicated part, just 
like animals learning)

Classification: Neural Networks January 18, 2008 Slide 235


COMP527:
Data Mining
Perceptrons?

Sounds like the idea of a Perceptron?  Yes, it is.

A Neural Network is kind of like a lot of Perceptrons bundled


together, but there are some significant differences:

– The training methods for Perceptron/Winnow aren't used


– The activation function is often not binary, but a function
– Neural Networks can handle multiple classes, not just 2
– There are multiple layers of nodes, normally 3

Classification: Neural Networks January 18, 2008 Slide 236


COMP527:
Data Mining
Simple Neural Network

N1

Attr1

Class

Attr2
Output Neuron(s)

Note: Class here is


true or false, so we
N2 can use just one
on/off neuron

Input Neurons “Hidden Layer”

Classification: Neural Networks January 18, 2008 Slide 237


COMP527:
Data Mining
Neural Networks

That sounds like a regression problem? Learning a function...
Actually we use the same activation function in all nodes,
and apply a weight to each link. Each node can also have
a constant to add to the incoming data called a bias.

N1
W1,1 W1,C

Attr1 W2,1

Class

Attr2 W1,2

W2,C
W2,2
N2

Classification: Neural Networks January 18, 2008 Slide 238


COMP527:
Data Mining
Neural Networks

N1
W1,1 W1,C

Attr1 W2,1

Class

Attr2 W1,2

W2,C
W2,2
N2

Node N1 does:   fN1(A1*W1,1 + A2*W2,1 + CN1)
Node Class does: fClass(   fN1(A1*W1,1+A2*W2,1+CN1)*W1,C +
fN2(A1*W1,2 + A2*W2,2 + CN2) * W2,C + CClass   )

Classification: Neural Networks January 18, 2008 Slide 239


COMP527:
Data Mining
Neural Network Issues

Issues with constructing a neural network classifier:
– Attributes as source nodes
Need to be numeric for weighting
– Number of hidden layers
Not necessarily just one layer, could have multiple
– Number of nodes per hidden layer
Complicated. Too many and will over-fit, too few won't learn
properly
– Number of output neurons
One per class, or perhaps a bit based combination (eg 101 = class5
with 3 outputs)
– Interconnections
Node might not connect to all in next layer, might connect
backwards
– Weights, Constants, Activation function to use
– Learning Technique to adjust weights

Classification: Neural Networks January 18, 2008 Slide 240


COMP527:
Data Mining
Neural Network Benefits

Bleh! Why use a NN rather than a Decision Tree then?
– More robust -- can use all attributes at once, without splitting
numeric or turning them into nominal
– Improves performance by later learning.
– More robust in noisy environments as can't go down the wrong path
so easily

Other points to consider:

– Difficult to understand for non-experts. Can understand a decision


tree.
– Generating rules from a NN is not easy.
– Learning phase may never converge for a given structure.

Classification: Neural Networks January 18, 2008 Slide 241


COMP527:
Data Mining
Back to Issues

In order to apply a weight to an attribute, that attribute needs to be 
numeric.  6.5 * “fish” is not meaningful.  But assigning numbers 
to the different values is also not meaningful.  “squirrel” (2) is not 
1 greater than “fish” (1)
Could divide the nominal attribute into many boolean
attributes with values of 0 or 1.

The number of hidden layers is typically 1, but it is quite


possible to construct a network with more than that.
More than 2 hidden layers becomes very time consuming
to train as there will be many more inter-connections.

Classification: Neural Networks January 18, 2008 Slide 242


COMP527:
Data Mining
Neural Network Issues

The number of nodes in the hidden layer is highly debated, but no 
good rule has been discovered to date.  Depends on the 
structure, activation function etc.

The number of output neurons is typically one per class, but


the value output by the neuron might not be 0/1. It could
be 0.2 in each of 2 nodes (the network thinks it's equally
likely to be either, and likely to be neither of them). In a
two class set, 0.2 in a single neuron would mean that it
predicts 0.8 for the other class, if 1.0 is the maximum.

Classification: Neural Networks January 18, 2008 Slide 243


COMP527:
Data Mining
Neural Network Issues

The simplest structure is all nodes connect to all nodes in the next 
highest layer, but this is not necessarily the case.  

Two other possibilities:


● Nodes may connect to only some of the next layer

● Nodes may also be allowed to connect to previous

layers

Of course the most important consideration is how to teach


the network the desired outputs. We're going to assume
the simplest case for all of the previous issues from now
on.

Classification: Neural Networks January 18, 2008 Slide 244


COMP527:
Data Mining
Activation Functions

Before we look at training, we should look at common activation 
functions.  The function could be anything, but typically are one 
of the following:

– Threshold: Neuron fires if input is greater than a given


value. Output is 1 if fires, otherwise 0.
– Linear: Neuron always fires with a value linear to the
input
– Linear Threshold: Combine the above two
– Sigmoid: An S shaped curve between -1 and 1.
Eg logistic function: f(S) = 1 / (1 + e­cS)
– Hyperbolic Tangent: Variation on Sigmoid.
f(S) = (1­e­S) / (1 + e­cS)
– Gaussian: Bell shaped curve between 0 and 1.

Classification: Neural Networks January 18, 2008 Slide 245


COMP527:
Data Mining
Training

The network classifies by propagating the values forwards through 
the network (feed­forward) and applying the activation function at 
each step.

The most common learning method is the reverse, called


back-propagation. We feed an instance forwards through
the network, calculate how badly it did, and then try to
reverse backwards modifying each node's weight to be
more accurate.

Repeat for each instance in the training set until the


network stabilises, an acceptable error rate is reached, or
you give up and try a different network structure.

Classification: Neural Networks January 18, 2008 Slide 246


COMP527:
Data Mining
Training

We know the expected output at the final layer (the class) so we 
can work out the error of the output from the nodes that connect 
to it.  A typical measure is the mean squared error (MSE):
(yi ­ di)2 / 2
Where for node i, y is the output and d is the desired output.

This could be repeated for all nodes in the network and summed to 
find the total error for a given instance.  The goal is then to 
minimise that error across all instances of the training set.

Classification: Neural Networks January 18, 2008 Slide 247


COMP527:
Data Mining
Training

The Hebb rule:  (historical interest only)
delta(wij) = cxijyj
The Delta rule:
delta(wij) = cxij(dj­yj)
For node j, input node i, output y, desired output d and constant c.  
The constant is typically 1/number of training instances.

So for back propagation, we can step backwards through the 
network after passing an instance through it and modify each 
weight using the delta rule.  ... Almost. 

Classification: Neural Networks January 18, 2008 Slide 248


COMP527:
Data Mining
Training

Remeber that we want to minimise the MSE.  We can use Gradient 
Descent to do this.  With a sigmoid function:

for each node i in outputNodes
for each node j in inputs to i
delta = c(di­yi)yj(1­yi)yi
wji += delta
for each node j in hiddenLayer
for each node k in inputs to j
outputDelta=0
for each node m in outputs from j
outputDelta += (dm­ym)wjmym(1­ym)
delta = c yk (1­yj2 / 2) outputDelta
wkj += delta

Classification: Neural Networks January 18, 2008 Slide 249


COMP527:
Data Mining
Training

Whuh?!  What's going on there??

Skipping all the math, it finds the 
gradient of the error curve.  To
minimise, we want the gradient to be
zero, so it takes lots of derivatives and
stuff...

If you like the math, read Witten ~230
and Dunham ~112.

For the rest of us... we'll smile and nod and skip ahead... 

Classification: Neural Networks January 18, 2008 Slide 250


COMP527:
Data Mining
Self-Organising Maps

To go back to our original premise, perhaps there's something else 
we can learn from our neurons that didn't just implode from the 
previous math.
Some things those neurons might tell us:

– Firing neurons impact other close neurons
– Neurons that are far apart inhibit each other
– Neurons have specific nonoverlapping tasks

In a Kohonen Self Organising Map, the nodes in the hidden layer 
are put into a two dimensional grid so that we have some 
measure of distance between neurons.

Classification: Neural Networks January 18, 2008 Slide 251


COMP527:
Data Mining
Self-Organising Maps

The nodes compete against each other to be the best for a 
particular attribute/instance.  In training, once the best node has 
been determined and had its connection weight modified, the 
near by nodes also have their weights modified.  The 
neighbourhood of a node can decrease over time, proportional to 
the amount it has 'learnt'.

A1

Classification: Neural Networks January 18, 2008 Slide 252


COMP527:
Data Mining
Radial Basis Function Networks

An RBF network has the standard three layers of nodes.  
The hidden layer has a Gaussian activation function.
The output layer has a Linear or Sigmoidal activation
function.

The important part is the Gaussian activation function for


the hidden layer. The output is maximal for some value
and decreases the further away from that value. Each
node represents a particular point in the input space, so
the output is how far away from this point the instance is.

Classification: Neural Networks January 18, 2008 Slide 253


COMP527:
Data Mining
Radial Basis Function Networks

Instead of having a fixed activation function for each hidden node, 
the RBF nodes also learn the their maximal value and how fast 
the output should drop off away from this value.

These centers and widths can be learnt independently of the 
connection weights.  Typically this is done by clustering.

Classification: Neural Networks January 18, 2008 Slide 254


COMP527:
Data Mining
Further Reading


Witten 6.3

Han 6.6

Dunham 4.5

Berry and Linoff, Chapter 7

Pal and Mitra, Pattern Recognition Algorithms for Data Mining, 
Chapter 7

Classification: Neural Networks January 18, 2008 Slide 255


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: SVM January 18, 2008 Slide 256


COMP527:
Data Mining
Today's Topics

Linear vs Nonlinear Classifiers
Support Vectors
Non Linearly Separable Datasets

Classification: SVM January 18, 2008 Slide 257


COMP527:
Data Mining
Dimensionality of Data Sets

Imagine a data set with two numeric attributes ... you could plot the 
instances on a graph.
Imagine a data set with three numeric attributes (eg
h,w,d) ... you could plot it in three dimensional space.

Now don't imagine a data set with N attributes, but we could


treat each attribute as a dimension. A classifier needs to
find the boundaries of the N dimensional space in which
the instances of a particular class reside.

To visualise, we'll use just 2 dimensions as they handily fit


on a slide.

(Ideas for many of these slides thanks to others, esp Barbara Rosario)

Classification: SVM January 18, 2008 Slide 258


COMP527:
Data Mining
Linearly Separable Data

All the instances can be correctly classified by a single linear 
decision boundary.

Classification: SVM January 18, 2008 Slide 259


COMP527:
Data Mining
Non-linearly Separable Data

Not all instances can be correct classified by a linear decision 
boundary.

Classification: SVM January 18, 2008 Slide 260


COMP527:
Data Mining
Non-linearly Separable Data

But can be separated by a non lineary boundary.

Classification: SVM January 18, 2008 Slide 261


COMP527:
Data Mining
Non-Separable Data

Random Noise

Classification: SVM January 18, 2008 Slide 262


COMP527:
Data Mining
Linearly Separable Data

Many possible decision boundaries... which is best?

Classification: SVM January 18, 2008 Slide 263


COMP527:
Data Mining
Maximum Margin Hyperplane

Maximum Margin Hyperplane (MMH) is the largest distance 
between the two classes

Classification: SVM January 18, 2008 Slide 264


COMP527:
Data Mining
MMH

With some slack to allow for somewhat noisy data

Classification: SVM January 18, 2008 Slide 265


COMP527:
Data Mining
Support Vectors

Find the convex hull of each class. Find the shortest line that can 
connect the two hulls. Then halfway along, at 90 degrees is the 
MMH.   

Instances that are the


closest to the MMH are
the Support Vectors.

Will always be one from


each class, but there
might be more if the
hull has a parallel section
to the MMH.

Classification: SVM January 18, 2008 Slide 266


COMP527:
Data Mining
Support Vectors

Once we've found the support vectors, we don't care about the 
other instances any more.  The MMH is still the same with just 
these instances.

That's not a vector, that's a smiley face!
That's not a hyperplane, it's a dotted line!

In 2­d, yes. But the same applies in
N­dimensional space where an instance
is a vector like [1,6,3,10,7,14,23] and 
the dividing plane is a 7 dimensional 
monstrosity.

Classification: SVM January 18, 2008 Slide 267


COMP527:
Data Mining
Mathsy Slide

Vector Norm: |X| = √x12 + x22 + ... xn2 
Dot Product: X ∙ Y = |X||Y|cosθ
MMH: x = b + ∑αiyia(i)∙a

yi is the class value of training instance a(i): 1 or -1


a(i) is a support vector for a, a is the current instance being compared

Calculating b and αi is a constrained quadratic optimization 


problem.
Constrained Quadratic Optimization: Far Too Complicated!

Classification: SVM January 18, 2008 Slide 268


COMP527:
Data Mining
Non Linearly Separable Data

Most of the time, classes will not be linearly separable. For 
example:

But what if we could transform the data set such that the
curve was actually a straight line. Then we could find the
MMH, and use the same transformation on new instances
to compare apples with apples.

Classification: SVM January 18, 2008 Slide 269


COMP527:
Data Mining
Non Linearly Separable Data

This involves mapping each instance into a higher dimensional 
space, where the previous curve is now a straight line.  Eg from 
a quadratic curve into a polynomial dimensional space above.

This could be very expensive, but it turns out that you can
do some of the work before the mapping (the dot product)
Classification: SVM January 18, 2008 Slide 270
COMP527:
Data Mining
Non-Linear Data

So we need some function Ф that will map our data into a different 
set of dimensions where there's a linear division.  Then we can 
construct a linear classifier using this set of dimensions.
Eg:  a 3D input vector (x,y,z) could be mapped to 6D space (Z) by:
(x,y,z, x2, xy, xz)
Decision hyperplane is now linear in this space.  Solve and then 
substitute back so the linear hyperplane in this space 
corresponds to a second order polynomial in the original space.

But doing this for all instances would be very very expensive...

Classification: SVM January 18, 2008 Slide 271


COMP527:
Data Mining
Kernel Methods

There's another math trick we can use.  It turns out that you don't 
need to do the dot product operation after the mapping, as it's 
constant.
So instead of: Ф(x) ∙  Ф(y)   
We can do:   Ф(x∙y) 
Avoiding a lot of expense.

So all of our calculations happen in the original input space,


not the higher dimensions. These functions are called
Kernel Methods or Kernel Functions.

Classification: SVM January 18, 2008 Slide 272


COMP527:
Data Mining
Kernel Methods

Polynomial Kernel:     (x∙y)n
Gaussian Radial Basis Function Kernel:   e ­|x­y|2 / 2σ2
Sigmoid Kernel:   tanh(k x∙y ­ δ)

The Radial Basis Function Kernel and Sigmoid Kernel are the same 
as the neural network activation functions we looked at last time.

Classification: SVM January 18, 2008 Slide 273


COMP527:
Data Mining
XOR Problem

Simplest non linearly separable problem is XOR.   There is no 
hyperplane to distinguish it the classes in normal space, but 
there is in a different space:

Classification: SVM January 18, 2008 Slide 274


COMP527:
Data Mining
Noisy Data

We need some slack to allow for noise in the data preventing the 
classes from being separable.
Introduce another parameter C that determines the
maximum effect any single instance can have on the
decision boundary.
If there is 10 bad instances and 1000 good instances, we
don't want the bad instances to prevent finding the MMH.
If by removing an instance, the boundary would move a
lot, that instance could be noise. (Still a constrained
quadratic optimization problem ... apparently)

Classification: SVM January 18, 2008 Slide 275


COMP527:
Data Mining
Sparse Data

If the data has lots of 0 values, then these can be ignored when 
computing the dot products.  Eg:  0 squared adds nothing to the 
normalised vector.

This speeds up the processing for sparse data sets as the


system can iterate through only the non-zero values.

This makes SVM very useful for text classification where the
attributes are the frequency of the word in a document.
(eg most words will appear 0 times)

Classification: SVM January 18, 2008 Slide 276


COMP527:
Data Mining
Issues with SVM

– Training and using SVMs with many (100,000s+) support vectors 
can be very slow.
– Determining the best kernel and user configurable
parameters is typically by trial and error.
– It can only predict two classes (1 vs -1)
Can learn a model for each of N classes vs all of the other
instances, but this means building lots of models, which is
very very slow.

Classification: SVM January 18, 2008 Slide 277


COMP527:
Data Mining
Further Reading


Witten, 6.3

Han, 6.7

Pal and Mitra, Chapter 4

Classification: SVM January 18, 2008 Slide 278


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Evaluation January 18, 2008 Slide 279


COMP527:
Data Mining
Today's Topics

Evaluation
Samples
Cross Validation
Bootstrap
Confidence of Accuracy

Classification: Evaluation January 18, 2008 Slide 280


COMP527:
Data Mining
Evaluation

We need some way to quantitatively evaluate the results of data 
mining.

– Just how accurate is the classification?


– How accurate can we expect a classifier to be?
– If we can't evaluate the classifier, how can it be improved?
– Can different types of classifier be evaluated in the same
way?
– What are useful criteria for such a comparison?
– How can we evaluate clusters or association rules?

There are lots of issues to do with evaluation.

Classification: Evaluation January 18, 2008 Slide 281


COMP527:
Data Mining
Evaluation

Assuming classification, the basic evaluation is how many correct 
predictions it makes as opposed to incorrect predictions.

Can't test on data used for training the classifier and get an
accurate result. The result is "hopelessly
optimistic" (Witten).

Eg: Due to over-fitting, a classifier might get 100% accuracy


on the data it was trained from and 0% accuracy on other
data. This is called the resubstitution error rate -- the
error rate when you substitute the data back into the
classifier generated from it.

So we need some new, but labeled data to test on.

Classification: Evaluation January 18, 2008 Slide 282


COMP527:
Data Mining
Validation

Most of the time we do not have enough data to have a lot for 
training and a lot for testing, though sometimes this is possible 
(eg sales data)

Some systems have two phases of training. An initial


learning period and then fine tuning. For example the
Growing and Pruning sets for building trees.
It's important to not use the validation set either.

Note that this reduces the amount of data that you can
actually train on by a significant amount.

Classification: Evaluation January 18, 2008 Slide 283


COMP527:
Data Mining
Numeric Data, Multiple Classes

Further issues to consider:

– Some classifiers produce probabilities for one or more


classes.
We need some way to handle the probabilities – for a
classifier to be partly correct. Also for multi-class
problems (eg instance has 2 or more classes) we need
some 'cost' function for getting an accurate subset of the
classes.
– Regression/Numeric Prediction produces a numeric value.
We need statistical tests to determine how accurate this is
rather than true/false for nominal classes.

Classification: Evaluation January 18, 2008 Slide 284


COMP527:
Data Mining
Hold Out Method

Obvious answer:  Keep part of the data set aside for testing 
purposes and use the rest to train the classifier.
Then use the test set to evaluate the resulting classifier in
terms of accuracy.
Accuracy: Number of correctly classified instances / total
number of instances to classify.

Ratio is often 2/3rds training, 1/3rd test.

How should we select the instances for each section?

Classification: Evaluation January 18, 2008 Slide 285


COMP527:
Data Mining
Samples

Easy:  Randomly select instances.

Data could be very unbalanced: Eg 99% one class, 1% the


other class.
Then random sampling is likely to not draw any of the 1%
class.

Stratified: Group the instances by class and then select a


proportionate number from each class.

Balanced: Randomly select a desired amount of minority


class instances, and then add the same number from the
majority class.

Classification: Evaluation January 18, 2008 Slide 286


COMP527:
Data Mining
Samples

Stratified:  Group the instances by class and then select a 
proportionate number from each class.

Classification: Evaluation January 18, 2008 Slide 287


COMP527:
Data Mining
Samples

Balanced:  Randomly select a desired amount of minority class 
instances, and then add the same number from the majority 
class. 

Classification: Evaluation January 18, 2008 Slide 288


COMP527:
Data Mining
Small Data Sets

For small data sets, removing some as a test set and still having a 
representative set to train from is hard.  Solutions?

Repeat the process multiple times, select a different test


set. Then find the error from each, and average across all
of the iterations.

Of course there's no reason to do this only for small data


sets!

Different test sets might still overlap, which might give a


biased estimate of the accuracy. (eg if it randomly selects
good records multiple times)
Can we prevent this?

Classification: Evaluation January 18, 2008 Slide 289


COMP527:
Data Mining
Cross Validation

Split the dataset up into k parts, then use each part in turn as the 
test set and the others as the training set.

If the data set is also stratified, we can have stratified cross


validation, rather than perhaps ending up with a non
representative sample in one or more parts.

Common values for k are 3 (eg hold out) and 10.


Hence: stratified 10-fold cross validation

Again, the error values are averaged after the k iterations.

Classification: Evaluation January 18, 2008 Slide 290


COMP527:
Data Mining
Cross Validation

Why 10?  Extensive testing shows it to be a good middle ground ­­ 
not too much processing, not too random.

Cross validation is used extensively in all data mining


literature. It's the simplest and easiest to understand
evaluation technique, while having a good accuracy.

There are other similar evaluation techniques, however ...

Classification: Evaluation January 18, 2008 Slide 291


COMP527:
Data Mining
Leave One Out

Select one instance and train on all others.  Then see if the 
instance is correctly classified.  Repeat and find the percentage 
of accurate results.

Eg: N-fold cross validation, where N is the number of


instances in the data set.

Attractive:
● If 10 is good, surely N is better :)

● No random sampling problems

● Trains with the most amount of data

Classification: Evaluation January 18, 2008 Slide 292


COMP527:
Data Mining
Leave One Out

Disadvantages:  
● Computationally expensive, builds N models!

● Guarantees a non-stratified, non-balanced sample.

Worst case: class distribution is exactly 50/50.

Data is so complicated, classifier simply picks the most


common class.
-- Will always pick the wrong class.

Classification: Evaluation January 18, 2008 Slide 293


COMP527:
Data Mining
Bootstrap

Until now, the sampling has been without replacement (eg each 
instance occurs once, either in training or test set).
However we could put back an instance to be drawn again --
sampling with replacement.

This results in the 0.632 bootstrap evaluation technique.


Draw a training set from the data set with replacement such
that the number of instances in both is the same, then use
the instances which are not in the training set as the test
set.
(Eg some instances will appear more than once in the
training set)

Statistically, the likelihood of an instance not being picked is


0.368, hence the name.
Classification: Evaluation January 18, 2008 Slide 294
COMP527:
Data Mining
Bootstrap

Eg:  Have a dataset of 1000 instances.
We sample with replacement 1000 times – eg we randomly
select an instance from all 1000 instances 1000 times.

This should leave us with approximately 368 instances that


have not been selected. We remove these and use them
for the test set.

Error rate will be pessimistic – only training on 63% of the


data, with some repeated instances. We compensate by
combining with the optimistic error rate from
resubstitution:

error rate: 0.632 * error-on-test + 0.368 * error-on-training

Classification: Evaluation January 18, 2008 Slide 295


COMP527:
Data Mining
Confidence of Accuracy

What about the size of the test set?  More test instances should 
make us more confident that the accuracy predicted is close to 
the true accuracy.
Eg getting 75% on 10,000 samples is more likely closer to
the accuracy than 75% on 10.

A series of events that succeed of fail is a Bernoulli process,


eg coin tosses. We can find out S successes from N trials,
and then S/N ... but what does that tell us about the true
accuracy rate?

Statistics can then tell us the range within which the true
accuracy rate should fall. Eg: 750/1000 is very likely to
be between 73.2% to 76.7%.
(Witten 147 to 149 has the full maths!)

Classification: Evaluation January 18, 2008 Slide 296


COMP527:
Data Mining
Confidence of Accuracy

We might wish to compare two classifiers of different types.  Could 
compare accuracy of 10 fold cross validation, but there's another 
method:  Student's T­Test

Method:
– We perform cross validation 10 times – eg 10 times TCV =
100 models
– Perform the same repeated TCV with the second classifier
– This gives us x1..x10 for the first, and y1..y10 for the
second.
– Find the mean of the 10 cross-validation runs for each.
– Find the difference between the two means.

We want to know if the difference is statistically significant.

Classification: Evaluation January 18, 2008 Slide 297


COMP527:
Data Mining
Student's T-Test

We then find 't' by:

Where d is the difference between the means, k is the


number of times the cross validation was performed, and
2 is the variance of the differences between the samples.
(variance = sum of squared differences between mean and
actual)

Then look up on the table for k-1 number of degrees of


freedom.
(more tables! But printed in Witten pg 155)

If t is greater than z on the table, then it is statistically


significant.
Classification: Evaluation January 18, 2008 Slide 298
COMP527:
Data Mining
Further Reading


Introductory statistical text books, again

Witten, 5.1­5.4

Han 6.2, 6.12, 6.13

Berry and Browne, 1.4

Devijver and Kittler, Chapter 10

Classification: Evaluation January 18, 2008 Slide 299


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Evaluation 2 January 18, 2008 Slide 300


COMP527:
Data Mining
Today's Topics

Confusion Matrix
Costs
Lift Curves
ROC Curves
Numeric Prediction

Classification: Evaluation 2 January 18, 2008 Slide 301


COMP527:
Data Mining
Confusion Matrix

The 'Confusion Matrix':
Actual Yes Actual No
Predict Yes: True Positive False Positive
Predict No: False Negative True Negative

We want to ensure that True Positive and True Negative are


as high as possible. Same with more than two classes,
you want the diagonal from top left to bottom right to be
high, and the others to be low.

(Think of the output from WEKA for example)

Classification: Evaluation 2 January 18, 2008 Slide 302


COMP527:
Data Mining
Kappa Statistic

But what about random luck?  An accuracy of 50% against 1000 
classes is obviously better than against 2 classes.

We can derive the Kappa statistic from a confusion matrix


for a classifier and an artificial confusion matrix with the
classes divided in proportion to the overall distribution.

Classification: Evaluation 2 January 18, 2008 Slide 303


COMP527:
Data Mining
Kappa Statistic

Sum the diagonal in the expected by chance matrix.  (82)
Sum the diagonal in the classifier's matrix (140)
Subtract expected from classifier. (58)
Subtract expected from total instances (200 – 82 = 118)
Divide and express as percentage: (58 / 118 = 49%)

Classification: Evaluation 2 January 18, 2008 Slide 304


COMP527:
Data Mining
Cost

For some situations, it's a lot worse to have a false negative than a 
false positive.

Example: Better to have all true positives, no false


negatives and some misclassifications if the application is
detection of (insert un-favourite nasty medical condition)

Example2: If there's a very skewed ratio of classes (eg 99%


class A, 1% class B) then you want to tell the system that
getting 99% accuracy by always predicting A is not good
enough. The cost of getting it wrong for class B needs to
be higher than the value of getting it right for class A.

Classification: Evaluation 2 January 18, 2008 Slide 305


COMP527:
Data Mining
Cost

Another example application:  Mass mailed advertising.

If it costs 40 pence to send out a letter, you want to


maximize the number of letters sent to people who will
buy, and minimize the number of letters sent out to those
that won't.

So the Confusion Matrix:

Predict Yes Predict No


Actual Yes Profit -40p Potential profit not used
Actual No -40p Saved money

Classification: Evaluation 2 January 18, 2008 Slide 306


COMP527:
Data Mining
Cost

Can use a cost matrix to determine the cost of errors of a classifier.
Default Cost Matrix:
A B C
A 0 1 1
B 1 0 1
C 1 1 0

But we might wish to change those values for different


scenarios.
Then when evaluating, we sum the values in the cells rather
than just count up the errors. Then use model with least
total cost.

This is only useful for evaluation, not training a cost


sensitive classifier.

Classification: Evaluation 2 January 18, 2008 Slide 307


COMP527:
Data Mining
Training with Costs

Can artificially inflate a 2­class training set with duplicates of the 
preferred class.  Then an error minimising classifier will attempt 
to reduce the errors on the inflated number.

Eg: Duplicate each 'false' instance 9 more times. Then


biased against predicting no wrongly, as it's 10 errors
instead of 1.

Then evaluate against the correct proportion of instances.

Some classification algorithms also allow instances to be


weighted directly, rather than duplicating them.

Classification: Evaluation 2 January 18, 2008 Slide 308


COMP527:
Data Mining
Probabilities

Some classifiers give a probability rather than a definite yes/no (eg 
Bayesian techniques)

These must be taken into account when determining cost.


Eg: A 51% correct probability is not that much better
than a 51% incorrect probability.

We have some extra tricks that we can use to evaluate


probabilities...

Classification: Evaluation 2 January 18, 2008 Slide 309


COMP527:
Data Mining
Quadratic Loss

Quadratic Loss Function:
∑j(pj-aj)2
Where it is summed over the probabilities of each of j
classes for a single instance. A is 1 for the correct class
and 0 for the others, P is the probability assigned to that
class.

Then sum the loss over all test instances for a classifier.

You could then find the mean across different cross


validation folds... at which point you have the mean
squared error.

Classification: Evaluation 2 January 18, 2008 Slide 310


COMP527:
Data Mining
Quadratic Loss

∑j(pj-aj)2
Example:
In a 5 class problem, an instance might have:
(0.5, 0.2, 0.05, 0.15, 0.1)
When you want the first class:
(1,0,0,0,0)
= -0.52 + 0.22 + 0.052 + 0.152 + 0.12
= .25 + .04 + .0025 + .023 + .01
= 0.325

(and then summed for all instances, and the mean taken
across CV folds)

Classification: Evaluation 2 January 18, 2008 Slide 311


COMP527:
Data Mining
Information Loss

The opposite of information gain, we can use the same function as 
a cost.
-E1log(p1) -E2log(p2) ...
Where E is the true probability and p is the predicted
probability.
If there is only one class, then the only one that matters is
the correct class, as the rest will be multiplied by 0.
Note that if you assign a 0 probability to the true class, you
get an infinite error! (Don't Do That Then)

Classification: Evaluation 2 January 18, 2008 Slide 312


COMP527:
Data Mining
Precision/Recall

Information Retrieval uses the same confusion matrix:

Recall: relevant and retrieved / total relevant


Precision: relevant and retrieved / total retrieved

eg 10 relevant, of which 6 are retrieved = 60% recall


100 retrieved, with all 10 relevant = 10% precision.

The best result is all relevant documents retrieved, and no


irrelevant documents retrieved
False Positive: Document retrieved but not relevant
False Negative: Relevant, but not retrieved

Classification: Evaluation 2 January 18, 2008 Slide 313


COMP527:
Data Mining
Lift Charts

To go back to the directed advertising example... A data mining tool 
might predict that, given a sample of 100,000 recipients, 400 will 
buy (0.4%).  Given 400,000, then it predicts that 800 will buy 
(0.2%).

In order to work out where the ideal point is, we need to


include information about the cost of sending an
advertisement vs the profit gained from someone that
responds. (eg will 300,000 extra ads be worth 400 extra
people).

This can be graphed, hence a lift chart...

Classification: Evaluation 2 January 18, 2008 Slide 314


COMP527:
Data Mining
Lift Charts

The lift is what is gained
from the baseline to
the black line, as
determined by the
classification engine.
(or a Cumulative
Gains chart)

This can be accomplished by ranking instances by the


highest probabilities first.

Classification: Evaluation 2 January 18, 2008 Slide 315


COMP527:
Data Mining
ROC Curves

From signal processing:  Receiver Operating Characteristic.

Tradeoff between hit rate and false alarm rate when trying
to find real data in noisy channel.

Plot true positives vertically, and false positives horizontally.


As with Lift charts, the place to be is the top left.

Generate an ordered list of instances and if the classifier


correctly classifies them. Then for each true positive take
a step up, for each false positive take a step to the right.

Eg...

Classification: Evaluation 2 January 18, 2008 Slide 316


COMP527:
Data Mining
ROC Curves

We can generate the smooth curve by the use of cross-


validation.
Eg generate a curve for each fold, and then average them.

Classification: Evaluation 2 January 18, 2008 Slide 317


COMP527:
Data Mining
ROC Curves

We can also plot two curves on the same chart, each generated 
from different classifiers.  This lets us see at which point it's 
better to use one classifier rather than the other.
By using both A and B classifiers with appropriate
weightings, it's possible to get at points in between the
two peaks.

Classification: Evaluation 2 January 18, 2008 Slide 318


COMP527:
Data Mining
Numeric Prediction

Most common is Mean Squared Error. And have seen before.  
(subtract prediction from actual, square it, average)
Also Mean Absolute Error – don't square it, just average the
magnitude of each error.

But if there is a great difference between numbers to be


predicted we might want to use a relative error. Eg 50 out
in an prediction of 500, vs 0.2 out in a prediction of 2.
Same magnitude, relatively speaking.

So, we have the Relative Squared Error.

Classification: Evaluation 2 January 18, 2008 Slide 319


COMP527:
Data Mining
Numeric Prediction

Classification: Evaluation 2 January 18, 2008 Slide 320


COMP527:
Data Mining
Further Reading


Witten, Chapter 5

Han, 6.15

Classification: Evaluation 2 January 18, 2008 Slide 321


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Regression, Prediction January 18, 2008 Slide 322


COMP527:
Data Mining
Today's Topics

Prediction / Regression
Linear Regression
Logistic Regression
Support Vector Regression
Regression Trees

Regression, Prediction January 18, 2008 Slide 323


COMP527:
Data Mining
Prediction

Classification tries to determine which class an instance belongs to, 
based on known classes for instances by generating a model 
and applying it to new instances.  The model generated can be in 
many forms (rules, tree, graph, vectors...). The output is the 
class which the new instance is predicted to be part of.

So the class for classification is a nominal attribute.

What if it was numeric, with no enumerated set of values?

Then our problem is one of prediction rather than


classification.

Regression, Prediction January 18, 2008 Slide 324


COMP527:
Data Mining
Prediction / Regression

Regression takes data and finds a formula for it.  As with SVM, the 
formula can be the model used for classification.  This might 
learn the formula for the probability of a particular class from 0..1 
and then return the most likely class.

It can also be used for predicting/estimating values of a


numeric attribute simply by applying the formula used to
the data.

At the end of the lecture we'll look at regression trees which


combine decision trees and regression.

Regression, Prediction January 18, 2008 Slide 325


COMP527:
Data Mining
Prediction / Regression

For example, instead of determining that the weather will be 'hot' 
'warm', 'cool' or 'cold', we may need to be able to say with some 
degree of accuracy that it will be 25 degrees or 7.5 degrees, 
even if 7.5 never appeared in the temperature attribute for the 
training data.

Or the stress on a structure under various conditions, the


number of seconds a boxer might last in the ring, the
number of goals a team would score over a season, or ...
any other numeric value that you might want to try to
predict.

Regression, Prediction January 18, 2008 Slide 326


COMP527:
Data Mining
Linear Regression

Express the 'class' as a linear combination of the attributes with 
determined weights. eg:

x = w0 + w1a1 + w2a2 + ... + wnan
Where w is a weight, and a is an attribute.
The predicted value for instance i then is found by putting the attribute 
values for i into the appropriate a slots.

So we need to learn the weights that minimize the error between actual 
value and predicted value across the training set.
(Sounds like Perceptron, right?)

Regression, Prediction January 18, 2008 Slide 327


COMP527:
Data Mining
Linear Regression

To determine the weights, we try to minimize the sum of the squared error 
across all the documents:
∑(xi ­ ∑wjaik)2
Where x is the actual value for instance i and the second
half is the predicted value by applying all k weights to the
k attribute values of instance i.

To do so we can use the method described in Dunham, ~pg


85.

(Which I'm not going to try and explain!)

Regression, Prediction January 18, 2008 Slide 328


COMP527:
Data Mining
Linear Regression

Simple case:  Method of Least Squares

∑(xi­ avg(x))(yi­ avg(y))
w=
        ∑(xi­avg(x))2
solves the simple case of y = b + wx

And then we find b by:


b = avg(y) – w * avg(x)

Regression, Prediction January 18, 2008 Slide 329


COMP527:
Data Mining
Non-Linear Regression

We could apply a function to each attribute instead of just 
multiplying by a weight.

For example:
x = c + f1(a1) + f2(a2) + ... + fn(an)
Where f is some function (eg square, log, square root, modulo 6, 
etc)

Of course determining the appropriate function is a problem!

Regression, Prediction January 18, 2008 Slide 330


COMP527:
Data Mining
Logistic Regression

Instead of fitting the data to a straight line, we can try to fit it to a 
logistic curve (a flat S shape).
This curve gives values between 0 and 1, and hence can be used 
for probability.

We won't go into how to work 
out the coefficients, but the 
result is the same as the linear 
case:

    x = c + wa + wa + ... + wa

Regression, Prediction January 18, 2008 Slide 331


COMP527:
Data Mining
Support Vector Regression

We looked at the maximum margin hyperplane, which involved 
learning a hyperplane to distinguish two classes.  Could we learn 
a prediction hyperplane in the same way?
That would allow the use of kernel functions for the non­linear case. 

Goal is to find a function that has at most E deviation in prediction 
from the training set, while being as flat as possible.  This 
creates a tube of width 2E around the function.  Points that do 
not fall within the tube are support vectors.

Regression, Prediction January 18, 2008 Slide 332


COMP527:
Data Mining
Support Vector Regression

By also trying to flatten the function, bad choices for E can be 
problematic.

If E is too big and encloses all the points, then the function will 
simply find the mean.  If E is 0, then all instances are support 
vectors. Too small and there will be too many support vectors, 
too large and the function will be too flat to be useful.

We can replace the dot product in the regression equation with a 
kernel function to perform non­linear support vector regression:
x = b + ∑αia(i)∙a

Regression, Prediction January 18, 2008 Slide 333


COMP527:
Data Mining
Regression and Model Trees

The problem with linear regression is that most data sets are not linear.
The problem with non­linear regression is that it's even more 
complicated!

Enter Regression Trees and Model Trees.

Idea: Use a Tree structure (divide and conquer) to split up the instances 
such that we can more accurately apply a linear model to only the 
instances that reach the end node.

So branches are normal decision tree tests, but instead of a class value 
at the node, we have some way to predict or specify the value.

Regression, Prediction January 18, 2008 Slide 334


COMP527:
Data Mining
Regression vs Model Trees

Regression Trees:  The leaf nodes have the average value of the 
instances to reach it.

Model Trees:  The leaf nodes have a (linear) regression model to 
predict the value of the instances that reach it.
So a regression tree is a constant value model tree.

Issues to consider:
– Building
– Pruning / Smoothing

Regression, Prediction January 18, 2008 Slide 335


COMP527:
Data Mining
Building Trees

We know that we need to construct a tree, with a linear model at 
each node and an attribute split at non leaf nodes.

To split, we need to determine which attribute to split on, and where 
to split it.  (Remember that all attributes are numeric)

Witten (p245) proposes Standard Deviation Reduction ­­ treating 
the std dev of the class values as a measure of the error at the 
node and maximising the reduction in that value for each split.
 

Regression, Prediction January 18, 2008 Slide 336


COMP527:
Data Mining
Smoothing

It turns out that the value predicted at the bottom of the tree is generally 
too coarse, probably because it was built against only a small subset of 
the data.  
We can fine tune the value by building a linear model at each node along 
with the regular split and then send the value from the leaf back up the 
path to the root of the tree, combining it with the values at each step.

p' = (np + kq) / (n + k)

p' is prediction to be passed up. p is prediction passed to this node.
q is the value predicted at this node. n is the number of instances that 
reach the node below. k is a constant.

Regression, Prediction January 18, 2008 Slide 337


COMP527:
Data Mining
Pruning

Pruning can also be accomplished using the models built at each 
node.

We can estimate the error at each node using the model built by 
taking the actual error on the test set and multiplying by (n+v)/(n­
v) where n is the number of instances that reach the node and v 
is the number of parameters in the linear model for the node.
We do this multiplication to avoid under­estimating the error on new 
data, rather than the data it was trained against.

If the estimated error is lower at the parent, the leaf node can be 
dropped.

Regression, Prediction January 18, 2008 Slide 338


COMP527:
Data Mining
Building Algorithm

MakeTree(instances)
SD = sd(instances)  // standard deviation
root = new Node(instances)
split(root)
prune(root)

split(node)
if len(node)< 4 or sd(node) < 0.05*SD:
node.type = LEAF
else
node.type = INTERIOR
foreach attribute a:
foreach possibleSplitPosition s in a:
calculateSDR(a, s)
splitNode(node, maximumSDR)
split(node.left)
split(node.right) 

Regression, Prediction January 18, 2008 Slide 339


COMP527:
Data Mining
Pruning Algorithm

prune(node)
if node.type == INTERIOR:
prune(node.left)
prune(node.right)
node.model = new linearRegression(node)
if (subTreeError(node) > error(node):
node.type = LEAF

subTreeError(node)
if node.type = INTERIOR:
return len(left)*subTreeError(left) +
len(right)*subTreeError(right) / len(node)
else:
return error(node)

Regression, Prediction January 18, 2008 Slide 340


COMP527:
Data Mining
Specific Algorithms

Some regression/model trees:

CHAID (Chi­Squared Automatic Interaction Detector). 1980. 
Can also be used either for continuous or nominal classes.

CART (Classification And Regression Tree).  1984.  
Entropy or Gini to choose attribute, binary split for selected 
attribute.

M5  Quinlan's model tree inducer (of C4.5 fame). 1992.

Regression, Prediction January 18, 2008 Slide 341


COMP527:
Data Mining
Further Reading


Introductory statistical text books, still!

Witten, 3.7, 4.6, 6.5

Dunham, 3.2, 4.2

Han, 6.11

Regression, Prediction January 18, 2008 Slide 342

Das könnte Ihnen auch gefallen