Comp527 Part1

COMP527:
Data Mining
COMP527: Data Mining
Dr Robert Sanderson
(azaroth@liv.ac.uk)
Dept. of Computer Science
University of Liverpool
2008
This is the full course notes, but not quite complete. You should come to the lectures anyway. Really.
Introduction to the Course January 18, 2008 Slide 1

COMP527:
Data Mining
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Agglomerative/Divisive
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam

COMP527:
Data Mining
Today's Topics
Me, You: Introductions
Lectures
Tutorials
References
Course Summary
Assessment
Something Fun*
* Or at least more fun, hopefully

COMP527:
Data Mining
Introductions
Dr. Robert Sanderson
Office: 1.04, Ashton Building
Extension: 54252 [external: 795 4252]
Email: azaroth@liv.ac.uk
Web: http://www.csc.liv.ac.uk/~azaroth/
Hours: 10:00 to 18:00, not Thursday
Email for a time, or show up at any time knowing that
I might not be there.
Where's your accent from: New Zealand

COMP527:
Data Mining
Not Me!
So you went to Waikato?
Your PhD is in Data Mining?
... Computer Science?
... Science? Math? Engineering?
You at least write Java?
... C++?
What sort of CS Lecturer are you?!

COMP527:
Data Mining
Me!
Went to University of Canterbury (NZ, not Kent)
... But I do know Ian Witten quite well.
PhD is in French/History
... But focused on Computing in the Humanities/Informatics
Python!
Information Science: Information Retrieval, Data Mining, Text
Mining, XML, Databases, Interoperability, Grid Processing,
Digital Preservation ...

COMP527:
Data Mining
You!
...

COMP527:
Data Mining
Lectures
Lecture Slots:
Monday: 1011am Here
Tuesday: 1011am Here
Friday: 23pm Here
Course requirement: 30 hours of lectures
Semester Timetable:
8 weeks class, 3 weeks easter, 4 weeks class.
Dates:
21st January to 11th of March (Rob @ conference on 14th)
7th April to 21st April (But may run to 25th?)

COMP527:
Data Mining
Tutorials/Lab Sessions
Location:
Lab 6, Tuesdays 3-4pm
(just before departmental seminar)
Aims:
Provide time for practical experience
Answer any questions from lectures/reading
Informal self-assessment exercises
Software:
Data mining 'workbench' software WEKA installed on Windows
image. May be available under Linux. Freely downloadable from
University of Waikato:
http://www.cs.waikato.ac.nz/ml/weka/

COMP527:
Data Mining
Course Web Sites
Departmental Home Page:
http://www.csc.liv.ac.uk/teaching/modules/newmscs2/comp527.html
Lecture Notes, Assignments, Exercises:
http://www.csc.liv.ac.uk/~azaroth/courses/current/comp527/

COMP527:
Data Mining
Reference Texts
Witten, Ian and Eibe Frank, Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition, Morgan Kaufmann, 2005
Dunham, Margaret H, Data Mining: Introductory and Advanced Topics, Prentice
Hall, 2003

COMP527:
Data Mining
Frequently Used Resources
– Han and Kamber, Data Mining: Concepts and Techniques, Second
Edition, Morgan Kaufmann, 2006
– Berry, Browne, Lecture Notes in Data Mining, World Scientific, 2006
– Berry and Linoff, Data Mining Techniques, Second Edition, Wiley, 2004
– Zhang, Association Rule Mining, Springer, 2002
– Konchady, Text Mining Application Programming, Thomson, 2006
– Weiss et al., Text Mining: Predictive Methods for Analyzing
Unstructured Information, Springer, 2005
– Inmon, Building the Data Warehouse, Wiley, 1993
– KDD (http://www.kdd2007.com)
– PAKDD (http://lamda.nju.edu.cn/conf/PAKDD07/)
– PKDD (http://www.ecmlpkdd2008.org/)

COMP527:
Data Mining
Frequently Used Websites
– CiteSeer: http://citeseer.ist.psu.edu/
– KDNuggets: http://www.kdnuggets.com/
– UCI Repository: http://kdd.ics.uci.edu/
(plus follow link to Machine Learning Archive)
– Wikipedia: http://en.wikipedia.org/wiki/Data_mining
– MathWorld: http://mathworld.wolfram.com/
– Google Scholar: http://scholar.google.com/
– NaCTeM: http://www.nactem.ac.uk/

COMP527:
Data Mining
Course Summary
● Introduction, Basics: 4 lectures

● Data Warehousing: 1 lecture
● Classification: 10 lectures
● Input Preprocessing: 2 lectures
● Association Rule Mining 4 lectures
● Clustering: 3 lectures
● Hybrid Approaches: 1 lecture
● Graph Mining: 1 lecture
● Text Mining: 3 lectures
● Revision: 1 lecture
Total: 30 lectures

COMP527:
Data Mining
Assessment
● 75% End of Year Exam:

● 2 ½ hours
● Short Answer and/or Essays
● Choose 4 of 5 sections
● 25% Continuous Assessment:

● 12% Assignment 1 (Due 2008-03-10 16:00:00)
● 13% Assignment 2 (Due 2007-04-25 16:00:00)
● Self assessment exercises

● Weekly (or as desired) during tutorial session

COMP527:
Data Mining
And Now...
... what you've all been waiting for ...
Something Fun! *
* (Or more fun than the rest of the lecture at least, your mileage may
vary, opinions expressed herein bla bla bla)

COMP527:
Data Mining
“Nomic Mao”
The Rules:
– Each player is dealt 7 cards by the dealer
– The first person to have no cards in hand wins
– Every turn, each player discards a card
– Play starts with the person to the left of the dealer and proceeds
to the left
– The dealer and then the winner of each round makes a secret
rule
– If you break a rule, you receive a penalty from the rule's creator
– The penalty is: You must draw one card

COMP527:
Data Mining
Advanced Rules
– Later rules may overturn earlier rules, either completely or in part
– Each rule may only change one aspect of the game play
– Penalty conditions for breaking rules include:
●
Illegal card played (eg black on red)
●
Procedural error (eg playing out of turn)
●
Incorrect penalty (eg when a later rule enables a play)
– Each rule is numbered (eg: Procedural error under Rule 3)
– When taking a penalty for playing out of turn, or discarding
multiple cards, you must return the state of the game to as it was
before the penalty and then the penalty is incurred.

COMP527:
Data Mining
Classification: Trees Clustering: Improvements
Introduction to Data Mining January 18, 2008 Slide 19

COMP527:
Data Mining
Today's Topics
● What is Data Mining?

● Definitions
● Views on the Process
● Basic Functions
● Why would you do this?

● Motivations
● Applications
● WEKA: Waikato Environment for Knowledge Analysis

(And a cute little bird!)

COMP527:
Data Mining
What is Data Mining?
Some Definitions:
– “The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data” (Piatetsky-Shapiro)
– "...the automated or convenient extraction of patterns
representing knowledge implicitly stored or captured in large
databases, data warehouses, the Web, ... or data
streams." (Han, pg xxi)
– “...the process of discovering patterns in data. The process
must be automatic or (more usually) semiautomatic. The
patterns discovered must be meaningful...” (Witten, pg 5)
– “...finding hidden information in a database.” (Dunham, pg 3)
– “...the process of employing one or more computer learning
techniques to automatically analyse and extract knowledge
from data contained within a database.” (Roiger, pg 4)

COMP527:
Data Mining
What is Data Mining?
Keywords from each definition:

– “The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data” (Piatetsky-Shapiro)
– "...the automated or convenient extraction of patterns
representing knowledge implicitly stored or captured in large
databases, data warehouses, the Web, ... or data
streams." (Han, pg xxi)
– “...the process of discovering patterns in data. The process
must be automatic or (more usually) semiautomatic. The
patterns discovered must be meaningful...” (Witten, pg 5)
– “...finding hidden information in a database.” (Dunham, pg 3)
– “...the process of employing one or more computer learning
techniques to automatically analyze and extract knowledge
from data contained within a database.” (Roiger, pg 4)

COMP527:
Data Mining
KDD: Knowledge Discovery in Databases
Many texts treat KDD and Data Mining as the same process,
but it is also possible to think of Data Mining as the
discovery part of KDD.
Dunham:
KDD is the process of finding useful information and
patterns in data.
Data Mining is the use of algorithms to extract information
and patterns derived by the KDD process.
For this course, we will discuss the entire process (KDD) but
focus mostly on the algorithms used for discovery.

COMP527:
Data Mining
Piatetsky-Shapiro View
Knowledge
Interpretation
Data Model
Data Mining
Transformed Data
Transformation
Preprocessed Data
Preprocessing
Target Data
Selection
Initial Data
(As tweaked by Dunham)

COMP527:
Data Mining
CRISP-DM View

COMP527:
Data Mining
Data Mining Functions
All Data Mining functions can be thought of as attempting to

find a model to fit the data.
Each function needs Criteria to create one model over
another.
Each function needs a technique to Compare the data.
Two types of model:

– Predictive models predict unknown values based on
known data
– Descriptive models identify patterns in data
Each type has several sub-categories, each of which has

many algorithms. We won't have time to look at ALL of
them in detail.

COMP527:
Data Mining
Data Mining Functions
Classification: Maps data into predefined

classes
Regression: Maps data into a function
Prediction: Predict future data states
Predictive
Time Series Analysis: Analyze data over time
(Supervised Learning)
Data
Mining
Descriptive Clustering: Find groups of similar items

Association Rules: Find relationships between items
Characterisation: Derive representative information
Sequence Discovery: Find sequential patterns
(Unsupervised Learning)

COMP527:
Data Mining
Classification
The aim of classification is to create a model that can

predict the 'type' or some category for a data instance
that doesn't have one.
Two phases:
1. Given labelled data instances, learn model for how
to predict the class label for them. (Training)
2. Given an unlabelled, unseen instance, use the
model to predict the class label. (Prediction)
Some algorithms predict only a binary split (yes/no), some

can predict 1 of N classes, some give probabilities for
each of N classes.

COMP527:
Data Mining
Clustering
The aim of clustering is similar to classification, but without

predefined classes.
Clustering attempts to find clusters of data instances which

are more similar to each other than to instances outside
of the cluster.
Unsupervised Learning: learning by observation, rather

than by example.
Some algorithms must be told how many clusters to find,

others try to find an 'appropriate' number of clusters.

COMP527:
Data Mining
Association Rule Mining
The aim of association rule mining is to find patterns that

occur in the data set frequently enough to be interesting.
Hence the association or correlation of data attributes
within instances, rather than between instances.
These correlations are then expressed as rules – if X and Y

appear in an instance, then Z also appears.
Most algorithms are extensions of a single base algorithm

known as 'A Priori', however a few others also exist.

COMP527:
Data Mining
Why?
That all sounds ... complicated. Why should I learn about

Data Mining?
What's wrong with just a relational database? Why would I

want to go through these extra [complicated] steps?
Isn't it expensive? It sounds like it takes a lot of skill,

programming, computational time and storage space.
Where's the benefit?
Data Mining isn't just a cute academic exercise, it has very

profitable real world uses. Practically all large companies
and many governments perform data mining as part of
their planning and analysis.

COMP527:
Data Mining
The Data Explosion
The rate of data creation is accelerating each year. In 2003,

UC Berkeley estimated that the previous year generated 5
exabytes of data, of which 92% was stored on
electronically accessible media.
Mega < Giga < Tera < Peta < Exa ... All the data in all the
books in the US Library of Congress is ~136 Terabytes.
So 37,000 New Libraries of Congress in 2002.
VLBI Telescopes produce 16 Gigabytes of data every

second.
Each engine of each plane of each company produces ~1
Gigabyte of data every trans-atlantic length journey.
Google searches 18 billion+ accessible web pages.

COMP527:
Data Mining
Data Explosion Implications
As the amount of data increases, the proportion of

information decreases.
As more and more data is generated automatically, we need

to find automatic solutions to turn those stored raw
results into information.
Companies need to turn stored data into profit ... otherwise

why are they storing it?
Let's look at some real world examples.

COMP527:
Data Mining
Classification
The data generated by airplane engines can be used to

determine when it needs to be serviced. By discovering
the patterns that are indicative of problems, companies
can service working engines less often (increasing profit)
and discover faults before they materialise (increasing
safety).
Loan companies can “give you results in minutes” by

classifying you into a good credit risk or a bad risk, based
on your personal information and a large supply of
previous, similar customers.
Cell phone companies can classify customers into those

likely to leave, and hence need enticement, and those
that are likely to stay regardless.
COMP527:
Data Mining
Clustering
Discover previously unknown groups of customers/items.

By finding clusters of customers, companies can then
determine how best to handle that particular cluster.
For example, this could be used for targeted advertising,

special offers, transferring information gathered by
association rule mining to other members of the cluster,
and so forth.
The concept of 'Similarity' is often used for determining

other items that you might be interested in, eg 'More Like
This' links.

COMP527:
Data Mining
Association Rule Mining
By finding association rules from shopping baskets,

supermarkets can use this information for many things,
including:
– Product placement in the store
– What to put on sale
– What to create as 'joint special offers'
– What to offer the customer in terms of coupons
– What to advertise together
It shouldn't be surprising that your Tesco coupons are for

things that you sometimes buy, rather than things you
always or never buy.
Wal-Mart in the US records every transaction at every store
-- petabytes of information to sift through. (TeraData)

COMP527:
Data Mining
Data/Information/Knowledge/Wisdom
Note well that data mining applications have no wisdom.

They cannot apply the knowledge that they discover
appropriately.
For example, a data mining application may tell you that

there is a correlation between buying music magazines
and beer, but it doesn't tell you how to use that
knowledge. Should you put the two close together to
reinforce the tendency, or should you put them far apart
as people will buy them anyway and thus stay in the store
longer?
Data mining can help managers plan strategies for a

company, it does not give them the strategies.

COMP527:
Data Mining
WEKA

COMP527:
Data Mining
WEKA

COMP527:
Data Mining
Further Reading
● Witten Chapter 1
● Dunham Chapter 1
● Han Chapter 1; Sections 6.1, 7.1
● Berry & Linoff Chapters 1,2
● http://en.wikipedia.org/wiki/Data_mining
and linked pages

COMP527:
Data Mining
Introduction to Text Mining January 18, 2008 Slide 41

COMP527:
Data Mining
Today's Topics
Information Retrieval (IR)
What is IR?
Typical IR Process
Data Mining on Text
Text Mining
What is Text Mining?
Typical Text Mining Process
Applications

COMP527:
Data Mining
What is Information Retrieval?
IR is concerned with retrieving textual records, not data

items like relational databases, nor (specifically) with
finding patterns like data mining.
Examples:
SQL: Find rows where the text column LIKE “%information
retrieval%”
DM: Find a model in order to classify document topics.
IR: Find documents with text that contains the words

Information adjacent to Retrieval, Protocol or SRW, but
not Google.

COMP527:
Data Mining
What is Information Retrieval?
– IR focuses on finding the most appropriate or relevant

records to the user's request.
The supremacy of Google can be attributed primarily to its

PageRank algorithm for ranking web pages in order of
relevance to the user's query. $741.79 (on 2007-11-06,
up from $471.80 on 2006-11-03) a share says this topic is
important to understand!
– IR also focuses on finding these records as quickly as

possible.
Not only does Google find relevant pages, it finds them Fast,
for many thousands (maybe millions?) of concurrent
users.

COMP527:
Data Mining
IR = Google??
So is “Google” the answer to the question of “Information

Retrieval”?
No! Google has a good answer for how to search the web,
but there are many more sources of data, and many more
interesting questions.
Many other examples, including:

Library catalogues
XML searching
Distributed searching
Query languages

COMP527:
Data Mining
IR Processes: Discovery
Need Query
Search
User
Engine
Information
Research topics exist for each box and arrow!

COMP527:
Data Mining
IR Processes: Ingestion
Documents
Search
Engine
Target
Documents
Preprocessed
Information Records
Documents
Compare to the KDD process we looked at last time!

COMP527:
Data Mining
Document Indexing
What information do we need to store?

Query: Documents containing Information and Retrieval but
not Protocol
Need to find which documents contain which words.
Could perform this query using a document/term matrix:
Term1 Term2 Term3 Term4 Term5 ... TermN

Document1 1 0 0 1 0 1
Document2 1 1 0 1 0 0
Document3 0 1 1 0 1 0
Document4 1 0 1 1 1 0
Document5 0 1 1 1 0 0
... DocumentN 1 0 0 0 1 1

COMP527:
Data Mining
Document Indexing
Also useful to know is the frequency of the term in the

document.
Each row in the matrix is a vector, and useful for data
mining functions as the document has been reduced to a
series of numbers rather than words.
Our new matrix might look like:
Term1 Term2 Term3 Term4 Term5 ... TermN

Document1 2 0 0 4 0 9
Document2 3 5 0 1 0 0
Document3 0 2 6 0 1 0
Document4 1 0 3 1 2 0
Document5 0 4 1 2 0 0
... DocumentN 1 0 0 0 3 1

COMP527:
Data Mining
Evaluation
Common evaluation for IR relevance ranking: Precision and

Recall
Precision: Number Relevant and Retrieved / Number

Retrieved
Recall: Number Relevant and Retrieved / Number Relevant
F Score: recall * precision / ((recall + precision) / 2)
Ideal situation is all and only relevant documents retrieved.

Also used in Data Mining evaluation.

COMP527:
Data Mining
Topics of Interest
Format Processing: Extraction of text from different file formats
Indexing: Efficient extraction/storage of terms from text
Query Languages: Formulation of queries against those indexes
Protocols: Transporting queries from client to server
Relevance Ranking: Determining the relevance of a document to the
user's query
Metasearch: Crosssearching multiple document sets with the same query
GridIR: Using the grid (or other massively parallel infrastructure) to
perform IR processes
Multimedia IR: IR techniques on multimedia objects, compound digital
objects...

COMP527:
Data Mining
Data Mining on Text
All of the Data Mining functions can be applied to textual data, using
term as the attribute and frequency as the value.
Classification:
Classify a text into subjects, genres, quality, reading age, ...
Clustering:
Cluster together similar texts
Association Rule Mining:

Find words that frequently appear together
Finds texts that are frequently cited together
Key challenge is the very large number of terms (eg the number of
different words across all documents)

COMP527:
Data Mining
Text Mining
So, we've looked at Data Mining and IR... What's Text Mining then?
Good question. No canonical definition yet, but a similar definition for
Data Mining could be applied:
The nontrivial extraction of previously unknown, interesting facts from
an (invariably large) collection of texts.
So it sounds like a combination of IR and Data Mining, but actually the
process involves many other steps too. Before we look at what actually
happens, let's look at why it's different...

COMP527:
Data Mining
Text Mining vs Data Mining
Data Mining finds a model for the data based on the attributes of the
items. The only attributes of text are the words that make up the text.
As we looked at for IR, this creates a very sparse matrix.
Even if we create that matrix, what sort of patterns could we find:
– Classification: We could classify texts into predefined classes
(eg spam / not spam)
– Association Rule Mining: Finding frequent sets of words.
(eg if 'computer' appears 3+ times, then 'data' appears at least once)
– Clustering: Finding groups of similar documents (IR?)
None of these fit our definition of Text Mining.

COMP527:
Data Mining
Text Mining vs IR
Information Retrieval finds documents that match the user's query.
Even if we matched at a sentence level rather than document, all we do is
retrieve matching sentences, we're not discovering anything new.
The relevance ranking is important, but it still just matches information
we already knew... it just orders it appropriately.
IR (typically) treats a document as a big bag of words... but doesn't care
about the meaning of the words, just if they exist in the document.
IR doesn't fit our definition of Text Mining either.

COMP527:
Data Mining
Text Mining Process
How would one find previously unknown facts from a bunch of text?
– Need to understand the meaning of the text!
●
Part of speech of words
●
Subject/Verb/Object/Preposition/Indirect Object
– Need to determine that two entities are the same entity.
– Need to find correlations of the same entity.
– Form logical chains: Milk contains Magnesium. Magnesium stimulates
receptor activity. Inactive receptors cause Headaches > Milk is good
for Headaches. (fictional example!)

COMP527:
Data Mining
Part of Speech Tagging
First we need to tag the text with the parts of speech for each word.
eg:
Rob/noun teaches/verb the/article course/noun
How could we do this? By learning a model for the language! Essentially
a data mining classification problem should the system classify the
word as a noun, a verb, an adjective, etc.
Lots of different tags, often based on a set called the Penn Treebank.
(NN = Noun, VB = Verb, JJ = Adjective, RB = Adverb, etc)

COMP527:
Data Mining
Deep Parsing
Now we need to discover the phrases and parts of each clause.
Rob/noun teaches/verb the/article course/noun
(Subject: Rob Verb:teaches (Object: the+course))
The phrase sections are often expressed as trees:
( TOP
( S
( NP ( DT This ) ( JJ crazy ) ( NN sentence ) )
( VP ( VBD amused )
( NP ( NNP Rob ) )
( PP ( IN for )
( NP ( DT a ) ( JJ few ) ( NNS minutes ) )

COMP527:
Data Mining
Entity Recognition
Once we've parsed the text for linguistic structure, we need to identify the
real world objects referred to.
Rob teaches the course
Rob: Me (Sanderson, Robert D. b.19760720 Rangiora/New Zealand)
the course: Comp527 2006/2007, University of Liverpool, UK
This is typically done via lookups in very large thesauri or 'ontologies',
specific to the domain being processed (eg medical, historical, current
events, etc.)

COMP527:
Data Mining
Fact Extraction
There will normally be a lot more text to parse:
Rob Sanderson, a lecturer at the University of Liverpool, teaches a
masters level course on data mining (Comp527)
Rob is a lecturer
Rob is at the University of Liverpool
Rob teaches a course
The course is called Comp527
The course is masters level
The course is about data mining

COMP527:
Data Mining
Correlation
Rob Sanderson, a lecturer at the University of Liverpool, teaches a
masters level course on data mining (Comp527)
Data mining is about finding models to describe data sets.
> The University of Liverpool has a course about finding models to
describe data sets.
(Not very interesting or novel in this case, but that's the process)

COMP527:
Data Mining
Applications
Search engines of all types are based on IR.
But where would you use text mining?
Most research so far is on medical data sets ... because this is the most
profitable! If you could correlate facts to find a cure for cancer, you
would be very VERY rich! So ... lots of people are trying to do just
that for various values of 'cancer'.
Also because of the wide availability of ontologies and datasets, in
particular abstracts for medical journal articles (PubMed/Medline)

COMP527:
Data Mining
Applications
More application areas:
News feeds
Terrorism detection
Social sciences analysis
Historical text analysis
Corpus linguistics
'Net Nanny' filters
etc.

COMP527:
Data Mining
Further Reading
●
Weiss et al Chapter 1 (and 2 if you're interested)
●
BaezaYates, Modern Information Retrieval, Chapter 1
●
Jackson and Moulinier, Natural Language Processing for Online
Applications, Chapter 1
●
http://www.jisc.ac.uk/publications/publications/pub_textmining.aspx
●
http://people.ischool.berkeley.edu/~hearst/textmining.html

COMP527:
Data Mining
General Data Mining Issues January 18, 2008 Slide 65

COMP527:
Data Mining
Today's Topics
Machine Learning?
Input to Data Mining Algorithms
Data types
Missing values
Noisy values
Inconsistent values
Redundant values
Number of values
Over-fitting / Under-fitting
Scalability
Human Interaction
Ethical Data Mining

COMP527:
Data Mining
Machine Learning
What do we mean by 'learning' when applied to machines?
– Not just committing to memory (= storage)

– Can't require consciousness
– Learn facts (data), or processes (algorithms)?
“Things learn when they change their behaviour in a way

that makes them perform better” (Witten)
– Ties to future performance, not the act itself

– But things change behaviour for reasons other than
'learning'
– Can a machine have the Intent to perform better?

COMP527:
Data Mining
Inputs
The aim of data mining is to learn a model for the data. This
could be called a concept of the data, so our outcome will
be a concept description.
Eg, the task is classify emails as spam/not spam. Concept

to learn is the concept of 'what is spam?'
Input comes as instances.

Eg, the individual emails.
Instances have attributes.
Eg sender, date, recipient, words in text

COMP527:
Data Mining
Inputs
Use attributes to determine what about an instance means

that it should be classified as a particular class. ==
Learning!
Obvious input structure: Table of instances (rows) and

attributes (columns)
IRIS DATA Sepal Length Sepal Width Petal Length Petal Width

Flower1 5.1 3.5 1.4 0.2
Flower2 4.9 3.0 1.4 0.2
Flower3 4.7 3.2 1.3 0.2
FlowerN 5.0 3.6 1.4 0.2

COMP527:
Data Mining
WEKA's ARFF Format
@relation Iris
@attribute sepal_length numeric
@attribute sepal_width numeric
@attribute petal_length numeric
@attribute petal_width numeric
@data
5.1, 3.5, 1.4, 0.2
4.9, 3.0, 1.4, 0.2
4.7, 3.2, 1.3, 0.2
5.0, 3.6, 1.4, 0.2
...
But what about non numeric data?

COMP527:
Data Mining
Data Types
Nominal: Prespecified, finite number of values

eg: {cat, fish, dog, squirrel}
Includes boolean {true, false} and all enumerations.
Ordinal: Orderable, but no concept of distance

eg: hot > warm > cool > cold
Domain specific ordering, but no notion of how much hotter

warm is compared to cool.

COMP527:
Data Mining
Data Types
Interval: Ordered, fixed unit

eg: 1990 < 1995 < 2000 < 2005
Difference between values makes sense (1995 is 5 years

after 1990)
Sum does not make sense (1990 + 1995 = year 3985??)
Ratio: Ordered, fixed unit, relative to a zero point

eg: 1m, 2m, 3m, 5m
Difference makes sense (3m is 1m greater than 2m)

Sum makes sense (1m + 2m = 3m)

COMP527:
Data Mining
ARFF Data Types
Nominal:
@attribute name {option1, option2, ... optionN}
Numeric:
@attribute name numeric -- real values
Other:
@attribute name string -- text fields
@attribute name date -- date fields (ISO-8601 format)

COMP527:
Data Mining
Data Issues: Missing Values
The following issues will come up over and over again, but
different algorithms have different requirements.
What happens if we don't know the value for a particular

attribute in an instance?
For example, the data was never stored, lost or not able to
be represented.
Maybe that data was important!

ARFF records missing values with a ? in the table
How should we process missing values?

COMP527:
Data Mining
Missing Values
Possible 'solutions' for dealing with missing values:
– Ignore the instance completely. (eg class missing in

training data set)
Not very useful solution if in test data to be classified!
– Fill in values by hand
Could be very slow, and likely to be impossible
– Global 'missingValue' constant
Possible for enumerations, but what about numeric data?
– Replace with attribute mean
– Replace with class's attribute mean
– Train new classifier to predict missing value!
– Just leave as missing and require algorithm to apply
appropriate technique

COMP527:
Data Mining
Noisy Values
By 'noisy data' we mean random errors scattered in the

data.
For example, due to inaccurate recording, data corruption.
Some noise will be very obvious:

– data has incorrect type (string in numeric attribute)
– data does not match enumeration (maybe in yes/no field)
– data is very dissimilar to all other entries (10 in an attr
otherwise 0..1)
Some incorrect values won't be obvious at all. Eg typing

0.52 at data entry instead of 0.25.

COMP527:
Data Mining
Noisy Values
Some possible solutions:
– Manual inspection and removal

– Use clustering on the data to find instances or attributes
that lie outside the main body (outliers) and remove them
– Use regression to determine function, then remove those
that lie far from the predicted value
– Ignore all values that occur below a certain frequency
threshold
– Apply smoothing function over known-to-be-noisy data
If noise is removed, can apply missing value techniques on

it. If it is not removed, it may adversely affect the
accuracy of the model.

COMP527:
Data Mining
Inconsistent Values
Some values may not be recorded in different ways.

For example 'coke', 'coca cola', 'coca-cola', 'Coca Cola' etc
etc
In this case, the data should be normalised to a single form.

Can be treated as a special case of noise.
Some values may be recorded inaccurately on purpose!
Email address: r.d.nospam.sanderson@...
Spike in early census data for births on 11/11/1911. Had to

put in some value, so defaulted to 1s everywhere. Ooops!
(Possibly urban legend?)

COMP527:
Data Mining
Redundant Values
Just because the base data includes an attribute doesn't

make it worth giving to the data mining task.
For example, denormalise a typical commercial database

and you might have:
ProductId, ProductName, ProductPrice, SupplierId,
SupplierAddress...
SupplierAddress is dependant on SupplierId (remember SQL

normalisation rules?) so they will always appear together.
A 100% confident, 100% support association rule is not very
interesting!

COMP527:
Data Mining
Number of Attributes
Is there any harm in putting in redundant values? Yes for

association rule mining, and ... yes for other data mining
tasks too.
Can treat text as thousands of numeric attributes:

term/frequency from our inverted indexes.
But not all of those terms are useful for determining (for
example) if an email is spam. 'the' does not contribute to
spam detection.
The number of attributes in the table will affect the time it

takes the data mining process to run. It is often the case
that we want to run it many times, so getting rid of
unnecessary attributes is important.

COMP527:
Data Mining
Number of Attributes/Values
Called 'dimensionality reduction'.
We'll look at techniques for this later in the course, but

some simplistic versions:
– Apply upper and lower thresholds of frequency

– Noise removal functions
– Remove redundant attributes
– Remove attributes below a threshold of contribution to
classification
(Eg if attribute is evenly distributed, adds no knowledge)

COMP527:
Data Mining
Over-Fitting / Under-Fitting
Learning a concept must stop at the appropriate time.
For example, could express the concept of 'Is Spam?' as a

list of spam emails. Any email identical to those is spam.
Accuracy: 0% on new data, 100% on training data.
Ooops! This is called Over-Fitting. The concept has been

tailored too closely to the training data.
Story: US Military trained a neural network to distinguish

tanks vs rocks.
It would shoot the US tanks they trained it on very
consistently and never shot any rocks ... or enemy tanks.
[probably fiction, but amusing]

COMP527:
Data Mining
Extreme case of over-fitting:
Algorithm tries to learn a set of rules to determine class.

Rule1: attr1=val1/1 and attr2=val2/1 and attr3=val3/1 = class1
Rule2: attr1=val1/2 and attr2=val2/2 and attr3=val3/2 = class2
Urgh. One rule for each instance is useless.
Need to prevent the learning from becoming too specific to

the training set, but also don't want it to be too broad.
Complicated!

COMP527:
Data Mining
Extreme case of under-fitting:
Always pick the most frequent class, ignore the data

completely.
Eg: if one class makes up 99% of the data, then a 'classifier'

that always picks this class will be correct 99% of the
time!
But probably the aim of the exercise is to determine the 1%,

not the 99%... making it accurate 0% of the time when
you need it.

COMP527:
Data Mining
Scalability
We may be able to reduce the number of attributes, but

most of the time we're not interested in small 'toy'
databases, but huge ones.
When there are millions of instances, and thousands of

attributes, that's a LOT of data to try to find a model for.
Very important that data mining algorithms scale well.

– Can't keep all data in memory
– Might not be able to keep all results in memory either
– Might have access to distributed processing?
– Might be able to train on a sample of the data?

COMP527:
Data Mining
Human Interaction
Problem Exists Between Keyboard And Chair.
– Data Mining experts are probably not experts in the

domain of the data. Need to work together to find out
what is needed, and formulate queries
– Need to work together to interpret and evaluate results
– Visualisation of results may be problematic
– Integrating into the normal workflow may be problematic
– How to apply the results appropriately may not be clear
(eg Barbie + Chocolate?)

COMP527:
Data Mining
Ethical Data Mining
Just because we can doesn't mean we should.
Should we include married status, gender, race, religion or

other attributes about a person in a data mining
experiment? Discrimination?
But sometimes those attributes are appropriate and
important ... medical diagnosis, for example.
What about attributes that are dependent on 'sensitive'

attributes? Neighbourhoods have different average
incomes... discriminating against the poor by using
location?
Privacy issues? Data Mining across time? Government

sponsored data mining?

COMP527:
Data Mining
Further Reading
● Witten, Chapters 1,2

● Dunham Sections 1.3-1.5
● Han Sections 1.9, 11.4

COMP527:
Data Mining
Data Warehousing January 18, 2008 Slide 89

COMP527:
Data Mining
Today's Topics
Data Warehouses
Data Cubes
Warehouse Schemas
OLAP
Materialisation

COMP527:
Data Mining
What is a Data Warehouse?
Most common definition:
“A data warehouse is a subject-oriented, integrated,
time-variant and nonvolatile collection of data in
support of management's decision-making process.” -
W. H. Inmon
– Corporate focused, assumes a lot of data, and typically

sales related
– Data for “Decision Support System” or “Management
Support System”
– 1996 survey: Return on Investment of 400+%
Data Warehousing: Process of constructing (and using) a

data warehouse

COMP527:
Data Mining
Data Warehouse
– Subjectoriented:
● Focused on important subjects, not transactions
● Concise view with only useful data for decision making
– Integrated:
● Constructed from multiple, heterogeneous data
sources. Normally distributed relational databases, not

necessarily same schema.
● Cleaning, pre-processing techniques applied for
missing data, noisy data, inconsistent data (sounds

familiar, I hope)

COMP527:
Data Mining
Data Warehouse
– Timevariant:
● Has different values for the same fields over time.
● Operational database only has current value. Data
Warehouse offers historical values.
– Nonvolatile:
● Physically separate store
● Updates not online, but in offline batch mode only
● Read only access required, so no concurrency issues

COMP527:
Data Mining
Data Warehouse
Data Warehouses are distinct from:
● Distributed DB: Integrated via wrappers/mediators. Far

too slow, semantic integration much more complicated.
Integration done before loading, not at run time.
● Operational DB: Only records current value, lots of

extra non useful information such as HR.
Different schemas/models, access patterns, users,
functions, even though the data is derived from an
operational db.

COMP527:
Data Mining
OLAP vs OLTP
OLAP: Online Analytical Processing (Data Warehouse)
OLTP: Online Transaction Processing (Traditional DBMS)
OLAP data typically: historical, consolidated, and multi-

dimensional (eg: product, time, location).
Involves lots of full database scans, across terabytes or
more of data.
Typically aggregation and summarisation functions.
Distinctly different uses to OLTP on the operational

database.

COMP527:
Data Mining
Data Cubes
Data is normally MultiDimensional,
and can be thought of as a cube.
Often: 3 dimensions of time,

location and product.
No need to have just 3

dimensions -- could have one
for cars with make, colour,
price, location, and time
for example.
Image courtesy of IBM OLAP Miner documentation

COMP527:
Data Mining
Data Cubes
– Can construct many 'cuboids' from the full cube by

excluding dimensions.
– In an N dimensional data cube, the cuboid with N
dimensions is the 'base cuboid'. A 0 dimensional cuboid
(other than non existent!) is called the 'apex cuboid'.
– Can think of this as a lattice of cuboids...
(Following lattice courtesy of Han & Kamber)

COMP527:
Data Mining
Lattice of Cuboids
all
0-D(apex) cuboid
time item locationsupplier

1-D cuboids
time,item time,location item,location location,supplier
2-D cuboids
time,supplier item,supplier
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier

COMP527:
Data Mining
Multi-dimensional Units
Each dimension can also be thought of in terms of different units.
– Time: decade, year, quarter, month, day, hour (and
week, which isn't strictly hierarchical with the others!)
– Location: continent, country, state, city, store
– Product: electronics, computer, laptop, dell, inspiron
This is called a “Star-Net” model in data warehousing, and

allows for various operations on the dimensions and the
resulting cuboids.

COMP527:
Data Mining
Star-Net Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
DISTRICT
SALES PERSON
REGION
DISTRICT
COUNTRY
DIVISION
Geography
Promotion Organization

COMP527:
Data Mining
Data Cube Operations
– Roll Up: Summarise data by climbing up hierarchy.

Eg. From monthly to quarterly, from Liverpool to England
– Drill Down: Opposite of Roll Up
Eg. From computer to laptop, from £100-199 to £100-999
– Slice: Remove a dimension by setting a value for it
Eg. location/product where time is Q1,2007
– Dice: Restrict cube by setting values for multiple
dimensions
Eg. Q1,Q2 / North American cities / 3 products sub cube
– Pivot: Rotate the cube (mostly for visualisation)

COMP527:
Data Mining
Data Cube Schemas
– Star Schema: Single fact table in the middle, with connected set
of dimension tables
(Hence a star)
– Snowflake Schema: Some of the dimension tables
further refined into smaller dimension tables
(Hence looks like a snow flake)
– Fact Constellation: Multiple fact tables can share
dimension tables
(Hence looks like a collection of star schemas. Also called
Galaxy Schema)

COMP527:
Data Mining
Star Schema
Time Dimension Item Dimension
time_key item_key
day
name
day_of_week
month Sales Fact Table brand
quarter type
year time_key supplier_type
item_key
location_key
Loc.n Dimension
units_sold
location_key
street
city
Measure (value) state
country
continent

COMP527:
Data Mining
Snowflake Schema
time_key item_key
day Sales Fact Table name
day_of_week brand
month time_key type
quarter supplier_key
year item_key
location_key
units_sold Loc Dimension
location_key
street
city_key City Dimension
city_key
city
country

COMP527:
Data Mining
Fact Constellation
time_key item_key Shipping Table

day Sales Fact Table name
day_of_week brand
time_key
month time_key type
quarter supplier_key item_key
year item_key
from_key
location_key
units_shipped
units_sold Loc Dimension
location_key
street
city_key City Dimension
city_key
city
country

COMP527:
Data Mining
OLAP Server Types
ROLAP: Relational OLAP
● Uses relational DBMS to store and manage the warehouse
data
● Optimised for non traditional access patterns
● Lots of research into RDBMS to make use of!
MOLAP: Multidimensional OLAP

● Sparse array based storage engine
● Fast access to precomputed data
HOLAP: Hybrid OLAP

● Mixture of both MOLAP and ROLAP

COMP527:
Data Mining
Data Warehouse Architecture
(also courtesy of Han & Kamber)
Monitor OLAP
Other Metadata & Server
sources Integrator
Analysis
Query
Operational Extract
Serve Reports
DBs Transform Data
Load Data mining
Warehouse
Refresh
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

COMP527:
Data Mining
Materialisation
In order to compute OLAP queries efficiently, need to materialise some of
the cuboids from the data.
● None: Very slow, as need to compute entire cube at run
time
● Full: Very fast, but requires a LOT of storage space and
time to compute all possible cuboids

● Partial: But which ones to materialise? Called an 'iceberg
cube', as only partially materialised and the rest is "below

water".
Many cells in a cuboid will be empty, only materialise
sections that contain more values than a minimum
threshold.

COMP527:
Data Mining
Further Reading
●
Han, Chapters 3,4
●
Dunham Sections 2.1, 2.6, 2.7
●
Berry and Linoff, Chapter 15
●
Inmon, Building the Data Warehouse
●
Inmon, Managing the Data Warehouse
●
http://en.wikipedia.org/wiki/Data_warehouse
and subsequent links

COMP527:
Data Mining
Classification: Challenges, Basics January 18, 2008 Slide 110

COMP527:
Data Mining
Today's Topics
Classification
Basic Algorithms:
KNN
Perceptron
Winnow

COMP527:
Data Mining
Classification
Main Idea: Learn the concept of what it means to be part

of a named class of instances.
Called Supervised Learning, as it learns by example from

data which is already classified correctly.
Often called the Class Label attribute, hence learns from
Labeled data.
Two main phases:

● Training: Learn the classification
model from labeled data
● Prediction: Use the pre-built model
to classify new instances

COMP527:
Data Mining
Classification Accuracy
We need to use previously unseen instances to test a

classifier.
● Over-fitting is the main problem. Classifiers will often
learn too specific a model, and testing on data that was

used in training would reinforce this problem.
● Need to split the data set into training and testing.
Revised Phases for accuracy estimation:

● Split data set into distinct Training and Testing sets
● Build classifier with Training set
● Assess accuracy with Testing set – normally expressed
as %

COMP527:
Data Mining
Comparing Methods
Accuracy
● Percent of instances classified correctly.
Speed
● Computational cost of both learning model and
predicting classes
Robustness
● Ability to cope with noisy or missing data
Scalability
● Ability to cope with very large amounts of data
Interpretability
● Is the model understandable to a human, or otherwise
useful?

COMP527:
Data Mining
Classification vs Prediction
Classification predicts a class label from a given finite set.

● The label is a nominal attribute, so unordered and
enumerable
● Some algorithms predict probability for more than one
label
● Sometimes called a categorical attribute (eg H&K)
Prediction predicts a number instead of a label.

● Ordered and infinite set of possible outcomes
● Also often called Regression or Numeric Prediction
● Often viewed as a function

COMP527:
Data Mining
Eager vs Lazy Learners
Eager Learner: Constructs the model when it receives the

training data.
Builds a model likely to be very different in structure to
the data.
Lazy Learner: Doesn't construct a model when training,

only when classifying new instances.
Does only enough work to ensure that data can be
compared later
Sometimes called instance-based learners
Most classifiers are Eager, but there's an important Lazy

classifier called 'KNN' – K Nearest Neighbour

COMP527:
Data Mining
But First...
Which group, left or right,

for these two flowers?
(Experiment reported on in Cognitive Science, 2002)

COMP527:
Data Mining
Resemblance
People classify things by finding other items that are similar

which have already been classified.
For example: Is a new species a bird? Does it have the
same attributes as lots of other birds? If so, then it's
probably a bird too.
A combination of rote memorisation and the notion of

'resembles'.
Although kiwis can't fly like most other birds, they resemble
birds more than they resemble other types of animals.
So the problem is to find which instances most closely

resemble the instance to be classified.

COMP527:
Data Mining
KNN: Distance Measures
Distance (or similarity) between instances is easy if the data

is numeric.
Typically use Euclidian distance:

d = √(x1i-x1j)2 + (x2i-x2j)2 + ...
Also Manhattan / City Block distance:

d = (x1i-x1j) + (x2i-x2j) + ...
However we should normalise all of the values to the same

scale first. Otherwise income will overpower age, for
example.

COMP527:
Data Mining
KNN: Non Numeric Distance
For nominal attributes, we can only compare whether the

value is the same or not. Equally, this can be done by
dividing enumerations into many boolean attributes.
Might be able to convert to attributes between which

distance can be determined by some function. Eg colour,
temperature.
Text can be treated as one attribute per word, with the

frequency as the value, normalised to 0..1, and preferably
with very high frequency words ignored (eg the, a, as,
is...)

COMP527:
Data Mining
KNN: Classification
Classification process is then straight forward:
– Find the k closest instances to the test instance

– Predict the most common class among those instances
– Or predict the mean, for numeric prediction
What value to use for k?

● Depends on dataset size. Large dbs need a higher k,
whereas a high k for a small dataset might cross out of

the class boundaries
● Calculate accuracy on test set for increasing value of k,
and use a hill climbing algorithm to find the best.

● Typically use an odd number to help avoid ties

COMP527:
Data Mining
KNN: Classification
5-NN – find 5 closest to black point, 3 blue and 2 red, so

predict blue

COMP527:
Data Mining
KNN: Classification
Classification can be very slow to find the k nearest

instances.
– In a trivial implementation, it could take |D| comparisons.
– Using indexing it can easily be improved.
– Also easy to parallelise as one comparison is completely
distinct to other comparisons.
Can remove instances from the data set that do not help, for
example a tight cluster of 1000 instances of the same
class is unnecessary for k<50
Can also use advanced data structures to improve the

speed of classification, by storing the instance information
appropriately.

COMP527:
Data Mining
KNN: kD-Trees
KD-Tree is a binary tree that divides the input space with a

plane, then splits each such partition recursively. Each
split is made parallel to an axis and through an instance.
Typical strategy is to find the point closest to the mean in

the current partition and split through it, along a different
axis to the previous split. (Actually on the axis with the
greatest variance)
Then to search, descend the tree to the leaf partition that

contains the test instance. Search only that partition, then
if an edge is closer than any of the k closest instances,
search the parent partitions as well.

COMP527:
Data Mining
KNN: kD-Trees
First split at instance

(7,4) then again at
(6,7) which divides
the search space
into more easily
searchable sections.

COMP527:
Data Mining
KNN: kD-Trees
Then to classify the star, descend

into the section with both star and
the black instance.
But note that the instance in the
other section is closer, so we still
must check the adjacent area.
Note that the shaded area is the

black node's sibling and hence
cannot contain closer points.

COMP527:
Data Mining
Perceptron and Winnow
Two very simple eager methods: Perceptron and Winnow.

The both use the idea of a single neuron that fires when
given the right stimuli. (We'll look at this idea again later
under Neural Networks)
First thing to keep in mind is that the input to the

perceptron must be a vector of numbers.
Secondly, that it can only answer a 2 class problem – either
the neuron fires (class 1) or it doesn't (class 2).

COMP527:
Data Mining
Perceptron and Winnow
The square boxes are inputs, the w lines are weights and
the circle is the perceptron. The learning problem is to
find the correct weights to apply to the attributes.
The bias is a fixed value (1)

that is then learnt in the
same way as the other
attributes, in order to ensure
that the result perceptron can
check if the result is > 0 or
not to see if it should fire.

COMP527:
Data Mining
Perceptron
For each attribute, we have an input node. Then there is

one output node to which all of them connect, with a
weight on each connection.
We can then multiply weight by value, and add them all
up...
w0a0 + w1a1 + ... + wnan
Make it an equation equal to 0 and it's the equation for a

hyperplane. So essentially we are learning the
hyperplane that separates the two classes. Then
classification is just checking which side of the plane the
instance falls on.
But how do we learn the weights?

COMP527:
Data Mining
Perceptron
Remember that instances are a set of numeric attributes (a

vector). We can also treat the weights on the connections
as a vector. We only want to classify between two classes.
So:
weightVector = [0,...0]
while classificationFailed,
for each training instance I,
if not classify(I) == I.class,
if I.class == class1:
weightVector += I
else:
weightVector = I
No complicated higher math here!

COMP527:
Data Mining
Winnow
Winnow only updates when it finds a misclassified instance, and uses

multiplication to do the update rather than addition. It only works
when the attribute values are also binary. (1 or 0)
delta = (user defined)
while classificationFailed,
for each instance I,
if classify(I) != I.class,
if I.class == class1,
for each attribute ai in I,
if ai == 1, wi *= delta
else,
for each attribute ai in I,
if ai == 1, wi /= delta

COMP527:
Data Mining
Further Reading
● Witten, Section 3.8, and pp 124-136

● Han, Sections 6.1, 6.9
● Dunham Sections 4.1-4.3
● Berry and Linoff, Chapter 8
● Berry and Browne, Chapter 6
● Devijver and Kittler, Pattern Recognition: A Statistical Approach,
Chapter 3
● For KNN and Perceptron, Wikipedia, as always :)

COMP527:
Data Mining
Classification: Rules January 18, 2008 Slide 133

COMP527:
Data Mining
Today's Topics
Introduction
Rule Sets vs Rule Lists
Constructing Rules-based Classifiers
1R
PRISM
Reduced Error Pruning
RIPPER
Rules with Exceptions

COMP527:
Data Mining
Rules-Based Classifiers
Idea: Learn a set of rules from the data. Apply those rules to
determine the class of the new instance.
For example:
R1. If blood-type=Warm and lay-eggs=True then Bird
R2. If blood-type=Cold and flies=False then Reptile
R3. If blood-type=Warm and lay-eggs=False then Mammal
Hawk: flies=True, blood-type=Warm, lay-eggs=True,

class=???
R1 is True, so the classifier predicts that Hawk = Bird.

Yay!

COMP527:
Data Mining
Rules-Based Classifiers
A rule r covers an instance x if the attributes of the instance satisfy
the condition of the rule.
The coverage of a rule is the percentage of records that

satisfy the condition.
The accuracy of a rule is the percentage of covered records

that satisfy the condition and the conclusion.
For example, a rule might cover 10/50 records (coverage

20%) of which 8 are correct (accuracy 80%).

COMP527:
Data Mining
Rule Set vs Rule List
Rules can either be grouped as a set or an ordered list.
Set:
The rules make independent predictions.
Every record is covered by 0..1 rules (hopefully 1!)
R1. If flies=True and lays-eggs=True and lives-in-

water=False then Bird
R2. If flies=False and lives-in-water=True and lays-
eggs=True then Fish
R3. If blood-type=Warm and lays-eggs=False then Mammal
R4. If blood-type=Warm and lays-eggs=True then Reptile
Doesn’t matter which order we evaluate these rules.

COMP527:
Data Mining
Rule Set vs Rule List
List:
The rules make dependent predictions.
Every record is covered by 0..* rules (hopefully 1..*!)
R1. If flies=True and lays-eggs=True=False then Bird

R2. If blood-type=Warm and lays-eggs=False then Mammal
R3. If lives-in-water=True then Fish
R4. If lays-eggs=True then Reptile
Does matter which order we evaluate these rules.
If all records are covered by at least one rule, then rule set
or list is considered Exhaustive.

COMP527:
Data Mining
Constructing Rules-Based Classifiers
Covering approach: At each stage, a rule is found that covers
some instances.
“Separate and Conquer” -- Choose rule that identifies many

instances, separate them out, repeat.
But first a very very simple classifier called “1R”.

1R because the rules all test one particular attribute.

COMP527:
Data Mining
1R Classifier
Idea: Construct one rule for each attribute/value combination
predicting the most common class for that combination.
Example Data:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

COMP527:
Data Mining
1R Classifier
Rules generated:
Attribute Rules Errors Total Errors
Outlook sunny » no 2/5 4/14
overcast » yes 0/4
rainy » yes 2/5
Temperature hot » no 2/4 (random) 5/14
mild » yes 2/6
cool » yes 1/4
Humidity high » no 3/7 4/14
normal » yes 1/7
Windy false » yes 2/8 5/14
true » no 3/6 (random)
Now choose the attribute with the fewest errors. Randomly

decide on Outlook. So 1R will simply use the outlook
attribute to predict the class for new instances.

COMP527:
Data Mining
1R Algorithm
foreach attribute,
foreach value of that attribute,
find class distribution for attr/value
conc = most frequent class
make rule: attribute=value -> conc
calculate error rate of ruleset
select ruleset with lowest error rate
Almost not worth wasting a slide on, really!

COMP527:
Data Mining
PRISM Classifier
Instead of always looking to the full data set, after

constructing each rule we could remove the instances
that the rule covers before looking for a new rule.
Start with a high coverage rule, then increase its accuracy

by adding more conditions to it.
Want to maximise the accuracy of each rule: maximise the

ratio of positive instances/covered instances.
Finished adding conditions when p/t = 1, or no more

instances to look at

COMP527:
Data Mining
PRISM Classifier
Following Witten (pg 6, 108+)

If X then recommendation=hard
Find highest coverage ratio condition for X:
Age = Young 2/8
Age = Prepresbyopic 1/8
Age = Presbyopic 1/8
Prescription = Myope 3/12
Prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
TearProduction = Reduced 0/12
TearProduction = Normal 4/12
Select astigmatism = yes

(arbitrarily over Tear-Production = Normal)

COMP527:
Data Mining
PRISM Classifier
This covers:
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Prepresbyopic Myope Yes Reduced None
Prepresbyopic Myope Yes Normal Hard
Prepresbyopic Hypermetrope Yes Reduced None
Prepresbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
Now need to add another condition to make it more accurate.
If astigmatism = yes and X then recommendation = hard

COMP527:
Data Mining
PRISM Classifier
Best condition is Tear-Production = normal (4/6)

New rule: astigmatism=yes and tear-production = normal
But still some inaccuracy...
Age=Young (2/2) or Prescription = Myope (3/3) both have

100% ratio in remaining instances. Choose the greater
coverage.
If astigmatism = yes and tear-production = normal and

prescription = myope then recommendation = hard
Repeat the process, removing the instances covered by this
rule.
Then repeat for all classes.

COMP527:
Data Mining
PRISM Classifier
Try with the other example data set. If X then play=yes
Outlook=overcast is (4/4) Already perfect, remove
instances and look again.

COMP527:
Data Mining
PRISM Classifier
With reduced dataset, if X then play=yes
Sunny
rainy
(2/5) rainymild
(3/5) hot (0/2)normal
mild (3/5) cool (2/3)
false
high (1/5) normal
yes
(4/5)
sunnyfalse (4/6) true
mild(¼) normal true yes
Select humidity=normal (4/5) and look for another rule as
not perfect

COMP527:
Data Mining
PRISM Classifier
If humidity=normal and X then play=yes
If we could use 'and-not' we could have:
and-not (temperature=cool and windy=true)
But instead:
rainy (2/3), sunny (2/2), cool (2/3), mild (2/2), false(3/3),
true (1/2)
So we select windy=false to maximise t and add that to the
rule.

COMP527:
Data Mining
PRISM Algorithm
for each class C
initialise E to the complete instance set
while E contains instances with class C
create empty rule R if X then C
until R is perfect (or no more attributes)
for each attribute A not in R, and each value v,
consider adding A=v to R
select A and v to maximise accuracy of p/t
add A=v to R
remove instances covered by R from E

COMP527:
Data Mining
Issues with PRISM
Overfitting. As we saw, we had 4/5 and then lost one

positive instance to lose the one negative instance. But
with more examples maybe it was 199/200 and we'd need
to lose 40 positive to remove it... that's crazy.
Measure 2: Information Gain

p * (log(p/t) - log (P/T))
p/t as before (positive over total)
P and T are positive and total before new condition.
Emphasises large number of positive examples. Use this in

PRISM in place of maximising p/t.

COMP527:
Data Mining
Over-Fitting Avoidance
We could split the training set into a growing set and a

pruning set.
Grow rules out using the first, and then try to cut the rules
back with the pruning set.
Two strategies:
Reduced-error pruning: Build full rule set then prune

rules
Incremental reduced-error pruning: Simplify rules as

built
Can re-split the data after each rule. Let's look at this one.

COMP527:
Data Mining
Incremental Reduced Error Pruning
initialise E to instance set
until E is empty:
split E into Grow and Prune (ratio 2:1)
for each class C in Grow
generate best rule for C
using Prune:
calc worth(R) and worth(RfinalCondition)
while worth(R) > worth(R), prune rule
from rules for different classes, select
largest worth(R)
remove instances covered by rule

COMP527:
Data Mining
Rule Worth?
How can we generate the worth of the rules? (Witten 203)
– ( p + (N -n)) / T
● true positive + true negative / total number of
instances
● positive + totalNegative - negativesCovered /
totalInstances
● p=2000, t=3000 --> 1000 + N / T
● p=1000, t=1001 --> 999 + N / T

– p/t
● Same problem as before p=1,t=1 vs p=1000,t=1001
– Simple but intuitive algorithm for worth?

COMP527:
Data Mining
Issue with Grow/Prune Splitting
Say we have 1000 examples, and we split 2:1 for train/test

(666,334), then 2:1 for grow/prune (444,222) ... we're
building our rules on less than half of our data!
Depending on the dataset, classes may be absent from the

training set, or the distributions may be very wrong, or
any number of other statistical problems with random
sampling to this degree.
Ameliorated in Incremental as re-split often. But might still

want to perform the algorithm several times and pick the
best.

COMP527:
Data Mining
RIPPER
Rules-based classifier from industry.
If 2 classes, then learn rules for one and default the other
If more than 2 classes, start with smallest until you have 2.
Information Gain to grow rules

Measure for pruning: p-n / p + n (positive/negative
examples covered in pruning set)
Uses 'Description Length' metric -- Ockham's Razor says

that the simplest solution is the best, so here the simplest
rule set is the best. (Not going into how to calculate this)

COMP527:
Data Mining
RIPPER Algorithm
Repeated Incremental Pruning to Produce Error Reduction
split E into Grow/Prune
BUILD:
repeat until no examples, or DL of ruleset >minDL(rulesets)+64, or error >50%
GROW: add conditions until rule is 100% by IG
PRUNE: prune last to first while worth metric W increases
for each rule R, for each class C:
split E into Grow/Prune
remove all instances from Prune covered by other rules
GROW and PRUNE two competing rules:
R1 is new rule built from scratch
R2 is generated by adding conditions to R
prune using worth metric A for reduced dataset
replace R by R, R1 or R2 with smallest DL
if uncovered instances of C, return to BUILD to make more rules
calculate DL for ruleset and ruleset with each rule omitted, delete any rule that
increases the DL
remove instances covered by rules generated
DL = Description Length, Metric W= p+1/t+2, Metric A=p+Nn/T

COMP527:
Data Mining
Rules with Exceptions
If we get more data after a ruleset has been generated, it might be
useful to add exceptions to rules.
If X then class1 unless Y then class2
Consider our humidity rule:
if humidity=normal then play=yes
unless temperature=cool and windy=true then play = no
Exceptions developed with the Induct system, called 'rippledown
rules'

COMP527:
Data Mining
Further Reading
●
Witten, Sections 3.3, 3.5, 3.6, 4.1, 4.4
●
Dunham Section 4.6
●
Han, Section 6.5
●
Berry and Browne, Chapter 8

COMP527:
Data Mining
Classification: Trees January 18, 2008 Slide 160

COMP527:
Data Mining
Today's Topics
Trees
Tree Learning Algorithm
Attribute Splitting Decisions
Random
'Purity Count'
Entropy (aka ID3)
Information Gain Ratio

COMP527:
Data Mining
Trees
Anything can be made better by storing it in a tree structure! (Not really!)
Instead of having lists or sets of rules, why not have a tree

of rules? Then there's no problem with order, or repeating
the same test over and over again in different conjunctive
rules.
So each node in the tree is an attribute test, the branches

from that node are the different outcomes.
Instead of 'separate and conquer', Decision Trees are the

more typical 'divide and conquer' approach. Once the
tree is built, new instances can be tested by simply
stepping through each test.

COMP527:
Data Mining
Example Data Again
Here's our example data again:
How to construct a tree from it, instead of rules?

COMP527:
Data Mining
Tree Learning Algorithm
Trivial Tree Learner:
create empty tree T
select attribute A
create branches in T for each value v of A
for each branch,
recurse with instances where A=v
add tree as branch node
Most interesting part of this algorithm is line 2, the attribute
selection. Let's start with a Random selection, then look at how it
might be improved.

COMP527:
Data Mining
T
Random method: Let's pick 'windy'
Windy
false true
6 yes 3 yes
2 no 3 no
Need to split again, looking at only the 8 and 6 instances respectively.
For windy=false, we'll randomly select outlook:
sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes
As all instances of overcast and rainy are yes, they stop, sunny continues.

COMP527:
Data Mining
Attribute Selection
As we may have thousands of attributes and/or values to test, we
want to construct small decision trees. Think back to RIPPER's
description length ... the smallest decision tree will have the
smallest description length. So how can we reduce the number
of nodes in the tree?
We want all paths through the tree to be as short as

possible. Nodes with one class stop a path, so we want
those to appear early in the tree, otherwise they'll occur
in multiple branches.
Think back: the first rule we generated was

outlook=overcast because it was pure.

COMP527:
Data Mining
Attribute Selection: Purity
'Purity' count:
Outlook
sunny rainy
2 yes 3 yes
4 yes
3 no 2 no
Select attribute that has the most 'pure' nodes, randomise equal
counts.
Still mediocre. Most data sets won't have pure nodes for several
levels. Need a measure of the purity instead of the simple count.

COMP527:
Data Mining
Attribute Selection: Entropy
For each test:
Maximal purity: All values are the same
Minimal purity: Equal number of each value
Find a scale between maximal and minimal, and then merge across all of the
attribute tests.
One function that calculates this is the Entropy function:
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn*log(pn)
p1 ... pn are the number of instances of each class, expressed as a fraction of
the total number of instances at that point in the tree. log is base 2.

COMP527:
Data Mining
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn *log(pn)
This is to calculate one test. For outlook there are three tests:
sunny: info(2,3)
= 2/5 log(2/5) 3/5 log(3/5)
= 0.5287 + 0.4421
= 0.971
overcast: info(4,0) = (4/4*log(4/4)) + (0*log(0))
Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it
is the final result will be 0.

COMP527:
Data Mining
sunny: info(2,3) = 0.971
overcast: info(4,0) = 0.0
rainy: info(3,2) = 0.971
But we have 14 instances to divide down those paths...
So the total for outlook is:
(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693
Now to calculate the gain, we work out the entropy for the top node
and subtract the entropy for outlook:
info(9,5) = 0.940
gain(outlook) = 0.940 0.693 = 0.247

COMP527:
Data Mining
Now to calculate the gain for all of the attributes:
gain(outlook) = 0.247
gain(humidity) = 0.152
gain(windy) = 0.048
gain(temperature) = 0.029
And select the maximum ... which is outlook.
This is (also!) called information gain. The total is the information,
measured in 'bits'.
Equally we could select the minimum amount of information
needed the minimum description length issue in RIPPER.
Let's do the next level, where outlook=sunny.

COMP527:
Data Mining
Now to calculate the gain for all of the attributes:
Temp: hot info(0,2) mild info(1,1) cool info(1,0)
Humidity: high info(0,3) normal info(2,0)
Windy: false info(1,2) true info(1,1)
Don't even need to do the math. Humidity is the obvious choice as
it predicts all 5 instances correctly. Thus the information will be
0, and the gain will be maximal.

COMP527:
Data Mining
Now our tree looks like:
Outlook
sunny rainy
?
Humidity yes
normal high
yes no
This algorithm is called ID3, developed by Quinlan.

COMP527:
Data Mining
Entropy: Issues
Nasty side effect of Entropy: It prefers attributes with a large
number of branches.
Eg, if there was an 'identifier' attribute with a unique value, this
would uniquely determine the class, but be useless for
classification. (overfitting!)
Eg: info(0,1) info(0,1) info(1,0) ...
Doesn't need to be unique. If we assign 1 to the first two instances,
2 to the second and so forth, we still get a 'better' split.

COMP527:
Data Mining
Entropy: Issues
HalfIdentifier 'attribute':
info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0)
info(1,1)
= 0 0 0.5 0.5 0 0 0.5
2/14 down each route, so:
= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...
= 3 * (2/14 * 0.5)
= 3/14
= 0.214
Gain is:
0.940 0.214 = 0.726
Remember that the gain for Outlook was only 0.247!
Urgh. Once more we run into overfitting.

COMP527:
Data Mining
Gain Ratio
Solution: Use a gain ratio. Calculate the entropy disregarding
classes for all of the daughter nodes:
eg info(2,2,2,2,2,2,2) for halfidentifier
and info(5,4,5) for outlook
identifier = 1/14 * log(1/14) * 14 = 3.807
halfidentifier = 1/7 * log(1/7) * 7 = 2.807
outlook = 1.577
Ratios:
identifier = 0.940 / 3.807 = 0.247
halfidentifier = 0.726 / 2.807 = 0.259
outlook = 0.247 / 1.577 = 0.157

COMP527:
Data Mining
Gain Ratio
Close to success: Picks halfidentifier (only accurate in 4/7
branches) over identifier (accurate in all 14 branches)!
halfidentifier = 0.259
identifier = 0.247
outlook = 0.157
humidity = 0.152
windy = 0.049
temperature = 0.019
Humidity is now also very close to outlook, whereas before they
were separated.

COMP527:
Data Mining
Gain Ratio
We can simply check for identifier like attributes and ignore them.
Actually, they should be removed from the data before the data
mining begins.
However the ratio can also overcompensate. It might pick an
attribute because it's entropy is low. Note how close humidity
and outlook became... maybe that's not such a good thing?
Possible Fix: First generate the information gain. Throw away any
attributes with less than the average. Then compare using the
ratio.

COMP527:
Data Mining
Alternative: Gini
An alternative method to Information Gain is called the Gini Index
The total for node D is:
gini(D) = 1 sum(p12, p22, ... pn2)
Where p1..n are the frequency ratios of class 1..n in D.
So the Gini Index for the entire set:
= 1 (9/142 + 5/142)
= 1 (0.413 + 0.127)
= 0.459

COMP527:
Data Mining
Gini
The gini value of a split of D into subsets is:
Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)
Where N' is the size of split D', and N is the size of D.
eg: Outlook splits into 5,4,5:
split = 5/14 gini(sunny) + 4/14 gini(overcast)
+ 5/14 gini(rainy)
sunny = 1sum(2/52, 3/52) = 1 0.376 = 0.624
overcast= 1 sum(4/42, 0/42) = 0.0
rainy = sunny
split = (5/14 * 0.624) * 2
= 0.446

COMP527:
Data Mining
Gini
The attribute that generates the smallest gini split value is chosen
to split the node on.
(Left as an exercise for you to do!)
Gini is used in CART (Classification and Regression Trees), IBM's
IntelligentMiner system, SPRINT (Scalable PaRallelizable
INduction of decision Trees). It comes from an Italian statistician
who used it to measure income inequality.

COMP527:
Data Mining
Decision Tree Issues
The various problems that a good DT builder needs to address:
– Ordering of Attribute Splits
As seen, we need to build the tree picking the best attribute to split on first.
– Numeric/Missing Data
Dividing numeric data is more complicated. How?
– Tree Structure
A balanced tree with the fewest levels is preferable.
– Stopping Criteria
Like with rules, we need to stop adding nodes at some point. When?
– Pruning
It may be beneficial to prune the tree once created? Or incrementally?

COMP527:
Data Mining
Further Reading
●
Introductory statistical text books
●
Witten, 3.2, 4.3
●
Dunham, 4.4
●
Han, 6.3
●
●

COMP527:
Data Mining
Classification: Trees 2 January 18, 2008 Slide 184

COMP527:
Data Mining
Today's Topics
Numeric Data
Missing Values
Pruning
Pre- vs Post-Pruning
Chi-squared Test
Sub-tree Replacement
Sub-tree Raising
C4.5's error estimation
From Trees to Rules

COMP527:
Data Mining
Numeric Attributes
The temperature attribute for the weather data is actually a set of
Fahrenheit values between 64 and 85:
64 65 68 69 70 71 72 75 80 81 83 85
yes no yes yes yes no no, yes yes, yes no yes yes no
While we could have a 12 way branching node, this wouldn't

catch training data that had, for example, a temperature
of 66. We've seen the issues of over-fitting with many
branches already.
Need to have some split points. Eg temperature > = 71
Could have one node with many branches, or one or more

nodes with two branches each.

COMP527:
Data Mining
T
Assuming one split, where should it be?
64 65 68 69 70 71 72 75 80 81 83 85
yes no yes yes yes no no, yes yes, yes no yes yes no
Common to split between two values, so 11 possible points.

(Why?)
We can simply test the information gain for each split.
< 71.5 is 4 yes, 2 no. > 71.5 is 5 yes, 3 no. So:
info([4,2],[5,3]) = 6/14 * info([4,2]) + 8/14 * info([5,3])
= 0.939
Then calculate it for all of the other split points and take the best.

COMP527:
Data Mining
Numeric Attributes
Once the best split has been found, continue as normal. Almost.
Just because it has been split once doesn't mean that it

couldn't benefit from being split again. You might want to
test the same attribute again in the same path. For
example:
1 2 3 4 5 6 7 8 9 10 11 12
A A A B B B A A A B B B
X Y X Y X Y X Y X Y X Y
For a/b, might split at 6.5, then again at 3.5 and 9.5
But splitting for x/y will eventually lead to 1=x, 2=y, 3=x ...
Over-fitting.
Also: Many binary splits on an attribute make the tree hard
to read.

COMP527:
Data Mining
Numeric Attributes
Isn't a multiway split better? Yes, but harder to accomplish. How
many splits? Where?
1 2 3 4 5 6 7 8 9 10 11 12
A A A B B B A A A B B B
X Y X Y X Y X Y X Y X Y
We really want to find a function to test the data with.
For X/Y we want to test: value % 2
For A/B we want to test: value1 /3 % 2
Complicated. We'll look at regression trees later.
Algorithm papers: http://citeseer.ist.psu.edu/context/412349/0
COMP527:
Data Mining
Numeric Attributes
Sounds like a lot of computation for attributes with a wide range of
data. ... Yes.
Second computational problem: If we test (for example) windy first
and then test temperature, the possible values will be different
because not all instances have made it to that node. So we'll
need to resort everything every node.
Not quite. The order doesn't change because instances are left out.
We can sort once and cross out instances that we don't have.

COMP527:
Data Mining
Numeric Attributes
Eg:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
7 6 5 9 4 14 8 12 10 11 2 13 3 1
Just because we don't have instances 1,3,4,5,8,9,10 and 13
doesn't mean that the order of the others has changed.
It will still be: 7, 6, 14, 12, 11, 2
Disadvantage: Need to store this information for each numeric
attribute for every instance. If the numeric attributes are used
further down the tree, it may be cheaper to do it only on the
subsets.

COMP527:
Data Mining
Numeric Attributes
We don't have this problem with nominal attributes. We could transform the
numeric attribute into nominal before the data mining stage?
If we can discretize it before data mining, surely we can do it during it as well
using the same techniques? And wouldn't it be faster, as you're only going to
be dealing with a subset of the data, not all of it? Yes, but it might be over
fitting!
Solutions:
●
Prediscretize to nominal attribute (will look at later)
●
Many binary splits
●
One multibranch split

COMP527:
Data Mining
Missing Values
What happens when an instance is missing the value for an
attribute? Already discussed some possibilities for filling in the
value.
While Training, may be possible to just ignore it. But we need a
solution for a Test instance with a missing value.
Idea: Send the instance down all of the branches and combine the
results?
Need to record in the tree the 'popularity' of each branch. Eg how
many instances went down it.

COMP527:
Data Mining
Missing Values
For example: Split the 14 instances by Windy ... 8 go down the
false branch, 6 down the true branch. So when we get an
instance without windy, the classification for false happened 8/14
times and the classification for true 6/14.
Instead of ending up with a single class, we might end up with 4/7
votes for one and 3/7 votes for another. Or they might both end
up the same.

COMP527:
Data Mining
Pre- vs Post- Pruning
In the same way as generating rule sets, we need to prune trees to
avoid overfitting.
PrePruning: Stop before reaching the bottom of the tree path.
But it might stop too early. For example when a combination of two
attributes is important, but neither by themselves is significant.
PostPruning: Generate the entire tree and then remove some
branches.
More time consuming, but more likely to help classification
accuracy.

COMP527:
Data Mining
Pre-Pruning
How to determine when to stop growing?
Statistical Significance:
Stop growing the tree when there is no statistically significant
association between any attribute and the class at a particular
node
Popular test: chisquared
chi2 = sum( (OE)2 / E )
O = observed data, E = expected values based on hypothesis.

COMP527:
Data Mining
Pre-Pruning
chi2 = sum( (OE)2 / E )
Example (from Dunham, 55): 5 schools have the same test.

Total score is 375, individual results are: 50, 93, 67, 78
and 87. Is this distribution significant, or was it just luck?
Average is 75.
(50 75)2/75 + (9375)2/75 + (6775)2/75 + (8775)2/75
= 15.55
This distribution is significant.
ID3 only allowed significant attributes to be selected by Information
Gain.

COMP527:
Data Mining
Post-Pruning
Two possible options for postpruning:
Subtree Replacement: Select a subtree and replace with a
single leaf.
Subtree Raising: Select a subtree and raise it to replace a higher
tree. More complicated, harder to tell if worthwhile in practice.
Need to the training data into a Grow/Prune division again. Grow
the entire tree from the Grow set, then prune it using the Prune
set. But it has the same problems as with rulesbased systems.

COMP527:
Data Mining
Sub-Tree Replacement
Replace left subtree with 'bad' leaf node
(Witten fig 1.3)

COMP527:
Data Mining
Sub-Tree Raising
Raise subtree C to B

(Witten fig 6.1)
But now need to reclassify instances that
would have gone to 4 & 5.

COMP527:
Data Mining
Post-Pruning
Can we estimate the error of the tree without a Pruning set? Can
we estimate it based on the training set that it has just been
grown from?
C4.5 uses a “shaky” statistical method that works well in

practice (as shown by C4.5's accuracy and subsequent
popularity)
Method: Derive a confidence interval and calculate

pessimistic error rate. If the combined training error for a
set of branches is higher than the parent, prune them.

COMP527:
Data Mining
Post-Pruning
– Error estimate for subtree is weighted sum of error estimates for
all its leaves
– Error estimate for a node:

e= f 
z2
2N
z
 f
N
−
f2
N

z2
4N
2  
/ 1
z2
N
– If c = 25% (default for C4.5) then z = 0.69 (from normal

distribution)
– f is the error on the training data
– N is the number of instances covered by the leaf
(These slides thanks to the official publisher slide sets, not my strong point!)

COMP527:
Data Mining
Post-Pruning
f = 5/14
e = 0.46
e < 0.51
so prune!
Combined using ratios 6:2:6

f=0.33 f=0.5 f=0.33
e=0.47 e=0.72 e=0.47 gives 0.51

COMP527:
Data Mining
From Trees to Rules
Now we've built a tree, it might be desirable to reexpress it as a list
of rules.
Simple Method: Generate a rule by conjunction of tests in
each path through the tree.
Eg:
if temp > 71.5 and ... and windy = false then play=yes
if temp > 71.5 and ... and windy = true then play=no
But these rules are more complicated than necessary.

Instead we could use the pruning method of C4.5 to prune
rules as well as trees.

COMP527:
Data Mining
From Trees to Rules
for each rule,
e = error rate of rule
e' = error rate of rule finalCondition
if e' < e,
rule = rule finalCondition
recurse
remove duplicate rules
Expensive: Need to reevaluate entire training set for every
condition!
Might create duplicate rules if all of the final conditions from a path
are removed.

COMP527:
Data Mining
Further Reading
As previous:
●
Witten, 3.2, 4.3 PLUS 6.1
●
Dunham, 4.4
●
Han, 6.3
●
●

COMP527:
Data Mining
Classification: Bayes January 18, 2008 Slide 207

COMP527:
Data Mining
Today's Topics
Statistical Modeling
Bayes Rule
Naïve Bayes
Fixes to Naïve Bayes
Document classification
Bayesian Networks
Structure
Learning

COMP527:
Data Mining
Bayes Rule
The probability of hypothesis H, given evidence E:
Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]
Pr[H] = A Priori probability of H (before evidence seen)
Pr[H|E] = A Posteriori probability of H (after evidence seen)
We want to use this in a classification system, so our goal is to find
the most probable hypothesis (class) given the evidence (test
instance).

COMP527:
Data Mining
Example
Meningitis causes a stiff neck 50% of the time.
Meningitis occurs 1/50,000, stiff necks occur 1/20.
Probability of Meningitis, given that the patient has a stiff

neck:
Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]
Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN]
= 0.5 * 1/50000 / 1/20
= 0.0002

COMP527:
Data Mining
Bayes Rule
Our evidence E is made up of different attributes A[1..n], so:
Pr[H|E] = Pr[A1|H]*Pr[A2|H]...Pr[An|H]*Pr[H]/Pr[E]
So we need to work out the probability of the individual attributes
per class. Easy...
Outlook=Sunny appears twice for yes out of 9 yes instances.
We can work these out for all of our training instances...

COMP527:
Data Mining
Weather Probabilities
Outlook Temperature Humidity Windy Play

Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Given a test instance (sunny, cool, high, true)
play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053
play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
So we'd predict play=no for that particular instance.
COMP527:
Data Mining
Weather Probabilities
play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053
play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
This is the likelihood, not the probability. We need to normalise these.
Prob(yes) = 0.0053 / 0.0053 + 0.0206 = 20.5%
This is when the Pr[E] denominator disappears from Bayes's rule.
Nice. Surely there's more to it than this... ?

COMP527:
Data Mining
Naïve Bayes
Issue: It's only valid to multiply probabilities when the events are
independent of each other. It is “naïve” to assume independence
between attributes in datasets, hence the name.
Eg: The probability of Liverpool winning a football match is not
independent of the probabilities for each member of the team
scoring a goal.
But even given that, Naïve Bayes is still very effective in practice,
especially if we can eliminate redundant attributes before
processing.

COMP527:
Data Mining
Naïve Bayes
Issue: If an attribute value does not cooccur with a class value, then the
probability generated for it will be 0.
Eg: Given outlook=overcast, the probability of play=no is 0/5. The other
attributes will be ignored as the final result will be multiplied by 0.
This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute set.
You can easily imagine a case where the likelihood for all classes is 0.
Eg: 'Viagra' is always spam, 'data mining' is never spam. An email with
both will be 0 for spam=yes and 0 for spam=no ... probability will be
undefined ... uh oh!

COMP527:
Data Mining
Laplace Estimator
The trivial solution is of course to mess with the probabilities such that
you never have 0s. We add 1 to the numerator and 3 to the
denominator to compensate.
So we end up with 1/8 instead of 0/5.
No reason to use 3, could use 2 and 6. No reason to split equally... we
could add weight to some attributes by giving them a larger share:
a+3/na+6 * b+2/nb+6 * c+1/nc+6
However, how to assign these is unclear.
For reasonable training sets, simply initialise counts to 1 rather than 0.

COMP527:
Data Mining
Missing Values
Naïve Bayes deals well with missing values:
Training: Ignore the instance for the attribute/class combination,
but we can still use it for the known attributes.
Classification: Ignore the attribute in the calculation as the
difference will be normalised during the final step anyway.

COMP527:
Data Mining
Numeric Values
Naïve Bayes does not deal well with numeric values without some help.
The probability of it being exactly 65 degrees is zero.
We could discretize the attribute, but instead we'll calculate the mean and
standard deviation and use a density function to predict the probability.
mean: sum(values) / count(values)
variance: sum(square(value mean)) / count(values)1
standard deviation: square root of variance

Mean for temperature is 73, Std. Deviation is 6.2

COMP527:
Data Mining
Numeric Values
Density function:
( x− µ )2
f(x) = 1 e 2σ 2
2π σ
Unless you've a math background, just plug the numbers in...
At which point we get a likelihood of 0.034
Then we continue with this number as before.
This assumes a reasonably normal distribution. Often not the case.

COMP527:
Data Mining
Document Classification
The Bayesian model is often used to classify documents as it deals
well with a huge number of attributes simultaneously. (eg
boolean occurrence of words within the text)
But we may know how many times the word occurs.
This leads to Multinomial Naive Bayes.
Assumptions:
1. Probability of a word occurring in a document is independent
of its location within the document.
2. The document length is not related to the class.

COMP527:
Data Mining
Pr[E|H] = N! * product(pn/n!)
N = number of words in document

p = relative frequency of word in documents of class H
n = number of occurrences of word in document
So, if A has 75% and B has 25% frequency in class H
Pr[“A A A”|H] = 3! * 0.753/3! * 0.250/0!
= 27/64
= 0.422
Pr[“A A A B B”|H] = 5! * 0.753/3! * 0.252/2!
= 0.264

COMP527:
Data Mining
Pr[E|H] = N! * product(pn/n!)
We don't need to work out all the factorials, as they'll normalise out
at the end.
We still end up with insanely small numbers, as vocabularies are
much much larger than 2 words. Instead we can sum the
logarithms of the probabilities instead of multiplying them.

COMP527:
Data Mining
Bayesian Networks
Back to the attribute independence assumption. Can we get rid of
it?
Yes, with a Bayesian Network.
Each attribute has a node in a Directed Acyclic Graph.
Each node has a table of all attributes with edges pointing at the
node linked against the probabilities for the attribute's values.
Examples will be hopefully enlightening...

COMP527:
Data Mining
Simple Network
play
yes no
.633 .367
outlook windy
play| sunny overcast rainy play| false true
yes | .238 .429 .333 yes | .350 .650
no | .538 .077 .385 no | .583 .417
temperature humidity
play| hot mild cold play| high normal
yes | .238 .429 .333 yes | .350 .650
no | .385 .385 .231 no | .750 .250

COMP527:
Data Mining
Less Simple Network
play
yes no
.633 .367 windy
play outlook false true
yes sunny .500 .500
Outlook yes overcast .500 .500
play sunny overcast rainy yes rainy .125 .875
yes .238 .429 .333 no sunny .375 .625
no .538 .077 .385 no overcast .500 .500
no rainy .833 .167
temperature humidity
play outlook hot mild cold play temp high normal
yes sunny .238 .429 .333 yes hot .500 .500
yes overcast .385 .385 .231 yes mild .500 .500
yes rainy .111 .556 .333 yes cool .125 .875
no sunny .556 .333 .111 no hot .833 .167
no overcast .333 .333 .333 no mild .833 .167
no rainy .143 .429 .429 no cool .250 .750

COMP527:
Data Mining
Bayesian Networks
To use the network, simply step through each node and multiply the
results in the table together for the instance's attributes' values.
Or, more likely, sum the logarithms as with the multinomial case.
Then, as before, normalise them to sum to 1.
This works because the links between the nodes determine the
probability distribution at the node.
Using it seems straightforward. So all that remains is to find out the best
network structure to use. Given a large number of attributes, there's a
LARGE number of possible networks...

COMP527:
Data Mining
Training Bayesian Networks
We need two components:
– Evaluate a network based on the data
As always we need to find a system that measures the
'goodness' without overfitting
(overfitting in this case = too many edges)
We need a penalty for the complexity of the network.
– Search through the space of possible networks
As we know the nodes, we need to find where the edges in the
graph are. Which nodes connect to which other nodes?

COMP527:
Data Mining
Training Bayesian Networks
Following the Minimum Description Length ideal, networks with lots
of edges will be more complex, and hence likely to overfit.
We could add a penalty for each cell in the nodes' tables.
AIC: LL +K
MDL: LL + K/2 log(N)
LL is total loglikelihood of the network and training set. eg Sum of
log of probabilities for each instance in the data set.
K is the number of cells in tables, minus the number of cells in the
last row (which can be calculated, by 1 sum of other cells in row)
N is the number of instances in the data.

COMP527:
Data Mining
Network Training: K2
K2:
for each node,
for each previous node,
add node, calculate worth
continue when doesn't improve
(Use MDL or AIC to determine worth)
The results of K2 depend on initial order selected to process the
nodes in.
Run it several times with different orders and select the best.
Can help to ensure that the class attribute is first and links to all
nodes (not a requirement)

COMP527:
Data Mining
Other Structures
TAN: Tree Augmented Naive Bayes.
Class attribute is only parent for each node in Naive Bayes. Start
here and consider adding a second parent to each node.
Bayesian Multinet:
Build a separate network for each class and combine the values.

COMP527:
Data Mining
Further Reading
●
Witten 4.2, 6.7
●
Han 6.4
●
Dunham 4.2
●
Devijver and Kittler, Pattern Recognition: A Statistical Approach,
Chapter 2
●

COMP527:
Data Mining
Classification: Neural Networks January 18, 2008 Slide 232

COMP527:
Data Mining
Today's Topics
Introduction to Neural Networks
Issues
Training
Kohonen Self-Organising Maps
Radial Basis Function Networks

COMP527:
Data Mining
Neural Networks
How do animals learn (including humans)? Perhaps we can
simulate that for learning simple patterns?
How does the brain work (simplistically)? The brain has lots of
neurons which either fire or not and are linked together in a huge
three dimensional structure. It receives input from many neurons
and sends its output to many neurons. The input comes from
external connected sensors such as eyes.
So ... can we model an artificial network of neurons to solve just
one task (a classification problem)?

COMP527:
Data Mining
Neural Networks
We need some inputs, then some neurons connected together and
then some outputs. Our inputs are the attributes from the data
set. Our outputs are (typically) the classes. We can have
connections from all of the inputs to the neurons, then from the
neurons to the outputs.
Then we just need to train the neurons to react to the values in the
attributes in the proper way such that the output layer gives us
the classification. (Which is of course the complicated part, just
like animals learning)

COMP527:
Data Mining
Perceptrons?
Sounds like the idea of a Perceptron? Yes, it is.
A Neural Network is kind of like a lot of Perceptrons bundled

together, but there are some significant differences:
– The training methods for Perceptron/Winnow aren't used

– The activation function is often not binary, but a function
– Neural Networks can handle multiple classes, not just 2
– There are multiple layers of nodes, normally 3

COMP527:
Data Mining
Simple Neural Network
N1
Attr1
Class
Attr2
Output Neuron(s)
Note: Class here is

true or false, so we
N2 can use just one
on/off neuron
Input Neurons “Hidden Layer”

COMP527:
Data Mining
Neural Networks
That sounds like a regression problem? Learning a function...
Actually we use the same activation function in all nodes,
and apply a weight to each link. Each node can also have
a constant to add to the incoming data called a bias.
N1
W1,1 W1,C
Attr1 W2,1
Class
Attr2 W1,2
W2,C
W2,2
N2

COMP527:
Data Mining
Neural Networks
N1
W1,1 W1,C
Attr1 W2,1
Class
Attr2 W1,2
W2,C
W2,2
N2
Node N1 does: fN1(A1*W1,1 + A2*W2,1 + CN1)
Node Class does: fClass( fN1(A1*W1,1+A2*W2,1+CN1)*W1,C +
fN2(A1*W1,2 + A2*W2,2 + CN2) * W2,C + CClass )

COMP527:
Data Mining
Neural Network Issues
Issues with constructing a neural network classifier:
– Attributes as source nodes
Need to be numeric for weighting
– Number of hidden layers
Not necessarily just one layer, could have multiple
– Number of nodes per hidden layer
Complicated. Too many and will over-fit, too few won't learn
properly
– Number of output neurons
One per class, or perhaps a bit based combination (eg 101 = class5
with 3 outputs)
– Interconnections
Node might not connect to all in next layer, might connect
backwards
– Weights, Constants, Activation function to use
– Learning Technique to adjust weights

COMP527:
Data Mining
Neural Network Benefits
Bleh! Why use a NN rather than a Decision Tree then?
– More robust -- can use all attributes at once, without splitting
numeric or turning them into nominal
– Improves performance by later learning.
– More robust in noisy environments as can't go down the wrong path
so easily
Other points to consider:
– Difficult to understand for non-experts. Can understand a decision

tree.
– Generating rules from a NN is not easy.
– Learning phase may never converge for a given structure.

COMP527:
Data Mining
Back to Issues
In order to apply a weight to an attribute, that attribute needs to be
numeric. 6.5 * “fish” is not meaningful. But assigning numbers
to the different values is also not meaningful. “squirrel” (2) is not
1 greater than “fish” (1)
Could divide the nominal attribute into many boolean
attributes with values of 0 or 1.
The number of hidden layers is typically 1, but it is quite

possible to construct a network with more than that.
More than 2 hidden layers becomes very time consuming
to train as there will be many more inter-connections.

COMP527:
Data Mining
The number of nodes in the hidden layer is highly debated, but no
good rule has been discovered to date. Depends on the
structure, activation function etc.
The number of output neurons is typically one per class, but

the value output by the neuron might not be 0/1. It could
be 0.2 in each of 2 nodes (the network thinks it's equally
likely to be either, and likely to be neither of them). In a
two class set, 0.2 in a single neuron would mean that it
predicts 0.8 for the other class, if 1.0 is the maximum.

COMP527:
Data Mining
The simplest structure is all nodes connect to all nodes in the next
highest layer, but this is not necessarily the case.
Two other possibilities:

● Nodes may connect to only some of the next layer
● Nodes may also be allowed to connect to previous
layers
Of course the most important consideration is how to teach

the network the desired outputs. We're going to assume
the simplest case for all of the previous issues from now
on.

COMP527:
Data Mining
Activation Functions
Before we look at training, we should look at common activation
functions. The function could be anything, but typically are one
of the following:
– Threshold: Neuron fires if input is greater than a given

value. Output is 1 if fires, otherwise 0.
– Linear: Neuron always fires with a value linear to the
input
– Linear Threshold: Combine the above two
– Sigmoid: An S shaped curve between -1 and 1.
Eg logistic function: f(S) = 1 / (1 + ecS)
– Hyperbolic Tangent: Variation on Sigmoid.
f(S) = (1eS) / (1 + ecS)
– Gaussian: Bell shaped curve between 0 and 1.

COMP527:
Data Mining
Training
The network classifies by propagating the values forwards through
the network (feedforward) and applying the activation function at
each step.
The most common learning method is the reverse, called

back-propagation. We feed an instance forwards through
the network, calculate how badly it did, and then try to
reverse backwards modifying each node's weight to be
more accurate.
Repeat for each instance in the training set until the

network stabilises, an acceptable error rate is reached, or
you give up and try a different network structure.

COMP527:
Data Mining
Training
We know the expected output at the final layer (the class) so we
can work out the error of the output from the nodes that connect
to it. A typical measure is the mean squared error (MSE):
(yi di)2 / 2
Where for node i, y is the output and d is the desired output.
This could be repeated for all nodes in the network and summed to
find the total error for a given instance. The goal is then to
minimise that error across all instances of the training set.

COMP527:
Data Mining
Training
The Hebb rule: (historical interest only)
delta(wij) = cxijyj
The Delta rule:
delta(wij) = cxij(djyj)
For node j, input node i, output y, desired output d and constant c.
The constant is typically 1/number of training instances.
So for back propagation, we can step backwards through the
network after passing an instance through it and modify each
weight using the delta rule. ... Almost.

COMP527:
Data Mining
Training
Remeber that we want to minimise the MSE. We can use Gradient
Descent to do this. With a sigmoid function:
for each node i in outputNodes
for each node j in inputs to i
delta = c(diyi)yj(1yi)yi
wji += delta
for each node j in hiddenLayer
for each node k in inputs to j
outputDelta=0
for each node m in outputs from j
outputDelta += (dmym)wjmym(1ym)
delta = c yk (1yj2 / 2) outputDelta
wkj += delta

COMP527:
Data Mining
Training
Whuh?! What's going on there??
Skipping all the math, it finds the
gradient of the error curve. To
minimise, we want the gradient to be
zero, so it takes lots of derivatives and
stuff...
If you like the math, read Witten ~230
and Dunham ~112.
For the rest of us... we'll smile and nod and skip ahead...

COMP527:
Data Mining
Self-Organising Maps
To go back to our original premise, perhaps there's something else
we can learn from our neurons that didn't just implode from the
previous math.
Some things those neurons might tell us:
– Firing neurons impact other close neurons
– Neurons that are far apart inhibit each other
– Neurons have specific nonoverlapping tasks
In a Kohonen Self Organising Map, the nodes in the hidden layer
are put into a two dimensional grid so that we have some
measure of distance between neurons.

COMP527:
Data Mining
Self-Organising Maps
The nodes compete against each other to be the best for a
particular attribute/instance. In training, once the best node has
been determined and had its connection weight modified, the
near by nodes also have their weights modified. The
neighbourhood of a node can decrease over time, proportional to
the amount it has 'learnt'.
A1

COMP527:
Data Mining
An RBF network has the standard three layers of nodes.
The hidden layer has a Gaussian activation function.
The output layer has a Linear or Sigmoidal activation
function.
The important part is the Gaussian activation function for

the hidden layer. The output is maximal for some value
and decreases the further away from that value. Each
node represents a particular point in the input space, so
the output is how far away from this point the instance is.

COMP527:
Data Mining
Instead of having a fixed activation function for each hidden node,
the RBF nodes also learn the their maximal value and how fast
the output should drop off away from this value.
These centers and widths can be learnt independently of the
connection weights. Typically this is done by clustering.

COMP527:
Data Mining
Further Reading
●
Witten 6.3
●
Han 6.6
●
Dunham 4.5
●
●
Pal and Mitra, Pattern Recognition Algorithms for Data Mining,
Chapter 7

COMP527:
Data Mining
Classification: SVM January 18, 2008 Slide 256

COMP527:
Data Mining
Today's Topics
Linear vs Nonlinear Classifiers
Support Vectors
Non Linearly Separable Datasets

COMP527:
Data Mining
Dimensionality of Data Sets
Imagine a data set with two numeric attributes ... you could plot the
instances on a graph.
Imagine a data set with three numeric attributes (eg
h,w,d) ... you could plot it in three dimensional space.
Now don't imagine a data set with N attributes, but we could

treat each attribute as a dimension. A classifier needs to
find the boundaries of the N dimensional space in which
the instances of a particular class reside.
To visualise, we'll use just 2 dimensions as they handily fit

on a slide.
(Ideas for many of these slides thanks to others, esp Barbara Rosario)

COMP527:
Data Mining
Linearly Separable Data
All the instances can be correctly classified by a single linear
decision boundary.

COMP527:
Data Mining
Non-linearly Separable Data
Not all instances can be correct classified by a linear decision
boundary.

COMP527:
Data Mining
Non-linearly Separable Data
But can be separated by a non lineary boundary.

COMP527:
Data Mining
Non-Separable Data
Random Noise

COMP527:
Data Mining
Linearly Separable Data
Many possible decision boundaries... which is best?

COMP527:
Data Mining
Maximum Margin Hyperplane
Maximum Margin Hyperplane (MMH) is the largest distance
between the two classes

COMP527:
Data Mining
MMH
With some slack to allow for somewhat noisy data

COMP527:
Data Mining
Support Vectors
Find the convex hull of each class. Find the shortest line that can
connect the two hulls. Then halfway along, at 90 degrees is the
MMH.
Instances that are the

closest to the MMH are
the Support Vectors.
Will always be one from

each class, but there
might be more if the
hull has a parallel section
to the MMH.

COMP527:
Data Mining
Support Vectors
Once we've found the support vectors, we don't care about the
other instances any more. The MMH is still the same with just
these instances.
That's not a vector, that's a smiley face!
That's not a hyperplane, it's a dotted line!
In 2d, yes. But the same applies in
Ndimensional space where an instance
is a vector like [1,6,3,10,7,14,23] and
the dividing plane is a 7 dimensional
monstrosity.

COMP527:
Data Mining
Mathsy Slide
Vector Norm: |X| = √x12 + x22 + ... xn2
Dot Product: X ∙ Y = |X||Y|cosθ
MMH: x = b + ∑αiyia(i)∙a
yi is the class value of training instance a(i): 1 or -1

a(i) is a support vector for a, a is the current instance being compared
Calculating b and αi is a constrained quadratic optimization

problem.
Constrained Quadratic Optimization: Far Too Complicated!

COMP527:
Data Mining
Non Linearly Separable Data
Most of the time, classes will not be linearly separable. For
example:
But what if we could transform the data set such that the
curve was actually a straight line. Then we could find the
MMH, and use the same transformation on new instances
to compare apples with apples.

COMP527:
Data Mining
Non Linearly Separable Data
This involves mapping each instance into a higher dimensional
space, where the previous curve is now a straight line. Eg from
a quadratic curve into a polynomial dimensional space above.
This could be very expensive, but it turns out that you can
do some of the work before the mapping (the dot product)
COMP527:
Data Mining
Non-Linear Data
So we need some function Ф that will map our data into a different
set of dimensions where there's a linear division. Then we can
construct a linear classifier using this set of dimensions.
Eg: a 3D input vector (x,y,z) could be mapped to 6D space (Z) by:
(x,y,z, x2, xy, xz)
Decision hyperplane is now linear in this space. Solve and then
substitute back so the linear hyperplane in this space
corresponds to a second order polynomial in the original space.
But doing this for all instances would be very very expensive...

COMP527:
Data Mining
Kernel Methods
There's another math trick we can use. It turns out that you don't
need to do the dot product operation after the mapping, as it's
constant.
So instead of: Ф(x) ∙ Ф(y)
We can do: Ф(x∙y)
Avoiding a lot of expense.
So all of our calculations happen in the original input space,

not the higher dimensions. These functions are called
Kernel Methods or Kernel Functions.

COMP527:
Data Mining
Kernel Methods
Polynomial Kernel: (x∙y)n
Gaussian Radial Basis Function Kernel: e |xy|2 / 2σ2
Sigmoid Kernel: tanh(k x∙y δ)
The Radial Basis Function Kernel and Sigmoid Kernel are the same
as the neural network activation functions we looked at last time.

COMP527:
Data Mining
XOR Problem
Simplest non linearly separable problem is XOR. There is no
hyperplane to distinguish it the classes in normal space, but
there is in a different space:

COMP527:
Data Mining
Noisy Data
We need some slack to allow for noise in the data preventing the
classes from being separable.
Introduce another parameter C that determines the
maximum effect any single instance can have on the
decision boundary.
If there is 10 bad instances and 1000 good instances, we
don't want the bad instances to prevent finding the MMH.
If by removing an instance, the boundary would move a
lot, that instance could be noise. (Still a constrained
quadratic optimization problem ... apparently)

COMP527:
Data Mining
Sparse Data
If the data has lots of 0 values, then these can be ignored when
computing the dot products. Eg: 0 squared adds nothing to the
normalised vector.
This speeds up the processing for sparse data sets as the

system can iterate through only the non-zero values.
This makes SVM very useful for text classification where the
attributes are the frequency of the word in a document.
(eg most words will appear 0 times)

COMP527:
Data Mining
Issues with SVM
– Training and using SVMs with many (100,000s+) support vectors
can be very slow.
– Determining the best kernel and user configurable
parameters is typically by trial and error.
– It can only predict two classes (1 vs -1)
Can learn a model for each of N classes vs all of the other
instances, but this means building lots of models, which is
very very slow.

COMP527:
Data Mining
Further Reading
●
Witten, 6.3
●
Han, 6.7
●
Pal and Mitra, Chapter 4

COMP527:
Data Mining
Classification: Evaluation January 18, 2008 Slide 279

COMP527:
Data Mining
Today's Topics
Evaluation
Samples
Cross Validation
Bootstrap
Confidence of Accuracy

COMP527:
Data Mining
Evaluation
We need some way to quantitatively evaluate the results of data
mining.
– Just how accurate is the classification?

– How accurate can we expect a classifier to be?
– If we can't evaluate the classifier, how can it be improved?
– Can different types of classifier be evaluated in the same
way?
– What are useful criteria for such a comparison?
– How can we evaluate clusters or association rules?
There are lots of issues to do with evaluation.

COMP527:
Data Mining
Evaluation
Assuming classification, the basic evaluation is how many correct
predictions it makes as opposed to incorrect predictions.
Can't test on data used for training the classifier and get an
accurate result. The result is "hopelessly
optimistic" (Witten).
Eg: Due to over-fitting, a classifier might get 100% accuracy

on the data it was trained from and 0% accuracy on other
data. This is called the resubstitution error rate -- the
error rate when you substitute the data back into the
classifier generated from it.
So we need some new, but labeled data to test on.

COMP527:
Data Mining
Validation
Most of the time we do not have enough data to have a lot for
training and a lot for testing, though sometimes this is possible
(eg sales data)
Some systems have two phases of training. An initial

learning period and then fine tuning. For example the
Growing and Pruning sets for building trees.
It's important to not use the validation set either.
Note that this reduces the amount of data that you can
actually train on by a significant amount.

COMP527:
Data Mining
Numeric Data, Multiple Classes
Further issues to consider:
– Some classifiers produce probabilities for one or more

classes.
We need some way to handle the probabilities – for a
classifier to be partly correct. Also for multi-class
problems (eg instance has 2 or more classes) we need
some 'cost' function for getting an accurate subset of the
classes.
– Regression/Numeric Prediction produces a numeric value.
We need statistical tests to determine how accurate this is
rather than true/false for nominal classes.

COMP527:
Data Mining
Hold Out Method
Obvious answer: Keep part of the data set aside for testing
purposes and use the rest to train the classifier.
Then use the test set to evaluate the resulting classifier in
terms of accuracy.
Accuracy: Number of correctly classified instances / total
number of instances to classify.
Ratio is often 2/3rds training, 1/3rd test.
How should we select the instances for each section?

COMP527:
Data Mining
Samples
Easy: Randomly select instances.
Data could be very unbalanced: Eg 99% one class, 1% the

other class.
Then random sampling is likely to not draw any of the 1%
class.
Stratified: Group the instances by class and then select a

proportionate number from each class.
Balanced: Randomly select a desired amount of minority

class instances, and then add the same number from the
majority class.

COMP527:
Data Mining
Samples
Stratified: Group the instances by class and then select a
proportionate number from each class.

COMP527:
Data Mining
Samples
Balanced: Randomly select a desired amount of minority class
instances, and then add the same number from the majority
class.

COMP527:
Data Mining
Small Data Sets
For small data sets, removing some as a test set and still having a
representative set to train from is hard. Solutions?
Repeat the process multiple times, select a different test

set. Then find the error from each, and average across all
of the iterations.
Of course there's no reason to do this only for small data

sets!
Different test sets might still overlap, which might give a

biased estimate of the accuracy. (eg if it randomly selects
good records multiple times)
Can we prevent this?

COMP527:
Data Mining
Cross Validation
Split the dataset up into k parts, then use each part in turn as the
test set and the others as the training set.
If the data set is also stratified, we can have stratified cross

validation, rather than perhaps ending up with a non
representative sample in one or more parts.
Common values for k are 3 (eg hold out) and 10.

Hence: stratified 10-fold cross validation
Again, the error values are averaged after the k iterations.

COMP527:
Data Mining
Cross Validation
Why 10? Extensive testing shows it to be a good middle ground
not too much processing, not too random.
Cross validation is used extensively in all data mining

literature. It's the simplest and easiest to understand
evaluation technique, while having a good accuracy.
There are other similar evaluation techniques, however ...

COMP527:
Data Mining
Leave One Out
Select one instance and train on all others. Then see if the
instance is correctly classified. Repeat and find the percentage
of accurate results.
Eg: N-fold cross validation, where N is the number of

instances in the data set.
Attractive:
● If 10 is good, surely N is better :)
● No random sampling problems
● Trains with the most amount of data

COMP527:
Data Mining
Leave One Out
Disadvantages:
● Computationally expensive, builds N models!
● Guarantees a non-stratified, non-balanced sample.
Worst case: class distribution is exactly 50/50.
Data is so complicated, classifier simply picks the most

common class.
-- Will always pick the wrong class.

COMP527:
Data Mining
Bootstrap
Until now, the sampling has been without replacement (eg each
instance occurs once, either in training or test set).
However we could put back an instance to be drawn again --
sampling with replacement.
This results in the 0.632 bootstrap evaluation technique.

Draw a training set from the data set with replacement such
that the number of instances in both is the same, then use
the instances which are not in the training set as the test
set.
(Eg some instances will appear more than once in the
training set)
Statistically, the likelihood of an instance not being picked is

0.368, hence the name.
COMP527:
Data Mining
Bootstrap
Eg: Have a dataset of 1000 instances.
We sample with replacement 1000 times – eg we randomly
select an instance from all 1000 instances 1000 times.
This should leave us with approximately 368 instances that

have not been selected. We remove these and use them
for the test set.
Error rate will be pessimistic – only training on 63% of the

data, with some repeated instances. We compensate by
combining with the optimistic error rate from
resubstitution:
error rate: 0.632 * error-on-test + 0.368 * error-on-training

COMP527:
Data Mining
What about the size of the test set? More test instances should
make us more confident that the accuracy predicted is close to
the true accuracy.
Eg getting 75% on 10,000 samples is more likely closer to
the accuracy than 75% on 10.
A series of events that succeed of fail is a Bernoulli process,

eg coin tosses. We can find out S successes from N trials,
and then S/N ... but what does that tell us about the true
accuracy rate?
Statistics can then tell us the range within which the true
accuracy rate should fall. Eg: 750/1000 is very likely to
be between 73.2% to 76.7%.
(Witten 147 to 149 has the full maths!)

COMP527:
Data Mining
We might wish to compare two classifiers of different types. Could
compare accuracy of 10 fold cross validation, but there's another
method: Student's TTest
Method:
– We perform cross validation 10 times – eg 10 times TCV =
100 models
– Perform the same repeated TCV with the second classifier
– This gives us x1..x10 for the first, and y1..y10 for the
second.
– Find the mean of the 10 cross-validation runs for each.
– Find the difference between the two means.
We want to know if the difference is statistically significant.

COMP527:
Data Mining
Student's T-Test
We then find 't' by:
Where d is the difference between the means, k is the

number of times the cross validation was performed, and
2 is the variance of the differences between the samples.
(variance = sum of squared differences between mean and
actual)
Then look up on the table for k-1 number of degrees of

freedom.
(more tables! But printed in Witten pg 155)
If t is greater than z on the table, then it is statistically

significant.
COMP527:
Data Mining
Further Reading
●
Introductory statistical text books, again
●
Witten, 5.15.4
●
Han 6.2, 6.12, 6.13
●
Berry and Browne, 1.4
●
Devijver and Kittler, Chapter 10

COMP527:
Data Mining
Classification: Evaluation 2 January 18, 2008 Slide 300

COMP527:
Data Mining
Today's Topics
Confusion Matrix
Costs
Lift Curves
ROC Curves
Numeric Prediction

COMP527:
Data Mining
Confusion Matrix
The 'Confusion Matrix':
Actual Yes Actual No
Predict Yes: True Positive False Positive
Predict No: False Negative True Negative
We want to ensure that True Positive and True Negative are

as high as possible. Same with more than two classes,
you want the diagonal from top left to bottom right to be
high, and the others to be low.
(Think of the output from WEKA for example)

COMP527:
Data Mining
Kappa Statistic
But what about random luck? An accuracy of 50% against 1000
classes is obviously better than against 2 classes.
We can derive the Kappa statistic from a confusion matrix

for a classifier and an artificial confusion matrix with the
classes divided in proportion to the overall distribution.

COMP527:
Data Mining
Kappa Statistic
Sum the diagonal in the expected by chance matrix. (82)
Sum the diagonal in the classifier's matrix (140)
Subtract expected from classifier. (58)
Subtract expected from total instances (200 – 82 = 118)
Divide and express as percentage: (58 / 118 = 49%)

COMP527:
Data Mining
Cost
For some situations, it's a lot worse to have a false negative than a
false positive.
Example: Better to have all true positives, no false

negatives and some misclassifications if the application is
detection of (insert un-favourite nasty medical condition)
Example2: If there's a very skewed ratio of classes (eg 99%

class A, 1% class B) then you want to tell the system that
getting 99% accuracy by always predicting A is not good
enough. The cost of getting it wrong for class B needs to
be higher than the value of getting it right for class A.

COMP527:
Data Mining
Cost
Another example application: Mass mailed advertising.
If it costs 40 pence to send out a letter, you want to

maximize the number of letters sent to people who will
buy, and minimize the number of letters sent out to those
that won't.
So the Confusion Matrix:
Predict Yes Predict No

Actual Yes Profit -40p Potential profit not used
Actual No -40p Saved money

COMP527:
Data Mining
Cost
Can use a cost matrix to determine the cost of errors of a classifier.
Default Cost Matrix:
A B C
A 0 1 1
B 1 0 1
C 1 1 0
But we might wish to change those values for different

scenarios.
Then when evaluating, we sum the values in the cells rather
than just count up the errors. Then use model with least
total cost.
This is only useful for evaluation, not training a cost

sensitive classifier.

COMP527:
Data Mining
Training with Costs
Can artificially inflate a 2class training set with duplicates of the
preferred class. Then an error minimising classifier will attempt
to reduce the errors on the inflated number.
Eg: Duplicate each 'false' instance 9 more times. Then

biased against predicting no wrongly, as it's 10 errors
instead of 1.
Then evaluate against the correct proportion of instances.
Some classification algorithms also allow instances to be

weighted directly, rather than duplicating them.

COMP527:
Data Mining
Probabilities
Some classifiers give a probability rather than a definite yes/no (eg
Bayesian techniques)
These must be taken into account when determining cost.

Eg: A 51% correct probability is not that much better
than a 51% incorrect probability.
We have some extra tricks that we can use to evaluate

probabilities...

COMP527:
Data Mining
Quadratic Loss
Quadratic Loss Function:
∑j(pj-aj)2
Where it is summed over the probabilities of each of j
classes for a single instance. A is 1 for the correct class
and 0 for the others, P is the probability assigned to that
class.
Then sum the loss over all test instances for a classifier.
You could then find the mean across different cross

validation folds... at which point you have the mean
squared error.

COMP527:
Data Mining
Quadratic Loss
∑j(pj-aj)2
Example:
In a 5 class problem, an instance might have:
(0.5, 0.2, 0.05, 0.15, 0.1)
When you want the first class:
(1,0,0,0,0)
= -0.52 + 0.22 + 0.052 + 0.152 + 0.12
= .25 + .04 + .0025 + .023 + .01
= 0.325
(and then summed for all instances, and the mean taken
across CV folds)

COMP527:
Data Mining
Information Loss
The opposite of information gain, we can use the same function as
a cost.
-E1log(p1) -E2log(p2) ...
Where E is the true probability and p is the predicted
probability.
If there is only one class, then the only one that matters is
the correct class, as the rest will be multiplied by 0.
Note that if you assign a 0 probability to the true class, you
get an infinite error! (Don't Do That Then)

COMP527:
Data Mining
Precision/Recall
Information Retrieval uses the same confusion matrix:
Recall: relevant and retrieved / total relevant

Precision: relevant and retrieved / total retrieved
eg 10 relevant, of which 6 are retrieved = 60% recall

100 retrieved, with all 10 relevant = 10% precision.
The best result is all relevant documents retrieved, and no

irrelevant documents retrieved
False Positive: Document retrieved but not relevant
False Negative: Relevant, but not retrieved

COMP527:
Data Mining
Lift Charts
To go back to the directed advertising example... A data mining tool
might predict that, given a sample of 100,000 recipients, 400 will
buy (0.4%). Given 400,000, then it predicts that 800 will buy
(0.2%).
In order to work out where the ideal point is, we need to

include information about the cost of sending an
advertisement vs the profit gained from someone that
responds. (eg will 300,000 extra ads be worth 400 extra
people).
This can be graphed, hence a lift chart...

COMP527:
Data Mining
Lift Charts
The lift is what is gained
from the baseline to
the black line, as
determined by the
classification engine.
(or a Cumulative
Gains chart)
This can be accomplished by ranking instances by the

highest probabilities first.

COMP527:
Data Mining
ROC Curves
From signal processing: Receiver Operating Characteristic.
Tradeoff between hit rate and false alarm rate when trying
to find real data in noisy channel.
Plot true positives vertically, and false positives horizontally.

As with Lift charts, the place to be is the top left.
Generate an ordered list of instances and if the classifier

correctly classifies them. Then for each true positive take
a step up, for each false positive take a step to the right.
Eg...

COMP527:
Data Mining
ROC Curves
We can generate the smooth curve by the use of cross-

validation.
Eg generate a curve for each fold, and then average them.

COMP527:
Data Mining
ROC Curves
We can also plot two curves on the same chart, each generated
from different classifiers. This lets us see at which point it's
better to use one classifier rather than the other.
By using both A and B classifiers with appropriate
weightings, it's possible to get at points in between the
two peaks.

COMP527:
Data Mining
Numeric Prediction
Most common is Mean Squared Error. And have seen before.
(subtract prediction from actual, square it, average)
Also Mean Absolute Error – don't square it, just average the
magnitude of each error.
But if there is a great difference between numbers to be

predicted we might want to use a relative error. Eg 50 out
in an prediction of 500, vs 0.2 out in a prediction of 2.
Same magnitude, relatively speaking.
So, we have the Relative Squared Error.

COMP527:
Data Mining
Numeric Prediction

COMP527:
Data Mining
Further Reading
●
Witten, Chapter 5
●
Han, 6.15

COMP527:
Data Mining
Regression, Prediction January 18, 2008 Slide 322

COMP527:
Data Mining
Today's Topics
Prediction / Regression
Linear Regression
Logistic Regression
Support Vector Regression
Regression Trees

COMP527:
Data Mining
Prediction
Classification tries to determine which class an instance belongs to,
based on known classes for instances by generating a model
and applying it to new instances. The model generated can be in
many forms (rules, tree, graph, vectors...). The output is the
class which the new instance is predicted to be part of.
So the class for classification is a nominal attribute.
What if it was numeric, with no enumerated set of values?
Then our problem is one of prediction rather than

classification.

COMP527:
Data Mining
Regression takes data and finds a formula for it. As with SVM, the
formula can be the model used for classification. This might
learn the formula for the probability of a particular class from 0..1
and then return the most likely class.
It can also be used for predicting/estimating values of a

numeric attribute simply by applying the formula used to
the data.
At the end of the lecture we'll look at regression trees which

combine decision trees and regression.

COMP527:
Data Mining
For example, instead of determining that the weather will be 'hot'
'warm', 'cool' or 'cold', we may need to be able to say with some
degree of accuracy that it will be 25 degrees or 7.5 degrees,
even if 7.5 never appeared in the temperature attribute for the
training data.
Or the stress on a structure under various conditions, the

number of seconds a boxer might last in the ring, the
number of goals a team would score over a season, or ...
any other numeric value that you might want to try to
predict.

COMP527:
Data Mining
Linear Regression
Express the 'class' as a linear combination of the attributes with
determined weights. eg:
x = w0 + w1a1 + w2a2 + ... + wnan
Where w is a weight, and a is an attribute.
The predicted value for instance i then is found by putting the attribute
values for i into the appropriate a slots.
So we need to learn the weights that minimize the error between actual
value and predicted value across the training set.
(Sounds like Perceptron, right?)

COMP527:
Data Mining
Linear Regression
To determine the weights, we try to minimize the sum of the squared error
across all the documents:
∑(xi ∑wjaik)2
Where x is the actual value for instance i and the second
half is the predicted value by applying all k weights to the
k attribute values of instance i.
To do so we can use the method described in Dunham, ~pg

85.
(Which I'm not going to try and explain!)

COMP527:
Data Mining
Linear Regression
Simple case: Method of Least Squares
∑(xi avg(x))(yi avg(y))
w=
∑(xiavg(x))2
solves the simple case of y = b + wx
And then we find b by:

b = avg(y) – w * avg(x)

COMP527:
Data Mining
Non-Linear Regression
We could apply a function to each attribute instead of just
multiplying by a weight.
For example:
x = c + f1(a1) + f2(a2) + ... + fn(an)
Where f is some function (eg square, log, square root, modulo 6,
etc)
Of course determining the appropriate function is a problem!

COMP527:
Data Mining
Logistic Regression
Instead of fitting the data to a straight line, we can try to fit it to a
logistic curve (a flat S shape).
This curve gives values between 0 and 1, and hence can be used
for probability.
We won't go into how to work
out the coefficients, but the
result is the same as the linear
case:
x = c + wa + wa + ... + wa

COMP527:
Data Mining
We looked at the maximum margin hyperplane, which involved
learning a hyperplane to distinguish two classes. Could we learn
a prediction hyperplane in the same way?
That would allow the use of kernel functions for the nonlinear case.
Goal is to find a function that has at most E deviation in prediction
from the training set, while being as flat as possible. This
creates a tube of width 2E around the function. Points that do
not fall within the tube are support vectors.

COMP527:
Data Mining
By also trying to flatten the function, bad choices for E can be
problematic.
If E is too big and encloses all the points, then the function will
simply find the mean. If E is 0, then all instances are support
vectors. Too small and there will be too many support vectors,
too large and the function will be too flat to be useful.
We can replace the dot product in the regression equation with a
kernel function to perform nonlinear support vector regression:
x = b + ∑αia(i)∙a

COMP527:
Data Mining
Regression and Model Trees
The problem with linear regression is that most data sets are not linear.
The problem with nonlinear regression is that it's even more
complicated!
Enter Regression Trees and Model Trees.
Idea: Use a Tree structure (divide and conquer) to split up the instances
such that we can more accurately apply a linear model to only the
instances that reach the end node.
So branches are normal decision tree tests, but instead of a class value
at the node, we have some way to predict or specify the value.

COMP527:
Data Mining
Regression vs Model Trees
Regression Trees: The leaf nodes have the average value of the
instances to reach it.
Model Trees: The leaf nodes have a (linear) regression model to
predict the value of the instances that reach it.
So a regression tree is a constant value model tree.
Issues to consider:
– Building
– Pruning / Smoothing

COMP527:
Data Mining
Building Trees
We know that we need to construct a tree, with a linear model at
each node and an attribute split at non leaf nodes.
To split, we need to determine which attribute to split on, and where
to split it. (Remember that all attributes are numeric)
Witten (p245) proposes Standard Deviation Reduction treating
the std dev of the class values as a measure of the error at the
node and maximising the reduction in that value for each split.


COMP527:
Data Mining
Smoothing
It turns out that the value predicted at the bottom of the tree is generally
too coarse, probably because it was built against only a small subset of
the data.
We can fine tune the value by building a linear model at each node along
with the regular split and then send the value from the leaf back up the
path to the root of the tree, combining it with the values at each step.
p' = (np + kq) / (n + k)
p' is prediction to be passed up. p is prediction passed to this node.
q is the value predicted at this node. n is the number of instances that
reach the node below. k is a constant.

COMP527:
Data Mining
Pruning
Pruning can also be accomplished using the models built at each
node.
We can estimate the error at each node using the model built by
taking the actual error on the test set and multiplying by (n+v)/(n
v) where n is the number of instances that reach the node and v
is the number of parameters in the linear model for the node.
We do this multiplication to avoid underestimating the error on new
data, rather than the data it was trained against.
If the estimated error is lower at the parent, the leaf node can be
dropped.

COMP527:
Data Mining
Building Algorithm
MakeTree(instances)
SD = sd(instances) // standard deviation
root = new Node(instances)
split(root)
prune(root)
split(node)
if len(node)< 4 or sd(node) < 0.05*SD:
node.type = LEAF
else
node.type = INTERIOR
foreach attribute a:
foreach possibleSplitPosition s in a:
calculateSDR(a, s)
splitNode(node, maximumSDR)
split(node.left)
split(node.right)

COMP527:
Data Mining
Pruning Algorithm
prune(node)
if node.type == INTERIOR:
prune(node.left)
prune(node.right)
node.model = new linearRegression(node)
if (subTreeError(node) > error(node):
node.type = LEAF
subTreeError(node)
if node.type = INTERIOR:
return len(left)*subTreeError(left) +
len(right)*subTreeError(right) / len(node)
else:
return error(node)

COMP527:
Data Mining
Specific Algorithms
Some regression/model trees:
CHAID (ChiSquared Automatic Interaction Detector). 1980.
Can also be used either for continuous or nominal classes.
CART (Classification And Regression Tree). 1984.
Entropy or Gini to choose attribute, binary split for selected
attribute.
M5 Quinlan's model tree inducer (of C4.5 fame). 1992.

COMP527:
Data Mining
Further Reading
●
Introductory statistical text books, still!
●
Witten, 3.7, 4.6, 6.5
●
Dunham, 3.2, 4.2
●
Han, 6.11

Comp527 Part1

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Comp527 Part1

Hochgeladen von

Copyright:

Verfügbare Formate

COMP527:

Introduction to the Course January 18, 2008 Slide 1

Introduction to the Course January 18, 2008 Slide 2

Introduction to the Course January 18, 2008 Slide 3

Introduction to the Course January 18, 2008 Slide 4

Introduction to the Course January 18, 2008 Slide 5

Introduction to the Course January 18, 2008 Slide 6

Introduction to the Course January 18, 2008 Slide 7

Introduction to the Course January 18, 2008 Slide 9

Departmental Home Page:

Lecture Notes, Assignments, Exercises:

Introduction to the Course January 18, 2008 Slide 10

Introduction to the Course January 18, 2008 Slide 11

Introduction to the Course January 18, 2008 Slide 12

Introduction to the Course January 18, 2008 Slide 13

● Introduction, Basics: 4 lectures

Introduction to the Course January 18, 2008 Slide 14

● 75% End of Year Exam:

● Short Answer and/or Essays

● 25% Continuous Assessment:

● Self assessment exercises

Introduction to the Course January 18, 2008 Slide 15

... what you've all been waiting for ...

Introduction to the Course January 18, 2008 Slide 16

Introduction to the Course January 18, 2008 Slide 17

Introduction to the Course January 18, 2008 Slide 18

Introduction to Data Mining January 18, 2008 Slide 19

● What is Data Mining?

● Views on the Process

● Why would you do this?

● WEKA: Waikato Environment for Knowledge Analysis

Introduction to Data Mining January 18, 2008 Slide 20

Introduction to Data Mining January 18, 2008 Slide 21

Keywords from each definition:

Introduction to Data Mining January 18, 2008 Slide 22

Introduction to Data Mining January 18, 2008 Slide 23

Introduction to Data Mining January 18, 2008 Slide 24

Introduction to Data Mining January 18, 2008 Slide 25

All Data Mining functions can be thought of as attempting to

Two types of model:

Each type has several sub-categories, each of which has

Introduction to Data Mining January 18, 2008 Slide 26

Classification: Maps data into predefined

Descriptive Clustering: Find groups of similar items

Introduction to Data Mining January 18, 2008 Slide 27

The aim of classification is to create a model that can

Some algorithms predict only a binary split (yes/no), some

Introduction to Data Mining January 18, 2008 Slide 28

The aim of clustering is similar to classification, but without

Clustering attempts to find clusters of data instances which

Unsupervised Learning: learning by observation, rather

Some algorithms must be told how many clusters to find,

Introduction to Data Mining January 18, 2008 Slide 29

The aim of association rule mining is to find patterns that

These correlations are then expressed as rules – if X and Y

Most algorithms are extensions of a single base algorithm

Introduction to Data Mining January 18, 2008 Slide 30

That all sounds ... complicated. Why should I learn about

What's wrong with just a relational database? Why would I

Isn't it expensive? It sounds like it takes a lot of skill,

Data Mining isn't just a cute academic exercise, it has very

Introduction to Data Mining January 18, 2008 Slide 31