Beruflich Dokumente
Kultur Dokumente
Data Mining
COMP527: Data Mining
Dr Robert Sanderson
(azaroth@liv.ac.uk)
Dept. of Computer Science
University of Liverpool
2008
This is the full course notes, but not quite complete. You should come to the lectures anyway. Really.
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Agglomerative/Divisive
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Me, You: Introductions
Lectures
Tutorials
References
Course Summary
Assessment
Something Fun*
* Or at least more fun, hopefully
Dr. Robert Sanderson
Office: 1.04, Ashton Building
Extension: 54252 [external: 795 4252]
Email: azaroth@liv.ac.uk
Web: http://www.csc.liv.ac.uk/~azaroth/
Hours: 10:00 to 18:00, not Thursday
Email for a time, or show up at any time knowing that
I might not be there.
Where's your accent from: New Zealand
So you went to Waikato?
Your PhD is in Data Mining?
... Computer Science?
... Science? Math? Engineering?
You at least write Java?
... C++?
What sort of CS Lecturer are you?!
Went to University of Canterbury (NZ, not Kent)
... But I do know Ian Witten quite well.
PhD is in French/History
... But focused on Computing in the Humanities/Informatics
Python!
Information Science: Information Retrieval, Data Mining, Text
Mining, XML, Databases, Interoperability, Grid Processing,
Digital Preservation ...
...
Lecture Slots:
Monday: 1011am Here
Tuesday: 1011am Here
Friday: 23pm Here
Course requirement: 30 hours of lectures
Semester Timetable:
8 weeks class, 3 weeks easter, 4 weeks class.
Dates:
21st January to 11th of March (Rob @ conference on 14th)
7th April to 21st April (But may run to 25th?)
Introduction to the Course January 18, 2008 Slide 8
COMP527:
Data Mining
Tutorials/Lab Sessions
Location:
Lab 6, Tuesdays 3-4pm
(just before departmental seminar)
Aims:
Provide time for practical experience
Answer any questions from lectures/reading
Informal self-assessment exercises
Software:
Data mining 'workbench' software WEKA installed on Windows
image. May be available under Linux. Freely downloadable from
University of Waikato:
http://www.cs.waikato.ac.nz/ml/weka/
http://www.csc.liv.ac.uk/teaching/modules/newmscs2/comp527.html
http://www.csc.liv.ac.uk/~azaroth/courses/current/comp527/
Witten, Ian and Eibe Frank, Data Mining: Practical Machine Learning Tools and
Techniques, Second Edition, Morgan Kaufmann, 2005
Dunham, Margaret H, Data Mining: Introductory and Advanced Topics, Prentice
Hall, 2003
– Han and Kamber, Data Mining: Concepts and Techniques, Second
Edition, Morgan Kaufmann, 2006
– Berry, Browne, Lecture Notes in Data Mining, World Scientific, 2006
– Berry and Linoff, Data Mining Techniques, Second Edition, Wiley, 2004
– Zhang, Association Rule Mining, Springer, 2002
– Konchady, Text Mining Application Programming, Thomson, 2006
– Weiss et al., Text Mining: Predictive Methods for Analyzing
Unstructured Information, Springer, 2005
– Inmon, Building the Data Warehouse, Wiley, 1993
– KDD (http://www.kdd2007.com)
– PAKDD (http://lamda.nju.edu.cn/conf/PAKDD07/)
– PKDD (http://www.ecmlpkdd2008.org/)
– CiteSeer: http://citeseer.ist.psu.edu/
– KDNuggets: http://www.kdnuggets.com/
– UCI Repository: http://kdd.ics.uci.edu/
(plus follow link to Machine Learning Archive)
– Wikipedia: http://en.wikipedia.org/wiki/Data_mining
– MathWorld: http://mathworld.wolfram.com/
– Google Scholar: http://scholar.google.com/
– NaCTeM: http://www.nactem.ac.uk/
Total: 30 lectures
● Choose 4 of 5 sections
Something Fun! *
* (Or more fun than the rest of the lecture at least, your mileage may
vary, opinions expressed herein bla bla bla)
The Rules:
– Each player is dealt 7 cards by the dealer
– The first person to have no cards in hand wins
– Every turn, each player discards a card
– Play starts with the person to the left of the dealer and proceeds
to the left
– The dealer and then the winner of each round makes a secret
rule
– If you break a rule, you receive a penalty from the rule's creator
– The penalty is: You must draw one card
– Later rules may overturn earlier rules, either completely or in part
– Each rule may only change one aspect of the game play
– Penalty conditions for breaking rules include:
●
Illegal card played (eg black on red)
●
Procedural error (eg playing out of turn)
●
Incorrect penalty (eg when a later rule enables a play)
– Each rule is numbered (eg: Procedural error under Rule 3)
– When taking a penalty for playing out of turn, or discarding
multiple cards, you must return the state of the game to as it was
before the penalty and then the penalty is incurred.
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
● Basic Functions
● Applications
Some Definitions:
– “The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data” (Piatetsky-Shapiro)
– "...the automated or convenient extraction of patterns
representing knowledge implicitly stored or captured in large
databases, data warehouses, the Web, ... or data
streams." (Han, pg xxi)
– “...the process of discovering patterns in data. The process
must be automatic or (more usually) semiautomatic. The
patterns discovered must be meaningful...” (Witten, pg 5)
– “...finding hidden information in a database.” (Dunham, pg 3)
– “...the process of employing one or more computer learning
techniques to automatically analyse and extract knowledge
from data contained within a database.” (Roiger, pg 4)
Many texts treat KDD and Data Mining as the same process,
but it is also possible to think of Data Mining as the
discovery part of KDD.
Dunham:
KDD is the process of finding useful information and
patterns in data.
Data Mining is the use of algorithms to extract information
and patterns derived by the KDD process.
For this course, we will discuss the entire process (KDD) but
focus mostly on the algorithms used for discovery.
Knowledge
Interpretation
Data Model
Data Mining
Transformed Data
Transformation
Preprocessed Data
Preprocessing
Target Data
Selection
Initial Data
(As tweaked by Dunham)
(Supervised Learning)
Data
Mining
(Unsupervised Learning)
Two phases:
1. Given labelled data instances, learn model for how
to predict the class label for them. (Training)
2. Given an unlabelled, unseen instance, use the
model to predict the class label. (Prediction)
● Witten Chapter 1
● Dunham Chapter 1
● Han Chapter 1; Sections 6.1, 7.1
● Berry & Linoff Chapters 1,2
● http://en.wikipedia.org/wiki/Data_mining
and linked pages
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Information Retrieval (IR)
What is IR?
Typical IR Process
Text Mining
What is Text Mining?
Typical Text Mining Process
Applications
Examples:
SQL: Find rows where the text column LIKE “%information
retrieval%”
Not only does Google find relevant pages, it finds them Fast,
for many thousands (maybe millions?) of concurrent
users.
No! Google has a good answer for how to search the web,
but there are many more sources of data, and many more
interesting questions.
Need Query
Search
User
Engine
Information
Documents
Search
Engine
Target
Documents
Preprocessed
Information Records
Documents
Format Processing: Extraction of text from different file formats
Indexing: Efficient extraction/storage of terms from text
Query Languages: Formulation of queries against those indexes
Protocols: Transporting queries from client to server
Relevance Ranking: Determining the relevance of a document to the
user's query
Metasearch: Crosssearching multiple document sets with the same query
GridIR: Using the grid (or other massively parallel infrastructure) to
perform IR processes
Multimedia IR: IR techniques on multimedia objects, compound digital
objects...
All of the Data Mining functions can be applied to textual data, using
term as the attribute and frequency as the value.
Classification:
Classify a text into subjects, genres, quality, reading age, ...
Clustering:
Cluster together similar texts
Key challenge is the very large number of terms (eg the number of
different words across all documents)
So, we've looked at Data Mining and IR... What's Text Mining then?
Good question. No canonical definition yet, but a similar definition for
Data Mining could be applied:
The nontrivial extraction of previously unknown, interesting facts from
an (invariably large) collection of texts.
So it sounds like a combination of IR and Data Mining, but actually the
process involves many other steps too. Before we look at what actually
happens, let's look at why it's different...
Data Mining finds a model for the data based on the attributes of the
items. The only attributes of text are the words that make up the text.
As we looked at for IR, this creates a very sparse matrix.
Even if we create that matrix, what sort of patterns could we find:
– Classification: We could classify texts into predefined classes
(eg spam / not spam)
– Association Rule Mining: Finding frequent sets of words.
(eg if 'computer' appears 3+ times, then 'data' appears at least once)
– Clustering: Finding groups of similar documents (IR?)
None of these fit our definition of Text Mining.
Information Retrieval finds documents that match the user's query.
Even if we matched at a sentence level rather than document, all we do is
retrieve matching sentences, we're not discovering anything new.
The relevance ranking is important, but it still just matches information
we already knew... it just orders it appropriately.
IR (typically) treats a document as a big bag of words... but doesn't care
about the meaning of the words, just if they exist in the document.
IR doesn't fit our definition of Text Mining either.
How would one find previously unknown facts from a bunch of text?
– Need to understand the meaning of the text!
●
Part of speech of words
●
Subject/Verb/Object/Preposition/Indirect Object
– Need to determine that two entities are the same entity.
– Need to find correlations of the same entity.
– Form logical chains: Milk contains Magnesium. Magnesium stimulates
receptor activity. Inactive receptors cause Headaches > Milk is good
for Headaches. (fictional example!)
First we need to tag the text with the parts of speech for each word.
eg:
Rob/noun teaches/verb the/article course/noun
How could we do this? By learning a model for the language! Essentially
a data mining classification problem should the system classify the
word as a noun, a verb, an adjective, etc.
Lots of different tags, often based on a set called the Penn Treebank.
(NN = Noun, VB = Verb, JJ = Adjective, RB = Adverb, etc)
Now we need to discover the phrases and parts of each clause.
Rob/noun teaches/verb the/article course/noun
(Subject: Rob Verb:teaches (Object: the+course))
The phrase sections are often expressed as trees:
( TOP
( S
( NP ( DT This ) ( JJ crazy ) ( NN sentence ) )
( VP ( VBD amused )
( NP ( NNP Rob ) )
( PP ( IN for )
( NP ( DT a ) ( JJ few ) ( NNS minutes ) )
Once we've parsed the text for linguistic structure, we need to identify the
real world objects referred to.
Rob teaches the course
Rob: Me (Sanderson, Robert D. b.19760720 Rangiora/New Zealand)
the course: Comp527 2006/2007, University of Liverpool, UK
This is typically done via lookups in very large thesauri or 'ontologies',
specific to the domain being processed (eg medical, historical, current
events, etc.)
There will normally be a lot more text to parse:
Rob Sanderson, a lecturer at the University of Liverpool, teaches a
masters level course on data mining (Comp527)
Rob is a lecturer
Rob is at the University of Liverpool
Rob teaches a course
The course is called Comp527
The course is masters level
The course is about data mining
Rob Sanderson, a lecturer at the University of Liverpool, teaches a
masters level course on data mining (Comp527)
Data mining is about finding models to describe data sets.
> The University of Liverpool has a course about finding models to
describe data sets.
(Not very interesting or novel in this case, but that's the process)
Search engines of all types are based on IR.
But where would you use text mining?
Most research so far is on medical data sets ... because this is the most
profitable! If you could correlate facts to find a cure for cancer, you
would be very VERY rich! So ... lots of people are trying to do just
that for various values of 'cancer'.
Also because of the wide availability of ontologies and datasets, in
particular abstracts for medical journal articles (PubMed/Medline)
More application areas:
News feeds
Terrorism detection
Social sciences analysis
Historical text analysis
Corpus linguistics
'Net Nanny' filters
etc.
●
Weiss et al Chapter 1 (and 2 if you're interested)
●
BaezaYates, Modern Information Retrieval, Chapter 1
●
Jackson and Moulinier, Natural Language Processing for Online
Applications, Chapter 1
●
http://www.jisc.ac.uk/publications/publications/pub_textmining.aspx
●
http://people.ischool.berkeley.edu/~hearst/textmining.html
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Machine Learning?
Input to Data Mining Algorithms
Data types
Missing values
Noisy values
Inconsistent values
Redundant values
Number of values
Over-fitting / Under-fitting
Scalability
Human Interaction
Ethical Data Mining
The aim of data mining is to learn a model for the data. This
could be called a concept of the data, so our outcome will
be a concept description.
@relation Iris
@attribute sepal_length numeric
@attribute sepal_width numeric
@attribute petal_length numeric
@attribute petal_width numeric
@data
5.1, 3.5, 1.4, 0.2
4.9, 3.0, 1.4, 0.2
4.7, 3.2, 1.3, 0.2
5.0, 3.6, 1.4, 0.2
...
Nominal:
@attribute name {option1, option2, ... optionN}
Numeric:
@attribute name numeric -- real values
Other:
@attribute name string -- text fields
@attribute name date -- date fields (ISO-8601 format)
The following issues will come up over and over again, but
different algorithms have different requirements.
But not all of those terms are useful for determining (for
example) if an email is spam. 'the' does not contribute to
spam detection.
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Data Warehouses
Data Cubes
Warehouse Schemas
OLAP
Materialisation
Most common definition:
“A data warehouse is a subject-oriented, integrated,
time-variant and nonvolatile collection of data in
support of management's decision-making process.” -
W. H. Inmon
– Subjectoriented:
● Focused on important subjects, not transactions
– Integrated:
● Constructed from multiple, heterogeneous data
– Timevariant:
● Has different values for the same fields over time.
– Nonvolatile:
● Physically separate store
Data Warehouses are distinct from:
OLAP: Online Analytical Processing (Data Warehouse)
OLTP: Online Transaction Processing (Traditional DBMS)
Data is normally MultiDimensional,
and can be thought of as a cube.
Image courtesy of IBM OLAP Miner documentation
all
0-D(apex) cuboid
2-D cuboids
time,supplier item,supplier
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Each dimension can also be thought of in terms of different units.
– Time: decade, year, quarter, month, day, hour (and
week, which isn't strictly hierarchical with the others!)
– Location: continent, country, state, city, store
– Product: electronics, computer, laptop, dell, inspiron
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
DISTRICT
SALES PERSON
REGION
DISTRICT
COUNTRY
DIVISION
Geography
Promotion Organization
– Star Schema: Single fact table in the middle, with connected set
of dimension tables
(Hence a star)
– Snowflake Schema: Some of the dimension tables
further refined into smaller dimension tables
(Hence looks like a snow flake)
– Fact Constellation: Multiple fact tables can share
dimension tables
(Hence looks like a collection of star schemas. Also called
Galaxy Schema)
time_key item_key
day
name
day_of_week
month Sales Fact Table brand
quarter type
year time_key supplier_type
item_key
location_key
Loc.n Dimension
units_sold
location_key
street
city
Measure (value) state
country
continent
time_key item_key
day Sales Fact Table name
day_of_week brand
month time_key type
quarter supplier_key
year item_key
location_key
location_key
street
city_key City Dimension
city_key
city
Measure (value) state
country
location_key
street
city_key City Dimension
city_key
city
Measure (value) state
country
ROLAP: Relational OLAP
● Uses relational DBMS to store and manage the warehouse
data
● Optimised for non traditional access patterns
Monitor OLAP
Other Metadata & Server
sources Integrator
Analysis
Query
Operational Extract
Serve Reports
DBs Transform Data
Load Data mining
Warehouse
Refresh
Data Marts
In order to compute OLAP queries efficiently, need to materialise some of
the cuboids from the data.
● None: Very slow, as need to compute entire cube at run
time
● Full: Very fast, but requires a LOT of storage space and
●
Han, Chapters 3,4
●
Dunham Sections 2.1, 2.6, 2.7
●
Berry and Linoff, Chapter 15
●
Inmon, Building the Data Warehouse
●
Inmon, Managing the Data Warehouse
●
http://en.wikipedia.org/wiki/Data_warehouse
and subsequent links
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Classification
Basic Algorithms:
KNN
Perceptron
Winnow
as %
Accuracy
● Percent of instances classified correctly.
Speed
● Computational cost of both learning model and
predicting classes
Robustness
● Ability to cope with noisy or missing data
Scalability
● Ability to cope with very large amounts of data
Interpretability
● Is the model understandable to a human, or otherwise
useful?
enumerable
● Some algorithms predict probability for more than one
label
● Sometimes called a categorical attribute (eg H&K)
Although kiwis can't fly like most other birds, they resemble
birds more than they resemble other types of animals.
Can remove instances from the data set that do not help, for
example a tight cluster of 1000 instances of the same
class is unnecessary for k<50
The square boxes are inputs, the w lines are weights and
the circle is the perceptron. The learning problem is to
find the correct weights to apply to the attributes.
weightVector = [0,...0]
while classificationFailed,
for each training instance I,
if not classify(I) == I.class,
if I.class == class1:
weightVector += I
else:
weightVector = I
delta = (user defined)
while classificationFailed,
for each instance I,
if classify(I) != I.class,
if I.class == class1,
for each attribute ai in I,
if ai == 1, wi *= delta
else,
for each attribute ai in I,
if ai == 1, wi /= delta
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Introduction
Rule Sets vs Rule Lists
Constructing Rules-based Classifiers
1R
PRISM
Reduced Error Pruning
RIPPER
Rules with Exceptions
Idea: Learn a set of rules from the data. Apply those rules to
determine the class of the new instance.
For example:
R1. If blood-type=Warm and lay-eggs=True then Bird
R2. If blood-type=Cold and flies=False then Reptile
R3. If blood-type=Warm and lay-eggs=False then Mammal
A rule r covers an instance x if the attributes of the instance satisfy
the condition of the rule.
Rules can either be grouped as a set or an ordered list.
Set:
The rules make independent predictions.
Every record is covered by 0..1 rules (hopefully 1!)
List:
The rules make dependent predictions.
Every record is covered by 0..* rules (hopefully 1..*!)
If all records are covered by at least one rule, then rule set
or list is considered Exhaustive.
Covering approach: At each stage, a rule is found that covers
some instances.
Idea: Construct one rule for each attribute/value combination
predicting the most common class for that combination.
Example Data:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
Rules generated:
Attribute Rules Errors Total Errors
Outlook sunny » no 2/5 4/14
overcast » yes 0/4
rainy » yes 2/5
Temperature hot » no 2/4 (random) 5/14
mild » yes 2/6
cool » yes 1/4
Humidity high » no 3/7 4/14
normal » yes 1/7
Windy false » yes 2/8 5/14
true » no 3/6 (random)
foreach attribute,
foreach value of that attribute,
find class distribution for attr/value
conc = most frequent class
make rule: attribute=value -> conc
calculate error rate of ruleset
select ruleset with lowest error rate
Age = Young 2/8
Age = Prepresbyopic 1/8
Age = Presbyopic 1/8
Prescription = Myope 3/12
Prescription = Hypermetrope 1/12
Astigmatism = no 0/12
Astigmatism = yes 4/12
TearProduction = Reduced 0/12
TearProduction = Normal 4/12
This covers:
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Prepresbyopic Myope Yes Reduced None
Prepresbyopic Myope Yes Normal Hard
Prepresbyopic Hypermetrope Yes Reduced None
Prepresbyopic Hypermetrope Yes Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None
Try with the other example data set. If X then play=yes
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
Outlook=overcast is (4/4) Already perfect, remove
rainy mild normal false yes
sunny mild normal true yes
instances and look again.
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
With reduced dataset, if X then play=yes
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
sunny mild high false no
sunny cool normal false yes
Sunny
rainy
(2/5) rainymild
(3/5) hot (0/2)normal
mild (3/5) cool (2/3)
false
high (1/5) normal
yes
(4/5)
sunnyfalse (4/6) true
mild(¼) normal true yes
Select humidity=normal (4/5) and look for another rule as
rainy mild high true no
not perfect
If humidity=normal and X then play=yes
Outlook Temperature Humidity Windy Play?
rainy cool normal false yes
rainy cool normal true no
sunny cool normal false yes
rainy mild normal false yes
If we could use 'and-not' we could have:
sunny mild normal true yes
and-not (temperature=cool and windy=true)
But instead:
rainy (2/3), sunny (2/2), cool (2/3), mild (2/2), false(3/3),
true (1/2)
So we select windy=false to maximise t and add that to the
rule.
for each class C
initialise E to the complete instance set
while E contains instances with class C
create empty rule R if X then C
until R is perfect (or no more attributes)
for each attribute A not in R, and each value v,
consider adding A=v to R
select A and v to maximise accuracy of p/t
add A=v to R
remove instances covered by R from E
Two strategies:
initialise E to instance set
until E is empty:
split E into Grow and Prune (ratio 2:1)
for each class C in Grow
generate best rule for C
using Prune:
calc worth(R) and worth(RfinalCondition)
while worth(R) > worth(R), prune rule
from rules for different classes, select
largest worth(R)
remove instances covered by rule
– ( p + (N -n)) / T
● true positive + true negative / total number of
instances
● positive + totalNegative - negativesCovered /
totalInstances
● p=2000, t=3000 --> 1000 + N / T
If 2 classes, then learn rules for one and default the other
If more than 2 classes, start with smallest until you have 2.
Repeated Incremental Pruning to Produce Error Reduction
split E into Grow/Prune
BUILD:
repeat until no examples, or DL of ruleset >minDL(rulesets)+64, or error >50%
GROW: add conditions until rule is 100% by IG
PRUNE: prune last to first while worth metric W increases
for each rule R, for each class C:
split E into Grow/Prune
remove all instances from Prune covered by other rules
GROW and PRUNE two competing rules:
R1 is new rule built from scratch
R2 is generated by adding conditions to R
prune using worth metric A for reduced dataset
replace R by R, R1 or R2 with smallest DL
if uncovered instances of C, return to BUILD to make more rules
calculate DL for ruleset and ruleset with each rule omitted, delete any rule that
increases the DL
remove instances covered by rules generated
DL = Description Length, Metric W= p+1/t+2, Metric A=p+Nn/T
If we get more data after a ruleset has been generated, it might be
useful to add exceptions to rules.
If X then class1 unless Y then class2
Consider our humidity rule:
if humidity=normal then play=yes
unless temperature=cool and windy=true then play = no
Exceptions developed with the Induct system, called 'rippledown
rules'
●
Witten, Sections 3.3, 3.5, 3.6, 4.1, 4.4
●
Dunham Section 4.6
●
Han, Section 6.5
●
Berry and Browne, Chapter 8
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Trees
Tree Learning Algorithm
Attribute Splitting Decisions
Random
'Purity Count'
Entropy (aka ID3)
Information Gain Ratio
Anything can be made better by storing it in a tree structure! (Not really!)
Here's our example data again:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
How to construct a tree from it, instead of rules?
Trivial Tree Learner:
create empty tree T
select attribute A
create branches in T for each value v of A
for each branch,
recurse with instances where A=v
add tree as branch node
Most interesting part of this algorithm is line 2, the attribute
selection. Let's start with a Random selection, then look at how it
might be improved.
Random method: Let's pick 'windy'
Windy
false true
6 yes 3 yes
2 no 3 no
Need to split again, looking at only the 8 and 6 instances respectively.
For windy=false, we'll randomly select outlook:
sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes
As all instances of overcast and rainy are yes, they stop, sunny continues.
As we may have thousands of attributes and/or values to test, we
want to construct small decision trees. Think back to RIPPER's
description length ... the smallest decision tree will have the
smallest description length. So how can we reduce the number
of nodes in the tree?
'Purity' count:
Outlook
sunny rainy
2 yes 3 yes
4 yes
3 no 2 no
Select attribute that has the most 'pure' nodes, randomise equal
counts.
Still mediocre. Most data sets won't have pure nodes for several
levels. Need a measure of the purity instead of the simple count.
For each test:
Maximal purity: All values are the same
Minimal purity: Equal number of each value
Find a scale between maximal and minimal, and then merge across all of the
attribute tests.
One function that calculates this is the Entropy function:
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn*log(pn)
p1 ... pn are the number of instances of each class, expressed as a fraction of
the total number of instances at that point in the tree. log is base 2.
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn *log(pn)
This is to calculate one test. For outlook there are three tests:
sunny: info(2,3)
= 2/5 log(2/5) 3/5 log(3/5)
= 0.5287 + 0.4421
= 0.971
overcast: info(4,0) = (4/4*log(4/4)) + (0*log(0))
Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it
is the final result will be 0.
sunny: info(2,3) = 0.971
overcast: info(4,0) = 0.0
rainy: info(3,2) = 0.971
But we have 14 instances to divide down those paths...
So the total for outlook is:
(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693
Now to calculate the gain, we work out the entropy for the top node
and subtract the entropy for outlook:
info(9,5) = 0.940
gain(outlook) = 0.940 0.693 = 0.247
Now to calculate the gain for all of the attributes:
gain(outlook) = 0.247
gain(humidity) = 0.152
gain(windy) = 0.048
gain(temperature) = 0.029
And select the maximum ... which is outlook.
This is (also!) called information gain. The total is the information,
measured in 'bits'.
Equally we could select the minimum amount of information
needed the minimum description length issue in RIPPER.
Let's do the next level, where outlook=sunny.
Now to calculate the gain for all of the attributes:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes
Temp: hot info(0,2) mild info(1,1) cool info(1,0)
Humidity: high info(0,3) normal info(2,0)
Windy: false info(1,2) true info(1,1)
Don't even need to do the math. Humidity is the obvious choice as
it predicts all 5 instances correctly. Thus the information will be
0, and the gain will be maximal.
Now our tree looks like:
Outlook
sunny rainy
?
Humidity yes
normal high
yes no
This algorithm is called ID3, developed by Quinlan.
Nasty side effect of Entropy: It prefers attributes with a large
number of branches.
Eg, if there was an 'identifier' attribute with a unique value, this
would uniquely determine the class, but be useless for
classification. (overfitting!)
Eg: info(0,1) info(0,1) info(1,0) ...
Doesn't need to be unique. If we assign 1 to the first two instances,
2 to the second and so forth, we still get a 'better' split.
HalfIdentifier 'attribute':
info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0)
info(1,1)
= 0 0 0.5 0.5 0 0 0.5
2/14 down each route, so:
= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...
= 3 * (2/14 * 0.5)
= 3/14
= 0.214
Gain is:
0.940 0.214 = 0.726
Remember that the gain for Outlook was only 0.247!
Urgh. Once more we run into overfitting.
Solution: Use a gain ratio. Calculate the entropy disregarding
classes for all of the daughter nodes:
eg info(2,2,2,2,2,2,2) for halfidentifier
and info(5,4,5) for outlook
identifier = 1/14 * log(1/14) * 14 = 3.807
halfidentifier = 1/7 * log(1/7) * 7 = 2.807
outlook = 1.577
Ratios:
identifier = 0.940 / 3.807 = 0.247
halfidentifier = 0.726 / 2.807 = 0.259
outlook = 0.247 / 1.577 = 0.157
Close to success: Picks halfidentifier (only accurate in 4/7
branches) over identifier (accurate in all 14 branches)!
halfidentifier = 0.259
identifier = 0.247
outlook = 0.157
humidity = 0.152
windy = 0.049
temperature = 0.019
Humidity is now also very close to outlook, whereas before they
were separated.
We can simply check for identifier like attributes and ignore them.
Actually, they should be removed from the data before the data
mining begins.
However the ratio can also overcompensate. It might pick an
attribute because it's entropy is low. Note how close humidity
and outlook became... maybe that's not such a good thing?
Possible Fix: First generate the information gain. Throw away any
attributes with less than the average. Then compare using the
ratio.
An alternative method to Information Gain is called the Gini Index
The total for node D is:
gini(D) = 1 sum(p12, p22, ... pn2)
Where p1..n are the frequency ratios of class 1..n in D.
So the Gini Index for the entire set:
= 1 (9/142 + 5/142)
= 1 (0.413 + 0.127)
= 0.459
The gini value of a split of D into subsets is:
Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)
Where N' is the size of split D', and N is the size of D.
eg: Outlook splits into 5,4,5:
split = 5/14 gini(sunny) + 4/14 gini(overcast)
+ 5/14 gini(rainy)
sunny = 1sum(2/52, 3/52) = 1 0.376 = 0.624
overcast= 1 sum(4/42, 0/42) = 0.0
rainy = sunny
split = (5/14 * 0.624) * 2
= 0.446
The attribute that generates the smallest gini split value is chosen
to split the node on.
(Left as an exercise for you to do!)
Gini is used in CART (Classification and Regression Trees), IBM's
IntelligentMiner system, SPRINT (Scalable PaRallelizable
INduction of decision Trees). It comes from an Italian statistician
who used it to measure income inequality.
The various problems that a good DT builder needs to address:
– Ordering of Attribute Splits
As seen, we need to build the tree picking the best attribute to split on first.
– Numeric/Missing Data
Dividing numeric data is more complicated. How?
– Tree Structure
A balanced tree with the fewest levels is preferable.
– Stopping Criteria
Like with rules, we need to stop adding nodes at some point. When?
– Pruning
It may be beneficial to prune the tree once created? Or incrementally?
●
Introductory statistical text books
●
Witten, 3.2, 4.3
●
Dunham, 4.4
●
Han, 6.3
●
Berry and Browne, Chapter 4
●
Berry and Linoff, Chapter 6
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Numeric Data
Missing Values
Pruning
Pre- vs Post-Pruning
Chi-squared Test
Sub-tree Replacement
Sub-tree Raising
C4.5's error estimation
From Trees to Rules
The temperature attribute for the weather data is actually a set of
Fahrenheit values between 64 and 85:
64 65 68 69 70 71 72 75 80 81 83 85
yes no yes yes yes no no, yes yes, yes no yes yes no
Assuming one split, where should it be?
64 65 68 69 70 71 72 75 80 81 83 85
yes no yes yes yes no no, yes yes, yes no yes yes no
info([4,2],[5,3]) = 6/14 * info([4,2]) + 8/14 * info([5,3])
= 0.939
Then calculate it for all of the other split points and take the best.
Once the best split has been found, continue as normal. Almost.
For a/b, might split at 6.5, then again at 3.5 and 9.5
But splitting for x/y will eventually lead to 1=x, 2=y, 3=x ...
Over-fitting.
Also: Many binary splits on an attribute make the tree hard
to read.
Isn't a multiway split better? Yes, but harder to accomplish. How
many splits? Where?
1 2 3 4 5 6 7 8 9 10 11 12
A A A B B B A A A B B B
X Y X Y X Y X Y X Y X Y
We really want to find a function to test the data with.
For X/Y we want to test: value % 2
For A/B we want to test: value1 /3 % 2
Complicated. We'll look at regression trees later.
Algorithm papers: http://citeseer.ist.psu.edu/context/412349/0
Classification: Trees 2 January 18, 2008 Slide 189
COMP527:
Data Mining
Numeric Attributes
Sounds like a lot of computation for attributes with a wide range of
data. ... Yes.
Second computational problem: If we test (for example) windy first
and then test temperature, the possible values will be different
because not all instances have made it to that node. So we'll
need to resort everything every node.
Not quite. The order doesn't change because instances are left out.
We can sort once and cross out instances that we don't have.
Eg:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
7 6 5 9 4 14 8 12 10 11 2 13 3 1
Just because we don't have instances 1,3,4,5,8,9,10 and 13
doesn't mean that the order of the others has changed.
It will still be: 7, 6, 14, 12, 11, 2
Disadvantage: Need to store this information for each numeric
attribute for every instance. If the numeric attributes are used
further down the tree, it may be cheaper to do it only on the
subsets.
We don't have this problem with nominal attributes. We could transform the
numeric attribute into nominal before the data mining stage?
If we can discretize it before data mining, surely we can do it during it as well
using the same techniques? And wouldn't it be faster, as you're only going to
be dealing with a subset of the data, not all of it? Yes, but it might be over
fitting!
Solutions:
●
Prediscretize to nominal attribute (will look at later)
●
Many binary splits
●
One multibranch split
What happens when an instance is missing the value for an
attribute? Already discussed some possibilities for filling in the
value.
While Training, may be possible to just ignore it. But we need a
solution for a Test instance with a missing value.
Idea: Send the instance down all of the branches and combine the
results?
Need to record in the tree the 'popularity' of each branch. Eg how
many instances went down it.
For example: Split the 14 instances by Windy ... 8 go down the
false branch, 6 down the true branch. So when we get an
instance without windy, the classification for false happened 8/14
times and the classification for true 6/14.
Instead of ending up with a single class, we might end up with 4/7
votes for one and 3/7 votes for another. Or they might both end
up the same.
In the same way as generating rule sets, we need to prune trees to
avoid overfitting.
PrePruning: Stop before reaching the bottom of the tree path.
But it might stop too early. For example when a combination of two
attributes is important, but neither by themselves is significant.
PostPruning: Generate the entire tree and then remove some
branches.
More time consuming, but more likely to help classification
accuracy.
How to determine when to stop growing?
Statistical Significance:
Stop growing the tree when there is no statistically significant
association between any attribute and the class at a particular
node
Popular test: chisquared
chi2 = sum( (OE)2 / E )
O = observed data, E = expected values based on hypothesis.
chi2 = sum( (OE)2 / E )
This distribution is significant.
ID3 only allowed significant attributes to be selected by Information
Gain.
Two possible options for postpruning:
Subtree Replacement: Select a subtree and replace with a
single leaf.
Subtree Raising: Select a subtree and raise it to replace a higher
tree. More complicated, harder to tell if worthwhile in practice.
Need to the training data into a Grow/Prune division again. Grow
the entire tree from the Grow set, then prune it using the Prune
set. But it has the same problems as with rulesbased systems.
Replace left subtree with 'bad' leaf node
(Witten fig 1.3)
Raise subtree C to B
(Witten fig 6.1)
But now need to reclassify instances that
would have gone to 4 & 5.
Can we estimate the error of the tree without a Pruning set? Can
we estimate it based on the training set that it has just been
grown from?
– Error estimate for subtree is weighted sum of error estimates for
all its leaves
– Error estimate for a node:
e= f
z2
2N
z
f
N
−
f2
N
z2
4N
2
/ 1
z2
N
(These slides thanks to the official publisher slide sets, not my strong point!)
f = 5/14
e = 0.46
e < 0.51
so prune!
Now we've built a tree, it might be desirable to reexpress it as a list
of rules.
Simple Method: Generate a rule by conjunction of tests in
each path through the tree.
Eg:
if temp > 71.5 and ... and windy = false then play=yes
if temp > 71.5 and ... and windy = true then play=no
for each rule,
e = error rate of rule
e' = error rate of rule finalCondition
if e' < e,
rule = rule finalCondition
recurse
remove duplicate rules
Expensive: Need to reevaluate entire training set for every
condition!
Might create duplicate rules if all of the final conditions from a path
are removed.
As previous:
●
Witten, 3.2, 4.3 PLUS 6.1
●
Dunham, 4.4
●
Han, 6.3
●
Berry and Browne, Chapter 4
●
Berry and Linoff, Chapter 6
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Statistical Modeling
Bayes Rule
Naïve Bayes
Fixes to Naïve Bayes
Document classification
Bayesian Networks
Structure
Learning
The probability of hypothesis H, given evidence E:
Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]
Pr[H] = A Priori probability of H (before evidence seen)
Pr[H|E] = A Posteriori probability of H (after evidence seen)
We want to use this in a classification system, so our goal is to find
the most probable hypothesis (class) given the evidence (test
instance).
Meningitis causes a stiff neck 50% of the time.
Meningitis occurs 1/50,000, stiff necks occur 1/20.
Pr[H|E] = PR[E|H]*Pr[H] / Pr[E]
Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN]
= 0.5 * 1/50000 / 1/20
= 0.0002
Our evidence E is made up of different attributes A[1..n], so:
Pr[H|E] = Pr[A1|H]*Pr[A2|H]...Pr[An|H]*Pr[H]/Pr[E]
So we need to work out the probability of the individual attributes
per class. Easy...
Outlook=Sunny appears twice for yes out of 9 yes instances.
We can work these out for all of our training instances...
Given a test instance (sunny, cool, high, true)
play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053
play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
So we'd predict play=no for that particular instance.
Classification: Bayes January 18, 2008 Slide 212
COMP527:
Data Mining
Weather Probabilities
play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053
play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206
This is the likelihood, not the probability. We need to normalise these.
Prob(yes) = 0.0053 / 0.0053 + 0.0206 = 20.5%
This is when the Pr[E] denominator disappears from Bayes's rule.
Nice. Surely there's more to it than this... ?
Issue: It's only valid to multiply probabilities when the events are
independent of each other. It is “naïve” to assume independence
between attributes in datasets, hence the name.
Eg: The probability of Liverpool winning a football match is not
independent of the probabilities for each member of the team
scoring a goal.
But even given that, Naïve Bayes is still very effective in practice,
especially if we can eliminate redundant attributes before
processing.
Issue: If an attribute value does not cooccur with a class value, then the
probability generated for it will be 0.
Eg: Given outlook=overcast, the probability of play=no is 0/5. The other
attributes will be ignored as the final result will be multiplied by 0.
This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute set.
You can easily imagine a case where the likelihood for all classes is 0.
Eg: 'Viagra' is always spam, 'data mining' is never spam. An email with
both will be 0 for spam=yes and 0 for spam=no ... probability will be
undefined ... uh oh!
The trivial solution is of course to mess with the probabilities such that
you never have 0s. We add 1 to the numerator and 3 to the
denominator to compensate.
So we end up with 1/8 instead of 0/5.
No reason to use 3, could use 2 and 6. No reason to split equally... we
could add weight to some attributes by giving them a larger share:
a+3/na+6 * b+2/nb+6 * c+1/nc+6
However, how to assign these is unclear.
For reasonable training sets, simply initialise counts to 1 rather than 0.
Naïve Bayes deals well with missing values:
Training: Ignore the instance for the attribute/class combination,
but we can still use it for the known attributes.
Classification: Ignore the attribute in the calculation as the
difference will be normalised during the final step anyway.
Naïve Bayes does not deal well with numeric values without some help.
The probability of it being exactly 65 degrees is zero.
We could discretize the attribute, but instead we'll calculate the mean and
standard deviation and use a density function to predict the probability.
mean: sum(values) / count(values)
variance: sum(square(value mean)) / count(values)1
standard deviation: square root of variance
Mean for temperature is 73, Std. Deviation is 6.2
Density function:
( x− µ )2
f(x) = 1 e 2σ 2
2π σ
Unless you've a math background, just plug the numbers in...
At which point we get a likelihood of 0.034
Then we continue with this number as before.
This assumes a reasonably normal distribution. Often not the case.
The Bayesian model is often used to classify documents as it deals
well with a huge number of attributes simultaneously. (eg
boolean occurrence of words within the text)
But we may know how many times the word occurs.
This leads to Multinomial Naive Bayes.
Assumptions:
1. Probability of a word occurring in a document is independent
of its location within the document.
2. The document length is not related to the class.
Pr[E|H] = N! * product(pn/n!)
So, if A has 75% and B has 25% frequency in class H
Pr[“A A A”|H] = 3! * 0.753/3! * 0.250/0!
= 27/64
= 0.422
Pr[“A A A B B”|H] = 5! * 0.753/3! * 0.252/2!
= 0.264
Pr[E|H] = N! * product(pn/n!)
We don't need to work out all the factorials, as they'll normalise out
at the end.
We still end up with insanely small numbers, as vocabularies are
much much larger than 2 words. Instead we can sum the
logarithms of the probabilities instead of multiplying them.
Back to the attribute independence assumption. Can we get rid of
it?
Yes, with a Bayesian Network.
Each attribute has a node in a Directed Acyclic Graph.
Each node has a table of all attributes with edges pointing at the
node linked against the probabilities for the attribute's values.
Examples will be hopefully enlightening...
play
yes no
.633 .367
outlook windy
play| sunny overcast rainy play| false true
yes | .238 .429 .333 yes | .350 .650
no | .538 .077 .385 no | .583 .417
temperature humidity
play| hot mild cold play| high normal
yes | .238 .429 .333 yes | .350 .650
no | .385 .385 .231 no | .750 .250
play
yes no
.633 .367 windy
play outlook false true
yes sunny .500 .500
Outlook yes overcast .500 .500
play sunny overcast rainy yes rainy .125 .875
yes .238 .429 .333 no sunny .375 .625
no .538 .077 .385 no overcast .500 .500
no rainy .833 .167
temperature humidity
play outlook hot mild cold play temp high normal
yes sunny .238 .429 .333 yes hot .500 .500
yes overcast .385 .385 .231 yes mild .500 .500
yes rainy .111 .556 .333 yes cool .125 .875
no sunny .556 .333 .111 no hot .833 .167
no overcast .333 .333 .333 no mild .833 .167
no rainy .143 .429 .429 no cool .250 .750
To use the network, simply step through each node and multiply the
results in the table together for the instance's attributes' values.
Or, more likely, sum the logarithms as with the multinomial case.
Then, as before, normalise them to sum to 1.
This works because the links between the nodes determine the
probability distribution at the node.
Using it seems straightforward. So all that remains is to find out the best
network structure to use. Given a large number of attributes, there's a
LARGE number of possible networks...
We need two components:
– Evaluate a network based on the data
As always we need to find a system that measures the
'goodness' without overfitting
(overfitting in this case = too many edges)
We need a penalty for the complexity of the network.
– Search through the space of possible networks
As we know the nodes, we need to find where the edges in the
graph are. Which nodes connect to which other nodes?
Following the Minimum Description Length ideal, networks with lots
of edges will be more complex, and hence likely to overfit.
We could add a penalty for each cell in the nodes' tables.
AIC: LL +K
MDL: LL + K/2 log(N)
LL is total loglikelihood of the network and training set. eg Sum of
log of probabilities for each instance in the data set.
K is the number of cells in tables, minus the number of cells in the
last row (which can be calculated, by 1 sum of other cells in row)
N is the number of instances in the data.
K2:
for each node,
for each previous node,
add node, calculate worth
continue when doesn't improve
(Use MDL or AIC to determine worth)
The results of K2 depend on initial order selected to process the
nodes in.
Run it several times with different orders and select the best.
Can help to ensure that the class attribute is first and links to all
nodes (not a requirement)
TAN: Tree Augmented Naive Bayes.
Class attribute is only parent for each node in Naive Bayes. Start
here and consider adding a second parent to each node.
Bayesian Multinet:
Build a separate network for each class and combine the values.
●
Witten 4.2, 6.7
●
Han 6.4
●
Dunham 4.2
●
Devijver and Kittler, Pattern Recognition: A Statistical Approach,
Chapter 2
●
Berry and Browne, Chapter 2
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Introduction to Neural Networks
Issues
Training
Kohonen Self-Organising Maps
Radial Basis Function Networks
How do animals learn (including humans)? Perhaps we can
simulate that for learning simple patterns?
How does the brain work (simplistically)? The brain has lots of
neurons which either fire or not and are linked together in a huge
three dimensional structure. It receives input from many neurons
and sends its output to many neurons. The input comes from
external connected sensors such as eyes.
So ... can we model an artificial network of neurons to solve just
one task (a classification problem)?
We need some inputs, then some neurons connected together and
then some outputs. Our inputs are the attributes from the data
set. Our outputs are (typically) the classes. We can have
connections from all of the inputs to the neurons, then from the
neurons to the outputs.
Then we just need to train the neurons to react to the values in the
attributes in the proper way such that the output layer gives us
the classification. (Which is of course the complicated part, just
like animals learning)
Sounds like the idea of a Perceptron? Yes, it is.
N1
Attr1
Class
Attr2
Output Neuron(s)
That sounds like a regression problem? Learning a function...
Actually we use the same activation function in all nodes,
and apply a weight to each link. Each node can also have
a constant to add to the incoming data called a bias.
N1
W1,1 W1,C
Attr1 W2,1
Class
Attr2 W1,2
W2,C
W2,2
N2
N1
W1,1 W1,C
Attr1 W2,1
Class
Attr2 W1,2
W2,C
W2,2
N2
Node N1 does: fN1(A1*W1,1 + A2*W2,1 + CN1)
Node Class does: fClass( fN1(A1*W1,1+A2*W2,1+CN1)*W1,C +
fN2(A1*W1,2 + A2*W2,2 + CN2) * W2,C + CClass )
Issues with constructing a neural network classifier:
– Attributes as source nodes
Need to be numeric for weighting
– Number of hidden layers
Not necessarily just one layer, could have multiple
– Number of nodes per hidden layer
Complicated. Too many and will over-fit, too few won't learn
properly
– Number of output neurons
One per class, or perhaps a bit based combination (eg 101 = class5
with 3 outputs)
– Interconnections
Node might not connect to all in next layer, might connect
backwards
– Weights, Constants, Activation function to use
– Learning Technique to adjust weights
Bleh! Why use a NN rather than a Decision Tree then?
– More robust -- can use all attributes at once, without splitting
numeric or turning them into nominal
– Improves performance by later learning.
– More robust in noisy environments as can't go down the wrong path
so easily
In order to apply a weight to an attribute, that attribute needs to be
numeric. 6.5 * “fish” is not meaningful. But assigning numbers
to the different values is also not meaningful. “squirrel” (2) is not
1 greater than “fish” (1)
Could divide the nominal attribute into many boolean
attributes with values of 0 or 1.
The number of nodes in the hidden layer is highly debated, but no
good rule has been discovered to date. Depends on the
structure, activation function etc.
The simplest structure is all nodes connect to all nodes in the next
highest layer, but this is not necessarily the case.
layers
Before we look at training, we should look at common activation
functions. The function could be anything, but typically are one
of the following:
The network classifies by propagating the values forwards through
the network (feedforward) and applying the activation function at
each step.
We know the expected output at the final layer (the class) so we
can work out the error of the output from the nodes that connect
to it. A typical measure is the mean squared error (MSE):
(yi di)2 / 2
Where for node i, y is the output and d is the desired output.
This could be repeated for all nodes in the network and summed to
find the total error for a given instance. The goal is then to
minimise that error across all instances of the training set.
The Hebb rule: (historical interest only)
delta(wij) = cxijyj
The Delta rule:
delta(wij) = cxij(djyj)
For node j, input node i, output y, desired output d and constant c.
The constant is typically 1/number of training instances.
So for back propagation, we can step backwards through the
network after passing an instance through it and modify each
weight using the delta rule. ... Almost.
Remeber that we want to minimise the MSE. We can use Gradient
Descent to do this. With a sigmoid function:
for each node i in outputNodes
for each node j in inputs to i
delta = c(diyi)yj(1yi)yi
wji += delta
for each node j in hiddenLayer
for each node k in inputs to j
outputDelta=0
for each node m in outputs from j
outputDelta += (dmym)wjmym(1ym)
delta = c yk (1yj2 / 2) outputDelta
wkj += delta
Whuh?! What's going on there??
Skipping all the math, it finds the
gradient of the error curve. To
minimise, we want the gradient to be
zero, so it takes lots of derivatives and
stuff...
If you like the math, read Witten ~230
and Dunham ~112.
For the rest of us... we'll smile and nod and skip ahead...
To go back to our original premise, perhaps there's something else
we can learn from our neurons that didn't just implode from the
previous math.
Some things those neurons might tell us:
– Firing neurons impact other close neurons
– Neurons that are far apart inhibit each other
– Neurons have specific nonoverlapping tasks
In a Kohonen Self Organising Map, the nodes in the hidden layer
are put into a two dimensional grid so that we have some
measure of distance between neurons.
The nodes compete against each other to be the best for a
particular attribute/instance. In training, once the best node has
been determined and had its connection weight modified, the
near by nodes also have their weights modified. The
neighbourhood of a node can decrease over time, proportional to
the amount it has 'learnt'.
A1
An RBF network has the standard three layers of nodes.
The hidden layer has a Gaussian activation function.
The output layer has a Linear or Sigmoidal activation
function.
Instead of having a fixed activation function for each hidden node,
the RBF nodes also learn the their maximal value and how fast
the output should drop off away from this value.
These centers and widths can be learnt independently of the
connection weights. Typically this is done by clustering.
●
Witten 6.3
●
Han 6.6
●
Dunham 4.5
●
Berry and Linoff, Chapter 7
●
Pal and Mitra, Pattern Recognition Algorithms for Data Mining,
Chapter 7
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Linear vs Nonlinear Classifiers
Support Vectors
Non Linearly Separable Datasets
Imagine a data set with two numeric attributes ... you could plot the
instances on a graph.
Imagine a data set with three numeric attributes (eg
h,w,d) ... you could plot it in three dimensional space.
(Ideas for many of these slides thanks to others, esp Barbara Rosario)
All the instances can be correctly classified by a single linear
decision boundary.
Not all instances can be correct classified by a linear decision
boundary.
But can be separated by a non lineary boundary.
Random Noise
Many possible decision boundaries... which is best?
Maximum Margin Hyperplane (MMH) is the largest distance
between the two classes
With some slack to allow for somewhat noisy data
Find the convex hull of each class. Find the shortest line that can
connect the two hulls. Then halfway along, at 90 degrees is the
MMH.
Once we've found the support vectors, we don't care about the
other instances any more. The MMH is still the same with just
these instances.
That's not a vector, that's a smiley face!
That's not a hyperplane, it's a dotted line!
In 2d, yes. But the same applies in
Ndimensional space where an instance
is a vector like [1,6,3,10,7,14,23] and
the dividing plane is a 7 dimensional
monstrosity.
Vector Norm: |X| = √x12 + x22 + ... xn2
Dot Product: X ∙ Y = |X||Y|cosθ
MMH: x = b + ∑αiyia(i)∙a
Most of the time, classes will not be linearly separable. For
example:
But what if we could transform the data set such that the
curve was actually a straight line. Then we could find the
MMH, and use the same transformation on new instances
to compare apples with apples.
This involves mapping each instance into a higher dimensional
space, where the previous curve is now a straight line. Eg from
a quadratic curve into a polynomial dimensional space above.
This could be very expensive, but it turns out that you can
do some of the work before the mapping (the dot product)
Classification: SVM January 18, 2008 Slide 270
COMP527:
Data Mining
Non-Linear Data
So we need some function Ф that will map our data into a different
set of dimensions where there's a linear division. Then we can
construct a linear classifier using this set of dimensions.
Eg: a 3D input vector (x,y,z) could be mapped to 6D space (Z) by:
(x,y,z, x2, xy, xz)
Decision hyperplane is now linear in this space. Solve and then
substitute back so the linear hyperplane in this space
corresponds to a second order polynomial in the original space.
But doing this for all instances would be very very expensive...
There's another math trick we can use. It turns out that you don't
need to do the dot product operation after the mapping, as it's
constant.
So instead of: Ф(x) ∙ Ф(y)
We can do: Ф(x∙y)
Avoiding a lot of expense.
Polynomial Kernel: (x∙y)n
Gaussian Radial Basis Function Kernel: e |xy|2 / 2σ2
Sigmoid Kernel: tanh(k x∙y δ)
The Radial Basis Function Kernel and Sigmoid Kernel are the same
as the neural network activation functions we looked at last time.
Simplest non linearly separable problem is XOR. There is no
hyperplane to distinguish it the classes in normal space, but
there is in a different space:
We need some slack to allow for noise in the data preventing the
classes from being separable.
Introduce another parameter C that determines the
maximum effect any single instance can have on the
decision boundary.
If there is 10 bad instances and 1000 good instances, we
don't want the bad instances to prevent finding the MMH.
If by removing an instance, the boundary would move a
lot, that instance could be noise. (Still a constrained
quadratic optimization problem ... apparently)
If the data has lots of 0 values, then these can be ignored when
computing the dot products. Eg: 0 squared adds nothing to the
normalised vector.
This makes SVM very useful for text classification where the
attributes are the frequency of the word in a document.
(eg most words will appear 0 times)
– Training and using SVMs with many (100,000s+) support vectors
can be very slow.
– Determining the best kernel and user configurable
parameters is typically by trial and error.
– It can only predict two classes (1 vs -1)
Can learn a model for each of N classes vs all of the other
instances, but this means building lots of models, which is
very very slow.
●
Witten, 6.3
●
Han, 6.7
●
Pal and Mitra, Chapter 4
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Evaluation
Samples
Cross Validation
Bootstrap
Confidence of Accuracy
We need some way to quantitatively evaluate the results of data
mining.
Assuming classification, the basic evaluation is how many correct
predictions it makes as opposed to incorrect predictions.
Can't test on data used for training the classifier and get an
accurate result. The result is "hopelessly
optimistic" (Witten).
Most of the time we do not have enough data to have a lot for
training and a lot for testing, though sometimes this is possible
(eg sales data)
Note that this reduces the amount of data that you can
actually train on by a significant amount.
Further issues to consider:
Obvious answer: Keep part of the data set aside for testing
purposes and use the rest to train the classifier.
Then use the test set to evaluate the resulting classifier in
terms of accuracy.
Accuracy: Number of correctly classified instances / total
number of instances to classify.
Easy: Randomly select instances.
Stratified: Group the instances by class and then select a
proportionate number from each class.
Balanced: Randomly select a desired amount of minority class
instances, and then add the same number from the majority
class.
For small data sets, removing some as a test set and still having a
representative set to train from is hard. Solutions?
Split the dataset up into k parts, then use each part in turn as the
test set and the others as the training set.
Why 10? Extensive testing shows it to be a good middle ground
not too much processing, not too random.
Select one instance and train on all others. Then see if the
instance is correctly classified. Repeat and find the percentage
of accurate results.
Attractive:
● If 10 is good, surely N is better :)
Disadvantages:
● Computationally expensive, builds N models!
Until now, the sampling has been without replacement (eg each
instance occurs once, either in training or test set).
However we could put back an instance to be drawn again --
sampling with replacement.
Eg: Have a dataset of 1000 instances.
We sample with replacement 1000 times – eg we randomly
select an instance from all 1000 instances 1000 times.
What about the size of the test set? More test instances should
make us more confident that the accuracy predicted is close to
the true accuracy.
Eg getting 75% on 10,000 samples is more likely closer to
the accuracy than 75% on 10.
Statistics can then tell us the range within which the true
accuracy rate should fall. Eg: 750/1000 is very likely to
be between 73.2% to 76.7%.
(Witten 147 to 149 has the full maths!)
We might wish to compare two classifiers of different types. Could
compare accuracy of 10 fold cross validation, but there's another
method: Student's TTest
Method:
– We perform cross validation 10 times – eg 10 times TCV =
100 models
– Perform the same repeated TCV with the second classifier
– This gives us x1..x10 for the first, and y1..y10 for the
second.
– Find the mean of the 10 cross-validation runs for each.
– Find the difference between the two means.
We then find 't' by:
●
Introductory statistical text books, again
●
Witten, 5.15.4
●
Han 6.2, 6.12, 6.13
●
Berry and Browne, 1.4
●
Devijver and Kittler, Chapter 10
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Confusion Matrix
Costs
Lift Curves
ROC Curves
Numeric Prediction
The 'Confusion Matrix':
Actual Yes Actual No
Predict Yes: True Positive False Positive
Predict No: False Negative True Negative
But what about random luck? An accuracy of 50% against 1000
classes is obviously better than against 2 classes.
Sum the diagonal in the expected by chance matrix. (82)
Sum the diagonal in the classifier's matrix (140)
Subtract expected from classifier. (58)
Subtract expected from total instances (200 – 82 = 118)
Divide and express as percentage: (58 / 118 = 49%)
For some situations, it's a lot worse to have a false negative than a
false positive.
Another example application: Mass mailed advertising.
Can use a cost matrix to determine the cost of errors of a classifier.
Default Cost Matrix:
A B C
A 0 1 1
B 1 0 1
C 1 1 0
Can artificially inflate a 2class training set with duplicates of the
preferred class. Then an error minimising classifier will attempt
to reduce the errors on the inflated number.
Some classifiers give a probability rather than a definite yes/no (eg
Bayesian techniques)
Quadratic Loss Function:
∑j(pj-aj)2
Where it is summed over the probabilities of each of j
classes for a single instance. A is 1 for the correct class
and 0 for the others, P is the probability assigned to that
class.
Then sum the loss over all test instances for a classifier.
∑j(pj-aj)2
Example:
In a 5 class problem, an instance might have:
(0.5, 0.2, 0.05, 0.15, 0.1)
When you want the first class:
(1,0,0,0,0)
= -0.52 + 0.22 + 0.052 + 0.152 + 0.12
= .25 + .04 + .0025 + .023 + .01
= 0.325
(and then summed for all instances, and the mean taken
across CV folds)
The opposite of information gain, we can use the same function as
a cost.
-E1log(p1) -E2log(p2) ...
Where E is the true probability and p is the predicted
probability.
If there is only one class, then the only one that matters is
the correct class, as the rest will be multiplied by 0.
Note that if you assign a 0 probability to the true class, you
get an infinite error! (Don't Do That Then)
Information Retrieval uses the same confusion matrix:
To go back to the directed advertising example... A data mining tool
might predict that, given a sample of 100,000 recipients, 400 will
buy (0.4%). Given 400,000, then it predicts that 800 will buy
(0.2%).
The lift is what is gained
from the baseline to
the black line, as
determined by the
classification engine.
(or a Cumulative
Gains chart)
From signal processing: Receiver Operating Characteristic.
Tradeoff between hit rate and false alarm rate when trying
to find real data in noisy channel.
Eg...
We can also plot two curves on the same chart, each generated
from different classifiers. This lets us see at which point it's
better to use one classifier rather than the other.
By using both A and B classifiers with appropriate
weightings, it's possible to get at points in between the
two peaks.
Most common is Mean Squared Error. And have seen before.
(subtract prediction from actual, square it, average)
Also Mean Absolute Error – don't square it, just average the
magnitude of each error.
●
Witten, Chapter 5
●
Han, 6.15
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Prediction / Regression
Linear Regression
Logistic Regression
Support Vector Regression
Regression Trees
Classification tries to determine which class an instance belongs to,
based on known classes for instances by generating a model
and applying it to new instances. The model generated can be in
many forms (rules, tree, graph, vectors...). The output is the
class which the new instance is predicted to be part of.
Regression takes data and finds a formula for it. As with SVM, the
formula can be the model used for classification. This might
learn the formula for the probability of a particular class from 0..1
and then return the most likely class.
For example, instead of determining that the weather will be 'hot'
'warm', 'cool' or 'cold', we may need to be able to say with some
degree of accuracy that it will be 25 degrees or 7.5 degrees,
even if 7.5 never appeared in the temperature attribute for the
training data.
Express the 'class' as a linear combination of the attributes with
determined weights. eg:
x = w0 + w1a1 + w2a2 + ... + wnan
Where w is a weight, and a is an attribute.
The predicted value for instance i then is found by putting the attribute
values for i into the appropriate a slots.
So we need to learn the weights that minimize the error between actual
value and predicted value across the training set.
(Sounds like Perceptron, right?)
To determine the weights, we try to minimize the sum of the squared error
across all the documents:
∑(xi ∑wjaik)2
Where x is the actual value for instance i and the second
half is the predicted value by applying all k weights to the
k attribute values of instance i.
Simple case: Method of Least Squares
∑(xi avg(x))(yi avg(y))
w=
∑(xiavg(x))2
solves the simple case of y = b + wx
We could apply a function to each attribute instead of just
multiplying by a weight.
For example:
x = c + f1(a1) + f2(a2) + ... + fn(an)
Where f is some function (eg square, log, square root, modulo 6,
etc)
Of course determining the appropriate function is a problem!
Instead of fitting the data to a straight line, we can try to fit it to a
logistic curve (a flat S shape).
This curve gives values between 0 and 1, and hence can be used
for probability.
We won't go into how to work
out the coefficients, but the
result is the same as the linear
case:
x = c + wa + wa + ... + wa
We looked at the maximum margin hyperplane, which involved
learning a hyperplane to distinguish two classes. Could we learn
a prediction hyperplane in the same way?
That would allow the use of kernel functions for the nonlinear case.
Goal is to find a function that has at most E deviation in prediction
from the training set, while being as flat as possible. This
creates a tube of width 2E around the function. Points that do
not fall within the tube are support vectors.
By also trying to flatten the function, bad choices for E can be
problematic.
If E is too big and encloses all the points, then the function will
simply find the mean. If E is 0, then all instances are support
vectors. Too small and there will be too many support vectors,
too large and the function will be too flat to be useful.
We can replace the dot product in the regression equation with a
kernel function to perform nonlinear support vector regression:
x = b + ∑αia(i)∙a
The problem with linear regression is that most data sets are not linear.
The problem with nonlinear regression is that it's even more
complicated!
Enter Regression Trees and Model Trees.
Idea: Use a Tree structure (divide and conquer) to split up the instances
such that we can more accurately apply a linear model to only the
instances that reach the end node.
So branches are normal decision tree tests, but instead of a class value
at the node, we have some way to predict or specify the value.
Regression Trees: The leaf nodes have the average value of the
instances to reach it.
Model Trees: The leaf nodes have a (linear) regression model to
predict the value of the instances that reach it.
So a regression tree is a constant value model tree.
Issues to consider:
– Building
– Pruning / Smoothing
We know that we need to construct a tree, with a linear model at
each node and an attribute split at non leaf nodes.
To split, we need to determine which attribute to split on, and where
to split it. (Remember that all attributes are numeric)
Witten (p245) proposes Standard Deviation Reduction treating
the std dev of the class values as a measure of the error at the
node and maximising the reduction in that value for each split.
It turns out that the value predicted at the bottom of the tree is generally
too coarse, probably because it was built against only a small subset of
the data.
We can fine tune the value by building a linear model at each node along
with the regular split and then send the value from the leaf back up the
path to the root of the tree, combining it with the values at each step.
p' = (np + kq) / (n + k)
p' is prediction to be passed up. p is prediction passed to this node.
q is the value predicted at this node. n is the number of instances that
reach the node below. k is a constant.
Pruning can also be accomplished using the models built at each
node.
We can estimate the error at each node using the model built by
taking the actual error on the test set and multiplying by (n+v)/(n
v) where n is the number of instances that reach the node and v
is the number of parameters in the linear model for the node.
We do this multiplication to avoid underestimating the error on new
data, rather than the data it was trained against.
If the estimated error is lower at the parent, the leaf node can be
dropped.
MakeTree(instances)
SD = sd(instances) // standard deviation
root = new Node(instances)
split(root)
prune(root)
split(node)
if len(node)< 4 or sd(node) < 0.05*SD:
node.type = LEAF
else
node.type = INTERIOR
foreach attribute a:
foreach possibleSplitPosition s in a:
calculateSDR(a, s)
splitNode(node, maximumSDR)
split(node.left)
split(node.right)
prune(node)
if node.type == INTERIOR:
prune(node.left)
prune(node.right)
node.model = new linearRegression(node)
if (subTreeError(node) > error(node):
node.type = LEAF
subTreeError(node)
if node.type = INTERIOR:
return len(left)*subTreeError(left) +
len(right)*subTreeError(right) / len(node)
else:
return error(node)
Some regression/model trees:
CHAID (ChiSquared Automatic Interaction Detector). 1980.
Can also be used either for continuous or nominal classes.
CART (Classification And Regression Tree). 1984.
Entropy or Gini to choose attribute, binary split for selected
attribute.
M5 Quinlan's model tree inducer (of C4.5 fame). 1992.
●
Introductory statistical text books, still!
●
Witten, 3.7, 4.6, 6.5
●
Dunham, 3.2, 4.2
●
Han, 6.11