Beruflich Dokumente
Kultur Dokumente
Chapter 4
Classification: Basic Concepts
Classification
• A form of data analysis that extracts model or classifier to
predict class labels
– class labels categorical (discrete or nominal)
– classifies data based on training set and values in a
classifying attribute, and uses it in classifying new data
• Numeric Prediction
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical applications
– Credit/loan approval: loan application is “safe” or “risky”
– Medical diagnosis: tumor is “cancerous” or “benign”
– Fraud detection: transaction is “fraudulent”
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data is accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
7
Decision Tree Induction
• Decision tree: a flowchart-like tree structure where
each
– internal node (nonleaf node): a test on an attribute,
– branch: an outcome of the test, and
– leaf node (or terminal node): class label
j 1 | D |
Information Gain
• InfoA(D) is the expected information required to classify
a tuple from D based on the partitioning by A
The smaller the expected information (still) required,
the greater the purity of the partitions
Information gained by branching on attribute A
• GainRatio(A) = Gain(A)/SplitInfo(A)
• The attribute with the maximum gain ratio is
selected as the splitting attribute
Gini Index
• Used in CART, IBM IntelligentMiner
• Gini index is defined as
m 2
gini( D) 1 p j
j 1