Beruflich Dokumente
Kultur Dokumente
MACHINE LEARNING
A Few Quotes
A breakthrough in machine learning would be worth
Definitions
Machine learning investigates the mechanisms by which
Model
A model of learning is fundamental in any machine
learning application:
who is learning (a computer program)
what is learned (a domain)
from what the learner is learning (the information source)
Traditional Programming
Data
Program
Computer
Output
Computer
Program
Machine Learning
Data
Output
Magic?
No, more like gardening
Seeds = Algorithms
Nutrients = Data
Gardener = You
Plants = Programs
Sample Applications
Web search
Computational biology
Finance
E-commerce
Space exploration
Robotics
Information extraction
Social networks
Debugging
[Your favorite area]
ML in a Nutshell
Tens of thousands of machine learning algorithms
Hundreds new every year
Every machine learning algorithm has three components:
Representation
Evaluation
Optimization
Representation
Decision trees
Sets of rules / Logic programs
Instances
Graphical models (Bayes/Markov nets)
Neural networks
Support vector machines
Model ensembles
Etc.
Evaluation
Accuracy
Precision and recall
Squared error
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence
Etc.
Optimization
Combinatorial optimization
E.g.: Greedy search
Convex optimization
E.g.: Gradient descent
Constrained optimization
E.g.: Linear programming
Data Preparation
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
data
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
new class?!
Use the attribute mean to fill in the missing value
Use the most probable value to fill in the missing value: inference-
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units
databases
The same attribute may have different names in different
databases
Careful integration of the data from multiple sources
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Data Transformation:
Normalization
min-max normalization
v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
z-score normalization
v meanA
v'
stand _ devA
v'
10
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
the task
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the
Sampling
Allow a mining algorithm to run in complexity that is
Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
Discretization
Three types of attributes:
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
Summary
Data preparation is a big issue for both warehousing
and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still an active
area of research
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse
Types of Learning
Supervised (inductive) learning
Training data includes desired outputs
Unsupervised learning
Training data does not include desired outputs
Semi-supervised learning
Training data includes a few desired outputs
Reinforcement learning
Rewards from sequence of actions
41
Data Collection
http://archive.ics.uci.edu/ml/
Breast Cancer Wisconsin (Original) Data Set
# Attribute Domain -- ----------------------------------------- 1. Sample code number id
1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4 1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2 1033078,2,1,1,1,2,1,1,1,5,2 1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2 1036172,2,1,1,1,2,1,2,1,1,2 1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2 1044572,8,7,5,10,7,9,5,5,4,4 1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2 1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4 1050718,6,1,1,1,2,1,3,1,1,2
Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%)
Steps
Training
Training
Labels
Training
Images
Image
Features
Training
Learned
model
Learned
model
Prediction
Testing
Image
Features
Test Image
Cross validation
Solution: k-fold cross validation maximizes the use of
the data.
Divide data randomly into k folds (subsets) of equal
size.
Train the model on k1 folds, use one fold for testing.
Repeat this process k times so that all folds are used
for testing.
Compute the average performance on the k test sets.
This effectively uses all the data for both training and
testing.
Typically k =10 is used.
Sometimes stratied k-fold cross validation is used.
Cross validation
Identify n folds of the available data.
Train on n-1 folds
Test on the remaining fold.
In the extreme (n=N) this is known as
47
48
49
2-fold cross-validation
This is the simplest variation of k-fold cross-validation.
For each fold, we randomly assign data points to two
sets d0 and d1, so that both sets are equal size (this is
usually implemented as shuffling the data array and
then splitting in two). We then train on d0 and test on
d1, followed by training on d1 and testing on d0.
This has the advantage that our training and test sets
are both large, and each data point is used for both
training and validation on each fold.
50
Leave-one-out cross-validation
Leave-one-out crossvalidation is simply k-fold
crossvalidation with k set to n, the number of
instances in the data set.
This means that the test set only consists of a single
instance, which will be classied either correctly or
incorrectly.
Advantages: maximal use of training data, i.e.,
training on n1 instances. The procedure is
deterministic, no sampling involved.
Disadvantages: unfeasible for large data sets: large
number of training runs required, high computational
cost. Cannot be stratied (only one class in the test
set).
51
Cross-validation visualized
Available Labeled Data
Identify n partitions
Fold 1
Train
Train
Train
Train
Dev
Test
52
Cross-validation visualized
Available Labeled Data
Identify n partitions
Fold 2
Test
Train
Train
Train
Train
Dev
53
Cross-validation visualized
Available Labeled Data
Identify n partitions
Fold 3
Dev
Test
Train
Train
Train
Train
54
Cross-validation visualized
Available Labeled Data
Identify n partitions
Fold 4
Train
Dev
Test
Train
Train
Train
55
Cross-validation visualized
Available Labeled Data
Identify n partitions
Fold 5
Train
Train
Dev
Test
Train
Train
56
Cross-validation visualized
Available Labeled Data
Identify n partitions
Fold 6
Train
Train
Train
Dev
Test
Train