CSC 3301-Lecture06 Introduction To Machine Learning

INTRODUCTION TO
MACHINE LEARNING
A Few Quotes
A breakthrough in machine learning would be worth
ten Microsofts (Bill Gates, Chairman, Microsoft)

Machine learning is the next Internet
(Tony Tether, Director, DARPA)
Machine learning is the hot new thing
(John Hennessy, President, Stanford)
Web rankings today are mostly a matter of machine
learning (Prabhakar Raghavan, Dir. Research, Yahoo)

Machine learning is going to result in a real revolution
(Greg Papadopoulos, CTO, Sun)
Machine learning is todays discontinuity
(Jerry Yang, CEO, Yahoo)
Definitions
Machine learning investigates the mechanisms by which
knowledge is acquired through experience

Machine Learning is the field that concentrates on
induction algorithms and on other algorithms that can be
said to ``learn.''
Learning through the data sets
Model
A model of learning is fundamental in any machine
learning application:
who is learning (a computer program)
what is learned (a domain)
from what the learner is learning (the information source)
Traditional Programming
Data
Program
Computer
Output
Computer
Program
Machine Learning
Data
Output
Magic?
No, more like gardening
Seeds = Algorithms
Nutrients = Data
Gardener = You
Plants = Programs
Sample Applications
Web search
Computational biology
Finance
E-commerce
Space exploration
Robotics
Information extraction
Social networks
Debugging
[Your favorite area]
ML in a Nutshell
Tens of thousands of machine learning algorithms
Hundreds new every year
Every machine learning algorithm has three components:
Representation
Evaluation
Optimization
Representation
Decision trees
Sets of rules / Logic programs
Instances
Graphical models (Bayes/Markov nets)
Neural networks
Support vector machines
Model ensembles
Etc.
Evaluation
Accuracy
Precision and recall
Squared error
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence
Etc.
Optimization
Combinatorial optimization
E.g.: Greedy search
Convex optimization
E.g.: Gradient descent
Constrained optimization
E.g.: Linear programming
Data Preparation
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
Why Data Preprocessing?

Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality
data
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Forms of data preprocessing
Data Preprocessing
Data cleaning
Data reduction
Summary
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to

equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming
the tasks in classificationnot effective when the percentage of

missing values per attribute varies considerably)
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., unknown, a
new class?!
Use the attribute mean to fill in the missing value
Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?

Binning method:
first sort data and partition into (equi-depth) bins
then smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection

detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions
Data Preprocessing
Data cleaning
Data reduction
Summary
Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units
Handling Redundant Data

Redundant data occur often when integration of multiple
databases
The same attribute may have different names in different
databases
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and

inconsistencies and improve mining speed and quality
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Data Transformation:
Normalization
min-max normalization
v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
z-score normalization
v meanA
v'
stand _ devA
normalization by decimal scaling

v
v'
10
Where j is the smallest integer such that Max(| v ' |)<1
Data Preprocessing
Data cleaning
Data reduction
Summary
Data Reduction Strategies

Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to run

on the complete data set
Data reduction
Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or

almost the same) analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
Data Cube Aggregation

The lowest level of a data cube
the aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubes

Further reduce the size of data to deal with
Reference appropriate levels

Use the smallest representation which is enough to solve
the task
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the
probability distribution of different classes given

the values for those features is as close as
possible to the original distribution given the
values of all features
reduce # of patterns in the patterns, easier to
understand
Sampling
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data

Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or subpopulation of interest)
in the overall database

Used in conjunction with skewed data
Sampling
R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r
SRSW
R
Raw Data
Data Preprocessing
Data cleaning
Data reduction
Summary
Discretization
Three types of attributes:
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
Discretization and Concept hierachy

Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into

intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age)

by higher level concepts (such as young, middle-aged,
or senior).
Data Preprocessing
Data cleaning
Data reduction
Summary
Summary
Data preparation is a big issue for both warehousing
and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still an active
area of research
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse
environments. Communications of ACM, 42:73-78, 1999.

Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the
Technical Committee on Data Engineering, 20(4), December 1997.

D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
T. Redman. Data Quality: Management and Technology. Bantam Books,
New York, 1992.

Y. Wand and R. Wang. Anchoring data quality dimensions ontological
foundations. Communications of ACM, 39:86-95, 1996.

R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality
research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995.
Types of Learning
Supervised (inductive) learning
Training data includes desired outputs
Unsupervised learning
Training data does not include desired outputs
Semi-supervised learning
Training data includes a few desired outputs
Reinforcement learning
Rewards from sequence of actions
41
Designing a Learning System:

An Example
1. Problem Description : Classifying cancer
2. Choosing the Training Experience: Data Collection
3. Choosing the Target Function / target output: Identify
appropriate function for the data
4. Choosing a Function Algorithm
6. Design
Data Collection
http://archive.ics.uci.edu/ml/
Breast Cancer Wisconsin (Original) Data Set
# Attribute Domain -- ----------------------------------------- 1. Sample code number id
number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of

Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7.
Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 10 11. Class: (2 for benign, 4 for malignant)
1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4 1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2 1033078,2,1,1,1,2,1,1,1,5,2 1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2 1036172,2,1,1,1,2,1,2,1,1,2 1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2 1044572,8,7,5,10,7,9,5,5,4,4 1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2 1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4 1050718,6,1,1,1,2,1,3,1,1,2
Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%)
Choosing a Function Algorithm

Applying to Neural Network
Input Data?
Number of layers?
Output Data?
Design
Training Data?
Testing Data?
Steps
Training
Training
Labels
Training
Images
Image
Features
Training
Learned
model
Learned
model
Prediction
Testing
Image
Features
Test Image
Slide credit: D. Hoiem and L.
Cross validation
Solution: k-fold cross validation maximizes the use of
the data.
Divide data randomly into k folds (subsets) of equal
size.
Train the model on k1 folds, use one fold for testing.
Repeat this process k times so that all folds are used
for testing.
Compute the average performance on the k test sets.
This effectively uses all the data for both training and
testing.
Typically k =10 is used.
Sometimes stratied k-fold cross validation is used.
Cross validation
Identify n folds of the available data.
Train on n-1 folds
Test on the remaining fold.
In the extreme (n=N) this is known as
leave-one-out cross validation
47
48
In k-fold cross-validation, the original sample is randomly partitioned
into k subsamples. Of the k subsamples, a single subsample is

retained as the validation data for testing the model, and the
remaining k 1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of
the k subsamples used exactly once as the validation data. The k
results from the folds then can be averaged (or otherwise combined)
to produce a single estimation. The advantage of this method over
repeated random sub-sampling is that all observations are used for
both training and validation, and each observation is used for
validation exactly once. 10-fold cross-validation is commonly used, [5]
but in general k remains an unfixed parameter [1].
In stratified k-fold cross-validation, the folds are selected so that the
mean response value is approximately equal in all the folds. In the
case of a dichotomous classification, this means that each fold
contains roughly the same proportions of the two types of class
labels.
49
2-fold cross-validation
This is the simplest variation of k-fold cross-validation.
For each fold, we randomly assign data points to two
sets d0 and d1, so that both sets are equal size (this is
usually implemented as shuffling the data array and
then splitting in two). We then train on d0 and test on
d1, followed by training on d1 and testing on d0.
This has the advantage that our training and test sets
are both large, and each data point is used for both
training and validation on each fold.
50
Leave-one-out cross-validation
Leave-one-out crossvalidation is simply k-fold
crossvalidation with k set to n, the number of
instances in the data set.
This means that the test set only consists of a single
instance, which will be classied either correctly or
incorrectly.
Advantages: maximal use of training data, i.e.,
training on n1 instances. The procedure is
deterministic, no sampling involved.
Disadvantages: unfeasible for large data sets: large
number of training runs required, high computational
cost. Cannot be stratied (only one class in the test
set).
51
Cross-validation visualized
Available Labeled Data
Identify n partitions
Fold 1
Train
Train
Train
Train
Dev
Test
52
Fold 2
Test
Train
Train
Train
Train
Dev
53
Fold 3
Dev
Test
Train
Train
Train
Train
54
Fold 4
Train
Dev
Test
Train
Train
Train
55
Fold 5
Train
Train
Dev
Test
Train
Train
56
Fold 6
Train
Train
Train
Dev
Test
Calculate Average Performance
Train

CSC 3301-Lecture06 Introduction To Machine Learning

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CSC 3301-Lecture06 Introduction To Machine Learning

Hochgeladen von

Copyright:

Verfügbare Formate

INTRODUCTION TO

ten Microsofts (Bill Gates, Chairman, Microsoft)

Machine learning is the hot new thing

(John Hennessy, President, Stanford)

Web rankings today are mostly a matter of machine

learning (Prabhakar Raghavan, Dir. Research, Yahoo)

Machine learning is todays discontinuity

(Jerry Yang, CEO, Yahoo)

knowledge is acquired through experience

Why Data Preprocessing?

Major Tasks in Data Preprocessing

similar analytical results

Forms of data preprocessing

such as customer income in sales data

Missing data may be due to

Missing data may need to be inferred.

How to Handle Missing Data?

the tasks in classificationnot effective when the percentage of

based such as Bayesian formula or decision tree

How to Handle Noisy Data?

Combined computer and human inspection

Handling Redundant Data

may help reduce/avoid redundancies and

normalization by decimal scaling

Where j is the smallest integer such that Max(| v ' |)<1

Data Reduction Strategies

data analysis/mining may take a very long time to run

much smaller in volume but yet produces the same (or

Data Cube Aggregation

Multiple levels of aggregation in data cubes

Reference appropriate levels

probability distribution of different classes given

potentially sub-linear to the size of the data

in the overall database

Discretization and Concept hierachy

attribute by dividing the range of the attribute into

concepts (such as numeric values for the attribute age)

environments. Communications of ACM, 42:73-78, 1999.

Technical Committee on Data Engineering, 20(4), December 1997.

New York, 1992.

foundations. Communications of ACM, 39:86-95, 1996.

research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995.

Designing a Learning System:

number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of

Choosing a Function Algorithm

Slide credit: D. Hoiem and L.

leave-one-out cross validation

In k-fold cross-validation, the original sample is randomly partitioned

into k subsamples. Of the k subsamples, a single subsample is

Calculate Average Performance

Das könnte Ihnen auch gefallen