Sie sind auf Seite 1von 56

INTRODUCTION TO

MACHINE LEARNING

A Few Quotes
A breakthrough in machine learning would be worth

ten Microsofts (Bill Gates, Chairman, Microsoft)


Machine learning is the next Internet
(Tony Tether, Director, DARPA)

Machine learning is the hot new thing

(John Hennessy, President, Stanford)

Web rankings today are mostly a matter of machine

learning (Prabhakar Raghavan, Dir. Research, Yahoo)


Machine learning is going to result in a real revolution
(Greg Papadopoulos, CTO, Sun)

Machine learning is todays discontinuity

(Jerry Yang, CEO, Yahoo)

Definitions
Machine learning investigates the mechanisms by which

knowledge is acquired through experience


Machine Learning is the field that concentrates on
induction algorithms and on other algorithms that can be
said to ``learn.''
Learning through the data sets

Model
A model of learning is fundamental in any machine

learning application:
who is learning (a computer program)
what is learned (a domain)
from what the learner is learning (the information source)

Traditional Programming

Data
Program

Computer

Output

Computer

Program

Machine Learning

Data
Output

Magic?
No, more like gardening
Seeds = Algorithms
Nutrients = Data
Gardener = You
Plants = Programs

Sample Applications
Web search
Computational biology
Finance
E-commerce
Space exploration
Robotics
Information extraction
Social networks
Debugging
[Your favorite area]

ML in a Nutshell
Tens of thousands of machine learning algorithms
Hundreds new every year
Every machine learning algorithm has three components:
Representation
Evaluation
Optimization

Representation
Decision trees
Sets of rules / Logic programs
Instances
Graphical models (Bayes/Markov nets)
Neural networks
Support vector machines
Model ensembles
Etc.

Evaluation
Accuracy
Precision and recall
Squared error
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence
Etc.

Optimization
Combinatorial optimization
E.g.: Greedy search
Convex optimization
E.g.: Gradient descent

Constrained optimization
E.g.: Linear programming

Data Preparation

Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
names
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality

data

Major Tasks in Data Preprocessing


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and

resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation

Data reduction
Obtains reduced representation in volume but produces the same or

similar analytical results

Data discretization
Part of data reduction but with particular importance, especially for

numerical data

Forms of data preprocessing

Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data

Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,

such as customer income in sales data

Missing data may be due to


equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data

Missing data may need to be inferred.

How to Handle Missing Data?


Ignore the tuple: usually done when class label is missing (assuming

the tasks in classificationnot effective when the percentage of


missing values per attribute varies considerably)
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., unknown, a

new class?!
Use the attribute mean to fill in the missing value
Use the most probable value to fill in the missing value: inference-

based such as Bayesian formula or decision tree

Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data

How to Handle Noisy Data?


Binning method:
first sort data and partition into (equi-depth) bins
then smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Clustering
detect and remove outliers

Combined computer and human inspection


detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions

Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Data Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units

Handling Redundant Data


Redundant data occur often when integration of multiple

databases
The same attribute may have different names in different

databases
Careful integration of the data from multiple sources

may help reduce/avoid redundancies and


inconsistencies and improve mining speed and quality

Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified

range
min-max normalization
z-score normalization
normalization by decimal scaling

Data Transformation:
Normalization
min-max normalization

v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA

z-score normalization

v meanA
v'
stand _ devA

normalization by decimal scaling


v

v'

10

Where j is the smallest integer such that Max(| v ' |)<1

Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Data Reduction Strategies


Warehouse may store terabytes of data: Complex

data analysis/mining may take a very long time to run


on the complete data set
Data reduction
Obtains a reduced representation of the data set that is

much smaller in volume but yet produces the same (or


almost the same) analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation

Data Cube Aggregation


The lowest level of a data cube
the aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.

Multiple levels of aggregation in data cubes


Further reduce the size of data to deal with

Reference appropriate levels


Use the smallest representation which is enough to solve

the task

Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the

probability distribution of different classes given


the values for those features is as close as
possible to the original distribution given the
values of all features
reduce # of patterns in the patterns, easier to
understand

Sampling
Allow a mining algorithm to run in complexity that is

potentially sub-linear to the size of the data


Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or subpopulation of interest)

in the overall database


Used in conjunction with skewed data

Sampling

R
O
W
SRS le random
t
p
u
o
m
i
h
t
s
i
(
w
e
l
samp ment)
ce
a
l
p
e
r

SRSW
R

Raw Data

Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Discretization
Three types of attributes:
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis

Discretization and Concept hierachy


Discretization
reduce the number of values for a given continuous

attribute by dividing the range of the attribute into


intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level

concepts (such as numeric values for the attribute age)


by higher level concepts (such as young, middle-aged,
or senior).

Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

Summary
Data preparation is a big issue for both warehousing

and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still an active

area of research

References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse

environments. Communications of ACM, 42:73-78, 1999.


Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the

Technical Committee on Data Engineering, 20(4), December 1997.


D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
T. Redman. Data Quality: Management and Technology. Bantam Books,

New York, 1992.


Y. Wand and R. Wang. Anchoring data quality dimensions ontological

foundations. Communications of ACM, 39:86-95, 1996.


R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality

research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995.

Types of Learning
Supervised (inductive) learning
Training data includes desired outputs
Unsupervised learning
Training data does not include desired outputs

Semi-supervised learning
Training data includes a few desired outputs
Reinforcement learning
Rewards from sequence of actions

41

Designing a Learning System:


An Example
1. Problem Description : Classifying cancer
2. Choosing the Training Experience: Data Collection
3. Choosing the Target Function / target output: Identify
appropriate function for the data
4. Choosing a Function Algorithm
6. Design

Data Collection
http://archive.ics.uci.edu/ml/
Breast Cancer Wisconsin (Original) Data Set
# Attribute Domain -- ----------------------------------------- 1. Sample code number id

number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of


Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7.
Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 10 11. Class: (2 for benign, 4 for malignant)
1000025,5,1,1,1,2,1,3,1,1,2 1002945,5,4,4,5,7,10,3,2,1,2 1015425,3,1,1,1,2,2,3,1,1,2

1016277,6,8,8,1,3,4,3,7,1,2 1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4 1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2 1033078,2,1,1,1,2,1,1,1,5,2 1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2 1036172,2,1,1,1,2,1,2,1,1,2 1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2 1044572,8,7,5,10,7,9,5,5,4,4 1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2 1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4 1050718,6,1,1,1,2,1,3,1,1,2
Class distribution: Benign: 458 (65.5%) Malignant: 241 (34.5%)

Choosing a Function Algorithm


Applying to Neural Network
Input Data?
Number of layers?
Output Data?
Design
Training Data?
Testing Data?

Steps
Training

Training
Labels

Training
Images
Image
Features

Training

Learned
model

Learned
model

Prediction

Testing
Image
Features
Test Image

Slide credit: D. Hoiem and L.

Cross validation
Solution: k-fold cross validation maximizes the use of

the data.
Divide data randomly into k folds (subsets) of equal
size.
Train the model on k1 folds, use one fold for testing.
Repeat this process k times so that all folds are used
for testing.
Compute the average performance on the k test sets.
This effectively uses all the data for both training and
testing.
Typically k =10 is used.
Sometimes stratied k-fold cross validation is used.

Cross validation
Identify n folds of the available data.
Train on n-1 folds
Test on the remaining fold.
In the extreme (n=N) this is known as

leave-one-out cross validation

47

48

In k-fold cross-validation, the original sample is randomly partitioned

into k subsamples. Of the k subsamples, a single subsample is


retained as the validation data for testing the model, and the
remaining k 1 subsamples are used as training data. The crossvalidation process is then repeated k times (the folds), with each of
the k subsamples used exactly once as the validation data. The k
results from the folds then can be averaged (or otherwise combined)
to produce a single estimation. The advantage of this method over
repeated random sub-sampling is that all observations are used for
both training and validation, and each observation is used for
validation exactly once. 10-fold cross-validation is commonly used, [5]
but in general k remains an unfixed parameter [1].
In stratified k-fold cross-validation, the folds are selected so that the
mean response value is approximately equal in all the folds. In the
case of a dichotomous classification, this means that each fold
contains roughly the same proportions of the two types of class
labels.

49

2-fold cross-validation
This is the simplest variation of k-fold cross-validation.
For each fold, we randomly assign data points to two
sets d0 and d1, so that both sets are equal size (this is
usually implemented as shuffling the data array and
then splitting in two). We then train on d0 and test on
d1, followed by training on d1 and testing on d0.
This has the advantage that our training and test sets
are both large, and each data point is used for both
training and validation on each fold.

50

Leave-one-out cross-validation
Leave-one-out crossvalidation is simply k-fold
crossvalidation with k set to n, the number of
instances in the data set.
This means that the test set only consists of a single
instance, which will be classied either correctly or
incorrectly.
Advantages: maximal use of training data, i.e.,
training on n1 instances. The procedure is
deterministic, no sampling involved.
Disadvantages: unfeasible for large data sets: large
number of training runs required, high computational
cost. Cannot be stratied (only one class in the test
set).

51

Cross-validation visualized
Available Labeled Data
Identify n partitions

Fold 1

Train

Train

Train

Train

Dev

Test

52

Cross-validation visualized
Available Labeled Data
Identify n partitions

Fold 2

Test

Train

Train

Train

Train

Dev

53

Cross-validation visualized
Available Labeled Data
Identify n partitions

Fold 3

Dev

Test

Train

Train

Train

Train

54

Cross-validation visualized
Available Labeled Data
Identify n partitions

Fold 4

Train

Dev

Test

Train

Train

Train

55

Cross-validation visualized
Available Labeled Data
Identify n partitions

Fold 5

Train

Train

Dev

Test

Train

Train

56

Cross-validation visualized
Available Labeled Data
Identify n partitions

Fold 6

Train

Train

Train

Dev

Test

Calculate Average Performance

Train

Das könnte Ihnen auch gefallen