You are on page 1of 28

# Data Acquisition

## • Better is the selection of data better will the performance

of our machine learning algorithm.

## • It will save costs as well.

Selection of data pool is performed in epochs, whereby at each
phase, a batch of one or more queries are performed simultaneously.

## The combinatorial problem of selecting the most useful such

batches (and overall data set) from such a pool of candidates makes
direct optimization an NP-hard problem.

## A first order Markov relaxation is performed: It selectes the most

promising data greedily one-at a-time from the pool.

## While not guaranteeing a globally optimal result set (regardless of

what selection criterion is being used), such sequential data access
often works well in practice while making the selection problem
tractable.
NP-hard problems are problems for which there is no known
polynomial algorithm, so that the time to find a solution
grows exponentially with problem size. Although it has not
been definitively proven that, there is no polynomial
algorithm for solving NP-hard problems, many eminent
mathematicians have tried and failed.

## A game you can not win.

Feature construction

## • Sometimes feature construction is also called as:

Preprocessing.

Processing/Feature
construction Steps

## We call x’ a vector of transformed features of dimension n’ .

Preprocessing
transformation steps
Standardization

## Features can have diﬀerent scales although they refer to

comparable objects.

## Consider for instance, a pattern x = [x1, x2] where x1 is a width

measured in meters and x2 is a height measured in centimeters.
Both can be compared, added or subtracted but it would be
unreasonable to do it before appropriate normalization.

## Classical centering and scaling of the data is often used: x’_i =

(xi−µi)/σi , where µi and σi are the mean and the standard
deviation of feature xi over training examples.
Normalization

## Consider for example the case where x is an image and the

xi ’s are the number of pixels with color i, it makes sense to
normalize x by dividing it by the total number of counts in
order to encode the distribution and remove the
dependence on the size of the image. This translates into
the formula: x’ = x/||x||.
Signal enhancement.

## The signal-to-noise ratio may be improved by applying

signal or image-processing filters. These operations include
baseline or background removal, de-noising, smoothing, or
sharpening. The Fourier transform and wavelet transforms
are popular methods.
Extraction of local features

## For sequential, spatial or other structured data, specific

techniques like convolutional methods using hand-crafted
kernels or syntactic and structural methods are used. These
techniques encode problem specific knowledge into the
features. They are beyond the scope of this book but it is
worth mentioning that they can bring significant
improvement.
Linear and non-linear space embedding methods

## When the dimensionality of the data is very high, some

techniques might be used to project or embed the data into
a lower dimensional space while retaining as much
information as possible. Classical examples are Principal
Component Analysis (PCA) and Multidimensional Scaling
(MDS).

## The coordinates of the data points in the lower dimension

space might be used as features or simply as a means of
data visualization.
Non-linear expansions

## Although dimensionality reduction is often summoned when

speaking about complex data, it is sometimes better to
increase the dimensionality. This happens when the problem
is very complex and first order interactions are not enough
to derive good results. This consists for instance in
computing products of the original features xi to create
monomials xk1 xk2 ...xkp .
Non-linear expansions

## Although dimensionality reduction is often summoned when

speaking about complex data, it is sometimes better to
increase the dimensionality. This happens when the problem
is very complex and first order interactions are not enough
to derive good results. This consists for instance in
computing products of the original features xi to create
monomials x_{k1}, x_{k2}, …, x_{kp}.
Feature discretization

## Some algorithms do no handle well continuous data. It

makes sense then to discretize continuous values into a
finite discrete set. This step not only facilitates the use of
certain algorithms, it may simplify the data description and
improve data understanding (Liu and Motoda, 1998).
• In particular, one should beware of not losing information
at the feature construction stage.

## • One must try to acquire as much raw data as possible. It’s

disadvantage is: : it increases the dimensionality of the
patterns and thereby immerses the relevant information
into a sea of possibly irrelevant, noisy or redundant
features
Feature selection
• It is the process of selecting correct or most informative
features.

## • feature set reduction

• performance improvement

• data understanding
• Feature selection can be performed in following ways:

## • Relevant features that are individually irrelevant

• Redundant features
Individual relevance ranking

## Figure (next slide) shows a situation in which one feature (x1)

is relevant individually and the other (x2) does not help
providing a better class separation.

## For such situations individual feature ranking works well: the

feature that provides a good class separation by itself will rank
high and will therefore be chosen.
Rotations in feature space often simplify feature selection,
refer image.

## A feature that is irrelevant for classification may be relevant

for predicting the class conditional probabilities.
Relevant features that are
individually irrelevant

## • Example: feature x1 might represent a measurement in an

image that is randomly oﬀset by a local background
change; feature x2 might be measuring such local oﬀset,
which by itself is not informative. Hence, feature x2 might
be completely uncorrelated to the target and yet improve
the separability of feature x1, if subtracted from it.

• Two individually irrelevant features may become relevant
when used in combination.

## • The Relief algorithm uses an approach based on the K-

nearest-neighbor algorithm.

## • In projection on feature j, the sum of the distances

between the examples and their nearest misses is
compared to the sum of distances to their nearest hits.
The Relief method works for multi-class problems.
Redundant features
• Noise reduction: When two features provide identical
projected distribution.

## • Yet they are not completely redundant: the two-

dimensional distribution shows a better class separation
than the one achievable with either feature. (refer (d) of
previous figure)

## • In (f) the features are redundant while (e) though the

distribution is same but they are not redundant.