You are on page 1of 28

Data Acquisition

• Machine learning requires Data.

• Better is the selection of data better will the performance


of our machine learning algorithm.

• It will save costs as well.


Selection of data pool is performed in epochs, whereby at each
phase, a batch of one or more queries are performed simultaneously.

The combinatorial problem of selecting the most useful such


batches (and overall data set) from such a pool of candidates makes
direct optimization an NP-hard problem.

A first order Markov relaxation is performed: It selectes the most


promising data greedily one-at a-time from the pool.

While not guaranteeing a globally optimal result set (regardless of


what selection criterion is being used), such sequential data access
often works well in practice while making the selection problem
tractable.
NP-hard problems are problems for which there is no known
polynomial algorithm, so that the time to find a solution
grows exponentially with problem size. Although it has not
been definitively proven that, there is no polynomial
algorithm for solving NP-hard problems, many eminent
mathematicians have tried and failed.

A game you can not win.


Feature construction

• Features are input data or attributes, generally.

• Sometimes feature construction is also called as:


Preprocessing.


Processing/Feature
construction Steps

Let x be a pattern vector of dimension n, x = [x1, x2, ...xn].

The components x_i of this vector are the original features.

We call x’ a vector of transformed features of dimension n’ .


Preprocessing
transformation steps
Standardization

Features can have different scales although they refer to


comparable objects.

Consider for instance, a pattern x = [x1, x2] where x1 is a width


measured in meters and x2 is a height measured in centimeters.
Both can be compared, added or subtracted but it would be
unreasonable to do it before appropriate normalization.

Classical centering and scaling of the data is often used: x’_i =


(xi−µi)/σi , where µi and σi are the mean and the standard
deviation of feature xi over training examples.
Normalization

Consider for example the case where x is an image and the


xi ’s are the number of pixels with color i, it makes sense to
normalize x by dividing it by the total number of counts in
order to encode the distribution and remove the
dependence on the size of the image. This translates into
the formula: x’ = x/||x||.
Signal enhancement.

The signal-to-noise ratio may be improved by applying


signal or image-processing filters. These operations include
baseline or background removal, de-noising, smoothing, or
sharpening. The Fourier transform and wavelet transforms
are popular methods.
Extraction of local features

For sequential, spatial or other structured data, specific


techniques like convolutional methods using hand-crafted
kernels or syntactic and structural methods are used. These
techniques encode problem specific knowledge into the
features. They are beyond the scope of this book but it is
worth mentioning that they can bring significant
improvement.
Linear and non-linear space embedding methods

When the dimensionality of the data is very high, some


techniques might be used to project or embed the data into
a lower dimensional space while retaining as much
information as possible. Classical examples are Principal
Component Analysis (PCA) and Multidimensional Scaling
(MDS).

The coordinates of the data points in the lower dimension


space might be used as features or simply as a means of
data visualization.
Non-linear expansions

Although dimensionality reduction is often summoned when


speaking about complex data, it is sometimes better to
increase the dimensionality. This happens when the problem
is very complex and first order interactions are not enough
to derive good results. This consists for instance in
computing products of the original features xi to create
monomials xk1 xk2 ...xkp .
Non-linear expansions

Although dimensionality reduction is often summoned when


speaking about complex data, it is sometimes better to
increase the dimensionality. This happens when the problem
is very complex and first order interactions are not enough
to derive good results. This consists for instance in
computing products of the original features xi to create
monomials x_{k1}, x_{k2}, …, x_{kp}.
Feature discretization

Some algorithms do no handle well continuous data. It


makes sense then to discretize continuous values into a
finite discrete set. This step not only facilitates the use of
certain algorithms, it may simplify the data description and
improve data understanding (Liu and Motoda, 1998).
• In particular, one should beware of not losing information
at the feature construction stage.

• One must try to acquire as much raw data as possible. It’s


disadvantage is: : it increases the dimensionality of the
patterns and thereby immerses the relevant information
into a sea of possibly irrelevant, noisy or redundant
features
Feature selection
• It is the process of selecting correct or most informative
features.

• But it is also useful for:

• general data reduction

• feature set reduction

• performance improvement

• data understanding
• Feature selection can be performed in following ways:

• Individual relevance ranking

• Relevant features that are individually irrelevant

• Redundant features
Individual relevance ranking

Figure (next slide) shows a situation in which one feature (x1)


is relevant individually and the other (x2) does not help
providing a better class separation.

For such situations individual feature ranking works well: the


feature that provides a good class separation by itself will rank
high and will therefore be chosen.
Rotations in feature space often simplify feature selection,
refer image.

A feature that is irrelevant for classification may be relevant


for predicting the class conditional probabilities.
Relevant features that are
individually irrelevant

• A helpful feature may be irrelevant by itself.

• Example: feature x1 might represent a measurement in an


image that is randomly offset by a local background
change; feature x2 might be measuring such local offset,
which by itself is not informative. Hence, feature x2 might
be completely uncorrelated to the target and yet improve
the separability of feature x1, if subtracted from it.


• Two individually irrelevant features may become relevant
when used in combination.

• One of the feature selection algorithms is Relief algorithm.

• The Relief algorithm uses an approach based on the K-


nearest-neighbor algorithm.

• In projection on feature j, the sum of the distances


between the examples and their nearest misses is
compared to the sum of distances to their nearest hits.
The Relief method works for multi-class problems.
Redundant features
• Noise reduction: When two features provide identical
projected distribution.

• Yet they are not completely redundant: the two-


dimensional distribution shows a better class separation
than the one achievable with either feature. (refer (d) of
previous figure)

• Correlation does NOT imply redundancy, refer (e) and(f).

• In (f) the features are redundant while (e) though the


distribution is same but they are not redundant.