Sie sind auf Seite 1von 4

Numerosity Reduction

Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods o Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) o Example: Log-linear modelsobtain value at a point in m-D space as the product on appropriate marginal subspaces

Non-parametric methods o o Do not assume models Major families: histograms, clustering, sampling

Data Reduction Method (1): Regression and Log-Linear Models

Linear regression: Data are modeled to fit a straight line o Multiple modeled vector Often uses the least-square method to fit the line regression: as a linear allows a response of variable Y to be

function

multidimensional

feature

Log-linear

model:

approximates

discrete

multidimensional

probability distributions Linear regression: Y = w X + b o o Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand Using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. o Many nonlinear functions can be transformed into the above Log-linear models: o The multi-way table of joint probabilities is approximated by a product of lower-order tables

o Probability:

p(a, b, c, d) = ab acad cd b

Data Reduction Method (2): Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction.

Divide data into buckets and store average (sum) for each bucket Partitioning rules:

o Equal-width: equal bucket range


o Equal-frequency (or equal-depth) sum of the original values that each bucket

o V-optimal: with the least histogram variance (weighted


represents)

o MaxDiff: set bucket boundary between each pair for


pairs have the 1 largest differences

Histograms.

The

following

data

are

list

of

prices

of

commonly sold items at AllElectronics (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

Data Reduction Method (3): Clustering

Partition store only

data

set

into

clusters

based

on

similarity, and

and

cluster

representation

(e.g.,

centroid

diameter)

Can be very effective if data is clustered but not if data is smeared

Can

have

hierarchical

clustering

and

be

stored

in

multi-

dimensional index tree structures There are many choices of clustering definitions and

clustering algorithms Data Reduction Method (4): Sampling

Sampling: obtaining a small sample s to represent the whole data set N Allow a mining algorithm to run in complexity that is

potentially sub-linear to the size of the data Choose a representative subset of the data o Simple random sampling may have very poor performance in the presence of skew

Develop adaptive sampling methods o Stratified sampling: Approximate database Used in conjunction with skewed data the of percentage interest) of each in class (or subpopulation the overall

Note: Sampling may not reduce database I/Os (page at a time)

Types Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples fromD (s < N), where the probability of drawing any tuple in D is 1=N, that is, all tuples are equally likely to be sampled.

Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.

Cluster sample: If the tuples in D are grouped into M mutually disjoint clusters, then an SRS of s clusters can be obtained, where s < M. For example, tuples in database are

usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a

cluster a

sample

of

the we on

tuples. may how

Other choose

clustering to define

criteria clusters

conveying rich semantics can also be explored. For example, in spatial database, based geographically located. closely different areas are

Stratified obtaining

sample: an SRS

If at

is

divided stratum.

into This

mutually helps

disjoint ensure a

parts called strata, a stratified sample of D is generated by each representative sample, especially when the data are skewed. For example, a stratified sample may be obtained from customer data, where a stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented.

Das könnte Ihnen auch gefallen