Chapter 2 5 5 Data Mining

Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods o Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) o Example: Log-linear modelsobtain value at a point in m-D space as the product on appropriate marginal subspaces
Non-parametric methods o o Do not assume models Major families: histograms, clustering, sampling
Data Reduction Method (1): Regression and Log-Linear Models
Linear regression: Data are modeled to fit a straight line o Multiple modeled vector Often uses the least-square method to fit the line regression: as a linear allows a response of variable Y to be
function
multidimensional
feature
Log-linear
model:
approximates
discrete
multidimensional
probability distributions Linear regression: Y = w X + b o o Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand Using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. o Many nonlinear functions can be transformed into the above Log-linear models: o The multi-way table of joint probabilities is approximated by a product of lower-order tables
o Probability:
p(a, b, c, d) = ab acad cd b
Data Reduction Method (2): Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction.
Divide data into buckets and store average (sum) for each bucket Partitioning rules:
o Equal-width: equal bucket range

o Equal-frequency (or equal-depth) sum of the original values that each bucket
o V-optimal: with the least histogram variance (weighted

represents)
o MaxDiff: set bucket boundary between each pair for

pairs have the 1 largest differences
Histograms.
The
following
data
are
list
of
prices
of
commonly sold items at AllElectronics (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Data Reduction Method (3): Clustering
Partition store only
data
set
into
clusters
based
on
similarity, and
and
cluster
representation
(e.g.,
centroid
diameter)
Can be very effective if data is clustered but not if data is smeared
Can
have
hierarchical
clustering
and
be
stored
in
multi-
dimensional index tree structures There are many choices of clustering definitions and
clustering algorithms Data Reduction Method (4): Sampling
Sampling: obtaining a small sample s to represent the whole data set N Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data Choose a representative subset of the data o Simple random sampling may have very poor performance in the presence of skew
Develop adaptive sampling methods o Stratified sampling: Approximate database Used in conjunction with skewed data the of percentage interest) of each in class (or subpopulation the overall
Note: Sampling may not reduce database I/Os (page at a time)
Types Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples fromD (s < N), where the probability of drawing any tuple in D is 1=N, that is, all tuples are equally likely to be sampled.
Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.
Cluster sample: If the tuples in D are grouped into M mutually disjoint clusters, then an SRS of s clusters can be obtained, where s < M. For example, tuples in database are
usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a
cluster a
sample
of
the we on
tuples. may how
Other choose
clustering to define
criteria clusters
conveying rich semantics can also be explored. For example, in spatial database, based geographically located. closely different areas are
Stratified obtaining
sample: an SRS
If at
is
divided stratum.
into This
mutually helps
disjoint ensure a
parts called strata, a stratified sample of D is generated by each representative sample, especially when the data are skewed. For example, a stratified sample may be obtained from customer data, where a stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented.

Chapter 2 5 5 Data Mining

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter 2 5 5 Data Mining

Hochgeladen von

Copyright:

Verfügbare Formate

Numerosity Reduction

Data Reduction Method (1): Regression and Log-Linear Models

o Equal-width: equal bucket range

o V-optimal: with the least histogram variance (weighted

o MaxDiff: set bucket boundary between each pair for

Data Reduction Method (3): Clustering

Partition store only

Can be very effective if data is clustered but not if data is smeared

clustering algorithms Data Reduction Method (4): Sampling

Note: Sampling may not reduce database I/Os (page at a time)

tuples. may how

Das könnte Ihnen auch gefallen