Beruflich Dokumente
Kultur Dokumente
Reduce data volume by choosing alternative, smaller forms of data representation Parametric methods o Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) o Example: Log-linear modelsobtain value at a point in m-D space as the product on appropriate marginal subspaces
Non-parametric methods o o Do not assume models Major families: histograms, clustering, sampling
Linear regression: Data are modeled to fit a straight line o Multiple modeled vector Often uses the least-square method to fit the line regression: as a linear allows a response of variable Y to be
function
multidimensional
feature
Log-linear
model:
approximates
discrete
multidimensional
probability distributions Linear regression: Y = w X + b o o Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand Using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. o Many nonlinear functions can be transformed into the above Log-linear models: o The multi-way table of joint probabilities is approximated by a product of lower-order tables
o Probability:
p(a, b, c, d) = ab acad cd b
Data Reduction Method (2): Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction.
Divide data into buckets and store average (sum) for each bucket Partitioning rules:
Histograms.
The
following
data
are
list
of
prices
of
commonly sold items at AllElectronics (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
data
set
into
clusters
based
on
similarity, and
and
cluster
representation
(e.g.,
centroid
diameter)
Can
have
hierarchical
clustering
and
be
stored
in
multi-
dimensional index tree structures There are many choices of clustering definitions and
Sampling: obtaining a small sample s to represent the whole data set N Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data Choose a representative subset of the data o Simple random sampling may have very poor performance in the presence of skew
Develop adaptive sampling methods o Stratified sampling: Approximate database Used in conjunction with skewed data the of percentage interest) of each in class (or subpopulation the overall
Types Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples fromD (s < N), where the probability of drawing any tuple in D is 1=N, that is, all tuples are equally likely to be sampled.
Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again.
Cluster sample: If the tuples in D are grouped into M mutually disjoint clusters, then an SRS of s clusters can be obtained, where s < M. For example, tuples in database are
usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a
cluster a
sample
of
the we on
Other choose
clustering to define
criteria clusters
conveying rich semantics can also be explored. For example, in spatial database, based geographically located. closely different areas are
Stratified obtaining
sample: an SRS
If at
is
divided stratum.
into This
mutually helps
disjoint ensure a
parts called strata, a stratified sample of D is generated by each representative sample, especially when the data are skewed. For example, a stratified sample may be obtained from customer data, where a stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented.