Beruflich Dokumente
Kultur Dokumente
Why preprocess the data? Data l D t cleaning i Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
1
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
4
Data Cleaning
Data l D t cleaning tasks i t k Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
Missing Data
Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted i i t t ith th d dd t d th d l t d data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.
7
Noisy Data y
Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments y data entry problems data transmission problems technology limitation inconsistency in naming convention Other d t Oth data problems which requires data cleaning bl hi h i d t l i duplicate records incomplete data inconsistent data
9
Cluster Analysis y
13
Regression
y
Y1
Y1
y=x+1
X1
14
Data Integration
Data integration: g combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity attribute values from entity, different sources are different possible reasons: different representations, different scales, e.g., metric vs. B iti h units le e g met i British nit
15
16
Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction / New attributes constructed from the given ones
17
v minA v' = (new _ maxA new _ minA) + new _ minA maxA minA
z-score normalization
v v' = j 10
18
Dimensionality Reduction
Feature selection (i e attribute subset selection): (i.e., Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns easier to patterns, understand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction
21
Class 1
>
Class 2
Class 1
Class 2
>
Data Compression
String compression There are extensive theories and well-tuned algorithms Typically lossless ll l l But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression with progressive compression, refinement Sometimes small fragments of signal can be g g reconstructed without reconstructing the whole
26
Data Compression
Compressed Data
Wavelet Transforms
Discrete wavelet transform (DWT): linear signal ( ) g processing
Haar2 Daubechie4
Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space l i l li d i Method:
Length, L Length L, must be an integer power of 2 (padding with 0s, when 0s necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length
28
X1
30
Numerosity Reduction
Can we reduce the data volume by choosing alternative, smaller forms of data representation? Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling
31
Histograms
A popular data reduction technique Divide data into buckets and store average (sum) for each bucket b cket Can be constructed optimally in one dimension using dynamic programming i
40 35 30 25 20 15 10 5 0
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
34
How are the buckets determined and the attribute al es partitioned? att ib te values pa titioned? Equiwidht: the width of each bucket range is uniform Equidepth (equiheight):each bucket contains roughly the same number of continuous data samples l V-Optimal:least variance histogram variance is a weighted sum of the original values that each bucket represents bucket b k weight = #values in bucket h l b k
35
MaxDiff
MaxDiff: make a bucket boundry between adjacent values if the difference is one of the largest k differences xi+1 - xi >= max_k-1(x1,..xN)
36
V-optimal hi t V ti l histogram
Avoid putting very different frequencies into the same bucket Partition in a way to minimize iVARi, where VARi is the frequency variance within bucket i Define area to be the product of the frequency of a value and its spread (the difference between this value and the next value with non-zero f l ith frequency) ) Insert bucket boundaries where two adjacent areas differ by large amounts
37
MaxDiff Histogram
38
Clustering
Partition d data set into clusters, and one can store cluster l d l representation only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms
39
Sampling
Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data p Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data j Sampling may not reduce database I/Os (page at a time).
40
Sampling methods
Simple random sample without replacement (SRSWOR) of size n: n of N tuples f f l from D n<N N P(drawing any tuple)=1/N all are equally likely Simple random sample with replacement (SRSWR) of size n: each time a tuple is drawn from D, it is recorded and then replaced it may be drawn again
41
Sampling
Raw Data
44
Sampling
Raw Data Cluster/Stratified Sample
45
Hierarchical Reduction
Hierarchical clustering is often performed but tends to g p define partitions of data sets rather than clusters Parametric methods are usually not amenable to hierarchical representation hi hi l t ti Hierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram
46
Discretization
Three types of attributes: yp Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers C ti l b Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. attributes Reduce data size by discretization Prepare for further analysis
47
48
Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young middle aged, young, middle-aged or senior). detail is lost generalized data
more meaningful easier to interpret, require less space
Discretization Process
Univariate - one feature at a time Steps of a typical discritization process sorting the continuous values of the feature evaluating a cut-point for splitting or adjacent i t dj t intervals for merging l f i splitting or merging intervals of continuous value
top-down approach intervals are split
chose the best cut point to split the values into two partitions pa titions until a stopping criterion is satisfied
50
11 14 15 18 19 20 21 | 22 23 25 30 31 33 35 | 36 R C C R C R C | R C C R C R C| R C | C | R interval of class C from 11-21 another interval of C from 22-35 last of class R including just 36 the two leftmost intervals are merged as they g y predict the same class stoping: each interval should contain a prespecified minimum number of instances
54
Entropy
An information-based measure called entropy can be used to recursively partition the values of a numeric attribute A, resulting in a hierarchical discritization. Such a discritization forms a numerical concept hierarchy for the attribute. Given a set of data tuples, s, the basic method for entropy-based discritization of entropy based A is as follows:
55
Entropy
A measure of (im)purity of an sample variable S is defined as Ent(S) E t(S) = -spsl 2ps log s is a value of S Ps is its estimated probability average amount of information per event Information of an event is I(s)= -log2ps information is high for lower probable events and low otherwise
56
The boundary that minimizes the entropy function over all p possible boundaries is selected as a binary discretization. y The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,
Ent ( S ) E (T , S ) >
Experiments show that it may reduce data size and improve classification accuracy
57
58
61
If it covers 1, 5, or 10 distinct values at the most significant digit, partition th range i t 5 i t titi the into intervals l Since there could be some dramatically large positive or negative values in a data set, the top level segmentation, based merely on the , p g , y minimum and maximum values , may derive distorted results. Segmentation based on the maximal values may lead to a highly biased hi bi d hierarchy. Thus the top-level segmentation can be performed h Th th t l l t ti b f d based on the range of data values representing the majority(e.g. 5th percentile to 95th percentile ).
62
profit
$4,700 Max
(-$1,000 - 0)
(0 -$ 1,000)
($1,000 - $2,000)
Step 4:
(-$4000 -$5,000)
(-$400 - 0) (-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0) (0 $200) ($200 $400) ($400 $600)
($600 $800)
63
Example
profits at different branches of Company Min:$-351,976.00 Max:$4,700,896.50 5th percentile:$ -159,876 159 876 95th percentile:$ 1,838,761 low=-159,876, high=1,838,761 , , g , , msd=1,000,000 rounding low to million$digit low=-1,000,000 rounding high to million$digit high=2,000,000 di hi h illi $di i hi h 2 000 000 (2,000,000-(-1,000,000))/1,000,000=3 partitioned into 3 equiwidth subsegments
(-1,000,000...0],(0...1,000,000],(1,000,000...2,000,000]
64
First interval low(-1,000,000)<min adjust the left boundary msd of min 100,000 rounding min to -400,000= min o nding 400 000 first interval:(-400,000...0] last interval (1,000,000...2,000,000] not cover max (1 000 000 2 000 000] max <high a new interval (2,000,000...5,000,000] ( , , , , ] rounding max to 5,000,000
65
Recursively each interval can be further partitioned according 3 4 5 le acco ding to 3-4-5 rule first interval (-400,00...0] partitioned into 4 subintervals ( 400,000... 300,000],( 300,000... 200,000], (-400,000...-300,000],(-300,000...-200,000], (-200,000...-100,000],(-100,000...0] second interval (0...1,000,000] partitioned into 5 subintervals (0...200,000],(200,000...400,000], (400,000...600,000],(600,000...800,000] (400 000 600 000] (600 000 800 000] (800,000...1,000,000]
66
third interval (1,000,000...,2,000,000] partitioned into 5 subintervals (1,000,000...1,200,000],(1,200,000...1,400,0 00], 00] (1,400,000...1,600,000],(1,600,000...1,800,0 00] (1,800,000...2,000,000] last interval into 3 subintervals
(2,000,000..3,000,000](3,000,000..4,000,000],(4,000,000..5,00 0,000]
67
68