Sie sind auf Seite 1von 69

Data Preprocessing p g

Why preprocess the data? Data l D t cleaning i Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary
1

Why Data Preprocessing? y p g


Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate , g y gg g data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! data Quality decisions must be based on quality data Data warehouse needs consistent integration of g quality data
2

Multi-Dimensional Measure of Data Quality Q y


A well accepted multidimensional view: well-accepted Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility A ibilit
3

Major Tasks in Data Preprocessing j p g


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation

Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization
Part of data reduction but with particular importance, especially for numerical data
4

Forms of data preprocessing p p g

Data Cleaning
Data l D t cleaning tasks i t k Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data

Missing Data
Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted i i t t ith th d dd t d th d l t d data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.
7

How to Handle Missing Data?


Ignore the tuple: usually done when class label is missing (assuming the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use h U the attribute mean to fill in the missing value ib i h i i l Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: regression, inference based inference-based such as decision tree induction
Using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income
8

Noisy Data y
Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments y data entry problems data transmission problems technology limitation inconsistency in naming convention Other d t Oth data problems which requires data cleaning bl hi h i d t l i duplicate records incomplete data inconsistent data
9

How to Handle Noisy Data?


Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin y , y median, smooth by bin boundaries, etc. Clustering detect and remove outliers d d l Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions
10

Simple Discretization Methods: Binning


Equal width Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are th l d the lowest and highest values of the t d hi h t l f th attribute, the width of intervals will be: W = (B-A)/N. The most straightforward g But outliers may dominate presentation Skewed data is not handled well. Equal-depth Equal depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky.
11

Binning Methods for Data Smoothing


* Sorted data for price (in dollars): 4 8 9 15 21 21 24 25, 26, 28, 4, 8, 9, 15, 21, 21, 24, 25 26 28 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 29 29 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 21 21 25 - Bin 3: 26, 26, 26, 34
12

Cluster Analysis y

13

Regression
y
Y1

Y1

y=x+1

X1

14

Data Integration
Data integration: g combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity attribute values from entity, different sources are different possible reasons: different representations, different scales, e.g., metric vs. B iti h units le e g met i British nit
15

Handling Redundant Data in Data Integration


Inconsistencies in attribute or dimension naming can cause redundancies in the resulting data set Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a derived attribute in another table, e.g., annual revenue Redundant data R d d t d t may be able to be detected by b bl t b d t t d b correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

16

Data Transformation
Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction / New attributes constructed from the given ones
17

Data Transformation: Normalization


min-max min max normalization

v minA v' = (new _ maxA new _ minA) + new _ minA maxA minA
z-score normalization

v mean A v'= stand _ dev

normalization by decimal scaling

v v' = j 10

Where j is the smallest integer such that Max(| v ' |)<1

18

Data Reduction Strategies


Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data b D t cube aggregation ti Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
19

Data Cube Aggregation


The lowest level of a data cube the aggregated data for an individual entity of interest Multiple levels of aggregation in data cubes p gg g Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be g , p answered using data cube, when possible
20

Dimensionality Reduction
Feature selection (i e attribute subset selection): (i.e., Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns easier to patterns, understand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction
21

Example of Decision Tree Induction p


Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6?

Class 1
>

Class 2

Class 1

Class 2

Reduced attribute set: {A1, A4 A6} d d ib {A1 A4,


22

Example of Stepwise forward selection


Initial attribute set: {A1, A2, A3, A4, A5, A6} Initial d d t I iti l reduced set: {} {A1} {A1, A4} Reduced attribute set: {A1, A4, A6}
>

Reduced attribute set: {A1, A4 A6} d d ib {A1 A4,


23

Example of Stepwise backward elimination


Initial attribute set: {A1, A2, A3, A4, A5, A6} {A1, A3 A4 A5 {A1 A3, A4, A5, A6} {A1, A4, A5, A6} Reduced attribute set: {A1, A4, {A1 A4 A6}

>

Reduced attribute set: {A1, A4 A6} d d ib {A1 A4,


24

Heuristic Feature Selection Methods


How can we find a good subset of the original attributes? good There are 2d possible sub-features of d features Several heuristic feature selection methods: Best single features under the feature independence assumption: choose by significance tests. Best step-wise feature selection: step wise The best single-feature is picked first Then next best feature condition to the first, ... Step-wise feature elimination: Repeatedly eliminate the worst feature Best combined feature selection and elimination:
If the mining task is classification, and the mining algorithm itself is used to determine the attribute subset, then this is called a wrapper approach; otherwise it is a filter approach.
25

Data Compression
String compression There are extensive theories and well-tuned algorithms Typically lossless ll l l But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression with progressive compression, refinement Sometimes small fragments of signal can be g g reconstructed without reconstructing the whole
26

Data Compression

Original Data lossless

Compressed Data

Original Data Approximated


27

Wavelet Transforms
Discrete wavelet transform (DWT): linear signal ( ) g processing
Haar2 Daubechie4

Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space l i l li d i Method:
Length, L Length L, must be an integer power of 2 (padding with 0s, when 0s necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length
28

Principal Component Analysis


Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data d t The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors Works for numeric data only Used when the number of dimensions is large
29

Principal Component Analysis


X2 Y1 Y2

X1

30

Numerosity Reduction
Can we reduce the data volume by choosing alternative, smaller forms of data representation? Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling
31

Regression and Log Linear Models Log-Linear


Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line q Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature d l d li f i f l idi i lf vector Log-linear model: approximates discrete multidimensional probability distributions
32

Regress Analysis and LogLinear Models


Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the M li f ti b t f d i t th above. Log-linear models: g The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a b, c, d) = ab acad b d p(a, b c b d bcd

Histograms
A popular data reduction technique Divide data into buckets and store average (sum) for each bucket b cket Can be constructed optimally in one dimension using dynamic programming i
40 35 30 25 20 15 10 5 0
10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
34

How are the buckets determined and the attribute al es partitioned? att ib te values pa titioned? Equiwidht: the width of each bucket range is uniform Equidepth (equiheight):each bucket contains roughly the same number of continuous data samples l V-Optimal:least variance histogram variance is a weighted sum of the original values that each bucket represents bucket b k weight = #values in bucket h l b k
35

MaxDiff
MaxDiff: make a bucket boundry between adjacent values if the difference is one of the largest k differences xi+1 - xi >= max_k-1(x1,..xN)

36

V-optimal hi t V ti l histogram
Avoid putting very different frequencies into the same bucket Partition in a way to minimize iVARi, where VARi is the frequency variance within bucket i Define area to be the product of the frequency of a value and its spread (the difference between this value and the next value with non-zero f l ith frequency) ) Insert bucket boundaries where two adjacent areas differ by large amounts
37

MaxDiff Histogram

V-optimal V optimal design


Sort the values assign equal number of values in each bucket compute variance repeat change bucket`s of boundary values compute new a iance comp te ne variance until no reduction in variance variance = (n1*Var1+n2*Var2+...+nk*Vark)/N N= n1+n2+..+nk, Vari= nij=1(xj-x_meani)2/ni,

38

Clustering
Partition d data set into clusters, and one can store cluster l d l representation only Can be very effective if data is clustered but not if data is smeared Can have hierarchical clustering and be stored in multidimensional index tree structures There are many choices of clustering definitions and clustering algorithms
39

Sampling
Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data p Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data j Sampling may not reduce database I/Os (page at a time).
40

Sampling methods
Simple random sample without replacement (SRSWOR) of size n: n of N tuples f f l from D n<N N P(drawing any tuple)=1/N all are equally likely Simple random sample with replacement (SRSWR) of size n: each time a tuple is drawn from D, it is recorded and then replaced it may be drawn again
41

Sampling methods cont. cont


Cluster Sample: if tuples in D are grouped into M mutually disjoint clusters then an SRS of m clusters can be obtained where m < M Tuples in a database are retrieved a page at a time Each pages can be considered as a cluster Stratified Sample: if D is divided into mutually disjoint parts called strata. Obtain a SRS at each stratum a representative sample when data are skewed Ex: customer data a stratum for each age group
The Th age group having the smallest number of customers h i th ll t b f t will be sure to be presented
42

Confidence intervals and sample size


Cost of obtaining a sample is propotional to the size of the sample, n Specifying a confidence interval and S if i fid i l d you should be able to determine n number of samples required so that the sample mean will be within the confidence interval with %(1-p) confident n is very small compared to the size of the database N n<<N
43

Sampling

Raw Data
44

Sampling
Raw Data Cluster/Stratified Sample

45

Hierarchical Reduction
Hierarchical clustering is often performed but tends to g p define partitions of data sets rather than clusters Parametric methods are usually not amenable to hierarchical representation hi hi l t ti Hierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram

46

Discretization
Three types of attributes: yp Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers C ti l b Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. attributes Reduce data size by discretization Prepare for further analysis
47

Discretization and Concept hierachy


Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

48

Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young middle aged, young, middle-aged or senior). detail is lost generalized data
more meaningful easier to interpret, require less space

mining on a reduced data set require


fewer input/output opperations be more efficient
49

Discretization Process
Univariate - one feature at a time Steps of a typical discritization process sorting the continuous values of the feature evaluating a cut-point for splitting or adjacent i t dj t intervals for merging l f i splitting or merging intervals of continuous value
top-down approach intervals are split
chose the best cut point to split the values into two partitions pa titions until a stopping criterion is satisfied
50

Discretization Process cont cont.


bottom-up approach intervals are merged
find best pair of intervals to merge until stopping criterion is satisfied

stopping at some point


lower # better understanding less accuracy high # poorer understanding higher accuracy Stopping criteria
number of intervals reach a critical value number of observations in each interval exceeds a value all observations in an interval are of the same class a minimum information gain is not satisfied
51

Generalized Splitting Algorithm


S= sorted values of feature f splitting(S) { if stopping criterion satisfied
return

T = GetBestSplitPoint(S) S1=GetLeftPart(S,T) S2=GetRightPart(S T) =GetRightPart(S,T) Splitting(S1) Splitting(S2) p g(


52

Discretization and concept hierarchy generation for numeric data


Binning Histogram analysis Clustering analysis Entropy-based discretization Segmentation by natural partitioning g y p g
53

1R a supervised discretization method


After sorting the continuous values, divides the range of continuous values into a number of disjoint intervals and adjusts the boundaries based on the class labels Example

11 14 15 18 19 20 21 | 22 23 25 30 31 33 35 | 36 R C C R C R C | R C C R C R C| R C | C | R interval of class C from 11-21 another interval of C from 22-35 last of class R including just 36 the two leftmost intervals are merged as they g y predict the same class stoping: each interval should contain a prespecified minimum number of instances

54

Entropy
An information-based measure called entropy can be used to recursively partition the values of a numeric attribute A, resulting in a hierarchical discritization. Such a discritization forms a numerical concept hierarchy for the attribute. Given a set of data tuples, s, the basic method for entropy-based discritization of entropy based A is as follows:

55

Entropy
A measure of (im)purity of an sample variable S is defined as Ent(S) E t(S) = -spsl 2ps log s is a value of S Ps is its estimated probability average amount of information per event Information of an event is I(s)= -log2ps information is high for lower probable events and low otherwise
56

Entropy-Based Entropy Based Discretization


Given a set of samples S, if S is partitioned into two p , p intervals S1 and S2 using boundary T, the entropy after partitioning is |S | |S |
E (S ,T ) = 1 Ent ( ) + 2 Ent ( ) S1 | S | S2 |S|

The boundary that minimizes the entropy function over all p possible boundaries is selected as a binary discretization. y The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,

Ent ( S ) E (T , S ) >
Experiments show that it may reduce data size and improve classification accuracy
57

Age Splitting Example


A sample of ages for 15 customers 19 27 29 35 38 39 40 41 42 43 43 43 45 55 55 y n y y y y y y n y y n n n n classes are y or n response to life insurance Responding to life promotion (y or n) v.s. Age of customer y or n yes or no entropy of the whole sample without partitioning ent(S)=(-6/15)*log2(6/15)+(-9/15)*log2(9/15)= =-0.4*(-1.322)+(-0.6)*(-0.734)=0.969

58

Age Splitting Example


Splitting value of age=41.5 S1 age<41.5: 7 y, 1 n S2 T> 41.5: 2 y, 5 n I(S,T)=#S1/#S*ent(S1)+#S2/#S*ent(S2) = (8/15)*ent(S1)+(7/15)ent(S2) ent(S1)=(-1/8)log2(1/8)+(-7/8)log2(7/8)= =-0.125*(-3.000)+(-0.875*(-0.193)=0.544 ( ) ( ( ) ent(S2)=(-2/7)log2(2/7)+(-5/7)log2(5/7)= =-0.286*(-1.806)+(-0.714*(-0.486)=0.863 ent(S1,S2) = (8/15)*0.544+(7/15)*0.863=0.692 t(S1 S2) (8/15)*0 544 (7/15)*0 863 0 692
59

Age Splitting Example


Information gain: 0.969-0.692=0.277 try all possible partitions c ose the partition providing a chose t e pa t t o p o d g maximum u information gain then recursively p y partition each interval until the stoping criteria is satisfied when information gain < a threshold value
60

Segmentation by natural partitioning


3-4-5 rule can be used to segment numeric data y , into relatively uniform, natural intervals. In general the rule partitions a given range of data into 3, 4, or 5 relatively equiwidth intervals, recursively and level by level, based on the value range at the most significant digit digit.

61

Segmentation by natural partitioning


* If an interval covers 3, 6, 7 or 9 distinct values at the most significant
digit, di it partition the range into 3 equi-width intervals ( 3 equiwidth titi th i t i idth i t l i idth intervals for 3, 6, 9, and 3 intervals in the grouping of 2-3-2 for 7). * If it covers 2, 4, or 8 distinct values at the most significant digit, , , g g , partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most significant digit, partition th range i t 5 i t titi the into intervals l Since there could be some dramatically large positive or negative values in a data set, the top level segmentation, based merely on the , p g , y minimum and maximum values , may derive distorted results. Segmentation based on the maximal values may lead to a highly biased hi bi d hierarchy. Thus the top-level segmentation can be performed h Th th t l l t ti b f d based on the range of data values representing the majority(e.g. 5th percentile to 95th percentile ).

62

Example of 3-4-5 rule 345


count

Step 1: Step 2: Step 3:

-$351 Min msd=1,000 msd=1 000

-$159 Low (i.e, 5%-tile) Low=-$1,000 Low= $1 000

profit

$1,838 High(i.e, 95%-0 tile)

$4,700 Max

High=$2,000 High=$2 000 (-$1,000 - $2,000)

(-$1,000 - 0)

(0 -$ 1,000)

($1,000 - $2,000)

Step 4:

(-$4000 -$5,000)

(-$400 - 0) (-$400 -$300) (-$300 -$200) (-$200 -$100) (-$100 0) (0 $200) ($200 $400) ($400 $600)

(0 - $1,000) ($1,000 $1,200) ($1,200 $1,400)

($1,000 - $2, 000)

($2,000 - $5, 000)

($2,000 $3,000) ($3,000 $4,000) ($4,000 $5,000)

($1,400 $1,600) $ ($800 $1,000) ($1,600 ($1,800 $1,800) $2,000)

($600 $800)

63

Example
profits at different branches of Company Min:$-351,976.00 Max:$4,700,896.50 5th percentile:$ -159,876 159 876 95th percentile:$ 1,838,761 low=-159,876, high=1,838,761 , , g , , msd=1,000,000 rounding low to million$digit low=-1,000,000 rounding high to million$digit high=2,000,000 di hi h illi $di i hi h 2 000 000 (2,000,000-(-1,000,000))/1,000,000=3 partitioned into 3 equiwidth subsegments
(-1,000,000...0],(0...1,000,000],(1,000,000...2,000,000]
64

First interval low(-1,000,000)<min adjust the left boundary msd of min 100,000 rounding min to -400,000= min o nding 400 000 first interval:(-400,000...0] last interval (1,000,000...2,000,000] not cover max (1 000 000 2 000 000] max <high a new interval (2,000,000...5,000,000] ( , , , , ] rounding max to 5,000,000
65

Recursively each interval can be further partitioned according 3 4 5 le acco ding to 3-4-5 rule first interval (-400,00...0] partitioned into 4 subintervals ( 400,000... 300,000],( 300,000... 200,000], (-400,000...-300,000],(-300,000...-200,000], (-200,000...-100,000],(-100,000...0] second interval (0...1,000,000] partitioned into 5 subintervals (0...200,000],(200,000...400,000], (400,000...600,000],(600,000...800,000] (400 000 600 000] (600 000 800 000] (800,000...1,000,000]

66

third interval (1,000,000...,2,000,000] partitioned into 5 subintervals (1,000,000...1,200,000],(1,200,000...1,400,0 00], 00] (1,400,000...1,600,000],(1,600,000...1,800,0 00] (1,800,000...2,000,000] last interval into 3 subintervals
(2,000,000..3,000,000](3,000,000..4,000,000],(4,000,000..5,00 0,000]

67

Concept hierarchy generation for categorical data


Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes

68

Specification of a set of attributes


Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country province_or_ province or state city street 15 distinct values 65 distinct values 3567 distinct values 674,339 distinct values
69

Das könnte Ihnen auch gefallen