Beruflich Dokumente
Kultur Dokumente
Content
Part 1 : Data preprocessing
Data cleaning
Data integration and transformation Data reduction
Chapter 3
noisy:
containing errors or outliers
inconsistent:
containing discrepancies in codes or names
2. Data integration
3.Data transformation
4. Data reduction
Chapter 3
2. Data integration
Integration of multiple databases, data cubes, or files
3. Data transformation
Normalization and aggregation
141-653 Data Warehousing and Data Mining 7 Chapter 3
4. Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
5. Data discretization
Part of data reduction but with particular importance, especially for numerical data
141-653 Data Warehousing and Data Mining 8 Chapter 3
Chapter 3
Missing Data
Missing data may be due to
1. equipment malfunction
Noisy Data
Noise: random error or variance in a measured variable
Noisy Data
Other data problems which requires data cleaning
13
Chapter 3
1.2 Clustering
detect and remove outliers
1.4 Regression
smooth by fitting the data into regression functions
141-653 Data Warehousing and Data Mining 14 Chapter 3
W = (B-A)/N.
The most straightforward But outliers may dominate presentation Skewed data is not handled well.
141-653 Data Warehousing and Data Mining 15 Chapter 3
4, 8, 9, 15,
* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
141-653 Data Warehousing and Data Mining 16
Chapter 3
17
Chapter 3
18
Chapter 3
19
Chapter 3
20
Chapter 3
1.4 Regression
y
Y1
Y1
y=x+1
X1
21
Chapter 3
2. Data Integration :
Data integration:
Data Preprocessing
Schema integration
integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
141-653 Data Warehousing and Data Mining 23 Chapter 3
3. Data Transformation :
Data Preprocessing
1. Smoothing: remove noise from data 2. Aggregation: summarization, data cube construction
min-max normalization
z-score normalization normalization by decimal scaling
5. Attribute/feature construction
New attributes constructed from the given ones
141-653 Data Warehousing and Data Mining 24 Chapter 3
v v' j 10
25
4. Data Reduction :
Data Preprocessing
Data reduction
Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
26
Chapter 3
4. Data Reduction :
Data Preprocessing
27
Chapter 3
Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible
141-653 Data Warehousing and Data Mining 28 Chapter 3
29
Chapter 3
A4 ?
A1? A6?
Class 1
>
Class 2
Class 1
Class 2
31
Chapter 3
Data Compression
String compression
There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion
Audio/video compression
Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole
Data Compression
Compressed Data
34
Chapter 3
Wavelet Transforms
Haar2 Daubechie4
Discrete wavelet transform (DWT): linear signal processing Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space
35
Chapter 3
find c <= k orthogonal vectors that can be best used to represent data
The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors
X1
37
Chapter 3
Linear regression
38
Chapter 3
Parametric methods
1. Regression
Linear regression: Y = + X
Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, , X1, X2, .
Chapter 3
39
clustering
sampling
40
Chapter 3
Non-parametric methods
A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems.
141-653 Data Warehousing and Data Mining
1. Histograms
40 35 30 25 20 15 10 5 0
10000
41
30000
50000
70000
Chapter 3
90000
4, 8, 9, 15,
* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
42
Chapter 3
Non-parametric methods
2. Cluster
C1
C2
C3
43
Chapter 3
Non-parametric methods
3. Sampling
Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data
Simple random sampling may have very poor performance in the presence of skew
Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time).
141-653 Data Warehousing and Data Mining 44 Chapter 3
Sampling
Raw Data
141-653 Data Warehousing and Data Mining 45 Chapter 3
Sampling
Raw Data Cluster/Stratified Sample
46
Chapter 3
47
Chapter 3
48
Chapter 3
province_or_ state
city
street
141-653 Data Warehousing and Data Mining 49
Concept hierarchies
reduce the data by collecting replacing low level concepts
(such as numeric values for the attribute age)
50
Chapter 3
51
Chapter 3
5. Discretization :
Data Preprocessing
52
Chapter 3
5. Discretization :
Data Preprocessing
reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis Interval labels can then be used to replace actual data values.
141-653 Data Warehousing and Data Mining 53 Chapter 3
54
Chapter 3
55
Chapter 3
56
Chapter 3
2.2 Discrimination
2.3 Association
2.4 Classification/prediction
2.5 Clustering 2.6 Outlier analysis
Chapter 3
60
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
Chapter 3
4.2 Certainty
e.g.,
4.4 Novelty
not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratio
141-653 Data Warehousing and Data Mining 63 Chapter 3
Design
DMQL is designed with the primitives described earlier
141-653 Data Warehousing and Data Mining 65 Chapter 3
1. task-relevant data 2. the kind of knowledge to be mined 3. concept hierarchy specification 4. interestingness measure 5. pattern presentation and visualization
141-653 Data Warehousing and Data Mining 66 Chapter 3
having condition
Chapter 3
67
68
Chapter 3
Mine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s) 2.2 Discrimination
Mine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where contrast_condition_i} analyze measure(s)
141-653 Data Warehousing and Data Mining 69 Chapter 3
2.4 Classification
Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension
2.5 Prediction
Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}}
70
Chapter 3
Example:
with support threshold = 0.05 with confidence threshold = 0.7
141-653 Data Warehousing and Data Mining 73 Chapter 3
To facilitate interactive viewing at different concept level, the following syntax is defined:
Multilevel_Manipulation ::= roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimension
141-653 Data Warehousing and Data Mining 74 Chapter 3
2. Loose coupling
Fetching data from DB/DW
78
Chapter 3
3. Tight coupling
A uniform information processing environment DM is smoothly integrated into a DB/DW system, Mining query is optimized based on e.g.
mining query indexing query processing methods
79
Chapter 3
4. Semi-tight coupling
enhanced DM performance Provide efficient implement a few data mining
sorting indexing aggregation histogram analysis multiway join precomputation of some stat functions
80 Chapter 3
Reference
Data Mining: Concepts and Techniques (Chapter 3, 4 Slide for textbook), Jiawei Han and Micheline Kamber, Intelligent Database
Systems Research Lab, School of Computing Science, Simon Fraser University, Canada
81
Chapter 3
The end