Beruflich Dokumente
Kultur Dokumente
Data Preprocessing
quality data.
1
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g.,
HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
4
2
5
3
7
4
9
10
5
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
11
collection of documents
Sometimes, represented as integer variables
6
Similarity and Dissimilarity
Way to assess how alike and unalike objects
are in comparison to one another.
A store may want to search for cluster of
7
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects
are.
Is higher when objects are more alike.
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
16
8
Calculate Proximity Measure for
Nominal Attributes
17
18
9
19
20
10
Distance Metric
Distance d (p, q) between two points p and q is a
dissimilarity measure if it satisfies:
1. Positive definiteness:
d (p, q) ≥ 0 for all p and q and
d (p, q) = 0 only if p = q.
2. Symmetry: d (p, q) = d (q, p) for all p and q.
3. Triangle Inequality:
d (p, r) ≤ d (p, q) + d (q, r) for all points p, q,
and r.
21
11
Minkowski Distance
12
Similarity Between Binary Vectors
Common situation is that objects, p and q, have only
binary attributes
Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
13
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
27
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
28
14
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes
or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
30
15
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data Discretization
reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values 31
32
16
Data Cleaning
33
Missing Data
Data is not always available
E.g., many tuples have no recorded value for
several attributes, such as customer income in sales
data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
Missing data may need to be inferred. 34
17
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-
based such as r decision tree
35
Noisy Data
technology limitation
incomplete data
inconsistent data
36
18
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
Regression
smooth by fitting the data into regression functions
37
Binning
38
19
Binning
In the example , the data for price are first sorted and
then partitioned into equidepth bins of depth 3
Smoothing by bin means: each value in a bin is
replaced by the mean value of the bin.
Smoothing by bin medians: each bin value is
replaced by the bin median
Smoothing by bin boundaries: the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the
closed boundary value
The larger the width the greater effect of the
smoothing
39
20
41
42
21
Cluster Analysis
43
Clustering
22
Regression
45
Regression
y
Y1
Y1’ y=x+1
X1 x
23
Data Transformation:
Normalization
47
Data Transformation:
Normalization
48
24
Data transformation
Min-max normalization performs a linear
transformation on the original data.
Suppose that minA and maxA are the minimum
and maximum values of an attribute A. Min-
max transformation maps a value v of A to v’ in
the
range [new_minA,new_maxA] by computing
v − minA
v' = (new_ maxA − new_ minA) + new_ minA
maxA − minA
49
73,600 − 12,000
98,000 − 12,000 (1 .0 − 0 ) + 0 = 0 .716
50
25
Example z-score normalization
In z-score normalization (or zero-mean normalization), the values for an
attribute A are normalized based on the mean and standard deviation of
A. A value v of A is normalized to v’ by computing
v− A
v' =
σA
Where A and σA are the mean and standard deviation,
respectively, of attribute A. This method of normalization is useful
when the actual minimum and maximum of attribute A are unknown,
51
v− A 73,600 − 54,000
v' = = 1.225
σA 16,000
52
26
53
v
v' = j
10
54
27
55
56
28
57
Data Integration
Data integration:
combines data from multiple sources into a coherent
store
Schema integration
integrate metadata from different sources
58
29
Handling Redundant Data
in Data Integration
Redundant data occur often when integration of multiple
databases
The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
59
The Question
30
61
For Example:-
Log file Size (MB): 20 21 22 24
Time stamp(AM): 7:00 7:02 7:05 7:10
31
For Example:- Consider maximum log file size is 50 MB
Log file Size (MB) : 20 25 30 35
Remaining size in log file (MB): 30 25 20 15
Data Integration
rA,B =
∑(A− A)(B − B)
(n −1)σ AσB
64
32
Data Integration
65
33
Data Reduction Strategies
68
34
Data discretization
Data discretization converts a large number of
data values into smaller once, so that data
evaluation and data management becomes
very easy.
Histogram analysis
Binning
Clustering analysis
Decision tree analysis
Equal width partitioning
Equal depth partitioning
69
70
35
Data Reduction Strategies
36
73
Dimensionally Reduction
Dimensionality reduction reduces the data set size by
removing such attributes (or dimensions) from it
37
75
Feature Extraction
Three groups of features
Basic features of individual TCP connections
source & destination IP Features 1 & 2
source & destination port Features 3 & 4 dst … service … flag dst … service … flag %S0
Protocol Feature 5 h1
h1
http
http
S0
S0 syn flood
h1
h1
http
http
S0
S0
70
72
Duration Feature 6 h1 http S0 h1 http S0 75
h2 ftp S0 h2 ftp S0 0
38
Data Mining for Intrusion Detection
Misuse Detection –
Building Predictive
Tid SrcIP
Start
Dest IP Dest Number
Attack
Models
time Port of bytes Start Number
Number
Tid SrcIP Dest
DestPort
IP Attack
Attack
1 206.135.38.95 11:07:20 160.94.179.223 139 192 No time bytes
of bytes
Data Compression
78
39
Data Compression
Original Data
Approximated
79
Data Compression
An example of lossless vs. lossy compression is the
following string:
25.888888888
This string can be compressed as:
25.[9]8
Interpreted as, "twenty five point 9 eights", the original
string is perfectly recreated, just written in a smaller
form.
In a lossy system, using 26 instead, the exact original
data is lost, at the benefit of a shorter representation.
80
40
Numerosity Reduction
82
41
83
84
42
Simple Discretization Methods: Binning
uniform grid
if A and B are the lowest and highest values of the
86
43