Sie sind auf Seite 1von 43

Data preprocessing

 Data preprocessing describes any type of


processing performed on raw data to prepare it
for another processing procedure.

 Commonly used as a preliminary data mining


practice, data preprocessing transforms the
data into a format that will be more easily and
effectively processed

Data Preprocessing

No quality data, no quality mining results!


"Garbage In, Garbage Out"
 Quality decisions must be based on quality data

 Data warehouse needs consistent integration of

quality data.

1
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g.,
HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
4

2
5

3
7

4
9

10

5
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities
11

Discrete vs. Continuous Attributes


 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete


attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and


represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables
12

6
Similarity and Dissimilarity
 Way to assess how alike and unalike objects
are in comparison to one another.
 A store may want to search for cluster of

customer objects, resulting in groups of


customers with similar characteristics(e.g.
similar income, area of residence and age).
 A similarity measure for two object , I and j,
will typically return 0 if the objects are unalike.

Similarity and Dissimilarity

 The higher the similarity value .the greater the


similarity between objects(Typically value 1
indicate complete similarity(that the object is
identical)
 A dissimilarity measure works the opposite way
. It return a value 0 if objects are same ( and
therefore far from being similar).
 The higher the dissimilarity value, the more
similar two objects

7
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects

are.
 Is higher when objects are more alike.

 Often falls in the range [0,1]

 Dissimilarity
 Numerical measure of how different are two data

objects
 Lower when objects are more alike

 Minimum dissimilarity is often 0

 Upper limit varies

 Proximity refers to a similarity or dissimilarity

Calculate Proximity Measure for


Nominal Attributes

16

8
Calculate Proximity Measure for
Nominal Attributes

17

Calculate Proximity Measure for


Nominal Attributes

18

9
19

Data Matrix and Dissimilarity Matrix


 Data matrix
 n data points with p  x 11 ... x 1f ... x 1p 
 
dimensions  ... ... ... ... ... 
x ... x if ... x ip 
 Two modes
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 Sim(I,j)=1-d(I,j)  n1 
 Dissimilarity matrix
 0 
 n data points, but
 d(2,1) 0 
registers only the  
 d(3,1 ) d ( 3 ,2 ) 0 
distance  
 A triangular matrix  : : : 
 d ( n ,1 ) d ( n,2 ) ... ... 0 
 Single mode

20

10
Distance Metric
 Distance d (p, q) between two points p and q is a
dissimilarity measure if it satisfies:

1. Positive definiteness:
d (p, q) ≥ 0 for all p and q and
d (p, q) = 0 only if p = q.
2. Symmetry: d (p, q) = d (q, p) for all p and q.
3. Triangle Inequality:
d (p, r) ≤ d (p, q) + d (q, r) for all points p, q,
and r.

21

Euclidean & Manhattan Distance

11


Minkowski Distance

 It is generalization of Euclidean distance


and Minkowski distance
 h = 1 for Minkowski distance - A common
example of this is the Hamming distance,
which is just the number of bits that are
different between two binary vectors
 H = 2 for of Euclidean distance

12
Similarity Between Binary Vectors
 Common situation is that objects, p and q, have only
binary attributes
 Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

 Simple Matching and Jaccard Coefficients


SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero


attributes values
= (M11) / (M01 + M10 + M11)

SMC versus Jaccard: Example


p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) =


0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

13
Cosine Similarity
 A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

27

Example: Cosine Similarity


 cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

28

14
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

30

15
Major Tasks in Data Preprocessing

 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the
same or similar analytical results
 Data Discretization
 reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values 31

32

16
Data Cleaning

 Data cleaning tasks


 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data

33

Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for
several attributes, such as customer income in sales
data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 Missing data may need to be inferred. 34

17
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
 Use the most probable value to fill in the missing value: inference-
based such as r decision tree
35

Noisy Data

 Incorrect attribute values may due to


 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which requires data cleaning


 duplicate records

 incomplete data

 inconsistent data

36

18
How to Handle Noisy Data?
 Binning method:
 first sort data and partition into (equi-depth) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human

 Regression
 smooth by fitting the data into regression functions

37

Binning

 Data smoothing techniques are used to


eliminate "noise" and extract real trends and
patterns.
 Binning methods smooth a sorted data value
by consulting its “neighborhood”, i.e. the
values around it.
 These values are distributed into a number of
“buckets” or bins.

38

19
Binning
 In the example , the data for price are first sorted and
then partitioned into equidepth bins of depth 3
 Smoothing by bin means: each value in a bin is
replaced by the mean value of the bin.
 Smoothing by bin medians: each bin value is
replaced by the bin median
 Smoothing by bin boundaries: the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the
closed boundary value
 The larger the width the greater effect of the
smoothing

39

Binning Methods for Data Smoothing


* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
40

20
41

42

21
Cluster Analysis

43

Clustering

 Outliers may be detected by clustering, where


similar values are organized into groups or
“clusters”
 The values that fall outside of the set of

clusters may be considered as outliers.


Outlier
 Data points inconsistent with the majority of data
 Different outliers
 Valid: CEO’s salary,
 Noisy: One’s age = 280, widely deviated points
44

22
Regression

 Data can be smoothed by fitting the data to a


function, such as regression
 Linear regression involves finding the “best”
line to fit two variables, so that one variable
can be used to predict the other.

45

Regression
y

Y1

Y1’ y=x+1

X1 x

23
Data Transformation:
Normalization

 Different features in the data set may have


values in different ranges.
 For example, in an employee data set, the
range of salary feature may lie from thousands
to lakhs but the range of values of age feature
will be in 20- 60.

47

Data Transformation:
Normalization

 Normalization: scaled to fall within a small,


specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

48

24
Data transformation
 Min-max normalization performs a linear
transformation on the original data.
 Suppose that minA and maxA are the minimum
and maximum values of an attribute A. Min-
max transformation maps a value v of A to v’ in
the
range [new_minA,new_maxA] by computing
v − minA
v' = (new_ maxA − new_ minA) + new_ minA
maxA − minA
49

Example - min-max normalization


 Suppose that the minimum and maximum
values for the attribute income are $12,000 and
$98,000 respectively. We would like to map
income 73,600 in transformed to range[0,1]
v − minA
v' = (new_ maxA − new_ minA) + new_ minA
maxA − minA

 73,600 − 12,000 
 98,000 − 12,000 (1 .0 − 0 ) + 0 = 0 .716 
 
50

25
Example z-score normalization
 In z-score normalization (or zero-mean normalization), the values for an
attribute A are normalized based on the mean and standard deviation of
A. A value v of A is normalized to v’ by computing

v− A
v' =
σA
 Where A and σA are the mean and standard deviation,
respectively, of attribute A. This method of normalization is useful
when the actual minimum and maximum of attribute A are unknown,

51

Example z-score normalization

 Suppose that the mean and standard deviation


of the values for the attribute income are
$54,000 and $16,000 respectively. With z-score
normalization, a value of $73,600 for income is
transformed to

v− A 73,600 − 54,000
v' = = 1.225
σA 16,000

52

26
53

Normalization by decimal scaling

 Normalization by decimal scaling normalizes by


moving the decimal point of values of attribute A. The
number of decimal points moved depends on the
maximum absolute value of A. A value v of A is
normalized to v’ by computing

 Where j is the smallest integer such that Max(|v’|)<1.

v
v' = j
10
54

27
55

56

28
57

Data Integration
 Data integration:
 combines data from multiple sources into a coherent
store
 Schema integration
 integrate metadata from different sources

 Entity identification problem: identify real world entities


from multiple data sources, e.g., A.cust-id ≡ B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations, different
scales, e.g., metric vs. British units

58

29
Handling Redundant Data
in Data Integration
 Redundant data occur often when integration of multiple
databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in another
table, e.g., annual revenue
 Redundant data may be able to be detected by
correlational analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
59

The Question

 Are two variables related?


 Does one increase as the other increases?

 e. g. Web_Page_hits and Web_Log_size


 Does one decrease as the other increases?
 e. g. Memory_Space and No_of_Process
 How can we get a numerical measure of the
degree of relationship?

30
61

For Example:-
Log file Size (MB): 20 21 22 24
Time stamp(AM): 7:00 7:02 7:05 7:10

31
For Example:- Consider maximum log file size is 50 MB
Log file Size (MB) : 20 25 30 35
Remaining size in log file (MB): 30 25 20 15

Data Integration

 Some redundancies can be detected by


correlation analysis.
 For example, given two attributes, such
analysis can measure B how strongly one
attribute implies the other, based on the
available data. The correlation between
attributes A and B can be measured by

rA,B =
∑(A− A)(B − B)
(n −1)σ AσB
64

32
Data Integration

 Where N is the number of tuples, A and B are


the respective mean values of A and B.

65

Data Integration Correlation


 If the resulting value of the equation is greater that 0,
then A and B are positively correlated, meaning that the
values of A increase as the value of B increase.
 The higher the value, the more each attribute implies
each other.
 Hence, a high value may indicate that A (or B) may be
removed as a redundancy. If the resulting value is equal
to 0, then A and B are independent and there is no
correlation between them.
 If the resulting value is less than 0, then A and B are
negatively correlated, where the values of one attribute
increase as the value of other attribute decreases.
66

33
Data Reduction Strategies

 Data Warehouse may store terabytes of


data: Complex data analysis/mining may
take a very long time to run on the complete
data set
 Data reduction
 Obtains a reduced representation of the

data set that is much smaller in volume


but yet produces the same (or almost the
same) analytical results
67

Data Reduction - Hashing

 In addition, most the forensic tools and


investigating agencies use the hashing library
to compare the files which are examined in the
suspect cases against the known files,
separating relevant files from benign files;
 as a result, the uninteresting files will not be
examined and investigator’s time and effort will
be saved.

68

34
Data discretization
 Data discretization converts a large number of
data values into smaller once, so that data
evaluation and data management becomes
very easy.
 Histogram analysis
 Binning
 Clustering analysis
 Decision tree analysis
 Equal width partitioning
 Equal depth partitioning
69

70

35
Data Reduction Strategies

• Data cube aggregation, where aggregation operations


are applied to the data in the construction of a data
cube.

• Dimension reduction, where irrelevant, weakly


relevant, or redundant attributes or dimensions may
be detected and removed.

• Data compression, where encoding mechanisms are


used to reduce the data set size.

Data Reduction Strategies

 Numerosity reduction, where the data are


replaced or estimated by alternative, smaller data
representations such as parametric models (which
need store only the model parameters instead of
the actual data), or nonparametric methods such
as clustering, sampling, and the use of
histograms.
 Discretization and concept hierarchy generation,
where raw data values for attributes are replaced
by ranges or higher conceptual levels.
72

36
73

Dimensionally Reduction
 Dimensionality reduction reduces the data set size by
removing such attributes (or dimensions) from it

 Stepwise forward selection : the procedure starts with an


empty set of attributes. The best of the original attributes is
determined and added to the set. At each subsequent iteration
or step, the best of the remaining original attributes is added
to the set.

 Stepwise backward elimination: the procedure starts with the


full set of attributes. At each step, it removes the worst
attributes remaining in the set.

 Combination of forward selection and backward elimination:


the stepwise forward selection and backward elimination
methods can be combined so that at each step, the procedure
selects the best attribute and removes the worst from among
the remaining attributes.
74

37
75

Feature Extraction
 Three groups of features
 Basic features of individual TCP connections
 source & destination IP Features 1 & 2
 source & destination port Features 3 & 4 dst … service … flag dst … service … flag %S0
 Protocol Feature 5 h1
h1
http
http
S0
S0 syn flood
h1
h1
http
http
S0
S0
70
72
 Duration Feature 6 h1 http S0 h1 http S0 75

 Bytes per packets Feature 7 h2 http S0 h2 http S0 0

 number of bytes Feature 8 h4 http S0 normal h4 http S0 0

h2 ftp S0 h2 ftp S0 0

existing features construct features with


 Time based features useless high information gain
 For the same source (destination) IP address, number of unique destination (source)
IP addresses inside the network in last T seconds – Features 9 (13)
 Number of connections from source (destination) IP to the same destination (source)
port in last T seconds – Features 11 (15)
 Connection based features
 For the same source (destination) IP address, number of unique destination (source)
IP addresses inside the network in last N connections - Features 10 (14)
 Number of connections from source (destination) IP to the same destination (source)
port in last N connections - Features 12 (16)

38
Data Mining for Intrusion Detection
Misuse Detection –
Building Predictive
Tid SrcIP
Start
Dest IP Dest Number
Attack
Models
time Port of bytes Start Number
Number
Tid SrcIP Dest
DestPort
IP Attack
Attack
1 206.135.38.95 11:07:20 160.94.179.223 139 192 No time bytes
of bytes

2 206.163.37.95 11:13:56 160.94.179.219 139 195 No 1 206.163.37.81 11:17:51 160.94.179.208 150


150 ?
No

3 206.163.37.95 11:14:29 160.94.179.217 139 180 No 2 206.163.37.99 11:18:10 160.94.179.235 208


208 ?
No
4 206.163.37.95 11:14:30 160.94.179.255 139 199 No 195 ?
3 206.163.37.55 11:34:35 160.94.179.221 195 Yes
5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes
4 206.163.37.37 11:41:37 160.94.179.253 199
199 ?
No
6 206.163.37.95 11:14:35 160.94.179.253 139 177 No
Test
181 ?
7 206.163.37.95 11:14:36 160.94.179.252 139 172 No
5 206.163.37.41 11:55:19 160.94.179.244 181 Yes
Set
8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes

9 206.163.37.95 11:14:41 160.94.179.250 139 195 No

10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes


Training Learn
1 0

Set Classifier Model


Summarization of
attacks using Anomaly Detection
association rules
Rules Discovered:
{Src IP = 206.163.37.95,
Dest Port = 139,
Bytes ∈ [150, 200]} --> {ATTACK}

Data Compression

In data compression, data encoding or


transformation are applied so as to obtain a
reduced or ”compressed” representation of the
original data.
If the original data can be reconstructed from the
compressed data without any loss of information,
the data compression technique used is called
lossless.
If, instead, we can reconstruct only an approximation
of the original data, then the data compression
technique is called lossy.

78

39
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

79

Data Compression
 An example of lossless vs. lossy compression is the
following string:
25.888888888
 This string can be compressed as:
25.[9]8
 Interpreted as, "twenty five point 9 eights", the original
string is perfectly recreated, just written in a smaller
form.
 In a lossy system, using 26 instead, the exact original
data is lost, at the benefit of a shorter representation.

80

40
Numerosity Reduction

 “can we reduce the data volume by choosing


alternative, ‘smaller’ forms of data representation?”
 A histogram for an attribute a partitions the
data distribution of a into disjoint subjects,
or buckets are displayed on a horizontal
axis, while the height (and area) of a
bucket typically reflects the average
frequency of the value represented by the
bucket.
 If each bucket represents only a single
attribute – value/frequency pair, the buckets
are called singleton buckets. 81

 The following data are a list of prices of commonly


sold items at AllElectronices ( round to the nearest dollar).
The numbers have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,1
8,18,18,18,18,18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,
25,25,25,25,28,28,30,30,30.

 Equiwidth: in an equiwidth histogram, the width of each


bucket range is uniform (such as the width of $10 for the
buckets in Figure 3.10)

 Equidepth (or equiheight) : equidepth distogram, the


buckets are created so that, roughly, the frequency of each
buckets is constant (that is, each bucket contains
roughly the same number of contiguous data
samples)

82

41
83

84

42
Simple Discretization Methods: Binning

 Equal-width (distance) partitioning:


 It divides the range into N intervals of equal size:

uniform grid
 if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B-A)/N.


 The most straightforward

 But outliers may dominate presentation

 Skewed data is not handled well.

 Equal-depth (frequency) partitioning:


 It divides the range into N intervals, each containing

approximately same number of samples


 Good data scaling

 Managing categorical attributes can be tricky.


85

86

43

Das könnte Ihnen auch gefallen