Data Preprocessing

Data preprocessing
Data preprocessing describes any type of

processing performed on raw data to prepare it
for another processing procedure.
Commonly used as a preliminary data mining

practice, data preprocessing transforms the
data into a format that will be more easily and
effectively processed
Data Preprocessing
No quality data, no quality mining results!

"Garbage In, Garbage Out"
Quality decisions must be based on quality data
Data warehouse needs consistent integration of
quality data.
1
Attributes
Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g.,
HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
4
2
5
3
7
4
9
10
5
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
11
Discrete vs. Continuous Attributes

Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a
collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete

attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and

represented using a finite number of digits
Continuous attributes are typically represented as
floating-point variables
12
6
Similarity and Dissimilarity
Way to assess how alike and unalike objects
are in comparison to one another.
A store may want to search for cluster of
customer objects, resulting in groups of

customers with similar characteristics(e.g.
similar income, area of residence and age).
A similarity measure for two object , I and j,
will typically return 0 if the objects are unalike.
The higher the similarity value .the greater the

similarity between objects(Typically value 1
indicate complete similarity(that the object is
identical)
A dissimilarity measure works the opposite way
. It return a value 0 if objects are same ( and
therefore far from being similar).
The higher the dissimilarity value, the more
similar two objects
7
Similarity
Numerical measure of how alike two data objects
are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
Calculate Proximity Measure for

Nominal Attributes
16
8
Nominal Attributes
17

Nominal Attributes
18
9
19
Data Matrix and Dissimilarity Matrix

Data matrix
n data points with p  x 11 ... x 1f ... x 1p 
 
dimensions  ... ... ... ... ... 
x ... x if ... x ip 
Two modes
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
Sim(I,j)=1-d(I,j)  n1 
Dissimilarity matrix
 0 
n data points, but
 d(2,1) 0 
registers only the  
 d(3,1 ) d ( 3 ,2 ) 0 
distance  
A triangular matrix  : : : 
 d ( n ,1 ) d ( n,2 ) ... ... 0 
Single mode
20
10
Distance Metric
Distance d (p, q) between two points p and q is a
dissimilarity measure if it satisfies:
1. Positive definiteness:
d (p, q) ≥ 0 for all p and q and
d (p, q) = 0 only if p = q.
2. Symmetry: d (p, q) = d (q, p) for all p and q.
3. Triangle Inequality:
d (p, r) ≤ d (p, q) + d (q, r) for all points p, q,
and r.
21
Euclidean & Manhattan Distance
11

Minkowski Distance
It is generalization of Euclidean distance

and Minkowski distance
h = 1 for Minkowski distance - A common
example of this is the Hamming distance,
which is just the number of bits that are
different between two binary vectors
H = 2 for of Euclidean distance
12
Similarity Between Binary Vectors
Common situation is that objects, p and q, have only
binary attributes
Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero

attributes values
= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example

p= 1000000000
q= 0000001001
M01 = 2 (the number of attributes where p was 0 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) =

0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
13
Cosine Similarity
A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
Other vector objects: gene features in micro-arrays, …

Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d
27
Example: Cosine Similarity

cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
28
14
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
e.g., occupation=“ ”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes
or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
30
15
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data Discretization
reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values 31
32
16
Data Cleaning
Data cleaning tasks

Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
33
Missing Data
Data is not always available
E.g., many tuples have no recorded value for
several attributes, such as customer income in sales
data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
Missing data may need to be inferred. 34
17
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-
based such as r decision tree
35
Noisy Data
Incorrect attribute values may due to

faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning

duplicate records
incomplete data
inconsistent data
36
18
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

Clustering
detect and remove outliers
Combined computer and human inspection

detect suspicious values and check by human
Regression
smooth by fitting the data into regression functions
37
Binning
Data smoothing techniques are used to

eliminate "noise" and extract real trends and
patterns.
Binning methods smooth a sorted data value
by consulting its “neighborhood”, i.e. the
values around it.
These values are distributed into a number of
“buckets” or bins.
38
19
Binning
In the example , the data for price are first sorted and
then partitioned into equidepth bins of depth 3
Smoothing by bin means: each value in a bin is
replaced by the mean value of the bin.
Smoothing by bin medians: each bin value is
replaced by the bin median
Smoothing by bin boundaries: the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the
closed boundary value
The larger the width the greater effect of the
smoothing
39
Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
40
20
41
42
21
Cluster Analysis
43
Clustering
Outliers may be detected by clustering, where

similar values are organized into groups or
“clusters”
The values that fall outside of the set of
clusters may be considered as outliers.

Outlier
Data points inconsistent with the majority of data
Different outliers
Valid: CEO’s salary,
Noisy: One’s age = 280, widely deviated points
44
22
Regression
Data can be smoothed by fitting the data to a

function, such as regression
Linear regression involves finding the “best”
line to fit two variables, so that one variable
can be used to predict the other.
45
Regression
y
Y1
Y1’ y=x+1
X1 x
23
Data Transformation:
Normalization
Different features in the data set may have

values in different ranges.
For example, in an employee data set, the
range of salary feature may lie from thousands
to lakhs but the range of values of age feature
will be in 20- 60.
47
Data Transformation:
Normalization
Normalization: scaled to fall within a small,

specified range
min-max normalization
z-score normalization
normalization by decimal scaling
48
24
Data transformation
Min-max normalization performs a linear
transformation on the original data.
Suppose that minA and maxA are the minimum
and maximum values of an attribute A. Min-
max transformation maps a value v of A to v’ in
the
range [new_minA,new_maxA] by computing
v − minA
v' = (new_ maxA − new_ minA) + new_ minA
maxA − minA
49
Example - min-max normalization

Suppose that the minimum and maximum
values for the attribute income are $12,000 and
$98,000 respectively. We would like to map
income 73,600 in transformed to range[0,1]
v − minA
v' = (new_ maxA − new_ minA) + new_ minA
maxA − minA
 73,600 − 12,000 
 98,000 − 12,000 (1 .0 − 0 ) + 0 = 0 .716 
 
50
25
Example z-score normalization
In z-score normalization (or zero-mean normalization), the values for an
attribute A are normalized based on the mean and standard deviation of
A. A value v of A is normalized to v’ by computing
v− A
v' =
σA
Where A and σA are the mean and standard deviation,
respectively, of attribute A. This method of normalization is useful
when the actual minimum and maximum of attribute A are unknown,
51
Example z-score normalization
Suppose that the mean and standard deviation

of the values for the attribute income are
$54,000 and $16,000 respectively. With z-score
normalization, a value of $73,600 for income is
transformed to
v− A 73,600 − 54,000
v' = = 1.225
σA 16,000
52
26
53
Normalization by decimal scaling
Normalization by decimal scaling normalizes by

moving the decimal point of values of attribute A. The
number of decimal points moved depends on the
maximum absolute value of A. A value v of A is
normalized to v’ by computing
Where j is the smallest integer such that Max(|v’|)<1.
v
v' = j
10
54
27
55
56
28
57
Data Integration
Data integration:
combines data from multiple sources into a coherent
store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities

from multiple data sources, e.g., A.cust-id ≡ B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units
58
29
Handling Redundant Data
in Data Integration
Redundant data occur often when integration of multiple
databases
The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
59
The Question
Are two variables related?

Does one increase as the other increases?
e. g. Web_Page_hits and Web_Log_size

Does one decrease as the other increases?
e. g. Memory_Space and No_of_Process
How can we get a numerical measure of the
degree of relationship?
30
61
For Example:-
Log file Size (MB): 20 21 22 24
Time stamp(AM): 7:00 7:02 7:05 7:10
31
For Example:- Consider maximum log file size is 50 MB
Log file Size (MB) : 20 25 30 35
Remaining size in log file (MB): 30 25 20 15
Data Integration
Some redundancies can be detected by

correlation analysis.
For example, given two attributes, such
analysis can measure B how strongly one
attribute implies the other, based on the
available data. The correlation between
attributes A and B can be measured by
rA,B =
∑(A− A)(B − B)
(n −1)σ AσB
64
32
Data Integration
Where N is the number of tuples, A and B are

the respective mean values of A and B.
65
Data Integration Correlation

If the resulting value of the equation is greater that 0,
then A and B are positively correlated, meaning that the
values of A increase as the value of B increase.
The higher the value, the more each attribute implies
each other.
Hence, a high value may indicate that A (or B) may be
removed as a redundancy. If the resulting value is equal
to 0, then A and B are independent and there is no
correlation between them.
If the resulting value is less than 0, then A and B are
negatively correlated, where the values of one attribute
increase as the value of other attribute decreases.
66
33
Data Reduction Strategies
Data Warehouse may store terabytes of

data: Complex data analysis/mining may
take a very long time to run on the complete
data set
Data reduction
Obtains a reduced representation of the
data set that is much smaller in volume

but yet produces the same (or almost the
same) analytical results
67
Data Reduction - Hashing
In addition, most the forensic tools and

investigating agencies use the hashing library
to compare the files which are examined in the
suspect cases against the known files,
separating relevant files from benign files;
as a result, the uninteresting files will not be
examined and investigator’s time and effort will
be saved.
68
34
Data discretization
Data discretization converts a large number of
data values into smaller once, so that data
evaluation and data management becomes
very easy.
Histogram analysis
Binning
Clustering analysis
Decision tree analysis
Equal width partitioning
Equal depth partitioning
69
70
35
• Data cube aggregation, where aggregation operations

are applied to the data in the construction of a data
cube.
• Dimension reduction, where irrelevant, weakly

relevant, or redundant attributes or dimensions may
be detected and removed.
• Data compression, where encoding mechanisms are

used to reduce the data set size.
Numerosity reduction, where the data are

replaced or estimated by alternative, smaller data
representations such as parametric models (which
need store only the model parameters instead of
the actual data), or nonparametric methods such
as clustering, sampling, and the use of
histograms.
Discretization and concept hierarchy generation,
where raw data values for attributes are replaced
by ranges or higher conceptual levels.
72
36
73
Dimensionally Reduction
Dimensionality reduction reduces the data set size by
removing such attributes (or dimensions) from it
Stepwise forward selection : the procedure starts with an

empty set of attributes. The best of the original attributes is
determined and added to the set. At each subsequent iteration
or step, the best of the remaining original attributes is added
to the set.
Stepwise backward elimination: the procedure starts with the

full set of attributes. At each step, it removes the worst
attributes remaining in the set.
Combination of forward selection and backward elimination:

the stepwise forward selection and backward elimination
methods can be combined so that at each step, the procedure
selects the best attribute and removes the worst from among
the remaining attributes.
74
37
75
Feature Extraction
Three groups of features
Basic features of individual TCP connections
source & destination IP Features 1 & 2
source & destination port Features 3 & 4 dst … service … flag dst … service … flag %S0
Protocol Feature 5 h1
h1
http
http
S0
S0 syn flood
h1
h1
http
http
S0
S0
70
72
Duration Feature 6 h1 http S0 h1 http S0 75
Bytes per packets Feature 7 h2 http S0 h2 http S0 0
number of bytes Feature 8 h4 http S0 normal h4 http S0 0
h2 ftp S0 h2 ftp S0 0
existing features construct features with

Time based features useless high information gain
For the same source (destination) IP address, number of unique destination (source)
IP addresses inside the network in last T seconds – Features 9 (13)
Number of connections from source (destination) IP to the same destination (source)
port in last T seconds – Features 11 (15)
Connection based features
For the same source (destination) IP address, number of unique destination (source)
IP addresses inside the network in last N connections - Features 10 (14)
Number of connections from source (destination) IP to the same destination (source)
port in last N connections - Features 12 (16)
38
Data Mining for Intrusion Detection
Misuse Detection –
Building Predictive
Tid SrcIP
Start
Dest IP Dest Number
Attack
Models
time Port of bytes Start Number
Number
Tid SrcIP Dest
DestPort
IP Attack
Attack
1 206.135.38.95 11:07:20 160.94.179.223 139 192 No time bytes
of bytes
2 206.163.37.95 11:13:56 160.94.179.219 139 195 No 1 206.163.37.81 11:17:51 160.94.179.208 150

150 ?
No
3 206.163.37.95 11:14:29 160.94.179.217 139 180 No 2 206.163.37.99 11:18:10 160.94.179.235 208

208 ?
No
4 206.163.37.95 11:14:30 160.94.179.255 139 199 No 195 ?
3 206.163.37.55 11:34:35 160.94.179.221 195 Yes
5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes
4 206.163.37.37 11:41:37 160.94.179.253 199
199 ?
No
6 206.163.37.95 11:14:35 160.94.179.253 139 177 No
Test
181 ?
7 206.163.37.95 11:14:36 160.94.179.252 139 172 No
5 206.163.37.41 11:55:19 160.94.179.244 181 Yes
Set
8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes
9 206.163.37.95 11:14:41 160.94.179.250 139 195 No
10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes

Training Learn
1 0
Set Classifier Model

Summarization of
attacks using Anomaly Detection
association rules
Rules Discovered:
{Src IP = 206.163.37.95,
Dest Port = 139,
Bytes ∈ [150, 200]} --> {ATTACK}
Data Compression
In data compression, data encoding or

transformation are applied so as to obtain a
reduced or ”compressed” representation of the
original data.
If the original data can be reconstructed from the
compressed data without any loss of information,
the data compression technique used is called
lossless.
If, instead, we can reconstruct only an approximation
of the original data, then the data compression
technique is called lossy.
78
39
Data Compression
Original Data Compressed

Data
lossless
Original Data
Approximated
79
Data Compression
An example of lossless vs. lossy compression is the
following string:
25.888888888
This string can be compressed as:
25.[9]8
Interpreted as, "twenty five point 9 eights", the original
string is perfectly recreated, just written in a smaller
form.
In a lossy system, using 26 instead, the exact original
data is lost, at the benefit of a shorter representation.
80
40
Numerosity Reduction
“can we reduce the data volume by choosing

alternative, ‘smaller’ forms of data representation?”
A histogram for an attribute a partitions the
data distribution of a into disjoint subjects,
or buckets are displayed on a horizontal
axis, while the height (and area) of a
bucket typically reflects the average
frequency of the value represented by the
bucket.
If each bucket represents only a single
attribute – value/frequency pair, the buckets
are called singleton buckets. 81
The following data are a list of prices of commonly

sold items at AllElectronices ( round to the nearest dollar).
The numbers have been sorted:
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,1
8,18,18,18,18,18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,
25,25,25,25,28,28,30,30,30.
Equiwidth: in an equiwidth histogram, the width of each

bucket range is uniform (such as the width of $10 for the
buckets in Figure 3.10)
Equidepth (or equiheight) : equidepth distogram, the

buckets are created so that, roughly, the frequency of each
buckets is constant (that is, each bucket contains
roughly the same number of contiguous data
samples)
82
41
83
84
42
Simple Discretization Methods: Binning
Equal-width (distance) partitioning:

It divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.

The most straightforward
But outliers may dominate presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning:

It divides the range into N intervals, each containing
approximately same number of samples

Good data scaling
Managing categorical attributes can be tricky.

85
86
43

Data Preprocessing - UG - 2019 (Compatibility Mode)

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Preprocessing - UG - 2019 (Compatibility Mode)

Hochgeladen von

Copyright:

Verfügbare Formate

Data preprocessing describes any type of

Commonly used as a preliminary data mining

No quality data, no quality mining results!

Data warehouse needs consistent integration of

Discrete vs. Continuous Attributes

E.g., zip codes, profession, or the set of words in a

Note: Binary attributes are a special case of discrete

E.g., temperature, height, or weight

Practically, real values can only be measured and

customer objects, resulting in groups of

Similarity and Dissimilarity

The higher the similarity value .the greater the

Often falls in the range [0,1]

Minimum dissimilarity is often 0

Upper limit varies

Proximity refers to a similarity or dissimilarity

Calculate Proximity Measure for

Calculate Proximity Measure for

Data Matrix and Dissimilarity Matrix

Euclidean & Manhattan Distance

It is generalization of Euclidean distance

Simple Matching and Jaccard Coefficients

J = number of 11 matches / number of not-both-zero

SMC versus Jaccard: Example

M01 = 2 (the number of attributes where p was 0 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) =

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Other vector objects: gene features in micro-arrays, …

Example: Cosine Similarity

Ex: Find the similarity between documents 1 and 2.

Data cleaning tasks

Incorrect attribute values may due to

data entry problems

data transmission problems

inconsistency in naming convention

Other data problems which requires data cleaning

then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

Combined computer and human inspection

Data smoothing techniques are used to

Binning Methods for Data Smoothing

Outliers may be detected by clustering, where

clusters may be considered as outliers.

Data can be smoothed by fitting the data to a

Different features in the data set may have

Normalization: scaled to fall within a small,

Example - min-max normalization

Example z-score normalization

Suppose that the mean and standard deviation

Normalization by decimal scaling

Normalization by decimal scaling normalizes by

Where j is the smallest integer such that Max(|v’|)<1.

Entity identification problem: identify real world entities

Are two variables related?

e. g. Web_Page_hits and Web_Log_size

Some redundancies can be detected by

Where N is the number of tuples, A and B are

Data Integration Correlation

Data Warehouse may store terabytes of

data set that is much smaller in volume

Data Reduction - Hashing