Chapter 3: Data Mining

Data Mining & Warehousing
Chapter 3: Data Mining:

3.1 What is Data Mining?
• Data Mining is the process of automatically discovering useful information in large

repository.
Why do we need Data mining?
• Conventional database systems provide users with query and reporting

tools.
• To some extent the query and reporting tools can assist in answering questions like,
where did the largest number of students come from last year?
• But these tools cannot provide any intelligence about why it happened.
Taking an Example of University Database system: o The OLTP system will quickly
be able to answer the query like “how many students are enrolled in university”
o The OLAP system using data warehouse will be able to show the trends in
students enrollments (ex: how many students are preferring BCA)
o Data mining will be able to answer where the university should market.
3.2 Data Mining and Knowledge Discovery:
Data mining is an integral part of Knowledge discovery in databases (KDD) it is the
Dept of CSE, KLESCET –Shrikant Athanikar Page 1

process converting raw data into useful information
The input data is stored in various formats (flat files, spread sheet or relational tables)
The purpose of preprocessing is to transform the raw input data into an appropriate format
for subsequent analysis.
3.3 Motivating Challenges:
• Traditional analysis techniques have often faced practical difficulties posed by new data sets.
Challenges that motivated the development of data mining:
1) Scalability: Data sets are of the size of Terabytes and Petabytes, if data mining
algorithm is handling this massive data it must be scalable, and it need to have parallel
distributed algorithms to achieve this.
2) High Dimensionality:
Large amount of data always contains thousands of attributes, Complexity increases as
the dimensions grows, so the data sets need to have high dimensionality so that it can
deal with data containing many dimensions.
Traditional data analysis technique can only deal with low dimensional data.
3) Heterogeneous and complex Data:

Traditional analysis methods can deal with homogeneous type of attributes; with
businesses growing rapidly data mining techniques are required to deal with
heterogeneous data

4) Data Ownership and Distribution: as the data is not always stored at one location
and it might be scattered at different places in different organization, Distributed data
mining technique is required and the challenges faced during this is:
1) How to reduce the amount of communication needed to perform
distributed computing
2) How to combine the data mining results obtained from multiple sources 3)
How to address data security issues.
5) Non Traditional Analysis: The traditional statistical approach is based on a

hypothesized and test paradigm. In other words, a hypothesis is proposed, an
experiment is designed to gather the data, and then the data is analyzed with respect to
hypothesis. This process is extremely labor-intensive.
3.4 The Origin of Data Mining:
• Draws ideas from machine learning/AI, pattern recognition, statistics, and database
systems
• Traditional Techniques may be unsuitable
due to
 Enormity of data
 High dimensionality of data
 Heterogeneous, distributed nature of
data
3.5 Data Mining Tasks:
Data mining tasks are generally divided into two major categories:
• Predictive tasks:


Use some variables to predict unknown or future values of other variables.

Ex: by seeing the behaviour of one variable we can decide the value of other variable.

The attribute to be predicted is called: target or dependent

Attribute used for making prediction are called: explanatory or independent
variable
• Descriptive tasks:
 Here the objective is to derive patterns (correlations, anomalies, cluster etc..) that
summarize the relationships in data.
Cluster Analysis
Predictive Modeling
Association Analysis
Anomaly Detection
 They are needed post processing the data to validate and explain the results.
Four of the Core data Mining tasks:
1) Predictive Modeling
2) Association analysis
3) Cluster analysis
4) Anomaly detection
1) Predictive Modeling: refers to the task of building a model for the target variable as a
function of the explanatory variable.
There are two types of predictive modeling tasks:
1) Classification: used for discrete target variables ex: Web user will make
purchase at an online bookstore is a classification task, because the target variable
is binary valued.
2) Regression: used for continuous target variables.

Ex: forecasting the future price of a stock is regression task because price is a
continuous values attribute
2) Association Analysis: useful application of association is to find group of data that have
related functionality.
The Goal of associating analysis is to extract the most of interesting patterns in an efficient
manner.
Ex: market based analysis:
We may discover the rule that {diapers}{Milk}, which suggests that customers who buy
diapers also tend to buy milk.
3) Cluster Analysis: clustering has been used to group sets of related customers.
EX: collection of news articles below table shows first 3 rows speak about economy and
second 3 lines speak about health sector. A good clustering algorithm should be able to
identify these two clusters based on the similarity between words that appear in the article.
Example:
Article Words
1 Dollar:1, industry:4, country:2, loan:3, government:2
2 Machinery:2, labor:3, market:4, industry:2, work:3, country:1
3 Job:5, inflation3, rise:2, jobless:2, market: 3, country:2
4 Patient:4, symptoms:2, drug:3, health:2, clinic:2, doctor:2
5 Death:2, cancer:4, drug:3, public:4, health:4, director:1
6 Medical:2, cost:3, increase:2, patient:2, health:3, care:2
4) Anomaly Detection: is the task of identifying observations whose characteristics are

significantly different from the rest of the data. Such observations are known as anomalies
or outliers.
Applications of anomalies are: fraud detection, network intrusion, unusual patterns of
diseases, and ecosystem disturbances.
3.6 Data:
What is Data?
 Collection of data objects and their attributes What is
an Attribute?

 An attribute is a property or characteristic of an object  Examples: eye

color of a person, temperature, etc.
 Attribute is also known as variable, field, characteristics, or feature What

is an Object?
 A collection of attributes describe an object

 Object is also known as record, point, case, sample, entity, or instance
3.7 Types of Data
1) Attributes and Measures:

Attribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute values:

• Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters
• Different attributes can be mapped to the same set of values
Example: Attribute values for ID and age are integers
Note: properties of attribute values can be different like you can find the average ages of persons but
you cannot find the average ID’s
Types of Attributes:
Nominal (particular identity)
Examples: ID numbers, eye color, zip codes
Ordinal (measurable)
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in
{tall, medium, short}
Interval (range between the two)
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio ()
Examples: temperature in Kelvin, length, time, counts

Properties of Attribute Values:

• The type of an attribute depends on which of the following properties it
possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: */
• Nominal attribute: Uses only distinctness

• Ordinal attribute: Uses distinctness & order
• Interval attribute: Uses distinctness, order & addition
• Ratio attribute: Uses all 4 properties
Describing Attributes by the nature of Values:
• Discrete Attribute (Integers)

– Has only a finite or countably infinite set of values
– Examples: zip codes, ID Numbers, or the set of words in a collection of documents –
Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute (Floating point)
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a finite number of
digits.
– Continuous attributes are typically represented as floating-point variables.
3.7 Types of Data Sets
• Types of data is grouped in to three groups

1) Record data
– Transaction or market based Data
– Data Matrix
– Document Data or sparse data matrix
2) Graph data
– Data with relationship among objects (World Wide Web
– Data with objects that are Graphs (Molecular Structures)

3) Ordered data
– Sequential Data
– Sequence Data or Genetic Sequence Data
– Time series or Temporal Data
– Spatial Data
Three characteristics that apply to many data sets are:
i) Dimensionality
• The dimensionality of data set is the number of attributed that the objects in
the data set possess. Data with small number of dimensions tends to be
qualitatively different than moderate or high dimensional data.
• The difficulty associated with analyzing high dimensional data are sometimes
referred to as the curse of dimensionality
ii) Sparsity
• Data with asymmetric features, most of the attribute values are zero’s, in
practice terms, sparsity is an advantage because usually only non-zero values
to be stored and manipulated. This results in significant savings with respect to
computation time and storage.
• Some of the Data mining algorithms work well for sparse data.
iii) Resolution
• It is frequently possible to obtain data different levels of resolution, often the
properties of the data are different at different resolution.
• Ex: the surface of the earth seems very uneven at a resolution of few meters,
but is relatively smooth at a resolution of tens of kilometers.
• Ex: Photo Pixels (higher the pixel resolution clears the image lesser the
resolution image is blurred.
Detailed view on three types of data:
1) Record data:
 Record data set is a collection of data objects, which consists of fixed set of data fields
(attributes).
 Record data is usually stored in flat files Or relational tables
Types of record data are:

• Transaction or market based data

• The data matrix

• The sparse data matrix
A) Transaction or Market Basket Data:

• Transaction data is a special type of record data, where each record
(transaction) involves a set of items. B) The Data Matrix:
• If a set of objects have same set numeric attributes then the data objects
will be known as points in a multidimensional space. A set of such data objects can be
interpreted as M by N matrix. C) The Sparse Data Matrix:
• This matrix only contains non zero values
• Ex: fig(d) document term matrix.

staff in the enterprise.
2) Graph Based data:

 A Graph can sometime be convenient and powerful representation for data. A.
Data with relationships among objects:
– relationships among objects frequently convey important information ex: web
pages
B. data with objects that are graphs:

– If objects have structure, that is the object contains sub objects that have
relationship, then such objects are frequently represented as graphs. Ex: benzene
molecule
3) Ordered Data:
a. Sequential Data:
Sequential data also referred as temporal data, can be thought of as an extension of
record data, where each record has time associated with it.
b. Sequence Data or Genetic Sequence Data:

Sequence data consists of a data set that is a sequence of individual entities, such as
sequence of words or letters.

C. Time Series Data:

Time series data is a special type of sequential data in which each record is a time
series, i.e. a series of measurements taken over time. D. Spatial Data:
Some objects have spatial attributes, such as positions or areas, as well as other types of
attributes. An example of spatial data is weather data that is collected for a variety
Dept of CSE, KLESCET – Shrikant Athanikar Page 12

of geographical location.
3.8 Data Processing

Different Data Processing Techniques are:

1. Aggregation
2. Sampling
3. Dimensionality reduction
4. Feature creation
5. Discretization and binarization
6. Variable transformation
1. Aggregation:
• Combining two or more attributes (or objects) into a single attribute (or object)
Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
2. Sampling
• Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data and the final data
analysis.
• Statisticians sample because obtaining the entire set of data of interest is too expensive or
time consuming.
• Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
• ANALOGY: (Rice : to see whether the rice is cooked or not we only see one particle of it not
all the rice particles)

Types of Sampling:
• Simple Random Sampling
– There is an equal probability of selecting any particular item » There
are two variations on random sampling:
1) Sampling without replacement
• As each item is selected, it is removed from the population
2) Sampling with replacement
• Objects are not removed from the population as they are selected for the
sample.
• In sampling with replacement, the same object can be picked up more than
once
• Stratified sampling
– Split the data into several partitions; then draw random samples from each partition
• Progressive Sampling:
– If the proper sample size selection is difficult then adaptive or progressive sampling is
used. Then these approaches start with a small sample, and then increase the sample
size until a sample of sufficient size has been obtained.
3. Dimensionality Reduction:
Complexity of data increases as the dimensions grows in data.
Curse of Dimensionality:
• When dimensionality increases, data becomes increasingly sparse in the space that it
occupies
• For clustering, the definitions of density and distance between points, this is critical for
clustering and outlier detection, become less meaningful.
• As a result many clustering and classification algorithms have trouble with high
dimensional data, as it results in reduced classification accuracy and poor quality
clusters.
Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining algorithms
– Allow data to be more easily visualized

– May help to eliminate irrelevant features or reduce noise Linear Algebra

Techniques for dimensionality reduction:
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
Feature Subset Selection:
• Another way to reduce dimensionality of data is to use only subset of features.
• It might seems that such approach would lose information, this is not the case if
redundant and irrelevant features are present. Redundant features
– duplicate much or all of the information contained in one or more other attributes
– Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
– contain no information that is useful for the data mining task at hand
– Example: students' ID is often irrelevant to the task of predicting students' Grade Point
Averages
• Techniques for feature selection:

– Brute-force approach:
• Try all possible feature subsets as input to data mining algorithm – Embedded
approaches:
• Feature selection occurs naturally as part of the data mining algorithm – Filter
approaches:
• Features are selected before data mining algorithm is run – Wrapper approaches:
• Use the data mining algorithm as a black box to find best subset of attributes

4. Feature Creation:
• Create new attributes that can capture the important information in a data set
much more efficiently than the original attributes Three general
methodologies: – Feature Extraction
• creation of new set of features from the original raw data is known as feature
creation
– Mapping Data to New Space:

• A totally different view of data that can reveal important and interesting
features.
– Feature Construction
• combining features to get better features than the original
5. Discretization and Binarization

– Some data mining algorithms need data to be in the form of categorical attributes. And
algorithms that find association patters require that the data be in the form of binary
attributes.
– Thus transforming continuous attributes into a categorical attribute is called

discretization.
– And transforming continuous and discrete attributes into binary attributes is called as
Binarization.
6. Variable transformation
– A variable transformation refers to a transformation that is applied to all the values of
a variable, or even attributes.
– In each object the transformation is applied to the value of the variable for that object.

– Ex: converting a floating point value to an absolute value.

– two types of variable transformation:
• Simple function
• normalization
– two types of variable transformation:
• Simple function: for this type of variable transformation, a simple
mathematical function is applied to each value individually. If x is a variable,
then example of such transformation include xk, log x, ex, 1/x, sinx or |x|
• Normalization: the goal of normalization of standardization is to make an

entire set of values have a particular property.
• Standard deviation is one of the example of standardization where making

entire set of values have a common property
3.9 Measures of Similarity and Dissimilarity:
Basics:
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0

– Upper limit varies

• Similarity and Dissimilarity between Simple Attributes:
If p and q are the attribute values for two data objects.
With respect to ordinal attributes:

• Consider an attribute that measures the quality of the product: eg: candy bar on the scale
{poor, fair, OK, good, wonderful}
• It would seem reasonable that a product, PI , which is rated wonderful, would be closer to a
product P2, which is rated good, than it would be to a product P3, which is rated OK.
• To make this observation quantitative, the values of the ordinal attribute are often mapped to
successive integers, beginning at 0 or 1, e.g., {poor=0, fair=l, OK=2. good=3, wonderful=4].
• Then, d ( Pl , P2 ) = 3 — 2 = 1 or, if we want the dissimilarity
• to fall between 0 and 1, d ( P l , P 2 ) = (3-2)/5-1 = 0.25. A similarity for ordinal attributes can
then be defined as s = 1- d.
Possible Questions from This chapter:

1. What is data mining and why do we need data mining? Ans : Page-1
2. Write a note on Data mining and Knowledge discovery? Ans: Page-1
3. Explain the Challenges that motivated the use of data mining? Ans: page-2
4. Explain the data mining tasks in details? Ans: page-3 to 4
5. Write a note on Data, attribute and Object? Ans:Page-5
6. Explain in detail types of attributes? Ans:Page-5 to 6
7. Explain the different types of data sets in details with proper examples and figures? Ans:
page-6 to 10
8. Explain data processing in details with examples? Ans: Page-11 to 14
9. Write a note on measures of similarities and dissimilarities? Ans: Page-15-16

Chapter 3: Data Mining

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter 3: Data Mining

Hochgeladen von

Copyright:

Verfügbare Formate

Data Mining & Warehousing

Chapter 3: Data Mining:

• Data Mining is the process of automatically discovering useful information in large

Why do we need Data mining?

• Conventional database systems provide users with query and reporting

3.2 Data Mining and Knowledge Discovery:

Data mining is an integral part of Knowledge discovery in databases (KDD) it is the

Dept of CSE, KLESCET –Shrikant Athanikar Page 1

process converting raw data into useful information

Challenges that motivated the development of data mining:

3) Heterogeneous and complex Data:

Dept of CSE, KLESCET –Shrikant Athanikar Page 2

5) Non Traditional Analysis: The traditional statistical approach is based on a

3.4 The Origin of Data Mining:

3.5 Data Mining Tasks:

Dept of CSE, KLESCET –Shrikant Athanikar Page 3

Dept of CSE, KLESCET –Shrikant Athanikar Page 4

4) Anomaly Detection: is the task of identifying observations whose characteristics are

Dept of CSE, KLESCET –Shrikant Athanikar Page 5

 An attribute is a property or characteristic of an object  Examples: eye

 Attribute is also known as variable, field, characteristics, or feature What

 A collection of attributes describe an object

3.7 Types of Data

1) Attributes and Measures:

Distinction between attributes and attribute values:

Dept of CSE, KLESCET –Shrikant Athanikar Page 6

Properties of Attribute Values:

• Nominal attribute: Uses only distinctness

Describing Attributes by the nature of Values:

• Discrete Attribute (Integers)

3.7 Types of Data Sets

• Types of data is grouped in to three groups

Dept of CSE, KLESCET –Shrikant Athanikar Page 7

Three characteristics that apply to many data sets are:

Detailed view on three types of data:

Types of record data are:

Dept of CSE, KLESCET –Shrikant Athanikar Page 8

• The data matrix

A) Transaction or Market Basket Data:

Dept of CSE, KLESCET –Shrikant Athanikar Page 9

staff in the enterprise.

2) Graph Based data:

Dept of CSE, KLESCET –Shrikant Athanikar Page 10

b. Sequence Data or Genetic Sequence Data:

Dept of CSE, KLESCET –Shrikant Athanikar Page 11

C. Time Series Data:

Dept of CSE, KLESCET – Shrikant Athanikar Page 12

3.8 Data Processing

Dept of CSE, KLESCET – Shrikant Athanikar Page 13

Different Data Processing Techniques are:

Dept of CSE, KLESCET – Shrikant Athanikar Page 14

Dept of CSE, KLESCET – Shrikant Athanikar Page 15

– May help to eliminate irrelevant features or reduce noise Linear Algebra

• Techniques for feature selection:

Dept of CSE, KLESCET – Shrikant Athanikar Page 16

– Mapping Data to New Space:

5. Discretization and Binarization

– Thus transforming continuous attributes into a categorical attribute is called

Dept of CSE, KLESCET – Shrikant Athanikar Page 17

– Ex: converting a floating point value to an absolute value.

• Normalization: the goal of normalization of standardization is to make an

• Standard deviation is one of the example of standardization where making

3.9 Measures of Similarity and Dissimilarity: