Sie sind auf Seite 1von 48

Data Mining: Conceptsand

Techniques

3rd Edition

Solution Manual

Jiawei Han, Micheline Kamber, JianPei

The University of Illinois at Urbana-Champaign


Simon Fraser University

Version January 2, 2012


C Morgan Kaufmann, 2011

For Instructors references only

I
Contents

Chapter 1 Introduction ...................................................................................................................... 1


1.1 Exercises ............................................................................................................................. 1
Chapter 2 Getting to Know Your Data .............................................................................................. 4
2.1 Exercises ............................................................................................................................. 4
Chapter 3 Data Preprocessing ......................................................................................................... 14
3.1 Exercises ........................................................................................................................... 14
Chapter 4 Data Warehousing and Online Analytical Processing .................................................... 23
4.1 Exercises ........................................................................................................................... 23
Chapter 5 Data Cube Technology ................................................................................................... 28
5.1 Exercises ........................................................................................................................... 28
Chapter 6 Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
........................................................................................................................................................ 31
6.1 Exercises ........................................................................................................................... 31
Chapter 8 Classification: Basic Concepts ....................................................................................... 36
8.1 Exercises ........................................................................................................................... 36
Chapter 9 Classification: Advanced Metrods .................................................................................. 39
9.1 Exercises ........................................................................................................................... 39
Chapter 10 Cluster Analysis: Basic Concepts and Methods ........................................................... 40
10.1 Exercises ......................................................................................................................... 40

II
Chapter 1 Introduction

1.1 Exercises

1.1 What is data mining?In your answer, address the following:


(a) Is it an other hype? (hype 英 [haɪp] 美 [haɪp] n. 天花乱坠的广告宣传 vt. 大
肆宣传;夸张地宣传(某人或某事物))
(b) Is it a simple transformation or application of technology developed from databases,
statistics, machine learning, and pattern recognition?
(c) We have presented a view that data mining is the result of the evolution of database
technology. Do you think that data mining is also the result of the evolution of
machine learning research? Can you present such views based on the historical
progress of this discipline? Address (Do) the same for the fields of statistics and
pattern recognition.
(d) Describe the steps involved in data mining when viewed as a process of knowledge
discovery.
Answer:
Data mining refers to the process or method that extracts or mines interesting
knowledge patterns from large amounts of data.
(a) Is it an other hype?
Data mining is not an other hype. Instead, the need for data mining has arisen due
to the wide availability of hugea mounts of data and the imminent (英 ['ɪmɪnənt]
美 ['ɪmɪnənt] adj. (通常指不愉快的事)即将发生的;迫切的,危急的;逼
近的;迫在眉睫) need for turning such data into useful information and knowledge.
Thus, data mining can be viewed as the result of the natural evolution of
information technology.
(b) Is it a simple transformation or application of technology developed from databases,
statistics, machine learning, and pattern recognition?
No. Data mining is more than a simple transformation of technology developed
from databases, statistics, and machine learning. Instead, data mining involves an
integration, rather than a simple transformation, of techniques from multiple
disciplines such as database technology, statistics, machine learning,
high-performance computing, pattern recognition, neural networks, data

1
visualization, information retrieval, image and signal processing, and spatial data
analysis.
(c) Explain how the evolution of database technology led to data mining.
Database technology began with the development of data collection and database
creation mechanisms that led to the development of efective mechanisms for data
management including data storage and retrieval, and query and transaction
processing. The large number of database systems ofering query and transaction
processing eventually and naturally led to the need for data analysis and
understanding. Hence, datamining began its development out of this necessity.
(d) Describe the steps involved in data mining when viewed as a process of knowledge
discovery.
The steps involved in data mining when viewed as a process of knowledge
discovery are as follows:
 Data cleaning, a process that removes or transforms noise and inconsistent
(英 [ˌɪnkənˈsɪstənt] 美 [ˌɪnkənˈsɪstənt] adj. 不一致的,不调和的;前后
矛盾的,不合逻辑的;反复无常的;歧出) data
 Data integration, where multiple data sources may be combined
 Data selection, where data relevant to the analysis task are retrieved from the
database
 Data transformation, where data are transformed or consolidated into forms
appropriate for mining
 Datamining, an essential process where intelligent and efficient methods are
applied in order to extract patterns
 Pattern evaluation, a process that identifies the truly interesting patterns
representing knowledge based on some interestingness measures
 Knowledge presentation, where visualization and knowledge representation
techniques are used to present the mined knowledge to the user.
1.2 How is a data warehouse different from a database? How are they similar?
Answer:
Differences between a data warehouse and a database: A data warehouse is a repository
of information collected from multiple sources, over a history of time, stored under a
unified schema, and used for data analysis and decision support; where as a database, is
a collection of interrelated data that represents the current status of the stored data. There
could be multiple heterogeneous databases where the schema of one database may not
agree with the schema of another. A database system supports ad-hoc query and on-line

2
transaction processing. For more details, please refer to the section “Differences
between operational database systems and data warehouses.”
Similarities between a data warehouse and a database: Both are repositories of
information, storing huge amounts of persistent data.
1.5 What is the difference between discrimination and classification? Between
characterization and clustering? Between classification and regression? For each of
these pairs of tasks, how are they similar?
Answer:
Discrimination differs from classification in that the former refers to a comparison of
the general features of target class data objects with the general features of objects from
one or a set of contrasting classes, while the latter is the process of finding a set of
models (or functions) that describe and distinguish data classes or concepts for the
purpose of being able to use the model to predict the class of objects whose class label is
unknown. Discrimination and classification are similar in that they both deal with the
analysis of class data objects.
Characterization differs from clustering in that the former refers to a summarization of
the general characteristics or features of a target class of data while the latter deals with
the analysis of data objects without consulting a known class label. This pair of tasks is
similar in that they both deal with grouping together objects or data that are related or
have high similarity in comparison to one another.
Classification differs from regression in that the former predicts categorical (discrete,
unordered) labels while the latter predicts missing or unavailable, and often numerical,
data values. This pair of tasks is similar in that they both are tools for prediction.

3
Chapter 2 Getting to Know Your Data

2.1 Exercises

2.2 Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,
33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal,
trimodal, etc.)
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile-quantile plot different from a quantile plot?
Answer:
(a) What is the mean of the data? What is the median?
The (arithmetic) mean of the data is:
1 n 809
x
n  i 1
xi 
27
 30 .

The median (middle value of the ordered set, as the number of values in the set is
odd) of the data is: 25, a 14th value.
(b) What is the mode of the data? Comment (英 ['kɒment] 美 [ˈkɑmɛnt] n. 评论;注
释;意见;说明 vt.& vi. 评论,谈论 vt. 表达意见;解释,注释 vi. 作注释;
作注解;作解释;作评语) on the data’s modality (英 [məʊ'dælətɪ] 美 [moʊ'dælət
ɪ] n. 形式;模式;方式)(i.e., bimodal, trimodal, etc.).
This data set has two values that occur with the same highest frequency and is,
therefore, bimodal. The modes (values occurring with the greatest frequency) of the
data are 25 and 35.
(c) What is the midrange of the data?
The midrange (average of the largest and smallest values in the data set) of the data
is: (70+13)/2=41.5
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?

4
The first quartile (corresponding to the 25th percentile) of the data is at
(N+1)/4=(27+1)/4=7, so it is: Q1=20.
The third quartile (corresponding to the 75th percentile) of the data is at
3(N+1)/4=21, so it is: Q3=35.
(e) Give the five-number summary of the data.
The five number summary of a distribution consists of the minimum value, first
quartile, median value, third quartile, and maximum value. It provides a good
summary of the shape of the distribution and for this data is: 13, 20, 25, 35, 70.
(f) Show a boxplot of the data.
See Figure 2.1.

(g) How is a quantile-quantile plot different from a quantile plot?


A quantile plot is a graphical method used to show the approximate percentage of
values below or equal to the independent variable in a univariate distribution. Thus,
it displays quantile information for all the data, where the values measured for the
independent variable are plotted against their corresponding quantile.
A quantile-quantile plot however, graphs the quantiles of one univariate distribution
against the corresponding quantiles of another univariate distribution. Both axes
display the range of values measured for their corresponding distribution, and
points are plotted that correspond to the quantile values of the two distributions. A
line (y = x) can be added to the graph along with points representing where the first,
second and third quantiles lie, in order to increase the graph informational value.
Points that lie above such a line indicate a correspondingly higher value for the
distribution plotted on the y-axis, than for the distribution plotted on the x-axis at

5
the same quantile. The opposite effect is true for points lying below this line.
2.3 Suppose that the values for a given set of data are grouped into intervals. The intervals
and corresponding frequencies are as follows.
age frequency
1–5 200
6–15 450
16–20 300
21–50 1500
51–80 700
81–110 44
Compute an approximate median value for the data.
Answer:
First determine the median range:
N=200+450+300+1500+700+44=3194; N/2=1597
∴ 200+450+300=950<1597<2450=950+1500
∴ 21~50 is the median range
L1=20, N= 3194, (Σfreq)l=950, freq median = 1500, width = 30
 N / 2   freq l  3194 / 2  950 
median  L1    width  20     30  32.94
 freq median   1500 
∴ median=32.94 years.

2.4 Suppose a hospital tested the age and body fat data for 18 randomly selected adults with
the following result

(a) Calculate the mean, median and standard deviation of age and %fat.
(b) Draw the boxplots for age and %fat.
(c) Draw a scatter plot and a q-q plot based on these two variables.
Answer:
(a) Calculate the mean, median and standard deviation of age and %fat.
For the variable age the mean is
mean=(23+23+27+27+39+41+47+49+50+52+54+54+56+57+58+58+60+61)/18
=46.44

6
The median is
median=(50+52)/2=51
The standard deviation is 12.85.
For the variable %fat the mean is 28.78, sortting the %fat is
7.8, 9.5, 17.8, 25.9, 26.5, 27.2, 27.4, 28.8, 30.2, 31.2, 31.4, 32.9, 33.4, 34.1, 34.6,
35.7, 41.2, 42.5
the median is (30.2+31.2)/2=30.7, and the standard deviation is 8.99.
(b) Draw the boxplots for age and %fat.
(18+1)/4=4.75, 3(18+1)/4=14.25,
Age: min=23, max=61, median=51
Q1=270.25+390.75=36, Q3=570.75+580.25=57.25,
Inter-quartile range: IQR=Q3Q1=21.25
%fat: min=9.5, max=42.5, median=30.7
Q1=25.90.25+26.50.75=26.35, Q3=34.10.75+34.60.25=34.225,
Inter-quartile range: IQR=Q3Q1=7.875
See Figure 2.2.

(c) Draw a scatter plot and a q-q plot based on these two variables.
See Figure 2.3.

7
注:四分位数怎么算
分位数是将总体的全部数据按大小顺序排列后,处于各等分位置的变量值。如果将
全部数据分成相等的两部分,它就是中位数;如果分成四等分,就是四分位数;八等分
就是八分位数等。四分位数也称为四分位点,它是将全部数据分成相等的四部分,其中
每部分包括 25%的数据,处在各分位点的数值就是四分位数。四分位数有三个,第一个
四分位数就是通常所说的四分位数,称为下四分位数,第二个四分位数就是中位数,第
三个四分位数称为上四分位数,分别用 Q1、Q2、Q3 表示。四分位数作为分位数的一种
形式,在统计中有着十分重要的作用和意义,现就四分位数的计算做一详细阐述。
一、资料未分组四分位数计算
第一步:确定四分位数的位置。Qi 所在的位置=i(n+1)/4,其中 i=1,2,3。n
表示资料项数。
第二步:根据第一步四分位数的位置,计算相应四分位数。
例 1:某数学补习小组 11 人年龄(岁)为:17,19,22,24,28,34,35,36,
37,38。则三个四分位数的位置分别为:
Q1 所在的位置=(11+1)/4=3,Q2 所在的位置=2(11+1)/4=6,Q3 所在的位置=3
(11+1)/4=9。
变量中的第三个、第六个和第九个人的岁数分别为下四分位数、中位数和上四分位
数,即:
Q1=22(岁)、Q2=28(岁)、Q3=36(岁)
我们不难发现,在上例中(n+1)恰好是 4 的整数倍,但在很多实际工作中不一定
都是整数倍。这样四分位数的位置就带有小数,需要进一步研究。带有小数的位置与位
置前后标志值有一定的关系:四分位数是与该小数相邻的两个整数位置上的标志值的平

8
均数,权数的大小取决于两个整数位置的远近,距离越近,权数越大,距离越远,权数
越小,权数之和应等于 1。
例 2:设有一组经过排序的数据为 12,15,17,19,20,23,25,28,30,33,34,
35,36,37,则三个四分位数的位置分别为:
Q1 所在的位置=(14+1)/4=3.75,Q2 所在的位置=2(14+1)/4=7.5,Q3 所在的位
置=3(14+1)/4=11.25。
变量中的第 3.75 项、第 7.5 项和第 11.25 项分别为下四分位数、中位数和上四分位
数,即:
Q1=0.25×第三项+0.75×第四项=0.25×17+0.75×19=18.5;
Q2=0.5×第七项+0.5×第八项=0.5×25+0.5×28=26.5;
Q3=0.75×第十一项+0.25×第十二项=0.75×34+0.25×35=34.25。
二、资料已整理分组的组距式数列四分位数计算
第一步:向上或向下累计次数(因篇幅限制,以下均采取向上累计次数方式计算);
第二步:根据累计次数确定四分位数的位置:
Q1 的位置 = (∑f+1)/4,Q2 的位置 = 2(∑f +1)/4,Q3 的位置 = 3(∑f +1)/4
式中:∑f 表示资料的总次数;
第三步:根据四分位数的位置计算各四分位数(向上累计次数,按照下限公式计算
四分位数):
Qi=Li+■×di
式中:Li——Qi 所在组的下限,fi——Qi 所在组的次数,di——Qi 所在组的组距;
Qi-1——Qi 所在组以前一组的累积次数,∑f——总次数。
例 3:某企业工人日产量的分组资料如下:
根据上述资料确定四分位数步骤如下:
(1)向上累计方式获得四分位数位置:
Q1 的位置=(∑f +1)/4=(164+1)/4=41.25
Q2 的位置=2(∑f +1)/4=2(164+1)/4=82.5
Q3 的位置=3(∑f +1)/4=3(164+1)/4=123.75
(2)可知 Q1,Q2,Q3 分别位于向上累计工人数的第三组、第四组和第五组,日
产量四分位数具体为:
Q1=L1+■×d1=70+■×10=72.49(千克)
Q2=L2+■×d2=80+■×10=80.83(千克)
Q3=L3+■×d3=90+■×10=90.96(千克)

2.6 Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.

9
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using h =3.
(d) Compute the supremum distance between the two objects.
Answer:
(a) Compute the Euclidean distance between the two objects.
The Euclidean distance is computed using Equation (2.6).

Therefore, we have

(b) Compute the Manhattan distance between the two objects.


The Manhattan distance is computed using Equation (2.7).
Therefore, we have |22-20|+|1-0|+|42-36|+|10-8|=11
(c) Compute the Minkowski distance between the two objects, using h =3.
The Minkowski disance is

where h is a real number such that h1.


Therefore, with h=3, we have

(d) Compute the supremum distance between the two objects.


The supremum distance is computed using Equation (2.8).

Therefore, we have a supremum distance of 6.

2.8 It is important to define or select similarity measures in data analysis. However, there is
no commonly-accepted subjective similarity measure. Results can vary depending on
the similarity measures used. Nonetheless, seemingly different similarity measures may
be equivalent after some transformation. Suppose we have the following
two-dimensional data set:

10
(a) Consider the data as two-dimensional data points. Given a new data point, x=(1.4,
1.6) as a query, rank the database points based on similarity with the query using
Euclidean distance, Manhattan distance, supremum distance, and cosine
similarity.
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.
Answer:
(a) Consider the data as two-dimensional data points. Given a new data point, x=(1.4,
1.6) as a query, rank the database points based on similarity with the query using
Euclidean distance, Manhattan distance, supremum distance, and cosine
similarity.
Answer:
Use Equation

to computethe Euclideandistance, Equation

to compute the Manhattan distance, Equation

to compute the supremum distance, and Equation

to compute the cosine similarity between the input data point and each of the data
points in the dataset. Doing so yields the following table

11
Use Matlab to compute the upper table
0.1414 0.2000 0.1000 1.0000
0.6708 0.9000 0.6000 0.9958
0.2828 0.4000 0.2000 1.0000
0.2236 0.3000 0.2000 0.9990
0.6083 0.7000 0.6000 0.9654
These values produce the following rankings of the data points based on similarity:
Euclidean distance: x1, x4, x3, x5, x2
Manhattan distance: x1, x4, x3, x5, x2
Supremum distance: x1, x4, x3, x5, x2
Cosine similarity: x1, x3, x4, x2, x5,
(b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.
The normalized query is(0.65850, 0.75258). The normalized data set is given by
the following table

Use Matlab to compute the upper table


0.6616 0.7498
0.7250 0.6887
0.6644 0.7474
0.6247 0.7809
0.8321 0.5547
Recomputing the Euclidean distances as before yields

12
Use Matlab to compute the upper table
0.0041
0.0922
0.0078
0.0441
0.2632
Which results in the final ranking of the transformed data points: x1, x3, x4, x2, x5,

13
Chapter 3 Data Preprocessing

3.1 Exercises

3.3 Exercise2.2 gave the following data (in increasing order) for the attribute age: 13, 15, 16,
16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52,
70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
(b) How might you determine outliers in the data?
(c) What other methods are there for data smoothing?
Answer:
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin
means with a bin depth of 3.
Step1: Sort the data. (This step is not required here as the data are already sorted.)
Step2: Partition the data into equidepth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin3: 20, 21, 22
Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70
Step3: Calculate the arithmetic mean of each bin.
Bin 1: 44/3, 44/3, 44/3 Bin 2: 55/3, 55/3, 55/3 Bin 3: 63/3, 63/3, 63/3
Bin 4: 72/3, 72/3, 72/3 Bin 5: 80/3, 80/3, 80/3 Bin 6: 101/3, 101/3, 101/3
Bin 7: 105/3, 105/3, 105/3 Bin 8: 121/3, 121/3, 121/3 Bin 9: 168/3, 168/3, 168/3
Step4: Replace each of the values in each bin by the arithmetic mean calculated for
the bin.
Bin 1: 14.7, 14.7, 14.7 Bin 2: 18.3, 18.3, 18.3 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26.7, 26.7, 26.7 Bin 6: 33.7, 33.7, 33.7
Bin 7: 35, 35, 35 Bin 8: 40.3, 40.3, 40.3 Bin 9: 56, 56, 56

14
This method smooths a sorted data value by consulting to its “neighborhood”. It
perform local smoothing.
(b) How might you determine outliers in the data?
Outliers in the data may be detected by clustering, where similar values are
organized into groups, or “clusters”. Values that fall outside of the set of clusters
may be considered outliers. Alterntively, a combination of computer and human
inspection can be used where a predetermined data distribution is implemented to
allow the computer to identify possible outliers. These possible outliers can then be
verifed by human inspection with much less effort than would be required to verify
the entire initial data set.
(c) What other methods are there for data smoothing?
Other methods that can be used for data smoothing include alternate forms of
binning such as smoothing by bin medians or smoothing by bin boundaries.
Alternatively, equiwidth bins can be used to implement any of the forms of binning,
where the interval range of values in each bin is constant. Methods other than
binning include using regression techniques to smooth the data by fitting it to a
function such as through linear or multiple regression. Also, classification
techniques can be used to implement concept hierarchies that can smooth the data
by rolling-uplower level concepts to higher-level concepts.
3.5. What are the value ranges of the following normalization methods?
(a) min-max normalization
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard
deviation
(d) normalizationbydecimalscaling
Answer:
(a) min-max normalization can define any value range and linearly map the original data
to this range.
[new_min, new_max]
(b) z-score normalization normalize the values for an attribute A based on the mean and
standard

 min A  A max A  A 
  ,
 A
 A 
(c) z-score normalization using the mean absolute deviation is a variation of z-score

15
normalization by replacing the standard deviation with the mean absolute deviation
of A, denoted by sA, which is

1
SA   v1  A  v2  A    vn  A 
n
The value range is
 min A  A max A  A  .
 s , 
 A sA 
(d) normalization by decimal scaling normalizes by moving the decimal point of values
of attribute A. The value range is

 min A max A 
 10 j , 10 j 

where j is the smallest integer such that max 


v 
 i j   1.
 10 
3.6 Use the methods below to normalize the following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting min=0 and max=1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard
deviation
(d) normalization by decimal scaling
Answer:
(a) min-max normalization by setting min=0 and max=1 get the new value by computing
vi  200
v 'i  1  0   0
1000  200
The normalized data are:
0, 0.125, 0.25, 0.5, 1
(b) In z-score normalization, a value vi of A is normalized to v’i by computing

vi  A
vi 
A
where

1
A 200  300  400  600  1000  500
5
1
A  200 2  300 2    1000 2   A 2  282.8
5

16
The normalized data are: –1.06, –0.707, –0.354, 0.354, 1.77
(c) z-score normalization using the mean absolute deviation instead of standard
deviation replaces σA with sA, where

The normalized data are: 1.25, 0.833, 0.417, 0.417, 2.08


(d) normalization by decimal scaling

The smallest integer j such that  v  is 3. After normalization by


max i j   1
 10 
decimal scaling, the data become: 0.2, 0.3, 0.4, 0.6, 1.0
3.7 Using the data for age given in Exercise 3.3, answer the following:
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0,
1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving
reasons as to why.
Answer:
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0,
1.0].
Using the corresponding equation with minA=13, maxA=70, new_minA=0,
new_max=1.0,
then v=35 is transformed to 35  13
v  1  0  0  0.3860  0.39
70  13
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
Using the corresponding equation where A=809/27=29.96 and σA=12.94, then

v=35 is transformed to v A 35  29.96 .


v    0.3895  0.39
A 12.94

(c) Use normalization by decimal scaling to transform the value 35 for age.
Using the corresponding equation where j=2, v=35 is transformed to
v 35 .
v  j
 2  0.35
10 10
(d) Comment on which method you would prefer to use for the given data, giving

17
reasons as to why.
Given the data, one may prefer decimal scaling for normalization as such a
transformation would maintain the data distribution and be intuitive to interpret,
while still allowing mining on specific age groups. Min-max normalization has the
undesired effect of not permitting any future values to fall outside the current
minimum and maximum values without encountering an “out of bounds error”. As
it is probable that such values may be present infuture data, this method is less
appropriate. Also, z-score normalization transforms values into measures that
represent their distance from the mean, in terms of standard deviations. It is
probable that this type of transformation would not increase the information value
of the attribute in terms of intuitiveness to users or in usefulness of mining results.

3.8 Using the data for age and body fat given in Exercise 2.4, answer the following:
(a) Normalize the two attributes based on z-score normalization.
(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are
these two attributes positively or negatively correlated? Compute their covariance.
Answer:
(a) Normalize the two variables based on z-score normalization.
Age mean:
A=(232+272+39+41+47+49+50+52+542+56+57+582+60+61)/18
=836/18=46.444
Age standard deviation:
σA={[2(23-46.444)2+2(27-46.444)2+(39-46.444)2+(41-46.444)2+(47-46.444)2
+(49-46.444)2+(50-46.444)2+(52-46.444)2+2(54-46.444)2+(56-46.444)2
+(57-46.444)2+2(58-46.444)2+(60-46.444)2+(61-46.444)2]/18}1/2=12.846
Body fat mean:
B=(9.5+26.5+7.8+17.8+31.4+25.9+27.4+27.2+31.2+34.6+42.5+28.8+33.4
+30.2+34.1+32.9+41.2+35.7)/18
=518.1/18=28.783
Body fat standard deviation:
σB={[(9.5-28.783)2+(26.5-28.783)2+(7.8-28.783)2+(17.8-28.783)2+(31.4-28.783)2
+(25.9-28.783)2+(27.4-28.783)2+(27.2-28.783)2+(31.2-28.783)2+(34.6-28.783)2
+(42.5-28.783)2+(28.8-28.783)2+(33.4-28.783)2+(30.2-28.783)2+(34.1-28.783)2
+(32.9-28.783)2+(41.2-28.783)2+(35.7-28.783)2]/18}1/2
=(1455.945/18)1/2=80.8861/2=8.994

18
age 23 23 27 27 39 41 47 49 50
z-age -1.83 -1.83 -1.51 -1.51 -0.58 -0.42 0.04 0.20 0.28
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
z-%fat -2.14 -0.25 -2.33 -1.22 0.29 -0.32 -0.15 -0.18 0.27
age 52 54 54 56 57 58 58 60 61
z-age 0.43 0.59 0.59 0.74 0.82 0.90 0.90 1.06 1.13
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
z-%fat 0.65 1.53 0.0 0.51 0.16 0.59 0.46 1.38 0.77

(b) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are
these two variables positively or negatively correlated?
RA,B=[(23-46.444)(9.5-28.783)+(23-46.444)(26.5-28.783)+(27-46.444)(7.8-28.783)+
(27-46.444)(17.8-28.783)+ (39-46.444)(31.4-28.783)+ (41-46.444)(25.9-28.783)+
(47-46.444)(27.4-28.783)+ (49-46.444)(27.2-28.783)+ (50-46.444)(31.2-28.783)+
(52-46.444)(34.6-28.783)+ (54-46.444)(42.5-28.783)+ (54-46.444)(28.8-28.783)+
(56-46.444)(33.4-28.783)+ (57-46.444)(30.2-28.783)+ (58-46.444)(34.1-28.783)+
(58-46.444)(32.9-28.783)+ (60-46.444)(41.2-28.783)+
(61-46.444)(35.7-28.783)]/(1812.8468.994)=1700.33/(1812.8468.994)=0.818
The correlation coefficient is 0.82. The variables are positively correlated.

3.9 Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods.
(a) equal-frequency (equidepth) partitioning
(b) equal-width partitioning
(c) clustering
Answer:
(a) equal-frequency (equidepth) partitioning
Partition the data into equidepth bins of depth 4:
Bin 1: 5, 10, 11, 13 Bin 2: 15, 35, 50, 55 Bin 3: 72, 92, 204, 215
(b) equal-width partitioning
Partitioning the data into 3 equi-width bins will require the width to be (215–
5)/3=70. We get:
Bin 1: 5, 10, 11, 13, 15, 35, 50, 55, 72 Bin 2: 92 Bin 3: 204, 215

19
(c) clustering
Using K-means clustering to partition the data into three bins we get:
Bin 1: 5, 10, 11, 13, 15, 35 Bin 2: 50, 55, 72, 92 Bin 3: 204, 215
3.11 Using the data for age given in Exercise3.3,
(a) Plot an equal-width histogram of width 10.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR,
cluster sampling, stratified sampling. Use samples of size 5 and the strata youth,
middle-aged senior
Answer:
(a) Plot an equiwidth histogram of width 10.
See Figure 3.4.

(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR,
cluster sampling, stratified sampling. Use samples of size 5 and the strata young,
middle-aged, and senior.
See Figure 3.5.

20
21
22
Chapter 4 Data Warehousing and Online
Analytical Processing

4.1 Exercises

4.4 Suppose that a data warehouse for Big-University consists of the following four
dimensions: student, course, semester, and instructor, and two measures count and
avg_grade. When at the lowest conceptual level (e.g., for a given student, course,
semester, and instructor combination), the avg_grade measure stores the actual course
grade of the student. At higher conceptual levels, avg_grade stores the average grade for
the given combination.
(a) Draw a snowflake schema diagram for the data warehouse.
(b) Starting with the base cuboid [student, course, semester, instructor], what specific
OLAP operations (e.g., roll-up from semester to year) should one perform in order
to list the average grade of CS courses for each Big-University student.
(c) If each dimension has five levels (including all), such as “student < major < status <
university < all”, how many cuboids will this cube contain (including the base and
apex cuboids)?
Answer:
(a) Draw a snowflake schema diagram for the data warehouse.
A snowflake schema is shown in Figure 4.2.

23
Figure 4.2: A snowflake schema for data warehouse of Exercise 4.4.

(b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP
operations (e.g., roll-up from semester to year) should one perform in order to list the
average grade of CS courses for each Big-University student.
The specific OLAP operations to be performed are:
• Roll-up on course from course_id to department.
• Roll-up on semester from semester_id to all.
• Slice for course=“CS” .
(c) If each dimension has five levels (including all), such as student < major < status <
university <all, how many cuboids will this cube contain (including the base and apex
cuboids)? This cube will contain 54 = 625 cuboids.

4.5 Suppose that a data warehouse consists of the four dimensions, date, spectator, location,
and game, and the two measures, count and charge, where charge is the fare that a
spectator pays when watching a game on a given date. Spectators may be students,
adults, or seniors, with each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.
(b) Starting with the base cuboid [date, spectator, location, game], what specific OLAP
operations should one perform in order to list the total charge paid by student
spectators at GM_Place in 2010?
(c) Bitmap indexing is useful in data warehousing. Taking this cube as an example,
briefly discuss advantages and problems of using a bitmap index structure.
Answer:
(a) Draw a star schema diagram for the data warehouse.

24
A star schema is shown in Figure 4.3.

Figure 4.3: A star schema for data warehouse of Exercise 4.5.

(b) Starting with the base cuboid [date, spectator, location, game], what specific OLAP
operations should one perform in order to list the total charge paid by student
spectators at GM_Place in 2010?
The specific OLAP operations to be performed are:
• Roll-up on date from date_id to year.
• Roll-up on game from game_id to all.
• Roll-up on location from location_id to location name.
• Roll-up on spectator from spectator_id to status.
• Dice with status=“students”, location name=“GM_Place”, and year = 2010.
(c) Bitmap indexing is useful in data warehousing. Taking this cube as an example,
briefly discuss advantages and problems of using a bitmap index structure.
Bitmap indexing is advantageous for low-cardinality domains. For example, in this
cube, if dimension location is bitmap indexed, comparison, join, and aggregation
operations over location are then reduced to bit arithmetic, which substantially
reduces the processing time. Furthermore, strings of long location names can be
represented by a single bit, which leads to significant reduction in space and I/O.
For dimensions with high-cardinality, such as date in this example, the vector to

25
present the bitmap index could be very long. For example, a 10-year collection of
data could result in 3650 data records, meaning that every tuple in the fact table
would require 3650 bits, or approximately 456 bytes, to hold the bitmap index.
However, using bit-vector compression techniques can overcome this difficulty to a
certain degree.

4.9 Regarding the computation of measures in a data cube:


(a) Enumerate three categories of measures, based on the kind of aggregate functions
used in computing a data cube.
(b) For a data cube with the three dimensions time, location, and item, which category
does the function variance belong to? Describe how to compute it if the cube is
partitioned into many chunks.

1 N
Hint: The formula for computing variance is
  xi  x  , where x is the
2

N i 1
average of xis.
(c) Suppose the function is “top 10 sales.” Discuss how to efficiently compute this
measure in a data cube.
Answer:
(a) Enumerate three categories of measures, based on the kind of aggregate functions
used in computing a data cube.
The three categories of measures are distributive, algebraic, and holistic.
(b) For a data cube with three dimensions: time, location, and product, which category
does the function variance belong to? Describe how to compute it if the cube is
partitioned into many chunks.
Hint: The formula for computing variance is

1 N 2
  xi  x  
N i 1
1 N 2
 N N
 xi  2  xi x   x 
N i 1 i 1 i 1
2 1 N 2 N 2
 
 xi   x
N i 1 i 1

1 N 2 2
  xi  xi
N i 1

, where 1 N is the average of xis. The function variance is algebraic. If


x  xi
N i 1
the cube is partitioned into many chunks, the variance can be computed as follows:
Read in the chunks one by one, keeping track of the accumulated
(1) number of tuples,

26
(2) sum of (xi)2,
and (3) sum of xi.
After reading all the chunks, compute the average of (xi)2’s as the sum of (xi)2
divided by the total number of tuples. Use the formula as shown in the hint to get
the variance.
(c) Suppose the function is “top 10 sales.” Discuss how to efficiently compute this
measure in a data cube.
For each cuboid, use 10 units to register the top 10 sales found so far. Read the data
in each cubiod once. If the sales amount in a tuple is greater than an existing one in
the top-10 list, insert the new sales amount from the new tuple into the list, and
discard the smallest one in the list. The computation of a higher level cuboid can be
performed similarly by propagation of the top-10 cells of its corresponding lower
level cuboids.

27
Chapter 5 Data Cube Technology

5.1 Exercises

5.4 Suppose that a base cuboid has three dimensions, A, B, C, with the following number of
cells: |A| =1,000,000, |B| = 100, and |C| = 1000. Suppose that each dimension is evenly
partitioned into 10 portions for chunking.
(a) Assuming each dimension has only one level, draw the complete lattice of the cube.
(b) If each cube cell stores one measure with 4 bytes, what is the total size of the
computed cube if the cube is dense?
(c) State the order for computing the chunks in the cube that requires the least amount of
space, and compute the total amount of main memory space required for computing
the 2-D planes.
Answer:
(a) Assuming each dimension has only one level, draw the complete lattice of the cube.
The complete lattice is shown in Figure 5.1.

Figure 5.1: A complete lattice for the cube of Exercise 5.4.

(b) If each cube cell stores one measure with 4 bytes, what is the total size of the

28
computed cube if the cube is dense?
The total size of the computed cube is as follows.
• all: 1
• A: 1,000,000; B: 100; C: 1, 000; subtotal: 1,001,100
• AB: 100,000,000; BC: 100,000; AC: 1,000,000,000; subtotal: 1,100,100,000
• ABC:100,000,000,000
• Total: 101,101,101,101 cells × 4 bytes = 404,404,404,404 bytes
(c) State the order for computing the chunks in the cube that requires the least amount of
space, and compute the total amount of main memory space required for computing
the 2-D planes.
The order of computation that requires the least amount of space is B-C-A. as show
in Figure 5.2.

Figure 5.2: The order of computation in Exercise 5.4 that requires the least amount of space.

The total amount of main memory space required for computing the 2-D planes is:
Total space = (100 × 1,000) (for the whole BC plane) + (1,000,000 × 10) (for one
column of the AB plane)+ (100 × 100,000) (for one chunk of the AC plane) =
20,100,000 cells = 80,400,000 bytes.

29
30
Chapter 6 Mining Frequent Patterns,
Associations, and Correlations: Basic
Concepts and Methods

6.1 Exercises

6.6 A database has 7 transactions. Let min_sup=60% and min_conf=80%.

(a) Find all frequent itemsets using A priori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
(b) List all of the strong association rules (with support s and confidence c) matching the
following metarule, where X is a variable representing customers, and itemi denotes
variables representing items (e.g., “A”, “B”, etc.):
xtransaction, buys(X, item1)buys(X, item2)buys(X, item3) [s, c]
Answer:
(a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
Answer:
i. For Apriori, one finds the following frequent itemsets, and candidate itemsets (after
deletion as a result of has_infrequent_subset):
absolute min_sup=60%5=3

31
A 1
C 2  EK 3
   EM
D 1 2
E 4  
E
 4    EO 3
I 1 K 5  EY 2

C1    L1   M 3  KM
  3
K 5 C2   
O 3
M 3    KO 3
  Y 3  KY 3  EK 3
N 2
   
O 3  MO 1  EO 3
   MY 2 L2   KM 3
Y 3
   
 OY 2  KO 3
 
 KY 3
C3  EKO 3 L3  EKO 3

C4   L4  
Finally resulting in the complete set of frequent itemsets:
{E, K, M, O, Y, EK, EO, KM, KO, KY, EKO}
ii. For FP-growth, one finds the following:
Note: CPB (item) denotes the Conditional Pattern Base of that item, and CFPT (item)
denotes the Conditional FP-Tree of that item.
Scan the transaction dabase once, find the frequent 1-itemset (single item pattern).

E 4
 
K 5
L1   M 3
 
O 3
 
Y 3
Sort frequent items in frequency descending order, f-list.
San DB again, conduct F p-tree, see Figure. FP Growth Approach.
From conditional pattern-bases to conditional FP-trees

32
Item Support Node
ID count link Root
K 5
E 4 K:5
M 3
E:4 M:1
O 3
Y 3 M:2 O:2 Y:1

O:1
Y:1
Y:1

FP Growth Approach

Ite Conditional Pattern Base Conditional Frequent Pattern


m FP-tree Generated
Y {{K,E,M,O:1},{K,E,O:1}, <K:3> {K,Y:3}
{K,M:1}}
O {{K,E,M:1},{K,E:2}} <K:3,E:3> {K,O:3},{E,O:3},{K,E,O:3
M {{K,E:2},{K:1}} <K:3> }
E {{K:4}} <K:4> {K,M:3}
{K,E:4}
Which finally results in the complete set of frequent itemsets: {{K:5}, {E:4},
{M:3}, {O:3}, {Y:3}, {K,Y:3}, {E,K,O:3}, {K,O:3}, {E,O:3}, {K,M:3}, {E,K:4}
The student should identify that FP-growth is more efficient because it is able to
mine in the conditional pattern bases, which may substantially reduce the sizes of
the data sets to be searched. However, when working with small data sets like the
one given (especially when working by hand) the student may feel like Apriori is
more “efficient.”
(b) List all of the strong association rules (with support s and confidence c) matching the
following metarule, where X is a variable representing customers, and itemi denotes
variables representing items (e.g., “A”, “B”, etc.):
xtransaction, buys(X, item1)buys(X, item2)buys(X, item3) [s, c]
The following strong association rules are found:
xtransaction, buys(X, “E”)buys(X, “O”)buys(X, “K”)
[0.6=min_sup, 1.0>min_conf]
xtransaction, buys(X, “K”)buys(X, “O”)buys(X, “E”)
[0.6=min_sup, 1.0>min_conf]
However, the following strong association rules are not exist

33
xtransaction, buys(X, “E”)buys(X, “K”)buys(X, “O”)
[0.6=min_sup, 0.75<min_conf]
Because the confidence is 3/4=75%,is smaller than min_conf.
6.14 The following contingency table summarizes supermarket transaction data, where
hotdogs refers to the transactions containing hot dogs, refers to the
transactions that do not contain hotdogs, hamburgers refers to the transactions
containing hamburgers, and refers to the transactions that do not contain
hanburgers.

(a) Suppose that the association rule “hot dogshamburgers” is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of 50%,
is this association rule strong?
(b) Based on the given data, is the purchase of hot dogs independent of the purchase of
hamburgers? If not, what kind of correlation relationship exists between the two?
(c) Compare the use of the all_confidence, max_confidence, Kulczynski,and cosine
measures with lift and correlation on the given data.
Answer:
(a) Suppose that the association rule “hotdogshamburgers” is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of 50%,
is this association rule strong?
For the rule, support=2000/5000=40%, and confidence=2000/3000=66.7%.
Therefore, the association rule is strong.
(b) Based on the given data, is the purchase of hotdogs independent of the purchase of
hamburgers? If not, what kind of correlation relationship exists between the two?
Corr{hotdog, hamburger}=P({hot dog, hamburger})/(P({hot dog})P({hamburger}))
=0.4/(0.60.5)=1.33>1
So, the purchase of hotdogs is NOT independent of the purchase of hamburgers.
There exists a POSITIVE correlation between the two.
Notice: Corr{hotdog, hamburger} is calculated by using “lift”
(c) Compare the use of the all_confidence, max_confidence, Kulczynski,and cosine
measures with lift and correlation on the given data.
This subproblem is omitted (省略)in your home work.

34
35
Chapter 8 Classification: Basic Concepts

8.1 Exercises

8.7. The following table consists of training data from an employee database. The data have
been generalized. For example, “31  35 for age represents the age range of 31 to 35.
For a given row entry, count represents the number of data tuples having the values for
department, status, age, and salary given in that row.

department status age salary count


sales senior 31…35 46K…50K 30
sales junior 26…30 26K…30K 40
sales junior 31…35 31K…35K 40
systems junior 21…25 46K…50K 20
systems senior 31…35 66K…70K 5
systems junior 26…30 46K…50K 3
systems senior 41…45 66K…70K 3
marketing senior 36…40 46K…50K 10
marketing junior 31…35 41K…45K 4
secretary senior 46…50 36K…40K 4
secretary junior 26…30 26K…30K 6
Let status be the class label attribute.
(a) How would you modify the basic decision tree algorithm to take into consideration
the count of each generalized data tuple (i.e., of each row entry)?
(b) Use your algorithm to construct a decision tree from the given data.
(c) Given a data tuple having the values “systems,” “26...30,” and “4650K” for the
attributes department, age, and salary, respectively, what would an naïve Bayesian
classification of the status for the tuple be?

36
Answer:
(a) How would you modify the basic decision tree algorithm to take into consideration
the count of each generalized data tuple (i.e., of each row entry)?
The basic decision tree algorithm should be modifed as follows to take into
consideration the count of each generalized data tuple.
 The count of each tuple must be integrated into the calculation of the attribute
selection measure (such as information gain).
 Take the count into consideration to determine the most common class among
the tuples.
(b) Use your algorithm to construct a decision tree from the given data.
The resulting tree is:
(salary=26K...30K:
junior
=31K...35K:
junior
=36K...40K:
senior
=41K...45K:
junior
=46K...50K (department=secretary:
junior
=sales:
senior
=systems:
junior
=marketing:
senior)
=66K...70K:
senior)
(c) Given a data tuple having the values “systems,” “26...30,” and “4650K” for the
attributes department, age, and salary, respectively, what would a naïve Bayesian
classification of the status for the tuple be?
Answer: The status can be classified into 2 classes:
C1: Status=’junor’ P(junior)=113/165=0.6848;
C2: Status=’senjor’ P(senior)=52/165=0.3152;

37
Suppose that all attributes are independent. So the attributs’ class-conditional
probabilities can be concluded by the multiplication of their subattributes’ ones.
The subattributes’ class-conditional probabilities are separately:
P(systems|junior)=(20+3)/(40+40+20+3+4+6)=23/113=0.2035;
P(26-30|junior)=(40+3+6)/113=49/113=0.4336;
P(46K-50K|junior)=(20+3)/113=23/113=0.2035;
∵ X=(department=system, age=26…30,salary=46K…50K);
∴ P(X|junior)=P(systems|junior)P(26-30|junior)P(46K-50K|junior)
=23×49×23/1133=25921/1442897=0.01796;
P(systems|senior)=(5+3)/(30+5+3+10+4)=8/52=0.1538;
P(26-30|senior)=(0)/52=0;
P(46K-50K|senior)=(30+10)/52=40/52=0.7692;
The class-conditional probability could be wrongly calculated for its subattribute
class-conditional probability P(26-30|senior) is 0. It is necessary to use Laplacian
correction (or Laplacian estimator):
P(systems|senior)=(8+1)/(52+3)=9/55=0.1636;
P(26-30|senior)=(0+1)/(52+3)=1/55=0.01818;
P(46K-50K|senior)=(40+1)/(52+3)=41/55=0.7454;
∵ X=(department=system,age=26…30, salary=46K…50K);

P(X|senior)=P(systems|senior)P(26-30|senior)P(46K-50K|senior)=9141/553=0.0
02218;
∴ P(X|junior)P(junior)=0.01796×0.6848=0.0123>
0.000699=0.0022180.3152=P(X|senior)P(senior);
Thus, naïve Baysian Classifier predicts “junior”.
所以:朴素贝叶斯分类器将 X 分到 junior 类。

38
Chapter 9 Classification: Advanced
Metrods

9.1 Exercises

39
Chapter 10 Cluster Analysis: Basic
Concepts and Methods

10.1 Exercises

10.2 Suppose that the data mining task is to cluster points (with (x, y) representing location)
into three clusters, where the points are
A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9).
The distance function is Euclidean distance. Suppose initially we assign A1, B1,
and C1 as the center of each cluster, respectively. Use the k-means algorithm to
show only
(a) The three cluster centers after the first round of execution.
(b) The final three clusters.
Answer:
(a) The three cluster centers after the first round of execution.
Answer:
1) Firstly calculate the Euclidean distance matrix of the centers. Its column is
listed accirding to the different points from A1 to C2, and the row is
represented by the different centers Cr1=A1, Cr2=B1, and Cr3=C1 initially.
Table 10-2-1
Cr1=A1 Cr2=B1 Cr3=C1
A1 0.0 3.6056 8.0623
A2 5.0 4.2426 3.1623
A3 8.4853 5.0 7.2801
B1 3.6056 0.0 7.2111
B2 7.0711 3.6056 6.7082
B3 7.2111 4.1231 5.3852
C1 8.0623 7.2111 0.0
C2 2.2361 1.4142 7.6158

40
11

10

6
y

1
0 1 2 3 4 5 6 7 8 9
x

2) Secondly assign all the points to the three cluster centers according to
nearest rule.
Table 10-2-2
Cr1=A1 Cr2=B1 Cr3=C1
A1 0.0(A1) 3.6056 8.0623
A2 5.0 4.2426 3.1623(A2)
A3 8.4853 5.0(A3) 7.2801
B1 3.6056 0.0(B1) 7.2111
B2 7.0711 3.6056(B2) 6.7082
B3 7.2111 4.1231(B3) 5.3852
C1 8.0623 7.2111 0.0(C1)
C2 2.2361 1.4142(C2) 7.6158

41
11

10

6
y

1
0 1 2 3 4 5 6 7 8 9
x

After the first round, the three new clusters are: (1) {A1}, (2)
{B1,A3,B2,B3,C2}, (3) {C1,A2},
3) Then create the next cluster centers Cr1’, Cr2, and Cr3’ by computing
every cluster’s mean.
Table 10-2-3 The 2nd round:
Cr1’=A1 Cr2’=(6.0 ,6.0) Cr3’=(1.5, 3.5)
A1 0.0(A1) 5.6569 6.5192
A2 5.0 4.1231 1.5811(A2)
A3 8.4853 2.8284(A3) 6.5192
B1 3.6056 2.2361(B1) 5.7009
B2 7.0711 1.4142(B2) 5.7009
B3 7.2111 2.0(B3) 4.5277
C1 8.0623 6.4031 1.5811(C1)
C2 2.2361 3.6056(C2) 6.0415

42
11

10

6
y

1
0 1 2 3 4 5 6 7 8 9
x

After the computing, their next cluster centers are (1) (2, 10), (2) (6, 6), (3)
(1.5, 3.5).
4) If cluster center’s set {Cr1’, Cr2’, Cr3’} is same as the old one {Cr1, Cr2,
Cr3}, the iterative is terminated; otherwise turned to 1)
Table 10-2-4 The 3rd round:
Cr1’=(3.0, 9.5) Cr2’=(6.5, 5.25) Cr3’=(1.5, 3.5)
A1 1.1180(A1) 6.5431 6.5192
A2 4.6098 4.5069 1.5811(A2)
A3 7.4330 1.9526(A3) 6.5192
B1 2.5(B1) 3.1325 5.7009
B2 6.0208 0.5590(B2) 5.7009
B3 6.2650 1.3463(B3) 4.5277
C1 7.7621 6.3885 1.5811(C1)
C2 1.1180(C2) 4.5069 6.0415

43
11

10

6
y

1
0 1 2 3 4 5 6 7 8 9
x

Table 10-2-5 The 4th round:


Cr1’=(3.6667, 9.0) Cr2’=(7.0, 4.3333) Cr3’=(1.5, 3.5)
A1 1.9437(A1) 7.5572 6.5192
A2 4.3333 5.0442 1.5811(A2)
A3 6.6165 1.0541(A3) 6.5192
B1 1.6667(B1) 4.1767 5.7009
B2 5.2068 0.6667(B2) 5.7009
B3 5.5176 1.0541(B3) 4.5277
C1 7.4907 6.4377 1.5811(C1)
C2 0.3333(C2) 5.5478 6.0415

44
11

10

6
y

1
0 1 2 3 4 5 6 7 8 9
x

Table 10-2-6 The 5th round:


Cr1’=(3.6667, 9.0) Cr2’=(7.0, 4.3333) Cr3’=(1.5, 3.5)
A1 1.9437(A1) 7.5572 6.5192
A2 4.3333 5.0442 1.5811(A2)
A3 6.6165 1.0541(A3) 6.5192
B1 1.6667(B1) 4.1767 5.7009
B2 5.2068 0.6667(B2) 5.7009
B3 5.5176 1.0541(B3) 4.5277
C1 7.4907 6.4377 1.5811(C1)
C2 0.3333(C2) 5.5478 6.0415
th th
The rusults of the 5 round are same as the 4 , so the iteration is terminated
then.
(b) The final three clusters.
Answer:
The cluster centers and clusters during the 2nd round to the 5th round.
Table 10-2-7
1st Cr1=(2, 10) Cr2=(5, 8) Cr3=(1, 2)
roind A1 A3,B1,B2,B3,C2 A2,C1

45
2nd Cr1=(2.0, 10.0) Cr2=(6.0, 6.0) Cr3=(1.5, 3.5)
round A1 A3,B1,B2,B3,C2 A2,C1
3rd Cr1=(3, 9.5) Cr2=(6.5, 5.25) Cr3=(1.5, 3.5)
round A1,C2 A3,B1,B2,B3 A2,C1
3rd Cr1=(3.6667, 9) Cr2=(7, 4.3333) Cr3=(1.5, 3.5)
round A1,B1,C2 A3,B2,B3 A2,C1
4th Cr1=(3.6667, 9) Cr2=(7, 4.3333) Cr3=(1.5, 3.5)
round A1,B1,C2 A3,B2,B3 A2,C1

The final three clusters are: (1) {A1, B1, C2}, (2) {A3, B2, B3}, (3) {A2, C1}.

11

10

6
y

1
0 1 2 3 4 5 6 7 8 9
x

Figure 10.2 The k-means’ final clusters distribution. Where the sign “+” dedicates tuples, red,
green, and blue color separately represent 1st to 3rd clusters, sign “*” is utilized to express cluster
centers.

Notice: Students’ answer need not include table 10-2-2 to table 10-2-6 and all figures.

46

Das könnte Ihnen auch gefallen