Beruflich Dokumente
Kultur Dokumente
Homework-2: First two pages contain assignment while page 3&4 contains helping material to solve
some problems. Use internet to read for more clarity.
Data Mining
a. The number of each type of attributes (continuous [interval, ratio], categorical [nominal,
ordinal]). Also identify which attribute(s) are input attribute(s) and which are class
attribute(s) (if any). For each type of attribute give a few example attributes and their
values.
b. Compute the five-number summary for the continuous attributes (you may create
boxplots for this). Compute the mode for categorical attributes.
c. Generate the quantile (percentile) plots for two key attributes in each dataset.
d. Generate the histogram or distribution plot for each of the two attributes selected in (c).
e. Generate the scatter plots for the two attributes selected in (c).
f. Compute and visualize the covariance and correlation matrices for the continuous
attributes.
g. Comment on the results regarding characteristics of this database.
2. Data Preprocessing
Download the Communities & Crime
(http://archive.ics.uci.edu/ml/datasets/communities+and+crime) dataset from UCI repository.
Study the dataset, and perform the following tasks:
a. Generate basic stats on missing values in the dataset (fraction of missing values for each
attribute, fraction of missing values for each object). You may report only the top 10
attributes and objects with high missing values.
b. Fill in the missing values in the data using an appropriate filter.
c. Standardize the dataset to zero mean and unit variance (z-score normalization).
Page 2
Data Mining
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, c 2 test
information to distinguish one object
from another. (=, ¹)
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
Page 3
Data Mining
Page 4