Sie sind auf Seite 1von 4

Homework-2

1. Data Understanding and Visualization(Part-1)

Homework-2: First two pages contain assignment while page 3&4 contains helping material to solve
some problems. Use internet to read for more clarity.
Data Mining

2. Data Understanding and Visualization(Part-2)

Download the Wine (https://archive.ics.uci.edu/ml/datasets/wine) and Forest Fire datasets


(https://archive.ics.uci.edu/ml/datasets/forest+fires) from UCI Machine Learning Repository.
Read the datasets instructions and report the following:

a. The number of each type of attributes (continuous [interval, ratio], categorical [nominal,
ordinal]). Also identify which attribute(s) are input attribute(s) and which are class
attribute(s) (if any). For each type of attribute give a few example attributes and their
values.
b. Compute the five-number summary for the continuous attributes (you may create
boxplots for this). Compute the mode for categorical attributes.
c. Generate the quantile (percentile) plots for two key attributes in each dataset.
d. Generate the histogram or distribution plot for each of the two attributes selected in (c).
e. Generate the scatter plots for the two attributes selected in (c).
f. Compute and visualize the covariance and correlation matrices for the continuous
attributes.
g. Comment on the results regarding characteristics of this database.

2. Data Preprocessing
Download the Communities & Crime
(http://archive.ics.uci.edu/ml/datasets/communities+and+crime) dataset from UCI repository.
Study the dataset, and perform the following tasks:
a. Generate basic stats on missing values in the dataset (fraction of missing values for each
attribute, fraction of missing values for each object). You may report only the top 10
attributes and objects with high missing values.
b. Fill in the missing values in the data using an appropriate filter.
c. Standardize the dataset to zero mean and unit variance (z-score normalization).

Page 2
Data Mining

Attribute Transformation Comments


Level

Nominal Any permutation of values If all employee ID numbers


were reassigned, would it
make any difference?

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).

Ratio new_value = a * old_value Length can be measured in


meters or feet.

Attribute Description Examples Operations


Type

Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, c 2 test
information to distinguish one object
from another. (=, ¹)

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests

Interval For interval attributes, the calendar dates, mean, standard


differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current

Page 3
Data Mining

Page 4

Das könnte Ihnen auch gefallen