Sie sind auf Seite 1von 6

Data Mining

Course Number: CS616BH1

GRADE: 20/20
Assignment 1
Students Name: Chintapalli Sri Ram

1.) Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.

A) No, as this is simple segregation problem.


Example :- select * from CustomersTable
where Gender = Male ..... in the same way select the
other and"" join them with required condition"" # like sales , purchase , age
other
(b) Dividing the customers of a company according to their profitability.
A) No. This is a simple mathematical problem. Dividing the customers of a

company by just Comparing the profits. Which also depends on product


which they consume and margin they make on them and also sales they do.
(c) Computing the total sales of a company.
A) No, as it is just a sum of total sales. That is all the manufacture products

and sold items at what margin is considered


(d) Sorting a student database based on student identification numbers.
A) No. A simple sorting procedure can solve the problem.

Example - Select * from student table


where sid = 1234
(e) Predicting the outcomes of tossing a (fair) pair of dice.

A) No. since it is given that the die is fair it is more of a probabilistic problem.
(f) Predicting the future stock price of a company using historical records.

A) This belongs to data mining. Since it involves predicting the future stock
prices from the historical data. Example Lets consider a stock Verizon where
customer start calculating the values while buying them with the help of
sales + company asserts and other historical data in a simple way or a
company will refer the same data and private data (where the consumer

Data Mining
Course Number: CS616BH1
does not have any idea )
which directly proportional
to rise the stock price or sell them.

(g) Monitoring the heart rate of a patient for abnormalities.

A) This problem also comes into data mining domain. Detecting an


abnormality involves continues observation of the heart beat and reporting if
any unusual happens.
(h) Monitoring seismic waves for earthquake activities.

A) Yes. It is a data mining problem. Similar to the above, all the seismic
waves are monitored at a time. If any unusual wave appears then an alarm is
raised.
(i)Extracting the frequencies of a sound wave.

A) No, this is not a data mining problem.

2.) For each of the following data sets, explain whether or not data privacy is an
important issue.
(a) Census data collected from 19001950.

A) No. Census is published data, therefore data privacy is not the primary
issue.
(b) IP addresses and visit times of Web users who visit your Website.

A) Yes because these are private data for the users. Because in network
systems are identified with the help of IP address and communication is done
through peer to peer with this if other have IP address in same network there
are lot of chances to hack or misuse your data.
(c) Images from Earth-orbiting satellites.

A) No, images of earth-orbiting satellites are not private. Because these are
used in public transport (navigation) also helps in identifying the natural
threats

Data Mining
Course Number: CS616BH1
(d) Names and addresses of
people from the telephone
book.

A) No they are meant to be shared in olden days because its very hard to
remember all the names and phone numbers, to contact someone they use
these books which are in hardcopy
(e) Names and email addresses collected from the Web.

A) No, names and email addresses are not private data. Its very similar to
telephone book where we save data as softcopy and refer those

3) You are approached by the marketing director of a local company, who believes
that he has devised a foolproof way to measure customer satisfaction. He explains
his scheme as follows: Its so simple that I cant believe that no one has thought of
it before. I just keep track of the number of customer complaints for each product. I
read in a data mining book that counts are ratio attributes, and so, my measure of
product satisfaction must be a ratio attribute. But when I rated the products based
on my new customer satisfaction measure and showed them to my boss, he told me
that I had overlooked the obvious, and that my measure was worthless. I think that
he was just mad because our best-selling product had the worst satisfaction since it
had the most complaints. Could you help me set him straight?
(a) Who is right, the marketing director or his boss? If you answered, his boss, what
would you do to fix the measure of satisfaction?

A) The Boss is correct because the key factor the number of sales is not
included in the measure of satisfaction. Therefore the appropriate measure
of satisfaction would be a function as follows
Measure = f(number of complaints for the product, total sales of the
product).
(b) What can you say about the attribute type of the original product satisfaction
attribute?

A) The attribute type cannot be cleared because the original product


satisfaction attribute contains many determining factors like satisfaction
level, number of complaints etc.
4.) An educational psychologist wants to use association analysis to analyze test
results. The test consists of 100 questions with four possible answers each.

Data Mining
Course Number: CS616BH1
(a) How would you convert this
data into a form suitable for
association analysis?

A) Association analysis first step is to present the data in binary form. The
binary form representation of the above problem is as follows

1
2

Q1(A
)
0
1

Q1(B)

Q1(C)

Q1(D)

1
0

0
0

0
0

..

Q100(
A)
1
0

Q100(
B)
0
0

Q100(
C)
0
1

Q100(
D)
0
0

If the answer for the nth question is A then Qn(A) is 1 else it is 0.

(b) In particular, what type of attributes would you have and how many of them are
there?

A) Since the solution is considered more important than the other options the
attributes are asymmetric binary variables. And there are a total of 400
variables (100 questions * 4 options).
5.) Discuss why a document-term matrix is an example of a data set that has
asymmetric discrete or asymmetric continuous features.

A) A document term matrix is an i x j matrix in which the ijth is the number


of times the ith term appears in the jth term. Since it represents the number of
times a term appears in a particular document therefore zero entries are
considered to be important. Thus it is a dataset that asymmetric discrete
features. If a normalization is performed on this matrix to have a L 2
norm of 1 then that matrix will have continuous features. Since this
transformation do not create non-zero entries for those value which
were zero previously. And they still do not give any meaning
therefore the matrix poses asymmetric continuous features.
what is L2 norm : A Least squares which minimizes the sum of the
squares of the differences (D) in between Target and Estimated

Data Mining
Course Number: CS616BH1
values D = 1(sumof)n
{t-E}^2
6.) Consider a document-term matrix, where tfij is the frequency of the ith word
(term) in the jth document and m is the number of documents. Consider the
variable transformation that is defined by

where dfi is the number of documents in which the ith term appears and is known
as the document frequency of the term. This transformation is known as the inverse
document frequency transformation.
(a) What is the effect of this transformation if a term occurs in one document?
In every document?

A)
a) In one document: i.e. dfi is 1 then the transformation will have its
maximum value logm.
= Log m/1 aprox value is Log m
b) In every document: i.e. dfi is m then the transformation will have zero
value.
= Log m/m = Log(1) that is zero
(b) What might be the purpose of this transformation?

A) The above mentioned shows that a term which appears in one document
has maximum value while which appears in all the documents has zero
value. This reflects that a term which appears in all the documents do not
play a crucial role in segregating than that which appears in only certain
documents.

Criterion

Correctness

No justification
of correctness

Clarity

Unclear

Explanation
justifies, mostly
correct

Explanation
justifies
correctness

Explained;
somewhat clear

Every point
clearly
specified; well

A
Explanation justifies
correctness extremely well;
complete and thorough
justification
Every point precisely
specified; thoroughly

Data Mining
Course Number: CS616BH1

commented;
clear
Minor
Understanding understanding
evidenced

Note: Nicely done!

Satisfactory
understanding
evidenced

Evidence of
good
understanding
throughout

commented; entirely clear

Evidence throughout of
entirely thorough
understanding

Das könnte Ihnen auch gefallen