Sie sind auf Seite 1von 18

a. What are the coordinates of the centroids for the good students and the weak students?

Lets start with already known K-means clustering and flag normalize input data. In order to provide possibility
for additional analyzing could be done also hierarchical clustering. Ive done K-means with and without
normalize.

Select all data and under XLMiner tab choose Cluster and from dropdown menu K-Means Clustering.

Choose all variables on left side excluded Student and move them on right side. Click Next

Flag Normalize input Data

Keep flagged both and click finish

Same steps you have to use if you wish try without normalize, as well similar step you should implement for
hierarchical ( explained in previous assignment and now probably you may try complete linkage, centroid
method or group average linkage for clustering method).
In excel are sheets with K-means with\ without normalize. Here are present coordinates of centroid in case of
K-means normalize.

b. Use XLMiners standard data partition command to partition the data into a training set (with 60% of
the observations) and validation set (with 40% of the observations) using the defauly seed of 12345.

Select data and click on Partition ( be careful partition under Data Mining tab and not like on picture under
Time Series. There choose Standard partition)
On next screen check that training and validation sets are in desired %. Just click OK.

In sheet Data_Partition you have partition data in sets.


c. Use discriminant analysis to create a classifier for this data. How accurate is this procedure on the
training and validation sets?
In sheet Data_Partition select 3 columns data and after that click on Classify and choose discriminant analysis.

On next screen you have to choose selected variable and output variable. Ive decided to use rating as output
variable and other 2 as selected.

On next screen Ive checked Canonical Variate and for Prior Class Probabilities Ive kept flag defined by
system, but you may check explanation in box bellow to understand whether some other Probabilities should be
used.
If According to relative occurrences in training data is selected, XLMiner calculates according to the
relative occurrences, the discriminant analysis procedure incorporates prior assumptions about how
frequently the different classes occur, and XLMiner assumes that the probability of encountering a
particular class in the large data set is the same as the frequency with which it occurs in the training data.
If Use equal prior probabilities is selected, XLMiner assumes that all classes occur with equal
probability.
If User specified prior probabilities is selected, manually enter the desired class and probability value.
Under the Probability list, enter 0.7 for Class1, and 0.3 for Class 0

On next screen you may see which possibility Ive flagged and decided to use

Data are in sheet DA_Output. You may easy see that results for training and validation sets and we couldnt
claim high level of accurate following current outputs . To decide about accurate probably we should try to
launch again with some different parameters.

d. Use logistic regression to create a classifier for this data. How accurate is this procedure on the training
and validation data sets?
Again sheet data partition, tab classify and choose logistic regression. For all next question I wont insert
explanation how to launch some method if method is under classify tab.

I didnt define weight variable, but if you wish you have to know that A record with a large weight will
influence the model more than a record with a smaller weight.

You may see that set confidence is flagged and I didnt change % offered by system. Ive also used
Advanced button and flagged Perform Collinearity under Variable Section you may do changes(screen below)

Data are in sheet LR_Output. Easily we may see differences in outputted data corresponding on previous
method. These results show accurate especially if we check part Training\Validation Data Scoring - Summary
Report.
e. Use the k-nearest neighbor technique to create a classifier for this data(with normalized inputs and a
k=3). How accurate is this procedure on the training and validation sets?
On same way select data in sheet Data_Partition and start method from Classify tab.

You have to flag Normalize input data and also 3 for k.


Based on results we may conclude high accurate, but more higher for one set than for second one. Average error
is very close to value from previous method, but in same time is so far between sets.
f. Use the k-nearest neighbor technique to create a classifier for this data (with normalized inputs). What
value of k seems to work best? How accurate is this procedure on the training and data sets?
Again using previous method, but flagged option score on best k between system again has shown K=3 as the
best. Probably with changing k value and some other parameters that best k will have other value.

g. Use a classification tree to create a classifier for this data (with normalized inputs and at least 4
observations per terminal node). Write a pseudo-code summarizing the classification rules for the
optimal tree. How accurate is this procedure on the training and validation sets?

Ive chosen Random Trees

Based on results we may conclude high accurate, but again we have some misalignments between Training and
Validation sets .

h. Use a neural network to create a classifier for this data (use normalized inputs and a single hidden
layer with 3 nodes). How accurate is this procedure on th training and validation sets?
On same way select data in sheet Data_Partition and start method from Classify tab.

Here we may conclude similarities with results from previous method, but only for Validation data, while seems
that Training data is so different according previous method.
i. Return to the data sheet and use the Transform, Bin Continuous Data command to create binned
variable for GPA and GMAT. Use XLMiner's standard data partition command to partition the data in a
training set (with 60% of the observations) and validation set (with 40% of the observaitons) using the
default speed of 12345. Now use the Naive Bayes technique to create a classifier for the data using the new
binned varaibles for GPA and GMAT. How accurate is this procedure on the training and validation sets?
From my point of view under b has been done part of this request. I will done also in excel part of standard data
partition, but more focus on last part.

In next screen select GPA and then click Apply to Selected Variable. Do same for GMAT.

We have gotten sheet Binned_Data. Lets now repeat standard partition but using banned GPA and GMAT.

Results are in sheet Data_Partition1


Lets do last method from Classify tab.

Results are in sheets starting with NNB


In this case Training data are on very good level, but validation is still on similar average like in some previous
methods.

j. Which of the classification rechniques would you recommend the MBA actually use?
We should try to check whether each method has been done under the best possible parameterization and just
after that we may decide based on results. Following different behavior of data sets we also have to decide
about importance each of them as well as whether best average for both is the most acceptable.
Following just results I will recommended Naive Bayes considering best results for Training and similar
average with other methods for Validation.
k. Suppose that the MBA director receives applications for admission to the MBA program from the
following individuals. According to your recommended classifier, which of these individuals do you expect
to be good students and which do you expect to be weak?
Name: Mike Dimoupolous GPA: 3.02 GMAT: 450
Name: Scott Frazier GPA: 2.97 GMAT: 587
Name: Paula Curry GPA:3.95 GMAT:551
Name: Terry Freeman GPA: 2.45 GMAT: 484
Name: Dana Simmons GPA:3.26 GMAT:524
Lets say that here you have to do prediction, but Im not sure whether your professor expect to use tool for
prediction or to check your prediction using some classifier ( Ive understood in this way and as well have done
it like that).
Ive marked yours students with numbers from 1 to 5 and have inserted Rating according to calculate average
for GPA and GMAT. Here we have to be very careful with students 3 and 4 who are so far from average in both
categories.( This data are in Sheet1)
Studen
t
1
2
3
4
5

Rating
2
1
1
2
1

GPA
3.02
2.97
3.95
2.45
3.26

3.13

GMAT
450
587
551
484
524

519.2

On this data you should implement for start data partition (you may use transform tab to get banned data- Ive
done in sheet BinnedData1) what is already explained in part under i.( Ive done in sheet Data_Partition2).
After that start with using Nave Bayes on data gotten with data partition ( sheets start with NNP, but with 1 in
the end. )
Analyzing all that in sheets we may see that rating which we inserted as actual( predicted but for system actual)
start becomes true with high probability. More or less its visible from sheet NNB_TrainingScore1 for students
1,2 and 5.
If we check NNB_ValidationScore1, we will confirm predicted rating for student 3, but for student 4 we have to
accept predicted rating 2 considering that system wasnt able to calculate probability for any class in this case.

Das könnte Ihnen auch gefallen