Beruflich Dokumente
Kultur Dokumente
Lets start with already known K-means clustering and flag normalize input data. In order to provide possibility
for additional analyzing could be done also hierarchical clustering. Ive done K-means with and without
normalize.
Select all data and under XLMiner tab choose Cluster and from dropdown menu K-Means Clustering.
Choose all variables on left side excluded Student and move them on right side. Click Next
Same steps you have to use if you wish try without normalize, as well similar step you should implement for
hierarchical ( explained in previous assignment and now probably you may try complete linkage, centroid
method or group average linkage for clustering method).
In excel are sheets with K-means with\ without normalize. Here are present coordinates of centroid in case of
K-means normalize.
b. Use XLMiners standard data partition command to partition the data into a training set (with 60% of
the observations) and validation set (with 40% of the observations) using the defauly seed of 12345.
Select data and click on Partition ( be careful partition under Data Mining tab and not like on picture under
Time Series. There choose Standard partition)
On next screen check that training and validation sets are in desired %. Just click OK.
On next screen you have to choose selected variable and output variable. Ive decided to use rating as output
variable and other 2 as selected.
On next screen Ive checked Canonical Variate and for Prior Class Probabilities Ive kept flag defined by
system, but you may check explanation in box bellow to understand whether some other Probabilities should be
used.
If According to relative occurrences in training data is selected, XLMiner calculates according to the
relative occurrences, the discriminant analysis procedure incorporates prior assumptions about how
frequently the different classes occur, and XLMiner assumes that the probability of encountering a
particular class in the large data set is the same as the frequency with which it occurs in the training data.
If Use equal prior probabilities is selected, XLMiner assumes that all classes occur with equal
probability.
If User specified prior probabilities is selected, manually enter the desired class and probability value.
Under the Probability list, enter 0.7 for Class1, and 0.3 for Class 0
On next screen you may see which possibility Ive flagged and decided to use
Data are in sheet DA_Output. You may easy see that results for training and validation sets and we couldnt
claim high level of accurate following current outputs . To decide about accurate probably we should try to
launch again with some different parameters.
d. Use logistic regression to create a classifier for this data. How accurate is this procedure on the training
and validation data sets?
Again sheet data partition, tab classify and choose logistic regression. For all next question I wont insert
explanation how to launch some method if method is under classify tab.
I didnt define weight variable, but if you wish you have to know that A record with a large weight will
influence the model more than a record with a smaller weight.
You may see that set confidence is flagged and I didnt change % offered by system. Ive also used
Advanced button and flagged Perform Collinearity under Variable Section you may do changes(screen below)
Data are in sheet LR_Output. Easily we may see differences in outputted data corresponding on previous
method. These results show accurate especially if we check part Training\Validation Data Scoring - Summary
Report.
e. Use the k-nearest neighbor technique to create a classifier for this data(with normalized inputs and a
k=3). How accurate is this procedure on the training and validation sets?
On same way select data in sheet Data_Partition and start method from Classify tab.
g. Use a classification tree to create a classifier for this data (with normalized inputs and at least 4
observations per terminal node). Write a pseudo-code summarizing the classification rules for the
optimal tree. How accurate is this procedure on the training and validation sets?
Based on results we may conclude high accurate, but again we have some misalignments between Training and
Validation sets .
h. Use a neural network to create a classifier for this data (use normalized inputs and a single hidden
layer with 3 nodes). How accurate is this procedure on th training and validation sets?
On same way select data in sheet Data_Partition and start method from Classify tab.
Here we may conclude similarities with results from previous method, but only for Validation data, while seems
that Training data is so different according previous method.
i. Return to the data sheet and use the Transform, Bin Continuous Data command to create binned
variable for GPA and GMAT. Use XLMiner's standard data partition command to partition the data in a
training set (with 60% of the observations) and validation set (with 40% of the observaitons) using the
default speed of 12345. Now use the Naive Bayes technique to create a classifier for the data using the new
binned varaibles for GPA and GMAT. How accurate is this procedure on the training and validation sets?
From my point of view under b has been done part of this request. I will done also in excel part of standard data
partition, but more focus on last part.
In next screen select GPA and then click Apply to Selected Variable. Do same for GMAT.
We have gotten sheet Binned_Data. Lets now repeat standard partition but using banned GPA and GMAT.
j. Which of the classification rechniques would you recommend the MBA actually use?
We should try to check whether each method has been done under the best possible parameterization and just
after that we may decide based on results. Following different behavior of data sets we also have to decide
about importance each of them as well as whether best average for both is the most acceptable.
Following just results I will recommended Naive Bayes considering best results for Training and similar
average with other methods for Validation.
k. Suppose that the MBA director receives applications for admission to the MBA program from the
following individuals. According to your recommended classifier, which of these individuals do you expect
to be good students and which do you expect to be weak?
Name: Mike Dimoupolous GPA: 3.02 GMAT: 450
Name: Scott Frazier GPA: 2.97 GMAT: 587
Name: Paula Curry GPA:3.95 GMAT:551
Name: Terry Freeman GPA: 2.45 GMAT: 484
Name: Dana Simmons GPA:3.26 GMAT:524
Lets say that here you have to do prediction, but Im not sure whether your professor expect to use tool for
prediction or to check your prediction using some classifier ( Ive understood in this way and as well have done
it like that).
Ive marked yours students with numbers from 1 to 5 and have inserted Rating according to calculate average
for GPA and GMAT. Here we have to be very careful with students 3 and 4 who are so far from average in both
categories.( This data are in Sheet1)
Studen
t
1
2
3
4
5
Rating
2
1
1
2
1
GPA
3.02
2.97
3.95
2.45
3.26
3.13
GMAT
450
587
551
484
524
519.2
On this data you should implement for start data partition (you may use transform tab to get banned data- Ive
done in sheet BinnedData1) what is already explained in part under i.( Ive done in sheet Data_Partition2).
After that start with using Nave Bayes on data gotten with data partition ( sheets start with NNP, but with 1 in
the end. )
Analyzing all that in sheets we may see that rating which we inserted as actual( predicted but for system actual)
start becomes true with high probability. More or less its visible from sheet NNB_TrainingScore1 for students
1,2 and 5.
If we check NNB_ValidationScore1, we will confirm predicted rating for student 3, but for student 4 we have to
accept predicted rating 2 considering that system wasnt able to calculate probability for any class in this case.