Sie sind auf Seite 1von 6

Final Exam ADMIN 601 Spring ‘18 Name_____YI TING CHANG_______________

Prof H. Dyck Student ID____006699414______

1. Below is the JMP output for a logistic regression using our toylogistic data set with PassClass regressed on
MidtermScore. Use the output to answer the following questions:
a. What are the parameter estimates?

The intercept is -25.601875, and the slope is 0.36376093.

b. Give an interpretation of the estimates.

The intercept is -25.601875, and the slope is 0.36376093.

The slope gives the expected change in the logit for a one-unit change in the independent

variable (the expected change on the log of the odds ratio).

c. Find the odds ratio; Midterm = 75

-25.602+.3637(75) = 1.6755

ⅇ 1.6755 = 5.3414

d. What is the Unit odds ratio? What does it refer to?

5.3414 change in the odds ratio for a one-unit change in the independent variable.

e. Explain the meaning of the range odds ratio.

Range Odds Ratios refers to the expected change in the odds ratio when the independent

variable changes from its minimum to its maximum.

f. Use the logistic regression to compute probabilities for a student whose midterm score is 70
𝜋
̂
Log[ ] = -25. 601875+0. 36376093* midterm score 70 = -0.13886099
1−𝜋̂
𝜋
̂ 0.87034900726
= ⅇ −0.13886099 = 0.87034900726 ; 𝜋̂ = = 0.465
1−𝜋
̂ 1+0.87034900726
Prob of Pass Class: 0.465 ; Prob of Not Pass Class: 0.535

g. For a specified probability of .3, what is the predicted MidtermScore?


0.3
Log[ ]=-25. 601875+0. 36376093*midterm score;
1−0.3

Log[0.36797678529]=-0.43417957898=-25.6+0.36*midterm score.

MidtermScore=69.905

h. From the Confusion Matrix, what is the overall error rate?

4/20 = 0.2 Total Error Rate

i. What are the error rates for the 0s and for the 1s?
Error Rates for the 0s: 2/8 = 0.25
Error Rates for the 1s: 2/12= 0.167
Logistic Plot

Whole Model Test

Model -LogLikelihood DF ChiSquare Prob>ChiSq


Difference 6.264486 1 12.52897 0.0004*
Full 7.195748
Reduced 13.460233

RSquare (U) 0.4654


AICc 19.0974
BIC 20.383
Observations (or Sum Wgts) 20

Lack Of Fit

Source DF -LogLikelihood ChiSquare


Lack Of Fit 18 7.1957477 14.3915
Saturated 19 0.0000000 Prob>ChiSq
Fitted 1 7.1957477 0.7032

Parameter Estimates

Term Estimate Std Error ChiSquare Prob>ChiSq Lower 95% Upper 95%
Intercept -25.601875 11.184069 5.24 0.0221* -56.110434 -8.8822154
MidtermScore 0.36376093 0.1581661 5.29 0.0215* 0.12889235 0.7958171

Confidence limits are likelihood-based.


For log odds of 1/0

Odds Ratios
For PassClass odds of 1 versus 0
Unit Odds Ratios
Per unit change in regressor

Term Odds Ratio Lower 95% Upper 95% Reciprocal


MidtermScore 1.43873 1.137568 2.216251 0.6950573
Range Odds Ratios
Per change in regressor over entire range

Term Odds Ratio Lower 95% Upper 95% Reciprocal


MidtermScore 2989.138 17.04116 40143701 0.0003345

Tests and confidence intervals on odds ratios are likelihood ratio based.
Inverse Prediction

Specified Predicted Lower 95% Upper 95%


Probability(PassClass=1) MidtermScore
0.3000000 68.05178 38.41326 72.13128
0.5000000 70.38105 59.86802 78.82096
0.7000000 72.71032 68.38742 98.44601

Confusion Matrix
Training

Actual Predicted
Count
PassClass 1 0
1 10 2
0 2 6
2. Below is output of a hierarchical clustering analysis using the PublicUtilities data set. Use it to answer the
following questions:

What does the vertical line in the scree plot mean?


The vertical line in the box goes through the third x, which indicates that three clusters might be a good choice.

How many clusters do you think might be best? Why?


The book indicates that 3 is the best, but I think 4 is better because the variables from Arizona Public to
Consolidated Edison is too big and there’s a clear gap between Florida Power & Light and Boston Edison.

Use the parallel coordinate plot to characterize the clusters.


Cluster 1 are the left overs of cluster 2 and 3, I don’t find it meaningful. Cluster 2 has the highest sales while Nuclear and fuel
are low. Cluster 3 has higher Load and Fuel while Nuclear is low.
Hierarchical Clustering
Method = Ward
Dendrogram

N
Clust Ro Mean(Covera Mean(Retu Mean(Co Mean(Lo Mean(Pe Mean(Sal Mean(Nucle Mean(Fu
er ws ge) rn) st) ad) ak) es) ar) el)
1 14 1.17 11.71 155.21 55.31 2.43 8354.79 18.20 0.97
2 3 1.00 8.87 223.33 54.83 6.33 15504.6 0.00 0.57
3 5 1.02 9.14 171.40 62.94 3.66 6525.60 1.84 1.80
Parallel Plot

3. Explain how clusters can be used in regression.

The data can be separated into two or more clusters to perform individual regression on the data, to analysis
correlations.

4. Discuss the benefits and drawbacks of the technique of decision trees.

Ability to categorize data in ways that other methods cannot.

Decision trees do not always produce the best results, but they offer a reasonable compromise between

models that perform well and models that can be simply explained.

A high-variance procedure: Growing trees on two similar data sets probably will not produce two similar trees

Because an error in any one node does not stay in that node.

5. What is the difference between a classification and a regression tree?

If the target variable is categorical, then the decision tree is called a classification tree.

If the target variable is continuous, then the decision tree is called a regression tree.
6. Below is output from a partition of the freshman1 data set used to model the return of freshmen to their
sophomore year. Use the output to answer the following questions:

a. How many end nodes are there?

4 end nodes

b. Write a rule for each end node.


First node is the11 freshmen whose GPA are less than 1.0159 do not return for the sophomore year.

Second node is the 5 freshmen whose GPA are greater or equal to 3.624 do not return the next year.

Third node is the 22 freshmen whose GPA are less than 3.62 has a higher rate of returning the next year.

Fourth node is the 62 freshmen who are in Engineering, Business, Liberal Arts, and Sciences are most likely to

return for the next year.

c. What actions does this analysis suggest that a university can take to improve the retention rate?

I think most people are leaving when their GPA are lower than 1, so try to find out the reasons and improve their
GPA, also, university can enroll more students in Engineering, Business, Liberal Arts, and Sciences college.

Partition for return

RSquare N Number
of Splits
0.640 100 3

7. Discuss the pros and cons of neural nets and how the two validation methods work.
NN is like a black box, no way to determine precisely how it makes its predictions.
It’s not good to present a model.
There are alternatives that are simpler, faster, easier than NN and provide better performance, e.g. decision tre
and regression.

But NN is capable of modeling extremely nonlinear phenomena,


easy to conceptualize, used extensively in industry for years, lots of implementation,
can be used for either classification (binary dependent variable) or prediction (continuous dependent variable).

To be Continue at the next page…


Holdback Validation: data set is randomly divided into 2 parts, training sample and holdback sample. They have
the same underlying model but different noise. The weights are estimated on training sample then use the weights to
estimate holdback sample to calculate error. As the process iterates, error decline and NN create a model that is
common to both data set. Iteration shouldn’t be too few and too much in order to get better results, and it calls
overfitting.

K-fold Cross-Validation: Divide data set into k folds. Consider k-1 as the training data set and kth fold to be
validation data set. Compute the relevant measure of accuracy on validation fold. Repeat this k times and obtaining k
measures of accuracy. Average them to overall estimate of the accuracy. Also avoid overfitting.

8. Explain SAS’ SEMMA phase of data mining/predictive analytics modeling process.

S—Sample. Extract a sample small but significant.

E—Explore. Use discovery tools and various data reduction tools to further understand data and

search for hidden trends and relationships.

M—Modify. Create, transform, and group variables to enhance the analysis.

M—Model. Choose and apply one or more appropriate data mining techniques.

A—Assess. Build several models using multiple techniques; evaluate; assess the usefulness;

and compare the models results. If a small portion of the large data set was set aside during

the sample stage, validate and test the model.

9. Explain the three major steps in the text mining process.


1. Developing the document term matrix. 0 and 1 represent the words. Group them and remove infrequent
words until satisfied.
2. Using multivariate techniques. Text visualization and text multivariate techniques are used to understand
the composition of DTM.
3. Using predictive techniques. If a dependent variable exists, you can use the text multivariate analysis results
as independent variables.

10. Evaluate each of your group members (including yourself) on a 1 to 5 scale (5 being best) and provide a short
description of each member’s contribution to the group activities.

Christopher Barajas very good at organizing info. into PowerPoint


Yi-Ting Chang very good at follow Christ suggestions and commands
Christopher Greder very good at using JMP and doing statistic analysis
Monica Lagos very good at combining data to practical world
Oscar Santana very good at communicating and interpreting