Beruflich Dokumente
Kultur Dokumente
80
60
Year - 1900
40
20
-20
240
260
280
300 inches
320
340
360
In the data, the years 12,20; 36,48,52 and 64,68 present great decreasing in the LOJ. Maybe has something to do with the first and second world war.
b- Run a simple linear regression on the data. Does the linear regression model fit the data well? Coef 278.05 0.70915 R Square Adj R square 80.5153 79.5410 StdErr 4.2534 0.078007 tStat 65.372 9.0909 pVal 8.49E-25 1.53E-08
From the T statistics we can check that our factors are relevant (for a level of significance =0.05 Pvalue 0), and also from the R square we see that the model fits the data.
1|Page
c- Conduct an F test at 0.05 level to decide if there is a linear relationship between the performance of the long jump and the year of the game. Source Regr Resid Total df 1 20 21 SS MS 10034.49 10034.49 2428.35 121.42 12462.84 F 82.64 P 0
From the Anova table, we see that the linear relationship is proved. d- Get an estimate of the mean long jump performance in year 1896, and obtain a 95% confidence interval for the estimate. The estimate for the mean long jump in the year 1896 (1986-1900 = -4) is given for the predictive model: 0 + 1 = 0 = 278.05 and 1 = 0.70915, so plugging in the -4 in the regression model we obtain the Where predicted value for the length of jump which is = 275.2134. Also we will use the residual to calculate the 95% confidence interval for the predicted value as is shown. 0 + 1 1 + 2,
2
( )2 1 + 1 ( )2
Using this information we will have 275.2134 6.0712 = (269.1422 , 281.2846) e- Analyze the regression output. Are there any outliers in the data? If so, remove the outliers and reanalyze the data. Obtain the residual plots and take a careful look. Do they still reveal any special pattern pertaining to the record setting nature of the data?
From the scatter plot in a, we see that there is some strange variations in the data, starting from the first year, and among 12-20; 36,48,52 and 64,68. We could be suspicious about these points. In order to check the outliers we will draw the residual and the leverage charts.
2|Page
0.16
0.14
leverage (X)
0.12
0.1
-1
0.08
0.06
-2
0.04
-3
10 index
15
20
25
From the graphs above we see that the leverage have a good behavior, but there is two suspicious points: the 1st and 16th point. Deciding to take of the first point (-4, 249.75) we obtain the following model: Coef 282.89 0.63328 StdErr 3.9418 0.070639 tStat 71.767 8.9651 pVal 1.34E-24 2.97E-08
df 1 19 20
MS 6974.74 86.78
F 80.37
P 0
60 50 40 30 20 10 0 280 290 300 310 320 inches 330 340 350 360
3|Page
The results obtained shows that there is not a major improvement in the model.
The response is undercount (in terms of percentage). Use regression to investigate the relationship between undercount and the eight predictors. a- Perform regression analysis using all the predictors except city. Show the regression residuals. Which predictors seem to be important? Draw the residual plot against the fitted value. What can you conclude from this plot? Coef intercept -2.2177 minority 0.093366 crime 0.034687 poverty -0.17352 language 0.23078 highschool 0.058204 housing -0.02252 conventional 0.03612 StdErr 1.3647 0.02097 0.012775 0.085775 0.092615 0.045213 0.023454 0.009335 tStat -1.6251 4.4523 2.7152 -2.023 2.4918 1.2873 -0.96009 3.8691 pVal 0.10957 3.92E-05 0.008712 0.047696 0.015589 0.2031 0.341 0.000279
From the chart, we see that for a level of significance = 0.05 the predictors highschool and housing fails the t test and from that we have statistical evidence to claim that their coefficients are equal to 0 and hence they are not significant for the model, although the other predictors are significant for the model.
4|Page
residuals
-1
-2
-3 -2
-1
2 3 fitted values
From the graph we can see no relation between the fitted values and the residuals, which randomly are allocated around 0. b- Explain how the variable city differs from the others The variable city is a qualitative variable. Also if we decide to consider it as a number, would be discrete and not normally distributed.
c- Use both best subset regression and stepwise regression to select variables from all the predictors (excluding the variable city). Compare your final models obtained by the two methods. For the best subset selection method we have: Vars R-Sq RSqa C-p BIC AIC s 1 49.35 48.56 34.65 81.87 77.49 variables 1 2 56.96 variables 1 55.59 5 22.14 75.32 68.75
1.77
1.65
3 63.75 variables 1 2
62 11.17 7
68.16
59.41
1.52
66.33 7 67.17
55.38
1.47
54.03
1.44
5|Page
6 variables
54.82
1.44
7 69.61 variables 1
55.78
1.44
So for the AIC and CP criterion the best regression is for the model with 5 variables which are minority, crime, poverty, language and convention. This result shows concordance with the relevant factors. For the Stepwise selection method we have:
6|Page
b- Use an F test to test at the 0.01 level the null hypothesis that the four treatments have the same bioactivity. Compute the p value of the observed F statistic. The F test examines the null hypothesis that there is no difference between the treatments. The Fstatistic is calculated as follows. ( . .. )2 =1 (
=1 =1 2
1)
. ) ( ( )
21.47 2.39
From the ANOVA table, we plug in the MSTr and the MSE data, obtaining =
= 8.98. The p
value for this data is 0.0003 which is less than the 0.01 level of significance, so we have statistical evidence to deny the null hypothesis and claim that exist differences among the treatments. c- The treatment averages are as follows: = 66.10 (7 samples), = 65.75 (8 samples), = 63.85 (6 samples). Use the Tukey method to perform multiple =62.63 (9 samples), comparisons of the four treatments at the 0.01 level.
d- It turns out that A and B are Brand-name drugs and C and D are generic drugs. To compare brand-name generic drugs, the contrast 2 ( + )
1 1 ( 2
value of the computed contrast and test its significance at the 0.01 level. Comment on the difference between brand-name and generic drugs.
7|Page