Sie sind auf Seite 1von 8

Homework 1

Statistical design models

PROBLEM 1(Ch.1 Ex.11)


The modern Olympic Games are an international athletic competition that has been held at a different city every four years since their inauguration in 1896, with some interruptions due to wars. The data for the gold medal performances in the mens long jump (distance in inches) is given. The first column Year is coded to be zero in 1900. The performance of long jump is expected to be improved over the years. a- Draw a scatter plot of the response long jump against the covariate year. Comment on any striking feature of the data by relating them to the world events of the specific years.
Length of jump per year 100

80

60
Year - 1900

40

20

-20

240

260

280

300 inches

320

340

360

In the data, the years 12,20; 36,48,52 and 64,68 present great decreasing in the LOJ. Maybe has something to do with the first and second world war.

b- Run a simple linear regression on the data. Does the linear regression model fit the data well? Coef 278.05 0.70915 R Square Adj R square 80.5153 79.5410 StdErr 4.2534 0.078007 tStat 65.372 9.0909 pVal 8.49E-25 1.53E-08

From the T statistics we can check that our factors are relevant (for a level of significance =0.05 Pvalue 0), and also from the R square we see that the model fits the data.

1|Page

c- Conduct an F test at 0.05 level to decide if there is a linear relationship between the performance of the long jump and the year of the game. Source Regr Resid Total df 1 20 21 SS MS 10034.49 10034.49 2428.35 121.42 12462.84 F 82.64 P 0

From the Anova table, we see that the linear relationship is proved. d- Get an estimate of the mean long jump performance in year 1896, and obtain a 95% confidence interval for the estimate. The estimate for the mean long jump in the year 1896 (1986-1900 = -4) is given for the predictive model: 0 + 1 = 0 = 278.05 and 1 = 0.70915, so plugging in the -4 in the regression model we obtain the Where predicted value for the length of jump which is = 275.2134. Also we will use the residual to calculate the 95% confidence interval for the predicted value as is shown. 0 + 1 1 + 2,
2

( )2 1 + 1 ( )2

0 + 1 = 275.2134 , the So, we know that 2, turn into 20,0.025 =0.5098, =


2

2 121.42 = 11.07907, = 45.45455 , ( )2 = 2445.752, 1 ( ) = 19953.45.

Using this information we will have 275.2134 6.0712 = (269.1422 , 281.2846) e- Analyze the regression output. Are there any outliers in the data? If so, remove the outliers and reanalyze the data. Obtain the residual plots and take a careful look. Do they still reveal any special pattern pertaining to the record setting nature of the data?

From the scatter plot in a, we see that there is some strange variations in the data, starting from the first year, and among 12-20; 36,48,52 and 64,68. We could be suspicious about these points. In order to check the outliers we will draw the residual and the leverage charts.

2|Page

outlier detection from leverage (X) 0.18 3

outlier detection from standardized residual (R)

0.16

standardized residual (R)


0 5 10 index 15 20 25

0.14

leverage (X)

0.12

0.1

-1

0.08

0.06

-2

0.04

-3

10 index

15

20

25

From the graphs above we see that the leverage have a good behavior, but there is two suspicious points: the 1st and 16th point. Deciding to take of the first point (-4, 249.75) we obtain the following model: Coef 282.89 0.63328 StdErr 3.9418 0.070639 tStat 71.767 8.9651 pVal 1.34E-24 2.97E-08

Source Regr Resid Total

df 1 19 20

SS 6974.74 1648.82 8623.56

MS 6974.74 86.78

F 80.37

P 0

R Square 80.8800 Adj R Square 79.8737


Length of jump per year 100 90 80 70
Year - 1900

60 50 40 30 20 10 0 280 290 300 310 320 inches 330 340 350 360

3|Page

The results obtained shows that there is not a major improvement in the model.

PROBLEM 2(Ch.1 Ex.13)


The data in the table is from 1980 U.S. census undercount (Ericksen et al., 1989). There are 66 rows and 10 columns. The first column is the place where the data is collected. There are eight predictors: 1. 2. 3. 4. 5. 6. 7. 8. Minority: minority percentage. Crime: rate of serious crimes per 1000 population Poverty: percentage poor Language: percentage having difficulty speaking or writing English. High school: percentage age 25 or older who had not finished high school Housing: percentage of housing in small, multi-unit buildings City: a factor with two levels: city (major city), state (state remainder). Conventional: percentage of households counted by conventional personal enumeration.

The response is undercount (in terms of percentage). Use regression to investigate the relationship between undercount and the eight predictors. a- Perform regression analysis using all the predictors except city. Show the regression residuals. Which predictors seem to be important? Draw the residual plot against the fitted value. What can you conclude from this plot? Coef intercept -2.2177 minority 0.093366 crime 0.034687 poverty -0.17352 language 0.23078 highschool 0.058204 housing -0.02252 conventional 0.03612 StdErr 1.3647 0.02097 0.012775 0.085775 0.092615 0.045213 0.023454 0.009335 tStat -1.6251 4.4523 2.7152 -2.023 2.4918 1.2873 -0.96009 3.8691 pVal 0.10957 3.92E-05 0.008712 0.047696 0.015589 0.2031 0.341 0.000279

From the chart, we see that for a level of significance = 0.05 the predictors highschool and housing fails the t test and from that we have statistical evidence to claim that their coefficients are equal to 0 and hence they are not significant for the model, although the other predictors are significant for the model.

4|Page

residuals

-1

-2

-3 -2

-1

2 3 fitted values

From the graph we can see no relation between the fitted values and the residuals, which randomly are allocated around 0. b- Explain how the variable city differs from the others The variable city is a qualitative variable. Also if we decide to consider it as a number, would be discrete and not normally distributed.

c- Use both best subset regression and stepwise regression to select variables from all the predictors (excluding the variable city). Compare your final models obtained by the two methods. For the best subset selection method we have: Vars R-Sq RSqa C-p BIC AIC s 1 49.35 48.56 34.65 81.87 77.49 variables 1 2 56.96 variables 1 55.59 5 22.14 75.32 68.75

1.77

1.65

3 63.75 variables 1 2

62 11.17 7

68.16

59.41

1.52

4 66.91 64.75 7.14 variables 1 2 4 5 68.55 variables 1 65.93 2 3 6.01 4 7

66.33 7 67.17

55.38

1.47

54.03

1.44

5|Page

6 variables

69.12 65.98 1 2 3 65.94 2 3 8 4

6.92 70.15 4 5 7 73.3 6 7

54.82

1.44

7 69.61 variables 1

55.78

1.44

So for the AIC and CP criterion the best regression is for the model with 5 variables which are minority, crime, poverty, language and convention. This result shows concordance with the relevant factors. For the Stepwise selection method we have:

So both methods achieve the same model.

Problem 3 ( Ch.2 Ex.11)


The bioactivity of four different drugs A, B, C, D for treating a particular illness was compared in a study and the following ANOVA table was given for the data: a- Describe a proper design of the experiment to allow valid inferences to be made from the data. From the description of the problem we know that we have 4 treatments to be tested and 30 samples. So the correct design would be one way anova testing one factor, where the cases were randomly assigned with the 4 treatments.

6|Page

b- Use an F test to test at the 0.01 level the null hypothesis that the four treatments have the same bioactivity. Compute the p value of the observed F statistic. The F test examines the null hypothesis that there is no difference between the treatments. The Fstatistic is calculated as follows. ( . .. )2 =1 (
=1 =1 2

1)

. ) ( ( )


21.47 2.39

From the ANOVA table, we plug in the MSTr and the MSE data, obtaining =

= 8.98. The p

value for this data is 0.0003 which is less than the 0.01 level of significance, so we have statistical evidence to deny the null hypothesis and claim that exist differences among the treatments. c- The treatment averages are as follows: = 66.10 (7 samples), = 65.75 (8 samples), = 63.85 (6 samples). Use the Tukey method to perform multiple =62.63 (9 samples), comparisons of the four treatments at the 0.01 level.

d- It turns out that A and B are Brand-name drugs and C and D are generic drugs. To compare brand-name generic drugs, the contrast 2 ( + )
1 1 ( 2

+ ) is computed. Obtain the p

value of the computed contrast and test its significance at the 0.01 level. Comment on the difference between brand-name and generic drugs.

7|Page

Das könnte Ihnen auch gefallen