Beruflich Dokumente
Kultur Dokumente
Scope of this tutorial: Discussion - scatter plots Regression Exercise Revision of earlier hypothesis tests
Annual (c) Annual income and credit card balance of bank income clients
Positive
2
Negative linear relation between temperature and latitude: 3 Higher latitude => lower temp.
Positive linear relation between chest girth and weight for males.
Astronomy: galaxy
What is a galaxy? A galaxy is a collection of stars, ranging from ten million (107) up to a hundred trillion (1014) stars.
Group of galaxies
454 km/sec
H Ho: = 0 A The relation appears reasonably linear. The points seem to be fairly evenly spread round the line with no obvious outliers, indicating that the residuals have constant standard deviation and residuals may be normally distributed. T t = 6.036, df=22 P p-value 0. Since p<0.05, reject Ho
11
10
Predictions:
Predict the radial velocity of a galaxy which is 1.25 Mpc from Earth. v=-40.784+454.158*distance =-40.784+454.158*1.25 = 526.9 km/sec Predict the radial velocity of a galaxy which is 2.25 Mpc from Earth. 2.25 megaparsecs is out of range of data, hence not valid to predict.
C: There is a significant positive linear relation between distance and radial velocity (Hubbles law). For each increase of a distance of 1 megaparsec (Mpc) from earth, a galaxys velocity increases by 454 km/sec, on average. We are 95% confident that the true increase is between 298 and 610 km/sec.
12
13
Predictions:
Predict the distance from earth for a celestial object which has a radial velocity of 400 km/sec. Not valid to predict independent variable (X) from outcome (Y)
(For those curious: If we really want to predict distance from velocity, we need to re-do the regression using velocity as x (independent variable) and distance as y (dependent variable). Then the new regression will be Distance = a + b*velocity)
14
Goodness-of-fit statistic r2
Interpret the goodness of fit statistic: r2 = 0.624. 62.4% of the variation in radial velocity of galaxies can be explained by the variation in distance from Earth. Calculate and interpret the correlation coefficient: r=+0.624 = 0.79, indicating there is a fairly strong positive linear relation between the two variables.
15
Variable ID D BF%
Description ID of male (1 252) Density determined from underwater weighing Percent body fat from Siri's (1956) equation Age (years) Weight (kg) Height (m) Body Mass Index (kg/m2) Neck circumference (cm) Chest circumference (cm) Abdomen circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm) Ankle circumference (cm) Biceps (extended) circumference (cm) Forearm circumference (cm) Wrist circumference (cm)
Revision Questions
Age W H BMI Nec Che Abd Hip Thi Kne Ank Bic Arm Wri
16
17
Question 1: Display
1. a) What type of graphical display should you provide to compare the percentage body fat (BF%) of males aged less than 39 years and males aged 39 years or more? BF%: numeric (continuous) variable Less than or more than 39 years old: binary variable Hence comparative box plots
New variable
b) An obese person is said to have a body mass index (BMI) of more than 30. What type of graphical display should you provide to compare the proportions of obese males aged less than 39 years with those aged 39 or more years?
BMI above or below 30 (obese or not obese): binary variable New variable Less than or more than 39 years old: binary variable New variable
Carry out a suitable hypothesis test to answer the research question. Assume that the variation in BMI has not changed. 19
Question 3(a)
Was there a difference between the average percentage body fat (BF%) of American males in 1985 aged less than 39 years and the average BF% of American males aged 39 years or more? => 2-sample t-test
Freq.
20 15
Freq.
6 5 4 3 2 1 0
20 22 24
BMI
<39yrs
26
28
30
32
10 5 0
0 5 10 15 20 25 30
Freq.
20 15 10 5 0
0 10
>39yrs
20
30
40
0.0003
95% CI = y 1.96
= 27.2895 1.96
3.5 = (25.76,28.82) 20
20
21
Question 3(b): Was the ankle circumference 5cm more, on average, than the wrist circumference of American males in 1985? => paired t-test
We are 95% confident that the BF% of males aged over 39 years between 1.87% and 6.26% higher than the younger males on average.
Freq.
100 80 60 40 20
difference
1 n2
2 3 4 5 6 7 8 9 1011 12
Question 4: Regression
Research Question: Was the BMI of American males in 1985 a useful predictor of BF%? Use the output to complete this question.
1. Which is the dependent/response variable? 2. Which is the independent/predictor variable? 3. Comment on the scatterplot. 4. Write down the equation of the regression line. 5. Test the statistical significance of the relation. 6. Predict, if appropriate, the expected % Body Fat for: (a) a male with a BMI of 20; (b) a male with a BMI of 15 7. Predict, if appropriate, the expected BMI for a male with 20% Body Fat. 8. (a) Calculate r and interpret. (b) Calculate r2 and interpret.
24
1. Which is the dependent/response variable? BF% 2. What is the independent/predictor variable? BMI?
50 40 30 20 10 0 25 BMI 30 35 3. Comment on the scatter plot. 15 20 Positive linear relation: higher BMI => higher BF% Residual constant SD No outliers; symmetric on both sides 40 BF% BF% vs BMI
25
6. Predict, if appropriate, the expected % Body Fat for: (a) a male with a BMI of 20; (b) a male with a BMI of 15 A male with a BMI of 20: BF% = -26.9872 + 1.8186*20 = 9.384 A male with a BMI of 15: Not valid to predict, since 15 is out of the range of the data. 7. Predict, if appropriate, the expected BMI for a male with 20% Body Fat. Not valid to predict the independent variable (predictor or x) from the dependent variable (outcome or y) 27
26
Question 5: Best predictor Research Question: Which of the BMI, Neck Circumference or Abdomen circumference is the best predictor of BF%?
8. (a) Calculate r and interpret. r = 0.535 =0.73 There is a fairly strong positive linear relation between BMI and BF%. (b) Calculate r2 and interpret. r2 = 0.535 This indicates that about 53% of the variation in BF% can be explained by the variation in BMI.
28 29
Each of the predictors is a significant predictor of BF%; the p-val for each of the predictors is 0.000. Each regression equation satisfies the assumptions of linearity, constant spread and normality of the residuals. However, the abdomen circumference (Abd) provides the best fitting as r2 = 67% is much higher than the others. Note: 1. NEVER compare values of b. 2. It is easier, and better, to compare r2 instead of p-vals. 3. Discard (cross out) variables if they break assumptions or if p-val>0.05.
30
31
Question 1
(a) By comparing the regression line (solid) with the
line y=x (ie ideal weight_f = weight_f) (dotted), comment on the scatter plot.
Question 1
(b) From the partial EcStat output above, perform an appropriate hypothesis test to see if there is a linear relation between Ideal weight_f (Y) and Weight_f (X). Partial ans: t=21.29
32
33
Question 1 (answers)
Question 1 (continued)
(c) What is the value of goodness-of-fit statistic? Interpret its meaning. Ans: 70.8% Meaning: .
34
35
Question 2
90 80 70 60 50 40 30 20 20 30 40 50 60 Acc 70 80 90 100
The table on the right shows Accounting and Statistics marks for 12 students. Research question: Can Accounting marks (X) be used to predict Statistics marks (Y)? Use the partial EcStat output below to answer the research question.
df: 10 outcome: Stat predictor coeff SE t p-value constant 7.0194 7.971 0.8806 0.399 Acc 0.9560 0.129 r-sq: 0.845 Resid SS: 1046.876 s: 10.232
Question 2 (answers)
(Partial Ans: t=7.411)
Question 2 (continued)
(b) What is the value of goodness-of-fit statistic? Interpret its meaning.
Question 3
Height 195 190 185 180 175 170 165 160 155 150 40 50 60 70 Weight 80 90 100
Research question: Can Weight (X) be used to predict Height (Y)? Using the partial EcStat output below to answer the research question.
df: 82 outcome: Height predictor coeff SE t p-value constant 130.1702 4.041 32.2109 0.000 Weight 0.6699 0.061 r-sq: 0.595 Resid SS: 2855.483 s: 5.901 95% C.I. 122.131 138.209
38
39
Question 3 (answers)
(Partial Ans: t=10.98)
Question 3 (continued)
(b) What is the value of goodness-of-fit statistic? Interpret its meaning.
40
41
Weight 100
Question 4
90 80 70 60 50 40 150
Question 4 (answers)
(Partial Ans: t=10.976)
160 170 Height 180 190
Question 4 (continued)
(b) What is the value of goodness-of-fit statistic? Interpret its meaning.
Question 5
For each of the following given regression equations, interpret (i) the equation and (ii) r2. (a) X=time a bee spends on a flow y = 13 + 2.05x Y = % pollen removed, r2 = 0.384 Interpretation of equation (slope): Interpretation of r2:
44 45
Question 5 (continued)
(b) X = students high school results Y = STAT170 exam results r2 = 6.2%
Question 5 (continued)
y = 29.23 + 0.54x
(c) X = number of cans of beer drank Y = blood alcohol content y = 0.0217 + 0.0203x r2 = 82.1%
46
47
Make sure the X (Account) and Y (Stat) are chosen correctly, otherwise you will have the wrong graph, and wrong regression results.
df: 10 outcome: Stat (Y) predictor coeff SE t p-value constant 7.0194 7.971 0.8806 0.399 Account (X) 0.9560 0.129 7.3927 0.000 r-sq: 0.845 Resid SS: 1046.876 s: 10.232 Fitted line: Stat (Y) = 7.0194 + 0.956 Account (X)
Stat (Y) 110 100 90 80 70 60 50 40 30 20
50
20
40 Account (X) 60
80
100
51
Question 1(continued)
Fill in the following answers: (a) Ho: ___________________ (b) Write down the regression equation: ______________________________ (c) What is the value of test statistic? (Include symbol z/t) ___________________ (d) What is the value of p-val? ________ (e) Do you reject or not reject Ho? _________ (f) What is a 95% CI for ? ____________________ (g) Does the 95% CI for include the null value? ______ (h) What is the value of goodness-of-fit statistic? _______
52
53
Question 2(continued)
Fill in the following answers: (a) Ho: ___________________ (b) Write down the regression equation: ______________________________ (c) What is the value of test statistic? (Include symbol z/t) ___________________ (d) What is the value of p-val? ________ (e) Do you reject or not reject Ho? _________ (f) What is a 95% CI for ? ____________________ (g) Does the 95% CI for include the null value? ______ (h) What is the value of goodness-of-fit statistic? _______
54
55
Question 3(continued)
Fill in the following answers: (a) Ho: ___________________ (b) Write down the regression equation: ______________________________ (c) What is the value of test statistic? (Include symbol z/t) ___________________ (d) What is the value of p-val? ________ (e) Do you reject or not reject Ho? _________ (f) What is a 95% CI for ? ____________________ (g) Does the 95% CI for include the null value? ______ (h) What is the value of goodness-of-fit statistic? _______
56
57
Question 4 (continued)
Fill in the following answers: (a) Ho: ___________________ (b) Write down the regression equation: ______________________________ (c) What is the value of test statistic? (Include symbol z/t) ___________________ (d) What is the value of p-val? ________ (e) Do you reject or not reject Ho? _________ (f) What is a 95% CI for ? ____________________ (g) Does the 95% CI for include the null value? ______ (h) What is the value of goodness-of-fit statistic? _______
58
59
Question 5 (continued)
Fill in the following answers: (a) Ho: ___________________ (b) Write down the regression equation: ______________________________ (c) What is the value of test statistic? (Include symbol z/t) ___________________ (d) What is the value of p-val? ________ (e) Do you reject or not reject Ho? _________ (f) What is a 95% CI for ? ____________________ (g) Does the 95% CI for include the null value? ______ (h) What is the value of goodness-of-fit statistic? _______
60