Sie sind auf Seite 1von 23

4/21/08

21

The Simple Regression Model

THE SIMPLE REGRESSION MODEL............................................................................. 21-3 Linear Conditional Mean...................................................................................... 21-3 Deviations from the Mean..................................................................................... 21-3 Data Generating Process...................................................................................... 21-4 CONDITIONS FOR THE SRM....................................................................................... 21-6 Modeling Process ................................................................................................. 21-9 INFERENCE IN REGRESSION ....................................................................................... 21-9 Standard Errors .................................................................................................. 21-10 Confidence Intervals ........................................................................................... 21-11 Hypothesis Tests ................................................................................................. 21-12 Interpreting Tests of the Slope ............................................................................ 21-12 PREDICTION INTERVALS .......................................................................................... 21-14 Reliability of Prediction Intervals....................................................................... 21-15 SUMMARY ............................................................................................................... 21-19

4/21/08

21 The SRM

The Capital Assets Pricing Model (CAPM) describes the relationship between returns on a speculative asset and returns on the stock market. According to this theory, the market rewards investors for taking unavoidable risks collectively known as market risk. Since investors cannot escape market risk, the market pays investors who are willing to take these on. Other risks, called idiosyncratic risk, are avoidable, and the CAPM promises no compensation for these. For instance, if you buy stock in a company that is pursuing an uncertain biotechnology strategy, the market will not compensate you for those unique risks. We can formulate the CAPM as a simple regression. The response Rt is the percentage change in the value of an asset. The explanatory variable Mt is the contemporaneous percentage change in the stock market as a whole. The intercept in this regression is called alpha. The slope is called beta. Written using Greek letters for these terms, the equation associated with the CAPM is Rt = + Mt + t The added term t represents the effect of everything else on Rt. According to the CAPM, the mean of t is zero and = 0. On average, if the return on the market is zero, we expect the return on the stock to be zero as well. We can use regression analysis to test this theory. This scatterplot shows monthly percentage changes in the price of stock in Berkshire-Hathaway, the company managed by the famous investor Warren Buffett.

Estimated % Change Berkshire-Hathaway = 1.32 + 0.737 % Change Market

Figure 21-21-1. Estimating the CAPM for Berkshire-Hathaway.

The graph plots percentage changes in Berkshire-Hathaway versus percentage changes in the value of the entire stock market. The data span October 1976 (the earliest public trading in Berkshire-Hathaway) through December 2007. The red line is the least squares regression line. Is the estimated intercept b0 = 1.32 large enough to suggest that 0? Did Warren Buffett beat the market by a statistically significantly amount? To answer this question, we need tools for inference: standard errors and either confidence intervals or hypothesis tests. We develop these methods for regression analysis in this chapter. 21-2

4/21/08

21 The SRM

The Simple Regression Model

The simple regression model (SRM) combines an equation that relates two numerical variables with a description of the remaining variation. The SRM describes a population, not a sample from the population. The equation in the simple regression model specifies how the explanatory variable is related to the mean of the response. Because were describing the population, we use Greek letters and random variables. The equation of the SRM states that averages of the response fall on a line: y|x = E(Y|X=x) = 0 + 1 x

Linear on Average

conditional mean y|x The average of one variable given that another variable takes on a specific value.

Read the notation y|x = E(Y|X=x) as the expected value, or average, of Y given the explanatory variable has value x. The average value of Y for each x is the conditional mean of Y given X. For instance, E(Y|X = 5) is the average percentage change in Berkshire-Hathaway during months in which the market increases 5%. According to this model, the conditional means fall on a line with intercept 0 and the slope 1. (Finance traditionally denotes 0 by and 1 by in the CAPM.) This equation may give the impression that the simple regression model only describes linear patterns, but thats not true. The SRM assumes that you have chosen variables that have a linear relationship. The variables X and Y may involve transformations such as logs or reciprocals as in Chapter 20.

Deviations from the Mean


error Deviation from the conditional mean specified by the SRM.

The equation of the simple regression model describes what happens on average. Deviations that separate observations from the conditional averages y|x are called errors. These arent mistakes, just random variation around the conditional means. The usual symbol for the error is another Greek letter, = y y | x . The Greek letter (epsilon) is a reminder that we do not observe the errors; the errors are deviations from the population conditional mean, not an average in the data. Errors can be positive or negative, depending on whether data lie above the line (positive) or below the line (negative). Because y|x is the conditional mean Y in the population, the expected value of an error is zero, E() = 0. On average, the deviation from the line is zero. Because the errors are not observed, the SRM makes several assumptions about them: 1. Independence. The error for one observation is independent of the error for any other observation. 2. Equal variance. The errors have equal variance 2 . 3. Normal. The errors are normally distributed.

21-3

4/21/08

21 The SRM

If all 3 assumptions hold, then the errors are an iid sample from a normal population with mean 0 and variance 2 , ~ N(0, 2 ).

Data Generating Process


A statistical model describes an idealized sequence of steps that produce y denote monthly sales of a the data we observe. As an illustration, let company and x denote its spending on advertising (both in thousands of dollars). To specify the SRM, we need to choose values for 0 , 1 , and . Suppose that if the company spends x thousand dollars on advertising, then the expected level of sales is

y|x = 500 + 2 x

(0 = 500, 2) 1 =

Without advertising, sales average $500,000. Every dollar spent on advertising increases expected sales by $2. We further set = 45. Theres a normal distribution with mean 500+2x and standard deviation 45 at each value of x.

Figure 21-2. The simple regression model assumes a normal distribution at each x.

The data generating process defined by this model begins by allowing the company to choose a value for the explanatory variable. The SRM does not specify how this is done; the company is free to decide how much it wants to advertise. Suppose the company spends x1 = 150 (thousand dollars) on advertising in the first month. The expected level of sales given this advertising is y|150 = 0 + 1 (150) = 500 + 2(150) = $800,000 According to the SRM, all of the other factors that affect sales combine to produce a deviation from y|150 that looks like a random choice from a normal distribution with mean 0 and SD . Lets denote the first error as 1 and imagine that 1 = -20. Sales during the first month are then y1 = y|150 + 1 = 800 + (20) = $780,000 Thats the dot in the figure at the left; its below the line because 1 < 0. If the company only wants to estimate y|150, heres what it could do. If it spends $150,000 for advertising month after month, the average level of sales will eventually settle down on y|150 = $800,000. The company would 21-4

4/21/08

21 The SRM

not learn 0 or 1 this way, but it would eventually know y|150. To reveal, 0 or 1 the company must vary the explanatory variable x. Lets follow the data generating process for a second month. In the next month, the company spends x2 = 100 (thousand dollars) on advertising. The same line determines the response. The equation for y|x remains the same: expected sales are y|100 = 0 + 1 (100) = 700. Because the errors are independent of one another, we ignore 1 and independently draw a second error, say 2 = 50, from the same normal distribution. Sales in the second month are then y2 = y|100 + 2 = 700 + 50 = $750,000. This data point lies above the line because 2 is positive. This process repeats each month. The company sets the amount of advertising, and the data generating process defined by the SRM determines sales. If the SRM holds, then the company observes data like these.

Figure 21-3. The observed data do not show the population regression line.

Recognize that the simple regression model presents a simplified view of reality. The SRM is a model, not the real thing. Nonetheless the SRM is often a reasonable description of data. The closer data conform to this ideal scenario, the more reliable inferences become. Simple Regression Model. The observed response y is linearly related to the explanatory variable x by the equation y = 0 + 1 x + , The observations 1. are independent of one another, ~ N(0, 2 )

regression line, and 2. have equal variance 2 around the


3. are normally distributed around the regression line.

21-5

4/21/08

21 The SRM

Are You There?

Which scatterplots show data that are consistent with the data generating process defined by the simple regression model?1

(a)

(b)

(c)

(d)

Conditions for the SRM

We never know for certain whether the SRM describes the population. All we observe are a sample and the fitted least squares regression line. The best we can do is to check several conditions. We checked these in Chapter 19, but we can now refine what to look for in the residuals. Instead of checking for no pattern (that the residuals are simple enough), we have three specific conditions. If the answer to each question is yes, then the data match the SRM well enough to move on to inference. Checklist for the Simple Regression Model Is the pattern in the scatterplot of y on x straight-enough? Have we ruled out embarrassing lurking variables? Are the errors evidently independent? simple enough Are the variances of the residuals similar? Are the residuals nearly normal? Lets check these conditions for the regression of Berkshire-Hathaway on the market (Figure 21-21-1).

Straight-enough. The scatterplot of y on x in Figure 21-21-1 seems straight enough. The association is not strong (r2 = 0.24), but seems linear. We should confirm this by plotting the residuals versus the explanatory
All but (d). In (a) the data track along a line with negative slope. In (b), theres little evidence of a line, but that just means the slope is near zero. (c) is an ideal example of a linear pattern with little error variation. (d) Fails because it appears that the error variation grows with the mean.
1

21-6

4/21/08

21 The SRM

variable. This plot should look like a random swarm of bees, buzzing above and below the horizontal line at zero.

Figure 21-4. Residuals from the regression of percentage changes.

Theres no evident pattern in this case. The data are shifted to the right of the plot to accommodate the outliers at the left (months with large declines in the market: October 1987 and August 1998). These do not indicate a problem; the SRM makes no assumption about the distribution of the explanatory variable. No embarrassing lurking variables. According to the CAPM, there arent any lurking variables. Some in Finance question CAPM for just this reason, claiming that other variables predict stock performance. In place of checking that the residuals are simple enough, we have three specific tasks: check for independence, equal variance, and normality. Evidently independent. In general, no plot shows dependence among the observations with one exception. If the data are time series, as in this example, we can check this assumption by looking at sequence plots of the residuals. This timeplot shows that the residuals vary around zero consistently over time, with no drifts. These appear independent.

Figure 21-5. Timeplot of residuals from the CAPM regression for Berkshire-Hathaway.

Similar variances.To check this condition, start with the scatterplot of the residuals on x (Figure 21-4). With the fitted line removed, it is easier to see changes in variation. In this case, the spread appears constant around the horizontal line at zero. Be alert for a fan-shaped pattern or tendency for the variability to grow or shrink.
21-7

4/21/08

21 The SRM

The timeplot of the residuals, however, does show periods in which the residuals become more and less variable. The residuals seem to have lower variance after 2001. The effect is subtle and gradual changes in variation will not cause us problems, but this aspect of time series is an important area of research in quantitative finance. (This issue is very important to prediction, however.) Nearly normal. A normal model is often a good description for the unexplained variation because the errors represent the net effect of all other variables on the response, added together. Since sums of random effects tend to be normally distributed, a normal model is a start. Its only a start, however. To check this condition, we check that the residuals are nearly normal (Chapter 12). We dont observe the errors so we substitute the residuals in place of them. Otherwise, this check is done as before: inspect a histogram and normal quantile plot. The following histogram and normal quantile plot summarize the residuals from the regression of percentage changes in BerkshireHathaway on the market.

Figure 21-6. Histogram and normal quantile plot of residuals.

In general, the residuals track the diagonal reference line except near the arrow. At this location, they drift too far from the reference line; these residuals are not nearly normal. The distribution of the residuals also has a long right tail. Fortunately, inferences about 0 and 1 work well even if the data are not normally distributed. As when estimating using , the justification comes from the Central Limit Theorem. Confidence intervals for the slope and intercept are reliable even if the errors are not normally distributed. If the residuals are not normally distributed (as in this example), check the CLT condition for the residuals (Chapter 15). The sample size should be larger than 10 times the larger of the skewness K3 and kurtosis K4 of the residuals. In this example, K3 = 0.8 and K4 = 2.5, so we can rely on the CLT for making inferences about 0 and 1. 21-8

4/21/08

21 The SRM

Modeling Process
Before you look at plots, think about these two questions (a) Does a linear relationship make sense? (b) Is the relationship free of major lurking variables? If either answer is no, then find a remedy for the problem. You may need to transform a variable or find better data. If you think both answers are yes, then follow this outline: Plot y versus x and verify that the association is straight enough. Dont worry about properties of residuals until you get an equation that captures the pattern in the scatterplot. three plots If the pattern is straight-enough, fit the least squares regression line 1. y versus x and obtain the residuals, e. 2. residuals versus x 3. Histogram and normal Plot the residuals versus the explanatory variable. This plot should quantile plot of residuals. have no pattern; the residuals should show only simple variation. Curvature suggests that the pattern wasnt straight enough after all, and any thickening indicates different variances in the errors. Note the presence of outliers as well. If the data are measured over time, plot the residuals against time to check for dependence. Inspect the histogram and normal quantile plot of the residuals to check the nearly normal condition. If the residuals are not nearly normal, check the skewness and kurtosis. Its a good idea to proceed in this order. If you skip the initial check for straight enough, for example, you are likely to find something unusual in the normal quantile plot of the residuals. At that point, you might conclude Ah-ha, the errors are not normally distributed. Youd be right, but for the wrong reason. Detecting a problem at this late stage offers little advice for how to fix it. Unless you identify the source of the problem, youre not going to know what to do next. Well have more to say about fixing problems in Chapter 22.

Inference in Regression
b0 b1
Intercept Slope Fit/Mean Deviation SD()

y
e se

0 1
y| x

Once our model passes this gauntlet of checks, we are we ready for inference. Three parameters, 0, 1, and 2 identify each instance of the simple regression model. We estimate these using the least squares regression line. As you may suspect, b0 estimates 0, b1 estimates 1, and se estimates . Inference for these parameters proceeds as when testing or building confidence intervals for (Chapters 15-18). The estimated standard error is our ruler. We reject a null hypothesis if an estimate is too many standard errors from the hypothesized value, and the confidence interval consists of values within a certain number of standard errors of the estimate. Students t-distribution determines how many standard errors are necessary for statistical significance. 21-9

4/21/08

21 The SRM

For the SRM, software handles the details of calculating standard errors of the least squares estimators. Packages routinely summarize the results of a least squares regression in a table. Each row of this table summarizes the estimate of a parameter in the CAPM regression for Berkshire-Hathaway.
Term Intercept Market Estimate b0 b1 Std Error t Statistic p-value

1.321082 0.736763

0.336367 0.067973

3.93 10.84

0.0001 <.0001

Table 21-1. Regression coefficient estimates for the CAPM regression.

Numbers in the row labeled Intercept describe the estimated intercept b0; those in the next row describe the estimated slope b1.

Standard Errors
Standard errors describe the sample-to-sample variability of b0 and b1. Each time we draw a sample from the population and fit a regression, we get different estimates. How different? Thats the job of the standard error: estimate the sample-to-sample variation. If the standard error of b1 is small, then not only are estimates from different samples similar, but they are also close to 1. The formula for the standard error of the slope in a least squares regression resembles the standard error of an average. The estimated standard error of the average of a sample of n observations y1, y2, , yn is

, s n1 n This formula uses the sample standard deviation sy in place of y. The standard error of the slope is similar. (The exact formula is at the end of this chapter.) 2 s 1 e2 + e2 + en se( b1 ) , se2 = 1 2 n 2 n SD( X)
Three aspects of the data determine the standard error of the slope: Standard deviation of the residuals Sample size Standard deviation of the explanatory variable The residual standard deviation, se, sits on top in the numerator since more variation around the line increases the standard error. The more variable the data are around the regression line, the less precise the estimate of the slope. The sample size is in the denominator. Larger samples decrease the standard error. The larger the sample is, the more precise the estimate of the slope becomes. To see why the standard deviation of the explanatory variable affects the standard error of the slope, consider these scatterplots.

se(Y ) =

sy

2 y

(y =

y + y 2 y + + yn y

) (

21-10

4/21/08
12 10 8 12 10 8

21 The SRM

Y1

6 4 2 0 0 2 4 6 8 10

Y2

6 4 2 0 0 2 4 6 8 10

X1

X2

Figure 21-7. Which data tells you more about the slope?

Each scatterplot shows a sample of 25 observations from the same population. The only difference is that the points on the left are spread out along the x-axis. This sample produces a more accurate estimate of 1. The lines sketched in the next figure equally well represent the data. Since the data on the right are packed closely together, theres more variability among the slopes. These data provide less information about 1.
12 10 8 12 10 8

Y1

6 4 2 0 0 2 4 6 8 10

Y2

6 4 2 0 0 2 4 6 8 10

X1

X2

Figure 21-8. More variation in x leads to a better estimate of the slope.

The formula for the standard error of the intercept is similar and shown at the end of the chapter.

Confidence Intervals
The sampling distribution of b1 is centered on 1 with standard deviation estimated by se(b1). If the errors are nearly normally distributed or satisfy the CLT condition, then the sampling distribution of b1 is approximately normal. Since we substitute se for to calculate the standard error, we use a t-distribution. The degrees of freedom are n-2, the divisor in the estimate se. Put together, the sampling distribution of the ratio b t= 1 1 se(b1 ) is Students t with n-2 degrees of freedom. Hence, the 95% confidence interval for 1 (beta for Berkshire-Hathaway) is b1 t0.025,n-2 se(b1 ) = 0.736763 1.97 0.067973 [0.603, 0.871] Because 1 lies outside the confidence interval, we conclude that for Berkshire-Hathaway is statistically significantly less than 1. Returns on Berkshire-Hathaway attenuate returns on the market, reducing the volatility of the value of this asset. For the intercept, the 95% confidence interval is 21-11

4/21/08

21 The SRM

b0 t0.025,n-2 se(b0) = 1.321082 1.97 0.315335 [0.700, 1.942] Since zero lies well outside the 95% confidence interval, for BerkshireHathaway is statistically significantly larger than zero. Buffetts stock has averaged higher returns than predicted by the CAPM: he beat the market.

Hypothesis Tests
Output from software that fits a least squares regression usually provides several redundant ways to test the hypotheses H0: 0 = 0 and H0: 1 = 0. You can rely on confidence intervals and avoid hypothesis tests, but the output for regression summarizes tests of both hypotheses in the columns labeled t Statistic and p-value in Table 21-1. Each test compares the estimate to zero; positive and negative deviations from 0 are evidence against H0. Hence, these are two-sided tests; the p-value is small whether b0 for instance is far below or far above zero. Each t-statistic in Table 21-1 is the ratio of an estimate to its standard error. Each counts the number of standard errors that separate the estimate from zero. For example, the t-statistic for the intercept is t = b0/se(b0) = 3.93 The estimated intercept lies about 3.93 standard errors above zero. The accompanying p-value converts the t-statistic into a probability, as when testing a hypothesis about a mean. For the intercept, the p-value = 0.0001, far less than the common threshold 0.05. We can reject H0 if we accept a 5% chance of a Type I error. As with tests of the mean, the test agrees with the confidence interval. The test rejects H0: 0 = 0, and 0 lies outside the 95% confidence interval. The t-statistic and p-value in the next row of output test H0: 1 = 0. The tstatistic tells us that b1 is t = b1/se(b1) = 10.84 standard errors above zero. Thats unlikely to happen by chance if the null hypothesis H0: 1=0 is true, so the p-value is tiny (less than 0.0001). In plain language, the t-statistic tells us that for Berkshire-Hathaway is not zero. Thats not a surprise; few financial experts would expect for any stock to be zero. The reason that software automatically tests the null hypothesis H0: 1 = 0 is simple. If this hypothesis is true, then the distribution of y is the same regardless of the value of the explanatory variable x. There is no linear association between y and x. Wed expect to find the same mean value for the response regardless of the value of x. In a graph, the line is flat. In this example, that would mean returns on Berkshire-Hathaway were unrelated to returns on the market. Special language describes a regression model that rejects H0: 1 = 0. Youll sometimes hear that the model explains statistically significant variation in the response or the slope is significantly different from zero. These expressions mean that the 95% confidence interval for 1 does not include zero. The first phrase comes from the connection between 1 21-12

equivalent inferences Might a parameter in the population be zero? Not if 1. zero lies outside 95% confidence interval, 2. t-statistic is larger than 2 in absolute size, or 3. p-value less than 0.05.

Interpreting Tests of the Slope

4/21/08

21 The SRM

tip

Are You There?

and the correlation between x and y. If 1 = 0, then = 0. By rejecting H0, weve said that 0. Hence 2 > 0, too. Remember, though, that if we dont reject H0: 1 = 0, we have not proven that 1 = 0. All weve said is that 1 might be zero, not that it is zero. The following results summarize a model for the relationship between total compensation of CEOs and net sales at 201 at finance companies. The response and explanatory variables measured on a log10 scale. Because both x and y are expressed as logs, the slope is the elasticity of compensation with respect to net sales, the average percentage change in the salary associated with a 1% increase in set sales. (See Chapter 20.)

Figure 21-9. Log of CEO compensation versus log of net sales in the finance industry. r2 se Term Intercept Log10 Net Sales Estimate 1.8653351 0.5028293 0.403728 0.388446 Std Error t Statistic 0.400834 4.65 0.043318 11.61 p-value <.0001 <.0001

b0 b1

(a) Based on what is shown, check the conditions for the SRM: (1) straight-enough (2) no lurking variable (3) evidently independent (4) similar variances (4) nearly normal2 (b) What does it mean that the t-statistic for b1 is bigger than 10?3 (c) Find the 95% confidence interval for the elasticity of compensation (the slope in this model) with respect to net sales.4 (d) A CEO claimed to her Board of Directors that the elasticity of compensation with respect to net sales is . Does this model agree?5
The relationship seems straight-enough (with logs). The plots dont show a problem. The variation seems consistent. Its hard to judge normality without a quantile plot, but these residuals probably meet the CLT condition. 3 The estimate b1 is more than 10 ses away from zero, and hence statistically significant. Zero will not be in the 95% confidence interval. 4 The confidence interval for the slope is 0.503 1.97 (0.0403) [0.42 to 0.58]. 50.5 lies inside the confidence interval for 1; is a plausible value for the elasticity.
2

21-13

4/21/08

21 The SRM

(e) The outlier marked with an x at the right in the plots in Figure 21-9 is Warren Buffett. Why is he an outlier?6

Prediction Intervals

Regression is often used for predicting the response. We know how to compute fitted values of y for any value of x. The same formula gives a prediction, = b0 + b1 x. Lets go back to the emerald diamonds considered in Chapter 19. If you are interested in buying a carat diamond, say, youd like to know what you should expect to pay. What range in prices can you anticipate? This plot shows the estimated least squares fit for the diamonds.
1800 1700 1600 1500 1400 1300 1200 1100 1000 900 800 700 .3 .4 .5

Price ($, using Credit Card)

Weight (carats)

Figure 21-10. Prices and weights of emerald cut diamonds.

The observed prices of carat diamonds range from about $1,000 to $1,700. Thats a wide interval, due presumably to other factors such as varying color and clarity. According to the SRM, the price of each diamond is y = y|x + . Even if we knew 0 and 1 and hence y|x, we could not predict the exact price of any diamond because of the error variation around the conditional mean. A range that quantifies the accuracy of a prediction has to account for this remaining variation. Lets begin with the prediction itself. The following tables summarize the fit shown in Figure 21-10. r2 se Term Intercept Weight (carats) Estimate 43.419237 2669.8544 0.434303 168.634 Std Error t-statistic 71.23376 0.61 170.8713 15.62 p-value 0.5426 <.0001

b0 b1

Table 21-2. Summary of the estimated SRM for the diamonds in Figure 21-10.

Plugging into the equation, the predicted price for a carat diamond is

= b0 + b1() = 43.419237 +2669.8544 * 0.5 = 1378.346437 $1,378 y

Buffett pays himself a small salary; he makes his money in stock invested in his firm.

21-14

4/21/08

21 The SRM

According to the SRM, if we have $1,378, we can afford to buy 50% of diamonds that weigh carat. The rest would be too expensive. How accurate is $1,378 as a prediction of a randomly chosen carat diamond? Theres no reason to think that we should be able to predict the price of another carat diamond any better than this line fits these diamonds. The SD of the residuals se = $168.63 gives us a good idea of the accuracy. If the SRM holds, the error variation around y|x is normally distributed. Hence, prices of 95% of diamonds fall within 1.96 standard deviations of the conditional mean y|x for any weight, P(y|x 1.96 ynew y|x + 1.96 ) = 0.95 We dont know y|x or , but we have of the unknown parameters, then and se. If we use these in place

approximate 95% prediction interval As long as you are not extrapolating beyond observed conditions, use 2 se y

t0.025,n-2 se( y ) ynew y + t0.025,n-2 se( y ) = 0.95 P( y The standard error of the prediction is tedious to calculate, but so long as were not extrapolating, we can use the handy approximations and t0.025, e se( y ) s n-2 2

2 se , y +2 se] is an approximate 95% prediction Thus, the range [ y interval. This range is a prediction interval, rather than a confidence interval because were making a statement about the price of a single diamond, a future observation. A confidence interval is a statement about a population parameter; a prediction interval is a statement about a specific, as yet unknown, observation. Were not guessing the value of a mythical parameter; were predicting the price of a single diamond.
The prediction interval provides another way to think about se. The standard deviation of the residuals tells you how well you can predict new observations. If the data are nearly normal, then about 95% of the data lie within 2 se of the fitted line. The approximate 95% prediction interval for the cost of a carat diamond is 2(168.634) = 1378.346 337.268 [$1041, $1716] y We shouldnt be surprised if price of the diamond were $1,200 or $1,600. We can also think about the upper endpoint of this interval like this: if we arrive at the jewelers with $1,716, then we have enough to buy all but the most expensive 2.5% of diamonds that weigh carat.

Reliability of Prediction Intervals


Prediction intervals are reliable within the range of observed data. Predictions that extrapolate beyond the data, however, rely heavily on the assumed truth of the model. Often, a relationship that is linear over the range of the observed data changes when extrapolated. The next scatterplot shows prices of a larger collection of diamonds, including many that weigh more than carat. The fitted line is the line used in Chapter 19 to describe smaller diamonds. It fits well for diamonds below carat, but typically underpredicts prices for larger gems. 21-15

4/21/08

21 The SRM

Figure 21-11. Prices rise faster than the model fit to small diamonds anticipates.

When the prediction is not an extrapolation, the reliability of a prediction interval depends on two assumptions: equal variance and normality. If the variance of the errors around the regression line increases with the size of the prediction (as in Figure 21-11) then the prediction interval will be too wide for small items and too narrow for large items. This fanshaped pattern of increasing variation is a serious issue when using the regression model for prediction. We will deal with this issue in Chapter 22. Similarly, if the errors are not normally distributed, the t-percentiles used to set the endpoints of the interval can be off target. Prediction intervals depend on normality of each observation, not normality produced by averaging.

Example 21.1 Locating a Franchise Outlet

Motivation

state the question

A common saying about real estate is that three things determine the value of a property: location, location, location. The same goes for commercial properties. Consider where to locate a gasoline station. Have you noticed that prices seem higher at stations located near busy interstate highways? With so many cars passing by, a station can raise prices. Even if the high price deters some drivers, more than enough stop in to keep the business profitable. How does traffic volume affect sales? We will compare two sites. One is located on a highway that averages 40,000 drive-bys a day and another gets 32,000. How much more gasoline can we expect to sell at the busier location?

Method
Identify x and y. Link b0 and b1 to problem. Describe data. Check straightenough condition.

describe the data and select an approach

We will use regression, with y equal to the sales of gasoline per day (thousands of gallons) and x given by the average daily traffic volume (in thousands of cars). Both averages were computed during a recent month at 80 franchise outlets in similar communities. All charge roughly the same price. The intercept sets a baseline of gasoline sales that occur regardless of traffic intensity (probably due to local customers). The slope measures the sales per passing car. 21-16

4/21/08

21 The SRM

A 95% confidence interval for 8,000 times the estimated slope will indicate how much more gasoline to expect to sell at the busier location. Straight-enough. The scatterplot suggests that the relationship is linear. No lurking variable. These stations are in similar areas with comparable prices. We should also check that they face similar competition.

Mechanics
r2 se n Term Intercept b0 Traffic Volume (000) b1 Estimate -1.338097 0.236729 0.548596 1.505407 80 Std Error 0.945844 0.024314

do the analysis

These tables summarize the fit of the least squares regression equation.

t Stat -1.41 9.74

p-value 0.1611 <.0001

This scatterplot graphs the residuals versus the explanatory variable, and the normal quantile plot summarizes the distribution of the residuals.

Evidently independent. While not random samples, nothing in the plots of the data suggest a problem with the assumption of independence. Similar variances. This is confirmed in the plot of the residuals on the explanatory variable. We expected more variation at the busier stations, but since these data are averages, that effect is not evident. Nearly normal. The histogram of the residuals is reasonably bell-shaped with no large outliers. The points in the normal quantile plot stay near the diagonal.

21-17

4/21/08

21 The SRM

Since the conditions for the SRM are met, we can move on to inference. The 95% confidence interval for 1 is b1 t0.025,78 se(b1) = 0.236729 1.99 0.024314 [0.188 to 0.285 gallons/passing car] Hence, a difference of 8,000 in daily traffic volume is associated with a difference in average daily sales of 8000 [0.188 to 0.285 gallons per car] 1,507 to 2,281 more gallons per day

Message

summarize the results

Based on sales at a sample of 80 stations in similar communities, we expect (with 95% confidence) that a station located at a site with 40,000 drive-bys will sell on average from 1,500 to 2,300 more gallons of gasoline daily than a location with 32,000 drive-bys. It is good to mention possible lurking variables. Before we go further, we should check that the stations in our data face similar levels of competition and verify that all charge comparable prices. If, for example, stations at busy locations charge higher prices, this equation may underestimate the benefit of the busier location. If prices differ, then the estimate from this model mixes the effect on sales of increasing traffic (positive effect) with increasing price (negative effect). We can also use this equation for predictions. Since the residuals in this example are nearly normal, we can use the fit of this model to build prediction intervals. For instance, we predict sales at a site with 40,000 drive-bys to be

= -1.338097 + 0.236729 40 8.131 thousand gallons per day y


The approximate 95% prediction interval for such a location is

2 se = 8.131 2 1.5054 [5.12 to 11.14] thousand gallons per day. y


The predicted sales for a location with 32,000 drive-bys is

= -1.338097 + 0.236729 32 6.237 thousand gallons per day y


The approximate prediction interval is 6.237 2 1.5054 = [3.23 to 9.25] thousand gallons per day. These prediction intervals overlap to considerable extent, even though the confidence interval for the difference in sales based on the slope does not include zero. The reason for the overlap is that the prediction intervals contrast sales at two specific locations. The confidence interval compares the average level of sales across all stations at such locations. Averages are less variable than the individual cases.

21-18

4/21/08

21 The SRM

Summary

Key Terms

The simple regression model (SRM) provides an idealized description of the association between two numerical variables. This model has two components. The equation of the SRM describes the association between the explanatory variable x and the response y. This equation states that the conditional mean of Y given X = x is a line, y|x = 0 + 1 x. Both x and y may require transformations. The second component of the SRM describes the random variation around this pattern as a sample of independent, normally distributed errors with constant variance. The simple regression model provides a framework for inferences about the parameters 0 and 1 of the linear equation. Confidence intervals for 0 and 1 are centered on the least squares estimates b0 and b1. The 95% confidence intervals are b0 t se(b0) and b1 t se(b1). The standard summary of a regression includes a t-statistic and p-value for testing H0: 0 =0 and H0: 1 =0. A prediction interval measures the accuracy of predictions of new observations. Provided the SRM holds, the new 2 se. approximate 95% prediction interval for an observation at x is y

condition evidently independent, 21-7 nearly normal, 21-8 similar variances, 21-7 conditional mean, 21-3

errors, 21-3 prediction interval, 21-15 simple regression model, SRM, 21-3

Best Practices

Verify that your model makes sense, both visually and substantively. If you cannot interpret the slope, then whats the point of fitting a line? If the relationship between x and y isnt linear, theres no sense in summarizing it with a line. Consider the effects of other possible explanatory variables. The single explanatory variable in the model may not be the only important influence on the response. If you can think of several others, you may need to use multiple regression (Chapter 23). Check the conditions, in the listed order. The farther up the chain you find a problem, the more likely you can fix it. If the plot of y on x is not straight enough, then you know that you need some type of transformation. If you find that the residuals are not normal, now what can you do? Use confidence intervals to express what you know about the slope and intercept. Confidence intervals convey uncertainty and show that we dont know things like the beta of a stock perfectly from data. 21-19

4/21/08

21 The SRM

Use rounding to suppress extraneous digits when presenting results. Nothing makes you look more out of touch than saying things like the cost per carat is $2333.6732 to $3006.0355. The cost might be $2500 or $2900, and youre worried about $0.0032? Round the values! Check the assumption of normality very carefully before using prediction intervals. Other inferences work well even if the data are not normally distributed. Prediction intervals, however, rely on the shape of the normal distribution to set a range for the new value. Be careful when extrapolating. Its tempting to think that because we have prediction intervals, theyll take care of all of our uncertainty so we dont have to worry about extrapolating. Wrong: the interval is only as good as the model. Prediction intervals presume that the SRM holds, both within the range of the data and outside.

Pitfalls
Overreacting to residual plots. If you stare at a residual plot long enough, youll see a pattern. Use the visual test for simplicity if youre not sure. Even samples from normal distributions have outliers and irregularities every now and then. Mistaking lots of data for unequal variances. If the data have more observations at some xs than others, it may seem as though the variance is larger where you have more data. Thats because we see the range of the residuals when we look at a scatterplot, not the SD. The range always grows with more data. Confusing confidence intervals with prediction intervals. The sampling variation of the estimated parameters determines the width of confidence intervals. The SD of the unexplained (residual) variation determines the major component of the width of prediction intervals. Believing that r2 and se improve with a larger sample. Standard errors get smaller as n grows, but R2 doesnt head to 1 and se to zero. Both r2 and se reflect the variation of the data around the regression line. More data provide a better estimate of this line, but even if we knew 0 and 1, thered still be variation. The s are the errors around the ideal line and represent all of the other factors that influence the response that are not accounted for by our model. Confusing a problem with the fit as a problem in the errors. If you model the data using a linear relationship when a curve is needed, youll get residuals with all sorts of problems. Its a consequence of using the wrong equation for the conditional average value of y. Take a look at this scatterplot; its clearly not straight enough, but we fit the line anyway and got the residuals shown at the right.

21-20

4/21/08

21 The SRM

Figure 21-12.Fitting a line to a nonlinear pattern often produces residuals that are not nearly normal.

Formulas
Simple Regression Model. The conditional mean of the response given that the value of the explanatory variable is x is y|x = E(y|x) = 0 + 1 x As a description of individual values of the response yi = 0 + 1 xi + i The error terms i are assumed to be 1. Independent of each other, 2. Have equal standard deviation , and 3. Be normally distributed Checklist of conditions for the simple regression model Straight enough No embarrassing lurking variable Evidently independent Similar variances Nearly normal Standard error of the slope
se( b1 ) = s 1 s 1 2 ( n 1) sx n sx

Standard error of the intercept

se( b0 ) = s 1 x2 s x2 + 1 + 2 2 n ( n 1) sx sx n

If x = 0, the formula reduces to se/n. The farther x is from 0, the larger the standard error of b0 becomes. Large values of se(b0) warn that the intercept may be an extrapolation.

Standard error of prediction. When using a simple regression to predict the value of an independent observation for which x = xnew,
21-21

4/21/08
2

21 The SRM

1 (x x) new ) = se 1 + + new se se( y 2 n ( n 1) sx The approximation by se is accurate so long as xnew lies within the range of the observed data and n is moderately large (on the order of 40 or more).

About the Data

The stock returns are from the Center for Research in Security Prices (CRSP). The shown percentage changes earned by Berkshire Hathaway and the market are 100 times the excess returns on each. The excess return is obtained by subtracting the cost of borrowing from the return. If Berkshire-Hathaway returned 3% in a month, and the cost of borrowing is %, then the excess return is 2.5%. For the cost of borrowing, we use the rate of interest on 30-day Treasury Bills. For the total stock market, we used returns on the value-weighted market index. The salaries of CEOs in the finance industry is from the Compustat database for 2003. The data on locating a gasoline station are from a consulting project performed for a national oil company that operates filling stations around the US. Its best to begin a regression analysis with a scatterplot of Y on X as emphasized in Chapter 19. For checking the SRM, however, we need to go further than just looking at that plot. We need to see plots of the residuals as well. Dont skip the scatterplot of Y on X to get to the residuals. To inspect the residuals from a regression, follow the menu commands Tools > Data Analysis > Regression (If you dont see this option in your Tools menu, you will need to add these commands. See the Software Hints in Chapter 19.) In the dialog for the regression command, first pick the response and explanatory variable and then choose the options that produce residuals. Be sure to click the option to see the normal probability plot of the residuals. To see plots of the residuals, use the menu sequence Stat > Regression > Regression and select the response and explanatory variable. Click the Graphs button and pick the 4-in-1 option to see all of the residual plots together. The collection includes a scatterplot of the residuals on the fitted values and a normal probability plot of the residuals. The output that summarizes the regression itself appears in the scrolling session window. Follow the menu sequence

Software Hints

Excel

Minitab

JMP

Analyze > Fit Y by X 21-22

4/21/08

21 The SRM

to construct a scatterplot and add a regression line. (Click on the red triangle above the scatterplot near the words Bivariate Fit of .) Once youve added the least squares fitted line, a button labeled Linear Fit appears below the scatterplot. Click on the red triangle in this field and choose the item Plot Residuals to see a plot of the residuals on the explanatory variable. To obtain a normal quantile plot of the residuals, you have to save them first as a column in the data table. Click on the red triangle in the Linear Fit button and chooser the item Save Residuals. To get the normal quantile plot, follow the menu sequence Analyze > Distribution and choose the residuals to go into the histogram. Once you see the histogram, click on the red triangle immediately above the histogram and choose the item Normal Quantile Plot.

21-23

Das könnte Ihnen auch gefallen