Sie sind auf Seite 1von 8

Chapter 2 Looking at Data Relationships

2.1 Scatterplots (P. 105-112) A scatter displays the form, direction, and strength of the relationship between two quantitative variables. 2.2 Correlation The correlation measures the direction and strength of the linear relationship between two quantitative variables. It is defined as 1 n xi x yi y r= n 1 i =1 sx s y

Properties of Correlation: 1. Correlation requires that both variables be quantitative. 2. Correlation does not change with units of measurements of x, y, or both. 3. 1 r 1. Positive r indicates positive association and negative r indicates negative association. Values of r close to 0 indicate a weak linear relationship. Values of r close to -1 or 1 indicate a strong linear relationship. 4. Correlation measures the strength of only the linear relationship. Correlation does not describe curved relationships no matter how strong they are. 5. Correlation can be strongly affected by outliers. Interpret the value of r with caution when outliers appear in the scatterplot.

2.3 Least-Squares Regression A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. It is determined by fitting a line to data, i.e., drawing a line that comes as close as possible to the data points on the scatterplot. Given a set of (x, y) observations, the regression line can be described in a compact mathematical form y = a + bx where
b=r sy sx and a = y bx

are the slope and intercept of the line.

About the regression line 1. The regression line always passes through the point ( x , y ) . 2. The slope b of the regression is an estimate of the rate of change of the y variable with respect to the x variable. That is 3. The intercept a is the value of y when x = 0. 4. The regression line is also called the line of best fit. It is the straight line that best fits the data in the sense that the sum of the squares of the vertical distances of the data points from the line is as small as possible. Hence the term least square regression. 5. The regression line can be used to predict the value of y for any given value of x by substituting this x value into the equation of the line. However, extrapolation beyond the range of x-values is risky.

Variation The square of the correlation r2 is the fraction of the variation on the values of y that is explained by its relationship to x. Specifically,

r2 =

variance of predicted values of y variance of observed values of y

Residual A residual is the difference between an observed value of the response variable and the value predicted by the regression line. Residual = observed (y) predicted ( y ) The sum of the least square residuals is always 0. A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line.

Outliers and Influential observations in regression An outlier is an observation that lies outside the overall pattern of the other observations. Some outliers have large residuals, but others do not (see Figure 2.22 and 2.23). An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.

Lurking Variable A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.

Das könnte Ihnen auch gefallen