You are on page 1of 20

The Standard Deviation and other Measures of Dispersion

DISPERSION OR VARIATION The degree to which numerical data tend to spread about an average value is called the variation or dispersion of the data. Various measures of dispersion or variation are available, the most common being the range, mean deviation, semiinterquartile range, 10-90 percentile range, and the standard deviation. THE RANGE The range of a set of numbers is the difference between the largest and smallest numbers in the set. Example: The range of the set 2, 3, 3, 5, 5, 5, 8, 10, 12 is 12 2 = 10. Sometimes the range is given by simply quoting the smallest and largest numbers. In the above example, for instance, the range could be indicated as 2 to 12 or 2-12.

THE MEAN DEVIATION, OR AVERAGE DEVIATION, of a set of N numbers X1, X2, . XN is defined by Mean Deviation = M.D. where X is the arithmetic mean of the numbers and |Xj is the absolute value of the deviation of Xi from Y. (The absolute value of a number is the number without the associated sign and is indicated by two vertical lines placed around the number. Thus |-41 = 4,1+31 = 3, 61 = 6, 1-0.841 = 0-84.) Example: Find the mean deviation of the set of numbers 2, 3, 6, 8, 11

Arithmetic Mean = Mean Deviation = M.D.

If X1, X2, . . XK occur with frequenciesf,,f2, . . fK res pectively, the mean deviation can be written as Mean Deviation = M.D.

Where N =

This form is useful for grouped

data where the XD's represent class marks and the fj's are the corresponding frequencies. Occasionally the mean deviation is defined in terms of absolute deviations from the median or other average instead of the mean. An interesting property of the sum al is that it is a minimum when a is the median, i.e. the mean deviation about the median is a minimum. Note that it would be more appropriate to use the terminology, mean absolute deviation rather than mean deviation. THE SEMI-INTERQUARTILE RANGE OR QUARTILE DEVIATION of a set of data is defined by

Semi-interquartile Range = Q = where Q1 and Q, are the first and third quartiles for the data. See Problems 4.6 and 4.7. The interquartile range Q3 Q, is sometimes used but the semi-interquartile range is more common as a measure of dispersion. THE 10-90 PERCENTILE RANGE of a set of data is defined by 10-90 Percentile Range = P90 P10 (4) where P10 and P90 are the 10th and 90th percentiles for the data (see Prob. 4.8). The semi-10-90 percentile range, +(P90 - P10), can also be used but is not commonly employed. THE STANDARD DEVIATION of a set of N numbers X,, X2, . . XN is denoted by s and is defined by

s=

where x represents the deviations of each of the numbers XJ from the mean X. Thus s is the root mean square of the deviations from the mean or, as it is sometimes called, the root mean square deviation (see Page 49). If X1, X2, ..., Xj, occur with frequencies.f1,f2,...,fK respectively, the standard deviation can be written as

s=

where N =

f. In this form it is useful

for grouped data. Sometimes the standard deviation for the data of a sample is defined with (N-1) replacing N in the denominators of the expressions in (5) and (6) because the resulting value represents a better estimate of the standard deviation of a population from which the sample is taken. For large values of N (certainly N > 30) there is practically no difference between the two definitions. Also, when the better estimate is needed we can always obtain it by multiplying the standard deviation computed according to the first definition by

Hence we shall adhere to the definition given


above.

THE VARIANCE The variance of a set of data is defined as the square of the standard deviation and is thus given by s2 in (5) and (6). When it is necessary to distinguish the standard deviation of a population from the standard deviation of a sample drawn from this population, we often use the symbol s for the latter and a for the former. Thus s2 and n2 would represent the sam SHORT METHODS FOR COMPUTING THE STANDARD DEVIATION The equations (5) and (6) can be written respectively in the equivalent forms

s= s=

where denotes the mean of the squares of the various values of X, while X2 denotes the square of the mean of the various values of X. See Problems 4.12 to 4.14. If d = Xi - A are the deviations of X from some arbitrary constant A, the results (7) and (8) become respectively

s=

s=

See Problems 4.15 and 4.17. When data are grouped into a frequency distribution whose class intervals have equal size c, we have = cuj or Xi = A + cuj and (10) becomes s=

This last formula provides a very short method for computing the standard deviation and should always be used for grouped data when class interval sizes are equal. It is called the coding method and is exactly analogous to that used in computing the arithmetic mean for grouped data in Chapter 3. See Problems 4.16 to 4.19. PROPERTIES OF THE STANDARD DEVIATION 1. The standard deviation can be defined as s = where a is an average besides the arithmetic mean. Of all such standard deviations, the minimum is that for which a = X, because of Property (b), Chap. 3, Page 46. This property provides an important reason for defining the standard deviation as above. For a proof of this property see Prob. 4.27.

2. For normal distributions (see Chapter 7) it turns out that : 68,27% of the cases are included between s and D +S (i.e. one standard deviation on either side of the mean) b. 95.45% of the cases are included between 2s and + 2s (i.e. two standard deviations on either side of the mean) c. 9973 Y. of the cases are included between 3s and + 3s (i.e. three standard deviations on either side of the mean) a. as indicated in Fig. 4-1. For moderately skewed percentages Prob. 4.24).

distributions

the

above

3.

Suppose that two sets consisting of N1 and N2 numbers (or two frequency distributions with total frequencies N, and N2) have variances given by 4 and s2 respectively and the same mean X. Then the combined or pooled variance of both sets (or both frequency distributions) is given by

Note that this is a weighted arithmetic mean of the variances. This result can be generalized to 3 or more sets.

CHARLIER'S CHECK Charlier's check in computations of the mean and standard deviation by the coding method makes use of the identities

See Problem 4.20. SHEPPARD'S CORRECTION FOR VARIANCE

SHEPPARD'S CORRECTION FOR VARIANCE The computation of the standard deviation is somewhat in error due to grouping of data into classes (grouping error). To adjust for grouping error we use the result. Corrected Variance = Variance from grouped data c2/12 where c is the class interval size. The correction c2/12 which is subtracted is called Sheppard's correction. It is used for distributions of continuous variables where the "tails" go gradually to zero in both directions. Statisticians differ as to when and whether Sheppard's corrections should be applied. Certainly they should not be applied without thorough examination of the situation. This is so because often they tend to overcorrect and thus replace old errors by new errors. In this book unless otherwise indicated we shall not use these corrections.

EMPIRICAL RELATIONS BETWEEN MEASURES OF DISPERSION For moderately skewed distributions we have the empirical formulae Mean Deviation Semi-interquartile Range (Standard Deviation) (Standard Deviation)

These are consequences of the fact that for the normal distribution we find that the mean deviation and semiinterquartile range are equal respectively to 0.7979 and 0-6745 times the standard deviation.

ABSOLUTE AND RELATIVE DISPERSION. COEFFICIENT OF VARIATION

The actual variation or dispersion as determined from the standard deviation or other measure of dispersion is called the absolute dispersion. However, a variation or dispersion of 1 metre in measuring a distance of 1000 metres is quite different in effect from the same variation of 1 metre in a distance of 20 metres. A measure of this effect is supplied by the relative dispersion defined by

If the absolute dispersion is the standard deviation s and the average is the mean X, the relative dispersion is called the coefficient of variation or coefficient of dispersion given by Coefficient of Variation = V =

and is generally expressed as a percentage. Other possibilities also occur (see Problem 4.30). Note that the coefficient of variation is independent of units used. For this reason it is useful in comparing distributions where units may be different. A disadvantage of the coefficient of variation is that it fails to be useful when X is close to zero.

STANDARDIZED VARIABLE, STANDARD SCORES The variable z = which measures the deviation from the mean in units of the standard deviation is called a standardized variable and is a dimensionless quantity (i.e. is independent of units used). If deviations from the mean are given in units of the standard deviation, they are said to be expressed in standard units or standard scores. These are of great value in comparison of distributions (see Problem 4.31).