Sie sind auf Seite 1von 48

Chapter 3

Descriptive Statistics: Numerical


Measures
Part B
■ Measures of Distribution Shape, Relative
Location, and Detecting Outliers
■ Exploratory Data Analysis
■ Measures of Association Between Two Variables
■ The Weighted Mean and
Working with Grouped Data

© 2005 Thomson/South-Western 1
Measures of Distribution Shape,
Relative Location, and Detecting Outliers
■ Distribution Shape
■ z-Scores
■ Chebyshev’s Theorem
■ Empirical Rule
■ Detecting Outliers

© 2005 Thomson/South-Western 2
Distribution Shape: Skewness

■ An important measure of the shape of a


distribution is called skewness.
■ The formula for computing skewness for a data
set is somewhat complex.
■ Skewness can be easily computed using
statistical software.

© 2005 Thomson/South-Western 3
Distribution Shape: Skewness

■ Symmetric (not skewed)


• Skewness is zero.
• Mean and median are equal.
.35
Skewness =
0
Relative Frequency

.30
.25
.20
.15
.10
.05
0

© 2005 Thomson/South-Western 4
Distribution Shape: Skewness

■ Moderately Skewed Left


• Skewness is negative.
• Mean will usually be less than the median.
.35
Skewness = − .31
Relative Frequency

.30
.25
.20
.15
.10
.05
0

© 2005 Thomson/South-Western 5
Distribution Shape: Skewness

■ Moderately Skewed Right


• Skewness is positive.
• Mean will usually be more than the median.
.35
Skewness = .31
Relative Frequency

.30
.25
.20
.15
.10
.05
0

© 2005 Thomson/South-Western 6
Distribution Shape: Skewness

■ Highly Skewed Right


• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
.35
Skewness = 1.25
Relative Frequency

.30
.25
.20
.15
.10
.05
0

© 2005 Thomson/South-Western 7
Distribution Shape: Skewness

■ Example: Apartment Rents


Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next slide.

© 2005 Thomson/South-Western 8
Distribution Shape: Skewness

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

© 2005 Thomson/South-Western 9
Distribution Shape: Skewness

.35 Skewness = .92


Relative Frequency

.30

.25

.20
.15

.10
.05
0

© 2005 Thomson/South-Western 10
z-Scores

The
The z-score
z-score is
is often
often called
called the
the standardized
standardized value.
value.

It
It denotes
denotes the the number
number of
of standard
standard deviations
deviations aa data
data
value
value xxii is
is from
from the
the mean.
mean.

xi − x
zi =
s

© 2005 Thomson/South-Western 11
z-Scores

 An observation’s z-score is a measure of the relative


location of the observation in a data set.
 A data value less than the sample mean will have a
z-score less than zero.
 A data value greater than the sample mean will have
a z-score greater than zero.
 A data value equal to the sample mean will have a
z-score of zero.

© 2005 Thomson/South-Western 12
z-Scores

■ z-Score of Smallest Value (425)


xi − x 425 − 490.80
z= = = −1.20
s 54.74

Standardized Values for Apartment Rents


-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27

© 2005 Thomson/South-Western 13
Chebyshev’s Theorem (or Chebyshev's
inequality)

At
At least
least (1
(1 -- 1/
1/zz22)) of
of the
the items
items inin any
any data
data set
set will
will be
be
within
within zz standard
standard deviations
deviations of of the
the mean,
mean, where
where zz is
is
any
any value
value greater
greater than than 11
(z>1,
(z>1, beware
beware that
that value
value 11 is
is not
not included).
included).

© 2005 Thomson/South-Western 14
Pafnuty Chebyshev
(1821-1894)

© 2005 Thomson/South-Western 15
Chebyshev’s Theorem

At
At least
least 75%
75%of
of the
the data
data values
values must
must be
be
within
within zz =
=22 standard
standard deviations
deviations of
of the
the mean.
mean.

At
At least
least 89%
89%of
of the
the data
data values
values must
must be
be
within
within zz =
=33 standard
standard deviations
deviations of
of the
the mean.
mean.

At
At least
least 94%
94%of
of the
the data
data values
values must
must be
be
within
within zz =
=44 standard
standard deviations
deviations of
of the
the mean.
mean.

© 2005 Thomson/South-Western 16
Chebyshev’s Theorem

For example:
Let z = 1.5 with x= 490.80 and s = 54.74

At least (1 − 1/(1.5)2) = 1 − 0.44 = 0.56 or 56%


of the rent values must be between
- z(xs) = 490.80 − 1.5(54.74) = 409
and
+ zx(s) = 490.80 + 1.5(54.74) = 573

(Actually, 86% of the rent values


are between 409 and 573.)

© 2005 Thomson/South-Western 17
Empirical Rule

For data having a bell-shaped distribution:

68.26%
68.26%of
of the
the values
values of
of aa normal
normal random
random variable
variable
are
are within
within +/-
+/- 1
1 standard
standard deviation
deviation of
of its
its mean.
mean.

95.44%
95.44%of
of the
the values
values of
of aa normal
normal random
random variable
variable
are
are within
within +/-
+/- 2
2 standard
standard deviations
deviations of
of its
its mean.
mean.

99.72%
99.72%of
of the
the values
values of
of aa normal
normal random
random variable
variable
are
are within
within +/-
+/- 3
3 standard
standard deviations
deviations of
of its
its mean.
mean.

© 2005 Thomson/South-Western 18
Empirical Rule

99.72%
95.44%
68.26%

µ
x
µ – 3σ µ – 1σ µ + 1σ µ + 3σ
µ – 2σ µ + 2σ

© 2005 Thomson/South-Western 19
Detecting Outliers

 An outlier is an unusually small or unusually large


value in a data set.
 A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
 It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the
data set
• a correctly recorded data value that belongs in
the data set

© 2005 Thomson/South-Western 20
Detecting Outliers

 The most extreme z-scores are -1.20 and 2.27


 Using |z| > 3 as the criterion for an outlier, there are
no outliers in this data set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27

© 2005 Thomson/South-Western 21
Exploratory Data Analysis

■ Five-Number Summary
■ Box Plot

© 2005 Thomson/South-Western 22
Five-Number Summary

1 Smallest Value

2 First Quartile

3 Median

4 Third Quartile

5 Largest Value

© 2005 Thomson/South-Western 23
Five-Number Summary

Lowest Value = 425 First Quartile = 445


Median = 475
Third Quartile = 525 Largest Value = 615
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

© 2005 Thomson/South-Western 24
Box Plot

 A box is drawn with its ends located at the first and


third quartiles.
 A vertical line is drawn in the box at the location of
the median (second quartile).

37 40 42 45 47 50 52 55 57 60 62
5 0 5 0 5 0 5 0 5 0 5
Q1 = 445 Q3 = 525
Q2 = 475
© 2005 Thomson/South-Western 25
Box Plot

■ Limits are located (not drawn) using the


interquartile range (IQR).
■ Data outside these limits are considered
outliers.
■ The locations of each outlier is shown with the
symbol * .
… continued

© 2005 Thomson/South-Western 26
Box Plot

 The lower limit is located 1.5(IQR) below Q1.

Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325

 The upper limit is located 1.5(IQR) above Q3.

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 645

 There are no outliers (values less than 325 or


greater than 645) in the apartment rent data.

© 2005 Thomson/South-Western 27
Box Plot

■ Whiskers (dashed lines) are drawn from the


ends of the box to the smallest and largest
data values inside the limits.

37 40 42 45 47 50 52 55 57 60 62
5 0 5 0 5 0 5 0 5 0 5
Smallest value Largest value
inside limits = 425 inside limits = 615

© 2005 Thomson/South-Western 28
Measures of Association
Between Two Variables
■ Covariance
■ Correlation Coefficient

© 2005 Thomson/South-Western 29
Covariance

The
The covariance
covariance is
is aa measure
measure of
of the
the linear
linear association
association
between
between two
two variables.
variables.

Positive
Positive values
values indicate
indicate aa positive
positive relationship.
relationship.

Negative
Negative values
values indicate
indicate aa negative
negative relationship.
relationship.

© 2005 Thomson/South-Western 30
Covariance

The
The covariance
covariance is
is computed
computed as
as follows:
follows:

∑ ( xi − x )( yi − y ) for
sxy =
n −1 samples

∑ ( xi − µ x )( yi − µ y ) for
σ xy = populations
N

© 2005 Thomson/South-Western 31
Correlation Coefficient

The
The correlation
correlation coefficient
coefficient is
is computed
computed as
as follows:
follows:
sxy σ xy
rxy = ρ xy =
sx s y σ xσ y

for for
samples populations

© 2005 Thomson/South-Western 32
Correlation Coefficient

The
The coefficient
coefficient can
can take
take on
on values
values between
between -1
-1 and
and +1.
+1.

Values
Values near
near -1-1 indicate
indicate aa strong
strong negative
negative linear
linear
relationship
relationship..

Values
Values near
near +1+1 indicate
indicate aa strong
strong positive
positive linear
linear
relationship
relationship..

© 2005 Thomson/South-Western 33
Correlation Coefficient

Correlation
Correlation is
is aa measure
measure of
of linear
linear association
association and
and not
not
necessarily
necessarily causation.
causation.

Just
Just because
because two
two variables
variables are
are highly
highly correlated,
correlated, itit
does
does not
not mean
mean that
that one
one variable
variable is
is the
the cause
cause of
of the
the
other.
other.

© 2005 Thomson/South-Western 34
Covariance and Correlation Coefficient

A golfer is interested in investigating


the relationship, if any, between driving
distance and 18-hole score.

Average Driving Average


Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69

© 2005 Thomson/South-Western 35
Covariance and Correlation Coefficient

x y (xi − x) (yi − y ) (xi − x )(yi − y )

277.6 69 10.65 -1.0 -10.65


259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192.8944

© 2005 Thomson/South-Western 36
Covariance and Correlation Coefficient

■ Sample Covariance

sxy =
∑ (x − x )(y
i i − y ) −35.40
= = 7.08

n− 1 6 −1
■ Sample Correlation Coefficient
sxy −7.08
rxy = = = -.9631
sxsy (8.2192)(.8944)

© 2005 Thomson/South-Western 37
The Weighted Mean and
Working with Grouped Data
■ Weighted Mean
■ Mean for Grouped Data
■ Variance for Grouped Data
■ Standard Deviation for Grouped Data

© 2005 Thomson/South-Western 38
Weighted Mean

 When the mean is computed by giving each data


value a weight that reflects its importance, it is
referred to as a weighted mean.
 In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
 When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.

© 2005 Thomson/South-Western 39
Weighted Mean

x= ∑ wx i i

∑w i

where:
xi = value of observation i
wi = weight for observation i

© 2005 Thomson/South-Western 40
Grouped Data

 The weighted mean computation can be used to


obtain approximations of the mean, variance, and
standard deviation for the grouped data.
 To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
 We compute a weighted mean of the class midpoints
using the class frequencies as weights.
 Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.

© 2005 Thomson/South-Western 41
Mean for Grouped Data

■ Sample Data

x= ∑ fMi i

■ Population Data

µ= ∑ fMi i

N
where:
fi = frequency of class i
Mi = midpoint of class i

© 2005 Thomson/South-Western 42
Sample Mean for Grouped Data

Given below is the previous sample of monthly rents


for 70 efficiency apartments, presented here as grouped
data in the form of a frequency distribution.

Rent ($) Frequency


420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6

© 2005 Thomson/South-Western 43
Sample Mean for Grouped Data

Rent ($) fi Mi f iMi


420-439 8 429.5 3436.0 34,525
x= = 493.21
440-459 17 449.5 7641.5 70
460-479 12 469.5 5634.0 This approximation
480-499 8 489.5 3916.0 differs by $2.41 from
500-519 7 509.5 3566.5
520-539 4 529.5 2118.0
the actual sample
540-559 2 549.5 1099.0 mean of $490.80.
560-579 4 569.5 2278.0
580-599 2 589.5 1179.0
600-619 6 609.5 3657.0
Total 70 34525.0

© 2005 Thomson/South-Western 44
Variance for Grouped Data

■ For sample data

∑ f i ( M i − x ) 2
s2 =
n −1

■ For population data

∑ f i ( M i − µ ) 2
σ2 =
N

© 2005 Thomson/South-Western 45
Sample Variance for Grouped Data

Rent ($) fi Mi Mi - x (M i - x )2 f i (M i - x )2
420-439 8 429.5 -63.7 4058.96 32471.71
440-459 17 449.5 -43.7 1910.56 32479.59
460-479 12 469.5 -23.7 562.16 6745.97
480-499 8 489.5 -3.7 13.76 110.11
500-519 7 509.5 16.3 265.36 1857.55
520-539 4 529.5 36.3 1316.96 5267.86
540-559 2 549.5 56.3 3168.56 6337.13
560-579 4 569.5 76.3 5820.16 23280.66
580-599 2 589.5 96.3 9271.76 18543.53
600-619 6 609.5 116.3 13523.36 81140.18
Total 70 208234.29
continued
© 2005 Thomson/South-Western 46
Sample Variance for Grouped Data

■ Sample Variance

s2 = 208,234.29/(70 – 1) = 3,017.89

■ Sample Standard Deviation


s = 3,017.89= 54.94

This approximation differs by only $.20


from the actual standard deviation of $54.74.

© 2005 Thomson/South-Western 47
End of Chapter 3, Part B

© 2005 Thomson/South-Western 48

Das könnte Ihnen auch gefallen