SBE10 - 03b (Ca) Descriptive Stats-Numerical Measures

Chapter 3
Descriptive Statistics: Numerical

Measures
Part B
■ Measures of Distribution Shape, Relative
Location, and Detecting Outliers
■ Exploratory Data Analysis
■ Measures of Association Between Two Variables
■ The Weighted Mean and
Working with Grouped Data
© 2005 Thomson/South-Western 1
Measures of Distribution Shape,
Relative Location, and Detecting Outliers
■ Distribution Shape
■ z-Scores
■ Chebyshev’s Theorem
■ Empirical Rule
■ Detecting Outliers
Distribution Shape: Skewness
■ An important measure of the shape of a

distribution is called skewness.
■ The formula for computing skewness for a data
set is somewhat complex.
■ Skewness can be easily computed using
statistical software.
■ Symmetric (not skewed)

• Skewness is zero.
• Mean and median are equal.
.35
Skewness =
0
Relative Frequency
.30
.25
.20
.15
.10
.05
0
■ Moderately Skewed Left

• Skewness is negative.
• Mean will usually be less than the median.
.35
Skewness = − .31
Relative Frequency
.30
.25
.20
.15
.10
.05
0
■ Moderately Skewed Right

• Skewness is positive.
• Mean will usually be more than the median.
.35
Skewness = .31
Relative Frequency
.30
.25
.20
.15
.10
.05
0
■ Highly Skewed Right

• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
.35
Skewness = 1.25
Relative Frequency
.30
.25
.20
.15
.10
.05
0
■ Example: Apartment Rents

Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next slide.
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
.35 Skewness = .92

Relative Frequency
.30
.25
.20
.15
.10
.05
0
z-Scores
The
The z-score
z-score is
is often
often called
called the
the standardized
standardized value.
value.
It
It denotes
denotes the the number
number of
of standard
standard deviations
deviations aa data
data
value
value xxii is
is from
from the
the mean.
mean.
xi − x
zi =
s
z-Scores
 An observation’s z-score is a measure of the relative

location of the observation in a data set.
 A data value less than the sample mean will have a
z-score less than zero.
 A data value greater than the sample mean will have
a z-score greater than zero.
 A data value equal to the sample mean will have a
z-score of zero.
z-Scores
■ z-Score of Smallest Value (425)

xi − x 425 − 490.80
z= = = −1.20
s 54.74
Standardized Values for Apartment Rents

-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
Chebyshev’s Theorem (or Chebyshev's
inequality)
At
At least
least (1
(1 -- 1/
1/zz22)) of
of the
the items
items inin any
any data
data set
set will
will be
be
within
within zz standard
standard deviations
deviations of of the
the mean,
mean, where
where zz is
is
any
any value
value greater
greater than than 11
(z>1,
(z>1, beware
beware that
that value
value 11 is
is not
not included).
included).
Pafnuty Chebyshev
(1821-1894)
Chebyshev’s Theorem
At
At least
least 75%
75%of
of the
the data
data values
values must
must be
be
within
within zz =
=22 standard
standard deviations
deviations of
of the
the mean.
mean.
At
At least
least 89%
89%of
of the
the data
data values
values must
must be
be
within
within zz =
=33 standard
standard deviations
deviations of
of the
the mean.
mean.
At
At least
least 94%
94%of
of the
the data
data values
values must
must be
be
within
within zz =
=44 standard
standard deviations
deviations of
of the
the mean.
mean.
Chebyshev’s Theorem
For example:
Let z = 1.5 with x= 490.80 and s = 54.74
At least (1 − 1/(1.5)2) = 1 − 0.44 = 0.56 or 56%

of the rent values must be between
- z(xs) = 490.80 − 1.5(54.74) = 409
and
+ zx(s) = 490.80 + 1.5(54.74) = 573
(Actually, 86% of the rent values

are between 409 and 573.)
Empirical Rule
For data having a bell-shaped distribution:
68.26%
68.26%of
of the
the values
values of
of aa normal
normal random
random variable
variable
are
are within
within +/-
+/- 1
1 standard
standard deviation
deviation of
of its
its mean.
mean.
95.44%
95.44%of
of the
the values
values of
of aa normal
normal random
random variable
variable
are
are within
within +/-
+/- 2
2 standard
standard deviations
deviations of
of its
its mean.
mean.
99.72%
99.72%of
of the
the values
values of
of aa normal
normal random
random variable
variable
are
are within
within +/-
+/- 3
3 standard
standard deviations
deviations of
of its
its mean.
mean.
Empirical Rule
99.72%
95.44%
68.26%
µ
x
µ – 3σ µ – 1σ µ + 1σ µ + 3σ
µ – 2σ µ + 2σ
Detecting Outliers
 An outlier is an unusually small or unusually large

value in a data set.
 A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
 It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the
data set
• a correctly recorded data value that belongs in
the data set
Detecting Outliers
 The most extreme z-scores are -1.20 and 2.27

 Using |z| > 3 as the criterion for an outlier, there are
no outliers in this data set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
Exploratory Data Analysis
■ Five-Number Summary
■ Box Plot
Five-Number Summary
1 Smallest Value
2 First Quartile
3 Median
4 Third Quartile
5 Largest Value
Five-Number Summary
Lowest Value = 425 First Quartile = 445

Median = 475
Third Quartile = 525 Largest Value = 615
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Box Plot
 A box is drawn with its ends located at the first and

third quartiles.
 A vertical line is drawn in the box at the location of
the median (second quartile).
37 40 42 45 47 50 52 55 57 60 62
5 0 5 0 5 0 5 0 5 0 5
Q1 = 445 Q3 = 525
Q2 = 475
Box Plot
■ Limits are located (not drawn) using the

interquartile range (IQR).
■ Data outside these limits are considered
outliers.
■ The locations of each outlier is shown with the
symbol * .
… continued
Box Plot
 The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325
 The upper limit is located 1.5(IQR) above Q3.
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 645
 There are no outliers (values less than 325 or

greater than 645) in the apartment rent data.
Box Plot
■ Whiskers (dashed lines) are drawn from the

ends of the box to the smallest and largest
data values inside the limits.
37 40 42 45 47 50 52 55 57 60 62
5 0 5 0 5 0 5 0 5 0 5
Smallest value Largest value
inside limits = 425 inside limits = 615
Measures of Association
Between Two Variables
■ Covariance
■ Correlation Coefficient
Covariance
The
The covariance
covariance is
is aa measure
measure of
of the
the linear
linear association
association
between
between two
two variables.
variables.
Positive
Positive values
values indicate
indicate aa positive
positive relationship.
relationship.
Negative
Negative values
values indicate
indicate aa negative
negative relationship.
relationship.
Covariance
The
The covariance
covariance is
is computed
computed as
as follows:
follows:
∑ ( xi − x )( yi − y ) for
sxy =
n −1 samples
∑ ( xi − µ x )( yi − µ y ) for
σ xy = populations
N
Correlation Coefficient
The
The correlation
correlation coefficient
coefficient is
is computed
computed as
as follows:
follows:
sxy σ xy
rxy = ρ xy =
sx s y σ xσ y
for for
samples populations
The
The coefficient
coefficient can
can take
take on
on values
values between
between -1
-1 and
and +1.
+1.
Values
Values near
near -1-1 indicate
indicate aa strong
strong negative
negative linear
linear
relationship
relationship..
Values
Values near
near +1+1 indicate
indicate aa strong
strong positive
positive linear
linear
relationship
relationship..
Correlation
Correlation is
is aa measure
measure of
of linear
linear association
association and
and not
not
necessarily
necessarily causation.
causation.
Just
Just because
because two
two variables
variables are
are highly
highly correlated,
correlated, itit
does
does not
not mean
mean that
that one
one variable
variable is
is the
the cause
cause of
of the
the
other.
other.
Covariance and Correlation Coefficient
A golfer is interested in investigating

the relationship, if any, between driving
distance and 18-hole score.
Average Driving Average

Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
x y (xi − x) (yi − y ) (xi − x )(yi − y )
277.6 69 10.65 -1.0 -10.65

259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192.8944
■ Sample Covariance
sxy =
∑ (x − x )(y
i i − y ) −35.40
= = 7.08
−
n− 1 6 −1
■ Sample Correlation Coefficient
sxy −7.08
rxy = = = -.9631
sxsy (8.2192)(.8944)
The Weighted Mean and
Working with Grouped Data
■ Weighted Mean
■ Mean for Grouped Data
■ Variance for Grouped Data
■ Standard Deviation for Grouped Data
Weighted Mean
 When the mean is computed by giving each data

value a weight that reflects its importance, it is
referred to as a weighted mean.
 In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
 When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
Weighted Mean
x= ∑ wx i i
∑w i
where:
xi = value of observation i
wi = weight for observation i
Grouped Data
 The weighted mean computation can be used to

obtain approximations of the mean, variance, and
standard deviation for the grouped data.
 To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
 We compute a weighted mean of the class midpoints
using the class frequencies as weights.
 Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.
Mean for Grouped Data
■ Sample Data
x= ∑ fMi i
■ Population Data
µ= ∑ fMi i
N
where:
fi = frequency of class i
Mi = midpoint of class i
Sample Mean for Grouped Data
Given below is the previous sample of monthly rents

for 70 efficiency apartments, presented here as grouped
data in the form of a frequency distribution.
Rent ($) Frequency

420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6
Sample Mean for Grouped Data
Rent ($) fi Mi f iMi

420-439 8 429.5 3436.0 34,525
x= = 493.21
440-459 17 449.5 7641.5 70
460-479 12 469.5 5634.0 This approximation
480-499 8 489.5 3916.0 differs by $2.41 from
500-519 7 509.5 3566.5
520-539 4 529.5 2118.0
the actual sample
540-559 2 549.5 1099.0 mean of $490.80.
560-579 4 569.5 2278.0
580-599 2 589.5 1179.0
600-619 6 609.5 3657.0
Total 70 34525.0
Variance for Grouped Data
■ For sample data
∑ f i ( M i − x ) 2
s2 =
n −1
■ For population data
∑ f i ( M i − µ ) 2
σ2 =
N
Sample Variance for Grouped Data
Rent ($) fi Mi Mi - x (M i - x )2 f i (M i - x )2
420-439 8 429.5 -63.7 4058.96 32471.71
440-459 17 449.5 -43.7 1910.56 32479.59
460-479 12 469.5 -23.7 562.16 6745.97
480-499 8 489.5 -3.7 13.76 110.11
500-519 7 509.5 16.3 265.36 1857.55
520-539 4 529.5 36.3 1316.96 5267.86
540-559 2 549.5 56.3 3168.56 6337.13
560-579 4 569.5 76.3 5820.16 23280.66
580-599 2 589.5 96.3 9271.76 18543.53
600-619 6 609.5 116.3 13523.36 81140.18
Total 70 208234.29
continued
Sample Variance for Grouped Data
■ Sample Variance
s2 = 208,234.29/(70 – 1) = 3,017.89
■ Sample Standard Deviation

s = 3,017.89= 54.94
This approximation differs by only $.20

from the actual standard deviation of $54.74.
End of Chapter 3, Part B

SBE10 - 03b (Ca) Descriptive Stats-Numerical Measures

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

SBE10 - 03b (Ca) Descriptive Stats-Numerical Measures

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 3

Descriptive Statistics: Numerical

■ An important measure of the shape of a

■ Symmetric (not skewed)

■ Moderately Skewed Left

■ Moderately Skewed Right

■ Highly Skewed Right

■ Example: Apartment Rents

.35 Skewness = .92

 An observation’s z-score is a measure of the relative

■ z-Score of Smallest Value (425)

Standardized Values for Apartment Rents

At least (1 − 1/(1.5)2) = 1 − 0.44 = 0.56 or 56%

(Actually, 86% of the rent values

For data having a bell-shaped distribution:

 An outlier is an unusually small or unusually large

 The most extreme z-scores are -1.20 and 2.27

Lowest Value = 425 First Quartile = 445

 A box is drawn with its ends located at the first and

■ Limits are located (not drawn) using the

 The lower limit is located 1.5(IQR) below Q1.

Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325

 The upper limit is located 1.5(IQR) above Q3.

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 645

 There are no outliers (values less than 325 or

■ Whiskers (dashed lines) are drawn from the

A golfer is interested in investigating

Average Driving Average

x y (xi − x) (yi − y ) (xi − x )(yi − y )

277.6 69 10.65 -1.0 -10.65

 When the mean is computed by giving each data

 The weighted mean computation can be used to

Given below is the previous sample of monthly rents

Rent ($) Frequency

Rent ($) fi Mi f iMi

■ For sample data

■ For population data

■ Sample Standard Deviation

This approximation differs by only $.20

Das könnte Ihnen auch gefallen