Sie sind auf Seite 1von 21

Introduction:

The Practice of Statistic for Business and Economics is the required


textbook for both Stat 1181 and Stat 2225. If you are planning to take Stat
1181 after completing Stat 1181, you should purchase the supplementary
material that comes with the main text.This material is prepared for Stat
2225 course at Langara and thus is only available at the Langara Bookstore.
You do not need the supplementary material for Stat 1181. The topics in the
main text are divided into two areas of concentration:

Exploratory data analysis (Chapters 1 through 3), which deals with


methods and strategies for organizing, exploring, and describing data with
graphs and numerical summaries; it also deals with methods of collecting or
producing data by random sampling.

Probability and Inference (Chapters 4 through 9), which deals with


techniques for drawing conclusions from data using tools of probability to
account for variations in data.
The authors emphasize throughout the text that statistics is learned best by
solving statistical problems. To be successful in this course, you must be
prepared to set aside at least 5 hours per week for your own in-depth study
of the material. This means reading the sections of the text covered in class,
studying any class notes provided, and doing at least 50% of the exercises in
the text plus any assigned exercises. Read pages xxxv through xxxvii on the
authors recommendation on how best to use the textbook.

Chapter 1
Examining Distributions
In this chapter, we discuss the basic tools for describing data graphically.
More complex graphing tools will be discussed in Chapter 2.
All graphs are not equal. The basic idea is that the data you have and what
you plan to do with it determine what graph is appropriate. This is
analogous to, for example, a home handyman examining the head of a screw
to decide what screwdriver is needed to take it out. A survey data typically
contain several variables of interest to the researcher. For example, a realtor
interested in house prices in your neighbourhood may collect data on age,
type of dwelling (condo, townhouse, duplex, etc.), number of bedrooms, and
distance to the nearest shopping mall, etc. on recently sold homes in the
area.What price range did most of the house sold at? And is there a
relationship between number of bedrooms and age?
The simplest graphs, histogram, bar graph, pie chart, box plot, and stem
and leaf display, organize data by examining just one variable at a time.
More complex graphs, such as comparative bar graph and scatterplot (to
be discussed in Chapter 2), describe the relationship, if any, between pairs of
variables. You will need to be familiar with the following terminologies
relating to data analysis in general:

Individual (or a case): is an entity from whom or about whom


information is gathered. An individual may be a person or an object.
The context will determine which one.
Variable: is any characteristic (or attribute) of an individual.

In data analysis, a variable may be described as either categorical or


quantitative. A categorical variable contains descriptive information which
are typically recorded as words. A categorical variable places individuals
into groups or categories. A quantitative variable contains numeric
information for which arithmetic operations such as averaging are
meaningful. A quantitative variable describes a measurement or a count.
Examples: 1. A financial analyst is interested in determining whether meal
costs at city restaurants differ from meal costs at suburban restaurants. She
collected data from a sample of 100 restaurants, 50 from each area.

Describe the individuals in the study and determine whether the variable of
interest is categorical or quantitative.
Solution: The individuals are the100 restaurants surveyed. The variable of
interest is the cost (in $) of a meal. The variable is quantitative.
2. A mortgage broker wishes to survey Vancouver households to determine
what percent of a households income is spent on housing. Describe the
individuals in the study and determine whether the variable of interest is
categorical or quantitative.
Solution: The individuals are households participating in the survey. The
variable of interest is the percent of the households income it spends on
housing. It is quantitative.
3. A stats student wants to survey Langara students to determine if a
students mode of commuting (by car or motorcycle, by bus, ride a bike,
other) to classes depends on the students employment status (unemployed,
part-time)
Why is employment status a categorical variable?
Solution: Employment status is categorical because it puts all students into
two categories.
Your Turn (for practice at home)
a) Determine what type is each of the following variables:
Age, nationality, weight, hourly wage, mothers occupation, program of
study (Career, UT), number of passengers on a bus.
b) Do Apply Your Knowledge Exercise #1.2 on page 5 and check with the
answers provided below
Solution #1.2 Who (or cases) is the set of students enrolled in the stat
class; the What (or variables) are the 8 variables ID, Exam1, Exam2,
Grade, Id and Grade are categorical; the remaining 6 are quantitative; the
Why (or purpose of the data) is to help the instructor keep track of students
work and to be able to identify students who are falling behind in the course.
c) Additional practice Exercises. Page 21 #s 1.13 &1.15.

Graphing Univariate Data


Graphical tools help to organize data in order to describe their main
characteristics. As a first step in data analysis, begin with a graphical
description; then add appropriate numerical descriptions. Describing a single

variable graphically means to display its distribution- the possible values it


takes and how often. The distribution of a categorical variable is a list of the
categories and either a count or percent of individuals belonging to each
category. The result is displayed in a table called a frequency distribution;
for example, refer to Exercise 1.24, page 22 on market shares for computer
operating systems.
Pie Chart and Bar Graph
Use a pie chart or a bar graph (or preferably a Pareto chart, if we desire to
emphasize categories in terms of their relative frequencies) to organize a
categorical variable data. Statistical software such as Statgraphics is
required.
Example 1: The number of automobile collisions recorded for each day of the
week is provided in the file Crashes.xls (see page 9 of the text). A frequency
distribution (on the left) and a pie chart of the data are shown below.
Day
Crashes (in thousands) Percent
Monday
620
10.1
Tuesday
858
13.9
Wednesday
935
15.2
Thursday
915
14.9
Friday
943
15.3
Saturday
1051
17.1
Sunday
830
13.5

Piechart for Crashes (in thousands)


13.49%

10.08%

13.95%

17.08%

15.20%
15.33%
14.87%

We will discuss how to graph data with Statgraphics later. The pie chart
shows that collisions occur almost evenly throughout the week; the fewest
collisions occur on Mondays and the most collisions occur on Saturdays.
Note that a pie chart is suitable for displaying information about individual
categories of a single variable relative to the whole distribution. However, a
bar graph (or Pareto chart) is more flexible than a pie chart. We can use a bar
graph to graph a categorical data even if the categories do not necessarily
represent a single variable. As an example from the text, in a survey of
adults who use several electronic devices or services, we may be interested

Crashes (in thousands)


Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday

in the percent of respondents who love the device or service (Blackberry,


iPod/iPad, Android phone, Satellite radio, Pay TV channels). Note that the
categories under Device or Service do not refer to a single variable so the
proportions for the categories would not necessarily add to 1 (or 100%) so
displaying the result with a pie chart would be inappropriate. Use a bar
graph instead. A bar graph can also be used to compare any set of quantities
that are measured in the same units. For example, comparing males and
females with respect to the type of recreational sports they prefer.
Example 2: Exercise 1.22, page 22.For your convenience, the data is
reproduced below.
Material
FoodScraps
Glass
Metals
Paper
Plastics
Rubber
Wood
Trimmings
Other

Weight
31.7
13.6
20.8
83.0
30.7
19.4
14.2
32.6
8.2

PercentOfTotal
12.5
5.3
8.2
32.7
12.1
7.6
5.6
12.8
3.2

Refer to the questions in the text.


Solution.
a)
The weights add up to 254.2 and it is slightly off the sum in the text
probably due to rounding off error in the calculation. Nothing serious!
b)
The bar graph of the variable Percent is shown below.

Histogram
A Bar graph or pie chart helps to graph data quickly. But they have limited
use in data analysis because it is easy to understand data for a single
categorical variable without a graph. When the variable is quantitative,

however, graphs are essential. The most common method of graphing a


quantitative data is a histogram; a 2D graph, it is made by grouping
quantitative data into intervals (or bins) along x-axis and determining a
count of the values belonging to each interval. The counts (called
frequencies) are represented on the y-axis. Steps for making a histogram by
hand are as follows (Also see page 12 of the text):
Step 1. Decide how many classes would be suitable for the size of data. A
rule of thumb is the 2k rule:
If n = data size (that is, the number of individuals), and
k = number of classes required,
then find a positive integer k such that 2k 1 n 2k and for which 2k is closer
to n.
Example: Data for a quantitative variable has n= 900 values. Use the 2k rule
to determine many classes should be used to make a histogram.
Solution: Since 512 29 900 210 1024 , and 210 is closer to 900 than 29 is,
choose k =10 classes.
Step 2: Use k to determine the class width by dividing the range of the
distribution by k. That is, find c = class width =(Max-Min)/k.
Step 3. Set up the classes (bins). Starting with the minimum value, add
multiples of class width c to the min to obtain the class boundaries:
Min to Min+c, Min+c to Min +2c, Min+2c to Min +3c, etc.
Step 4. Count the values in each class bin. The counts are class frequencies.
Step 5. To make the histogram, mark off positions on the horizontal axis
for the class intervals. Then construct vertical bars on top of each interval
with heights corresponding to the class frequencies.
It is time consuming to draw a histogram manually so use of software is
recommended.

Example 3: Refer to the TBillRates50 data on page 16. Graph the data and
describe the main features.
Solution: We can make a histogram because the variable Rate is
quantitative. We will use 6 classes according to the 2k rule.

We see that although rate varies from a low 1% to a high of 13.8%, most of
the values fall approximately between 3% and 7%. The graph is not
symmetric but tends to be more elongated in the direction of larger values
for rate; it is right-skewed.

Interpreting the graph of a quantitative data


After making a graph, always look for an overall pattern. This means
describe the shape, center and spread. Also look for outliers, unusual values
that deviate from the pattern.

Symmetric and Skewed Curves.


A graph is said to be symmetric if it can be divided into two equal parts that
are mirror images of each other. A graph is right-skewed if the top 50% of
the values are more spread out than the bottom 50%; it is left-skewed if the
bottom 50% of the values are more spread out to the left than to top 50%.

Symmetric Curve

Right-Skewed Curve

Left-Skewed Curve

0.4

0.1

0.3

0.08

0.8

0.6

0.06

0.2

0.4

0.04
0.1
0.2

0.02
0
-5

-3

-1

0
0

10

20

30

40

0.5

1.5

2.5

Question: Refer to Example 1.15, page 15. What is the shape of the
histogram for Length of Service Call
Stemplot
Another handy way of organizing quantitative data is a stemplot. A stemplot
is an arrangement of the values into groups (called stems). Values on the
same stem are arranged in order in a row. Stemplot is useful for visualizing
a small size data when the integrity of the values must be preserved.
To make a stemplot:
1. Break up the digits of every value in the data into a stem consisting of
all but the rightmost digit, and a leaf, the last digit.
2. Write all the stem values in a vertical column, with the smallest at the
top and the largest at the bottom, and draw a vertical line at the right
of the column. The line is used to separate the stems and leaves.
3. Write each leaf on the right of the corresponding stem
4. Arrange the leaves on each stem in numerical order.
Technical note: If the values do not all have the same number of digits,
consider rounding off or tagging values having fewer number of digits with
zeros.
Example 4. A financial analysts is interested in determining whether meal
costs at city restaurants differ from meal costs at suburban restaurants. She
collected data from a sample of 50 restaurants in each area.
Meal Costs at City Restaurants
61
50
35
37
29

74
26
45
54
34

43
56
32
41
33

32
67
25
51
27

44
57
74
50
77

44
66
43
76
50

50
80
39
53
61

42
68
55
44
60

44
42
65
77
33

36
48
35
57
43

28

46

70

47

Meal Costs at Suburban Restaurants


37

47

29

33

39

39

35
34
54
41

59
51
41
60

44
34
50
52

51
51
71
67

37
56
60
68

36
26
37
49

43
34
27
48

52
34
34
51

34
44
48
31

38
40
39
44

Make a stemplot to describe costs of meals at restaurants in the city.


Solution: Using Statgraphics yielded the following:
Stem-and-Leaf Display for MealCost: unit = 1.0 1|2 represents 12.0
4
9
14
24
(2)
24
17
13
10
6
4
1

2|5679
3|22334
3|55679
4|1223334444
4|58
5|0000134
5|5677
6|011
6|5678
7|44
7|677
8|0

The stemplot is like a histogram of ungrouped data that has been flipped 90
clockwise, the stems representing a flipped horizontal axis and the leaves
representing the vertical axis. Observe the values to the left of the stems.
These are running totals of the number of leaves. For example, 14 on row 4
is the sum of leaves for stems with values 2 and 3. The number 2 in
parenthesis indicates that the median is in that row.
The stem and leaf indicates that the data is right skewed since values to the
right of the median are more spread out than values to the left of the median.
Yet another useful graphing method for quantitative data is box plot which
we will discuss in the next section.
Describing Distributions with Numbers
In addition to graphs, we use special numbers to describe quantitative data.
These numbers are given the technical name numerical measures. We may
classify them into two groups as follows:
1. Measures of Center: Mean, Median
2. Measures of Spread: Range, Interquartile Range and Standard
Deviation
To understand these measures, we look at the following terminology.

The (pronounced Sigma) notation


The notation means the sum of. For example, to find the sum of the n
values x1 , x2, x3,..., xn , we write xi , where the subscript i varies from 1 to n.
Similarly, the notation xi 2 means the sum of the squares of the values. So
it is a convenient shorthand for writing formulas. For example if x denotes
the mean of the n values, then
x

1
xi
n

Illustration: The amount of oil in a sample of wells in an oil field is a


determining factor in drilling further for oil. In one field, the estimated
amounts of oil in a random sample of 5 wells are (in thousand barrels)
x1 53, x2 38, x3 31, x4 48, and x5 60. So the sample mean
1
xi 46. Another well has just been discovered and it is estimated that
5
it contains x6 196 thousand barrels. Then the mean for all six wells is now
1
x xi 71. What sample mean should be reported?
6
x

This example illustrates an important fact about the mean. As a measure of


centre or a typical value in the data, the mean is sensitive to the effect of
a few extreme values. These may be outliers, but a skewed distribution that
has no outliers will also pull the mean towards its long tail. Reporting that
the mean for the 6 wells is 71 thousand barrels may be factually correct but
it is not representative of the 6 values because five of the six values are
considerably less that mean and none of the drilled wells contains that
amount of oil.
Median: In the presence of outliers or a skewed distribution, the median
provides a better measure of centre versus the mean. The median M is the
midpoint of a distribution, the number such that half of the values are
smaller and half are larger.
To determine M,
1. Put the values in order of size, from the smallest to largest.
2. If the number of observations n is odd, then M is the value in the
n 1
th position in the ordered list.
2

3. If n is even, then M is the mean of the two centre values in the


and

n
th
2

n
1 th positions.
2

Example 5. The manager at a branch of a financial institution wants to find


out how long customers who use the bank during rush hour (between 1200
and 1300 hours) wait in line for service. The waiting times (in minutes) for
a sample of 15 customers were the following:
9.4
3.8
6.5

8.2
5.8
4.2

8.0
7.4
5.3

5.8
6.9
9.3

8.6
11.7
5.6

a) Make a stem plot of the data and describe the shape of the distribution.
b) Calculate the mean and median.
c) Which measure of centre in part b) better represents the data? Explain.
Solution:
a) Stem-and-Leaf Display for WaitingTime: unit = 0.1
1
2
6
(2)
7
6
3
1
1

1|2 represents 1.2

3|8
4|2
5|3688
6|59
7|4
8|026
9|34
10|
11|7

The stem plot shows the distribution is right-skewed (because of the outlier
time of 11.7 minutes).
b) Statgraphics reported that x 7.1 and M 6.9.
Details of the computation for M are as follows:
Arrange the waiting times in numerical order
3.8 4.2 5.3 5.6 5.8 5.8
6.5 6.9
7.4 8
8.2 8.6 9.3 9.4 11.7
Here n =15, odd and

n 1
8. Thus the median M = 6.9, the value in
2

the 8th ranked position.


c) The median is slightly better because the data contains an outlier.
Comparing the Mean and Median.

We have seen that in the presence of outliers, the median is a better measure
of centre because it is less sensitive to extreme values than the mean is. The
mean tends to be pulled more in the direction of outliers relative to the
median. If the distribution is right skewed, a small proportion of the data
have relatively large values and will make the mean greater than without the
outliers. If a distribution is left skewed, the few but relatively small values
will make the mean smaller than without the outliers. If the distribution is
exactly symmetric, the mean and median are identical.
Exercise: Do exercise 1.45, page29 and check with answer at the back of
the text.

Measuring Spread.
A measure of the center alone to describe a distribution can be misleading
since two distributions with the same mean can be different in the way their
values are spread out around their mean. A complete numerical description
should include a measure of centre and a measure of spread. We now
examine measures of spread.
To measure the spread in a distribution, you can just determine the
difference between the largest value and the smallest value, called the range.
However, if the distribution is clustered near the center, this measure of
spread will be too big.
We can improve on the range as a measure of spread by using more
intermediate values. It is helpful if we first order the values. If we put 100
values in numerical order, the value in say the 25th position is referred to as
the 25th percentile, the value in the 75th position as the 75th percentile, and so
on. The median is the 50th percentile. In general, the pth percentile of a
distribution is a value K such that p percent of the values is less than or
equal to K and (100-p) percent is greater than or equal to K.
Example 6: Calculate the 75th percentile for the waiting times in Example 5.
Explain what it means in context.
Solution. The stemplot arranged the values in numerical order. There are 15
values so 75% way up will be 15*0.75= 11.25. We interpret this rank to
mean that any value between the 11th and 12th position will work. Since there
is no such value, we use the mean of the two values instead. The values in
positions 11 and 12 are 8.2 and 8.6 respectively. Their mean is 8.4. So the

75th percentile is 8.4 minutes. This means 75% of the customers wait in line
for up to 8.4 minutes and 25% wait 8.4 minutes or longer.
Measuring Spread with Inter Quartile Range (IQR)
The most commonly used percentiles other than the median are quartiles.
The first quartile, denoted by Q1, is the 25th percentile and the third quartile,
denoted by Q3, is the 75th percentile. The quantity Q3-Q1, called the
interquartile range (IQR), shows the spread of the middle 50% of the data.
When the data is skewed, IQR is usually a better measure of spread than the
range of the entire data.
Observe that Q1 is the median of the bottom 50% of the data and Q3 is the
median of the top 50%. So to find Q1 and Q3:
1. Arrange the values in numerical order and find the overall median M.
2. Find the median of the values less than or equal to M. This is Q1.
3. Find the median of the values greater than or equal to M. This is Q3.
Example 7: The ages of 10 randomly selected residents in a seniors home
are the following: 80 55 90 73 75 80 85 92 93 98
Calculate the interquartile range and explain what it means.
Solution: In numerical order the ages are
55 73 75 80
80 85 90

92

93

98

Since there are 10 values, the median M is the mean of the values in 5th and
6th positions. That is, M

80 85
82.5. None of the values is M but of the
2

five values less than M, the median Q1 is 75, and of the five values bigger
than M, the median Q3 is 92. Thus IQR= 92-75=17. This means the ages of
the middle 50% of the residents in the sample have a spread of 17 years.
The Five-Number Summary and Boxplot.
To get a description of the centre and spread of a distribution, we can report
a five-number summary using Minimum, Q1, Q2 , Q3 and Maximum. A
graphical description of the data using the 5-number summary is the box
plot.
Example 8: Obtain a boxplot for the seniors data:
Solution: The five-number summary is Min=55, Q1=75, Q2=82.5,
Q3=92, and Max= 98. Here is the box plot.

Box-and-Whisker Plot

55

65

75

85

95

105

Marks

The line inside the box represents the median; the crosshair represents the
mean. In this case, the mean and the median are almost identical suggesting
that the distribution is roughly symmetric. In fact, the distribution is skewed
to the left since the bottom 50% of the data are more spread out than the top
50%.
Example 9: Obtain a box plot of the waiting times in Example 5 and
describe the shape of the distribution.
Solution. The five-number summary is (check this)
Min=3.8 , Q1=5.6

Q2= 6.9,

Q3=8.6,

Max=11.7

The distribution is right-skewed.


Your Turn
For practice, do exercise 1.47, page 33 and check with the answer at the
back of the text. You may find it convenient to arrange the data in numerical
order using Excel. We will discuss later how to make a boxplot in
Statgraphics.

Checking for Outliers


One application of IQR is checking for outliers. A common rule of thumb is
that an observation that is further than 1.5*IQR below Q1 or further than
1.5*IQR above Q3, then it may be an outlier. However, further study of the
data and the properties of the variable must be done before concluding that it
is an outlier.
Example 10: Is the resident aged 55 years in the seniors home an outlier?
Solution: The five-number summary for the data is 55 75 82.5 92 98.
So IQR=Q3-Q1=92-75=16 and 1.5*IQR=24.
Further, Q1-55 = 75-55=20 < 1.5*IQR; That is, the value 55 is not further
than 1.5IQR units from Q1 and it is not an outlier.
Your Turn
Refer to the stemplot in Example 5. Use IQR to verify if the waiting time
11.7 minutes is an outlier.
Standard Deviation
The most commonly used measure of spread for a non-skewed distribution is
the standard deviation. The standard deviation for a sample of values is
denoted by s and is defined by the formula
s

1
( xi x )2

n 1

To get s, add up the squares of the differences of each value and the mean,
divide the result by a normalizing factor and take the square root. That is, s
measures an average spread of the values from their mean. Use a
calculator or software (Statgraphics, Excel, other) to determine s for large
size data.

Example 11: Exercise 1.50, page 36.


Solution (a)

Stem-and-Leaf Display for FirstExam: unit = 1.0 1|2 represents 12.0


1
1
1
2
3
5
5
4
1

5|5
6|
6|
7|3
7|5
8|00
8|5
9|023
9|8

The stemplot shows that the distribution is skewed to the left because the
value 55 appears to be an outliers and is cut off from the rest of the data.
(b)
Summary Statistics for FirstExam
Count
10
Average
82.1
Standard deviation 12.5472
Minimum
55.0
Maximum
98.0
Range
43.0

(c) Since the distribution is left skewed, the mean and standard deviation
does not effectively describe the distribution properly. In this case the
median- IQR pair is a better choice for measures of center and spread.
Your Turn
Use the stat keys on your calculator to find the mean and standard deviation
of the fuel efficiency readings (in litres/100 km) for 5 cars:
7.3, 7.2, 7.6, 7.4, 7.5
Do the following for practice. #s 1.59, 1.67 & 1.71. (Use Excel to view the
data and do the calculations by hand)

Uses of Standard Deviation


The standard deviation as a measure of spread is a useful tool for comparing
data that are measured on different scales. To do this the data must be
standardized or rendered scale-independent.
Example 12: The final exam marks for a student in Chemistry and Physics
courses are given in the following table.

Exam Students Mark ( x )


Chem
85
Physics
80

Class Mean ( x ) Class Standard Dev. (s)


75
5
72
2

In which course did the student do better in his or her class?


Solution: To compare the chem and physics marks, they should be put on
the same scale by standardizing, using the formula
Standardized mark z =

xx
.
s

A standardized mark represents the number of standard deviations the mark


is higher or lower than the mean. The standardized marks for chemistry and
physics are respectively zc

82 75
80 72
4.
2 and z p
2
5

That is, put on an equal scale, the physics marks is four standard deviations
higher than its mean whereas the chemistry mark is two standard deviations
higher than the its mean. Thus the student is stronger in the physics class
than in the chemistry class.
Another use of the standard deviation is in predicting outcomes from bell
shaped distributions, as is explained below.
Density Curves and Normal Distributions
A density curve is an idealized mathematical model that describes the
overall pattern of a quantitative distribution (see Figure 1.17, page 41). A
density curve has the following properties:
It is always on or above the horizontal axis
The total area underneath it is 1.
Thus we interpret the area underneath a density curve corresponding to a
specified range of values of the distribution to represent the proportion of the
distribution that fall in that range. For example, the median of a density
curve is the equal-area division point, If the density curve is symmetric, the
median and mean are identical.
Do exercises 1.79 & 1.81 on pages 44-45 and check with the answers at the
back of the text.

In exercise 1.80, note that for the total area under the rectangle to be 1, the
height of the rectangle must also be 1. In part (b), lie above and are
greater than have the same meaning. In part (c), lie below and are less
than also have the same meaning. Complete the rest of exercise 1.80.
Normal Density Curves
Normal density curves are a family of curves having the characteristics of
being symmetric, single-peaked and bell-shaped. All normal density curves
have the same overall shape but different centre and spread. The exact
distribution of a particular normal density curve curve may be determined
completely by specifying the mean which is located at the center, and
standard deviation . Being a density curve, the total area under a normal
curve is 1.
Normal density curves are important in statistics for many reasons, three of
which are the following. First, normal density curves are good descriptions
of many real life data such as the distribution of marks on a test taken by a
large number of people; or the actual weight of cereal in a 800 gram box).
Second, a normal density curve is a good description of many chance
outcomes such as the distribution of the proportion of heads in tossing a coin
many times, the heights of individuals belonging to a given age group.
Thirdly, as we will see later, normal density curves play a pivotal role in
drawing conclusions about a population based on sample.
The 68-95-99.7 rule (see page 46)
All normal density curves obey the 68-95-99.7 rule. The rule states that if
a density curve is Normal with mean and standard deviation , then
68% of the observations fall within of ; that is in the interval
( , )

95% of the observations fall within 2 of , or in the interval


( 2 , 2 )

99.7% of the observations fall within 3 of , or in the interval


( 3 , 3 )

(Study Figure 1.23 on page 47)

Example 13. The actual content of an 800 gram box of cereal varies from
box to box and can be modeled by a Normal density curve with mean
805 grams and standard deviation 4 grams. The 68-95-99.7 rule says
that the middle 68% of boxes contain between 801 grams and 809 grams
cereal; the middle 95% of the boxes contain between 797 grams and 813
grams of cereal; and the middle 99.7% of the boxes contain between 793
grams and 817 cereal.

Example 14. Continue from Example 13. What percent of boxes


(a) contains between 801 and 813 gram of cereal?
(b) contains 809 gram and 813 grams of cereal in them.
Solution: Draw a diagram to help visualize each problem.
(a) 801 grams is one below the mean and 813 grams is two higher than
the mean. Thus according to the 68-95-99.7 rule, the proportion of
boxes is
(b)

1
(68% 95%) 81.5%.
2

813 grams is two above the mean and 809 grams is one above the
mean. Thus according to the 68-95-99.7 rule, the proportion of boxes
1
2

in the interval is (95% 68%) 13.5% .


Your Turn
Do for practice, exercise 1.83, page 48.
The Standard Normal Density Curve
A normal density curve is also known as a normal distribution. A normal
distribution with mean and standard deviation is abbreviated as N( , ).
The standard normal distribution is the normal distribution N(0,1), with
mean 0 and standard deviation 1.
If a variable x has a N( , ) distribution, , then the variable
z

has the standard normal distribution. The process of converting x to z units


using the above formula is called standardizing the variable x. A
standardized value is also called a z-score.

Example: Suppose the variable x has a N(10,2) distribution. What is the zscore for the value x=15?
Solution. Here 10, 2 so if x =15, then z

15 10
2.5.
2

the value 15 is 2.5 standard deviations higher than the mean.


Do for practice exercise 1.86 on page 49. and check with the answer below.
Solution:
x

680 500
1.8

100
x 27 18
Geralds standardized score on the ACT is z

1.5

Eleanors standardized score on the SAT is z

Eleanor had the higher score because on the standardized test, his score is
higher and both tests measure the same kind of ability.

Finding and Using areas under Normal density curves


Areas under a Normal curve represents proportions of observations from that
Normal distribution. To find areas under any Normal distribution, find the
corresponding area under the standard normal distribution by standardizing
using the formula above. Areas under the standard normal distribution
require use of special tables or a computer software; for example, see Table
A, pages T-2 & T-3 of the text.
Example: Exercise 1.87, page 53.
Solution: The scores on the test have a N(572, 51) distribution.
If a score x= 600, then z

600 572
0.55
51

From the standard normal table on page T-2, the area under the curve to the
left of z=0.55 on the z curve is 0.7088; so area above z=0.55 is 10.7088=0.2912. That is, about 71% of the students scored lower than or
equal to 600 on the ISTEP and about 29% scored higher than 600.

Inverse Normal calculations


These are used to answer a question such as what is the minimum score for
the top 25% of the ISTEP scores? In this case we know the percent
(proportion) of students and we want to find the interval of scores for that
percent (or proportion). We need a statistical software to do the calculations
but without one we can use Table A backward to find an approximate
interval of values of x corresponding to the given area.

Example: Find the first quartile for the ISTEP scores.


Solution: If x is the first quartile, then 25% of the scores is less than x and
75% is higher. Assuming as in Exercise 1.87 that x has a N(572,51)
distribution, the z score for x can be found using the formula
z

x 572
51

for the From Table A, z-score for the first quartile is approximately -0.67.
Thus we solve for x in -0.67=

x 572
and get 537.83.
51

Your Turn:
1. Find the minimum score for the top 10% of the ISTEP scores.
2. Practice Exercises: Exercise Set 1.3, page 60:
a) # 1.111

Class group project


#1.105
For this class project, I would like you to work with your colleagues in
groups of two or three. If this is the first time of meeting your
colleague(s), please take the opportunity to introduce yourself to each
other. Now work collaboratively to do the question in the text.

Das könnte Ihnen auch gefallen