Sie sind auf Seite 1von 42

Page 1 of 2

Box-and-Whisker Plots:
Quartiles, Boxes, and Whiskers (page 1 of 3)
Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers

Statistics assumes that your data points (the numbers in your list) are clustered around some central
value. The "box" in the box-and-whisker plot contains, and thereby highlights, the middle half of these
data points.

To create a box-and-whisker plot, you start by ordering your data (putting the values in numerical
order), if they aren't ordered already. Then you find the median of your data. The median divides the
data into two halves. To divide the data into quarters, you then find the medians of these two halves.
Note: If you have an even number of values, so the first median was the average of the two middle
values, then you include the middle values in your sub-median computations. If you have an odd
number of values, so the first median was an actual data point, then you do not include that value in
your sub-median computations. That is, to find the sub-medians, you're only looking at the values that
haven't yet been used.

You have three points: the first middle point (the median), and the middle points of the two halves
(what I call the "sub-medians"). These three points divide the entire data set into quarters, called
"quartiles". The top point of each quartile has a name, being a "Q" followed by the number of the
quarter. So the top point of the first quarter of the data points is "Q1", and so forth. Note that Q1 is
Q2 is also the middle number for the whole list, Q3
also the middle number for the first half of the list,
is the middle number for the second half of the list, and Q4 is the largest value in the list.

Once you have these three points, Q1, Q2, and Q3, you have all you need in order to draw a simple
box-and-whisker plot. Here's an example of how it works.

z Draw a box-and-whisker plot for the following data set:

4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4

My first step is to order the set. This gives me:

3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1

The first number I need is the median of the entire set. Since there are seventeen values in
this list, I need the ninth value:

3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1

The median is Q2 = 4.4.

The next two numbers I need are the medians of the two halves. Since I used the "4.4" in the
middle of the list, I can't re-use it, so my two remaining data sets are:

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 2 of 2

3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4 and 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1

The first half has eight values, so the median is the average of the middle two:

Q1 = (4.3 + 4.3)/2 = 4.3

The median of the second half is:

Q3 = (4.7 + 4.8)/2 = 4.75

Since my list values have one decimal place and


range from 3.9 to 5.1, I won't use a scale of, say,
zero to ten, marked off by ones. Instead, I'll draw a
number line from 3.5 to 5.5, and mark off by tenths.

Now I'll mark off the minimum and maximum values,


and Q1, Q2, and Q3:

The "box" part of the plot goes from Q1 to Q3:

And then the "whiskers" are drawn to the endpoints:

By the way, box-and-whisker plots don't have to be drawn horizontally as I did above; they can be
vertical, too.

Original URL: http://www.purplemath.com/modules/boxwhisk.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.


Terms of Use: http://www.purplemath.com/terms.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 1 of 2

Box-and-Whisker Plots:
Five-Number Summary (page 2 of 3)
Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers

More terminology: The top end of your box may also be called the "upper hinge"; the lower end may
also be called the "lower hinge". The lower hinge is also called "the 25th percentile"; the median is
"the 50th percentile"; the upper hinge is "the 75th percentile". This means that 25%, 50% and 75%
of the data, respectively, is at or below that point. The distance between the hinges may be referred
to as the "H-spread" or, as you will see on the following page, the "Interquartile Range", abbreviated
"IQR". ("Hinge" actually has a different technical definition, but the term is sometimes used
informally.)

Also, some books and software will include the overall median (Q2) when computing Q1 and Q3 for
data sets with an odd number of elements. The Texas Instruments calculators do not include Q2 in
this case, so you may encounter a book answer that doesn't match the calculator answer. And
different software packages use all different sorts of formulas. Be careful to use the formula from your
book when doing your homework!

Additionally, the box-and-whisker plot may include a cross or an "X" marking the mean value of the
data, in addition to the line inside the box that marks the median. The difference between the "X" and
the median line can then be used as a measure of "skew".

Please don't ask me to explain "skew".

z Draw the box-and-whisker plot for the following data set:


77, 79, 80, 86, 87, 87, 94, 99

My first step is to find the median. Since there are eight data points, the median will be the
average of the two middle values: (86 + 87) ÷ 2 = 86.5 = Q2

This splits the list into two halves: 77, 79, 80, 86 and 87, 87, 94, 99. Since the halves of
the data set each contain an even number of values, the sub-medians will be the average of
the middle two values.

Q1 = (79 + 80) ÷ 2 = 79.5


Q3 = (87 + 94) ÷ 2 = 90.5

The minimum value is 77 and the maximum value is 99, so I have:

min: 77, Q1: 79.5, Q2: 86.5, Q3: 90.5, max: 99

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 2 of 2

Then my plot looks like this:

As you can see, you only need the five values listed above (min, Q1, Q2, Q3, and max) in order to
draw your box-and-whisker plot. This set of five values has been given the name "the five-number
summary".

z Give the five-number summary of the following data set:


79, 53, 82, 91, 87, 98, 80, 93

The five-number summary consists of the numbers I need for the box-and-whisker plot: the
minimum value, Q1 (the bottom of the box), Q2 (the median of the set), Q3 (the top of the
box), and the maximum value (which is also Q4). So I need to order the set, find the median
and the sub-medians, and then list the required values in order.

ordering the list: 53, 79, 80, 82, 87, 91, 93, 98, so the minimum is 53 and the
maximum is 98

finding the median: (82 + 87) ÷ 2 = 84.5 = Q2

lower half of the list: 53, 79, 80, 82, so Q1 = (79 + 80) ÷ 2 = 79.5

upper half of the list: 87, 91, 93, 98, so Q3 = (91 + 93) ÷ 2 = 92

five-number summary: 53, 79.5, 84.5, 92, 98

Part of the point of a box-and-whisker plot is to show how spread out your values are. But what if one
or another of your values is way out of line? For this, we need to consider "outliers"....

Original URL: http://www.purplemath.com/modules/boxwhisk2.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.


Terms of Use: http://www.purplemath.com/terms.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 1 of 3

Box-and-Whisker Plots:
Interquartile Ranges and Outliers (page 3 of 3)
Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers

The "interquartile range", abbreviated "IQR", is just the width of the box in the box-and-whisker plot.
That is, IQR = Q3 – Q1. The IQR can be used as a measure of how spread-out the values are.
Statistics assumes that your values are clustered around some central value. The IQR tells how
spread out the "middle" values are; it can also be used to tell when some of the other values are "too
far" from the central value. These "too far away" points are called "outliers", because they "lie outside"
the range in which we expect them.

The IQR is the length of the box in your box-and-whisker plot. An outlier is any value that lies more
than one and a half times the length of the box from either end of the box. That is, if a data point is
below Q1 – 1.5×IQR or above Q3 + 1.5×IQR, it is viewed as being too far from the central values
to be reasonable. Maybe you bumped the weigh-scale when you were making that one
measurement, or maybe your lab partner is an idiot and you should never have let him touch any of
the equipment. Who knows? But whatever their cause, the outliers are those points that don't seem to
"fit".

(Why one and a half times the width of the box? Why does that particular value demark the difference
between "acceptable" and "unacceptable" values? Because, when John Tukey was inventing the
box-and-whisker plot in 1977 to display these values, he picked 1.5×IQR as the demarkation line for
outliers. This has worked well, so we've continued using that value ever since.)

z Find the outliers, if any, for the following data set:

10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7,


14.7, 14.7, 14.9, 15.1, 15.9, 16.4

To find out if there are any outliers, I first have to find the IQR. There are fifteen data points,
so the median will be at position (15 + 1) ÷ 2 = 8. Then Q2 = 14.6. There are seven data
Q1 is the fourth value in the list and Q3 is the twelfth:
points on either side of the median, so
Q1 = 14.4 and Q3 = 14.9. Then IQR = 14.9 – 14.4 = 0.5.

Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 +


Outliers will be any points below
1.5×IQR = 14.9 + 0.75 = 15.65.

Then the outliers are at 10.2, 15.9, and 16.4.

The values for Q1 – 1.5×IQR and Q3 + 1.5×IQR are the "fences" that mark off the "reasonable"
values from the outlier values. Outliers lie outside the fences.

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 2 of 3

If your assignment is having you consider outliers and "extreme values", then the values for Q1 –
1.5×IQR and Q3 + 1.5×IQR are the "inner" fences and the values for Q1 – 3×IQR and
Q3 + 3×IQR are the "outer" fences. The outliers (marked with asterisks or open dots) are between
the inner and outer fences, and the extreme values (marked with whichever symbol you didn't use for
the outliers) are outside the outer fences.

By the way, your book may refer to the value of "1.5×IQR" as being a "step". Then the outliers will be
the numbers that are between one and two steps from the hinges, and extreme value will be the
numbers that are more than two steps from the hinges.

Looking again at the previous example, the outer fences would be at 14.4 – 3×0.5 = 12.9 and 14.9
+ 3×0.5 = 16.4. Since 16.4 is right on the upper outer fence, this would be considered to be only an
outlier, not an extreme value. But 10.2 is fully below the lower outer fence, so 10.2 would be an
extreme value.

Your graphing calculator may or may not indicate whether a


box-and-whisker plot includes outliers. For instance, the above
problem includes the points 10.2, 15.9, and 16.4 as outliers.
One setting on my graphing calculator gives the simple box-
and-whisker plot which uses only the five-number summary, so
the furthest outliers are shown as being the endpoints of the
whiskers:

A different calculator setting gives the box-and-whisker plot


with the outliers specially marked (in this case, with a
simulation of an open dot), and the whiskers going only as far
as the highest and lowest values that aren't outliers:

Note that my calculator makes no distinction between outliers


and extreme values.

If you're using your graphing calculator to help with these plots, make sure you know which setting
you're supposed to be using and what the results mean, or the calculator may give you a perfectly
correct but "wrong" answer.

z Find the outliers and extreme values, if any, for the following data set, and draw the
box-and-whisker plot. Mark any outliers with an asterisk and any extreme values with an
open dot.

21, 23, 24, 25, 29, 33, 49

To find the outliers and extreme values, I first have to find the IQR. Since there are seven
values in the list, the median is the fourth value, so Q2 = 25. The first half of the list is 21, 23,
24, so Q1 = 23; the second half is 29, 33, 49, so Q3 = 33. Then IQR = 33 – 23 = 10.

The outliers will be any values below 23 – 1.5×10 = 23 – 15 = 8 or above 33 + 1.5×10 = 33


+ 15 = 48. The extreme values will be those below 23 – 3×10 = 23 – 30 = –7 or above 33 +
3×10 = 33 + 30 = 63.

So I have an outlier at 49 but no extreme values, I won't have a top whisker because Q3 is

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 3 of 3

also the highest non-outlier, and my plot looks like this:

It should be noted that the methods, terms, and rules outlined above are what I have taught and what
I have most commonly seen taught. However, your course may have different specific rules, or your
calculator may do computations slightly differently. You may need to be somewhat flexible in finding
the answers specific to your curriculum.

Original URL: http://www.purplemath.com/modules/boxwhisk2.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.


Terms of Use: http://www.purplemath.com/terms.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Cum. Freq. Graphs Page 1 of 2

Cumulative Frequency Graphs


Cumulative Frequency
This is the running total of the frequencies. On a graph, it can be represented
by a cumulative frequency polygon, where straight lines join up the points, or
a cumulative frequency curve.

Example

Cumulative
Frequency:
Frequency:
4 4
6 10 (4 + 6)
3 13 (4 + 6+ 3)
2 15 (4 + 6+ 3 + 2)
6 21 (4 + 6+ 3 + 2 + 6)
4 25 (4 + 6+ 3 + 2 + 6 + 4)

The Median Value

The median of a group of numbers is the number in the middle, when the
numbers are in order of magnitude. For example, if the set of numbers is 4, 1,
6, 2, 6, 7, 8, the median is 6:
1, 2, 4, 6, 6, 7, 8 (6 is the middle value when the numbers are in order)
If you have n numbers in a group, the median is the (n + 1)/2 th value. For
example, there are 7 numbers in the example above, so replace n by 7 and the
median is the (7 + 1)/2 th value = 4th value. The 4th value is 6.

When dealing with a cumulative frequency curve, "n" is the cumulative


frequency (25 in the above example). Therefore the median would be the 13th
value. To find this, on the cumulative frequency curve, find 13 on the y-axis
(which should be labelled cumulative frequency). The corresponding 'x' value is
an estimation of the median.

Quartiles

If we divide a cumulative frequency curve into quarters, the value at the lower
quarter is referred to as the lower quartile, the value at the middle gives the
median and the value at the upper quarter is the upper quartile.
A set of numbers may be as follows: 8, 14, 15, 16, 17, 18, 19, 50. The mean
of these numbers is 19.625 . However, the extremes in this set (8 and 50)
distort this value. The interquartile range is a method of measuring the spread
of the middle 50% of the values and is useful since it ignore the extreme
values.

http://www.mathsrevision.net/gcse/pages.php?page=21 23/02/2009
Cum. Freq. Graphs Page 2 of 2

The lower quartile is (n+1)/4 th value (n is the cumulative frequency, i.e. 157
in this case) and the upper quartile is the 3(n+1)/4 the value. The difference
between these two is the interquartile range (IQR).
In the above example, the upper quartile is the 118.5th value and the lower
quartile is the 39.5th value. If we draw a cumulative frequency curve, we see
that the lower quartile, therefore, is about 17 and the upper quartile is about
37. Therefore the IQR is 20 (bear in mind that this is a rough sketch- if you
plot the values on graph paper you will get a more accurate value).

http://www.mathsrevision.net/gcse/pages.php?page=21 23/02/2009
Linear Regression Page 1 of 2

Linear Regression
Scatter Diagrams

We often wish to look at the relationship between two things (e.g. between a person"s height and weight) by comparing
data for each of these things. A good way of doing this is by drawing a scatter diagram.

"Regression" is the process of finding the function satisfied by the points on the scatter diagram. Of course, the points
might not fit the function exactly but the aim is to get as close as possible. "Linear" means that the function we are
looking for is a straight line (so our function f will be of the form f(x) = mx + c for constants m and c).

Here is a scatter diagram with a regression line drawn in:

Correlation

Correlation is a term used to describe how strong the relationship between the two variables appears to be.

We say that there is a positive linear correlation if y increases as x increases and we say there is a negative linear
correlation if y decreases as x increases. There is no correlation if x and y do not appear to be related.

Explanatory and Response Variables

In many experiments, one of the variables is fixed or controlled and the point of the experiment is to determine how the
other variable varies with the first. The fixed/controlled variable is known as the explanatory or independent variable
and the other variable is known as the response or dependent variable.

I shall use "x" for my explanatory variable and "y" for my response variable, but I could have used any letters.

Regression Lines

By Eye

If there is very little scatter (we say there is a strong correlation between the variables), a regression line can be drawn
"by eye". You should make sure that your line passes through the mean point (the point (x,y) where x is mean of the
data collected for the explanatory variable and y is the mean of the data collected for the response variable).

Two Regression Lines

When there is a reasonable amount of scatter, we can draw two different regression lines depending upon which
variable we consider to be the most accurate. The first is a line of regression of y on x, which can be used to estimate y
given x. The other is a line of regression of x on y, used to estimate x given y.

If there is a perfect correlation between the data (in other words, if all the points lie on a straight line), then the two
regression lines will be the same.

http://www.mathsrevision.net/alevel/pages.php?page=61 23/02/2009
Linear Regression Page 2 of 2

Least Squares Regression Lines

This is a method of finding a regression line without estimating where the line should go by eye.

If the equation of the regression line is y = ax + b, we need to find what a and b are. We find these by solving the
"normal equations".

Normal Equations

The "normal equations" for the line of regression of y on x are:

Σy = aΣx + nb and

Σxy = aΣx2 + bΣx

The values of a and b are found by solving these equations simultaneously.

For the line of regression of x on y, the "normal equations" are the same but with x and y swapped.

http://www.mathsrevision.net/alevel/pages.php?page=61 23/02/2009
MathsII-Statistics
Standard-deviation
Comparing distributions

When comparing distributions, it is better to use a measure of spread or dispersion (such


as standard deviation or semi-interquartile range) in addition to a measure of central
tendency (such as mean, median or mode).

For example, the following two data sets are significantly different in nature and yet have
the same mean, median and range. Some sort of numerical measure which distinguishes
between them would be useful.

• 1, 7, 12, 15, 20, 22, 28


• 1, 15, 15, 15, 15, 16, 28

The standard deviation of the first set of data is significantly larger than the standard
deviation of the second set of data (ie there is more spread about the mean in the first set
of data).

The formulae

There are two formulae for standard deviation given in the formulae list in the Credit
Level examination paper. The first of the two formulae is often referred to as the defining
formula and shows more clearly that the standard deviation of a set of numbers is the
square root of the average of the squares of differences between each of the numbers
and the mean of the numbers.

The second formula is a re-arrangement which may make it better for calculation
purposes.

You may use either of the formulae; they'll give you the same answer.

More about formulae.

Comparing these formulae with standard deviation formulae in books or in your


calculator, you may notice that sometimes the "n - 1" in the denominator is replaced by n.

When you're finding the standard deviation of a set of measures, which are only a sample
of the total set of measures, then it's correct to use "n - 1". All examples in the exams will
be of this type. When statisticians know they're working with the whole set or the
population then they use "n" instead of "n - 1".

Remember

Σ means "sum of"

is the "mean"
MathsII-Statistics
Standard-deviation
Example

Question 1

Find the mean and standard deviation of the following numbers.

4, 7, 9, 11, 13, 15, 18

The Answer

Here are two ways of calculating the standard deviation, using formulae.

(i)

x x- (x - )2
4 -7 49
7 -4 16
9 -2 4
11 0 0
13 2 4
15 4 16
18 7 49
Σ(x - )2= 138

correct to 3 decimal places

If you're having a problem with this table, this is how it works.

• The first column lists the numbers.


• The second column finds the difference between each of the numbers and the
mean.
• The third column squares these differences. This makes all of them positive
numbers.

The next step is to find the average of these squared differences. In this case, add them
up and divide by six (one less than the number of numbers).

The final step is to take the square root. This undoes the squaring we did earlier.

(ii)
MathsII-Statistics
Standard-deviation

x x2
4 16
7 49
9 81
11 121
13 169
15 225
18 324
2
Σ x = 985

So the standard deviation "s" = 4.796, using either of the formulae.

Now try a Test Bite


Statistics – Standard Deviation
Test bite

Standard Deviation
1. Here are two sets of data:

• First set of data: 30, 35, 45, 50, 55, 65, 70


• Second set of data: 30, 50, 50, 50, 50, 50, 70

State whether the statements below are true or false.


(i). The mean, median and range of the two sets of data is the
same. True False
(ii). The standard deviation of the first set of data is larger than
the standard deviation of the second set of data. True False

2. Here are two sets of data:

• First set of data: 47, 48, 49, 50, 51, 52, 53


• Second set of data: 1, 10, 20, 50, 80, 90, 99

State whether the statements below are true or false.


(i). The mean, median and range of the two sets of data is the
same. True False
(ii). The standard deviation of the first set of data is larger than
the standard deviation of the second set of data. True False

3. Here is a set of 10 numbers.

2, 7, 5, 5, 3, 9, 10, 8, 12, 11

Choose the correct figures from the lists below.

(i). The mean of this set of numbers is:-

a. 72

b. 7.2

c. 7

d. 7.5

(ii). The standard deviation of the set of numbers is:-

a. 3.39

b. 3
Statistics – Standard Deviation
Test bite

c. 3.22

d. 0.2

4. Here is a set of 5 numbers.

100, 121, 123, 145, 152

Choose the correct figures from the lists below.

(i). The mean of this set of numbers is:-

a. 160.2

b. 130

c. 123

d. 128.2

(ii). The standard deviation of the set of numbers is:-

a. 0.75

b. 20.75

c. 2.75

d. 20
Page 1 of 3

Mean, Median, Mode, and Range


Mean, median, and mode are three kinds of "averages". There are many "averages" in statistics, but
these are, I think, the three most common, and are certainly the three you are most likely to
encounter in your pre-statistics courses, if the topic comes up at all.

The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the
number of numbers. The "median" is the "middle" value in the list of numbers. To find the median,
your numbers have to be listed in numerical order, so you may have to rewrite your list first. The
"mode" is the value that occurs most often. If no number is repeated, then there is no mode for the
list.

The "range" is just the difference between the largest and smallest values.

z Find the mean, median, mode, and range for the following list of values:

13, 18, 13, 14, 13, 16, 14, 21, 13

The mean is the usual average, so:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Note that the mean isn't a value from the original list. This is a common result. You should not
assume that your mean will be one of your original numbers.

The median is the middle value, so I'll have to rewrite the list in order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th
number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

The mode is the number that is repeated more often than any other, so 13 is the mode.

The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.

mean: 15
median: 14
mode: 13
range: 8

Note: The formula for the place to find the median is "( [the number of data points] + 1) ÷ 2", but you
don't have to use this formula. You can just count in from both ends of the list until you meet in the
middle, if you prefer. Either way will work.

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo... 25/02/2009
Page 2 of 3

z Find the mean, median, mode, and range for the following list of values:

1, 2, 4, 7

The mean is the usual average: (1 + 2 + 4 + 7) ÷ 4 = 14 ÷ 4 = 3.5

The median is the middle number. In this example, the numbers are already listed in numerical
order, so I don't have to rewrite the list. But there is no "middle" number, because there are an
even number of numbers. In this case, the median is the mean (the usual average) of the
middle two values: (2 + 4) ÷ 2 = 6 ÷ 2 = 3

The mode is the number that is repeated most often, but all the numbers appear only once.
Then there is no mode.

The largest value is 7, the smallest is 1, and their difference is 6, so the range is 6.

mean: 3.5
median: 3
mode: none
range: 6

The list values were whole numbers, but the mean was a decimal value. Getting a decimal value for
the mean (or for the median, if you have an even number of data points) is perfectly okay; don't round
your answers to try to match the format of the other numbers.

z Find the mean, median, mode, and range for the following list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

The mean is the usual average:

(8 + 9 + 10 + 10 + 10 + 11 + 11 + 11 + 12 + 13) ÷ 10 = 105 ÷ 10 = 10.5

The median is the middle value. In a list of ten values, that will be the (10 + 1) ÷ 2 = 5.5th
value; that is, I'll need to average the fifth and sixth numbers to find the median:

(10 + 11) ÷ 2 = 21 ÷ 2 = 10.5

The mode is the number repeated most often. This list has two values that are repeated three
times.

The largest value is 13 and the smallest is 8, so the range is 13 – 8 = 5.

mean: 10.5
median: 10.5
modes: 10 and 11
range: 5

While unusual, it can happen that two of the averages (the mean and the median, in this case) will
have the same value.

Note: Depending on your text or your instructor, the above data set ;may be viewed as having no
mode (rather than two modes), since no single solitary number was repeated more often than any
other. I've seen books that go either way; there doesn't seem to be a consensus on the "right"
definition of "mode" in the above case. So if you're not certain how you should answer the "mode"
part of the above example, ask your instructor before the next test.

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo... 25/02/2009
Page 3 of 3

About the only hard part of finding the mean, median, and mode is keeping straight which "average"
is which. Just remember the following:

mean: regular meaning of "average"


median: middle value
mode: most often

(In the above, I've used the term "average" rather casually. The technical definition of "average" is the
arithmetic mean: adding up the values and then dividing by the number of values. Since you're
probably more familiar with the concept of "average" than with "measure of central tendency", I used
the more comfortable term.)

z A student has gotten the following grades on his tests: 87, 95, 76, and 88. He wants an
85 or better overall. What is the minimum grade he must get on the last test in order to
achieve that average?

The unknown score is "x". Then the desired average is:

(87 + 95 + 76 + 88 + x) ÷ 5 = 85

Multiplying through by 5 and simplifying, I get:

87 + 95 + 76 + 88 + x = 425
346 + x = 425
x = 79

He needs to get at least a 79 on the last test.

Original URL: http://www.purplemath.com/modules/meanmode.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.


Terms of Use: http://www.purplemath.com/terms.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo... 25/02/2009
Probability Page 1 of 4

Probability
Introduction

Probability is the likelihood or chance of an event occurring.


Probability = the number of ways of achieving success
the total number of possible outcomes
For example, the probability of flipping a coin and it being heads is ½, because
there is 1 way of getting a head and the total number of possible outcomes is 2
(a head or tail). We write P(heads) = ½ .

The probability of something which is certain to happen is 1.


The probability of something which is impossible to happen is 0.
The probability of something not happening is 1 minus the probability that it
will happen.

Single Events
Example

There are 6 beads in a bag, 3 are red, 2 are yellow and 1 is blue. What is the
probability of picking a yellow?

The probability is the number of yellows in the bag divided by the total number
of balls, i.e. 2/6 = 1/3.

Example

There is a bag full of coloured balls, red, blue, green and orange. Balls are
picked out and replaced. John did this 1000 times and obtained the following
results:
Number of blue balls picked out: 300
Number of red balls: 200
Number of green balls: 450
Number of orange balls: 50

a) What is the probability of picking a green ball?


b) If there are 100 balls in the bag, how many of them are likely to be green?

a) For every 1000 balls picked out, 450 are green. Therefore P(green) =
450/1000 = 0.45

b) The experiment suggests that 450 out of 1000 balls are green. Therefore,
out of 100 balls, 45 are green (using ratios).

Multiple Events

http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Probability Page 2 of 4

Possibility Spaces

When working out what the probability of two things happening is, a
probability/ possibility space can be drawn. For example, if you throw two dice,
what is the probability that you will get: a) 8, b) 9, c) either 8 or 9?

a) The black blobs indicate the ways of getting 8 (a 2 and a 6, a 3 and a 5, ...).
There are 5 different ways. The probability space shows us that when throwing
2 dice, there are 36 different possibilities (36 squares). With 5 of these
possibilities, you will get 8. Therefore P(8) = 5/36 .
b) The red blobs indicate the ways of getting 9. There are four ways, therefore
P(9) = 4/36 = 1/9.
c) You will get an 8 or 9 in any of the 'blobbed' squares. There are 9
altogether, so P(8 or 9) = 9/36 = 1/4 .

Probability Trees

Another way of representing 2 or more events is on a probability tree.

Example

There are 3 balls in a bag: red, yellow and blue. One ball is picked out, and not
replaced, and then another ball is picked out.

http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Probability Page 3 of 4

The first ball can be red, yellow or blue. The probability is 1/3 for each of
these. If a red ball is picked out, there will be two balls left, a yellow and blue.
The probability the second ball will be yellow is 1/2 and the probability the
second ball will be blue is 1/2. The same logic can be applied to the cases of
when a yellow or blue ball is picked out first.

In this example, the question states that the ball is not replaced. If it was, the
probability of picking a red ball (etc.) the second time will be the same as the
first (i.e. 1/3).

The AND and OR rules

In the above example, the probability of picking a red first is 1/3 and a yellow
second is 1/2. The probability that a red AND then a yellow will be picked is
1/3 × 1/2 = 1/6 (this is shown at the end of the branch).
The probability of picking a red OR yellow first is 1/3 + 1/3 = 2/3.
When the word 'and' is used we multiply. When 'or' is used, we add. On a
probability tree, when moving from left to right we multiply and when moving
down we add.

Example

What is the probability of getting a yellow and a red in any order?


This is the same as: what is the probability of getting a yellow AND a red OR a
red AND a yellow.
P(yellow and red) = 1/3 × 1/2 = 1/6
P(red and yellow) = 1/3 × 1/2 = 1/6

http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Probability Page 4 of 4

P(yellow and red or red and yellow) = 1/6 + 1/6 = 1/3

http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Representing Data Page 1 of 2

Bar Chart

A bar chart is a chart where the height of bars represents the frequency. The
data is 'discrete' (discontinuous- unlike histograms where the data is
continuous). The bars should be separated by small gaps.

Pie Chart

A pie chart is a circle which is divided into a number of parts.

The pie chart above shows the TV viewing figures for the following TV
programmes:
Eastenders, 15 million
Casualty, 10 million
Peak Practice, 5 million
The Bill, 8 million

Total number of viewers for the four programmes is 38 million. To work out the
angle that 'Eastenders' will have in the pie chart, we divide 15 by 38 and
multiply by 360 (degrees). This is 142 degrees. So 142 degrees of the circle

http://www.mathsrevision.net/gcse/pages.php?page=10 23/02/2009
Representing Data Page 2 of 2

represents Eastenders. Similarly, 95 degrees of the circle is Casualty, 47


degrees is Peak Practice and the remaining 76 degrees is The Bill.

http://www.mathsrevision.net/gcse/pages.php?page=10 23/02/2009
Page 1 of 2

Scatterplots and Regressions (page 1 of 4)


Real life is messy, so it is expected that measurements taken from real life will be messy as well.
When you graph measurements of real life, it is expected that the dots won't line up exactly in a nice
neat line, but will instead form a scattering of dots which, at best, might suggest a nice neat line.
These dots are called a scatterplot.

z Create a scatterplot from the following data:

(1, 49), (3, 51), (4, 52), (6, 52), (6, 53), (7, 53), (8, 54), (11, 56),
(12, 56), (14, 57), (14, 58), (17, 59), (18, 59), (20, 60), (20, 61)

One of the first things I have to do when graphing these points is figure out what my axis scale
values are going to be. If I try doing an axis system with the "standard" –10 to 10 values, none
of the above points will even show up on my graph. As is common with these sorts of data
sets, all the x- and y-values are positive, so I only really need scales for the first quadrant. The
y-values are much larger than the x-values, but instead of squeezing all the y-values together,
I'll spread them out (so I can see them better) by using an interrupted scale.

The little "hicky-bob" at the bottom of my y-axis above shows that I've skipped some of the scale
values. For some reason, this broken-axis notation seems almost never to be taught in schools,
though it is very commonly used in "the real world". If you read financial journals, you're very likely to
see many graphs with this sort of axis notation. If you use this notation in your homework, don't be
surprised if you have to explain it to your instructor.

You'll probably be expected to do your scatterplots in your graphing calculator. My calculator gives
me this picture:

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 2

You will often need to adjust your WINDOW settings in order to have all your data points show up on
the screen. I used window settings of 0 < X < 25 with an X-scale of 5 and 45 < Y < 65 with a Y-
scale of 5 for the above graph.

When you're done with the scatterplot, don't forget to turn the STATPLOT "off", or the parameters for
the statistics graphing could mess with your regular graphing utility.

I will give you fair warning now: It has become fashionable to insert the topic of scatterplots and
regressions into algebra and other non-statistics classes, and to require students to use a graphing
calculator to answer questions. While they may give you the slope formula and the Quadratic Formula
and all sorts of other stuff on the test (even though you should have memorized them), they will NOT
give you help with your calculator. They often don't seem to care if you've learned the math, but you
had gosh-darned better know your calculator! So pull out your owners manual, or go to the
manufacturer's web site, or search online, or get together with a friend NOW, because if you're doing
this stuff in class, you ARE going to have to know it, and know it well, on the test.

Original URL: http://www.purplemath.com/modules/scattreg.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.


Terms of Use: http://www.purplemath.com/terms.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 1 of 3

Scatterplots and Regressions (page 2 of 4)


You may be asked about "correlation". Correlation can be used in at least two different ways: to refer
to how well an equation matches the scatterplot, or to refer to the way in which the dots line up. If
you're asked about "positive" or "negative" correlation, they're using the second definition, and they're
asking if the dots line up with a positive or a negative slope, respectively. If you can't plausibly put a
line through the dots, if the dots are just an amorphous cloud of specks, then there is probably no
correlation.

z Tell whether the data graphed in the following scatterplots appear to have positive,
negative, or no correlation.

Plot A Plot B

Plot C Plot D

Plot A: Low x-values correspond to high y-values, and high x-values correspond to low y-
values. If I put a line through the dots, it would have a negative slope. This scatterplot shows a
negative correlation.

Plot B: Low x-values correspond to low y-values, and high x-values correspond to high y-
values. If I put a line through the dots, it would have a positive slope. This scatterplot shows a
positive correlation.

Plot C: There doesn't seem to be any trend to the dots; they're just all over the place. This
scatterplot shows no correlation.

Plot D: I might think that this plot shows a correlation, because I can clearly put a line through
the dots. But the line would be horizontal, thus having a slope value of zero. These dots
actually show that whatever is being measured on the x-axis has no bearing on whatever is
being measured on the y-axis, because the value of x has no affect on the value of y. So even
though I could draw a line through these points, this scatterplot still shows no correlation.

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 3

You may also be asked about "outliers", which are the dots that don't seem to fit with the rest of the
dots. (There are more technical definitions of "outliers", but they will have to wait until you take
statistics classes.) Maybe you dropped the crucible in chem lab, or maybe you should never have left
your idiot lab partner alone with the Bunsen burner in the middle of the experiment. Whatever the
cause, having outliers means you have points that don't line up with everything else.

z Identity any points that appear to be outliers.

Most of the points seem to line up in a fairly straight line, but the dot at (6, 7) is way off to the
side from the general trend-line of the points.

The outlier is the point at (6, 7)

Usually you'll be working with scatterplots where the dots line up in some sort of vaguely straight row.
But you shouldn't expect everything to line up nice and neat, especially in "real life" (like, for instance,
in a physics lab). And sometimes you'll need to pick a different sort of equation as a model, because
the dots line up, but not in a straight line.

z Tell which sort of equation you think would best model the data in the following
scatterplots, and why.

Graph A: The dots look like they line up fairly straight, so a linear model would probably work
well.

Graph B: The dots here do line up, but as more of a curvy line. A quadratic model might work
better.

Graph C: The dots are very close to the x-axis, and then they shoot up, so an exponential or
power-function model might work better here.

In general, expect only to need to recognize linear (straight-line) versus quadratic (curvy-line) models,
and never anything that you haven't already covered in class. For instance, if you haven't done logs
yet, you won't be expected to recognize the need for a logarithmic model for a given scatterplot. The
next lesson explains how to define these models, called "regressions".

Original URL: http://www.purplemath.com/modules/scattreg2.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 1 of 3

Scatterplots and Regressions (page 3 of 4)


The point of collecting data and plotting the collected values is usually to try to find a formula that can
be used to model a (presumed) relationship. I say "presumed" because the researcher may end up
concluding that there isn't really any relationship where he'd hoped there was one. For instance, you
could run experiments timing a ball as it drops from various heights, and you would be able to find a
definite relationship between "the height from which I dropped the ball" and "the time it took to hit the
floor". On the other hand, you could collect reams of data on the colors of people's eyes and the
colors of their cars, only to discover that there is no discernable connection between the two data
sets.

The process of taking your data points and coming up with an equation is called "regression", and the
graph of the "regression equation" is called "the regression line". If you're doing your scatterplots by
hand, you may be told to find a regression equation by putting a ruler against the first and last dots in
the plot, drawing a line, and guessing the line's equation from the picture. This is an incredibly clumsy
way to proceed, and can give very wrong answers, especially since values at the ends often turn out
to be outliers (numbers that don't quite fit with everything else).

For instance, suppose your dots look like this:

Connecting the first and last points, you would


end up with this:

On the other hand, you could ignore the outliers


and instead just eyeball the cloud of dots to
locate a general trend. Put the ruler about where
you think a line ought to go (regardless of
whether the ruler actually crosses any of the
dots), draw the line, and guess the equation from
that. You'll likely end up with a more sensible
result. Your equation will still be guess-work, but
it'll be better guess-work than using only the first
and last points:

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 3

If you're finding regression equations with a ruler, you'll need to work extremely neatly, of course, and
using graph paper would probably be a really good idea. Once you've drawn in your line (and this will
only work for linear, or straight-line, regressions), you will estimate two points on the line that seem to
be close to where the gridlines intersect, and then find the line equation through those two points.
From the above graph, I would guess that the line goes close to the points (3, 7) and (19, 1), so the
regression equation would be y = (–3/8)x + 65/8.

Most likely, though, you'll be doing regressions in your calculator. Doing regressions properly is a
difficult and technical process, but your graphing calculator has been programmed with the necessary
formulas and has the memory to crunch the many numbers. The calculator will give you "the"
regression line. If you're working by hand, you and your classmates will get slightly different answers;
if you're using calculators, you'll all get the same answer. (Consult your owners manual or calculator
web sites for specific information on doing regressions with your particular calculator model.)

If you're supposed to report how "good" a given regression is, then figure out how to find the "r",
"r2", and/or "R2" values in your calculator. These diagnostic tools measure the degree to which the
regression equation matches the scatterplot. The closer these correlation values are to 1 (or to –1),
the better a fit your regression equation is to the data values. If the correlation value is more than 0.8
or less than –0.8, the match is judged to be pretty good; if the value is between –0.5 and 0.5, the
match is judged to be pretty poor; and a correlation value close to zero means you're kidding yourself
if you think there's really a relationship of the type you're looking for. (There should be instructions,
somewhere in your owners manual, for finding this information.) When you're doing a regression,
you're trying to find the "best fit" line to the data, and the correlation numbers help you to tell how
good your "fit" is.

z Given the following data values, find the linear and cubic regression lines.
Say which regression is a better fit, and why.

(2, 23), (3, 24), (8, 32), (10, 36), (13, 51), (14, 59),
(17, 76), (20, 107), (22, 120), (23, 131), (27, 182)

After plugging these values into the STAT utility of my calculator, I can then do a linear
regression:

...and a cubic regression:

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 3 of 3

The line looks a little curvy on the scatterplot, so it's reasonable that the curvy line, the cubic y
= 0.000829x3 + 0.23x2 – 1.09x + 24.60, is a better fit to the data points than the straight-
line linear model y = 6.03x – 10.64.

Since the correlation value is closer to 1 for the cubic and since the graph of the
cubic model is closer to the dots, the cubic equation y = 0.000829x3 + 0.23x2 –
1.09x + 24.60 is the better regression.

You shouldn't expect, by the way, always to get correlation values that are close to "1". If they tell
you to find, say, the linear regression equation for a data set, and the correlation factor is close to
zero, this doesn't mean that you've found the "wrong" linear equation; it only means that a linear
equation probably wasn't a good model to the data. A quadratic model, for instance, might have been
better.

Original URL: http://www.purplemath.com/modules/scattreg3.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.


Terms of Use: http://www.purplemath.com/terms.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 1 of 3

Scatterplots and Regressions (page 4 of 4)


Until (and unless) you get into a statistics class, the preceding pages cover pretty much all there is to
scatterplots and regressions. You draw the dots (or enter them into your calculator), you eyeball a line
(or find one in the calculator), and you see how well the line fits the dots. About the only other thing
you might do is "extrapolate" and "interpolate".

Remember that the point of all this data-collection, dot-drawing, and regression-computing was to try
to find a formula that models... whatever it is that they're measuring. You can use these models to try
to find missing data points or to try to project into the future (or, sometimes, into the past).

If you have data, say, for the years 1950, 1960, 1970, and 1980, and you find a model for your data,
you might use it to guess at values between these dates. For instance, given Namibian population
data for the listed years, you might try to guess the population of Namibia in 1965. The prefix "inter"
means "between", so this guessing-between-the-points would be interpolation. On the other hand,
you might try to work backwards to guess the population in 1940, or try to fill in the missing data up
through 2000. The prefix "extra" means "outside", so this guessing-outside-the-points would be
extrapolation.

z Find a regression equation for the following population data, using t = 0 to stand for
1950. Then estimate the population of Namibia in the years 1940, 1997, and 2005. Note:
Population values are in thousands.

year t 0 5 10 15 20 25 30 35 40 45 50
pop. 511 561 625 704 800 921 1 018 1 142 1 409 1 646 1 894

Setting my window range as 0 < X < 55, counting by 5's, and 500 < Y < 2000, counting by
250's, my calculator gives me the following scatterplot:

The dots look like they line up in a curve, so I'll try a quadratic regression. The calculator gives
me:

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 3

As you can see, I've set the calculator to "DiagnosticsOn", so it displays the correlation value
whenever I do a regression. This regression looks pretty darned good, especially when it's
graphed with the data values:

...so I'll use this model for my computations.

Now that I have an equation for modelling Namibia's population, I can use it to estimate the
population in the given years. For 1940, I'll use t = –10, since this is ten years before 1950.
(This is an extrapolated value, since I'm going outside the data set.)

f(–10) = 0.4958(–10)2 + 1.9389(–10) + 538.6993 = 568.8903

For 2005, I'll use t = 55; this will be another extrapolated value.

f(55) = 0.4958(55)2 + 1.9389(55) + 538.6993 = 2145.1338

For 1997, I'll use t = 47. Since this value is between known values, this will be an interpolated
answer.

f(47) = 0.4958(47)2 + 1.9389(47) + 538.6992 = 1725.0498

Remembering that the population values are in thousands, I'll add three zeroes to my numbers
and round to get my final answers:

The estimated values for the population in 1940 is about 569 000; for 2005, the
estimated value is about 2.15 million; and for 1997, the estimated value is about
1.73 million.

Depending on your calculator, you may need to memorize what the regression values mean. On my
old TI-85, the regression screen would list values for a and b for a linear regression. But I had to
memorize that the related regression equation was "a + bx", instead of the "ax + b" that I would
otherwise have expected, because the screen didn't say. If you need to memorize this sort of
information, do it now, because the teacher will not bail you out if you forget on the test what your
calculator's variables mean.

Original URL: http://www.purplemath.com/modules/scattreg4.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Standard Deviation Page 1 of 3

Standard Deviation
The standard deviation measures the spread of the data about the mean
value. It is useful in comparing sets of data which may have the same mean
but a different range. For example, the mean of the following two is the same:
15, 15, 15, 14, 16 and 2, 7, 14, 22, 30. However, the second is clearly more
spread out. If a set has a low standard deviation, the values are not spread out
too much.

Just like when working out the mean, the method is different if the data is
given to you in groups.

Non-Grouped Data

Non-grouped data is just a list of values. The standard deviation is given by


the formula:

σ means 'standard deviation'.


Σ means 'the sum of'.
means 'the mean'

Example

Find the standard deviation of 4, 9, 11, 12, 17, 5, 8, 12, 14


First work out the mean: 10.222
Now, subtract the mean individually from each of the numbers given and
square the result. This is equivalent to the (x - )² step. x refers to the values
given in the question.

x 4 9 11 12 17 5 8 12 14
(x -
38.7 1.49 0.60 3.16 45.9 27.3 4.94 3.16 14.3
)2

Now add up these results (this is the 'sigma' in the formula): 139.55
Divide by n. n is the number of values, so in this case is 9. This gives us:
15.51

http://www.mathsrevision.net/gcse/pages.php?page=42 23/02/2009
Standard Deviation Page 2 of 3

And finally, square root this: 3.94

The standard deviation can usually be calculated much more easily with a
calculator and this may be acceptable in some exams. On my calculator, you
go into the standard deviation mode (mode '.'). Then type in the first value,
press 'data', type in the second value, press 'data'. Do this until you have
typed in all the values, then press the standard deviation button (it will
probably have a lower case sigma on it). Check your calculator's manual to see
how to calculate it on yours.

NB: If you have a set of numbers (e.g. 1, 5, 2, 7, 3, 5 and 3), if each number
is increased by the same amount (e.g. to 3, 7, 4, 9, 5, 7 and 5), the standard
deviation will be the same and the mean will have increased by the amount
each of the numbers were increased by (2 in this case). This is because the
standard deviation measures the spread of the data. Increasing each of the
numbers by 2 does not make the numbers any more spread out, it just shifts
them all along.

Grouped Data

When dealing with grouped data, such as the following:

x f
4 9
5 14
6 22
7 11
8 17

the formula for standard deviation becomes:

Try working out the standard deviation of the above data. You should get an
answer of 1.32 .

You may be given the data in the form of groups, such as:

Number Frequency
3.5 - 4.5 9
4.5 - 5.5 14
5.5 - 6.5 22
6.5 - 7.5 11

http://www.mathsrevision.net/gcse/pages.php?page=42 23/02/2009
Standard Deviation Page 3 of 3

7.5 - 8.5 17

In such a circumstance, x is the midpoint of groups.

http://www.mathsrevision.net/gcse/pages.php?page=42 23/02/2009
Page 1 of 3

Stem-and-Leaf Plots (page 1 of 2)


Stem-and-leaf plots are a method for showing the frequency with which certain classes of values
occur. You could make a frequency distribution table or a histogram for the values, or you can use a
stem-and-leaf plot and let the numbers themselves to show pretty much the same information.

For instance, suppose you have the following list of values: 12, 13, 21, 27, 33, 34, 35, 37, 40, 40,
41. You could make a frequency distribution table showing how many tens, twenties, thirties, and
forties you have:

Frequency
Frequency
Class
10 - 19 2
20 - 29 2
30 - 39 4
40 - 49 3

You could make a histogram, which is a bar-graph showing the number of occurrences, with the
classes being numbers in the tens, twenties, thirties, and forties:

(The shading of the bars in a histogram isn't necessary, but it can be helpful by making the bars
easier to see, especially if you can't use color to differentiate the bars.)

The downside of frequency distribution tables and histograms is that, while the frequency of each
class is easy to see, the original data points have been lost. You can tell, for instance, that there must
have been three listed values that were in the forties, but there is no way to tell from the table or from
the histogram what those values might have been.

On the other hand, you could make a stem-and-leaf plot for the same data:

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 2 of 3

The "stem" is the left-hand column which contains the tens digits. The "leaves" are the lists in the
right-hand column, showing all the ones digits for each of the tens, twenties, thirties, and forties. As
you can see, the original values can still be determined; you can tell, from that bottom leaf, that the
three values in the forties were 40, 40, and 41.

Note that the horizontal leaves in the stem-and-leaf plot correspond to the vertical bars in the
histogram, and the leaves have lengths that equal the numbers in the frequency table.

That's pretty much all there is to a stem-and-leaf plot. You're just listing out how many entries you
have in certain classes of numbers, and what those entries are. Here are some more examples of
stem-and-leaf plots, containing a few additional details.

z Complete a stem-and-leaf plot for the following list of grades on a recent test:

73, 42, 67, 78, 99, 84, 91, 82, 86, 94

I'll use the tens digits as the stem values and the ones digits as the leaves. For convenience
sake, I'll order the list, but this is not required:

42, 67, 73, 78, 82, 84, 86, 91, 94, 99

Since I know where these data points came from ("a recent test"), I'll use a title. Then my plot
looks like this:

The above is the simplest case for stem-and-leaf plots, but even the "complicated" cases aren't much
more complex.

Original URL: http://www.purplemath.com/modules/stemleaf.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 1 of 4

Stem-and-Leaf Plots: Examples (page 2 of 2)


z Subjects in a psychological study were timed while completing a certain task. Complete
a stem-and-leaf plot for the following list of times:

7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.7, 7.4, 7.8, 8.2

First, I'll reorder this list:

5.8, 5.9, 6.1, 6.2, 6.8, 7.3, 7.4, 7.6, 7.8, 8.1, 8.1, 8.2, 8.7, 9.2

These values have one decimal place, but the stem-and-leaf plot makes no accomodation for
this. The stem-and-leaf plot only looks at the last digit (for the leaves) and all the digits before
(for the stem). So I'll have to put a "key" or legend on this plot to show what I mean by the
numbers in this plot. The ones digits will be the stem values, and the tenths will be the leaves.

Properly, every stem-and-leaf plot should have a key.

z Complete a stem-and-leaf plot for the following two lists of class sizes:

Economics 101: 9, 13, 14, 15, 16, 16, 17, 19, 20, 21, 21, 22, 25, 25, 26
Libertarianism: 14, 16, 17, 18, 18, 20, 20, 24, 29

This example has two lists of values. Since the values are similar, I can plot them all on one
stem-and-leaf plot by drawing leaves on either side of the stem. I will use the tens digits as the
stem values, and the ones digits as the leaves. Since "9" (in the Econ 101 list) has no tens
digit, the stem value will be "0".

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 2 of 4

z Complete a stem-and-leaf plot for the following list of values:

100, 110, 120, 130, 130, 150, 160, 170, 170, 190,
210, 230, 240, 260, 270, 270, 280. 290, 290

Since all the ones digits are zeroes, I'll do this plot with the hundreds digits being the stem
values and the tens digits being the leaves. I can do the plot like this:

...but the leaves are fairly long this way, because the values are so close together. To spread
the values out a bit, I can break each leaf into two. For instance, the leaf for the two-hundreds
class can be split into two classes, being the numbers between 200 and 240 and the numbers
between 250 and 290. I can also reverse the order, so the smaller values are at the bottom of
the "stem". The new plot looks like this:

For very compact data points, you can even split the leaves into five classes, like this:

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 3 of 4

z Complete a stem-and-leaf plot for the following list of values:

23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09

If I try to use the last digit, the hundredths digit, for these numbers, the stem-and-leaf plot will
be enormously long, because these values are so spread out. (With the numbers' first three
digits ranging from 232 to 270, I'd have thirty-nine leaves, most of which would be empty.) So
instead of working with the given numbers, I'll round each of the numbers to the nearest tenth,
and then use those new values for my plot. Rounding gives me the following list:

23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1

Then my plot looks like this:

Naturally, when you're drawing a stem-and-leaf plot, you should use a ruler to construct a neat table,
and you should label everything clearly.

Original URL: http://www.purplemath.com/modules/stemleaf.htm

mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009

Das könnte Ihnen auch gefallen