Beruflich Dokumente
Kultur Dokumente
Box-and-Whisker Plots:
Quartiles, Boxes, and Whiskers (page 1 of 3)
Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers
Statistics assumes that your data points (the numbers in your list) are clustered around some central
value. The "box" in the box-and-whisker plot contains, and thereby highlights, the middle half of these
data points.
To create a box-and-whisker plot, you start by ordering your data (putting the values in numerical
order), if they aren't ordered already. Then you find the median of your data. The median divides the
data into two halves. To divide the data into quarters, you then find the medians of these two halves.
Note: If you have an even number of values, so the first median was the average of the two middle
values, then you include the middle values in your sub-median computations. If you have an odd
number of values, so the first median was an actual data point, then you do not include that value in
your sub-median computations. That is, to find the sub-medians, you're only looking at the values that
haven't yet been used.
You have three points: the first middle point (the median), and the middle points of the two halves
(what I call the "sub-medians"). These three points divide the entire data set into quarters, called
"quartiles". The top point of each quartile has a name, being a "Q" followed by the number of the
quarter. So the top point of the first quarter of the data points is "Q1", and so forth. Note that Q1 is
Q2 is also the middle number for the whole list, Q3
also the middle number for the first half of the list,
is the middle number for the second half of the list, and Q4 is the largest value in the list.
Once you have these three points, Q1, Q2, and Q3, you have all you need in order to draw a simple
box-and-whisker plot. Here's an example of how it works.
4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1
The first number I need is the median of the entire set. Since there are seventeen values in
this list, I need the ninth value:
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1
The next two numbers I need are the medians of the two halves. Since I used the "4.4" in the
middle of the list, I can't re-use it, so my two remaining data sets are:
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 2 of 2
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4 and 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1
The first half has eight values, so the median is the average of the middle two:
By the way, box-and-whisker plots don't have to be drawn horizontally as I did above; they can be
vertical, too.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 1 of 2
Box-and-Whisker Plots:
Five-Number Summary (page 2 of 3)
Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers
More terminology: The top end of your box may also be called the "upper hinge"; the lower end may
also be called the "lower hinge". The lower hinge is also called "the 25th percentile"; the median is
"the 50th percentile"; the upper hinge is "the 75th percentile". This means that 25%, 50% and 75%
of the data, respectively, is at or below that point. The distance between the hinges may be referred
to as the "H-spread" or, as you will see on the following page, the "Interquartile Range", abbreviated
"IQR". ("Hinge" actually has a different technical definition, but the term is sometimes used
informally.)
Also, some books and software will include the overall median (Q2) when computing Q1 and Q3 for
data sets with an odd number of elements. The Texas Instruments calculators do not include Q2 in
this case, so you may encounter a book answer that doesn't match the calculator answer. And
different software packages use all different sorts of formulas. Be careful to use the formula from your
book when doing your homework!
Additionally, the box-and-whisker plot may include a cross or an "X" marking the mean value of the
data, in addition to the line inside the box that marks the median. The difference between the "X" and
the median line can then be used as a measure of "skew".
My first step is to find the median. Since there are eight data points, the median will be the
average of the two middle values: (86 + 87) ÷ 2 = 86.5 = Q2
This splits the list into two halves: 77, 79, 80, 86 and 87, 87, 94, 99. Since the halves of
the data set each contain an even number of values, the sub-medians will be the average of
the middle two values.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 2 of 2
As you can see, you only need the five values listed above (min, Q1, Q2, Q3, and max) in order to
draw your box-and-whisker plot. This set of five values has been given the name "the five-number
summary".
The five-number summary consists of the numbers I need for the box-and-whisker plot: the
minimum value, Q1 (the bottom of the box), Q2 (the median of the set), Q3 (the top of the
box), and the maximum value (which is also Q4). So I need to order the set, find the median
and the sub-medians, and then list the required values in order.
ordering the list: 53, 79, 80, 82, 87, 91, 93, 98, so the minimum is 53 and the
maximum is 98
lower half of the list: 53, 79, 80, 82, so Q1 = (79 + 80) ÷ 2 = 79.5
upper half of the list: 87, 91, 93, 98, so Q3 = (91 + 93) ÷ 2 = 92
Part of the point of a box-and-whisker plot is to show how spread out your values are. But what if one
or another of your values is way out of line? For this, we need to consider "outliers"....
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 1 of 3
Box-and-Whisker Plots:
Interquartile Ranges and Outliers (page 3 of 3)
Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers
The "interquartile range", abbreviated "IQR", is just the width of the box in the box-and-whisker plot.
That is, IQR = Q3 – Q1. The IQR can be used as a measure of how spread-out the values are.
Statistics assumes that your values are clustered around some central value. The IQR tells how
spread out the "middle" values are; it can also be used to tell when some of the other values are "too
far" from the central value. These "too far away" points are called "outliers", because they "lie outside"
the range in which we expect them.
The IQR is the length of the box in your box-and-whisker plot. An outlier is any value that lies more
than one and a half times the length of the box from either end of the box. That is, if a data point is
below Q1 – 1.5×IQR or above Q3 + 1.5×IQR, it is viewed as being too far from the central values
to be reasonable. Maybe you bumped the weigh-scale when you were making that one
measurement, or maybe your lab partner is an idiot and you should never have let him touch any of
the equipment. Who knows? But whatever their cause, the outliers are those points that don't seem to
"fit".
(Why one and a half times the width of the box? Why does that particular value demark the difference
between "acceptable" and "unacceptable" values? Because, when John Tukey was inventing the
box-and-whisker plot in 1977 to display these values, he picked 1.5×IQR as the demarkation line for
outliers. This has worked well, so we've continued using that value ever since.)
To find out if there are any outliers, I first have to find the IQR. There are fifteen data points,
so the median will be at position (15 + 1) ÷ 2 = 8. Then Q2 = 14.6. There are seven data
Q1 is the fourth value in the list and Q3 is the twelfth:
points on either side of the median, so
Q1 = 14.4 and Q3 = 14.9. Then IQR = 14.9 – 14.4 = 0.5.
The values for Q1 – 1.5×IQR and Q3 + 1.5×IQR are the "fences" that mark off the "reasonable"
values from the outlier values. Outliers lie outside the fences.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 2 of 3
If your assignment is having you consider outliers and "extreme values", then the values for Q1 –
1.5×IQR and Q3 + 1.5×IQR are the "inner" fences and the values for Q1 – 3×IQR and
Q3 + 3×IQR are the "outer" fences. The outliers (marked with asterisks or open dots) are between
the inner and outer fences, and the extreme values (marked with whichever symbol you didn't use for
the outliers) are outside the outer fences.
By the way, your book may refer to the value of "1.5×IQR" as being a "step". Then the outliers will be
the numbers that are between one and two steps from the hinges, and extreme value will be the
numbers that are more than two steps from the hinges.
Looking again at the previous example, the outer fences would be at 14.4 – 3×0.5 = 12.9 and 14.9
+ 3×0.5 = 16.4. Since 16.4 is right on the upper outer fence, this would be considered to be only an
outlier, not an extreme value. But 10.2 is fully below the lower outer fence, so 10.2 would be an
extreme value.
If you're using your graphing calculator to help with these plots, make sure you know which setting
you're supposed to be using and what the results mean, or the calculator may give you a perfectly
correct but "wrong" answer.
z Find the outliers and extreme values, if any, for the following data set, and draw the
box-and-whisker plot. Mark any outliers with an asterisk and any extreme values with an
open dot.
To find the outliers and extreme values, I first have to find the IQR. Since there are seven
values in the list, the median is the fourth value, so Q2 = 25. The first half of the list is 21, 23,
24, so Q1 = 23; the second half is 29, 33, 49, so Q3 = 33. Then IQR = 33 – 23 = 10.
So I have an outlier at 49 but no extreme values, I won't have a top whisker because Q3 is
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Page 3 of 3
It should be noted that the methods, terms, and rules outlined above are what I have taught and what
I have most commonly seen taught. However, your course may have different specific rules, or your
calculator may do computations slightly differently. You may need to be somewhat flexible in finding
the answers specific to your curriculum.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker... 25/02/2009
Cum. Freq. Graphs Page 1 of 2
Example
Cumulative
Frequency:
Frequency:
4 4
6 10 (4 + 6)
3 13 (4 + 6+ 3)
2 15 (4 + 6+ 3 + 2)
6 21 (4 + 6+ 3 + 2 + 6)
4 25 (4 + 6+ 3 + 2 + 6 + 4)
The median of a group of numbers is the number in the middle, when the
numbers are in order of magnitude. For example, if the set of numbers is 4, 1,
6, 2, 6, 7, 8, the median is 6:
1, 2, 4, 6, 6, 7, 8 (6 is the middle value when the numbers are in order)
If you have n numbers in a group, the median is the (n + 1)/2 th value. For
example, there are 7 numbers in the example above, so replace n by 7 and the
median is the (7 + 1)/2 th value = 4th value. The 4th value is 6.
Quartiles
If we divide a cumulative frequency curve into quarters, the value at the lower
quarter is referred to as the lower quartile, the value at the middle gives the
median and the value at the upper quarter is the upper quartile.
A set of numbers may be as follows: 8, 14, 15, 16, 17, 18, 19, 50. The mean
of these numbers is 19.625 . However, the extremes in this set (8 and 50)
distort this value. The interquartile range is a method of measuring the spread
of the middle 50% of the values and is useful since it ignore the extreme
values.
http://www.mathsrevision.net/gcse/pages.php?page=21 23/02/2009
Cum. Freq. Graphs Page 2 of 2
The lower quartile is (n+1)/4 th value (n is the cumulative frequency, i.e. 157
in this case) and the upper quartile is the 3(n+1)/4 the value. The difference
between these two is the interquartile range (IQR).
In the above example, the upper quartile is the 118.5th value and the lower
quartile is the 39.5th value. If we draw a cumulative frequency curve, we see
that the lower quartile, therefore, is about 17 and the upper quartile is about
37. Therefore the IQR is 20 (bear in mind that this is a rough sketch- if you
plot the values on graph paper you will get a more accurate value).
http://www.mathsrevision.net/gcse/pages.php?page=21 23/02/2009
Linear Regression Page 1 of 2
Linear Regression
Scatter Diagrams
We often wish to look at the relationship between two things (e.g. between a person"s height and weight) by comparing
data for each of these things. A good way of doing this is by drawing a scatter diagram.
"Regression" is the process of finding the function satisfied by the points on the scatter diagram. Of course, the points
might not fit the function exactly but the aim is to get as close as possible. "Linear" means that the function we are
looking for is a straight line (so our function f will be of the form f(x) = mx + c for constants m and c).
Correlation
Correlation is a term used to describe how strong the relationship between the two variables appears to be.
We say that there is a positive linear correlation if y increases as x increases and we say there is a negative linear
correlation if y decreases as x increases. There is no correlation if x and y do not appear to be related.
In many experiments, one of the variables is fixed or controlled and the point of the experiment is to determine how the
other variable varies with the first. The fixed/controlled variable is known as the explanatory or independent variable
and the other variable is known as the response or dependent variable.
I shall use "x" for my explanatory variable and "y" for my response variable, but I could have used any letters.
Regression Lines
By Eye
If there is very little scatter (we say there is a strong correlation between the variables), a regression line can be drawn
"by eye". You should make sure that your line passes through the mean point (the point (x,y) where x is mean of the
data collected for the explanatory variable and y is the mean of the data collected for the response variable).
When there is a reasonable amount of scatter, we can draw two different regression lines depending upon which
variable we consider to be the most accurate. The first is a line of regression of y on x, which can be used to estimate y
given x. The other is a line of regression of x on y, used to estimate x given y.
If there is a perfect correlation between the data (in other words, if all the points lie on a straight line), then the two
regression lines will be the same.
http://www.mathsrevision.net/alevel/pages.php?page=61 23/02/2009
Linear Regression Page 2 of 2
This is a method of finding a regression line without estimating where the line should go by eye.
If the equation of the regression line is y = ax + b, we need to find what a and b are. We find these by solving the
"normal equations".
Normal Equations
Σy = aΣx + nb and
For the line of regression of x on y, the "normal equations" are the same but with x and y swapped.
http://www.mathsrevision.net/alevel/pages.php?page=61 23/02/2009
MathsII-Statistics
Standard-deviation
Comparing distributions
For example, the following two data sets are significantly different in nature and yet have
the same mean, median and range. Some sort of numerical measure which distinguishes
between them would be useful.
The standard deviation of the first set of data is significantly larger than the standard
deviation of the second set of data (ie there is more spread about the mean in the first set
of data).
The formulae
There are two formulae for standard deviation given in the formulae list in the Credit
Level examination paper. The first of the two formulae is often referred to as the defining
formula and shows more clearly that the standard deviation of a set of numbers is the
square root of the average of the squares of differences between each of the numbers
and the mean of the numbers.
The second formula is a re-arrangement which may make it better for calculation
purposes.
You may use either of the formulae; they'll give you the same answer.
When you're finding the standard deviation of a set of measures, which are only a sample
of the total set of measures, then it's correct to use "n - 1". All examples in the exams will
be of this type. When statisticians know they're working with the whole set or the
population then they use "n" instead of "n - 1".
Remember
is the "mean"
MathsII-Statistics
Standard-deviation
Example
Question 1
The Answer
Here are two ways of calculating the standard deviation, using formulae.
(i)
x x- (x - )2
4 -7 49
7 -4 16
9 -2 4
11 0 0
13 2 4
15 4 16
18 7 49
Σ(x - )2= 138
The next step is to find the average of these squared differences. In this case, add them
up and divide by six (one less than the number of numbers).
The final step is to take the square root. This undoes the squaring we did earlier.
(ii)
MathsII-Statistics
Standard-deviation
x x2
4 16
7 49
9 81
11 121
13 169
15 225
18 324
2
Σ x = 985
Standard Deviation
1. Here are two sets of data:
2, 7, 5, 5, 3, 9, 10, 8, 12, 11
a. 72
b. 7.2
c. 7
d. 7.5
a. 3.39
b. 3
Statistics – Standard Deviation
Test bite
c. 3.22
d. 0.2
a. 160.2
b. 130
c. 123
d. 128.2
a. 0.75
b. 20.75
c. 2.75
d. 20
Page 1 of 3
The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the
number of numbers. The "median" is the "middle" value in the list of numbers. To find the median,
your numbers have to be listed in numerical order, so you may have to rewrite your list first. The
"mode" is the value that occurs most often. If no number is repeated, then there is no mode for the
list.
The "range" is just the difference between the largest and smallest values.
z Find the mean, median, mode, and range for the following list of values:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15
Note that the mean isn't a value from the original list. This is a common result. You should not
assume that your mean will be one of your original numbers.
The median is the middle value, so I'll have to rewrite the list in order:
There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th
number:
The mode is the number that is repeated more often than any other, so 13 is the mode.
The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.
mean: 15
median: 14
mode: 13
range: 8
Note: The formula for the place to find the median is "( [the number of data points] + 1) ÷ 2", but you
don't have to use this formula. You can just count in from both ends of the list until you meet in the
middle, if you prefer. Either way will work.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo... 25/02/2009
Page 2 of 3
z Find the mean, median, mode, and range for the following list of values:
1, 2, 4, 7
The median is the middle number. In this example, the numbers are already listed in numerical
order, so I don't have to rewrite the list. But there is no "middle" number, because there are an
even number of numbers. In this case, the median is the mean (the usual average) of the
middle two values: (2 + 4) ÷ 2 = 6 ÷ 2 = 3
The mode is the number that is repeated most often, but all the numbers appear only once.
Then there is no mode.
The largest value is 7, the smallest is 1, and their difference is 6, so the range is 6.
mean: 3.5
median: 3
mode: none
range: 6
The list values were whole numbers, but the mean was a decimal value. Getting a decimal value for
the mean (or for the median, if you have an even number of data points) is perfectly okay; don't round
your answers to try to match the format of the other numbers.
z Find the mean, median, mode, and range for the following list of values:
The median is the middle value. In a list of ten values, that will be the (10 + 1) ÷ 2 = 5.5th
value; that is, I'll need to average the fifth and sixth numbers to find the median:
The mode is the number repeated most often. This list has two values that are repeated three
times.
mean: 10.5
median: 10.5
modes: 10 and 11
range: 5
While unusual, it can happen that two of the averages (the mean and the median, in this case) will
have the same value.
Note: Depending on your text or your instructor, the above data set ;may be viewed as having no
mode (rather than two modes), since no single solitary number was repeated more often than any
other. I've seen books that go either way; there doesn't seem to be a consensus on the "right"
definition of "mode" in the above case. So if you're not certain how you should answer the "mode"
part of the above example, ask your instructor before the next test.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo... 25/02/2009
Page 3 of 3
About the only hard part of finding the mean, median, and mode is keeping straight which "average"
is which. Just remember the following:
(In the above, I've used the term "average" rather casually. The technical definition of "average" is the
arithmetic mean: adding up the values and then dividing by the number of values. Since you're
probably more familiar with the concept of "average" than with "measure of central tendency", I used
the more comfortable term.)
z A student has gotten the following grades on his tests: 87, 95, 76, and 88. He wants an
85 or better overall. What is the minimum grade he must get on the last test in order to
achieve that average?
(87 + 95 + 76 + 88 + x) ÷ 5 = 85
87 + 95 + 76 + 88 + x = 425
346 + x = 425
x = 79
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo... 25/02/2009
Probability Page 1 of 4
Probability
Introduction
Single Events
Example
There are 6 beads in a bag, 3 are red, 2 are yellow and 1 is blue. What is the
probability of picking a yellow?
The probability is the number of yellows in the bag divided by the total number
of balls, i.e. 2/6 = 1/3.
Example
There is a bag full of coloured balls, red, blue, green and orange. Balls are
picked out and replaced. John did this 1000 times and obtained the following
results:
Number of blue balls picked out: 300
Number of red balls: 200
Number of green balls: 450
Number of orange balls: 50
a) For every 1000 balls picked out, 450 are green. Therefore P(green) =
450/1000 = 0.45
b) The experiment suggests that 450 out of 1000 balls are green. Therefore,
out of 100 balls, 45 are green (using ratios).
Multiple Events
http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Probability Page 2 of 4
Possibility Spaces
When working out what the probability of two things happening is, a
probability/ possibility space can be drawn. For example, if you throw two dice,
what is the probability that you will get: a) 8, b) 9, c) either 8 or 9?
a) The black blobs indicate the ways of getting 8 (a 2 and a 6, a 3 and a 5, ...).
There are 5 different ways. The probability space shows us that when throwing
2 dice, there are 36 different possibilities (36 squares). With 5 of these
possibilities, you will get 8. Therefore P(8) = 5/36 .
b) The red blobs indicate the ways of getting 9. There are four ways, therefore
P(9) = 4/36 = 1/9.
c) You will get an 8 or 9 in any of the 'blobbed' squares. There are 9
altogether, so P(8 or 9) = 9/36 = 1/4 .
Probability Trees
Example
There are 3 balls in a bag: red, yellow and blue. One ball is picked out, and not
replaced, and then another ball is picked out.
http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Probability Page 3 of 4
The first ball can be red, yellow or blue. The probability is 1/3 for each of
these. If a red ball is picked out, there will be two balls left, a yellow and blue.
The probability the second ball will be yellow is 1/2 and the probability the
second ball will be blue is 1/2. The same logic can be applied to the cases of
when a yellow or blue ball is picked out first.
In this example, the question states that the ball is not replaced. If it was, the
probability of picking a red ball (etc.) the second time will be the same as the
first (i.e. 1/3).
In the above example, the probability of picking a red first is 1/3 and a yellow
second is 1/2. The probability that a red AND then a yellow will be picked is
1/3 × 1/2 = 1/6 (this is shown at the end of the branch).
The probability of picking a red OR yellow first is 1/3 + 1/3 = 2/3.
When the word 'and' is used we multiply. When 'or' is used, we add. On a
probability tree, when moving from left to right we multiply and when moving
down we add.
Example
http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Probability Page 4 of 4
http://www.mathsrevision.net/gcse/probability.php 23/02/2009
Representing Data Page 1 of 2
Bar Chart
A bar chart is a chart where the height of bars represents the frequency. The
data is 'discrete' (discontinuous- unlike histograms where the data is
continuous). The bars should be separated by small gaps.
Pie Chart
The pie chart above shows the TV viewing figures for the following TV
programmes:
Eastenders, 15 million
Casualty, 10 million
Peak Practice, 5 million
The Bill, 8 million
Total number of viewers for the four programmes is 38 million. To work out the
angle that 'Eastenders' will have in the pie chart, we divide 15 by 38 and
multiply by 360 (degrees). This is 142 degrees. So 142 degrees of the circle
http://www.mathsrevision.net/gcse/pages.php?page=10 23/02/2009
Representing Data Page 2 of 2
http://www.mathsrevision.net/gcse/pages.php?page=10 23/02/2009
Page 1 of 2
(1, 49), (3, 51), (4, 52), (6, 52), (6, 53), (7, 53), (8, 54), (11, 56),
(12, 56), (14, 57), (14, 58), (17, 59), (18, 59), (20, 60), (20, 61)
One of the first things I have to do when graphing these points is figure out what my axis scale
values are going to be. If I try doing an axis system with the "standard" –10 to 10 values, none
of the above points will even show up on my graph. As is common with these sorts of data
sets, all the x- and y-values are positive, so I only really need scales for the first quadrant. The
y-values are much larger than the x-values, but instead of squeezing all the y-values together,
I'll spread them out (so I can see them better) by using an interrupted scale.
The little "hicky-bob" at the bottom of my y-axis above shows that I've skipped some of the scale
values. For some reason, this broken-axis notation seems almost never to be taught in schools,
though it is very commonly used in "the real world". If you read financial journals, you're very likely to
see many graphs with this sort of axis notation. If you use this notation in your homework, don't be
surprised if you have to explain it to your instructor.
You'll probably be expected to do your scatterplots in your graphing calculator. My calculator gives
me this picture:
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 2
You will often need to adjust your WINDOW settings in order to have all your data points show up on
the screen. I used window settings of 0 < X < 25 with an X-scale of 5 and 45 < Y < 65 with a Y-
scale of 5 for the above graph.
When you're done with the scatterplot, don't forget to turn the STATPLOT "off", or the parameters for
the statistics graphing could mess with your regular graphing utility.
I will give you fair warning now: It has become fashionable to insert the topic of scatterplots and
regressions into algebra and other non-statistics classes, and to require students to use a graphing
calculator to answer questions. While they may give you the slope formula and the Quadratic Formula
and all sorts of other stuff on the test (even though you should have memorized them), they will NOT
give you help with your calculator. They often don't seem to care if you've learned the math, but you
had gosh-darned better know your calculator! So pull out your owners manual, or go to the
manufacturer's web site, or search online, or get together with a friend NOW, because if you're doing
this stuff in class, you ARE going to have to know it, and know it well, on the test.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 1 of 3
z Tell whether the data graphed in the following scatterplots appear to have positive,
negative, or no correlation.
Plot A Plot B
Plot C Plot D
Plot A: Low x-values correspond to high y-values, and high x-values correspond to low y-
values. If I put a line through the dots, it would have a negative slope. This scatterplot shows a
negative correlation.
Plot B: Low x-values correspond to low y-values, and high x-values correspond to high y-
values. If I put a line through the dots, it would have a positive slope. This scatterplot shows a
positive correlation.
Plot C: There doesn't seem to be any trend to the dots; they're just all over the place. This
scatterplot shows no correlation.
Plot D: I might think that this plot shows a correlation, because I can clearly put a line through
the dots. But the line would be horizontal, thus having a slope value of zero. These dots
actually show that whatever is being measured on the x-axis has no bearing on whatever is
being measured on the y-axis, because the value of x has no affect on the value of y. So even
though I could draw a line through these points, this scatterplot still shows no correlation.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 3
You may also be asked about "outliers", which are the dots that don't seem to fit with the rest of the
dots. (There are more technical definitions of "outliers", but they will have to wait until you take
statistics classes.) Maybe you dropped the crucible in chem lab, or maybe you should never have left
your idiot lab partner alone with the Bunsen burner in the middle of the experiment. Whatever the
cause, having outliers means you have points that don't line up with everything else.
Most of the points seem to line up in a fairly straight line, but the dot at (6, 7) is way off to the
side from the general trend-line of the points.
Usually you'll be working with scatterplots where the dots line up in some sort of vaguely straight row.
But you shouldn't expect everything to line up nice and neat, especially in "real life" (like, for instance,
in a physics lab). And sometimes you'll need to pick a different sort of equation as a model, because
the dots line up, but not in a straight line.
z Tell which sort of equation you think would best model the data in the following
scatterplots, and why.
Graph A: The dots look like they line up fairly straight, so a linear model would probably work
well.
Graph B: The dots here do line up, but as more of a curvy line. A quadratic model might work
better.
Graph C: The dots are very close to the x-axis, and then they shoot up, so an exponential or
power-function model might work better here.
In general, expect only to need to recognize linear (straight-line) versus quadratic (curvy-line) models,
and never anything that you haven't already covered in class. For instance, if you haven't done logs
yet, you won't be expected to recognize the need for a logarithmic model for a given scatterplot. The
next lesson explains how to define these models, called "regressions".
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 1 of 3
The process of taking your data points and coming up with an equation is called "regression", and the
graph of the "regression equation" is called "the regression line". If you're doing your scatterplots by
hand, you may be told to find a regression equation by putting a ruler against the first and last dots in
the plot, drawing a line, and guessing the line's equation from the picture. This is an incredibly clumsy
way to proceed, and can give very wrong answers, especially since values at the ends often turn out
to be outliers (numbers that don't quite fit with everything else).
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 3
If you're finding regression equations with a ruler, you'll need to work extremely neatly, of course, and
using graph paper would probably be a really good idea. Once you've drawn in your line (and this will
only work for linear, or straight-line, regressions), you will estimate two points on the line that seem to
be close to where the gridlines intersect, and then find the line equation through those two points.
From the above graph, I would guess that the line goes close to the points (3, 7) and (19, 1), so the
regression equation would be y = (–3/8)x + 65/8.
Most likely, though, you'll be doing regressions in your calculator. Doing regressions properly is a
difficult and technical process, but your graphing calculator has been programmed with the necessary
formulas and has the memory to crunch the many numbers. The calculator will give you "the"
regression line. If you're working by hand, you and your classmates will get slightly different answers;
if you're using calculators, you'll all get the same answer. (Consult your owners manual or calculator
web sites for specific information on doing regressions with your particular calculator model.)
If you're supposed to report how "good" a given regression is, then figure out how to find the "r",
"r2", and/or "R2" values in your calculator. These diagnostic tools measure the degree to which the
regression equation matches the scatterplot. The closer these correlation values are to 1 (or to –1),
the better a fit your regression equation is to the data values. If the correlation value is more than 0.8
or less than –0.8, the match is judged to be pretty good; if the value is between –0.5 and 0.5, the
match is judged to be pretty poor; and a correlation value close to zero means you're kidding yourself
if you think there's really a relationship of the type you're looking for. (There should be instructions,
somewhere in your owners manual, for finding this information.) When you're doing a regression,
you're trying to find the "best fit" line to the data, and the correlation numbers help you to tell how
good your "fit" is.
z Given the following data values, find the linear and cubic regression lines.
Say which regression is a better fit, and why.
(2, 23), (3, 24), (8, 32), (10, 36), (13, 51), (14, 59),
(17, 76), (20, 107), (22, 120), (23, 131), (27, 182)
After plugging these values into the STAT utility of my calculator, I can then do a linear
regression:
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 3 of 3
The line looks a little curvy on the scatterplot, so it's reasonable that the curvy line, the cubic y
= 0.000829x3 + 0.23x2 – 1.09x + 24.60, is a better fit to the data points than the straight-
line linear model y = 6.03x – 10.64.
Since the correlation value is closer to 1 for the cubic and since the graph of the
cubic model is closer to the dots, the cubic equation y = 0.000829x3 + 0.23x2 –
1.09x + 24.60 is the better regression.
You shouldn't expect, by the way, always to get correlation values that are close to "1". If they tell
you to find, say, the linear regression equation for a data set, and the correlation factor is close to
zero, this doesn't mean that you've found the "wrong" linear equation; it only means that a linear
equation probably wasn't a good model to the data. A quadratic model, for instance, might have been
better.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 1 of 3
Remember that the point of all this data-collection, dot-drawing, and regression-computing was to try
to find a formula that models... whatever it is that they're measuring. You can use these models to try
to find missing data points or to try to project into the future (or, sometimes, into the past).
If you have data, say, for the years 1950, 1960, 1970, and 1980, and you find a model for your data,
you might use it to guess at values between these dates. For instance, given Namibian population
data for the listed years, you might try to guess the population of Namibia in 1965. The prefix "inter"
means "between", so this guessing-between-the-points would be interpolation. On the other hand,
you might try to work backwards to guess the population in 1940, or try to fill in the missing data up
through 2000. The prefix "extra" means "outside", so this guessing-outside-the-points would be
extrapolation.
z Find a regression equation for the following population data, using t = 0 to stand for
1950. Then estimate the population of Namibia in the years 1940, 1997, and 2005. Note:
Population values are in thousands.
year t 0 5 10 15 20 25 30 35 40 45 50
pop. 511 561 625 704 800 921 1 018 1 142 1 409 1 646 1 894
Setting my window range as 0 < X < 55, counting by 5's, and 500 < Y < 2000, counting by
250's, my calculator gives me the following scatterplot:
The dots look like they line up in a curve, so I'll try a quadratic regression. The calculator gives
me:
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Page 2 of 3
As you can see, I've set the calculator to "DiagnosticsOn", so it displays the correlation value
whenever I do a regression. This regression looks pretty darned good, especially when it's
graphed with the data values:
Now that I have an equation for modelling Namibia's population, I can use it to estimate the
population in the given years. For 1940, I'll use t = –10, since this is ten years before 1950.
(This is an extrapolated value, since I'm going outside the data set.)
For 2005, I'll use t = 55; this will be another extrapolated value.
For 1997, I'll use t = 47. Since this value is between known values, this will be an interpolated
answer.
Remembering that the population values are in thousands, I'll add three zeroes to my numbers
and round to get my final answers:
The estimated values for the population in 1940 is about 569 000; for 2005, the
estimated value is about 2.15 million; and for 1997, the estimated value is about
1.73 million.
Depending on your calculator, you may need to memorize what the regression values mean. On my
old TI-85, the regression screen would list values for a and b for a linear regression. But I had to
memorize that the related regression equation was "a + bx", instead of the "ax + b" that I would
otherwise have expected, because the screen didn't say. If you need to memorize this sort of
information, do it now, because the teacher will not bail you out if you forget on the test what your
calculator's variables mean.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr... 25/02/2009
Standard Deviation Page 1 of 3
Standard Deviation
The standard deviation measures the spread of the data about the mean
value. It is useful in comparing sets of data which may have the same mean
but a different range. For example, the mean of the following two is the same:
15, 15, 15, 14, 16 and 2, 7, 14, 22, 30. However, the second is clearly more
spread out. If a set has a low standard deviation, the values are not spread out
too much.
Just like when working out the mean, the method is different if the data is
given to you in groups.
Non-Grouped Data
Example
x 4 9 11 12 17 5 8 12 14
(x -
38.7 1.49 0.60 3.16 45.9 27.3 4.94 3.16 14.3
)2
Now add up these results (this is the 'sigma' in the formula): 139.55
Divide by n. n is the number of values, so in this case is 9. This gives us:
15.51
http://www.mathsrevision.net/gcse/pages.php?page=42 23/02/2009
Standard Deviation Page 2 of 3
The standard deviation can usually be calculated much more easily with a
calculator and this may be acceptable in some exams. On my calculator, you
go into the standard deviation mode (mode '.'). Then type in the first value,
press 'data', type in the second value, press 'data'. Do this until you have
typed in all the values, then press the standard deviation button (it will
probably have a lower case sigma on it). Check your calculator's manual to see
how to calculate it on yours.
NB: If you have a set of numbers (e.g. 1, 5, 2, 7, 3, 5 and 3), if each number
is increased by the same amount (e.g. to 3, 7, 4, 9, 5, 7 and 5), the standard
deviation will be the same and the mean will have increased by the amount
each of the numbers were increased by (2 in this case). This is because the
standard deviation measures the spread of the data. Increasing each of the
numbers by 2 does not make the numbers any more spread out, it just shifts
them all along.
Grouped Data
x f
4 9
5 14
6 22
7 11
8 17
Try working out the standard deviation of the above data. You should get an
answer of 1.32 .
You may be given the data in the form of groups, such as:
Number Frequency
3.5 - 4.5 9
4.5 - 5.5 14
5.5 - 6.5 22
6.5 - 7.5 11
http://www.mathsrevision.net/gcse/pages.php?page=42 23/02/2009
Standard Deviation Page 3 of 3
7.5 - 8.5 17
http://www.mathsrevision.net/gcse/pages.php?page=42 23/02/2009
Page 1 of 3
For instance, suppose you have the following list of values: 12, 13, 21, 27, 33, 34, 35, 37, 40, 40,
41. You could make a frequency distribution table showing how many tens, twenties, thirties, and
forties you have:
Frequency
Frequency
Class
10 - 19 2
20 - 29 2
30 - 39 4
40 - 49 3
You could make a histogram, which is a bar-graph showing the number of occurrences, with the
classes being numbers in the tens, twenties, thirties, and forties:
(The shading of the bars in a histogram isn't necessary, but it can be helpful by making the bars
easier to see, especially if you can't use color to differentiate the bars.)
The downside of frequency distribution tables and histograms is that, while the frequency of each
class is easy to see, the original data points have been lost. You can tell, for instance, that there must
have been three listed values that were in the forties, but there is no way to tell from the table or from
the histogram what those values might have been.
On the other hand, you could make a stem-and-leaf plot for the same data:
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 2 of 3
The "stem" is the left-hand column which contains the tens digits. The "leaves" are the lists in the
right-hand column, showing all the ones digits for each of the tens, twenties, thirties, and forties. As
you can see, the original values can still be determined; you can tell, from that bottom leaf, that the
three values in the forties were 40, 40, and 41.
Note that the horizontal leaves in the stem-and-leaf plot correspond to the vertical bars in the
histogram, and the leaves have lengths that equal the numbers in the frequency table.
That's pretty much all there is to a stem-and-leaf plot. You're just listing out how many entries you
have in certain classes of numbers, and what those entries are. Here are some more examples of
stem-and-leaf plots, containing a few additional details.
z Complete a stem-and-leaf plot for the following list of grades on a recent test:
I'll use the tens digits as the stem values and the ones digits as the leaves. For convenience
sake, I'll order the list, but this is not required:
Since I know where these data points came from ("a recent test"), I'll use a title. Then my plot
looks like this:
The above is the simplest case for stem-and-leaf plots, but even the "complicated" cases aren't much
more complex.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 1 of 4
7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.7, 7.4, 7.8, 8.2
5.8, 5.9, 6.1, 6.2, 6.8, 7.3, 7.4, 7.6, 7.8, 8.1, 8.1, 8.2, 8.7, 9.2
These values have one decimal place, but the stem-and-leaf plot makes no accomodation for
this. The stem-and-leaf plot only looks at the last digit (for the leaves) and all the digits before
(for the stem). So I'll have to put a "key" or legend on this plot to show what I mean by the
numbers in this plot. The ones digits will be the stem values, and the tenths will be the leaves.
z Complete a stem-and-leaf plot for the following two lists of class sizes:
Economics 101: 9, 13, 14, 15, 16, 16, 17, 19, 20, 21, 21, 22, 25, 25, 26
Libertarianism: 14, 16, 17, 18, 18, 20, 20, 24, 29
This example has two lists of values. Since the values are similar, I can plot them all on one
stem-and-leaf plot by drawing leaves on either side of the stem. I will use the tens digits as the
stem values, and the ones digits as the leaves. Since "9" (in the Econ 101 list) has no tens
digit, the stem value will be "0".
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 2 of 4
100, 110, 120, 130, 130, 150, 160, 170, 170, 190,
210, 230, 240, 260, 270, 270, 280. 290, 290
Since all the ones digits are zeroes, I'll do this plot with the hundreds digits being the stem
values and the tens digits being the leaves. I can do the plot like this:
...but the leaves are fairly long this way, because the values are so close together. To spread
the values out a bit, I can break each leaf into two. For instance, the leaf for the two-hundreds
class can be split into two classes, being the numbers between 200 and 240 and the numbers
between 250 and 290. I can also reverse the order, so the smaller values are at the bottom of
the "stem". The new plot looks like this:
For very compact data points, you can even split the leaves into five classes, like this:
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009
Page 3 of 4
23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09
If I try to use the last digit, the hundredths digit, for these numbers, the stem-and-leaf plot will
be enormously long, because these values are so spread out. (With the numbers' first three
digits ranging from 232 to 270, I'd have thirty-nine leaves, most of which would be empty.) So
instead of working with the given numbers, I'll round each of the numbers to the nearest tenth,
and then use those new values for my plot. Rounding gives me the following list:
23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1
Naturally, when you're drawing a stem-and-leaf plot, you should use a ruler to construct a neat table,
and you should label everything clearly.
mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P... 25/02/2009