Note 02

Review
histograms: bins, range, bin limits, bin interval

distributions of: frequency, relative frequency, density
Pareto diagram
dot diagrams (dotplot)
stem-and-leaf displays
data quantifiers: length, max, min, mean, median, variance, standard deviation
Chapter 2: Visualization methods for data
Sample Data:
compressive strength measurements of 58 samples of an aluminum alloy
66.4, 67.7, 68.0, 68.0, 68.3, 68.4, 68.6, 68.8, 68.9, 69.0, 69.1,
69.2, 69.3, 69.3, 69.5, 69.5, 69.6, 69.7, 69.8, 69.8, 69.9, 70.0,
70.0, 70.1, 70.2, 70.3, 70.3, 70.4, 70.5, 70.6, 70.6, 70.8, 70.9,
71.0, 71.1, 71.2, 71.3, 71.3, 71.5, 71.6, 71.6, 71.7, 71.8, 71.8,
71.9, 72.1, 72.2, 72.3, 72.4, 72.6, 72.7, 72.9, 73.1, 73.3, 73.5,
74.2, 74.5, 75.3
Dot diagram: each data value represented as a point, with care taken so that points
with the same value do not overlap
Note: if the bin limits remain the same, the shape of the distribution
(frequency, relative frequency, density) remains unchanged.
Relative frequency plot with cumulative distribution (Pareto diagram)
Unlike the dot diagram, which displays every value, frequency distributions suppress
information on individual data values.
Stem-Leaf displays provide a tabular way of organizing the display of all data values
leaf count to
check no data
stem lost
66 4 1
labels
67 7 2
68 0 0 3 4 6 8 9 9
69 0 1 2 3 3 4 4 6 7 8 8 9 21
70 0 0 1 2 3 3 4 5 6 6 8 9 33
71 0 1 2 3 3 5 6 6 7 8 8 9 45
72 1 2 3 4 6 7 9 52
73 1 3 5 55
74 2 5 57
leaf entries are comprised
75 3 69.1 leaf of the last digit of the data 58
values
In addition to knowing the average behavior (value) of a set of data, we would like to
know how much the data is spread about the average. This is computed by deviations
from the mean
Let be the mean value of the data 1 , 2 , , , ,
The differences, 1 , 2 , , , , are called the deviations

from the mean

The sum of the deviations =1 = 0
(i.e. the sum of the positive deviations exactly cancels the sum of the negative
deviations.) This is one of the nice properties of the mean.
Summing the squares of the deviations gives a non-zero result which gives a useful
measure of the spread of the data about the mean. The common measure for this
spread is the sample variance, s2
2
=1
2 =
1
Since the deviations sum to 0, the value of 1 of them can always be calculated from the
other n 1. Therefore only n 1 of the deviations are independent, explaining the n 1
(instead of n) appearing in the denominator of s2
Since the units of the sample variance are the square of the units of the data, we
define the sample standard deviation
2
s= =1
1
If the deviations from the mean are generally large, the standard deviation will be
large. If the deviations from the mean are generally small, the standard deviation will
be small.
When do we know the standard deviation is large? Is s = 1.67 large or small?

That depends on the mean value. Thus we use
the fraction relative variation,
or the percent coefficient of variation, = 100%
to quantify the size of the standard deviation.
e.g. a set of measurements made with micrometers A and B give the following table of
results. Which micrometer measurements are more accurate?
Micrometer mean std dev coef of var
A 3.92 mm 0.0152 mm 0.39%
B 1.54 in 0.0086 in 0.56%

Sample variance & standard deviation
Sample variance:
2
2

=1
=
1
Sample standard deviation:
2
=1
=
1
Divisor: 1 instead of
because the s tend to be closer to than
Using tends to underestimate 2 .
2 2
=1 /

2 = =1
1
(hand-held calculator formula)
Alternate way to compute sample variance
2
=1 2 2
=1
2 =
1 1
Operations count
+ n-1 n-1
n+1 2
n n+2
1 1
Total (2n) + (n+1) (n+1) + (n+3)
more efficient on a computer hand calculator
Example 1.10: 6 12 6 6 4 8
Observation 2
1 6 36
2 12 144
3 6 36
4 6 36
5 4 16
6 8 64
Total = 42 2 = 332
n 2 n 2
i=1 i=1 /
332422 /6
2 = = = 7.6
1 61
= 7.6 = 2.76
(3) Measures of Variation
A measure of location cannot give a complete

summary of data.
Quartiles and Percentiles
In the same way that we use the median to find the center point of a data set
of the data values lie below the median, of the values lie above
we can define the first quartile Q1, the data value that has of the data points lying
below it (and 1 = of the points lying above it). Since of the data corresponds to
25% of the data, Q1 is also referred to as the 25th percentile P0.25 of the data set.
We can generalize this to the 100pth percentile P0.p , the data value that has 100p% of
the data points lying below it (and 100 (1 p) % of the data points lying above it).
The 50th percentile P0.50 is known as the second quartile Q2 and is, in fact, identical
to the median of the data set.
The 75th percentile P0.75 is known as the third quartile Q3 .
To compute percentile P0.p in a data set of n values:

1. order the data smallest to largest
2. compute the product n p
if np is not an integer, round it up to the next integer value and find the
corresponding data point
if np is an integer k, take the average of the kth and (k+1)st data values.
e.g. The Q1, Q3 and Q3 values for the alloy strength data set are:
Q1 (p = 0.25) np = 0.25 58 = 14.5 round to 15. From the Stem-Leaf table, Q1 = x15 =
69.4
Q2 (p = 0.50) np = 0.50 58 = 29 Average the 29th and 30th values. From the Stem-
Leaf table, Q2 = (x29 + x30 ) /2 = (70.5 + 70.6)/2 = 70.55 as previous computed for the
median value of this data set
Q3 (p = 0.75) np = 0.75 58 = 43.5 round to 44. From the Stem-Leaf table, Q3 = x44 =
71.8
As previously implied, the range of the data is
range = maximum minimum
= 1 when the data are ordered smallest to largest
The middle half of the data lies in the interquartile range

interquartile range = Q3 Q1
e.g. The range for the alloy strength data set is 75.3 66.4 = 8.9
The interquartile range is Q3 Q1 = 71.8 69.4 = 2.4

Boxplots: (aka box-whisker plot) plot quartile information
Q1 Q2 Q3
whisker extending to whisker extending to

minimum data value maximum data value
Example 1.11: Final exam scores
81 57 85 84 99 90 69 76 76 83
Order: 57 69 76 76 81 83 84 85 90 99
+
2 2 +1 5 + 6 81+83
= 10. = 2 = = = = 82
2 2 2
= 10 0.5 = 5
1 : =10(0.25)=2.53 3 =76
3 : =10(0.75)=7.58 8 =85
min: 57, max: 99 range: 99 57=42
IQR=3 1 =85 76=9
1 1.5(IQR)=76 1.5(9)=62.5
3 +1.5(IQR)=85+1.5(9)=98.5
Outliers: 57 & 99
Boxplots
Sample range: 1 =max min

Interquartile range (IQR): 3 1 (length of the
middle half)
Outlier: an observation which is less than 1

1.5(IQR) or greater than 3 +1.5(IQR)
Example 1.12: 78 99 47 53 71 69 60 57 45 88 59
2.5 5.4
Order: 45 47 53 57 59 60 69 71 78 88 99
= 11. = 2 = +1 = 12 = 6 = 60
2 2
= 11 0.5 = 5.5 6
1 : =11(0.25)=2.753 =53 3
3 : =11(0.75)=8.259 9 =78
min: 45, max: 99 range: 99 45=54
IQR=3 1 =78 53=25
1 1.5(IQR)=53 1.5(25)= 15.5
3 +1.5(IQR)=78 + 1.5 25 = 115.5
No outliers
90th percentile:
= 11 0.9 = 9.9 10 10 = 88
Using R
The 100-th percentile of data a can be obtained as
>quantile(a, p)
More than one percentile, say percentiles, can be obtained as
>quantile(a, c(p1, p2, ,pn))
For the data in Example 1.11, the quartiles can be obtained as
>a=c(81,57,85,84,99,90,69,76,76,83)
>quantile(a, c(0.25, 0.50, 0.75))
25% 50% 75%
76.00 82.00 84.75
>min(a)
>max(a)
>IQR(a)
>boxplot(a)
Using R
A side-by-side boxplot of a and b can be obtained by
>boxplot(a, b)
For Example 1.12,
>b=c(78,99,47,53,71,69,60,57,45,88,59)
>boxplot(a, b)
Shapes of Distributions
(a) Symmetric
(b) Unimodal: single peak
(c) Bimodal: two peaks
(d) Uniform: The possible values appear with equal frequency
(e) Skewed: One side of the distribution is stretched out longer
than the other side
Shapes of Distributions
Skewness
Where the mean and median fall in a frequency
distribution tells us something about the skewness of the
data
If the mean and median have the same value, then the
data are not skewed.
If the mean is greater than the median, the data are
positively skewed (skewed to the right).
If the mean is less than the median, the data are
negatively skewed (skewed to the left).
Skewness
(C) Stem-and-Leaf Plot
List one or more leading

digits for the stem
values.
The trailing digit(s)
become the leaves.
Arrange the trailing
digits in each row so
they are in increasing
order.
Qualitative (Categorical) Data
Example 1.6: Frequency distribution of enrollment
in a high school
Class Frequency Relative Frequency

Algebra 26 0.26
English 30 0.30
Physics 19 0.19
Biology 24 0.24
Total 99 0.99
Possible round-off error

Bar Graph
Enrollment
0.35
0.3
0.25
0.2
0.15
0.1
0.05
Algebra English Physics Biology

Pie Chart
Enrollment
Algebra English Physics Biology

(E) Time Plots
Time plot (Time series plot): a plot of observations
against time, or the order in which they were
obtained.
(a) Trends: increasing or decreasing, changes in
location of the center, changes in variation
(b) Seasonal variation or cycles: fairly regular
increasing or decreasing movements
Example 5: Number of workers late in arriving at work
Monday Tuesday Wednesday Thursday Friday
Week 1 6 3 2 4 7
Week 2 8 0 5 3 2
Week 3 7 2 1 0 2
Week 4 5 0 1 0 1
Comparison of mean, median and mode
Example 1.9: Professional baseball players annual income
Income (in $1,000) Number of players

30 180
50 100
100 50
500 20
1,000 10
2,000 15
3,000 15
5,000 10
Total 400
Summary
dot diagrams
distributions (histograms): bins, range, bin limits, bin marks, bin interval
distributions of: frequency, relative frequency, percent frequency, cumulative
frequency
distribution graphs, density graphs
stem-and-leaf displays
data quantifiers: sample mean, sample variance, sample standard deviation
Quartiles, percentiles, median, interquartile range
boxplots

Note 02

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Note 02

Hochgeladen von

Copyright:

Verfügbare Formate

Review

histograms: bins, range, bin limits, bin interval

Let be the mean value of the data 1 , 2 , , , ,

The differences, 1 , 2 , , , , are called the deviations

When do we know the standard deviation is large? Is s = 1.67 large or small?

A 3.92 mm 0.0152 mm 0.39%

B 1.54 in 0.0086 in 0.56%

A measure of location cannot give a complete

The 75th percentile P0.75 is known as the third quartile Q3 .

To compute percentile P0.p in a data set of n values:

The middle half of the data lies in the interquartile range

The interquartile range is Q3 Q1 = 71.8 69.4 = 2.4

whisker extending to whisker extending to

Sample range: 1 =max min

Outlier: an observation which is less than 1

List one or more leading

Class Frequency Relative Frequency

Possible round-off error

Algebra English Physics Biology

Algebra English Physics Biology

Monday Tuesday Wednesday Thursday Friday

Example 1.9: Professional baseball players annual income

Income (in $1,000) Number of players

Das könnte Ihnen auch gefallen