Sie sind auf Seite 1von 31

Review

histograms: bins, range, bin limits, bin interval


distributions of: frequency, relative frequency, density
Pareto diagram
dot diagrams (dotplot)
stem-and-leaf displays
data quantifiers: length, max, min, mean, median, variance, standard deviation
Chapter 2: Visualization methods for data

Sample Data:
compressive strength measurements of 58 samples of an aluminum alloy
66.4, 67.7, 68.0, 68.0, 68.3, 68.4, 68.6, 68.8, 68.9, 69.0, 69.1,
69.2, 69.3, 69.3, 69.5, 69.5, 69.6, 69.7, 69.8, 69.8, 69.9, 70.0,
70.0, 70.1, 70.2, 70.3, 70.3, 70.4, 70.5, 70.6, 70.6, 70.8, 70.9,
71.0, 71.1, 71.2, 71.3, 71.3, 71.5, 71.6, 71.6, 71.7, 71.8, 71.8,
71.9, 72.1, 72.2, 72.3, 72.4, 72.6, 72.7, 72.9, 73.1, 73.3, 73.5,
74.2, 74.5, 75.3

Dot diagram: each data value represented as a point, with care taken so that points
with the same value do not overlap
Note: if the bin limits remain the same, the shape of the distribution
(frequency, relative frequency, density) remains unchanged.
Relative frequency plot with cumulative distribution (Pareto diagram)
Unlike the dot diagram, which displays every value, frequency distributions suppress
information on individual data values.

Stem-Leaf displays provide a tabular way of organizing the display of all data values
leaf count to
check no data
stem lost
66 4 1
labels
67 7 2

68 0 0 3 4 6 8 9 9

69 0 1 2 3 3 4 4 6 7 8 8 9 21

70 0 0 1 2 3 3 4 5 6 6 8 9 33

71 0 1 2 3 3 5 6 6 7 8 8 9 45

72 1 2 3 4 6 7 9 52

73 1 3 5 55

74 2 5 57
leaf entries are comprised
75 3 69.1 leaf of the last digit of the data 58
values
In addition to knowing the average behavior (value) of a set of data, we would like to
know how much the data is spread about the average. This is computed by deviations
from the mean

Let be the mean value of the data 1 , 2 , , , ,

The differences, 1 , 2 , , , , are called the deviations


from the mean


The sum of the deviations =1 = 0

(i.e. the sum of the positive deviations exactly cancels the sum of the negative
deviations.) This is one of the nice properties of the mean.

Summing the squares of the deviations gives a non-zero result which gives a useful
measure of the spread of the data about the mean. The common measure for this
spread is the sample variance, s2
2
=1
2 =
1
Since the deviations sum to 0, the value of 1 of them can always be calculated from the
other n 1. Therefore only n 1 of the deviations are independent, explaining the n 1
(instead of n) appearing in the denominator of s2
Since the units of the sample variance are the square of the units of the data, we
define the sample standard deviation

2
s= =1
1
If the deviations from the mean are generally large, the standard deviation will be
large. If the deviations from the mean are generally small, the standard deviation will
be small.

When do we know the standard deviation is large? Is s = 1.67 large or small?


That depends on the mean value. Thus we use
the fraction relative variation,
or the percent coefficient of variation, = 100%
to quantify the size of the standard deviation.

e.g. a set of measurements made with micrometers A and B give the following table of
results. Which micrometer measurements are more accurate?
Micrometer mean std dev coef of var

A 3.92 mm 0.0152 mm 0.39%

B 1.54 in 0.0086 in 0.56%


Sample variance & standard deviation
Sample variance:
2
2

=1
=
1
Sample standard deviation:
2
=1
=
1
Divisor: 1 instead of
because the s tend to be closer to than
Using tends to underestimate 2 .
2 2
=1 /

2 = =1
1
(hand-held calculator formula)
Alternate way to compute sample variance

2
=1 2 2
=1
2 =
1 1
Operations count
+ n-1 n-1
n+1 2
n n+2
1 1
Total (2n) + (n+1) (n+1) + (n+3)
more efficient on a computer hand calculator
Example 1.10: 6 12 6 6 4 8
Observation 2
1 6 36
2 12 144
3 6 36
4 6 36
5 4 16
6 8 64

Total = 42 2 = 332

n 2 n 2
i=1 i=1 /
332422 /6
2 = = = 7.6
1 61

= 7.6 = 2.76
(3) Measures of Variation

A measure of location cannot give a complete


summary of data.
Quartiles and Percentiles

In the same way that we use the median to find the center point of a data set
of the data values lie below the median, of the values lie above

we can define the first quartile Q1, the data value that has of the data points lying
below it (and 1 = of the points lying above it). Since of the data corresponds to
25% of the data, Q1 is also referred to as the 25th percentile P0.25 of the data set.

We can generalize this to the 100pth percentile P0.p , the data value that has 100p% of
the data points lying below it (and 100 (1 p) % of the data points lying above it).

The 50th percentile P0.50 is known as the second quartile Q2 and is, in fact, identical
to the median of the data set.

The 75th percentile P0.75 is known as the third quartile Q3 .

To compute percentile P0.p in a data set of n values:


1. order the data smallest to largest
2. compute the product n p
if np is not an integer, round it up to the next integer value and find the
corresponding data point
if np is an integer k, take the average of the kth and (k+1)st data values.
e.g. The Q1, Q3 and Q3 values for the alloy strength data set are:

Q1 (p = 0.25) np = 0.25 58 = 14.5 round to 15. From the Stem-Leaf table, Q1 = x15 =
69.4

Q2 (p = 0.50) np = 0.50 58 = 29 Average the 29th and 30th values. From the Stem-
Leaf table, Q2 = (x29 + x30 ) /2 = (70.5 + 70.6)/2 = 70.55 as previous computed for the
median value of this data set

Q3 (p = 0.75) np = 0.75 58 = 43.5 round to 44. From the Stem-Leaf table, Q3 = x44 =
71.8
As previously implied, the range of the data is
range = maximum minimum
= 1 when the data are ordered smallest to largest

The middle half of the data lies in the interquartile range


interquartile range = Q3 Q1

e.g. The range for the alloy strength data set is 75.3 66.4 = 8.9

The interquartile range is Q3 Q1 = 71.8 69.4 = 2.4


Boxplots: (aka box-whisker plot) plot quartile information

Q1 Q2 Q3

whisker extending to whisker extending to


minimum data value maximum data value
Example 1.11: Final exam scores
81 57 85 84 99 90 69 76 76 83
Order: 57 69 76 76 81 83 84 85 90 99
+
2 2 +1 5 + 6 81+83
= 10. = 2 = = = = 82
2 2 2
= 10 0.5 = 5
1 : =10(0.25)=2.53 3 =76
3 : =10(0.75)=7.58 8 =85
min: 57, max: 99 range: 99 57=42
IQR=3 1 =85 76=9
1 1.5(IQR)=76 1.5(9)=62.5
3 +1.5(IQR)=85+1.5(9)=98.5
Outliers: 57 & 99
Boxplots

Sample range: 1 =max min


Interquartile range (IQR): 3 1 (length of the
middle half)

Outlier: an observation which is less than 1


1.5(IQR) or greater than 3 +1.5(IQR)
Example 1.12: 78 99 47 53 71 69 60 57 45 88 59
2.5 5.4
Order: 45 47 53 57 59 60 69 71 78 88 99
= 11. = 2 = +1 = 12 = 6 = 60
2 2

= 11 0.5 = 5.5 6
1 : =11(0.25)=2.753 =53 3
3 : =11(0.75)=8.259 9 =78
min: 45, max: 99 range: 99 45=54
IQR=3 1 =78 53=25
1 1.5(IQR)=53 1.5(25)= 15.5
3 +1.5(IQR)=78 + 1.5 25 = 115.5
No outliers

90th percentile:
= 11 0.9 = 9.9 10 10 = 88
Using R
The 100-th percentile of data a can be obtained as
>quantile(a, p)
More than one percentile, say percentiles, can be obtained as
>quantile(a, c(p1, p2, ,pn))
For the data in Example 1.11, the quartiles can be obtained as
>a=c(81,57,85,84,99,90,69,76,76,83)
>quantile(a, c(0.25, 0.50, 0.75))
25% 50% 75%
76.00 82.00 84.75
>min(a)
>max(a)
>IQR(a)
>boxplot(a)
Using R
A side-by-side boxplot of a and b can be obtained by
>boxplot(a, b)
For Example 1.12,
>b=c(78,99,47,53,71,69,60,57,45,88,59)
>boxplot(a, b)
Shapes of Distributions

(a) Symmetric
(b) Unimodal: single peak
(c) Bimodal: two peaks
(d) Uniform: The possible values appear with equal frequency
(e) Skewed: One side of the distribution is stretched out longer
than the other side
Shapes of Distributions
Skewness
Where the mean and median fall in a frequency
distribution tells us something about the skewness of the
data
If the mean and median have the same value, then the
data are not skewed.
If the mean is greater than the median, the data are
positively skewed (skewed to the right).
If the mean is less than the median, the data are
negatively skewed (skewed to the left).
Skewness
(C) Stem-and-Leaf Plot

List one or more leading


digits for the stem
values.
The trailing digit(s)
become the leaves.
Arrange the trailing
digits in each row so
they are in increasing
order.
Qualitative (Categorical) Data
Example 1.6: Frequency distribution of enrollment
in a high school

Class Frequency Relative Frequency


Algebra 26 0.26
English 30 0.30
Physics 19 0.19
Biology 24 0.24
Total 99 0.99

Possible round-off error


Bar Graph

Enrollment
0.35

0.3

0.25

0.2

0.15

0.1

0.05

Algebra English Physics Biology


Pie Chart
Enrollment

Algebra English Physics Biology


(E) Time Plots
Time plot (Time series plot): a plot of observations
against time, or the order in which they were
obtained.
(a) Trends: increasing or decreasing, changes in
location of the center, changes in variation
(b) Seasonal variation or cycles: fairly regular
increasing or decreasing movements
Example 5: Number of workers late in arriving at work

Monday Tuesday Wednesday Thursday Friday

Week 1 6 3 2 4 7
Week 2 8 0 5 3 2
Week 3 7 2 1 0 2
Week 4 5 0 1 0 1
Comparison of mean, median and mode

Example 1.9: Professional baseball players annual income

Income (in $1,000) Number of players


30 180
50 100
100 50
500 20
1,000 10
2,000 15
3,000 15
5,000 10
Total 400
Summary

dot diagrams
distributions (histograms): bins, range, bin limits, bin marks, bin interval
distributions of: frequency, relative frequency, percent frequency, cumulative
frequency
distribution graphs, density graphs
stem-and-leaf displays
data quantifiers: sample mean, sample variance, sample standard deviation
Quartiles, percentiles, median, interquartile range
boxplots

Das könnte Ihnen auch gefallen