Beruflich Dokumente
Kultur Dokumente
bheavlin@stat.stanford.edu
Statistics
Numerical Graphical
bheavlin@stat.stanford.edu
The branch of statistics devoted to the
organization, summarization and description of
data sets is called descriptive statistics.
… vs inferential statistics.
bheavlin@stat.stanford.edu
Bar charts Schmidt
60
United States
50 Sw itzerland
Sw eden
40 Japan
Holland
frequency
30 Germany
France
20
Finland
Belgium
10
0 10 20 30 40 50 60
0 frequency
Japan
Finland
France
Holland
Belgium
Sweden
Germany
Switzerland
United States
country
bheavlin@stat.stanford.edu
Dot chart formats Cleveland, McGill
frequency
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0 10 20 30 40 50
Belgium
United States
France
Finland France
Germany Japan
Holland
Germany
Japan
Sweden Belgium
Switzerland Sweden
United States
Finland
Holland
Switzerland
bheavlin@stat.stanford.edu
data-to-ink ratio
40
cumulative book titles
30
SAS
20
10
R,S,&
S-plus
0
year of publication
bheavlin@stat.stanford.edu
Low ink-to-data ratio frequency
0 10 20 30 40 50 60
France
Japan
Germany
Belgium
Sweden
Finland
Holland
Switzerland
bheavlin@stat.stanford.edu
1 2 4 8 16 32 64
Log-base-2 rev
United States
France
Japan
Germany
Belgium
Sweden
Finland
Holland
Switzerland
bheavlin@stat.stanford.edu
Rules of thumb for simple labeled data:
• Except for time-ordered plots, favor horizontal
over vertical scales.
• Less ink unrelated to data is usually better.
• Pick meaningful scales, including reference
(e.g. zero) lines.
• Favor linear over non-linear patterns.
• The best practices scale better as number of
values, number of groups, relative magnitudes
get large.
stem&leaf, 0.8
0.6
moderate-to-high 1-2
0.4
Histogram 0.2
0.0
1.0
0.8
0.6
0.4
bheavlin@stat.stanford.edu
Stem Leaf Count
CPU times of 25 jobs 4 8 1
4
1.17 1.61 1.16 1.38 3.53
3 58 2
1.23 3.76 1.94 0.96 4.75 3 1 1
0.15 2.41 0.71 0.02 1.59 2 6 1
2 024 3
0.19 0.82 0.47 2.16 2.01
1 6669 4
0.92 0.75 2.59 3.07 1.40
1 02244 5
0 57889 5
0 022 3
Rounded to 1x10-1
before assigned leaf
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
■ ■ ■ ■ ■
■
0 1 2 3 4 5
bheavlin@stat.stanford.edu
stem-and-leaf: detailed version
Stem Leaf count
1.17 1.61 1.16 1.38 3.53 4. 5-9 75, 1
1.23 3.76 1.94 0.96 4.75 4. 0-4
3. 5-9 53,76, 2
0.15 2.41 0.71 0.02 1.59
3. 0-4 07 1
0.19 0.82 0.47 2.16 2.01
2. 5-9 59, 1
0.92 0.75 2.59 3.07 1.40
2. 0-4 01,16,41, 3
1. 5-9 59,61,94, 3
Rule of Thumb 1. 0-4 16,17,23,38,40 5
• stem interval 1,2, or 5x10k 0. 5-9 71,75,82,92,96 5
• max number of leaves > 2 0. 0-4 02,15,19,47 4
• ~10 categories, not 50, not 4 Truncated to 1x10-1
• sorting leaves helps find medians… before assigned leaf
bheavlin@stat.stanford.edu
5
histogram
4
1.17 1.61 1.16 1.38 3.53
1.23 3.76 1.94 0.96 4.75 3
0.15 2.41 0.71 0.02 1.59
0.19 0.82 0.47 2.16 2.01 2
0.92 0.75 2.59 3.07 1.40
1
bheavlin@stat.stanford.edu
Bin sizes
• M&S: 5 to 20 bins
• Sqrt( n )
Bin width
hn =1,2,5 x 10k
hn = 2 x IQR / n1/3 or
hn = 1.66 x stdev x [ loge(n)/ n ]1/3
0 1 2 3 4 5
CPU times
bheavlin@stat.stanford.edu
Numerical Descriptive Statistics
• Mean: the arithmetic average = Σi xi / n
• Median: the middle number.
with n numbers, x(i) the ith smallest,
n odd, x((n +1))/2;
n even, [ x(n/2)+x((n +1)/2) ]/2
bheavlin@stat.stanford.edu
30 -4
upper “fence” -5
20
75th %ile -6
10 -7
median -8
0
-9
-10
25th %ile -10
-20 -11
lower “fence” outside fence: “outlier”
-30 -12
clearout 01-12 no splits 2kTEOS 2kTEOS_HDP none
30 “fence”
• IQR = 75%ile – 25%ile
20 • step = 1.5 x IQR
• lower fence = min s.t.>25%─step
10
• upper fence = max s.t.< 75%─step
0
Interval to angled ends:
-10 • 95% confidence interval of mean
-20
Interval to flattened ends:
• “overlap” marks
-30 • 2 groups significantly differ when
clearout 01-12 no splits
trapezoids do not overlap (approx)
bheavlin@stat.stanford.edu