Sie sind auf Seite 1von 17

Stat 110, Lecture 2

Graphical Descriptive Statistics

bheavlin@stat.stanford.edu
Statistics

No Data Some Data Way Too


Much Data

Probability Inferential Descriptive


Statistics Statistics

Numerical Graphical
bheavlin@stat.stanford.edu
The branch of statistics devoted to the
organization, summarization and description of
data sets is called descriptive statistics.

… vs inferential statistics.

bheavlin@stat.stanford.edu
Bar charts Schmidt

60
United States

50 Sw itzerland

Sw eden
40 Japan

Holland
frequency

30 Germany

France
20
Finland

Belgium
10
0 10 20 30 40 50 60

0 frequency
Japan
Finland
France

Holland
Belgium

Sweden
Germany

Switzerland
United States

country
bheavlin@stat.stanford.edu
Dot chart formats Cleveland, McGill
frequency
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0 10 20 30 40 50

Belgium
United States
France
Finland France

Germany Japan
Holland
Germany
Japan
Sweden Belgium

Switzerland Sweden
United States
Finland

Holland

Switzerland

bheavlin@stat.stanford.edu
data-to-ink ratio

40
cumulative book titles

30
SAS
20
10

R,S,&
S-plus
0

1996 2000 2004

year of publication
bheavlin@stat.stanford.edu
Low ink-to-data ratio frequency
0 10 20 30 40 50 60

Tufte United States

France

Japan

Germany

Belgium

Sweden

Finland

Holland

Switzerland

bheavlin@stat.stanford.edu
1 2 4 8 16 32 64
Log-base-2 rev
United States

France

Japan

Germany

Belgium

Sweden

Finland

Holland

Switzerland

bheavlin@stat.stanford.edu
Rules of thumb for simple labeled data:
• Except for time-ordered plots, favor horizontal
over vertical scales.
• Less ink unrelated to data is usually better.
• Pick meaningful scales, including reference
(e.g. zero) lines.
• Favor linear over non-linear patterns.
• The best practices scale better as number of
values, number of groups, relative magnitudes
get large.

Avoid pie charts, frivolous color.


bheavlin@stat.stanford.edu
Graphs: balance detail with #groups
# points/group # groups

Dot plot low-to-moderate 1-3


1.0

stem&leaf, 0.8
0.6
moderate-to-high 1-2
0.4
Histogram 0.2
0.0

QQ plot moderate-to-high ~1-3

1.0
0.8
0.6
0.4

Boxplot moderate-to-high 2-30+


0.2
0.0
Hc1 Hc2

Multi-vari low-to-moderate 2-4/layer


Hc1 Hc2 Hc1 Hc2
W Z

bheavlin@stat.stanford.edu
Stem Leaf Count
CPU times of 25 jobs 4 8 1
4
1.17 1.61 1.16 1.38 3.53
3 58 2
1.23 3.76 1.94 0.96 4.75 3 1 1
0.15 2.41 0.71 0.02 1.59 2 6 1
2 024 3
0.19 0.82 0.47 2.16 2.01
1 6669 4
0.92 0.75 2.59 3.07 1.40
1 02244 5
0 57889 5
0 022 3
Rounded to 1x10-1
before assigned leaf
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
■ ■ ■ ■ ■

0 1 2 3 4 5
bheavlin@stat.stanford.edu
stem-and-leaf: detailed version
Stem Leaf count
1.17 1.61 1.16 1.38 3.53 4. 5-9 75, 1
1.23 3.76 1.94 0.96 4.75 4. 0-4
3. 5-9 53,76, 2
0.15 2.41 0.71 0.02 1.59
3. 0-4 07 1
0.19 0.82 0.47 2.16 2.01
2. 5-9 59, 1
0.92 0.75 2.59 3.07 1.40
2. 0-4 01,16,41, 3
1. 5-9 59,61,94, 3
Rule of Thumb 1. 0-4 16,17,23,38,40 5
• stem interval 1,2, or 5x10k 0. 5-9 71,75,82,92,96 5
• max number of leaves > 2 0. 0-4 02,15,19,47 4
• ~10 categories, not 50, not 4 Truncated to 1x10-1
• sorting leaves helps find medians… before assigned leaf
bheavlin@stat.stanford.edu
5
histogram
4
1.17 1.61 1.16 1.38 3.53
1.23 3.76 1.94 0.96 4.75 3
0.15 2.41 0.71 0.02 1.59
0.19 0.82 0.47 2.16 2.01 2
0.92 0.75 2.59 3.07 1.40
1

• good for details in tail,


• …for multiple modes
• but binning choice varies
0 1 2 3 4 5 • best only 1-3 groups

bheavlin@stat.stanford.edu
Bin sizes
• M&S: 5 to 20 bins
• Sqrt( n )

Bin width
 hn =1,2,5 x 10k
 hn = 2 x IQR / n1/3 or
hn = 1.66 x stdev x [ loge(n)/ n ]1/3

1-2-5 rule dominates any difference.


bheavlin@stat.stanford.edu
Cumulative Distribution Function
1.0
0.9
0.8
0.7
0.6
Cum Prob

0.5 Increments at ea obs,


0.4 rising from 0 to 1.
0.3
0.2 does not depend on
0.1 choice of bin hn
0.0

0 1 2 3 4 5
CPU times
bheavlin@stat.stanford.edu
Numerical Descriptive Statistics
• Mean: the arithmetic average = Σi xi / n
• Median: the middle number.
with n numbers, x(i) the ith smallest,
n odd, x((n +1))/2;
n even, [ x(n/2)+x((n +1)/2) ]/2

• Mode: the most frequently occurring value(s),


often relative to nearby values.
Sometimes dependent on binning choices in
histogram.

bheavlin@stat.stanford.edu
30 -4
upper “fence” -5
20
75th %ile -6
10 -7
median -8
0
-9
-10
25th %ile -10
-20 -11
lower “fence” outside fence: “outlier”
-30 -12
clearout 01-12 no splits 2kTEOS 2kTEOS_HDP none

30 “fence”
• IQR = 75%ile – 25%ile
20 • step = 1.5 x IQR
• lower fence = min s.t.>25%─step
10
• upper fence = max s.t.< 75%─step
0
Interval to angled ends:
-10 • 95% confidence interval of mean
-20
Interval to flattened ends:
• “overlap” marks
-30 • 2 groups significantly differ when
clearout 01-12 no splits
trapezoids do not overlap (approx)
bheavlin@stat.stanford.edu

Das könnte Ihnen auch gefallen