Stat 110, Lecture 2 Graphical Descriptive Statistics: Bheavlin@stat - Stanford.edu

Stat 110, Lecture 2
Graphical Descriptive Statistics
bheavlin@stat.stanford.edu
Statistics
No Data Some Data Way Too

Much Data
Probability Inferential Descriptive

Statistics Statistics
Numerical Graphical
The branch of statistics devoted to the
organization, summarization and description of
data sets is called descriptive statistics.
… vs inferential statistics.
Bar charts Schmidt
60
United States
50 Sw itzerland
Sw eden
40 Japan
Holland
frequency
30 Germany
France
20
Finland
Belgium
10
0 10 20 30 40 50 60
0 frequency
Japan
Finland
France
Holland
Belgium
Sweden
Germany
Switzerland
United States
country
Dot chart formats Cleveland, McGill
frequency
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0 10 20 30 40 50
Belgium
United States
France
Finland France
Germany Japan
Holland
Germany
Japan
Sweden Belgium
Switzerland Sweden
United States
Finland
Holland
Switzerland
data-to-ink ratio
40
cumulative book titles
30
SAS
20
10
R,S,&
S-plus
0
1996 2000 2004
year of publication
Low ink-to-data ratio frequency
0 10 20 30 40 50 60
Tufte United States
France
Japan
Germany
Belgium
Sweden
Finland
Holland
Switzerland
1 2 4 8 16 32 64
Log-base-2 rev
United States
France
Japan
Germany
Belgium
Sweden
Finland
Holland
Switzerland
Rules of thumb for simple labeled data:
• Except for time-ordered plots, favor horizontal
over vertical scales.
• Less ink unrelated to data is usually better.
• Pick meaningful scales, including reference
(e.g. zero) lines.
• Favor linear over non-linear patterns.
• The best practices scale better as number of
values, number of groups, relative magnitudes
get large.
Avoid pie charts, frivolous color.

Graphs: balance detail with #groups
# points/group # groups
Dot plot low-to-moderate 1-3

1.0
stem&leaf, 0.8
0.6
moderate-to-high 1-2
0.4
Histogram 0.2
0.0
QQ plot moderate-to-high ~1-3
1.0
0.8
0.6
0.4
Boxplot moderate-to-high 2-30+

0.2
0.0
Hc1 Hc2
Multi-vari low-to-moderate 2-4/layer

Hc1 Hc2 Hc1 Hc2
W Z
Stem Leaf Count
CPU times of 25 jobs 4 8 1
4
1.17 1.61 1.16 1.38 3.53
3 58 2
1.23 3.76 1.94 0.96 4.75 3 1 1
0.15 2.41 0.71 0.02 1.59 2 6 1
2 024 3
0.19 0.82 0.47 2.16 2.01
1 6669 4
0.92 0.75 2.59 3.07 1.40
1 02244 5
0 57889 5
0 022 3
Rounded to 1x10-1
before assigned leaf
■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ ■
■ ■ ■ ■ ■
■
0 1 2 3 4 5
stem-and-leaf: detailed version
Stem Leaf count
1.17 1.61 1.16 1.38 3.53 4. 5-9 75, 1
1.23 3.76 1.94 0.96 4.75 4. 0-4
3. 5-9 53,76, 2
0.15 2.41 0.71 0.02 1.59
3. 0-4 07 1
0.19 0.82 0.47 2.16 2.01
2. 5-9 59, 1
0.92 0.75 2.59 3.07 1.40
2. 0-4 01,16,41, 3
1. 5-9 59,61,94, 3
Rule of Thumb 1. 0-4 16,17,23,38,40 5
• stem interval 1,2, or 5x10k 0. 5-9 71,75,82,92,96 5
• max number of leaves > 2 0. 0-4 02,15,19,47 4
• ~10 categories, not 50, not 4 Truncated to 1x10-1
• sorting leaves helps find medians… before assigned leaf
5
histogram
4
1.17 1.61 1.16 1.38 3.53
1.23 3.76 1.94 0.96 4.75 3
0.15 2.41 0.71 0.02 1.59
0.19 0.82 0.47 2.16 2.01 2
0.92 0.75 2.59 3.07 1.40
1
• good for details in tail,

• …for multiple modes
• but binning choice varies
0 1 2 3 4 5 • best only 1-3 groups
Bin sizes
• M&S: 5 to 20 bins
• Sqrt( n )
Bin width
 hn =1,2,5 x 10k
 hn = 2 x IQR / n1/3 or
hn = 1.66 x stdev x [ loge(n)/ n ]1/3
1-2-5 rule dominates any difference.

Cumulative Distribution Function
1.0
0.9
0.8
0.7
0.6
Cum Prob
0.5 Increments at ea obs,

0.4 rising from 0 to 1.
0.3
0.2 does not depend on
0.1 choice of bin hn
0.0
0 1 2 3 4 5
CPU times
Numerical Descriptive Statistics
• Mean: the arithmetic average = Σi xi / n
• Median: the middle number.
with n numbers, x(i) the ith smallest,
n odd, x((n +1))/2;
n even, [ x(n/2)+x((n +1)/2) ]/2
• Mode: the most frequently occurring value(s),

often relative to nearby values.
Sometimes dependent on binning choices in
histogram.
30 -4
upper “fence” -5
20
75th %ile -6
10 -7
median -8
0
-9
-10
25th %ile -10
-20 -11
lower “fence” outside fence: “outlier”
-30 -12
clearout 01-12 no splits 2kTEOS 2kTEOS_HDP none
30 “fence”
• IQR = 75%ile – 25%ile
20 • step = 1.5 x IQR
• lower fence = min s.t.>25%─step
10
• upper fence = max s.t.< 75%─step
0
Interval to angled ends:
-10 • 95% confidence interval of mean
-20
Interval to flattened ends:
• “overlap” marks
-30 • 2 groups significantly differ when
clearout 01-12 no splits
trapezoids do not overlap (approx)

Stat 110, Lecture 2 Graphical Descriptive Statistics: Bheavlin@stat - Stanford.edu

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Stat 110, Lecture 2 Graphical Descriptive Statistics: Bheavlin@stat - Stanford.edu

Hochgeladen von

Copyright:

Verfügbare Formate

Stat 110, Lecture 2

Graphical Descriptive Statistics

No Data Some Data Way Too

Probability Inferential Descriptive

1996 2000 2004

Tufte United States

Avoid pie charts, frivolous color.

Dot plot low-to-moderate 1-3

QQ plot moderate-to-high ~1-3

Boxplot moderate-to-high 2-30+

Multi-vari low-to-moderate 2-4/layer

• good for details in tail,

1-2-5 rule dominates any difference.

0.5 Increments at ea obs,

• Mode: the most frequently occurring value(s),

Das könnte Ihnen auch gefallen