Beruflich Dokumente
Kultur Dokumente
Friday, 27 September, 13
You should be able to name files following the convention,
create your Rpubs and Gists following the convention, and
give us Markdown-ready links by now.
NO!
Friday, 27 September, 13
Consider observations of one quantitative variable
X ... possibly in the presence of one or two
categorical variables Y and Z, that take on a small
number of values
Friday, 27 September, 13
“Using fancy tools like neural nets, boosting, and
support vector machines without understanding
basic statistics is like doing brain surgery before
knowing how to use a band-aid.”
-- Larry Wasserman in preface of “All of Statistics”
Friday, 27 September, 13
X = life expectancy
<make a list>
Friday, 27 September, 13
What would you most like to know about the
observed distribution of the X’s (ignore Y, Z)?
relevant parameters
be careful with your words and thoughts
mean or average never forget the distinction between the
or expectation underlying truth (“the true mean”) and the current
observation (“the observed average”)
median
mode
Friday, 27 September, 13
What would you most like to know about the
observed distribution of the X’s (ignore Y, Z)?
concepts
spread relevant parameters
variability
dispersion
standard deviation, variance
median absolute deviation
interquartile range
range
Friday, 27 September, 13
we can compute
we do so to make
these from the
inference about these
observed data
mean or expectation
average
1 n E(Y ) = ∑ y ypY (y) for discrete rv Y
Yn = ∑ Yi
n i=1 E(Y ) = ∫ yfY (y)dy for continuous rv Y
variance variance
n
2 1 2 2
s = ∑
n − 1 i=1
(Yi − Yn ) V (Y ) = E(Y − µ )
Friday, 27 September, 13
we can compute
we do so to make
these from the
inference about these
observed data
mean or expectation
average
1 n E(Y ) = ∑ y ypY (y) for discrete rv Y
Yn = ∑ Yi
n i=1 E(Y ) = ∫ yfY (y)dy for continuous rv Y
variance variance
n
2 1 2 2
s = ∑
n − 1 i=1
(Yi − Yn ) V (Y ) = E(Y − µ )
Friday, 27 September, 13
density or prob.
mass function CDF
2
N( µ, σ )
Binom(n, p)
Friday, 27 September, 13
sources of images on previous page
http://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg
http://en.wikipedia.org/wiki/File:Normal_Distribution_CDF.svg
http://en.wikipedia.org/wiki/File:Binomial_distribution_pmf.svg
http://en.wikipedia.org/wiki/File:Binomial_distribution_cdf.svg
Friday, 27 September, 13
“cumulative distribution function (CDF)”
a
FY (a) = P(Y ≤ a) = ∫ fY (y)dy (for a continuous Y)
−∞
Friday, 27 September, 13
how to get a probability from a density
[2]
b
[1] P(a < Y < b) = ∫ fY (y)dy
a
a
[2] P(Y ≤ a) = ∫ fY (y)dy
−∞
∞
[3] P(Y ≥ a) = ∫ fY (y)dy
a
−a ∞
[4] P( Y ≥ a) = ∫ fY (y)dy + ∫ fY (y)dy
−∞ a
Friday, 27 September, 13
inverse CDF, quantile function
−1
FY (q) = smallest y such that FY (y) > q
Friday, 27 September, 13
inverse CDF, quantile function
−1
FY (q) = smallest y such that FY (y) > q
Friday, 27 September, 13
expectation, expected value, the mean
E(Y ) = ∑ y ypY (y) for discrete rv Y
binomial example:
Y ~ Binom(n, p)
E(Y ) = np
Friday, 27 September, 13
Summaries computed from observed data are
empirical versions of those “key” concepts
Friday, 27 September, 13
Numerical summaries, esp. location
> summary(gDat$lifeExp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
23.60 48.20 60.71 59.47 70.85 82.60
> fivenum(gDat$lifeExp)
[1] 23.5990 48.1850 60.7125 70.8460 82.6030
> mean(gDat$lifeExp)
[1] 59.47444
> median(gDat$lifeExp)
[1] 60.7125
Friday, 27 September, 13
Numerical summaries, esp. spread
> var(gDat$lifeExp)
[1] 166.8517
> sd(gDat$lifeExp)
[1] 12.91711
> mad(gDat$lifeExp)
[1] 16.10104
> IQR(gDat$lifeExp)
[1] 22.6475
Friday, 27 September, 13
Numerical summaries, esp. extremes
> min(gDat$lifeExp)
[1] 23.599
> max(gDat$lifeExp)
[1] 82.603
> range(gDat$lifeExp)
[1] 23.599 82.603
> which.min(gDat$lifeExp)
[1] 1293
> gDat[which.min(gDat$lifeExp), ]
country year pop continent lifeExp gdpPercap
1293 Rwanda 1992 7290203 Africa 23.599 737.0686
> which.max(gDat$lifeExp)
[1] 804
> gDat[which.max(gDat$lifeExp), ]
country year pop continent lifeExp gdpPercap
804 Japan 2007 127467972 Asia 82.603 31656.07
Friday, 27 September, 13
go over to R and try those functions out
Friday, 27 September, 13
but who wants to look at tables of numbers all
day?
Friday, 27 September, 13
STAT 545A
Class meeting #6
Wednesday, September 25, 2013
Friday, 27 September, 13
picking up where we left off ....
Friday, 27 September, 13
Digression: R’s formula syntax
http://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models
y ~ x
“y twiddle x”
●
80 ●
●
●
●
● ●
●
●
75 ●
● ●
● ●
●
lifeExp
●
●
● ●
65 ●
●
●
●
● ●
● ●
60 ●
●
●
●
● ●
●
●
55
●
year
<snip, snip>
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.03624 1.20834 48.857 < 2e-16 ***
I(year - 1950) 0.30944 0.03535 8.753 7.52e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Nicaragua Oman
1950 1970 1990
Turkey
80
●
● ●
● ● ●
● ● ● ●●
● ● ● 70
●
● ●
xyplot(lifeExp ~ year | country, zDat, ●
●
● ●
● ●
layout = c(4,2), type = c('p','g','r')) ● ●
●
●
● 60
●
● ●
● ● ● ●
●
● 50
● ●
●
● ●
● ● ●
●
● 40
●
lifeExp
Zimbabwe Paraguay Iraq Denmark
scatterplot example 80
●●●●
●
●
●
●●
●●
●● ●
70 ●
● ●●
●● ●
●●
●
●
●
●●
●
●
●
● ●
● ●
50 ●
● ●
●
●
●
40 ●
year
Friday, 27 September, 13
y ~ x | z
In many lattice plotting functions, this says to plot y against x for
every level of z (assumed to be categorical). Evokes conditional
probability, “given z”, etc.
1950 1970 1990 1950 1970 1990
lifeExp
Zimbabwe Paraguay Iraq Denmark
80 ●
●
●
●●●●
z is categorical
●
● ●
● ●
50 ●
● ●
●
●
●
40 ●
year
Friday, 27 September, 13
Inference example: is there a difference in
distribution of Y given X = x? 30 40 50 60 70 80 90
Africa Europe
y ~ x 0.15
Density
y is quantitative and x is the binary
variable that specifies the two 0.05
groups
> t.test(lifeExp ~ continent, tinyDat)
● ● ●● ●
0.00 ● ●●
●
●
● ● ● ●
●
●
lifeExp
data: lifeExp by continent
t = -6.5267, df = 13.291, p-value = 1.727e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-28.35922 -14.27766
sample estimates:
mean in group Africa mean in group Europe
57.01227 78.33071
Friday, 27 September, 13
y ~ x | z + k
y ~ x | z * k
In lattice plotting functions, this says to plot y against x for every
combination of levels of z and k (both assumed to be
categorical). Note that the “+” and “*” do the same thing in this
context, which is not true in, e.g., a linear model.
Friday, 27 September, 13
watch my formulas in the following
graphing examples to see more ways to
use the formula interface
end digression
Friday, 27 September, 13
20 40 60 80
Europe Oceania
0.10
0.08
0.06
0.04
0.02
●●●
●
● ●
●●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●● ●
●●●●
●●
●● 0.00
Density
●●●● ●●
●●
●●
●●
●●
●
●●
●
●
●
●●
●●
●
●
●●
●
●●
●
●
●●
● ●●●●●
0.08
0.06
0.04
0.02
0.00 ● ●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●●●
●● ●
●
●●●
●
●
●
●●
●
● ●
●
●
●●
●
●
●●
●
●●●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●●
●
●
●●●
●
●
●
●
●●
●
●●
● ●●
●
●●
●
●●●
●
●
●●
●
●●
●
●●
●●
●●
●
●
●●●
●
●●
●●
●
●
●
●●
●
●●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
● ●●
●
●●
●
●●
●●
●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●●
●●
●
●
●●
●●
● ●●
●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
20 40 60 80 20 40 60 80
lifeExp
Friday, 27 September, 13
densityplot(~ lifeExp | continent, gDat,
plot.points = FALSE, ref = TRUE)
20 40 60 80
Europe Oceania
0.10
0.08
0.06
0.04
0.02
0.00
Density
0.08
0.06
0.04
0.02
0.00
20 40 60 80 20 40 60 80
lifeExp
Friday, 27 September, 13
densityplot(~ lifeExp, gDat,
groups = reorder(continent, lifeExp), auto.key = TRUE,
plot.points = FALSE, ref = TRUE)
Africa
Asia
Americas
Europe
Oceania ability to superpose
to facilitate direct
0.10 visual comparison is
big advantage of
0.08
densityplot over
histogram
0.06
Density
lifeExp
Friday, 27 September, 13
Illustration of kernel density estimation
0.6
0.4
0.2
0.0
Friday, 27 September, 13
boxplot or `box and whiskers’ plot (hence ‘bwplot()’)
80
●
●
●
●
● ●
●
● ●
70
●
●
60 ●
●
●
●
●
●
lifeExp
●
●
●
●
50
● ●
40
●
30
Friday, 27 September, 13
Where boxplots come from
Friday, 27 September, 13
violin plot 80
70
60
lifeExp 50
40
30
80
●
●
●
●
● ●
●
● ●
70
●
●
60 ●
●
●
●
●
●
lifeExp ●
●
●
●
50
● ●
40
●
30
Europe Oceania
60
50
40
30
20
20 40 60 80
10
Percent of Total
Europe Oceania
0
60
Africa Americas Asia
50 60
50
40
40
30
30
20
20
10
Percent of Total
10
0 0
Africa Americas Asia
60 20 40 60 80 20 40 60 80
lifeExp
50
40
30
20
10
20 40 60 80 20 40 60 80
lifeExp
Friday, 27 September, 13
histogram(~ lifeExp | year, gDat)
20 40 60 80 20 40 60 80
20
15
10
5
0
year year year year
25
20
15
10
5
0
20 40 60 80 20 40 60 80
lifeExp
Friday, 27 September, 13
Raw MATalpha growth
8.0 8.5 9.0 9.5 10.0 10.5
MATalpha_220305_N MATalpha_220305_O
14
13
12
11
10
9
8
7
6
5
4
3
2
Agar plate
1
MATalpha_111105_N MATalpha_111105_O
14
13
12
11
10
9
8
7
6
5
4
3
2
1
ability to superpose
Friday, 27 September, 13
Where boxplots fail
34
20
13
−6
log(Forward Scatter)
34 ● ●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
●
●●
●●
●●
●●
●
●●
●
●●
●●●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●●
●
●●●
●
●●
Days Past Transplant
27 ● ●●
●●●●
●
● ●
20 ● ●●
●
●
●
●
●●
●
●
●●
●
●●
●●
●●
●
●●●
●●
●●
●
●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●●
●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●●
●
●
●●●
●● ●●
13 ● ●
●● ●●●
●●●●
●●●
●●
●●
●●
●
●●
●● ●
● ●●
●●●
●
●●
●●●
●
●●●●● ●
6 ● ●
●
●
●●
●
●
●●
●
●●
●●
●●
●●
●●●
●
●●
●●●
●●
●●
●●●
●●
●●●
●
●●●
●
●●●●
●
●●
●●●
●
●●●
●●●
●● ●●●●
●●●●
●
●●
●●●
●
●●
●●●●
●
●●●●
●●●●●
●
●●●
●● ●
●●● ●
●●●●
●●
● ● ● ●
0 ● ●●
●
●●
●●
●●
●●● ●●● ●
−6 ● ●
●
●●●
●
●
●●
●●
●
●●
●●
●
●●
●●●
●●●●
●
●●
●●
●
●●
●●
●●
●
●●
●
●●
●
●●
●●
●●●●●
●●●●
●
●
●●●● ●
● ●● ● ●
log(Forward Scatter)
Friday, 27 September, 13
gvhd10 package:latticeExtra R Documentation
Description:
Usage:
data(gvhd10)
Format:
'Days' a factor with levels '-6' '0' '6' '13' '20' '27' '34'
Friday, 27 September, 13
Violin plot > boxplot?
## Figure 3.13
bwplot(Days ~ log(FSC.H), data = gvhd10,
xlab = "log(Forward Scatter)",
ylab = "Days Past Transplant")
## Figure 3.14
bwplot(Days ~ log(FSC.H), gvhd10,
panel = panel.violin, box.ratio = 3,
xlab = "log(Forward Scatter)",
ylab = "Days Past Transplant")
Friday, 27 September, 13
what about “empirical cumulative distribution plots”
or ECDF plots?
Friday, 27 September, 13
What is the empirical cumulative distribution (ecdf)?
# xi 's ≤ x
F̂n (x) =
n
1
F̂n (x) = ∑ i I(xi ≤ x)
n
A step function that increases by 1/n at every
observed value of X. The NPMLE of F.
Friday, 27 September, 13
histogram vs. densityplot
Friday, 27 September, 13
34
Empirical CDF
0.2
0.0
the data!
6 13
1.0
0.8
0.6
0.4
0.2
0.0
−6 0
1.0
0.8
0.6
0.4
0.2
0.0
4.0 4.5 5.0 5.5 6.0 6.5 7.0
densityplot() log(FSC.H)
I cannot ‘read’
ecdfplots ... can you
spot bimodality?
What’s the mean?
Which distribution has
greater spread?
Friday, 27 September, 13
For medium-to-large datasets,
main data visualizations driven by
1. the density f
a. histogram
b. kernel density estimate
2. the CDF F
1. box-and-whisker plot
2. empirical cumulative distribution function
Friday, 27 September, 13
Main data visualizations driven by
1. the density f
a. histogram (histogram)
b. kernel density estimate (densityplot)
2. the CDF F
1. box-and-whisker plot (bwplot)
2. empirical cumulative distribution function
(ecdfplot)
stripplot ✓
bwplot ✓
histogram ✓ ✓
densityplot ✓ * ✓ ✓
ecdfplot ✓ ✓ ✓
* I’ve actually extended densityplot to work here, for personal use. See other page.
Friday, 27 September, 13
Visualizing dist’n of X (given Y = y)
• I favor smooth histograms = density estimates. Path of
least resistance is densityplot.
Friday, 27 September, 13
Illustration of kernel density estimation
0.6
0.4
0.2
0.0
Friday, 27 September, 13
brief introduction to kernel density
estimation
Friday, 27 September, 13
Here’s something that IS available via STATSnetBase (and seems to
have been a source for this material in the first place!):
Friday, 27 September, 13
Histogram
Well-established, widely-practiced method of density
estimation.
Friday, 27 September, 13
Histogram
Crucial ‘tuning’ parameter for histogram density
estimation: the bins (or bin widths or number of bins)
hist()
base R
k = 1 + log 2 n
truehist() −1/ 3
MASS
h = 3.5σ̂ n
histogram()
k = round(1 + log 2 n)
lattice
Friday, 27 September, 13
faithful data set
bin selection
has a huge
impact on the
result!
Friday, 27 September, 13
Naive estimator, uniform kernel estimator
Friday, 27 September, 13
Naive estimator, uniform kernel estimator
Therefore, for small h, a crude estimator of f is:
1
f̂h (x) = [# xi ∈(x − h, x + h)]
2nh
Define a weight function:
1
w(x) = if x < 1 and 0 otherwise
2
And re-write the crude / naive estimator as:
1 1 ⎛ x − xi ⎞
f̂h (x) = ∑ i w ⎜ ⎟
n h ⎝ h ⎠
Friday, 27 September, 13
Naive estimator, uniform kernel estimator
And re-write the crude / naive estimator as:
1 1 ⎛ x − xi ⎞
f̂h (x) = ∑ i w ⎜ ⎟
n h ⎝ h ⎠
In plain English, place a box of width 2h and height
(2nh)-1 on each observation. Density estimate at any
point x is the sum of these boxes.
Friday, 27 September, 13
Moving beyond a uniform (or
rectangular) kernel
Let’s replace the weight function with another
function K that satisfies the following:
K(x) ≥ 0
∫ K(x)dx = 1
So K is a probability density function and, usually, is
symmetric.
Friday, 27 September, 13
In general, the kernel estimator is given by:
1 ⎛ x − xi ⎞
f̂h (x) =
nh
∑ i
K ⎜⎝
h
⎟⎠
Friday, 27 September, 13
In general, the kernel estimator is given by:
1 ⎛ x − xi ⎞
f̂h (x) =
nh
∑ i
K ⎜⎝
h
⎟⎠
Friday, 27 September, 13
Illustration of kernel density estimation
0.6
0.4
0.2
0.0
Friday, 27 September, 13
density package:stats R Documentation
Description:
n: the number of equally spaced points at which the density is
The (S3) generic function 'density' computes kernel density to be estimated. When 'n > 512', it is rounded up to a power
estimates. Its default method does so with the given kernel and of 2 during the calculations (as 'fft' is used) and the final
bandwidth for univariate observations. result is interpolated by 'approx'. So it almost always
makes sense to specify 'n' as a power of two.
Usage:
density(x, ...)
## Default S3 method:
density(x, bw = "nrd0", adjust = 1,
kernel = c("gaussian", "epanechnikov", "rectangular",
"triangular", "biweight",
Arguments:
Friday, 27 September, 13
densityplot
allows the user to
densityplot(~ eruptions, data = faithful) specify the
arguments to
densityplot(~ eruptions, data = faithful,
kernel = "rect", bw = 0.2,
kernel, bandwidth
Friday, 27 September, 13
n = 10 n = 35 n = 150
0.5 0.5
Density
Density
0.2 0.2 0.2
● ●●●
●●
● ●
● ● ●● ●●●●●
●●● ●● ●
●
●●●●● ●●
●●●●● ● ●●● ●●●
●●
●● ● ●
●●
●
● ● ● ●● ● ●●● ● ●
●●
● ●● ●●
●●
●●●●●
●●
●
● ● ●
●● ●
●
● ●
● ●● ● ●●
● ●
●● ● ●●
● ● ●● ●
●●●● ●●●●
0.0 ●●●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●●● ●
●
● ● ●●
● ● ● ●●●
● ●●● ●
●
●●●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●●
●● 0.0 ●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ● ●
●●
●
●
●
●
● ●●
●
●
●●●●●
●
●●
● ●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●● 0.0 ●●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
● ●● ● ● ● ●●
●●● ●●
● ●●
●
●
●●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●●
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
<snip, snip>
higher value of n.
Friday, 27 September, 13
Practical usage tip: if the actual KDE is
wigglier than you’d like, try changing the
bandwidth.
Friday, 27 September, 13
Other recommended sources:
Friday, 27 September, 13