Beruflich Dokumente
Kultur Dokumente
AnIntroductionToStatistics.tex or .pdf
Edward Morey - August 30, 2010
1.1
To know exactly what this means, one must rst dene a variable and then
a random variable.
Put simply a variable is something that varys. That is, a variable can take
on dierent numerical values (a realized value of a variable is some number).
Each number represents a distinct state. For example, the variable G might
represent gender, where 1 corresponds to the state female and 0 corresponds to
the state not-female. Or, for example, if X is a variable such that 0 x 123 ,
X can take any numerical value between zero and 123, inclusive, the variable
X might represent age of a human. Here, I have assumed all humans are not
younger than zero and not older than 123 years (everyone would not agree on
this lower bound).1
X is the name of the random variable (e.g. age, price, or amount of sexual
activity), and x is a numerical value of X.2
1 From Wikipedia: The longest unambiguously documented lifespan is that of Jeanne Calment of France (18751997), who died at age 122 years and 164 days. She met Vincent van
Gogh at age 14.[1] This led to her being noticed by the media in 1985, at age 110. Subsequent
investigation found that her life was documented in the records of her native city of Arles
beyond reasonable question.[2] More evidence for the Calment case has been produced than
for any other supercentenarian case, which makes her case a standard among the oldest people
recordholders.[citation needed]. http://en.wikipedia.org/wiki/Oldest_people
2 The issue of how to distinguish between a random variable and a specic realization of
that random variable can be confusing, and the literture is not consistent in how it notationally
distinguishes between the two - I wont be either. I will try to use uppper case to denote the
name of a random variable, and lower-case to denote a specic value of that random variable.
However, I, and others, might use x to refer to both and hope the reader can determine which
is meant by the context.
To say that the value of a rv can be expressed with a number is not vary
restrictive: if X has cardinal properties, e.g. if X represents age, the the numbers have cardinal meaning; if X only has meaning in terms of a ranking, e.g.
class rank, then only the ordinal properties of the numbers are important; if X
simply denotes categories, e.g. gender or race, then the numbers mean nothing
other than dierent numbers represent dierent categories.3
Lets assume X is a random variable (I will dene random variable in a
second). In which case
y = f (x)
= g(x)
and
m = 4 + 7x
are each statistics, all of the same random variable, X. The letters f and g are
the names of particular functions.
Or more generally, imagine three random variables: X, Y and Z. In which
case
= (x; y; z)
b = h(x; y; z)
1
1 (x; y; z)
2 (x; y; z)
and
are each statistics.
So c = m(b) is a rv.
Alternatively, one could let x refer to the rv and let xi refer to a value i of the rv. This
approach will be pretty clear if there is only one rv being considered. But what if there are
three rvs? Do I give them dierent names, like x, y and z? If so zi refers to a realized value
of z. But, what if instead of x, y and z, I had denoted the three random variables in the text
x1 , x2 and x3 where the subscripts now refer to dierent rvs, not dierent observation on x.
One must be viligant.
We need to be careful and gure out what is going on by the context. Being explicit about
what we mean is also a good thing if we dont want the reader to get confused.
3 You are probably ready to conclude that I like footnotes. I do; they allow me to digressa
former student accused me of being the "King of digression"and tangents. I discuss the
properties of numbers in Morey (confuser surplus). Latent Gold, a statistical program I
use, allows one to change the specication of random variables between, cardinal, ordinal and
nominial (categorical). The dierences between many statistical and economic models are
often only the numerical properties of the dependent and independent variables.
All statistics are random variables, but all random variables are not statistics,
unless one denes x = x as a function.
1.2
Denition 2 X is a random variable if it is a variable and if it has a distribution. Said another way, X is a random variable if 8 a and b one can determine
the probability that a x b if one knows the distribution of X
Note that X takes specic values (e.g., if X is weight, each of us has a specic
weight but weight, in the population, has some distribution.)
The above denition is not self-contained. It requires that we know what a
distribution is, and we have yet to dene that term, other than we have dened
it as something that allows us to calculate Pr(a x b)
Also note the dention requires that X has a distrbution, but it does not
require that we know what distribution.
The book, Introduction to the Theory of Statistics (Mood, Graybill and
Boes) denes a continuous random variable as follows:
The variable X is a one-dimensional, continuous random variable if there
exists a function f (x) such that f (x) 0 8 x in the interval 1 x 1, and
the probability that (a x b) is4
Pr(a
b) =
Zb
f (x)dx
The function f (X) is called a density function (or a probability density function). The function, f (X), describes the distribution of X.
Any function, f (X), can serve as a density function as long as
f (x)
and5
0;
+1
Z
f (x)dx = 1
4 Note
1.2.1
Another example
0
if x < :5
f (x) =
where round(x) is dened as the integer
:5round(x) if x :5
closest to x, with :5 rounded up.
0.6
0.5
f(x)
0.4
0.3
0.2
0.1
0
1
5
x
y 0.75
0.625
0.5
0.375
0.25
0.125
0
-1
-0.5
0.5
1
x
Note that it is much easy to make up a density function f (X) when X has
a limited domain, rather than an innite domain. It is tough to nd functions
where the domain is innite and the area under the function is one.
After one has chosen some population variable to study, how does does/should
one decide what to assume about its density function?
1.2.2
A variable is a rv if there exists some probability that the variable lies in the
interval a; b.
It is sometimes easy to forget that statstics are all about determining or
estimating probabilities.
For example, in OLS regression anlysis, something many of you know, the
probabilities are not always explicit. Many, unfortunately, xate on the OLS
parameter estimates, but the probabilities are there. For example, given the
OLS parameter estimates, what is the probability that the true value of the
parameter lies between a and b? We should be most interested in that question.
10
1.3
11
If one reads statstics books, one quickly gets the idea that statisticans have
a thing about urns and drawing balls from urns. When entering the kitchen
to make the kids breakfast, the statistican takes the lid o the "breakfast urn"
and draws a ping-pong ball. If the ball is red it is eggs, blue ceral, ..... Maybe
there is dierent urn for weekend breakfasts. Tonight will it be TV in bed or
sex with the spouse, it all depends of the draw from the bedroom urn.
1.3.1
Econometrics is the study of the application of statistical methods to the analysis of economic phenomena.
Books at http://books.google.com/books?id=B8I5SP69e4kC&dq=Kennedy+a+guide+to+econometrics&printsec=fro
13
The art of the econometrician consists in nding the set of assumptions which are both su ciently specic and su ciently realistic to allow him to take the best possible advantage of the data
available to him (Malinvaud)9
The applied econometrician: The applied econometrician, unlike
the theoretical econometrician, needs to worry as much about her
data as about the theory. The forecasts and predictions generated
by the econometric model are only as good as the data that produced
them.
A well-known econometrican recently mentioned to me that he was hired
by a group of wealthy gamblers to use his choice-modeling skills to predict the
outcome of horse races. It might be important that he get it right.10
1.4
The following quote is from the front of The Advanced Theory of Statistics, Vol.
2, by M.G. Kendall and A. Stuart. They attributed it to the ctitious K.A.C.
Manderville, The Undoing of Lamia Gurdleneck.)
"You havent told me yet," said Lady Nuttal, "what it is your
anc does for a living."
"Hes an statistician." replied Lamia, with an annoying sense of
being on the defensive.
Lady Nuttal was obviously taken aback. It had not occurred to
her that statisticians entered into normal social relationships. The
species, she would have surmised, was perpetuated in some collateral
manner, like mules.
"But Aunt Sara, its a very interesting profession," said Lamia
warmly.
8 More
14
"I dont doubt it," said her aunt, who obviously doubted it very
much. "To express anything important in mere gures is so plainly
impossible that there must be endless scope for well-paid advice on
how to do it. But don"t you think that life with an statistician
would be rather, shall we say, humdrum?"
Lamia was silent. She felt reluctant to discuss the surprising
depth of emotional possibility which she had discovered below Edgars
numerical veneer.
"Its not the gures themselves," she said nally, "its what you
do with them that matter."
Some additional quotes:
To understand Gods thoughts we must study statistics, for these
are the measure of His purpose. (Florence Nightingale)
Statistics are like a bikini. What they reveal is suggestive, but
what they conceal is vital. (Aaron Levenstein)
The rst lesson that you must learn is, when I call for statistics
about the rate of infant mortality, what I want is proof that fewer
babies died when I was Prime Minister than when anyone else was
Prime Minister. That is a political statistic. (Winston Churchill)
There are three kinds of lies: lies, damned lies, and statistics.
(Benjamin Disraeli, but sometimes attributed to Mark Twain)
To bad we cant email Florence and ask here what the hell she meant. Maybe
she mean that "casuality statistics", estimates of maimed and dead soldiers, are
a "measure of [Gods] purpose": part of Gods big plan.
15
1.4.1
16
imagine a cemetery where all and only smokers are buried. One digs up a
bunch of the decomposing and takes from each a snip of lung tissue to see whether the smoker
had lung cancer. Here Reference the study that dug up frozen guys from WWI to see what
kind of u they had.
1 3 I would modify this to "whose outcomes can be expressed with real numbers." For example
one would still have a random variable if the variable was hair color and one used letters of
the alphabet, rather than numbers, to denote the dierent colors hair can take.
17
"Statistics is the science of estimating the probability distribution of a random variable on the basis of repeated observations
drawn from the same random variable."14 (p. 4)
1 4 The
term density function is typically only used to describe the distribution of the rv
if the rv varys continuously over some range (it is a continous rv ). If a rv can take only a
countable number of values (it is a discrete rv ), each with some probability, we dont call its
distribution a density function. Rather we call it a discrete distribution. The term probability
distribution refers to either a density function, a discrete distribution, or some combination
of the two.
18
1.5
e
w!
that I have broken my "rule" about uppercase for the name of the rv.
19
20
1.5.1
One could dene a random variable c as the number of cigarettes smoked per
day by an individual, where ci denotes the number of cigarettes smoked by
individual i: c is a random variable and ci is a realized value of this random
variable.16
We might want to learn about the distribution of this random variable in
our population of interest: determine its density function. The data generating
process is draws from that density function.
Note the term population of interest. For example, the density function
for cigarettes smoked by residents of Italy is very dierent from the density
function for cigarettes smoked by residents of the U.S. And the density function
for cigarettes smoked by foreign, male graduates students in Boulder is dierent
from the density function for all Boulder residents.
What properties, if any, must these density function have?
Can the number of cigarettes smoked take any value or must it be an integer?
Can it be a negative number? Can it be zero? Can it be 1000 a day? Someone
want to check the world record for number of cigarettes smoked in 24 hours?
Go to http://www.jimmouth.com/tv04_body.html to see some idiot smoke
159 cigarettes at once
1 6 Not that for many i, c = 0, particularly residents of Boulder. The only people in Boulder
i
who smoke appear to be foreign graduate students.
21
1.5.2
22
According to WolframAlpha the mean is 180 and the median is 173. How
did they get their estimate of the mean?
We observe four weights in the population: denote the weight of the rst
person observed, w1 , second person w2 , third w3 and fourth w4 .18
Every time we sample, we get four dierent observations: a dierent sample.
In the U.S. population there is a very large number of dierent possible samples
(dierent sets of 4 people). Let ws (w1s ; w2s ; w3s ; w4s ) denote sample s. ws is
a vector of four random variables, so any function of ws is a statistic.
Consider the following three statistics
w
e = f (w1 ; w2 ; w3 ; w4 ) = w1 + 3w2 + (w3 w4 )2
w1 + w4
2
w = h(w1 ; w2 ; w3 ; w4 ) = (:25 ln w1 + :25ew2 )w3 + w4
w
b = g(w1 ; w4 ) =
Where did these three statistics come from? I made up three functions of the
four random variables.
I now declare each of these an estimator of !; anything can be an estimator
of anything, so what I declare is not untrue, each is an estimator of !. That
said, they may be lousy estimators of !.
Every time we plug in values from a dierent sample, we will get new estis
s
mates. For example ws = h(w1s ; w2s ; w3s ; w4s ) = (:25 ln w1s + :25ew2 )w3 + w4s is the
estimate of w for sample s.
4
If God said that w
b = g(w1 ; w4 ) = w1 +w
was the best estimator of !, the
2
applied statistican/econometrican would always use this estimator to estimate
!, no matter what sample they had collected.19
God is either unavaible or unwilling; so, we need to decide which of all feasible
estimators is the preferred estimator (which has the most desirable properties).
To determine which is the preferred estimator, from those availible, we ask the
theorists what properties we would like an estimator to have (not all theorists
agree), and which estimator has the most of those properties.
1 8 Note
that here the subscript refers to dierent observations on w, not to four dierent
rvs. But, that said, one could also think of them as four dierent rvs. For example in every
sample there will be a rst observation, w1 , and this will vary from sample to sample.
1 9 This would be an interesting God. If she wanted to be helpful, why didnt she just tell us
!?
23
Since we can never know the true population mean of H, !, we cannot judge
an estimate of mean H by how close it is to the true value. (If we knew the true
value, we would not need to do estimation.)
We judge estimators, not estimates, this point is lost on many soulsthose
souls should be dammed to Purgatory, maybe the third level of Purgatory.
Words that come to mind when we think about the properties of an estimator
include simple, linear, unbiased (vs. biased), e cient, asymtotically unbiased,
consistent, and easy to estimate.
24
1.5.3
(1)
bpi )2
i=1
2 0 Note that this is a pretty stupid (unrealistic) model because most people smoke no cigarettes, in addition no one smokes a negative number of cigarettes, so consumption cannot be
normally distributed.
2 1 Econometricans like to assume away most of the estimation problem. We impose a lot of
assumptions on our models, often independently of anything the data might suggest.
25
Denote this esimator bOLS . Every sample taken will generate a dierent bOLS
estimate of . Note that bOLS is a rv - is not a rv; it is a constant. Let bsOLS
denote the OLS estimate generated by sample s.
As applied econometricans, we often mistakinly concentrate on the obtained
esimate rather than keeping in mind that our bOLS , b1OLS , is just one draw from
a distribution of bOLS .22
That is, bOLS is a random variable with some density function f (bOLS ).
Much of the work of the classical linear regression model has to do with
deriving that density function.
Once we have it, we can answer question such as "Given , what is the
probably that an esimate, bOLS , will be between (
) and ( + )?" Or,
of more relevance, "What is the probability that is between (bOLS
) and
(bOLS + )?"
So, put simply, the OLS esimator is a special type of statistic, an estimator.
And, OLS estimates are rvs with some distribution.
We like OLS esimates - when we assume the classical linear-regression model
- because we can show that the OLS estimator has nice properties: it is, if one
buys the assumptions, BLUE (a Best Linear Unbaised Estimator).
Note that what has nice properties is the estimator, not any particular
estimate generated by the estimator. Our actual OLS estimate of often sucks.
2 2 b1
OLS is the esimate from the rst sample. I assume that most of the time we only collect
one sample. When we do simulations, we will collect many samples.
26
27