2 - Statistics Review

This Week RVs Sampling Hypothesis Testing Summary
G579: Business Econometrics

Statistics Review
Sebastian Wai
Kelley School of Business
Spring 2017
Sebastian Wai
Readings for this Week
Main textbook:
I
Chapter 1: The Roles of Data and Predictive Analytics in

Business
Chapter 3: Reasoning from Sample to Population
Sebastian Wai
Objectives for this Week
Introduce the course
Get started using Stata
Statistics review
Sebastian Wai
Working with Samples
These slides introduce our first tools for working with data samples.
Because of this, this slide show is quite math and stats-heavy.
We will learn some very important tools here that we will use for
the rest of the course.
Sebastian Wai
Discrete Continuous CDFs
Random Variables
A random variable is a variable whose value can take on more than

one value, dependent on chance.
I
Contrast this with deterministic variables which can always be

predicted with certainty
The chance that a random variable takes certain values is

determined by a distribution function, which may be unknown
to us
The set of possible values is called the support
Sebastian Wai
Draws
Sampling observations from a population is modeled as a

draw from a distribution
I
Eg. Asking a random person in the class their age is a draw

from the class population
Denote random variable Xi as a single draw from the

population
Sebastian Wai
Discrete Random Variables
A discrete random variable has a finite or countably infinite

number of possible values
I
Coin flip has support is {Heads, Tails}
A count variable has support {0, 1, 2, ...}
The probability each value is drawn is given by the probability

mass function (pmf)
A fair coin has the pmf:

Pr (X = x) = 0.5, x {Heads, Tails}
Sebastian Wai
Bernoulli Distribution
The coin flip is an example of a Bernoulli distributed random

variable, one of the simplest probability distributions. The pmf of a
Bernoulli variable is
(
p
if x = 1
Pr (X = x) =
1 p if x = 0
For the coin flip, p = 0.5.
Sebastian Wai
Binomial Distribution
In a business environment, we might use the binomial distribution

to model sales. Suppose a company knows 100 people will access
their website each hour, and that 25% of those will make a
purchase. What is the probability that at least 20 people make a
purchase? Add up

100
X
100
Pr (X 20) =
0.2x (0.8)100x 0.54
x
x=20
Sebastian Wai
Binomial Distribution
A related distribution is the Binomial distribution. This represents

a series of Bernoulli draws. In addition to p, we need to know the
number of Bernoulli draws n. A draw from the binomial
distribution is the number of successes (1s) drawn from the
Bernoulli. The pmf is:
(
n x
nx
if x {0, 1, ..n}
x p (1 p)
Pr (X = x) =
0
otherwise
Sebastian Wai
Example: Roulette
It was obvious that all this ritual and all the mechanical minutiae
of the wheel, of the numbered slots and the cylinder, had been
devised and perfected over the years so that neither the skill of the
croupier nor any bias in the wheel could affect the fall of the ball.
And yet it is a convention among roulette players, and Bond rigidly
adhered to it, to take careful note of the peculiarities in the run of
the wheel... Bond didnt defend the practice. He simply
maintained that the more effort and ingenuity you put into
gambling, the more you took out.
Ian Fleming, Casino Royale (1953)
Sebastian Wai
Example: Roulette
A French roulette wheel has:

I
18 black slots
18 red slots
1 green zero slot
From the passage, we know James has concluded the wheel is fair.
Suppose he decides to bet on black for 10 spins. Betting on black
pays out the initial bet for a win.
Sebastian Wai
Exercises
What is the probability James will win k spins out of 10?
Construct a table representing the probability mass function.
What is the probability James will win at least 6 spins?
How many spins do we expect James to win?
If he bets 100 francs on each spin. What the expected payout

of 10 spins?
Sebastian Wai
Solution
If the wheel is fair, the chance of a red is 18

37 , or about 0.486. Now,
we go to the binomial distribution. The probability of k
successes given 10 trials and success probability 0.486 is given
by:

10
0.486k (0.514)10k
k
Sebastian Wai
Solution
Based on this, we can construct the table:
k
0
1
2
3
4
5
6
7
8
9
10
Pr (x = k)
0.0013
0.0121
0.0515
0.1301
0.2157
0.2452
0.1936
0.1048
0.0372
0.0078
0.0007
Pr (x k)
1
0.9987
0.9866
0.9352
0.8051
0.5894
0.3442
0.1506
0.0458
0.0086
0.0007
Sebastian Wai
Pr (x k)
0.0013
0.0134
0.0648
0.1949
0.4106
0.6558
0.8494
0.9542
0.9914
0.9993
1
Solution
Expected wins: 4.86

Expected payout: -27.03 francs
Probability at least 6 wins: 0.34
Sebastian Wai
Custom Distribution
We can also use a table of probabilities to describe a pmf, as we

did in the roulette example. For example, suppose the population
of my bookshelf is: 3 books by Salman Rushdie, 1 by Philip K.
Dick, 2 by Dan Abnett, and 4 by Ian Fleming. The pmf of a draw
from the shelf can be written
Probability
0.3
0.1
0.2
0.4
Sebastian Wai
Author
Rushdie
Dick
Abnett
Fleming
Continuous Random Variable
A continuous random has an (uncountably) infinite support, such

as
I
Any number from 1 and 10: [1, 10]
Any positive number, including zero: [0, )
Any number at all: (, )
Sebastian Wai
Probability Density Function
In general, since there are infinitely many possible values, the

probability of drawing any individual number is zero. Instead of a
pmf, we now have a probability density function (pdf).
I
The area under the pdf between a and b tells us the

probability a draw will be in the set [a, b]
In other words, the integral of the pdf from a to b
Before anyone panics, we wont be doing integrals in this class!
Sebastian Wai
Normal Distribution
0.2
0.1
0.0
Density
0.3
0.4
Arguably the most famous continuous distribution is the normal

distribution, whose pdf is known as the bell curve.
Sebastian Wai
Normal Distribution
To describe a particular normal distributions shape, we need to

know its mean () and variance ( 2 ).
I
I
I
Mean is the expected value of the random variable, = E [Xi ]

Variance is the spread of the variable, 2 = E (Xi E [Xi ])2
Define standard deviation () as the square root of the
variance
If Xi is normally distributed, we write

Xi Normal (, )
Sebastian Wai
Normal Distribution
Put this all together into the pdf:

f (x) =
1
2 2
(x)2
2 2
Look awfully complicated? We will usually use computers when

dealing with this pdf!
Sebastian Wai
Cumulative Distribution Function
The cumulative distribution function (CDF) tells us the probability

the random variable will take a value at most x. For a discrete
variable,
X
Pr (X x) =
Pr (X = k)
kx
and for a continuous variable,

Z
F (x) =
f (k)dk
k=
When x is the highest value of the support, the CDF equals 1. If

not, something went wrong!
Sebastian Wai
Basics Confidence Intervals
Population vs. Sample
So far, we have been talking in terms of the population. and

are population parameters, whose true values are unknown to us.
I
For example, consider the population of your companys

potential customers.
Your boss wants to know the average age of the type of

person who buys your product. What do you do?
Draw a sample!
From here, we can use the sample to produce estimates of the

true parameters.
Sebastian Wai
Notation
Sample notation:
I
Sample size (N): the number of observations/draws in the

sample
Denote our sample of N realizations of random variable Xi as

{X1 , X2 , ..., XN }
Sebastian Wai
Sample Statistics
Sample mean:
N
1 X
X =
Xi
N
i=1
Sample variance:
N
2
1 X
Xi X
S =
N 1
2
i=1
Sample standard deviation:
S=
S2
These statistics are the sample counterparts to the population

parameters , 2 , and .
Sebastian Wai
Sample Statistics
Notice how the formula for sample variance differs from 2 .

I
N 1 in denominator instead of N
If using Excel to calculate variance, make sure you use the

right command (STDEV.S, not STDEV.P)
For very large samples, the difference is negligible
Sebastian Wai
Estimators
Since we dont know the true values of the parameters, sample

statistics will serve as estimators. How good are the estimates?
I
Selection bias can make the estimates inaccurate
Assume, for now, the sample is truly random (all population

members are equally likely to be sampled)
This means individual draws are independent and identically

distributed (i.i.d.) the distribution a given draw does not
depend on the realization of any other draw
Sebastian Wai
Unbiasedness
When the draws are i.i.d., sample statistics are unbiased estimators
for the population parameters. Formally, the sample statistic is
unbiased when the mean of the statistic equals the population
parameter:

E X =

E S 2 = 2
E [S] =
Sebastian Wai
Unbiasedness
X an unbiased estimator for .
Proof.
E [Xi ] =
N
1 X
Xi
X =
N
i=1
N
N

1 X
1 X
E X =
E [Xi ] =
N
N
i=1
i=1

1
E X = N =
N
Sebastian Wai
Confidence Intervals
It is still unlikely the sample statistics will perfectly peg the

population parameters
Confidence intervals give a range that there is a specific

probability the population parameter is contained within
For example, suppose I ask 20 random people in Bloomington

their income; while the average of my sample may be close to
the true mean of Bloomington residents income, chances are
that it is not quite accurate.
But can we provide a range?
Sebastian Wai
Variance of X
In order to construct a confidence interval for the mean estimate,
we need to know how X is distributed. We already know its mean
is , but what is the variance?
!
N
X

1
Var X = Var
Xi
N
i=1
1
Var X = 2 Var
N
N
X
!
Xi
i=1
1
Var X = 2 N Var (Xi )
N
2
Var X =
N
Sebastian Wai
Central Limit Theorem
From the derived variance, we know the standard deviation of X is

. Intuitively, it should make sense that that the mean of the
N
sample has less spread than individual draws.
Theorem
When N is sufficiently large (conventionally, N 30), the sample
mean is normally distributed. That is,

X Normal ,
N
Sebastian Wai
CLT Example
For an illustration, lets return to the roulette example. Recall that

the French roulette wheel we discussed has 37 slots, each
corresponding with integers 0 to 37. Assume each slot is equally
likely. Let random variable Xi be the number the ball lands on.
The parameters are
= 18.5, 2 = 120.25, = 10.97
Sebastian Wai
CLT Example
Experiment:
I
Simulate samples of 30 roulette spins
Record the mean of the 30 spins
Repeat 10,000 times
Results:
I

E X = 18.47

Var X = 4.02
4.02 30 = 120.67
These results are quite close to the predicted values.
Sebastian Wai
CLT Example
Histogram of sample means with fitted Normal density:
1000
500
0
Frequency
1500
2000
Histogram of Means
15
20
Means
Sebastian Wai
25
Now that we have established the distribution of the sample mean,

we are ready to construct confidence intervals. Commonly, we
consider 90%, 95%, and 99% confidence intervals. For Normal
random variables, draws fall within
I
1.65 s.d. of the mean 90% of the time
1.96 s.d. of the mean 95% of the time
2.58 s.d. of the mean approximately 99% of the time
You can use a computer or table to do this for CIs of any size.
Sebastian Wai
This implies that in 90% of samples, the sample mean will be

contained in

, + 1.65
1.65
N
N
and so forth for the other intervals.
But wait, this looks a bit backward. We dont actually know and
we were trying to find estimates for those parameters in the
first place.
Sebastian Wai
Instead, well flip things around. Replace with X and with S,

and we have what were looking for.

S
S
, X + 1.65
CI90 = X 1.65
N
N
will contain 90% of the time. Using the language of inductive
reasoning, we say:
Based on a sample of size N with mean X and variance S, the
population mean is contained within CI90 , as defined above. The
degree of support for this argument is 90%.
Sebastian Wai
We have just learned how to construct a confidence interval for

one parameter . For other statistics, such as , the specifics
will be different, but the overall idea is the same.
I
Find the distribution of the parameter, in terms of sample

statistics
Calculate bounds to the left and right of the sample analogue

corresponding with the desired confidence level
Sebastian Wai
Intro Steps Discussion P-Value
Hypothesis Testing
We will conclude this week with formal hypothesis testing. A

hypothesis uses a data sample to assess the credibility of a claim
(the hypothesis) about the underlying population. The steps to a
hypothesis test are:
1. State the null hypothesis and the alternative hypothesis
2. Collect a sample
3. Decide on a confidence level and rejection rule
4. Calculate the test statistic
5. Reject the null hypothesis if the test statistic falls outside the
confidence interval
Sebastian Wai
What is a Hypothesis Test?
Hypothesis tests are a component of making an inductive

argument.
I Degree of support, also called inductive probability is a
measure of the strength of the inductive conclusion
I
We distinguish between a subjective degree of support, based

on gut instinct and an objective degree of support, based on
data and statistical analysis
I am 99% sure the Steelers will win the Super Bowl!
A hypothesis test is a tool used to determine whether or not

the data backs up the conclusion for a particular degree of
support
Sebastian Wai
Hypothesis Testing
We will discuss hypothesis testing in terms of the population mean,

though we can do a test concerning any population parameter.
Suppose we want to test the hypothesis the mean personal
income in Indiana is $35,000. We will restrict this to Hoosiers
aged 18 to 54. You can follow along with the example using the
dataset indiana100.dta.
Sebastian Wai
Step 1: State the Hypothesis

The null hypothesis (H0 pronounced H-naught) is the default
position we are trying to test. Null hypotheses are generally state
that a parameter of interest is equal to something else, which can
be a number or another parameter, as in
H0 : = K ,
where is a population parameter. In our example,
H0 : = $35, 000
The alternative hypothesis (H1 ) is that the parameter is not equal
to what we hypothesized. In this case,
H1 : 6= $35, 000
Sebastian Wai
Step 2: Collect a Sample
To test the hypothesis, I took a random sample of 100 Indiana

residents aged 18-54 from the 2014 American Community Survey.
I
Measures: Sex, age, race, education, personal income (we

only need the last one, but I included a few others)
Dimensions: State = IN, Age [18, 54], households only

(eliminates prisons, mental institutions, etc.)
Sebastian Wai
Step 3: Confidence Level
Now, we need to decide on a confidence level. The stricter the

confidence level, the less likely it is that the hypothesis is rejected,
but when it does happen, the degree of support is high. Some of
the most commonly used levels are 90%, 95%, and 99%. Lets go
with 95% for now.
I
Recall that since N > 30, by the CLT, the sample mean is
normally distributed
Thus, for 95% confidence, we reject if the sample mean is

more than 1.96 standard deviations away from the null value
($35,000)
Sebastian Wai
Step 4: Test Statistic
The test statistic is the value we test to determine of the null is

rejected. The form of the test statistic depends on the parameter
of interest.
and
I The criterion is: reject if the difference between X
$35,000 is more than 1.96 standard deviations
I
Construct the test statistic:

z=
X $35, 000
Were good to go, then, right?
Sebastian Wai
Step 4: Test Statistic

Not quite yet. We dont actually know , the population standard
deviation.
I
Use the next best thing, the sample standard deviation, S
Rewrite the test statistic

t=
X $35, 000
S
N
When we use the sample S.D., we have a t-statistic, that

follows the t-distribution
But, when N > 30, the t and normal distributions are very
similar
Sebastian Wai
Step 5: Do the Test
First, calculate the sample mean and standard deviation:

X = 30, 552.95
S = 30, 623.11
Plug these into t to get
t=
30, 552.95 35, 000

30,623.11
100
= 1.452
We find $35,000 is only 1.452 standard deviations from the sample

mean not enough to reject the hypothesis at 95% confidence.
Sebastian Wai
What Does it All Mean?
Okay, so we didnt reject the hypothesis. That means that the

average income for Hoosiers aged 18-54 is $35,000, right?
I
No! Failed to reject is not the same as confirmed.
What weve done is a bit more nuanced we showed that the

evidence we have is not strong enough to say, with 95%
degree of support, that the mean is not $35,000
Had we made the more extreme claim that the average

income is $40,000, the t-stat is 3.08 enough to reject even
at 99% degree of support
Sebastian Wai
Sample Size
The power of sample size in hypothesis testing cannot be

understated.
I Note the
N in the t-stat
I
All other things equal, larger sample sizes make larger t-stats
In other words, larger samples make for stronger evidence
This is intuitive. What if we tried to make claims about the

entire state with just two peoples income?
Using the full ACS sample of 28,683 people, the t-stat

becomes 7.40, a very strong rejection!
Sebastian Wai
P-Values
An alternative way to approach a hypothesis test is to calculate the

p-value. The p-value tells us:
I
Assuming the sample is large and the null hypothesis is

correct, what is the probability a t-stat at least as large as the
one we got would be observed?
In other words, if the null is correct, what would be the false

rejection rate?
The more extreme the t-stat, the lower the p-value
The p-value also tells us the highest degree of support at

which we could reject the null
Sebastian Wai
P-Values
Graphical representation of the p-value
Sebastian Wai
P-Values
We know that t-stats follow the t distribution. Using this

distribution, we can calculate the p-value
I
The degrees of freedom of a t distribution is N 1
P-value is the area under the distribution to the right of t and

to the left of t
For our example, the p-value is 0.15
This means the highest degree of support for a rejection of

the null would be 1 0.15 = 85%
Sebastian Wai
t Distribution in Excel
To find a p-value in Excel, we need to use the function T.DIST

I
T.DIST.2T(t, df )
t is our t-stat
df is the degrees of freedom, equal to N 1
Note that 2T indicates a 2-tailed test. While 1-tailed tests are

possible, we wont be doing them in this class.
Sebastian Wai
Mean Hypothesis Test in Stata
To do the test in Stata, load up indiana100.dta and run the

command
I
ttest inctot == 35000
Note the double-equals sign for the logical equals
We will focus on the two-tailed test in the middle.
Sebastian Wai
Summary
Important takeaways:
I
Random variables represent draws from a population
If the sample is random, sample statistics are unbiased

estimators of population parameters
Confidence intervals give a range that a parameter lies within

for a given probability
Hypothesis testing is important know how to do them and

understand the interpretation
Sebastian Wai
Up Next
Next, we will cover randomized trials.

I
We can then do hypothesis tests using the results of

randomized trials
Then, we turn to regression
Anyone can hit the button to run a regression
But we need to understand hypothesis tests to understand

and interpret the numbers that come out
Sebastian Wai

2 - Statistics Review

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2 - Statistics Review

Hochgeladen von

Copyright:

Verfügbare Formate

This Week RVs Sampling Hypothesis Testing Summary

G579: Business Econometrics

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Readings for this Week

Chapter 1: The Roles of Data and Predictive Analytics in

Chapter 3: Reasoning from Sample to Population

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Objectives for this Week

Introduce the course

Get started using Stata

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Working with Samples

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

A random variable is a variable whose value can take on more than

Contrast this with deterministic variables which can always be

The chance that a random variable takes certain values is

The set of possible values is called the support

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Sampling observations from a population is modeled as a

Eg. Asking a random person in the class their age is a draw

Denote random variable Xi as a single draw from the

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Discrete Random Variables

A discrete random variable has a finite or countably infinite

Coin flip has support is {Heads, Tails}

A count variable has support {0, 1, 2, ...}

The probability each value is drawn is given by the probability

A fair coin has the pmf:

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

The coin flip is an example of a Bernoulli distributed random

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

In a business environment, we might use the binomial distribution

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

A related distribution is the Binomial distribution. This represents

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

A French roulette wheel has:

1 green zero slot

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

What is the probability James will win k spins out of 10?

Construct a table representing the probability mass function.

What is the probability James will win at least 6 spins?