Sie sind auf Seite 1von 58

This Week RVs Sampling Hypothesis Testing Summary

G579: Business Econometrics


Statistics Review
Sebastian Wai
Kelley School of Business

Spring 2017

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Readings for this Week

Main textbook:
I

Chapter 1: The Roles of Data and Predictive Analytics in


Business

Chapter 3: Reasoning from Sample to Population

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Objectives for this Week

Introduce the course

Get started using Stata

Statistics review

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Working with Samples

These slides introduce our first tools for working with data samples.
Because of this, this slide show is quite math and stats-heavy.
We will learn some very important tools here that we will use for
the rest of the course.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Random Variables

A random variable is a variable whose value can take on more than


one value, dependent on chance.
I

Contrast this with deterministic variables which can always be


predicted with certainty

The chance that a random variable takes certain values is


determined by a distribution function, which may be unknown
to us

The set of possible values is called the support

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Draws

Sampling observations from a population is modeled as a


draw from a distribution
I

Eg. Asking a random person in the class their age is a draw


from the class population

Denote random variable Xi as a single draw from the


population

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Discrete Random Variables

A discrete random variable has a finite or countably infinite


number of possible values
I

Coin flip has support is {Heads, Tails}

A count variable has support {0, 1, 2, ...}

The probability each value is drawn is given by the probability


mass function (pmf)

A fair coin has the pmf:


Pr (X = x) = 0.5, x {Heads, Tails}

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Bernoulli Distribution

The coin flip is an example of a Bernoulli distributed random


variable, one of the simplest probability distributions. The pmf of a
Bernoulli variable is
(
p
if x = 1
Pr (X = x) =
1 p if x = 0
For the coin flip, p = 0.5.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Binomial Distribution

In a business environment, we might use the binomial distribution


to model sales. Suppose a company knows 100 people will access
their website each hour, and that 25% of those will make a
purchase. What is the probability that at least 20 people make a
purchase? Add up

100 
X
100
Pr (X 20) =
0.2x (0.8)100x 0.54
x
x=20

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Binomial Distribution

A related distribution is the Binomial distribution. This represents


a series of Bernoulli draws. In addition to p, we need to know the
number of Bernoulli draws n. A draw from the binomial
distribution is the number of successes (1s) drawn from the
Bernoulli. The pmf is:
( 
n x
nx
if x {0, 1, ..n}
x p (1 p)
Pr (X = x) =
0
otherwise

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Example: Roulette

It was obvious that all this ritual and all the mechanical minutiae
of the wheel, of the numbered slots and the cylinder, had been
devised and perfected over the years so that neither the skill of the
croupier nor any bias in the wheel could affect the fall of the ball.
And yet it is a convention among roulette players, and Bond rigidly
adhered to it, to take careful note of the peculiarities in the run of
the wheel... Bond didnt defend the practice. He simply
maintained that the more effort and ingenuity you put into
gambling, the more you took out.
Ian Fleming, Casino Royale (1953)

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Example: Roulette

A French roulette wheel has:


I

18 black slots

18 red slots

1 green zero slot

From the passage, we know James has concluded the wheel is fair.
Suppose he decides to bet on black for 10 spins. Betting on black
pays out the initial bet for a win.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Exercises

What is the probability James will win k spins out of 10?

Construct a table representing the probability mass function.

What is the probability James will win at least 6 spins?

How many spins do we expect James to win?

If he bets 100 francs on each spin. What the expected payout


of 10 spins?

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Solution

If the wheel is fair, the chance of a red is 18


37 , or about 0.486. Now,
we go to the binomial distribution. The probability of k
successes given 10 trials and success probability 0.486 is given
by:
 
10
0.486k (0.514)10k
k

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Solution
Based on this, we can construct the table:
k
0
1
2
3
4
5
6
7
8
9
10

Pr (x = k)
0.0013
0.0121
0.0515
0.1301
0.2157
0.2452
0.1936
0.1048
0.0372
0.0078
0.0007

Pr (x k)
1
0.9987
0.9866
0.9352
0.8051
0.5894
0.3442
0.1506
0.0458
0.0086
0.0007

Sebastian Wai

Pr (x k)
0.0013
0.0134
0.0648
0.1949
0.4106
0.6558
0.8494
0.9542
0.9914
0.9993
1

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Solution

Expected wins: 4.86


Expected payout: -27.03 francs
Probability at least 6 wins: 0.34

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Custom Distribution

We can also use a table of probabilities to describe a pmf, as we


did in the roulette example. For example, suppose the population
of my bookshelf is: 3 books by Salman Rushdie, 1 by Philip K.
Dick, 2 by Dan Abnett, and 4 by Ian Fleming. The pmf of a draw
from the shelf can be written
Probability
0.3
0.1
0.2
0.4

Sebastian Wai

Author
Rushdie
Dick
Abnett
Fleming

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Continuous Random Variable

A continuous random has an (uncountably) infinite support, such


as
I

Any number from 1 and 10: [1, 10]

Any positive number, including zero: [0, )

Any number at all: (, )

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Probability Density Function

In general, since there are infinitely many possible values, the


probability of drawing any individual number is zero. Instead of a
pmf, we now have a probability density function (pdf).
I

The area under the pdf between a and b tells us the


probability a draw will be in the set [a, b]

In other words, the integral of the pdf from a to b

Before anyone panics, we wont be doing integrals in this class!

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Normal Distribution

0.2
0.1
0.0

Density

0.3

0.4

Arguably the most famous continuous distribution is the normal


distribution, whose pdf is known as the bell curve.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Normal Distribution

To describe a particular normal distributions shape, we need to


know its mean () and variance ( 2 ).
I
I
I

Mean is the expected value of the random variable, = E [Xi ]




Variance is the spread of the variable, 2 = E (Xi E [Xi ])2
Define standard deviation () as the square root of the
variance

If Xi is normally distributed, we write


Xi Normal (, )

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Normal Distribution

Put this all together into the pdf:


f (x) =

1
2 2

(x)2
2 2

Look awfully complicated? We will usually use computers when


dealing with this pdf!

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Discrete Continuous CDFs

Cumulative Distribution Function

The cumulative distribution function (CDF) tells us the probability


the random variable will take a value at most x. For a discrete
variable,
X
Pr (X x) =
Pr (X = k)
kx

and for a continuous variable,


Z

F (x) =

f (k)dk
k=

When x is the highest value of the support, the CDF equals 1. If


not, something went wrong!

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Population vs. Sample

So far, we have been talking in terms of the population. and


are population parameters, whose true values are unknown to us.
I

For example, consider the population of your companys


potential customers.

Your boss wants to know the average age of the type of


person who buys your product. What do you do?

Draw a sample!

From here, we can use the sample to produce estimates of the


true parameters.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Notation

Sample notation:
I

Sample size (N): the number of observations/draws in the


sample

Denote our sample of N realizations of random variable Xi as


{X1 , X2 , ..., XN }

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Sample Statistics
Sample mean:
N
1 X

X =
Xi
N
i=1

Sample variance:
N

2
1 X
Xi X
S =
N 1
2

i=1

Sample standard deviation:

S=

S2

These statistics are the sample counterparts to the population


parameters , 2 , and .
Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Sample Statistics

Notice how the formula for sample variance differs from 2 .


I

N 1 in denominator instead of N

If using Excel to calculate variance, make sure you use the


right command (STDEV.S, not STDEV.P)

For very large samples, the difference is negligible

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Estimators

Since we dont know the true values of the parameters, sample


statistics will serve as estimators. How good are the estimates?
I

Selection bias can make the estimates inaccurate

Assume, for now, the sample is truly random (all population


members are equally likely to be sampled)

This means individual draws are independent and identically


distributed (i.i.d.) the distribution a given draw does not
depend on the realization of any other draw

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Unbiasedness

When the draws are i.i.d., sample statistics are unbiased estimators
for the population parameters. Formally, the sample statistic is
unbiased when the mean of the statistic equals the population
parameter:
 
E X =
 
E S 2 = 2
E [S] =

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Unbiasedness
X an unbiased estimator for .

Proof.
E [Xi ] =
N
1 X

Xi
X =
N
i=1

N
N
 
1 X
1 X
E X =
E [Xi ] =

N
N
i=1

i=1

 
1
E X = N =
N

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Confidence Intervals

It is still unlikely the sample statistics will perfectly peg the


population parameters

Confidence intervals give a range that there is a specific


probability the population parameter is contained within

For example, suppose I ask 20 random people in Bloomington


their income; while the average of my sample may be close to
the true mean of Bloomington residents income, chances are
that it is not quite accurate.

But can we provide a range?

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Variance of X
In order to construct a confidence interval for the mean estimate,
we need to know how X is distributed. We already know its mean
is , but what is the variance?
!
N
X

1
Var X = Var
Xi
N
i=1

1
Var X = 2 Var
N


N
X

!
Xi

i=1

1
Var X = 2 N Var (Xi )
N
 2
Var X =
N


Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Central Limit Theorem

From the derived variance, we know the standard deviation of X is


. Intuitively, it should make sense that that the mean of the
N
sample has less spread than individual draws.

Theorem
When N is sufficiently large (conventionally, N 30), the sample
mean is normally distributed. That is,



X Normal ,
N

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

CLT Example

For an illustration, lets return to the roulette example. Recall that


the French roulette wheel we discussed has 37 slots, each
corresponding with integers 0 to 37. Assume each slot is equally
likely. Let random variable Xi be the number the ball lands on.
The parameters are
= 18.5, 2 = 120.25, = 10.97

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

CLT Example

Experiment:
I

Simulate samples of 30 roulette spins

Record the mean of the 30 spins

Repeat 10,000 times

Results:
I


E X = 18.47

Var X = 4.02

4.02 30 = 120.67

These results are quite close to the predicted values.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

CLT Example
Histogram of sample means with fitted Normal density:

1000
500
0

Frequency

1500

2000

Histogram of Means

15

20
Means

Sebastian Wai

G579: Business Econometrics

25

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Confidence Intervals

Now that we have established the distribution of the sample mean,


we are ready to construct confidence intervals. Commonly, we
consider 90%, 95%, and 99% confidence intervals. For Normal
random variables, draws fall within
I

1.65 s.d. of the mean 90% of the time

1.96 s.d. of the mean 95% of the time

2.58 s.d. of the mean approximately 99% of the time

You can use a computer or table to do this for CIs of any size.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Confidence Intervals

This implies that in 90% of samples, the sample mean will be


contained in






, + 1.65
1.65
N
N
and so forth for the other intervals.
But wait, this looks a bit backward. We dont actually know and
we were trying to find estimates for those parameters in the
first place.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Confidence Intervals

Instead, well flip things around. Replace with X and with S,


and we have what were looking for.





S
S
, X + 1.65
CI90 = X 1.65
N
N
will contain 90% of the time. Using the language of inductive
reasoning, we say:
Based on a sample of size N with mean X and variance S, the
population mean is contained within CI90 , as defined above. The
degree of support for this argument is 90%.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Basics Confidence Intervals

Confidence Intervals

We have just learned how to construct a confidence interval for


one parameter . For other statistics, such as , the specifics
will be different, but the overall idea is the same.
I

Find the distribution of the parameter, in terms of sample


statistics

Calculate bounds to the left and right of the sample analogue


corresponding with the desired confidence level

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Hypothesis Testing

We will conclude this week with formal hypothesis testing. A


hypothesis uses a data sample to assess the credibility of a claim
(the hypothesis) about the underlying population. The steps to a
hypothesis test are:
1. State the null hypothesis and the alternative hypothesis
2. Collect a sample
3. Decide on a confidence level and rejection rule
4. Calculate the test statistic
5. Reject the null hypothesis if the test statistic falls outside the
confidence interval

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

What is a Hypothesis Test?

Hypothesis tests are a component of making an inductive


argument.
I Degree of support, also called inductive probability is a
measure of the strength of the inductive conclusion
I

We distinguish between a subjective degree of support, based


on gut instinct and an objective degree of support, based on
data and statistical analysis
I am 99% sure the Steelers will win the Super Bowl!

A hypothesis test is a tool used to determine whether or not


the data backs up the conclusion for a particular degree of
support

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Hypothesis Testing

We will discuss hypothesis testing in terms of the population mean,


though we can do a test concerning any population parameter.
Suppose we want to test the hypothesis the mean personal
income in Indiana is $35,000. We will restrict this to Hoosiers
aged 18 to 54. You can follow along with the example using the
dataset indiana100.dta.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Step 1: State the Hypothesis


The null hypothesis (H0 pronounced H-naught) is the default
position we are trying to test. Null hypotheses are generally state
that a parameter of interest is equal to something else, which can
be a number or another parameter, as in
H0 : = K ,
where is a population parameter. In our example,
H0 : = $35, 000
The alternative hypothesis (H1 ) is that the parameter is not equal
to what we hypothesized. In this case,
H1 : 6= $35, 000
Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Step 2: Collect a Sample

To test the hypothesis, I took a random sample of 100 Indiana


residents aged 18-54 from the 2014 American Community Survey.
I

Measures: Sex, age, race, education, personal income (we


only need the last one, but I included a few others)

Dimensions: State = IN, Age [18, 54], households only


(eliminates prisons, mental institutions, etc.)

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Step 3: Confidence Level

Now, we need to decide on a confidence level. The stricter the


confidence level, the less likely it is that the hypothesis is rejected,
but when it does happen, the degree of support is high. Some of
the most commonly used levels are 90%, 95%, and 99%. Lets go
with 95% for now.
I

Recall that since N > 30, by the CLT, the sample mean is
normally distributed

Thus, for 95% confidence, we reject if the sample mean is


more than 1.96 standard deviations away from the null value
($35,000)

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Step 4: Test Statistic

The test statistic is the value we test to determine of the null is


rejected. The form of the test statistic depends on the parameter
of interest.
and
I The criterion is: reject if the difference between X
$35,000 is more than 1.96 standard deviations
I

Construct the test statistic:


z=

X $35, 000

Were good to go, then, right?

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Step 4: Test Statistic


Not quite yet. We dont actually know , the population standard
deviation.
I

Use the next best thing, the sample standard deviation, S

Rewrite the test statistic


t=

X $35, 000
S
N

When we use the sample S.D., we have a t-statistic, that


follows the t-distribution

But, when N > 30, the t and normal distributions are very
similar

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Step 5: Do the Test

First, calculate the sample mean and standard deviation:


X = 30, 552.95
S = 30, 623.11
Plug these into t to get
t=

30, 552.95 35, 000


30,623.11

100

= 1.452

We find $35,000 is only 1.452 standard deviations from the sample


mean not enough to reject the hypothesis at 95% confidence.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

What Does it All Mean?

Okay, so we didnt reject the hypothesis. That means that the


average income for Hoosiers aged 18-54 is $35,000, right?
I

No! Failed to reject is not the same as confirmed.

What weve done is a bit more nuanced we showed that the


evidence we have is not strong enough to say, with 95%
degree of support, that the mean is not $35,000

Had we made the more extreme claim that the average


income is $40,000, the t-stat is 3.08 enough to reject even
at 99% degree of support

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Sample Size

The power of sample size in hypothesis testing cannot be


understated.

I Note the
N in the t-stat
I

All other things equal, larger sample sizes make larger t-stats

In other words, larger samples make for stronger evidence

This is intuitive. What if we tried to make claims about the


entire state with just two peoples income?

Using the full ACS sample of 28,683 people, the t-stat


becomes 7.40, a very strong rejection!

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

P-Values

An alternative way to approach a hypothesis test is to calculate the


p-value. The p-value tells us:
I

Assuming the sample is large and the null hypothesis is


correct, what is the probability a t-stat at least as large as the
one we got would be observed?

In other words, if the null is correct, what would be the false


rejection rate?

The more extreme the t-stat, the lower the p-value

The p-value also tells us the highest degree of support at


which we could reject the null

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

P-Values
Graphical representation of the p-value

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

P-Values

We know that t-stats follow the t distribution. Using this


distribution, we can calculate the p-value
I

The degrees of freedom of a t distribution is N 1

P-value is the area under the distribution to the right of t and


to the left of t

For our example, the p-value is 0.15

This means the highest degree of support for a rejection of


the null would be 1 0.15 = 85%

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

t Distribution in Excel

To find a p-value in Excel, we need to use the function T.DIST


I

T.DIST.2T(t, df )

t is our t-stat

df is the degrees of freedom, equal to N 1

Note that 2T indicates a 2-tailed test. While 1-tailed tests are


possible, we wont be doing them in this class.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Intro Steps Discussion P-Value

Mean Hypothesis Test in Stata

To do the test in Stata, load up indiana100.dta and run the


command
I

ttest inctot == 35000

Note the double-equals sign for the logical equals

We will focus on the two-tailed test in the middle.

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Summary

Important takeaways:
I

Random variables represent draws from a population

If the sample is random, sample statistics are unbiased


estimators of population parameters

Confidence intervals give a range that a parameter lies within


for a given probability

Hypothesis testing is important know how to do them and


understand the interpretation

Sebastian Wai

G579: Business Econometrics

This Week RVs Sampling Hypothesis Testing Summary

Up Next

Next, we will cover randomized trials.


I

We can then do hypothesis tests using the results of


randomized trials

Then, we turn to regression

Anyone can hit the button to run a regression

But we need to understand hypothesis tests to understand


and interpret the numbers that come out

Sebastian Wai

G579: Business Econometrics

Das könnte Ihnen auch gefallen