You are on page 1of 8

University

Admissions: Statistical Inference on Proportions

Slide One:

Voiceover:

Harrigan University is a liberal arts university in the Midwest that attempts to attract the highest quality
students, especially from its region of the country. Harrigan is concerned that it is not getting enough of
the best students, and worse yet, that many of these best students are going to Harrigan’s main rival.

Slide Two:

Text on screen:

• Accepted- whether the applicants accepts Harrigan’s offer to enroll


• HS Sports- number of varsity letters applicant earned
• HS Size- number of students in applicant’s graduation class
• MainRival- whether the applicant enrolls at Harrigans main rival university
• HSGPA- applicant’s high school GPA
• SAT- applicant’s combined SAT score
• HSClubs- number of high school clubs applicant served as an officer
• HSPctile- applicant’s percentile (in terms of GPA) in his or her graduating class
• Combined Score- a confidential internal combined score for the applicant used by Harrigan to
rank applicants

Voiceover:

Gertrude Cox has been tasked with investigating this concern and has gathered data on 178 applicants
who were accepted by Harrigan (a random sample from all acceptable applicants over the past several
years). The data contains numerous information about the applicants including whether the applicants
accepts Harrigan’s offer to enroll. Cox’s first step is to estimate the fraction of highly qualified applicant
that accepts the offer, which she completes in Excel.

EMBA758L– Data Analysis | 1


Slide Three:

Voiceover:

Realizing that X is just a number that is a point estimator, Cox wants to build a confidence interval
around it. However, whether or not an applicant chooses to attend Harrigan is not a numerical outcome,
but rather a binary choice. Therefore, .579 is not a sample mean, put a population proportion and
therefore she will need to explore how to build confidence intervals around sample proportions
estimators.

Text on screen:

Success = Accepted offer of admission

Failure = Reject offer of admission

𝑝 the distribution of p-hat is the normal distribution

E[𝑝] = 𝑝

𝜎# ̂ = 𝑆𝐸 (𝑝) = 𝑝(1 − 𝑝) /𝑛

If 𝑛𝑝 > 5 and 𝑛(1-𝑝) > 5

Sample Mean

• Point estimator of 𝜇
• Distribution: Normal if 𝑛 is large enough
2
• Standard error: 𝜎0 = 𝑆𝐸 𝑋 =
3

Sample Proportion

• Point estimator of 𝑝
• Distribution: Normal if 𝑛𝑝 > 5 and 𝑛(1-𝑝) > 5
• Standard error: 𝜎# = SE(𝑝) = 𝑝 1 − 𝑝 /𝑛

Since 𝑝 has a normal distribution we can apply the same methodology and mechanics to build
confidence intervals as we did with the sample mean.

Point Estimator ± Margin of Error

#(<=#)
𝑃 ± (z−𝑚𝑢𝑙𝑖𝑡𝑖𝑝𝑙𝑒)
3

Voiceover:

EMBA758L– Data Analysis | 2


When records can be classified in one of two categories: “success” or “failure”, the appropriate analysis
is that of sample proportion. In the case of Harrigan University, the success translates to a student that
accepted an offer of admissions and failure a student that decided to go somewhere else.

Let p denote the proportion of successes in the population. It is a parameter which is unknown and must
be estimated. If a random sample of size n is drawn from this population, then we can let p-hat denote
the proportion of successes in the sample. P-hat is a random variable, as p-hat from each random
sample will be slightly different from one another.

It can be shown that for sufficiently large n, the sampling distribution of p-hat is approximately normal
with mean p, that is the expected value of the sample proportion is the true population proportion, and
standard error of the square root of p times one minus p divided by the sample size n.

But what do we mean by sufficiently large? As a rule of thumb, if both n times p and n times one minus
p are greater than five the approximation hold, that is if p is either very small or very large we need a
bigger sample for the distribution to be normal, but for populations with p close to .5 we could get away
with samples as small as ten.

There are a lot of parallels between the sample mean and the sample proportions.

Both are point estimators of mu – a population average, and p – a population proportion respectively.

The Sample Mean is normally distributed if n is large enough, and the sample proportion is also normally
distributed if both n times p and n times one minus p are greater than 5.

The standard error of both estimates are generally estimated from the sample, that is for sigma in the
standard error of the sample mean we substitute s, the sample standard deviation [make sigma change
to s] and for the standard error of the sample proportion we substitute p-hat in for p [make p change to
p-hat].

The basic formula for any confidence interval is point estimator plus-minus margin of error. And as
before, we break down the margin of error into a multiple times the standard error of the point
estimator.

Now our point estimator is p-hat. Our multiple is the z-multiple, as p-hat is normally distributed, we can
find this value in Excel or let StatTools do the calculations. And the standard error is given by the square
root of p times one minus p divided by n.

The standard error formula for p-hat, and as a result, the confidence interval formula, contains the
unknown parameter p. As a result, we approximate standard error of p-hat, by substituting p-hat in for p
in the formula.

EMBA758L– Data Analysis | 3


Slide Four:

Text on screen:

Cox wants to build a 95% confidence interval around the point estimate of the acceptance rate.

#(<=#)
𝑃 ± (z−𝑚𝑢𝑙𝑖𝑡𝑖𝑝𝑙𝑒)
3

Select the appropriate values and place them below.

• O.579
• 1.96
• 178

Slide Five:

Voiceover:

Any time we use StatTools we need to first define the dataset and we do that in the Data set Manager.
As the cursor was placed inside the data, StatTools detects the data range and [on selecting Yes] we
select Yes. And since all the variables look correct we [on selecting OK] select ok.

Now we are ready to construct the confidence interval. Again [on selecting StatTools] we select
StatTools and under [on selecting Statistical Inference] Statistical Inference we find Confidence Interval
and then select [on selecting Proportion] Proportion.

First we select data that we are constructing the interval for, in our case [on checking accepted]
Accepted and we want to analyse the proportion that accepted, so we dedicate Yes [on selecting yes] as
the category to analyze. We want a 95% confidence interval [on circling the CI] which is the default
value, so we can press OK.

And here is the resulting confidence interval, we notice that the estimated sample proportion is 0.579,
which we knew, and [on highlighting cells] the upper and lower limits of the interval are 0.506 and
0.651. That is wide for a proportion, so lets calculate the margin of error [on writing MOE]. The margin
of error is half the interval, so we subtract the lower limit from the upperlimit, which gives us the with of
the interval, and the margin of error is half that, so we [on dividing by 2 in the formula] divide by 2.

We see that the margin of error is 0.07, or [on formatting] 7.3%.

EMBA758L– Data Analysis | 4


Slide Six:

Text on screen:

• Reduce the confidence level


• Increase the sample size

Sample Size for Proportions

MOE = [Multiple] [Standard Error of the Point Estimator]

#(<=#)
MOE= (z-multiple)
3

>=?@ABC#AD 2
N = ( ) p(1-p)
EFG

EFG (>=?@ABC#AD) #(<=#)


=
>=?@ABC#AD >=?@ABC#AD 3

EFG #(<=#) 2
( ) 2= ( )
>=?@ABC#AD 3

Voiceover:

Cox is not happy with the width of the confidence interval, that is, she would like a tighter interval. To
achieve this she has two options: to reduce the confidence level or increase the sample size. For Cox,
reducing the confidence is really not an option, so she decides to increase the sample size to achieve a
confidence interval of no more than plus-minus 5%. How large of a data set does she need?

Given a confidence level and a margin of error (MOE), what is the sample size that we should draw for
estimating p?

The formula for the margin of error is the multiple times the standard error of the estimate. Now we
want to determine n, so we need to use algebra to isolate n in this equation.

We start by dividing both sides by the z-multiple.

The next step is to square both sides of the equation.

The final step is to move n to the left hand side and the Margin of Error divided by z-multiple, whole
square to the right hand side.

The formula for n will most times return a fractional value, but since we cannot draw fractional samples,
we round up the number of samples.

EMBA758L– Data Analysis | 5


There is an additional challenge with applying this equation. Prior to polling we don’t know p and
therefore cannot substitute it into the formula.

One approach is to use the worst case scenario and substituting .5 in for p as p(1-p) is maximized when p
is .5.

Slide Seven:

Text on screen:

Cox wants to build a 95% confidence interval with a margin of error of no more than 5%.

Slide Eight:

Text on screen:

Acceptance Rate of highly qualified applicants in previous years 65%

Voiceover:

Speaking with other seasoned admission people Cox gets the feeling that things have been getting
increasingly worse in the past few years.
A comprehensive study conducted 3 years ago on all admissions in previous years showed the
acceptance rate of highly qualified applicants to be 65%.

Slide Nine:

Text on screen:

Hypothesis Testing for Proportions

1. Contruct Ho and H1
2. Determine on-tailed or two-tailed and appropriate significance level
3. Select appropriate parameter settings in StatTools and run analysis
4. Interpret results

Voiceover:

Cox decides to conduct a hypothesis testing at the 5% level of significance with the goal of proofing the
overall feeling of the admissions department. In this case, the same mechanism applies as with
hypothesis testing for the sample mean. What has changed is that now we are applying hypothesis
testing to a proportion, while before we were testing the sample mean.

EMBA758L– Data Analysis | 6


Slide Ten:

Text on screen:

Hypothesis Testing for Proportions

• Step 1
• Step 2
• Step 3
• Step 4

Step 1: Construct Ho and H1

What is the status quo? The acceptance rate has not gone down

What is the alternative hypothesis?

A. H1: The acceptance rate < .65


B. H1: The acceptance rate ≠.65
C. H1: The acceptance rate > .65

Correct: Correct! The alternative hypothesis is what Cox is trying to proof, that is that the acceptance
rate has gone down. The null hypothesis represents the status quo, i.e. that the acceptance rate has not
gone down. H0: the acceptance rate >=.65

Incorrect: Incorrect, Cox is not trying to proof that the acceptance rate has not changed.

Incorrect: Incorrect, Cox’s aim is to reject the null hypothesis which is a strong statement in favor of the
alternative, you should therefore set the alternative to be the statistical statement you are trying to
proof.

Step 2: Determine one-tailed or two-tailed and appropriate significance level

Voiceover:

The significance level was set by Cox, 5%, and we are conducting a one tail test, as only evidence in one
extreme are evidence against the null hypothesis, that is only very low values for p-hat will be evidence
against the null hypothesis that the acceptance rate has not declined.

Step 3: Apply StatTools

EMBA758L– Data Analysis | 7


Step 4: Interpret Results

Comparing the p-value to the significance level, we ______ the null hypothesis, meaning we ________

A. Accept … we do not have sufficient evidence to conclude that the acceptance level is lower.
B. Accept … we do have sufficient evidence to claim that the acceptance level has stayed the same.
C. Reject … we do not have sufficient evidence to conclude that the proportion of students
accepting the offer has gone down.
D. Reject … we conclude that the proportion of students accepting has gone down.

Incorrect: Incorrect: since the p-value is smaller than the significance level we can reject the hypothesis.
The low p-value is indicating that it is unlikely to observe this data if the null hypothesis is true.

Incorrect: since the p-value is smaller than the significance level we can reject the hypothesis. The low p-
value is indicating that it is unlikely to observe this data if the null hypothesis is true.

Incorrect: we can reject the null hypothesis, but as a result we can conclude that we have sufficient
evidence to reject the null hypothesis.

Correct!

Slide Fifteen:

Voiceover: The data indicate that the acceptance rate of highly qualified students has gone down in
recent years, a worrisome fact for Harrigan University. Cox realizes that she now has a number of other
questions she would like to answer, including: Has the composition of student applicants changed? Has
the acceptance rate gone down equally across all students, or for example, does it differ across the
Combined Score ranking? It is clear that more analysis is needed to search for an explanation for the
decline and for insights into the current admission trends.

EMBA758L– Data Analysis | 8