You are on page 1of 6


Midterm Exam Logistics

Stat 101 Midterm will be held during the regular class time and location on
Thursday: it starts at approximately 10:07 and runs for 83 min.
Its closed-book. You are allowed one 8.5x11 page or notes(front-
and-back OK) No laptops or cell phones, Remember to bring a
Midterm I Review
It covers Units 15 and the material on HWs 1-4. There are
Units 1-5 practice problems and practice exam on the course website.
Be sure to briefly explain your answers and show calculations.
Office Hours Tues-Thurs (no sections or OHs otherwise):
Tues: 11:30am-12:30pm in SC-300b (Kevin)
Wed: 11am-1pm in SC-107 (Joseph), 1-3pm in SC-300b (Kevin)
Thurs: 9-10am in SC-300b (Kevin)

Outline of Material for the Midterm Descriptive Statistics

Sampling and Measurement (Unit 2)
Surveys and Sampling
Experimental Design and Randomization
Histograms and Boxplots
Descriptive Statistics (Unit 3) Measures of Center/Location
Univariate Mean, Median, Mode, Quantiles
Probability & Random Variables (Unit 4 - Part I)
Measures of Spread
Binomial Random Variables Range, IQR, Variance/Standard Deviation
Normal Distribution Measures of Shape
Law of Large Numbers and the Central Limit Theorem
Inference for a Population Mean (Unit 5)
Confidence Intervals Outliers (1.5*IQR Rule)
Hypothesis Tests
Power and Sample Size Calculations


Descriptive Statistics
Descriptive Statistics (Bivariate)

Scatterplots, Side-by-Side Boxplots, Contingency

Tables (Two-way tables)

Correlation (r): a measure of linear association

Regression: y^ = a + b*x
e = y y^
R2: proportion of the total variability in the y-variable
that can be predicted by using the x-variable.

Study Design Study Design

(Experiments) (Surveys)
Target Population (and a parameter), the sample (and a statistic)
Control, Replication, Randomization leads to
causal associations
SRS (Randomly select Harvard students)
Stratified (Randomly select three student from each Harvard house)
Treatments, response Bias
Types Selection (systematic bias in those that were chosen to be sampled)
Simple Randomized Experiment Response (systematic bias in the way that people responded to the
Stratified/Blocked, Matched Pairs questions, for example: bad wording or pressure from the
Not always feasible (ethically, etc)
Non-response (systematic bias in the way that people did not respond
to the survey)
7 8


Intro to Probability Intro to Random Variables

Sample Space, Outcomes Discrete
Probability distribution function (sum to one)
Union, Intersection, Disjoint, Complement
Usually defined in tabular form
Probability (only on events)
Rules (Unions & Intersections)
Calculating the mean, variance & sd
Conditional Probability: X = E(X) = [x*P(X = x)]
Defintion: P(A|B) = P(A and B)/P(B) 2X = E((X- X)2) = [(x- X)2*P(X = x)]
Independence Continuous
Check: P(A and B) = P(A)*P(B)
or P(A|B) = P(A) Probability density function
Bayes Rule (often, its just easier using a 2x2 table) Probabilities represented by areas
P ( B | A) P ( A) Note: P(X = x) = 0 for continuous variables
P( A | B)
P ( B | A) P ( A) P ( B | AC ) P ( AC )

9 10

The Normal Distribution Binomial Random Variables

Think Coin Flips (counting heads)

Continuous 4 Major characteristics

X ~ N(,) dichotomous, n fixed, fixed, independent trials
Standardize to find probabilities
in Table A: Shorthand: X ~ Bin(n, )

X Finding probabilities (Formula)

E(X) = n, Var(X) = n(1 )

= X/n


Binomial Random Variables Law of Large Numbers and

(Normal approximation to the binomial) the Central Limit Theorem
Law of Large Numbers
Let X ~ Bin(n, ). Then approximately:
X will have mean equal to the individual observations

mean (), and its variance will shrink in comparison
X ~ N n , n (1 )
E( X ) =
(1 ) Var(X ) = 2/n

~ N ,
n Central Limit Theorem
States that all sample means ( X ) and sums of RVs will

This holds only if: be normally distributed, no matter what the original
n 10 distribution (assuming n is large)

n(1 ) 10
Remember: X ~ N X , X


Inference Power and Sample Size

(One-Sample for Means , unknown) Calculating Power (2 steps)
One sample t-test for CI for one sample 1) Determine the rejection region for x under Ho
x 0 s 2) Calculate the probability of x falling in that rejection
t x t df* n 1 region when Ha is true
s/ n n Power increases with:
We assume Xi ~ N(,) & independent larger sample size, n
smaller ,
t ~ t(df = n 1) [when null hypothesis is true]
further distance between A and 0
Assume normal if n is small, OK without extreme outliers
when n is large (n > 15) Calculating Sample Size from a Desired Margin of Error (m)
Example: z * ( )

A sample of 5 stat 101 students were found to have slept 5.8 hours m
the night before their final, with a standard deviation of 2.0. Is this
significantly lower than the recommended minimum of 7.5 hours?


Practice Problem #1 Practice Problem #1 (cont.)

The following are a collection of unrelated quick problems. Briefly justify c) If females of a certain species of lizard always mate with males that are
your answer for each problem. 0.75 years younger than they are, what would the correlation
coefficient between the ages of these male and female lizards be?
a) Suppose that A and B are two disjoint events within the same sample Circle the right answer and provide justification.
space. In addition, let P(A) = 1/8 and P(B) = 1/4. Are events A and B
independent? Explain. i) 0.75 iv) 1
ii) -0.75 v) -1
iii) 0 vi) Not enough information to tell
b) Suppose a particular outcome from a random event has a probability of
0.02. Which of the following statements represent correct interpretations of
this probability? Circle the right answer and provide justification.
d) Consider the annual salaries of mutual fund managers in the Boston
i) The outcome will never happen. area. The mean salary is $450,000 and the median salary is $380,000.
ii) The outcome will happen two times out of every 100 trials, for Circle the correct answer below. The probability that the salary of a
certain. randomly selected mutual fund manager from the Boston area is larger
iii) The outcome will happen two times out of every 100 trials, on the than the mean of $450,000 is:
average. i) > 0.5 iii) = 0.5
iv) The outcome could happen, or it couldn't, the chances of either ii) < 0.5 iv) Cannot be determined
result are the same.
17 18

Practice Problem #2 (cont.)

Practice Problem #2
Cancer is the #2 cause of death in the United States, yet is not
nearly as deadly in other parts of the world. An investigator
looks at the cancer mortality rate (per 1,000 person-years) vs.
Population Growth Rate per year (in percent) for 171
countries. She starts by looking at the scatterplot and some
summary statistics from her data:

a) What is the equation for the least squares regression line?

b) The US has a growth rate of 0.90%. What is the predicted

cancer mortality rate for US (the true cancer mort. rate is 123.8)?


Practice Problem #3
Practice Problem #2 (cont.) Not everybody likes Britney Spears. In fact, an internet poll run by
the Rolling Stones magazine showed that 66% of college-aged men
c) What percentage of total variability in cancer said they liked Britney, while 30% of college-aged women like her.
mortality rate can be predicted using growth rate?
a) Imagine Harvard, made up of 52% women, is hosting a Britney
Spears concert. Given that only fans of Britney attend the concert,
d) The investigator believes cancer mortality rates what is the probability that the person sitting next to you at the
concert is a woman?
could be lowered if countries encouraged more
baby-making and more immigration. Briefly
explain why this statement may not be correct.
b) A line at the snack bar for the concert has10 people (all Brit-fans).
What is the probability that exactly 5 of these students are women?

21 22

Practice Problem #3 (cont.) Practice Problem #4

A friend of yours is curious to see how confident Harvard students are in
c) There is a line of 100 students to get into the concert (all of their look. He asks a random sample of n = 130 Harvard students what
whom are Brit-fans). What is the probability that the percent of Harvard students do you believe I better looking than you?
majority of them are women? This sample had a mean of 30.8% and a standard deviation of 24.2%.
***Note: if people had realistic judgments about themselves, the mean
in the population should be 50%.

a) Calculate the 95% confidence interval to estimate the true mean

percent of students that Harvard students think they are better looking
d) The internet poll run by the Rolling Stones reported that 66%
of all college-aged men like Britney Spears? Why could this b) Based on your confidence interval in part (a), would you expect a
be a mistake? hypothesis test to determine whether H0: = 50 to be rejected based
on a twosided test?

c) Perform the formal hypothesis test as stated in part (b).

23 24