Ap Stat 1-7 Notes

AP Statistics Notes
Kathryn Jiang
Contents
1 Part I Describing Data
1.1 Plotting Data . . . . .
1.2 Normal Distribution .
1.3 Contingency Tables . .
1.4 Central Limit Theorem
.
.
.
.
2
2
2
3
3
Variables
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
4
4
4
4
3 Part III Gathering Data

3.1 Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Types of samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
5
6
4 Part IV Probability
4.1 Probability and Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 General Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
7
7
5 Part V Inference for Proportions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Part II Exploring Relationships between

2.1 Scatterplots . . . . . . . . . . . . . . . .
2.2 Linear Regression . . . . . . . . . . . . .
2.3 Transformations . . . . . . . . . . . . . .
6 Part VI Inference
6.1 1 and 2 Sample t-tests . . . . . .
6.2 2-Sample Independent t-tests . .
6.3 Paired t-tests . . . . . . . . . . .
6.4 Confidence Intervals . . . . . . .
6.5 Hypothesis Testing Control Form
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
10
10
10
7 Part VII Inference when Variables are Related

7.1 Chi-square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Page 2 of 12
Part I Describing Data
1.1
Plotting Data
Frequency table: classes on left, frequency on right

Stem and leaf plot: low to high, shows shape
Dot plot: dots on a number line, shows shape
Pie chart: shows parts of a whole
Box plots: five number summary, outliers are 1.5 IQR from 1st and 3rd quartiles
Histogram: a bar graph with quantitative data, shows shape
Data can be described three ways:
1. shape
normal: symmetrical, unimodal, bellshaped
skewed: name whale by the tale
uniform: all things are the same height
2. center: mean, median, mode
3. spread: stdev increases as spread increases
1.2
Normal Distribution
0.683
0.954
Figure 1: Empirical rule

The empirical rule (68-95-99.7) describes how much of the curve falls within 1, 2, and 3 stdevs.
x
z-score: how many standard deviations away from the mean you are at. Used to find area
under curve using the green sheet.
z=
1.3 Contingency Tables
Page 3 of 12
Mode
Mode
Mean, Median, Mode
Median
Median
Mean
Mean
(a) Left skewed
(b) Normally distributed
(c) Right skewed
Figure 2: The mean, median, and mode of different distribution shapes

Table 1: Useful Nspire Functions
normCdf(lower, upper, mean, stdev)
invNorm(x, mean, stdev)
1.3
area under curve between given bounds

percentile that gives x given mean and stdev
Contingency Tables
Contingency tables show how individuals are distributed along each variable, depending on
another variable. They test for independence.
Table 2: Contingency Table of Class and Survival of Titanic Passengers
Alive
Dead
Total
First
202
123
325
Second
118
167
285
Third
178
528
706
Crew Total
212
710
673
1491
885
2201
In this case, we are trying to determine if survival depends on your class. The independent
variable (thing that happened first) is your ticket, not whether you lived or died.
H0 : Survival is independent of ticket
HA : Survival is dependent on ticket
Table 3: Percentages of Survival by Class of Titanic Passengers
First Second Third
Alive 62.6% 41.4% 25.2%
Dead 37.8% 58.6% 74.8%
Crew
24.0%
76.0%
Because the percentages of survival look different depending on class, support HA .
1.4
Central Limit Theorem
CLT =
n
As n , shape becomes more normal, center stays the same, spread (stdev) decreases.
3
Page 4 of 12
Part II Exploring Relationships between Variables
2.1
Scatterplots
Scatterplots are described using three characteristics:

1. direction: positive, negative (can tell from sign of r)
2. form: linear, nonlinear, curved
3. strength: low, medium, high scatter
2.2
Linear Regression
Table 4: Regression Variables
y
y
b0
b1
r2
y y
predicted y-value
observed y-value
y-intercept
y
slope = rs
sx
coefficient of determination
residuals
Replacing x and y with variables named in context gives least squares regression line:
y = b0 + b1 x
Interpreting r2 : 56.6% of the variability in male weight (explanatory variable) can be
explained by the least squares regression line variation in male height (response variable).
Interpreting slope: According to the model, as fht increases by 1, fwt increases by 3.70.
Interpreting y-intercept: According to the model, a gestation period of 0 days is a lifeexpectancy of 7.87 years.
***CORRELATION DOES NOT IMPLY CAUSATION***
The LSRL always passes through the centroid (
x, y). Influential points do not lie within
x-range of the data and change slope. Points that are above or below the mean have leverage,
since they drag the entire line up or down toward them, but they do not change the slope.
2.3
Transformations
If there is a pattern in the residuals, or if the data looks curved, try transforming it to get
a better model. Four ways to transform data:
1. log(x)
2. log(y)
4
Page 5 of 12
3. log-log
4. when all else fails, break apart into linear segments, perform regression on each
How to see if a model has high predictive power:
residuals show random scatter, with even number of points above and below 0
high r2 value, although there are some variables that are strongly association but
correlation is not close to 1 because curve
Part III Gathering Data
3.1
Biases
Table 5: Types of Bias
Name
voluntary response bias
undercoverage
nonresponse bias
response bias
Definition
self-selected, type of nonresponse
excluding people
choose not to respond
response changes
Example
American Idol voting
only calling households on landlines
calling when not at home
loaded questions
The best way to minimize bias is to use randomization, where each individual is given a
fair, random chance of selection. Using random number tables: starting at the top left, look
at one digit at a time until a dozen numbers are selected. Or, you can use another random
number generator.
3.2
Types of samples
Table 6: Types of Sampling
simple random sample

stratified random sample
systematic random sample
cluster sample
convenience sample
every grouping is possible in entire population

break by characteristic, then x from each
follow a pattern (eg. every third)
break into characteristics, survey everyone in that strata
only asking people nearby; bad
Nonsampling error occurs when you messed up by not randomizing, or there was a
computer error.
3.3 Experiments
3.3
Page 6 of 12
Experiments
Table 7: Principles of Experimental Design
control
control group as baseline using null or placebo treatment
randomize used to even out effects we can control
replicate be able to reproduce results using different sample
block
match similar subjects to reduce effects of things you cannot control
Blinding:
people who influence results: subjects, treatment administrators, etc
people who evaluate results: judges, treating physicians, etc
If only one group is blinded, it is single-blind. Everyone from both groups needs to be blind
for the treatment to be double-blind.
Confounding variables: occurs when levels of one factor are associated with levels of
another factor.
Lurking variables: when there is a third, outside factor that affects things.
Part IV Probability
4.1
Probability and Counting

n Ck
n!
(n k)!k!
Random things to know:

experiment: tossing a coin 30 times
event: tossing a coin
sample space: {H, T }
blackjack: P (blackjack) =
8
52
4
51
If two events cannot happen, they are mutually exclusive. The law of large numbers says
observed probabilities will go toward theoretical probabilities with large enough trials.
lim P (observed) = P (theoretical)
Conditional probability is the probability that B happens given that A has already happened, so its the probability of both A and B happening over probability of A happening.
P (B|A) =
P (A B)
P (A)
4.2 General Probability
Page 7 of 12
Table 8: Probability Variables
n
x
p
q
P(x)
EV or
4.2
number of trials
event
standard deviation
probability of success
1-p, probability of failure
probability of event x
expected value
General Probability
=
qX
qX
P (x) =
x2 P (x) 2
X
EV =
x P (x)
(x
)2
Standard deviation does not add; variances ( 2 ) add. (ax) = a(x)
4.3
Bernoulli Trials
There are three conditions for a Bernoulli trial:

1. two possible outcomes
2. probability of success is constant
3. trials are independent
Table 9: Types of Bernoulli Trials
Binomial
Geometric
fixed number of trials stops at first success
=np
n = p1
P (n) = q n1 p
= n p q
n
P (x) = x px q nx blank
Bernoulli models can be approximated with normal models, but only when np 10 and
nq 10.
Table 10: Useful Nspire Functions
geometricPdf(p, n)
probability something not happening until nth event
binomialCdf(n, p, lower, upper) probability between lower and upper number of events occur
binomialPdf(n, p, x)
probability exactly x events occur in n trials
Page 8 of 12
Part V Inference for Proportions

Table 11: Proportions Variables
p
q
p
q
n
x
population proportion of success

population proportion of failure
sample proportion of success
sample proportion of failure
number of units in sample
number of successes
p p
zts = p pq
n
zts = q
p1 p2
p1 q1
n1
p2 q2
n2
Follow this progression to find out what values of p and q to use in formulas:
1. p, q from previous students
2. proportion from hypothesis
3. use p, q from 2 samples
4. p = 0.5, q = 0.5 worse case
We use p to predict p at a point, and create a confidence interval accordingly:
r
pq
CI = p margin of error = p z ?
n
p
where pq
is the standard error. Essentially, this interval is created by taking the number
n
of standard deviations away on each side, which is what z ? is. If many many samples are
collected and the confidence interval is created, then 95% of the intervals will capture the
true proportion of whatever youre trying to measure.
Table 12: Types of Errors
Name
Type I
symbol
name
false positive
null
wrongly rejected
power 1
n
Type II
false negative
wrongly supported
Page 9 of 12
6
6.1
Part VI Inference
1 and 2 Sample t-tests
Table 13: Independent t-test Variables
n
0
x
s
tts
number of samples
mean of population
value from hypothesis
mean of sample
stdev of population
stdev of sample
t-test statistic
tts =
x 0
s
n
This finds the t-test statistic for 1-sample data. The t-test statistic is pretty similar to
z-test statistic in that it describes the number of standard deviations your value is away
from the mean on a t-distribution, which depends on degrees of freedom and is fatter than
a normal distribution.
s
Error = t?
n
Error associated with the confidence interval. In most cases, since we cannot find t? , we
use z ? instead.
6.2
2-Sample Independent t-tests

tts =
(x1 x2 ) 0
q 2
s1
s2
+ n22
n1
s
StandardError(x1 x2 ) =
s21
s2
+ 2
n1 n2
0 is the hypothesized difference in means for the two independent group, which can be
written as
H0 : 1 2 = 0
HA : 1 2 > 0
6.3 Paired t-tests
Page 10 of 12
Table 14: Paired t-test Variables
n
d
d
s
tts
6.3
number of pairs
population mean of differences
sample mean of differences
population stdev of differences
sample stdev of differences
t-test statistic
Paired t-tests
Paired t-tests are used when each pair of data is related in some way. For example, the data
could be before/after a certain treatment given to the sample person.
tts =
d 0
sd
sd
=
StadardError(d)
n
6.4
Confidence Intervals
s
Conf idenceInterval = Statistic CriticalV alue StandardOf Error = x t?
n
Confidence interval statement: I am 90% confident that the true mean of battery lifespans
is between 291.1 and 321.4 minutes. t? is the number of standard deviations you are to get
that confidence interval (just like a z-score).
If 0 does not lie within the confidence interval, then the data is statistically significant. If
0 lies anywhere inside the interval, even if it is not in the center, the data is not statistically
significant.
For 2-tailed tests (Ha 6=), each tail will be 2 . For 1-tailed tests (Ha > or <), each tail
will be .
6.5
Hypothesis Testing Control Form
1. Identify parameter of interest: mean

2. Pick choice of test: 1- or 2-sample independent or paired t-test
3. Check assumptions and conditions
Table 15: Assumptions and Conditions for Paired t-test
Samples are paired
Look at how it was conducted
Data are independent
SRS & 20 < 0.1N
Data are normally distributed Look at box plot, normal probability plot, histogram, or dot plot
10
Page 11 of 12
4. Write null and alternate hypothesis. Make sure to define 1 and 2 .

H0 : 1 = 2
HA : 1 > 2
5. Calculate test statistic
6. Write p-value statement: There is a 0.0032% chance of seeing data like this if there
was no significant difference.
7. Sketch of sampling distribution assuming H0 is true: draw curve and shade in p-value
8. Compare p-value against alpha
Table 16: P-value Reconciliation
If p-value > alpha
Fail to reject null hypothesis H0 .

Do not support alternate hypothesis Ha .
The data do not support there is a significant difference in the means of....
If p-value alpha Reject null hypothesis H0 .
Support alternate hypothesis Ha .
The data do support there is a significant difference in the means of....
Part VII Inference when Variables are Related
7.1
Chi-square Tests
Table 17: Chi-square tests
Name
Goodness of Fit
Homogeneity
Variables
1
1 (data stratified)
Degrees of Freedom (df) k 1
(r
1)
(c 1)
P
P
row col
Expected
given
total
Compares counts against given distribution proportions
H0
pO = pE for each ps,w = pj,w
HA
1 sig diff
ps,w 6= pj,w
Independence
2
(r
1)
(c 1)
P
P
row col
total
two variables
response independent of gender
response dependent on gender
Note: a 2-proportion z-test can also be used instead of a chi-square test for homogeneity
2 =
X (Obs Exp)2
Exp
The chi-square statistic is used to find the p-value based on the chi-square distribution, which
is skewed right. Increasing df reduces skew.
11
7.2 Linear Regression
7.2
Page 12 of 12
Linear Regression
H0 : = 0
HA : 6= 0
To test to see if there is a linear association between two variables, compare the slope
, which is the parameter, to 0. The statistic is b1 , or coefficient of the x-variable in the
calculator printout (given).
tT S =
b1
SE(b1 )
The t test statistic, when calculated by hand, is found by the equation above. In most
cases, = 0, but it depends on the hypotheses.
tCdf (tT S , , df ) = p valuef oronetail; doubleitf ortwotails
In most cases, you need to double it because two tail (HA 6=). Or use the green sheet.
Conf idenceinterval : b1 t?n2 SEb
Statement: I am 95% confident the true slope of the regression line is between 0.883
and 1.737.
From the p-value, you can conclude there is / is not an association between the two
variables.
12

Ap Stat 1-7 Notes

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ap Stat 1-7 Notes

Hochgeladen von

Copyright:

Verfügbare Formate

AP Statistics Notes

3 Part III Gathering Data

5 Part V Inference for Proportions

2 Part II Exploring Relationships between

7 Part VII Inference when Variables are Related

Part I Describing Data

Frequency table: classes on left, frequency on right

Figure 1: Empirical rule

1.3 Contingency Tables

Mean, Median, Mode

(a) Left skewed

(b) Normally distributed

(c) Right skewed

Figure 2: The mean, median, and mode of different distribution shapes

area under curve between given bounds

Because the percentages of survival look different depending on class, support HA .

Central Limit Theorem

Part II Exploring Relationships between Variables

Scatterplots are described using three characteristics:

Part III Gathering Data

simple random sample

every grouping is possible in entire population

Probability and Counting

Random things to know:

4.2 General Probability

Standard deviation does not add; variances ( 2 ) add. (ax) = a(x)

There are three conditions for a Bernoulli trial:

Part V Inference for Proportions

population proportion of success

2-Sample Independent t-tests

6.3 Paired t-tests

Hypothesis Testing Control Form

1. Identify parameter of interest: mean

4. Write null and alternate hypothesis. Make sure to define 1 and 2 .

Fail to reject null hypothesis H0 .

Part VII Inference when Variables are Related

7.2 Linear Regression

Das könnte Ihnen auch gefallen