Sie sind auf Seite 1von 12

AP Statistics Notes

Kathryn Jiang

Contents
1 Part I Describing Data
1.1 Plotting Data . . . . .
1.2 Normal Distribution .
1.3 Contingency Tables . .
1.4 Central Limit Theorem

.
.
.
.

2
2
2
3
3

Variables
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

4
4
4
4

3 Part III Gathering Data


3.1 Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Types of samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5
5
5
6

4 Part IV Probability
4.1 Probability and Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 General Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6
6
7
7

5 Part V Inference for Proportions

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

2 Part II Exploring Relationships between


2.1 Scatterplots . . . . . . . . . . . . . . . .
2.2 Linear Regression . . . . . . . . . . . . .
2.3 Transformations . . . . . . . . . . . . . .

6 Part VI Inference
6.1 1 and 2 Sample t-tests . . . . . .
6.2 2-Sample Independent t-tests . .
6.3 Paired t-tests . . . . . . . . . . .
6.4 Confidence Intervals . . . . . . .
6.5 Hypothesis Testing Control Form

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.

9
9
9
10
10
10

7 Part VII Inference when Variables are Related


7.1 Chi-square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11
11
12

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Page 2 of 12

Part I Describing Data

1.1

Plotting Data

Frequency table: classes on left, frequency on right


Stem and leaf plot: low to high, shows shape
Dot plot: dots on a number line, shows shape
Pie chart: shows parts of a whole
Box plots: five number summary, outliers are 1.5 IQR from 1st and 3rd quartiles
Histogram: a bar graph with quantitative data, shows shape
Data can be described three ways:
1. shape
normal: symmetrical, unimodal, bellshaped
skewed: name whale by the tale
uniform: all things are the same height
2. center: mean, median, mode
3. spread: stdev increases as spread increases

1.2

Normal Distribution

0.683

0.954

Figure 1: Empirical rule


The empirical rule (68-95-99.7) describes how much of the curve falls within 1, 2, and 3 stdevs.
x

z-score: how many standard deviations away from the mean you are at. Used to find area
under curve using the green sheet.
z=

1.3 Contingency Tables

Page 3 of 12

Mode

Mode

Mean, Median, Mode

Median

Median

Mean

Mean

(a) Left skewed

(b) Normally distributed

(c) Right skewed

Figure 2: The mean, median, and mode of different distribution shapes


Table 1: Useful Nspire Functions
normCdf(lower, upper, mean, stdev)
invNorm(x, mean, stdev)

1.3

area under curve between given bounds


percentile that gives x given mean and stdev

Contingency Tables

Contingency tables show how individuals are distributed along each variable, depending on
another variable. They test for independence.
Table 2: Contingency Table of Class and Survival of Titanic Passengers
Alive
Dead
Total

First
202
123
325

Second
118
167
285

Third
178
528
706

Crew Total
212
710
673
1491
885
2201

In this case, we are trying to determine if survival depends on your class. The independent
variable (thing that happened first) is your ticket, not whether you lived or died.
H0 : Survival is independent of ticket
HA : Survival is dependent on ticket
Table 3: Percentages of Survival by Class of Titanic Passengers
First Second Third
Alive 62.6% 41.4% 25.2%
Dead 37.8% 58.6% 74.8%

Crew
24.0%
76.0%

Because the percentages of survival look different depending on class, support HA .

1.4

Central Limit Theorem

CLT =
n
As n , shape becomes more normal, center stays the same, spread (stdev) decreases.
3

Page 4 of 12

Part II Exploring Relationships between Variables

2.1

Scatterplots

Scatterplots are described using three characteristics:


1. direction: positive, negative (can tell from sign of r)
2. form: linear, nonlinear, curved
3. strength: low, medium, high scatter

2.2

Linear Regression
Table 4: Regression Variables
y
y
b0
b1
r2
y y

predicted y-value
observed y-value
y-intercept
y
slope = rs
sx
coefficient of determination
residuals

Replacing x and y with variables named in context gives least squares regression line:
y = b0 + b1 x
Interpreting r2 : 56.6% of the variability in male weight (explanatory variable) can be
explained by the least squares regression line variation in male height (response variable).
Interpreting slope: According to the model, as fht increases by 1, fwt increases by 3.70.
Interpreting y-intercept: According to the model, a gestation period of 0 days is a lifeexpectancy of 7.87 years.
***CORRELATION DOES NOT IMPLY CAUSATION***
The LSRL always passes through the centroid (
x, y). Influential points do not lie within
x-range of the data and change slope. Points that are above or below the mean have leverage,
since they drag the entire line up or down toward them, but they do not change the slope.

2.3

Transformations

If there is a pattern in the residuals, or if the data looks curved, try transforming it to get
a better model. Four ways to transform data:
1. log(x)
2. log(y)
4

Page 5 of 12

3. log-log
4. when all else fails, break apart into linear segments, perform regression on each
How to see if a model has high predictive power:
residuals show random scatter, with even number of points above and below 0
high r2 value, although there are some variables that are strongly association but
correlation is not close to 1 because curve

Part III Gathering Data

3.1

Biases
Table 5: Types of Bias

Name
voluntary response bias
undercoverage
nonresponse bias
response bias

Definition
self-selected, type of nonresponse
excluding people
choose not to respond
response changes

Example
American Idol voting
only calling households on landlines
calling when not at home
loaded questions

The best way to minimize bias is to use randomization, where each individual is given a
fair, random chance of selection. Using random number tables: starting at the top left, look
at one digit at a time until a dozen numbers are selected. Or, you can use another random
number generator.

3.2

Types of samples
Table 6: Types of Sampling

simple random sample


stratified random sample
systematic random sample
cluster sample
convenience sample

every grouping is possible in entire population


break by characteristic, then x from each
follow a pattern (eg. every third)
break into characteristics, survey everyone in that strata
only asking people nearby; bad

Nonsampling error occurs when you messed up by not randomizing, or there was a
computer error.

3.3 Experiments

3.3

Page 6 of 12

Experiments
Table 7: Principles of Experimental Design
control
control group as baseline using null or placebo treatment
randomize used to even out effects we can control
replicate be able to reproduce results using different sample
block
match similar subjects to reduce effects of things you cannot control

Blinding:
people who influence results: subjects, treatment administrators, etc
people who evaluate results: judges, treating physicians, etc
If only one group is blinded, it is single-blind. Everyone from both groups needs to be blind
for the treatment to be double-blind.
Confounding variables: occurs when levels of one factor are associated with levels of
another factor.
Lurking variables: when there is a third, outside factor that affects things.

Part IV Probability

4.1

Probability and Counting


n Ck

n!
(n k)!k!

Random things to know:


experiment: tossing a coin 30 times
event: tossing a coin
sample space: {H, T }
blackjack: P (blackjack) =

8
52

4
51

If two events cannot happen, they are mutually exclusive. The law of large numbers says
observed probabilities will go toward theoretical probabilities with large enough trials.
lim P (observed) = P (theoretical)

Conditional probability is the probability that B happens given that A has already happened, so its the probability of both A and B happening over probability of A happening.
P (B|A) =

P (A B)
P (A)

4.2 General Probability

Page 7 of 12
Table 8: Probability Variables
n
x

p
q
P(x)
EV or

4.2

number of trials
event
standard deviation
probability of success
1-p, probability of failure
probability of event x
expected value

General Probability
=

qX

qX

P (x) =
x2 P (x) 2
X
EV =
x P (x)

(x

)2

Standard deviation does not add; variances ( 2 ) add. (ax) = a(x)

4.3

Bernoulli Trials

There are three conditions for a Bernoulli trial:


1. two possible outcomes
2. probability of success is constant
3. trials are independent
Table 9: Types of Bernoulli Trials
Binomial
Geometric
fixed number of trials stops at first success
=np
n = p1

P (n) = q n1 p
= n p q
n
P (x) = x px q nx blank
Bernoulli models can be approximated with normal models, but only when np 10 and
nq 10.
Table 10: Useful Nspire Functions
geometricPdf(p, n)
probability something not happening until nth event
binomialCdf(n, p, lower, upper) probability between lower and upper number of events occur
binomialPdf(n, p, x)
probability exactly x events occur in n trials

Page 8 of 12

Part V Inference for Proportions


Table 11: Proportions Variables
p
q
p
q
n
x

population proportion of success


population proportion of failure
sample proportion of success
sample proportion of failure
number of units in sample
number of successes
p p
zts = p pq
n

zts = q

p1 p2
p1 q1
n1

p2 q2
n2

Follow this progression to find out what values of p and q to use in formulas:
1. p, q from previous students
2. proportion from hypothesis
3. use p, q from 2 samples
4. p = 0.5, q = 0.5 worse case
We use p to predict p at a point, and create a confidence interval accordingly:
r
pq
CI = p margin of error = p z ?
n
p
where pq
is the standard error. Essentially, this interval is created by taking the number
n
of standard deviations away on each side, which is what z ? is. If many many samples are
collected and the confidence interval is created, then 95% of the intervals will capture the
true proportion of whatever youre trying to measure.
Table 12: Types of Errors
Name
Type I
symbol

name
false positive
null
wrongly rejected
power 1
n

Type II

false negative
wrongly supported

Page 9 of 12

6
6.1

Part VI Inference
1 and 2 Sample t-tests
Table 13: Independent t-test Variables
n

0
x

s
tts

number of samples
mean of population
value from hypothesis
mean of sample
stdev of population
stdev of sample
t-test statistic

tts =

x 0
s
n

This finds the t-test statistic for 1-sample data. The t-test statistic is pretty similar to
z-test statistic in that it describes the number of standard deviations your value is away
from the mean on a t-distribution, which depends on degrees of freedom and is fatter than
a normal distribution.
s
Error = t?
n
Error associated with the confidence interval. In most cases, since we cannot find t? , we
use z ? instead.

6.2

2-Sample Independent t-tests


tts =

(x1 x2 ) 0
q 2
s1
s2
+ n22
n1
s

StandardError(x1 x2 ) =

s21
s2
+ 2
n1 n2

0 is the hypothesized difference in means for the two independent group, which can be
written as
H0 : 1 2 = 0
HA : 1 2 > 0

6.3 Paired t-tests

Page 10 of 12
Table 14: Paired t-test Variables
n
d
d

s
tts

6.3

number of pairs
population mean of differences
sample mean of differences
population stdev of differences
sample stdev of differences
t-test statistic

Paired t-tests

Paired t-tests are used when each pair of data is related in some way. For example, the data
could be before/after a certain treatment given to the sample person.
tts =

d 0
sd

sd
=
StadardError(d)
n

6.4

Confidence Intervals

s
Conf idenceInterval = Statistic CriticalV alue StandardOf Error = x t?
n
Confidence interval statement: I am 90% confident that the true mean of battery lifespans
is between 291.1 and 321.4 minutes. t? is the number of standard deviations you are to get
that confidence interval (just like a z-score).
If 0 does not lie within the confidence interval, then the data is statistically significant. If
0 lies anywhere inside the interval, even if it is not in the center, the data is not statistically
significant.
For 2-tailed tests (Ha 6=), each tail will be 2 . For 1-tailed tests (Ha > or <), each tail
will be .

6.5

Hypothesis Testing Control Form

1. Identify parameter of interest: mean


2. Pick choice of test: 1- or 2-sample independent or paired t-test
3. Check assumptions and conditions
Table 15: Assumptions and Conditions for Paired t-test
Samples are paired
Look at how it was conducted
Data are independent
SRS & 20 < 0.1N
Data are normally distributed Look at box plot, normal probability plot, histogram, or dot plot

10

Page 11 of 12

4. Write null and alternate hypothesis. Make sure to define 1 and 2 .


H0 : 1 = 2
HA : 1 > 2
5. Calculate test statistic
6. Write p-value statement: There is a 0.0032% chance of seeing data like this if there
was no significant difference.
7. Sketch of sampling distribution assuming H0 is true: draw curve and shade in p-value
8. Compare p-value against alpha
Table 16: P-value Reconciliation
If p-value > alpha

Fail to reject null hypothesis H0 .


Do not support alternate hypothesis Ha .
The data do not support there is a significant difference in the means of....
If p-value alpha Reject null hypothesis H0 .
Support alternate hypothesis Ha .
The data do support there is a significant difference in the means of....

Part VII Inference when Variables are Related

7.1

Chi-square Tests
Table 17: Chi-square tests

Name
Goodness of Fit
Homogeneity
Variables
1
1 (data stratified)
Degrees of Freedom (df) k 1
(r
1)
(c 1)
P
P
row col
Expected
given
total
Compares counts against given distribution proportions
H0
pO = pE for each ps,w = pj,w
HA
1 sig diff
ps,w 6= pj,w

Independence
2
(r
1)
(c 1)
P
P
row col
total

two variables
response independent of gender
response dependent on gender

Note: a 2-proportion z-test can also be used instead of a chi-square test for homogeneity
2 =

X (Obs Exp)2
Exp

The chi-square statistic is used to find the p-value based on the chi-square distribution, which
is skewed right. Increasing df reduces skew.
11

7.2 Linear Regression

7.2

Page 12 of 12

Linear Regression
H0 : = 0
HA : 6= 0

To test to see if there is a linear association between two variables, compare the slope
, which is the parameter, to 0. The statistic is b1 , or coefficient of the x-variable in the
calculator printout (given).
tT S =

b1
SE(b1 )

The t test statistic, when calculated by hand, is found by the equation above. In most
cases, = 0, but it depends on the hypotheses.
tCdf (tT S , , df ) = p valuef oronetail; doubleitf ortwotails
In most cases, you need to double it because two tail (HA 6=). Or use the green sheet.
Conf idenceinterval : b1 t?n2 SEb
Statement: I am 95% confident the true slope of the regression line is between 0.883
and 1.737.
From the p-value, you can conclude there is / is not an association between the two
variables.

12

Das könnte Ihnen auch gefallen