Stats Cheat Sheet February

c

Making inferences Collection
Hypothesis testing Organization
Determining Relationships Summarizing
Making Predictions Graphical Displays
Population Sample
Number N n
Standard Deviation ı s
2
Standard Variance ı s2
Mean ȝ xbar
Null Hypothesis Ha (alternate) Ho (null)
Alternate Hypothesis Ha (alternate) Ho (null)
Level of confidence Level of sig

1-Ƚ Ƚ Ƚ/2 Z score Critical Value
.90 .10 .05 1.65
.95 .05 .025 1.96
.99 .01 .005 2.58
Probability is: Mutually exclusive means that they cannot occur at the same time.Collectively exhaustive means that one of the events must occur.
P = successes: tries Scales: Nominal: Ordinal.(Chi ± Qualitative) Interval ± Measures by using equal intervals. Ratio: distances on distances on a
number scale. Relative Frequency Theory: if an experiment is repeated many times and a particular outcome occurs frequently, it is likely close to
the probability of the that outcome occurring. Probability of Addition (or) and Multiplication (and) Rules (Joint occurrences): P(ABC) =
P(a)*P(b)*P(c) 0.5*0.5*0.5 = P = 0.125 Frequency distributions are organizations of data into intervals. From this you can take measurements:
Relative Frequency: the number of data you are interested in/total number in the sample. Univariate Data: Describing 1 Piece of Data Central
Tendency is a measurement for a collection of data values. It is a number that is meant to convey the idea of 'centralness' for the data set. The
most common measures of central data set are MMM. Tend to cluster around the middle of the set of numbers Z-scores AKA Standard scores
Indicate how many standard deviations separate a particular value from the mean. The result is interpreted as the number of standard
deviations an observation lies above or below the mean. Skewness Positive (TailRight): Dispersion Measures the degree of symmetry in a
frequency distribution. Skewness occurs due to errors. Variability is a measurement for a collection of data values that is meant to convey the
idea of spread for the data set. Range (R) The difference between the smallest and largest numbers in the data Standard Deviation & Variance
They provide information about how the values are clustered around the mean. If they are clustered then the standard deviation will be small.
Vice versa Sample variance is the approximate average of the squared deviations from the sample mean. First you compute the deviations of the
data values from the sample meanThen square the results Then find the average using n ± 1 (to make the sample variance an unbiased estimator
of the population) S2 = Ȉ(x-xbar)2 /n-1E.g. variance for 3 8 6 14 0 11 (so there are 6 data)3+8+6+14+0+11/ 6 = 42/6 = 7 The variance is 7.Next
you have to compute all of the deviations. The variance is 7 so subtract each number by 7 to find the deviation. Then you square the deviations.
Deviation (x-xbar) squared deviation (x-xbar)2 3-7= -4 squared = 16;8-7= 1 squared = 1;6-7= -1 squared = 1;14-7= 7 squared = 49;0-7= -7
squared = 49;11-7= -4 squared = 16 } so this is why we square the variance. It would have yielded a sum of 0. once squared it yields a positive
number.And finally you come to the conclusion that the the total deviations are 132Next we plug in the formula:132 132 132 26.4 This is the
VARIANCEn-1 = 6-1 (we get this number from the sample size) = 5 = The variance of 26.4 is pretty large compared to the size of the data values
so on a graph you can expect it to be all scattered away from the mean. Next we take a SAMPLE STANDARD DEVIATION.This is the positive
square root of the sample variance s = ¥s2 or ¥Ȉ(x-xbar)2/n-1so the standard deviation is for the previous problem is: ¥2.64 = 5.14 Some things
to know:If all of the data are of the same value, the sample deviation will simply be 0. There is no variability in this case. The variance can be
influenced by outliers (which are very scattered data). Double check if the data is incorrect.Standard deviation is better than variance in terms of
accuracy. If you are using a sample: s If you are using a population: ı Measures of Variability (AKA Spread)Range: difference between the
highest and lowest numbers Deviation: distance of the measurements away from the mean. Variance: the sum of the squared deviations of n
measurements from their mean divided by (n-1) Formula s2 = E(x-xbar) 2//(n-1) = nEx2 ± (Ex)2/n(n-1) Standard Error of the mean: ıxbar = ı/¥n
Empirical Rule: the Normal Curve Empirical rule is a generalization that only works with symmetrical bell-shaped distributions. It relates the
mean to one standard deviation, two standard deviations, and three standard deviations. Practical significance of the standard deviation is that
the bell curve distribution following the Empirical Rule must apply. One sigma (standard deviation) ± Interval one ı above and below the
mean must be 68 % of the measurements Two sigma ± Interval two ı above and below the mean must be 95% of the measurements Three sigma
± Interval three ı above and below the mean must be 99.7% of the measurements. Draw the empirical curve with numbers filled in here. (piece
of paper in index of µtext¶) Coefficient of variation (cv) allows us to compare the variation between two or more variables. It is usually expressed
as a percentage. The sample coefficient standardizes the variation by dividing it by the sample mean. There are no units expressed because the
sample and the sample mean are in the same units. Thus they are canceled out. Coefficient of Variation Allows for comparison of variability
spatial samples Test which sample has the greatest variability. Household income, which of the neighborhoodǮs has the greatest
variation. Is there a large variation or is there uniformity.Standard Deviation or Variance are absolute measures so, they are influenced
by the size of the values in the dataset To allow a comparison of variations across two or more geographic samples, can use a relative
measure of dispersion called Coefficient of Variation Expressed as CV = S/-X Absolute measurements are not possible so relative
measures. Pearson's Coefficient of Skewness NOT Pearson's R Strength fo the Bivariate Relationship Do they move at the same rate, does
the x increase or decrease the y? etc Perfect, strong, weak, mutual Correlation Coefficeients More rigorous approach to observing and
measureing strength and dirctin of a biaraite relationship Most construcrted have a maximum value of 1.0 and can be positive or
negative.Most common measure is ǲPearsonǯs Product Moment Correlationǳ or ǲPearsonǯs rǳ Interval/ratio scale data Pearsonǯs R and
Covariance: At basis of this covariance Covariance measures the degree to which 2 variables vary together Begins with deviations around
means of both variables or: (x-x with line) (y-y with line over it) The line tells us if it is positive or negativeǥ it does not tell us if the
correlation is strong or weak. The higher the value of r (closer to 1), the stronger the relationship will be. 0-.3 (weak) .3-.6 (moderate) .7-
.9 (strong) 1 Perfect Weak positive weak negative etcǥ Pearson's R = 3(xbar-median) Numerical Measures of Position This measures the
relative position of a data value in the data set. The most commonly used measures of location are th z score or standard score and percentiles. Z-
score is obtained for: sample data by subtracting the mean from the value and dividing the result by the standard deviation. (x-xbar)/s population
data (x-ȝ)/ıThe z score is the number of standard deviations the data value falls above (+ z score) or below (- z score) the mean. The point of the
z score is to determine how far away the data value is from the mean. It gives an idea of the position of data relative to the mean. Bivariate Data
Aptly titled, it's use is to compare two variables. Scatter plots display data and show the strengths of the relationship between the two variables.
We can measure correlation coefficients and regressions. EG High temperature of the day and number of soft drinks sold. You need to determine
x (independent) and y (dependent) From the plotted data you can see patterns, and make lines of best fit. Does the pattern slope upward or
downward. Are the data tightly clustered or widely spread. Any noticeable deviations or outliers? If all of the data is in a single line it is 'perfect'.
If it shows a strong line it is 'very strongly associated' (either negative or positive are to be added after strong) No association would be all data
being completely random and if there is no linear pattern, say so, non-linear association. Sample correlation coefficient measures the strength and
direction of a linar relationship between town variables using sample data. Regression:Think ³Line of Best Fit´Coefficient of Determination:
Measures the proportion of the variability in the dependent variable that is explained by the regression model through the independent variable. R
squared (if the outcome is close to 1 the model explains most of the variation: close to zero it explains little) Categorical Data Used with
Qualitative data. 2 way contingency table bivariate frequency table EG comparing: gender; race; religion; age. How the peaks look is called
Kurtosis: how peaked the distribution is. Meso: normal. Platy: flat. Lepto: very peaked. There are 3 Types of Probability Distributions Binomial
Distribution: only 2 outcomes possible (rain or no rain) Normal Probability Distributions: Bell-curve, most common, not skewed, basis for
sampling. Can assume any value in the Interval. Normal can have different sizes of spread and Kurtosis and still be correct. It just needs to be
symmetrical.**Empirical rule falls in here. EMPIRICAL RULE. 68%, 95%, and 97.5%, don¶t forget to make halves. (for the tails, there are
halves) Allows us to compare probability of what¶s actually occurring. One/two/three deviations from the mean. You should convert these
answers into z-scores because each normally distributed random variable has it's own mean and standard deviation. This is impractical. Another
word for z score is standardized score, this gives the everything a standardized score instead of mutually exclusive and collectively exhaustive
independent scores. Z scores indicate how many standard deviations separate a particular value from the mean. Z scores can be +/- depending if
they are > mean < .Normal Z scores are between 0-1 (+/-)If x is the number that we want to find the: we express it as P(0<z<x) if it is greater than
zeroP(-x<z<0) if it is less than zero In here draw in the bell curve with 0 as the mean and the number we are looking for to either side, and shad in
the area. The shaded area is the area we want to know about. Formula: z = x-ȝ/ı Poisson's Probability Distributions: all or nothing, only one
matters. Will happen or won't happen. P of rareness>P of occurrence. NormalV V
VV
Sampling Distributions for the
sample mean, sample proportion, differences of sample proportions & sample means from 2 independent populations.Sampling Distributions:
Aim is to infer N from n. Cannot use the whole population, too expensive and time restrictive. Done without bias: so the sample must be
randomly chosen. Increase n size for better sample. Sample proportion: p = # of people you are interested in # of people in sample SD ± mean is
important. ¥(p(1-p)/n Central Limit Theorum: Even if a population isn¶t normal, the mean of a large sample (30+) will be approximately the same
as the N mean. The probability distribution for stats is called Sampling Distribution. Has own ı and ȝ Standard Deviation of a sampling
distribution is called standard error of the mean. Every stat has standard error, which is a measure of the statistics random variablitly. 1 ı above
and below the mean ± 68% 2 ı above and below the mean ± 95 3 ı above and below the mean ± 99.7. Value must be standardized by z-score.
Z=(x-u)/o. Hypothesis and Estimation: 2 types: Point: single value to estimate population parameter. (exact, actual value) Confidence Interval
(casting a net to look at boundaries. Range) ³we can say with a level of confidence that the # is somewhere bw these boundaries. On graphs, the
higher the peak, the smaller the variance. = use a large sample. Point estimates: a single number that is used to estimate a population parameter.
Give or take, __ with a margin of error of ___ . 95% of the point estimate will lie between 1.96 deviates of the mean. If an estimate is unbiased
the difference between the point estimate and the true parameter value will be less than z =1.96ı , or standard error. Called 95% margin of error.
Probability Level 90 = critical z score of 1.65 95 = 19/20 = critical z score of 1.96 99.7= critical z-score of 2.38. s/(¥n) = margin of error ³We
can say with a measurable confidence that the interval contains the true population parameter.´ When a point estimate is used to estimate the
parameter of interest it is unlikely that the values of the point estimate will be equal to the value of the parameter. Therefore, we will use the
values of the point estimate to help construct an interval estimate of the parameter. We will be able to state, with some confidence that the
parameter lies within the intervals and because of this we refer to thse intervals as confidence intervals. Typically, we consider 90, 95, 99.7%
confidence interval estimates for parameters. Steps:Form hypothesis, which states the expectancy of outcome Derive null hypothesis (that the
research will have no effect on the outcome) Test null, if rejected, evidence for the research hypo is mounted. (Never proved, just supported. Ho-
Null. Ha-alternative (research) hypothesis. Hypothesis testing involves the use of distributions of known area, like the normal distribution, to
estimate the probability of obtaining a certain value as a result of chance. Usually assume that probability is low, so that the test result was not a
mere coincidence but occurred because the researchers theory is correct. Z-score is one type of test statistic that is used to determine the
probability of obtaining a given value. In a z-score two tests Equality of Variances must be included or I get no marks. E of V: Two sample
means test requires that we make a decision about whether or not to pool the sample variances. F ± test helps out here. F test is a Variance Ratio
Test. Select a low significance level (0.05 or 0.01) in order to avoid a type 1 error. (False positive: rejecting the Ho when it in fact should not be
rejected) In order to test hypotheses, you must decide in advance what number to use as a cutoff for the whether the null hypothesis will be
rejected or not (rejection region: supports alternate hypothesis* Can be directional or non-directional) (acceptance region: supports null
hypothesis) ³Critical/Tabled Value´. Represents the level of probability that you will use to test the hypothesis. IF the computed test stat has a
smaller prob than critical value, the null is rejected. Directional/one-tailed/goes one way tests bc everything you need to know is in one tail. Non-
directional/2/both. How to decide what to do. One: use often, good reason to expect that the difference will be in a particular direction. 2: more
conservative than one, takes more extreme test stat to reject null. You can still be wrong.1,5,10% of the time. Error 1: Alpha reject when null is
right. Error 2: Beta You may not null the bad hypothesis. As alpha increases, beta decreases and vice versa. Statistical Significance is the
likelihood of obtaining a given result by chance. Represented by p for probability. 0-1. Alpha table. The smaller p the better chance that
conclusion is right. Don¶t forget to divide the alpha by 2 for each tail. Alpha is the chance of getting it wrong. Sampling Distribution is the
combination of Normal Distribution and Central Limit Theorum. (draw in little graph here in index: tails) Rejection region: Ha = xbar< to the
leftxbar> to the right Acceptance region: Ho = xbar = (anytime it is equal to) Regardless of the test used, always calculate the test statistic. (the
word test, stat, and value are the same). Compare that to the critical value that we find with reference to our level of significance. Use tests of
significance: chi, t, f, z. T = (xbar-)/s¥(n-1) ëon-Parametric Testing (not regarding parameters): stat test used when assumptions about
normal distributions in the population cannot be met. When the level of measurement is ordinal or nominal. Interval/Ratio: Looking for
statistics and parameters Ordinal/ëominal: ëon-parametric One of the most common assumptions is a normal populatuon distribution
Pop mean µ and Variance ɐ2 But what happens when: Using nominal/ordinal or categorical data ëormality of population distribution is
unknow When using ëon-Parametric Tests: t common non-parametric test Goodness-of-fit: one sample Relationships between cross-
tabulated variables Using observed frequency counts of variable or variables Looking for a significant difference between actual
frequencies and expected frequencies ëull hypothesis assumes no significant difference between actual and expected frequency
counts.Uses of Chi-square (X2) distribution that is by default, one tailed. Uses a number of important assumptions: Data measured at
nominal or ordinal scales Must be at least 2 mutually exclusive categories (2x2 at least) When only using 2 categories, expected
frequencies in each must not be less than 5 observations. When more than 2 categories, no one category should have an expected
frequency of less than one, and not more than one category in 5, should have an expected frequency of less than 5. (Say, small firm,
medium firm, large firm: big spending, medium spending, small spending = 9 cells)May need to amalgamate categories to achieve
assumptions. Confidence Co-efficients: Steps of Hypothesis Testing:Determine Hypothesis Determine Test to Use Level of Significance
Locate Critical Value Compute Test Statistic (value is variable depending on what test is being used) A test statistic is computed from the
sample data used in decision making about Ho/Ha Found between the critical values (-+) Decide if itǯs one-tailed or two tailed testing.
One tailed points out that the null hypothesis should be rejected when the test stat vale is in the critical region on one side of the
parameter value being tested. When the critical region is OëLY more or less than the mean.Two tailed points out that the null hypothesis
should be rejected when the test statistic value is in either side of the two critical regions. One Sample Mean Tests Used when the
objective is to examine one sample you are testing when there is a significant difference bw the sample and the population. Stat inference
allows the researcher to conclude that the sample is drawn from different sources. Hypothesis Z or T Test Stat Decision Rule Hypothesis
Testing: P valu Decision to reject or fail to reject Ho is made by comparing the test statistic to a critical value of z or t. Different
significance levels may lead to different conclusions. To avoid ambiguity, some researchers prefer to use variable level of significance
(AKA p-value) Hypothesis Testing: 2 When comparing 2 samples, if the sample means are significantly different it means it is a different
population size. (and vice-versa) We use a Two Sample Difference of Mean Test to: decide if samples are dependent or independent (and
related or unrelated) EG When increasing TTC fares, sample n before and after the increase. To arrive at a sampling distribution of the
difference of the means, we need to know something about the population variance. But it is unlikely that we do so we use one of two
approaches: (both of which use T-test) t= (xbar1-xbar2)/ɐ(xbar1-xbar2) It is the difference of the means divided by the standard
deviation of the means. T Distribution TestingSimilar to Z Test: Bell, symmetrical, MMM=0, Different from Z: Variance is greater than 1,
shape depends on sample size/degrees of freedom. Degrees of freedom: the number of values that are free to vary afyter a stat is
computed from a set aof data values. (1-10 * x = 30)

Stats Cheat Sheet February

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Stats Cheat Sheet February

Hochgeladen von

Copyright:

Verfügbare Formate

c

Making inferences Collection

Hypothesis testing Organization

Determining Relationships Summarizing

Making Predictions Graphical Displays

Null Hypothesis Ha (alternate) Ho (null)

Alternate Hypothesis Ha (alternate) Ho (null)

Level of confidence Level of sig

Das könnte Ihnen auch gefallen

Stats Cheat Sheet February

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Stats Cheat Sheet February

Hochgeladen von

Copyright:

Verfügbare Formate

c 

Making inferences Collection

Hypothesis testing Organization

Determining Relationships Summarizing

Making Predictions Graphical Displays

Null Hypothesis Ha (alternate) Ho (null)

Alternate Hypothesis Ha (alternate) Ho (null)

Level of confidence Level of sig

Das könnte Ihnen auch gefallen

c