Sie sind auf Seite 1von 7

Study Guide for Statistics

Key Terms:
1. Statistics is the art and science of using mathematics to make data driven decisions, but it is not mathematics. 2. Variable is any characteristics observed in a study. a. Categorical (qualitative) data consists of names or labels that are not numbers representing counts or measurements. b. Quantitative (numerical) data consists of numbers representing counts or measurements. 3. Stem plot represents quantitative data by separating each value into two parts: a. Stem- leftmost digit b. Leaf- rightmost digit 4. Frequency is the count of the number of values that fall in a class. 5. Relative frequency is another term for the proportion or percentage of the number of values that fall in a class. 6. Histogram is a representation of the data using a bar graph in which the horizontal scale represents classes of data values and the vertical scale represents frequencies. 7. Dot plot consists of a graph in which data value is plotted as a point along a scale of values. 8. Bar graph is a histogram when the data is categorical. 9. Pareto chart is a bar graph with the added stipulation that the bars are arranged in descending order. Used to draw attention to the more important categories. 10. Mean of a set of data, denoted by is the measure of center defined by adding the data values and dividing by the total number of data values. 11. The median, denoted , is the middle value when the value is arranged in increasing order. 12. Outliers is an unusually far away observation. 13. Mode is the value that occurs most often. 14. Range is the difference between the largest and smallest observations. 15. Standard deviation measures the variation of values about the mean. 16. pth percentile is a value that p percent of the observations fall below or at the value. 17. Box plot is the graph of the data set the displays the 5-number summary. 18. Standard score (z-score) is the number of standard deviations that a given value is away from the mean.

19. Correlation coefficient r, is the association between the explanatory variable and the response variable. 20. Confounding variable is an explanatory variable that is associated with the response variable and other explanatory variables. 21. Simpsons Paradox is a case where the direction of association reverses when a third variable is introduced. 22. Experiment is a study where the researcher assigns subjects to different treatments (explanatory variable) and then observes the outcome (the response variable). 23. Observational study is a study where the researcher only observes values of the explanatory variables and response variable for sampled subjects. 24. Sampling: a. Random samples, members from the population are selected in such a way that each individual member has an equal chance of being selected. b. Simple random sample of size n, every group of size n has an equal chance of being selected. c. Systematic sampling, we select a starting point and then select ever kth element in the population. d. Stratified sampling, we subdivide the population into at least two different subgroups that have the same characteristics, then we draw a sample from each subgroup. e. Cluster sampling, we first divide the population into subgroups, and then we randomly select some of the subgroups. f. Convenience sampling, we use results that are easy to get.

g. Self-selecting or voluntary sampling, we allow the subjects to decide whether to respond. 25. Sampling bias occurs when the sampling method is poorly designed. a. Nonresponse bias occurs when some sampled subjects dont respond. b. Response bias occurs when sampled subjects dont respond accurately. c. Placebo effect occurs when you give someone a sugar pill and get better anyway thinking the sugar pill was effective. 26. Control group is the group that is given the placebo treatment. a. Blind experiment, the patients dont know which group they are in. b. Double-blind study, the patients and the people administer the medicine dont know whether it is real or not. 27. Hawthorne effect occurs when subjects change their behavior when they know they are being observed.

28. Cross-sectional study, data are observed, measured on point at a time. a. Retrospective (case-control) study, data are collected from the past by going back in time. b. Prospective (longitudinal/ cohort) study, data are collected in the future form groups sharing common factors. 29. Counting rule for a sequence of two events in which the first can occur in m ways and the second in n ways, the two events can occur a total of mn ways. 30. Phenomena is any observable occurrence. 31. Event consist of any collection of outcomes of a phenomena. 32. Simple event (outcome) is an event that cannot be further broken down into simpler components. 33. Sample space for phenomena consists of all possible simple events. 34. Complement of event A, denoted by , consist of all outcomes in which event A does not occur. 35. Disjoint means separate or not overlapping. 36. Conditional probability of B given A, denoted P(B|A), is the probability that B occurs given A has already occurred. 37. Random variable is a variable that has a single numerical value, determined by chance for each outcome of a phenomenon. 38. A probability distribution is a graph, table, or formula P(x) that gives the probability for each value of the random variable. 39. Normal distribution is characterized by a symmetric, bell-shaped curve with two parameters the mean and the standard deviation. 40. A standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1. 41. Sampling distribution shows the distribution of a particular statistic of all possible samples. 42. Data distribution is the distribution of one sample Formulas: Ch. 2.3-2.45 Centers & Variation of Data 1. Mean: 2. Standard deviation: 3. Variance:
( )

Ch. 2.5-2.6 Measures of Position & Misleading Statistics

1. Interquartile range: 2. z-score:

Ch. 3.3-3.4 Predicting the Outcome of a Variable 1. regression line: Ch. 5.1-5.2 Probability 1. Theoretical Probability: ( ) 2. Relative Frequency Approximation of Probability: ( ) 3. Addition Rule: ( 4. Complement Rule: ( ) 5. Multiplication Rule: ( CH. 5.3 Conditional Probability 1. Multiplication Rule: ( 2. Conditional probability: ( ) ) ( )
( ( ( )

( ) ( ) )

( ) ( ) ( )

( ( )

) ( )

(
))

Counting- Permutations and Combinations 1. Ordered combinations: 2. Unordered combinations:


( ) ( )

Random Variables and Probability Distributions 1. Mean: 2. Standard deviation: 3. Variance: Binomial Distribution 1. Mean: 2. Standard deviation: ( ) ( ( ) ) ( ( ) ) ( )

Sampling Distribution 1. Mean: 2. Standard deviation: Key notes: Ch. 2.3-2.4 Centers & Variation of Data The mean is the balance point of the data. For skewed data, the mean is pulled in the direction of the skew. The mean can be highly influenced by an outlier. The median, on the other hand, is resistant to the effect of outliers. For this reason, the median is a more helpful measure of the middle there are extreme outliers. In calculation the mean, we will use the rounding rule of thumb: round so that you have one more decimal place than is present in original set of values. The mean and median do not make much sense for categorical data. A comparison of the mean, median and mode can reveal information about the symmetry of distribution. s The larger s is, the larger the variability of the data set. s is very sensitive to outliers. s can increase dramatically with the inclusion of a few outliers, so we say that s is not resistant. We will use the following round-off rule for s: Use one decimal place than is present in the original data set. ( ) . So the standard deviation is roughly adding up to the distance from the
( )

mean. Larger distances mean more variability. A known value is said to be statistically usual if it is within 32 standard deviations of the mean.

CH. 2.5-2.6 Measures of Position & Misleading Statistics Q , the 50th percentile, is more commonly called the median. Q is the 25th percentile, and the Q is the 75th percentile. Different programs may produce different results finding percentiles. For large data sets this wont be an issue. Just be aware that there are differing practices for determining percentiles.

The IQR also be used as a measure of spread (variation), along with range and standard deviation. Range is heavily influenced by outliers, the standard deviation less so, and the IQR is resistant. The most common way to identify an outlier is as any observation that is more than 1.5*IQR below Q or above Q . The z-score can be used to identify potential outliers. An observation is a bell-shaped distribution is regarded as a potential outlier if it falls more than standard deviations from the mean. The z-score is useful for comparing relative positions for different data sets.

Ch. 3.3-3.4 Predicting the Outcome of a Variable The slope of the regression line b=r ( ) represents the marginal change in the second variable y that occurs when the first variable x change by one unit. is the predicted value. The residual y- represents the difference between the observation and the prediction (the error of the prediction). The intercept a = -b is the predicted value of y when x=0. Sometimes this has a physical interpretation, but typically in context it does not make sense. The correlation coefficient r is a measure of the association between the explanatory variable and the response variable. In addition, r-squared, r, gives a measure of the quality of the model .

r and b are closely related and always have the same sign. The regression line always goes through the point ( , ). The regression line has some positive residuals and some negative residuals. The residuals will always add up to 0. An influential outlier has a large effect on the regression model. A lurking variable affects both the explanatory and response variables and therefore may set up an association between the two. Correlation (association) does not imply causation.

Experimental and Observational Studies Sampling is almost always necessary since the population is too large. The sample size should be large enough to representative. Randomness is crucial.

Experiments

Randomization balances the groups so that the effect of the lurking variable is evened out. Reasons to randomly assign o o o To eliminate bias that may result if the researchers assign the subjects. To balance the groups on variables that you know affect the response. To balance the groups on lurking variables that may be unknown.

Ch. 5.1-5.2 Probability 0 ( ) the probability of an impossible event is 0; the probability of an event that is

certain is 1.

Das könnte Ihnen auch gefallen