Sie sind auf Seite 1von 1

Descriptive Stats – summarizing and characterizing data Probability and Probability Models in Statistics

Vs. Inferential Stats – interpret & draw conclusions from data that take uncertainty into account. Probability Sampling:
*With inferential stats, use data from sample to make valid inferences based on probability about the Simple random: take all members of population N and select n individuals randomly.
population (from which the sample was taken) Systematic sample: start w/ sampling frame; determine interval (N/n); first person at random from first N/n
Scientific Method: Start w/ hypothesis from prior observations, which you test empirically via and then every N/n thereafter
observation & “experimental manipulation”- experimental manipulation takes into account principle Stratified sample: Organize population into mutually exclusive strata; select individuals at random within
of temporal precedence to infer about cause of particular outcome. each stratum.
3 types of sci. research: 1) Observational.(study associations b/w variables) 2) Experiment (random Non-probability sampling: i.e. non-random: Convenience sample = e.g. grabbing people in your lab and
assignment) 3) Quasi-exp (non-random—b/c random not possible, so match control group to exp. group testing a procedure (convenience sampling is not for inference). Quota sample = select a pre-determined # of
on all other variables, e.g. Schizophrenia vs. non-schiz.) **Experimental studies: can make stronger individuals into sample from groups of interest. E.g. want 300 females and 200 males (non-random version of
inferences of cause. stratified sample). Statistics procedures used to adjust for potential bias from non-random nature.
1) Observational: Case series: review interesting/common features of series of cases. Cross-sectional: 0 < Probability < 1
observe group at pt. in time., e.g. association b/w diabetes & heart disease? Simply estimate prevalence; *2 events are independent if the probability of one is not influences by the occurrence or non-occurrence of the
cost efficient, east to implement, BUT no temporal info., and only show prevalence now (i.e. not in the other (determine unconditional and conditional probabilities and compare).
future). Cohort study: (prosp. OR retro) – involves group that meets a particular criteria (e.g. doesn’t Test Performance Characteristics: **Disease status is ALWAYS the denominator**
have disease) and follow the outcome. Prospect: follow group over time and evaluate outcome. Allows Sensitivity (True positive fraction) = P(test+ | Disease) False negative (1-sensitivity)
for assessment of risk factor and outcome (therefore some causal inference). Exposure group might have Specificity (True negative fraction) = P (test - | No disease) False positive (1-specificity)
common or rare risk factors, while the control group is similar to experimental group in all other factors. Binomial distribution: For discrete outcomes (2 possible outcomes—success or failure) where P(success ) is
Case-control (retrospective): select based on basis of outcome status and assess exposure and risk constant for each replication. n = # of replications. p = P(success)
factors. Cases have disease and controls do not. Compare risk factors retrospectively. Good in studying n! x = # of successes of interest, 0 < x < n
rare diseases; cheap; diseases w/ long latency. Problems: misclassification bias, selection bias, recall P(x successes)  (p x ) (1  p)n x
x! (n  x)! e.g. if we know the probability (p) of success in
bias. Also, since retrospective, exposure and outcomes have both happened already and so temporal
relationship difficult to make. Also, might be other unmeasured “confounding” risk factors that affect Normal Distribution: Theoretic probability distribution the population, e.g. 80%, we can determine the
outcome. model for continuous outcomes. Defined by mean (μ – chance for success in 7 patients (x) if meds given
mu) and SD (σ – sigma) to 10 patients (n).
2) Experimental: randomized controlled (clinical) trial—used to test hypothesis of evaluate intervention
(*randomize the subjects to comparison groups). Subjs. Randomized to one of two or more treatments *No real data (sample OR population) will truly be or… P(X > 7) = P(7) + P(8) + P(9) + P(10)
(one of which might be control treatment). Concurrence (active and control treatments occur at the same normally distributed, but many approximate it well. or… P(X > 7) = 1-P(x<7)
time) is important. May be single-center or multi-center. When possible, should be double-blinded. *Randomness is essential to all probability models. For discrete random variables (e.g. heads vs. tails), we can
Sometimes, subjects cannot be blinded because of the type of treatment, but can often ensure people who compute the exact probability that an event will occur in so many trials. In contrast, continuous random
evaluate the outcome are unaware of treatment. GOLD STANDARD in sense that it minimizes bias and variables can take on an infinite # of values and will take on a very large # (but finite) number of values—exact
confounding variables, but it is expensive, requires a lot of monitoring, and inclusion criteria can limit probability is difficult to compute.
generalizability. **The binomial distribute gets closer & closer to normal distribution with more trials.
Randomized crossover trial: eligible participants are randomly assignment treatment or control, then Standardization = transforming raw values in a normally distributed data set into z scores (SD units).
“wash-out” period, then switch groups.
3) Quasi experimental: subjs. NOT randomly assigned. Match on all possible confounding variables. You can also use z scores to compare scores from different
Operationalization: must precisely define risk factors and outcomes to make valid inferences. distributions, e.g. males and females.
Case reports and case series are cheap but no comparison group/no specific question or hypothesis to Percentiles formula: X = m + [(z)(s)], where m = mean. z = value for
test (i.e. descriptive) desired percentile, s =standard deviation.
________________________________________________________________________________
Sampling distribution: distribution of all possible values that can be assumed by some statistic. Computed
*Whereas population parameter is a constant, sample statistic can vary from sample to sample. Thus, from samples of the same size randomly drawn from the same population. We are usually interesting in the
sample statistic = always estimate. Thus, statistic for sample and parameter for population. mean and variance/SD of the sampling distribution. E.g. if you take samples of n=2 from a population of N=5,
n = sample size; N = population size. Variables: Qualitative (categorical aka nominal (unordered) or ages of sample subjects = 6, 8, 10, 12, and 14, i.e. mean = 10, the most frequent mean of the samples = 10,
ordinal (ordered) vs. Quantitative (Continuous). For categorical, if just two categories = because the more extreme values like 6 and 14 converge toward the central value. *The mean of the sampling
dichotomous. distribution is the same as the original population mean. The SD of the sampling distribution is not equal to the
**”On a scale of 1-5” = ordinal b/c distances differ. variance/SD of the population but rather equal to the standard error of the mean, i.e. the population SD
*Quantitative/continuous can be interval (scale w/ constant interval size but no true zero, e.g.
Temp. in F or C) or it can be ratio (scale w/ non-arbitrary zero, e.g. temp in K) divided by the square root of the sample size:
σ n
*****Different sample size give the same mean, but
Cumulative frequency: sum of all frequencies up to and including that value (can do w/ ordinal as the sample size increases, the variance gets smaller. At a sample size > 30, we can expect that the sample is
variables). normally distributed. Thus, it is the size of the sample rather than the number of samples taken that affects
If categorical or dichotomous, can only show frequency/relative frequencies. variance (i.e. the larger the sample size, smaller the standard error SE).
Summarizing Continuous Data Central limit theorem: If we take single sample of n and get one mean, 95% of the time, it falls w/i +/- 2 SE’s
Measures of Central Tendency: Mean, median, and mode x μ
z
Measures of data variability: Range, Variance, Standard deviation σ n (use is f > 30)
**Mean is more sensitive to outliers than is the mode or median. of mu (population mean).
If n < 30, use t-distribution with n-1 degrees of freedom. If fewer D.O.F., flatter distribution.
If, don’t know sigma, we can estimate standard error as s n
Negative skew (to the left): mean is less than
median and mode.
_________________________________________________________________________________________
Estimation = population parameter unknown; use sample statistics to generate estimates of the parameter.
Positive skew (to the right): mean is greater
Hypothesis testing: generate hypothesis about population parameter; sample statistics analyzed to either
than median and mode support or reject the hypothesis about the parameter.
Pt. Estimate: Best single number estimate of the parameter
Sample mean  X   X
 Symmetric: mean = median = mode Confidence interval estimate for mu: range of values for the population parameter w/ level of confidence =
n **Measures of central tendency are important but are limited in the point estimate +/- margin of error = pt. estimate +/- z*SE
information they provide about the sample—thus we have… (z for desired confidence level and SE = standard error of the pt. estimate)
Measures of Variability/Dispersion: Range, Variance, and Standard Deviation Confidence interval estimate for mu: one sample with continuous outcome: We know that based on CLT,
Unlike the range, the variance measures not only the spread of data but the spread of the data AROUND the distribution of the sample means are approximately normal with:
THE MEAN—gives a statistical average of the amount of dispersion is a distribution.
s
*In calculating variance, divide the sum of the squares by the population N or the sample n-1. X  (Z)
σ Xt
Standard deviation (SD) = square root of the variance—because it is in the same units as original If n > 30 n If n < 30 n Always, a 2-tailed test
measurements, provides a more meaningful measure of deviation b/w individual scores and the mean.
CI estimates for mu1-mu2 – 2 Independent Samples with Continuous Outcomes (i.e. interested in
Σ X2 - (Σ X ) /n difference between means—mu1-mu2) e.g. in a randomized control trial or cohort study.
2
Σ(X  X)2
s 
2
2
s= s 2 s = 2 2
For each group, determine n1, X1, s1 (or s1 ), n2 , X2 , s2 (or s2 ) .
Variance: n 1 SD: SD (machine): n -1
1 1 1 1
(X1 - X2 )  (z)s p  (X1 - X2 )  (t)s p 
Quartiles: When determining Q1 and Q3, if odd n, take out median and determine. If n1 n2 n1 n2
If n1 > 30 AND n2 > 30 If n1 < 30 OR n2 < 30 df = (n1 +
even, Q1 = median of lower half, and Q3 = median of upper half. n2) - 2
*IQR = Q3 – Q1 (n1  1)s  (n2  1)s
2 2
Sp  1 2
If no outliers, sample mean and standard deviation summarize location and variability. n1  n2  2
If outliers, median and IQR summarize location and variability. **Only use if 0.5 < S12/S22 < 2 CI estimates for mud – Dependent
(matched) samples w/ continuous outcomes – for each subject, measure outcome under each experimental
Normal Curve 68% of observations fall between mean +/- one SD n, X d , s d
condition. e.g. for a crossover trial.
95% of observations fall between mean +/- two SD’s sd sd
Almost 100% of observations fall between mean +/- three SD’s Xd  z Xd  t
If n > 30 n If n < 30 n df = n -1 **Need to calculate the diffs. and diffs2.

Outliers: Values that exceed Q3 + 1.5 IQR or values that fall below Q1 – 1.5 IQR
Σdiff 2  (diff) 2 / n 740  (66)2 / 10
or…. Outliers fall outside of  sample mean +/- 3s ΣDiff 66
sd 
n 1

9
Xd    6.6
n 10  33.82  5.82

Box plot: Top of the box is Q3 and bottom is Q1. Thus, the Hypothesis Testing: 1) Null: H0: µ = X Alternatives: 2 tail: HA: µ ≠ X 1 tail: HA: µ < X or µ > X
box represents IQR (i.e. middle 50% of data). The whiskers To test hypothesis select random sample of size n.
are found 1.5*IQR above and below Q3 and Q1 respectively. 2) Select appropriate test statistics: e.g. If X ~ 130, then z  0  H0 is probably true.
Values above or below the whiskers = outliers. If X > 130, then z > 0  H1 may be true
Determine acceptable level of significance/critical value (e.g. alpha = .05 = critical value = z = 1.65
for 1 tail) and set up 3) decision rule for accepting/rejecting null hypothesis: Reject H0 if z > 1.65
(corresponding to a < .05), Do Not Reject H0 if z < 1.65 (a > .05). 4) compute test statistic:
X - μ 135 - 130
Z= = = 3.46
s 15 5) Conclude: e.g. “significant
n 108 (where for sample, n = 108, X = 135, s = 15) evidence at alpha = 0.05, that mean
SBP has increased.”

Das könnte Ihnen auch gefallen