Sie sind auf Seite 1von 63

Advanced Statistical Methods and Multivariate Analysis In Medicine

David Dayya, D.O., M.P.H. Dept. of Family Medicine Saint Barnabas Hospital

Objectives
Overview of advanced statistical methods used in medicine Kaplan Meier Curves in Survival analysis ANOVA, ANCOVA, MANOVA Ad Hoc Analysis Simple Linear, Multiple Linear, Nonlinear and Logistic Regression

Kaplan Meier Survival Curve

Prostate Cancer Research Institute

ONE-WAY ANALYSIS OF VARIANCE


Definition ANOVA is a procedure used to test the null hypothesis that the means of three or more populations are equal.

Assumptions of One-Way ANOVA


The following assumptions must hold true to use one-way ANOVA.
1. The populations from which the samples are

drawn are (approximately) normally distributed. 2. The populations from which the samples are drawn have the same variance (or standard deviation). 3. The samples drawn from different populations are random and independent.

Calculating the Value of the Test Statistic


Test Statistic F for a One-Way ANOVA Test The value of the test statistic F for an ANOVA test is calculated as

Variance between groups MSB F= or Variance within groups MSW

Sum Of Squares

To calculate MSB and MSW, we first compute the between groups sum of squares denoted by SSB and the within groups sum of squares denoted by SSW. The sum of SSB and SSW is called the total sum of squares and it is denoted by SST; that is,

SST = SSB + SSW

Between Groups and Within Groups Sums of Squares


The between groups sum of squares, denoted by SSB, is calculates as

( x ) T T T SSB = + + + ... n n2 n3 n 1
2 1 2 2 2 3

Between Groups and Within Groups Sums of Squares


The within groups sum of squares, denoted by SSW, is calculated as

T T T SSW = x + + + ... n 1 n2 n3
2 2 1 2 2 2 3

Calculating the Values of MSB and MSW


MSB and MSW are calculated as

SSB SSW and MSW = MSB = k 1 nk


Where k 1 and n k are, respectively, the df for the numerator and the df for the denominator for the F distribution.

Table 12.3 ANOVA Table

Source of Variation Between Within Total

Degrees of Freedom

Sum of Squares SSB SSW SST

Mean Square MSB MSW

Value of the Test Statistic

k1 nk
n1

MSB F= MSW

Solution 12-3
H0: 1 = 2 = 3
The mean scores of the three groups are equal

H1: Not all three means are equal

Figure 12.4 Critical value of F for df = (3, 18) and


= .05.

Do not reject H0

Reject H0

= .05

0
Critical value of

3.16

Simple Regression
Definition A regression model is a mathematical equation that describes the relationship between two or more variables. A simple regression model includes only two variables: one independent and one dependent. The dependent variable is the one being explained, and the independent variable is the one used to explain the variation in the dependent variable.

Linear Regression
Definition A (simple) regression model that gives a straight-line relationship between two variables is called a linear regression model.

Relationship between Blood Pressure and BMI (a) Linear relationship (b) Nonlinear relationship

Blood Pressure

Blood Pressure

Linear

Nonlinear

BMI (a)

BMI (b)

Figure 13.2 Plotting a linear equation.


y
120

y = 50 + 5x

100

x = 10 y = 100 x=0 y = 50
20 25 30

80

y-intercept and slope of a line.

5 1 5 1 80 Change in x Change in y

y-intercept

SIMPLE LINEAR REGRESSION ANALYSIS


Scatter Diagram Least Square Line Interpretation of a and b Assumptions of the Regression Model

SIMPLE LINEAR REGRESSION ANALYSIS cont.

Constant term or y-intercept

Slope

y = A + Bx
Dependent variable Independent variable

y =mx + b

SIMPLE LINEAR REGRESSION ANALYSIS cont.


Definition In the regression model y = A + Bx + , A is called the y-intercept or constant term, B is the slope, and is the random error term. The dependent and independent variables are y and x, respectively.

SIMPLE LINEAR REGRESSION ANALYSIS


Definition In the model = a + bx, a and b, which are calculated using sample data, are called the estimates of A and B from:

y = A + Bx
= a + bx

Scatter Diagram
Definition A plot of paired observations is called a scatter diagram.

Scatter diagram.

Blood Pressure

First household Seventh household

BMI

Scatter diagram and straight lines.

Blood Pressure

BMI

Least Squares Line


Regression line and random errors.

e
Blood Pressure

Regression line

BMI

Error Sum of Squares (SSE)


The error sum of squares, denoted SSE, is

SSE = e = ( y y )
2

The values of a and b that give the minimum SSE are called the least square estimates of A and B, and the regression line obtained with these estimates is called the least square line.

The Least Squares Line


For the least squares regression line = a + bx,

b=

SS xy SS xx

and

a = y bx

The Least Squares Line

where
SS xy =

( x )( y ) xy
n

and

SS xx = x 2

( x )
n

and SS stands for sum of squares. The least squares regression line = a + bx us also called the regression of y on x.

Interpretation of a and b
Interpretation of b The value of b in the regression model gives the change in y due to change of one unit in x We can state that, on average, a 5% increase in BMI of a will increase the SBP by 20 mmHg.

Positive and negative linear relationships between x and y.

y
b>0

y
b<0

(a) Positive linear relationship.

(b) Negative linear relationship.

Assumptions of the Regression Model


Assumption 1: The random error term has a mean equal
to zero for each x

Assumptions of the Regression Model


Assumption 2: The errors associated with different observations are independent

Assumptions of the Regression Model


Assumption 3: For any given x, the distribution of errors is normal

Assumptions of the Regression Model


Assumption 4: The distribution of population errors for each x has the same (constant) standard deviation, which is denoted .

Error Distribution

Normal distribution with (constant) standard deviation

E() = 0

Distribution of errors around the population regression line.

Food expenditure

16 12 8 4

Population regression line

10

x = 20

30

x = 35
BMI

40

50

Nonlinear relations between x and y.

x
(a) (b)

Standard Deviation Of Random Errors


Degrees of Freedom for a Simple Linear Regression Model The degrees of freedom for a simple linear regression model are df = n 2

Standard Deviation of Random Errors


The standard deviation of errors is calculated as

se =
where

SS yy bSS xy n2
( y ) 2 n

SS yy = y 2

Coefficient Of Determination
Total Sum of Squares (SST) The total sum of squares, denoted by SST, is calculated as

SST = y

( y )
n

Total errors.

140 Blood Pressure 120 100 80

y = 9.1429

10

20

30 BMI

40

50

Coefficient Of Determination
Regression Sum of Squares (SSR) The regression sum of squares , denoted by SSR, is

SSR = SST SSE

Coefficient of Determination
Coefficient of Determination The coefficient of determination, denoted by r2, represents the proportion of SST that is explained by the use of the regression model. The computational formula for r2 is bSS
r =
2 xy

SS yy

and

0 r2 1

Inferences and B
Sampling Distribution of b Estimation of B Hypothesis Testing About B

Sampling Distribution of b
Mean, Standard Deviation, and Sampling Distribution of b The mean and standard deviation of b, denoted by b and b , respectively, are

b = B

and

b =

SS xx

Estimation of B
Confidence Interval for B The (1 )100% confidence interval for B is given by

b ts b
where
sb = se SS xx

Population and sample regression lines.

y
Population regression line

y| x = A + Bx

Regression lines = a +bx estimated from different samples

Using the Regression Model for Estimating the Mean Value of y


Confidence Interval for y|x The (1 )100% confidence interval for y|x for x = x0 is

y ts ym

System of linear equations i.e. Matrix Algebra


Y= a + b1X + b2X + b3X + 1. SBP = a + b1(bmi) + b2(age) + b3(#cigs) 2. SBP = a + b1(bmi) + b2(age) + b3(#cigs) 3. SBP = a + b1(bmi) + b2(age) + b3(#cigs) 4. SBP = a + b1(bmi) + b2(age) + b3(#cigs) 5. 6.

Model Construction
Solve for the constant and the coefficients using Differential Calculus of matrices. Use the model to predict blood pressures given the values for a set of variables of interest i.e. bmi, age, and #cigarettes smoked.

References
Gehlbach SH. Interpreting the Medical Literature. Practical Epidemiology For Clinicians. 5th Ed. 2006. Gordis L. Epidemiology. 3rd Ed. 2004. Hulley SB, Designing Clinical Research. An Epidemiologic Approach. 3rd Ed. 2006.

Epidemiology/Research Methods

Maran R. Maran Illustrated Office 2003. 1st Ed. 2005. Maran R. Maran Illustrated Access 2003. 1st Ed. 2005. Maran R. Maran illustrated Excel 2003. 1st Ed. 2005. George D, Mallory P. SPSS For Windows: Step-by-Step. 7th Ed. 2006 Step- byHinton PR. Brownlow C, McMurray I. et. al. SPSS Explained. 1st Ed. 2004. Delwiche LD, Slaughter SJ. The Little SAS Book: A Primer. 3rd Ed. 2003 Acock AC. A Gentle Introduction to Stata. 1st. Ed. 2005. Stata.

Maran R. Microsoft Office 2000 Simplified. 1999.

Dataset/Database Management

Agrawal A. EndNote 1-2-3 Easy! Reference Management For the Professional. 1st ed. 2005. Maran R. Microsoft Office 2000 Simplified. 1999. Maran R. Maran Illustrated Office 2003. 1st Ed. 2005.

Scholarly Research Paper Publication/Bibliography Software

References
Supplementary and Advanced Level References Glantz SA, Slinker BK. Primer of Applied Regression and Analysis of Variance. 2nd Ed. 2000. Kleinbaum DG, Kupper LL, Nizam A, Muller KE. Applied Regression Analysis and Multivariable Methods. 4th Ed.. Methods. 2007. Winer BJ, Brown DR, Michels KM. Statistical Principles in Experimental Design, 3rd Ed. Snedecor GW, Cochran WG. Statistical Methods. 8th Ed. 1989. Maxwell SE, Delaney HD. Designing Experiments and Analyzing Data: A Model Comparison Approach. 2nd Ed. Data: Keppel G, Wickens TD. Design And Analysis. A Researchers Handbook. 4th Ed. 2004. McMahon D. Linear Algebra Demystified. 1st Ed. 2005. Lay DC. Linear Algebra and its Applications. 3rd Ed. 2005. Clark-Carter D. Quantitative Psychological Research: A Students Handbook. 2004. ClarkHandbook. Russo R. Statistics for the Behavioral Sciences. 2003.

Statistics/Biostatistics

(1)Evans JS, Evans BT. How To Do Research. 2005. (2) Boynton PM. The Research Companion. A Practical Guide for the Social and Health the

Epidemiology/Research Methods

Sciences. 2005.

Englebardt SP. Health Care Informatics: An Interdisciplinary Approach Medical Informatics: Knowledge Management and Data Mining in Biomedicine Biomedicine

Medical Health Informatics

References
Rice University Virtual Lab in Statistics (online multimedia tutorial and textbook) www.onlinestatbook.com/rvls/ UCLA Statistical Computing Online Tutorial on SAS, STATA, and SPSS. www.ats.ucla.edu/stat/overview.htm Practice Datasets www.vetmed.wsu.edu/appliedregression/ Against All Odds: Inside Statistics www.learner.org/resources/series65.html Statistics www.videoaidedinstruction.com/
Video Instruction Resources Useful WWW Online Resources

Review
Know your data types i.e. Nominal, ordinal, interval, and ratio, parametric, nonparametric, continuous, discrete. Be able to identify the type of study i.e. cohort, case-control, cross-sectional, intervention trial, cross-over casecrosscrossdesigns. Recognize blinding, systematic error, confounders, sources of bias and their significance. bias Understand error types 1 and type 2, alpha, beta, power. Understand the general principles around Sensitivity, Specificity, Positive and Negative predictive values, and the Specificity, effect of prevalence. Understand epidemiologic principles and basic definitions i.e. incidence, prevalence, causation theory, levels of incidence, evidence. Know how too interpret a Relative Risk Ratio or Odds ratio, attributable risk and the types of studies they can be attributable applied to. Know what the hypothesis and null hypothesis mean. Understand why we use a kappa statistic and how to interpret it. Understand the principles behind a kaplan-meier curve and how to interpret it in survival analysis. kaplanUnderstand how to interpret a p-value, an alpha, or a confidence interval of an effect size or a risk ratio. pUnderstand the general principles behind a literature search techniques. techniques. Understand when a multivariate or a biivariate hypothesis test is required and be able to recognize whether or not the author used the correct test. Recognize and interpret the measures of central tendency and dispersion in the data. dispersion Differentiate between population and sample data. Understand the concepts of inclusion/exclusion criteria. Understand the concepts of validity, reliability, accuracy and precision. precision. Understand and interpret the graphical representation of data. Understand the limitations and strengths of prospective vs. retrospective studies. retrospective

Anatomy of a Research Article


Abstract Background/Introduction Methods Results Discussion/Conclusion A brief summary of the Study and its findings. Historic overview information related to the research question and its relevance. A detailed discussion of the research design including limitations. Descriptive statistics, graphical representation of data, and results of hypothesis testing and findings. An interpretation of the findings.

Scott Wetstone University of Connecticut

Scale Used for the Dependent Variable Non-parametric Parametric


Ordinal Data
Kolmogorov-Smirnov One-Sample Runs

Non-parametric
Nominal Data
One Sample Test Binomial (binomial equation or Z approximation) Chi-Square Goodness of Fit Test
Two Sample Test Related Samples Unrelated Samples

Interval Data
Z test

McNemar Contingency Chi Square Fishers Exact

Sign Wilcoxan Two Sample Rank


Friedman Chi-square test

Paired t Unpaired t

K Sample Test Related Samples Cochran Q Randomized Block Analysis Of Variance Analysis Of Variance (ANOVA) (followed by Tukey or SNK))

Contingency Chi Square Unrelated

Kruskal-Wallace

Evidence Based Medicine Hierarchy

Evidence Pyramid

Hill's Postulates for Causation (1965)


Strength of Association: The larger the relative effect, the more likely the causal role of the factor. Dose-response: If the risk increases with increasing dose of the risk factor, the more likely the causal role of the factor. (Pancreatic cancer and coffee) Consistency: If similar associations are found in different studies in different populations, the more likely the causal role of the factor. (Literature review) Temporality: Risk factor exposure must precede the outcome. (Effect cannot precede cause) Intervention: Reduction or removal of the risk factor must reduce the risk of the outcome. Biological Plausibility a plausible mechanism exists that may explain the risk. (thimersol and autism) Coherence: Associations between the risk factor and the outcome must be consistent with existing knowledge. (exercise and obesity)

Retrospective vs. Prospective


Incidence of disease cannot be calculated from this case control data. Common sources of control general population, hospital patients, relatives of cases, associates or friends of cases. (related sample) Sampling is random, systematic or paired. Incidence of disease can be obtained from the Prospective data. Common sources for control are a placebo group that is comparable to the exposed subjects. Randomized pairing is preferred in the selection process. Sample is drawn from the popoulation and randomized into exposure or control group. Subjects are free of disease at onset of study. Bias is less common. Common Diseases Greater Costs More difficult logistics Ethical considerations

Cases and controls selected from a medical facility/facilities, community or general population. Subjects are diseased at onset of the study. Can select Incident cases or Prevalent cases. Bias more common. Rare Diseases Less Cost Less difficult logistics Ethical considerations

Interpreting the Medical Literature Outline Resident Name_________________________ Date_________________________________ Citation: When reading the assigned article consider these questions for discussion at the Journal Club meeting? General Considerations 1. Is the title of the article consistent with the content of the article? 2. What were the author(s) conclusions and how strongly were they worded? 3. Did the research question warrant doing a study on this topic i.e. unnecessary, clinical practical significance? Systematic Design Considerations 4. What were the dependent (outcome) variables? Were they clearly defined and adequately measured? 5. What were the independent (exposure/intervention/predictor) variables? Were they clearly defined and adequately measured? 6. What was the design of the study? Was there an adequate control group, blinding, randomization? Were confounders balanced or excluded in the design? 7. If the authors conclusions are correct to whom can they be generalized to based on the sample selected? Statistical Considerations 8. Were any associations established and if so what was/were the strength of the associations? Was there statistical significance?

Interpreting the Medical Literature Statistical Considerations - Continued 9. Was the statistical test used to determine statistical significance appropriate and correctly interpreted? 10. What do you estimate the potential was for type 1 or type 2 error in the study?
11. Were the authors conclusions justified based on your assessment of the strengths and weaknesses of this study? 12. How could the study design have been improved?

Das könnte Ihnen auch gefallen