Univariate analysis

involves collecting and analyzing data on one

variable.

bivariate analysis,

process to two variables

when one believes there may be some association

between the two variables.

one is searching for a relationship indicating a

risk factor (an independent variable) and a

health outcome (a dependent variable).

1 December 2014

relationship?

No.

Finding an association is merely the first step toward

supporting a cause-and-effect relationship between two

variables

1 December 2014

associations between variables?

Several statistic methods are used to identify associations.

depend on the characteristics of the data

1 December 2014

when choosing a statistic method?

A good first question is whether the data are

quantitative or qualitative.

Quantitative data are numeric, with some sense

of rank or order.

Qualitative data are categoric, such as gender

(e.g., male or female) or race

1 December 2014

Yes.

continuous versus discrete data

1 December 2014

Discrete data can assume only certain values.

For example, a variable designated as the number of nevi on

a person can assume only discrete integer values (e.g., 0, 1,

2, 3,).

1 December 2014

Does it matter how the data are presented when deciding the

best methods of analysis?

Yes.

consider the properties of each variable because each may

have different characteristics.

For example

blood pressure measurements (continuous) and

ages in years (discrete )

The selection of the best analysis tool is dependent on

both variables

1 December 2014

In paired data, the value of one variable is assumed to be

associated with the value of the other, usually because they

represent data on a single entity or object (e.g., an

individual).

Whether or not the data are paired also restricts the choice

of bivariate statistic methods

1 December 2014

10

considerations with respect to our analysis?

Yes.

For continuous data, we must consider the distribution of

the data and ask,

"Can the assumption of a normal distribution be made?"

1 December 2014

11

Many tests assume normality (i.e., the data follow a normal

or binomial distribution), and these are called parametric

tests.

Examples of parametric tests include

Pearson's correlation and

linear regression tests.

1 December 2014

12

be made for the data set?

Tests that make NO assumption of normality are termed

nonparametric.

Examples of nonparametric tests include the

Wilcoxon signed rank test,

the Mann-Whitney U test, the

Kruskal-Wallis test, and the

McNemar test.

Deciding whether or not to use a parametric versus a

nonparametric test can be difficult. Sometimes tests for

normality can help

1 December 2014

13

continuous data.

Scatterplots

Pearson's correlation

Simple linear regression

Spearman's rank correlation (nonparametric)

Wilcoxon signed rank test (nonparametric)

Mann-Whitney U test (nonparametric)

Kruskal-Wallis test (nonparametric)

1 December 2014

14

What is a scatterplot?

A scatterplot is probably the simplest form of bivariate

analysis, but this elementary exercise may convey much

meaningful information through the power of the visual

medium.

A scatterplot can be applied to either continuous or discrete

data, or a combination thereof

1 December 2014

15

scattergram,

scatter diagram, and

joint distribution graph

1 December 2014

16

To create a scatterplot, simply plot one variable against

another on an x-y graph, creating a set of points.

This "scatter" of points can then be examined for any visual

relationships.

The power of this method can be illustrated by the

scatterplot

1 December 2014

17

A scatterplot can be very useful in the initial analysis of

bivariate quantitative data.

For example,

a quick glance below suggests that there is no

relationship between the amount of change in the left

pocket of a male and his shoe size.

1 December 2014

18

A scatterplot is not quantitative,

Useful to have a numeric value indicating the magnitude of

relationship or association between the two variables

1 December 2014

19

If we want to examine the relationship or association

between two continuous variables in a quantitative fashion,

we can calculate the Pearson's correlation coefficient,

a.k.a

product-moment correlation

population correlation coefficient

1 December 2014

20

variable and which is the dependent (i.e., outcome) variable

from a correlation coefficient?

We cannot.

It is important to note that here we are not making any

assumptions about the interdependence of the variables:

neither is considered a function of the other.

In fact, each is considered dependent on the other.

We are not so concerned with predicting the value of the

variables as we are with determining the association

between the two variables

1 December 2014

21

Pearson's correlation coefficient can be calculated using one

of many available statistics software programs.

One formula for its calculation is as follows:

the paired values of the associated variables

1 December 2014

24

The correlation coefficient r must take on a value from -1 to

1.

A value of r = 1 suggests a perfect positive correlation in

which the association between x and y can be described by a

straight line with a positive slope.

A value of r = -1 indicates a perfect negative correlation

1 December 2014

25

COEFFICIENT

Both variables are continuous

Both variables are normally distributed

Data are paired.

1 December 2014

26

coefficient?

Puts into quantitative terms the association implied by a

scatterplot of the two variables.

A value

r = 0 no association

r = +1 positive associations

r = -1 negative associations

valuable in the initial analysis,

does not indicate a cause-and-effect relationship, and a

correlation of zero does not rule out a relationship

between the two variables..

1 December 2014

27

different r values

1 December 2014

28

OF A HYPOTHETIC BIOSTATISTICS CLASS

1 December 2014

29

involving paired height and weight data

For these data,

correlation or association.

Does this imply a cause-and-effect relationship?

No. Pearson's correlation makes no assumptions about

cause-and-effect or about dependency

1 December 2014

30

hypothetic biostatistics class

1 December 2014

31

You have collected final exam scores and class attendance rates

(in percentages) for a group of college students enrolled in an

English course. Pearson's correlation for the two variables is

0.23. What is your interpretation?

relationship.

However, this conclusion is not necessarily true.

You should examine the scatterplot to assess more

subtle forms of relationship.

1 December 2014

32

A more sophisticated method for the analysis of two

quantitative continuous variables is simple linear

regression-"simple" indicating that only two variables are

involved.

There are some similarities between this method and

correlation, but in linear regression one variable is

considered dependent on the other.

1 December 2014

33

tool?

In one sense, we are assessing whether the value of the

independent variable can be used to predict the value of the

dependent variable.

An example of an application might be whether or not

weight (the independent variable) can be used to predict

blood pressure (the dependent variable).

In simple linear regression, a "best fit" straight line is

determined for the data in which the dependent variable is

plotted on the y-axis versus the independent variable on the

x-axis

1 December 2014

34

by simple linear regression

1 December 2014

35

linear regression determined?

From algebra, we know that a straight line can be described

as y = mx + b, where m is the slope and b is the y-intercept.

The "best fit" m and b can be determined by a mathematic

approach called the method of least squares. This method

yields the following values for m and b:

[ymacr] is the mean of the dependent variable.

Most regression analysis is performed by use of statistics

software

1 December 2014

36

Data are continuous and paired

For each x, the associated y values follow a normal

distribution

1 December 2014

37

A parameter of significant interest in linear regression is

called the coefficient of determination, or r2.

It is a measure of how well the regression equation

describes the relationship between the two variables.

It represents the proportion of variability in the independent

variable y (as measured by the difference from the mean)

that is accounted for by the regression line.

The value of r2 ranges from 0-1, where 1 indicates that all of

the points fall directly on the regression line.

Hence, the larger r2 is, the better the fit of the regression

line to the data points.

The formula for r2 is as follows

1 December 2014

38

fit of the regression line?

Statistic testing can also be used to evaluate the fit of the

regression line to the data points.

This usually takes the form of hypothesis testing-in other

words, testing the hypothesis that the slope of the line is

zero.

If you can reject the null hypothesis that the slope is zero,

then a line represents a good or reasonable fit to the data

points

1 December 2014

39

quantitative, continuous data , consisting of average morning systolic

blood pressures, measured the same time, daily, over a period of five

days, and body mass index (BMI) for a random sample of males aged

35-45 years not receiving treatment for hypertension.

1 December 2014

40

Thus, the slope and intercept of the linear regression line are given by 1.837 and 86.071,

respectively.

The coefficient of determination, 0.574, is not close to 1, suggesting that the regression line

does not account for a large part of the variation in the independent variable y.

The ANOVA table for the data is given below,

where df = degrees of freedom,

SS = sum of squares,

MS = mean square, and

VR = variance ratio.

1 December 2014

41

more common test methods for regression line fit employs

the F-test.

The test statistic is the variance ratio obtained from the

ANOVA (analysis of variance) methodology.

Most regression analysis today is done by computer, and

statistics software usually displays an ANOVA table when

linear regression is applied.

This table shows the associated p-value for the calculated

variance ratio.

If this p-value is less than the chosen level of statistic

significance (usually designated as the level and often set

at = 0.05), then typically the null hypothesis can be

rejected.

1 December 2014

42

It is a statistics test of significance designed to determine

whether a significant difference exists among multiple

sample means.

The F statistic relates the ratio of the variance occurring

between the means to the variance that occurs within

groups themselves.

The larger the F statistic, the more significant the result. SS is

used to calculate the ANOVA F statistic.

1 December 2014

43

0.05 (or 0.01, for that matter); thus, the null hypothesis that

the slope m is zero can be rejected at these levels of

significance.

Interestingly, the p-value is very low, yet the coefficient of

determination was not very high (0.574), as you will recall.

This is not necessarily a contraindication.

The rejection of the null hypothesis only means that if you

are going to fit a straight line to the data, the slope of the

line would not be zero-it does not necessarily mean that a

straight line is the best fit to the data.

The relatively low r2 suggests otherwise.

1 December 2014

44

pressure versus BMI data.

1 December 2014

45

tests discussed (i.e., Pearson's correlation coefficient and linear

regression)?

Nonparametric methods require fewer assumptions,

especially with respect to normality (i.e., data following a

normal distribution).

In addition, nonparametric methods tend to be simple and

results can be obtained relatively quickly.

1 December 2014

47

tests typically used?

Typical situations in which nonparametric methods are used

include grossly non normal data and the analysis of a small

sample

1 December 2014

48

Spearman's rank correlation is a nonparametric analog to

Pearson's correlation coefficient.

Like Pearson's correlation coefficient, the method attempts

to determine if there is an association between two

mutually dependent variables, with no assumption that one

variable is dependent on another.

The variables can be continuous or discrete, and they are

paired.

1 December 2014

49

CORRELATION

No assumptions of normality are made for either variable

Ordinal scale data

Data are paired.

1 December 2014

50

Spearman's rank correlation?

The test is a form of hypothesis testing in which the null

hypothesis (HA) is that there is no relationship between the

two variables, and the alternative hypothesis is that there is

a relationship.

A rank correlation coefficient, rx, is calculated and used to

obtain a test statistic.

The test statistic is then compared with a table of critical

values to determine the level of statistic significance

1 December 2014

51

coefficient calculated?

The values within each variable are first ranked from 1 to n

by magnitude.

The difference (di) is calculated for each pair of observations

(xi, yi) by subtracting the rank of yi from the rank of xi.

Then Spearman's rank correlation coefficient is calculated

according to the following formula: For n30, this becomes

the test statistic

1 December 2014

52

For the test, the null hypothesis is that the height and weight are mutually independent.

Our alternative hypothesis is that weight increases with height (i.e., a direct correlation).

In table below, each of the values is ranked from lowest to highest within its respective group,

beginning with 1. di represents the difference between the ranks of xi and yi (specifically, xi-yi).

In this example, the method for dealing with ties is also illustrated: one assigns a calculated

rank, representing the average of the tied ranks.

Since n30 for the above data, the test statistic is given by the formula

statistic critical values gives 0.3994 for n

= 18.

Given the above alternative hypothesis,

we are interested in a one-sided test,

wherein we will reject the null hypothesis

if rs >0.3994.

Because this is true, we reject the null

hypothesis and accept the alternative at p

<.05.

1 December 2014

53

categoric data?

Yes.

An example is the absence or presence of a cough in cases

of group A beta-hemolytic streptococcal pharyngitis versus

nongroup A beta-hemolytic streptococcal pharyngitis.

For example, the above example would be depicted in the

following 2 2 contingency table

1 December 2014

54

A-D represent the count of cases, with the pair of characteristics

determined by the intersection of the rows and columns. For

instance, A subjects had both a cough and group A streptococcal

pharyngitis, whereas C subjects had both a cough and nongroup A

streptococcal pharyngitis. The total subjects with cough is A + C, and

the total number of subjects in the study is A + B + C + D.

1 December 2014

55

The chi-square statistic is a nonparametric test of bivariate

count or frequency data.

This method of modeling examines the disparity between

the actual count frequencies and the expected count

frequencies when the null hypothesis of no difference

between the groups is true.

We can apply the chi-square test to determine whether an

association exists between group A beta-hemolytic

streptococcal pharyngitis and coughing

1 December 2014

56

A chi-square test is a test of statistical significance that does

not specifically test a population parameter (like the mean).

used in two situations:

Goodness of fit: this hypothesis test addresses the issue

of whether the distribution of a given data set in

categories "fits" a uniform distribution of the data.

Independence: this hypothesis test addresses the issue of

whether the categories in a two-way contingency table

are related.

1 December 2014

57

In this method, a chi-square test statistic is calculated by the

following formula:

where

r = number of rows,

c = number of columns,

Oij= the observed number for the intersection of the ith row and the jth column,

Eij= expected number for the intersection of the ith row and the jth column,

Oi.= sum of terms for the ith row (in the table above,

O1.= sum of the first row of entries = A + B),

O.j= sum of the jth column entries (in the table above,

O.2 = sum of entries in the second column = B + D), and

n = the total number of cases (above, n = A + B + C + D).

1 December 2014

58

used?

Once the chi-square statistic is calculated, the probability

that the frequency counts in the table are due to chance

(i.e., that the null hypothesis is true) can be determined for

the number of degrees of freedom from a standard chisquare table.

For any given number of degrees of freedom, the higher the

chi-square statistic, the more unlikely the frequency

disparities are due to chance.

1 December 2014

59

Variables are categoric.

Data represent counts and can be represented in a table of r

rows and c columns, an r c contingency table.

Cochran's criteria should be met to apply the basic chisquare test.

1 December 2014

60

Cochran's criteria should be fulfilled if the chi-square test

statistic is to be used as noted above. These criteria are

All expected values in each cell have a frequency count

1.

80% of expected values in each cell should be 5.

The number of terms or cells for summation will be given by

r c in general.

For the above table, therefore, there would be 2 2, or 4,

terms for summation for the chi-square statistic.

1 December 2014

61

chi-square statistic can be calculated?

There is no theoretic limit to the number of rows or columns

(a variable could easily have more than two categories), and

the computation can become tedious.

Statistics software is highly recommended

1 December 2014

62

statistic in the 2 2 contingency table?

Thankfully, yes. For the above 2 2 table, the formula

simplifies to

1 December 2014

63

is a statistical method analogous to linear regression for the

analysis of categorical data.

the dependent variable is categorical (in the simplest case,

dichotomous, or having two outcomes like absence or

presence of disease).

The independent variables can be categorical, or else

continuous or discrete.

Like linear regression, logistic regression is a powerful tool

that has been widely used in medical research.

Independent variables can be assessed for their contribution

to the outcome of the dependent variable (e.g., the

presence of disease).

1 December 2014

64

independent variable may potentially affect the outcome?

Yes.

In the simplest case, logistic regression involves the

modeling of a dichotomous dependent variable on a single

independent variable that is categoric or quantitative.

In most applications, multiple independent variables are

assessed in the regression model

1 December 2014

65

quantify risk?

the ability to calculate odds ratios of an outcome of the

dependent variable (e.g., lung cancer) given specified values

of one of the independent variables (e.g., smoking vs.

nonsmoking) while controlling for the values of the other

independent variables (e.g., alcohol intake, family history,

and antioxidant use).

1 December 2014

66

1 December 2014

67

