Sie sind auf Seite 1von 64


What is bivariate analysis?

Univariate analysis
involves collecting and analyzing data on one
bivariate analysis,
process to two variables

Why is bivariate analysis performed?

when one believes there may be some association
between the two variables.
one is searching for a relationship indicating a
risk factor (an independent variable) and a
health outcome (a dependent variable).

1 December 2014

Does a found association indicate a causal

Finding an association is merely the first step toward
supporting a cause-and-effect relationship between two

1 December 2014

What methods are used to determine

associations between variables?
Several statistic methods are used to identify associations.
depend on the characteristics of the data

1 December 2014

What is a good first question one might ask

when choosing a statistic method?
A good first question is whether the data are
quantitative or qualitative.
Quantitative data are numeric, with some sense
of rank or order.
Qualitative data are categoric, such as gender
(e.g., male or female) or race

1 December 2014

Can quantitative data be further grouped?

continuous versus discrete data

1 December 2014

What is discrete about discrete data?

Discrete data can assume only certain values.
For example, a variable designated as the number of nevi on
a person can assume only discrete integer values (e.g., 0, 1,
2, 3,).

1 December 2014

Does it matter how the data are presented when deciding the
best methods of analysis?
consider the properties of each variable because each may
have different characteristics.
For example
blood pressure measurements (continuous) and
ages in years (discrete )
The selection of the best analysis tool is dependent on
both variables

1 December 2014

What are paired data?

In paired data, the value of one variable is assumed to be
associated with the value of the other, usually because they
represent data on a single entity or object (e.g., an
Whether or not the data are paired also restricts the choice
of bivariate statistic methods

1 December 2014


When dealing with continuous data, are there any

considerations with respect to our analysis?
For continuous data, we must consider the distribution of
the data and ask,
"Can the assumption of a normal distribution be made?"

1 December 2014


What are parametric tests?

Many tests assume normality (i.e., the data follow a normal
or binomial distribution), and these are called parametric
Examples of parametric tests include
Pearson's correlation and
linear regression tests.

1 December 2014


What if the assumption of normality cannot

be made for the data set?
Tests that make NO assumption of normality are termed
Examples of nonparametric tests include the
Wilcoxon signed rank test,
the Mann-Whitney U test, the
Kruskal-Wallis test, and the
McNemar test.
Deciding whether or not to use a parametric versus a
nonparametric test can be difficult. Sometimes tests for
normality can help

1 December 2014


List methods commonly used to analyze quantitative

continuous data.
Pearson's correlation
Simple linear regression
Spearman's rank correlation (nonparametric)
Wilcoxon signed rank test (nonparametric)
Mann-Whitney U test (nonparametric)
Kruskal-Wallis test (nonparametric)

1 December 2014


What is a scatterplot?
A scatterplot is probably the simplest form of bivariate
analysis, but this elementary exercise may convey much
meaningful information through the power of the visual
A scatterplot can be applied to either continuous or discrete
data, or a combination thereof

1 December 2014


What are some other terms for scatterplot?

scatter diagram, and
joint distribution graph

1 December 2014


How is a scatterplot created?

To create a scatterplot, simply plot one variable against
another on an x-y graph, creating a set of points.
This "scatter" of points can then be examined for any visual
The power of this method can be illustrated by the

1 December 2014


How can a scatterplot be useful?

A scatterplot can be very useful in the initial analysis of
bivariate quantitative data.
For example,
a quick glance below suggests that there is no
relationship between the amount of change in the left
pocket of a male and his shoe size.

1 December 2014


What is a major weakness of the scatterplot?

A scatterplot is not quantitative,
Useful to have a numeric value indicating the magnitude of
relationship or association between the two variables

1 December 2014


What is the Pearson's correlation coefficient?

If we want to examine the relationship or association
between two continuous variables in a quantitative fashion,
we can calculate the Pearson's correlation coefficient,
product-moment correlation
population correlation coefficient

1 December 2014


How can we tell which is the dependent (i.e., predictor)

variable and which is the dependent (i.e., outcome) variable
from a correlation coefficient?
We cannot.
It is important to note that here we are not making any
assumptions about the interdependence of the variables:
neither is considered a function of the other.
In fact, each is considered dependent on the other.
We are not so concerned with predicting the value of the
variables as we are with determining the association
between the two variables

1 December 2014


Pearson's correlation coefficient

Pearson's correlation coefficient can be calculated using one
of many available statistics software programs.
One formula for its calculation is as follows:

where r is Pearson's correlation coefficient, and xi and yi are

the paired values of the associated variables

1 December 2014


What is the range of values for r?

The correlation coefficient r must take on a value from -1 to
A value of r = 1 suggests a perfect positive correlation in
which the association between x and y can be described by a
straight line with a positive slope.
A value of r = -1 indicates a perfect negative correlation

1 December 2014



Both variables are continuous
Both variables are normally distributed
Data are paired.

1 December 2014


What is the purpose of Pearson's correlation

Puts into quantitative terms the association implied by a
scatterplot of the two variables.
A value
r = 0 no association
r = +1 positive associations
r = -1 negative associations
valuable in the initial analysis,
does not indicate a cause-and-effect relationship, and a
correlation of zero does not rule out a relationship
between the two variables..

1 December 2014


several examples of scatterplots yielding

different r values

1 December 2014




1 December 2014


Consider the hypothetic example below,

involving paired height and weight data
For these data,

This value indicates a high (i.e., close to 1) positive

correlation or association.
Does this imply a cause-and-effect relationship?
No. Pearson's correlation makes no assumptions about
cause-and-effect or about dependency

1 December 2014


Scatterplot of weight versus height data for a

hypothetic biostatistics class

1 December 2014


You have collected final exam scores and class attendance rates
(in percentages) for a group of college students enrolled in an
English course. Pearson's correlation for the two variables is
0.23. What is your interpretation?

This is a low positive correlation, suggesting little

However, this conclusion is not necessarily true.
You should examine the scatterplot to assess more
subtle forms of relationship.

1 December 2014


. What is linear regression?

A more sophisticated method for the analysis of two
quantitative continuous variables is simple linear
regression-"simple" indicating that only two variables are
There are some similarities between this method and
correlation, but in linear regression one variable is
considered dependent on the other.

1 December 2014


Can linear regression be used as a prediction

In one sense, we are assessing whether the value of the
independent variable can be used to predict the value of the
dependent variable.
An example of an application might be whether or not
weight (the independent variable) can be used to predict
blood pressure (the dependent variable).
In simple linear regression, a "best fit" straight line is
determined for the data in which the dependent variable is
plotted on the y-axis versus the independent variable on the

1 December 2014


A "best fit" straight line for bivariate data, determined

by simple linear regression

1 December 2014


How is the function of the line in simple

linear regression determined?
From algebra, we know that a straight line can be described
as y = mx + b, where m is the slope and b is the y-intercept.
The "best fit" m and b can be determined by a mathematic
approach called the method of least squares. This method
yields the following values for m and b:

where [xmacr] is the mean of the independent variable and

[ymacr] is the mean of the dependent variable.
Most regression analysis is performed by use of statistics
1 December 2014



Data are continuous and paired
For each x, the associated y values follow a normal

1 December 2014


What is the coefficient of determination?

A parameter of significant interest in linear regression is
called the coefficient of determination, or r2.
It is a measure of how well the regression equation
describes the relationship between the two variables.
It represents the proportion of variability in the independent
variable y (as measured by the difference from the mean)
that is accounted for by the regression line.
The value of r2 ranges from 0-1, where 1 indicates that all of
the points fall directly on the regression line.
Hence, the larger r2 is, the better the fit of the regression
line to the data points.
The formula for r2 is as follows
1 December 2014


Can statistic testing be used to evaluate the

fit of the regression line?
Statistic testing can also be used to evaluate the fit of the
regression line to the data points.
This usually takes the form of hypothesis testing-in other
words, testing the hypothesis that the slope of the line is
If you can reject the null hypothesis that the slope is zero,
then a line represents a good or reasonable fit to the data

1 December 2014


Let us apply simple regression to the following hypothetic paired,

quantitative, continuous data , consisting of average morning systolic
blood pressures, measured the same time, daily, over a period of five
days, and body mass index (BMI) for a random sample of males aged
35-45 years not receiving treatment for hypertension.

1 December 2014


Thus, the slope and intercept of the linear regression line are given by 1.837 and 86.071,
The coefficient of determination, 0.574, is not close to 1, suggesting that the regression line
does not account for a large part of the variation in the independent variable y.
The ANOVA table for the data is given below,
where df = degrees of freedom,
SS = sum of squares,
MS = mean square, and
VR = variance ratio.

1 December 2014


What is the F- test?

more common test methods for regression line fit employs
the F-test.
The test statistic is the variance ratio obtained from the
ANOVA (analysis of variance) methodology.
Most regression analysis today is done by computer, and
statistics software usually displays an ANOVA table when
linear regression is applied.
This table shows the associated p-value for the calculated
variance ratio.
If this p-value is less than the chosen level of statistic
significance (usually designated as the level and often set
at = 0.05), then typically the null hypothesis can be
1 December 2014


What is ANOVA? When is it used?

It is a statistics test of significance designed to determine
whether a significant difference exists among multiple
sample means.
The F statistic relates the ratio of the variance occurring
between the means to the variance that occurs within
groups themselves.
The larger the F statistic, the more significant the result. SS is
used to calculate the ANOVA F statistic.

1 December 2014


The above p-value would be statistically significant at =

0.05 (or 0.01, for that matter); thus, the null hypothesis that
the slope m is zero can be rejected at these levels of
Interestingly, the p-value is very low, yet the coefficient of
determination was not very high (0.574), as you will recall.
This is not necessarily a contraindication.
The rejection of the null hypothesis only means that if you
are going to fit a straight line to the data, the slope of the
line would not be zero-it does not necessarily mean that a
straight line is the best fit to the data.
The relatively low r2 suggests otherwise.

1 December 2014


Simple linear regression line for hypothetic systolic blood

pressure versus BMI data.

1 December 2014


How are the nonparametric tests different from the parametric

tests discussed (i.e., Pearson's correlation coefficient and linear
Nonparametric methods require fewer assumptions,
especially with respect to normality (i.e., data following a
normal distribution).
In addition, nonparametric methods tend to be simple and
results can be obtained relatively quickly.

1 December 2014


Under what situations are nonparametric

tests typically used?
Typical situations in which nonparametric methods are used
include grossly non normal data and the analysis of a small

1 December 2014


What is Spearman's rank correlation?

Spearman's rank correlation is a nonparametric analog to
Pearson's correlation coefficient.
Like Pearson's correlation coefficient, the method attempts
to determine if there is an association between two
mutually dependent variables, with no assumption that one
variable is dependent on another.
The variables can be continuous or discrete, and they are

1 December 2014



No assumptions of normality are made for either variable
Ordinal scale data
Data are paired.

1 December 2014


What are the principles underlying

Spearman's rank correlation?
The test is a form of hypothesis testing in which the null
hypothesis (HA) is that there is no relationship between the
two variables, and the alternative hypothesis is that there is
a relationship.
A rank correlation coefficient, rx, is calculated and used to
obtain a test statistic.
The test statistic is then compared with a table of critical
values to determine the level of statistic significance

1 December 2014


How is Spearman's rank correlation

coefficient calculated?
The values within each variable are first ranked from 1 to n
by magnitude.
The difference (di) is calculated for each pair of observations
(xi, yi) by subtracting the rank of yi from the rank of xi.
Then Spearman's rank correlation coefficient is calculated
according to the following formula: For n30, this becomes
the test statistic

For n > 30, one must use

1 December 2014



For the test, the null hypothesis is that the height and weight are mutually independent.
Our alternative hypothesis is that weight increases with height (i.e., a direct correlation).
In table below, each of the values is ranked from lowest to highest within its respective group,
beginning with 1. di represents the difference between the ranks of xi and yi (specifically, xi-yi).
In this example, the method for dealing with ties is also illustrated: one assigns a calculated
rank, representing the average of the tied ranks.

Since n30 for the above data, the test statistic is given by the formula

For = 0.05, the table of Spearman's test

statistic critical values gives 0.3994 for n
= 18.
Given the above alternative hypothesis,
we are interested in a one-sided test,
wherein we will reject the null hypothesis
if rs >0.3994.
Because this is true, we reject the null
hypothesis and accept the alternative at p
1 December 2014


Is it possible to analyze two groups of

categoric data?
An example is the absence or presence of a cough in cases
of group A beta-hemolytic streptococcal pharyngitis versus
nongroup A beta-hemolytic streptococcal pharyngitis.
For example, the above example would be depicted in the
following 2 2 contingency table

1 December 2014


Explain the above contingency table.

A-D represent the count of cases, with the pair of characteristics
determined by the intersection of the rows and columns. For
instance, A subjects had both a cough and group A streptococcal
pharyngitis, whereas C subjects had both a cough and nongroup A
streptococcal pharyngitis. The total subjects with cough is A + C, and
the total number of subjects in the study is A + B + C + D.

1 December 2014


What is the chi-square statistic?

The chi-square statistic is a nonparametric test of bivariate
count or frequency data.
This method of modeling examines the disparity between
the actual count frequencies and the expected count
frequencies when the null hypothesis of no difference
between the groups is true.
We can apply the chi-square test to determine whether an
association exists between group A beta-hemolytic
streptococcal pharyngitis and coughing

1 December 2014


What is chi-square? When is it used?

A chi-square test is a test of statistical significance that does
not specifically test a population parameter (like the mean).
used in two situations:
Goodness of fit: this hypothesis test addresses the issue
of whether the distribution of a given data set in
categories "fits" a uniform distribution of the data.
Independence: this hypothesis test addresses the issue of
whether the categories in a two-way contingency table
are related.

1 December 2014


How is the chi-square statistic calculated?

In this method, a chi-square test statistic is calculated by the
following formula:
r = number of rows,
c = number of columns,
Oij= the observed number for the intersection of the ith row and the jth column,
Eij= expected number for the intersection of the ith row and the jth column,
Oi.= sum of terms for the ith row (in the table above,
O1.= sum of the first row of entries = A + B),
O.j= sum of the jth column entries (in the table above,
O.2 = sum of entries in the second column = B + D), and
n = the total number of cases (above, n = A + B + C + D).

1 December 2014


How is the calculated chi-square statistic

Once the chi-square statistic is calculated, the probability
that the frequency counts in the table are due to chance
(i.e., that the null hypothesis is true) can be determined for
the number of degrees of freedom from a standard chisquare table.
For any given number of degrees of freedom, the higher the
chi-square statistic, the more unlikely the frequency
disparities are due to chance.

1 December 2014



Variables are categoric.
Data represent counts and can be represented in a table of r
rows and c columns, an r c contingency table.
Cochran's criteria should be met to apply the basic chisquare test.

1 December 2014


What are Cochran's criteria?

Cochran's criteria should be fulfilled if the chi-square test
statistic is to be used as noted above. These criteria are
All expected values in each cell have a frequency count
80% of expected values in each cell should be 5.
The number of terms or cells for summation will be given by
r c in general.
For the above table, therefore, there would be 2 2, or 4,
terms for summation for the chi-square statistic.

1 December 2014


What is the maximum number of rows or columns for which a

chi-square statistic can be calculated?
There is no theoretic limit to the number of rows or columns
(a variable could easily have more than two categories), and
the computation can become tedious.
Statistics software is highly recommended

1 December 2014


Is there a simplified formula for calculating the chi-square

statistic in the 2 2 contingency table?
Thankfully, yes. For the above 2 2 table, the formula
simplifies to

1 December 2014


Describe logistic regression.

is a statistical method analogous to linear regression for the
analysis of categorical data.
the dependent variable is categorical (in the simplest case,
dichotomous, or having two outcomes like absence or
presence of disease).
The independent variables can be categorical, or else
continuous or discrete.
Like linear regression, logistic regression is a powerful tool
that has been widely used in medical research.
Independent variables can be assessed for their contribution
to the outcome of the dependent variable (e.g., the
presence of disease).
1 December 2014


Can logistic regression be performed when more than one

independent variable may potentially affect the outcome?
In the simplest case, logistic regression involves the
modeling of a dichotomous dependent variable on a single
independent variable that is categoric or quantitative.
In most applications, multiple independent variables are
assessed in the regression model

1 December 2014


Can logistic regression models be used to

quantify risk?
the ability to calculate odds ratios of an outcome of the
dependent variable (e.g., lung cancer) given specified values
of one of the independent variables (e.g., smoking vs.
nonsmoking) while controlling for the values of the other
independent variables (e.g., alcohol intake, family history,
and antioxidant use).

1 December 2014


1 December 2014