Sie sind auf Seite 1von 44

Correlation

1. Pearson r correlation
Pearson r correlation is widely used in statistics to
measure the degree of the relationship between
linear related variables. For example, in the stock
market, if we want to measure how two
commodities are related to each other,
Pearson r correlation is used to measure the
degree of relationship between the two
commodities. The following formula is used to
calculate the Pearson r correlation:
Where:
r = Pearson r correlation coefficient
N = number of value in each data set
xy = sum of the products of paired scores
x = sum of x scores
y = sum of y scores
x2= sum of squared x scores
y2= sum of squared y scores
Questions a Pearson
correlation answers

Is there a statistically significant relationship


between age, as measured in years, and height,
measured in inches?
Is there a relationship between temperature,
measure in degree Fahrenheit, and ice cream
sales, measured by income?
Is there a relationship among job satisfaction, as
measured by the JSS, and income, measured in
dollars?
Sample question: Find the value of the correlation
coefficient from the following table:

GLUCOSE
SUBJECT AGE X
LEVEL Y

1 43 99

2 21 65

3 25 79

4 42 75

5 57 87

6 59 81
Step 1:Make a chart. Use the given data,
and add three more columns: xy, x2, and y2.

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy
column. For example, row 1 would be 43 99
= 4,257.

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x
column, and put the result in the x2 column.

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y
column, and put the result in the y2 column.

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns
and put the result at the bottom.2 column. The Greek
letter sigma () is a short way of saying sum of.

GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
247 486 20485 11409 40022
Step 6:Use the following correlation
coefficient formula.

From our table: The answer is: 2868 / 5413.27 =


x = 247 0.529809
y = 486
xy = 20,485
x2 = 11,409
y2 = 40,022
n is the sample size, in our case = 6
The correlation coefficient =
6(20,485) (247 486) /
[[[6(11,409) (2472)]
[6(40,022) 4862]]]
=0.5298
The range of the correlation coefficient is from -1 to 1.
Our result is 0.5298 or 52.98%, which means the
variables have a moderate positive correlation.
How to test correlation coefficients

Sample problem: test the significance of the


correlation coefficient r = 0.565 using the critical values
for PPMC Table. Test at = 0.01 for a sample size of 9.
Step 1: Subtract two from the sample size to get df,
degrees of freedom.
92=7
Step 2: Look the values up in the PPMC Table. With df =
7 and = 0.01, the table value is = 0.798
Visit this site for the table:
http://www.statisticshowto.com/tables/ppmc-
critical-values/
Pearson Product-Moment Correlation (PPMC)
Coefficient Table of Critical Values
Level of significance for two-tailed test
f=n2
n = # of pairs
of data .10 .05 .02 .01

1 .988 .997 .9995 .9999


2 .900 .950 .980 .990
3 .805 .878 .934 .959
4 .729 .811 .882 .917
5 .669 .754 .833 .874
6 .622 .707 .789 .834
7 .582 .666 .750 .798
8 .549 .632 .716 .765
9 .521 .602 .685 .735
10 .497 .576 .658 .708
Step 3: Draw a graph, so you can more
easily see the relationship.

r = 0.565 does not fall into the reject region (above


0.798), so there isnt enough evidence to state a
strong linear relationship exists in the data.
Meaning of the Linear Correlation
Coefficient.

Pearsons Correlation Coefficient is a linear


correlation coefficien that returns a value
of between -1 and +1. A -1 means there
is a strong negative correlation and +1
means that there is a strong positive
correlation. A 0 means that there is no
correlation (this is also called zero order
correlation).
This can initially be a little hard to wrap
your head around. The Political Science
Department at Quinnipiac University
posted this useful list of the meaning of
Pearsons Correlation coefficients. They
note that these are crude estimates
for interpreting strengths of correlations
using Pearsons Correlation:
r value =
Very strong positive
+.70 or higher
relationship
+.40 to +.69 Strong positive relationship

+.30 to +.39 Moderate positive relationship

+.20 to +.29 weak positive relationship


+.01 to +.19 No or negligible relationship
No relationship [zero order
0
correlation]
-.01 to -.19 No or negligible relationship
-.20 to -.29 weak negative relationship
Moderate negative
-.30 to -.39
relationship
-.40 to -.69 Strong negative relationship
Very strong negative
-.70 or higher
relationship
It may be helpful to see graphically what
these correlations look like:

Graphs showing a correlation of -1 (a negative correlation), 0


and +1 (a positive correlation)
The images show that a strong
negative correlation means that the
graph has a downward slope from
left to right: as the x-values
increase, the y-values get smaller. A
strong positive correlation means
that the graph has an upward slope
from left to right: as the x-values
increase, the y-values get larger.
Cramers V Correlation

Cramers V Correlation is similar to the


Pearson Correlation coefficient. While the
Pearson correlation is used to test the
strength of linear relationships, Cramers V
is used to calculate correlation in tables
with more than 2 x 2 columns and rows.
Cramers V correlation varies between 0
and 1. A value close to 0 means that there
is very little association between the
variables. A Cramers V of close to 1
indicates a very strong association.
Cramers V
.25 or higher Very strong relationship
.15 to .25 Strong relationship
.11 to .15 Moderate relationship
.06 to .10 weak relationship
No or negligible
.01 to .05
relationship
Kendall rank correlation
Kendall rank correlation is a non-parametric test
that measures the strength of dependence
between two variables. If we consider two
samples, a and b, where each sample size is n, we
know that the total number of pairings with a b
is n(n-1)/2. The following formula is used to
calculate the value of Kendall rank correlation:

Where:
Nc= number of concordant
Nd= Number of discordant
Kendalls Tau is a non-parametric measure
of relationships between columns of
ranked data. The Tau correlation
coefficient returns a value of 0 to 1,
where:

0 is no relationship,
1 is a perfect relationship.
A quirk of this test is that it can also produce negative
values (i.e. from -1 to 0). Unlike a linear graph, a
negative relationship doesnt mean much with ranked
columns (other than you perhaps switched the columns
around), so just remove the negative sign when youre
interpreting Tau.
Several versions of Tau exist.
Tau-A and Tau-B are usually used for square tables (with
equal columns and rows). Tau-B will adjust for tied
ranks.
Tau-C is usually used for rectangular tables. For square
tables, Tau-B and Tau-C are essentially the same.
Most statistical packages have Tau-B
built in, but you can use the
following formula to calculate it by
hand:
Kendalls Tau = (Nc Nd /Nc +
Nd)
Where Nc is the number of
concordant and Nd is the number
of discordant pairs.
Example Problem

Sample Question: Two


interviewers ranked 12 candidates
(A through L) for a position. The
results from most preferred to least
preferred are:
Interviewer 1: ABCDEFGHIJKL.
Interviewer 2: ABDCFEHGJILK.

Calculate the Kendall Tau


correlation.
Step 1: Make a table of rankings. The first
column, Candidate is optional and for
reference only. The rankings for Interviewer
1 should be in ascending order (from least to
greatest).
Step 2: Count the number of concordant pairs,
using the second column. Concordant pairs are how
many larger ranks are below a certain rank. For
example, the first rank in the second interviewers
column is a 1, so all 11 ranks below it are larger.
However, going down the list to the third row
(a rank of 4), the rank immediately below
(3) is smaller, so it doesnt count for a
concordant pair.
When all concordant pairs have
been counted, it looks like this:
Step 3: Count the number of discordant
pairs and insert them into the next column.
The number of discordant pairs is similar to
Step 2, only youre looking for smaller ranks,
not larger ones.
Step 4: Sum the values in the
two columns:
Step 5: Insert the totals into the
formula:

Kendalls Tau = (C D / C + D)
= (61 5) / (61 + 5) = 56 / 66 = .
85.
The Tau coefficient is .85,
suggesting a strong relationship
between the rankings.
Perfect Correlation
Counting how many values are below the second column seems
very odd when you first do it. But it does work. Just as a
thought experiment, heres what the spreadsheet looks like if
both interviewers were in perfect agreement:

And, inserting the totals into the formula we get:


Tau = (66 0) / (66 + 0) = 1, which is (as we expect)
perfect agreement.
Spearman rank correlation

The Spearman rank correlation coefficient, rs, is


the nonparametric version of the Pearson
correlation coefficient. Your data must
be ordinal, interval or ratio. Spearmans returns a
value from -1 to 1, where:
+1 = a perfect positive correlation between ranks
-1 = a perfect negative correlation between ranks
0 = no correlation between ranks.
Contents:
Tied ranks example.
What to do with tied ranks.
Spearman Rank Correlation:
Worked Example (No Tied Ranks)

The formula for the Spearman rank


correlation coefficient when there are no
tied ranks is:
Sample Question:

The scores for nine students in physics


and math are as follows:
Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28
Mathematics: 30, 33, 45, 23, 8, 49, 12, 4,
31
Compute the students ranks in the two
subjects and compute the Spearman rank
correlation.
Step 1: Find the ranks for each individual
subject. Order the scores from greatest to
smallest; assign the rank 1 to the highest
score, 2 to the next highest and so on:
Step 2: Add a third column, d, to your data. The d is
the difference between ranks. For example, the first
students physics rank is 3 and math rank is 5, so
the difference is 3 points. In a fourth column,
square your d values.
Step 4: Sum (add up) all of your d-squared values.
4 + 4 + 1 + 0 + 1 + 1 + 1 + 0 + 0 = 12. Youll need this
for the formula (the d2 is just the sum of d-squared values).
Step 5: Insert the values into the formula. These ranks are
not tied, so use the first formula:

= 1 (6*12)/(9(81-1))
= 1 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is
0.9
Spearman Rank Correlation:
Worked Example (Tied Ranks)

Tied ranks are where two items in a


column have the same rank. A couple of
different formulas exist for dealing with
tied ranks. Perhaps the easiest way is to
use the mean of the tied ranks.
Lets say two items in the above example
tied for ranks 5 and 6. You would assign
each data point a mean rank of 5.5:
Use the same formula, this time
inserting the d-squared value for the
tied ranks (14.5 from the new data):

= 1 (6*14.5)/(9(81-1))
= 1 87/720
= 1 0.120833333
= 0.879