Sie sind auf Seite 1von 12

MULTIVARIATE ANALYSIS

EXAM 2

Dan Sewell

Question 1:

a. The first few observations of mortalities per 100,000 are below, which were calculated in SAS (see
code in Appendix), but the entirety of the data set can be found in the exhaustive SAS output (see
Appendix).

Obs newcopd newcvd newpneu newresp

1 64.2845 336.711 32.4186 97.2557


2 15.8351 160.462 16.8907 32.7258
3 21.5268 206.496 21.3919 42.9862
4 34.7171 161.644 12.1879 47.3974
5 47.6086 289.430 19.0434 66.9543
6 22.2840 370.108 19.8618 42.1459
7 29.5735 242.242 29.4285 59.1470
8 42.5145 427.039 31.8858 74.8212
9 31.7778 216.549 28.7582 60.8236
10 33.0126 328.191 31.1527 64.5372

b. Monthly average temperatures and ozones, from April to September were computed using Microsoft
Excel (=average(…). The resulting data set was then imported into SAS. The first few observations are
below, but the entire data set can be found in the exhaustive SAS output.

Obs TempApril TempMay TempJune TempJuly TempAug TempSep O3April

1 51.4000 61.6129 70.0667 76.0968 68.4516 63.7333 -2.59742


2 56.7667 67.4839 74.9333 83.2258 79.9032 70.2667 -5.77401
3 65.3667 69.1290 75.1667 79.3871 82.1290 73.3000 -4.00079
4 72.0000 76.6452 81.9667 83.4516 88.5161 81.6667 8.26811
5 58.0667 66.6452 73.9000 80.2258 77.8710 77.2333 2.01369
6 72.0333 74.7419 80.5000 81.8710 85.2903 76.3000 4.48913
7 49.4000 58.4194 71.2000 76.0323 71.6452 67.3667 6.72094
8 46.1333 59.8387 68.5667 74.5806 68.0645 64.4667 2.51458
9 62.2333 65.9677 73.6333 79.0645 79.6774 69.3667 0.53066
10 49.8000 61.8710 70.5667 78.5806 70.5484 63.6667 1.25153

Obs O3May O3June O3July O3Aug O3Sep

1 2.9776 9.3160 10.1753 -1.3167 -5.4473


2 3.3453 6.7535 13.4359 8.1736 -6.5235
3 2.8535 -1.9897 -0.0305 13.6344 2.2994
4 2.1398 -5.5971 -11.3581 9.2746 11.6430
5 10.1284 14.0376 13.9645 9.5357 16.2752
6 13.4318 0.1609 -3.6472 8.2088 3.4473
7 2.6733 7.3469 7.7088 3.5155 -3.5027
8 9.0380 13.3186 15.9207 1.0480 0.3954
9 2.7441 4.2031 3.6292 12.7265 -6.9343
10 6.0385 7.6226 7.0373 -0.5158 0.0429

Question 2:

Are mortalities related to coastline? To answer this question, I ran MANOVA, with the following model:
β 01 β 02 β 03 β 04
[ newcopd i newcvd i newpneui newresp i ]=[ 1 coast i ]
[ ]
β 11 β 12 β 13 β 14
+[ε i 1 ε i2 ε i 3 ε i 4 ]

for i = 1…70.

I used Wilk’s Lambda to determine whether to reject or fail to reject the null hypothesis that Coastline
has no effect (i.e. H 0 : β1 =0 ¿. Less relevant to the question but still tested was the null hypothesis that
the intercept was not significant (i.e. H 0 : β 0=0 ¿ . The p-values from Wilk’s Lambda for these two tests
are 0.003 and less than 0.001, respectively. This implies that there is indeed a difference in mortalities
due to geography (specifically, if they live in region next to the coast). Since the mortality rates differ
depending on region, it is important to look at the means of the mortality rates for each of the two
locations.

N
coast Obs Variable N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0 32 newcopd 32 46.9256059 10.4654806 24.8226145 74.4175238
newcvd 32 342.2685421 104.0352160 170.4295948 706.4312769
newpneu 32 26.9604181 6.1229006 14.3832579 39.9480052
newresp 32 74.3137923 13.2431913 45.9742789 99.4803952

1 38 newcopd 38 36.8365308 11.7025433 15.8350620 77.1583167


newcvd 38 304.6484248 117.8645380 160.4619615 738.6820742
newpneu 38 20.5756272 8.0217809 7.3123619 38.0844329
newresp 38 57.6683982 15.7261493 32.7257948 110.2571727

We can see that for each type of death, the mean mortality rates are higher for those cities which are
not along a coastline. I conclude that there is a higher probability of a Chronic Obstructive Pulmonary
Disease death, a cardiovascular death, a pneumonia death, or a respiratory death if one lives away from
the coastline.

Diagnostic Checks were run to check for Normality and for equal covariance matrices. Testing the
homogeneity of the covariances, by Chi-squared test, leads us to fail to reject the null hypothesis (equal
variances) at the 0.05 level. However, when Henze-Zirkler Test was run on the residuals, it turned out to
not be normal.

Question 3:

We wish to better see the underlying variation structure in the monthly averages of temperature and
ozone. To this end, I perform principle component analysis on both temperature and ozone.

First, with monthly average temperature, I find that 96.04% of the variation of the data is explained by
the first two PC’s. Further evidence for choosing to use just the first two PC’s comes from the following
Scree Plot, and noticing the elbow is at 2:

Scree Plot of Eigenvalues







‚ 1










E ‚
i ‚
g ‚
e ‚
n ‚
v3ˆ
a ‚
l ‚
u ‚
e ‚
s ‚











‚ 2


‚ 3 4
0ˆ 5 6
Šƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒ
0 1 2 3 4 5 6

The following table shows how each monthly average temperature is correlated with the first two PC’s.

PC APRIL MAY JUNE JULY AUG SEPT


1 0.928 0.987 0.952 0.809 0.957 0.949
2 -0.353 -0.0183 0.256 0.562 0.147 -0.092
Second, for average monthly ozone, I find that I should choose either 3 or 4 PC’s by looking at the Scree
Plot below:

Scree Plot of Eigenvalues





2.5 ˆ





‚ 1 2
2.0 ˆ




E ‚
i ‚
g 1.5 ˆ
e ‚
n ‚
v ‚
a ‚
l ‚
u ‚ 3
e 1.0 ˆ
s ‚





0.5 ˆ
‚ 4

‚ 5

‚ 6

0.0 ˆ



Šƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ
0 1 2 3 4 5 6

Number
The first 3 PC’s account for 88.24% of the variance in average monthly Ozone, and 4 PC’s account for
94.38% of the variance. This is a rather subjective decision, but I will choose that 4 PC’s will be used for
further analytic needs. Correlations between each monthly Ozone and each of the first four PC’s are
given below.

PC APRIL MAY JUNE JULY AUG SEP


- 0.24650 0.91912 0.97286 0.20688 -
Y1 0.24531 9 9 2 6 0.29725
0.59219 0.02607 0.84724
Y2 0.44967 3 1 -0.0257 7 0.77709
0.78115 0.60377 0.17787 0.02400 0.10900
Y3 8 7 8 1 -0.4416 2
- - 0.12266 0.02053 - 0.53659
Y4 0.27142 0.14753 3 3 0.20632 8
The data set with the PC Scores from all 6 PC’s attached in the exhaustive SAS output. Note that PCTi
refers to the ith PC for temperature, and PCOi refers to the ith PC for Ozone, i = 1..6 .

Question4:
In order to understand the relationships and associations between mortality and monthly temperatures,
monthly ozones, and geography (coastlines or no coastlines), I set up a regression model, with four
response variables (four mortality rates) and 13 covariates (14 including intercept). The following are
the regression coefficients for each of the four mortality variables:
For Chronic Obstructive Pulmonary Disease:
Intercept 20.28952
TempApril -0.19539
TempMay -0.48570
TempJune -0.16807
TempJuly 1.12896
TempAug -1.52321
TempSep 1.61922
O3April 0.15609
O3May -0.03675
O3June -0.31666
O3July -0.16258
O3Aug -0.15963
O3Sep 0.29670
coast -12.50162

For cardiovascular deaths:


Intercept -193.94372
TempApril 10.72090
TempMay -39.54714
TempJune 26.74436
TempJuly 22.51034
TempAug -24.88893
TempSep 8.86207
O3April -5.95457
O3May 14.62094
O3June -9.52123
O3July -0.69330
O3Aug -1.42830
O3Sep -0.95663
coast 39.53147

For pneumonia deaths:


Intercept -193.94372
TempApril 10.72090
TempMay -39.54714
TempJune 26.74436
TempJuly 22.51034
TempAug -24.88893
TempSep 8.86207
O3April -5.95457
O3May 14.62094
O3June -9.52123
O3July -0.69330
O3Aug -1.42830
O3Sep -0.95663
coast 39.53147

For respiratory deaths:


Intercept 40.02573
TempApril 0.12833
TempMay -1.50822
TempJune 2.27612
TempJuly 1.33574
TempAug -2.91614
TempSep 1.12797
O3April 0.32036
O3May -0.17657
O3June -0.85262
O3July 0.00097820
O3Aug 0.11777
O3Sep -0.00352
coast -13.97280
A test for the overall significance of the model was conducted. With H 0 :B=0 gave a resulting p-value
of 0.0041. With a significance level of 0.05, we reject the null hypothesis, concluding that at least one of
the 13 covariates is significant. In other words, at least one monthly temperature average, or one
monthly ozone average, or their geography has an effect on mortality rates. This is not surprising since
from earlier we found out that geography played a role in mortality rates, and in particular, cities by the
coast had lower mortality rates in the specific four areas. It is important to recognize which of these
covariates play a role in mortality rates, so a separate test is conducted for each. For each test, Wilk’s
Lambda is used to obtain a p-value to reject or fail to reject the null hypothesis that the covariate’s
effect is insignificant. Below, a table of these results is given.

COVARIATE p-value Does it affect mortality rates?


Mean Temperature for April 0.7441 No
Mean Temperature for May 0.1777 No
Mean Temperature for June 0.0549 No
Mean Temperature for July 0.3901 No
Mean Temperature for Aug 0.5209 No
Mean Temperature for Sept 0.3019 No
Mean Ozone for April 0.4409 No
Mean Ozone for May 0.2785 No
Mean Ozone for June 0.4956 No
Mean Ozone for July 0.9295 No
Mean Ozone for Aug 0.8481 No
Mean Ozone for Sept 0.4615 No
coast 0.0346 Yes

Next, we use only the PC’s (specifically, PCT1,PCT2, PCO1,PCO2,PCO3, and PCO4) as covariates, along
with coast, to construct much the same model. The following are the regression coefficients for this
model.

For Chronic Obstructive Pulmonary Disease:


Intercept 48.11297
PCT1 0.13659
PCT2 -0.02778
PCO1 -0.01593
PCO2 -0.26565
PCO3 0.41358
PCO4 0.69546
coast -12.27632

For cardiovascular deaths:


Intercept 328.24533
PCT1 0.88349
PCT2 2.34210
PCO1 0.03544
PCO2 -4.82485
PCO3 -0.93247
PCO4 2.95725
coast -11.78788

For pneumonia deaths:


Intercept 24.82633
PCT1 0.02900
PCT2 0.27819
PCO1 -0.04426
PCO2 -0.38240
PCO3 -0.36996
PCO4 -0.24137
coast -2.45358

For respiratory deaths:


Intercept 73.31611
PCT1 0.16035
PCT2 0.26899
PCO1 -0.06609
PCO2 -0.65850
PCO3 0.05854
PCO4 0.46664
coast -14.80757

Again using Wilk’s Lambda for our tests, we test the null hypothesis that none of the covariates have
any effect on mortality rates. The test yields a p-value of 0.008, so we reject, concluding that at least
one of the covariates affects mortality rates. The results for individual covariate tests are given in the
same fashion as for the first regression model.

COVARIATE p-value Does it affect mortality rates?


PCT1 0.5959 No
PCT2 0.6509 No
PCO1 0.9490 No
PCO2 0.0610 No
PCO3 0.0211 Yes
PCO4 0.1783 No
Coast 0.0220 Yes

Finally, one last regression model is conducted, which only has two covariates, PCO3 and Coast, just to
make sure that we can really say that ozone has an effect on the mortality rates. With this model, we
get the following regression coefficients:

For Chronic Obstructive Pulmonary Disease:


Intercept 47.53327
PCO3 0.33993
coast -11.20846

For cardiovascular deaths:


Intercept 340.00412
PCO3 -1.26673
coast -33.44881

For pneumonia deaths:


Intercept 26.28443
PCO3 -0.37815
coast -5.13954

For respiratory deaths:


Intercept 74.27432
PCO3 -0.02208
coast -16.57268

The overall model is significant, as the Wilk’s Lambda gave a p-value of less than 0.0001. Individual tests
show that they both are significant, as the p-value for PCO3 is 0.0166, and the p-value for coast is
0.0006. This indicates that the four mortality rates are affected by both geography and the monthly
average ozone.

Diagnostics were run to ensure that our residuals were multivariate normal. To assess the
multinormality of the residuals, I used Henze-Zirkler Test, which indicated that they were in fact not
normal.

Question 5:

In order to see which cities have similar average monthly temperatures, I performed both hierarchical
and non-hierarchical clustering methods. First, I performed a hierarchical clustering method based on
average linkage. The problem now lies in how many clusters to choose. Looking at the plot of the first
two PC’s does not give any clear idea as to how many clusters one should choose, so I then looked at
both the Cubic Clustering Criterion and the Pseudo Hotelling’s T 2 test. On the plot of CCC vs. Number of
Clusters, it seems that there is a peak at 11 clusters, and at 7 clusters. I look at the Pseudo T 2 and it also
confirms that I should choose 11 clusters, since it jumps from 4.9 (obtaining 11 clusters) to 41.7
(obtaining 10 clusters). As a final check, I used a non-hierarchical method, using Beale’s F-type statistic
to determine if I should choose 7 or 11 clusters. The Beale’s F-type statistic for this comparison is 16.9,
which leads us to conclude that 11 clusters is better. The Beale’s F-type statistic is computed and
compared to the F value using the R code in the appendix. See the SAS output for a plot of the first 2
temperature PC’s by cluster, and for a list of cities sorted by their cluster. So I can say that all of our
cities fall into one of 11 groups based on their average monthly temperature.

Cities clustered # of clusters # of clusters # indicated from Decided number


by . . . indicated by CCC indicated by Beale’s F-type of clusters
Pseudo T2 statistic
Monthly avg. 7, 11 11 11 11
Temperatures
Monthly avg. 3, 5, 7, 9, 11 3, 7, 9, 11 11 11
Ozone
Monthly avg. 4, 6, 10 4, 10 10 10
Temps. and Ozone
First 2 6, 11, 13 SAS did not 13 13
Temperature PC’s calculate
First 4 Ozone PC’s 3, 7, 11 3, 7, 11, 15 15 15
First 2 Temp. PC’s 4, 6, 10 4, 7, 10, 12 12 12
and first 4 ozone
PC’s
Since parts a-f are all clustered in the same manner, I will summarize all of the results in a tabulated
form below.

From these data, the following conclusions can be made: Each of our cities can be put into one of 11
groups, where the cities within a group have similar monthly average temperatures. Each city can also
be put into one of 11 groups, where the cities within a group have similar monthly average ozone. They
can also be put into one of 10 groups, where the cities within a group have similar monthly average
temperatures and ozone. If we use the first few PC’s from temperature and ozone which explain most
of the variance of the original variables, we can put each city in one of 13 groups, where the cities within
a group have similar average monthly temperatures, or 15 groups, where the cities in a group have
similar average monthly ozone, or 12 groups, where the cities within a group have similar average
monthly temperatures and ozone. To find which city belongs to which group, for each of these 6
grouping methods, see the exhaustive SAS output.

Question 6:

It is of interest to find the correlation between temperature and ozone. To learn more about how these
two things are correlated, I used canonical correlation analysis. First, I used all the monthly average
temperatures, and correlated those to the set of variables consisting of the monthly average ozone.
Second, I used only the first two PC’s of monthly average temperature, and correlated those to the set
of the first four PC’s of monthly average ozone. Note that in the SAS output, you can find the canonical
correlation coefficients, as well as the correlation matrices for all four sets of variables, and the
correlation matrices for each pair of sets of variables (i.e. for temperatures and ozone, and for the PC’s
of temperatures and ozone). For both analyses, I report results based on the first canonical variables,
since they explain most of the correlation.

For the first case, I find that the canonical correlation between the set of all monthly average
temperatures to the set of all monthly average ozone is 0.965. This implies that there is a very high
correlation between temperature and ozone. Since it is positive, as the temperature canonical variable
(temp1) increases, we can feel quite certain that the ozone canonical variable (oz1) will also increase.
Furthermore, we can see which months of temperature are more correlated to the average monthly
ozone, and vice versa. First, temperatures for April and August, followed by May, have the strongest
correlation with the canonical variable temp1. They also have the strongest (positive) correlation with
the canonical variable oz1. We may conclude from this that the average temperatures for April, August,
and May have the strongest correlation with the monthly average ozone from April to September.
Second, by looking at the monthly average ozone variables, we see that the average ozone in June and
July have the strongest (negative) correlation with canonical variable ozone1. They also have the
strongest (negative) correlation with the canonical variable temperature1. We may conclude that the
average ozone during June and July have the strongest correlation with the average monthly
temperatures from April to September.
For the second case, I find that the canonical correlation between the first two PC’s of monthly average
temperatures and the first four PC’s of monthly average ozone is 0.823. This implies that there is a
strong positive correlation between the PC temperature canonical variable (PCtemp1) and the PC ozone
canonical variable (PCOz1). As one would expect, the first PC for temperature has the highest (negative)
correlation with PCtemp1, and the first two PC’s for ozone have the strongest (PCO1 is positive, and
PCO2 is negative) correlation with PCOz1. Similarly, the first PC for temperature has the strongest
(negative) correlation with PCOz1 and the first two PC’s for ozone have the strongest (PCO1 is positive
and PCO2 is negative) correlation with PCTemp1. This implies the first linear combination of average
monthly temperatures (PCT1) has the strongest correlation with the average monthly ozone from April
to September, and the first two linear combinations of average monthly ozone (PCO1 and PCO2) have
the strongest correlation with average monthly temperatures from April to September.

Below are two plots of the canonical variable scores, first using the monthly means, and second using
the PC’s.

Clustering based on Monthly Average Ozone

Plot of Temp1*Oz1. Legend: A = 1 obs, B = 2 obs, etc.

Temp1 ‚

2.0 ˆ

‚ A A
‚ A A A
‚ A
1.5 ˆ A A A
‚ AA

‚ A
‚ A A
1.0 ˆ A
‚ A
‚ A A A
‚ B
‚ A
0.5 ˆ
‚ A

‚ AA A
‚ A A
0.0 ˆ A
‚ A BB
‚ A A A
‚ A BA AA A A
‚ AA A
-0.5 ˆ A
‚ A A A
‚ B A


-1.0 ˆ AA A
‚ A
‚ A A
‚ A A
‚ A AA A
-1.5 ˆ
‚ A
‚ A
‚ A

-2.0 ˆ

Šˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆ
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

Oz1
Plot of PCTemp1*PCOz1. Legend: A = 1 obs, B = 2 obs, etc.

PCTemp1 ‚


‚ A
‚ A


‚ A A B A
‚ A A
1ˆ A A AB A
‚ A B A
‚ A A A AA A
‚ B AA
‚ A
‚ A AA
‚ A
0ˆ A AA
‚ A
‚ AA AA
‚ AA AA A
‚ A AB AA
‚ A A
‚ A A
-1 ˆ A A
‚ A

‚ A A A
‚ A A
‚ A
‚ A AA
-2 ˆ

‚ A




-3 ˆ

Šƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒ
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

PCOz1
The plots give good visual reference to the fact that the Canonical Correlation was stronger when using
the average monthly means of temperature and ozone, as opposed to using the PC’s. The Canonical
Variable Scores can be seen in the full SAS output.

Question 7:

We are now interested in being able to classify a region as having a coastline or not having a coastline by
various sets of variables. The end result of the following analysis will be that if we are given the monthly
averages for temperature and ozone for a region, we will be able to predict whether it is coastal or not.
Three methods will be applied to 6 different sets of variables, and the misclassification rates will
determine which rule and which set of variables are most effective for determining if a region is coastal
or not. The cross-validation technique will be used to calculate misclassification rates. The following
table summarizes the results, which can be found in detail in the SAS output. For each method and each
variable set, there is a complete listing of the cities and their classification/misclassification which is
found in the SAS output.

VARIABLES LINEAR DISCRIMINANT K NEAREST NEIGHBOR LOGISTIC


FUNCTION (K=5) REGRESSION
TempApril to TempSep
13 12 7
70 70 70
O3April to O3Sep
21 18 17
70 70 70
TempApril to TempSep
and O3April to O3Sep 10 9 0
70 70 70
PCT1 and PCT2
14 20 14
70 70 70
PCO1 to PCO4
21 18 22
70 70 70
PCT1, PCT2 and PCO1 to
PCO4 15 18 12
70 70 70

As is clearly seen, our best way to correctly classify a region as being coastal or not coastal is using all of
our monthly means (our average monthly temperatures and our average monthly ozone), and using
logistic regression. In this manner, we have the highest probability of correctly classifying a region as
coastal or not coastal.

Das könnte Ihnen auch gefallen