Beruflich Dokumente
Kultur Dokumente
EXAM 2
Dan Sewell
Question 1:
a. The first few observations of mortalities per 100,000 are below, which were calculated in SAS (see
code in Appendix), but the entirety of the data set can be found in the exhaustive SAS output (see
Appendix).
b. Monthly average temperatures and ozones, from April to September were computed using Microsoft
Excel (=average(…). The resulting data set was then imported into SAS. The first few observations are
below, but the entire data set can be found in the exhaustive SAS output.
Question 2:
Are mortalities related to coastline? To answer this question, I ran MANOVA, with the following model:
β 01 β 02 β 03 β 04
[ newcopd i newcvd i newpneui newresp i ]=[ 1 coast i ]
[ ]
β 11 β 12 β 13 β 14
+[ε i 1 ε i2 ε i 3 ε i 4 ]
for i = 1…70.
I used Wilk’s Lambda to determine whether to reject or fail to reject the null hypothesis that Coastline
has no effect (i.e. H 0 : β1 =0 ¿. Less relevant to the question but still tested was the null hypothesis that
the intercept was not significant (i.e. H 0 : β 0=0 ¿ . The p-values from Wilk’s Lambda for these two tests
are 0.003 and less than 0.001, respectively. This implies that there is indeed a difference in mortalities
due to geography (specifically, if they live in region next to the coast). Since the mortality rates differ
depending on region, it is important to look at the means of the mortality rates for each of the two
locations.
N
coast Obs Variable N Mean Std Dev Minimum Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0 32 newcopd 32 46.9256059 10.4654806 24.8226145 74.4175238
newcvd 32 342.2685421 104.0352160 170.4295948 706.4312769
newpneu 32 26.9604181 6.1229006 14.3832579 39.9480052
newresp 32 74.3137923 13.2431913 45.9742789 99.4803952
We can see that for each type of death, the mean mortality rates are higher for those cities which are
not along a coastline. I conclude that there is a higher probability of a Chronic Obstructive Pulmonary
Disease death, a cardiovascular death, a pneumonia death, or a respiratory death if one lives away from
the coastline.
Diagnostic Checks were run to check for Normality and for equal covariance matrices. Testing the
homogeneity of the covariances, by Chi-squared test, leads us to fail to reject the null hypothesis (equal
variances) at the 0.05 level. However, when Henze-Zirkler Test was run on the residuals, it turned out to
not be normal.
Question 3:
We wish to better see the underlying variation structure in the monthly averages of temperature and
ozone. To this end, I perform principle component analysis on both temperature and ozone.
First, with monthly average temperature, I find that 96.04% of the variation of the data is explained by
the first two PC’s. Further evidence for choosing to use just the first two PC’s comes from the following
Scree Plot, and noticing the elbow is at 2:
The following table shows how each monthly average temperature is correlated with the first two PC’s.
‚
‚
‚
‚
2.5 ˆ
‚
‚
‚
‚
‚
‚ 1 2
2.0 ˆ
‚
‚
‚
‚
E ‚
i ‚
g 1.5 ˆ
e ‚
n ‚
v ‚
a ‚
l ‚
u ‚ 3
e 1.0 ˆ
s ‚
‚
‚
‚
‚
‚
0.5 ˆ
‚ 4
‚
‚ 5
‚
‚ 6
‚
0.0 ˆ
‚
‚
‚
Šƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ
0 1 2 3 4 5 6
Number
The first 3 PC’s account for 88.24% of the variance in average monthly Ozone, and 4 PC’s account for
94.38% of the variance. This is a rather subjective decision, but I will choose that 4 PC’s will be used for
further analytic needs. Correlations between each monthly Ozone and each of the first four PC’s are
given below.
Question4:
In order to understand the relationships and associations between mortality and monthly temperatures,
monthly ozones, and geography (coastlines or no coastlines), I set up a regression model, with four
response variables (four mortality rates) and 13 covariates (14 including intercept). The following are
the regression coefficients for each of the four mortality variables:
For Chronic Obstructive Pulmonary Disease:
Intercept 20.28952
TempApril -0.19539
TempMay -0.48570
TempJune -0.16807
TempJuly 1.12896
TempAug -1.52321
TempSep 1.61922
O3April 0.15609
O3May -0.03675
O3June -0.31666
O3July -0.16258
O3Aug -0.15963
O3Sep 0.29670
coast -12.50162
Next, we use only the PC’s (specifically, PCT1,PCT2, PCO1,PCO2,PCO3, and PCO4) as covariates, along
with coast, to construct much the same model. The following are the regression coefficients for this
model.
Again using Wilk’s Lambda for our tests, we test the null hypothesis that none of the covariates have
any effect on mortality rates. The test yields a p-value of 0.008, so we reject, concluding that at least
one of the covariates affects mortality rates. The results for individual covariate tests are given in the
same fashion as for the first regression model.
Finally, one last regression model is conducted, which only has two covariates, PCO3 and Coast, just to
make sure that we can really say that ozone has an effect on the mortality rates. With this model, we
get the following regression coefficients:
The overall model is significant, as the Wilk’s Lambda gave a p-value of less than 0.0001. Individual tests
show that they both are significant, as the p-value for PCO3 is 0.0166, and the p-value for coast is
0.0006. This indicates that the four mortality rates are affected by both geography and the monthly
average ozone.
Diagnostics were run to ensure that our residuals were multivariate normal. To assess the
multinormality of the residuals, I used Henze-Zirkler Test, which indicated that they were in fact not
normal.
Question 5:
In order to see which cities have similar average monthly temperatures, I performed both hierarchical
and non-hierarchical clustering methods. First, I performed a hierarchical clustering method based on
average linkage. The problem now lies in how many clusters to choose. Looking at the plot of the first
two PC’s does not give any clear idea as to how many clusters one should choose, so I then looked at
both the Cubic Clustering Criterion and the Pseudo Hotelling’s T 2 test. On the plot of CCC vs. Number of
Clusters, it seems that there is a peak at 11 clusters, and at 7 clusters. I look at the Pseudo T 2 and it also
confirms that I should choose 11 clusters, since it jumps from 4.9 (obtaining 11 clusters) to 41.7
(obtaining 10 clusters). As a final check, I used a non-hierarchical method, using Beale’s F-type statistic
to determine if I should choose 7 or 11 clusters. The Beale’s F-type statistic for this comparison is 16.9,
which leads us to conclude that 11 clusters is better. The Beale’s F-type statistic is computed and
compared to the F value using the R code in the appendix. See the SAS output for a plot of the first 2
temperature PC’s by cluster, and for a list of cities sorted by their cluster. So I can say that all of our
cities fall into one of 11 groups based on their average monthly temperature.
From these data, the following conclusions can be made: Each of our cities can be put into one of 11
groups, where the cities within a group have similar monthly average temperatures. Each city can also
be put into one of 11 groups, where the cities within a group have similar monthly average ozone. They
can also be put into one of 10 groups, where the cities within a group have similar monthly average
temperatures and ozone. If we use the first few PC’s from temperature and ozone which explain most
of the variance of the original variables, we can put each city in one of 13 groups, where the cities within
a group have similar average monthly temperatures, or 15 groups, where the cities in a group have
similar average monthly ozone, or 12 groups, where the cities within a group have similar average
monthly temperatures and ozone. To find which city belongs to which group, for each of these 6
grouping methods, see the exhaustive SAS output.
Question 6:
It is of interest to find the correlation between temperature and ozone. To learn more about how these
two things are correlated, I used canonical correlation analysis. First, I used all the monthly average
temperatures, and correlated those to the set of variables consisting of the monthly average ozone.
Second, I used only the first two PC’s of monthly average temperature, and correlated those to the set
of the first four PC’s of monthly average ozone. Note that in the SAS output, you can find the canonical
correlation coefficients, as well as the correlation matrices for all four sets of variables, and the
correlation matrices for each pair of sets of variables (i.e. for temperatures and ozone, and for the PC’s
of temperatures and ozone). For both analyses, I report results based on the first canonical variables,
since they explain most of the correlation.
For the first case, I find that the canonical correlation between the set of all monthly average
temperatures to the set of all monthly average ozone is 0.965. This implies that there is a very high
correlation between temperature and ozone. Since it is positive, as the temperature canonical variable
(temp1) increases, we can feel quite certain that the ozone canonical variable (oz1) will also increase.
Furthermore, we can see which months of temperature are more correlated to the average monthly
ozone, and vice versa. First, temperatures for April and August, followed by May, have the strongest
correlation with the canonical variable temp1. They also have the strongest (positive) correlation with
the canonical variable oz1. We may conclude from this that the average temperatures for April, August,
and May have the strongest correlation with the monthly average ozone from April to September.
Second, by looking at the monthly average ozone variables, we see that the average ozone in June and
July have the strongest (negative) correlation with canonical variable ozone1. They also have the
strongest (negative) correlation with the canonical variable temperature1. We may conclude that the
average ozone during June and July have the strongest correlation with the average monthly
temperatures from April to September.
For the second case, I find that the canonical correlation between the first two PC’s of monthly average
temperatures and the first four PC’s of monthly average ozone is 0.823. This implies that there is a
strong positive correlation between the PC temperature canonical variable (PCtemp1) and the PC ozone
canonical variable (PCOz1). As one would expect, the first PC for temperature has the highest (negative)
correlation with PCtemp1, and the first two PC’s for ozone have the strongest (PCO1 is positive, and
PCO2 is negative) correlation with PCOz1. Similarly, the first PC for temperature has the strongest
(negative) correlation with PCOz1 and the first two PC’s for ozone have the strongest (PCO1 is positive
and PCO2 is negative) correlation with PCTemp1. This implies the first linear combination of average
monthly temperatures (PCT1) has the strongest correlation with the average monthly ozone from April
to September, and the first two linear combinations of average monthly ozone (PCO1 and PCO2) have
the strongest correlation with average monthly temperatures from April to September.
Below are two plots of the canonical variable scores, first using the monthly means, and second using
the PC’s.
Temp1 ‚
‚
2.0 ˆ
‚
‚ A A
‚ A A A
‚ A
1.5 ˆ A A A
‚ AA
‚
‚ A
‚ A A
1.0 ˆ A
‚ A
‚ A A A
‚ B
‚ A
0.5 ˆ
‚ A
‚
‚ AA A
‚ A A
0.0 ˆ A
‚ A BB
‚ A A A
‚ A BA AA A A
‚ AA A
-0.5 ˆ A
‚ A A A
‚ B A
‚
‚
-1.0 ˆ AA A
‚ A
‚ A A
‚ A A
‚ A AA A
-1.5 ˆ
‚ A
‚ A
‚ A
‚
-2.0 ˆ
‚
Šˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆ
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Oz1
Plot of PCTemp1*PCOz1. Legend: A = 1 obs, B = 2 obs, etc.
PCTemp1 ‚
‚
2ˆ
‚ A
‚ A
‚
‚
‚ A A B A
‚ A A
1ˆ A A AB A
‚ A B A
‚ A A A AA A
‚ B AA
‚ A
‚ A AA
‚ A
0ˆ A AA
‚ A
‚ AA AA
‚ AA AA A
‚ A AB AA
‚ A A
‚ A A
-1 ˆ A A
‚ A
‚
‚ A A A
‚ A A
‚ A
‚ A AA
-2 ˆ
‚
‚ A
‚
‚
‚
‚
-3 ˆ
‚
Šƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒ
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
PCOz1
The plots give good visual reference to the fact that the Canonical Correlation was stronger when using
the average monthly means of temperature and ozone, as opposed to using the PC’s. The Canonical
Variable Scores can be seen in the full SAS output.
Question 7:
We are now interested in being able to classify a region as having a coastline or not having a coastline by
various sets of variables. The end result of the following analysis will be that if we are given the monthly
averages for temperature and ozone for a region, we will be able to predict whether it is coastal or not.
Three methods will be applied to 6 different sets of variables, and the misclassification rates will
determine which rule and which set of variables are most effective for determining if a region is coastal
or not. The cross-validation technique will be used to calculate misclassification rates. The following
table summarizes the results, which can be found in detail in the SAS output. For each method and each
variable set, there is a complete listing of the cities and their classification/misclassification which is
found in the SAS output.
As is clearly seen, our best way to correctly classify a region as being coastal or not coastal is using all of
our monthly means (our average monthly temperatures and our average monthly ozone), and using
logistic regression. In this manner, we have the highest probability of correctly classifying a region as
coastal or not coastal.