Sie sind auf Seite 1von 18

Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

Homework 1
BUS41100 Applied Regression Analysis
Due Oct 3, 2017
Team: Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

1. Sample statistics and the regression coefficients


(a) If the sample variance for X is 1, the sample variance for Y is 2, and the sample
correlation is 0.7, what is the slope of the least squares line?

2
1 = = 0.7
1

1 = 0.9899

(b) If the sample means for X and Y are 0 and 2 respectively, what is the intercept of this
line?

The sample means lie on the regression line, so we have

= 0 1
2 = 0 + 0.9899 0
0 = 2

2. Always Look at Scatter Plots


The file scatterplots.csv on the course site contains 4 pairs of Xs and Ys. For each pair:

(a) Compute the correlation

Pairs Correlation
X1, Y1 0.8164
X2, Y2 0.8162
X3, Y3 0.8163
X4, Y4 0.8165

(b) Show a scatter plot for each pair along with the least-squares regression line.
plot(data$X1, data$Y1, main="Y1,X1",pch=20)
abline(lm(data$Y1 ~ data$X1),lwd=1.5)

Page 1 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

If you were to use these regression models to make a business decision which one would you trust?

We would most likely base our business decision on the model between X1 and Y1 because:

The scatterplot of X2 and Y2 suggests a non-linear relationship may better describe the
true relationship between the variables
There is a significant outlier in the scatterplot of Y3 and X3, biasing the regression lines
upward. We need to understand if theres a business case to exclude the outlier, thereby
obtaining a different regression that better describes the remaining data points
There is a significant outlier in the scatterplot of Y4 and X4. Except for the outlier, the
sample of X4 has zero variance, and so the regression line cannot explain the variance in
Y4 meaningfully
Hence, the regression line in X1 and X2 is the only one that seems to represent the real relationship
between the variables and is the model we would probably trust to take a business decision.

Page 2 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

3. Teacher Salary Exploratory Analysis

The teach.csv data contains information on salary (in 1971 $ Sterling) for n = 90 teachers in
the United Kingdom, along with the following characteristics of the teachers and the schools
they work in: number of months of service (minus 12); sex (M/F); marry indicating
(TRUE/FALSE) whether the female teachers were married or not1; type of degree offered to
graduates ({0; 1; 2; 3} with 3 being the highest" type of degree); type of school (A/B);
whether or not the teacher had special training (TRUE/FALSE); and brk, indicating whether
or not the teacher had a break in service for two or more years (TRUE/FALSE). You are
going to explore how these variables affect teacher pay.

(a) Make a plot of salary versus the number of months in service using color, or otherwise, to
indicate the sex of each teacher on the plot. Comment on what you see, and why the original
article published using with this data may have been called Sex differentials in teachers'
pay" (Turnbull & Williams; JRSSA 1974).

The scatterplot seems to suggest a differential in pay between the genders. On average,
men earn more than women who have a comparable length of work experience. This
differential seems to become more pronounced among those with most experience.

In the sample, the average salary for men is higher than for women
Within any given range of working experience, men seem to be earn more than
women

Page 3 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

Those who earn top salaries are predominantly men: the top 7 earners are all men.
Conversely, in the lowest salary group there are many more women than men
The pay gap between men and women widens at higher levels of experience. Up
to around 200 months of working experience, salary seems to increase with
experience at a comparable rate for both men and women. Beyond 200 months,
however, men seem to experience a faster salary increase with work experience
than their female counterparts.
In the lower tenure range (0-100 months), there are many more women than men

(b) Now, ignore months and produce six sets of boxplots, one set for each other factor (sex,
marry, degree, type, train, and break), showing the conditional distribution of salary for
each level of each factor. Which seems to have the strongest effect on teacher salary?

Degree seems to have the strongest effect on teachers salary. The median top degree
holders (i.e. when degree = 3) earn approx. GBP 1000 more than the average lowest degree
Page 4 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

holders (ie 0). The interquartile range is also compressed for those with degree=3
compared to those with degree=0.
In addition to degree, training also seems to have a considerable effect on teacher salary.
Teachers with special training earn around GBP 500 more than teachers without special
training.

(c) Using color, or otherwise, plot salary versus months in service [similar to (a)] with
indications for the levels your chosen factor [from (b)] for each teacher. How does this new
plot compare with the plot from (a), and what do you conclude based on this new evidence?

Comparing the plots, we see that those people who earn disproportionately more than
others with comparable experience are also those who have higher degrees. The previous
scatterplot highlighted these high earners to be predominantly men, but this new
information calls into question whether it is gender or education level that is driving the
pay differential.
Across a range of working experience, the section of highest earners seems to correlate
more consistently with the level of education, than it does with gender.
The scatterplot in 3(a) focuses on just one factor (ie gender) and its explanatory power on
salary differential. In fact, as we see in the new evidence, more factors may contribute.
(d) Reconsider the questions in (c) through regression. That is, run two regressions: salary on
your chosen factor and then again on sex.

(i) Explicitly write down the regression model you are fitting in each case.

Page 5 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

Regression 1: Salary on Degree

= 0 + 1 +

Regression 2: Salary on Sex

= 0 + 1 +

(ii) How do you interpret the slope coefficient in the regression on sex?

The estimate slope coefficient is 283.81. This indicates that,on average, a male teacher
earns GBP 283.81 more than a female teacher.
Note: The variable [male, female] was automatically converted to a dummy variable
in R where male=1 and female=0 (i.e. teacherdata$sexM).
(iii) Do you think that these factors make a meaningful difference in teacher's pay? What
is your evidence?

The p = value for Regression 1 and 2 in (i) are 0.00523 and 2.59*10^-8 respectively.
Given the very low p-value, we can:

Reject the null hypothesis that coefficient on sex is zero in Regression 1


Reject the null hypothesis that coefficient on degree is zero in Regression 2
Therefore, both these factors seem to have a non-zero, meaningful impact on teachers
pay: a male teacher on average earns GBP 283.81 more than a female teacher, and each
level of higher degree on average increases pay by GBP 298.98.

(iv) Compare your results to the boxplots in (b).

The regression models agree with the box plots in that both show the two factors have
meaningful impact on teachers pay. The box plots show substantially different
distributions conditional on different values of the factors, and the regression models
show statistically significant coefficients.

(v) (Looking ahead a little.) Run your first multiple linear regression (MLR) by regressing
salary on sex and your chosen factor. What do you learn from the slope coefficients in
this regression? How does this compare with what the two separate regressions
indicated?

If you chose marry in (b), the R code for the MLR would be
> my.first.mlr <- lm(salary ~ sex + marry)
> summary(my.first.mlr)

Page 6 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

Coefficient on the sex variable is much smaller in a multiple linear regression (=150.61)
than it is in a single-variable regression (=283.81). This implies that when holding
education level constant, the pay differential between men and women decreases
significantly.
Furthermore, the coefficient is also much less statistically significant. In fact, with 5%
confidence level, we cannot reject the hypothesis that the coefficient is different from zero.
This implies that if we control for the effect of education, we cannot say for sure that gender
has meaningful impact on salary.
Coefficient on degree is only slightly smaller in a multiple linear regression (275.87)
compared to a single-variable regression (298.98). The coefficient also remains statistically
significant. Holding gender constant, we can still conclude that having a more advanced
degree has a positive and meaningful impact on salary.

(e) Now, consider only the portion of the data corresponding to teachers whose school offers
a degree of type 0":

> teach0 <- teach[teach$degree == 0,]


Investigate the effect of months of service on salary in this subset of the data. Calculate the
correlation between months and salary and use this to fit the regression line salary
=b0+b1months+e. What does b1 tell you about the influence of months? How would you
predict the starting salary for teachers in schools which offer degree 0"?

For degree=0, correlation between months and salary is 0.88


The fitted regression line is
salary = 1135.4570 + 2.7749months

Page 7 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

Teachers with degree=0 earn on average GBP 2.8849 more with each additional month of
work experience.
We would predict starting salary for teachers with degree=0 to be the intercept in the above
model, ie GBP 1135.4570.

(f) Consider the results from your regression in (e). Plot the data (subset) and regression line.
Plot the residuals both as a histogram and against months. Comment on any problems you
see.

The first scatterplot shows that the variance of salaries seems to differ at different levels of
work experience: Salaries for the least experienced are more tightly clustered, while the
most experienced have a much wider range of salaries. The salary data is therefore
heteroskedastic, implying that the estimated variance of the model coefficient may be
biased, and any inference based on this coefficient is suspect.
The histogram shows that the residuals do not seem to be normally distributed.
The mean of the residual also seems to differ based on length of working experience. From
roughly 100-300 months the mean of the residuals appears to be negative, implies that the
model may be underpredicting salary. From 300+ months the mean of the residuals seems
to be greater than zero. This implies that the model may be systematically over predicting
salary at these high levels of experience.

Page 8 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

4. Market Model Example

The CAPM (Capital Asset Pricing Model) relates asset returns to market returns through a
simple linear regression model. Here we will model individual company returns as a function
of the S&P500 index returns. This model assumes the rate of return Rs on a generic stock is
linearly related to the rate of return (Rm) on the overall stock market as: Rsi = alpha + beta*Rmi
+ errori, where the error term follows the assumptions of the SLR Model. The slope coefficient
measures the sensitivity of the stock's rate of return to changes in the level of the overall market,
and the intercept is market independent income. (The CAPM is discussed also in lecture 2.)
For this problem, use the file mktmodel.csv from the course website. The dataset contains 60
monthly returns (from 1992 to 1996) of the S&P500 and 30 individual US stocks (labelled by
ticker).

(a) Use the code below to plot the return time series for the S&P and for each individual
equity. Comment on what you see.

Page 9 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

The S&P index seems to be less volatile than individual stocks (some stocks are
particularly volatile). The average S&P 500 monthly return in this period seems to be
slightly positive. In particular, index returns remain consecutively positive from month 40
to 50.

(i) Calculate the market correlation for each stock. Based on this information alone, which
CAPM fit would yield the highest R2? Can you give a practical reasoning for this?

Page 10 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

GEs monthly return has the highest correlation with market return at 0.6312947. This
means that based on historical information only, market returns explain the largest portion
of variance of GEs monthly returns. Hence a regression between market returns and GEs
monthly return should yield the highest R-square.
Intuitively, GE has a highly diversified product portfolio corresponding to a wide range of
economic activities. It therefore can be expected to behave like a partial index of many
industries.

Page 11 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

(ii) Estimate alpha and beta for each stock and plot them against each other. Describe the
results.

Page 12 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

All sample betas are positive ranging from approximately 0.5 to 1.9. Alphas range from
approximately -0.010 to 0.020.
The sample mean for alpha is positive, confirming our observation in a).
There seems to be no clear linear relationship between stock alphas and betas. A regression
between alphas and betas produced intercept and coefficient that are statistically
insignificant.

(b) Pairs Trading is a strategy which picks two stocks that generally move together and
attempts to make money through arbitrage on differences within the pair. For example, if
two stocks have the same market sensitivity (_), you could sell $100 of the stock with
low alpha (say alphalow) and buy $100 of the stock with high alpha (say alphahigh).

Suppose this is your trading strategy:

(i) Show that your average return is alphahigh alphalow. Do you lose money if the
market goes down?

Page 13 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

Let denote returns and denote sample average. Then we have the following
relationship between average returns, alpha, and beta

= + &500

= + &500

Taking the difference to obtain strategy return,


= + &500 ( + &500 )

Since: =

=

Based on the above equation, the strategy yields an average return independent of
market movement and depends only on the alpha differential. That means based on
historical regression, on average you would not lose money if the market goes down.
(However, your actual realized return for a particular month (as opposed to average
return) may be negative, and future returns can certainly be negative since the equation
does not predict the future perfectly).
(ii) Based on the regressions you ran above, choose a pair of stocks for trading according
to this strategy. Which would you buy and which would you sell?

We could buy ENE and sell WMT, since both stocks have almost identical betas and
the largest difference in alphas. ENE has a more positive beta so we would want to buy
it to make a positive spread on alpha.
(Note that ENE has a slightly smaller beta, so the strategy is not entirely market-neutral:
a positive market return slightly dampens our alpha differential, when we tend to lose
slightly more on selling WMT than we gain on buying ENE)

(iii) Calculate what you would have made executing this strategy over the time span of
our dataset. What is your average monthly return? How does this compare to the
difference in alphas?

Over 60 months, the strategy returns 189.2%. The compound average monthly return
= 2.12%
The difference in alpha between ENE and WMT= 2.12%, which agrees with the
above.

Page 14 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

APPENDIX: CODE IN R

## 2 - Always look at scatter plots

## Set the working directory and loading the data


setwd ("C:/Users/jp_mi/Desktop/Chicago Booth/03. Classes/2017 - Fall/41100 - Applied
Regression Analysis - Max Farrell/Homework/Homework 1")
data <- read.csv("scatterplots.csv")

## Run correlations
cor(data$X1,data$Y1)
cor(data$X2,data$Y2)
cor(data$X3,data$Y3)
cor(data$X4,data$Y4)

## Run least square funcions


summary(lm (data$Y1 ~ data$X1))
summary(lm (data$Y2 ~ data$X2))
summary(lm (data$Y3 ~ data$X3))
summary(lm (data$Y4 ~ data$X4))

## Plot dispersion graphs with least squares


par(mfrow=c(2,2))
plot(data$X1, data$Y1, main="Y1,X1",pch=20)
abline(lm(data$Y1 ~ data$X1),lwd=1.5)
text(x=3.1, y=26, col=2,cex=1.5, paste("corr(y, x) ="))
plot(data$X2, data$Y2, main="Y2,X2",pch=20)
abline(lm(data$Y2 ~ data$X2),lwd=1.5)
plot(data$X3, data$Y3, main="Y3,X3",pch=20)
abline(lm(data$Y3 ~ data$X3),lwd=1.5)
plot(data$X4, data$Y4, main="Y4,X4",pch=20)

Page 15 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

abline(lm(data$Y4 ~ data$X4),lwd=1.5)

## 3 - Teacher Salary Exploratory Analysis

## Set the working directory and loading the data


setwd ("C:/Users/jp_mi/Desktop/Chicago Booth/03. Classes/2017 - Fall/41100 - Applied
Regression Analysis - Max Farrell/Homework/Homework 1")
teacherdata <- read.csv("teach.csv")

## a) Plot salary .vs number of months - color representing professors' gender


plot(teacherdata$months, teacherdata$salary,ylab="Salary ($ Sterling)",xlab="Months (m)",
main="Salary vs Months",pch=20,col=teacherdata$sex)
legend("topleft", levels(teacherdata$sex), fill=1:2)

## b) Boxplots
par(mfrow=c(2,3))
boxplot(teacherdata$salary ~ teacherdata$sex, ylab="Salary ($ Sterling)", main="Sex")
boxplot(teacherdata$salary ~ teacherdata$marry, ylab="Salary ($ Sterling)", main="Marry")
boxplot(teacherdata$salary ~ teacherdata$degree, ylab="Salary ($ Sterling)", main="Degree")
boxplot(teacherdata$salary ~ teacherdata$type, ylab="Salary ($ Sterling)", main="Type")
boxplot(teacherdata$salary ~ teacherdata$train, ylab="Salary ($ Sterling)", main="Train")
boxplot(teacherdata$salary ~ teacherdata$brk, ylab="Salary ($ Sterling)", main="Break")

## c) Plot salary .vs number of months - color representing professors' degrees


par(mfrow=c(1,1))
teacherdata$degree <- as.factor(teacherdata$degree)
plot(teacherdata$months, teacherdata$salary,ylab="Salary ($ Sterling)",xlab="Months (m)",
main="Salary vs Months",pch=20,col=teacherdata$degree)
legend("topleft", levels(teacherdata$degree), fill=1:4)

## d) Regressions

Page 16 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

par(mfrow=c(1,2))
summary(lm(teacherdata$salary ~ teacherdata$degree))
summary(lm(teacherdata$salary ~ teacherdata$sex))

summary(lm(teacherdata$salary ~ teacherdata$sex+teacherdata$degree))

## e) Data cut - tech==0


teach0 <- teacherdata[teacherdata$degree == 0,]
cor(teach0$months,teach0salary)
b1 <- cor(teach0$months,teach0$salary)*sd(teach0$salary)/sd(teach0$months)
b0 <- mean(teach0$salary)-b1*mean(teach0$months)

par(mfrow=c(1,3))
reg <- lm(teach0$salary ~ teach0$month)
plot(teach0$months, teach0$salary,ylab="Salary ($ Sterling)",xlab="Months (m)", main="Salary
vs Months",pch=20)
abline(lm(teach0$salary ~ teach0$month),lwd=1.5)
hist(reg$fitted-teach0$salary,xlab="Residuals", main="Residuals Histogram")
plot(teach0$months, reg$fitted-teach0$salary, pch=20, ylab="Residuals", xlab="Months",
main="Residuals vs Months")

## 4 - Market Model Example

## a) CAPM

par(mfrow=c(1,1))
mkt <- read.csv("mktmodel.csv")
SP500 <- mkt$SP500
stocks <- mkt[,-1]
plot(SP500, col=0, ## Just get the plot up
xlab = "Month", ylab = "Returns",

Page 17 of 18
Homework 1 | Joao Wench Milanezi, Savio Lorentz, Leila Rohd-Thomsen, Mohan Ru

main = "Monthly returns for 1992-1996",


ylim=range(unlist(mkt)))
colors <- rainbow(30) ## 30 different colors
## this is how you do 'loops' in R... this is useful!
for(i in 1:30){ lines(stocks[,i], col=colors[i], lty=2) }
lines(SP500, lwd=2)

## market correlations
correlations <- c()
for(i in 1:30){ correlations <- c(correlations,cor(SP500,stocks[,i])) }
correlations

## alphas and betas


par(mfrow=c(1,1))
plot(0,0,xlab="Delta SP500",ylab="Delta stock")
alphas <- c()
betas <- c()
for(i in 1:30){
betas <- c(betas,cor(SP500, stocks[,i])*sd(stocks[,i])/sd(SP500))
alphas <- c(alphas,mean(stocks[,i])-betas[i]*mean(SP500))
plot (alphas, betas, pch=20)

## pairs trading
Profit <- stocks[,10] - stocks[,29] + 1
TotalProfit <- prod(Profit) - 1

Page 18 of 18

Das könnte Ihnen auch gefallen