Beruflich Dokumente
Kultur Dokumente
PROJECT ABSTRACT
This work develops the best linear model of residential real estate prices for homes throughout De Witt,
New York. It differs from other studies comparing models for predicting house prices by taking more
variables into account, such as school district and fuel type. The purpose and goal of this study is to
accurately predict housing prices in De Witt compared to Zestimates© done by Zillow©.
Rachael Paciello
Introduction
Many different parties are interested in accurate predictions of the market value of residential homes.
Buyers and sellers have a clear interest in setting prices relative to market values. Real estate investors
want to know the predicted housing price in order to maximize profits. Local governments also look at
housing values in order to set county taxes. Many academic studies have been done in predicting
housing price, but are limited due to data availability.
Methods
I began my study by polling all available data from Zillow. I looked specifically into De Witt, NY and at
Beds, Bath, Sqft, Year built, lot size, fuel type and delivery, school district, housing type, number of
garage stalls as well as garage sqft size, and finally taxes per house. I then ran a pairs plot on all of the
variables to get a feel for the data and see which variable had the strongest obvious correlation to
housing price. I also visually inspected which variable had the strongest connection to garage stalls. It
was very clear that house size had the strongest correlation, with an adjusted r - squared value of .3304.
Using the formula : predictGarageStalls = 1.0738601 + .0002912* HouseSize, given by R, I eliminated the
GarageSize variable and effectively filled in any missing data for GarageStalls; thus creating 37 usable
data points for all variables listed above.
I then created a model for each of the quantitative variables, and used multiple r squared as in indicator
of a good or weak linear fit. I did the same with the qualitative variables; however, their r squared values
never exceeded .1127, indicating a very poor fit.
Next I wanted to check normality and error structure, so I created residual plots for each quantitative
variables. For all of the plots, it seems like the residuals have higher variability for positive residuals.
Additionally, it seems that the variability of the residuals increases for larger fitted observations. A
natural log transformation should take care of both of these issues.
Furthermore, I wished to create a confidence interval to demonstrate the true average for a house of
size 3524 (the mean sqft of houses from my data). After coding my confidence interval, I am 95% certain
that homes with sqft size of 3524 cost between $126,934 - $521,744.
After this I created box plots for all variables of interest; including beds, bath, year built, lot size, school
district, garage stalls, house type, taxes, and house size.
Finally I ran a prediction using a full model (including all the variables) and using a model just looking at
house size.
Results
Pairs plot below showing house size and garage stalls having a strong correlation.
Residual Plot for quantitative variables:
Residual Plot using Log Transformation:
Confidence Interval for predicting true cost of houses with 3524 sqft:
Blue line is the confidence interval, green line is prediction interval. It is much harder to predict, which is
why the area between the upper and lower green line is much larger than that of the blue lines
Box Plots for variables:
Interesting, school district seems to be playing an unexpected role. 3 corresponds to Jamesville De Witt,
1 corresponds with other, and 4 corresponds with Fayetteville Manlius. All other variables increase as
expected.
It is clear that the sqft model predicted the housing price more accurately.
Conclusions/Discussion
I have systematically designed a local linear model with the goal of finding a model that had the closest
predicted housing price compared to Zillow's estimate. The best model was found to be tailored to
house size as indicator variables. I do believe my model has good predictive utility because of such a
high multiple r-squared value of .7734. There are limitations however, in that not all the data from
Zillow is completely accurate. I believe an interesting idea for future study would be comparing local
linear models to other non-linear models found in literature to outperform linear models. This could
increase the accuracy of my estimates.
Bibliography
Lowrance, Roy E. "Predicting the Market Value of Single-Family Residential Real Estate." (n.d.): n. pag.