Sie sind auf Seite 1von 10

Dewitt Housing Data

PROJECT ABSTRACT
This work develops the best linear model of residential real estate prices for homes throughout De Witt,
New York. It differs from other studies comparing models for predicting house prices by taking more
variables into account, such as school district and fuel type. The purpose and goal of this study is to
accurately predict housing prices in De Witt compared to Zestimates© done by Zillow©.

Rachael Paciello
Introduction
Many different parties are interested in accurate predictions of the market value of residential homes.
Buyers and sellers have a clear interest in setting prices relative to market values. Real estate investors
want to know the predicted housing price in order to maximize profits. Local governments also look at
housing values in order to set county taxes. Many academic studies have been done in predicting
housing price, but are limited due to data availability.

Methods
I began my study by polling all available data from Zillow. I looked specifically into De Witt, NY and at
Beds, Bath, Sqft, Year built, lot size, fuel type and delivery, school district, housing type, number of
garage stalls as well as garage sqft size, and finally taxes per house. I then ran a pairs plot on all of the
variables to get a feel for the data and see which variable had the strongest obvious correlation to
housing price. I also visually inspected which variable had the strongest connection to garage stalls. It
was very clear that house size had the strongest correlation, with an adjusted r - squared value of .3304.
Using the formula : predictGarageStalls = 1.0738601 + .0002912* HouseSize, given by R, I eliminated the
GarageSize variable and effectively filled in any missing data for GarageStalls; thus creating 37 usable
data points for all variables listed above.

I then created a model for each of the quantitative variables, and used multiple r squared as in indicator
of a good or weak linear fit. I did the same with the qualitative variables; however, their r squared values
never exceeded .1127, indicating a very poor fit.

Next I wanted to check normality and error structure, so I created residual plots for each quantitative
variables. For all of the plots, it seems like the residuals have higher variability for positive residuals.
Additionally, it seems that the variability of the residuals increases for larger fitted observations. A
natural log transformation should take care of both of these issues.

Furthermore, I wished to create a confidence interval to demonstrate the true average for a house of
size 3524 (the mean sqft of houses from my data). After coding my confidence interval, I am 95% certain
that homes with sqft size of 3524 cost between $126,934 - $521,744.

After this I created box plots for all variables of interest; including beds, bath, year built, lot size, school
district, garage stalls, house type, taxes, and house size.

Finally I ran a prediction using a full model (including all the variables) and using a model just looking at
house size.
Results
Pairs plot below showing house size and garage stalls having a strong correlation.
Residual Plot for quantitative variables:
Residual Plot using Log Transformation:
Confidence Interval for predicting true cost of houses with 3524 sqft:

Blue line is the confidence interval, green line is prediction interval. It is much harder to predict, which is
why the area between the upper and lower green line is much larger than that of the blue lines
Box Plots for variables:

Interesting, school district seems to be playing an unexpected role. 3 corresponds to Jamesville De Witt,
1 corresponds with other, and 4 corresponds with Fayetteville Manlius. All other variables increase as
expected.

I preformed paired T-tests and used Anova tables throughout my study.


Below is a table comparing the list price of houses using my full model, and my sqft model.

House List Price Full Model Sqft Model


1 469,900 529,541 455,269
2 725,000 618,750 661,252
3 144,900 229,815 235,140
4 379,900 454,083 312,468
5 179,900 133,381 184,082
6 149,900 107,348 113,760
7 629,000 513,689 425,900
8 299,900 457,313 596,992
9 164,900 190,761 200,248
10 369,000 423,975 455,808
11 289,900 286,226 309,235
12 349,900 437,339 206,580
13 670,000 530,644 395,185
14 275,000 393,353 428,056
15 387,900 380,688 388,584
16 799,900 677,538 670,278
17 87,709 -117,794 975,933
18 69,900 -7,107 97,189
19 128,000 90,819 110,661
20 173,500 151,386 223,824
21 79,900 131,046 127,366
22 800,000 701,180 465,373
23 214,900 271,122 220,321
24 799,000 648,427 556,846
25 259,900 343,476 355,578
26 283,900 321,509 276,633
27 1,375,000 1,337,955 1,316,788
28 224,900 366,945 388,718
29 264,900 264,900 307,618
30 497,500 462,868 486,254
31 559,777 567,895 529,902
32 444,900 465,096 555,499
33 379,900 396,184 408,656
34 424,900 432,933 399,226
35 499,900 510,213 705,844
36 650,000 666,284 696,548
37 479,900 624,389 619,894

It is clear that the sqft model predicted the housing price more accurately.
Conclusions/Discussion
I have systematically designed a local linear model with the goal of finding a model that had the closest
predicted housing price compared to Zillow's estimate. The best model was found to be tailored to
house size as indicator variables. I do believe my model has good predictive utility because of such a
high multiple r-squared value of .7734. There are limitations however, in that not all the data from
Zillow is completely accurate. I believe an interesting idea for future study would be comparing local
linear models to other non-linear models found in literature to outperform linear models. This could
increase the accuracy of my estimates.
Bibliography
Lowrance, Roy E. "Predicting the Market Value of Single-Family Residential Real Estate." (n.d.): n. pag.

New York University, 1 Jan. 2015. Web. 4 May 2017.

"Predicting House Prices." N.p., 5 Feb. 2016. Web. 4 May 2017.

Das könnte Ihnen auch gefallen