Beruflich Dokumente
Kultur Dokumente
15A1HP032
Nandita Mehta
15A1HP086
15A3HP624
Introduction
For the term paper, we will be using the data set House_data to run multivariate multiple regression.
As the name implies, multivariate regression is a technique that estimates a single regression model
with multiple outcome variables and one or more predictor variables. The objective we have is to find
out the relationship between dependent variable price with other independent variables. We will use
price as our dependent variable and others as our independent variables.
Variables description
Variables
Description
price
sqft_living
area of living
sqft_lot
sqft_above
floor area
sqft_basement
basement area
bedrooms
number of bedrooms
bathrooms
number of bathrooms
floors
number of floors
view
condition
Objective
The objective of this analysis is to investigate that whether the price of a
house increases with the number of bedrooms increases, and also whether
the view, the condition of the house, the area in which it is made, etc., also
effect the price of the house.
Business Domain
Real Estate
Data Cleaning
In the first step, the water variable is removed for better analysis. There
were no missing values in the data. There were some unnecessary variables
which were removed for better statistical analysis.
Data Exploration
1. In data exploration, the data is first imported to SAS 9.2 to apply
techniques.
2. Univariate analysis is done on the variables price, sqft_living, sqft_lot,
sqft_above, and sqft_basement.
5274
406001.838
258577.971
4.15851016
1.22192E15
63.6888672
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
5274
2141253691
6.68626E10
34.0894842
3.52566E14
3560.58585
Variability
406001.8
350000.0
250000.0
Std Deviation
Variance
Range
Interquartile Range
258578
6.68626E10
4412000
230000
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
114.0267
2637
6955088
<.0001
<.0001
<.0001
--Statistic---
-----p Value------
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
D
W-Sq
A-Sq
Pr > D
Pr > W-Sq
Pr > A-Sq
0.145755
45.83572
269.9525
Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
Estimate
4490000
1360000
822000
652000
480000
350000
price
(price)
Quantiles (Definition 5)
Quantile
25% Q1
10%
5%
1%
0% Min
Estimate
250000
200000
170000
118000
78000
Extreme Observations
-----Lowest----
-----Highest-----
Value
Obs
Value
Obs
78000
81000
82000
82500
83000
3960
4173
2149
553
4780
2890000
3100000
3200000
3400000
4490000
1011
1881
5147
2725
2249
<0.0100
<0.0050
<0.0050
#
1
Boxplot
*
1
1
1
2
3
2
6
2
5
10
14
38
50
153
404
1386
2678
517
*
*
*
*
*
*
*
*
*
*
*
*
0
0
|
+--+--+
*-----*
|
4
The UNIVARIATE Procedure
Variable: price (price)
Normal Probability Plot
4500000+
*
|
|
|
|
|
*
|
*
|
*
|
*
|
*
|
*
2300000+
*
|
*
|
*
|
*
|
**
|
****
|
*** +++
|
****+++
|
+++******
|
+++*********
|
*****************
100000+*************+++++
+----+----+----+----+----+----+----+----+----+----+
-2
-1
0
+1
+2
5274
1480.67444
705.801163
2.11916203
1.41895E10
47.6675455
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
5274
7809077
498155.282
7.05883488
2626772800
9.71879243
Variability
1480.674
1300.000
1010.000
Std Deviation
Variance
Range
Interquartile Range
705.80116
498155
7460
720.00000
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
152.3517
2637
6955088
<.0001
<.0001
<.0001
--Statistic---
-----p Value------
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
D
W-Sq
A-Sq
Pr > D
Pr > W-Sq
Pr > A-Sq
0.132984
36.02791
211.6481
<0.0100
<0.0050
<0.0050
Quantiles (Definition 5)
Quantile
Estimate
100% Max
7850
99%
4168
95%
2850
90%
2340
75% Q3
1730
50% Median
1300
The SAS System
01:47 Wednesday, October 19, 2016
6
Quantile
Estimate
25% Q1
10%
5%
1%
0% Min
1010
840
750
630
390
Extreme Observations
----Lowest----
----Highest---
Value
Obs
Value
Obs
390
420
460
470
480
5248
2972
4670
3944
2115
5780
5940
6330
6430
7850
1011
5210
1572
2249
5113
Histogram
7750+*
.
.
6250+*
.*
.*
4750+*
.*
.**
3250+***
.*****
.***********
1750+************************
.************************************************
.***************************
250+*
----+----+----+----+----+----+----+----+----+--* may represent up to 45 counts
#
1
Boxplot
*
2
6
9
12
35
62
97
191
454
1052
2146
1202
5
*
*
*
*
*
0
0
0
|
+-----+
*--+--*
|
|
Sqft_lot
The UNIVARIATE Procedure
Variable:
sqft_lot
(sqft_lot)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
5274
13408.9863
45173.8626
18.9416156
1.17088E13
336.892449
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
5274
70718994
2040677858
511.48021
1.07605E13
622.038354
Mean
Median
Mode
Variability
13408.99
7351.00
6000.00
Std Deviation
Variance
Range
Interquartile Range
45174
2040677858
1650759
4630
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
21.55653
2637
6955088
<.0001
<.0001
<.0001
--Statistic---
-----p Value------
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
D
W-Sq
A-Sq
Pr > D
Pr > W-Sq
Pr > A-Sq
0.388485
302.3959
1445.553
Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
Estimate
1651359
188760
32137
16030
9730
7351
Estimate
5100
3621
2550
1105
600
<0.0100
<0.0050
<0.0050
Extreme Observations
----Lowest----
-----Highest-----
Value
Obs
Value
Obs
600
635
649
649
651
1950
5173
1153
143
5098
843309
871200
982278
1164794
1651359
1214
2537
1179
4455
441
Histogram
1650000+*
.
.
.
.
.*
.
.*
850000+*
.
.*
.
.*
.*
.*
.*
50000+************************************************
----+----+----+----+----+----+----+----+----+--* may represent up to 109 counts
#
1
Boxplot
*
1
2
*
*
5
4
30
41
5188
*
*
*
*
+--0--+
Sq_ft above
The UNIVARIATE Procedure
Variable: sqft_above (sqft_above)
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
5274
1295.86974
590.95481
2.58433978
1.0698E10
45.6029485
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
5274
6834417
349227.587
10.8621274
1841477068
8.13737273
1295.870
1140.000
1010.000
Variability
Std Deviation
Variance
Range
Interquartile Range
590.95481
349228
7460
520.00000
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
159.2492
2637
6955088
<.0001
<.0001
<.0001
--Statistic---
-----p Value------
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
D
W-Sq
A-Sq
Pr > D
Pr > W-Sq
Pr > A-Sq
0.155122
48.74819
278.5273
Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
Estimate
7850
3650
2430
1950
1460
1140
Estimate
940
790
720
610
390
Extreme Observations
----Lowest----
----Highest---
Value
Obs
Value
Obs
390
420
460
470
480
5248
2972
4670
3944
3643
5000
5320
5430
6430
7850
2814
365
5073
2249
5113
<0.0100
<0.0050
<0.0050
Histogram
#
7750+*
.
.
6250+*
.
.*
4750+*
.*
.*
3250+**
.***
.******
1750+***************
.************************************************
.***********************************
250+*
----+----+----+----+----+----+----+----+----+--* may represent up to 49 counts
Boxplot
1
3
7
22
35
66
102
259
726
2352
1692
8
*
*
*
*
0
0
0
|
+--+--+
+-----+
|
Sq_ft basement
Variable:
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
5274
184.804702
345.082535
2.03724345
808040906
186.728222
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
5274
974660
119081.956
4.0259496
627919155
4.75174271
Variability
184.8047
0.0000
0.0000
Std Deviation
Variance
Range
Interquartile Range
345.08254
119082
2250
250.00000
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
38.89198
788
621338
<.0001
<.0001
<.0001
--Statistic---
-----p Value------
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
D
W-Sq
A-Sq
Pr > D
Pr > W-Sq
Pr > A-Sq
0.405037
182.3533
905.4814
<0.0100
<0.0050
<0.0050
Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
Estimate
2250
1350
960
730
250
0
Estimate
25% Q1
10%
5%
1%
0% Min
0
0
0
0
0
Extreme Observations
----Lowest----
----Highest---
Value
Obs
Value
Obs
0
0
0
0
0
5274
5273
5272
5271
5270
2060
2100
2170
2196
2250
3174
1168
1414
2851
3477
Variable:
Histogram
2250+*
.*
.*
.*
.*
.*
.*
.*
.*
.*
.*
1150+*
.**
.**
.**
.**
.***
.**
.***
.***
.***
.**
50+************************************************
----+----+----+----+----+----+----+----+----+--* may represent up to 78 counts
#
1
3
3
3
6
3
3
15
11
18
33
56
87
93
131
137
164
152
178
170
162
138
3707
Boxplot
*
*
*
*
*
*
*
*
*
*
*
*
0
0
0
0
0
|
|
|
+-----+
| + |
*-----*
Test of Normality
Variable:
sqft_living
(sqft_living)
Variable:
sqft_lot
(sqft_lot)
Variable:
sqft_above
(sqft_above)
Variable:
sqft_basement
(sqft_basement)
Therefore, the above chart shows that all the variables are multivariate normal.
Statistical Technique
The statistical technique used in the data is multivariate regression. It is a
technique that estimates a single regression model with more than one outcome variable. When there
is more than one predictor variable in a multivariate regression model, the model is a multivariate
multiple regression. The model is used to predict that whether one bedroom or bathroom
increases by one unit or whether there is better view available in a given house, are customers
willing to pay the price or not.
5274
5274
Analysis of Variance
DF
Sum of
Squares
Mean
Square
4
5269
5273
1030.10115
1817.60534
2847.70648
257.52529
0.34496
Root MSE
Dependent Mean
Coeff Var
0.58733
2.80319
20.95240
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
746.53
<.0001
0.3617
0.3612
NOTE: Model is not full rank. Least-squares solutions for the parameters are not
unique. Some statistics will be misleading. A reported DF of 0 or B means that the
estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a linear
combination of other variables as shown.
sqft_basement =
sqft_living - sqft_above
Parameter Estimates
Parameter
Variable
Label
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
Standard
DF
Estimate
Error
t Value
Pr > |t|
1
1
B
1
B
0
1.94080
-6.46778E-7
0.00074449
-7.395E-7
0.00002512
0
0.01983
4.169038E-8
0.00002542
1.820141E-7
0.00002819
.
97.88
-15.51
29.29
-4.06
0.89
.
<.0001
<.0001
<.0001
<.0001
0.3728
5274
5274
Analysis of Variance
DF
Sum of
Squares
Mean
Square
4
5269
5273
1518.67050
1111.72919
2630.39970
379.66763
0.21099
Root MSE
Dependent Mean
Coeff Var
0.45934
1.50436
30.53397
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
1799.42
<.0001
0.5774
0.5770
NOTE: Model is not full rank. Least-squares solutions for the parameters are not
unique. Some satistics will be misleading. A reported DF of 0 or B means that the
estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a linear
Combination of other variables as shown.
sqft_basement =
sqft_living - sqft_above
Parameter Estimates
Parameter
Variable
Label
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
Standard
DF
Estimate
Error
t Value
Pr > |t|
1
1
B
1
B
0
0.34830
2.567253E-8
0.00064208
-6.55906E-7
0.00015721
0
0.01551
3.26051E-8
0.00001988
1.423491E-7
0.00002204
.
22.46
0.79
32.30
-4.61
7.13
.
<.0001
0.4311
<.0001
<.0001
<.0001
.
5274
5274
Analysis of Variance
DF
Sum of
Squares
Mean
Square
4
5269
5273
253.90699
686.34728
940.25427
63.47675
0.13026
Root MSE
Dependent Mean
Coeff Var
0.36092
1.17349
30.75583
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
487.30
<.0001
0.2700
0.2695
NOTE: Model is not full rank. Least-squares solutions for the parameters are not
unique. Some statistics will be misleading. A reported DF of 0 or B means that the
estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a linear
combination of other variables as shown.
sqft_basement =
sqft_living - sqft_above
Parameter Estimates
Variable
Label
Intercept
price
sqft_living
sqft_lot
sqft_above
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
B
1
B
0.70070
1.745027E-7
-0.00009904
-5.14999E-7
0.00042866
0.01219
2.561875E-8
0.00001562
1.118477E-7
0.00001732
57.50
6.81
-6.34
-4.60
24.75
<.0001
<.0001
<.0001
<.0001
<.0001
sqft_basement
5274
5274
Analysis of Variance
DF
Sum of
Squares
Mean
Square
4
5269
5273
237.10683
1693.15634
1930.26318
59.27671
0.32134
Root MSE
Dependent Mean
Coeff Var
0.56687
0.14941
379.40089
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
184.47
<.0001
0.1228
0.1222
NOTE: Model is not full rank. Least-squares solutions for the parameters are not
unique. Some statistics will be misleading. A reported DF of 0 or B means that the
estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a linear
Combination of other variables as shown.
sqft_basement =
sqft_living - sqft_above
Parameter Estimates
Variable
Label
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
B
1
B
-0.17437
7.47685E-7
0.00011362
6.480955E-7
-0.00012092
0.01914
4.023783E-8
0.00002453
1.756725E-7
0.00002720
-9.11
18.58
4.63
3.69
-4.45
<.0001
<.0001
<.0001
0.0002
<.0001
5274
5274
Analysis of Variance
DF
Sum of
Squares
Mean
Square
4
5269
5273
51.22687
2517.61272
2568.83959
12.80672
0.47782
Root MSE
Dependent Mean
Coeff Var
0.69124
3.46189
19.96721
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
26.80
<.0001
0.0199
0.0192
NOTE: Model is not full rank. Least-squares solutions for the parameters are not
unique. Some statistics will be misleading. A reported DF of 0 or B means that the
estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a linear
combination of other variables as shown.
sqft_basement =
sqft_living - sqft_above
Parameter Estimates
Parameter
Variable
Label
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
Intercept
price
sqft_living
sqft_lot
sqft_above
sqft_basement
Standard
DF
Estimate
Error
t Value
Pr > |t|
1
1
B
1
B
0
3.50167
2.506291E-7
0.00016466
-1.96459E-7
-0.00029533
0
0.02334
4.906598E-8
0.00002992
2.142149E-7
0.00003317
.
150.05
5.11
5.50
-0.92
-8.90
.
<.0001
<.0001
<.0001
0.3591
<.0001
.
-1.17092E-12
1.8732555E-9
-1.67642E-13
-1.651999E-9
LB-cj
7.776209E-16
-1.67642E-13
9.603697E-14
-1.31872E-12
L Ginv(X'X) L'
2.5672529E-8
0.0006420774
-6.559065E-7
0.0001572081
1.7450273E-7
-0.000099036
-5.149987E-7
0.0004286615
635548754985
2626772800
25805732831
1920165356.3
7.4768504E-7
0.0001136185
6.4809548E-7
-0.00012092
4.7124527E12
25805732831
1.0760494E13
24829857583
197751985.68
660570.8868
3368602.4915
668485.18942
2.5062915E-7
0.0001646605
-1.964595E-7
-0.000295333
Inv()(LB-cj)
-6.467775E-7
0.0007444866
-7.394999E-7
0.0000251197
LB-cj
-7.63057E-14
-1.651999E-9
-1.31872E-12
2.3029018E-9
470294937757
1920165356.3
24829857583
1841477067.5
253454358.33
1573689.0694
8830446.8294
1153256.7884
Inv()(LB-cj)
282004486.05
558177.54077
10426828.758
363218.64619
53193572.838
19654.062571
-4016823.744
-114682.6826
241.08037659
1111.7291944
306.4134363
-43.33131935
-56.21408864
-12.10035207
306.4134363
686.34727878
20.028916995
-155.8725535
-123.4849555
-43.33131935
20.028916995
1693.1563433
38.762558085
193.12284288
-56.21408864
-155.8725535
38.762558085
2517.61272
1192.4463583
1518.6705022
532.09623378
415.8948385
-1.40934708
378.1856763
532.09623378
253.90698743
144.25891008
-39.75543283
234.57483031
415.8948385
144.25891008
237.10683454
53.269296295
-19.68256984
-1.40934708
-39.75543283
53.269296295
51.226870419
1433.5267349
2630.3996966
838.50967008
372.56351915
-57.62343572
366.08532423
838.50967008
940.25426621
164.28782708
-195.6279863
111.08987486
372.56351915
164.28782708
1930.2631779
92.03185438
173.44027304
-57.62343572
-195.6279863
92.03185438
2568.8395904
0.003560
0.008803
0.015117
-0.010636
0.010215
-0.000054776
0.001252
0.006737
0.018051
0.005257
Eigenvectors
0.006046
-0.010945
-0.007015
-0.003435
0.016544
0.013853
-0.003438
0.009279
0.002717
-0.020334
0.002709
0.027851
-0.022405
0.008322
0.012991
Eigenvalues
0.639941
0.113344
0.101557
0.002117
1.035931E-16
M=0
N=2631.5
Value
F Value
Num DF
Den DF
Pr > F
0.28621943
0.85695903
2.02031257
1.77732087
400.06
287.27
531.74
1872.59
20
20
20
5
17463
21072
11576
5268
<.0001
<.0001
<.0001
<.0001
Appendix
PROC IMPORT OUT= WORK.swarnanaman
DATAFILE= "C:\Users\imt\Desktop\Nandita_SAS\housedataset.xls
x"
DBMS=EXCEL REPLACE;
RANGE="housedataset$";
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
PROC univariate plot normal;
var price sqft_living sqft_lot sqft_above sqft_basement;
run;
data Reg;
set Swarnanaman;
run;
proc reg data = Reg;
model bedrooms bathrooms floors view condition
sqft_lot sqft_above sqft_basement;
mtest / details print;
run;
quit;
= price sqft_living