Sie sind auf Seite 1von 18

HEWLETT-PACKARD

[Type the document title]


[Type the document subtitle]
Surajit Basak

Contents
Introduction.................................................................................................................... 1
Problem Statement........................................................................................................... 2
List of Technical Tasks...................................................................................................... 3
Data Description.............................................................................................................. 3
Qualitative Variables:..................................................................................................... 3
Quantitative Variables:................................................................................................... 4
Analysis:....................................................................................................................... 4
Conclusions and Recommendations:........................................................................13
Appendix:................................................................................................................. 14
Regression after outlier removal-1:.................................................................................. 14
Regression after outlier removal-2:.................................................................................. 15

Introduction
This regression class was very reallyinteresting; it not only taught us more about the
useful statistical techniques but also taught us the regression analysis.This is one of the most
important statistical tool used in professional world. Especially, the fact that we can predict the
future or another variable from already available data, which are collected from real world,
makesthe regression modeling so vital and useful. Therefore, we would like to build a regression
model using the collected data, so that we can use the learned materials in the real life data to
enhance our knowledge and practice the methods which will strengthen the learned knowledge.
The regression analysis can be used in many real life situations so getting the proper data
is not a problem. As the first step, we looked at few frequent activities in our daily life to get the
data. This is because, as these activities are done regularly (if not every day), we can collect data
from our daily life easily. The second step was,the result of this regression model should be
useful in our daily life. From these two approaches, we found TV watching as the topic.
Nowadays, most people watch TV and we can easily profile personal data for each person.
Moreover, TV watching time with general-personal data is significant to broadcasting

companies, TV manufacturers and advertising companies. If we get a significant regression


model, these companies can utilize the result to target the viewers based on the specific factors as
they need. For example, if people aged between 40-45 years old and within $30,000~$35,000
income range has the highest TV watch time, the advertising companies should focus on the
products which these category wants.
We thought this will be a useful regression where we are learning by applying regression
model in the real world data.

Problem Statement
Though a high proportion of people watch television, but still some dont and even the
viewing time and habit mostly depends on the personal choice like what kind of program they
like, their leisure time and so on. Therefore we set the target to capture the data from regular TV
watchers and who can possibly affect companys profit. Statistically perceiving hours of the
people watching television will support companies to develop a strategy in advertisement.
In this report, we came up with various factors which can affect peoples TV watching
time. The factors such as their gender, race, employment, spouse presence, cable availability,
years of education, numbers of children, amount of income, and hours spent on leisure are
considered in this project. These factors are taken into account as independent variables in our
regression model in order to forecast the hours spend on TV watching. Our overall objective is to
give an idea about how the TV watching time can be differed by certain significant factors, and
later the companies can relate their advertisement to influence these factors to increase their
profit margin. As for example if we find out that Women tend to watch TV more than men then
the advertising companies can give advertises targeting the women more than men and that
definitely will increase their profit margin.

List of Technical Tasks


Here our target is to find a proper regression model to predict the TV watching time
based on the other independent variables.
As regression model has many assumptions which should be fulfilled to consider the
model as valid. So we also need to run the assumptions check and need to make the correction if
necessary.
Here I will start with scatter diagram plot which will inform me whether there are any
linear relationship between the dependent and independent variables.
After looking at it I will start with regression analysis and see whether there are any
outliersin the data or not. If there are outliers found I will remove them till I have the data with
no significant outliers. This will take care of another assumption of the regression analysis.
Then I will select a subset model of the significant variables from the full model.
All the others assumptions will be check on this model to see whether the assumptions
are validated or not.

Data Description
Rather than collecting the data from online or from some other sources, our group
decided to physically collect the data, since we wanted to have our own data (which is more
accurate) rather than one collected by others. For accuracy, our group member went to several
different locations such as Georgia State, Atlantic Station, and Coca-Cola and selected random
people to collect the data, by selecting random people we tried to eliminate the data collection or
sampling bias. We used several different qualitative variables and quantitative variables for the
data collection. Below are the independent variables which we chose as these are really
important in affecting the TV watch time.

Qualitative Variables:

Gender: 1= Male and 0=Female (Qualitative variable with Nominal Scale)


Asian: 1= Asian and 0 = Non-Asian (Qualitative variable with Nominal Scale)
Caucasian: 1= Caucasian and 0 = Non-Caucasian (Qualitative variable with Nominal Scale)
African-American: 1= African-American and 0 = Non-African-American (Qualitative variable
with Nominal Scale)
Employment: 1= Viewer has a job, 0 = he/she does not (Qualitative variable with Nominal Scale)
Spouse: 1= Viewer is married, 0 = he/she is not (Qualitative variable with Nominal Scale)
Cable TV: 1= Viewer has cable connection, 0 = he/she does not (Qualitative variable with
Nominal Scale)
Education: Measurement of viewer's education level. (1 ~ High School Diploma, 2 ~ College, 3
~ Graduate school) (Qualitative variable with Ordinal Scale)

Quantitative Variables:
Age: Quantitative measurement of viewer's age (Quantitative variable with Ratio Scale)
Children: Quantitative measurement of viewer's number of children (Quantitative variable with
Ratio Scale)
Income: Quantitative measurement of viewer's income (Income range) (Quantitative variable
with Ratio Scale)
Leisure: Quantitative measurement of viewer's time spent on leisure (Hour spent on leisure
weekly) (Quantitative variable with Ratio Scale)

Our dependent variables is,


Hours: Hours spent on watching TV weekly (Quantitative variable with Ratio Scale)

Analysis:

There are several assumptions which we need to check before performing the regression
analysis. As the regression model depends on these assumptions so violating them may give a
regression equation which is not useful at all.
But before proceeding to any that kind of analysis we need to check the relationship
between dependent and independent variables. The best way to do it is by looking at the
correlation matrix or by looking at the scatter plot.
The scatter plot should be considered only for the quantitative variables thus the obtained
scatter plots are given below.
Scatterplot of HOURS vs CHILDREN
10

HOURS

HOURS

Scatterplot of HOURS vs AGE


10

0
20

25

30

35

40

45

50

55

0.0

0.5

1.0

AGE

2.0

2.5

3.0

Scatterplot of HOURS vs LEISURE


10

HOURS

HOURS

Scatterplot of HOURS vs INCOME


10

0
10000

1.5

CHILDREN

0
20000

30000

40000

50000

INCOME

60000

70000

80000

LEISURE

The above plots show no significant relationship between the independent variables and
the dependent variable. We can only see some support of a negative relationship between Hours
and the independent variable income. Let us look at the correlation matrix for more information.

The obtained correlation matrix is given below,


Correlation: HOURS, AGE, CHILDREN, INCOME, LEISURE
HOURS
0.037
0.601

AGE

0.000
0.997

0.038
0.594

INCOME

-0.375
0.000

0.011
0.877

0.124
0.079

LEISURE

-0.123
0.082

-0.115
0.104

0.007
0.919

AGE
CHILDREN

CHILDREN

INCOME

0.204
0.004

Cell Contents: Pearson correlation


P-Value

From the above result it is clear that only income has a significant correlation with the
hours. Though the result suggests that we should eliminate the variables which are not
significantly correlated with the dependent variable Hours but as we also have many qualitative
dummy variables so I am proceeding with taking these insignificant variables in my
regression.
Before starting to analyze the data, we need to check the assumptions of regression analysis:
i)
ii)
iii)
iv)
v)
vi)

Linear relationship:
Normality:
No or little multicollinearity:
Homoscedasticity:
No significant outliers in the model:
No serial correlation in the model:

Now the 1st assumption is already validated through the scatter plots.
The all other assumptions can be checked before the regression analysis also but as we
may need to select a subset so I am keeping the assumptions check for the later part.
Considering all the independent variables the full regression model output is given below.
Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN,
AFRICAN AMER, EMPLOYED, ...

Analysis of Variance
Source
Regression
GENDER
ASIAN
CAUCASIAN
AFRICAN AMERICAN
EMPLOYED
SPOUSE
CABLE
AGE
EDUCATION
CHILDREN
INCOME
LEISURE
Error
Lack-of-Fit
Pure Error
Total

DF
12
1
1
1
1
1
1
1
1
1
1
1
1
187
186
1
199

SeqSS ContributionAdj SS
Adj MS F-Value P-Value
342.100
54.39% 342.100
28.508
18.58
0.000
0.126
0.02%
0.627
0.627
0.41
0.524
0.004
0.00%
0.458
0.458
0.30
0.585
1.176
0.19%
2.751
2.751
1.79
0.182
1.754
0.28%
0.255
0.255
0.17
0.684
268.434
42.68% 227.611 227.611
148.36
0.000
0.220
0.03%
4.661
4.661
3.04
0.083
27.197
4.32%
28.553
28.553
18.61
0.000
0.441
0.07%
1.435
1.435
0.94
0.335
18.581
2.95%
5.963
5.963
3.89
0.050
2.483
0.39%
4.527
4.527
2.95
0.088
20.832
3.31%
18.628
18.628
12.14
0.001
0.852
0.14%
0.852
0.852
0.56
0.457
286.900
45.61% 286.900
1.534
286.400
45.53% 286.400
1.540
3.08
0.431
0.500
0.08%
0.500
0.500
629.000
100.00%

Model Summary
S
1.23864

R-sq
54.39%

R-sq(adj)
51.46%

PRESS
329.350

R-sq(pred)
47.64%

Coefficients
Term
Constant
GENDER
ASIAN
CAUCASIAN
AFRICAN AMERICAN
EMPLOYED
SPOUSE
CABLE
AGE
EDUCATION
CHILDREN
INCOME
LEISURE

Coef
5.944
-0.116
0.162
0.372
0.111
-3.050
-0.397
0.793
0.00898
-0.263
0.190
-0.000021
-0.0294

SE Coef
0.502
0.181
0.296
0.278
0.272
0.250
0.228
0.184
0.00929
0.133
0.111
0.000006
0.0395

95% CI
(
4.954,
6.934)
(
-0.473,
0.242)
(
-0.422,
0.745)
(
-0.176,
0.919)
(
-0.426,
0.648)
(
-3.544,
-2.556)
(
-0.846,
0.052)
(
0.431,
1.156)
( -0.00934,
0.02730)
(
-0.526,
0.000)
(
-0.028,
0.409)
(-0.000032, -0.000009)
( -0.1074,
0.0485)

T-Value
11.85
-0.64
0.55
1.34
0.41
-12.18
-1.74
4.31
0.97
-1.97
1.72
-3.48
-0.75

P-Value
0.000
0.524
0.585
0.182
0.684
0.000
0.083
0.000
0.335
0.050
0.088
0.001
0.457

VIF
1.06
1.86
2.09
2.16
1.13
1.69
1.07
1.12
1.27
1.66
1.30
1.11

Regression Equation
HOURS = 5.944 - 0.116 GENDER + 0.162 ASIAN + 0.372 CAUCASIAN + 0.111 AFRICAN AMERICAN
- 3.050 EMPLOYED - 0.397 SPOUSE + 0.793 CABLE + 0.00898 AGE - 0.263 EDUCATION
+ 0.190 CHILDREN - 0.000021 INCOME - 0.0294 LEISURE
Fits and Diagnostics for Unusual Observations
Obs HOURS
Fit SE Fit
95% CI
ResidStdResid Del Resid
HI Cooks D
3 5.000 2.021
0.318 (1.394, 2.648)
2.979
2.49
2.52 0.0659293
0.03
10 9.000 6.545
0.379 (5.797, 7.292)
2.455
2.08
2.10 0.0937056
0.03
11 9.500 2.964
0.291 (2.390, 3.538)
6.536
5.43
5.90 0.0552064
0.13
15 8.500 5.766
0.339 (5.096, 6.435)
2.734
2.30
2.32 0.0750477
0.03
62 9.500 6.274
0.333 (5.617, 6.930)
3.226
2.70
2.75 0.0720839
0.04
71 6.000 2.831
0.301 (2.237, 3.424)
3.169
2.64
2.68 0.0590371
0.03
83 7.000 2.951
0.331 (2.299, 3.603)
4.049
3.39
3.49 0.0712014
0.07
96 2.000 5.014
0.331 (4.361, 5.667) -3.014
-2.53
-2.56 0.0713602
0.04
99 3.000 5.445
0.310 (4.833, 6.057) -2.445
-2.04
-2.06 0.0627761
0.02
116 8.000 5.450
0.364 (4.732, 6.169)
2.550
2.15
2.18 0.0865024
0.03
120 5.500 2.874
0.297 (2.288, 3.461)
2.626
2.18
2.21 0.0576223
0.02
163 3.500 6.153
0.341 (5.481, 6.824) -2.653
-2.23
-2.25 0.0755750
0.03
188 3.000 5.362
0.391 (4.591, 6.134) -2.362
-2.01
-2.03 0.0996398
0.03
192 7.500 2.939
0.258 (2.431, 3.447)
4.561
3.76
3.91 0.0432661
0.05

Obs
DFITS
3
0.67055 R
10
0.67568 R
11
1.42586 R
15
0.66147 R
62
0.76682 R
71
0.67157 R
83
0.96693 R
96 -0.71031 R
99 -0.53224 R
116
0.66936 R
120
0.54557 R
163 -0.64377 R
188 -0.67419 R
192
0.83049 R
R

Large residual

Durbin-Watson Statistic
Durbin-Watson Statistic =

1.76295

From the above output we can clearly see that many variables are insignificant in the
model. Moreover though the overall regression model is significant but the R-square value is
54.39% implying only 54.39% of the variation is getting explained by the regression model.
As many variables are insignificant so we should select some model with removing all
these insignificant variables. But as there are many outliers in the data (which can cause some
variable to be insignificant) so I am removing these outliers at first.
As we know for normal distribution 95%, 99.73% of the values fall within 2 and 3
standard deviation of the mean respectively. So lets remove all data points having standardized
residual value more than +2 or less than -2. After deleting them and running the regression model
the obtained output is given in Appendix: Regression after outlier removal-1.
We can still see some outliers falling in the outside of 2 standard deviation interval. By
keep deleting those data points and rerunning the model we reached at the point where no
standardized residuals have value outside 3 standard deviation interval.
As there is a 5% chance that the standardized residual will be outside 2 standard
deviation interval so I am keeping this dataset and running the stepwise selection method with
alpha to enter as 0.05 and alpha to remove as 0.15.
The obtained model is given below. All the in between regression output is given in the
appendix with proper numberings.

Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN,


AFRICAN AMER, EMPLOYED, ...
Stepwise Selection of Terms
to enter = 0.05, to remove = 0.15
Analysis of Variance
Source
Regression
EMPLOYED
SPOUSE
CABLE
CHILDREN
INCOME
LEISURE
Error
Lack-of-Fit
Pure Error
Total

DF
6
1
1
1
1
1
1
167
166
1
173

SeqSS ContributionAdj SS
Adj MS F-Value P-Value
184.481
75.48% 184.481
30.747
85.66
0.000
162.047
66.30% 142.103 142.103
395.88
0.000
0.075
0.03%
1.551
1.551
4.32
0.039
11.339
4.64%
12.425
12.425
34.61
0.000
2.968
1.21%
4.224
4.224
11.77
0.001
6.257
2.56%
4.870
4.870
13.57
0.000
1.795
0.73%
1.795
1.795
5.00
0.027
59.945
24.52%
59.945
0.359
59.445
24.32%
59.445
0.358
0.72
0.761
0.500
0.20%
0.500
0.500
244.425
100.00%

Model Summary
S
0.599125

R-sq
75.48%

R-sq(adj)
74.59%

PRESS
66.7525

R-sq(pred)
72.69%

Coefficients
Term
Constant
EMPLOYED
SPOUSE
CABLE
CHILDREN
INCOME
LEISURE

Coef
5.572
-3.193
-0.242
0.5500
0.1946
-0.000011
-0.0450

SE Coef
0.196
0.160
0.116
0.0935
0.0567
0.000003
0.0201

95% CI
(
5.186,
5.959)
(
-3.509,
-2.876)
(
-0.472,
-0.012)
(
0.3654,
0.7345)
(
0.0826,
0.3066)
(-0.000017, -0.000005)
( -0.0847,
-0.0053)

T-Value
28.47
-19.90
-2.08
5.88
3.43
-3.68
-2.24

P-Value
0.000
0.000
0.039
0.000
0.001
0.000
0.027

VIF
1.10
1.64
1.03
1.67
1.13
1.04

Regression Equation
HOURS = 5.572 - 3.193 EMPLOYED - 0.242 SPOUSE + 0.5500 CABLE + 0.1946 CHILDREN
- 0.000011 INCOME - 0.0450 LEISURE
Fits and Diagnostics for Unusual Observations
Obs HOURS
Fit SE Fit
95% CI
ResidStdResid Del Resid
HI Cooks D
1 0.500 1.715
0.088 (1.541, 1.888) -1.215
-2.05
-2.07 0.021474
0.01
8 0.500 2.082
0.090 (1.904, 2.260) -1.582
-2.67
-2.72 0.022635
0.02
16 0.000 1.247
0.145 (0.960, 1.534) -1.247
-2.15
-2.17 0.058811
0.04
21 7.000 5.823
0.173 (5.481, 6.165)
1.177
2.05
2.07 0.083517
0.05
29 0.000 1.466
0.128 (1.214, 1.718) -1.466
-2.50
-2.55 0.045412
0.04
47 0.000 1.379
0.118 (1.147, 1.612) -1.379
-2.35
-2.38 0.038685
0.03
49 7.000 5.835
0.195 (5.449, 6.220)
1.165
2.06
2.08 0.106367
0.07
58 7.500 5.928
0.169 (5.594, 6.262)
1.572
2.74
2.79 0.079613
0.09
73 4.000 5.321
0.171 (4.983, 5.658) -1.321
-2.30
-2.33 0.081243
0.07
80 3.500 4.817
0.168 (4.486, 5.149) -1.317
-2.29
-2.32 0.078552
0.06
87 4.500 3.109
0.201 (2.712, 3.506)
1.391
2.46
2.50 0.112469
0.11
95 7.000 5.781
0.168 (5.449, 6.112)
1.219
2.12
2.14 0.078559
0.05
106 0.000 1.409
0.113 (1.185, 1.633) -1.409
-2.39
-2.43 0.035807
0.03
110 4.000 5.214
0.186 (4.847, 5.581) -1.214
-2.13
-2.16 0.096315
0.07
115 3.500 2.157
0.092 (1.975, 2.338)
1.343
2.27
2.30 0.023601
0.02

Obs
DFITS
1 -0.306549 R
8 -0.414255 R
16 -0.542294 R
21
0.625467 R
29 -0.555239 R
47 -0.477636 R
49
0.716874 R
58
0.820652 R
73 -0.692789 R
80 -0.677400 R
87
0.891054 R
95
0.625761 R
106 -0.468253 R
110 -0.703592 R
115
0.357270 R
R

Large residual

Durbin-Watson Statistic
Durbin-Watson Statistic =

2.11398

Here we can see that quite few variables came to be significant. Though some residuals
are outside 2 standard deviation interval but none are outside 3 standard deviation interval. As
there are 174 data points here so 9 of the residuals are expected to be outside the 2 standard
deviation by normality rule and we can see that the number of residuals which is outside the 2
standard deviation interval is 15 which is close.
By applying the above method we also took care of the 5th assumption which is No
significant outliers in the model.
Now lets check the other assumptions.
The normality check can be done using the Normal probability plot of the residuals which
is given below.

Normal Probability Plot


(response is HOURS)
99.9
99
95

Percent

90
80
70
60
50
40
30
20
10
5
1
0.1

-2

-1

Residual

From the above plot no significant deviation is found and thus normality assumption is
validated.
Similarly the Homoscedasticity assumption can be tested using the Residual vs Fit plot
which is given below.

Versus Fits
(response is HOURS)
2

Residual

-1

-2
1

Fitted Value

The plot suggests a little deviation from the randomness however all values are within the
3 standard deviation. So ignoring this little deviation we can say that the Homoscedasticity
assumption is validated.
The Durbin-Watson Statistic = 1.76295 implying no significant serial correlation thus
another assumption is validated.
The last assumption is the multicollinearity which can be checked using the Variance
Inflation Factors (VIFs) we can see all VIFs have low values implying no multicollinearity in the
mode. Letscheck the correlation matrix to be sure.
The correlation matrix is given below,
As many variables are qualitative here so using the proper correlation method the
obtained output is given below,
Spearman Rho: EMPLOYED, SPOUSE, CABLE, CHILDREN, INCOME,
LEISURE
SPOUSE
CABLE

EMPLOYED
-0.016
0.838

SPOUSE

0.112
0.139

0.091
0.231

CABLE

CHILDREN

INCOME

CHILDREN
INCOME
LEISURE

-0.050
0.511

0.694
0.000

-0.016
0.833

0.245
0.001

0.062
0.420

0.057
0.452

0.128
0.092

-0.067
0.377

-0.030
0.695

-0.036
0.640

0.012
0.877

0.162
0.033

Cell Contents: Spearman rho


P-Value

Here we can see two obvious significance between Income and employed and Spouse
and Children. But as we saw that the Variance Inflation Factors for all variables are low so there
is no multicollinearity present in the model.
So we can say all steps are performed and the model is performing really well.

Conclusions and Recommendations:


From the above analysis we have some pretty clear idea about the data and outcome. We
saw that the outliers really affect the model. When the outlier was present most of the variables
came insignificant and after taking care of the outliers many variables are coming significant.
Though the final model is looking good here but we can also improve it by spending
more time on it and playing with the data more. By using more iterative approach we can
identify more significant variables like interaction terms and higher order terms which would
improve the model.
Here we can see that the model is performing well. The F test suggests that the regression
model is significant (P-value < 0.05) at 5% significance level. The variables EMPLOYED,
SPOUSE, CABLE, CHILDREN and INCOME are significant at 5% significance level.
The R-sq is 75.48% and Adjusted R-sq is 74.59% implying 75.48% of the variation in
Hours has been explained by the regression model. Thus the model is really good.
The significant variables also giving us enough information. We can see as the variables
are important so the advertising companies should target the non-employed people. They should

also consider the non married people as it seems like non married persons spend more time on
watching TV than married people (the beta coefficient is negative).
The should also consider the people having more children and who has cable connection
to optimize their profit.
But the regression model suggested to approach the less income persons as well as less
leisure time people which might be a mistake. We should run more tests to see whether this is a
fact or just a small mistake due to the characteristic of the collected data.

Appendix:
Regression after outlier removal-1:
Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN,
AFRICAN AMER, EMPLOYED, ...
Analysis of Variance
Source
Regression
GENDER
ASIAN
CAUCASIAN
AFRICAN AMERICAN
EMPLOYED
SPOUSE
CABLE
AGE
EDUCATION
CHILDREN
INCOME
LEISURE
Error
Lack-of-Fit
Pure Error
Total

DF
12
1
1
1
1
1
1
1
1
1
1
1
1
173
172
1
185

SeqSS ContributionAdj SS
Adj MS F-Value P-Value
258.689
70.75% 258.689
21.557
34.87
0.000
3.991
1.09%
0.413
0.413
0.67
0.415
0.731
0.20%
0.121
0.121
0.20
0.659
11.922
3.26%
0.055
0.055
0.09
0.766
1.642
0.45%
0.158
0.158
0.25
0.614
199.412
54.54% 190.482 190.482
308.07
0.000
0.462
0.13%
2.276
2.276
3.68
0.057
18.445
5.04%
17.970
17.970
29.06
0.000
1.704
0.47%
1.843
1.843
2.98
0.086
5.017
1.37%
1.086
1.086
1.76
0.187
4.703
1.29%
6.084
6.084
9.84
0.002
8.544
2.34%
6.837
6.837
11.06
0.001
2.116
0.58%
2.116
2.116
3.42
0.066
106.967
29.25% 106.967
0.618
106.467
29.12% 106.467
0.619
1.24
0.630
0.500
0.14%
0.500
0.500
365.656
100.00%

Model Summary
S
0.786324

R-sq
70.75%

R-sq(adj)
68.72%

PRESS
126.045

R-sq(pred)
65.53%

Coefficients
Term
Constant
GENDER
ASIAN
CAUCASIAN
AFRICAN AMERICAN
EMPLOYED
SPOUSE

Coef
5.327
0.097
0.086
0.054
0.089
-3.139
-0.290

SE Coef
0.341
0.119
0.195
0.182
0.177
0.179
0.151

(
(
(
(
(
(
(

95% CI
4.654,
6.000)
-0.138,
0.332)
-0.298,
0.470)
-0.305,
0.414)
-0.260,
0.438)
-3.492,
-2.786)
-0.588,
0.008)

T-Value
15.62
0.82
0.44
0.30
0.50
-17.55
-1.92

P-Value
0.000
0.415
0.659
0.766
0.614
0.000
0.057

VIF
1.05
1.78
2.06
2.12
1.12
1.72

CABLE
AGE
EDUCATION
CHILDREN
INCOME
LEISURE

0.652
0.01050
-0.1156
0.2242
-0.000013
-0.0485

0.121
0.00608
0.0872
0.0715
0.000004
0.0262

(
0.413,
0.890)
( -0.00150,
0.02251)
( -0.2878,
0.0565)
(
0.0831,
0.3653)
(-0.000020, -0.000005)
( -0.1001,
0.0032)

5.39
1.73
-1.33
3.14
-3.33
-1.85

0.000
0.086
0.187
0.002
0.001
0.066

1.07
1.11
1.26
1.67
1.24
1.11

Regression Equation
HOURS = 5.327 + 0.097 GENDER + 0.086 ASIAN + 0.054 CAUCASIAN + 0.089 AFRICAN AMERICAN
- 3.139 EMPLOYED - 0.290 SPOUSE + 0.652 CABLE + 0.01050 AGE - 0.1156 EDUCATION
+ 0.2242 CHILDREN - 0.000013 INCOME - 0.0485 LEISURE
Fits and Diagnostics for Unusual Observations
Obs HOURS
Fit SE Fit
95% CI
ResidStdResid Del Resid
HI Cooks D
9 5.000 2.857
0.188 (2.485, 3.229)
2.143
2.81
2.86 0.057396
0.04
10 7.000 5.112
0.235 (4.648, 5.575)
1.888
2.52
2.56 0.089301
0.05
30 4.000 1.755
0.231 (1.300, 2.211)
2.245
2.99
3.06 0.086113
0.06
49 8.000 5.776
0.225 (5.331, 6.221)
2.224
2.95
3.02 0.082241
0.06
75 1.500 3.138
0.206 (2.732, 3.544) -1.638
-2.16
-2.18 0.068314
0.03
88 4.000 5.723
0.215 (5.298, 6.148) -1.723
-2.28
-2.31 0.075053
0.03
99 8.000 6.123
0.254 (5.622, 6.625)
1.877
2.52
2.56 0.104455
0.06
115 3.000 4.785
0.250 (4.291, 5.278) -1.785
-2.39
-2.43 0.101250
0.05
130 7.000 5.450
0.220 (5.016, 5.885)
1.550
2.05
2.07 0.078376
0.03
131 2.500 4.248
0.271 (3.714, 4.782) -1.748
-2.37
-2.40 0.118514
0.06
149 5.000 2.998
0.220 (2.564, 3.432)
2.002
2.65
2.70 0.078220
0.05
183 2.000 4.273
0.253 (3.774, 4.772) -2.273
-3.05
-3.13 0.103446
0.08
Obs
DFITS
9
0.70691 R
10
0.80053 R
30
0.93842 R
49
0.90440 R
75 -0.59065 R
88 -0.65697 R
99
0.87501 R
115 -0.81480 R
130
0.60432 R
131 -0.88007 R
149
0.78636 R
183 -1.06315 R
R

Large residual

Durbin-Watson Statistic
Durbin-Watson Statistic =

1.94462

Regression after outlier removal-2:


Regression Analysis: HOURS versus GENDER, ASIAN, CAUCASIAN,
AFRICAN AMER, EMPLOYED, ...
Analysis of Variance
Source
Regression
GENDER
ASIAN
CAUCASIAN
AFRICAN AMERICAN

DF
12
1
1
1
1

SeqSS ContributionAdj SS
Adj MS F-Value P-Value
185.856
76.04% 185.856
15.488
42.57
0.000
0.728
0.30%
0.009
0.009
0.03
0.873
0.036
0.01%
0.341
0.341
0.94
0.335
9.831
4.02%
0.000
0.000
0.00
0.983
0.027
0.01%
0.072
0.072
0.20
0.658

EMPLOYED
SPOUSE
CABLE
AGE
EDUCATION
CHILDREN
INCOME
LEISURE
Error
Lack-of-Fit
Pure Error
Total

1
1
1
1
1
1
1
1
161
160
1
173

152.910
0.217
10.858
0.195
1.706
3.180
4.431
1.736
58.569
58.069
0.500
244.425

62.56%
0.09%
4.44%
0.08%
0.70%
1.30%
1.81%
0.71%
23.96%
23.76%
0.20%
100.00%

137.866
1.339
11.504
0.193
0.187
4.262
3.477
1.736
58.569
58.069
0.500

137.866
1.339
11.504
0.193
0.187
4.262
3.477
1.736
0.364
0.363
0.500

378.98
3.68
31.62
0.53
0.52
11.72
9.56
4.77

0.000
0.057
0.000
0.467
0.474
0.001
0.002
0.030

0.73

0.758

Model Summary
S
0.603145

R-sq
76.04%

R-sq(adj)
74.25%

PRESS
69.9706

Coef
5.518
0.0151
0.146
-0.003
-0.061
-3.232
-0.230
0.5409
0.00354
-0.0494
0.1977
-0.000010
-0.0453

SE Coef
0.276
0.0943
0.151
0.143
0.138
0.166
0.120
0.0962
0.00486
0.0688
0.0577
0.000003
0.0208

R-sq(pred)
71.37%

Coefficients
Term
Constant
GENDER
ASIAN
CAUCASIAN
AFRICAN AMERICAN
EMPLOYED
SPOUSE
CABLE
AGE
EDUCATION
CHILDREN
INCOME
LEISURE

95% CI
(
4.973,
6.063)
( -0.1712,
0.2014)
(
-0.152,
0.444)
(
-0.285,
0.279)
(
-0.334,
0.211)
(
-3.560,
-2.904)
(
-0.467,
0.007)
(
0.3509,
0.7308)
( -0.00605,
0.01313)
( -0.1853,
0.0865)
(
0.0836,
0.3117)
(-0.000016, -0.000004)
( -0.0863,
-0.0044)

T-Value
19.99
0.16
0.97
-0.02
-0.44
-19.47
-1.92
5.62
0.73
-0.72
3.42
-3.09
-2.18

P-Value
0.000
0.873
0.335
0.983
0.658
0.000
0.057
0.000
0.467
0.474
0.001
0.002
0.030

VIF
1.05
1.75
2.00
2.01
1.16
1.72
1.08
1.13
1.29
1.70
1.29
1.10

Regression Equation
HOURS = 5.518 + 0.0151 GENDER + 0.146 ASIAN - 0.003 CAUCASIAN - 0.061 AFRICAN AMERICAN
- 3.232 EMPLOYED - 0.230 SPOUSE + 0.5409 CABLE + 0.00354 AGE - 0.0494 EDUCATION
+ 0.1977 CHILDREN - 0.000010 INCOME - 0.0453 LEISURE
Fits and Diagnostics for Unusual Observations
Obs HOURS
Fit SE Fit
95% CI
ResidStdResid Del Resid
HI Cooks D
1 0.500 1.914
0.148 (1.622, 2.205) -1.414
-2.42
-2.45 0.059821
0.03
8 0.500 1.965
0.133 (1.703, 2.226) -1.465
-2.49
-2.53 0.048293
0.02
16 0.000 1.319
0.193 (0.938, 1.701) -1.319
-2.31
-2.34 0.102496
0.05
21 7.000 5.720
0.205 (5.315, 6.124)
1.280
2.26
2.29 0.115574
0.05
23 3.500 2.303
0.127 (2.053, 2.554)
1.197
2.03
2.05 0.044229
0.01
29 0.000 1.392
0.159 (1.079, 1.706) -1.392
-2.39
-2.43 0.069151
0.03
33 6.000 4.838
0.193 (4.457, 5.219)
1.162
2.03
2.05 0.102488
0.04
47 0.000 1.365
0.159 (1.052, 1.679) -1.365
-2.35
-2.38 0.069404
0.03
58 7.500 5.869
0.182 (5.509, 6.229)
1.631
2.84
2.90 0.091226
0.06
73 4.000 5.479
0.232 (5.021, 5.936) -1.479
-2.66
-2.71 0.147459
0.09
80 3.500 4.719
0.194 (4.336, 5.103) -1.219
-2.14
-2.16 0.103423
0.04
87 4.500 3.152
0.223 (2.711, 3.593)
1.348
2.41
2.44 0.137104
0.07
94 1.500 2.789
0.193 (2.409, 3.170) -1.289
-2.26
-2.29 0.101942
0.04
106 0.000 1.337
0.149 (1.042, 1.631) -1.337
-2.29
-2.32 0.061106
0.03
115 3.500 2.116
0.135 (1.848, 2.383)
1.384
2.36
2.39 0.050391
0.02
Obs
DFITS
1 -0.61923 R
8 -0.57008 R
16 -0.79113 R
21
0.82669 R
23
0.44095 R
29 -0.66204 R

33
47
58
73
80
87
94
106
115
R

0.69405
-0.65010
0.91922
-1.12589
-0.73343
0.97371
-0.76986
-0.59129
0.55046

R
R
R
R
R
R
R
R
R

Large residual

Durbin-Watson Statistic
Durbin-Watson Statistic =

2.11168

Das könnte Ihnen auch gefallen