MLR

1
CHAPTER 1
INTRODUCTION

1.1 Child Mortality

Recently, United Nation (UN) (2013) reported the well-being of children has
improved and child mortality rate has decreased remarkably in most countries through
worldwide effort since year 1950. However, there has been slower progress in reducing
the neonatal mortality rate, and the progress in achieving the Millennium Development
Goals (MDG) which targeted to reduce the under-five mortality rate by two-thirds by
year 2015 seems insufficient.

Although the child mortality rate is falling globally, but there is still some high
mortality rate in few regions that cause the decline rate is not quickly enough in
sustaining children survival. Out of the 67 countries which defined as high child
mortality rate region, only 10 countries are on track in meeting the MDG target.

Based on the UN report (2013), East Asia shown the highest decline in child
mortality rate, followed by Southeast Asia, West Asia, South Asia, and the lowest in
Central Asia. The children under-five mortality rate in Asia had declined half of its
2

percentage level between year 1990 and 2011. But, the progress of decline is also not
fast enough to meet the MDG targets in child survival as mentioned.

1.2 Child Mortality Rate Analysis

In order to investigate the interactions effects among all the factors in data set
itself, a multiple linear regression model which able to investigate the contribution of a
set of independents variables to dependent variable will build. It can be further refined to
incorporate additional parameters and variables, with interactions for examining the
child mortality rate.

1.3 Objectives

The objectives of this study are:
(i) To examine the interrelationship between the Asia child under age of five
mortality rate in year 2010 to its factors such as the carbon dioxide emissions,
percentage of Hepatitis B immunization coverage among one-year-olds,
percentage of case detection rate for all forms of tuberculosis, number of
reported deaths on human immunodeficiency virus (HIV), number of reported
deaths on measles, percentage of population using an improved drinking water
3

source, percentage of children one-year-olds immunized against measles, and
number of birth trauma reported.
(ii) To determine the coefficient regression in the multiple regression model of Asia
child under age of five mortality rate.
(iii) To identify the possible outliers and influential points of the regression model.

1.4 Overview of the Research

This report is divided into 5 chapters. Chapter 1 displays the background of child under
age of five mortality rate in Asia in year 2010. It provides some suggestions in doing
child mortality rate analysis. Objectives of this study do display in this chapter.

Chapter 2 focuses on the literature review which related to child mortality rate in Asia.
Few previous studies on the methods to be used in this study such as multiple regression
analysis techniques, model selection method, robust regression are also referred in this
chapter.

Chapter 3 concentrates on the description of raw data from various sources which the
information all about child mortality rate and its potential factors in 47 Asia countries.
Some descriptions on the methods used in analyzing the data included the correlation
4

analysis, variables selection method, data transformation technique, outliers and
influential points detection, and robust regression technique are prepared in this chapter.

Chapter 4 presents the application of multiple regression in analyzing the data of child
mortality rate. Appropriate methods listed in Chapter 3 are used in this chapter, in order
to perform outcome to fulfill the objectives of this study.

Chapter 5 summarizes the overall conclusions and discusses some opportunities for
future research.

CHAPTER 2
LITERATURE REVIEW
5

2.1 Child Mortality Studies

According to the United Nations Childrens Fund (UNICEF) (2014) child mortality
statistics, sub-Saharan Africa and south Asia shown the highest child death, which
consist of 41% and 34% respectively. A research in estimating the distribution of deaths
in children age under-five by several causes for 42 countries in year 2000 by using a
prediction model was conducted by Black et. al (2003). The outcome from the prediction
model was compared with the World Health Organization (WHO) statistics, and an
analysis on the differences between these two approaches were analyzed and contributed
to an understanding to the strengths and weaknesses for those child mortality major
causes, such as neonatal, diarrhea, respiratory infections, acquired immunodeficiency
syndrome (AIDS) and other causes.

Gabriele and Schettino (2008) had conducted few analyzes such as ordinary least square
(OLS) techniques to illustrate a model for the basic causal relations among UN,
prevalence of underweight, and under-five child mortality. Furthermore, a seemingly
unrelated regression (SUR) was also conducted to analyze the child mortality systems of
the impact of both income factors on underweight and under-five child mortality
variables.

6

Sarmin et. al (2014) made a study in predicting and comparing the Bangladesh mortality
in children under-five with diarrhea between who showed septic shock and drowsiness
in year 2010 and 2011. They analyzed their data with chi-square test in comparing the
differences in proportions, the differences in means by Students t-test, and those data
which were not normally distributed by Mann-Whitney test.

2.2 Multiple Regression

Draper and Smith (1981), and Aiken et. al (2003) been mentioned regarding on the use
of multiple regression analysis as a general system in examining on the relationship for a
few independent variables to a dependent variable. Draper and Smith (1981) discussed
more in model building of multiple linear relationship and the application of multiple
relationship to the problems of analysis of variance (ANOVA).

In the process of a statistical analysis regarding with child mortality, the databases
prepared normally were split up and came from different separated parties. Hall et. al
(2011) proposed an approach which gave a full statistical calculation to the combined
database by demonstrated through an experiment which the dataset with 51,016 cases
and 22 covariates which was extracted from the Current Population Survey. A focus area
in computing linear regression, regression estimates, and certain of goodness-of-fit
statistics were done discussed.
7

2.2.1 Significant Variables of Model Selection Method

Few concepts and methods which associated in the variables selecting of linear
regression models were reviewed in a lot of articles. One of the methods on variable
selection which he reviewed was the stepwise methods on two basic ideas which are
Forward Selection (FS) and Backward Elimination (BE). A criticism regarding on the
FS would be over looked in an excellent model due to its limitation in adding a single
variable per time was also discussed.
For the application towards stepwise method was applied in the study by Broadhurst et.
al (1996). The three techniques of stepwise method which are FS, BS and stepwise
multiple regression (SMR) were applied and compared in examining all the possible
combinations of the variables.

Grossman et. al (1996) selected the variables of leaf carbon, nitrogen, lignin, cellulose,
dry weight, and water compositions from leaf level reflectance through the examine of
stepwise multiple linear regression. The output of the leaf level studies and its
wavelengths which derived from 15 bands of chemical spectroscopy were compared
between unconstrained stepwise multiple linear regression and constrained regression.

8

Recently, there was a study in determining the relationship between gross calorific value,
chemical and ultimate analysis through backward stepwise multiple linear regression by
Telmo et. al (2010). Through the BE technique which deleted any not candidates
variables one-by-one that were not significant, and left for only six independent
variables out of eight independents variables. The remaining two independent variables
were described as had a negative explanation towards the gross calorific value variation.

Besides on stepwise-type search method, Quinino et. al (2013) demonstrated an
alternative to test the significance in a model of multiple linear regression based on its
coefficient of determination and the beta sampling distribution. They found out that their
proposed way was easier to understand compared with the traditional F-test where the
coefficient of determination values able to present in percent index.

2.2.2 Robustness in Regression

In real world practice, it is impossible no outliers been detected for an original dataset.
However, least squares estimates are highly sensitive towards outliers in regression
models. The inefficiency of least squares fit which due to the estimation of residual scale
is inflated. Therefore, there is a need of robustness ideas in regression which are able to
identify the true outliers. In the field of studying residuals of a robust regression, there
was many ideas been proposed, such as the M estimation introduced by Huber (1973),
9

Least Trimmed Squares (LTS) by Rousseeuw (1984), and S estimation by Rousseeuw
and Yohai (1984).

Chatterjee and Hadi (1986) were discussed about the effect of the deletion of data points
when outliers detected. Their discussions were demonstrated by some diagnostic
methods, and found it had a well effect on only one outlier detected, but had difficulties
in diagnosing when there were a group of outliers. Therefore, by deleting outliers is no
longer effective and computationally feasible.

There was a further study on the estimation method by combining high breakdown value
estimation and the M estimation. This method was known as the MM estimation which
is able to achieve the high breakdown point and high efficiency for normal errors
properties, and was introduced by Yohai (1987).
Bianco et. al (2005) proposed an extension of the MM-estimators as the new class of
robust parameter estimators for a regression model which its error terms distribution
belonged to a class of exponential families. They proved the MM estimation method on
the consistency and derived an asymptotic normal distribution by a Monte Carlo study.

Maronna and Yohai (2013) presented a robust version of a spline-based estimate which
had a form of an MM estimation method for a functional regression which based on the
minimization of
2
L norm of the residuals by replacing it to a bounded loss function.
10

Overall, throughout their study, MM estimation approach shown a better predictive
performance than
2
L and both the partial least squares version.

CHAPTER 3
METHODOLOGY

3.1 Data Background

In this study, the annual data employed are the child under age of five mortality rate
(probability per 1,000 live births), carbon dioxide emissions (metric tons per capita),
coverage of Hepatitis B (percentage of coverage), case detection rate for all forms of
tuberculosis (percentage of detection), number of reported deaths on HIV, number of
reported deaths on measles, population using an improved drinking water source
(percentage of population), children immunized against measles (percentage of
11

immunized), and number of birth trauma reported. The data set used were covered all
the 47 Asia countries in the year 2010.

The data on child under age of five mortality rate are drawn from the UNICEF database.
For the carbon dioxide emissions, the percentage of population using an improved
drinking water source, the percentage of children immunized against measles are
collected from the UN database. Meanwhile, the percentage on the coverage of Hepatitis
B immunization, the percentage of case detection rate for all forms of tuberculosis, the
number of reported deaths on HIV, the number of reported deaths on measles, and the
number of birth trauma reported are gathered from the WHO database. All the choices of
countries, indicators selection and time period in this analysis are influenced by the
availability of consistent data.

3.2 Methods

3.2.1 Correlation Analysis

Before doing regression analysis, correlation analysis is conducted to check for the
relationship among all the variables. Then, a correlation matrix is also conducted for to
observe the correlations between variables. The correlation coefficient is a measure of
linear association between two variables. Values of the correlation coefficient are always
12

between -1 and +1. If a correlation coefficient of +1, it indicates that two variables are
perfectly related in a positive linear sense. If a correlation coefficient of -1, it indicates
that two variables are perfectly related in a negative linear sense; and if 0 indicates that
there is no linear relationship between the two variables.

3.2.2 Multiple Regression Analysis

The general multiple regression model which is used to investigate the contribution of
various independent variables to dependent variable is defined as,
, ...
3 3 2 2 1 1 0
c | | | | | + + + + + + =
k k
x x x x y (3.1)
where the dependent variable, , y is related to k independent variables;
1
x , ,
2
x ,
3
x ,
k
x ;
0
| was denoted as the y intercept of the regression line;
1
| , ,
2
| ,
3
| ,
k
|
represents the coefficients of independent variables; and c as a random variable called
the error term.

There are also basic model assumptions about the error term, c to meet,
(i) the error term, c is a random variable with mean of zero,
13

(ii) the variance of error term, c is denoted by
2
o , is the same for all values of the
independent variables,
(iii) the values of error term, c are independent, and
(iv) the error term c , is a normally distributed random variable.

OLS method is applied to illustrate the model that best characterizes the dataset of this
study, as well as to estimate the parameters.

3.2.3 Backward Elimination Technique

In the model selection phase of this Asian child mortality study, stepwise BE algorithm
is used. The BE technique starts by involving all the candidate independent variables
with calculating the F statistics in model, then deleting those variables one-by-one for
model improvement. This process is kept repeating until no improvement in the model is
possible. For BE methods, the default significant level in the statistical software SAS for
staying in the model is set as 0.15.

The details procedure in applying BE method are shown,
Step (a),
14

at the beginning, the original model is set as shown as Equation 3.1. Then, the following
k tests are carried out by setting the null hypothesis,
, 0 :
0
=
j j
H | . ,..., 2 , 1 k j = (3.2)
The highest p-value which corresponding to
0 :
1 0
= |
l
H , (3.3)
is compared with the preselected significance p-value, which is 0.15.
Step 2(a), if p-value 0.15,
l
X can be deleted and the new original model is

. ... ...
1 1 1 1 3 3 2 2 1 1 0
c | | | | | | | + + + + + + + + + =
+ + k k l l l l
x x x x x x y (3.4)
Then, go back and repeat step 1.
Step 2(b), if p-value 0.15,
the original model is the model we should choose.

3.2.4 Data Transformation

To solve the potential modeling problems especially which caused by the violation of
assumptions while building a multiple regression model, a transformation in data is
needed to minimized or even fixed to improve the model accuracy.
15

A method which named as Box-Cox method is used to identify an appropriate
transformation on the response variable based on its set of independent variables.

A recommended transformation chart which presented by Box and Cox (1964) is
reviewed to get the information about the model recommended transformation.

3.2.5 Identify Outliers and Influential Point

In detecting outliers, an observation with a studentized residuals absolute value that is
larger than 3 is generally deemed an outlier. For a linear regression, the residuals can be
modified to detect unusual observation better. This ratio of residual to its standard error
is called as the standardized residual,

( )
ii
i i
si
h s
y y
r
=
1
, (3.5)
where the
2
s denotes as the estimated
2
o , and
ii
h denotes as the hat values.

Then, studentized residual is further developed when estimate
2
o by ) (
2
i s , the estimate
2
o is obtained after deleting the
th
i observation,
16

ii i
i i
ti
h s
y y
r
=
1
) (
. (3.6)

For indentifying influential data points in this study, difference in fits (DFITS) measure
is used. The basic idea of this measure is to delete each observation one-by-one, then
refitting the regression model on the remaining 1 n observation each time. Next, the
results using all the n observations to the results with the deleted
th
i observation is
compared, and then it is able to see the influence impact on the analysis.

i
DFITS for observation
th
i is defined as,
,

) (
) (
ii i
i i
i
h MSE
y y
DFITS

= (3.7)
where the
) (
i
y is the predicted response for the
th
i observation based on the estimated
model with the
th
i observation deleted.

To determine a data point to being influential, the absolute value of DFITS value must
be greater than the formula,

1
1
2

+
p n
p
, (3.8)
where n is the number of observations, and p is the number of parameters.
17

3.2.6 Robust Regression

Robust regression is a useful and important tool in analyzing a data set that
contaminated with outliers. While fitting a least squares regression, there might detect
some high leverage data points or outliers, and is confirmed not data entry errors. In this
situation, robust regression is a good strategy where it can be used in detecting outliers
and provided a stable results in presenting outliers. Robust regression also showed the
characteristics of weighted and reweighted least squares regression.

Basically, there are four methods under robust regression for the statistical applications
of outliers detection which are the M estimation, LTS estimation, S estimation, and MM
estimation. In my study, MM estimation is used due to this method is combining high
breakdown value estimation and the M estimation. Furthermore, it is able to achieve the
high breakdown point and perform higher statistical efficiency than S estimation method.

18

CHAPTER 4
RESULTS AND DISCUSSION

4.1 Descriptive Statistics of the Data

There is a total of 8 independent variables and 47 observations in this study. For the
independent variables of co2, it is meant by the carbon dioxide emission which carried
the mean of 378354.47 metric tons per capita, standard deviation of 1249863.79, with a
minimum value of 183 metric tons per capita, and maximum value of 8286692.00 metric
tons per capita.

For the independent variables of drinking, it indicated that the percentage of population
using an improve drinking water source in each countries. There is a mean of 89.45%
population used an improve drinking water source in all Asia countries, standard
deviation of 12.22%, minimum percentage of 55% and maximum percentage of 100%.
The summary for the rest of the independent variables is shown in Table 4.1.

19

Table 4.1: Summary of Independent Variables
Variable n Mean Standard
Deviation
Minimum Maximum
co2 47 378354.47 1249863.79 183.00 8286692.00
hepb3 47 0.89 0.14 0.37 0.99
tb 47 74.17 13.68 29.00 89.00
hiv 47 510.068 2121.59 0 14156.00
measles 47 3264.30 14853.66 0 100066.00
drinking 47 0.89 0.12 0.55 1.00
immunized 47 0.91 0.11 0.62 0.99
trauma 47 17056.89 56551.77 4.00 366346.00

4.2 Result of Correlation Analysis

Child under age of five mortality rate (mortality) as the dependent variable is related to
carbon dioxide emissions (co2), Hepatitis B immunization coverage (hepb3), case
detection rate for all forms of tuberculosis (tb), number of reported deaths on HIV (hiv),
number of reported deaths on measles (measles), population using an improved drinking
water source (drinking), children immunized against measles (immunized), and number
of birth trauma reported (trauma). In this study, the regression model building algorithm
is performed by using SAS Statistical software.

Correlation analysis is used to determine the interrelationship among the dependent and
independent variables. Based on Table 4.2, it shows that a positive significant
relationship exists between mortality with respect to the variable, hiv, measles and
20

trauma. A negative relationship exists between
mortality
with respect to variable co2,
hepb3, tb, drinking and immunized.

21

4.3 Multicollinearity Test
For checking the case of multicollinearity, the VIF is used to determine it. All the VIF
values which are greater than 10 are considered as the variables have high
multicollinearity relationship.

Table 4.3: Full Model of Parameter Estimates Table for mortality with Variance
Inflation Factor
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 183.97461 14.91594 12.33 .0001 0
co2 1 -0.00000738 0.00000175 -4.22 0.0001 2.27628
hepb3 1 -1.73563 24.96106 -0.07 0.9449 5.47084
tb 1 -0.10235 0.14688 -0.70 0.4901 1.92185
hiv 1 0.00029982 0.00249 0.12 0.9048 13.29688
measles 1 -0.00170 0.00045482 -3.73 0.0006 21.73623
drinking 1 -147.00506 17.84689 -8.24 .0001 2.26473
immunized 1 -17.18049 27.87994 -0.62 0.5414 4.41274
trauma 1 0.00058426 0.00009842 5.94 .0001 14.75245

According to Table 4.3, the VIF values showed worrisome for the variables of hiv,
measles, and trauma. It is indicated that these variables are possibly redundant and
multicollinearity arise. Among these 3 variables, measles showed the highest VIF value,
with the value of 21.73623, which is also greater than 10. Therefore, measles is omitted
for further checking.

22

Table 4.4: Parameter Estimates Table for mortality with Variance Inflation Factor
After measles is Deleted
Estimate
Standard
Error
t Value Pr > |t| Variance
Inflation
Intercept 1 170.77432 16.72310 10.21 .0001 0
co2 1 -0.00000333 0.00000158 -2.10 0.0420 1.39844
hepb3 1 12.04383 28.48951 0.42 0.6748 5.35125
tb 1 -0.30391 0.15764 -1.93 0.0612 1.66222
hiv 1 0.00554 0.00224 -2.48 0.0177 8.05400
drinking 1 -121.40519 19.01549 -6.38 .0001 1.93047
immunized 1 -24.92026 32.08555 -0.78 0.4420 4.38834
trauma 1 0.00034283 0.00008562 4.00 0.0003 8.38390

After omiting the variables measles, Table 4.4 shows all the VIF values in the analysis
appeared much better with a value less than 10. The variables of hiv, and trauma which
initially shown high VIF values, turned to be a better one.

4.4 Variables Selection

Based on Table 4.4, there is not all the independent variables shown statistically
significance under the significant level of 0.15. Variables such as hepb3, and immunized
which the p-value is greater than 0.15 are consider as insignificant variables.

Therefore, backward elimination method which involves the procedure of removing an
individual of each step is used in this study. There are 2 steps in total for this backward
elimination in this study. For step 0 in the backward elimination algorithm which starts
23

by involving all the candidate independent variables such as co2, hepb3, tb, hiv, drinking,
immunized, and trauma. It showed that the variable of hepb3 variable has the highest p-
value, 0.6748 among all the 7 independent variables

In step 1, the backward elimination algorithm shows that the hepb3 variable has been
deleted. This is due to the hepb3 variable has the highest p-value among all the 8
independent variables and it is greater than the significant level of 0.15. In this step, it
also showed that immunized has the highest p-value of 0.4769 among all the 6
independent variables.

Then, in step 2, the immunized variable has been deleted due to this variable has the
highest p-value among all the 6 independent variables and is greater than the significant
level of 0.15. In this step, all the variables had shown significant at the significance level
of 0.15 as shown in Table 4.5

Table 4.5: Parameter Estimation for Step 2
Variable Parameter
Estimate
Standard
Error
Type II SS F Value Pr > F
Intercept 165.09192 13.03198 19981 160.48 .0001
co2 -0.00000345 0.00000150 656.22015 5.27 0.0269
tb -0.30389 0.15368 486.83670 3.91 0.0547
hiv -0.00600 0.00208 1039.63309 8.35 0.0061
drinking -128.26421 16.49238 7530.63207 60.48 .0001
trauma 0.00035554 0.00008188 2347.36048 18.85 .0001

24

So the final updated model now is,
. 0003554 . 0 26421 . 128 ........ ..........
006 . 0 30389 . 0 2 00000345 . 0 09192 . 1865
trauma drinking
hiv tb co mortality
+
=

Note that, in this step 2, all the remaining variables which are co2, tb, hiv, drinking, and
trauma have the p-value less than 0.15. Therefore, the final reduced model is as shown
in Equation 4.1.

4.5 Check for Model Assumption of the Child Mortality Rate Data

Table 4.9: Normality Test for mortality
Test Statistic p value
Shapiro-Wilk W 0.900921 Pr W 0.0008
Kolmogorov-Smirnov D 0.172832 Pr D 0.0100
Cramer-von Mises W-Sq 0.263018 Pr W-Sq 0.0050
Anderson-Darling A-Sq 1.537701 Pr A-Sq 0.0050

(4.1)
25

Figure 4.1: Q-Q Plot for mortality
For the selected model, it shows insignificant in the normality test of Anderson-Darling
in Table 4.8.
:
0
H The data follows the normal distribution
:
1
H The data do not follow the normal distribution
In checking the normality assumption, a significant level of 0.1 is set. Based on Table
4.8, it shows the p-value for Anderson Darling test is 0.0525 where
0
H is rejected. This
meaning that the data do not follow the normal distribution and is violating one of the
multiple regression assumptions.

Besides that, normality assumption also can be check through Q-Q plot as show in
Figure 4.1. It shown a linear trend with a slight deviation at the tail, this suggests that the
normality assumption is satisfied.

26

Figure 4.2: Scatter Plot of Residual Versus Predicted Value of mortality
Based on Figure 4.2, the plot shown there are some points do not scattered randomly and
it has a slight violation. However, the slight violation does not destroy the validity of
constant variance asumption where the error terms are equal to each other.

27

Figure 4.3: Scatter Plot of Residual Versus Regressors of mortality
By referring to Figure 4.3, there are 4 scatter plots shown which are residual versus to
the 4 selected independent variables such as co2, measles, drinking, and trauma. For all
the plots shown, they are not scattered randomly, and was scattered at one side.
Therefore, the linearity assumption is not valid in this mortality regression model.
From all the output shown, the regression model does not fulfill with multiple regression
assumptions, violation problem which may bring the model to be inefficient, seriosly
biased or misleading exited. So some apropriate statistical steps should take to overcome
this violation problem.

4.4 Data Transformation
28

To solve this violation assumptions problem, appropriate data transformation method is
needed in helping to minimize these violation problems. In this study a data
transformation method in making data to be normal is used which is based on Box and
Cox (1964). Their proposed on the recommended transformation table in referred.

According to the procedure developed, the value is indicating the power to which all
data should be raised. Therefore, the Box-Cox power transformation searches from
5 = to 5 + = until the best value is found.

Figure 4.4: Box-Cox Analysis for mortality
29

Based on Figure 4.4, the lambda value shown is 0.5 where according to the
recommended transformation table, the suggested transformation is to transform the
dependent variable, mortality to mortality.

Table 4.10: Parameter Estimates Table for sqrtmortality
Estimate
Standard Error t Value Pr > |t|
Intercept 1 17.68237 1.03749 17.04 .0001
co2 1
7
10 64405 . 6

7
10 455377 . 1

-4.57 .0001
measles 1 -0.00014477 0.00003038 -4.76 .0001
drinking 1 -14.39144 1.15136 -12.50 .0001
trauma 1 0.00005241 0.00000854 6.13 .0001

After a transformation is done, according to Table 4.9, the regression analysis showed
all the selected variables through BE search algorithm is statistically significance at the
significant level set. Furthermore, in Table 4.9 also showed all the estimated regression
coefficients have different values as shown in Table 4.7 which the regression analysis is
done before the transformation is done.

For the assumption checking process is needed to check there is no assumptions
violation problem on the data of this child mortality rate study. When there are any
assumptions which stated in Section 3.2.2 is violated, then the predictions and
confidence intervals by a regression model maybe inefficient, seriously biased or even
causes misleading.

30

According to the previous normality test before data transformation is done, the data
shown the error distribution is significantly non-normal. Therefore, Box-Cox
transformation method is used to overcome this violation problem.

Table 4.11: Normality Test for sqrtmrtality
Test Statistic p value
Shapiro-Wilk W 0.951442 Pr W 0.0835
Kolmogorov-Smirnov D 0.10437 Pr D 0.1500
Cramer-von Mises W-Sq 0.124595 Pr W-Sq 0.1941
Anderson-Darling A-Sq 0.733602 Pr A-Sq 0.1953

According to Table 4.10, it is statistical significance in the normality test of Anderson-
Darling for the regression model after the transformation data is done.
:
0
H The data follows the normal distribution
:
1
H The data do not follow the normal distribution
In checking the normality assumption, a significant level of 0.1 is set. Based on Table
4.10, it shows the p-value for Anderson Darling test is 0.1953 where
0
H is fail to reject.
This meaning that the child mortality rate data follow the normal distribution.
31

Figure 4.5: Q-Q Plot for sqrtmortality

Based on Figure 4.5, the Q-Q plot shown a linear trend with a slight deviation at the tail,
again, this suggests that the normality assumption is satisfied.

Figure 4.6: Scatter Plot of Residual Versus Predicted Value of sqrtmortality
32

Based on Figure 4.6, the plot shown there are some points do not scattered randomly and
it has a slight violation. However, the slight violation does not destroy the validity of
constant variance asumption where the error terms are equal to each other.

Figure 4.7: Scatter Plot of Residual Versus Regressors of sqrtmortality
By referring to Figure 4.7, there are 4 scatter plots shown which are residual versus to
the 4 selected independent variables such as co2, measles, drinking, and trauma. For all
the plots shown, they are not scattered randomly, and was scattered at one side.
Therefore, the linearity assumption is not valid in this sqrtmortality regression model.

33

Figure 4.8: Scatter Plot of Studentized Residual Versus Leverage

Since the data transformation method yet to overcome the assumption violation problem,
and there is also presence of potential outliers, influential points and high leverage point
as shown in Figure 4.8. Therefore, Robust Regression is used to obtain a fit that is
resistant to the presence of high leverage points and outliers.

4.5 Indentifying Outliers and Influential Points

4.5.1 Outliers Determination

Outliers are referring to a data point which its response y does not follow the general
trend of the data, or in other words, it is an observation point that us distant from other
observations. Influential point is a data point which it is immoderately influencing any
34

part of a regression analysis, such as the predicted responses, the estimated slope
coefficients, or the hypothesis test result. In this study, there are few data points shown
the potential characteristics of outliers and influential points.

Figure 4.9: Scatter Plot of Studentized Residual Versus Predicted Value of
sqrtmortality

Figure 4.9 gave evidence of the presence of outlying observations, it shown there are 2
points fall behind the band. In order to identify which data points are the outliers, an
observation with a studentized residuals absolute value that is larger than 3 is generally
deemed an outlier.

Table 4.12: Studentized Residuals Absolute Value Larger Than 3
Observation Country Studentized Residual
6 Bhutan 3.30887
8 Cambodia -2.13049

35

Table 4.11 showed the 2 highest absolute studentized residual values among the 47
observations. However, there is only one observation shown its absolute value greater
than 3. Therefore, only observation 6, which is Bhutan is identified as the outlier.

The reason behind of the child mortality rate in Bhutan has a vary observation compared
to other Asia countries may cause by there was a flash floods from major rivers in May
2009. According to Central Intelligence Agency (CIA) World Factbook (2012), this
flash floods been caused major destruction to lives and properties, as well as the many
death cases been reported.

Furthermore, the Bhutan newly elected government in year 2009 just began to focus a
program named by Early Childhood Care and Education (ECCD) which focus on
enhancing the development of Bhutan children, prepare them to school, and to overcome
children basic health problem in year 2010. This late action also might cause the child
mortality rate in Bhutan to be different.

4.5.2 Influential Points Determination
To identify influential point in this child under age of five mortality study, DFITS
measure is used. Therefore, any absolute DFITS values greater than
7651 . 0
1 5 47
1 5
2 =

+
, (4.6)
36

are consider as influential data points.

Table 4.13: Absolute DFI TS Values Lager than 0.7651
Observation Country DFI TS Value
8 Cambodia -0.79638
12 India -3.69008

Table 4.12 shown that there are 2 influential data points in this study which are
Cambodia with just a slight influencing and India which playing the most influencing
role to the data. This is not surprise, as India has the biggest population and the highest
child mortality rate in the region of Asia.

4.6 Robust Regression

Since there are still outliers and influential points detected even the data had been
transformed. Therefore, robust regression with MM method which able to obtain a fit
that is resistant to the presence of outliers and influential data points is used in this study.
Hence, the estimated regression coefficients are compared between robust regression
and OLS regression.

37

All of the independent variables are tested in robust regression at first and the highest p-
value which greater than significant level, 0.15 is deleted each time. The insignificant
variables are deleted one-by-one until only those significant variables are remained in
the model.

Table 4.14: Parameter Estimates
Parameter DF Estimate Standard
Error
Chi-
Square
Pr
ChiSq
Intercept 1 183.9219 11.1613 271.54 .0001
co2 1 -0.0000 0.0000 6.46 0.0110
hepb3 1 0.3478 17.9087 0.00 0.9845
tb 1 -0.2746 0.1138 5.82 0.0159
hiv 1 0.0018 0.0018 1.04 0.3070
measles 1 -0.0009 0.0004 4.29 0.0383
drinking 1 -149.072 14.0339 112.83 .0001
immunized 1 -4.1131 20.3620 0.04 0.8399
trauma 1 0.0003 0.0001 5.79 0.0161
Based on Table 4.13, it shows that the hepb3 variable has the highest p-value, 0.9845
among all the 8 independent variables and it is greater than the significant level of 0.15.
Therefore, it is then deleted from the robust regression model, the remaining 7
independent variables were continued for the further significance test.

38

Error
Chi-
Square
Pr
ChiSq
Intercept 1 183.5311 10.7162 293.32 .0001
co2 1 -0.0000 0.0000 6.87 0.0088
tb 1 -0.2747 0.1101 6.23 0.0126
hiv 1 0.0018 0.0017 1.17 0.2795
measles 1 -0.0009 0.0004 4.53 0.0333
drinking 1 -149.378 13.6600 119.58 .0001
immunized 1 -3.0262 12.6556 0.06 0.8110
trauma 1 0.0003 0.0001 6.07 0.0138

Table 4.14 shown the immunized variable has the highest p-value, 0.8110 among all the
7 independent variables and it is greater than the significant level of 0.15. Therefore, it is
then deleted from the robust regression model, the remaining 6 independent variables
were continued for the further significance test.

Error
Chi-
Square
Pr
ChiSq
Intercept 1 182.043 8.7177 435.87 .0001
co2 1 -0.0000 0.0000 7.48 0.0063
tb 1 -0.2764 0.1067 6.71 0.0096
hiv 1 0.0019 0.0017 1.32 0.2504
measles 1 -0.0009 0.0004 4.84 0.0278
drinking 1 -150.612 12.1867 152.74 .0001
trauma 1 0.0003 0.0001 6.44 0.0112

Table 4.15 shown the hiv variable has the highest p-value, 0.2504 among all the 6
independent variables and it is greater than the significant level of 0.15. Therefore, it is
39

then deleted from the robust regression model, the remaining 5 independent variables
were continued for the further significance test.

Error
Chi-
Square
Pr
ChiSq
Intercept 1 178.4948 8.3919 452.41 .0001
co2 1 -0.0000 0.0000 5.92 0.0150
tb 1 -0.3035 0.1083 7.86 0.0051
measles 1 -0.0007 0.0004 3.15 0.0757
drinking 1 -144.564 11.6213 154.74 .0001
trauma 1 0.0003 0.0001 6.46 0.0110

According to Table 4.16, it shown the remaining 5 variables co2, tb, measles, drinking,
and trauma are statistically significance at the significant level set. It also showed that
the significance variables of robust regression are different from the significance output
of OLS regression.

4.7 Comparison of Estimated Regression Coefficients
40

Table 4.18: Estimated Regression Coefficients for Each Type of Data
Parameter
Regression
OLS
Robust
After BE Transformation
0
|
38856 . 172 68237 . 17 4948 . 178
1
| 00000809 . 0
7
10 64 . 6

0000 . 0
3
|
- - 3035 . 0
5
|
00175 . 0 00014477 . 0 0007 . 0
6
|
90520 . 161 39144 . 14 564 . 144
8
|
00062903 . 0 00005241 . 0 0003 . 0

Table 4.17 shown that all the estimated regression coefficients have different value, as
well as the number of significance parameters have been increasing from 4 variables to 5
variables in the OLS regression and robust regression respectively.

The difference observed is due to robust regression which able to obtain a fit that is
resistant to the presence of outliers and influential data points. Robust regression is a
design which not overly affected by assumptions violations whereas OLS can give
misleading result when there are any violation problems. Therefore, the result of
estimated regression coefficients for robust regression is believe to be more accurate
compared with the result from OLS regression.

CHAPTER 5
41

CONCLUSION AND FUTURE WORK

5.1 Conclusion

The findings of this study reveal that the main determinants affecting child under age of
five mortality rate in Asia countries are carbon dioxide emission, case detection rate for
all forms of tuberculosis, number of reported deaths on measles, population using an
improved drinking water source, children immunized against measles, and number of
birth trauma reported. These 5 independent variables have statistically significant
impacts on the dependent variables.

Bhutan where occurred a flash flood in May 2009 caused major destroyed and death in
this country, hence it was shown that Bhutan has a different child mortality observation
compared to other countries in Asia. India, one of the countries which has the biggest
population and highest mortality rate in Asia, it showed that it has a characteristic of
influencing to this study.

Children are the future leader of a country. They the group of people who we need to
protect the most to make sure they have a healthy environment to survive, to grow up.
Then only they are able to lead the country to a brighter future. Government of each
Asia countries should come out with a rules or guidelines to improve the living quality
42

of children, to let all the children are live healthy and safely. Moreover, regional
cooperation among the Asia communities should be included to overcome this child
mortality problem, and reduce the case to minimum.

5.2 Future Work

In the future, this child mortality rate analysis can be carried out by using the method of
robust regression with other estimation such as the M estimation, LTS estimation or S
estimation.

Besides that, more factors which may cause child mortality in Asia can also be in
included in the further analysis in future. The example of factors such as the income
group, gross domestic product (GDP), cases of child malnutrition, underweight are
suggested to use in the analysis of child mortality in Asia.

REFERENCES

43

Aiken, L. S., West, S. G. and Pitts, S. C. Multiple Linear Regression in Handbook of
Psychology ed. Schinka, J. A., Velicer, W. F. and Weiner, I. B. Canada: John
Wiley & Sons Inc., 2003.

Bianco, A. M., Ben, M. G. and Yohai, V. J. 2005. Robust Estimation for Linear
Regression with Asymmetric Errors, Canadian Journal of Statistics. 33: 511-528.

Black, R. E., Morris, S. S. and Bryce, J. 2003. Where and Why are 10 Million Children
Dying Every Year, The Lancet. 361: 2226-2234.

Box, G. E. P., and Cox, D. R. 1964. An Analysis of Transformations, Journal Royal
Statistics Society, Series B. 26: 211-234.

Broadhurst, D., Goodacre, R., Jones, A., Rowland, J. J. and Kell, D. B. 1996. Genetic
Algorithms as a Method for Variable Selection in Multiple Linear Regression and
Partial Least Squares Regression, with Applications to Pyrolysis Mass
Spectrometry, Analytica Chimica Acta. 348: 71-86.

Chatterjee, S. and Hadi, A.S. 1986. Influential Observations, High Leverage Points, and
Outliers in Linear Regression, Statistical Science. 1: 379-416.

Draper, N. R. and Smith, H. Applied Regression Analysis 2
nd
Edition: John Wiley and
Sons, New York, 1981.

Gabriele, A. and Schettino, F. 2008. Child Malnutrition and Mortality in Developing
Countries: Evidence from a Cross-Country Analysis, Analysis of Social Issue and
Public Policy. 8: 53-81.
Grossman, Y. L., Ustin, S. L., Jacquemoud, S., Sanderson, E. W., Schmuck, G. and
Verdebout, J. 1996. Critique of Stepwise Multiple Linear Regression for the
Extraction of Leaf Biochemistry Information from Leaf Reflectance Data, Remote
Sensing of Environment. 56: 182-193.
44

Hall, R., Fienberg, S. E. and Nardi, Y. 2011. Secure Multiple Linear Regression Based
on Homomorphic Encrption, Journal of Official Statistics. 27: 1-23.

Huber, P. J. 1973. Robust Regression: Asymptotics, Conjectures and Monte Carlo, Ann.
Stat., 1: 799-821.

Maronna R. A. and Yohai, V. J. 2013. Robust Functional Linear Regression Based on
Splines, Computational Statistics & Data Analysis. 65: 46-55.

Quinino, R. C., Reis, E. A. and Bessegato, L. F. 2011. Using the Coefficient of
Determination R
2
to Test the Significance of Multiple Linear Regression,
Teaching Statistics. 35: 84-88.

Rousseeuw, P. J. 1984. Least Median of Squares Regression, Journal of the American
Statistical Association, 79: 871-880.

Rousseeuw, P. J. and Yohai, V. Robust Regression by Means of S Estimatiors in Robust
and Nonlinear Time Series Analysis ed. Franke, J., Hrdle W. and Martin, R. D.
New York: Springer Verlag, 1984.

Sarmin, M., Ahmed, T., Bardhan, P. K. and Chisti, M. J. 2014. Specialist Hospital Study
Shows that Septic Shock and Drowsiness Predict Mortality in Children Under
Five with Diarrhoea, Acta Paediatrica. 1-6.

Telmo, C., Lousada, J. and Moreira, N. 2010. Proximate Analysis, Backward Stepwise
Regression Between Gross Calorific Value, Ultimate and Clinical Analysis of
Wood, Bioresource Technology. 101: 3808-3815.

45

Yohai, V. J. 1987. High Breakdown Point and High Efficiency Robust Estimates for
Regression, Annals of Statistics. 15: 642-656.

Zainodin, H. J., Noraini, A. and Yap, S. J. 2011. An Alternative Multicollinearity
Approach in Solving Multiple Regression Problem, Trends in Applied Sciences
Research. 6: 1241-1255.

MLR

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

MLR

Hochgeladen von

Copyright:

Verfügbare Formate

1

Das könnte Ihnen auch gefallen