Sie sind auf Seite 1von 19

South Dakota State University

STAT 786, Fall 2016


Project 1

ASSOCIATION OF INSULIN WITH CHEMICAL AND


DEMOGRAPHIC VARIABLES

Authors:
1. Acharya, Subash
2. Khan, Riaz
3. Shrestha, Mahesh
4. Suehring, Aaron

INTRODUCTION
The body breaks down carbohydrates from foods into glucose. Glucosea
form of sugaris used by tissues in the body for energy (USDHHS 2014a). Once in
the bloodstream, glucose needs the insulin hormone in order to be absorbed by organs
and tissues (USDHHS 2014a). The insulin hormone, made by the pancreas for
regulating blood glucose levels as part of metabolism, assists glucose to be absorbed
by organs and tissues (Joshi et al. 2007; USDHHS 2014a). Glucose accumulates in
the bloodstream if the pancreas cannot secrete enough insulin (Mayo Clinic).
Prolonged periods of elevated blood glucose can lead to diabetes or prediabetes,
kidney damage, and nerve damage (USDHHS 2014a).
Using data from the National Health and Nutrition Examination Survey
(NHANES), our objective was to determine if there was a relationship between insulin
in the bloodstream (pmol/L), and a group of chemical predictor variables measured in
the urine and demographic predictor variables for a sample of 974 individuals. The
chemical predictor variables we examined were urinary iodine, urinary creatinine,
urinary perchlorate, urinary nitrate, and urinary thiocyanate. Demographic variables
included gender, age, and household income.
Iodine (g/L), measured in the urine as a quantitative variable, is an element
that must come from the dietit cannot be made in the body (WebMD). Iodine is
needed by the thyroid gland to produce hormones. Iodine has been shown to have a
negative correlation with blood glucose and insulin (Al-Attas et al. 2012). Further,
iodine has been associated with insulin resistance in people with type 2 diabetes (AlAttas et al. 2012). Insulin resistance is a condition where organs and body tissue do
not respond to insulin, and therefore cannot easily absorb glucose from the
bloodstream. Thus, the body needs to produce more insulin (USDHHS 2014b).
Creatinine (mg/dL), also measured in the urine as a quantitative variable, is a
waste product of muscle deterioration and is filtered through the kidneys and excreted
through the urine (WebMD 2015). Testing the level of creatinine in a persons body is
a common measure of kidney health, and high levels of creatinine in your blood
indicate kidney failure in patients with diabetes (National Kidney Foundation; WebMD
2015).
Perchlorate (ng/mL), measured in the urine as a quantitative variable, is an
element commonly found in the environment. Exposure to it occurs through the
ingestion of food or water that contains perchlorate (ATSDR 2015). Perchlorate in the
2/19

body is associated with the inhibition of iodide uptake by the thyroid, which can lead
to hypothyroidism (Blount et al. 2006; ATSDR 2015). Similarly, exposure to Nitrate
(ng/mL), measured in the urine as a quantitative variable, is primarily linked to the
ingestion of food or water that contain nitrates (CDC 2013). Similar to perchlorate, it
is known to disrupt thyroid function by inhibiting iodide uptake (CDC 2013).
Thiocyanate (ng/mL) is another element known to inhibit iodide uptake by the thyroid.
Measured in the urine as a quantitative variable, exposure to thiocyanate occurs from
cigarette smoke and some Brassica genus vegetables (Leung et al. 2014). Similar to
perchlorate and nitrate, thiocyanate could lead to lower thyroid hormone production
(Steinmaus et al. 2013).
The demographic variables included in the analysis were gender, age, and
household income. Gender, a qualitative variable coded as 1 for male and 2 for
female, was included for its possible relationship with insulin resistance. Given that
increased body fat has been associated with insulin resistance (USDHHS 2014b),
higher levels of visceral and hepatic adipose tissue in men may contribute to higher
levels of insulin resistance (Greer & Shen 2009).
Aging has also been associated with increased body fat and increased levels
of insulin resistance (Ryan 2000).

Therefore, age, measured as a quantitative

variable, was included in the analysis. Elderly were found to be more prone to insulin
resistance, although its uncertain whether the cause is biological or environmental
(i.e., decreased physical activity, weight gain) (Refaie et al. 2006).
Lastly, household income, coded in qualitative bins, was included in the
analysis. It has been shown that proximity to resources in high-income areas is related
to insulin resistance (Auchincloss et al. 2007). A potential explanation is that grocery
stores in poorer areas have less-healthy food options (Horowitz et al. 2004).
Alternatively, in a low-income, poorly educated population of patients with diabetes,
48% reported an unwillingness to use insulin (Machinani et al. 2013). Known as
psychological insulin resistance, this refers to a persons unwillingness to use insulin
(Machinani et al. 2013). This may result in a relationship between lower-income
households and lower levels of insulin.
A previous study by Blount et al. (2006) used this data from NHANES to
examine the relationship between perchlorate and thyroid hormone levels.

The

covariates incorporated in their analysis that were also examined in ours include
urinary creatinine, urinary iodine, urinary nitrate, urinary thiocyanate, and age. They
3/19

found a relationship between perchlorate and thyroid hormone production in women;


however, no such significant relationship was found in men (Blount et al. 2006). A
study by Steinmaus et al. (2013) also examined data from NHANES. They looked for
an interaction effect between perchlorate, thiocyanate, and iodine on thyroid hormone
levels. They found greater effects when all three variables were assessed together
than when they were examined separately (Steinmaus et al. 2013).

EXPLANATORY DATA EXPLORATION


Figure 1 is a simple scatter plot of the data. We have the response in the left
most column. Looking at the smoothed curve, there seems to be some predictive
power with creatinine. However, there is lack of homogeneity of variability as
suggested by the corresponding scatter plot. This implies that there may be a need for
transformation of the data, which would be done later. The other variables appear to
have very low predictive power.

Figure 1: Scatter plot of the data

ANALYSIS
Data Formatting
Four files, which contained the data for estimation of insulin level in the blood,
were acquired from NHANES. In those files, each respondent was given a unique
sequence number. The final dataset was created by aggregating the four datasets to
extract the desired response variable and predictor variables based on common
4/19

sequence number.

In total, there were 974 unique sequence numbers which

contained common sequence numbers for all the variables to be included in the
analysis. After forming the final data, the data set was randomly divided into two equal
parts: a train dataset and test dataset. The train dataset was used to perform initial
analysis, whereas the test dataset was used for model validation.
Insulin level was selected as the response variable, and urinary iodine, urinary
creatinine, urinary perchlorate, urinary nitrate, urinary thiocyanate, gender, age and
household income were treated as the predictor variables. In total, there were eightpredictor variables selected for estimating the insulin level in the blood. Two of the
variablesgender and household incomewere qualitative variables, and the six
remaining variables were quantitative variables.

Initial Model Fitting and Diagnostics


As initial analysis of the data, a linear model was fitted to the train data using
all the variables and its diagnostics plots were analyzed to check whether the
assumptions of the linear model were met.

Figure 2: Diagnostics plots, full model, train data

A plot of residuals vs. fitted values for this fitted model suggested that the
variance of the error is not constant. Furthermore, the normal Q-Q plot of the residuals
also suggests some departure from normality of the errors exits. It is concluded after
assessing the diagnostic plots that the assumptions of constancy of the error term
variance and the normality of the error terms are violated. Furthermore, the estimated
5/19

intercept and coefficients for the different predictor variables range over a long range,
so the values of the predictor variables were scaled in an appropriate manner. Urinary
iodine, urinary creatinine, and urinary thiocyanate were divided by 100, whereas
urinary nitrate was divided by 1000. These scaling factors were chosen to get all the
predictor variables on a common scale.

Figure 3: Graphical representation of Box-Cox transformation

As previously stated, there appears to be unequal error variance and nonnormality of the error terms.

To alleviate the violation of these assumptions,

transformation of the response variable is an appropriate measure due to shape and


spread of the distribution of the response variable. To accomplish this, a suitable
transformation to mitigate the unequal error variance and non-normality of the
residuals needed to be determined. Box-Cox transformation was used to determine
the best transformation to apply to the response variable. The built-in R function
boxcox() automatically identified a range of power for transformations. Figure 3 shows
this range, indicating an optimal lambda value to be near zero. So, it is reasonable to
select lambda to be zero. Therefore, a log transformation, applied to the response
variable, was suitable to remove the non-constant error variance and the nonnormality of the residuals. A linear model was again fitted to the data using all the
predictor variables and the log-transformed response variable, and the diagnostic plots
(Figure 4) were analyzed to check whether the transformation was appropriate. The
residuals versus fitted values showed the constancy of error variance assumption was
met, and the normality plot indicated the residuals were normally distributed.

6/19

Figure 4: Diagnostics of full model after transformation, train data

Multicollinearity Check
To check if there is any serious multicollinearity problem in our data, we used
the variance inflation factor (VIF) criterion. The technique is all the predictors are
regressed against the rest of the predictors and the VIF is found using
is the multiple

for the regression of

, where

on the other predictors. The cutoff point is

suggested to be 10 (Kutner et al. 2004). For our data, we got the highest VIF to be
1.76, suggesting there is not a serious multicollinearity problem in our data. Table 1
shows the VIF values and the correlations between the predictor variables of the data.

Table 1: VIF and Correlation of the training data

7/19

Model selection
In total, we fit four models. In model 1, we included all the variables except for
the age and income. When fitting with all the predictors, these terms were not
significant at any default significance level (maximum 0.1). Additionally, the response
was regressed with all the predictors separately and these two predictors did not show
any significant linear relationship. Model 2 was selected based on the Akaike
information criterion (AIC) based stepwise selection. Model 3 is a simpler model,
derived from model 2 after deleting the variable having the highest p-value. Model 4
is a further simplification where one more variable was deleted from model 3 based
on the highest p-value.

Table 2: shows the linear predictors of each of the models considered

Figure 5 and 6 show the residual scatter plots and the normal Q-Q plot for all
the models proposed. From visual inspection, they appear to meet the normality and
variance constancy assumptions of the residuals. However, from the Q-Q plot, there
seems to be some outlying observations in the data. Further analysis attempted to
identify influential points which will be discussed later.

8/19

Figure 5: Residual plot of four candidate models, training data

Figure 6: Q-Q plot of four candidate models, training data

With Influential Observations


Table 3 shows the regression results for candidate models based on training
and validation dataset, with influential and outlying observations. For each of the four
candidate models, we recorded the point estimates for the intercept and the slope of
each predictor variable, along with the corresponding standard error. We also colorcoded the slope coefficients and intercept point estimate based on the predictors level
of significance. We performed a lack-of-fit test for all four of the models to determine
significance of the model. Given that the p-value for all the models was significant (p9/19

value < 0.05), we found that there were four competitive models. We calculated
several model selection statistics to determine the best model.

Table 3: Regression results for candidate models based on training and validation
dataset

= 2 2ln ( ), where k

We calculated the AIC value for each model using

is the number of estimated parameters and L is the maximum value for the likelihood
function. Model 2 produced the lowest AIC value of the four models, and was therefore
considered the best model using this criterion. We compared the SSE values for the
four models using
predicted value for the

= (
th

) , where

is the

th

observation and

is the

observation. Although model 1 and model 2 produced

almost identical SSE values, model 1 produced the lowest SSE and was therefore
considered the best model using this criterion.
We compared the MSE values for the four models using
is the number of observations and

, where

is the number of parameters to be estimated.


10/19

All four models produced similar values; however, model 2 produced the lowest MSE
and was therefore considered the best model. We calculated and compared the
= (

PRESS statistic for all four models using


prediction of the

th

value with the

th

( ))

, where

()

is the

observation removed. The PRESS criterion is a

way to determine of how well the use of the fitted values for a given subset model can
predict the observed response values (Kutner et al. 2004). Although models 2, 3, and
4 were all very similar, model 4 produced the lowest PRESS value and was therefore
considered the best model using this criterion. Next, we calculated the
four models using

+ 2 , where

value for all

is the error sum of squares

from the full model including all the potential predictors,

is the residual mean

square error of the candidate model. Model 2 produced the smallest

value, and

was therefore considered the best model. Lastly, we calculated and compared the
MSPR for all the four models using

, where

and

observations and point estimations of the response, respectively and

are original

is the number

of cases in the validation dataset. Although model 3 and 4 produced very similar
results, model 3 produced the lowest MSPR and was therefore considered the best
model. All four models were then applied to the validation dataset, and the results
were recorded in Table 3. The results show that model 1 was the best model;
however, three of the predictor variables were not significant.
From the results presented in Table 3, model 2 seems to be an appealing
choice as it possesses the lowest AIC and

values. The SSE, MSE and PRESS

statistics do not show a lot of variability for the four models. However, looking at the
MSPR value, model 3 and model 4 look to be the better choice over the other two.
Because the results from Table 3 are not consistent using different model selection
techniques, this suggests the presence of influential points. Therefore, we conducted
measures to detect influential observations.
Up to this point none of our data was screened for the presence of influential
observations. Looking at Figure 4 and 5, there seems to be presence of observations
with high deviations of the residuals from the mean. This could be a direct
consequence of potential outliers and influential points in the data. Although a point
may be an outlier in terms of the range of predictor variables, it may not be an outlier
in terms of the response variable. Conversely, a point may be an outlier in terms of
11/19

the response variable, yet it may not be an outlier in terms of the predictor variable.
Further, a point may be an outlier in terms of both the response and predictor variables.
In these instances, it is possible that although the point is outlying for all or only one
of the variables, it may not have an influence on the regression line. Therefore, it was
necessary to assess the data for influential points.
We used two measures to assess the presence of influential points: DFFITS
and Cooks Distance. The DFFIT is given by

, where

is the studentized

deleted residual and the second term of the product is the leverage factor of the
observation. An observation having a high DFFIT value is identified as an influential
point per this criterion. We used 2

as the threshold value (Kutner et al. 2004).

The DFFIT considers the influence of the


Cooks Distance measures the influence of the
Distance measure for the

case

case on the fitted value


case on all

is calculated using

, while the

fitted values. Cooks


(

. Thus, we

get a high Cooks Distance value for high residuals and/or high leverage value. Higher
indicates higher degree of influence of the

case of the fitted values. We used a

Cooks Distance cutoff value of 4/n to identify influential points, where n was the
number of data observations (Introduction to SAS). It should be noted that identifying
the influential points by both these criteria depends on the model itself as the
calculation involves , the diagonal element of the hat matrix. We aggregated the
influential data from both tests and removed them from the data for corresponding
model. This process was repeated for each of the four models.

Figure 7: DFFIT and Cooks Distance plot for model 1


12/19

Figure 7 shows the influential points identified by these measures for model 1.
This was repeated for all four proposed models. The models were fitted again without
these influential points. Table 4 summarizes the regression results for the candidate
models based on the datasets after removing the influential points.

Table 4: Regression results after removing influential points

Looking at the results presented in Table 4, we see that model 4 produces the
best model selection values in terms of

, SSE, MSE, PRESS, and

value. Five

out of the six criteria we have considered for model evaluation indicate model 4 is the
best model. The other statistic for model 4, MSPR is only slightly above the lowest of
all the models. These results are consistent for the validation data as well, with the
exception of the

statistic. Therefore, we chose model 4 for this study. This model

is supported by the statistics and is very simple in nature as it includes only two
predictor variables. For this chosen model, we ran the multiple comparison test to
check whether all the coefficients were significantly different from zero. The in-built R
function glht() was used to do that. This function takes the null hypothesis in the form

13/19

(Hothorn et al. 2016). The default choice of

is

and

was specified

as a diagonal matrix of size 3. Table 5 summarizes the results of multiple comparison


and it support our choice of model 4.

Table 5: Simultaneous inference results

Robust regression
To investigate whether our choice of model based on the data after removing
the influential points complies, we have implemented the robust regression. Robust
regression dampens the effect of influential cases and safeguards against these
influences (Kutner et al. 2004). The in-built R function rlm() was used for this purpose.
This function, by default uses the iteratively reweighted least square (IRLS) method to
fit the function (Yegorov 2016). It chooses the mean absolute deviation (MAD) as the
weight function with the default choice. Using the weights from a least square
regression, it obtains the weights and fits the model again using weighted least
squares. The weights are re-estimated after each iteration until a convergence is
obtained (Kutner et al. 2004).
Table 6: Results from robust regression

14/19

Table 6 summarizes the regression results obtained from robust regression.


Comparison between the test and training data indicates similar coefficient values and
standard error values. Further, these results are consistent with results obtained from
ordinary least square results. Given that robust regression is insensitive to influential
observations, this indicates that our original data used for ordinary least squares
analysis was sufficiently assessed for the presence of influential data.

Regression Tree
We have implemented the regression tree, which is a non-parametric, simple,
and powerful regression technique (Kutner et al. 2004). Implementing this method in
the training dataset without the influential cases, the MSE, PRESS and MSPR were
found to be 0.44, 203.36, and 0.61 respectively, which validates our choice of model.

CONCLUSION
Based on our analysis, we select model 4 as our final model, i.e the blood
insulin level can be modeled as a linear function of urinary creatinine level and gender.
However, the coefficient of the gender variable is opposite in the training and the
validation data. This holds true for all four models (also in the robust regression, Table
6). This was a direct consequence of the sampling of the data, when the full dataset
was divided into train and test data. This can be explained with the help of Figure 8.
The boxplot of the train data shows that female has higher insulin level on average,
whereas the test data tells us the different story. The full data agrees with the train
dataset. Therefore, we stick with our coefficients found from the training dataset, when
interpreting the effect of gender on insulin level.

Figure 8: Boxplot of response based on gender

15/19

The model we choose to show the relationship of blood insulin level with urinary
creatinine level and gender was found to be statistically significant. However, it had
very low predictive power, as indicated by low coefficient of determination (

less

than 5%). Figure 9 shows the original values and the predicted values along with the
95% confidence and prediction band. This figure indicates that this model has little
practical application because the prediction and confidence bands are similar across
the entire range of the data. This emphasizes the importance of examining the
practical application of a statistically significant model.

Figure 9: Original and predicted values of training data

REFERENCES

Al-Attas, O. S., Al-Daghri, N. M., Alkharfy, K. M., Alokail, M. S., Al-Johani, N. J., AbdAlrahman, S. H., Yakout, S. M., Draz, H. M., & Sabico, S. (2012). Urinary iodine
is associated with insulin resistance in subjects with diabetes mellitus type 2.
Experimental and Clinical Endocrinology & Diabetes, 120(10), 618-622.
ATSDR (Agency for Toxic Substances and Disease Registry), Division of Toxicology
and Environmental Medicine. (2015). Public health statement: Perchlorates.
https://www.atsdr.cdc.gov/ToxProfiles/tp162-c1-b.pdf. Accessed 4 Dec. 2016.
Auchincloss, A. H., Roux, A. V. D., Brown, D. G., O'Meara, E. S., & Raghunathan, T.

16/19

E. (2007). Association of insulin resistance with distance to wealthy areas the


multi-ethnic study of atherosclerosis. American Journal of Epidemiology,
165(4), 389-397.
Blount, B. C., Pirkle, J. L., Osterloh, J. D., Valentin-Blasini, L., & Caldwell, K. L. (2006).
Urinary perchlorate and thyroid hormone levels in adolescent and adult men
and women living in the United States. Environmental Health Perspectives,
1865-1871.
CDC (Centers for Disease Control and Prevention). (2013). National Health and
Nutrition

Examination

Survey.

https://wwwn.cdc.gov/nchs/nhanes/2011-

2012/PERNTS_G.htm. Accessed 4 Dec. 2016.


Geer, E. B., & Shen, W. (2009). Gender differences in insulin resistance, body
composition, and energy balance. Gender Medicine, 6, 60-75.
Horowitz, C. R., Colson, K. A., Hebert, P. L., & Lancaster, K. (2004). Barriers to buying
healthy foods for people with diabetes: evidence of environmental disparities.
American Journal of Public Health, 94(9), 1549-1554.
Hothorn, T., Bretz, F., Westfall, P., Heiberger, R. M., Schuetzenmeister, A., Scheibe, S.
(2016). Simultaneous Inference in General Parametric Models. https://cran.rproject.org/web/packages/multcomp/multcomp.pdf. Accessed 10 Dec. 2016.

Joshi, S. R., Parikh, R. M., & Das, A. K. (2007). Insulin-history, biochemistry,


physiology and pharmacology. Journal-Association of Physicians of India,
55(L), 19.
Kutner, M. H., Nachtsheim, C. J., Neter, J. (2004). Applied Linear Regression Models
(4th ed.). McGraw-Hill Irwin.
Leung, A. M., Katz, P. M., He, X., Feig, D. S., Pearce, E. N., & Braverman, L. E. (2014).
Urinary perchlorate and thiocyanate concentrations in pregnant women from
Toronto, Canada. Thyroid, 24(1), 175-176.
Machinani, S., Bazargan-Hejazi, S., & Hsia, S. H. (2013). Psychological insulin
resistance among low-income, US racial minority patients with type 2 diabetes.
Primary Care Diabetes, 7(1), 51-55.
Mayo Clinic. Diabetes treatment: Using insulin to manage blood sugar.
http://www.mayoclinic.org/diseases-conditions/diabetes/in-depth/diabetestreatment/art-20044084. Accessed 2 Dec 2016.
National Kidney Foundation. Diabetes a major risk factor for kidney disease.
https://www.kidney.org/atoz/content/diabetes. Accessed 2 Dec. 2016.
17/19

Ryan, A. S. (2000). Insulin resistance with aging. Sports Medicine, 30(5), 327-346.
Refaie, M. R., Sayed-Ahmed, N. A., Bakr, A. M., Aziz, M. Y. A., El Kannishi, M. H., &
Abdel-Gawad, S. S. (2006). Aging is an Inevitable Risk Factor for Insulin
Resistance. Journal of Taibah University Medical Sciences, 1(1), 30-41.
Steinmaus, C., Miller, M. D., Cushing, L., Blount, B. C., & Smith, A. H. (2013).
Combined effects of perchlorate, thiocyanate, and iodine on thyroid function in
the National Health and Nutrition Examination Survey 200708. Environmental
Research, 123, 17-24.
Introduction to SAS. UCLA: Statistical Consulting Group.
http://www.ats.ucla.edu/stat/sas/dae/rreg.htm. Accessed 9 Dec. 2016.
USDHHS (United States Department of Health and Human Services), National
Institute of Diabetes and Digestive and Kidney Diseases. (2014a). Causes of
diabetes. https://www.niddk.nih.gov/health-information/diabetes/causes.
Accessed 5 Dec. 2016.
USDHHS (United States Department of Health and Human Services), National
Institute of Diabetes and Digestive and Kidney Diseases (2014b). Prediabetes
and insulin resistance. https://www.niddk.nih.gov/healthinformation/diabetes/types/prediabetes-insulin-resistance. Accessed 5 Dec.
2016.
WebMD. Iodine. http://www.webmd.com/vitamins-supplements/ingredientmono-35iodine.aspx?activeingredientid=35. Accessed 2 Dec 2016.
WebMD. (2015). Creatinine and creatinine clearance blood tests.
http://www.webmd.com/a-to-z-guides/creatinine-and-creatinine-clearanceblood-tests - 1. Accessed 2 Dec. 2016.
Yegorov, O. (2016). Robust Fitting of Linear Models.
http://stat.ethz.ch/R-manual/R-devel/library/MASS/html/rlm.html. Accessed 10
Dec. 2016.

18/19

APPENDIX
Part of the Data

R code: Submitted as separate .R files.

19/19

Das könnte Ihnen auch gefallen