Beruflich Dokumente
Kultur Dokumente
Authors:
1. Acharya, Subash
2. Khan, Riaz
3. Shrestha, Mahesh
4. Suehring, Aaron
INTRODUCTION
The body breaks down carbohydrates from foods into glucose. Glucosea
form of sugaris used by tissues in the body for energy (USDHHS 2014a). Once in
the bloodstream, glucose needs the insulin hormone in order to be absorbed by organs
and tissues (USDHHS 2014a). The insulin hormone, made by the pancreas for
regulating blood glucose levels as part of metabolism, assists glucose to be absorbed
by organs and tissues (Joshi et al. 2007; USDHHS 2014a). Glucose accumulates in
the bloodstream if the pancreas cannot secrete enough insulin (Mayo Clinic).
Prolonged periods of elevated blood glucose can lead to diabetes or prediabetes,
kidney damage, and nerve damage (USDHHS 2014a).
Using data from the National Health and Nutrition Examination Survey
(NHANES), our objective was to determine if there was a relationship between insulin
in the bloodstream (pmol/L), and a group of chemical predictor variables measured in
the urine and demographic predictor variables for a sample of 974 individuals. The
chemical predictor variables we examined were urinary iodine, urinary creatinine,
urinary perchlorate, urinary nitrate, and urinary thiocyanate. Demographic variables
included gender, age, and household income.
Iodine (g/L), measured in the urine as a quantitative variable, is an element
that must come from the dietit cannot be made in the body (WebMD). Iodine is
needed by the thyroid gland to produce hormones. Iodine has been shown to have a
negative correlation with blood glucose and insulin (Al-Attas et al. 2012). Further,
iodine has been associated with insulin resistance in people with type 2 diabetes (AlAttas et al. 2012). Insulin resistance is a condition where organs and body tissue do
not respond to insulin, and therefore cannot easily absorb glucose from the
bloodstream. Thus, the body needs to produce more insulin (USDHHS 2014b).
Creatinine (mg/dL), also measured in the urine as a quantitative variable, is a
waste product of muscle deterioration and is filtered through the kidneys and excreted
through the urine (WebMD 2015). Testing the level of creatinine in a persons body is
a common measure of kidney health, and high levels of creatinine in your blood
indicate kidney failure in patients with diabetes (National Kidney Foundation; WebMD
2015).
Perchlorate (ng/mL), measured in the urine as a quantitative variable, is an
element commonly found in the environment. Exposure to it occurs through the
ingestion of food or water that contains perchlorate (ATSDR 2015). Perchlorate in the
2/19
body is associated with the inhibition of iodide uptake by the thyroid, which can lead
to hypothyroidism (Blount et al. 2006; ATSDR 2015). Similarly, exposure to Nitrate
(ng/mL), measured in the urine as a quantitative variable, is primarily linked to the
ingestion of food or water that contain nitrates (CDC 2013). Similar to perchlorate, it
is known to disrupt thyroid function by inhibiting iodide uptake (CDC 2013).
Thiocyanate (ng/mL) is another element known to inhibit iodide uptake by the thyroid.
Measured in the urine as a quantitative variable, exposure to thiocyanate occurs from
cigarette smoke and some Brassica genus vegetables (Leung et al. 2014). Similar to
perchlorate and nitrate, thiocyanate could lead to lower thyroid hormone production
(Steinmaus et al. 2013).
The demographic variables included in the analysis were gender, age, and
household income. Gender, a qualitative variable coded as 1 for male and 2 for
female, was included for its possible relationship with insulin resistance. Given that
increased body fat has been associated with insulin resistance (USDHHS 2014b),
higher levels of visceral and hepatic adipose tissue in men may contribute to higher
levels of insulin resistance (Greer & Shen 2009).
Aging has also been associated with increased body fat and increased levels
of insulin resistance (Ryan 2000).
variable, was included in the analysis. Elderly were found to be more prone to insulin
resistance, although its uncertain whether the cause is biological or environmental
(i.e., decreased physical activity, weight gain) (Refaie et al. 2006).
Lastly, household income, coded in qualitative bins, was included in the
analysis. It has been shown that proximity to resources in high-income areas is related
to insulin resistance (Auchincloss et al. 2007). A potential explanation is that grocery
stores in poorer areas have less-healthy food options (Horowitz et al. 2004).
Alternatively, in a low-income, poorly educated population of patients with diabetes,
48% reported an unwillingness to use insulin (Machinani et al. 2013). Known as
psychological insulin resistance, this refers to a persons unwillingness to use insulin
(Machinani et al. 2013). This may result in a relationship between lower-income
households and lower levels of insulin.
A previous study by Blount et al. (2006) used this data from NHANES to
examine the relationship between perchlorate and thyroid hormone levels.
The
covariates incorporated in their analysis that were also examined in ours include
urinary creatinine, urinary iodine, urinary nitrate, urinary thiocyanate, and age. They
3/19
ANALYSIS
Data Formatting
Four files, which contained the data for estimation of insulin level in the blood,
were acquired from NHANES. In those files, each respondent was given a unique
sequence number. The final dataset was created by aggregating the four datasets to
extract the desired response variable and predictor variables based on common
4/19
sequence number.
contained common sequence numbers for all the variables to be included in the
analysis. After forming the final data, the data set was randomly divided into two equal
parts: a train dataset and test dataset. The train dataset was used to perform initial
analysis, whereas the test dataset was used for model validation.
Insulin level was selected as the response variable, and urinary iodine, urinary
creatinine, urinary perchlorate, urinary nitrate, urinary thiocyanate, gender, age and
household income were treated as the predictor variables. In total, there were eightpredictor variables selected for estimating the insulin level in the blood. Two of the
variablesgender and household incomewere qualitative variables, and the six
remaining variables were quantitative variables.
A plot of residuals vs. fitted values for this fitted model suggested that the
variance of the error is not constant. Furthermore, the normal Q-Q plot of the residuals
also suggests some departure from normality of the errors exits. It is concluded after
assessing the diagnostic plots that the assumptions of constancy of the error term
variance and the normality of the error terms are violated. Furthermore, the estimated
5/19
intercept and coefficients for the different predictor variables range over a long range,
so the values of the predictor variables were scaled in an appropriate manner. Urinary
iodine, urinary creatinine, and urinary thiocyanate were divided by 100, whereas
urinary nitrate was divided by 1000. These scaling factors were chosen to get all the
predictor variables on a common scale.
As previously stated, there appears to be unequal error variance and nonnormality of the error terms.
6/19
Multicollinearity Check
To check if there is any serious multicollinearity problem in our data, we used
the variance inflation factor (VIF) criterion. The technique is all the predictors are
regressed against the rest of the predictors and the VIF is found using
is the multiple
, where
suggested to be 10 (Kutner et al. 2004). For our data, we got the highest VIF to be
1.76, suggesting there is not a serious multicollinearity problem in our data. Table 1
shows the VIF values and the correlations between the predictor variables of the data.
7/19
Model selection
In total, we fit four models. In model 1, we included all the variables except for
the age and income. When fitting with all the predictors, these terms were not
significant at any default significance level (maximum 0.1). Additionally, the response
was regressed with all the predictors separately and these two predictors did not show
any significant linear relationship. Model 2 was selected based on the Akaike
information criterion (AIC) based stepwise selection. Model 3 is a simpler model,
derived from model 2 after deleting the variable having the highest p-value. Model 4
is a further simplification where one more variable was deleted from model 3 based
on the highest p-value.
Figure 5 and 6 show the residual scatter plots and the normal Q-Q plot for all
the models proposed. From visual inspection, they appear to meet the normality and
variance constancy assumptions of the residuals. However, from the Q-Q plot, there
seems to be some outlying observations in the data. Further analysis attempted to
identify influential points which will be discussed later.
8/19
value < 0.05), we found that there were four competitive models. We calculated
several model selection statistics to determine the best model.
Table 3: Regression results for candidate models based on training and validation
dataset
= 2 2ln ( ), where k
is the number of estimated parameters and L is the maximum value for the likelihood
function. Model 2 produced the lowest AIC value of the four models, and was therefore
considered the best model using this criterion. We compared the SSE values for the
four models using
predicted value for the
= (
th
) , where
is the
th
observation and
is the
almost identical SSE values, model 1 produced the lowest SSE and was therefore
considered the best model using this criterion.
We compared the MSE values for the four models using
is the number of observations and
, where
All four models produced similar values; however, model 2 produced the lowest MSE
and was therefore considered the best model. We calculated and compared the
= (
th
th
( ))
, where
()
is the
way to determine of how well the use of the fitted values for a given subset model can
predict the observed response values (Kutner et al. 2004). Although models 2, 3, and
4 were all very similar, model 4 produced the lowest PRESS value and was therefore
considered the best model using this criterion. Next, we calculated the
four models using
+ 2 , where
value, and
was therefore considered the best model. Lastly, we calculated and compared the
MSPR for all the four models using
, where
and
are original
is the number
of cases in the validation dataset. Although model 3 and 4 produced very similar
results, model 3 produced the lowest MSPR and was therefore considered the best
model. All four models were then applied to the validation dataset, and the results
were recorded in Table 3. The results show that model 1 was the best model;
however, three of the predictor variables were not significant.
From the results presented in Table 3, model 2 seems to be an appealing
choice as it possesses the lowest AIC and
statistics do not show a lot of variability for the four models. However, looking at the
MSPR value, model 3 and model 4 look to be the better choice over the other two.
Because the results from Table 3 are not consistent using different model selection
techniques, this suggests the presence of influential points. Therefore, we conducted
measures to detect influential observations.
Up to this point none of our data was screened for the presence of influential
observations. Looking at Figure 4 and 5, there seems to be presence of observations
with high deviations of the residuals from the mean. This could be a direct
consequence of potential outliers and influential points in the data. Although a point
may be an outlier in terms of the range of predictor variables, it may not be an outlier
in terms of the response variable. Conversely, a point may be an outlier in terms of
11/19
the response variable, yet it may not be an outlier in terms of the predictor variable.
Further, a point may be an outlier in terms of both the response and predictor variables.
In these instances, it is possible that although the point is outlying for all or only one
of the variables, it may not have an influence on the regression line. Therefore, it was
necessary to assess the data for influential points.
We used two measures to assess the presence of influential points: DFFITS
and Cooks Distance. The DFFIT is given by
, where
is the studentized
deleted residual and the second term of the product is the leverage factor of the
observation. An observation having a high DFFIT value is identified as an influential
point per this criterion. We used 2
case
is calculated using
, while the
. Thus, we
get a high Cooks Distance value for high residuals and/or high leverage value. Higher
indicates higher degree of influence of the
Cooks Distance cutoff value of 4/n to identify influential points, where n was the
number of data observations (Introduction to SAS). It should be noted that identifying
the influential points by both these criteria depends on the model itself as the
calculation involves , the diagonal element of the hat matrix. We aggregated the
influential data from both tests and removed them from the data for corresponding
model. This process was repeated for each of the four models.
Figure 7 shows the influential points identified by these measures for model 1.
This was repeated for all four proposed models. The models were fitted again without
these influential points. Table 4 summarizes the regression results for the candidate
models based on the datasets after removing the influential points.
Looking at the results presented in Table 4, we see that model 4 produces the
best model selection values in terms of
value. Five
out of the six criteria we have considered for model evaluation indicate model 4 is the
best model. The other statistic for model 4, MSPR is only slightly above the lowest of
all the models. These results are consistent for the validation data as well, with the
exception of the
is supported by the statistics and is very simple in nature as it includes only two
predictor variables. For this chosen model, we ran the multiple comparison test to
check whether all the coefficients were significantly different from zero. The in-built R
function glht() was used to do that. This function takes the null hypothesis in the form
13/19
is
and
was specified
Robust regression
To investigate whether our choice of model based on the data after removing
the influential points complies, we have implemented the robust regression. Robust
regression dampens the effect of influential cases and safeguards against these
influences (Kutner et al. 2004). The in-built R function rlm() was used for this purpose.
This function, by default uses the iteratively reweighted least square (IRLS) method to
fit the function (Yegorov 2016). It chooses the mean absolute deviation (MAD) as the
weight function with the default choice. Using the weights from a least square
regression, it obtains the weights and fits the model again using weighted least
squares. The weights are re-estimated after each iteration until a convergence is
obtained (Kutner et al. 2004).
Table 6: Results from robust regression
14/19
Regression Tree
We have implemented the regression tree, which is a non-parametric, simple,
and powerful regression technique (Kutner et al. 2004). Implementing this method in
the training dataset without the influential cases, the MSE, PRESS and MSPR were
found to be 0.44, 203.36, and 0.61 respectively, which validates our choice of model.
CONCLUSION
Based on our analysis, we select model 4 as our final model, i.e the blood
insulin level can be modeled as a linear function of urinary creatinine level and gender.
However, the coefficient of the gender variable is opposite in the training and the
validation data. This holds true for all four models (also in the robust regression, Table
6). This was a direct consequence of the sampling of the data, when the full dataset
was divided into train and test data. This can be explained with the help of Figure 8.
The boxplot of the train data shows that female has higher insulin level on average,
whereas the test data tells us the different story. The full data agrees with the train
dataset. Therefore, we stick with our coefficients found from the training dataset, when
interpreting the effect of gender on insulin level.
15/19
The model we choose to show the relationship of blood insulin level with urinary
creatinine level and gender was found to be statistically significant. However, it had
very low predictive power, as indicated by low coefficient of determination (
less
than 5%). Figure 9 shows the original values and the predicted values along with the
95% confidence and prediction band. This figure indicates that this model has little
practical application because the prediction and confidence bands are similar across
the entire range of the data. This emphasizes the importance of examining the
practical application of a statistically significant model.
REFERENCES
Al-Attas, O. S., Al-Daghri, N. M., Alkharfy, K. M., Alokail, M. S., Al-Johani, N. J., AbdAlrahman, S. H., Yakout, S. M., Draz, H. M., & Sabico, S. (2012). Urinary iodine
is associated with insulin resistance in subjects with diabetes mellitus type 2.
Experimental and Clinical Endocrinology & Diabetes, 120(10), 618-622.
ATSDR (Agency for Toxic Substances and Disease Registry), Division of Toxicology
and Environmental Medicine. (2015). Public health statement: Perchlorates.
https://www.atsdr.cdc.gov/ToxProfiles/tp162-c1-b.pdf. Accessed 4 Dec. 2016.
Auchincloss, A. H., Roux, A. V. D., Brown, D. G., O'Meara, E. S., & Raghunathan, T.
16/19
Examination
Survey.
https://wwwn.cdc.gov/nchs/nhanes/2011-
Ryan, A. S. (2000). Insulin resistance with aging. Sports Medicine, 30(5), 327-346.
Refaie, M. R., Sayed-Ahmed, N. A., Bakr, A. M., Aziz, M. Y. A., El Kannishi, M. H., &
Abdel-Gawad, S. S. (2006). Aging is an Inevitable Risk Factor for Insulin
Resistance. Journal of Taibah University Medical Sciences, 1(1), 30-41.
Steinmaus, C., Miller, M. D., Cushing, L., Blount, B. C., & Smith, A. H. (2013).
Combined effects of perchlorate, thiocyanate, and iodine on thyroid function in
the National Health and Nutrition Examination Survey 200708. Environmental
Research, 123, 17-24.
Introduction to SAS. UCLA: Statistical Consulting Group.
http://www.ats.ucla.edu/stat/sas/dae/rreg.htm. Accessed 9 Dec. 2016.
USDHHS (United States Department of Health and Human Services), National
Institute of Diabetes and Digestive and Kidney Diseases. (2014a). Causes of
diabetes. https://www.niddk.nih.gov/health-information/diabetes/causes.
Accessed 5 Dec. 2016.
USDHHS (United States Department of Health and Human Services), National
Institute of Diabetes and Digestive and Kidney Diseases (2014b). Prediabetes
and insulin resistance. https://www.niddk.nih.gov/healthinformation/diabetes/types/prediabetes-insulin-resistance. Accessed 5 Dec.
2016.
WebMD. Iodine. http://www.webmd.com/vitamins-supplements/ingredientmono-35iodine.aspx?activeingredientid=35. Accessed 2 Dec 2016.
WebMD. (2015). Creatinine and creatinine clearance blood tests.
http://www.webmd.com/a-to-z-guides/creatinine-and-creatinine-clearanceblood-tests - 1. Accessed 2 Dec. 2016.
Yegorov, O. (2016). Robust Fitting of Linear Models.
http://stat.ethz.ch/R-manual/R-devel/library/MASS/html/rlm.html. Accessed 10
Dec. 2016.
18/19
APPENDIX
Part of the Data
19/19