Beruflich Dokumente
Kultur Dokumente
Final Project
Stat 152
May 11, 2017
Contents
1 Introduction 3
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Analysis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Survey Design 6
2.1 NHANES Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Public Release Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Design Element Definitions . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Design Elements of Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Exploration of Design Elements . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Methodology 13
3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Data Merging and Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Fixing Missing Data Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Results 19
4.1 Effect of Participation in PE on Physical Health . . . . . . . . . . . . . . . . 19
4.2 Effect of Participation in PE on Weight Awareness . . . . . . . . . . . . . . 20
4.3 Effect of Frequency of PE on Physical Health . . . . . . . . . . . . . . . . . 21
4.4 Effect of Frequency of PE on Weight Awareness . . . . . . . . . . . . . . . . 22
4.5 Effect of Enjoyment of PE on Physical Health . . . . . . . . . . . . . . . . . 24
4.6 Effect of Enjoyment of PE on Weight Awareness . . . . . . . . . . . . . . . . 26
5 Conclusions 28
5.1 Caveats and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
CONTENTS 2
6 Appendix 32
Introduction
In recent decades, America has grown into its reputation as one of the top ten fattest countries
in the world. According to the 2013-2014 National Health and Nutrition Examination Survey
(NHANES), the obesity rate of American adults is at a staggering 38%, compared to a global
average of 13% [10]. In particular, childhood obesity rates (ages 2-19) have nearly tripled
since the early 1980s and have hovered around 17% for the past ten years. Rates have been
declining among 2 to 5-year-olds, held stable among 6 to 11-year-olds, and increasing among
12 to 19-year-olds [9]. The Center for Disease Control and Prevention (CDC) attributes
much of these trends to unhealthy eating habits, excessive sedentary activity, and lack of
regular physical activity [5].
In response to these alarmingly high rates of both adult and childhood obesity across the
nation, the United States Department of Agriculture (USDA) has taken steps to promote
portion control and healthy food choices among the American public. In 2005, the dietary
system MyPyramid was released, and in 2011 it was later replaced by MyPlate [3]. Both of
these dietary guidelines served as a part of a larger communication initiative encouraging a
shift towards more healthful eating habits. However, in November of 2015, it was revealed
that results of the 2013-2014 CDC survey indicated no significant decrease in obesity rates,
which remained at 17% for youth and 36% for adults [3].
MyPyramid and MyPlate were federal attempts to improve the eating habits of the Amer-
ican public, but many are wondering why no federal regulations have been imposed to en-
courage regular physical activity and exercise, particularly in youth. The question of whether
or not physical education classes should be mandatory in schools has sparked widespread
national debate, with proponents arguing that such classes not only foster healthy and ac-
tive lifestyles, but also drastically improve the physical and mental health of adolescents. In
this study, we examine data from the 2013-2014 National Health and Nutrition Examina-
tion Survey (NHANES) to determine whether or not there exists significant evidence that
physical education classes are correlated with better physical health and weight awareness
of American adolescents aged 12-15.
INTRODUCTION 4
1.1 Background
Body Mass Index (BMI) is a value derived from weight and height measurements that at-
tempts to quantify the amount of tissue mass (muscle, fat, and bone) in an individual and
place that individual in one of four distinct weight categories: underweight, healthy weight,
overweight, and obese [4].
2
BMI = ( Weight (lb)
Height (in)
) 703
Additionally, standard weight categories defined by BMI values are defined as in Table 1.1
below:
1. Does the likelihood of an adolescent being of healthy weight (as defined by BMI) vary
by participation in a physical education class?
Additionally, we consider the impact of two other variables that may or may not affect
the significance of our results: frequency of physical education class per week and enjoyment
of physical education class. These specific sub-questions are stated below:
1. Does the likelihood of an adolescent being of healthy weight (as defined by BMI) vary
by frequency of physical education class?
2. Does the likelihood of an adolescent being of healthy weight (as defined by BMI) vary
by enjoyment of physical education class?
census block, and household information. One indirect consequence of this is reduction
in the variance of estimates, as more PSUs are included in each release. Because of this
modification in variances, NHANES also releases Masked Variance Units (MVUs) in their
public data set [8]. The 2013-2014 NHANES data used in this study consists of 15 masked-
variance strata and 30 masked-variance primary sampling units [2]. Rather than treating
the survey design as a four-stage sample, the MVUs design allows us to treat the survey as
a two stage sample and was designed to closely approximate the variances that would have
been estimated using the true design variables [7].
It should also be noted that NHANES provides both interview and exam weights due to
two different methods of data collection. For example, in our project, we have information
that was collected via examination (BMI) and information that was collected by interview
(responses about PE). However, since there was not a 100% response rate for everyone that
was attempted to be interviewed/examined, weights need to be adjusted to account for non-
response. Additionally, the CDC instructs that the examination weights should be used
exclusively for analyses of data from the examination, or in conjunction with the interview
data [8].
SSU: Secondary Sampling Unit, the classification of sampling units in the second stage
of a cluster sample. The SSUs in this design are the groups of Census Blocks. A subset
of SSU are sampled.
Census Block: The finest grouping used by the US Census bureau in the Decennial
Census. In highly-populated areas Blocks are typically city blocks, but can be much
larger geographic in rural areas.
Sample Weight: The reciprocal of the inclusion probability of a sample unit. The
sample weight is the number of people that a unit in the sample represents in the
population.
SURVEY DESIGN 8
Sample Unit: An individual unit that can be sampled in each stage of a sample. The
sample unit of the final stage of this in design is an individual person.
Observation Unit: The unit which data is taken on. The observation unit of this
design is an individual that has been selected.
The first stage samples PSUs from all counties in the Unites States (aggregated contigu-
ously if not sufficiently large). All PSUs were stratified into groups based on health and
urban and rural composition of the county. States were aggregated into 5 groups based on
derived health factors, geography, and population size to create homogeneous strata. A
table of how the states were divided can be found in table 2.2.
Next, counties within each group were stratified by urban and rural composition [7]. Of
the 13 major strata and 52 minor strata in the two-year data, two PSUs per major strata (of
unique minor strata) are selected. These PSUs were sampled with probability proportional
SURVEY DESIGN 9
to size (PPS), with a select few PSUs included with probability of 1, determined by the
size of the PSU. Probability for non-certain selection was also determined by the size of the
PSU, with correction to reduce the number of PSUs selected to both the 2007-2010 and
2011-2014 NHANES [7].
The second stage sampled geographic area segments of census blocks and aggregated
combinations of census blocks. Segments were selected with PPS to create approximately
equal sample sizes per PSU and SSU. Population size of each segment is based off of estimates
from the census of 2000 and updated estimates from the American Communities Survey.
In the third stage, a sample of all households, or Dwelling Units (DUs), within selected
segments is taken. Probability of selection in this stage was mostly determined by domains
from which NHANES intended to over-sample due to low frequencies in the population. In
2013-2014, the over-sampled population consisted of Hispanic persons, non-Hispanic black
persons, non-Hispanic non-black Asian persons, low-income, non-Hispanic, non-black, non-
Asian, white, and other persons (at or below 130% of the federal poverty line) [7].
The final stage of sampling selected individuals within households. Here individuals
were selected in order maximize the average number of sampled participants per sample
household while trying to meet the target sample size based on sex, age, race and Hispanic
origin, and income [7]. Table 2.3 indicates what the target size for some of the categories
are.
Sampled Unit Number
Study Locations 15
Segments 360
DUs to be screened 13,529
Households to be Screened 11,500
Sampled persons 6,888
Examined persons 5,000
to account for non-response and post-stratification to demographic rates. Weights for the
interview and examination are different because a participant could respond in the interview
and fail to go to an MEC. As mentioned previously, NHANES specifies exam weights to be
used for analysis that include data from MEC.
In the range of the smaller weights, the data looks approximately normal. However, there
are a lot of extreme values and there is a long right tail in the distribution. The average
exam weight was 24,067.91 and the median exam weight is 15,869.53. The largest exam
weight was of an 11 year old Female in PSU 2 and stratum 111.
There is high variability of the median exam weights by the pseudo-stratum. This is
expected as the survey design reduces variance when the strata are homogeneous within but
SURVEY DESIGN 11
Median weights are consistent along each of the BMI classifications. The under weight
classification has the lowest density of high exam weight responses.
Comparing the sum of the weights to the distribution of the weights yields an interesting
perspective of the survey design. The sum of the weights of subjects with PE five times per
week represents the highest proportion of the weighted data. This subset has approximately
the same median weight but far more upper-bound outliers for exam weight. Those with
PE five times per week both have the highest proportion of the sample and also represent
SURVEY DESIGN 12
the greatest number of people in the population. Here we also notice that there is a column
with NA values, how to correct for that is addressed in the methodology section.
Using the select function from the dplyr package, we subsetted our merged data
frame for only relevant columns. We then renamed these variables using the rename
function from the same package into more generally comprehensible names. Original
METHODOLOGY 14
Factor variables in the original data set were coded with integer levels that held no
contextual meaning without use of the code book. For purposes of convenience, we re-
coded these factor levels into more comprehensible labels using the mutate function in
the dplyr package. The re-coded variables, original factor levels, and modified labels
are displayed in Table 3.2 on the previous page.
Our main explanatory variables of interest in this study were the interview questions
related to existence, frequency, and enjoyment of physical education classes (pe yn,
freq pe, enjoy pe). Therefore, because the NHANES survey results only reported
values of these variables for adolescents between the ages of 12 and 15, we filtered our
entire data frame to include only respondents of ages 12-15. Completion of this step
using the filter function in the dplyr package left us with 737 observation units.
2. Item non-response
With regards to exam weights of zero, closer inspection of the data revealed that exam
weights equalled zero when a survey respondent completed the interview portion of the survey
but did not have any data for the exam portion. We considered this as unit non-response with
regards to the examination portion of the survey. With no examination data to cross-analyze
with interview data, it is not possible to draw any conclusions regarding impact of physical
education on physical health and weight awareness from these responses. Additionally, it
did not make sense to impute exam data for these respondents based on interview data,
as the relationship between actual and perceived weight categories is a primary variable of
interest in our study. Therefore, assuming these instances of unit non-response were missing
completely at random (MCAR), we removed all rows corresponding to survey respondents
who had exam weights of zero. Completion of this step left us with 713 rows in our data
frame.
Moving on to issues of item non-response, we first noted that one particular variable with
a high frequency of missing data was freq pe. Upon closer examination of the data, it was
discovered that freq pe was set to NA for all instances in which pe yn equalled no. That
is, for any respondent who did not participate in physical education classes, the frequency
of physical education class was recorded as NA. To fix this issue, for all survey respondents
who answered no for the pe yn variable, we replaced the NA values with 0, indicating that
these respondents participated in physical education 0 times per week. This substitution
eliminated all instances of item non-response for the freq pe variable.
Remaining issues of item non-response were instances in which a survey respondent had
nonzero exam weight and missing values for some but not all demographic variables and
exam data. In total, there were 46 instances of item non-response. Assuming that these
instances of item non-response were missing at random given covariates (MAR), we chose to
use random hot deck imputation by row in order to fix missing data values. For each row with
one or more missing value, a random row was selected from a subset of observations of the
same age and sex. Missing values were substituted from the random complete observation.
Within-group variability of the subset data is expected to be far lower than overall vari-
ability in the data. Each observation of a variable for a given respondent is highly correlated
with observations of other variables for that given respondent. For example, it would be
highly unlikely that a BMI-defined obese person would categorize himself as underweight.
Therefore, for any row that contained a missing data value, we chose a substitute row at
random in which the provided observations of variables matched with those of the substitute
row, and replaced the missing data values.
Usage of random hot deck imputation by row allowed us to uphold some variability in our
data while simultaneously preserving the correlations between observed variables. However,
we do make note of the fact that these imputed values are not real observations and could
potentially be biased, and that duplication of rows decreases the variance of our estimated
proportions.
Figure 3.1 above displays a plot of the frequency of unit non-response (regarding the
exam portion of the survey) by age and gender as well as a plot of the frequency of item non-
response by age and gender. From the plot on the left of unit non-response, we note that unit
non-response appeared to be random across age and gender, as there are no apparent trends
or patterns from the plot. This remains consistent with our decision to throw out these data
points due to our assumption of unit non-response instances being missing completely at
random (MCAR). However, from the plot of item non-response on the right, we note that
frequency of item non-response among boys was relatively similar, whereas frequency of item
non-response among girls was particular high for 14-year-olds. This could be an indication
that 14-year-old girls are more self-conscious and intentionally left certain questions blank,
METHODOLOGY 17
but again, for purposes of this study, we assume that all item non-response is classified as
missing at random given covariates (MAR). We make particular note of this as we move on
the the analysis of our data, as it is possible that through random hot deck imputation by
row, we have introduced potential biases into the data or masked significant trends.
With our merged, cleaned, and imputed data set of 713 observations across 7 variables,
we set up our survey design using the survey package in R using the following line of code:
The PSU tell us which pseudo-PSU the individual belongs to. The original survey split
the states into five different PSUs based upon how healthy the state is with group 1 being
the most healthy and group 5 the least healthy. However, to maintain anonymity of the
respondents, NHANES only distributes masked variance units in the form of pseudo-PSUs,
which are labeled as PSU one and PSU two. CDC instructs that the use of MVUs closely
approximates the variances that would have been estimated using the true design vari-
ables [7]. Simplified pseudo-strata are also provided in this data, and therefore we are able
to analyze this complex survey as if it were two-stage. ID helps to identify each unique
individual. Stratum tell us which stratum the individual belongs to. Finally, for the weights
we used exam weights. CDC instructs that the examination weights should be used exclu-
sively for analyses of data from the examination, or in conjunction with the interview data
[8]. Our analysis features use of both examination and household interview data. In our
METHODOLOGY 18
svydesign(), we did not include a finite population correction because we do not have any
information regard the population sizes.
Results
We confirm these initial observations with a two-sample F-test for difference in propor-
tions between the two groups, which yielded a p-value of 0.2827 >0.05. Therefore, at the
95% significance level, we fail to reject the null hypothesis that there is no difference in dis-
tribution of weight categories according to participation/no participation in a PE show the
weight categories are distributed based on whether or not an adolescent has PE or not. We
have insignificant evidence to conclude an association between participation in a PE class
and likelihood of being of healthy weight.
25% incorrectly categorizing themselves. For those who do participate in PE, the accuracy
level is actually slightly lower than those who do not have PE, but the difference appears to
be very small.
Again, to confirm our initial observations, we conduct a two-sample F-test of proportions
and obtain a p-value of 0.4077 >0.05. Therefore, at the 95% significance level, we fail to reject
the null hypothesis that weight awareness is consistent across participation/no participation
in a physical education class.
From Figure 4.6, we note that there are similar patterns of correct and incorrect self-
weight-categorizations across all frequencies of PE. Groups separated by varying frequencies
of PE class per week exhibited similar behavior across the board, with close to 75% of
adolescents categorizing themselves in the correct weight class (defined by BMI) and the
other 25% having incorrect perceptions about their own weight category. Groups with PE
class 4 or 5 times a week exhibit small deviations from the others, but these differences are
relatively small.
To confirm this initial observation, we conduct a chi-squared goodness-of-fit test and
obtain a p-value of 0.4465 >0.05. Therefore, at the 95% significance level, we have insufficient
RESULTS 24
evidence to conclude that weight awareness varies across frequencies of physical education
class.
Question: Does the likelihood of an adolescent being in a particular weight class (as defined
by BMI) vary across enjoyment of physical education class?
H0 : Proportions in each weight class are consistent across enjoyment of physical education
class.
H1 : Proportions in each weight class vary across enjoyment of physical education class.
From looking at figure 4.7, we note that different BMI categories tended to behave in
different ways. For those with normal, overweight, or obese categories of BMI, responses
tended towards the more moderate statements such as: I enjoy participating in PE or gym
class. That is to say, the majority of their responses were agree, disagree, and neither
agree nor disagree. With each of these groups, the majority of adolescents agreed with the
statement of enjoying PE class. However, in contrast, those who had BMI categorized as
underweight tended to have different responses. Within the underweight group, there was a
much higher proportion of unsure responses and no negative responses at all.
Because the responses from the underweight group were so deviant from the other groups,
we wanted to more closely examine the relationship between the different enjoyment levels
of PE and ones BMI category. In Figure 4.8, we plotted the proportions of PE enjoyment
for each weight group, and the results were very interesting. We had from previous analyses
determined that the normal BMI category was most prevalent among the adolescents sam-
pled, but from the barplot, we note additional information about the response from each
weight category. In particular, we note that a higher proportion of those who responded
with disagree or strongly disgree come from those who were categorized as overweight
than other weight classes. Additionally, we note that most of the respondents who answered
unsure were underweight. These observations led us to believe that there did exist some
correlation between BMI category and ones enjoyment of PE.
Upon completing a chi-squared goodness-of-fit test and obtaining a p-value of .0000516,
we confirmed our initial observations that proportion of adolescents in each weight category
seemed to be dependent upon enjoyment of PE. Therefore, at the 95% significance level,
we reject our null hypothesis that proportions in each weight category are consistent across
enjoyment of physical education class.
Figure 4.9 on the following page displays the table of proportions in each weight category
by enjoyment of PE.
RESULTS 26
The original intent of this study was to determine whether or not distributions of adoles-
cents in the four defined BMI weight categories (underweight, normal, overweight, obese)
and adolescent weight awareness (categorization of self into correct weight category based
on BMI) were significantly different across existence, frequency, or enjoyment of physical
education classes. Based on hypothesis tests using F-tests and chi-squared goodness-of-fit
tests, we found that distribution of weight awareness did not differ significantly across exis-
tence, frequency, or enjoyment of physical education classes. That is to say, the proportion
of adolescents who categorized themselves in the correct weight category (as defined by
their measured BMI index) was not significantly associated with the existence, frequency, or
enjoyment of PE classes.
Additionally, it was determined through F-tests and chi-squared goodness-of-fit tests that
distribution of adolescents in the four defined BMI weight categories did not differ signif-
icantly across existence or frequency of physical education classes. However, distribution
in weight categories did differ significantly across enjoyment of physical education classes.
Specifically, we noted in our analysis that overweight adolescents had the highest tendency
of disliking PE classes, and underweight adolescents had the highest tendency of unsure
opinions regarding their enjoyment of PE class.
Prior to completing this study, we hypothesized that there would undoubtedly be a
correlation between the existence/frequency of physical education classes and an adolescents
physical health, but we were proven to be wrong. However, while we found that there does
not exist significant correlation between our original variables, we did come out of this with
other interesting results on general misconceptions regarding this topic. First, we found
that there is a significant association between an adolescents enjoyment of PE and said
adolescents weight category (as defined by BMI). Second, contrary to popular belief that
PE is only liked/dominated by athletic students (typically of normal weight), we found that
a large proportion of adolescents who are overweight or obese also seem to really enjoy PE.
It has always seemed logical to believe that if a child has PE more frequently, then they
CONCLUSIONS 29
have higher chances of being of healthy/normal weight because they will get more hours of
physical activity. However, we find that the mere existence of a class that promotes physical
activity and healthy lifestyles does not automatically translate to healthier students. On the
contrary, it is a students enjoyment and happiness linked to physical activity and PE classes
that contributes more to that students physical health (as measure by BMI).
The implications of these findings on the ongoing national debate of whether or not
physical education classes should be mandatory in schools is huge. The primary argument
for mandating physical education in schools is the logical assumption that such classes not
only foster healthy and active lifestyles in youth and adolescents, but also have the capability
of making a direct impact on the physical and mental health of students. Using BMI weight
categories as our measure of physical health in adolescents, our results indicate that neither
the existence nor frequency of physical education classes in schools is directly associated
with the physical health of students. Specifically, the distribution of adolescents in each
weight category did not vary across existence or frequency of physical education classes.
Surprisingly, what did have an impact on the physical health of students as defined by their
BMI weight category was the enjoyment of said physical education class. Therefore, we
conclude that the perceived improvement of physical health in adolescents participating in
physical education classes is not directly linked to those classes, or even frequency of those
classesbut rather, an intrinsic motivation to live an active and healthy lifestyle.
the 2012 London Olympics have BMI measurements placing them in the overweight or obese
weight categories [1]. However, these athletes cannot be considered as unhealthy individuals.
FiveThirtyEights article BMI Is A Terrible Measure Of Health, states that while there is
a positive correlation between weight and fat composition of ones body, it is important to
remember that weight is also comprised of bone mass, muscle mass, fluids in the body, etc
[6]. Hence, the implication of using BMI as a measure of physical health is that it is difficult
to differentiate what proportion of weight comes from muscle mass versus what proportion
comes from fat. In the case of these athletes, muscle weight is what makes up most of their
body weight, but BMI incorrectly classifies this as excess fat.
Finally, we would again like to note the caveat of using imputation to deal with missing
data values. By imputing our data using random hot-deck imputation by row, we made the
assumption that people of the same gender, age, and BMI had similar experiences and feelings
towards physical education classes. However, that may not always be the case. Usage of
imputation could possibly have introduced bias into our dataset or masked important trends
or patterns we did not catch during initial exploratory data analysis. For example, as noted
previously, we noticed that item non-response tended to be higher in females of age 14. We
decided to impute their responses, but there could actually have been some reason that this
gender and age group decided not to answer some of the questions (e.g. higher levels of
self-consciousness in 14-year-old girls). The implications of using imputation by row are
decreased variance and likely increased bias of estimates.
Choice and personal desires drive people in different directions and can also have enor-
mous impact on physical health, as demonstrated by the link between enjoyment of PE and
distribution of weight categories. The discussion of whether PE should be mandatory in
schools merely touches the tip of the iceberg. Beyond the scope of PE classes, the discussion
can be expanded to participation in school sports, which allows students the opportunity to
choose what type of physical activity they would like to partake in. With our new findings,
balance between what adolescents want and what is considered best for them is probably
the trickiest thing about the current debate.
Appendix
# Loading t h e Data
l i b r a r y ( Hmisc )
library ( plyr )
library ( dplyr )
o p i n i o n wt = factor ( o p i n i o n wt , l e v e l s = c ( 1 , 2 , 3 , 7 , 9 ) ,
l a b e l s = c ( o v e r w e i g h t , underweight , normal ,
r e f u s e d , unsure ) ) ,
action wt = factor ( action wt , l e v e l s = c ( 1 , 2 , 3 , 4 , 7 , 9 ) ,
l a b e l s = c ( l o s e , g a i n , maintain ,
n o t h i n g , r e f u s e d , unsure ) ) ,
pe yn = factor ( pe yn , l e v e l s = c ( 1 , 2 , 7 , 9 ) ,
l a b e l s = c ( y e s , no , r e f u s e d , unsure ) ) ,
e n j o y pe = factor ( e n j o y pe , l e v e l s = c ( 1 , 2 , 3 , 4 , 5 , 7 , 9 ) ,
labels = c ( strongly agree , agree ,
n e i t h e r a g r e e nor d i s a g r e e , d i s a g r e e ,
s t r o n g l y d i s a g r e e , r e f u s e d , unsure ) ) ) %>%
mutate ( f r e q pe = i f e l s e ( pe yn == no , 0 , f r e q pe ) )
##########
#####Hot Deck Imputation#####
#rows w i t h NAs
nas< which ( i s . na ( data4 ) , a r r . i n d=TRUE)
need buddy < unique ( nas [ , 1 ] )
for ( j i n 1 : nrow( b u d d i e s ) )
for ( i i n seq a l o n g ( data4 [ b u d d i e s $need buddy [ j ] , ] ) )
i f ( i s . na ( data4 [ b u d d i e s $need buddy [ j ] , ] [ i ] ) )
data4 [ b u d d i e s $need buddy [ j ] , ] [ i ] <
data4 [ b u d d i e s $buddy [ j ] , ] [ i ]
##########
#####EDA and Survey Design Code#####
(nrow( data ) )
head ( data , 5 )
summary( data )
library ( ggplot2 )
e r < table ( data$bmi , data$ e n j o y pe )
mosaicplot ( er , l a s =1, x l a b=BMI , y l a b= Enjoy PE ,
main=BMI and PE Enjoyment )
library ( plyr )
data$bmi < r e v a l u e ( data$bmi , c ( o v e r w e i g h t = o v e r ,
underweight = under ) )
h i s t ( data$exam wtdata$bmi )
axis ( 1 , a t=c ( 0 . 7 , 1 . 9 , 3 . 1 , 4 . 3 , 5 . 5 , 6 . 6 ) , l a b e l s = x . l a b s ,
cex . axis = . 5 )
ugh3 < c (
sum( data$exam wt [ which ( data$ f r e q pe==0 ) ] ) ,
sum( data$exam wt [ which ( data$ f r e q pe==1 ) ] ) ,
sum( data$exam wt [ which ( data$ f r e q pe==2 ) ] ) ,
sum( data$exam wt [ which ( data$ f r e q pe==3 ) ] ) ,
sum( data$exam wt [ which ( data$ f r e q pe==4 ) ] ) ,
sum( data$exam wt [ which ( data$ f r e q pe==5 ) ] )
)
##########
#####Analysis Code#####
#B a r p l o t
bmi prop df = as . data . frame ( bmi prop t b l )
prop pe yn = tapply ( bmi prop df$Freq , bmi prop df$pe yn , sum)
bmi prop df$prop pe yn = prop pe yn [ bmi prop df$pe yn ]
plot df = bmi prop df %>%
mutate ( s t d prop = Freq/prop pe yn )
g g p l o t ( plot df , a e s ( pe yn , s t d prop ) ) +
geom bar ( a e s ( f i l l = bmi ) , p o s i t i o n = dodge , stat= i d e n t i t y ) +
l a b s ( x = E x i s t e n c e o f PE ,
y = P r o p o r t i o n i n Each Weight Category ,
f i l l = Weight Category ,
t i t l e = P r o p o r t i o n o f A d o l e s c e n t s i n Each Weight
Category by E x i s t e n c e o f PE ) +
theme ( plot . t i t l e = element text ( h j u s t = 0 . 5 ) )
######################################
APPENDIX 38
#P r o p o r t i o n o f C o r r e c t P e r c e p t i o n s by E x i s t e n c e o f PE
svymean ( interaction ( pe yn , c o r r e c t p e r c e p t i o n ) , d e s i g n = nhanes d e s i g n )
bmi t b l = s v y t a b l e ( pe yn+c o r r e c t p e r c e p t i o n , nhanes d e s i g n )
bmi prop t b l = round ( prop . table ( bmi t b l ) , 5 )
summary( bmi t b l , s t a t i s t i c = F )
s v y c h i s q ( pe yn+c o r r e c t p e r c e p t i o n , nhanes d e s i g n , s t a t i s t i c = F )
#B a r p l o t
bmi prop df = as . data . frame ( bmi prop t b l )
prop pe yn = tapply ( bmi prop df$Freq , bmi prop df$pe yn , sum)
bmi prop df$prop pe yn = prop pe yn [ bmi prop df$pe yn ]
plot df = bmi prop df %>%
mutate ( s t d prop = Freq/prop pe yn )
g g p l o t ( plot df , a e s ( pe yn , s t d prop ) ) +
geom bar ( a e s ( f i l l = c o r r e c t p e r c e p t i o n ) , p o s i t i o n = dodge ,
stat= i d e n t i t y ) +
l a b s ( x = E x i s t e n c e o f PE ,
y = Proportion ,
f i l l = Correct Perception ,
t i t l e = P r o p o r t i o n o f C o r r e c t P e r c e p t i o n s by E x i s t e n c e o f PE ) +
theme ( plot . t i t l e = element text ( h j u s t = 0 . 5 ) )
######################################
#P r o p o r t i o n i n Each Weight Category by Frequency o f PE
svymean ( interaction ( f r e q pe , bmi ) , d e s i g n = nhanes d e s i g n )
bmi t b l = s v y t a b l e ( f r e q pe+bmi , nhanes d e s i g n )
bmi prop t b l = round ( prop . table ( bmi t b l ) , 5 )
summary( bmi t b l , s t a t i s t i c = F )
s v y c h i s q ( f r e q pe+bmi , nhanes d e s i g n , s t a t i s t i c = F )
#B a r p l o t
bmi prop df = as . data . frame ( bmi prop t b l )
prop f r e q pe = tapply ( bmi prop df$Freq , bmi prop df$ f r e q pe , sum)
bmi prop df$prop f r e q pe = prop f r e q pe [ bmi prop df$ f r e q pe ]
plot df = bmi prop df %>%
APPENDIX 39
g g p l o t ( plot df , a e s ( f r e q pe , s t d prop ) ) +
geom bar ( a e s ( f i l l = bmi ) , p o s i t i o n = dodge , stat= i d e n t i t y ) +
l a b s ( x = Frequency o f PE ,
y = P r o p o r t i o n i n Each Weight Category ,
f i l l = Weight Category ,
t i t l e = P r o p o r t i o n o f A d o l e s c e n t s i n Each Weight
Category by Frequency o f PE ) +
theme ( plot . t i t l e = element text ( h j u s t = 0 . 5 ) )
######################################
#B a r p l o t
p e r c e p t i o n prop df = as . data . frame ( p e r c e p t i o n prop t b l )
prop f r e q pe = tapply ( p e r c e p t i o n prop df$Freq ,
p e r c e p t i o n prop df$ f r e q pe , sum)
p e r c e p t i o n prop df$prop f r e q pe = prop f r e q pe [ p e r c e p t i o n prop df$ f r e q pe ]
plot df = p e r c e p t i o n prop df %>%
mutate ( s t d prop = Freq/prop f r e q pe )
g g p l o t ( plot df , a e s ( f r e q pe , s t d prop ) ) +
geom bar ( a e s ( f i l l = c o r r e c t p e r c e p t i o n ) ,
p o s i t i o n = dodge , stat= i d e n t i t y ) +
l a b s ( x = Frequency o f PE ,
y = Proportion ,
f i l l = Correct Perception ,
t i t l e = P r o p o r t i o n o f A d o l e s c e n t s Whose Weight P e r c e p t i o n
Matched BMI Category \nby Frequency o f PE ) +
APPENDIX 40
######################################
#P r o p o r t i o n i n Each Weight Category by Enjoyment o f PE
svymean ( interaction ( e n j o y pe , bmi ) , d e s i g n = nhanes d e s i g n )
bmi t b l = s v y t a b l e ( e n j o y pe+bmi , nhanes d e s i g n )
bmi prop t b l = round ( prop . table ( bmi t b l ) , 5 )
summary( bmi t b l , s t a t i s t i c = F )
s v y c h i s q ( e n j o y pe+bmi , nhanes d e s i g n , s t a t i s t i c = F )
#B a r p l o t
bmi prop df = as . data . frame ( bmi prop t b l )
prop e n j o y pe = tapply ( bmi prop df$Freq , bmi prop df$ e n j o y pe , sum)
bmi prop df$prop e n j o y pe = prop e n j o y pe [ bmi prop df$ e n j o y pe ]
plot df = bmi prop df %>%
mutate ( s t d prop = Freq/prop e n j o y pe )
g g p l o t ( plot df , a e s ( e n j o y pe , s t d prop ) ) +
geom bar ( a e s ( f i l l = bmi ) , p o s i t i o n = dodge , stat= i d e n t i t y ) +
l a b s ( x = Enjoyment o f PE ,
y = P r o p o r t i o n i n Each Weight Category ,
f i l l = Weight Category ,
t i t l e = P r o p o r t i o n o f A d o l e s c e n t s i n Each Weight
Category by Enjoyment o f PE ) +
theme ( plot . t i t l e = element text ( h j u s t = 0 . 5 ) )
######################################
#P r o p o r t i o n Who Enjoy PE by Weight C l a s s
svymean ( interaction ( e n j o y pe , bmi ) , d e s i g n = nhanes d e s i g n )
bmi t b l = s v y t a b l e ( e n j o y pe+bmi , nhanes d e s i g n )
bmi prop t b l = round ( prop . table ( bmi t b l ) , 5 )
summary( bmi t b l , s t a t i s t i c = F )
s v y c h i s q ( e n j o y pe+bmi , nhanes d e s i g n , s t a t i s t i c = F )
#B a r p l o t
bmi prop df = as . data . frame ( bmi prop t b l ) %>%
APPENDIX 41
mutate ( e n j o y pe = factor ( e n j o y pe , l e v e l s = c ( s t r o n g l y a g r e e ,
agree ,
neither agree
nor d i s a g r e e ,
disagree ,
strongly disagree ,
unsure ) ) )
prop we ight = tapply ( bmi prop df$Freq , bmi prop df$bmi , sum)
bmi prop df$prop weight = prop weight [ bmi prop df$bmi ]
new prop = bmi prop df$Freq/bmi prop df$prop weight
plot df = bmi prop df %>%
mutate ( s t d prop = new prop )
######################################
#P e r c e i v e d and A c t u a l Weight Category by Enjoyment o f PE
svymean ( interaction ( e n j o y pe , bmi , o p i n i o n wt ) , d e s i g n = nhanes d e s i g n )
p e r c e p t i o n t b l = s v y t a b l e ( e n j o y pe+c o r r e c t p e r c e p t i o n ,
nhanes d e s i g n )
p e r c e p t i o n prop t b l = round ( prop . table ( p e r c e p t i o n t b l ) , 5 )
summary( p e r c e p t i o n t b l , s t a t i s t i c = F )
s v y c h i s q ( e n j o y pe+c o r r e c t p e r c e p t i o n , nhanes d e s i g n , s t a t i s t i c = F )
#B a r p l o t
perception prop df = as . data . frame ( p e r c e p t i o n prop t b l )
prop e n j o y pe = tapply ( p e r c e p t i o n prop df$Freq ,
perception prop df$ e n j o y pe , sum)
perception prop df$prop e n j o y pe =
APPENDIX 42
g g p l o t ( plot df , a e s ( e n j o y pe , s t d prop ) ) +
geom bar ( a e s ( f i l l = c o r r e c t p e r c e p t i o n ) , p o s i t i o n = dodge ,
stat= i d e n t i t y ) +
l a b s ( x = Enjoyment o f PE ,
y = Proportion ,
f i l l = Correct Perception ,
t i t l e = P r o p o r t i o n o f A d o l e s c e n t s Whose Weight
P e r c e p t i o n Matched BMI Category \nby Enjoyment o f PE ) +
theme ( plot . t i t l e = element text ( h j u s t = 0 . 5 ) )
##########
References
[1] Sally Adee. Overweight Olympians: Guess the BMI of top athletes. May 2014. url:
https://www.newscientist.com/gallery/obese-olympians/.
[2] CDC. NHANES Data Documentation 2013-2014, Demographic Data. Oct. 2015. url:
https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#SDMVPSU.
[3] Cable News Network (CNN). Obesity in the U.S. Fast Facts. July 2016. url: http:
//www.cnn.com/2013/09/02/health/obesity-in-u-s-fast-facts/.
[4] Centers for Disease Control and Prevention (CDC). About Adult BMI. May 2015. url:
https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/.
[5] Centers for Disease Control and Prevention (CDC). Childhood Obesity Causes & Conse-
quences. Dec. 2016. url: https://www.cdc.gov/obesity/childhood/causes.html.
[6] Katherine Hobson. BMI Is A Terrible Measure Of Health. Feb. 2016. url: https :
//fivethirtyeight.com/features/bmi-is-a-terrible-measure-of-health/.
[7] Clifford L. Johnson. National Health and Nutrition Examination Survey: Sample De-
sign, 20112014. Mar. 2014. url: https://wwwn.cdc.gov/nchs/data/series/sr02_
162.pdf.
[8] Lisa B. Mirel. National Health and Nutrition Examination Survey: Estimation Proce-
dures, 20072010. Aug. 2013. url: https://wwwn.cdc.gov/nchs/data/series/
sr02_159.pdf.
[9] The State of Obesity. Obesity Rates & Trends Overview. Sept. 2016. url: http://
stateofobesity.org/obesity-rates-trends-overview/.
[10] World Health Organization (WHO). Obesity and Overweight: Fact Sheet. June 2016.
url: http://www.who.int/mediacentre/factsheets/fs311/en/.