Beruflich Dokumente
Kultur Dokumente
STUDENT DEBT
Nina Satasiya and Priya Chandrashekar
A deeper look into the difference of student debt in the years of 2004
and 2014.
Introduction
Brief Overview:
Attending college and obtaining a degree is an important investment. For many, the
burden of this investment lies on the parents who support their children's educations; however,
for some, the burden of obtaining higher education lies with the student. With this project, we
would like analyze our data to draw conclusions and make predictions about the future student
debt in the United States. Furthermore, through this project, we hope to not only analyze the past
of this issue, but to also begin to brainstorm on how to help those students in the future in
obtaining higher education without this large financial strain.
The data/calculations found within our given dataset itself are from The Institute for
College Access & Success (TICAS).1 These calculations/data within our dataset is based on data
that was collected from the U.S. Department of Education, National Center for Education
Statistics, and Integrated Postsecondary Education Data System (IPEDS) and Peterson's
Undergraduate Financial Aid and Undergraduate Databases (copyright 2015 Peterson's, a Nelnet
company, all rights reserved).
To thoroughly understand what the data looks like, one must actually visualize the data.
The entire dataset contains of data for 2004 and 2014; for the purpose of easily carrying out tests
& identifying data points, we have separated the dataset into two smaller subsections one data
file for 2004 and another for 2014. The first 10 observations in all three data files can be
observed in Appendix A. Additionally, to get a clear understanding of the datas normality and
overall distribution, we have included histograms and Normal Plots in Appendix B.
Terminology:
Overall, our dataset portrays a great deal of data including:
State: The specific state of the United States the data reflects.
Percent Change: The percentage of change of average debt from 2004 to 2014.
Robustness: The robustness in 2004 and 2014 explains the validity of the data in each
state. The variables here are categorical with three levels named Strong, Medium, and
Weak.
o
the change over time was categorized as Strong; where this share was at least half
in both years but less than two-thirds in at least one of the two years, it was
categorized as Medium; and for the remaining states it was categorized as Weak.
Average Debt 2004/Average Debt 2014: The average debt of those with loans from the
class of 2004/2014, respectively.
represented of the US population in student debt studies is 83%. Therefore, 0.83 is used in our
null hypothesis. After the Great Recession, many students/families may have not been able to
afford postsecondary education. And so, many students may have actually declined to submit
their financial information due to personal reasons/embarrassment/etc.
Thus, we ultimately chose to perform a one tailed t-test, as the proportion of students
represented in 2014 could possibly be less than the listed average. The hypotheses are as follows:
H0: 2014=.83 versus Ha: 2014<.83
This results in a t-distribution with 47 degrees of freedom. The t-statistic is -1.0977, with a
corresponding p-value of 0.139. Based on this, there is insufficient statistical evidence at the 0.05
significance level to reject the null hypothesis. We fail to reject the claim that the true proportion
of student represented in 2014 is equal to 0.83
2-Sample Dependent Test
Non-applicable
Our dataset conveys a numerous amount of information pertaining to student debt (i.e.:
percent in debt, average debt, percent represented, etc) for the years of 2004 and 2014. Although
the same type of data is reflected in both respective years, the data itself was not collected from
the same subjects for both years.
Dependent statistical tests typically require that the same subjects are tested more than
once. Thus, "related groups" indicate that the same subjects are present in both groups. This is
because each subject has been measured on two occasions for the same dependent variable. So
essentially, data is collected from the same subjects/individuals over different time
periods/conditions. This is not the case for our dataset at hand.
Data for our dataset was collected once in 2004 and again in 2014; this information came
from recent college graduates from their respective graduating years. Hence, it is highly
unlikely/nearly impossible that the same subjects are present in both groups. This indicates that
the data is not dependent, and so, we are unable to perform any two sample dependent tests.
2-Sample Independent Tests
2-Sample F-Test
For this test, we are comparing the variances between average debt in 2004 and in 2014.
Again, the distribution for these variables are independent and approximately normal so the FTest may be used (Appendix B). Analyzing the variances of these variables would help further
evaluate the spread of the data. In turn, this would reveal whether average debt throughout the
United States varies more drastically in one year compared to the other. We believe that after the
Great Recession, U.S. families would have a greater range of financial conditions. And so, our
alternative hypothesis is testing whether the average debt in 2014 had a greater variance than the
average debt in 2004. The two hypotheses are as follows:
H0: 20042 = 20142 versus Ha: 20042>20142
This results in a f-distribution with 47 degrees of freedom. The f-statistic is 1.8609, with a
corresponding p-value of 0.01782. Based on this, there is sufficient statistical evidence at the
0.05 significance level to reject the null hypothesis. We can reject the claim that the population
variances of 2004 and 2014 are equal to one another.
2-Sample T-Test
In this test, we are analyzing the average percentage of students in debt in 2004 versus
the average percentage of students in debt in 2014. Our population standard deviations are
unknown, and so, we chose to perform a 2-sample t test. Our distributions for percent debt are
approximately normal (Appendix B), and so, the assumptions of this test are fulfilled. The 2sample t-test will compare the total average percent debt of 2004 and the total average percent
debt of 2014.
This will analyze whether the average percent debt has grown from 2004 to 2014. If it
has, it will confirm our intuition that student debt has increased, as a larger proportion of students
are in debt as time moved on. Thus, our alternative hypothesis is testing whether the total
average percent debt is greater in 2014 than in 2004. Not that 2004 indicates the average percent
debt in 2004 and 2014 reflects the average percent debt in 2014. The hypotheses are as follows:
H0: 2004=2014 versus Ha: 2004<2014
This results in a t-distribution. According to statistical software, there is 86.797 degrees of
freedom; we may also consider there to be 94 degrees of freedom (n1 + n2 - 2). The t-statistic is
-2.3037, with a corresponding p-value of 0.01181. Based on this, there is sufficient statistical
evidence at the 0.05 significance level to reject the null hypothesis. We can reject the claim that
the average percent debt of students in 2004 is equal to the average percent debt of students in
2014.
Categorical Tests
Chi Square
Here we are comparing robustness with percent represented in our dataset using the chi
square test. Our variable robustness is categorical (i.e. we have counts) and we have transformed
our percent represented variable to be of factors (so it will be categorical as well). Thus, the chi
square test may be used. In our 2x3 contingency table, the columns are the different levels of
robustness (Strong, Medium, Weak) while the rows are the two separate ranges (0-70%, 70100%) of the percent represented data (Appendix A). This test will analyze whether the count of
each level of robustness vary between the various ranges of percentage of students represented.
Furthermore, this test can be performed because both variables adhere to the assumptions of
normality, independence, and the categorical nature of the variables (Appendix B).
The analysis will show whether there is a correlation between the robustness of the data
and the percentage of students represented in each state. We hope to see an association between
these two variables as it would add to the reliability and credibility of the entire dataset. Based on
this, our alternative hypothesis is whether there was association between robustness and percent
represented. The hypotheses are as follows:
H0: Robustness and Percent Represented are independent of one another
versus
Ha: Robustness and Percent Represented have association
This results in a chi square-distribution with 2 degrees of freedom. The test statistic, X-squared,
is 30.532, with a corresponding p-value of 2.344e-07. Based on this, there is sufficient statistical
evidence at the 0.05 significance level to reject the null hypothesis. We can reject the claim that
Robustness and Percent Represented do not have any relationship with one another.
1-Sample Proportion Z-Test
For this test, we are testing the strong category of robustness. Since the
proportions/counts of all levels of robustness are the same for 2004 and 2014, we only need to do
this test once and it will apply to both years. The results of this Z-test for proportions will show
whether the majority of the data has a strong level of robustness. This will be helpful as we
would like to see our data have a strong level of robustness to indicate credibility and reliability
of results (and of the dataset in general). So, we will be comparing our proportion of strong to the
null hypothesis of p=0.5 to see if more than half of our data has a strong robustness.
Furthermore, since the data is an SRS and independent and {48(0.5) 10} and {48(1-0.5) 10}
hold true, the assumptions for this test are fulfilled and the test can be carried out in good faith.
The hypotheses are as follows:
H0: ps = 0.5 versus Ha: ps > 0.5
This results in a normal distribution. By software, x-squared is 0.0833, with a corresponding pvalue of 0.3864 and a d.f. value of 1. By hand, the z-statistic is 0.2886751, with a corresponding
p-value of 0.386415. Based on this, there is insufficient statistical evidence at the 0.05
significance level to reject the null hypothesis. We fail to reject the claim that the population
proportion of strong robustness is equal to 0.50.
robustness level in the data set. We assigned a dummy variable for each other group and
organized this in a 3x2 contrast matrix (Appendix A).
Assumptions
Prior to using a regression model, several assumptions need to be addressed. First off, the
correct variables need to be present. In this case, they are; all explanatory variables are
quantitative or categorical with at least two categories and the response variable is also
quantitative. Furthermore, residuals must be independent, have linearity, constant variance, and
normality. The proper graphics displaying/confirming these assumptions are seen in Appendix C.
Maximal Population Model/Variables & Terminology
Ultimately, we hope that our models explanatory variables and interaction terms (both
continuous and dummy) are used to predict the percentage of change in student debt. Our
maximal model is as follows:
X
( aX w )+ 7 (X a X m )+ 8 ( X p X w )+ 9 ( X p X m )+ 10( X w X m )+ 11 (X aX p X w )+ 12 (X aX wX m )+ 13 (X
Y^ = 0 + 1 X a+ 2 X p + 3 X w + 4 X m + 5 ( X pX a ) + 6
Where:
0 refers to the intercept of the regression line; it is the estimated mean response value
when all the explanatory variables have a value of 0
The variable Average Debt will reflect on how average debt looked holistically in each year.
We decided to include this variable in the model as we thought it would surely be a variable that
would have a strong, direct relationship with the percentage of change in student debt. Average
debt usually fluctuates greatly over time. Hence, it would correlate directly with our response
variable. Thus, we believed that this variable will be beneficial to include in the model.
We hope that Percent Represented will give us insight into how reliable the data is. This
variable is clearly important in holistic analysis of the dataset. However, we are skeptical on the
extent of its influence in the model and on the response variable. By including this variable, we
will see if it truly benefits the model. If not, we would like to see it removed.
The variable Robustness is categorical. It focuses on the reliability of the data by
determining if the colleges that reported data/information were consistent in 2004 and 2014. This
variable gives useful information in regards to the credibility and reliability of the dataset itself.
So, we assumed it would be very beneficial to include in our maximal model; we hope to see it
ultimately have significance.
Furthermore, Percent Change is the response variable. It allows us to see the overall trend
in the percentage of change in student debt between 2004 and 2014. This statistic is a good
summary of the overall trend we are attempting to analyze, and so, we will be using it as our
response variable.
Model Testing - Brief Overview
We began by fitting the maximal model. From there on, the process was ultimately a
repetition of cases where we attempted to simplify the model by removing non-significant
interaction terms/quadratic or nonlinear terms/explanatory variables. Each case consisted of
checking the ANOVA and t-tests for the model at hand. We first checked the ANOVA to see if the
overall model (all slopes collectively) had significance. Then, we sought out the t-test for slopes
(individual test for slopes) to test for coefficient significance. If we found a slope of a variable to
not be significant, then we had no need for it in the model. We would then proceed to remove the
variable (if need be), update the model, and repeat this procedure with the new, updated model
until we came across a significant/desired model. Hence, this process can be considered a
repetition of cases, where each case is a new, updated model.
Model Testing - Data Discussion
We repeated this process eight times before arriving to the final, simplest and most
significant model. When we arrived to the seventh model (our second to last model), we
checked the ANOVA for overall significance. The hypotheses were as follows: H0: the model is
insignificant versus Ha: The model is not insignificant. The F test-statistic in this case was 1.851
on 2 and 73 d.f. The ANOVA returned a p-value of 0.1644, and so, we failed to reject the null
hypothesis. Thus, there is insufficient statistical evidence at the 0.05 significance level to reject
the claim that the model (all slopes collectively) is insignificant.
We then proceeded to take a closer look at the slopes (t-test for slopes). We had already
examined all 3-way and 2-way interactions. Now, we could focus on the single, explanatory
variables. The hypotheses for this test were as follows: H0 : robustness =0 versus
: robustness 0
Ha
The t test-statistic for Robustness was 0.616 with 71 d.f. The t-test for slopes
returned a p-value of 0.540, and so, we failed to reject the null hypothesis. Thus, there is
insufficient evidence at the 0.05 level to reject the claim that the regression coefficient equals 0.
Since this term was insignificant, we removed it, updated our model, and continued to our eighth
model at hand. It turned out that the eighth model ended up being the last.
Final Model
After the seventh model, all of our terms had been dropped. So, our final model was
simply Percent_ Change = 0.56934. In other words, the model simply equals the intercept point.
This model indicates that a unit change in any predictor variables results in no change in the
predicted value of outcome, or percentage of change in student debt.
All of the coefficients proved to be insignificant, reiterating the fact that the overall
model is insignificant. We would have liked to get a significant/good model or at least one that
had some variables... After many attempts of making new models with new explanatory
variables/response variables, we had no luck in finding a significant model. None of the variables
in our dataset resulted in a significant model. There may have been multicollinearity between
some of the explanatory variables that could have lead to these results. Additionally, our sample
size may have been too small in respect to the number of explanatory variables. Nevertheless, we
still attempted to fit a model to our dataset.
Accuracy
Our line of best fit for the final model is Y=0.56934, with an R Squared and Adjusted R
Squared value of 0. This indicates that 0% of the variation in the response variable can be
explained by the variation in the explanatory variables. Our value for SSE is 0.02052, which
indicates the residual variation/unexplained variation. After finding these statistics for our final
model, we attempted to fit it to the testing data set. Ultimately, we found that our fitted model
was not a good representation of the data. The plot in Appendix D portrays the differences in the
observed and predicted values from the training dataset. As seen, the differences were
substantial, reiterating the poor fit the training model provides.
Binary Logistic Regression:
Addressing Dummy Variables
Robustness has three levels; strong, medium, and weak. Thus, two dummy variables were
made to represent the information contained in the single categorical variable. We inferred that it
would have been a good idea to make one of the extremes, strong or weak, the baseline group.
We ended up choosing strong to be the baseline group because it was the most prominent
robustness level in the data set. We assigned a dummy variable for each other group and
organized this in a contrast matrix (Appendix A).
We decided to make Years into a categorical variable. The two levels are 2004 and 2014.
Half of the data set is from 2004, and the other half is from 2014. Were interested in seeing if
the year 2014 has greater student debt than 2004. Thus, we decided to make 2004 our baseline
and 2014 our dummy variable. We assigned a dummy variable for the other group and organized
this into a contrast matrix (Appendix A).
Assumptions
Prior to using a regression model, several assumptions need to be addressed. First off,
there must be linearity. We need to assume that there is a linear relationship between any
continuous predictors and the logit of the outcome variable. We can assume that there is
independence of errors and the data is independently distributed. Furthermore, by looking at the
data, we can infer that there is indeed a linear relationship.
Maximal Population Model/Variables & Terminology
Ultimately, we hope that our models explanatory variables and interaction terms (both
continuous and dummy) are used to predict the probability of Robustness (with the level
Strong being the baseline) occurring in the dataset given the known values of our explanatory
variables.
4 + b5 X 5
2+ b3 X 3 +b 4 X
b0 + b1 X 1+ b2 X
1+e
1
P ( Y Robustness ) =
b0 is a constant
o
i.e., at baseline.
e b = for every unit increase in their respective Xi, the odds that the
x
The explanatory variables (both continuous & dummy) are used to predict
X1 is average debt
X2 is percentage of students in debt
X3 is percentage of student represented
X4 percentage of change in student debt from 2004 to 2014
X5 is the dummy variable 2014
The variable Average Debt will reflect on how average debt looked holistically in each
year. We decided to include this variable in the model as we thought it would provide more
information about the trend we are attempting to analyze. Average debt usually fluctuates greatly
over time. Hence, it would have an association with our response variable, Robustness. Thus, we
believed that this variable will be beneficial to include in the model.
The variable Percent Debt is the percentage of students in Debt in each year. It is an
informative variable to have when looking at the dataset in its entirety. However, we are unsure
as to how much influence this variable will have on the model. If anything, we would like to see
it removed if it is of no benefit.
We hope that Percent Represented will give us more insight into how reliable the data is.
This variable is clearly important in holistically analyzing the dataset. Thus, we are interested on
the extent of its influence in the model and on the response variable. We believe that this variable
at Robustness will also have a close association. We would like to see if it truly is beneficial to
the model.
We believe Percent Change would be a valuable variable to include in our model. It
allows us to see the overall trend in the percentage of change in student debt between 2004 and
2014. This statistic is a good summary of the overall trend we are attempting to analyze, and so,
we hope it will be beneficial to keep in the model.
Years is a categorical variable. Half of the data set is from 2004, and the other half is
from 2014. Were interested in seeing if the year 2014 has greater student debt than 2004. Thus,
we decided to make 2004 our baseline and 2014 our dummy variable.
Our response variable is Robustness, with the level Strong being the baseline.
categorical. It focuses on the reliability of the data by determining if the colleges that reported
data/information were consistent in 2004 and 2014. This variable gives useful information in
regards to the credibility and reliability of the dataset itself. We chose this as our response
variable as we would like to see a majority of our data having a Strong level of Robustness.
Model Testing - Overview
We began by fitting the maximal model using the training dataset. After fitting the model,
there began a process of cases where we attempted to simplify the model by removing nonsignificant interaction terms/quadratic or nonlinear terms/explanatory variables. Each case
consisted of checking the likelihood ratio and Wald Test for the model at hand.
First off, we checked the likelihood ratio between our first model and the baseline/null
model. The likelihood ratio test checks for model significance. From this first point, we found
that our model is significantly different than the baseline model. Hence we continued on to
simplify and improve our model through the Z-test for Individual Components (Wald Test).
Furthermore, we would constantly conduct a likelihood ratio test after each new model to assess
its improvement. Ultimately, this process can be considered a repetition of cases, where each
case is a new, updated model.
Model Testing Data Discussion
We repeated this process three times before arriving to the final, simplest and most
significant model. When we arrived to the second (second to last) model, we checked its
likelihood ratio to assess its significance. The hypotheses were as follows: H0: the model is not
significantly different than the previous model (model 1) versus Ha: The model is significantly
different than the previous model (model 1). The chi square test-statistic in this case was 0.41713
on 1 d.f. It had a respective p-value of 0.5183726, and so, we failed to reject the null hypothesis.
Thus, there is insufficient statistical evidence at the 0.05 significance level to reject the claim that
the model is significantly different than the previous model. Although this model was not
significantly different, it was more simple, and so, we could use it.
We needed to look at the model some more to see if we could improve it. So, we
proceeded to take a closer look at the explanatory variables through the Wald test. If we found an
explanatory variable to be insignificant, we would have no need for it in the model. The
hypotheses for this test were as follows: H0 : Percent Change =0 versus
Ha : Percent Change 0
The
z test-statistic for Percent Change was 0.146 with a p-value of 0.883858. So, we failed to reject
the null hypothesis. Thus, there is insufficient evidence at the 0.05 level to reject the claim that
this regression coefficient equals 0. Since this term was insignificant, we removed it, updated our
model, and continued to our third model at hand. It turned out that the third model ended up
being the last.
Final Model
After the second model, the terms Year and Percent Change had been dropped. So,
2+ b3 X 3
b 0 +b1 X 1 +b2 X
1+ e
1
P ( Y Robustness ) =
Odds Ratios
We then found the odds ratio for our significant explanatory variables from our ultimate
model. The odds ratio for percent debt is 8.818586e-07. This entails that the odds that robustness
increases as percent debt increases is not likely because the value is below 1. The odds ratio for
percent represented is 1.916972e-09. This value, like that of percent debt, reveals that the odds
that robustness increases as percent represented increases is not likely because the value is below
1. The odds ratio for average debt is 1.000153e+00. Since this value is slightly above 1 it can be
deduced that the odds that robustness increases as average debt increases is likely.
Appendix A
Introduction: First Ten Observations
Appendix B
Introduction & Hypothesis Testing: Normality Histograms & QQ Plots
Chi Square Matrix
Robustness: Dummy
Variable Contrast Matrix
Appendix C
Modeling: Training and Testing Data Sets
Appendix D
Modeling MLR: Accuracy
Works Cited
1
"Project on Student Debt." State by State Data. The Institute For College Access and Success,
2015. Web. 18 Feb. 2016. http://ticas.org/posd/map-state-data-2015#
2014, November. STUDENT DEBT AND THE CLASS OF 2013 (n.d.): 22. Web.