Sie sind auf Seite 1von 4

For this semester project, the topic decision was to explore a wide array of variables and

examine their impact, if any, on the total incomes of individual Americans. The data used for
this semester project was acquired from IPUMS USA and downloaded into Excel. Due to the
nature of the statistical question, there is no single hypothesis for the project. Rather, there is a
hypothesis for each variable in question on how it relates to income. The variables in question
were compiled and their significance to income was determined. Income is the topic of choice
for this project because it is multi-faceted. Income is a subject that naturally has a wide range of
values, assuming the sample size is appropriate. This wide range of values is highly desired
when choosing a variable to base a project on. This is because there needs to be significant
variation in the data so that it is more evident how the test variables relate and affect the income
variable. Also, income is a good choice for this project because there are a very large number of
variables that could possibly have an effect on income. This opens the door for numerous
possibilities when it comes to testing the significance of variables against income. This is
valuable for this project because it results in an in depth look at which variables carry
significance with income. Additionally, it can potentially provide a category of variables that,
after testing, have no significance to income.
When first obtaining the data, filtering and organization is required. IPUMS USA will
give data with roughly 3 million observations. The size of the file would be too large to handle,
so a smaller sample was selected with about 100 thousand observations. Observations which
have no income, or is not available are filtered out. This resulted in about 75 thousand testable
observations. Additionally, dummy variables, variables which cannot be quantified like gender or
race, must be created. Race, for example, must be separated into 9 different categories, White,
Black, Chinese, etcetera, each with a 1 for yes if the person is that race and 0 for no if they
are not. Each observation can be only one race, so when running a regression, only the
coefficient for that race will be used.
The next step was to get a basis on what income looks like. Descriptive statistics were
done in order to give a baseline of what income would look like, and also just to make sure the
numbers looked reasonable.
The first variable under question about how it relates to income is gender. The following
is in question as it pertains to gender and income:
Question: Does gender affect an individuals total income?
Null Hypothesis: There is no relationship between an individuals gender and their total income.
Alternative Hypothesis: There is a relationship between an individuals gender and their total
income.
These hypotheses mean that the test is attempting to determine if a persons sex dicates
how much money they make. If this is true, then the question of which sex makes more money
than the other comes to mind. The gender variable was chosen to test for a relationship to
peoples total income because it makes for a great hypothesis. The question stated above that we
are asking about this variable is a question that many people have heard people ponder before. It
is the question that is looking for the answer to whether or not the wage gap really exists or
not. Therefore it is no doubt that this question is one that carries a lot of weight, especially in
society today. In this project, it is just one of the questions that are brought to light in relation to
income. In light of this statistical question, here are the key figures from the outputs from the
testing:
P-value: 0
F-value: 2156.301

R Squared: 0.03
Adjusted R Squared: 0.03
The adjusted R Squared value is a slightly smaller and more conservative value that its R
Squared counterpart, however it appears here as an identical value simply due to rounding. Due
to the fact that the P-value is less than 0.05 and the F-value is a large figure, the null hypothesis
is rejected. Therefore, the alternative hypothesis is accepted. This means that because the null
hypothesis is being rejected, there is a relationship between an individuals gender and their
income. However, it is worth noting that the R Squared value is low, in this case 0.03. This
means that only 3% of the variation of total income is predicted by gender. This means that even
though the relationship between gender and income is significant, gender predicts very little
about income. This shows that income has a very high degree of variability.
The second variable under question about how it relates to income is marital status. The
following is in question as it pertains to marital status and income:
Question: Does marital status affect an individuals total income?
Null Hypothesis: There is no relationship between an individuals marital status and their total
income.
Alternative Hypothesis: There is a relationship between an individuals marital status and their
total income.
These hypotheses mean that the test is attempting to determine whether or not a person is
married dicates how much money they make. If this is true, then the question of which particular
marital status makes more money than the other comes to mind. For this statistical question,
here are the key figures from the outputs from the testing:
P-value: 0
F-value: 710.37
R Squared: 0.05
Adjusted R Squared: 0.05
Because the P-value is less than 0.05 and the F-value is a large figure, the null hypothesis
is rejected. Therefore, the alternative hypothesis is accepted. This means that because the null
hypothesis is being rejected, there is a relationship between an individuals marital status and
their income. This could be because if someone is married, the couple would be content to keep
two slightly lower paying jobs instead of taking risks going for higher paid and possibly riskier
positions. However, it is worth noting that the R Squared value is low, (0.05). This means that
only 5% of the variation of total income is predicted by marital status. Therefore, marital status
predicts very little about income.
The third variable in question on how it relates to income is age. The following
pertains to a persons age and their income:
Question: Does age affect an individuals total income?
Null Hypothesis: There is no relationship between an individuals age and their total income.
Alternative Hypothesis: There is a relationship between an individuals age and their total
income.
These hypotheses mean that the test is attempting to determine whether or not a persons
age affects how much money they make. If this is the case, it brings about the question of at
what age do people earn the highest income? For this statistical question, here are the key
figures from the outputs from the testing:
P-value: 0
F-value: 359.84

R Squared: 0.005
Adjusted R Squared: 0.005
Yet again, because the P-value is less than 0.05 and the F-value is a large figure, the null
hypothesis is rejected. Therefore, the alternative hypothesis is accepted. This means that
because the null hypothesis is being rejected, there is a relationship between an individuals age
and their income. However, the R Squared value is very low, (0.005). This means that only
0.5% of the variation of total income is predicted by age. Therefore, age predicts very little
about income. The data shows that income is highest in the middle of a persons life, lowest at
the beginning, and a dropoff occurs later in life. This proposed relationship is not linear so the
small R Squared value might be related to the type of regression.
The fourth variable in question on how it relates to income is education. The following
pertains to a persons level of education and their income:
Question: Does education affect an individuals total income?
Null Hypothesis: There is no relationship between an individuals level of education and their
total income.
Alternative Hypothesis: There is a relationship between an individuals level of education and
their total income.
These hypotheses mean that the test is attempting to determine if a persons level of
education affects their income. If it does, it brings about the question of at what level of
education do people earn the highest income? Education is a good variable to test against
income because it creates a relevant question that people in real life ask themselves all the time.
Everyone wants to know if pursuing a higher education is worth it in regards to their potential
income. For this statistical question, here are the key figures from the outputs from the testing:
P-value: 0
F-value: 2,974.69
R Squared: 0.14
Adjusted R Squared: 0.14
Here again, because the P-value is less than 0.05 and the F-value is an enormous figure,
the null hypothesis is rejected. Therefore, the alternative hypothesis is accepted. This means
that because the null hypothesis is being rejected, there is a relationship between an individuals
level of education and their income. The R Squared value is relatively higher than it was for the
other variables, (0.14). This means that 14% of the variation of total income is predicted by
education. Therefore, education predicts relatively more about income than the other variables.
The fifth and final variable in question on how it relates to income is race. The following
pertains to an individuals race and their income:
Question: Does race affect an individuals total income?
Null Hypothesis: There is no relationship between an individuals race and their total income.
Alternative Hypothesis: There is a relationship between an individuals race and their total
income.
These hypotheses mean that the test is attempting to determine if a persons race affects
their income. If it does, it brings about a whole range of potentially not so good reasons as to
why this might be the case, including discrimination. For this statistical question, here are the
key figures from the outputs from the testing:
P-value: 0
F-value: 86.23

R Squared: 0.0094
Adjusted R Squared: 0.0093
Race also proved to be significant and the null is rejected. Race does affect income,
which would be classified as discrimination. However, the R Squared is very small at .0094,
which suggest any discrimination would be rare, but still existent.
Overall, income is something that is hard to predict on a general scale. The numbers show
it is extremely difficult to predict income. Instead, income can only be predicted on a case by
case basis, meaning specific information on the persons job like industry, geographic location,
needed qualifications, and danger level have to be known. For example, take two men with the
same education level, but one specifies in business management and the other specializes in
forestry. Obviously, business management is a higher paying job that forestry, so other
information more specific to a job must be known. That is one of the limitations with IPUMS.
There is not appropriate information pertaining to things like college major and job type. There is
however an occupation sample, but it contains thousands of codes that would take much longer
than a single semester to analyze. However, income is still a good subject to choose since it
allows for examination of limits of data and how some variables are not so simple.
IPUMS USA
Steven Ruggles, J. Trent Alexander, Katie Genadek, Ronald Goeken, Matthew B. Schroeder, and
Matthew Sobek. Integrated Public Use Microdata Series: Version 5.0 [Machine-readable
database]. Minneapolis: University of Minnesota, 2010.

Das könnte Ihnen auch gefallen