Sie sind auf Seite 1von 10

Anna Mariah Acevedo

MATH 1040-402
Fall Semester 2014 Maw
December 1st, 2014
Summary of Term Project
This term project has been a culmination of each of the key aspects of statistics that
we have learned in this class: collecting the data, creating graphs to visually illustrate the
dataset, examining the distribution and determining if it is normal or not, working with
confidence intervals, and using hypothesis testing to determine whether or not claims were
false or could not be rejected.
By completing each section of the project, I was able to see how a dataset could be
examined and calculated from the earlier stages all the way to the end. I didnt realize just
how important it was for me to be able to apply each of the chapters to one dataset so that I
could have a better expectation of what statisticians go through by compiling everything
together. It helped to broaden my understanding of the process that one goes through to
interpret the data. While that was key to my comprehension of statistics, I found that the
most important thing for me to take away from this project and this class in general is that I
can take charge of my group projects and put myself out there as a leader.
In the past, Ive been more reluctant to take such a role because that usually meant
that I was going to end up doing all of the work. This project really showed me that I can
lead my group in a way that doesnt require me to do everything on my own. I started the
dialogue with my group members for each of the sections by laying out the assignment. At

first, we all divided sections up evenly, but that didnt seem very successful, so I suggested
that we work on the computations on our own. Then we could come together with an
overall understanding of what that section entailed. When my group members struggled, I
helped them out and vice-versa. That was something that I needed to learn for my future
classes. I now know that I am capable of taking a leadership role and I can lead my group in
the right direction. This was the most important skill that this project has taught me.
Part II: Comparing Our Results with the Class Totals
For this individual portion of our term project, I have compared the number of each
color of Skittles I had in my own bag and compared it to those found by the entire class
sample. You can see the results in the table below:

RED

ORANGE

YELLOW

GREEN

PURPLE

TOTAL
CANDIES

My Bag

17

11

13

57

Class Bags
(37 Total)

408

410

424

402

429

2073

As you can see, the highest number of colored candies I had was red, followed by
green, yellow, orange, and then purple had the lowest amounts. After comparing my results
to those of the entire class, I was surprised to see that the number of red Skittles was the
one of the lowest counts. Typically, I despise the red colored ones and always seem to have
more of them than any other color (as you can see by my results). On the opposite end of
the spectrum, I had very few purple Skittles in my bag, yet it was the most found color in

the entire group. My predictions were proven correct by my own bag, but were completely
opposing to the class totals. The overall data from the class disagrees with the data I
collected from my own bag. This is more visually evident in the graphs my group members
and I produced.
As a group, we created a pie chart that illustrates just how close in numbers all of
these colors were in terms of percentages (ranging between 19.39% and 20.69%). They
were all less than a percentage away from being split equally between the five colors.
However, because the pie chart appears to show five equal amounts, we created a Pareto
chart to visually show our readers the difference between the colors found most often and
least often. This chart (appearing similar to a bar chart) organizes each category in
descending order, from highest to lowest and is better able to demonstrate the distinction
between each one. The readers are much more easily able to point out the fact that there
were clearly more purple Skittles than the rest and green ones were seen the least.
Part III: Analyzing the Distribution
The distribution of the total number of Skittles in each bag is by no means a normal,
bell shape. Instead we see the shape begin low, then there's a gap, and then a high spike in
the data set before dropping off again. Theres almost a sense of symmetry in the higher
numbers of the data set: starting low, increasing, and then dropping off in the same fashion,
but those outliers prevent it from being symmetrical. With 37 bags in total, we see an Inter
Quartile Range between 57 and 61 candies in each bag. My bag consisted of 57 Skittles, so
my data falls into the first quartile (or 25% percentile). There are a few outliers closer to

the IQR, but there are some in the lower region that are quite a bit lower than the rest of
the data.
The graphs are pretty close to what I expected to see. However, I was surprised to
see those few outliers that were so far away from the IQR. I cant imagine that someone
would only have around 20 candies or less in their bag. I think the ones around 50 or 70
candies in their bag is a little bit more realistic than 20. So I assume that those very low
outliers are probably human error and that data was typed into the system incorrectly or
the respondent had a different sized bag than the rest of the class.
When comparing my own data to the rest of the class, I feel that they agree. My bag
may be in the 25th percentile, but since I had 57 candies and the mean was 56 for the
overall data, Im within one standard deviation of the mean. Even if you disregarded the
outliers, I would still be close to the adjusted mean. If you compare my data to the median
(which is more resistant to the outliers), then Im only off by a little bit. I would say with all
of these factors in mind, they are evidence to support my assertion that my data agrees
with the overall dataset.
Part III: Categorical Data vs. Quantitative Data
The difference between categorical and quantitative data is that the second consists
of numerical data that can be counted or measured. The first consists of data with labels,
names, or any other type of qualitative nature. Categorical data can have numbers in the
data set, but these numbers are representative rather than quantitative. You would want to
use types of graphs that show that data in the most appropriate way. Pareto and pie charts,
bar graphs, and dot plots are all helpful graphs that show the frequency in categorical data,

which is usually the thing we want to interpret most about that type of data. It would be
irrational to use graphs such as stemplots, boxplots, histograms, or other types of graphs
that correspond to quantitative data because were not using numbers, at least not in that
way. Those graphs along with scatterplots, time-series graphs, and dot plots as well (works
for both since were dealing with frequencies only) are better suited to quantitative data.
The reason why these graphs work so well is because there are formulaic systems (as in
actual formulas) to manipulate the data into showing us a visual representation of the data
set. It would be silly to use something like a pie chart to show a five number summary,
because it wouldnt make sense to anyone who was trying to interpret your data.
There are different types of calculations that make sense for each type of data. Since
categorical data is qualitative, it makes no sense to try and find something like the mean.
Even if you have numbers rather than actual words, there is no meaning behind your
results because the numbers are representative. However, calculations like the mode do
make sense because wanting to find out how frequently the data appears is reasonable.
When it comes to quantitative data, most calculations work because were using actual
numbers, interpreting them in a way that allows us to understand the data is rational. The
mean, standard deviations, summations, five number summaries, and more (including
mode, just like categorical data) are all types of calculations that make sense. As far as ones
that dont work or make sense for quantitative data, I honestly cannot think of any that
would work for names and not for numbers. Even alphabetical ranking can be simulated by
just listing the numbers in order. I think that depending on how you want to interpret the
data, then you would want to use the appropriate formula to make your calculations and

using a different one would result in a solution that you werent looking for. So in that
sense, it wouldnt be reasonable.
Part IV: Confidence Intervals
A Confidence Interval (CI) is an estimate that is used to make inferences about the
general population. By using simple random samples, one can create a CI to help make
these inferences. A CI is a range of numbers that may or may not contain the actual value of
the population, depending on the Confidence Level. The higher the level of confidence, the
surer you can be that the true value of the population falls within that range. Because it
would be impossible to calculate the exact number for an entire population, we use a range
of numbers to guesstimate where the actual number might lie. This allows us to narrow
down that possible number with a high level of confidence. When interpreting a CI, you
would say that you have ___% of confidence that the true value falls somewhere in between
that range of values (AKA an interval). This is not to say that there is a ___% chance that the
number will be within the interval, but that one can be ___% confident that it actual
contains the true value of the population.
Part V: Hypothesis Testing
1. In a paragraph, explain in general the purpose and meaning of a hypothesis test.

A hypothesis test is the statistical process of testing claims made about a


component of a population. This is done by creating a symbolic, mathematical
form of the original claim and calculating whether or not we should reject it
or fail to reject it (we dont want to say accept because that could be
misleading). The purpose of doing a hypothesis test is that they provide a way
to show statistical evidence for backing up our claims or for proving that a

false claim is made. Its important for people to be able to test the claims that
others make, whether its through a government agency, a large corporation, a
nonprofit organization, or even just a small group of people. Since decisions
can be made directly based off of these claims, its imperative that we make
sure that these claims are true before a decision is made. This protects people
by saving them/ others from making a choice based on false evidence.
2. Use a 0.05 significance level to test the claim that 20% of all Skittles candies are red,
using the entire class data set as your sample.

Since the p-value of 0.7171 is greater than the significance level of 0.05, we
fail to reject the null hypothesis. There is sufficient evidence to support the
claim that 20% of all Skittles candies are red.
3. Use a 0.01 significance level to test the claim that the mean number of candies in a bag
of Skittles is 55, using the entire class data set as your sample.

Since the p-value of 0 is extremely small and less than the significance level of
0.05, we reject the null hypothesis. There is sufficient evidence to warrant the
rejection of the claim that the mean number of candies in a bag is 55 Skittles.
4. In detail, discuss how your samples meet (or fail to meet) the requirements for
performing these hypothesis tests.

In problem #2, the samples met the requirements for performing these
hypothesis tests. The first requirement is that the data collected was a simple
random sample. We accomplished this in the fact that each of us randomly
selected a bag of Skittles to obtain our individual data. There was no
systematic way of choosing it and since each one of us chose our own bags, it
wasnt like someone bought all the bags from the same store and then gave
them to us (which would lessen the randomness of selection).
The second requirement is that it is a binomial distribution. The four parts of
this requirement are that there is a set number of n trials (n=2073 total
Skittles) which we have; they must be independent and since the results of
one trial didnt affect any of the others, this requirement is met; there must be

two outcomes (red Skittle or not red); and the probability must be constant
within each trial which we meet since no matter the results, the probability of
selecting a red Skittle was 1 out of 5 different colors (p=0.20).
The last requirement is that np5 and n(1-p)5 so that the binomial
distribution of the sample proportions is more or less a normal distribution.
By multiplying 2073 by 20%, we have almost 415 which is greater than 5.
Then we multiply 2073 by 80% and we get about 1,658 which is also greater
than 5. So this requirement is also met.
In problem #3, the requirements that we must meet are that the data was a
simple random sample (see above for why that is true), and that the data is
normally distributed or that n>30. Since we are looking for the number of
candies per bag, our total number of bags is 37, and 37>30.
5. Discuss and interpret the results of each of your two hypothesis tests.

The results of #2 proved that 20% of all Skittles are red. We were able to do
this by testing the hypothesis that the proportion of red Skittles to the total
number of Skittles was 20%. We saw that because we had a higher p-value of
0.7171 than the significance level of 0.05, we could not reject the claim
because the evidence just wasnt there to support that decision. The evidence
supported what the claim said. This is not too surprising of results because if
we already concluded that there was a 1 in 5 probability of getting a red
Skittle from a bag, then the claim that the proportion is 20% reflects the
probability as well.
The results of #3 was that we were not able to prove the claim that the mean
number of candies per bag was 55. We used the hypothesis test to judge
whether or not this claim was true. Using the claim as our null hypothesis, we

calculated the results and saw that we ended up with an extremely low pvalue score of 0

. Because it was much smaller than our

significance level of 0.01, we had to reject the claim. The evidence just wasnt
there to support the claim that the mean number of candies per bag was 55.
This result is not too shocking because of the fact that when you gather up all
of the bags and the amount of candies inside them, we see a sample mean of
about 56 candies. Since the best point estimate of a population mean is a
sample mean, then 56 might be a better number to represent them than 55,
especially since we rejected it as our population mean.

Das könnte Ihnen auch gefallen