Sie sind auf Seite 1von 23

Dax Allen

MATH 1040

Intro to Statistics

Stats Project Reflection

A Brief Explanation

The goal of this assignment was to use what we learned in class on

a real world example that was unique to us. We were each tasked with

purchasing a bag of Skittles, counting the total, counting the total of each

color, recording both of those, and recording our height. Once the

complete data set had been created we did a statistical analysis on the

data. We did the analysis in 5 separate parts as we learned them in class.

Reflection

This project taught me that, at least in my class, I am an outlier in

physical height. Joking aside, this project has taught me that while stats

can be useful they are also vulnerable to misinterpretation and can easily

be misinterpreted by someone who doesn’t know anything about them.

There was a pretty large range of total candies in the bags that were preset

in my class, 21 total candies difference from the smallest to the largest.

Does that mean that the company should be held liable for not supplying
Dax Allen

MATH 1040

Intro to Statistics
enough candies to the smallest bag or that they are losing money on the

largest bag? Learning how to find the answer to that question is something

I think will do me good in my future.

This project really helped me improve my ability to interpret

information that I was given in a much more concise and detailed way.

Having to take a bunch of numbers and run them through differing

calculations and then take the answers and create a story with them is

something that I as a computer science/data science major will be doing

daily.

I don’t know if I can say, yet, how this has influenced my view of real

world math applications. Being in computer science and dabbling in data

science I’ve experienced real world applications of math in many different

areas. Look at computer graphics for example, there is a ton of

trigonometry, calculus, and physics involved in creating realistic

animations for movies, whether it be the whole movie or just some special

effects.

Dax Allen

MATH 1040

Intro to Statistics
One of the biggest things that I will take away from this course is how

statistics are manipulated and how to identify if results were concluded

properly from the statistics. In todays connected and, supposedly, data

driven world people like to through in correct conclusions around all the

time.
Dax Allen
MATH 1040
09/21/2019

Which of the two variables (candy color, number of candies per bag) is qualitative and which is
quantitative and an explanation of how you know?

Candy Color – qualitative – I know because colors aren’t countable


Number of colors per bag – quantitative – You can count how many of each color per bag.
Number of Candies per bag – quantitative – You can count the number of candies in the bag
Student Height – quantitative – You can count the students’ height.

State what the individuals (single objects being observed) are for measuring each variable.

The colors are nominal.


The totals per color and per bag can be ordinal or internal.
Student height can be ordinal or internal.
Dax Allen
MATH 1040
09/21/2019

Total Relative Frequency


Red 1064 20.85%
Orange 985 19.30%
Yellow 1063 20.83%
Green 1016 19.91%
Purple 976 19.12%
Total 5104 100.00%
Dax Allen
MATH 1040
09/21/2019
Dax Allen
MATH 1040
09/21/2019
Dax Allen
MATH 1040
09/21/2019

Column n Mean Std. Median Range Min Max Q1 Q3 IQR Unadj. Mode
dev. std.
dev.
Red 87 12.230 3.259 12 17 6 23 10 15 5 3.240 11
Orange 87 11.322 3.215 11 21 2 23 10 13 3 3.197 12
Yellow 87 12.218 2.503 12 10 8 18 10 14 4 2.489 Multiple
modes
Green 87 11.678 3.197 12 14 5 19 9 14 5 3.179 12
Purple 87 11.218 3.353 11 18 4 22 9 13 4 3.334 9
Total 87 58.667 2.380 59 21 51 72 58 60 2 2.367 59
Height 87 65.802 3.893 65 20 58 78 63 68 5 3.870 65
Dax Allen
MATH 1040
09/21/2019

Lower Upper
Fence Fence Outliers LF Calc UF Calc
Red 2.5 22.5 23 10 - (1.5 * 5) 15 + (1.5 * 5)
Orange 5.5 17.5 2, 5, 18, 19, 23 10 - (1.5 * 3) 13 + (1.5 * 3)
Yellow 4 20 No Outliers 10 - (1.5 * 4) 14 + (1.5 * 4)
Green 1.5 21.5 No Outliers 9 - (1.5 * 5) 14 + (1.5 * 5)
Purple 3 19 20, 22 9 - (1.5 * 4) 13 + (1.5 * 4)
Total 55 63 51, 54, 72 58 - (1.5 * 2) 60 + (1.5 * 2)
Height 55.5 75.5 77, 78 63 - (1.5 * 5) 68 + (1.5 * 5)

There were 61 total candies in my bag. 6 Red, 15 Orange, 9 Yellow, 19 Green, & 12 Purple. My bag isn’t
an outlier, and neither are any of my colors. My height is an outlier at 78 Inches.

I think that it is appropriate to discuss the shape of the graphs for the individual colors, the totals per
bad, and the height of the student. The colors are all mostly symmetric in shape though yellow has a bit
of a skew to the right. The totals per bag is very symmetric while the height of the students is skewed
right.
Dax Allen Project Part 3 Stats 1040

What do you think your results relative to the research


question will be? Explain.
I don t think there will be any correlation between the height of
the individual and the number of skittles they received. There is
no reason to think that an individual s height has any impact of
the number of skittles in the bag they chose.

Determine which of the two variables, number of candies


per bag and height of the person, is the explanatory
variable and which is the response variable based on the
research question above.
The number of candies per bag is the response variable and the
height in inches is the explanatory variable.
Dax Allen Project Part 3 Stats 1040

Simple linear regression results:

Dependent Variable: Total


Independent Variable: Height
Total = 59.965897 - 0.019744454 Height
Sample size: 87
R (correlation coefficient) = -0.032289218
R2 = 0.0010425936
Estimate of error standard deviation: 2.3931895

Parameter estimates:

Parameter Estimate Std. Err. Alternative DF T-Stat P-value


Intercept 59.965897 4.36961 ≠ 0 85 13.723398 <0.0001
Slope -0.019744454 0.066290551 ≠ 0 85 -0.29784719 0.7665

Analysis of variance table for regression model:

Source DF SS MS F-stat P-value


Model 1 0.50809063 0.50809063 0.08871295 0.7665
Error 85 486.82524 5.7273558
Total 86 487.33333
Dax Allen Project Part 3 Stats 1040
Dax Allen Project Part 3 Stats 1040

Is there a significant relationship between the two


variables? Identify the value of the correlation coefficient
from the StatCrunch regression output and identify the
correct critical value for determining whether there is a
significant relationship. Is this what you expected when
you thought about what the results would be before
analyzing the data? Explain.

The correlation coefficient is -0.032289218


The critical value for 87 items is 0.211 @ .05 significance level
according to this Pearson Chart.
|-0.032289218| < 0.211
This is what I expected. There is no reason to think that the
height of an individual has anything to do with the number of
skittles they get in a bag.

Using the same regression output, give the regression


equation. Use the regression equation and YOUR height in
inches to predict the number of candies in the next bag of
skittles you buy? Was it appropriate to use this regression
equation to make this prediction? Why or why not?
The regression equation is:
y-hat = -0.019744454x + 59.965897
-0.019744454(78) + 59.965897 = 58.425829588 or 58 candies
I don t think it is appropriate to use the regression equation.
Because the correlation is so low, the results from it aren t going
to be in line with actual data.
Dax Allen Project Part 3 Stats 1040

Using the regression output, give the value for R2 and


interpret its meaning.
R2 = 0.0010425936
This show that only 0.1% of the total variation can be explained
by the linear relationship between the height of the person and
the number of candies, which is close enough to none that it s
almost indistinguishable from it.

Assume there is a significant relationship between height


and number of candies per bag. Would it be appropriate to
predict the number of candies in a bag purchased by retired
Houston Rockets player Yao Ming, who is 90 inches tall?
Why or why not?
It would not be appropriate. Yao Ming s height is outside the scope
of the current data.

Enter these values into your calculator and report the


correlation coefficient and regression equation. Give the
appropriate critical value and state whether there is a
significant linear relationship between X and Y for this
smaller data set.
r = .1836245558
y-hat = 0.797503467x + 53.17683773
Critical value = .666 – From accepted tables doc on Canvas page.
|.1836245558| < .666
There is no significant linear relationship in this subset.
Dax Allen Project Part 4 Math 1040

Problem 1: Suppose all of the Skittles in the class data set are
combined into one large bowl and you are going to randomly
select one Skittle.
o What is the probability that you select a green Skittle? (4
points)
1016 / 5104 = 0.1991
o What is the probability that you select a Skittle that is NOT
green? (4 points)
1 0.1991 = 0.8009
o What is the probability that you select a Skittle that is red
OR yellow? (4 points)
Red 1064 / 5104 = 0.2085
Yellow 1063 / 5104 = 0.2083
0.2085 + 0.2083 = 0.4168
o What is the probability that you select a Skittle that is
orange GIVEN that it is a secondary color (secondary
colors are green, orange and purple)? (4 points)
985 (Or) + 1016 (Gr) + 976 (Pr) = 2977
985 / 2977 = 0.3309

Problem 2: Suppose you are going to randomly select two


Skittles from the bag YOU purchased.
o What is the probability that both Skittles are purple if you
select them with replacement? Give your answer correct to
four decimal places. (4 points)
12/61 * 12/61 = 0.0387
o What is the probability that both Skittles are purple if you
select them without replacement? Give your answer
correct to four decimal places. (4 points)
12/61 * 11/60 = 0.0361
o What is the probability that the first skittle is purple and the
second skittle is not purple if you select them with
replacement? (4 points)
12/61 * 49/61 = 0.158
o What is the probability that at least one Skittle is purple if
you select them with replacement? (4 points)
1 0.1967 = 0.8033
Dax Allen Project Part 4 Math 1040

Problem 3: Suppose all of the Skittles in the class data set are
combined into one large bowl and you are going to randomly
select ten Skittles with replacement and count how many are
yellow.
o List the requirements of the binomial probability
distribution and show that this meets them, including
identifying the values for n and p. (6 points)
Binary choice i or i n Yellow or not Yellow
With replacement probability remains constant
Fixed number of trials, n = 10
Choo ing a ello and replacing i doe n impac he
ability to choose another yellow, or any other color trials
are independent
n = 10, p = 0.2083
o What is the probability that exactly 4 of the 10 Skittles are
yellow? (4 points)
binomialpfd(10, 0.2083, 4) = 0.0974
o What is the probability that at most 2 of the 10 Skittles are
yellow? (4 points)
binomialcdf(10, 0.2083, 2) = 0.6526
o For samples of size 10, what is the expected value and
standard deviation for the number of yellow skittles that
will be included? (4 points)
10 * 0.2083 = = 2.083 or 2
Sqrt(10 * 0.2083 * (1 0.2083) ) = = 1.284
Dax Allen Project Part 5 MATH 1040

Assume p = the proportion of yellow candies for all Skittles =


0.2. Describe the sampling distribution for the proportion of
yellow candies for samples of 85 candies, including center,
spread, and shape (justify your answers). (6 points)
o 85 * .2 * (1 - .2) = 13.6 ≥ 10
o 85 ÷ 5104 = 0.0167 < 0.05N
o The center is the mean, .2 or 17
o Normal distribution because np(1-p) ≥ 10 and n ≤ 0.05N
o Spread - (.2 * (1 - .2) / 85) = 0.0434

Based on this sampling distribution, what is the probability


that in a random sample of 85 skittles, more than 22% of them
will be yellow? (5 points)
o NormalCdf(.22, 1E99, .2, 0.0434) = -.6775374832
o 1 + -.6775374832 = .3225

A e = mean number of candies per bag for all 2.17 oz


bag f Ski le = 60 ca die a d = a da d de ia i f
number of candies per bag for all 2.17 oz bags of Skittles = 2.5.
Describe the sampling distribution for the mean number of
candies per bag for samples of 32 bags, including center
spread, and shape (justify your answers). (6 points)
o The center is going to be the same as , 60.
σ
o σ -> 2.5 / 32 = 0.4419

o Distribution is normal according to the Central Limit
Theorem n ≥ 30

Based on this sampling distribution, what is the probability


that a sample of 32 bags will have a mean of less than 59
candies per bag? (5 points)
Dax Allen Project Part 5 MATH 1040

o Normalcdf(-1E99, 59, 60, 2.5) = .3446


o The probability of having less than 59 candies in a bad is
.3446 or 34.46%

Explain in general the purpose of a confidence interval. (2


points)
o It is to give us a range of values instead of a single value for
our estimated population parameter.

Using values for the class data that you computed in Part 2 of
the project, construct a 99% confidence interval estimate for
the true proportion of yellow candies using the class data as
your sample. Remember that for this computation, n is the
number of CANDIES for the entire class data. Include all your
work, showing the formula used and appropriate values
inserted (neatly written and scanned or typed) or including the
appropriate calculator commands and inputs. (5 points)
o TInterval (
o x-bar = 12.218
o Sx = 2.503
o n = 87
o C-Level = .99 )
o Lower Bound: 11.511
o Upper Bound: 12.925

Give an appropriate interpretation of your interval. (3 points)


o We can be 99% confident that number of yellow Skittles in a
bag is between 11.511 and 12.925.
Dax Allen Project Part 5 MATH 1040

Based on your interval for the true proportion of yellow


candies, was the proportion of yellow candies in the single bag
of candy you purchased a likely value for the true population
proportion? Explain how you know using actual values from
your data and computations. (5 points)
o The number of Yellow in my bag was 9. This is lower than
the interval states and outside to the left of the 99%
confidence interval and as such would not be considered a
likely value for the true population proportion.

Using values you computed in Part 2 of the project, construct a


95% confidence interval estimate for the true mean number of
candies per bag using the class data as your sample, but for
this computation, n is the number of BAGS. Make sure you use
the correct standard deviation from Part 2, which treats the
class data set as a sample. Include all your work, showing the
formula used and appropriate values inserted (neatly written
and scanned or typed) or including the appropriate calculator
commands and inputs. (5 points)
o TInterval (
o x-bar: 58.667
o Sx: 2.38
o n: 87
o C-Level: .95 )
o Lower Bound: 58.16
o Upper Bound: 59.17

Give an appropriate interpretation of your interval. (3 points)


o We can be 95% confident that the total number of skittles per
bag is between 58.16 and 59.17.
Dax Allen Project Part 5 MATH 1040

Based on your interval for the true mean number of candies


per bag, was the total number of candies in the single bag you
purchased a likely value for the population mean? Explain how
you know using actual values from your data and
computations. (5 points)
o The total number of Skittles in my bag was 61. This value a
higher than the interval and as such falls to the right of the
95% interval area.
Dax Allen Project Part 6 MATH 1040

Explain in general the purpose of a hypothesis test. (2 points)


o The purpose of hypothesis testing is to determine if there is enough statistical
evidence in favor of a hypothesis about a parameter.

Using values for the class data that you computed in Part 2 of the project and a 0.1
significance level, test the claim that 20% of all Skittles candies are red. Show all the
steps (neatly written and scanned, typed, or copied from StatCrunch) including:

1. the hypotheses with correct notation (4 points)


H0: P = 0.2
H1: P ≠ 0.2

2. the requirements for performing the hypothesis test, along with discussing
ho he a e me o no me hin : he a e no all me ! In o di c ion,
describe how you selected the bag of skittles you bought (give me some
details, like where you bought it and how you picked the bag to buy), and
identify what type of sampling method it was. (6 points)
When selecting my bag of Skittles, I just walked into my local Harmon s
and chose the first one my hand touched. I would say this was
convenience sampling.
The number of red Skittles per bag also has one outlier but the sample size
is large enough.
5104 (0.2) (1-0.2) = 816.64 ≥ 10 is True
5104 ≤ 0.05(Total Skittles population) is True – Sample values are
independent of each other

1. the test statistic and supporting work (3 points)


1-PropZTest:
P0: .2
x: 1064
n: 5104
Prop: ≠ P0
Z0 = 1.5127

2. the p-value (3 points)


1-PropZTest:
P0: .2
Dax Allen Project Part 6 MATH 1040

x: 1064
n: 5104
Prop: ≠ P0
P-Value = .1306

3. the appropriate decision about the null hypothesis and an appropriate


conclusion (4 points)
P-Value > = 0.1 level of significance
We will not reject the null hypothesis because there is insufficient
evidence to suggest that the proportion of red skittles is not 0.2.

4. Also interpret the p-value for this test. (4 points)


The P-Value shows that about 13 in 100 samples would result in a
proportion equal or more extreme than 0.2. Since this is not less than 0.05,
it is not an unusual occurrence.

Using values for the class data that you computed in Part 2 of the project and a 0.05
significance level, test the claim that the mean number of candies in a bag of Skittles
is more than 58. Show all the steps (neatly written and scanned, typed, or copied
from StatCrunch) including:

1. the hypotheses with correct notation (4 points)


H0: µ = 58
H1: µ > 58

2. the requirements for performing the hypothesis test, along with discussing
ho he a e me o no me hin : he ame one ha a n me abo e i n
met here! (4 points)
The samples were not selected at random; it was done via convenience
sampling.
There are three outliers in the sample, but the sample size is 87, which is
large enough to offset that. 87 ≥ 30
87 < 0.005 (Skittles population) – Sample values are independent of each
other

3. the test statistic and supporting work (3 points)


T Test:
Dax Allen Project Part 6 MATH 1040

µ0: 58
x̄: 58.667
Sx: 2.380
N: 87
µ > µ0
t = 2.61
4. the p-value (3 points)
T Test:
µ0: 58
x̄: 58.667
Sx: 2.380
N: 87
µ > µ0
p = .0053
5. the appropriate decision about the null hypothesis and an appropriate
conclusion (4 points)
At a level of significance, = 0.05, we will reject the null hypothesis as
the P-Value is less than . The P-Value of 0.0053 means that we could
expect for every 1000 samples we would get a mean of 58 in 5 of them.
That is a very rare occurrence.
6. Also describe the Type I and Type II errors for this test. (6 points)
Type I would be rejecting the null when it should not be rejected. Let s
say the Skittles factory needs the mean to be 58 per bag. If our results
were accurate and we got a Type I error, they might re-calibrate the
Skittles factory when they didn t need to.
Type II would be not rejecting the null when they need to. Using the
example above, this could be a serious problem if we needed to average
per bag to be 62. We would be promising a certain amount of Skittles per
bag and not delivering on that promise which could end up in legal issues.