Sie sind auf Seite 1von 6

Hypothesis Testing for Mean Difference (2 Samples)

using R
Umair Durrani
Thursday, April 26, 2015

Contents
Problem Statement: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data Description: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Resources: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Problem Statement:
A traffic analyst in the city of Zreeha wants to find if there is any difference in the crash frequencies (no.
of crashes per year) between rear-end and side-swipe crashes. The transport depeartment collects crash
frequencies for a year at 10 sites of 4-legged intersections. The data is described below in the data frame df.
Statistically speaking, the analyst wants to answer the question:
Are the crash frequencies between rear-end and side-swipe crashes at 4-legged intersection statistically different?

photo credit: wikimedia.org

Data Description:
Reading Data
We will first read the data using readr package:
library(readr)
df <- read_csv("HT2.csv", col_names = TRUE)
head(df)
##
##
##
##
##
##
##

[EMPTY] Crash Frequency \n(Crashes per year)


[EMPTY]
1 (Crashes per year)"
<NA>
2
Site #
Rear-end Side-swipe
3
1
10
12
4
2
7
9
5
3
6
4
6
4
5
7

The data has some empty cells as it was created in an Excel sheet this way. So, we read the data again
skipping the un-necessary rows:
df <- read_csv("HT2.csv", col_names = TRUE, skip = 2)
head(df)
##
##
##
##
##
##
##

1
2
3
4
5
6

Site # Rear-end Side-swipe


1
10
12
2
7
9
3
6
4
4
5
7
5
9
8
6
11
10

The data looks in a good shape now.


Summary Statistics
Following describes the summary of the data set:
# Loading the data manipulation library
summary(df)
##
##
##
##
##
##
##

Site #
Min.
: 1.00
1st Qu.: 3.25
Median : 5.50
Mean
: 5.50
3rd Qu.: 7.75
Max.
:10.00

Rear-end
Min.
: 5.00
1st Qu.: 7.00
Median : 8.50
Mean
: 8.20
3rd Qu.: 9.75
Max.
:11.00

Side-swipe
Min.
: 4.00
1st Qu.: 7.00
Median : 8.00
Mean
: 8.30
3rd Qu.: 9.75
Max.
:12.00

But we are not really interested in individual averages of rear-end and side-swipe crashes but the difference
between them. Our main goal is to verify whether the mean of the differences is statistically significant.
3

Hypothesis Testing
For estimating the significance in mean difference in crash frequencies well first find the difference:
library(dplyr)
colnames(df) <- c("site", "rear", "side")
names(df)
## [1] "site" "rear" "side"
# Adding a new column
df <- df %>%
mutate(d = rear - side)
head(df)
##
##
##
##
##
##
##
##
##

Source: local data frame [6 x 4]

1
2
3
4
5
6

site rear side d


1
10
12 -2
2
7
9 -2
3
6
4 2
4
5
7 -2
5
9
8 1
6
11
10 1

The mean and standard deviation of the differences of two samples is:
dbar <- mean(df$d)
dbar
## [1] -0.1
s <- sd(df$d)
s
## [1] 1.66333
Hypothesis Our null hypothesis is that there is no difference between the crash frequencies of rear-end
and side-swipe crashes or, in other words, the mean of the population of all these differences is zero:
Ho : D = 0 and the alternative hypothesis would be:
HA : D 6= 0
Level of significance = 0.5

Critical Value Because we have a sample size of 10 only we will use t-test instead of Z distribution.
According to CLT, the mean of the sampling distribution of mean differences in crash frequencies of Rear-end
and Side-swipe crashes is equal to the population mean difference which is assumed as zero in this case.
The critical t value for a two-tailed test at a degree of freedom = dof = 10-1 = 9 and level of significance =
0.05:
tcritical = abs(qt(0.05/2, df=9))
tcritical
## [1] 2.262157
t-statistic From our data we can compute t score using following formula:
p
t = (d D )/(s/ (n))
tval = (dbar - 0) / (s / sqrt(10))
tval
## [1] -0.1901173
One-step procedure

We can do a t.test in R in one step:

t.test(df$rear,df$side,alternative="two.sided", paired=TRUE, conf.level = 0.95)

##
##
##

Paired t-test

##
##
##
##
##
##
##
##

data: df$rear and df$side


t = -0.1901, df = 9, p-value = 0.8534
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.289875 1.089875
sample estimates:
mean of the differences
-0.1

Conclusion
Because the t-value falls in the acceptance region i.e. between 2.262 and -2.262 critical t-values we fail to
reject the null hypothesis.
Another way to interpret the result is that the p-value is higher than the critical t-value, the probability of
getting the observed or extreme mean difference given the null hypothesis is true is higher than the probability
of rejecting the null hypothesis when it is in fact true. Therefore, we fail to reject the null hypothesis. In
the context of this example, we say that mean difference between rear-end and side-swipe crashes is not
statistically significant.

Resources:

readr package examples


Data Analysis and Statistical Inference course
t test in R
Caldwell, Sally. Statistics unplugged. Cengage Learning, 2012.

Das könnte Ihnen auch gefallen