Sie sind auf Seite 1von 12

Department of Mathematics and Computing

University of Southern Queensland


Solution to Assignment # 2
Course No. STA2300 Data Analysis
Semester 2, 2011
Question 1 (18 marks)
(a) 4 marks
Variable of interest: The variable of interest is the number of small businesses that fail in
their first year of operation.
If we assume that each small business is independent of each other and have an equal
probability of failing, then the binomial model would be appropriate.
The assumptions for a binomial model include:
- S = success/ failure condition: A success is that a small business fails in its first year
of operation, a failure is that a small business does not fail in its first year of
operation.
- P = Probability of success = the probability of a small business failing in its first year
of operation is 10% (thus p = 0.10).
- I = Independence = it is assumed that each small business is independent of each
other.
- N = a fixed number of small businesses is needed. For parts (b) and (c) this is 12 (For
parts (d) and (e)) this is 200.


(b) 2 marks
Let X represent the number of small businesses, out of 12, who fail in their first year of
operation. We will assume that X follows a binomial model with parameters n=12 and
p=0.10. We are interested in 2 or less failing, thus X = 0, 1, 2.
Using Table B in the Study Book:
( ) 8891 . 0 2301 . 0 3766 . 0 2824 . 0 ) 2 ( ) 1 ( 0 = + + = = + = + = X P X P X P
We estimate the probability that 2 or less (or X=0) of the next 12 small businesses to fail
in their first year of operation 0.8891 = 88.91%.
(c) 2 marks
From above we know:
- p=0.10
- n=12
The formula for the mean of a binomial model is thus:
= np = 12 x 0.10 = 1.2 small businesses
1.2 small businesses are expected to fail in their first year of operation.

(d) 4 marks
We now assume we have 200 small businesses. Let X represent the number of small
businesses, out of 200, that fail in their first year of operation. We will assume that X
follows a binomial model with parameters n=200 and p=0.10. As n is large we will not
be able to use Table B and so will use the normal approximation to the binomial instead.
Calculate the mean and standard deviation of this new sample:
- Mean = = np = 200 x 0.10 = 20 small businesses
- Standard deviation:
o = 24 . 4 18 ) 10 . 0 1 ( 10 . 0 200 ) 1 ( = = = p np small businesses
(e) 4 marks
To use the normal approximation we need to assume that the 200 small businesses are
less than 10% of the entire population. We also need to check the rule of thumb:

- np=200 x 0.10=20>10 and
- n(1-p)=200 x (1-0.10)=190>10. So the data passes the rules of thumb.
The mean and standard deviation of this new sample was calculated in part (d):
- Mean = = np = 20
- Standard deviation = o = 4.24

Using Z scores and Table Z and z-tables calculate the probability that 15 or less will fail:
( )
|
.
|

\
|
<

= <
o

o
15
15
y
P X P
1190 . 0
) 18 . 1 (
24 . 4
20 15
=
< =
|
.
|

\
|
< =
Z P
Z P

We estimate the probability that 15 or less of small businesses will fail in their first year
of operation to be approximately 11.90%.


(f) 2 marks
Firstly the assumptions of the Binomial model need to be satisified:
- S = success/ failure condition: A success is that a small business fails in its first year
of operation, a failure is that a small business does not fail in its first year of
operation.
- P = Probability of success = the probability of a small business failing in its first year
of operation is 10% (thus p = 0.10).
- I = Independence = it is assumed that each small business is independent of each
other.
- N = a fixed number of small businesses is needed. For parts (d) and (e) this is 200.

Secondly the rules of thumb need to be satisfied:
- np=200 x 0.10=20>10 and
- nq = n(1-p)=200 x (1-0.10)=180>10. So the data passes the rules of thumb.

Question 2 (12 marks)
(a) 2 marks
Variable of interest Speed of cars
Unit of measurement Speed is measured in km/h

(b) 5 marks
Let y denote a cars speed (km/h). The proportion of scores for which y is greater than
110 is calculated using z-scores (standardising/forward method):
( ) ( ) 25 . 1
4
105 110 110
110 > =
|
.
|

\
|
> =
|
.
|

\
|
>

= > z P z P
y
P y P
o

o


We wish to use Table Z to find this proportion but remember that Table Z shows the
proportion below z. So first we need to re-write the proportion we are looking for as (make an
adjustment as the book and diagram dont match):
( ) ( ) 25 . 1 1 25 . 1 s = > Z P Z P
Now using Table Z we find
( ) 1056 . 0 8944 . 0 1 25 . 1 = = > Z P
The percentage of cars expected to obtain a speeding ticket is 10.56%.
NOTE: Some students may do a speed of 111 or more (considering that if their speed was
110 then they arent speeding but doing the limit). They will get full marks if they do this. If
so they will find a z-score of 1.50 and a final answer of (1-0.9332) = 0.0668 = 6.68%.

(c) 5 marks
The question asks us to find some score y (unstandardising) such that a car would be
considered a nuisance driver:
( ) 06 . 0 = < y Y P
Converting to Z scores:
That is,
0600 . 0
4
105
=
|
.
|

\
|
<
y
z P
Using Table Z we can see that ( ) 0594 . 0 56 . 1 = < z P which is pretty close to the answer we
are looking for (you could also use ( ) 0606 . 0 55 . 1 = < z P OR ( ) 0600 . 0 555 . 1 = < z P ).

To find the y which produces a z value less than or equal to -1.56 we need to solve
56 . 1
4
105
=
y

That is 4 56 . 1 105 = y or 105 4 56 . 1 + = y or 76 . 98 = y .
Similarly you can simply use the unstandardising formula:
76 . 98 105 4 56 . 1 = + = + = o z y
To be classified as a nuisance driver a car needs to be travelling at 98.76km/hr or less.
- if you had used z=-1.55 then you would have gotten y=98.80
- if you had used z=-1.555 then you would have gotten y=98.78

Question 3 (14 marks)
(a) 4 marks
The relationship between the number of cylinders and origin of these
cars


Origin of Car
Total America European Japanese
Number of
cylinders
4 65 58 67 190
6 64 4 6 74
8 101 0 0 101
Total 230 62 73 365
(b) 2 marks 67/190 = 35.26%
(c) 2 marks 67/365 = 18.36%
(d) 6 marks
Origin and the number of cylinders of these cars do appear to be associated. From the
table (see the highlighted green part) below it can be seen that 100% of 8-cylinder
cars are from America, whereas only 34.2% of 4-cylinder cars are from America. The
large difference in percentages indicates that origin and the number of cylinders of
these cars may be associated (it appears that the origin of the cars depends on how
many cylinders the car has or vice versa).
Conditional distribution table showing the relationship between origin and number of cylinders

Origin of Car
Total America European Japanese
Number of cylinders 4 Count 65 58 67 190
% within Number of
cylinders
34.2% 30.5% 35.3% 100.0%
6 Count 64 4 6 74
% within Number of
cylinders
86.5% 5.4% 8.1% 100.0%
8 Count 101 0 0 101
% within Number of
cylinders
100.0% .0% .0% 100.0%
Total Count 230 62 73 365
% within Number of
cylinders
63.0% 17.0% 20.0% 100.0%



Origin of Car
Total America European Japanese
Number of cylinders 4 Count 65 58 67 190
% within Origin of Car 28.3% 93.5% 91.8% 52.1%
6 Count 64 4 6 74
% within Origin of Car 27.8% 6.5% 8.2% 20.3%
8 Count 101 0 0 101
% within Origin of Car 43.9% .0% .0% 27.7%
Total Count 230 62 73 365
% within Origin of Car 100.0% 100.0% 100.0% 100.0%

Question 4 (20 marks)
(a) 4 marks

Time to accelerate is a quantitative variable, so an appropriate graph is a histogram. There
is too much data for a stem and leaf plot; a boxplot is for comparing a quantitative
variable across various groups. The histogram appears below. (To see any useful
information, it may be necessary to adjust the scale and limits of the histogram).


If students change the bin size then the following graph will be produced. Either graph is
acceptable.

(b) 4 marks
- The shape of distribution of time is slightly skewed to the right with one mode at
about approximately 14-15 seconds (unimodal) (The graph could be said to be
approximately symmetric);
- The centre is between 14-15 seconds;
- The spread is from about 11 seconds to 25 seconds (if said there was outliers then
this would be 11-22 seconds).
- There does not appear to be any outliers (if produced the first graph then if they
said there are outliers this is ok).

(c) 4 marks
The mean time is 16.55 seconds
The standard deviation is 2.37 seconds

(d) 4 marks
The median time is 16.05 seconds
The IQR is 3.2 seconds

(e) 4 marks
As the distribution is slightly skewed is would be best to use the median and IQR as the
centre and spread for this distribution as skewness affects both the mean and standard
deviation.

If students said in part (a) that the distribution was symmetric then they should report the
centre and spread as the mean and standard deviation.

- Students are not to include SPSS in part (c) and (d) BUT they may include it as an
appendix.


Statistic Std. Error
Time to accelerate from 0 to
100km/h (seconds)
Mean 16.547 .1720
95% Confidence Interval for
Mean
Lower Bound 16.208

Upper Bound 16.886

5% Trimmed Mean 16.418

Median 16.050

Variance 5.622

Std. Deviation 2.3711

Minimum 11.6

Maximum 24.8

Range 13.2

Interquartile Range 3.2

Skewness .892 .176
Kurtosis .888 .351
Question 5 (24 marks)
(a) 2 mark
Horsepower and vehicle weight are both quantitative variables.
(b) 4 marks
As both horsepower and vehicle weight are quantitative variables the most
appropriate graph to display both of these variables is a scatterplot.


(c) 4 marks
Form: Approximately Linear
Direction: Positive
Scatter: Medium
Outliers: There appears to be an outlier at vehicle weight 1400kg with a horsepower
of 225hp. This car deviates away from the rest of the linear form of the data and thus
is classified as an outlier.

(d) 4 marks
The most appropriate statistic to measure both strength and direction is the correlation
coefficient (r). Since both variables are quantitative and the form of the scatter plot is
approximately linear, an appropriate statistic to use is the correlation co-efficient, r.
Using SPSS the correlation coefficient is found to be 0.836. Students should NOT
report R
2
here as this statistic measures strength ONLY.

Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .836
a
.699 .698 22.201
a. Predictors: (Constant), Vehicle weight (in kilograms)


(e) 6 marks
The regression line to predict horsepower (HP) from vehicle weight (kg) is:
x y 092 . 0 839 . 21 + =
Where:
- y is horsepower (HP)
- x is vehicle weight (kg)

Coefficients
a

Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) -21.839 6.345

-3.442 .001
Vehicle weight (in kilograms) .092 .004 .836 23.015 .000

Students may have included the regression line in the graph on part (b) and this is fine and
will receive full marks.

(f) 4 marks
A car weighing 3000kg would have a predicted horsepower of 254.61HP:
161 . 254 3000 092 . 0 839 . 21 092 . 0 839 . 21 = + = + = x y
It is difficult to tell from the above graph if this would be an accurate prediction or
not. We need to extrapolate beyond the data to find the horsepower required for a
3000kg car and we cannot say with certainty that the graph will continue in a linear
form beyond the data given.

Question 6 (12 marks)
(a) 3 marks
This is an observational study. In order to be an experiment, treatments (in this case
the treatments the location of a person) have to be imposed (ie you have to be told
which treatment you are going to be doing/taking). Ethically you cant tell a person
where they can live and thus this is an observational study.
(b) 3 marks
Explanatory Variable location of residence
Response Variable incidence of premature death from heart disease and circulatory
disease.
OR
Parameter of interest - incidence of premature death from heart disease and
circulatory disease.
(c) 3 marks
In order to be a lurking variable it must affect both the explanatory and response
variables. An example of a lurking variable in this case may be a persons job.
Link to explanatory: As a persons job is located in Ipswich they have to move to this
area.
Link to response: It may be the persons job that is causing them to be sick and not in
fact where they live.
(d) 3 marks
In order to say that an explanatory variables (location of residence) causes some
response in the response variable (premature death); a well-design EXPERIMENT is
needed. For an observation study you can only say that there appears to be a
relationship between the two variables of interest.

Key to errors
The marker of your assignment may have used some of the following key words to describe
errors in your assignment answers.
Define Variables: Symbols without some way of knowing what they refer to are
meaningless.
Title: Tables, like graphs, should be given contextual titles. Your assignment is a report and
your submission of your assignment is publication of this report so each graph and table
should be easy to read by anyone.
Label: Columns and rows of tables and axes of graphs should be given meaningful labels,
describing exactly what you have used in your table or graph. The numbers such as 0 and 1
from the data are not meaningful unless you give values to these numbers in the variable view
of the data file (see SPSS Exercise 4). Also in the variable view the variables need to have the
correct measure selected. See Assignment 1 solutions for comments about this.
Units: Units (such as cms) should be attached to numerical results or included with labels
where appropriate.
Printout: Raw output alone from computer software typically contains lots of information,
some of which may be relevant to the required answer. The marker will not decide for you
what is relevant and what is not. You must indicate this yourself, preferably by transcribing
the appropriate information into your answer.
Error: Your method and formula are correct but you have made an arithmetic error.
Rounding: As a general rule, final answers should be rounded to one or two more significant
figures than the original data. Too severe rounding, incorrect rounding or ridiculous precision
in final answers will be penalised. Rounding however should only occur at the final answer.
For intermediate calculations use whatever precision your calculator provides. This was
highlighted in Assignment 1. (These procedures were followed in the worked solutions even
though in some places intermediate steps may only show numbers rounded to four decimal
places.)
Working: Incorrect numerical answers without working receive no marks. Incorrect
numerical answers with working may attract part marks. Working includes appropriate
formulae, values of relevant statistics or variables, intermediate calculations, statement of
assumptions, etc.
Consequential Marking: If you make an error in one part of a question which impacts
further parts of the question, consequential marking is used where possible.
In Context: You need to finish off each question by putting your answer into context (in
relation to the question asked). Often this just requires a sentence at the end of the question
stating what your found answer means.

Das könnte Ihnen auch gefallen