Sie sind auf Seite 1von 21

TOPIC 7

Sampling distributions
P R E PA R E D B Y R O B I N B O Y L E A N D H I LT O N S H O RT F O R T H E U N I T T E A M

Contents
Introduction Objectives Learning resources Textbook Sampling issues Non-probability (non-random) sampling Probability (random) sampling Bias and self-selection Sampling distributions Sampling distribution of the sample mean, X Calculating probabilities for the sample mean, X Sampling distribution for the sample proportion, p Calculating probabilities for the sample proportion, p Summary Review exercises Further resources 1 1 1 1 2 2 3 4 5 6 7 11 11 15 15 16

Suggested answers
Exercises

17
17

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

Introduction
In topics 1 4 we concentrated on exploring and describing the features of a given set of data. If the data set under investigation relates to the whole population (that is, if we have conducted a census) then clearly what we reveal from the data relates to that entire population. Frequently, however, we cannot take a census but instead must rely on a sample, that is, we gather data on a fraction of the population. The aim, however, is to use what we find in the sample to draw conclusions about the population as a whole. Clearly if we do take a samplefor example, we take a sample of 40 workers from a firms work force of 3 500 or if we take a sample of 2 000 voters from a population of 10 000 000 voters in Australiathen generalising the sample results to the population is almost certain to result in error. That is, the sample is unlikely to look exactly the same as the population from which it was drawn. If error is so likely to occur then why do we bother using sample data? The answer is that by using random sampling combined with probability we can largely control and predict the magnitude of the error that is likely to occur. In fact, statistical theory based on a concept known as the sampling distribution of a sample statistic shows that relatively small samples (at great savings of time and money) can provide remarkably high degrees of accuracy in estimating features of a population. By making use of the probability concepts covered in topics 5 and 6, we will be able to draw conclusions about whole populations on the basis of random sample data. This topic is split into two broad areas. First, we look at sample surveys as a method of collecting data. Second, we introduce the concept of the sampling distribution, the key theoretical concept for statistical inference.

Objectives
At the completion of this topic you should be able to: explain the concept of sampling from populations, describe the various sampling techniques, and explain their uses and their effects on the sample results and inference about populations explain the sampling distributions for the sample mean and the sample proportion use the sampling distributions to calculate relevant probabilities.

Learning resources
Textbook
Berenson, ML, Levine, DM, Krehbiel, TC, Watson J, Jayne, N, Turner, L & OBrien, M 2010 Basic business statistics 2: concepts and applications Pearson Education, Australia, chapter 7.

Deakin University

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

Sampling issues
We take samples rather than a census because a sample (of just 400 respondents) can save a great deal of time and money compared to a census (of say 10,000 employees in a company or 10,000,000 voters in a country), and in circumstances where items are destroyed (like finding out the average life of a new type of light globe, or new style of car tyre), a census does not make sense. Perfect accuracy is not always, and perhaps rarely, justified: If a new product is expected to earn you $500,000 over five years, you cant justify spending $1,000,000 to interview every Australian (census) to see if they would buy the product. You need a cost-effective data collection procedure. If you need to make a decision in three weeks about accepting a new contract, but need some data to help make that decision, you cant wait three months for a census of all people/items to be carried out, processed and analysed. You need a timely data collection procedure. If you need to tell consumers the average life of a new make of tyre, a census means you can be perfectly honest to the consumer, but you would have nothing to sell (all tyres destroyed).

There are many issues related to sampling. Some of these sampling issues are covered in sections 7.4 and 7.5, pp.212220 of Berenson et al (Aust.) 2010, and we covered that material in topic 1. However, before proceeding to inferential statistics techniques, it helps to quickly review some of these issues. One reason for such a review is that all of the inferential statistics techniques you study for the rest of this course depend on simple random sampling being used when selecting the samples taken from the population. Thus, you need to have a clear understanding of random sampling before proceeding. In practice, samples are often not taken randomly. We must be able to recognise when non-random samples have been taken and what impact they might have. We can classify sampling procedures into two main categories: Non-probability (or non-random) sampling Probability (or random) sampling.

Non-probability (non-random) sampling


These are procedures for when: the probability of any particular element of the population entering the sample is unknown, and some individual people or items in the population have a greater chance of selection than others.

These non-random procedures may have been chosen on the basis of lower cost or convenience, or even because the researcher specifically wishes to gather information from a particular target group. They may also have been used in ignorance of the necessity (not desirability) for probability (random) sampling.

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

Non-random techniques include: (a) Convenience Members of samples are chosen primarily because they are both readily available and willing to participate. Examples include asking members of just one class to respond to a survey meant to be about all students at a university, or asking shoppers in a mall to participate in a survey meant to reflect broad community opinion. (b) Purposive Members of the sample are chosen specifically because they are not typical of the population. For example, one would present a new device to experts in the field and gauge their reaction or ability to use the device. If the experts cannot use the product, then the general population will have difficulty. (c) Judgment The researcher chooses members of the sample because he/she believes they are representative of the population. For example, in trying to gain student opinions about a course, an instructor might choose a few very bright students, a few average and a few struggling, plus a few keen ones and a few not-so-keen students. The main advantages of these non-random sampling methods are lower time and cost. However, statistical inferential analysis is forfeited. Non-probability procedures do have their place if employed properly. In particular, they are commonly used in the exploratory stages of research, for example, to test out the wording on a questionnaire. Defining the nature of the problems to be researched in detail can be determined by some preliminary, smaller scale exploratory research based on non-random methods. The main concern with non-random methods is that they are very frequently employed in ignorance of their shortcomings and of the desirability of random sampling methods. Worse still, inferential techniques are frequently applied to non-random sample data. Very important decisions and conclusions may be made on the basis of such faulty data gathering and analysis procedures.

Probability (random) sampling


In random sampling, each element of the population has a known (non-zero) chance of being included in the sample chosen. Sampling procedures in this category include the following: (a) Simple random In simple random sampling, every element in the population has an equal chance of being chosen in a particular sample. Alternatively, we could say that all sample combinations of a given sample size have an equal chance of selection. Usually, a list of the population or a list of eligible elements is compiled and this is called a sampling frame. For example, at a university we could list all 7600 students in alphabetical order, then number them from 1 to 7600. A random number generator (using a computer) is then used to select the numbered population elements for the sample. (b) Systematic This is similar in concept to the simple random sampling but easier to apply. In this approach we again use the sampling frame and number all the elements. We select a starting point between 1 and k randomly, then select every kth element from the population from that point. The value k will be determined by the population size and the size of the sample desired.

Deakin University

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

(c) Stratified The population is divided into strata (sub-groups). The proportion of the population in each stratum is calculated. The sample is chosen so that proportions of certain strata chosen in the sample are identical with the calculated population proportions. Then the members in each stratum for the sample are randomly chosen (typically from a separate sampling frame for each stratum). Examples of strata are gender (male and female), age group (young adult, middle age, elderly) or state (Victoria, Queensland, etc.). (d) Cluster In cluster sampling we divide the population into groups or clusters based on some criterion, usually geographical area. We randomly select some groups from the entire set of groups or clusters formed and then take either a census or samples from the chosen groups. Suppose we wish to survey some of our business customers who are dispersed in all capital cities in Australia for in-depth interviews about our product and service. We could consider each capital city to be a cluster or group and choose, for example, two cities randomly from all capital cities. We could then survey all of our customers in these cities or choose a random selection of customers in these cities. The advantage of the cluster sampling procedure is mainly time and cost savings. In this example, we reduce our costs as we need travel to only two cities. In a simple random sample on the other hand, we may have needed to visit all capital cities for the interviews. For all the probability (random) sampling procedures, the sampling distributions for estimators (sample means or proportions) are easily manageable and thus, probabilities for ranges of values of sample estimators can be calculated. We can use this information to build inferential procedures for the population parameters of interest.

Bias and self-selection


In all of the non-probability (non-random) cases and with poorly constructed probability sampling procedures you need to be aware of systematic biases that can occur. For example, shopping mall surveys to ascertain opinions on general topics (voting intentions, gun control) are unlikely to be representative of the entire population, as some people do not shop regularly while others typically do. The results of the surveys will more reflect the opinions of those in the mall at the time and not the opinions of the population in general. Some other biases are: response and non-response bias. Can occur if personal questions (income, alcohol consumption, tax, etc.) are asked, which some participants find embarrassing and thus do not respond (non-response bias) or for which they have an incentive to lie (response bias). interviewer bias. Occurs with personal interviews where the attitude or demeanour of the interviewer can influence the response. self-selection biasan important consideration. Often the participants in surveys are self-selected because they have to mail a questionnaire, phone in to a certain number, or be present at a location when the survey is conducted. These respondents are typically motivated to respond and may have views different from the population concerned. A common mistake is to take a customer satisfaction survey in-store and generalise the results to the population. Those that are shopping in the store are more likely to be happy with the store service, ambience, etc. (otherwise they may shop elsewhere!). Results from such surveys can be misleading, providing a false indication of consumer satisfaction.

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

It does, however, circumvent the need for a costly sampling procedure. A small bias may be acceptable for a reduced sampling cost. The way a survey is conducted can influence the results and lead to misleading conclusions. When examining the results of a survey you should always ask about the structure of the survey, the questions asked in the survey and the sampling procedures used. This additional information will give you a better idea about the validity and applicability of the survey results. Reading
You are now strongly advised to re-read Berenson et al (Aust.) 2010, sections 7.4 and 7.5, pp. 21220.

Exercise 7.1
1 2 Often a sample is preferred to a population census. Why would this be the case? If you were conducting research to help determine possible new product concepts, would you most likely use a probability or non-probability procedure? If larger samples are more representative of populations, why dont we take as large a sample as possible? To gauge the approval of our customer service we write to all customers that have purchased our product in the last three months and ask them to fill out a questionnaire. Is this a sample or a census? Are there any problems with this sampling procedure in terms of the validity of the results?

3 4

Sampling distributions
From your readings so far you should understand that: (1) sampling is necessary as conducting a census is impractical in most real-life situations, (2) a random sampling technique should be used when collecting data and any potential biases are to be avoided, and (3) results from a well chosen sample will give (hopefully) good approximations to the true population values but, nevertheless, some error is inevitable. In this section we expand on the third point above by introducing the concept of a sampling distribution which explains the relationship between population parameter, sample statistic, random error and probability. Consider exhibit 7.1:
Exhibit 7.1: Parameters, statistics and error

Summary Measure Arithmetic mean Proportion

Population

Sample
X

Error

In practice we use our sample results ( X and p) to approximate the corresponding population results ( and ). The error that occurs is doing so, can never actually be calculated exactly. However, sampling distributions provide the basis on which this random sampling error can be gauged, predicted and controlled.

Deakin University

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

Reading
Please read the Packaging Tea Tree Shampoo case study in Berenson et al (Aust.) 2010, p. 198 and section 7.1, p. 199 which introduces the concept of the sampling distribution.

We now proceed to look at the sampling distribution of the mean. Later we will consider the sampling distribution for proportions.

Sampling distribution of the sample mean, X


Reading
Please read Berenson et al (Aust.) 2010, section 7.2, pp. 199208.

If you are feeling confused about the concepts covered in the above section read the following paragraphs then reread the text again. In discussing the sampling distribution of X the following points should be kept in mind. There are three distributions we can talk about: The first is the population distribution which may or may not be normal. For example, the distribution of the annual incomes is likely to be skewed to the right. The second is the sample distribution which is the distribution of the sample you have taken. Generally a sample will possess features similar to the population from which it was drawn. Thus household incomes from a random sample of say 100 households is still likely to be skewed to the right, to cover a similar range of incomes and to have a mean income and standard deviation not that different to the population mean and population standard deviation. The third distribution is the sampling distribution of X which, for a given sample size, is the probability distribution of how the sample mean, X , varies around the population mean, . Thus if we took a sample of size n = 100 we can talk about the sampling distribution of X for n = 100. (There is in fact a sampling distribution for each possible sample size, for example, n = 20, n = 550 or n = 1 760, etc.) See figure 7.4 on p.200 in the textbook which shows the sampling distributions for n = 2, 5 and 30. Clearly if we only take a sample of 100 households the mean income of those 100 households it unlikely to be exactly the same as the mean income of all households in the population. In fact the sample mean could fall above , below or spot on . Further the sample mean should fall relatively close to the population mean, . The reason for this is that generally a sample will have a range of values similar to the range of values in the population. For example, if household incomes range from $20 000 to $60 000 for all households, a random sample of say 20 households is likely to contain incomes over a not dissimilar range. As the mean of a sample must fall towards the centre of the sample (recall that the mean is a measure of central tendency) it must therefore fall close to the value of the population mean. In fact by the central limit theorem (see p.206 of the textbook) we can say: If the population distribution is a normal probability distribution the sampling distribution of X will be normal no matter what the size of n. If the population distribution is not normal, the sampling distribution of X will be normally distributed or approximately normally distributed for n sufficiently large, in particular, for n 30.

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

In both cases:

X =
and (for an infinite or effectively infinite population)

where X and X are the mean and standard deviation of the sampling distribution respectively. The first equation above means that on average the sample mean will be the same as the population mean. In the second equation, the term X is known as the standard error of the sample mean. It is called the standard error since it is the average error that could be made in using the value of X as an estimate of the value of the population mean, . For a sample greater than one, the standard error is somewhat smaller (by a factor of n) than the standard deviation of the whole population. The reason is that possible sample means for a given sample size will tend to fall much closer to the population mean than individual values in the population. Because of the formula, the bigger is n the smaller the standard error and therefore the greater confidence we have the sample mean will be close to the population mean. However, there are diminishing returns to increasing the size, n. We have seen that the standard error of the mean varies inversely with n. Thus we need to increase n by 4 in order to halve the standard error. We need to increase n by 9 to cut the standard error to one third. At the same time the standard error varies directly with the population standard deviation, . Thus the greater the population standard deviation, the greater the standard error. For the same n, the chance of a large error is greater. Alternatively we can say we need a bigger n for the same level of precision. The less variable is a population, the smaller the sample can be.

Calculating probabilities for the sample mean, X


In order to calculate probabilities for the sample mean we need to know the distribution of the sample mean. As the preceding section suggests, if the conditions for the central limit theorem hold then the sampling distribution will be normal or approximately normal. We can therefore make use of the procedures covered in topic 6 for calculating probabilities from a normal distribution. The one difference when dealing with sampling distributions of X is that the Z transform formula becomes:

Z=

X X x

Consider the following examples.


Application 7.1: Production task

Suppose a bank claims that the time taken to complete a transaction task by phone is normally distributed with = 25 seconds and = 4 seconds. To test this claim, you take a random sample of n = 9 individuals who have used the phone system. The average time taken to complete the task by the 9 individuals was 28 seconds. Is this a likely result for the assumed distribution? Does this result fit in with what the bank is claiming?

Deakin University

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

We can do some probability calculations as follows. You will notice that since we are now in a sampling situation, we do not use the standard deviation (which analyses how individual values could occur), but calculate the standard error, (i.e. how a sample mean from a sample of size n = 9 could occur). Since the population is assumed to be normal then we assume:

X = = 25 seconds

= 4 / 9 = 4/3 = 1.33 seconds.

What is the probability that X could be equal to or exceed 28? We use the Z transformation:

Z=

X X

28 25 = 2.25 1.33

P( X 28) = P ( Z 2.25) = 1 0.9878 = 0.0122 Note to calculate the probability use statistical software or look up Z = 2.25 in the standardised normal probability table (table E.2, p. 576 in the textbook). This probability is 0.0122, that is, just over 1% or 1 chance in 100 and surely raises some questions, in the same way as some of our exercises and applications in topic 6. Is the population average time really 25 seconds or something greater? Is our sample representative of the population? What we do know is that if the population average is 25 and the population standard deviation is 4, there is slightly greater than 1 chance in 100 that the average time taken to complete the task for a sample of 9 elements is 28 seconds. How do you react if something occurs yet it is only meant to have a chance of 1 in 100 of occurring? Some decision makers would conclude that this is an unlikely occurrence and conclude that it is more likely, based on the sample result that is not 25 but that the population average is greater than 25 seconds (that is, > 25). Some decision makerswho have different sensitivity to probabilitiesmight adopt a different attitude and conclude that a chance of 1 in 100 is bearable, and stay with the original claim that = 25 and = 4. So what is your personal probability threshold? If you assumed something to be true (in this case that = 25 and = 4) yet something occurred (in this case a random sample of n = 9 gives a mean of X = 28) which had a chance of only 1 in 100 of occurring, would you stick with the original assumption? Or would you conclude that the original assumption was false? Say instead the random sample of 9 had given a sample mean of 25.8 seconds. Would you question the original assumption of = 25 and = 4? Confirm that P( X 25.8 seconds) = 0.2743 or 27.43%. Thus, if our original assumption is true ( = 25 and = 4), then

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

there is a chance of about 1 in 4 of obtaining a random sample of 9 from the whole population with a sample mean of 25.8 seconds or more. On the basis of the probabilities, we should stick with the original assumption. If instead the sample mean for the 9 observations had been 31.4, we still have
X = 4 9 = 1.33

and the Z score would be Z = (31.4 25)/1.33 = 6.4/1.33 = 4.81 If you check with computer software or the normal probability tables, you will see that the Z scores effectively run out at 3.99. Now it is not impossible to get a result that is 4.81 standard errors away from the mean, but it is most unlikely. On the basis of the probabilities we would have to conclude that the original claim ( = 25 and = 4) is incorrect. We are in effect saying that based on probabilities, a sample mean of 31.4 could not occur if = 25 and = 4. But the sample mean did occur (and is correct as you, as the statistician, have undertaken the sample!), hence we would have to conclude that for the particular population under question, cannot be 25 but must be greater than 25. (Or it could be is somewhat greater than 4, but if your sample has a standard deviation close to 4 then that would indicate that the problem is with .) This example illustrates one way in which statisticians use probabilities to draw inferences. We will take up that type of thinking in detail when we study hypothesis testing in topic 9.
Application 7.2: Average spend

Last financial year, a detailed study of customers spending patterns proved that the average (mean) spending of shoppers in our store was $245 with a standard deviation of $60. Spending was quite severely skewed to the right. We wish to carry out some analysis on this years customers to see if there has been any change in spending patterns since last year. (a) Initially, it was proposed to take a random sample of 16 of this years customers. Your advice was: Given that spending was severely skewed to the right, a sample of size n = 16 is too small for us to confidently invoke the Central Limit Theorem (CLT); therefore, we recommend a sample of size 30 or more. (b) Eventually, we took a random sample of 100 of this years customers and found their average expenditure was $230. What does this result tell us about average expenditure for this financial year compared to last financial year? With a sample size of 100 shoppers we can invoke the CLT and the problem would be solved as follows: Z= X X

230 245 15 = = 2.5 6 60 / 100

P( X 230) = P ( Z 2.5) = 0.0062 Note to calculate the probability use statistical software or look up Z = 2.5 in the standardised normal probability table (table E.2, p. 576 in the textbook).

Deakin University

10

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

Interpretation of the result suggests that if the population mean is $245 and the standard deviation is $60, it is unlikely (0.62%) that a sample of 100 customers will have an average spending $230 or lower. Given that a sample of 100 customers was taken and the sample mean was less than $230, what conclusions can be made? There are really two alternative conclusions: 1 Our assumption about the population mean spending (or standard deviation or both) may be wrong. We may be over-estimating the average spending of customers. Our sample result of $230 is more consistent with a population mean below $245 and closer to $230. The mean spending is as assumed (that is, the same as last year), and the reason that the sample mean spent was less than $230 was due to the sampling process. The sample generated was (due to bad luck) not representative of the population. (However, our preceding calculations determined that the chance of this was only 0.62% - less than a 1% chance.)

It is up to the researcher to decide. But in general we would adopt alternative 1, and conclude (on the basis of the probabilities, 0.62%) that mean expenditure for all customers for this year is less than $245. Our sample result is consistent with < $245.
Exercise 7.2
Complete problem 7.2 in Berenson et al (Aust.) 2010, p. 208.

Exercise 7.3
Complete problem 7.8 in Berenson et al (Aust.) 2010, p. 209.

Exercise 7.4
1 2 What do you understand by the concept of a sampling distribution? Suppose I wish to be 95% confident that the sample mean from a sample will be within 10 units of the unknown population mean. (a) Given that I know the population standard deviation, how could I achieve the objective? Do I need to know if the population is normally distributed?

(b) Say we estimate a population has a standard deviation of $75. What sized sample is required to ensure the sample mean falls within $10? 3 The time taken to complete a unit of output for an automated process is normally distributed with mean equal to 45 seconds and standard deviation equal to 4 seconds. Suppose a sample of 16 units is taken and the completion time measured. (a) What is the probability that a single unit of output will take longer than 47 seconds to be completed?

(b) What is the probability that the average of the sample of 16 units will exceed 47 seconds?

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

11

Sampling distribution for the sample proportion, p


In many cases we are interested in the average of a variable, such as the average time to complete a task or average expenditure of customers. In such instances we use the sampling distribution of the sample mean, X , as described in the previous section. In other cases, though, the variable of interest is the proportion of a population that possesses some attribute or characteristicfor example, the proportion of customers that are repeat customers, the proportion of accounts that are in error or the proportion of voters who will vote for a particular party. Again, we generally have to use sample information (in this case, the sample proportion) to deduce or infer information on the population proportion. Understanding the relevant sampling distribution is important. In this section we demonstrate how we can use a sample proportion (p) to make inferences about the equivalent population proportion ().
Reading
Please read Berenson et al (Aust.) 2010, section 7.3, pp. 20910.

The sampling distribution for p is basically an extension of the binomial distribution when large samples are being taken. If we were taking a random sample of 10 items to determine the number of defectives, we would use the binomial distribution. But if we can take a random sample of 100, or even 500, we can concentrate on the proportion (rather than the number) of defectives in each sample and use the normal distribution rather than the binomial. The reason is that, provided the sample size is large enough (see below), the binomial distribution can be approximated by the normal distribution, which is a far easier distribution to work with. The sample size is considered to be large enough if: n 5, and n (1 ) 5 Then the distribution of p will be approximately normally distributed about a mean of and with a standard deviation (or standard error) of:
p =

(1 )
n

We then proceed to calculate probabilities associated with given ranges by using the Z transformation because the variable is normal. It also allows us a framework on which we can build a formal procedure for drawing conclusions about with associated probability statements and error statements.

Calculating probabilities for the sample proportion, p


In order to calculate probabilities for the sample proportion when n is large enough (as described above), we need to know the distribution of the sample proportion. As the preceding section suggests, if the sampling distribution is normal or approximately normal, then the Z transform can be used to calculate areas under the curve. In this case, the Z transform becomes: Z=
p p

Deakin University

12

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

Exercise 7.5
Complete problem 7.12 in Berenson et al (Aust.) 2010, p. 211.

Exercise 7.6
Complete problem 7.16 in Berenson et al (Aust.) 2010, p. 211.

Exercise 7.7
Complete problem 7.18 in Berenson et al (Aust.) 2010, p. 212.

Application 7.3: Proportion of defectives

A manufacturer of electronic components has found the historical proportion of defective output from a particular machine is 5%. From time-to-time, samples are taken to maintain quality control to see that the defective unit does not deteriorate beyond this. On each testing, a sample of 400 units is randomly drawn and there is a warning initiated if the number of defects is 30 or more. What is the probability associated with this rule? The first thing to do is assume that the machine is producing to historical standard, thus the current defect rate is assumed to be 5%. We then proceed to calculate the sample proportion indicated by our rule. n = 400, X = 30 p = X / n = 30/400 = 7.5% (see formula 7.6 in Berenson et al [Aust.] 2010, p. 209) The rule suggests that if the proportion of defectives for a particular sample is 7.5% or more, we would question whether the historical defect rate was being maintained. We then calculate the probability that a randomly selected sample from a population with a defect rate of 5% will yield a sample proportion of 7.5% or more. The calculations are as follows: Since n = 400 0.05 = 20 and n (1 ) = 400 0.95 = 380, which both exceed 5, then the sample proportion is approximately normally distributed, with:

p = 0.05, p =

0.05 0.95 / 400 = 0.0109

Since the sampling distribution of the sample proportion p is normally distributed, we can use the standardisation rule (formula 7.8 in Berenson et al (Aust.) 2010, p. 210) to calculate probabilities: Z= p p

0.075 0.05 = 2.29 0.0109

P( p 0.075) = P( Z 2.29) = 1 0.9890 = 0.0110 = 1.1% Note to calculate the probability use statistical software or look up Z = 2.29 in the standardised normal probability table (table E.2, p. 576 in the textbook). This result tells us that given the machine maintains its historical defect rate of 5% there is a very small, almost negligible chance that a random sample of 400 units will produce a sample defect rate of 7.5% or more. This suggests, on balance, that if ever a sample is

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

13

taken that produces a sample defect rate of 7.5% or more then it is more likely that the actual defect rate at the time the sample was taken was greater than the historical defect rate of 5% and we would argue that the machine needs adjustment to bring it back to 5%. If a random sample of n = 400 showed 18 defective items, should we stop the machine? Clearly not, as p = 18/400 = 0.045 or 4.5%. The sample result is less than the acceptable historical rate of 5%. If a random sample of n = 400 showed 22 defectives, should we stop the machine? Perhaps the answer is still clearly not. The reasoning is as follows. If the machine is not malfunctioning, and producing 5% defectives in the long run, then out of 400 randomly selected we would expect to get 5% of 400, or 20 defectives. Clearly there could be some variation around this figure: 18, 19, 21 or 22 defects could occur due to the random sample selection process. Other values are not out of the question either. Out of interest, we calculate the following probability: If n = 400 and 22 defectives occur,
p = 22 = 0.055 or 5.5% 400

Z=

p p

0.055 0.05 = 0.46 0.0109

P( p 0.055) = P ( Z 0.46) = 1 0.6772 = 0.3228 = 32.28% Thus, even if the true overall defective rate is 5%, there is a 32.28% chance that a random sample of size 400 will produce at least 22 defective items. Confirm the following: P(25 or more defectives in 400) = P(p 0.0625) = 0.1251 or 12.51% Generally, statisticians would consider this to be a reasonably high probability, and not low enough to conclude that the machine is playing up and producing more than 5% defectives. By using a cross-over point of 30 defectives in 400 (or 7.5% in 400), the machine engineers are using a probability of about 1% to indicate that a sample result is inconsistent with the assumed historical rate of 5%.
Application 7.4: Market share

The previous example illustrated that with sample results (especially when sampling for proportions) there can be quite a deal of variation around the expected population parameter. The reason is that a sample (being relatively small) is unlikely to be perfectly representative of the population from which it was drawn. The following application provides another example of sample variation and how we can use probability to make judgments about sample data. Suppose the market share for a brand is 20%. If we survey 100 customers, can we calculate the probability that the proportion of the group who indicate that they buy our brand will be somewhere between 15 and 25% inclusive?

Deakin University

14

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

That is, if = 20%, what is the probability that a randomly selected sample will produce a sample proportion between 15% and 25%? We use sampling theory to find: P(0.15 p 0.25) This calculation is possible if we know that p has a normal or approximately normal distribution. Since n = 100 0.2 = 20 and n (1 ) = 100 0.80 = 80, which both exceed 5, then n is large enough for us to assume the sampling distribution for the sample proportion is approximately normally distributed, with
p = 0.20, p =
0.20 0.80 / 100 = 0.04

Since p is normally distributed we can use the standardisation rule to calculate probabilities:

Z1 = Z2 =

p p

p
p p

0.15 0.20 = 1.25 0.04 0.25 0.20 = 1.25 0.04

P(0.15 p 0.25) = P (1.25 Z 1.25) = 0.8944 0.1056 = 0.7888 = 78.88% Note to calculate the probability use statistical software or look up Z = 1.25 and Z = 1.25 in the standardised normal probability table (table E.2, p. 576 in the textbook). The result suggests that given the population proportion is 20% (that is, market share in this example), a sample of size 100 could show a great deal of variability. In particular, the probability that a sample proportion could fall somewhere in the range 1525% is quite high, almost 0.8. That is, a sample proportion in this range is likely. The chances of a sample proportion below 15% (or above 25%) are still just over 10%. This example illustrates why statisticians urge caution about drawing hasty conclusions about population parameters (for example, and p) on the raw sample results (that is, X or p) without: taking into account sampling variability (using standard errors X and p ), and assessing the probability of occurrence.

Out of interest, rework the above problem, but for a sample of size 1000 customers. As indicated earlier, the sample proportion is in fact a binomial distribution as all of the circumstances underscoring the binomial are present. Given the large sample sizes typically involved for proportions, the normal distribution provides an adequate approximation to the binomial. It simplifies calculations and the need for extensive tables. It also emphasises the importance of the normal distribution as the limiting distribution for many other distributions. Secondly, you should be careful in this section not to confuse proportions of interest with the probabilities calculated for ranges of proportions. In the previous example we calculated that the probability of the sample proportion being in the range 1525% was 78.88%. Students should realise that the values 1525% represent a range on the horizontal axis of the normal distribution, whereas 78.88% represents the area under the particular normal curve in the relevant range.

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

15

Exercise 7.8
1 Using sampling theory, explain how a poll of 400 people that reveals a sample proportion of 52% supporting the proposition does not necessarily mean that there is majority support for the proposition in general. Based on historical data, a bank has found that, throughout Australia 7%, of customers with the bank will default on a loan. At a particular branch, the manager claims it has special risk screening procedures which means their default rate is lower than the national average for the bank. A sample of 100 loan customers from the branch is selected randomly. The sample reveals only 4 have defaulted. On the basis of the sample evidence and sampling theory, evaluate the branch managers claim about better screening methods .

Reading
Please skim read section 7.6 on the textbook companion website. This section covers a very important concept, but the calculations (whilst also very important) are not examinable. The important concept is that if a population is quite small, adjustments need to be made to some formulae. However, the rest of the theory and thinking remains the same.

Summary
In this topic we made use of our knowledge about the normal distribution as studied in topic 6 and show how probability is an essential aid to decision making in a sampling situation. In particular, we introduce the important concept of the sampling distribution of a sample statistic and show how, in conjunction with the normal distribution, we are able to predict and manage error from sampling. The Central Limit Theorem assures us that, under the right circumstances, the sampling distribution for the sample mean and for the sample proportion will be (approximately) normally distributed. This means that probabilities for sample outcomes can be easily determined from the normal distribution, and inferences about population parameters are systematically and objectively based. The sampling distribution properties depend on the application of random sampling.

Review exercises
Solve the following problems by computer and also by hand using a calculator and the probability tables in the back of the textbook.
Exercise 7.9
Complete problems 7.48, 7.50, 7.52 in Berenson et al (Aust.) 2010, pp. 222 223.

Deakin University

16

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

Further resources
Black, K 2008, Business statistics for contemporary decision making, 5th edn, Wiley, NJ. Anderson, DR, Sweeney, DJ & Williams, TA 2008, Statistics for business and economics, 10th edn, South-Western Thomson Learning, Cincinnati, Mass. Selvanathan, A, Selvanathan, S, Keller, G & Warrack, B 2006, Australian business statistics, 4th edn, Nelson Thomson Learning, Melbourne.

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

17

Suggested answers

Exercises
Exercise 7.1

Sample is preferred on the basis of cost and time. In other cases collection of population data can lead to serious mis-measurement due to the size of the population. We may lose more information than we gain in this case. In other cases, measurement causes the destruction of the population element. Exploratory research for testing new product concepts is usually undertaken at a greater depth and with fewer elements surveyed than the final overall survey. A new technical concept may be shown to some experts, or a new toy given to some children and we observe how they use the toy. In these cases the sampling procedure is a non-probability sample, as the elements that form the sample have been purposively picked. Larger samples are more costly and time-consuming. The additional information from a larger sample may only be of marginal benefit, but incurred at a great cost. Larger samples may be less efficient also, from the point of view of the likelihood of errors in managing the sampling process or data collection. One of the common misapprehensions is that larger samples are required for larger populations. This is not necessarily the case. A reasonable sample size to ensure a representative sample from the population is, in general, about 100. This is in some ways a census and in others a sample. If we consider the bounds to be the last three months, then all customers that purchase in this time would form a census. From another perspective we could consider these customers as a sample from a wider definition of potential customers in the last three months. We have ignored customers in earlier months. In any case, the results from the questionnaire will only be indicative of those customers that have purchased at our store in the last 3 months. And given that they did purchase, the implication is that these customers could be reasonably happy with our service and we may over-estimate the satisfaction with service levels for all existing and potential customers. We may wish to interview potential customers who did not purchase at our store to find out the reasons. Combined with the survey results of those who did purchase, it may give us a balanced indication of the satisfaction with service.

Exercise 7.2

Answers are available at the textbook companion website.


Exercise 7.3

Answers are available at the textbook companion website.

Deakin University

18

TOPIC 7

B U S I N E S S D ATA A N A LY S I S

Exercise 7.4

A sampling distribution is a probability distribution that can associate probabilities for outcomes of a sample statistic when sampling from a population. It can be used as a basis for inference about population characteristics when the Central Limit Theorem applies. (a) We know that variation for a sample mean (standard error) depends on the population standard deviation and the sample size. What the question is asking is How can I achieve 95% of sample means in an interval 10 units around the population mean? In answering this, we need to determine the shape of the sampling distribution. We would want it to be normal to apply our theory. The sampling distribution is normal if the population distribution is normal or the sample size is large. Given that the sampling distribution is normal and we know the population standard deviation, it is possible to calculate the size of the sample, n, to achieve the above objective. When setting sample sizes for surveys, researchers ensure (with a degree of confidence) that the sample statistic is within a specified distance from the population parameter of interest. We estimate = 75. Therefore, for a given sample size the standard error is given by
x =
n
=

(b)

75
n

From the normal distribution we know 95% of possible values (sample means) fall within 1.96 (z score) standard deviations (or standard errors). These $10 must correspond to 1.96 standard errors. Thus, we get:
1.96 75
n
= 10

n = (1.96 75) / 10 = 14.7

n = 14.72 = 216 3 Time taken is normal with mean = 45 and standard deviation = 4. The average time for a sample of 16 units will also be normal with mean = 45 and standard deviation = 4 / 16 = 1 . (a) The probability that it will take more than 47 seconds for a unit of output is: P(X > 47) = P(Z > (47 45) / 4) = P(Z > 0.5) = 1.0 0.6915 = 0.3085 Thus, there is a 30% chance that a single unit of output will exceed 47 seconds. (b) The probability that the sample average will exceed 47 seconds is: P( X > 47) = P(Z > (47 45) / 1) = P(Z > 2) = 1.0 0.9772 = 0.0228 There is a 2.28% chance that the average of 16 units will exceed 47 seconds. A comparison of these answers reveals that there is a greater probability of a single unit of output exceeding 47 than an average of 16 units. The averages of samples from the population are less dispersed than the individual population units. They will tend to be relatively more closely centred around the population mean.

Deakin University

B U S I N E S S D ATA A N A LY S I S

TOPIC 7

19

Rework this problem assuming a sample of size 25 and show P( X > 47) = 0.0062. Thus, the bigger the sample, the smaller the probability the sample mean will be distant from the overall population mean.
Exercise 7.5

Answers are available at the textbook companion website.


Exercise 7.6

Answers are available at the textbook companion website.


Exercise 7.7

Answers are available at the textbook companion website.


Exercise 7.8

Random selection of sample units could have created this result. The result of 52% in the sample could have come from a population with a proportion of less than 50%, say 48%. For example, say = 48%, then for n = 400:
p =

(0.48 0.52) / 400

= 0.025 Therefore, P(p 52% given = 48%) = P(Z 1.6) = 1 0.9452 = 0.0548 or 5.48% So while it is not a very likely occurrence, there is still a chance of 5.48% that a sample of size n = 400 could give a proportion of 52% when the true population proportion is in fact 48%. 2 The statistical approach to this type of decision-making problem is to assume that the proportion of defaulters at the branch is similar to other branches, that is, 0.07. The sample proportion was 4/100 or 4%. Given this we calculate: P(p 0.04) = P(Z (0.04 0.07) / 11.90%)
(0.07 0.93) / 100 ) = P(Z 1.18) = 0.1190 (or

It would appear that the managers claim may not be valid. The probability of 100 randomly selected customers having a default rate of 4% or less given the true default percentage of 7% is 11.9% (generally, statisticians would consider this to be a reasonably high probability). On the balance of probabilities it may suggest that the default rate at the branch is the same as every other branch, namely 7%, and that the screening process is not successful.
Exercise 7.9
Answers are available at the textbook companion website.

Deakin University

Das könnte Ihnen auch gefallen