Beruflich Dokumente
Kultur Dokumente
David Ramsey
e-mail: david.ramsey@ul.ie website: www.ul.ie/ramsey
1 / 67
1. Data Collection and Descriptive Statistics. 2. Probability Theory 3. Statistical Methods of Estimation A more detailed description of the course (lectures and tutorials), together with suggested reading, is available on my website.
2 / 67
Populations of objects and individuals show variation with respect to various traits (e.g. height, political preferences, the working life of a light bulb). It is impractical to observe all the members of the population. In order to describe the distribution of a trait in the population, we select a sample. On the basis of the sample we gain information on the population as a whole.
3 / 67
Quantitative Variables
These are variables which naturally take numerical values (e.g. age, height, number of children). Such variables can be measured or counted. As before we distinguish between two types of quantitative variables. a) Discrete variables: these are variables that take values from a set that can be listed (most commonly integer values, i.e. they are variables that can be counted). For example, number of children, the results of die rolls.
5 / 67
b) Continuous variables
These are variables that can take values in a given range to any number of decimal places (such variables are measured, normally according to some unit e.g. height, age, weight). It should be noted that such variables are only measured to a given accuracy (i.e. height is measured to the nearest centimetre, age is normally given to the nearest year). If a discrete random variable takes many values (i.e. the population of a town), then for practical purposes it is treated as a continuous variable.
6 / 67
A parameter is an unkown number describing a population. For example, it may be that 9% of the population of eligible voters (the electorate) wish to vote for the Green Party (we do not, however, observe this population proportion). A statistic is a number describing a sample. For example, 8% of a sample may wish to vote for the Green Party. This is the sample proportion. Statistics may be used to describe a population, but they only estimate the real parameters of the population.
8 / 67
Naturally, the statistics from a sample will show some variation around the appropriate parameters e.g. 9% of the population wish to vote for the Green Party, but only 8% in the sample. The greater the sample size, the more precise the results (suppose we take a large number of samples of size n, the larger n the less variable the sample proportion from the various samples, i.e. the more replicable the results).
9 / 67
Non-Sampling Bias
Other sources of non-sampling bias may be: 1. Lack of anonymity. 2. The wording of a question. 3. The desire to give an answer that would please the interviewer. For example, surveys may systematically overestimate the willingness of individuals to pay extra for environmentally friendly goods, as stating that you are prepared to pay more is seen to be the politically correct answer.
12 / 67
It should be noted that bias is a characteristic of the way in which data are collected not a single sample. Increasing the sample size will improve the precision of an estimate, but will not aect the bias. Returning to the example of height. As the sample size increases, the sample mean becomes more replicable. However, if we are estimating the mean height of the entire population based on samples of students, there will always be a tendency to overestimate the mean height of the population.
13 / 67
We may describe qualitative data using a) Frequency tables. b) Bar charts. c) Pie charts.
14 / 67
Frequency tables
Frequency tables display how many observations fall into each category (the frequency column), as well as the relative frequency of each category (the proportion of observations falling into each category). Let ni denote the number of observations in category i . The relative frequency of category i is fi , where ni fi = n Multiplying by 100, we obtain the relative frequency as a percentage. If there are missing data we may also give the relative frequencies in terms of the actual number of observations, n0 i.e. ni fi 0 = 0 n
15 / 67
Frequency tables
For example 200 students were asked which of the following bands they preferred: Franz Ferdinand, Radiohead or Coldplay. The answers may be presented in the following frequency table Band Coldplay Franz F. Radiohead Frequency 62 66 72 Relative Frequency (% ) 62 100/200 = 31 66 100/200 = 33 72 100/200 = 36
16 / 67
Bar chart
In a bar chart the height of a bar represents the relative frequency of a given category (or the number of observations in that category).
17 / 67
Pie chart
The size of a slice in a pie chart represents the relative frequency of a category. Hence, the angle made by the slice representing category i is given (in degrees) by i , where i = 360fi = 360ni n
(i.e. we multiply the relative frequency by the number of degrees in a full revolution). If the relative frequency of observations in group i is given in percentage terms, denoted pi . 1% of the observations in the sample correspond to an angle of 3.6 degrees. Thus, i = 3.6pi .
18 / 67
Pie chart
Discrete data can be presented in the form of frequency tables and/or bar charts (as above). The distribution of continuous data can be presented using a histogram. The histogram estimates the probability density function of a continuous random variable (see later).
20 / 67
In order to draw a histogram for a continuous variable, we need to categorise the data into intervals of equal length. The end points of these intervals should be round numbers. The number of categories used should be approximately n (normally between 5 and 20 categories are used). For example, if we have 30 observations then we should use about 30 5.5 categories. Hence, 5 and 6 are sensible choices for the number of categories. Let k be the number of categories.
21 / 67
Histograms
In order to choose the length of each interval, L, we use xmax xmin , k where xmax is smallest round number larger than all the observations and xmin is the largest round number smaller than all the observations. L If necessary L is rounded upwards, so that the intervals are of nice length and the whole range of the data is covered.
22 / 67
Histograms
The intervals used are [xmin , xmin + L], (xmin + L, xmin + 2L], . . . , (xmax L, xmax ]. In general the lower end-point of an interval is assumed not to belong to that interval (to avoid a number belonging to two classes).
23 / 67
Histograms
A histogram is very similar to a bar chart. The height of the block corresponding to an interval is the relative frequency of observations in that block. Thus, the height of a block is the number of observations in that interval divided by the total number of observations.
24 / 67
Example 1.2
We observe the height of 20 individuals (in cm). The data are given below 172, 165, 188, 162, 178, 183, 171, 158, 174, 184, 167, 175, 192, 170, 179, 187, 163, 156, 178, 182. Draw a histogram representing these data.
25 / 67
Example 1.2
We rst consider the histogram. First we choose the number of classes and the corresponding intervals. 20 4.5, thus we should choose 4 or 5 intervals.
26 / 67
Example 1.2
The tallest individual is 192cm tall and the shortest 156cm. 200cm is the smallest round number larger than all the observations and 150cm is the largest round number smaller than all the observations. To calculate the length of the intervals L= 200 150 . k
Taking k to be 4, L = 12.5. Taking k = 5, L = 10 (a nicer length). Hence, it seems reasonable to use 5 intervals of length 10, starting at 150.
27 / 67
Example 1.1
If we assume that the upper endpoint of an interval belongs to that interval, then we have the intervals [150,160], (160, 170], (170,180], (180,190], (190,200]. Now we count how many observations fall into each interval and hence the relative frequency of observations in each interval.
28 / 67
Example 1.1
Height (x) 150 x 160 160 < x 170 170 < x 180 180 < x 190 190 < x 200
No. of Observations 2 5 7 5 1
Rel. Frequency 2/20 = 0.1 5/20 = 0.25 7/20 = 0.35 5/20 = 0.25 1/20 = 0.05
29 / 67
Example 1.2
The histogram is given below:
30 / 67
A histogram is an estimator of the density function of a variable (see the chapter on the distribution of random variables in Section 2). The distribution of height seems to be reasonably symmetrical around 175cm.
31 / 67
From a histogram we may infer whether the distribution of a random variable is symmetric or not. The histogram of height shows that the distribution is reasonably symmetric (even if the distribution of height in the population were symmetric, we would normally observe some small deviation from symmetry in the histogram as we observe only a sample).
32 / 67
Right-Skewed distributions
A distribution is said to be right-skewed if there are observations a long way to the right of the centre of the distribution, but not a long way to the left. The distribution of wages is right-skewed, since a small proportion of individuals will earn several times more than the mean wage.
33 / 67
A right-skewed distribution
34 / 67
Left-skewed distributions
A distribution is said to be left-skewed if there are observations a long way to the left of the centre of the distribution, but not a long way to the right. For example, the distribution of weight of participants in the coxed boat races will have a left-skewed distribution. This is due to the fact that the majority of participants will be heavy rowers, while a minority will be very light coxes.
35 / 67
A Leftskewed Distribution
36 / 67
We consider two types of measure: 1. Measures of centrality - give information regarding the location of the centre of the distribution (the mean, median). 2. Measures of variability (dispersion) - give information regarding the level of variation (the range, variance, standard deviation, interquartile range).
37 / 67
1. The Sample Mean, x . Suppose we have a sample of n observations, the mean is given by the sum of the observations divided by the number of observations. 1 x= xi , n
i =1 n
38 / 67
denotes the population mean. If there are N units in the population, then N xi = i =1 , N where xi is the value of the trait for individual i in the population. is normally unknown. The sample mean x (a statistic) is an estimator of the population mean (a parameter).
39 / 67
In order to calculate the sample median, we rst order the observations from the smallest to the largest. The order statistic x(i ) is the i -th smallest observation in a sample (i.e. x(1) is the smallest observation and x(n) is the largest observation). The notation for the median comes from the fact that the median is the second quartile (see quartiles in the section on measures of dispersion).
40 / 67
If n is odd, then the median is the observation which appears in the centre of the ordered list of observations. Hence, Q2 = x(0.5[n+1]) . If n is even, then the median is the average of the two observations which appear in the centre of the ordered list of observations. Hence, Q2 = 0.5[x(0.5n) + x(0.5n+1) ] One half of the observations are smaller than the median and one half are greater.
41 / 67
42 / 67
The range is dened to be the largest observation minus the smallest observation. Since the range is only based on 2 observations it conveys little information and is sensitive to extreme values (errors).
43 / 67
The sample variance is a measure of the average square distance from the mean.
2 The formula for the sample variance sn n 2 sn 1 1
is given by
1 = (xi x )2 . n1
i =1
44 / 67
The sample standard deviation is given by the square root of the variance. It (and hence the sample variance) can be calculated on a scientic calculator by using the n 1 or sn 1 function as appropriate. In simple terms, the standard deviation is a measure of the average distance of an observation from the mean. It cannot be greater than the maximum deviation from the mean.
45 / 67
Otherwise, if a is the integer part of n+1 4 [this is obtained by simply removing everything after the decimal point], then Q1 = 0.5[x(a) + x(a+1) ]
46 / 67
If
3n+3 4 ,
then
Q3 = 0.5[x(b) + x(b+1) ] The interquartile range (IQR) is the dierence between the upper and lower quartiles IQR = Q3 Q1
47 / 67
48 / 67
Sometimes we wish to compare the dispersion of two variables. In cases where dierent units are used to measure the two variables or the means of two variables are very dierent, it may be useful to use a measure of dispersion which does not depend on the units in which it is measured. The coecient of variation C .V . does not depend on the units of measurement. It is the standard deviation divided by the sample mean sn 1 C .V . = . x
49 / 67
Calculate the measures of centrality and dispersion dened above for the following data. 6, 9, 12, 9, 8, 10 There are 6 items of data hence, x= 6
i =1 xi
6 + 9 + 12 + 9 + 8 + 10 =9 6
50 / 67
51 / 67
The range is the dierence between the largest and the smallest observations Range = 12 6 = 6.
52 / 67
(6 9)2 + (9 9)2 + (12 9)2 + (9 9)2 + (8 9)2 + (10 9)2 = 5 =4 2 The standard deviation is given by sn 1 = sn 1 = 2.
1 (xi x )2 n1
i =1
53 / 67
sn 1 2 = . x 9 Suppose a variable is by denition positive, e.g. height, weight. C .V . = A coecient of variation above 1 is accepted to be very large (such variation may occur in the case of wages when wage inequality is high). With regard to the physical traits of people, values for the coecient of variation of around 0.1 to 0.3 are common (in humans the coecient of variation of height is around 0.1, the coecient of variation for weight is somewhat bigger).
55 / 67
1.5 Measures of Location and Dispersion for Grouped Data - a) Discrete Random Variables
A die was rolled 100 times and the following data were obtained Result 1 2 3 4 5 6 No. of observations 15 18 20 14 15 18
56 / 67
fi .
xi fi .
57 / 67
1 = fi (xi x )2 n1
i =1
58 / 67
59 / 67
60 / 67
61 / 67
In this case we know the exact values of the observations and hence we can order the data. In this way we can calculate the median. Since there are 100 observations, the median is Q2 = 0.5[x(50) + x(51) ]
62 / 67
The 15 smallest observations are equal to 1 i.e. x(1) = x(2) = . . . = x(15) = 1. The next 18 smallest observations are equal to 2 i.e. x(16) = x(17) = . . . = x(33) = 2. The next 20 smallest observations are all equal to 3 i.e. x(34) = x(35) = . . . = x(53) = 3.
63 / 67
64 / 67
1.5 Measures of Location and Dispersion for Grouped Data - b) Continuous Random Variables
In such cases we have data grouped into intervals. Let xi be the centre of the i -th interval and fi the number of observations in the i -th interval. The approach to calculating the sample mean and variance is the same as in the case of discrete data. In order to carry out the calculations, we assume that each observation is in the middle of the appropriate interval.
65 / 67