Sie sind auf Seite 1von 72

Business statistics bcm 307 INTRODUCTION

Expected Learning Outcomes: At the end of the course a student should be able to: To demonstrate an understanding of statistics and its importance for business and management Demonstrate proficiency with the qualitative and quantitative measures: through ability to organize and present data on tables, charts, graphs, polygons Demonstrate an understanding of measures of central tendency and measure of dispersion Demonstrate an understanding in time series and application Drawing scatter diagrams, construct simple linear regression equation and application Demonstrate proficiency in contingent table, probability concepts and application in business Demonstrate some level of understanding and application of hypothesis testing

1.Introduction Definition of statistics; Application of statistics in business; Terminologies used in statistics 2.Collection, Organization and Presentation of Data Qualitative data: Summary tables, bar, pie and Pareto charts, Quantitative data: Summary tables, histogram, graphs, polygons, ogive and Lorenz curve Time Series: Time series graphs, application and forecasting

3.Numerical Descriptive Measures: Discrete and Continuos Data Measures of Central Tendency: Mode, Median, Mean Measures of dispersion: Significance of the measures, Range, Interquartiles range, Variance, Standard deviation, Coefficient of variation 4. Probability Distributions. a) Discrete distribution b) Normal distribution(Continuous)

Introduction Standard Normal Distribution Z-scores Areas to the Left and Right of x Calculations of Probabilities Using the Central Limit Theorem

5.Confidence Interval Introduction Confidence Interval Single Population Mean, Population Standard Deviation Known Confidence Interval, Single Population Mean, Standard Deviation 6.Linear Regression Model and Scatter Diagram (Simple and Multilinear) Drawing scatter diagram, Describing relationship, equation relationship between variables Use least square method to derive the simple regression equation Explain the coefficients of the variables and their significance

7.Hypothesis Testing 1. Introduction Definition of Hypothesis Testing Null and Alternate Hypothesis Testing for the Mean Hypothesis testing for the Proportion Support One of the Hypothesis Decision and Conclusion Dean. S Illowsky.B; Principles of Business Statistics; <

Course Texts: http://cnx.org/content/col10874/1.5/ >

2. 3. 4.

Berenson. M, et al; Basic Business Statistics: Concepts and Application. 11th T.Lucey; Quantitative Techniques: 6th edition (2002) R.I. et al. Quantitative Approaches to Management. 8th edition (1992)

edition (2009

INTRODUCTION Statistics is a branch of mathematics that transforms numbers into useful information for decision making. It does this by producing a set of methods for analyzing the numbers. Statistics is therefore the science of data that involves collecting, classifying, summarizing, organizing, analyzing and interpreting numerical information. Definition of Terms The application of statistics can be divided into two broad areas: Descriptive statistics Inferential statistics

Descriptive statistics: It utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed and to present the information in a convenient form. This could be referred to as analysis of data. The data is usually presented in form of tables, charts, graphs and analyzed using statistics such as the mean, median, mode, variance, standard deviation, coefficient of variation etc. Inferential statistics: It uses the data collected from a small group to draw conclusion about a larger group. The conclusion may be decisions, predictions or other generalizations about a larger set of data. Important applications can therefore be summarized as: Summarizing business data

Drawing conclusion from the data Making reliable forecast about business activities Improving business process

Statistics: The word statistics has two meanings: Numerical facts derived from analysis of sample data, for example, mean, standard deviation and proportions. Any numerical facts can also be referred to as a statistic, e. g number of people, number of countries, marks scored in a test etc. Field or discipline of study. It is a branch of mathematics that transforms numbers into useful information for decision making. It does this for by providing a set of methods for analyzing the numbers. These methods help to find patterns in the numbers and this enables one to determine whether differences in the numbers are just due to chance. In this case statistics can be seen as a science of data. It involves collecting, classifying summarizing, analyzing and interpreting numerical information. Population (target population): It is a set of units that we are interested in studying and the one we need to draw a conclusion about: the whole set of elements of focus. Characteristics are called parameters e.g. population mean, , population standard deviation, , population proportion, p Sample: It is the portion or subset of the population that is selected for analysis. The sample is randomly selected so as to consist of all the characteristics of the population. Characteristics are called statistics e.g. sample mean, x sample standard deviation, s, sample proportion, p Representative sample: The sample selected is representative if it exhibits typical characteristics that are possessed by the population of interest or the target population. The most common way to satisfy the representative sample requirements to use methods that allow us to select a random sample, that is, giving every element equal chance to be selected.

Random sample: A random sample is selected from the population in such a way that every different element and every sample size has equal chance of selection. Element/member: Element of a sample or population is a specific subjects or object about which information is collected, e.g. a firm, a country, a person, a university etc. Variable: It is a characteristic under study that assumes different values for different elements. For example scores in a test; different scores are expected for different students when the do a given test. In the case of a firm, the profit made at different time may be different; peoples tastes for a given product may be different etc. Observation/measurement: The value of a variable for an element. 120 cm as a height, a yes for an opinion, Sh. 20 000 as an income Data set: A collection of observations on one or more variables. Types of variables A variable may be classified as qualitative and quantitative. Qualitative Variable These are measurements that cannot be measured on a natural numerical scale but can only be classified into groups or categories. The data from categorical variable are measured on the scales; nominal or ordinal scales. Nominal scale: This divides distinct categories that cannot be ranked. For example the gender (female or male), preference of a product or a service (soft drink), a yes or no response etc. This is the weakest form of measurement Ordinal scale: It classifies data into distinct categories that can be ranked. E.g. Responses such as excellent, very good, fair, poor etc Though it is possible to rank the scale it is still weak in that the amount of the difference between categories cannot be accounted. Quantitative Variable The measurements are recorded on a naturally occurring numerical scale. The scale cans either b internal or ratio scale. These two scales can be ranked but also the difference between two variables can be calculated and interpreted. Internal scale: The scale cannot be used in comparing for example a student who scores a 100% is not twice as intelligent as one who scales 50%.

Ratio scale: It refers to data that can also be compared. This includes data that incorporates arithmetic operations (addition, subtraction, multiplication and division). For example sales of a company, income of families returns from an investment etc.

2. COLLECTION, ORGANIZATION AND PRESENTATION OF DATA The first step is to identify the type of data one wants to collect; quantitative or qualitative. The second step is to device a suitable method for collecting data. There are various methods that are used to collect data, for example survey, designed experiments and observational study. Survey: The researcher samples a group of people and asks them questions from who responses are obtained. Some tools used in the data collection are questionnaires, mails, and telephone or in-person interviews. Experiment: Designed experiments normally involve strict control over the elements in study. Two groups are designed one of which composes of experiment treatment group and a control group. Observational study: The researcher observes the elements in their natural setting and records the variables of interest. Regardless of the data collection method, it is likely that the data will be from a sample. Data is classified as primary or secondary. Primary data is the one collected by the person analyzing data while secondary is obtained by the analyzer from publications such as books, journals, newspapers etc. Organization and Presentation of Data After data is collected it is cleaned to remove unnecessary work in the record and it is ready for analysis which is the process that transforms the raw data to meaningful

information which analyst can use in decision making process. The analysis process includes with organization, presentation, description and inference of the results obtained from the sample to make generalization on the population. The technique or methods used to present or analyze data will depend on the type of data; quantitative or qualitative. The process and method of analysis depends on the type of data that was collected; qualitative or quantitative. Qualitative Data The data can be organized and presented in; Summary tables- frequency, relative frequency or parentage frequency table The bar charts The pie charts The Pareto charts

Examples 1 A sample was taken of 25 high school seniors who were planning to join college. The following are categories of majors he/she intended to choose: Business (BUS), economics (ECON), management information science systems (MIS), behavioral science (BS) and others. The responses of the students as they are asked their choice are listed below: ECON BUS Other ECON BS MIS BUS BS BUS MIS ECON other MIS MIS other BUS other other BUS other BUS other MIS other other

Required: To organize and present data by constructing i. Frequency distribution table

ii. iii.

Bar chart Pie chart

Solution The above data is measured on categorical basis. The analyst collected the data by identifying those students who majored in any of the above categories. i. Frequency Distribution Table: This is a summary table that organizes the raw data into a frequency distribution table that includes three columns as demonstrated below: Categories BUS ECON MIS BS Other Total Tally Frequency 6 3 6 2 8 25

To make sure that all the responses from each category are included, the student goes through the raw data putting a slash on each response and recording it as a tally on the tally column. The tallies are recorded as slashes and any group of five are written by counting the number of numerical value. To ensure all items were considered the student must write the total frequency as shown above that must match the sample size of the data set. The result in the frequency distribution table gives us the number of students who took that particular major. From the number or frequency we can identify the categories with the highest number of students, or the least and generally we can describe how the frequency is distributed among the different categories. Relative frequency distribution table can also be used as summary table. In this case an additional column for relative frequency from each category is written as a relative value by dividing it by the total frequency to result with:

Categories BUS ECON MISS BS Other Total

Tally

Frequency 6 3 6 2 8 25

Relative frequency 6/25 0.24 3/25 0.12 0.24 0.08 0.32 1.00

The total relative frequency is equal to 1.00. From this table we obtain the relative or proportion of the students as distributed among the categories. ii. Percentage frequency distribution Table

The summary table can also be in percentage frequency where the column is added to represent this. The table may be like: Categories BUS ECON MIS BS Other Total Tally Frequency 6 3 6 2 8 25 Percentage Frequency 6/25 *100 24 12 24 08 32 100

The column has a total of 100. Percentages are more simpler way of expressing proportions. NB: To develop either relative frequency distribution table or percentage frequency distribution table, one must have constructed the frequency distribution table first and then the rest and it would be advisable to do all of them in the same table. iii) Bar Chart The bar chart presents the data using a horizontal and vertical axis. The horizontal axis takes the categories which are represented by bars of equal width and

separated from each other by uniform space. The frequency or (relative frequency or percentage) is written on the vertical axis. The scale of the vertical axis is determined by the highest frequency of the categories. The scale must be easy to use construct and read.

The Vertical axis may also use the relative frequencies or percentages; the scale must be well selected to include the highest relative/p percentage value. The bar chart becomes a good presentation of the data from which various information can be drawn. For example; the business and MIS bars are both equal; the number of students doing business and MIS majors are equal. The category of other is the highest and so it means other cause not distinctly identified are also offered. Behavioral science has the least number of students.bar charts are best suited for comparing different categories by checking on the height of the bars iv) Pie Chart The information could also be presented on a pie chart. The chart assigns the categories according to their proportions reflected by the size of the sector. The percentages (or relative frequencies) are converted to degrees. Categories BUS ECON MIS RF 0.24 (360) 0.12 0.24 Degrees 86.4 43.2 86.4

BS Others

0.08 0.32 1.0

28.8 115.2 360

Then draw the pie chart to show the different categories in form of different sizes.

The pie chart presentation easily shows and identifies the size of the different portions and makes it easy to draw conclusions. v) Pareto charts Pareto charts classify categories into vital few and trivial many. The Pareto principle exists when the majority of items in a set of data occur in a small number of categories and the few remaining items are spread out over a larger number of categories. The separation helps to identify and focus on the important categories. Example 2 The hotel X Y Z samples complaints about the hotel rooms and categorizes them as in the table below. The sample that gave the responses was made up of 106 customers. The table summarizes the complaint categories and the number of customers that complained over certain issues.

Reasons for complaint A B C D E F G H Dirty room Not stocked Not ready Too noisy Needs maintenance Has too few beds Doesnt have promised features No special accommodation

Number of customers 32 17 12 10 17 9 7 2

Required a. Construct a Pareto chart b. What reasons for the complaints do you think the hotel managers should focus on if it wants to reduce the number of complaints. Explain c. Construct also pie and bar chart and compare the suitability of each chart in presenting this data Solution Orders the categories from the one with highest frequency to the one with the least, convert frequencies to percentages and show this column. Also draw columns for cumulative frequency and cumulative percentage frequency. The categories can be identified with symbols to avoid a lot of writing. Let the categories in the question be numbered from A to H before rearrangement. Arranging them in order from the one with highest frequency to the last would give us: A B E C F G H

The table has the following information

Reasons A B E C D F G H

Frequency 32 17 17 12 10 9 7 2

Cumulative Frequency 32 49 66 78 88 97 104 106 30.2 16.0 16.0 11.3 9.4 8.5 6.6 2.0

Cumulative % 30.2 46.2 62.2 73.5 82.9 91.4 98.0 100

Summary: Summary tables together with the chart are used to describe the portion of items of interest in each category.

Each chart best suits certain situations, for example; Bar chart is more suitable for the purposes of comparing the size of categories especially when they are not many in number; in our case we can have not more than six. If they are more than that, they become crowded. Pie charts are best suited for the situation where the main objective is to investigate the portion a category occupies in relation to the whole part. Coloring the portions with different colors enhance the display. It will also be best for few categories. Pareto chart sorts the frequencies in descending order and provides the cumulative curve on the same graph. This allows the viewer to see which categories account or matter most in the given situation. The chart allows presentation of many categories and also those with small difefrences in percentage because the curve enhances identification of the additional proportion given by any added category. Summary tables together with the chart are used to describe the portion of items of interest in each category. Each chart best suits certain situations, for example; Bar chart is more suitable for the purposes of comparing the size of categories Pie charts are best suited for the situation where the main objective is to investigate the portion a category occupies in relation to the whole part. Coloring the portions with different colors enhance the display. Pareto chart sorts the frequencies in descending order and provides the cumulative curve on the same graph. This allows the viewer to see which categories account or matter most in the given situation. Quantitative Data These are measurements that are recorded on a naturally concurring numerical scale. They are measured on an interval or ratio scale as explained earlier. Quantitative data can be organized and presented in a number of ways that include: Ordered array

Stem-and-leaf display Summary tables Histogram Frequency polygons The cumulative percentage polygon: ogive

Quantitative data; can either be discrete or continuous. Discrete data: It is a variable whose values are countable i.e. they assume whole number values e.g. number of persons, cars, companies etc. Continuous data: It is a variable that can assume any numerical value over continuum of certain interval or intervals e.g. time taken to serve a customer in a bank, amount of money height of individuals etc. Discrete data: It can be organized and presented in Ordered Array Stem-and-leaf display Bar chart Summary tables

Example 1 The following data represents the stock price of 25 companies. 31 16 22 13 22 15 22 18 26 27 13 12 33 16 20 17 23 21 26 20 23 30 18 27 22

Required: Construct i. ii. i) Ordered array Stem-and-leaf display Ordered Array:

This requires that the data is written in ascending or descending order.

12 20 26

13 20 26

13 21 27

15 22 27

16 22 30

16 22 31

17 22 33

18 23

18 23

Ordered array is best applicable if the data is not so large. ii) Stem-and-leaf display: It creates suitable stem (main part one digit, two or three) depending on the nature of the data. Then assigning the remaining digits in what is referred to as leaf. Since the above data values are a two digit, the tens digit can form stem and the ones digit the leaf. Tens are represented by 1, 2, the ones digit take the leaf. Stem 1 2 3 2 3 0 0 0 1 1 3 leaf 3 5 6 6 2 2 7 8 2 2 3 8 3 6 7 7 3, ie, tens, twenties and thirties while

The ones are matched after the appropriate tens, from the display twenties are the most and thirties the least. Example 2 The following data represent the monthly rents paid by samples of 30 households selected from a city. 429 540 650 585 732 956 950 675 550 1020 750 880 750 660

1070 871 780 989 900 620

578

1030 930 870

765 800

975 820

1020 840 Solution:

The digits contain either 3 digits or 4 digits we can take the stem for 1 digit for the 3 digits number and 2 digits for the four digit number. The leaf can be taken as a two digit number. The stem-and-leaf display may not necessarily require data to be arranged in orderly manner but even. If it is arranged, the pattern obtained is maintained. Stem-and-leaf display 4 5 6 7 8 9 10 29 85 75 32 71 89 20 50 20 50 80 56 30 40 60 65 40 30 70 78 50 80 70 75 20 50 00 50 20 00

By looking at the stem-and-leaf display we can observe how the data values are distributed. The stem and leaf display does not lose the information on individual observation or measurement.

Example 3 The following data give the number of computer courses taken by 30 businesses major who recently graduated from a university. 2 2 3 3 2 4 3 1 1 2 4 3 2 2 2 1 3 4 4 2

Required a. Prepare a frequency distribution table. b. Compute relative frequency and percentage distributions c. Draw a bar graph for the frequency distributions d. What percentage of the graduates takes 2 or 3 computer courses? Solution Identify all the numbers presented in the data set: 1, 2, 3 and 4. Construct the summary table to include the columns: Number of courses, tallies, frequency and who relative and percentage frequency distributions can be included in the same table. Number of courses 1 2 3 4 Tally Frequency (f) 7 11 7 5 30 Relative Frequency f/30 0.2333 0.3667 0.2333 0.1667 1.000 Percentage frequency (*100) 23.33 36.67 23.33 16.67

Bar graph (chart)

Those graduates who take 2 or 3 courses are are the total of those who take 2 and those who take 3: (36.67 + 23. 33) % = 60% Grouped/Continuous data Discrete data can be presented like categorical data in bar graph where the numbers take the horizontal axis. Frequency distribution table, relative frequency distribution or percentage distribution tables can be done as for the categorical variables where the discrete data value stands as a category. However for grouped data the frequencies, relative frequencies and percentages are assigned to an interval of numbers in the table. Stem- and- leaf display may not be very applicable and in place of bar chart grouped data is presented in a histogram. Sometimes it becomes necessary to look at values in a data set in form of class or groups. Each class gives the total number of values that fall within a given range. It is required that one identifies the class width, that is, the number of values accommodated in the class. Number of classes or groups: at least should not be so few (not less than 3 classes and not too many (not more than 10) in the context of our class work. However in real life we may have data grouped into so many classes.

This is necessary because we are interested in presenting data in a more organized, easily interpretable form and in a way that makes sense. Example 1 The data on the stock price of 25 companies: 31 15 13 33 23 16 12 12 23 26 22 18 27 21 18 13 26 16 17 27 22 22 26 20 30 To group the data we can choose a class width of 4 or 5. If we choose 4 the approximate number of classes will be = 25/4 = 6.25 = 6 If 5 then 25/5 = 5 classes. Either can be used. Lets use a class width of 5. Identify the smallest value =12 This can be the lowest value in the data or we can decide to start at 10. This means we will consider in first class 10, 11, 12, 13, 14, ie, 10-14. The next class will have 15, 16, 17, 18, 19, i.e. 15-19 etc. we write class to include all the values. Other classes then become; 15-19, 20-24 etc. The lowest and highest values in each class are included in the interval. The above classes can also be written as 10 to less than 15, 15 to less than 20 etc. when we use this style the upper value in each class is not included. However in each case the class interval is five. Be careful to use each style correctly. The summary table: We can consider including relative frequency distribution and percentage distribution in the table. Grouped data can be presented in i. ii. iii. iv. i) summary table, histogram frequency polygon cumulative frequency curve (ogive) Summary table

Class 10-14 15-19 20-24 25-29 30-34

Tally

Frequency 4 6 8 4 3 25

Relative Frequency f/25 0.16 0.24 0.32 0.16 0.12 1.00

% frequency *100 16 24 32 16 12 100

Cumulative % 16 40 72 88 100

ii)

Histogram

This is a graph in which classes are marked on horizontal axis. The classes are written to include class limits. Each class in adjusted so that the lower value in the class is subtracted 0.5 while the upper is added 0.5: .: 9.5 14.5, 14.5 19.5, 19.5 24.5 etc. The vertical axis either takes the frequency, relative frequency or percentage frequency. The scale must include the highest frequency: In this case 8. Draw bars with height corresponding to the frequency in each class making sure that the bars are adjacent (touch) because the data is continuous and any value can be included in this data. The information that can be obtained from a histogram is so much like that from a bar chart for discrete or qualitative data. Histogram also like stem and leaf can display the distribution pattern of the data. Data can be normally or approximately distributed or skewed and histogram can display this information well.

iii)

Frequency polygon

It is formed by plotting the middle of each class against the frequency and joining the points with straight lines. The polygon can be drawn in the histogram by marking the middle of the bars and joining the points. It is also effectively used to display the pattern of the data across the classes. iv) Cumulative frequency graph: Ogive The graph is drawn by plotting the higher value of the class limits against cumulative frequency, relative frequency or percentage frequency, i.e. of companies had their stock prices between 15 and 22 We may also want to know the number of companies whose stock prices were 27 and below. Locate 27 along prices and draw a vertical line to meet the curve. Draw a horizontal line to read the frequency =21. Therefore 21 companies had their stock prices at price of 27 and below. Therefore 4 companies have their stock prices above 27.

The cumulative frequency curve: ogive

Lorenz curve: It is a special Ogive that can be used to plot either income or wealth of a country against the population. It will show how the distribution of wealth is in a given country. Many Lorenz curves will form a long S showing some level of unequal distribution of wealth among citizens. Tax policies can be used to level out the inequality by charging higher tax rates for the more wealth and lower rates for the little wealth population. For equal distribution the long S shape results in a straight line- an ideal situation but the more equitably wealth is distributed nearer the shape to a straight line. The curve is drawn with percentage as cumulative of the population on vertical axis and the amounts wealth or income.

3. NUMERICAL DESCRIPTIVE MEASURES AND ANALYSIS The descriptive measures can be classified as: Measures of central tendency mode, median and mean Measures of dispersion or spread range, variance, std deviation, semiinterquartile range, coefficient of variation etc Measures of Central Tendency Discrete data These are summary measures that give averages. The measures of central tendency can be calculated for discrete (ungrouped) or continuos (grouped) data. Discrete data:

a. Mode- this is the most popular or common item in the data set. It is the value with the highest frequency. Data set can either have unimodal (one mode), bimodal (two modes) or multimodal. Example 29 31 35 39 39 40 43 44 44 The above set is a bimodal with 39 and 44 b. Median- it is the value of the middle term in a data set that has been ranked in ascending or descending order. The position of the median is identified as: N+1 2 where N is the total frequency 2 Example 1 23 36 210 249 257 506 385 13 50 97 210 275 Find the median Solution Arrange the data in ascending order 13 23 36 50 97 210 234 249 257 275 385 506 =5th which is 39 The median in the above data is position 9+1

Middle position = n+1 = 13 = 6.5 2 2

The position between 6 and 7th position 210 + 243 = 222 2 The advantage of using median as a measure of central tendency is that it is not influenced by outliers. It is preferred to the mean for data set that contains outliers. Outliers are few figures in the data that have extreme values from the rest: either very low or very high. Mean = Arithmetic mean It is the most frequently used measure of central tendency. It is the average of the sum of all values divided by the total frequency. So the mean is preferred in that it represents the whole data set from which it is computed.

= Mean = x n n = sample size

Sample data

= X = mean from a population data N = population sizes. N Example 2 The following data gives the profits thousand dollars of a sample of five companies in a given year. 4725 1884 3807 4939 and X== n = X = 16980 5 162 = 3396

The average profit on the 5 years is $3396000. A major shortcoming tendency is that mean is very sensitive to outliers. Example 3 The following data give the number of years eight employee have been with their current employers 11 a) b) Solution a) Outlier is 24 which seem to be the extreme number of years the employee has been with the employer. 9 13 12 8 9 24 10

Identify the outlier. What would be the mean if the outlier was ii) excluded ii) included

i)

Mean excluding 24: 11+9+13+12+8+9+10 = 10.286 7

ii)

Including 24 11+9+13+12+8+9+10+24 = 12 8

The one extreme value changes the mean by almost 2 values (units) i.e. from 10.256 to 12 (1.714). Mean is very sensitive to outliers. For example the mean mark of BCM 307 test can easily be affected by few very poor performing students or very few very wee performing students. The mean may not accurately represent the whole class. Example 4 The mean of 60, 80, 90, 120
60+80+90+120 4 350 4

=87.5

The arithmetic mean is very useful because it represents the values of most observations in the population. The mean therefore describes the population quite well in terms of the magnitudes attained by most of the members of the population Measures of Dispersion Discrete data These are statistics or measures that show how data is dispersed. The measures may include

Range Inter-quartile range Variance Standard variation

Range: The difference between the highest and the lowest value. Example 1 The example on the number of years the employees have stayed with the employer. 11 9 13 12 8 9 24 10

Range:

24-8 = 16

The range is influenced by outliers as it is only based on two values. Its disadvantage is that it ignores the rest of values in a data set and so it is not a satisfactory measure of dispersion. Inter Quartile Range: The difference between the upper quartile Q3 and the lower quartile Q1. It contains the middle 50% data. Example 1 Arrange data in order 8 9 9 10 11 12 13 24 1st Quartile: x 8 = 2nd Q1 =9 6th 2nd = 9

3rd Quartile (Q3) = x 8 6th = Q3 = 12

Inter quartile Range: 12-9 =3. Example 2 The following is a discrete data 2, 5, 8, 10, 11, 14, 17, 20

Required: (i) (ii) Solution Position = .3(n + 1) = .3(9) = 2.7 30th percentile = 5 + .7(8 5) = 5 + 2.1 = 7.1 Lower Quartile (25th percentile) Position = .25(n + 1) = .25(9) = 2.25 Q1 = 5+.25(8 5) = 5 + .75 = 5.75 Median (50th percentile) Position = .5(n + 1) = .5(9) = 4.5 Median: Q2 = 10+.5(11 10) = 10.5 Upper Quartile (75th percentile) Position = .75(n + 1) = .75(9) = 6.75 Q3 = 14+.75(17 14) = 16.25 Interquartiles IQ = Q3 Q1 = 16.25 5.75 = 9.50 Example 2 (Grouped Data) The following table shows the levels of retirement benefits given to a group of workers in a given establishment. Retirement benefits 000 20 30 40 50 60 70 80 29 39 49 59 69 79 89 No of retirees (f) 50 69 70 90 52 40 11 Upper class limit 29.5 39.5 49.5 59.5 69.5 79.5 89.5 50 119 189 279 331 371 382 cf Find the 30th percentile The quartiles.

Required i. ii. iii. Determine the semi interquartile range for the above data Determine the minimum value for the top ten per cent.(10%) Determine the maximum value for the lower 40% of the retirees

Solution The lower quartile (Q1) lies on position


N +1 4 =95.75 (95.75- 50) 69 = 382+1 4

theva lueof Q1

=29.5+

x10

= 29.5 + 6.63 = 36.13 The upper quartile (Q3) lies on position N+1 4 = 382 + 1 4

= 287.25 The value of Q3 = 59.5 +

( 287.25- 279)
52

10

= 61.08 The semi interquartile range = = 61.08 - 36.13 2


Q3-Q1 2

= 12.475 = 12,475 ii. The top 10% is equivalent to the lower 90% of the retirees The position corresponding to the lower 90% = 90 (n + 1) = 0.9 (382 + 1) 100

= 0.9 x 383 = 344.7 The benefits (value) corresponding to the minimum value for top 10% = 69.5 + = 72.925 = 72925 iii. The lower 40% corresponds to position =
40 100

( 344.7- 331)
40

x 10

(382 + 1)

= 153.20 Retirement benefits corresponding to its position

= 39.5 +

( 153.2-119)
70

x 10

= 39.5 + 4.88 = 44.38 = 44380 e. The 10th 90th percentile range

This is a measure of dispersion which uses percentile. A percentile is a value which separates one division from the other when a given data is divided into 100 equal divisions. This measure of dispersion is very important when calculating the co-efficient of skewness Variance: Variance is the square of standard deviation. Formula = (x ) where x: the values in data N N: size of population : the mean Standard Deviation: It is simply the average of all the Deviations of values of a variable from the mean. The deviation of each value from the mean is squared and the sum of all the square of deviations is divided by total frequency (N) of population data and size of sample less 1 (n-1) if sample data was used, them obtain square not. Formula for calculation: Population data = = (x-) Example 1 Assuming the data in the number of years employees remained with the employer to have been collected from a sample: Variance S = (x ) n1 Mean = X S = 12 (obtained earlier)

= (11-12) + (9-12)) + (8-12) + (10-12) + (24-12) 8-1

S S

= (-1)

+ 2 (-3) + (-4) + (-2) + (8-12) + (10-12) 7

= 1 + 2 x 9 + 16 + 4 + 144 = 176 = 25.142 7 7

On average each value deviates from the mean on squared = 25.142. Standard Deviation: Square root of variance = 25.142 = 5.014

On average each value deviates from the mean by 5.014. In general the lower the value of standard deviation for a data set from the mean. The values are close together but higher value of standard deviation indicates that the values are relatively spread or scattered. If the standard deviation of scores obtained by students in a BCM 307 class was obtained to be higher compared to score obtained in different class, it means the abilities of students are spread out. Some are very poor while others may be good in their performance. If data set is larger the working can be done from a frequency distribution table. Example 2 A sample comprises of the following observations; 14, 18, 17, 16, 25, 31 Determine the standard deviation of this sample. x 14 18 17 16 25 31 121

( x x)
-6.1 -2.1 -3.1 -4.1 4.9 10.9

( x x)

37.21 4.41 9.61 16.81 24.01 118.81 210.56

X=

121 6

= 20.1 xx n

Standard deviation,
= 5.93 Example 3

210.56 6

The data represents the number of bedrooms in homes owned by 30 families 3 4 3 5 1 4 2 4 3 3 3 1 2 1 2 3 3 4 1 3 2 2 2 2 1 2 5 3 3 3

Required a) identify the mode calculate the i) ii) Solution Construct frequency distribution table. Number of rooms (x) 1 2 3 4 5 Tally Frequency 5 8 11 4 2 30 x = 80 (x- ) = 36.667 FX 5 16 33 16 10 80 Xi-X -1.67 0.67 0.33 1.33 2.33 (Xi-X) 13.9445 3.5912 1.1979 7.0756 10.8578 36.667 mean variance and standard deviation

= 30 a) b)

mode is 3 bedrooms X = x = 80 = 2.67 30 = 36.667 = 1.264 30-1

Variance = S = (x- ) n-1

The variance = S = 1.264 Standard deviation = S = 1.264 = 1.124

The mean or average of all deviations of values from the mean is 1.124 i.e. each value is an average difference of 1.124 from the mean. Coefficient of Variation The variance or standard deviation of different data set is not easy to compare. The coefficient of variation makes it possible for different data sets to be compared based on measure of central tendency (normally the mean and measure of dispersion (normally the standard deviation). Coefficient of variation: CV = standard deviation Mean In the above example: CV = 1.124 = 0.421 = 1.124 x 100 = 0.421x100 2.67 CV can also be written as a percentage CV 2.67 The lower the CV the less the spread of the values from the mean i.e. the values are closer together. Measures of Central Tendency and Measures of Dispersion for a Continuos Data Example 1 The Table gives the frequency distribution of the daily commuting time for workers from home to work for all employees of a company.

Time (minutes) 0 to less than 10 10 to less than 20 20 to less than 30 30 to less than 40 40 to less than 50

Number of employees 4 9 6 4 2 25

Solution: Computation of the measures similar to that of discrete data whereby the value of x is obtained as the mid-point of each class X = sum of the class boundaries e.g. 0+10 = 5 is the mid-point of the 1st classs 2 Time 0 to less than 10 10 to less than 20 20 to less than 30 30 to less than 40 40 to less than 50 Mid-point (x) 5 15 25 35 45 2 Frequency 4 135 150 140 90 =25 (f) (fx) 20 135 150 140 90 x =535 (x-x) f 1075.84 368.64 77.76 739.84 1113.92 (x-x) f =3439.36

The mean can also be assigned instead of x given the data is from a population. However whether the column writes (x-x) or (x-) should not make difference in the value. Mean = = x = 535 = 21.4 Standard deviation = (x-) f = 3439.36 N = = 137.5744 25 = 11.729. 25

NB:For continuous data the mode is replaced by the term modal class; simply the class with the highest frequency. For the above example the modal class 10 to less than 20. Practice Questions: 1. The following data represent the age of a sample of 10 employees of a given 39 Required: i) ii) iii) iv) v) 2. identify the mode and the median compute mean standard deviation coefficient of variation The data gives the frequency distribution of the number of orders received each day during the past 50 days at the office of a mail order company. Number of Number of days 29 43 52 39 44 40 31 44 35 company

a) b) i) ii) iii)

order 10 12 13 15 16 18 19 - 21 Identify the modal class Calculate mean variance and standard deviation coefficient of variation

4 12 20 14

3. The price of the ordinary 25p shares of Manco PLC quoted on the stock exchange, at the close of the business on successive Fridays is tabulated below 126 125 128 124 127 Required a) Group the above date into eight classes. b) Calculate cumulative frequency, the median value, quartile values and the Semi-quartile range c) Calculate the mean and standard deviation of your frequency distribution. d) Compute : i) The median and mean ii) The semi-interquartile range and the standard deviation 5. The managers of an import agency are investigating the length of time that customers take to pay their invoices, the normal terms for which are 30 days net. They have checked the payment record of 100 customers chosen at random and have compiled the following table: Payment in 5 to 9 days Number of customers 4 120 127 126 127 122 122 113 117 114 106 105 112 114 111 121 129 130 120 116 116 119 122 123 131 135 131 134 127 128 142 138 136 140 137 130

10 15 20 25 30 35 40 Required:

to to to to to to to

14 19 24 29 34 39 44

days days days days days days days

10 17 20 22 16 8 3

a) Calculate the arithmetic mean. b) Calculate the standard deviation c) Construct a histogram and insert the modal value. d) Estimate the probability that an unpaid invoice chosen at random will be between 30 and 39 days old.

4. PROBABILITY DISTRIBUTIONS Probability distribution can either be discrete or continuous. The distribution can also assume the uniform, normal and skewed For numerical data, any distribution: discrete, continuous or probability, the mean and standard deviations can be used to find the proportions or percentage of the total observations that fall within a given internal about the mean. The pattern of any distribution of data values throughout the entire range of all values given a certain shape. The shape can be identified from a bar chart for discrete data or histogram for continuous data. The shape of the distribution can either be i) Uniform ii) Bell-shaped shaped that is- symmetrical iii) skewed i) Uniform or rectangular

ii) Symmetrical- bell shaped

For a symmetrical continuous distribution the measures of central tendency mode, median and mean are equal and the value is at the middle of the shape. Such a distribution is called normal distribution Gaussian distribution. a) DISCRETE DATA A probability distribution for a discrete random variable is a mutually exclusive listing of all the possible numerical out occurrence of each outcome.

Example 1 The following table contains the probability distribution for the number of traffic accidents daily in a small city. Number of accidents 0 1 2 3 4 5 Required: Compute: a) b) c) Solution Probability is a term that reflects uncertainty. It is used to make predictions on happenings by assigning the probability of the event happening. The mean or average from such distribution is referred to as expected value E(x), E(x) = X (Pxi) Where X: - Variable The variance Px: - probability that event xi will occur expected number of accidents The variance and standard deviation Coefficient of variation Probability p(x) 0.10. 0.20 0.45 0.15 0.05 0.05

= = (xi E(x)) 2 Pxi

Number of Accounts probability (x) P(xi) Xi Pxi Xi E(x) Pxi

0 1 2 3 4 5

0.10 0.20 0.45 0.15 0.05 0.05 1.00 = 1.00

0 0.20 0.90 0.45 0.20 0.25 2.00 xi Pxi = 2.00 = 2.00 = =

0.40 0.20 0.00 0.15 0.20 0.45 1.4 (xi E(x)) Pxi= 1.4

Pxi

i) Expected value E (x) ii) Variance iii) standard deviation

= = (xi E(x) Pxi = 1.4 = 1.4 2 (59.2%) = 1.1832 = 1.1832

iv) Coefficient of variance CV

= 0.592 Example 2

Given the following probability distributions A and B Distribution A Xp (x) 0 1 2 3 a) i) ii) iii) 0.25 0.25 0.25 0.25 Compute: The expected value for each distribution The standard deviation for each distribution Compare the results of distribution A and B. Distribution A PC(X) XP(x) 0.25 0.25 0.25 0.25 0.00 0.25 0.50 0.75 Distribution B P(X) X P(X) 0.15 0.25 0.45 0.15 0.00 0.25 0.90 0.45 Distribution B X 0 1 2 3 p(x) 0.15 0.25 0.45 0.15

X 0 1 2 3

[X-E(X)] Px) 0.5625 0.0625 0.0625 0.5625

X 0 1 2 3

[X-E(X)] P(X) 0.384 0.090 0.072 0.294

= E(x) = 1.5 = 1.25

=E(x) = 1.6 = 0.84

Distribution of A is uniform and symmetric .The distribution has one mode i.e. the variance 2. b) CONTINUOUS DISTRIBUTION: NORMAL DISTRIBUTION

Frequency distribution for continuous data can be converted to a probability distribution by calculating the relative frequency for each class. This column is taken as equivalent of probabilities for each class. Like total sum of relative frequency, the total probability is also equal to 1. i.e. Px =1 The distribution is the most common continuous distribution used in statistics based on the following main reasons. Numerous continuous variables common in business and other natural occurrences have distributions that closely resemble the normal distribution The normal distribution can be used to approximate various discrete probability distributions. It provides the basis for classical statistical inference.

The normal distribution is represented by the classical bell shape with it one can calculate the probability density function is denoted by the symbol (x). The mean () is in the middle of the symmetrical distribution. The standard deviation () measures the distance from the mean to a point on the x (horizontal) axis. In order to work with a set of standard values it is necessary to convert or transform any normal distribution to a standard normal distribution which has a mean of o and a standard deviation of 1. The total area of the distribution is 1, and each half of the curve is 0.5. Any values of x in a distribution can be converted to a value called z value or z- score, by the formula: Z =x-

Where x the variable - mean Standard deviation Z values are obtained normal probability distribution. The Z values correspond to the area shaded (identified from the normal curve). Example 1 The heights of adult males are normally distributed with mean 170 cm and standard deviation 10cm. Find the probability that the height of students is: Between 180 and 190 Taller than 190cm Shorter than 180cm Shorter than 165cm

Solution The distribution is said to be normally distribution

=170 X = 170 is the mean at the middle of the curve. Use formula to find 2 (standard deviation: ) Find the area (probability) that The height of an adult is between 180cm and 190cm

Z = x - = 180 170 = 1 (180 is 1 standard deviation) 10

= 190 170 = 2 (190 Is 2 standard deviation) 10 Find areas in the normal tables = P(z) Z 1 2 Area (Area under curve between Z = 0 and 2 0.3413 0.4772

P (180 x (190) = p (1 z = 0.4772 0.3413 = 01359.

-3

-2

-1

Taller than 190cm: P (x > 190) Z = 190 170 = 2 10

-3

-2

-1

=2 and P(z) or area =0.4772

P(z>2) = 0.5 0.4772 = 0.0228 c) c) Shorter than 180cm

-3

-2

-1

Z = 180-170 =1 10 P(x<180) P(Z<1) = 0.5 + 0.3414 =0.8413 d) Shorter than 165cm Z 165 170 =- 0.5 10 P(x<165) P (Z<-0.5) = 0.5 0.1915 e)Between 165cm to 177cm P(Z< 165) = 165 170 10 P(Z <177) = 177 170 10< P(165< x < 177) = 0.7 0.2580 = -0.5 = 0.3085 P(z) 0.1915 DRAWING

-3

-2

-1

0 = 0.4495

P(-0.5<Z<0.7) = 0.1915 + 0.2580 Example 2 If X is a variable following normal distribution with parameters = 3 and 2 = 9, Find i. P(2 < X < 5) ii. P(X >0), and iii. P(X >9). Solution (i) = .3779. (ii) = .8413. (iii) P(X >9) = P(Z > 2.0) = 0.5 0.4772 = .0228 P(X >0) = P(Z > 1) = P(Z < 1) P(2 < X < 5) = P(0.33 < Z < 0.67)

Example 3 A sample of students had a mean age of 35 years with a standard deviation of 5 years. A student was randomly picked from a group of 200 students. Find the probability that the age of the student turned out to be as follows i. ii. iii. iv. v. Lying between 35 and 40 Lying between 30 and 40 Lying between 25 and 30 Lying beyond 45 yrs Lying beyond 30 yrs

vi. Solution

Lying below 25 years

(i). the standardized value for 35 years Z=


35 - 35 5

= 0

The standardized value for 40 years Z=


40 - 35 = 5

The area between Z = 0 and Z = 1 is 0.3413 (These values are checked from the normal tables see appendix) The value from standard normal curve tables When z = 0, p=0 And when z = 1, p = 0.3413 Now the area under this curve is the area between z = 1 and z = 0 = 0.3413 0 = 0.3413 The probability age lying between 35 and 40 yrs is 0.3413 (ii). 30 and 40 years Z= Z=

= =

30 35 5 40 35 5

= =

5 5

= -1 1

The area between Z = -1 and Z = 1 is = 0.3413 (lying on the positive side of zero) + 0.3413 (lying on the negative side of zero) P = 0.6826 The probability age lying between 30 and 40 yrs is 0.6826 (iii). 25 and 30 years Z=

25 35 5

10 = -2 5

Z=

30 35 5

= -1

The area between Z = -2 and Z = -1 Probability area corresponding to Z = -2 = 0.4772 (the z value to check from the tables is 2) Probability area corresponding to Z = -1 = 0.3413 (the z value for this case is 1) The probability that the age lies between 25 and 30 yrs = 0.4772 0.3413 (The area under this curve) P(Z) = 0.1359 iv). P(beyond 45 years) is determined as follow = P(x > 45) Z=

45 35 5

+10 5

=+2

Probability corresponding to P(Z = 2) = 0.4772 = probability of between 35 and 45 P(Age > 45yrs) = 0.5000 0.4772 = 0.0228 Practice Questions 1. Identify the following as discrete or continuous random variables. (i) The market value of a publicly listed security on a given day (ii) The number of printing errors observed in an article in a weekly news magazine (iii) The time to assemble a product (e.g. a chair) (iv) The number of emergency cases arriving at a city hospital (v) The number of sophomores in a randomly selected Math. class at a university (vi) The rate of interest paid by your local bank on a given day 2. A random variable X has the following probability distribution: X P(x) 1 2 3 4 5 .05 .10 .15 .45 .25

(i) Verify that X has a valid probability distribution. (ii) Find the probability that X is greater than 3, i.e. P(X >3). (iii) Find the probability that X is greater than or equal to 3, i.e. P(X 3).

(iv) Find the probability that X is less than or equal to 2, i.e. P(X 2). (v) Find the probability that X is an odd number. (vi) Graph the probability distribution for X. 3, Calculate the area under the standard normal curve between the following values. (i) Z = 0 and z = 1.6 (i.e. P (0 Z 1.6)) (ii) Z = 0 and z = 1.6 (i.e. P (1.6 Z 0)) (iii) Z = .86 and z = 1.75 (i.e. P (.86 Z 1.75)) (iv) Z = 1.75 and z = .86 (i.e. P (1.75 Z .86)) (v) Z = 1.26 and z = 1.86 (i.e. P (1.26 Z 1.86)) (vi) Z = 1.0 and z = 1.0 (i.e. P (1.0 Z 1.0)) (vii) Z = 2.0 and z = 2.0 (i.e. P (2.0 Z 2.0)) (viii) Z = 3.0 and z = 3.0 (i.e. P (3.0 Z 3.0)) 4. Let Z be a standard normal distribution. Find z0 such that (i) P (Z z0) = 0.05 (ii) P (Z z0) = 0.99 (iii) P (Z z0) = 0.0708 (iv) P (Z z0) = 0.0708 (v) P (z0 Z z0) = 0.68 (vi) P (z0 Z z0) = 0.95 5.A normally distributed random variable X possesses a mean of = 10 and a standard deviation of =5. Find the following probabilities. (i) X falls between 10 and 12 (i.e. P (10 X 12)). (ii) X falls between 6 and 14 (i.e. P (6 X 14)). (iii) X is less than 12 (i.e. P(X 12)). (iv) X exceeds 10 (i.e. P(X 10)).

6. CONFIDENCE INTERVAL The interval estimate or a confidence interval consists of a range (upper confidences limits and lower confidence limit) within which we are confident that a population

parameter lies and we assign a probability that this interval contains the true population value. Confidence interval is the interval between the confidence limits. The higher the confidence level the greater the confidence interval. For example A normal distribution has the following characteristic i. ii. Sample mean 1.960 includes 95% of the population Sample mean 2.588 includes 99% of the population

Large Samples The Central Limit Theorem: The theory states that if we select a large number of simple random samples, say from any population and determine the mean of each sample, the distribution of these sample means will tend to be described by the normal probability distribution with a mean and variance 2/n. This is true even if the population itself is not normal distribution. Or the sampling distribution of sample means approaches to a normal distribution irrespective of the distribution of population from where the sample is taken and approximation to the normal distribution becomes increasingly close with increase in sample sizes Large samples that contain a sample size greater than 30(i.e. n>30). Such samples can use levels of confidence based on the normal distribution. Estimation of population mean Here we assume that if we take a large sample from a population then the mean of the population is very close to the mean of the sample Steps to follow to estimate the population mean includes i. ii. iii. Take a random sample of n items where (n>30); n is the sample size Compute sample mean ( X ) and standard deviation (S) Compute the standard error of the mean by using the following formula Sx =
s n

Where S x = Standard error of mean S = standard deviation of the sample n = sample size iv. Choose a confidence level e.g. 95% or 99%

v.

Estimate the population mean as under Population mean =

appropriate number XS x

Appropriate number means confidence level e.g. at 95% confidence level is 1.96 this number is usually denoted by Z and is obtained from the normal tables. The value of z corresponds to the confidence obtained as the probability percentage Example 1 The quality department of a wire manufacturing company periodically selects a sample of wire specimens in order to test for breaking strength. Past experience has shown that the breaking strengths of a certain type of wire are normally distributed with standard deviation of 200 kg. A random sample of 64 specimens gave a mean of 6200 kgs. Find out the population mean of 95% level of confidence Solution Population mean =

1.96 S x
x

Note that sample size is already n > 30 whereas s and iv) are provided. Here: X = 6200 kgs Sx =
s n 200 64

are given thus step i), ii) and

25

Population mean = 6200 1.96(25) = 6200 49 = 6151 to 6249 At 95% level of confidence, population mean will be in between 6151 and 6249 Estimation of population proportions This type of estimation applies at the times when information cannot be given as a mean or as a measure but only as a fraction or percentage The sampling theory stipulates that if repeated large random samples are taken from a population, the sample proportion p will be normally distributed with mean equal to the population proportion and standard error equal to

Sp =

Pq = Standard error for sampling of population proportions n

Where n is the sample size and q = 1 p. The procedure for estimating a proportion is similar to that for estimating a mean, we only have a different formula for calculating standard error is different. Example 1 In a sample of 800 candidates, 560 were male. Estimate the population proportion at 95% confidence level. Solution Here Sample proportion (P) = 560 = 0.70 800

q = 1 p = 1 0.70 = 0.30 n = 800 pq n =

( 0.70 ) ( 0.30 )
800

Sp = 0.016 Population proportion = P 1.96 Sp where 1.96 = Z. = 0.70 1.96 (0.016) = 0.70 0.03 = 0.67 to 0.73 = between 67% to 73% Example 2

A sample of 600 accounts was taken to test the accuracy of posting and balancing of accounts where in 45 mistakes were found. Find out the population proportion. Use 99% level of confidence Solution Here n = 600; p = 45 = 0.075 600

q = 1 0.075 = 0.925 Sp = pq = n

( 0.075) ( 0.925)
600

= 0.011 Population proportion = P 2.58 (Sp) = 0.075 2.58 (0.011) = 0.075 0.028 = 0.047 to 0.10 = between 4.7% to 10% Small Samples Estimation of population mean If the sample size is small (n<30) the arithmetic mean of small samples are not normally distributed. In such circumstances, students t distribution must be used to estimate the population mean. In this case Population mean = X tsx X = Sample mean

Sx =

s n

S = standard deviation of samples = n = sample size v = n 1 degrees of freedom.

( x x)
n 1

for small samples.

The value of t is obtained from students t distribution tables for the required confidence level Example A random sample of 12 items is taken and is found to have a mean weight of 50 grams and a standard deviation of 9 grams What is the mean weight of population a) with 95% confidence b) with 99% confidence Solution X = 50; S = 9; v = n 1 = 12 1 = 11; = x tsx At 95% confidence level 9 = 50 2.262 12 = 50 5.72 grams Therefore we can state with 95% confidence that the population mean is between 44.28 and 55.72 grams At 99% confidence level Sx = s 9 = n 12

9 = 50 3.25 12 = 50 8.07 grams Therefore we can state with 99% confidence that the population mean is between 41.93 and 58.07 grams Note: To use the t distribution tables it is important to find the degrees of freedom (v = n 1). In the example above v = 12 1 = 11 From the tables we find that at 95% confidence level against 11 and under 0.05, the value of t = 2.201 7. SIMPLE LENEAR REGRESSION EQUATION A regression model is a mathematical equation that describes the relationship between two or more variables. A simple regression model includes only two variables; Independent variables: the variables used to explain the variation in the dependent variable i.e. they are used to make prediction on the dependent variable. The dependent variable is the one being explained

The regression model that is linear shows the equation of a linear relationship between two variables X (dependent) and of (independent) as shown below: Y = a + bx The value of a: it is the y- intercept; the value of y where the line cuts the y- axis. The constant b; this is the slope or the gradient of the line. The linear relationship between x and y can be defined if the values of the constants a and b are determined. The values of a and b can be determined in the ways. Scatter plots:

Scatter plots are used to examine the relationship between two variables. One variable takes the horizontal axis (x) while the other takes the vertical axis (y). The variation between the variables can show a relationship that is positive or negative. Positive relationship that is either linear or close to linear would indicate that the variables more together in a linear manner. The scatter will show points lying in a region reflecting a and of a line. When one variable increases the other also increases, and when one decreases the other decreases the other also. Negative relationship is accompanied by a decrease in the other variable. Linear relationship shows points scattered in a way to lie in a line. Relationships between variables can also be non-linear. In such cases the points will concentrate in a region that reflects a curve. The relationship between two variables may therefore assume many possible shapes; which can be classified as linear or nonlinear relationship that are complicated mathematical functions. The simplest relationship consists of a straight-line or linear relationship. Check on the scatter plots on page 606 of the main test book. A scatter plot from which a line that fits (line of best fit) the variables points in the scatter into an approximately straight line. This requires good and refined skill in identifying the line that best fits the nearer is the accuracy of the line obtained. However there will always be an error caused random causes random error term. This is the difference between the actual value of y the obtained from the survey and the estimated values of the y by assuming they fall along the line. For every value of x, a different value of y may be obtained by estimating the line. The error for each value of y can be written as E = y - : where y is the actual value and the estimated value. For all the values of y, there will be a sum of y are less than the actual while others are more than the actual. The sum of the less (-ve difference) and the sum of the greater (the difference) is zero. Example:

The following data represents a sample of seven households showing their incomes and food expenditures for a given month. Income (hundreds of dollars 35 44 21 39 15 28 25 Required: Construct a scatter diagram; with income an x-axis and food-expenditure an yaxis Draw the prediction line Identify the y-intercept (a) and the slope (gradient) (b). Write the simple linear regression equation. Food expenditure (hundreds of dollars) 9 15 7 11 5 8 9

Scatter diagram

Depending on ones skill different lines such as L1, and L2 can be drawn using L2 Y Intercept i.e. constant a = 1.2 The gradient i.e. coefficient of x; b = 12-6 46-22 The linear equation generally written as Y = a + bx = 1.2 + 0.25x. The line is an estimate of values of y for different values of x. It can be used to predict values of y given x. However since the line is an estimate; the difference between the observed or actual value of y and the obtained by the prediction line, there exists an error called random error, also called the residual. It measures the surplus (positive or negative) differences. The random error obtained from a population is denoted by while that of a sample is denoted by e in the above example. E= Actual food expenditure - predicted food expenditure = y - . If the predicted line completely fits as the best line the sum of positive errors and the negative errors is equal to zero. = 0.25

Drawing a scatter diagram may not give is the best of fit line. The other option that results in sum of errors equal to zero: e = (y-) = 0 The use of the least squares method The Least squares method The least of squares method minimizes the random error. It helps to determine the constants a and b for the equation Y = a + bx that results in the line of best fit. The method gives the values of a and b for the equation (model) such that the same of squared errors (SSE) is minimum SSE = e = (y-) . The values of a and b which gives the minimum SSE are called the least squares estimates and the line is called the least squares line. For the line = a + bx b = SSxy SSxx Where SSxy = xy - (x) (y) n SSxy = x - (x) n Find the least square regression line for the data on incomes and food expenditures of seven households; we require to construct the table that would guide the computation of a and b. The table has the following; Icome (x) 35 (y) 9 Food expenditure xy 315 x 1225 and a = - bx

49 21 39 15 28 25 x=212

15 7 11 5 8 9 y=64

735 147 429 75 224 225 xy=2150

2901 441 1521 225 784 625 x=7222

x = 212 x = x n y 7 n xy

y = 64 = 212 = 30.286

= y = 64 = 9.143 7 x = 7222

= 2150

SSyx = xy- x y = 2150 - (212)(64) =211.714 n 7

SSxx = x - (x) = 7222- (212) = 801.429 n b = SSxy SSxx 801.429 = 64 - 0.2654(212) 7 7 7 = 212.714 = 0.2654

a = - bx =y - b x n a =1.1414 y = 1.1414 - 0.2654x Interpretation: n

The line gives coefficients of a and b to four decimal points making it more accurate to be used for prediction. We can check the accuracy as follows: A household with monthly income 35 (83500) dented by x=35 would be expected to spend some money on food as follows: y = 1.1414 + 0.2642x x = 35 y = 1.1414 + 0.2642 (35) 810.3884 i.e. in hundred dollars (81038.84) The acted value in the data gives y = 9. The value 810.3884 could be regarded as an average i.e. for households having an average i.e. for households having an income of 83500 (x=35) they spend an average 810.3884 (81038.84) on food. The constant a is the value of y when x=0. That is the amount of money a household would spend on food per month if there was no income. It means that food expenditure does not only depend on income but there could be other factors. For purposes of prediction using the linear regression line obtained, we can only predict values of y for values of x that lie within the range in our data. For example, the incomes lie between (81500 to 84900) i.e. x=15 and x = 49. We can only predict values of y with values of x between 15 and 49. We can only predict values of y with values of x between 15 and 49. Prediction outside this range may not hold true (prediction not reliable). X = 0 is a value not within the range and so the prediction that households with no income spend 8114.14 per month cannot be supported by our equation.

The constant b in the model gives the gradient or change in y due to a charge of one unit in x. Example; when x increases by one unit of income in (hundreds) then y increase by 0.2642 (in hundreds) of dollars spent on food. Example; If the income of a household changes from x = 30 to x = 31 y will change as: y = 1.1414 + 0.2642 (30) = 9.0674 y = 1.1414 + 0.2642 (31) = 9.3316 9.3316 9.0674 = 0.2642 When b is positive it means that as x increases y also increases and if x decreases y also decreases. There is a positive linear relationship between the variables i. e. the change in y and the charge in x are in the same direction; the variables move together. If the value of b is negative, change in y is in opposite direction to change in x i.e. there is a negative linear relationship between the variables. When is greater than zero (b > 0) the line slopes upwards from left to right. If b < 0 the line slopes down wards from left to right Assumptions of the regression model The mean value of error is zero. From the above example, among the households with the same income some spend more on food and other less. The sum of the differences (positive errors and negative errors) is equal to zero. The errors associated with different observations are independent. That is, all households decide independently how much to spend in food. For any given x, the distribution of errors is normal, i.e. with the above example the food expenditure for all households with the same income are normally distributed.

The distribution of population errors for each x has the same (constant) standard deviation. The assumption is that the spread of points around the regression line is similar for all x values.

Example 2 A random sample of eight auto divers insured with a company and having similar auto insurance policies was selected. The following table lists their driving experience (in years) and the monthly auto insurance premium (in dollars) paid by them. Find the linear equation using L. S M. Driving expenditure (year) 5 2 12 9 15 6 25 16 x = 90 x = x n y 8 = 59.25 n xy = 4737 8x = 1396 y = 474 = 90 = 11.25 Monthly auto insurance premium (dollars) 64 87 50 71 44 56 42 60

= y = 474

SSyx = xy- x y = 4739 - (90) = 383.5 n 8

SSxx = x - (x) = 1396 - (90) = 383.5 n

= SSxy SSxx

= -593.5 383.5

= - 1.5476

a = - bx = 59.25 (-1.5476)11.25) = 76.6605 = 76.6605 - 1.5476x

Practice Questions 1.The age versus prices for printers I reported in the table below. Age is in years while prices are in dollars (in hundreds) Age (years) x 5 7 6 6 5 4 7 6 5 5 2 Required: i. ii. iii. iv. Find the equation of the regression line. Describe the apparent relationship between age and price for the printers What does the slope of the regression equation represent in terms of the price for printers? Panama enterprise wants to buy 3 year old and 4 year old printers from the firm. How much do you predict the firm will spend in buying the two printers? Price 00 dollars (y) 80 57 58 55 70 88 43 60 69 63 118

8.HYPOTHESIS TESTING Definition

A hypothesis is a claim or an opinion about an item or issue. Therefore it has to be tested statistically in order to establish whether it is correct or not correct When testing a hypothesis, one must fully understand the 2 basic hypothesis to be tested namely i. ii. The null hypothesis (H0) The alternative hypothesis(H1)

The null hypothesis This is the hypothesis being tested, the belief of a certain characteristic e.g. Kenya Bureau of Standards (KBS) may walk to a sugar making company with an intention of confirming that the 2kgs bags of sugar produced are actually 2kgs and not less, they conduct hypothesis testing with the null hypothesis being: H 0 = each bag weighs 2kgs. The testing will set out to confirm this or to refute it. The alternative hypothesis While formulating a null hypothesis we also consider the fact that the belief might be found to be untrue hence we will reject it. We therefore formulate an alternative hypothesis which is a contradiction to the null hypothesis, thus when we reject the null hypothesis we accept the alternative hypothesis. In our example the alternative hypothesis would be H1 = each bag does not weigh 2kg Acceptance and rejection regions All possible values which a test statistic may either assume consistency with the null hypothesis (acceptance region) or lead to the rejection of the null hypothesis (rejection region or critical region) The values which separate the rejection region from the acceptance region are called critical values Type I and type II errors While testing hypothesis (H0) and deciding to either accept or reject a null hypothesis, there are four possible occurrences. a) Acceptance of a true hypothesis (correct decision) accepting the null hypothesis and it happens to be the correct decision. Note that statistics does not give

absolute information, thus its conclusion could be wrong only that the probability of it being right are high. b) Rejection of a false hypothesis (correct decision). c) Rejection of a true hypothesis (incorrect decision) this is called type I error, with probability = . d) Acceptance of a false hypothesis (incorrect decision) this is called type II error, with probability = . Levels of significance A level of significance is a probability value which is used when conducting tests of hypothesis. A level of significance is basically the probability of one making an incorrect decision after the statistical testing has been done. Usually such probability used are very small e.g. 1% or 5%

0.5000

0.4900

1% provision for errors

Hypothesis testing procedure Whenever a business complaint comes up there is a recommended procedure for conducting a statistical test. The purpose of such a test is to establish whether the null hypothesis or alternative hypothesis is to be accepted. The following are steps normally adopted 1. Statement of the null and alternative hypothesis 2. Statement of the level of significance to be used. 3. Statement about the test statistic i.e. what is to be tested e.g. the sample mean, sample proportion, difference between sample means or sample proportions

4. Type of test whether two tailed or one tailed. 5. Statement on critical values using the appropriate level of significance 6. Standardizing the test statistic 7. Conclusion showing whether to accept or reject the null hypothesis Hypothesis testing for the mean Example 1 A certain NGO carried out a survey in a certain community in order to establish the average at which the girls are married. The results of the survey indicated that the marriage age for the girls is 19 years In order to establish the validity of the mean marital age, a sample of 50 women was interviewed and the average age indicated that they got married at the age of 16 years. However the different ages at which they were married differed with the standard deviation of 2.1years The sample data indicates that the marital age is less 19 years. Is this conclusion true or not ? Required Conduct a statistical test to either support the above conclusion drawn from the sample statistics i.e. the marriage age is less than 19 years, use a level of significance of 5% Solution 1. Null hypothesis H0: (mean marital age) = 19 years Alternative hypothesis H1: (mean marital age) < 19 years 2. The level of significance is 5% 3. The test statistics is the sample mean age, X = 16 years 4. The critical value of the one tailed test (one tailed because the alternative hypothesis is an inequality) at 5% level of significance is 1.65 Solution Z =
X - Sx

, where Sx =

S n

Where,

X = Sample mean

= Population mean

S = sample standard deviation n = sample size z = standard value (as per computation) 5. The standard value Z must fall within the acceptance region for us to accept the null hypothesis. Thus it must be > - 1.65 otherwise we accept the alternative hypothesis. Z = 16 19
2.1 50

= - 10.1

Rejection region

acceptable region

-2

-1

6. Since 10.1 < -1.65, we reject the null hypothesis but accept the alternative hypothesis at 5% level of significance i.e. the marriage age in this community is significantly lower than 19 years Example 2 Test the hypothesis that weight loss in a new diet program exceeds 20 pounds during the first month. Sample data: n = 36, x = 21, s2 = 25, 0 = 20, = 0.05 H0: = 20 ( is not larger than 20) Ha: > 20 ( is larger than 20) Z = X Z 0 = 21 20 = 1.2 5/36 s/ n

=1.645

Acceptable region

rejection region

-3

-2

-1

At 5%; with Critical value: z = 1.645 RR: Reject H0 if z > 1.645 Decision: Do not reject H0 because the critical value is outside the reject region Conclusion: At 5% significance level there is insufficient statistical evidence to conclude that weight loss in a new diet program exceeds 20 pounds per first month. Exercise: Test the claim that weight loss is not equal to 19.5. Example 3 A machine is set to cut out bars to an average length of 150mm. an operator wants to check whether the setting is accurate. She samples 50 bars and finds a mean of 148mm. the standard deviation is known to be 5mm. is the machine still reliable? Test this at 1% significance level. Solution H0: = 150 Ha: 150 (machine may be reliable) (2- tailed test, machine not reliable; may produce lengths that are

too long or too. Short. We cannot get a direction from the wording of the question). Alpha: a = 0.01, Critical value: z/2 = z0.005 = 2.575

Z = X

= 148 -150 = -2.83 5/50 0.005

s/ n 0.005

-2

-1

Hypothesis testing for proportion A member of parliament (MP) claims that in his constituency only 50% of the total youth population lacks university education. A local media company wanted to acertain that claim thus they conducted a survey taking a sample of 400 youths, of these 54% lacked university education. Required: At 5% level of significance confirm if the MPs claim is wrong. Solution. Note: This is a two tailed tests since we wish to test the hypothesis that the hypothesis is different () and not against a specific alternative hypothesis e.g. < less than or > more than. H0 : = 50% of all youth in the constituency lack university education. H1 : 50% of all youth in the constituency lack university education. Sp = Z= pq = n 0.5 x0.5 = 0.025 400

0.54 0.50 = 1.6 0.025

at 5% level of significance for a two-tailored test the critical value is 1.96 since calculated Z value Z (sample) =-2.83 This falls in the region of Ha. Conclusion: reject Ho. There is enough evidence to support that the machine is no longer reliable. Practice Questions Kenya Commercial bank Ltd. commissioned a research whose results indicated that automatic teller machine (ATM) reduces the cost of routine banking transactions. Following this information, the bank installed an ATM facility at the premises of Joy Processing Company Ltd., which for the last several months has exclusively been, used by JoyS 605 employees. Survey on the usage of the ATM facility by 100 of the employees in a month indicated the following:

Number of times ATM used 0 1 2 3 4 5 Required:

Frequenc y 20 32 20 13 10 5

a) An estimate of the proportion of Joys employees who do not use the ATM facility in a month b) i) Determine the 95% confidence interval for the estimate in (a) above ii) Can the bank be certain that at least 40% of Joys employees will use the ATM facility? c).The number of ATM transactions on average an employee of Joy makes per month

d).Determine the 95% confidence interval of the mean number of transactions made by an employee in a month. e).Is it possible that the population mean number of transactions is four? Explain.

Das könnte Ihnen auch gefallen