Research Methodology - Lokendra Ojha

Notes on
BY
LOKENDRA KUMAR OJHA
RESEARCH METHODOLOGY
UNIT 1
Primary data are those that you have collected yourself, whereas secondary data originate elsewhere. Generally, you will find that you are expected to collect primary data when using quantitative methods, but that secondary data are more acceptable when you are using a qualitative method. This is because there are certain common aspects of qualitative research which involve only secondary data, such as the study of television or newspaper discourses. If you wanted to understand the nature of the representation of Romany people on television, you wouldnt make your own television programmes! You would use those which exist, and they would form [your] secondary data (Forshaw, 2000). A secondary data research project involves the gathering and/or use of existing data for purposes other than those for which they were originally collected. These secondary data may be obtained from many sources, including literature, industry surveys, compilations from computerized databases and information systems, and computerized or mathematical models of environmental processes. Secondary Data What Is Secondary Data? o Data may be described as Primary or Secondary Primary data - collected by the researcher himself Secondary data - collected by others to be "re-used" by the researcher What Form Does Secondary Data Take? o Quantitative Sources Published Statistics: National Government Sources Demographic (Census, Vital Statistics, Cancer Registrations) Administrative (by-product of Government) Collected by Govt. Depts. overseen by ONS (eg. employment, prices, trade, finance) Government Surveys (input to Government) General Household Survey (GHS) Family Expenditure Survey (FES) Labour Force Survey (LFS) Family Resources Survey (FRS) Omnibus Survey Local Government Sources Planning Documents Trends Documents (eg former Strathclyde Social Trends and Economic Trends) Other Sources Firms & Trade Associations eg Society of Motot Manufacturers & Traders (SMMT) Market & Opinion Research eg Gallup, NOP, SCPR System 3 Trade Unions, TUC, STUC Professional Bodies eg CIPFA (Chartered Institute of Public Finance & Accountancy) provides a Statistical Information Service re Local Government Statistics Political Parties Voluntary & Charitable Bodies eg Low Pay Unit, SCF (Save the Children Fund), Rowntree Foundation Academic & Research Institutes eg Micro-Social Change Research Centre (MSRC) at Essex Uni National Institute for Economic & Social Research (NIESR) Institute for Fiscal Studies (IFS) International Sources EU, OECD, World Bank, IMF Non-Published / Electronic Sources Data Archives eg the Data Archive At Essex Data Sub-Setting Service On Tape, Disk, Postal Or Via Janet On-Line Access To National Computing Centres MIMAS (Manchester Information & Associated Services) EDINA (Edinburgh) International Sources on Internet & Web
Page | 1
o Qualitative Sources Sources for Qualitative Research: Biographies - subjective interpretation involved Diaries - more spontaneous, less distorted by memory lapses Memoirs - benefit/problem of hindsight Letters - reveal interactions Newspapers - public interest & opinion Novels & Literature In General - eg Atkinson's tribute to usefulness of Gordon's "Dr Novels"; McLelland's study of achievement motivation in different cultures via children's stories & folktales Handbooks, Policy Statements, Planning Documents, Reports, Historical & Official Documents (Hansard, Royal Commission reports) etc. n.b Marx's use of Factory Inspectors reports in developing his theories of the labour process
Ways of Using Secondary Sources o Exploratory phase - getting ideas o Design Phase - definitions & sampling frames, question wording o Supplement to Main Research - Re-Enforcement &/Or Comparison o Main Mode of Research - Direct Data Collection Impossible - Or Costly & Time Consuming Limitations of Secondary Data o Collected For A Different Purpose o Problem of Definitions o Problem of Comparability Over Time o Lack of Awareness of Sources of Error/Bias o Has the Data Been "Massaged"? o What Do The Statistics Really Mean? Eg. Health, Crime, Unemployment o Limitations of Survey Data Representativeness Validity of Responses o Limitations of Documents Documents "Construct" As Well As Report Social Reality How to Search & Use Secondary Sources? o Documents - Bibliographic Skills, Use of Keywords, Boolean Operators o Published Statistics Guide to Official Statistics Digests & Abstracts Primary Publication o Electronic Sources Biron Gateways - SOSIG, BUBL Search Engines - Infoseek, Alta Vista, Webcrawler etc. Diagrams and Graphs of Statistical Data
We have discussed the techniques of classification and tabulation that help us in organizing the collected data in a meaningful fashion. However, this way of presentation of statistical data does not always prove to be interesting to a layman. Too many figures are often confusing and fail to convey the message effectively. One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in which statistical data may be displayed pictorially such as different types of graphs and diagrams. The commonly used diagrams and graphs to be discussed in subsequent paragraphs are given as under: Types of Diagrams/Charts: Simple Bar Chart Multiple Bar Chart or Cluster Chart Staked Bar Chart or Sub-Divided Bar Chart or Component Bar Chart Simple Component Bar Chart Percentage Component Bar Chart Sub-Divided Rectangular Bar Chart Pie Chart
Page | 2
Types of Diagrams/Charts: Histogram Frequency Curve and Polygon Lorenz Curve Historigram Simple Bar Chart A simple bar chart is used to represents data involving only one variable classified on spatial, quantitative or temporal basis. In simple bar chart, we make bars of equal width but variable length, i.e. the magnitude of a quantity is represented by the height or length of the bars. Following steps are undertaken in drawing a simple bar diagram: Draw two perpendicular lines one horizontally and the other vertically at an appropriate place of the paper. Take the basis of classification along horizontal line (X-axis) and the observed variable along vertical line (Y-axis) or vice versa. Marks signs of equal breath for each class and leave equal or not less than half breath in between two classes. Finally marks the values of the given variable to prepare required bars. Example- Draw simple bar diagram to represent the profits of a bank for 5 years. Years Profit (million $) Simple bar chart showing the profit of a bank for 5 years.
Multiple Bar Chart By multiple bars diagram two or more sets of inter-related data are represented (multiple bar diagram facilities comparison between more than one phenomena). The technique of simple bar chart is used to draw this diagram but the difference is that we use different shades, colors, or dots to distinguish between different phenomena. We use to draw multiple bar charts if the total of different phenomena is meaningless. Example- Draw a multiple bar chart to represent the import and export of Canada (values in $) for the years 1991 to 1995. Years Imports Exports
Simple bar chart showing the import and export of Canada from 1991 1995.
Page | 3
Component Bar Chart Sub-divided or component bar chart is used to represent data in which the total magnitude is divided into different or components. In this diagram, first we make simple bars for each class taking total magnitude in that class and then divide these simple bars into parts in the ratio of various components. This type of diagram shows the variation in different components within each class as well as between different classes. Sub-divided bar diagram is also known as component bar chart or staked chart. Example- The table below shows the quantity in hundred kgs of Wheat, Barley and Oats produced on a certain form during the years 1991 to 1994. Years Wheat Barley Oats
Construct a component bar chart to illustrate this data. Solution- To make the component bar chart, first of all we have to take year wise total production. Years Wheat Barley Oats Total
The required diagram is given below:
Pie Chart Pie chart can used to compare the relation between the whole and its components. Pie chart is a circular diagram and the area of the sector of a circle is used in pie chart. Circles are drawn with radii proportional to the square root of the quantities because the area of a circle is To construct a pie chart (sector diagram), we draw a circle with radius (square root of the total). The total angle of the circle is . The angles of each component are calculated by the formula.
Angle of Sector These angles are made in the circle by mean of a protractor to show different components. The arrangement of the sectors is usually anti-clock wise. Example-The following table gives the details of monthly budget of a family. Represent these figures by a suitable diagram. Item of Expenditure Family Budget Food Clothing House Rent Fuel and Lighting Miscellaneous Total
Page | 4
Solution-The necessary computations are given below:
Angle of Sector Items Food Clothing House Rent Fuel and Lighting Miscellaneous Total Expenditure $ Family Budget Angle of Sectors Cumulative Angle
Measures of Central Tendency A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as the median and the mode. The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections, we will look at the mean, mode and median, and learn how to calculate them and under what conditions they are most appropriate to be used. Arithmetic Mean It is the most commonly used average or measure of the central tendency applicable only in case of quantitative data. Arithmetic mean is also simply called mean. Arithmetic mean is defined as: Arithmetic mean is quotient of sum of the given values and number of the given values. The arithmetic mean can be computed for both ungroup data (raw data: a data without any statistical treatment) and grouped data (a data arranged in tabular form containing different groups). If is the involved variable, then arithmetic mean of is abbreviated as of and denoted by .
The arithmetic mean of Methods Name Direct Method Indirect or Short-Cut Method Method of Step-Deviation
can be computed by any of the following methods. Nature of Data Ungrouped Data
Grouped Data
Page | 5
Where Indicates values of the variable Indicates number of values of . .
Indicates frequency of different groups. Indicates assumed mean. Indicates deviation from i.e,
Step-deviation and
Indicates common divisor
Indicates size of class or class interval in case of grouped data. Summation or addition. Example (1):The one-sided train fare of five selected BS students is recorded as follows Calculate arithmetic mean of the following data. Solution-Let train fare is indicated by , then , , , and .
Arithmetic mean of have given data. and
, we decide to use above-mentioned formula. Form the given data, we . Placing these two quantities in above formula, we get the arithmetic mean for
; Example (2)- Given the following frequency distribution of first year students of a particular college. Age (Years) Number of Students Solution-The given distribution belongs to a grouped data and the variable involved is ages of first year students. While the number of students Represent frequencies. Number of Students Ages (Years)
Total
Now we will find the Arithmetic Mean as Example (3) - The following data shows distance covered by Distance (Km) Number of Persons
years. persons to perform their routine jobs.
Page | 6
Solution-The given distribution belongs to a grouped data and the variable involved is ages of distance covered. While the number of persons Represent frequencies.
Distance (Km)
Number of Persons
Mid Points
Total
Now we will find the Arithmetic Mean as
Km.
Merits and Demerits of Arithmetic Mean Merits: It is rigidly defined. It is easy to calculate and simple to follow. It is based on all the abservations. It is determined for almost every kind of data. It is finite and not indefinite. It is readily put to algebraic treatment. It is least affected by fluctuations of sampling. Demerits: The arithmetic mean is highly affected by extreme values. It cannot average the ratios and percentages properly. It is not an appropriate average for highly skewed distributions. It cannot be computed accurately if any item is missing. The mean sometimes does not coincide with any of the observed value. Geometric Mean It is another measure of central tendency based on mathematical footing like arithmetic mean. Geometric mean can be defined in the following terms: Geometric mean is the nth positive root of the product of n positive given values Hence, geometric mean for a value and given as under: containing values such as is denoted by of
(For Ungrouped Data) If we have a series of repeated positive values with repeated values such as times respectively then geometric mean will becomes: (For Grouped Data) Where Example: Find the Geometric Mean of the values 10, 5, 15, 8, 12 Solution: Here , and are
Page | 7
Example-Find the Geometric Mean of the following Data
Solution: We may write it as given below: Here , , , , ,
Using the formula of geometric mean for grouped data, geometric mean in this case will become:
The method explained above for the calculation of geometric mean is useful when the numbers of values in given data are small in number and the facility of electronic calculator is available. When a set of data contains large number of values then we need an alternative way for computing geometric mean. The modified or alternative way of computing geometric mean is given as under: For Ungrouped Data For Grouped Data
Example: Find the Geometric Mean of the values 10, 5, 15, 8, 12
Total
Example: Find the Geometric Mean for the following distribution of students marks: Marks No. of Students
Page | 8
Solution: Marks No. of Students Mid Points
Total
Merits and Demerits of Geometric Mean Merits: It is rigidly defined and its value is a precise figure. It is based on all observations. It is capable of further algebraic treatment. It is not much affected by fluctuation of sampling. It is not affected by extreme values. Demerits: It cannot be calculated if any of the observation is zero or negative. Its calculation is rather difficult. It is not easy to understand. It may not coincide with any of the observations. Harmonic Mean Harmonic mean is another measure of central tendency and also based on mathematic footing like arithmetic mean and geometric mean. Like arithmetic mean and geometric mean, harmonic mean is also useful for quantitative data. Harmonic mean is defined in following terms: Harmonic mean is quotient of number of the given values and sum of the reciprocals of the given values. Harmonic mean in mathematical terms is defined as follows: For Ungrouped Data For Grouped Data
Example-Calculate the harmonic mean of the numbers: 13.5, 14.5, 14.8, 15.2 and 16.1 Solution: The harmonic mean is calculated as below:
Total
Page | 9
Example: Given the following frequency distribution of first year students of a particular college. Calculate the Harmonic Mean. Age (Years) Number of Students Solution-The given distribution belongs to a grouped data and the variable involved is ages of first year students. While the number of students Represent frequencies. Number of Students Ages (Years)
Total
Now we will find the Harmonic Mean as years. Example: Calculate the harmonic mean for the given below: Marks
Solution- The necessary calculations are given below: Marks
Total Now we will find the Harmonic Mean as . Merits and Demerits of Harmonic Mean Merits: It is based on all observations. It not much affected by the fluctuation of sampling. It is capable of algebraic treatment. It is an appropriate average for averaging ratios and rates. It does not give much weight to the large items. Demerits: Its calculation is difficult. It gives high weight-age to the small items. It cannot be calculated if any one of the items is zero. It is usually a value which does not exist in the given data.
Page | 10
Median Median is the most middle value in the arrayed data. It means that when the data are arranged, median is the middle value if the number of values is odd and the mean of the two middle values, if the numbers of values is even. A value which divides the arrayed set of data in two equal parts is called median, the values greater than the median is equal to the values smaller than the median. It is also known as a positional average. It is denoted by Median from Ungrouped Data: read as X- tilde.
Median = Value of item Example: Find the median of the values 4, 1, 8, 13, 11 Solution-Arrange the data 1, 4, 8, 11, 13
Median = Value of
item
Median = Value of Median = 8
item =
item
Example- Find the median of the values 5, 7, 10, 20, 16, 12 Solution-Arrange the data 5, 7, 10, 12, 16, 20
Median = Value of
item
Median = Value of
item =
item
Median = Median from Ungrouped Data The median for grouped data, we find the cumulative frequencies and then calculated the
median number . The median lies in the group (class) which corresponds to the cumulative frequency in which use followingformula to find the median.
lies. We
Where,
= Lower class boundary of the model class = Frequency of the median class = Number of values or total frequency = Cumulative frequency of the class preceding the median class = Class interval size of the model class
Example: Calculate median from the following data. Group 60 64 65 69 Frequency 1 5
70 74 9
75 79 12
80 84 7
85 89 2
Page | 11
Solution: Group 60 64 65 69 70 74 75 79 80 84 85 89 f 1 5 9 12 7 2 Class Boundary 59.5 64.5 64.5 69.5 69.5 74.5 74.5 79.5 79.5 84.5 84.5 89.5 CumulativeFrequency 1 6 15 27 34 36
item
item
Median from Discrete Data-When the data follows the discrete set of values grouped by size. We use the formula item for finding median. First we form a cumulative frequency distribution and median is that value which corresponds to the
cumulative frequency in which item lies. Example-The following frequency distribution is classified according to the number of leaves on different branches. Calculate the median number of leaves per branch. No of Leaves No of Branches Solution: No of Leaves X 1 2 3 4 5 6 7 Total No of Branches f 2 11 15 20 25 18 10 101 CumulativeFrequency C.F 2 13 28 48 73 91 101 1 2 2 11 3 15 4 20 5 25 6 18 7 10
Median = Size of Median = 5
item because
item item lies corresponding to 5
Concept of Mode Mode is the value which occurs the greatest number of times in the data. When each value occurs the same numbers of times in the data, there is no mode. If two or more values occur the same numbers of time, then there are two or more modes and distribution is said to be multi-mode. If the data having only one mode the distribution is said to be uni-model and data
Page | 12
having two modes, the distribution is said to be bi-model. Mode from Ungrouped Data Mode is calculated from ungrouped data by inspecting the given data. We pick out that value which occur the greatest numbers of times in the data. Mode from Grouped Data When frequency distribution with equal class interval sizes, the class which has maximumfrequency is called model class.
Or
Where = Lower class boundary of the model class = Frequency of the model class (maximum frequency) = Frequency preceding the model class frequency = Frequency following the model class frequency = Class interval size of the model class Mode from Discrete Data: When the data follows discrete set of values, the mode may be found by inspection. Mode is the value of X corresponding to the maximum frequency. Example: Find the mode of the values 5, 7, 2, 9, 7, 10, 8, 5, 7 Solution: Mode is 7 because it occur the greatest number of times in the data. Example:The weights of 50 college students are given in the following table. Find the mode of thedistribution. Weights (Kg) No of Students Solution: Weights (Kg) 60 64 65 69 70 74 75 79 80 84 No of Students f 5 9 16 12 8 Class Boundary 59.5 64.5 64.5 69.5 69.5 74.5 74.5 79.5 79.5 84.5 60 64 5 65 69 9 70 74 16 75 79 12 80 84 8
Example: The following frequency distribution shows the numbers of children in each family in a locality. Find the mode. No of Children 0 1 2 3 4 5 6 No of Families 6 30 42 55 25 18 5 Solution: The data follows discrete set of values So, Mode = 3 (corresponding to the maximum frequency)
Page | 13
UNIT 2
Introduction to Measure of Dispersion A modern student of statistics is mainly interested in the study of variability and uncertainty. In this section we shall discuss variability and its measures and uncertainty will be discussed in probability. We live in a changing world. Changes are taking place in every sphere of life. A man of statistics does not show much interest in those things which are constant. The total area of the earth may not be very important to a research minded person but the area under different crops, area covered by forests, area covered by residential and commercial buildings are figures of great importance because these figures keep on changing form time to time and from place to place. Very large number of experts is engaged in the study of changing phenomenon. Experts working in different countries of the world keep a watch on forces which are responsible for bringing changes in the fields of human interest. The agricultural, industrial and mineral production and their transportation from one part to the other parts of the world are the matters of great interest to the economists, statisticians, and other experts. The changes in human population, the changes in standard living, and changes in literacy rate and the changes in price attract th e experts to make detailed studies about them and then correlate these changes with the human life. Thus variability or variation is something connected with human life and study is very important for mankind. Dispersion: The word dispersion has a technical meaning in statistics. The average measures the center of the data. It is one aspect observations. Another feature of the observations is as to how the observations are spread about the center. The observation may be close to the center or they may be spread away from the center. If the observation are close to the center (usually the arithmetic mean or median), we say that dispersion, scatter or variation is small. If the observations are spread away from the center, we say dispersion is large. Suppose we have three groups of students who have obtained the following marks in a test. The arithmetic means of the three groups are also given below: Group A: 46, 48, 50, 52, 54 Group B: 30, 40, 50, 60, 70 Group C: 40, 50, 60, 70, 80 In a group A and B arithmetic means are equal i.e.. But in group A the observations are concentrated on the center. All students of group A have almost the same level of performance. We say that there is consistence in the observations in group A. In group B the mean is 50 but the observations are not closed to the center. One observation is as small as 30 and one observation is as large as 70. Thus there is greater dispersion in group B. In group C the mean is 60 but the spread of the observations with respect to the center 60 is the same as the spread of the observations in group B with respect to their own center which is 50. Thus in group B and C the means are different but their dispersion is the same. In group A and C the means are different and their dispersions are also different. Dispersion is an important feature of the observations and it is measured with the help of the measures of dispersion, scatter or variation. The word variability is also used for this idea of dispersion. The study of dispersion is very important in statistical data. If in a certain factory there is consistence in the wages of workers, the workers will be satisfied. But if some workers have high wages and some have low wages, there will be unrest among the low paid workers and they might go on strikes and arrange demonstrations. If in a certain country some people are very poor and some are very high rich, we say there is economic disparity. It means that dispersion is large. The idea of dispersion is important in the study of wages of workers, prices of commodities, standard of living of different people, distribution of wealth, distribution of land among framers and various other fields of life. Some brief definitions of dispersion are: The degree to which numerical data tend to spread about an average value is called the dispersion or variation of the data. Dispersion or variation may be defined as a statistics signifying the extent of the scatteredness of items around a measure of central tendency. Dispersion or variation is the measurement of the scatter of the size of the items of a series about the average. Measures of Dispersion For the study of dispersion, we need some measures which show whether the dispersion is small or large. There are two types of measure of dispersion which are: (a) Absolute Measure of Dispersion (b) Relative Measure of Dispersion
Page | 14
Absolute Measures of Dispersion These measures give us an idea about the amount of dispersion in a set of observations. They give the answers in the same units as the units of the original observations. When the observations are in kilograms, the absolute measure is also in kilograms. If we have two sets of observations, we cannot always use the absolute measures to compare their dispersion. We shall explain later as to when the absolute measures can be used for comparison of dispersion in two or more than two sets of data. The absolute measures which are commonly used are: The Range The Quartile Deviation The Mean Deviation The Standard deviation and Variance Relative Measure of Dispersion These measures are calculated for the comparison of dispersion in two or more than two sets of observations. These measures are free of the units in which the original data is measured. If the original data is in dollar or kilometers, we do not use these units with relative measure of dispersion. These measures are a sort of ratio and are called coefficients. Each absolute measure of dispersion can be converted into its relative measure. Thus the relative measures of dispersion are: Coefficient of Range or Coefficient of Dispersion. Coefficient of Quartile Deviation or Quartile Coefficient of Dispersion. Coefficient of Mean Deviation or Mean Deviation of Dispersion. Coefficient of Standard Deviation or Standard Coefficient of Dispersion. Coefficient of Variation (a special case of Standard Coefficient of Dispersion) Range and Coefficient of Range The Range- Range is defined as the difference between the maximum and the minimum observation of the given data. If denotes the maximum observation denotes the minimum observation then the range is defined as
Range In case of grouped data, the range is the difference between the upper boundary of the highest class and the lower boundary of the lowest class . It is also calculated by using the difference between the mid points of the highest class and the lowest class. It is the simplest measure of dispersion. It gives a general idea about the total spread of the observations. It does not enjoy any prominent place in statistical theory. But it has its application and utility in quality control methods which are used to maintain the quality of the products produced in factories. The quality of products is to be kept within certain range of values. The range is based on the two extreme observations. It gives no weight to the central values of the data. It is a poor measure of dispersion and does not give a good picture of theoverall spread of the observations with respect to the center of the observations. Let us consider three groups of the data which have the same range: Group A: 30, 40, 40, 40, 40, 40, 50 Group B: 30, 30, 30, 40, 50, 50, 50 Group C: 30, 35, 40, 40, 40, 45, 50 In all the three groups the range is 50 30 = 20. In group A there is concentration of observations in the center. In group B the observations are friendly with the extreme corner and in group C the observations are almost equally distributed in the interval from 30 to 50. The range fails to explain these differences in the three groups of data. This defect in range cannot be removed even if we calculate the coefficient of range which is a relative measure of dispersion. If we calculate the range of a sample, we cannot draw any inferences about the range of the population. Coefficient of Range It is relative measure of dispersion and is based on the value of range. It is also called range coefficient of dispersion. It is defined as:
Coefficient of Range The range is standardized by the total Let us take two sets of observations. Set A contains marks of five students in Mathematics out of 25 marks and group B contains marks of the same student in English out of 100 marks. Set A: 10, 15, 18, 20, 20 Set B: 30, 35, 40, 45, 50 The values of range and coefficient of range are calculated as:
Page | 15
Range Set A: (Mathematics) Coefficient of Range
Set B: (English) In set A the range is 10 and in set B the range is 20. Apparently it seems as if there is greater dispersion in set B. But this is not true. The range of 20 in set B is for large observations and the range of 10 in set A is for small observations. Thus 20 and 10 cannot be compared directly. Their base is not the same. Marks in Mathematics are out of 25 and marks of English are out of 100. Thus, it makes no sense to compare 10 with 20. When we convert these two values into coefficient of range, we see that coefficient of range for set A is greater than that of set B. Thus there is greater dispersion or variation in set A. The marks of students in English are more stable than their marks in Mathematics. Example: Following are the wages of 8 workers of a factory. Find the range and the coefficient of range. Wages in ($) 1400, 1450, 1520, 1380, 1485, 1495, 1575, 1440. Solution: Here Largest value Range and Smallest Value
Coefficient of Range Example: The following distribution gives the numbers of houses and the number of persons per house. Number of Persons Number of Houses Calculate the range and coefficient of range. Solution: Here Largest value Range and Smallest Value
Coefficient of Range Example: Find the range of the weight of the students of a university. Weights (Kg) Number of Students Calculate the range and coefficient of range. Solution: Weights (Kg) Class Boundaries
Mid Value
No. of Students
Method 1:
Page | 16
Here Upper class boundary of the highest class Lower class boundary of the lowest class Range Kilogram
Coefficient of Range Method 2: Here Mid value of the highest class Mid value of the lowest class Range Kilogram
Coefficient of Range Quartile Deviation and its Coefficient Quartile Deviation It is based on the lower quartile and the upper quartile. The difference is called the inter quartile range. The difference divided by is called semiinter-quartile range or the quartile deviation. Thus
Quartile Deviation (Q.D) Quartile Deviation (Q.D) The quartile deviation is a slightly better measure of absolute dispersion than the range. But it ignores the observation on the tails. If we take difference samples from a population and calculate their quartile deviations, their values are quite likely to be sufficiently different. This is called sampling fluctuation. It is not a popular measure of dispersion. The quartile deviation calculated from the sample data does not help us to draw any conclusion (inference) about the quartile deviation in the population Coefficient of Quartile Deviation- A relative measure of dispersion based on the quartile deviation is called the coefficient of quartile deviation. It is defined as
Coefficient of Quartile Deviation It is pure number free of any units of measurement. It can be used for comparing the dispersion in two or more than two sets of data. Example: The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200, 1440, 1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Find the quartile deviation and coefficient of quartile deviation. Solution: After arranging the observations in ascending order, we get 1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750, 1755, 1785, 1880, 1885, 1960.
Page | 17
Quartile Deviation (Q.D)
Coefficient of Quartile Deviation The Mean Deviation- The mean deviation or the average deviation is defined as the mean of the absolute deviations of observations from some suitable average which may be the arithmetic mean, themedian or the mode. The difference ( ) is called deviation and when we ignore the negative sign, this deviation is written as and is read as mod deviations. The mean of these mod or absolute deviations is called the mean deviation or the mean absolute deviation. Thus for sample data in which the suitable average is the relation: , the mean deviation ( ) is given by the
For frequency distribution, the mean deviation is given by
When the mean deviation is calculated about the median, the formula becomes
The mean deviation about the mode is
For a population data the mean deviation about the population mean
is
The mean deviation is a better measure of absolute dispersion than the range and the quartile deviation. A drawback in the mean deviation is that we use the absolute deviations seem logical. The reason for this is that which does not ,
is always equal to zero. Even if we use median or mode in place of
even then the summation or will be zero or approximately zero with the result that the mean deviation would always be either zero or close to zero. Thus the very definition of the mean deviation is possible only on the absolute deviations.
Page | 18
The mean deviation is based on all the observations, a property which is not possessed by the range and the quartile deviation. The formula of the mean deviation gives a mathematical impression that is a better way of measuring the variation in the data. Any suitable average among the mean, median or mode can be used in its calculation but the value of the mean deviation is minimum if the deviations are taken from the median. A series drawback of themean deviation is that it cannot be used in statistical inference. Coefficient of the Mean Deviation A relative measure of dispersion based on the mean deviation is called the coefficient of the mean deviation or the coefficient of dispersion. It is defined as the ratio of the mean deviation to the average used in the calculation of the mean deviation. Thus
Example- Calculate the mean deviation form (1) arithmetic mean (2) median (3) mode in respect of the marks obtained by nine students gives below and show that the mean deviation from medians minimum. Marks (out of 25): 7, 4, 10, 9, 15, 12, 7, 9, 7 Solution: After arranging the observations in ascending order, we get Marks: 4, 7, 7, 7, 9, 9, 10, 12, 15
(Since Marks
is repeated maximum number of times)
Total
From the above calculations, it is clear that the mean deviation from the median hast the least value.
Page | 19
Example: Calculate the mean deviation from mean and its coefficients from the following data. Size of Items Frequency Solution: The necessary calculation is given below: Size of Items
Total
Standard Deviation- The standard deviation is defined as the positive square root of the mean of the square deviations taken from arithmetic mean of the data. For the sample data the standard deviation is denoted by and is defined as:
For a population data the standard deviation is denoted by
(sigma) and is defined as:
For frequency distribution the formulas becomes
or The standard deviation is in the same units as the units of the original observations. If the original observations are in grams, the value of the standard deviation will also be in grams. The standard deviation plays a dominating role for the study of variation in the data. It is a very widely used measure of dispersion. It stands like a tower among measure of dispersion. As far as the important statistical tools are concerned, the first important tool is the mean and the second important tool is
the standard deviation . It is based on all the observations and is subject to mathematical treatment. It is of great importance for the analysis of data and for the various statistical inferences. However some alternative methods are also available to compute standard deviation. The alternative methods simplify the computation. Moreover in discussing these methods we will confirm ourselves only to sample data because sample data rather than whole population confront mostly a statistician.
Page | 20
Actual Mean Method In applying this method first of all we compute arithmetic mean of the given data either ungroup or grouped data. Then take the deviation from the actual mean. This method is already defined above. The following formulas are applied: For Ungrouped Data For Grouped Data
This method is also known as direct method Assumed Mean Method (a) We use following formulas to calculate standard deviation: For Ungrouped Data For Grouped Data
Where (b) If
and
is any assumed mean other than zero. This method is also known asshort-cut method.
is considered to be zero then the above formulas are reduced to the following formulas: For Ungrouped Data For Grouped Data
(c) If we are in a position to simplify the calculation by taking some common factor or divisor from the given data the formulas for computing standard deviation are: For Ungrouped Data For Grouped Data
Where , Class Interval and Common Divisor. This method is also called method of stepdeviation. Example: Calculate the standard deviation for the following sample data using all methods: 2, 4, 8, 6, 10, and 12. Solution: Method-I: Actual Mean Method
Page | 21
Method-II: Taking Assumed Mean as
Total Method-III: Taking Assume Mean as Zero
Method-IV: Taking
as common divisor or factor
Total Example: Calculate standard deviation from the following distribution of marks by using all the methods. Marks No. of Students
SolutionMethod-I: Actual Mean Method Marks
Total
Page | 22
Marks Method-II: Taking assumed mean as Marks
Total
Marks Method-III: Using assumed mean as Zero Marks
Total
Marks Method-IV: By taking Marks as the common divisor
Total
Marks
Page | 23
The Variance Variance is another absolute measure of dispersion. It is defined as the average of thesquared difference between each of the observations in a set of data and the mean. For a sample data the variance is denoted is denoted by variance is denoted by The sample variance (sigma square). has the formula: and the population
Where
is sample mean and
is the number of observations in the sample.
The population variance
is defined as:
Where
is the mean of the population and
is the number of observations in the data. It may be remembered that the is calculated and if need be, this is used to make
population variance is usually not calculated. The sample variance inference about the population variance.
The term
is positive, therefore
is always positive. If the original observations are in centimeter, the value of is the square of the units of the original measurement. is defined as:
the variance will be (centimeter)2. Thus the unit of For a frequency distribution the sample variance
For s frequency distribution the population variance
is defined as:
In simple words we can say that variance is the square of standard deviation.
Example: Calculate the variance for the following sample data: 2, 4, 8, 6, 10, and 12. Solution
Page | 24
Example Calculate variance from the following distribution of marks: Marks
No. of Students
Solution Marks
Total
Sheppard Corrections and Corrected Coefficient of Variation Sheppard Corrections In grouped data the different observations are put into the same class. In the calculation of variation or standard deviation for grouped data, the frequency is multiplied with which is the mid-point of the respective class.
Thus it is assumed that all the observations in a class are centered at
. But this is not true because the observations are and . The value of and can be
spread in the said class. This assumption introduces some error in the calculation of corrected to some extent by applying Sheppard correction. Thus
Where
is the uniform class interval.
This correction is applied in grouped data which has almost equal tails in the start and at the end of the data. If a data a longer tail on any side, this correction is not applied. If size of the class interval applicable. is not the same in all classes, the correction is not
Corrected Coefficient of Variation- When the corrected standard deviation is used in the calculation of the coefficient of variation, we get what is called the corrected coefficient of variation. Thus
Corrected Coefficient of Variation Example Calculate Sheppard correction and corrected coefficient of variation from the following distribution of marks by using all the methods.
Page | 25
Marks No. of Students
Solution: Marks
Total
Corrected Coefficient of Variation
Skewness Lack of symmetry is called Skewness. If a distribution is not symmetrical then it is called skewed distribution. So, mean, median and mode are different in values and one tail becomes longer than other. The skewness may be positive or negative. Positively skewed distribution- If the frequency curve has longer tail to right the distribution is known as positively skewed distribution and Mean > Median > Mode. Positively skewed distribution- If the frequency curve has longer tail to left the distribution is known as negatively skewed distribution and Mean < Median < Mode.
Page | 26
Measure of Skewness- The difference between the mean and mode gives as absolute measure of skewness. If we divide this difference by standard deviation we obtain a relative measure of skewness known as coefficient and denoted by SK. Karl Pearson coefficient of Skewness
Sometimes the mode is difficult to find. So we use another formula
Bowleys coefficient of Skewness
Kurtosis The height and sharpness of the peak relative to the rest of the data are measured by a number called kurtosis. Higher values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak. This occurs because, higher kurtosis means more of the variability is due to a few extreme differences from the mean, rather than a lot of modest differences from the mean. Balanda and MacGillivray say the same thing in another way: increasing kurtosis is associated with the movement of probability mass from the shoulders of a distribution into its center and tails. The reference standard is a normal distribution, which has a kurtosis of 3. In token of this, often the excess kurtosis is presented: excess kurtosis is simply kurtosis3. For example, the kurtosis reported by Excel is actually the excess kurtosis. A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any distribution with kurtosis 3 (excess 0) is called mesokurtic. A distribution with kurtosis <3 (excess kurtosis <0) is called platykurtic. Compared to a normal distribution, its central peak is lower and broader, and its tails are shorter and thinner. A distribution with kurtosis >3 (excess kurtosis >0) is called leptokurtic. Compared to a normal distribution, its central peak is higher and sharper, and its tails are longer and fatter.
Uniform(min=3, max=3) kurtosis = 1.8, excess = 1.2
Normal(=0, =1) kurtosis = 3, excess = 0
Logistic(=0, =0.55153) kurtosis = 4.2, excess = 1.2
Page | 27
UNIT 3
Introduction to Regression and Correlation Statistical methods discussed so far are used to analyze the data involving only one variable. Often an analysis of data concerning two or more variables is needed to look for any statistical relationship or association between them. Few instances where the knowledge of an association or relationship between two variables would vital to make decision are: Family income and expenditure on luxury items. Sales revenue and expenses incurred on advertising. Yield of a crop and quantity of fertilizer applied. Following aspects are considered in examining the statistical relationship between two or more variables. Is there an association between two or more variables? If yes, what is the form and degree of that relationship? Is the relationship strong or significant enough to arrive at a desirable conclusion? Can the relationship be used for predictive purpose, that is, to predict the most likely value of a dependent variable corresponding to the given value of independent variable or variables? There are two different techniques which are used for the study of two or more than two variables. These are regression and correlation. Both study the behavior of the variables but they differ in their end results. Regression studies the relationship where dependence is necessarily involved. One variable has the dependence on a certain number of variables. Regression can be used for predicting the values of the variable which depends upon other variables. The term regression was introduced by the English biometrician, Sir Francis Galton (1822 - 1911). Correlation attempts to study the strength of the mutual relationship between two variables. In correlation we assume that the variables are random and dependence of any nature is not involved Linear Model Regression involves the study of equations. First we talk about some simple equations or linear models. The simplest mathematical model or equation is the equation of straight line. Example: Suppose a shop keeper is selling pencils. He sells one pencil for 2 cents. Table as shown gives the number of pencils sold and the sale price of the pencils. Number of pencils sold Sales Prices (Cents) Let us examine the two variables given in table. For the sake of our convenience, we can give some names to the variables given in the table. Let Thus, denote the number of pencils sold and ( for sale) denote the amount realized by selling pencils.
The information written above can be presented in some other forms as well. For example we can write an equation describing the above relation between is. . depends upon . Here is called independent and . It is very simple to write the equation. The algebraic equation connecting and
It is called mathematical equation or mathematical model in which variable and is called dependent variable. Cent
. Neither less than
nor more than
. The above model is called in
deterministic mathematical model because we can determine the value of the equation. The sale is said to be function of
without any error by putting the value of
. This statement in symbolic form is written as: and only
. It is read as
is function of . It means that depends upon presented in the form of a graph as shown in the figure.
and no other element. The data in the table can be
Page | 28
The main features of the graph in the figure are: 1. 2. 3. 4. 5. The graph lies in the first quadrant because all the values of and are positive. It is an exact straight line. But all graphs are not in the form of a straight line. It could be some curve also. All the points (pair of and ) lies on the straight line. The line passes through the origin. Take any point on the line and draw a perpendicular line which joins with the X-axis. Let us find the
ratio
. Here
units and
units. Thus
units.
It is called the slope of the line and in general it is denoted by . The slope of the line is the same at all points on the line. The slope is equal to the change in between and . for a unit change in . The relation is also called linear equation
Example: Suppose a carpenter wants to make some wooden toys for the small children. He has purchased some wood and some other material for $ and cost of the toys. . The cost of making each toy is $ . Table gives the information about the number of toys made
Number of Toys Cost of Toys
Let When $
denote the number of toys and ,
denote the cost of the toys. What is the algebraic relation between
and
. This is called fixed or starting cost and it may be denoted by
. For each additional toy, the cost is
. Thus and are connected through the following equation: It is called equation of straight line. It is also mathematical model of deterministic nature. Let us make the graph of the data in given table. Figure as shown is the graph of the data in table.
Let us note some important features of the graph obtained in figure. 1. The line the origin 2. does not pass through the origin. It passes through the point is called the intercept and is usually denoted by . on the line and complete a triangle as shown in the figure. Let us find the ratio between the on Y-axis. The distance between and
Take any point
perpendicular
and the base
of this triangle. The ratio is,
units. has
This ratio is denoted by in the equation of straight line. Thus the equation of straight line the intercept and slope
. In general, when the values of intercept and slope are not known, we write the . It is also called linear equation between and , and the relation between
equation of straight line as
Page | 29
and is called linear. The equation and . The value of may also be called exact linear modelbetween can be determined completely when and or simply linear is
model between
is given. The relation
therefore, called the deterministic linear model between and we shall not mean a mathematical model as described above. Non Linear Model Let us consider an equation
. In statistics, when we shall use the term Linear Model,
By putting the values of in this equation, we find the values of first and second differences are calculated in that given table. First differences
as given in the table below. The
Second differences
The second differences are exactly constant. The general quadratic equation or non linear model is written as
It is also called second degree parabola or second degree curve. The graph of the data is shown in the figure given below:
This figure is not a straight line. It is a curve or we say that the model in non-linear. The readers are advised to remember that if in a certain observed data, the seconddifferences are constant or almost constants, we find the second degree curve close to the observed data. We shall face this type of situation in time series. Scatter Diagram Scatter diagram is a graphic picture of the sample data. Suppose a random sample of n pairs of observations has the values. These points are plotted on a rectangular co-ordinate system taking independent variable on X-axis and the dependent variable on Y-axis. Whatever be the name of the independent variable, it is to be taken on X-axis. Suppose the plotted points are as shown in figure (a). Such a diagram is called scatter diagram. In this figure, we see that when X has a small value, Y is also small and when X takes a large value, Y also takes a large value. This is called direct or positive relationship between X and Y. The plotted points cluster around a straight line. It appears that if a straight line is drawn passing through the points, the line will be a good approximation for representing the original data. Suppose we draw a line AB to represent the scattered points. The line AB rises from left to the right and has positive slope. This line can be used to establish an approximate relation between the random variable Y and the independent variable X. It is nonmathematical method in the sense that different persons may draw different lines. This line is called the regression line obtained by inspection or judgment.
Page | 30
Making a scatter diagram and drawing a line or curve is the primary investigation to assess the type of relationship between the variables. The knowledge gained from the scatter diagram can be used for further analysis of the data. In most of the cases the diagrams are not as simple as in figure (a). There are quite complicated diagrams and it is difficult to choose a proper mathematical model for representing the original data. The scatter diagram gives an indication of the appropriate model which should be used for further analysis with the help of method of least squares. Figure (b) shows that the points in the scatter diagram are falling from the top left corner to the right. This is a relation called inverse or indirect. The points are in the neighborhood of a certain line called the regression line. As long as the scattered points show closeness to a straight line of some direction, we draw a straight line to represent the sample data. But when the points do not lie around a straight line, we do not draw the regression line. Figure (c) shows that the plotted points have a tendency to fall from left to right in the form of a curve. This is a relation called non-linear or curvilinear. Figure (d) shows the points which apparently do not follow any pattern. If X takes a small value, Y may take a small or large value. There seems to be no sympathy between X and Y. Such a diagram suggests that there is no relationship between the two variables. Correlation Correlation is a technique which measures the strength of association between two variables. Both the variables X and Y may be random or may be that one variable is independent (non-random) and the other to be correlated are dependent. When the changes in one variable appear to be linked with the changes in the other variable, the two variables are said to be correlated. When the two variables are meaningfully related and both increase or both decrease simultaneously, then the correlation is termed as positive. If increase in any one variable is associated with decrease in the other variable, the correlation is termed as negative or inverse. Suppose marks in Mathematics are denoted by X and marks are Statistics are denoted by Y. If small values of X appear with small values of Y and large values of X come with large values of Y, then correlation is said to be positive. If X stands for marks in English and Y stands for marks in Mathematics, it is possible that small values of X appear with large values of Y. It is a case of negative correlation Linear and Non Linear Correlation Linear Correlation: Correlation is said to be linear if the ratio of change is constant. The amount of output in a factory is doubled by doubling the number of workers is the example of linear correlation. In other words it can be defined as if all the points on the scatter diagram tends to lie neara line which are look like a straight line, the correlation is said to be linear, as shown in the figure. Non Linear (Curvilinear) Correlation: Correlation is said to be non linear if the ratio of change is not constant. In other words it can be defined as if all the points on the scatter diagram tends to lie near a smooth curve, the correlation is said to be non linear (curvilinear), as shown in the figure.
Page | 31
Positive Correlation The correlation in the same direction is called positive correlation. If one variable increase other is also increase and one variable decrease other is also decrease. For example, the length of an iron bar will increase as the temperature increases. Negative Correlation The correlation in opposite direction is called negative correlation, if one variable is increase other is decrease and vice versa, for example, the volume of gas will decrease as thepressure increase or the demand of a particular commodity is increase as price of such commodity is decrease. No Correlation or Zero Correlation- If there is no relationship between the two variables such that the value of one variable change and the other variable remain constant is called no or zero correlation.
Perfect Correlation If there is any change in the value of one variable, the value of the others variable is changed in a fixed proportion, the correlation between them is said to be perfect correlation. It is indicated numerically as 1 and -1. Perfect Positive Correlation If the values of both the variables are move in same direction with fixed proportion is called perfect positive correlation. It is indicated numerically as 1. Perfect Negative Correlation If the values of both the variables are move in opposite direction with fixed proportion is called perfect negative correlation. It is indicated numerically as -1. Karl Pearsons coefficient of correlation Definition- Given a set of N pairs of observation (X1, Y1), (X2,Y2)....(XN,YN) relating to two variables X and Y, Coefficient of Correlation between X and Y, denoted by the symbol r is defined as Cov. (X,Y) r= sx , sy Where, Cov.(X,Y) =Covariance of X and Y sx=Standard Deviation of X variable sy=Standard Deviation of variable Y This expression is known as Pearsons productmoment for and is used as a measure of linear correlation between X and Y. Expanded forms of the above formula: 1. Expanding the formula of Cov,(X,Y)
Page | 32
Properties of Regression Coefficients The following are the important properties of regression coefficients: 1. Same sign: Both regression coefficients have the same signs, i.e. either they will be positive or negative. 2. Both cannot be greater than one: If one of the regression coefficients is greater than unity, the other must be less than unity to the extent the product of both regression coefficient is less than unity. In other words, both the regression coefficients cannot be greater than one. 3. Independent of origin: Regression coefficients are independent of the origin but not of scale. 4. A.M>r: Arithmetic mean of regression coefficients is greater than the correlation coefficient. 5. r is G.M: Correlation coefficient is the geometric mean between the regression coefficients. 6. r, bxy and byx have same sign: the coefficient of correlation will have the same sign as that of regression coefficient i.e. if regression coefficient have a positive sign, r will also be positive and if regression coefficient have a negative sign, r will also be negative. Examples of Correlation Examples-Calculate and analyze the correlation coefficient between the number of study hours and the number of sleeping hours of different students. Number of Study hours Number of sleeping hours Solution-The necessary calculation is given below: X 2 4 6 8 10 Y 10 9 8 7 6 -4 -2 0 2 4 2 1 0 -1 -2 -8 -2 0 -2 -8 16 4 0 4 16 4 1 0 1 1 2 10 4 9 6 8 8 7 10 6
and
There is perfect negative correlation between the number of study hours and the number of sleeping hours. Example: From the following data, compute the coefficient of correlation between X and Y: X Series Y Series Number if Items 15 15 Arithmetic Mean 25 18 Sum of Square Deviations 136 138 Summation of products of deviations of X and Y series from their arithmetic means = 122. Solution: Here and hence
Page | 33
Coefficient of Standard Deviation and Variation Coefficient of Standard Deviation The standard deviation is the absolute measure of dispersion. Its relative measure is called standard coefficient of dispersion or coefficient of standard deviation. It is defined as:
Coefficient of Standard Deviation
Coefficient of Variation-The most important of all the relative measure of dispersion is the coefficient of variation. This word is variation not variance. There is no such thing as coefficient of variance. Thecoefficient of variation is defined as:
Coefficient of Variation Thus is the value of when is assumed equal to 100. It is a pure number and the unit of observations is not mentioned with its value. It is written in percentage form like 20% or 25%. When its value is 20%, it means that when the mean of the observations is assumed equal to 100, their standard deviation will be 20. The is used to compare the dispersion in different sets of data particularly the data which differ in their means or differ in the units of measurement. The wages of workers may be in dollars and the consumption of meat in their families may be in kilograms. The standard deviation of wages in dollars cannot be compared with the standard deviation of amounts of meat in kilograms. Both the standard deviations need to be converted into coefficient of variation for comparison. Suppose the value of for wages is 10% and the values
of for kilograms of meat is 25%. This means that the wages of workers are consistent. Their wages are close to the overall average of their wages. But the families consume meat in quite different quantities. Some families use very small quantities of meat and some others use large quantities of meat. We say that there is greater variation in their consumption of meat. The observations about the quantity of meat are more dispersed or more variant. Example- Calculate the coefficient of standard deviation and coefficient of variation for the following sample data: 2, 4, 8, 6, 10, and 12. Solution:
Coefficient of Variation
Page | 34
Example-Calculate coefficient of standard deviation and coefficient of variation from the following distribution of marks: Marks No. of Students
Solution: Marks
Total
Marks
Coefficient of Variation Linear Regression Regression-The word regression was used by Frances Galton in 1985. It is defined as The dependence of one variable upon other variable. For example, a weight depends upon the heights. The yield of wheat depends upon the amount of fertilizer. In regression we can estimate the unknown values of one (dependent) variable from known values of the other (independent) variable. Linear Regression-When the dependence of the variable is represented by a straight line then it is called linear regression, otherwise it is said to be non linear or curvilinear regression. For Example, if X is dependent variable and Y is dependent variable, then the relation Y = a bXis linear regression. Regression Line of Y on X: Regression lines study the average relationship between two variables. In regression lineY on X, we estimate the average value of Y for a given value of X. Y = a bX Where Y is dependent and X is independent variable. Alternate form of regression line Yon X is:
Regression Line of X on Y: In regression line X on Y we estimate the average value of X for a given value of Y.
X = C dY
Page | 35
or . Where X is dependent and Y is independent variable. Alternate form of regression line X on Y is:
Properties of the Regression Line Regression is concerned with the study of relationship among variables. The aim of regression (or regression analysis) is to make models for prediction and for making other inferences. Two variables or more than two variables may be treated by regression. Regression line usually written as given below: We know that . The general properties of the regression line are
. This shows that the line passes through the means
and
The sum of errors is equal to zero. The regression equation is of observed Y from estimated is
and the sum ofderivatives
When
, it means that
Page | 36
UNIT 4
The method in which we select samples to learn more about characteristics in a given population is called hypothesis testing. Hypothesis testing is really a systematic way to test claims or ideas about a group or population. To illustrate, suppose we read an article stating that children in the United States watch an average of 3 hours of TV per week. To test whether this claim is true, we record the time (in hours) that a group of 20 American children (the sample), among all children in the United States (the population), watch TV. The mean we measure for these 20 children is a sample mean. We can then compare the sample mean we select to the population mean stated in the article. Hypothesis testing or significance testing is a method for testing a claim or hypothesis about a parameter in a population, using data measured in a sample. In this method, we test some hypothesis by determining the likelihood that a sample statistic could have been selected, if the hypothesis regarding the population parameter were true. The method of hypothesis testing can be summarized in four steps. 1. To begin, we identify a hypothesis or claim that we feel should be tested. For example, we might want to test the claim that the mean number of hours that children in the United States watch TV is 3 hours. 2. We select a criterion upon which we decide that the claim being tested is true or not. For example, the claim is that children watch 3 hours of TV per week. Most samples we select should have a mean close to or equal to3 hours if the claim we are testing is true. So at what point do we decide that the discrepancy between the sample mean and 3 is so big that the claimwe are testing is likely not true? We answer this question in this step of hypothesis testing. 3. Select a random sample from the population and measure the sample mean. For example, we could select 20 children and measure the mean time (in hours) that they watch TV per week. 4. Compare what we observe in the sample to what we expect to observe if the claim we are testing is true. We expect the sample mean to be around3 hours. If the discrepancy between the sample mean and population mean is small, then we will likely decide that the claim we are testing is indeed true. If the discrepancy is too large, then we will likely decide to reject the claim as being not true. Step 1: State the hypotheses. We begin by stating the value of a population mean in a null hypothesis, which we presume is true. For the children watching TV example, we state the null hypothesis that children in the United States watch an average of 3 hours of TV per week. This is a starting point so that we can decide whether this is likely to be true, similar to the presumption of innocence in a courtroom. When a defendant is on trial, the jury starts by assuming that the defendant is innocent. The basis of the decision is to determine whether this assumption is true. Likewise, in hypothesis testing, we start by assuming that the hypothesis or claim we are testing is true. This is stated in the null hypothesis. The basis of the decision is to determine whether this assumption is likely to be true. The null hypothesis (H0), stated as the null, is a statement about a population parameter, such as the population mean, that is assumed to be true. The null hypothesis is a starting point. We will test whether the value stated in the null hypothesis is likely to be true. Keep in mind that the only reason we are testing the null hypothesis is because we think it is wrong. We state what we think is wrong about the null hypothesis in an alternative hypothesis. For the children watching TV example, we may have reason to believe that children watch more than (>) or less than (<) 3 hours of TV per week. When we are uncertain of the direction, we can state that the value in the null hypothesis is not equal to () 3 hours. In a courtroom, since the defendant is assumed to be innocent (this is the null hypothesis so to speak), the burden is on a prosecutor to conduct a trial to show evidence that the defendant is not innocent. In a similar way, we assume the null hypothesis is true, placing the burden on the researcher to conduct a study to show evidence that the null hypothesis is unlikely to be true. Regardless, we always make a decision about the null hypothesis (that it is likely or unlikely to be true). The alternative hypothesis is needed for Step 2. An alternative hypothesis (H1) is a statement that directly contradicts a null hypothesis by stating that that the actual value of a population parameter is less than, greater than, or not equal to the value stated in the null hypothesis. The alternative hypothesis states what we think is wrong about the null hypothesis, which is needed for Step 2. Step 2: Set the criteria for a decision. To set the criteria for a decision, we state the level of significance for a test. This is similar to the criterion that jurors use in a criminal trial. Jurors decide whether the evidence presented shows guilt beyond a reasonable doubt (this is the criterion). Likewise, in hypothesis testing, we collect data to show that the null hypothesis is not true, based on the likelihood of selecting a sample mean from a population (the likelihood is the criterion). The likelihood or
Page | 37
level of significance is typically set at 5% in behavioral research studies. When the probability of obtaining a sample mean is less than 5% if the null hypothesis were true, then we conclude that the sample we selected is too unlikely and so we reject the null hypothesis. Level of significance, or significance level, refers to a criterion of judgment upon which a decision is made regarding the value stated in a null hypothesis. The criterion is based on the probability of obtaining a statistic measured in a sample if the value stated in the null hypothesis were true. In behavioral science, the criterion or level of significance is typically set at 5%. When the probability of obtaining a sample mean is less than 5% if the null hypothesis were true, then we reject the value stated in the null hypothesis. The alternative hypothesis establishes where to place the level of significance. Remember that we know that the sample mean will equal the population mean on average if the null hypothesis is true. All other possible values of the sample mean are normally distributed (central limit theorem). The empirical rule tells us that at least 95% of all sample means fall within about 2 standard deviations (SD) of the population mean, meaning that there is less than a 5% probability of obtaining asample mean that is beyond 2 SD from the population mean. For the children watching TV example, we can look for the probability of obtaining a sample mean beyond 2 SD in the upper tail (greater than 3), the lower tail (less than 3), or both tails (not equal to 3). Figure 8.2 shows that the alternative hypothesis is used to determine which tail or tails to place the level of significance for a hypothesis test Step 3: Compute the test statistic. Suppose we measure a sample mean equal to 4 hours per week that children watch TV. To make a decision, we need to evaluate how likely this sample outcome is, if the population mean stated by the null hypothesis (3 hours per week) is true. We use a test statistic to determine this likelihood. Specifically, a test statistic tells us how far, or how many standard deviations, a sample mean is from the population mean. The larger the value of the test statistic, the further the distance, or number of standard deviations, a sample mean is from the population mean stated in the null hypothesis. The value of the test statistic is used to make a decision in Step 4. The test statistic is a mathematical formula that allows researchers to determine the likelihood of obtaining sample outcomes if the null hypothesis were true. The value of the test statistic is used to make a decision regarding the null hypothesis. Step 4: Make a decision. We use the value of the test statistic to make a decision about the null hypothesis. The decision is based on the probability of obtaining a sample mean, given that the value stated in the null hypothesis is true. If the probability of obtaining a sample mean is less than 5% when the null hypothesis is true, then the decision is to reject the null hypothesis. If the probability of obtaining a sample mean is greater than 5% when the null hypothesis is true, then the decision is to retain the null hypothesis. In sum, there are two decisions a researcher can make: 1. Reject the null hypothesis. The sample mean is associated with a low probability of occurrence when the null hypothesis is true. 2. Retain the null hypothesis. The sample mean is associated with a high probability of occurrence when the null hypothesis is true. The probability of obtaining a sample mean, given that the value stated in the null hypothesis is true, is stated by the p value. The p value is a probability: It varies between 0 and 1 and can never be negative. In Step 2, we stated the criterion or probability of obtaining a sample mean at which point we will decide to reject the value stated in the null hypothesis, which is typically set at 5% in behavioral research. To make a decision, we compare the p value to the criterion we set in Step 2. A p value is the probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true. The p value for obtaining a sample outcome is compared to the level of significance. Significance, or statistical significance, describes a decision made concerning a value stated in the null hypothesis. When the null hypothesis is rejected, we reach significance. When the null hypothesis is retained, we fail to reach significance. When the p value is less than 5% (p < .05), we reject the null hypothesis. We will refer to p < .05 as the criterion for deciding to reject the null hypothesis, although note that when p = .05, the decision is also to reject the null hypothesis. When thep value is greater than 5% (p > .05), we retain the null hypothesis. The decision to reject or retain the null hypothesis is called significance. When the p value is less than .05, we reach significance; the decision is to reject the null hypothesis. When the p value is greater than .05, we fail to reach significance; the decision is to retain the null hypothesis. Figure 8.3 shows the four steps of hypothesis testing.
Page | 38
Simple Hypothesis and Composite Hypothesis A simple hypothesis is one in which all parameters of the distribution are specified. For example, if the heights of college students are normally distributed with is , the hypothesis that its mean is, say, , that
, we have stated a simple hypothesis, as the mean and variance together specify a normal where is the specified value of a parameter
distribution completely. A simple hypothesis, in general, states that ,(
may represent etc). A hypothesis which is not simple (i.e. in which not all of the parameters are specified) is called a composite
hypothesis.For instance, if we hypothesize that (and ) or and , the hypothesis becomes a composite hypothesis because we cannot know the exact distribution of the population in either case. Obviously, the parameters and have more than one value and no specified values are being assigned. The or , that is the parameter does not exceed or does not fall short
general form of a composite hypothesis is
of a specified value . The concept of simple and composite hypotheses applies to both null hypothesis and alternative hypothesis. Hypotheses may also be classified as exact and inexact. A hypothesis is said to be an exact hypothesis if it selects a unique value for the parameter such as or . A hypothesis is called an inexact hypothesis when it indicates
more than one possible values for the parameter such as or . A simple hypothesis must be an exact one while an exact hypothesis is not necessarily a simple hypothesis. An inexact hypothesis is a composite hypothesis.
MAKING A DECISION: TYPES OF ERROR In Step 4, we decide whether to retain or reject the null hypothesis. Because we are observing a sample and not an entire population, it is possible that a conclusion may be wrong. Table 8.3 shows that there are four decision alternatives regarding the truth and falsity of the decision we make about a null hypothesis: 1. The decision to retain the null hypothesis could be correct. 2. The decision to retain the null hypothesis could be incorrect. 3. The decision to reject the null hypothesis could be correct. 4. The decision to reject the null hypothesis could be incorrect. MAKING A DECISION: TYPES OF ERROR In Step 4, we decide whether to retain or reject the null hypothesis. Because we are observing a sample and not an entire population, it is possible that a conclusion may be wrong. Table 8.3 shows that there are four decision alternatives regarding the truth and falsity of the decision we make about a null hypothesis: 1. The decision to retain the null hypothesis could be correct. 2. The decision to retain the null hypothesis could be incorrect. 3. The decision to reject the null hypothesis could be correct. 4. The decision to reject the null hypothesis could be incorrect. DECISION: RETAIN THE NULL HYPOTHESIS When we decide to retain the null hypothesis, we can be correct or incorrect. The correct decision is to retain a true null hypothesis. This decision is called a null result or null finding. This is usually an uninteresting decision because the decision is to retain what we already assumed: that the value stated in the null hypothesis is correct. For this reason, null results alone are rarely published in behavioral research. The incorrect decision is to retain a false null hypothesis. This decision is an example of a Type II error, or b error. With each test we make, there is always some probability that the decision could be a Type II error. In this decision, we decide to retain previous notions of truth that are in fact false. While its an error, we still did nothing; we retained the null hypothesis. We can always go back and conduct more studies. Type II error, or beta (b) error, is the probability of retaining a null hypothesis that is actually false. DECISION: REJECT THE NULL HYPOTHESIS When we decide to reject the null hypothesis, we can be correct or incorrect. The incorrect decision is to reject a true null hypothesis. This decision is an example of a Type I error. With each test we make, there is always some probability that our decision is a Type I error. A researcher who makes this error decides to reject previous notions of truth that are in fact true.
Page | 39
Making this type of error is analogous to finding an innocent person guilty. To minimize this error, we assume a defendant is innocent when beginning a trial. Similarly, to minimize making a Type I error, we assume the null hypothesis is true when beginning a hypothesis test. Type I error is the probability of rejecting a null hypothesis that is actually true. Researchers directly control for the probability of committing this type of error. An alpha (a) level is the level of significance or criterion for a hypothesis test. It is the largest probability of committing a Type I error that we will allow and still decide to reject the null hypothesis Since we assume the null hypothesis is true, we control for Type I error by stating a level of significance. The level we set, called the alpha level (symbolized as a), is the largest probability of committing a Type I error that we will allow and still decide to reject the null hypothesis. This criterion is usually set at .05 (a = .05), and we compare the alpha level to the p value. When the probability of a Type I error is less than 5% (p < .05), we decide to reject the null hypothesis; otherwise, we retain the null hypothesis. The correct decision is to reject a false null hypothesis. There is always some probability that we decide that the null hypothesis is false when it is indeed false. This decision is called the power of the decision-making process. It is called power because it is the decision we aim for. Remember that we are only testing the null hypothesis because we think it is wrong. Deciding to reject a false null hypothesis, then, is the power, inasmuch as we learn the most about populations when we accurately reject false notions of truth. This decision is the most published result in behavioral research.
F-statistic is a value resulting from a standard statistical test used in ANOVA and regression analysis to determine if the variances between the means of two populations are significantly different. For practical purposes, it is important to know that this value determines the P-value, but the F-statistic number will not actually be used in the interpretation here. If the null hypothesis is true, then the F test-statistic given above can be simplified (dramatically). This ratio of sample variances will be test statistic used. If the null hypothesis is false, then we will reject the null hypothesis that the ratio was equal to 1 and our assumption that they were equal. There are several different F-tables. Each one has a different level of significance. So, find the correct level of significance first, and then look up the numerator degrees of freedom and the denominator degrees of freedom to find the critical value. You will notice that all of the tables only give level of significance for right tail tests. Because the F distribution is not symmetric, and there are no negative values, you may not simply take the opposite of the right critical value to find the left critical value. The way to find a left critical value is to reverse the degrees of freedom, look up the right critical value, and then take the reciprocal of this value. For example, the critical value with 0.05 on the left with 12 numerator and 15 denominator degrees of freedom is found of taking the reciprocal of the critical value with 0.05 on the right with 15 numerator and 12 denominator degrees of freedom. Significance, or P-value, is the probability that an effect at least as extreme as the current observation has occurred by chance. Therefore, in these particular examples the chance that the prevalence of low waz dropped from 38% to 26% in for better roofing groups and from 40% to 16% in groups with higher education is unlikely to have occurred by chance. For the roofing example, P or Sig=0.031, 97 of every 100 times this difference would not occur by chance alone. For the education example, P or Sig= 0.000, there greater than 99.9% certainty that the difference did not occur by chance. In medical research, if the P-value is less than or equal to 0.05, meaning that there is no more than a 5%, or a 1 in 20, probability of observing a result as extreme as that observed solely due to chance, then the association between the exposure and disease is considered statistically significant. Chi-test 2 x 2 Contingency Table There are several types of chi square tests depending on the way the data was collected and the hypothesis being tested. We'll begin with the simplest case: a 2 x 2 contingency table. If we set the 2 x 2 table to the general notation shown below in Table 1, using the letters a, b, c, and d to denote the contents of the cells, then we would have the following table: Table 1. General notation for a 2 x 2 contingency table.
Page | 40
Variable 1 Variable 2 Category 1 Category 2 Total Data type 1 a c a+c Data type 2 b d b+d Totals a+b c+d a+b+c+d=N
For a 2 x 2 contingency table the Chi Square statistic is calculated by the formula:
Note: notice that the four components of the denominator are the four totals from the table columns and rows. Suppose you conducted a drug trial on a group of animals and you hypothesized that the animals receiving the drug would show increased heart rates compared to those that did not receive the drug. You conduct the study and collect the following data: Ho: The proportion of animals whose heart rate increased is independent of drug treatment. Ha: The proportion of animals whose heart rate increased is associated with drug treatment. Table 2. Hypothetical drug trial results. Heart Rate No Heart Rate Total Increased Increase Treated Not treated Total 36 30 66 14 25 39 50 55 105
Applying the formula above we get: 2 Chi square = 105[(36)(25) - (14)(30)] / (50)(55)(39)(66) = 3.418 Before we can proceed we need to know how many degrees of freedom we have. When a comparison is made between one sample and another, a simple rule is that the degrees of freedom equal (number of columns minus one) x (number of rows minus one) not counting the totals for rows or columns. For our data this gives (2-1) x (2-1) = 1. We now have our chi square statistic (x = 3.418), our predetermined alpha level of significance (0.05), and our degrees of freedom (df = 1). Entering the Chi square distribution table with 1 degree of freedom and reading along the row we find our 2 value of x (3.418) lies between 2.706 and 3.841. The corresponding probability is between the 0.10 and 0.05 probability levels. That means that the p-value is above 0.05 (it is actually 0.065). Since a p-value of 0.65 is greater than the conventionally accepted significance level of 0.05 (i.e. p > 0.05) we fail to reject the null hypothesis. In other words, there is no statistically significant difference in the proportion of animals whose heart rate increased. What would happen if the number of control animals whose heart rate increased dropped to 29 instead of 30 and, consequently, the number of controls whose hear rate did not increase changed from 25 to 26? Try it. Notice that the new 2 x value is 4.125 and this value exceeds the table value of 3.841 (at 1 degree of freedom and an alpha level of 0.05). This means that p < 0.05 (it is now0.04) and we reject the null hypothesis in favor of the alternative hypothesis - the heart rate of animals is different between the treatment groups. When p < 0.05 we generally refer to this as a significant difference
2
Page | 41
T-test to Compare One Sample Mean to an Accepted Value This test (as described below) assumes: (a) A normal (gaussian) distribution for the populations of the random errors, (b) there is no significant difference between the standard deviations of both population samples. The two means and the corresponding standard deviations are calculated by using the following equations (n A and nB are the number of measurements in data set A and data set B, respectively):
Then, the pooled estimate of standard deviation sAB is calculated:
Finally, the statistic texp (experimental t value) is calculated:
texp value is compared with the critical (theoretical) tth value corresponding to the given degree of freedom N (in the present case N = nA + nB - 2) and the confidence level chosen. Tables of critical t values can be found in any book of statistical analysis, as well as in many quantitative analysis textbooks. If texp>tth then H0 is rejected else H0 is retained. How this test and the other significance tests are performed using a statistical analysis program Nowadays, the rather tedious calculation of statistics (such as t exp) has been greatly simplified by using statistical analysis programs. Furthermore, there is no need of using statistical tables containing critical values. Instead, after loading the data and executing the program, a numerical value P is internally calculated (usually by mathematically complicated procedures) and it is finally displayed. This P is the probability of Type 1 error (specific for the data given), and this is more than adequate information for the user to judge the acceptance or the rejection of the null hypothesis. For example, supposing that we have decided to work at CL 95% (i.e. we risk a probability of error of Type 1 not greater than 0.05), then: (a) A value of P = 0.085 means that H0 must be accepted otherwise we risk an unacceptably high probability (more than 0.05) of error of Type 1. (b) A value of P = 0.021 means that H0 must be rejected because the probability of error of Type 1 is quite low (less than 0.05). Accordingly, if we had decided to work at CL 90%, in both cases (P = 0.085, P = 0.021) H0 would have been rejected, whereas if we had decided to work on CL 99%, in both cases H 0 would have been accepted
Page | 42

Research Methodology - Lokendra Ojha

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Research Methodology - Lokendra Ojha

Hochgeladen von

Copyright:

Verfügbare Formate

Notes on

LOKENDRA KUMAR OJHA

The required diagram is given below:

Indicates common divisor

Arithmetic mean of have given data. and

years. persons to perform their routine jobs.

Now we will find the Arithmetic Mean as

Example-Find the Geometric Mean of the following Data

Solution: We may write it as given below: Here , , , , ,

Example: Find the Geometric Mean of the values 10, 5, 15, 8, 12

Solution- The necessary calculations are given below: Marks

Median = Value of Median = 8

Example: Calculate median from the following data. Group 60 64 65 69 Frequency 1 5

Median = Size of Median = 5

item item lies corresponding to 5

Quartile Deviation (Q.D)

For frequency distribution, the mean deviation is given by

The mean deviation about the mode is

is always equal to zero. Even if we use median or mode in place of

is repeated maximum number of times)

For a population data the standard deviation is denoted by

(sigma) and is defined as:

For frequency distribution the formulas becomes

Total Method-III: Taking Assume Mean as Zero

as common divisor or factor

SolutionMethod-I: Actual Mean Method Marks

Marks Method-II: Taking assumed mean as Marks

Marks Method-III: Using assumed mean as Zero Marks

Marks Method-IV: By taking Marks as the common divisor

is sample mean and

is the number of observations in the sample.

The population variance

is the mean of the population and

For s frequency distribution the population variance

Thus it is assumed that all the observations in a class are centered at

is the uniform class interval.

Corrected Coefficient of Variation

Sometimes the mode is difficult to find. So we use another formula

Bowleys coefficient of Skewness

Uniform(min=3, max=3) kurtosis = 1.8, excess = 1.2

Normal(=0, =1) kurtosis = 3, excess = 0

Logistic(=0, =0.55153) kurtosis = 4.2, excess = 1.2

. Neither less than

nor more than

. The above model is called in

without any error by putting the value of

. This statement in symbolic form is written as: and only

and no other element. The data in the table can be

Number of Toys Cost of Toys

denote the number of toys and ,

. This is called fixed or starting cost and it may be denoted by

. For each additional toy, the cost is

Take any point

and the base

of this triangle. The ratio is,

equation of straight line as

is given. The relation

. In statistics, when we shall use the term Linear Model,

as given in the table below. The

Coefficient of Standard Deviation

Coefficient of Standard Deviation

Coefficient of Standard Deviation

. This shows that the line passes through the means

and the sum ofderivatives