Sie sind auf Seite 1von 20

# Chapter - 1

## 1.1.1 Definition of Statistics and Classification of Statistics

A. Definition of Statistics
Statistics can be defined in two senses: plural (as Statistical Data) and singular (as Statistical
Methods).
Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely used
when reference is made to facts and figures on sales, employment or unemployment, accident,
weather, death, education, etc. E.g.: Sales Statistics, Labor Statistics, Employment Statistics, etc.
In this sense the word Statistics serves simply as data. But not all numerical data are statistics.
Singular sense: Statistics is the science that deals with the methods of data collection,
organization, presentation, analysis and interpretation of data. It refers the subject area that is
concerned with extracting relevant information from available data with the aim to make sound
decisions. According to this meaning, statistics is concerned with the development and
application of methods and techniques for collecting, organizing, presenting, analyzing and
interpreting statistical data.
Classification of Statistics
Statistics can be classified in to two broad classes: Descriptive statistics and Inferential
Statistics.
Descriptive statistics:
This part of statistics deals only with describing some characteristics of the data collected
without going beyond the data. In other words, it deals with only describing the sample
data without going any further: that is without attempting to infer (conclude) anything
Descriptive statistics deals with collection of data, its presentation in various forms, such
as tables, graphs and diagrams and findings averages and other measures which would
describe the data.
Descriptive statistics refers only to the actual data. That is, the data at hand.
Descriptive Statistics is basically a kind of Statistics which is used to describe the
features of the data that gathered by the researcher.

1
Examples:
Classification of students in DMU Campus according to their Department
The number of fe/male students in this class.
Inferential Statistics:
This type of statistics is concerned with drawing statistically valid conclusions about the
characteristics of the population (large group) based on information obtained from a
sample (small group). That is, this part of statistics is concerned with the generalizing the
results of a sample (small groups) to the entire population (large group) from which the
sample is drawn.
It is the part of statistics that is generalizing from sample to population using
probabilities, performing hypothesis testing, determining relationships between variables,
and making predictions.
Example: Of 50 randomly selected people in the town of Gondar, 10 people had the last name
Abebe. An example of inferential statistics is the following statement: "about 20% of
all people living in Ethiopia have the last name Abebe."

## 1.1.2 Stages in Statistical Investigation

According to the definition of statistics, we have the following five stages of a statistical
investigation.
1. Collection of data: The first stage of statistical investigation. The data should be collected
with a specific and well defined purpose so that the conclusions drawn are not to be
misleading. Two methods of data collection: Primary and Secondary: Primary method of
data collection refers to obtaining original and first hand data and Secondary method of data
collection involves obtaining data from other sources.
2. Organization of data: This is a methodology for classification and describing the properties
of data in a summary form. Editing, coding and classification are the three steps in the
organization of data.
3. Presentation of data: In this stage the collected and organized data are presented with in
some systematic order to facilitate statistical analysis. The organized data are presented with
the help of tables, diagrams and graphs.

2
4. Analysis of data: Analysis of data involves extraction of relevant information from the
collected data using some mathematical and statistical tools.In other words, it involves
extracting relevant information from the data (like mean, median, mode, range, variance),
mainly through the use of elementary mathematical operation.
5. Interpretation of data: This stage involves drawing a valid conclusion from the analyzed
data. That is interpretation of data involves making inferences (drawing conclusions) based
on the analysis of data.
1.1.3 Definition of some Statistical terms
Sampling: - The process of selecting a sample from the population is called sampling.
Population: A population is a totality of things, objects, peoples, etc about which
information is being taken.
Sample: A sample is a subset or part of a population selected to draw conclusions about
the population.
Census survey: -It is the process of examining the entire population. It is the total count
of the population.
Parameter:- It is a descriptive measure (value) computed from the population. It is the
population measurement used to describe the population. Example: population mean and
population standard deviation
Statistic: - It is a measure used to describe the sample. It is a value computed from the
sample.
Sampling frame:-A list of people, items or units from which the sample is taken.
Data:- Data as a collection of related facts and figures from which conclusions may be
drawn.
Variable: A certain characteristic which changes from object to object and time to time.
Sample size: The number of elements or observation to be included in the sample.
1.1.4 Applications, Uses and Limitations of Statistics

Application: Statistics is applied in almost all fields of human endeavor. It has become the
scientific framework for including education, agriculture, business and economics, industry and
health.

3
Uses of Statistics

## Presents facts in a summarized and precise form

Simplifies complex data (data reduction)
Facilitates comparisons
Helps in estimating unknown population characteristics
Helps in studying the relationship between two or more variables
Helps in prediction and forecasting future values and formulating policies
In Scientific Research: Statistics is used as a tool in a scientific research. Statistical
formulas and concepts are applied on a data which are results of an experiment.
In Quality Control: Statistical methods help to check whether a product satisfies a given
standard.
Reliability Engineering : is the study of the ability of a system or component to perform
its required functions under stated conditions for a specified period of time
The application of probability theory, which includes mathematical tools for dealing
with large populations, to the field of mechanics, which is concerned with the motion of
particles or objects when subjected to a force.
The field of statistics deals with the collection, presentation, analysis, and use of data to:
Such as make decisions, Solve problems and Design products and processes. It is the
science of learning information from data.
Limitations of Statistics

Statistics deals with only aggregate of facts and not with individual data items
Statistics deals with only with quantitative data (information)
Statistical data are true only on average (approximately)
Statistics can be easily misused and therefore should be used be experts

## 1.5 Types of variables and Measurement Scales

Variable: It is a characteristics or an attribute that can assume different values. E.g.: Height,
Family size, Gender
Based on the values that variables assume, variables can be classified as
1. Qualitative variables: do not assume numeric values. E.g.: Gender

4
2. Quantitative variables: assume numeric values. These variables are numeric in nature. E.g.:
Height, Family size
Quantitative data can be further classified as discrete or continuous.
o Discrete variable: takes whole number values and consists of distinct recognizable
individual elements that can be counted. It is a variable that assumes a finite or countable
number of possible values. These values are obtained by counting (0, 1, 2, ,). E.g.:
Family size, Number of children in a family, number of cars at the traffic light
o Continuous variable: takes any value including decimals. Such a variable can
theoretically assume an infinite number of possible values. These values are obtained by
measuring.
E.g.: Height, Weight, Time, and Temperature
Generally the values of a variable can be obtained either by counting for discrete variables, by
measuring for continuous variables or by making categories for qualitative variables.
Ex: Classify each of the following as Qualitative and Quantitative and if it is quantitative classify
as Discrete and Continuous.
a. Color of automobiles in a dealers show room.
b. Number of seats in a movie theater.
c. Classification of patients based on nursing care needed (complete, partial or seafarer)
d. Number of tomatoes on each plant on a field.
e. Weight of newly born babies.
Scales of Measurements/Levels of Measurements
According to scale of measurement data can be classified as: Nominal, Ordinal, Interval and
Ratio data.
Nominal Scales of variables are those qualitative variables which show category of
individuals. They reflect classification in to categories (name of groups) where there is no
particular order or qualitative difference to the labels. Numbers may be assigned to the
variables simply for coding purposes. It is not possible to compare individual basing on the
numbers assigned to them. The only mathematical operation permissible on these variables
is counting. These variables
Have mutually exclusive (non-overlapping) and exhaustive categories.
No ranking or order between (among) the values of the variable.

5
Example: Gender, Religion, ID No, Ethnicity, Color
Ordinal Scales of variables are also those qualitative variables whose values can be ordered
and ranked. Ranking and counting are the only mathematical operations to be done on the
values of the variables. But there is no precise difference between the values (categories) of
the variable. Eg: Academic qualifications (B.Sc., M.Sc., Ph.D), Strength (very weak, week,
strong, very strong), Health status (very sick, sick, cured)
Interval Scales of variables are those quantitative variables when the value of the variables
is zero it does not show absence of the characteristics i.e. there is no true zero. Zero indicates
low than empty. There is a precise difference between the units of measurement (levels)
Eg: temperature, 00c does not mean there is no temperature but to say it is too cold.
Ratio Scales of variables are those quantitative variables when the values of the variables are
zero it shows absence of the characteristics. Zero indicates absence of the characteristics. All
mathematical operations are allowed to be operated on the values of the variables.
Eg: Height, Weight, Income, Amount of yield, Expenditure, Consumption.
1.2 Methods of data collection and presentation
1.2.1 Methods of data collection
We have already explained what it means by statistical data. Numerical facts or measurements
obtained in the course of enquiry in to a phenomenon, marked by uncertainty, constitute
statistical data. The statistical data may be already available or may have to be collected by an
investigator or an agency. Data termed primary when the reference is to data collected for the
first time by the investigator and is termed secondary when the data are taken from records or
Based on the source, data can be classified into two: Primary Data and Secondary Data.
Method of primary data collection
In primary data collection, you collect the data yourself using methods such as interviews,
observations, laboratory experiments and questionnaires. The key point here is that the data you
collect is unique to you and your research and, until you publish, no one else has access to it.
There are many methods of collecting primary data and the main methods include:
Questionnaire: It is a popular means of collecting data, but is difficult to design and often
require many rewrites before an acceptable questionnaire is produced.

6
Interviewing: is a technique that is primarily used to gain an understanding of the underlying
reasons and motivations for peoples attitudes, preferences or behavior. Interviews can be
undertaken on a personal one-to-one basis or in a group. They can be conducted at work, at
home, in the street or in a shopping center, or some other agreed location.
Observation: It involves recording the behavioral patterns of people, objects and events in a
systematic manner.
Diaries: A diary is a way of gathering information about the way individuals spend their time on
professional activities. They are not about records of engagements or personal journals of
thought! Diaries can record either quantitative or qualitative data, and in management research
can provide information about work patterns and activities.
Laboratory experiment: Conducting laboratory experiments on fields of chemical, biological
sciences and so on.
Methods of secondary data collection
Secondary data analysis can be literally defined as second-hand analysis and is the analysis of
data or information that was either gathered by someone else (e.g., researchers, institutions, other
NGOs, etc.) or for some other purpose than the one currently being considered, or often a
combination of the two. Some of the sources of secondary data are government document,
official statistics, technical report, scholarly journals, trade journals, review articles, reference
books, research institutes, universities, hospitals, libraries, library search engines and
computerized data base.

## 1.2.2 Methods of Data Presentation

Having collected and edited the data, the next important step is to present it. That is to present
the data in a comprehensible, condensed and suitable form that helps in order to draw
interpretation from it. Data presentation is a statistical procedure of arranging and putting data in
a form of tables, graphs, charts and diagrams. The need for proper presentation arises because of
the fact that statistical data in their raw form are not easy to understand.

## The process of arranging data in to classes or categories according to similarities is called

classification. Classification is a preliminary and it prepares the ground for proper presentation
of data.

## The presentation of data is broadly classified in to the following two categories:

7
Tabular presentation

## Tabular presentation of data

A statistical table is an orderly and systematic presentation of numerical data in rows and
columns. Rows (stubs) are horizontal and columns (captions) are vertical arrangements. The use
of tables for organizing data involves grouping the data into mutually exclusive categories of the
variables and counting the number of occurrences (frequency) to each category.

## The objective of tabular presentation of data or classification in general is to arrange data in

groups of classes according to their similarities.

## Tabular presentation of data has the following advantages:

Eliminates unnecessary details.
Facilitates comparison.
Helps to have a birds eye-view of the significant features of the data.
Enables required figures to be local more quickly.
Enables comparisons between different classes to be made more easily.
Reveals patterns within the figures which cannot be seen in the narrative from.
In order to construct tables there are no hard and fast rules to follow, but the following general

## 2. Tables should be self-explanatory.

Definitions:

Raw data: When data are collected in original form, they are called raw data.

Frequency: is the number of times a certain value or class of values that fall in to a specific class
of the distribution.

Frequency distribution: is the organization of raw data in table form using classes and
frequencies. A frequency distribution (or frequency table) lists classes (or categories) of values,
along with frequencies. The table consists of two columns: one contains the list of possible

8
values and/or classes, and the other contains the number of times each of those values or classes
occurred in the data.

## o Grouped frequency distribution

1) Categorical frequency Distribution: is a frequency distribution in which the data is only nominal
or ordinal.

Example: A social worker collected the following data on marital status for 25 persons.
(M=married, S=single, W=widowed, D=divorced)

M S D W D
S S M M M
W D S M M
W D D S S
S W W D D
Q. Construct a frequency distribution.

Solution:
Since the data are categorical, discrete classes can be used. There are four types of marital status M,
S, D, and W. These types will be used as class for the distribution. We follow the following
procedure to construct the frequency distribution.

## Class Tally Frequency Percent

(1) (2) (3) (4)

9
M

Step 2: Tally the data and place the result in column (2).
Step 3: Count the tally and place the result in column (3).
Step 4: Find the percentages of values in each class by using;
f
% *100 . Where f= frequency of the class, n=total number of value.
n
Percentages are not normally a part of frequency distribution but they can be added since they are
used in certain types diagrammatic such as pie charts.

## Step 5: Find the total for column (3) and (4).

Combing the entire steps one can construct the following frequency distribution.

(1) (2) (3) (4)
M //// 5 20
S //// // 7 28
D //// // 7 28
W //// / 6 24

## 2) Ungrouped frequency Distribution: is a frequency distribution of numerical data in which

the values are not grouped. It is a table of all the potential raw score values that could possible
occur in the data along with the number of times each actually occurred. It is often constructed
for small set or data on discrete variable.

## To facilitate counting one may include a column of tallies.

10
Example: The following data represent the mark of 20 students.

80 76 90 85 80 65 60 63 74 75
70 60 62 70 85 76 70 70 80 85

## Q.Construct a frequency distribution, which is ungrouped.

Solution: Step 1: Arrange the data in the order of magnitude and make a table as shown below.
Step 2 and 3: Tally the data and Compute the frequency.

Mark 60 62 63 65 70 74 75 76 80 85 90

## Tally // / / / //// / // / /// /// /

Frequency 2 1 1 1 4 1 2 1 3 3 1

Each individual value is presented separately, that is why it is named ungrouped frequency
distribution.

3) Grouped frequency Distribution: When the range of the data is large, the data must be grouped
in to classes that are more than one unit in width. Grouped frequency distribution is a frequency
distribution where several numbers are grouped into one class.

Definitions

## Grouped Frequency Distribution: a frequency distribution in which several numbers are

grouped in one class.

Class limits: Separates one class in a grouped frequency distribution from another. The limits
could actually appear in the data and have gaps between the upper limits of one class and
lower limit of the next.

Units of measurement (U or d): the distance between two possible consecutive measures or
the gap between two successive classes. It is usually taken as 1, 0.1, 0.01, 0.001, -----.

Class boundaries: Separates one class in a grouped frequency distribution from another. The
boundaries have one more decimal places than the raw data and therefore do not appear in
the data. There is no gap between the upper boundary of one class and lower boundary of the
next class. The lower class boundary is found by subtracting U/2 from the corresponding

11
lower class limit and the upper class boundary is found by adding U/2 to the corresponding
upper class limit.

Class width: the difference between the upper and lower class boundaries of any class. It is
also the difference between the lower limits of any two consecutive classes or the difference
between any two consecutive class marks.

Class mark (Mid points):it is the average of the lower and upper class limits or the average
of upper and lower class boundary.

## Cumulative frequency: is the number of observations less than/more than or equal to a

specific value.

Cumulative frequency above: it is the total frequency of all values greater than or equal to
the lower class boundary of a given class.

Cumulative frequency below: it is the total frequency of all values less than or equal to the
upper class boundary of a given class.

## Cumulative Frequency Distribution (CFD):it is the tabular arrangement of class interval

together with their corresponding cumulative frequencies. It can be more than or less than
type, depending on the type of cumulative frequency used.

## Relative frequency (rf):it is the frequency divided by the total frequency.

Relative cumulative frequency (rcf): it is the cumulative frequency divided by the total
frequency.

## Guidelines for classes

1. Choosing the number of classes to use, preferably between 5 and 20.
2. The classes must be mutually exclusive. This means that no data value can fall into two
different classes.
3. The classes must be all inclusive or exhaustive. This means that all data values must be
included.
4. The classes must be continuous. There are no gaps in a frequency distribution.
5. The classes must be equal in width. The exception here is the first or last class when we
have a "below ..." or "... above" class. This is often used with ages.

12
Steps for constructing Grouped frequency Distribution

1. Find the largest and smallest values and compute the Range(R) = Maximum Minimum
2. Select the number of classes desired, usually between 5 and 20 or use Sturges rule
k 1 3.332 log n where k is number of classes desired and n is total number of
observation.
R
3. Find the class width dividing the range by the number of classes w . It is also the
k
difference between the upper and lower class boundaries of the class, that is, w = UCB
LCB.
4. Pick a suitable starting point less than or equal to the minimum value. The starting point
is called the lower limit of the first class. Continue to add the class width to this lower
limit to get the rest of the lower limits.
5. To find the upper limit of the first class, subtract U from the lower limit of the second
class. Then continue to add the class width to this upper limit to find the rest of the upper
limits.
6. Find the boundaries by subtracting U/2 units from the lower limits and adding U/2 units
from the upper limits.
7. Tally the data.
8. Find the frequencies.
9. Find the cumulative frequencies. Depending on what you're trying to accomplish, it may
not be necessary to find the cumulative frequencies.
10. If necessary, find the relative frequencies and/or relative cumulative frequencies
Example:Q.Construct a frequency distribution for the following data.
11 29 6 33 14 31 22 27 19 20
18 17 22 38 23 21 26 34 39 27

Solutions:
Step 1: Find the highest and the lowest value H=39, L=6 and find the range; R=H-L=39-6=33.
Step 2: Find the number of classes using Sturges formula; which is given by k 1 3.332 log n
=1+3.32log (20) =5.32=6(rounding up)
R
Step 3: Find the class width; w =33/6=5.5=6 (rounding up)
k
Step 4: Select the starting point, let it be the minimum observation.
6, 12, 18, 24, 30, 36 are the lower class limits.
Step 5: Find the upper class limit; e.g. the first upper class=12-U=12-1=11
11, 17, 23, 29, 35, 41 are the upper class limits.
So combining step 5 and step 6, one can construct the following classes.
13
Class limits 6 11 12 17 18 23 24 29 30 35 36 41

## Step 6: Find the class boundaries;

E.g. for class 1 Lower class boundary=6-U/2=5.5
Upper class boundary =11+U/2=11.5

Then continue adding w on both boundaries to obtain the rest boundaries. By doing so, one can
obtain the following classes.

Class 5.5 11.5 11.5 17.5 17.5 23.5 23.5 29.5 29.5 35.5 35.5 41.5
boundary

Step 7 & 8: Tally the data& write the numeric values for the tallies in the frequency column.
Step 9 & 10: Find cumulative, relative and/or relative cumulative frequencies.
The complete frequency distribution is given below.

Class limit Class boundary Class Tally Fre Cf (less Cf (more rf. rcf (less
Mark q. than type) than type) than type

## Diagrammatic presentation of data

These are techniques for presenting data in visual displays using geometric and pictures.
Importance:
They have greater attraction.
They facilitate comparison.
They are easily understandable.

## They have greater memorizing value than mere figures

14
Diagrams are appropriate for presenting discrete data. The choice of the particular form among
the different possibilities will depend on personal choices and/or the type of the data.

The most commonly used diagrammatic presentation for discrete as well as qualitative data are:
Pie charts, Bar charts and Pictogram.

Pie chart:A pie chart is a circle that is divided in to sections or wedges according to the
percentage of frequencies in each category of the distribution. A circular chart is showing the
distribution of values of a variable (absolute or relative). Pie chart is a diagrammatic depiction of
data as slices of a pie. The frequency determines the size of the slice.

The proportion of the category can express either by percentages or by angles. That is degree of
central angle of a category = (amount of the category / total amount)*360 0. The proportion of a
category = (frequency of a category / total frequency)* 100%.

## Men Women Girls Boys

2500 2000 4000 1500

Solutions:
Step 1: Find the percentage and/ or degree for each class.
Step 2: Using a protractor and compass, graph each section and write its name and the
corresponding percentage.

Men 2500 25 90

Women 2000 20 72

## Girls 4000 40 144

Boys 1500 15 54

15
Boys
15% Men
25%

Girls Women
40% 20%

Bar Charts: Bar charts are used to represent and compare the frequency distribution of discrete
variables and attributes or categorical series. When we represent data using bar chart, all the bars must
have equal width and the distance between bars must be equal, but length varying in proportion to the
size(frequency) to the item. The height of the bars represents frequencies and the base represents the
categories. Bars can be drawn either vertically or horizontally. There are different types of bar charts.
The most common being:

## Multiple bar charts.

i. Simple Bar Chart: It is a one-dimensional chart in which the bar represents the whole of the
magnitude. The height or length of each bar indicates the size (frequency) of the figure represented.

They are thick lines (narrow rectangles) having the same breadth. The magnitude of a quantity is
represented by the height /length of the bar.

Example:The following data represent sale by product, 1957- 1959 of a given company for three
products A, B, C.
Product Sales (\$) Sales(\$) Sales(\$)
In 1957 In 1958 In 1959 Total
A 12 14 18 44
B 24 21 18 63
C 24 35 54 113
Total 58 70 90218
Solution:

16
Sales by Product Type

120 113

100

80
63
Sales

60
44
40

20

0
A B C

Product

Sales by Year

100 90

80 70
Sales

58
60
s

40

20

0
1957 1958 1959
Year

ii. Component Bar chart: When there is a desire to show how a total (or aggregate) is divided in
to its component parts, we use component bar chart. The bars represent total value of a variable
with each total broken in to its component parts and different colors or designs are used for
identifications. This is done by dividing the bars into parts representing the components and
Example: Draw a component bar chart to represent sales in dollar by product type from 1957 to
1959.
Solutions:

17
iii. Multiple Bar charts: In this type of chart the component figures are shown as separate bars adjoining
each other. The height of each bar represents the actual value of the component figure. It depicts
distributional pattern of more than one variable and comparisons of each component are desired.

Example: Draw a component bar chart to represent the sales by product from 1957 to 1959.

Solutions:

## Graphical Presentation of data

A graphic presentation of the data found in table is more likely to get attention of the casual
observer and shows trends or relationships that might be overlooked in a table and are the most
commonly used devices for presenting statistical data. The histogram, frequency polygon and
cumulative frequency graph or Ogive are most commonly applied graphical representation for
continuous data.

## Draw and label the X and Y axes.

18
Choose a suitable scale for the frequencies or cumulative freq. and label it on the Y axes.
Represent the class boundaries for the histogram or Ogive or the mid points for the frequency
polygon on the X axes.
Plot the points and Draw the bars or lines to connect the points.
Histogram: A graph which displays the data by using vertical bars of various heights to
represent frequencies. Class boundaries are placed along the horizontal axes.The horizontal axis
can be the class boundaries.The heights of the bars correspond to the frequency values, and the
bars are drawn adjacent to each other (without gaps).It differs from a bar chart in that there is a
numerical scaling on the horizontal axis.
Example: Construct a histogram to represent the following data.

Class limit Class boundary Class Tally Freq Cf (less Cf (more rf. rcf (less
Mark . than type) than type) than type

## 12 17 11.5 17.5 14.5 // 2 4 18 0.10 0.20

//////
18 23 17.5 23.5 20.5 7 11 16 0.35 0.55

## 36 41 35.5 41.5 38.5 // 2 20 2 0.10 1.00

Histogram

19
Frequency Polygon:A line graph of class frequencies against midpoints of the classes. The
frequency is placed along the vertical axis and classes mid points are placed along the horizontal
axis and these points are connected with lines.
Example: Draw a frequency polygon for the above data.

Frequency Polygon

8
7 7
Freq.
Frequency

6
5
4 4
3 3
2 2 2 2
1
0
8.5 14.5 20.5 26.5 32.5 38.5
Class Midpoint

Ogive (cumulative frequency polygon): A line graph that represents the cumulative frequencies
(less than or more than type) plotted against upper or lower class boundaries respectively. That is
class boundaries are plotted along the horizontal axis and the corresponding cumulative
frequencies are plotted along the vertical axis. The points are joined by a free hand curve.
To construct an Ogive curve:

Compute the less than and more than cumulative frequency of the distribution.
Prepare a graph with the cumulative frequency on the vertical axis and the true class
limits (class boundaries) of the interval scaled along the X-axis (horizontal axis).
Mark the intersection points of the class boundaries of the cumulative frequencies with a
dot.
Connect the intersection points using a line (curve).

20