You are on page 1of 93

PROBABILITY AND STATISTICS LECTURE NOTE

1. Basic concepts, methods of data collection and presentation

1.1. Introduction
1.1.1 Definitions and classification of Statistics
Definition
Statistics can be defined in two senses:
a) Statistics in its plural sense: Statistics refer to numerical facts, or figures or quantitative information that
describes every aspect of social and economic phenomenon. Statistics are the raw data themselves, like
statistics of births, statistics of deaths, statistics of imports and exports, etc.
b) Statistics in its singular sense: Statistics as a branch of scientific method deals with the planning and
design of data collection, organization, presentation, analysis and interpretation and drawing conclusions
based on the data.
Classification
Statistics can be divided in to two broad areas.
1. Descriptive Statistics is concerned with summarizing or describing important features of the available
data without going beyond the data themselves. It is concerned with summary calculations, graphs,
charts and tables.
2. Inferential Statisticsis a method used to generalize from a sample to a population. It induces the use of
data from samples to make inferences about a population from which samples are drawn.
For example, the average income of all families (the population) in Ethiopia can be estimated from
figures obtained from a few hundred (the sample) families.
Statistical techniques based on probability theory are required.

1.1.2. Stages in statistical investigation


The stages or steps in any statistical investigation are
1. Collection of data: The process of measuring, gathering, assembling the raw data up on which the
statistical investigation is to be based. Data can be collected in a variety of ways. Example, one of the
most common methods is through the use of survey. Survey can also be done in different methods like
questionnaire, interview.
2. Organization of data: Summarization of data in some meaningful way. Organization of data may involve
Editing, coding and classification of the collected data.

1
3. Presentation of the data: In this stage the collected and organized data are presented with some
systematic order to facilitate statistical analysis. The organized data are presented with the help of tables,
diagrams and graphs.
4. Analysis of data:
The process of extracting numerical description of data, mainly through the use of elementary
mathematical operation (like mean, standard deviation,..)
5. Interpretation of data: This involves giving meaning to the analyzed data and draw conclusions.
Statistical techniques based on probability theory are required.

1.1.3. Definitions of some terms


A (statistical) population: is the complete set of possible measurements for which inferences are to be
made. The population represents the target of an investigation, and the objective of the investigation is to
draw conclusions about the population hence we sometimes call it target population.
Examples
 Population of trees under specified climatic conditions
 Population of animals fed a certain type of diet
 Population of farms having a certain type of natural fertility
 Population of households, etc

The population could be finite or infinite (an imaginary collection of units)


There are two ways of investigation: Census and sample survey.
Census: a complete enumeration of the population. But in most real problems it cannot be realized, hence we
take sample.
Sample: A sample from a population is the set of measurements that are actually collected in the course of
an investigation. It should be selected using some pre-defined sampling technique in such a way that they
represent the population very well. Sample is sub part of the population.
In practice, we don‟t conduct census, instead we conduct sample survey.Parameter: Characteristic or
measure obtained from a population.
Statistic: Characteristic or measure obtained from a sample.
Sampling: The process or method of sample selection from the population.
Sample size: The number of elements or observation to be included in the sample.

2
1.1.4. Applications, Uses and Limitations of statistics.
Applications of statistics:
 In almost all fields of human endeavor
 Almost all human beings in their daily life are subjected to obtaining numerical facts
 Applicable in some process e.g. invention of certain drugs, extent of environmental pollution
 In industries especially in quality control area.

Uses of statistics
The main function of statistics is to enlarge our knowledge of complex phenomena. Some uses of statistics:
 It presents facts in a definite and precise form.
 Data reduction.
 Measuring the magnitude of variations in data.
 Furnishes a technique of comparison
 Estimating unknown population characteristics.
 Testing and formulating of hypothesis.
 Studying the relationship between two or more variable.
 Forecasting future events.

Limitations of statistics
As a science statistics has its own limitations.
Some of the limitations:
 Deals with only aggregate of facts and not with individual data items.
 Statistical data are only approximately and not mathematical correct.
 Statistics can be easily misused and therefore should be used be expert

1.1.5 Types of variables and measurement scales


Variable: It is an attribute or characteristic that can assume different values.
Variable is divided in to two: Qualitative and quantitative variable
1. Qualitative variables are nonnumeric variables and cannot be measured.

Examples: gender, religious affiliation, and state of birth.


2. Quantitative Variables are numerical variables and can be measured. Examples include balance in
checking account, number of children in family.

3
Note that quantitative variables are either discrete or continuous
Discrete variable: It assumes a finite or countable number of possible values. It is usually obtained by
counting.
Example: number of children„s in a family, number of cars at a traffic light
Continuous variable: It can assume any value within the defined range. Continuous variables are usually
obtained by measuring. Example: weight in kg, height, time, air pressure in a tire

Measurement scales
Proper knowledge about the nature and type of data to be dealt with is essential in order to specify and apply
the proper statistical method for their analysis and inferences. Measurement scale refers to the property of
value assigned to the data based on the properties of order, distance and fixed zero.
Order
The property of order exists when an object that has more of the attribute than another object, is given a
bigger number by the rule system.
Distance
The property of distance is concerned with the relationship of differences between objects. If a measurement
system possesses the property of distance it means that the unit of measurement means the same thing
throughout the scale of numbers.. More precisely, an equal difference between two numbers reflects an equal
difference in the "real world" between the objects that were assigned the numbers.

Fixed zero (true zero)


True zero is related to the property of absolute absence of characteristic under consideration.
The property of fixed zero (true zero) is necessary for ratios between numbers to be meaningful.

Scale types
Four levels of measurement scales are commonly distinguished: nominal, ordinal, interval, and ratio and
each possessed different properties of measurement systems.
Nominal Scales
Nominal scales are measurement systems that possess none of the three properties stated above.
 Level of measurement which classifies data into mutually exclusive, all inclusive categories in which
no order or ranking can be imposed on the data.
 No arithmetic and relational operation can be applied.

4
Examples:
 Sex (Male or Female),
 Marital status (married, single, widow, divorce)
 Country code
 Regional differentiation of Ethiopia.

Ordinal Scales
Ordinal Scales are measurement systems that possess the property of order, but not the property of distance.
The property of fixed zero is not important if the property of distance is not satisfied.
 Level of measurement which classifies data into categories that can be ranked. Differences between
the ranks do not exist.
 Arithmetic operations are not applicable but relational operations are applicable.
 Ordering is the sole property of ordinal scale.

Example: Rating scales (Excellent, Very good, Good, Fair, poor), Military status.

Interval Scales
Interval scales are measurement systems that possess the properties of Order and distance, but not the
property of fixed zero.
 Level of measurement which classifies data that can be ranked and differences are meaningful.
However, there is no meaningful zero, so ratios are meaningless.
 All arithmetic operations except division are applicable.
 Relational operations are also possible.

Example: Temperature in degree Celsius or 0F,


Your score on an individual intelligence test as a measure of your intelligence.
A temperature of 0°C does not mean that there is no temperature. Furthermore, a temperature of 30°C in
town X on a specific day may not be twice as warm as 15°C on another day in the same town.

Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance, and fixed zero. The
added power of a fixed zero allows ratios of numbers to be meaningfully interpreted; e.g. the ratio of the first
person‟s height to another person‟s height is 1.32, whereas this is not possible with interval scales.

5
 Level of measurement which classifies data that can be ranked, differences are meaningful, and there
is a true zero. True ratios exist between the different units of measure.
 All arithmetic and relational operations are applicable.
Examples: Weight, Height, Number of students, Age

Exercises: Classify the following different measurement systems into one of the four types of scales.
1. Your checking account number as a name for your account.
2. Your checking account balance as a measure of the amount of money you have in that account
3. Your score on the first statistics test as a measure of your knowledge of statistic
4. A response to the statement "Abortion is a woman's right" where "Strongly Disagree" = 1, "Disagree" =
2, "No Opinion" = 3, "Agree" = 4, and "Strongly Agree" = 5, as a measure of attitude toward abortion.
5. Times for swimmers to complete a 50-meter race
6. Months of the year Meskerm, Tikimit…
7. Socioeconomic status of a family when classified as low, middle and upper classes.
8. Blood type of individuals, A, B, AB and O.
9. Pollen counts provided as numbers between 1 and 10 where 1 implies there is almost no pollen and 10 that
it is rampant, but for which the values do not represent an actual counts of grains of pollen.
10. Regions numbers of Ethiopia
11. The number of students in a college
12. The net wages of a group of workers
13. The height of the men in a town

1.2. Methods of data collection and presentation


1.2.1 Methods of data collection
The statistical data may be classified under two categories, depending upon the sources – (1) Primary data
(2) Secondary data.

Primary Data: are those data, which are collected by the investigator himself for the purpose of a specific
inquiry or study. Such data are original in character and are mostly generated by surveys conducted by
individuals or research institutions.
Secondary Data: When an investigator uses data, which have already been collected by others, such data are
called "Secondary Data".

6
The secondary data can be obtained from journals, reports, government publications, publications of
professionals and research organizations.

According to the role of time, data are classified in to cross-section and time series data. Cross-section data is
a set of observations taken at one point in time, while, time series data is a set of observations collected for a
sequence of times, usually at equal interval which may be on weekly, monthly, quarterly, yearly, etc basis.

Before any statistical work can be done data must be collected. Depending on the type of variable and the
objective of the study different data collection methods can be employed. In the collection of data we have
to be systematic. If data are collected haphazardly, it will be difficult to answer our research questions in a
conclusive way.

Various data collection techniques can be used such as:


• Observation • Using available information
• Interview (Face-to-face/telephone interviews) • Focus group discussions (FGD)
• Questionnaire (mailed and self-administered questionnaire)
• Other data collection techniques – life histories, case studies, etc.
i) Observation – It includes all methods from simple visual observations to the use of high level machines
and measurements, sophisticated equipment or facilities, such as radiographic, X-ray machines, microscope.
An observation guide should be prepared prior to data collection.
Advantages: Gives relatively more detailed, accurate and context related information.
Disadvantages: Investigators or observer‟s own biases, prejudice, desires, and etc. and needs more resources
and skilled human power during the use of high level machines.
ii) Interview
Could be face to face /telephone interview
Advantage:
- suitable for use with illiterates
- permits clarifications of questions
- higher response rate than self-administered questionnaire
Disadvantage:
- presence of interviewer can influence the response
- more costly than self-administered questionnaire
iii) Questionnaire (Mailed and self-administered questionnaire)
7
Questionnaire is list of questions arranged in a predetermined sequence for a predetermined purpose.
Self-administered questionnaires: under this method, the questionnaire is distributed by hand to the
respondents. The use of self-administered questionnaires is simpler and cheaper; such questionnaires can be
administered to many persons simultaneously (e.g. to a class of students).
Mailed Questionnaire Method
- The questionnaires are sent by post to the informants.
Limitations of questionnaire:
 The method can be used only if the respondents are educated.
 The response rates tend to be relatively low.
 Informants may not return the completed questionnaire back and even if they did, they may have
filled them incorrectly.
 It may not give the investigator a chance to explain the questions or ask supplementary and follow up
questions.
Types of questions used in a questionnaire
Depending on how questions are asked and recorded we can distinguish two major possibilities - Open –
ended questions, and closed ended questions.
a) Open-ended questions: Open-ended questions permit free responses that should be recorded in the
respondent‟s own words. The respondent is not given any possible answers to choose from. Such questions
are useful to obtain information on:
 Facts with which the researcher is not very familiar
 Opinions, attitudes, suggestions of informants, or Sensitive issues
b) Closed- ended questions: Closed questions offer a list of possible options or answers from which the
respondents must choose. When designing closed questions one should try to:
 Offer a list of options that are exhaustive and mutually exclusive
 Keep the number of options as few as possible.

1.2.2 METHODS OF DATA PRESENTATION


The data collected in a survey is called raw data. In most cases, useful information is not immediately
evident from the mass of unsorted data. Collected data need to be organized in such a way as to condense the
information they contain in a way that will show patterns of variation clearly. Precise methods of analysis
can be decided up on only when the characteristics of the data are understood. For the primary objective of
this different techniques of data organization and presentation like order array, tables and diagrams are used.

8
Statistical Tables
A statistical table is an orderly and systematic presentation of data in rows and columns. Rows are horizontal
and columns are vertical arrangements. The use of tables for organizing, for example qualitative data,
involves grouping the data into mutually exclusive categories of the variables and counting the number of
occurrences (frequency) to each category.
The simple frequency table is used when the individual observations involve only to a single variable
whereas the cross tabulation is used to obtain the frequency distribution of one variable by the subset of
another variable.
Examples:
Simple or one-way table
Table 1: Immunization status of 210 children in a certain Woreda
Immunization status number of children percent (%)
Not immunized 75 35.7
Partially immunized 57 27.1
Fully immunized 78 37.2

Two-way table: This table shows two characteristics and is formed when either the row or the column is
divided into two or more parts.

Table 2: Immunization status by marital status of the women of childbearing age in a town.
Immunization Status
Marital Status Immunized Non Immunized Total

Single 58 177 235


Married 156 294 450
Divorce 10 18 28
Widowed 7 7 14

Total 231 496 727

Frequency distributions
For data to be more easily appreciated and to draw quick comparisons, it is often useful to arrange the data in
the form of a table, or in one of a number of different graphical forms.
Frequency: is the number of times a certain value of the variables is repeated in the given data. It is the
number of observations belonging to a given value or a group.

9
Frequency distribution: is a table which contains the values and the corresponding frequencies. From the
definition, a frequency distribution has two parts, namely- the values of the variables on the one hand and the
number of observations (frequency) corresponding to the values of the variables on the other.
Array (ordered array):is a serial arrangement of numerical data in an ascending or descending order.

Types of frequency distribution


There are two types of frequency distributions categorical (qualitative) and numerical (quantitative).
1. Categorical frequency distribution: Here data are classified according to non-numerical categories.
To construct a categorical frequency distribution, the categories contained in the frequency
distribution must be mutually exclusive and exhaustive. In other words, an element must be counted
in one and only one category.
Example: Seniors of a high school were interviewed on their plan after completing high school. The
following data give plans of 548 seniors of a high school.
SENIORS’ PLAN NUMBER OF SENIORS
Plan to attend college 240
May attend college 146
Plan to or may attend a vocational school 57
Will not attend any school 105
Total 548

2. Numerical frequency distribution: In this frequency distribution, data classified according to


numerical size. Numerical frequency distributions are either discrete or continuous according to
whether the variable is discrete or continuous.

Continuous grouped frequency distribution:


Example: 10,392 persons were surveyed by a social scientist who wants to study the age of persons arrested
in a country. We can construct a continuous frequency distribution for this data, since age is a continuous
variable. In connection with large sets of data, a good overall picture and sufficient information can often be
conveyed by grouping the data into a number of class intervals as shown below.

10
Age (years) Number of persons
Under 18 1,748
18 – 24 3,325
25 – 34 3,149
35 – 44 1,323
45 – 54 512
55 and over 335
Total 10,392

This kind of frequency distribution is called grouped frequency distribution. Frequency distributions present
data in a relatively compact form, gives a good overall picture, and contain information that is adequate for
many purposes, but there are usually some things which can be determined only from the original data. For
instance, the above grouped frequency distribution cannot tell how many of the arrested persons are 19 years
old, or how many are over 62.

Some terminologies used in a continuous grouped frequency distribution


Class frequency (f): refers to the numbers of observations belonging to a class.
Class limit: are the lowest (called lower class limit-LCL) and highest (called upper class limit-UCL) values
that can be included in a class.
Units of measurement (U): the distance between two possible consecutive measures. It is usually taken as 1,
0.1, 0.01, 0.001, -----.
Class boundaries: are the values that fall half way between the class limits of adjacent classes. The
boundaries have one more decimal places than the row data and therefore do not appear in the data . Each
class has a lower boundary (LCB) and an upper class boundary (UCB).
Then UCB = UCL + ½*U and LCB = LCL – ½*U.
Class mark (class midpoint-mi): is the value located half way between the lower and upper class limits of
that class. The class mark of the ith class is denoted by mi is,
1
mi = * (LCL + UCL) = ½*(LCB + UCB).
2
Class width (class size-w): is the difference between the upper and lower class boundaries of the class, that
is, w = UCB – LCB. It is also the difference between the lower limits of any two consecutive classes or the
difference between any two consecutive class marks.

11
Cumulative frequencies: when frequencies of two or more classes are added up, such total frequencies are
called Cumulative Frequencies. This frequencies help as to find the total number of items whose values are
less than or greater than some value.
More than cumulative frequency: it is the total frequency of all values greater than or equal to the lower
class boundary of a given class.
Less than Cumulative frequency: it is the total frequency of all values less than or equal to the upper class
boundary of a given class.
Relative frequency: it is the frequency of each value or class divided by the total frequency

Steps in the construction of grouped continuous frequency distribution;


 Determine the number of classes to use, preferably between 5 and 20. It is possible to take the
approximate number of classes (K) can be the Sturge‟s Formula, given by:
K = 1 + 3.322×log(n),where n is the number of observations.
 Determine the class size (class width) as:
W = (Maximum value – Minimum value)/K = Range/K.
 Pick a suitable starting point less than or equal to the minimum value. The starting point is called the
lower limit of the first class. Continue to add the class width to this lower limit to get the rest of the
lower limits.
 To find the upper limit of the first class, subtract U from the lower limit of the second class. Then
continue to add the class width to this upper limit to find the rest of the upper limits.
 Find the boundaries by subtracting U/2 units from the lower limits and adding U/2 units from the
upper limits.
 Find the frequency and relative frequency of each class.

Example: Construct a grouped frequency distribution of the following data on the amount of time (in hours)
that 80 college students devoted to leisure activities during a typical school week:
23 24 18 14 20 24 24 26 23 21 16 15 19 20 22 14 13 20 19 27 29 22 38 28 34 44
23 19 21 31 16 28 19 18 12 27 15 21 25 16 30 17 22 29 29 18 25 20 16 11 17 12
15 24 25 21 22 17 18 15 21 20 23 18 17 15 16 26 23 22 11 16 18 20 23 19 17 15
20 10

12
Solution:
Using the above formula: K = 1 + 3.322 × log (80) = 7.32 ≈ 7 classes, Maximum value = 44 and Minimum
value = 10. Range = 44 – 10 =34 and class width, W = 35/7 = 4.857 ~ =5.
Let 10 be the lower limit of the first class. That is LCL1 = 10, LCL2 =10+W= 10+5=15, etc.
10, 15, 20, 25, 30, 35, and 40 are lower class limits.
Find the upper class limit; e.g. the first upper class limit (UCL1)=15-U=15-1=14,
UCL2 =14+W=14+5 = 19, etc.
14, 19, 24, 29, 34, 39, 44 are the upper class limits.
Time spent (hours) Frequency
10 – 14 8
15 – 19 28
20 – 24 27
25 – 29 12
30 – 34 3
35 – 39 1
40 – 44 1

The class boundaries are calculated by: UCB = UCL + ½*U and LCB = LCL – ½*U.
Example: consider the above example and determine the class boundaries.
UCB1 = UCL1 + ½*(U=1)=14 +1/2 = 14.5 and LCB1 = LCL1 - ½*(U=1) =10 - 1/2 = 9.5 etc.
The class marks are also calculated as: m1 = ½*(UCL1 +LCL1) = ½*(UCB1 + LCB1) = 12.
m2 = ½*(UCL2 +LCL2) = 17, etc.
So, the complete frequency distribution table with cumulative frequencies is as follows.
So, the complete frequency distribution table with cumulative frequencies is as follows.
Class class class mark frequency relative less than cumulative greater
limit boundary (mi) (fi) frequency frequency than cf
10 – 14 9.5 – 14.5 12 8 0.1 8 80
15 – 19 14.5 – 19.5 17 28 0.35 36 72
20 – 24 19.5 – 24.5 22 27 0.3375 63 44
25 – 29 24.5 – 29.5 27 12 0.15 75 17
30 – 34 29.5 – 34.5 32 3 0.0375 78 5
35 – 39 34.5 – 39.5 37 1 0.0125 79 2
40–44 39.5 – 44.5 42 1 0.0125 80 1

13
Diagrammatic and graphical presentation of Data
Appropriately drawn graph or diagram allows readers to obtain rapidly an overall grasp of the data presented.
The relationship between numbers of various magnitudes can usually be seen more quickly and easily from a
graph or diagram than from a table.

 Bar charts and pie chart are commonly used diagrammatic presentation for qualitative data
 Histograms, frequency polygons and ogive curve are graphical presentation of quantitative
continuous data.

Type of Diagrams
1) Bar Chart:
There are different types of bar charts, the most important ones are simple bar chart, component bar chart
and multiple bar chat.
a) Simple bar chart: It is a one-dimensional chart in which the bar represents the whole of the
magnitude. The height or length of each bar indicates the size (frequency) of the figure represented.
Consider the data on immunization status of children (Table 1)
90
78
80 75
70
60 57

50
40
30
20
10
0
not immunized partially immunized fully immunized

Immunization status
Fig.1 Immunization status

b) Component Bar chart: Bars are sub-divided into component parts of the figure. These sorts of
diagrams are constructed when each total is built up from two or more component figures. This is
done by dividing the bars into parts representing the components and shading them accordingly.

14
Consider the data on immunization status of women by marital status (table 2)
500

400

300 294
immunized
200
non immunized
177
100
156
58 18 7
0 10
single married divorced widowed

Marital status
Fig. 2. Immunization status by marital status of women 15-49 years

c) Multiple bar charts: In this type of chart the component figures are shown as separate bars
adjoining each other. The height of each bar represents the actual frequency of the component figure.
It depicts distributional pattern of more than one variable and comparisons of each component are
desired.

Example of multiple bar chart: consider that data on immunization status of women by marital status.
350
294
300

250

200 177
156 immunized
150
non immunized
100
58
50
10 18
7 7
0
single married divorced widowed

Marital status
Fig. 3. Immunization status by marital status of women 15-49 years

15
2) Pie-chart: it is a circle representing a categorical data by dividing the circle into different sectors of angle
in proportion of 360o to the amount associated to each category. The proportion of the category can express
either by percentages or by angles.
That is degree of central angle of a category = (amount of the category / total amount)* 360 o.The proportion
of a category = (frequency of a category / total frequency)* 100%.

FI NI
37% 36%
NI
PI
FI

PI
27%

Fig. 4.Immunization status of children


Type of Graphs
The following are the most commonly used graphical presentations of data.
1) Histograms: A histogram is the graph of the frequency distribution of continuous measurement variables.
It is constructed on the basis of the following principles:
a) The horizontal axis is a continuous scale running from one extreme end of the distribution to the other. It
should be labeled with the name of the variable and the units of measurement.
b) For each class in the distribution a vertical rectangle is drawn with (i) its base on the horizontal axis
extending from one class boundary of the class to the other class boundary, there will never be any gap
between the histogram rectangles. (ii) the bases of all rectangles will be determined by the width of the
class intervals. If a distribution with unequal class-interval is to be presented by means of a histogram, it
is necessary to make adjustment for varying magnitudes of the class intervals.

Example: Consider the data on time (in hours) that 80 college students devoted to leisure activities during a
typical school week. Draw the histogram

2) Frequency Polygon: If we join the midpoints of the tops of the adjacent rectangles of the histogram with
line segments a frequency polygon is obtained. When the polygon is continued to the X-axis just outside the
range of the lengths the total area under the polygon will be equal to the total area under the histogram.
16
Example: Consider the above data on time spend on leisure activities.
30
28 27
25

20

15
12
10
8
5
3
0 1 1
0 5 10 15 20 25 30 35 40 45

Fig 5: Frequency polygon curve on time spent for leisure activities by students

3) Ogive or Cumulative Frequency Curve: When the cumulative frequencies of a distribution are graphed
the resulting curve is called Ogive Curve. Ogive are of two types, namely, “Less than” Ogive and “more
than” Ogive.
Less than Ogive: in this case the “less than” cumulative frequencies are plotted against upper class
boundaries of their respective classes and they are joined by lines adjacently.
More than Ogive: in this case, more than cumulative frequencies which are scaled on the Y- axis plotted
against the lower class boundary of their respective classes which are scaled on the X- axis are joined by
lines adjacently.
Example: Consider the above data on time spend on leisure activities.
90
80 80 78 79 80
75
70 72
63
60
50
44 Less than Ogive
40
36 More than Ogive
30
20
17
10 8
5
0 0 2 1 0
9.5 14.5 19.5 24.5 29.5 34.5 39.5 44.5

Fig 7: Cumulative frequency curve for amount of time college students devoted to leisure activities

17
2. SUMMARIZING OF DATA
2.1. MEASURES OF CENTERAL TENDENCY
When we want to make comparison between groups of numbers it is good to have a single value that is
considered to be a good representative of each group. This single value is called the average of the group.
Averages are also called measures of central tendency.
Objectives
Since the number of sample points is frequently large and it is easy to lose track of the overall picture by
looking at all the data at once, the data must be summarized as briefly as possible.
Some objectives of measuring central tendency:
 To comprehend (understand) the data easily.
 To facilitate comparison.
 To make further statistical analysis.

The Summation Notation


Let X1, X2, X3, …,Xnbe a number of measurements where n is the total number of observation and Xi is
,
th
i observation.
n
The symbol X
i 1
i (read as “the sum of Xi where i runs from 1 to n”) is mathematical shorthand for

n
X1+X2+X3+...+Xn . That is X
i 1
i = X1+X2+…+Xn

Example: Suppose the following were scores made on the first homework assignment for five students in the
class: 5, 7, 7, 6, and 8.
5

X
i 1
i = X1+X2+ X3 + X4+ X5 = 5 + 7+7+6+8=33

Properties of Summation
n

 k  nk , where k is any constant


i 1

n n

 kX  k  X , where k is any constant


i 1 i 1

n n

 (a  bX )  na  b X
i 1
i
i 1
i , a and b are constants.

n n n

 ( X i  Yi )   X i   Yi
i 1 i 1 i 1

18
Example: Consider the following data and determine
Xi 5 7 7 6 8
Yi 6 7 8 7 8
5 5
a)  X i =5+7+7+6+8=33
i 1
e) (X
i 1
i  Yi )   3

5 5
b)  Yi  36
i 1
f) X Y
i 1
i i =241

5 5
c) 10  10 * 5  50 g)
i 1
X
i 1
i
2
 223

5 5 5 5 5
d)  ( X i  Yi ) 
i 1
 X i +  Yi =69
i 1 i 1
h) (  X i )(  Yi ) = 1188
i 1 i 1

Types of measures of central tendency


The different measures of central tendency are the Mean (Arithmetic, Geometric and Harmonic), the Mode,
the Median.

The Arithmetic Mean:


It is defined as the sum of the magnitude of the items divided by the number of items.
Suppose X1, X2, X3, …,Xn are n observed values in a sample of size n, then thearithmetic mean of the
sample, denoted by X is given as:
n
X 1 + X 2+ …+X n i =1 X i
X= = .
n n

If we take an entire population Mean is denoted by 𝜇 and is given by:


N
X 1 + X 2+ …+X N i=1 X i
𝜇= = , where N stands for the total number of observations in the population.
N N

Example: Suppose the sample consists of birth weights (in grams) of live born infants at a private hospital in
a certain city during a 1-week period. These sample birth weights are:
3265, 3323, 2581, 2759, 3260, 3649, 2841, 3248, 3245, 3200, 3609, 3314, 3484,
3031, 2838, 3101, 4146, 2069, 3541, 2834.
Then find arithmetic mean for the sample birth weights.
1 1 63338
Solution:X=20 Xi = (3265 + 3260 + ….+ 2834) = = 3166.9 gram.
20 20

19
If X is a variable having values X1, X2,…,Xk occurring with frequencies of f1, f2,…, fk respectively, then its
arithmetic mean is given by:
k
X 1f 1 + X 2f 2 + …+X k f k i =1 X if i
X= = k f .
f 1 +f 2 +⋯+f k i=1 i

Example: Suppose the X values are 3, 5, 4, 2, 7 and 6 with corresponding frequencies of 2, 1, 3, 2, 1 and 1
respectively. Then fine the mean for data.
Xi 3 5 4 2 7 6
frequency, fi 2 1 3 2 1 1

3∗2+5∗1+ …+7∗2 +6∗1 40


Solution:X= = 10 = 4.
2+⋯+1

Mean for Grouped Data


This method is applicable where the entire range of observations has been grouped into a continuous
frequency distribution. In such cases the mean of the distribution is computed as:
k
i=1 m if i
X= k f , where
i=1 i

 k is number of classes,
 mi is the midpoint of the ith class and
 fi is the ith class frequency.

Example: Calculate the mean for grouped data on the amount of time (in hours) that 80 college students
devoted to leisure activities during a typical school week given below:
Time spent (hours) Frequency
10 – 14 8
15 – 19 28
20 – 24 27
25 – 29 12
30 – 34 3
35 – 39 1
40 - 44 1
Solution:
 First find the class marks (midpoints)
 Find the product of frequency and class marks

20
 Find mean using the formula.
The class marks of the distribution are: 12, 17, 22, 27, 32, 37, 42.
Then the mean of the data is computed as:
7
i=1 m if i 12∗8+17∗28+⋯+42∗1 1655
X= 7 f = = = 20.7 hours.
i=1 i 8+28+⋯+1 80

Special Properties of the Arithmetic Mean

1) The sum of the deviations about the mean is zero. i. e., Xi − X = 0.


2) The sum of the squares of deviations from the arithmetic mean is less than the sum of squared of
deviations about any other value in the data set,
2 2
i. e. Xi − X Xi − A . A X
3) If we have means X1 , X 2 , X 3 , …, X k of k groups having the same unit of measurements of a

variable, based on n1, n2, n3, …, nk observations respectively. Then the mean of all the observation in
all groups often called the combined mean is given by
n1 X 1  n2 X 2  ...  nk X k
Xc =
n1  n2  ...  nk

Example: If the mean final exam mark of one class of 50 students is 30 and the mean of marks of another
class of 100 students in the same final exam is 40. What is the mean mark of all 150 students?
50 * 30  100 * 40
Solution: X c   36.7 (50*30 + 100*40)/(50 + 100) =36.7.
50  100

4) If a wrong figure has been used when calculating the mean, then the correct mean can be obtained
without repeating the whole process using:
correct value  wrong value
Correct mean = wrong mean +
n
Where n= number of observations

Example: An average weight of 10 students was calculated to be 65. Later it was discovered that one weight
was misread as 40 instead of 80 k.g.
Calculate the correct average weight.
80  40
Correct mean = 65+ = 65+4 = 69
10

21
5) The effect of transforming original series on the mean.
a) If a constant k is added to / subtracted from/ every observation then the new mean will be the
old mean ± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be k*old mean.
Example: The mean of a set of numbers is 500.
a. If 10 is added to each of the numbers in the set, then what will be the mean of the new set?
New mean = 500+10 =510
b. If each of the numbers in the set are multiplied by -5, then what will be the mean of the new set?
New mean = -5*500= -2500
Example: The mean of n observations X , X , …,X are known to be 12 . New set of another
1 2 n

observations are obtained by the linear transformation Y = 2X – 0.5 ( i = 1, 2, …, n ) then what will be
i i

the mean of the new set of observations


Solutions: New Mean = 2* Old Mean – 0.5 = 2*12 – 0.5 = 23.5.

Advantages of arithmetic mean


 It is based on all values
 It is easy to calculate and simple to understand
 It is suitable for further mathematical treatment.
 It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.

Disadvantages of arithmetic mean


 It is affected by extreme observations.
 It cannot be used in the case of open end classes.
 It cannot be determined by the method of inspection.
 It cannot be used when dealing with qualitative characteristics, such as intelligence, honesty, beauty.
 Sometimes it leads to wrong conclusion if the details of the data from which it is obtained are not
available.

Weighted Mean
In computation of arithmetic mean we had given equal importance to each observation. While, when
averaging quantities, it is often necessary to account for the fact that not all of them are equally important in
the phenomenon being described. In order to give quantities being averaged their proper degree of

22
importance, it is necessary to assign them relative importance called weights, and then calculate a weighted
mean.
In general, the weighted mean Xw of a set of values X1, X2, …,Xn, whose relative importance is expressed
numerically by a corresponding set of weights W1, W2, … Wn, is given by:
n
X 1W 1 + X 2W 2 + …+X n W n i=1 X iW i
Xw = = n W .
W 1 +W 2 +⋯+W n i =1 i

Example: A student obtained results 60, 75, 63, 59, and 55 in English, Biology, Mathematics, Physics and
Chemistry examinations respectively. Find the students weighted arithmetic mean if weights 1, 2, 1, 3, 3
respectively are allotted to the subjects.
Solution: X w = (60*1 +75*2 + 63*1 + 59*3 + 55*3)/ (1+2+1+3+3) = 615/10 = 61.5.

The mode
The mode is the value of the observation that occurs with the greatest frequency. A particular disadvantage is
that, with a small number of observations, there may be no mode. In addition, sometimes, there may be more
than one mode such as when dealing with a bimodal (two-peak) distribution. .
Example: Find the modal values for the following data:
(a) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg).
(b) 10, 10, 9, 9, 8, 12, 15, 5 (modal value = 9 and 10). Hence, it is possible for a frequency distribution to
have more than one mode.

Note: Distributions with one mode are called unimodal, those with two modes are called bimodal, and
those with more than two modes are called multimodal.

Modal Value for Grouped data


To find the Modal value for grouped (continuous) frequency distribution, first find the modal class which is
the class with the highest frequency. Then to compute the modal value for grouped data, we use the formula:
∆1
Mode = Lmo + * w , where
∆1 + ∆2
Lmo = Lower class boundary of the modal class;
w = the class width of the modal class;
∆1 = fmo − f1 ;
∆2 = fmo − f3 ;

23
fmo = frequency of the modal class
f1 = frequencyoftheclassimmediatelyprecidingthemodalclass;
f3 = frequency of the class immediately succeeding the modal class.
Note: The modal class is a class with the highest frequency.
Example: Consider the following grouped quantitative data. Calculate the modal value of the data.

Class limit Class boundary Frequency

6 – 11 5.5 – 11.5 2

12 – 17 11.5 – 17.5 2

18 – 23 17.5 – 23.5 7

24 – 29 23.5 – 29.5 4

30 – 35 29.5 – 35.5 3

36 – 41 35.5 – 41.5 2

Solution:(17.5 – 23.5) is the modal class.


Lmo = 17.5, w =6, ∆1 = fmo − f1 = 7 – 2 = 5; ∆2 = fmo − f3 = 7- 4 =3
∆1
Mode = Lmo + *w
∆1 + ∆2

 5 
= 17.5+  6
5  3
=21.25

The Median
An alternative measure of location, perhaps second in popularity to the arithmetic mean, is the median. In a
distribution, median is the value of the variable which divides it in to two equal halves. In an ordered series
of data median is an observation lying exactly in the middle of the series. It is the middle most value in the
sense that the number of values less than the median is equal to the number of values greater than it.
Suppose there are n observations in a sample and if these observations are ordered from smallest to largest,
then the sample median foe ungrouped data is defined as:
n + 1 th
(1) The observations if n is odd
2

n th n th
(2) The average of the and + 1 observations if n is even.
2 2

24
Example: Find the median of the following numbers.
(a) 6, 2, 8, 9, 4 (b) 5, 2, 1, 8, 3,7, 8, 9.
Solution: a) ascending ordered data: 2, 4, 6, 8, 9 (n=5)

 5  1
th

Median =   value  3 value  6


rd

 2 
b) Ascending order: 1, 2, 3, 5, 7, 8, 8, 9 (n=8)
4 rd  5th 5  7
Median =  =6
2 2
Median for Grouped Data
For a grouped (continuous) frequency distribution, median is calculated as:
n
−cf
2
Median = L + ∗ w , where
f
L = lower class boundary of the median class
w = length of the interval
n = total frequency of the sample
cf = Cumulative frequency preceding the median class.
f = Frequency of that interval containing the median.
The median class is the class with the smallest cumulative frequency (less than type) greater than or equal to
n
2

Example: Find the median for the following distribution

Class limit Frequency Cumulative freq.(less than type)

40 – 44 7 7

45 – 49 10 17

50 – 54 22 39

55 – 59 15 54

60 – 64 12 66

65 – 69 6 72

70 – 74 3 75

25
n 75
  37.5
2 2
39 is the first cumulative frequency to be greater than or equal to 37.5.
Therefore, 50 – 54 is the median class. L = 49.5, n=75, w = 5, cf =17, f = 22
n
−cf
2
Hence, Median = L + ∗w
f
(37.5  17)5
= 49.5+ = 54.16
22
Note:
 Median is a positional average and hence not influenced by extreme observations.
 Median can be calculated in the case of open end intervals.
 Median can be located even if the data are incomplete.

Other measures of locations (Quantiles: quartiles, deciles, percentiles)

When a distribution is arranged in order of magnitude of items, the median is the value of the middle term.
Their measures that depend up on their positions in distribution quartiles, deciles, and percentiles are
collectively called quantiles.
Quartiles: Quartiles are measures that divide the frequency distribution in to four equal parts. The value of
the variables corresponding to these divisions are denoted Q , Q , and Q often called the first, the second
1 2 3

and the third quartile respectively.

Q is a value in which 25% items are less than or equal to it. Q has 50% items with value less than or equal
1 2

to it and Q has 75% items whose values are less than or equal to it.
3

k(n + 1)th
The kth quartile Qk for ungrouped data is the value of the item which is the position,
4
where k =1, 2, 3 and n is the total number of observations.
The computation of three quartiles for a grouped data can be done as follows:
kn kn
 Calculate and search for the minimum cumulative frequency which is greater than or equal to ,
4 4
k=1, 2, 3.
 The class corresponding to this cumulative frequency is the kthquartile class. This is the class where
Qk lies.

26
kn
w ( 4 −cf)
 Thus, Qk = L + , k =1, 2, 3, where
f
L = lower class boundary of the kth quartile class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding the k th
quartile class
w= the class width of the quartile class and
f= frequency of the kth quartile class
Deciles: Deciles are measures that divide the frequency distribution in to ten equal parts. The values of the
variables corresponding to these divisions are denoted D , D ,.. D often called the first, the second,…, the
1 2 9

ninth decile respectively.


kn
To find Dk(i=1, 2,..9) we count of the classes beginning from the lowest class.
10

For grouped data we have the following formula:


kn
w (10 −cf)
Dk = L + , k =1, 2, 3…9, where
f
L = lower class boundary of the kthdeciles class
n= the total number of observations
cf = the less than cumulative frequency corresponding to the class immediately preceding the
kthdeciles class
w= the class width of the deciles class
f = frequency of the kthdeciles class
Percentiles: Percentiles are measures that divide the frequency distribution in to hundred equal parts. The
values of the variables corresponding to these divisions are denoted P , P ,.. P often called the first, the
1 2 99

second,…, the ninety-ninth percentile respectively.


kn
To find P (i=1, 2,..99) we count of the classes beginning from the lowest class.
i 100
For grouped data we have the following formula:
kn
w (100 −cf)
Pk = L + , k =1, 2, 3…99, where
f
L = lower class boundary of the kth percentiles class
n= the total number of observations

27
cf = the less than cumulative frequency corresponding to the class immediately preceding the k th
percentiles class
w= the class width of the percentiles class
f = frequency of the kth percentiles class
Note: To compute quantiles, we first sort the data in ascending order.
Q2 = D5 = P50 = median, P25 = Q1, P75 = Q3, and Di = Pi*10,i=1, 2, 3,…9.
Example: Considering the following distribution
Calculate: a) All quartiles b) The 7thdecile c) The 90th percentile.
Class limit Frequency Cumulative freq.(less than type)
141 – 150 17 17
151 – 160 29 46
161 – 170 42 88
171 – 180 72 160
181 – 190 84 244
191 – 200 107 351
201 – 210 49 400
211 – 220 34 434
221 – 230 31 465
231 – 240 16 481
241 – 250 12 493
Solution a) quartiles
Q1: Determine the class containing the first quartile.
n
 123.25 . Hence, 171- 180 is the class containing the first quartile.
4
L =170.5, n =493, w= 10, cf = 88, f= 72
10(123.25  88)
kn
w ( −cf)
4
Q1 = L + = 170.5+ = 174.43
f 72

Q2: Determine the class containing the second quartile.


2n
 246.5 . Hence, 191- 200 is the class containing the second quartile.
4
L =190.5, n =493, w= 10, cf =244 , f= 107
10(246.5  244)
2n
w ( −cf)
4
Q2 = L + = 190.5+ = 190.73
f 107

Q3: Determine the class containing the third quartile.

28
3n
 369.75 . Hence, 201- 210 is the class containing the third quartile.
4
L =200.5, n =493, w= 10, cf = 351 , f= 49
10(369.75  351)
3n
w ( −cf)
4
Q3 = L + = 200.5+ = 204.33
f 49
b) D7: Determine the class containing the 7thdecile.

7n
 345.1 . Hence, 191- 200 is the class containing the seventh decile.
10
L =190.5, n =493, w= 10, cf = 244 , f= 107
10(345.1  244)
7n
w ( −cf)
10
D7= L + = 190.5+ = 199.95
f 107
c) P90: Determine the class containing the 90th percentile.
90n
 443.7 . Hence, 221- 230 is the class containing 90thpercentile.
100
L =220.5, n =493, w= 10, cf = 434 , f= 31
10(443.7  434)
90n
w( −cf)
100
P90= L + = 220.5+ = 223.63
f 31

29
2.2. Measures of variation (dispersion)

Introduction
The measure of central tendency helps us in describing a set of data by a single number or typical value.
However, they do not provide us any information about the extent to which the values differ from one
another or from the average value. Hence, to increase our understanding of the pattern of a data, we must
also measure its dispersion- indicates the degree to which the numerical data tend to spread or variability
about an average value. The scatter or spread of items of a distribution is known as dispersion or variation.
The measures of dispersion also enable us to compare several samples with similar averages.
Consider the following data sets:
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
Set 3: 50 50 50 50 50 50 50 50
The three data sets have a mean of 50, but obviously set 1 is more “spread out” than set 2 and set 3 has no
variability.
Objectives
The general object of measuring dispersion is to obtain a single summary figure which adequately exhibits
whether the distribution is compact or spread out.
• To judge the reliability of measures of central tendency
• To control variability itself.
• To compare two or more groups of numbers in terms of their variability.
• To make further statistical analysis.

Absolute and Relative Measures of Dispersion


The measures of dispersion which are expressed in terms of the original unit of a series are termed as
absolute measures. Such measures are not suitable for comparing the variability of two distributions which
are expressed in different units of measurement and different average size. Relative measures of dispersions
are a ratio or percentage of a measure of absolute dispersion to an appropriate measure of central tendency
and are thus pure numbers independent of the units of measurement. For comparing the variability of two
distributions (even if they are not measured in the same unit), we compute the relative measure of dispersion
instead of absolute measures of dispersion.

30
Types of Measures of Dispersion
It is useful for comparing variation in two or more distributions where units of measurements are the same.
Various measures of dispersions are in use. The most commonly used measures of dispersions are:
1) Range and Relative Range
2) Quartile Deviation and Coefficient of Quartile Deviation
3) Mean Deviation and Coefficient of Mean Deviation
4) Standard Deviation and Coefficient of Variation.

The Range (R)


The range is the largest value minus the smallest value in a data set. The range is greatly affected by extreme
values. Range = largest value – smallest value.
The following two distributions have the same range, 13, yet appear to differ greatly in the amount of
variability.

Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.

Merits and Demerits of range


Merits:
• It is rigidly defined.
• It is easy to calculate and simple to understand.
Demerits:
• It is not based on all observation.
• It is highly affected by extreme observations.
• It is affected by fluctuation in sampling.
• It cannot be computed in the case of open end distribution.
• It is very sensitive to the size of the sample.
Relative Range (RR)
It is also sometimes called coefficient of range and given by:
Highest value  lowest value
RR =
Highest value  lowest value
Example:
1. Find the relative range of the above two distribution. (Exercise!)

31
2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value of:
a) Smallest observation (Ans. 6)
b) Largest observation (Ans. 10)

The Quartile Deviation (Semi-inter quartile range), Q.D

The inter quartile range is the difference between the third and the first quartiles of a set of items.
IQR = Q3 – Q1, and semi-inter quartile range is half of the inter quartile range.
Q3 − Q1
Q.D = 2
Coefficient of Quartile Deviation (C.Q.D)
Q3 − Q1
2
Q3 − Q1
C.Q.D = Q =
3 + Q1 Q3 + Q1
2

Remark: Q.D or C.Q.D includes only the middle 50% of the observation.

The Mean Deviation (M.D):


The mean deviation of a set of items is defined as the arithmetic mean of the values of the absolute
deviations from a given average. Depending up on the type of averages used we have different mean
deviations.
Mean Deviation about the mean for a data set x1, x2, …, xn
n

x i X
MD  i 1
,
n
For the case of a frequency distribution data where the values X1, X2, X3, …,Xk occur f1, f2, f3, …, fk times
k

respectively, then mean deviation is obtained by: MD =


f
i 1
i Xi  X
k

f
i 1
i

If the data is given in the form of frequency distribution of k-classes in which mi and fi are the class marks
and frequency of the ith class respectively then the mean deviation is given by:
k

f
i 1
i mi  X
MD = k

f
i 1
i

32
Steps to calculate M.D:
1. Find the arithmetic mean,
2. Find the deviations of each reading from X and
3. Find the arithmetic mean of the deviations, ignoring sign.

Example: calculate the mean deviation for the following data:


Xi 10 8 9 7 6
fi 8 9 13 6 3

Solution: first find the mean as = = (10*8 + 8*9 +…+6*3)/(8+9+…+3) = 8.4, then

Xi 10 8 9 7 6
fi 8 9 13 6 3

Xi  X 1.6 0.6 0.4 1.4 2.4

fi X i  X 12.8 7.8 3.6 8.4 7.2

Thus, MD =  f i X i = 12.8  7.8  3.6  8.4  7.2  1.02


f i
8  9  13  6  3

Interpretation: each value deviates on average 1.02 from the arithmetic mean, 8.4.
Coefficient of Mean Deviation (C.M.D)

CMD = mean deviation


mean

The Variance and Standard Deviation


The variance
The variance is the "average squared deviation from the mean" and it measures the average of the square of
the deviations from the mean for each observations.
Suppose we have population of N observations, say X1, X2, X3, …, XN, then we define the population
variance as:
N N

 X i    X  N 2
2 2
i
2  i 1
 i 1

N N
But most of the time we have sample of n observations, say X1, X2, X3, …, Xn from the population of N, then
we define the sample variance as:

33
2
 n 
 X  X
n n n

X n X i    X i 
2
 nX
2 2 2
i i
S 
2 i 1
,or S 2 i 1
,or S 2  i 1  i 1 
n 1 n 1 n(n  1)
This measure of variation is universally used to show the scatter of the individual measurements around the
mean of all the measurements in a given distribution. But the disadvantage is that the units of variance are
the square of the units of the original observations. The easiest way for this difficulty is to use the square root
of the variance as a measure of variability called the standard deviation.
Standard deviation
The population and the sample standard deviations denoted by σ and S respectively are defined as:

N 2

 x i  
 i 1
, where  is the popuplatio n mean
N
n

 (x i  X )2
S i 1
where X is the sample mean
n 1
For the case of frequency distribution data the population and sample variance are given as:

 f (x i i  )2
2 
N
, where N= f i

 f (x i i  X )2
S2 
n 1
,where n = f i

Variance and Standard Deviation for Grouped Data


The sample variance for a grouped frequency distribution is given by

 f (m i i  X )2
S2 
n 1
, where n = f i , mi = midpoint of ith class

Example: Areas of spray able surfaces with DDT from a sample of 15 houses are as follows (m2): 101, 105,
110, 114, 115, 124, 125, 125, 130, 133, 135, 136, 137, 140, 145. Find the variance and standard deviation..
Solution: The mean of the sample is 125 ( X  125) , then

34
 X  X
n
2
i
(101  125) 2  (105  125) 2  ...  (145  125) 2
S2  i 1
=  178.71
n 1 14
Hence, the standard deviation = S = 178.71 = 13.37.
Examples: Find the variance and standard deviation of the following grouped sample data
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Sample mean, = 55, n=75

mi(midpoint) 42 47 52 57 62 67 72 Total
fi(mi- 2
) 1183 640 198 60 588 864 867 4400

 f (m i i  X )2
4400
Then S 2  = = 59.46
n 1 74
and S = 59.46 = 7.71

Note:
If the standard deviation of X1, X2, ….., Xn is S, then the standard deviation of
a) X1+ k, X2+k, …, Xn+k will also be S (where k =constant)
b) kX1, kX2, …, kXn will be |k|S.
c) c+kX1, c+kX2, …,a+ kXn will be |k|S ( c and k are constants)

Example1: The standard deviation of n observations X1, X2, ...., Xn is known to be 3. New set of
bservations are obtained by the linear transformation Yi = 2Xi– 0.5 ( i = 1, 2, …, n ), then what will be the
standard deviation of the new set of observations.
Solution: new standard deviation = |k|S = 2*3 =6
Example 2: The mean and the standard deviation of a set of numbers are respectively 500 and 10.
a) If 10 is added to each of the numbers in the set, then what will be the variance and standard deviation
of the new set?
b) If each of the numbers in the set are multiplied by -5, then what will be the variance and standard
deviation of the new set?
35
Solutions: a) The variance and standard deviation will remain the same.
b) New standard deviation= |k|S =5*10 =50

Coefficient of Variation (CV)


The coefficient of variation (CV) is defined by
s tan darddeviation
CV= *100%
mean
S
CV= *100%.
X
The coefficient of variation is most useful in comparing the variability of several different samples, each
with different means. This is because a higher variability is usually expected when the mean increases, and
the CV is a measure that accounts for this variability.
CV is a relative measure free from unit of measurement.

Examples: An analysis of the weekly wages paid (in Birr) to workers in two firms A and B belonging to the
same industry gives the following results.
In which firm the wages is more variable?
Value Firm A Firm B
Mean wage 56 72
Variance 100 121

S 10
Solution: C.VA = *100% = *100% = 17.86% and
X 56
S 11
C.VB = *100% = *100%= 15.28%.
X 72
Since C.VA > C.VB in A there is greater variability in individual wages.

The standard Score (Z-score):


It is the number of standard deviations that a given value X is below or above the mean.
The standard score of any value Xi is defined as
X i  mean
Zi 
s tan darddeviation
Xi  X
Zi  (for the sample data sets)
S

36
Values above the mean have positive z-scores and values below the mean have negative Z-scores. Z-scores
are generally meaningless by themselves unless they are compared to the distribution or scores from some
reference group.
Note: A Z-score value less than -2 and greater than 2 considers as unusually low or high value.

Example 1: Two sections were given introduction to statistics examinations. The following information was
given.
Value Section 1 Section 2
Mean 78 90
Standard deviation 6 5

Student A from section 1 scored 90 and student B from section 2 scored 95. Relatively speaking who
performed better?
XA  X 90  78
Solution: Z A  =  2 and
S 6

XB  X 95  90
ZB  = 1
S 5
Student A performed better relative to his section because the score of student A is two standard deviation
above the mean score of his section while, the score of student B is only one standard deviation above the
mean score of his section.

Example 2: Two groups of people were trained to perform a certain task and tested to find out which group is
faster to learn the task. For the two groups the following information was given:
Value Group one Group two
Mean 10.4 min 11.9 min
Stan.dev. 1.2 min 1.3 min
Relatively speaking:
a) Which group is more consistent(less variable) in its performance?
b) Suppose a person A from group one takes 9.2 minutes while person B from Group
two takes 9.3 minutes, who was faster in performing the task? Why?

37
Solutions:
a) Use coefficient of variation.
S1 1.2
CV1 = *100%  *100%  11.54%
X1 10.4

S2 1.3
CV2 = *100%  *100%  10.92%
X2 11.9
Since C.V2 < C.V1, group 2 is more consistent (less variable)
b) Calculate the standard scores of A and B
X A  X1 9.2  10.4
ZA  =  1 and
S1 1.2

X B  X 2 9.3  11.9
ZB  =  2
S2 1.3
Person B is faster because the time taken by person B is two standard deviation shorter than the average time
taken by group 2 while, the time taken by person A is only one standard deviation shorter than the average
time taken by group 1

38
3. Elementary probability

3.1 Deterministic and non-deterministic models

A deterministic model is one in which every set of variable states is uniquely determined by parameters in
the model and by sets of previous states of these variables. Hypothesize exact relationships and it will be
suitable when prediction error is negligible
In a non-deterministic (stochastic/probabilistic) model, randomness is present, and variable states are not
described by unique values, but rather by probability distributions. Hence, there will be a defined pattern or
regularity appears to construct a precise mathematical model. Hypothesize two components, which is
deterministic and random error.

3.2 Random Experiments, Sample Space and Events

Random experiments
An experiment is the process by which an observation (measurement) is obtained. Results of experiments
may not be the same even through conditions which are identical. Such experiments are called random
experiments.
Example:
a. If we aretossing a fair die the result of the experiment is that it will come up with one of the following
numbers in the set S = {1, 2, 3, 4, 5, 6}
b. If an experiment consists of measuring “lifetimes” of electric light bulbs produced by a company,
then the result of the experiment is a time t in hours that lies in some interval say, 0 ≤ t ≤ 4000 where
we assume that no bulb lasts more than 4000 hours.

Sample space
Sample space is the set of all possible outcomes of a random experiment. It is denoted by S. Each
outcome is called sample point.

Event: An event is a subset of the sample space S.

Simple event: If an event E consists of a single outcome, then it is called a simple or elementary event.

Impossible event: This is an event which will never occur.

Complement of an event: The complement of event A (denoted by Ac or A ), consists of all the sample
points in the sample space that are not in A.

39
Mutually exclusive events: Two events A and B are said to be mutually exclusive if they cannot occur
simultaneously (i.e. A  B   ). The intersection of mutually exclusive sets is empty set.

Independent events: Two events are said to be independent if the occurrence of one is not affected by, and
does not affect, the other. If two events are not independent, then they are said to be dependent.

Equally likely out comes: If each out come in an experiment has the same chance to occur, then the
outcomes are said to be equally likely.

Example: In an experiment of rolling a fair die, S = {1, 2, 3, 4, 5, 6}, each sample point is an equally likely
outcome. It is possible to define many events on this sample space as follows:

A = {1, 4}: the event of getting a perfect square number.

B = {2, 4, 6}: the event of getting an even number.

C = {1, 3, 5}: the event of getting an odd number.

E = the event of getting number 8.

Then, Ac  2,3,5,6; B and C are complementary and E is an impossible event.

Example: In tossing a coin the sample space S is S = {Head, Tail} The events will be
A = { Head, Tail }, B = { Head}, C = { Tail } and D = {}.

3.3 Review of set theory


Definition
Set is a collection of well-defined objects. These objects are called elements. Sets are usually denoted by
capital letters and elements by small letters. Membership for a given set can be denoted by  to show
belongingness and  to say not belong to the set.
Description of sets: Sets can be described by any of the following three ways. That is the complete listing
method (all element of the set are listed), the partial listing method (the elements of the set can be indicated
by listing some of the elements of the set) and the set builder method (using an open proposition to describe
elements that belongs to the set).
Example: The possible outcomes in tossing a six side die
S = {1, 2, 3, 4, 5, 6} or S = {1, 2, . . ., 6} or S = {x: x is an outcome in tossing a six side die}
Types of set
Universal set: is a set that contains all elements of the set that can be considered the objects of that particular
discussion.

40
Empty or null set: is a set which has no element, denoted by {} or 
Finite set: is a set which contains a finite number of elements. (eg.{x: x is an integer, 0 < x < 5})
Infinite set: is a set which contains an infinite number of elements. (eg. {x : x   , x > 0})
Sub set: If every element of set A is also elements of set B, set A is called sub sets of B, and denoted by A
 B.
Proper subset: For two sets A and B if A is subset of B and B is not sub set of A, then A is said to be a
proper subset of B. Denoted by A  B.
Equal sets: two sets A and B are said to be equal if elements of set A are also elements of set B.
Equivalent sets: Two sets A and B are said to be equivalent if there is a one to one correspondence between
elements of the two sets.
Set Operation and their Properties
There are many ways of operating two or more set to get another set. Some of them are discussed below.
Union of sets: The union of two sets A and B is a set which contains elements which belongs to either of the
two sets. Union of two sets denoted by  , A  B (A union B).
Intersection of sets: The intersection of two sets A and B is a set which contains elements which belongs to
both sets A and B. Intersection of two sets denoted by  , A  B (A intersection B).
Disjoint sets: are two sets whose intersection is empty set.
Absolute complement or complement: Let U is the universal set and A be the subset of U, then the
complement of set A is denoted by Ac is a set which contains elements in U but does not belong
in A.
Relative complement (or differences): The difference of set A with respected to set B, written as A  Bc (or
A – B) is a set which contain elements in A that doesn`t belong in B.
Symmetric difference: of two sets A and B denoted by A  B is a set which contain elements which belong
in A but not in B and contain elements which belong in B but not in A. That is, A  B is a set
which equals to (A  Bc)  (B  Ac).

Basic Properties of the Set Operations


Let U be the universal set and sets A, B, C are sets in the universe, the following properties will hold true.
1. A  B = B  A (Union of sets is commutative)
2. A  (B  C) = (A  B)  C = A  B  C (Union of sets is associative)
3. A  B = B  A (Intersection of sets is commutative)
4. A  (B  C) = (A  B)  C = A  B  C (Intersection of sets is associative)

41
5. A  (B  C) = (A  B)  (A  C) (union of sets is distributive over Intersection)
6. A  (B  C) = (A  B)  (A  C) (Intersection of sets is distributive over union)
7. If A  B, then Bc  Ac
8. A   = A and A   = 
9. A  U = U and A  U = A
10. (A  B)c = Ac  Bc De Morgan‟s first rule
11. (A  B)c = Ac  Bc De Morgan‟s second rule
12. A = (A  B)  (A  Bc)

In many problems of probability, we are interested in events that are actually combinations of two or more
events formed by unions, intersections, and complements. Since the concept of set theory is of vital
importance in probability theory, we need a brief review.

 The union of two sets A and B, A  B, is the set with all elements in A or B or both.

 The intersection of A and B, A  B, is the set that contains all elements in both A & B.

 The complement of A, Ac, is the set that contains all elements in the universal set U that are not found in A.

3.4. Finite, infinite sample space and equally likely outcomes

If a sample space has finite number of points, it is called a finite sample space. If it has as many point as
natural numbers1, 2, 3,…it is called a countable infinite sample space. If it has as many point as there are in
some interval, such as 0 <x< 1, it is called a non countable infinite sample space. A sample space which is
finite or countable infinite is often called a discrete sample space while a set which is non countable infinite
is called continuous sample space.
Equally Likely Outcomes
Equally likely outcomes are outcomes of an experiment which has equal chance (equally probable) to
appear. In most cases it is commonly assumed finite or countable infinite sample space is equally likely.
If we have n equally likely outcomes in the sample space then the probability of the i th sample point xi is p
1
(xi) = n , where xican be the first, second,... or the nth outcome.

Example: In an experiment tossing a fair die, the outcomes are equally likely (each outcomeis equally
1
probable. Hence,P(xi = 1) = P(xi = 2) = P(xi = 3) = P(xi = 4) = P(xi = 5) = P(xi = 6) = 6

42
3.5. Counting Techniques
In many cases the number of sample points in a sample space is not very large, and so direct enumeration or
counting of sample points used to obtain probabilities is not difficult. However, problems arise where direct
counting becomes a practical impossibility. To avoid such difficulties we apply the fundamental principles of
counting (counting techniques).
Multiplication Rule
Suppose a task is completed in k stages by carrying out a number of subtasks in each one of the k stages. If
in the first stage the task can be accomplished in n1 different ways and after this in the second stage the task
can be accomplished in n2 different ways, . . . , and finally in the kth stage the task can be accomplished in nk
different ways, then the overall task can be done in n1 ×n2 ×・・・×nk different ways.

Example: Suppose that a person has 2 different pairs of trousers and 3 shirts. In how many ways can he
wear his trousers and shirts?

Solution: He can choose the trousers in n1  2 ways, and shirts in n 2  3 ways.

Therefore, he can wear in n1  n2  2  3  6 possible ways.

Example: How many four-digit numbers can be formed from the digits 1, 2, 5, 6 and 9 if each digit can be
used only once? Solution: We have a total of 5*4*3*2= 120 four digit numbers.

Permutations
Suppose that we are given n distinct objects and wish to arrange r of these objects in a line. Since there are n
ways of choosing the 1st object, and after this is done, n - 1 ways of choosing the 2nd object, . . . , and finally
n - r + 1 ways of choosing the rth object, it follows by the fundamental principle of counting that the number
of different arrangements or permutations is given by n(n - 1)(n - 2) . . . (n - r + 1) = nPr where it is noted that
the product has r factors.
We call nPr the number of permutations of n objects taken r at a time and is given by
n!
nPr = n–r !

n!
When r = n, the above equation becomes nPn = n−n != n! which is called n factorial.

Note: 0! = 1
Example: In one year, three awards (research, teaching, and service) will be given to a class of 25 graduate
students in a statistics department. If each student can receive at most one award, how many possible
selections are there?

43
Solution: Since the awards are distinguishable, it is a permutation problem. The total number of sample
points is
25! 25!
25P3= = = (25)(24)(23) = 13, 800.
(25  3)! 22!
Example: A president and a treasurer are to be chosen from a student club consisting of 50people. How
many different choices of officers are possible if there are no restrictions?

Solution : The total number of choices of officers, without any restrictions, is


50! 50 !
50P2=  = (50)(49) = 2450.
(50  2)! 48!
Remark
If a set consists of n objects of which n1 are of one type (i.e., indistinguishable from each other), n2 are of a
second type, . . . , nk are of a kth type. Then the number of different permutations of the objects is given by:
n!

n
pn n
1
,
2
,.. ., nk
=n
1 ! n2 ! . . . nk !

Example: How many different letter arrangements can be made from the letters in the word
“STATISTICS”?
Solution: Here we have 10 total letters, with 2 letters (S, T) appearing 3 times each, letter I appearing twice,
and letters A and C appearing once each
10!
Therefore, there are  50,400 letter arrangements
3!3!1!2!1!
Combinations
In permutation we are interested in the order of arrangement of the objects. In many problems, however, we
are interested only in selecting or choosing objects without regard to order. Such selections are called
combinations.
The total number of combinations of r objects selected from n (also called the combinations of n objects
n n
taken r at a time) is denoted by or C r is given by
r
n n!
=
r r! n − r !
Example: In how many ways can a committee of 2 students be formed out of 6?

 6 6! 65
Solution:      15 .
 
2 2!.4! 2!
44
Example: Out of 5 male workers and 7 female workers of a factory, a task force consisting of 5 workers is to
be formed. In how many ways can this be done if the task force will consist of
(a) 2 male and 3 female workers?
(b) all female workers?
(c) at least 3 male workers?
Solution:

 5  7  5! 7!
a)       350
 2  3  2!3! 3!4!

 5  7  5! 7!
b)       21
  
0 5 0!5! 5!2 !

 5  7   5  7   5  7 
c)             210  35  1  246
 3  2   4 1   5  0 

3.6. Definitions of Probability


In any random experiment there is always uncertainty as to whether a particular event will or will not occur.
As a measure of the chance, or probability, with which we can expect the event to occur, it is convenient to
assign a number between 0 and 1. If we are sure or certain that the event will occur, we say that its
probability is 100% or 1, but if we are sure that the event will not occur, we say that its probability is zero.
There are different procedures by means of which we can define or estimate the probability of an event.
These procedures are discussed below:

1. Classical Approach
Let S be a sample space, associated with a certain random experiment and consisting of finitely many sample
points n, say, each of which is equally likely to occur whenever the random experiment is carried out. Then
k
the probability of any event A, consisting of k sample points (0 ≤ k ≤ n), is given by: P(A) = n

Example: What is the probability that an odd number will turn up in rolling a fair die?

Solution: S ={1, 2, 3, 4, 5, 6}; let A ={1, 3, 5}. For a fair die, P(1)=P(2) =  =P(6)=1/6; then,
k 3 1
P( A)    .
n 6 2

Example: In an experiment of tossing a fair coin three times, find the probability of getting exactly two heads

45
Solution: For each toss, there are two possible outcomes, head (H) or tail (T). Thus, the number of possible
outcomes is n =2x2x2=8.

The sample space is S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

If E1 = an event of getting 2 heads, then E1 = {HHT, HTH, THH}, &n( E1 )= k = 3.

k 3
Therefore, P(E1) = n = 8.

Example: Out of 5 male workers and 7 female workers of a factory, a task force consisting of 5 workers is to
be formed. What is the probability that the task force will consist of
(a) 2 male and 3 female workers?
(b) all female workers?
(c) at least 3 male workers?

12 
Solution: Total possible committee = n(S) =    792
5

 5  7 
a) Let A = 2 male and 3 female workers , n(A) =     350
 2  3 

n( A) 350
Hence, P(A) =  = 0.442
n( S ) 792

 5  7 
  
b) P(all female )     
0 5 21
 0.0265
12  792
 
5 

246
c) P(at least 3 male)   0.312
792

2. Relative Frequency Approach


Let N(A) be the number of times an event A occurs in N repetitions of a random experiment, and assume that
N(A)
the relative frequency of A, , converges to a limit as N →∞. This limit is denoted by P(A) and is called
N

the probability of A.
Example: If records show that 60 out of 100,000 bulbs produced are defective. What is the probability of a
newly produced bulb to be defective?

46
Solution: Let A be the event that the newly produced bulb is defective.
60
P(A) =  0.0006
100,000

Axioms of probability
Probability is a function defined for each event of a sample space S, taking on values in the real line  , and
satisfying the following three properties (or axioms of probability). We write P(A) for the probability of
event A occurs
Axiom 1: P(A) ≥ 0 for every event A
Axiom 2: P(S) = 1 , where S = sample space (sure or certain event)
Axiom 3: If A1, A2, A3,…,An are mutually exclusive events (meaning Ai  Aj =  , i ≠ j),
n
thenP( A1  A2  A3 . . .  An ) =P A1 + P A2 + P A3 . . . +P(An )= P(A)
i 1
i

3.7. Derived Theorems of Probability

Theorem 1: P(A) ≤1, for any event A

Theorem 2:If Ac is the complement of A, then P(Ac ) = 1 - P(A)


Theorem 3: P(  ) = 0, where is  empty set
Theorem 4: If A and B are any two events, then P(A  B) = P(A) + P(B) - P(A  B)
More generally, if A, B, C are any three events, then
P(A  B  C) = P(A) + P(B) + P(C) - P(A  B) - P(B  C) - P(A  C)+ P(A  B  C)
Theorem 5: For any events A and B,P(A) = P(A  B) + P(A  Bc),since (A  B) and
(A  Bc) are mutually .exclusive.
Example: In a class of 200 students, 138 are enrolled in a mathematics course, 115 are enrolled in
statistics, and 91 are enrolled in both. What percent of these students take

a) Either mathematics or statistics c) statistics but not mathematics


b) Neither mathematics nor statistics d) mathematics but not statistics
Solution:
138 115 91
a) P(M  S) =    0.69+0.575 - 0.455 =0.81
200 200 200
81% of the students take either course
b) P(M  S)c = 1- P(A  B) = 1- 0.81= 0.19. 19% of students take neither course

47
c) P(S  Mc) =P(S) – P(S  M) = 0.575-0.455=0.12.
12% of students take statistics but not mathematics
d) P(M  Sc) =P(M) – P(S  M) = 0.69-0.455=0.235
23.5% of students take mathematics but not statistics

48
4. Conditional Probability and Independence
4.1. Conditional Probability
Definition: The conditional probability of an event A, given that event B has occurred with P(B)>0,
P( A  B)
denoted by P(A|B), is defined as P(A|B) = . P(B)≠0
P( B)
Note:
 P(S|B) =1 , for any event B and S = sample space
 P(Ac|B)= 1 – P(A|B)
Example: A fair die is tossed once. What is the probability that the die shows a 4 given that the die shows
even number?
Let A = 4, B= 2, 4, 6, then A  B  4
1
P( A  B) 1
P(A|B) = = 6 
P( B) 3 3
6
Example: A random sample of 200 adults are classified below by sex and their level
of education attained.
Education Male Female
Elementary 38 45
Secondary 28 50
College 22 17
If a person is picked at random from this group, find the probability that
(a) the person is a male given that the person has a secondary education
(b) the person does not have a college education given that the person is a female.

Solution: Let S= the person has secondary education, M = the person is male, F = the person is female,C =
the person has college education and Cc =the person does not have college education

28
P( M  S ) 200  28
a) P(M|S) = =
P( S ) 78 78
200
95
c P(C c  F ) 200  95
b) P(C |F) = =
P( F ) 112 112
200

49
4.2 Multiplication Rule
If in an experiment the events A and B can both occur, then P(A ∩ B) = P(A)P(B|A), provided P(A) >0.Thus,
the probability that both A and B occur is equal to the probability that A occurs multiplied by the conditional
probability that B occurs, given that A occurs.

Example: Suppose that we have a fuse box containing 20 fuses, of which 5 are defective. If 2 fuses are
selected at random and removed from the box in succession without replacing the first, what is the
probability that both fuses are defective?
Solution: let A be the event that the first fuse is defective and B the event that the second fuse is defective;
then A ∩ B = both fuses are defective
The probability of first removing a defective fuse is ¼ (that is (P(A) =1/4); then the probability that the
second fuse is defective given that the first fuse is defective is 4/19 (i.e. P(B|A) =4/19)
1 4 1
Hence, P(A ∩ B) = P(A)P(B|A) =  
4 19 19

Multiplication rule (Multiplicative Theorem)


In general, If, in an experiment, the events A1, A2, . . . , Akcan occur, then
P(A1∩ A2∩ ·· · ∩Ak) = P(A1)P(A2|A1)P(A3|A1∩ A2) · · · P(Ak|A1∩ A2∩ · · · ∩ Ak-1).
.
Example: Three balls are drawn in succession, without replacement, from a box containing 6 red and 4 blue
balls. Find the probability that three of them are red.
Solution: First we define the events
A1: the first ball is a red,
A2: the second ball is red,
A3: the third ball is red
Required: P( A1  A2  A3 ) =?

6 3 5 4 1
Now, P(A1) =  , P( A2 A1 )  , P( A3 A1  A2 )  
10 5 9 8 2

Hence, P(A1∩ A2∩ A3) = P(A1)P(A2|A1)P(A3|A1∩ A2)

3 5 1 1
P(three of them are red) = P(A1∩ A2∩ A3) =   
5 9 2 6

50
4.3 Theorem of total probability and Bayes’ Theorem
Partition of sample space: A collection of events {B1,B2, . . . ,Bn} of a sample space S is called a partition
of S if B1, B2, . . . , Bnare mutually exclusive and B1∪ B2∪ ·· · ∪ Bn= S.

Theorem of total probability


If the events B1, B2, . . . , Bnconstitute a partition of the sample space S such that P(Bi) ≠0 for i= 1, 2, . . . , n,
then for any event A of S,
P(A) = P(A ∩ B1) + P(A ∩B2) + P(A ∩B3) + . . . + P(A ∩ Bn)
=P(B1)P(A\B1) + P(B2)P(A\B2) + . . . + P(Bn)P(A\Bn)

Example: In a certain assembly plant, three machines, B1, B2, and B3, make 30%, 45%, and25%,
respectively, of the products. It is known from past experience that 2%, 3%, and 2% of the products made by
each machine, respectively, are defective. Now, suppose that a finished product is randomly selected. What
is the probability that it is defective?

Solution: Consider the following events:


A: the product is defective,
B1: the product is made by machine B1,
B2: the product is made by machine B2,
B3: the product is made by machine B3.
Then, P(B1) = 0.3, P(B2) = 0.45, P(B3) = 0.25 , P(A|B1) =0.02, P(A|B2) = 0.03, P(A|B3) = 0.02
Applying the theorem of total probability,
P(A) = P(B1)P(A|B1) + P(B2)P(A|B2) + P(B3)P(A|B3).
= (0.3)(0.02)+ (0.45)(0.03)+ (0.25)(0.02)= 0.006+0.0135+ 0.005= 0.0245

Baye`s Theorem or Rule


Suppose that B1, B2, . . .,Bnare partitions of the sample space ( they are mutually exclusive events whose
union is the sample space S). Then if A is any event, we have the follow in theorem:
P ( B r ) P ( A Br )
P( Br A)  n

 P( B ) P ( A B )
i 1
i i

Example: With reference to above Example, if a product was chosen randomly and found tobe defective,
what is the probability that it was made by machine B3?

51
Solution : Using Bayes‟ rule
P( B3 ) P( A B3 )
P(B3/A)=
P( B1 ) P( A B1 )  P( B2 ) P( A B2 )  P( B3 ) P( A B3 )

and then substituting the probabilities calculated in the above Example, we have
0.005 0.005 10
P( B3 A)   
0.006  0.0135  0.005 0.0245 49

Example: An instructor has taught probability for many years. The instructor has found that 80% of students
who do the homework pass the exam, while 10% of students who don‟t do the homework pass the exam. If
60% of the students do the homework,
a) What percent of students pass the exam?
b) Of students who pass the exam, what percent did the homework?

Solution: consider the events,


A: the student passes the exam
B: the student does the home work
Bc: the student does not do the home work
Now, P(A|B) = 0.8, P(A|Bc) = 0.1, P(B) = 0.6, P(Bc) =0.4
a) Applying the theorem of total probability,
P(A) = P(B)P(A|B) + P(Bc)P(A|Bc) = (0.6)(0.8) + (0.4)(0.1) = 0.48+0.04 = 0.52
52% of students pass the exam
b) Applying Bayes‟ rule,
P( B)( A B) 0.48
P(B|A) =   0.9231
P( B) P( A B)  P( B ) P( A B )
c c 0.48  0.04

Of students who pass the exam, 92.31% did the homework.

4.4 Independent Event


Definition: Two events A and B are said to be independent (in the probability sense), if P(A∩ B) = P(A)
P(B).
In other words, two events A and B are independent means the occurrence of one event A is not affected by
the occurrence or non-occurrence of B and vice versa.

52
Remark
If two events A and B are independent, then P(B\A) = P(B), for P(A) > 0 and P(A|B) = P(A) where P(B) > 0.
The definition of independent event can be extended in two more than two event as follow:

Definition: The events A1, A2, . . . ,An are said to be independent (statistically or stochastically or in the
probability sense) if, for all possible choices of k out of n events (2 ≤ k ≤ n), the probability of their
intersection equals the product of their probabilities. More formally, a collection of events A={A1, A2, . .
.,An}are mutually independent if for any subset of A, Ai1, Ai2, . . ., Aik for 2 ≤ k ≤ n, we have
P( Ai1  ... Aik )  P( Ai1 ) . . . P( Aik )

NB: If at least one of the relations violates the above equation, the events are said to be dependent.

The three events, A1, A2 and A3,are independent if the following four conditions are satisfied.
P(A1∩A2) = P(A1) P(A2),
P(A1∩A3) = P(A1) P(A3),
P(A2∩A3) = P(A2) P(A3),
P(A1∩A2∩A3) = P(A1) P(A2) P(A3).
The first three conditions simply assert that any two events are independent, a property known as pair wise
independence. But the fourth condition is also important and does not follow from the first three. Conversely,
the fourth condition does not imply the first three conditions.

NB: The equality P(A1 ∩ A2 ∩ A3) = P(A1)P(A2)P(A3) is not enough for independence.

Example: Consider two independent rolls of a fair die, and the following events:
A = {1st roll shows 1, 2, or 3},
B = {1st roll shows 3, 4, or 5},
C = {the sum of the two rolls is 9}.

1 1 1
We have𝑃 𝐴 ∩ 𝐵 = 6 ≠ 2 ∗ 2 = 𝑃 𝐴 𝑃 𝐵 ;
1 1 4
𝑃 𝐴∩𝐶 = ≠ ∗ = 𝑃 𝐴 𝑃(𝐶)
36 2 36

53
1 1 4
𝑃 𝐵∩𝐶 = ≠ ∗ = 𝑃 𝐵 𝑃(𝐶)
12 2 36
Thus the three events A, B, and C are not independent, and indeed no two of these events are independent.
On the other hand, we have
1 1 1 4
𝑃 𝐴∩𝐵∩𝐶 = = . . = 𝑃 𝐴 𝑃 𝐵 𝑃(𝐶)
36 2 2 36

Note: If the events A and B are independent, then all three sets of events are also independent:
A and Bc; Ac and B; Ac and Bc

Example: If A and B are independent, then show that A and Bc are also independent.
Proof:
We need to show P(A  B c )  P( A) P( B c )
From set and probability theory, P(A) = P(A  B) + P(A  Bc)
So, P(A  Bc) = P(A) – P(A  B)
= P(A) – P(A)P(B), A and B are independent (given)
= P(A) 1  P( B)
P(A  Bc) = P(A)P(Bc), hence proved.

54
5. One-Dimensional Random Variables
5.1. Definitions of Random Variables
Let S be a sample space of an experiment and X is a real valued function defined over the sample space S,
then X is called a random variable (or stochastic variable).

A random variable, usually shortened to r.v. (rv), is a function defined on a sample space S and taking values
in the real line  , and denoted by capital letters, such as X, Y, Z. Thus, the value of the r.v. X at the sample
point s is X(s), and the set of all values of X, that is, the range of X, is usually denoted by X(S) or RX.

The difference between a r.v. and a function is that, the domain of a r.v. is a sample space S, unlike the usual
concept of a function, whose domain is a subset of  or of a Euclidean space of higher dimension. The
usage of the term “random variable” employed here rather than that of a function may be explained by the
fact that a r.v is associated with the outcomes of a random experiment. Of course, on the same sample space,
one may define many distinct r.vs.

Example 1: Assume tossing of three distinct coins once, so that the sample space is S = {HHH, HHT, HTH,
THH, HTT, THT, TTH, TTT}. Then, the random variable X can be defined as X(s), X(s) = the number
of heads (H‟s) in S.
Example 2: In rolling two distinct dice once. The sample space S is S = {(1, 1), (1, 2), . . , (2, 1), . . , (6, 1),
(6, 2), . .. , (6, 6)}, a r.v. X of interest may be defined by X(s) = sum of the numbers in the pair S.

In the examples discussed above we saw r.v.s with different values. Hence, random variables can be
categorized in to two broad categories such as discrete and continuous random variables.

5.2. Discrete Random Variables


A random variable X is called discrete (or of the discrete type), if X takes on a finite or countable infinite
number of values; that is, either finitely many values such as x1, . . . , xn, or countable infinite many values
such as x0, x1, x2, . . . .
Or we can describe discrete random variable as, it
 Take whole numbers (like 0, 1, 2, 3 etc.)
 Take finite or countably infinite number of values
 Jump from one value to the next and cannot take any values in between.

55
Example 3: In Example 1 and 2 above, the random variables defined are discrete r.v.s.

Example 4:
Experiment Random Variable (X) Variable values
Children of one gender in a family Number of girls 0, 1, 2, …
Answer 23 questions of an exam Number of correct 0, 1, 2, ..., 23
Count cars at toll between 11:00 am &1:00 pm Number of cars arriving 0, 1, 2, ..., n

5.2.1. Probability Distribution of Discrete Random Variable


If X is a discrete random variable, the function given by f(x) = P(X = x) for each x within the range of X is
called the probability distribution or probability mass function of X.

Remark
 The probability distribution (mass) function f(x), of a discrete random variable X, satisfy the
following two conditions
1. f (x) ≥ 0
2. 𝑥 𝑓 𝑥 = 1, The summation is taken over all possible values of x.

Example 5: A shipment of 20 similar laptop computers to a retail outlet contains 3 that are defective. If a
school makes a random purchase of 2 of these computers, find the probability distribution for the number of
defectives. Check that f (x) defines a pdf;
Solution: Let X be a random variable whose values x are the possible numbers of defective computers
purchased by the school. Then x can only take the numbers 0, 1, and 2. Now

56
5.3. Continuous Random Variables
A r.v X is called continuous (or of the continuous type) if X takes all values in a proper interval I ⊆  . Or
we can describe continuous random variables as follows:
 Take whole or fractional number.
 Obtained by measuring.
 Take infinite number of values in an interval.
Example 7: Recording the lifetime of an electronic device, or of an electrical appliance. Here S is the
interval (0, T) or for some justifiable reasons, S = (0, ∞), a r.v. X of interest is X(s) = s, s ∈ S. Here the
random variables defined are continuous r.v.s

Example 8: The following examples are continuous r.v.s:

Experiment Random Variable X Variable values


Weigh 100 People Weight 45.1, 78, ...
Measure Part Life Hours 900, 875.9, …
Measure Time Between Arrivals Inter-Arrival time 0, 1.3, 2.78, ...

5.3.1. Probability Density Function of Continuous Random Variables


A function with values f(x), defined over the set of all real numbers, is called a probability density function
of the continuous random variable X if and only if
𝑏
P (a ≤ x ≤ b) = ∫𝑎 𝑓 𝑥 𝑑𝑥 for any real constant a ≤ b.

Probability density function also referred as probability densities (p.d.f.), probability function, or simply
densities.
Remarks
 The probability density function f (x) of the continuous random variable X, has the following
properties (satisfy the conditions)
1. f(x) ≥ 0 for all x, or for −∞ <x < ∞

2.  f ( x) dx  1


 If X is a continuous random variable and a and b are real constants with a ≤ b, then
P (a ≤ x ≤ b) = P (a < x ≤ b) = P (a ≤ x < b) = P (a < x < b)

57
Example 9: Suppose that the r-v X is continuous with the pdf of f x   2 x, o  x  1,

 0, otherwise

a) Check that f (x) is a pdf

b) Find P X  0.5 ;

c) Evaluate P X  1 given that 1  X  2 


 2 3 3

Solution: a) Obviously, for o < X< 1, f(x) >0, and


 1 1

 f ( x)dx   f ( x)dx   2 xdx  x 2  1 .


1

0
 0 0

 1
Hence, f (x) is the pdf of some r-v X and note that  f ( x)dx   f ( x)dx, since f(x) is zero in the other two
 0

intervals:  , 0   1,  .
0.5 0.5

b) P X  0.5   f ( x)dx   2 xdx  x 2


0.5
 0.25.
0
0 0

c) Let A   X  1 , B  1  X  2 , sothat A  B  1  X  1 .


 2 3 3 3 2
1/ 2
Then, P X  1 1  X  2   P( A / B)  P( A  B) , where P( A  B)  5
 2 xdx  36 ,
 
 2 3 3 P( B) 1/ 3

2/3
1
and P( B)   2 xdx  3 .
1/ 3

5 / 36 5 5
 P( A / B)   3  .
1/ 3 36 12

5.4. Cumulative distribution function and its properties


The cumulative distribution function, or the distribution function, for a random variable X is a function
defined by:𝐹 𝑥 = 𝑃 (𝑋 ≤ 𝑥)

Where x is any real number, i.e., - ∞ < x < ∞. Thus, the distribution function specifies, for all real values x,
the probability that the random variable is less than or equal to x.

Properties of distribution functions, F(x)


1. 0 ≤ F(x) ≤ 1 for all x in R
2. F(x) is non-decreasing [i.e., F(x) ≤ F(y) if x ≤ y].

58
3. F(x) is continuous from the right [i.e., lim F ( x  h)  F ( x) for all x]

h 0
4. lim F ( x )  0 and lim F (x )  1
x x 

5.4.1.Distribution Functions for Discrete Random Variables


If X is a discrete random variable, the function given by: F ( x)  P ( X  x)   f (t ) For all x in  and t
t x

∈X, where f(t) is the value of probability distribution or p.m.f of X at t, is called the distribution function, or
the cumulative distribution function of X. If X takes on only a finite number of values x 1, x2, . . . , xn, then
the distribution function is given by:

𝑜 −∞<𝑥 <∞
𝑓 𝑥1 𝑥1 ≤ 𝑥 < 𝑥2
𝑓 𝑥1 + 𝑓 𝑥2 𝑥2 ≤ 𝑥 < 𝑥3
.
𝐹 𝑥 =
.
.
𝑓 𝑥1 + ⋯ … + 𝑓 𝑥𝑛 𝑥𝑛 ≤ 𝑥 < ∞

Example 10: Find the cumulative distribution function of the random variable X , if the following
information is given as follows f(0)= 1/16, f(1) = 1/4, f(2)= 3/8, f(3)= 1/4, and f(4)= 1/16. Therefore,

59
5.4.2.Distribution Functions of Continuous Random Variables
If X is a continuous random variable and the value of its probability density is f (t), then function given by
x
F ( x)  P ( X  x)   f (t ) dt

is called the distribution function, or the cumulative distribution of the continuous

r.v. X.

Theorem: If f (x) and F(x) are the values of the probability density and the distribution function of X at x,
then P (a ≤ x ≤ b) = F(b) - F(a)
𝑑𝐹 (𝑥)
For any real constant a and b with a ≤ b, and 𝑓 𝑥 = Where the derivative exist.
𝑑𝑥

Example 11: (a) Find the constant C such that the function f(x) is the density function of a r.v. X, where f(x)
2
is given by 𝑓 𝑥 = 𝐶𝑥 0 < 𝑥 < 3 (b) Compute P(1 < x < 2)?
0 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒
3 3

Solution: a) P(0  X  3)   f ( x)dx   cx dx =1


2

0 0

3
x3 27 𝑐
=c =1 , 3 = 1 , c= 1/9
3 0.

2
2 2
x3
b) .P(1 < x < 2) = P(1  X  2)   f ( x)dx   cx dx = c
2
=1/27( 8-1)=7/27
1 1
31

60
6. Functions of Random Variables
In standard statistical methods, the result of statistical hypothesis testing, estimation, or even statistical
graphics does not involve a single random variable but, rather, functions of one or more random variables.
As a result, statistical inference requires the distributions of these functions. In many situations in statistics,
we may be interested (it is necessary) to derive the probability distribution of a function of one or more
random variables.

6.1. Equivalent Events


Let X be a random variable defined on a sample space, S, and let Y be a function of X then Y is also a
random variable. Define Rx and Ry called the range space of X and Y can take. Let C ∁ Ry and B ∁ Rx
defined as: B ={X ∈ Rx: Y(X)∈ C} then the event B and C are called equivalent events. Or if B and C are
two events defined on different sample spaces, saying they are equivalent means that one occurs if and only
if the other one occurs. Or let E be an experiment and S be its sample space and X be a random variable
defined on S and let Rx be its range space. Let B be an event with respected to Rx, that is, B ⊆ Rx, suppose
that A is defined as A ={s ɛ S: X(s) ɛ B}, and we say A and B are equivalent events.

Example 1: In tossing two coins the sample space S = {HH, HT, TH, TT}. Let the random variable X =
Number of heads, Rx = {0, 1, 2}. Let B ⊆ Rx and B = {1}. Moreover X (HT) = X (TH) = 1. If A =
{HT, TH} then A and B are equivalent events.

Example 2: Let X is a discrete random variable on scores of a die and Y = X2, then Y is a discrete random
variable as X is discrete. Therefore, the range sample space of X is Rx = {1, 2, 3, 4, 5, 6,} and the
range sample space of Y is Ry = {1, 4, 9, 16, 25, 36}. Now,
{Y =4} is equivalent to {X=2}
{Y < 9} is equivalent to {X <3}
{Y ≤25} is equivalent to {X ≤5}etc.

Let B be an event in the range space Rx of the random variable X, we define P(B) as P(B) = P(A) where A =
{s ɛ S: X(s) ɛ B}.From this definition, we saw that if two events are equivalent then their probabilities are
equal.
6.2. Functions of discrete random variables
If X is a discrete or continuous random variable and Y is a function of X, then it follows immediately that Y
is also discrete or continuous. Suppose that X is a discrete random variable with probability distribution p(x).
61
Let Y = g(X) define a one-to-one transformation between the values of X and Y so that the equation y = g(x)
can be uniquely solved for x in terms of y, say x = w(y). Then the probability distribution of Y is p(y) =
p[w(y)].

3 1 𝑥
Example: Let X be a random variable with probability distribution p(x) = , x= 1, 2, 3, . . . then find
4 4

the probability distribution of the random variable Y = X2.


Solution: Since the values of X are all positive, the transformation defines a one-to-one correspondence
3 1 𝑦
between the x and y values, y = x2 and x = 𝑦. Hence p (y) =p( 𝑦) = , y= 1, 4, 9, . . . , and
4 4

0, elsewhere.
Example: If X is the number of heads obtained in four tosses of a balanced coin, find the probability
1
distribution of H(X) = .
1+𝑋

Solution: The sample space S = {HHHH, HHHT, HHTH, HTHH, THHH, HHTT, HTHT, HTTH, TTHH,
THTH, THHT, HTTT, TTTH, TTHT, THTT, TTTT}

x 0 1 2 3 4
p(x) 1/16 4/16 6/16 4/16 1/16

Then, using the relation y = 1/ (1 + x) to substitute values of Y for values of X, we find the
probability distribution of Y

y 1 1/2 1/3 1/4 1/5


p(y) 1/16 4/16 6/16 4/16 1/16

6.3. Functions of continuous random variables


A straight forward method of obtaining the probability density function of continuous random variables
consists of first finding its distribution function and then the probability density by differentiation. Thus, if X
is a continuous random variable with probability density f(x), then the probability density of Y = H(X) is
obtained by first determining an expression for the probability
G (y) = P(Y ≤ y) = P (H(X) ≤ y) and then differentiating
𝑑 𝐺(𝑦)
𝑔 𝑦 = 𝑑𝑦

Finally determine the values of y where 𝑔(𝑦) > 0.

62
To find the probability distribution of the random variable Y = u(X) when X is a continuous random variable
and the transformation is one-to-one, we shall need the following definition.

Suppose that X is a continuous random variable with probability distribution f(x). Let Y = g(X) define a one-
to-one correspondence between the values of X and Y so that the equation y = g(x) can be uniquely solved for
x in terms of y, say x = w(y). Then the probability distribution of Y is f(y) = f[w(y)]|J|, where J = w’(y) and is
called the Jacobian of the transformation.

Remarks
 Suppose that X1and X2are discrete random variables with joint probability distribution p(x1, x2). Let
Y1= g1(X1,X2) and Y2= g2(X1,X2) define a one-to-one transformation between the points (x1, x2) and
(y1, y2) so that the equations y1= g1(x1, x2) and y2= g2(x1, x2) may be uniquely solved for x1and x2 in
terms of y1and y2, say x1= w1(y1, y2) and x2= w2(y1, y2). Then the joint probability distribution of Y1and
Y2 is g(y1, y2) = p[w1(y1, y2), w2(y1, y2)].

To find the joint probability distribution of the random variables Y1= g1(X1,X2) and Y2= g2(X1,X2) when X1and
X2 are continuous and the transformation is one-to-one, we need an additional definition as follows:

 Suppose that X1 and X2 are continuous random variables with joint probability distribution f(x1, x2).
Let Y1= g1(X1,X2) and Y2= g2(X1,X2) define a one-to-one transformation between the points (x1, x2) and
(y1, y2) so that the equations y1= g1(x1, x2) and y2= g2(x1, x2) may be uniquely solved for x1 and x2 in
terms of y1 and y2, say x1= w1(yl, y2) and x2= w2(y1, y2). Then the joint probability distribution of Y1and
Y2 is g(y1, y2) = f[w1(y1, y2), w2(y1, y2)]|J|, where the Jacobian is the 2 × 2 determinant
𝜕𝑥 1 𝜕𝑥 1
𝜕𝑦 1 𝜕𝑦 2
J= 𝜕𝑥 2 𝜕𝑥 2
𝜕𝑦 1 𝜕𝑦 2
𝜕𝑥 1
and is simply the derivative of x1= w1(y1, y2) with respect to y1 holding y2 constant, as the partial
𝜕𝑦 1

derivative of x1with respect to y1. The other partial derivatives are defined in a similar manner.

Example: Let X be a continuous random variable with probability distribution


𝑥
for 1 < 𝑥 < 5
𝑓 𝑥 = 12 Then find the probability distribution of the random variable Y = 2X − 3
0 elsewhere
Solution: The inverse solution of y = 2x − 3 yields x = (y + 3)/2, from which we obtain
63
J = w’(y) = dx/dy = 1/2. Therefore, we find the density function of Y to be
𝑦+3 1 𝑦+3
f(y) = = , −1 < y <7, and 0, elsewhere.
24 2 48

Example: Let X1 and X2 be two continuous random variables with joint probability distribution
f(x1, x2) = 4x1x2, 0 < x1 <1, 0 < x2 <1, and 0, elsewhere. Then find the joint probability
distribution of Y1= 𝑋12 and Y2= X1X2.
𝑦2 1
Solution: The inverse solutions of y1= 𝑥12 and y2= x1x2 are x1= 𝑦1 and x2= , from which we obtain: J = 2𝑦 .
𝑦1 1

2𝑦 2
Finally, from the above definition the joint probability distribution of Y1 and Y2 is g(y1, y2) = , 𝑦22 < y1<1, 0
𝑦1

< y2<1, and 0, elsewhere.

64
7. Two or More Dimension Random Variables
7.1. Definitions of Two-dimensional Random Variables
We are often interested simultaneously in two outcomes rather than one. Then with each one of these
outcomes a random variable is associated, thus we are furnished with two random variables or a 2-
dimensional random vector denoted by (X, Y).
o Let (X, Y) is a two-dimensional random variable. (X, Y) is called a two dimensional discrete random
variable if the possible values of (X, Y) are finite or countable infinite. That is the possible values of
(X, Y) may be represented as (xi, yj), i = 1, 2, ….,n, … and j = 1, 2, . . . , m,….

o Let (X, Y) is a two-dimensional random variable. (X, Y) is called a two dimensional continuous
random variables if the possible values of (X, Y) can assume all values in some non countable set of
Euclidian space. That is, (X, Y) can assume values in a rectangle {(x,y): a ≤ x ≤ b and c ≤ y ≤ d} or in
a circle {(x,y): x2 + y2 ≤ 1} etc.

7.2. Joint Probability Distribution

If X and Y are two random variables, the probability distribution for their simultaneous occurrence can be
represented by a function with values p(x, y) for any pair of values (x, y) within the range of the random
variables X and Y. It is customary to refer to this function as the joint probability distribution of X and Y.

o Let (X, Y) is a two-dimensional discrete random variables that is the possible values of (X, Y) may
be represented as (xi, yj), i = 1, 2, ….,n, … and j = 1, 2, . . . , m,….Hence, in the discrete case, p(x, y)
= P(X = x, Y = y); that is, the values p(x, y) give the probability that outcomes x and y occur at the
same time, then the function p(x, y) is a joint probability distribution or probability mass function
of the discrete random variables X and Y if:
1. P(xi. yj) ≥ 0 for all (x, y)

 f ( x, y)  1
2. x y

Example: Two ballpoint pens are selected at random from a box that contains 3 blue pens, 2 red pens, and 3
green pens. If X is the number of blue pens selected and Y is the number of red pens selected, then find the
joint probability mass function p(x, y) and verify that it is pmf.

65
Solution: The possible pairs of values (x, y) are (0, 0), (0, 1), (1, 0), (1, 1), (0, 2), and (2, 0). Now, p(0, 1), for
example, represents the probability that a red and a green pens are selected. The total number of equally
8
likely ways of selecting any 2 pens from the 8 is = 28. The number of ways of selecting 1 red from 2 red
2
2 3 6 2
pens and 1 green from 3 green pens is = 6. Hence, p(0, 1) =28 = . Similar calculations yield the
1 1 14

probabilities for the other cases, whichare presented in the following Table.
Joint Probability Distribution
(Y, 0 1 2 𝑝𝑦 (𝑦)
X)
0 3 9 3 15
28 28 28 28
1 3 3 0 3
14 14 7
2 1 0 0 1
28 28
𝑝𝑥 (𝑥) 5 15 3 1
14 28 28

The probabilities sum to 1is shows that it is probability mass function. Note that, the joint probability mass
3 2 3
𝑥 𝑦 2−𝑥−𝑦
function of the above Table can be represented by the formula: p(x, y) = 8 , for x = 0, 1, 2; y = 0, 1,
2

2; and 0 ≤ x + y ≤ 2.

Example: Consider two discrete random variables, X and Y, where x=1 or x=2, and y=0 and y=1. The
bivariate probability mass function for X and Y is defined as follows. p(x, y)=
0.25+𝑥−𝑦
, consider the joint probability function and then verify that the properties of a discrete joint
5

probability mass function are satisfied.

Solution: Since X takes on two values (1 or 2) and Y takes on two values (0 or 1), there are 2x2 = 4 possible
combinations of X and Y. these four (x, y) pairs are (1,0), (1,1), (2, 0), and (2, 1). Substituting these
possible values of X and Y into the formula for p(x, y), we obtain the following joint probabilities.

X 1 2
0 0.25 0.45
Y 1 0.05 0.25

The probabilities sum to 1 and all values are non negative are shows that it is probability mass function.
66
o Let (X, Y) is a two dimensional continuous random variables assuming all values in some region R of
the Euclidian space that is, (X, Y) can assume values in a rectangle {(x,y): a ≤ x ≤ b and c ≤ y ≤ d} or
in a circle {(x,y): x2 + y2 ≤ 1} etc, then the function f(x, y) is a joint density function of the
continuous random variables X and Y if:
1) f(x, y) ≥ 0 for all (x, y) ∈ R and
2) ∫ ∫ 𝑓 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 1
Examples: The joint probability function of two continuous random variables X and Y is given by
f ( x, y)  c (2 x  y) , where x and y can assume all integers such that 0 ≤x ≤ 2, 0 ≤ y ≤ 3, and f (x, y) = 0
otherwise.
a) Find the value of the constant c?
b) Find P(X ≤ 2, Y ≤ 1)?
3 2 3 3
Solution: (a) ∫0 ∫0 𝑐 2𝑥 + 𝑦 𝑑𝑥𝑑𝑦 = 1 =c ∫0 [𝑥 2 + 𝑦𝑥]20 𝑑𝑦 = 𝑐 ∫0 (4 + 2𝑦) 𝑑𝑦 = 𝑐[4𝑦 + 𝑦 2 ]30
= 21c then c =1/21.
1 2 1 1 1
(b) p(X ≤2, Y ≤1) = ∫0 ∫0 2𝑥 + 𝑦 𝑑𝑥𝑑𝑦 = 21 ∫0 [𝑥 2 + 𝑦𝑥]20 𝑑𝑦
21
1 1 1 1 5
= 21 ∫0 (4 + 2𝑦) 𝑑𝑦 = 21 [4𝑦 + 𝑥 2 ]10 = 21 4+1) = 21

A function closely related to the probability distribution is the cumulative distribution function, CDF. If (X,
Y) is a two-dimensional random variable, then the cumulative distribution function is defined as follows. Let
(X, Y) is a two-dimensional discrete random variable, then the joint distribution or joint cumulative
distribution function, CDF of (X, Y) is defined by F(x, y) = P(X ≤ x, Y ≤ y)
=   p (s, t ),
s  x t y
s ≤ x, t ≤ yfor -∞ <x<∞ and -∞ <y<∞, where p(s, t) is the joint probability mass function

of (X, Y) at (s, t).

Let (X, Y) is a two dimensional continuous random variable, then the joint distribution or joint cumulative
distribution function, CDF of (X, Y) is defined by F(x, y) = P(X ≤ x, Y ≤ y)
y x
=   f (s, t ) ds dt
 
for -∞ <x<∞ and -∞ <y<∞, where f(s, t) is the joint probability density function of (X, Y)

at (s, t).

67
Remark:
If F(x, y) is joint cumulative distribution function of a two dimensional random variable (X, Y) with joint
𝑑 2 𝐹(𝑥,𝑦 )
p.d.f f(x, y), then: 𝑓 𝑥, 𝑦 = .
𝑑𝑥 𝑑𝑦

7.3. Marginal Probability Distributions and Conditional Probability Distributions


Marginal Probability Distributions
In a two dimensional random variable (X, Y) we associated two one dimensional random variables X and Y.
Sometime we may be interested in the probability distribution of X or Y. Given the joint probability
distribution p(x, y) of the discrete random variables X and Y, the probability distribution px(x) of X alone is
obtained by summing p(x, y) over the values of Y. Similarly, the probability distribution py(y) of Y alone is
obtained by summing p(x, y) over the values of X. We define px(x) and py(y) to be the marginal
distributions of X and Y, respectively. When X and Y are continuous random variables, summations are
replaced by integrals.

If X and Y are two-dimensional discrete random variables and p(x, y) is the value of their joint probability
mass function at (x, y), the function given by px(x) =  p ( x, y ) for each y within the range of X is called the
y

marginal distribution of X. Similarly, the function given by py(y) =  p ( x, y) for each x within the range of
x

Y is called the marginal distribution of Y.

The term marginal is used here because, in the discrete case, the values of g(x) and h(y) are just the marginal
totals of the respective columns and rows when the values of f(x, y) are displayed in a rectangular table.

Examples: Consider two discrete random variables, X and Y with the joint probability mass function of X
and Y:

X 1 2
Y 0 0.25 0.45
1 0.05 0.25

Then construct the marginal probability mass function of X and Y.

68
Solution:
x 1 2 Total y 0 1 Total
Px(x) 0.3 0.7 1 Py(y) 0.7 0.3 1

Example: Two ballpoint pens are selected at random from a box that contains 3 blue pens, 2 red pens, and 3
green pens. If X is the number of blue pens selected and Y is the number of red pens selected have the
joint probability mass function p(x, y) as shown below. Then verify that the column and row totals are
the marginal of X and Y, respectively.
(X, Y) 0 1 2
0 3 9 3
28 28 28
1 3 3 0
14 14
2 1 0 0
28
Solution:
X 0 1 2 Total y 0 1 2 Total
𝑝𝑥 (𝑥) 5 15 3 1 Py(y) 15 3 1 1
14 28 28 28 7 28

If X and Y are two-dimensional continuous random variables and f(x, y) is the value of their joint probability

density function at (x, y), the function given by fx(x) =  f ( x, y) dy for - ∞ ≤ x ≤ ∞ is called the marginal



distribution of X. Similarly, the function given byfy(y) =  f ( x, y) dx for - ∞

≤ y ≤ ∞ is called the marginal

distribution of Y.

Remark
The fact that the marginal distributions px(x) and py(y) are indeed the probability distributions of the
individual variables X and Y alone can be verified by showing that the conditions of probability distributions
stated in the one-dimensional case are satisfied.

Conditional Probability Distributions


In one-dimensional random variable case, we stated that the value X of the random variable X represents an
event that is a subset of the sample space. If we use the definition of conditional probability as stated in the
69
𝐴∩𝐵
previous chapter, P(B/A) = 𝑝(𝐴), provided p(A) > 0, where A and B are now the events defined by X = x and

Y = y, respectively, then

𝑝(𝑋=𝑥,𝑌=𝑦) 𝑝(𝑥, 𝑦)
P(Y = y | X = x) = = , provided px(x) >0, where X and Y are discrete random variables. It is
𝑝(𝑋=𝑥) 𝑝𝑥 (𝑥)
𝑝(𝑥, 𝑦)
clear that the function , which is strictly a functionof y with x fixed, satisfies all the conditions of a
𝑝𝑥 (𝑥)

probability distribution. This is also true when f(x, y) and 𝑓𝑥(𝑥) are the joint probability density function and
marginal distribution, respectively, of continuous random variables. As a result, it is extremely important that
𝑓(𝑥, 𝑦)
we make use of the special type of distribution of the form , inorder to be able to effectively compute
𝑓𝑥 (𝑥)

conditional probabilities. This type of distribution is called a conditional probability distribution; the
formal definitions are given as follows.

o The probability of numerical event X, given that the event Y occurred, is the conditional probability
of X given Y = y. A table, graph or formula that gives these probabilities for all values of Y is called
the conditional probability distribution for X given Y and is denoted by the symbol p(x/y).

Therefore, let X and Y be discrete random variables and let p(x, y) be their joint probability mass function,
𝑝(𝑥,𝑦)
then the conditional probability distributions for X and Y is defined as: p(x/y) = , provided py(y) > 0.
𝑝 𝑦 (𝑦 )

𝑝(𝑥,𝑦)
Similarly, the conditional probability distribution of X given that Y = y is defined as: p(y/x) = , provided
𝑝 𝑥 (𝑥)

px(x) > 0.

Again, let X and Y be continuous random variables and let f(x, y) be their joint probability density function,
𝑓(𝑥,𝑦)
then the conditional probability distributions for X and Y is defined as: f(x/y) = , provided fy(y) > 0.
𝑓𝑦 (𝑦 )

𝑓(𝑥,𝑦 )
Similarly, the conditional probability distribution of X given that Y = y is defined as: f(y/x) = , provided
𝑓𝑥 (𝑥)

fx(x) > 0.

70
Examples: The joint probability mass function of two discrete random variables X and Y is given by p(x, y)
= cxy for x = 1, 2, 3and y = 1, 2, 3, and zero otherwise. Then find the conditional probability
distribution of X given Y and Y given X.
Solution: first 𝑐𝑥𝑦 = 1 = c(1x1 + 1x2 + …+ 3x2 + 3x3) = 1, then c = 1/36 and finally P(x,y) = (xy)/36.
𝑥𝑦 𝑥𝑦
𝑝(𝑥,𝑦) 36 𝑦 36 𝑥
Therefore, p(X/Y) = = = 6 , y = 1, 2, 3 and p(Y/X) = = 6 , x = 1, 2, 3.
𝑝 𝑦 (𝑦 ) ∀𝑥 𝑝(𝑥,𝑦) ∀𝑦 𝑝(𝑥,𝑦)

Example: A software program is designed to perform two tasks, A and B. let X represent the number of IF-
THEN statement in the code for task A and let Y represent the number of IF-THEN statements in the
code for task B. the joint probability distribution p(x, y) for the two discrete random variables is
given in the accompanying table.
X
0 10 2 3 4 5
Y 0 0.000 0.050 0.025 0.000 0.025 0.000
1 0.200 0.050 0.000 0.300 0.000 0.000
2 0.100 0.000 0.000 0.000 0.100 0.150
Then construct the conditional probability distribution of X=0 given Y= 1 and Y=2 given X =5.
𝑝(𝑥=0,𝑦=1) 0.2
Solution: p(X=0/Y=1) = = 0.55 = 4/11
𝑝 𝑦 (𝑦 =1)

Example: The joint density function for the random variables (X, Y ), where X is the unit temperature change
and Y is the proportion of spectrum shift that a certain atomic particle produces, is f(x, y) = 10xy2, 0 <
x < y <1, and 0, elsewhere, then
(a) Construct the conditional probability distribution of Y given X.
(b) Find the probability that the spectrum shifts more than half of the total observations, given that the
temperature is increased by 0.25 units.
𝑓(𝑥,𝑦) 10𝑥𝑦 2 10𝑥𝑦 2 10𝑥𝑦 2 3𝑦 2
Solution: (a) f(y/x) = = 1 = 1 = 1 = 1− 𝑥 3 , 0 < x < y < 1
𝑓𝑥 (𝑥) ∫𝑥 𝑓(𝑥,𝑦)𝑑𝑦 ∫𝑥 10𝑥𝑦 2 𝑑𝑦 ∫𝑥 10𝑥𝑦 2 𝑑𝑦

71
1 1
∫0 ∫1/2 10𝑥𝑦 2 𝑑𝑦𝑑𝑥 1 1
(b) p(Y > ½ /x = ¼) = = ∫1/2 𝑓 𝑦 / 𝑥 = 4 𝑑𝑦 = 8/9.
𝑓𝑥 (𝑥=1/4)

7.4. Independent Random Variables

If the conditional probability distribution of X given Y does not depend on y, then the joint probability
distribution of X and Y is become the product of the marginal distributions of X and Y. It should make sense
to the reader that if the conditional probability distribution of X given Y does not depend on y, then of course
the outcome of the random variable Y has no impact on the outcome of the random variable X. In other
words, we say that X and Y are independent random variables. We now offer the following formal definition
of statistical independence.

o Let X and Y be two discrete random variables with joint probability mass function of p(x, y) and
marginal distributions px(x) and py(y), respectively. The random variables X and Y are said to be
statistically independent if and only if p(x, y) = fx(x)fy(y), for all (x, y) within their range.

o Let X and Y be two continuous random variables with joint probability density function f(x, y) and
marginal distributions fx(x) and fy(y), respectively. The random variables X and Y are said to be
statistically independent if and only if f(x, y) = fx(x)fy(y), for all (x, y) within their range.

Note that, checking for statistical independence of discrete random variables requires a more thorough
investigation, since it is possible to have the product of the marginal distributions equal to the joint
probability distribution for some but not all combinations of (x, y). If you can find any point (x, y) for which
p(x, y) is defined such that p(x, y) ≠px(x)py(y), the discrete variables X and Y are not statistically independent.

Remark

 If we know the joint probability distribution of X and Y, we can find the marginal probability
distributions, but if we have the marginal probability distributions, we may not have the joint
probability distribution unless X and Y are statistically independent.

Theorem:
a) Let (X, Y) be a two dimensional discrete random variable. Then, X and Y are independent if and only
if P(xi | yj) = Pxi(xi) for all i and j and P(yj | xi) = Pyj(yj) for all i and j.
b) Let (X, Y) be a two dimensional continuous random variable. Then, X and Y are independent if and
only if f(x| y) = fx(x) for all (x, y)and equivalently f(y | x)= fy(y) for all (x, y).

72
Examples: Let X and Y are binary random variables; that is 0 or 1 are the only possible outcomes for each of
X and Y. p(0, 0) = 0.3; p(1, 1) = 0.2 and the marginal probability mass function of x = 0 and x= 1 are
0.6 and 0.4, respectively. Then
(a) Construct the joint probability mass function of X and Y;
(b) Calculate the marginal probability mass function of Y.
Solution: (a) (b)
X 0 1 Py(y)
0 0.3 0.2 0.5
Y 1 0.3 0.2 0.5
Px(x) 0.6 0.4 1

Example: Let X and Y are the life length of two electronic devices. Suppose that their joint p.d.f is given
−(𝑥 + 𝑦)
by 𝑓 𝑥, 𝑦 = 𝑒 𝑥 ≥ 0 𝑎𝑛𝑑 𝑦 > 0, can these two random variables independent?
0 𝑒𝑙𝑠𝑒𝑤𝑕𝑒𝑟𝑒
Solution: If X and Y are independent, then the product of their marginal distributions should equal to the
joint pdf. So, fx(x) = 𝑒 −𝑥 x ≥ 0 and fy(y) = 𝑒 −𝑦 y ≥ 0.
Now f(x, y) = fx(x) fy(y) = 𝑒 −𝑥 𝑒 −𝑦 = 𝑒 −(𝑥+𝑦) x ≥ 0, y ≥ 0. Implies X and Y are statistically independent.

73
8. Expectation
8.1. Expectation of a Random Variable
The data we analyze in engineering and the sciences often results from observing a process. Consequently,
we can describe process data with numerical descriptive measures, such as its mean and variance. Therefore,
the expectation of X is very often called the mean of X and is denoted by E(X). The mean, or expectation, of
the random variable X gives a single value that acts as a representative or average of the values of X, and for
this reason it is often called a measure of central tendency.

 Let X be a discrete random variable which takes values xi (x1, . . . ,xn) with corresponding
probabilities P(X = xi) = p(xi), i = 1, . . . , n. Then the expectation of X (or mathematical expectation
or mean value of X) is denoted by E(X) and is defined as:
n
E(X) = x1p(x1) + . . . + xnp(xn) =  x p( x )
i 1
i i
=  x p( x )
x

The last summation is taken over all appropriate values of x.

Example: A school class of 120 students is driven in 3 buses to a symphonic performance. There are 36
students in one of the buses, 40 in another, and 44 in the third bus. When the buses arrive, one of the
120 students is randomly chosen. Let X denote the number of students on the bus of that randomly
chosen student, and find E[X].

Solution: Since the randomly chosen student is equally likely to be any of the 120students, it follows that:
36 40 44
P{X = 36} = 120 , P{X = 40} = 120 , P{X = 44} = 120 .
3 1 11 1208
Hence E(X) = 36x10 +40x3 +44x30 = = 40.2667.
30

Example: Let a fair die be rolled once. Find the mean number rolled, say X.

Solution: Since S = { 1, 2, 3, 4, 5, 6} and all are equally likely with prob. of 1/6, we have

1 1 1 1 1 1 21
E ( X )  1.  2.  3.  4.  5.  6.   3.5.
6 6 6 6 6 6 6

Example: A lot of 12 TV sets includes two which are defectives. If two of the sets are chosen at random,
find the expected number of defective sets.

Solution: Let X= the number of defective sets.

Then, the possible values of X are 0, 1, 2. Using conditional probability rule, we get,

74
P(X  0)  P (both non defective) = 10 9 15 , P(X  2) 
  P (both defective) = 2  1  1 ,
12 11 22 12 11 66

P(X  1)  P(one defective)

= P (first defective and second good) + P (first good and second defective)

2 10 10 2 10 10 10
       .
12 11 12 11 66 66 33

Or, since P( X  0)  P( X  1)  P( X  2)  1, we can use, P( X  1)  1  45  1  10 .


66 66 33

2
15 10 1 1
 E ( X )   xi P( X  xi )  0   1  2   .
i 0 22 33 66 3

 The mathematical expectations, in general, of a continuous r-v are defined in a similar way with those of
a discrete r-v with the exception that summations have to be replaced by integrations on specified
domains. Let the random variable X is continuous with p.d.f. f(x), its expectation is defined by:

E ( X )   x f ( x) dx , provided this integral exists.


1
 x 0  x 2
Example: The density function of a random variable X is given by: f ( x)   2 Then, find

0 otherwise

the expected value of X?


2 21
Solution: E(X) = ∫0 xf x dx = ∫0 2 x 2 dx = [1/6 x3]20 = 4/3.

Example: Find the expected value of the random variable X with the CDF of F(x) = x3,0 < x< 1.
1 1 1 1
Solution: E(X) = ∫0 xf x dx = ∫0 x 4 dx = 5 [x5]10 = 5.

8.2. Expectation of a Function of a Random Variable

The Statistics that we will subsequently use for making inferences are computed from the data contained in a
sample. The sample measurements can be viewed as observations on n random samples,x 1, x2, x3, …, xn.
Since the sample Statistics are functions of the random variables x 1, x2, x3, …, xn, they also will be random
variables and will possess probability distributions. To describes these distributions, we will define the
expected value (or mean) of functions of random variables.

75
Now let us consider a new random variable g(X), which depends on X; that is, each value of g(X) is
determined by the value of X. In particular, let X be a discrete random variable with probability function
p(x). Then Y = g(X) is also a discrete random variable, and the probability function of Y is p(y) = P(Y = y) =

x
 P ( X  x)
g ( x )  y

x
 f ( x)
g ( x )  y
and hence we can define expectation of functions of random variables as:

Let X be a random variable and let Y = g(X), then


(a) If X is a discrete random variable and p(xi) = P(X=xi) is the p.m.f, we will have

E(Y) = E(g(X)) =  g ( x ) p( x )
i 1
i i

(b) If X is a continuous random variable with p.d.f, f(x), we will have



E(Y) = E(g(X)) =  g ( x) f ( x) dx


The reader should note that the way to calculate the expected value, or mean,shown here is different from the
way to calculate the sample mean described in Introduction to Statistics, where the sample mean is obtained
by using data. Here is in random variable, the expected value is calculated by using the probability
distribution. However, the mean is usually understood as a “center” value of the underlyingdistribution if we
use the expected value.

Example: Suppose that a balanced die is rolled once. If X is the number that shows up,find the
expected value of g ( X )  2 X 2  1 .

Solution: Since each possible outcome has the probability 1/6, we get,
6
1
E ( g ( X ))   (2 x 2  1).  (2  12  1).    (2  6 2  1) 
1 1 94
.
x 1 6 6 6 3

8.2.1. Expectation of Two Dimensional Random Variables


Let X and Y be random variables with joint probability distribution p(x, y) [or f(x, y)] and let H = g(x, y) be a
real valued function of (X, Y), then the mean, or expected value, of the random variable (X,Y) and g(X, Y)
are:

 E(XY ) = 𝑥 𝑦 𝑥𝑦𝑝(𝑥, 𝑦)if X and Y are discrete random variables.


 
 E[XY] =   xyf ( x, y) dx dy if X and Y are continuous random variables.
  

 E[g(X, Y )] = 𝑥 𝑦 𝑔 𝑥, 𝑦 𝑝(𝑥, 𝑦)if X and Y are discrete random variables.

76
 
 E[g(X, Y)] =   g ( x, y) f ( x, y) dx dy if X and Y are continuous random variables.
  

Example: If the joint probability density function of X and Y given by


2
 ( x  2 y) 0  x  1, 1  y  2
f ( x)   7 Then find the expected value of g(X, Y) = X/Y?

0 otherwise
2 2 1𝑥 2 2 1 𝑥2
Solution: E{g(x, y)} = ∫ ∫ 𝑔 𝑥, 𝑦 𝑓 𝑥, 𝑦 𝑑𝑥𝑑𝑦 = 7 ∫1 ∫0 𝑦 (x + 2y)dxdy = 7 ∫1 ∫0 { 𝑦 + 2x}dxdy
2 2 1 𝑥3 2 2 1 1 2
= 7 ∫1 ∫0 { 3𝑦 + x2}dy = 7 ∫1 ∫0 { 3𝑦 + 1}dy = 7{1/3 (ln2 –ln1) + 1}

= 0.35172
Remark
In calculating E(X) over a two-dimensional space, one may use either the joint probability distribution of X
and Y or the marginal distribution of X as:
E[X] = 𝑥 𝑦 𝑥𝑝(𝑥, 𝑦) = 𝑥 𝑥 𝑝𝑥 (𝑥)if X is discrete random variable.
 

E[X] =   xf ( x, y) dx dy
  
= ∫−∞ xpx x dx if X is continuous random variable, where px(x) is the marginal

distribution of X. Similarly, we define


E[Y] = x y yp(x, y) = yy py (y) if Y is discrete random variable.

 

E[Y] =   yf ( x, y) dx dy
  
= ∫−∞ ypy y dy if Y is continuous random variable, where py(y) is the marginal

distribution of the random variable Y.

8.3. Variance of a Random Variable and its properties


Let X is a random variable. The variance of X, denoted by V(X) or Var(X) or δ2x , defined as:
V(X) = E (X – E(X))2 = V ( X )  E ( X 2 )  [ E ( X )]2 E ( X 2 )   xi f ( xi )
2
where

Note that, the positive square root of V(X) is called the standard deviation of X and denoted by σx .Unlike the
variance, the standard deviation is measured in the same units as X (and E(X)) and serves as a yardstick of
measuring deviations of X from E(X).

Examples: Find the expected value and the variance of the r-v given in as
x if 0 < 𝑥 < 1
f x = 2 − x if 1 < 𝑥 < 2
0 elsewhere

77
 1 2 1 2
Solution: E ( X )   x. f ( x)dx   x.xdx   x.(2  x)dx   x dx   (2 x  x 2 )dx
2

 0 1 0 1
2
3 1
x  x 
3
1  8  1
   x 2      4    1    1  4  2 = 1.
3 0  3 1 3  3  3 3 3 3
 1 2 1 2
E( X 2 )   x . f ( x)dx   x .xdx   x (2  x)dx =  x 3 dx   (2 x 2  x 3 )dx
2 2 2

 0 1 0 1
2
4 1
x 2 x4  1  16  2 1
   x3       4       1  4  5  7 .
4 0 3 4 1 4  3   3 4  4 3 12 6

V ( X )  E ( X 2 )  E ( X ) 
7 2 1
1  .
2

6 6
Properties of Expectation and Variance

There are cases where our interest may not only be on the expected value of a r -v, but also on the expected
value of a r -v related to X. In general, such relations are useful to explain the properties of the mean and the
variance.

o If a is constant, then E (aX )  aE ( X ) .


o If b is constant, then E (b)  b .
o If a and b are constants, then E(aX  b)  aE( X )  b .
o Let X and Y are any two random variables. Then E(X + Y) = E(X) + E(Y). This can be generalized to
n random variables, That is, if X1, X2, X3,. . . ,Xn are random variables then, E(X1 + X2 + X3+ . . . +
Xn) = E(X1) + E(X2) + E(X3) + . . . + E(Xn)
o Let X and Y are any two random variables. If X and Y are independent. Then E(XY) = E(X)E(Y)
o Let (X, Y) is a two dimensional random variable with a joint probability distribution. Let Z = H1(X,
Y) and W = H2(X, Y). Then E(Z + W) = E(Z) + E(W)
o For constant values a and b, V (aX  b)  a 2V ( X ) .
o Variance is not independent of change of scale, i.e. V (aX )  a2V ( X )
o Variance is independent of change of origin, i.e., V ( X  b)  V ( X )
o Variance of a constant is zero, i.e., V (b)  0 .
o Let X1, X2, X3, . . . , Xn be n independent random variable, then V(X1 + X2 + X3 + . . . + Xn) = V(X1)
+ V(X2) + V(X3) + . . . + V(Xn)
o If (X, Y) be a two dimensional random variable, and if X and Y are independent thenV(X + Y) =
V(X) + V(Y) and V(X - Y) = V(X) + V(Y)

78
Examples: A continuous random variable X has probability density given by

f x = 2e−2x x > 0 and for a constant K. Find


0 x≤0
(a) The variance of X (b) The standard deviation of X (c) Var (KX) (d) Var (K + X)
Solution: (a) V(X) = ∫ x 2 f x dx - [E(X)] = ∫ 2x 2 e−2x dx - [∫ 2xe−2x dx] = 2(1/2)2 – (1/2)2 = ¼
2 2

(b) SD(V(X)) = 1
4=½
K2
(c) V(KX) = K2V(X) = (d) V(K + X) = V(X) =1/4.
4

Example: Let X be a random variable with p.d.f. f (x) = 3x2, for 0 < x <1.
(a) Calculate the Var (X). (b) If the random variable Y is defined by Y = 3X − 2, calculate the Var(Y).

Solution: (a) V(X) = ∫ 𝑥 2 𝑓 𝑥 𝑑𝑥 - [E(X)] 2 = ∫ 3𝑥 4 𝑑𝑥 - [∫ 3𝑥 3 𝑑𝑥] 2 = 3/5 – [3/4] 2 = 3/80

(b) V(3X – 2) = 9V(X) = 9x3/80 = 27/80.


8.4. Chebyshev’s Inequality
Let X be random variable with E(X) = µ and variance σ2and let k be any positive constant. Then the
probability that any random variable X will assume a value within k standard deviations of the mean is at
1 1
least 1 − . Thatis,P(μ − kσ < X < μ+ kσ) ≥ 1 − 𝑘 2 .
𝑘2

Note that, Chebyshev‟s theorem holds for any distribution of observations, and for this reason the results are
usually weak. The value given by the theorem is a lower bound only. That is, we know that the probability of
a random variable falling within two standard deviations of the mean can be no less than 3/4, but we never
know how much more it might actually be. Only when the probability distribution is known can we
determine exact probabilities. For this reason we call the theorem a distribution-free result. The use of
Chebyshev‟s theorem is relegated to situations where the form of the distribution is unknown.

Examples: A random variable X has a mean μ = 8, a variance σ2= 9, and an unknown probability
distribution. Find
(a) P(−4 < X <20),
(b) P(|X − 8| ≥ 6).
Solution: (a) p(-4 < X < 20) = p{(-4-8)/3 < Z < (20-8)/3} = p(-4 < Z < 4) = 1
(b) p({|X-8|≥6} = p(14 < X or X > 14) = p(14 < X) + p(X > 14)
= p(-1.33 < Z) + p(Z > 2) = 0.5- p(0 < Z 1.33) + 0.5 – p(0 < Z <2)
= 1 – (0.4082 + 0.4772) = 0.1146

79
8.5. Covariance and Correlation Coefficient
8.5.1.Covariance
The covariance between two random variables is a measure of the nature of between the two. If large values
of X often result in large values of Y or small values of X result in small values of Y , positive X−μX will often
result in positive Y −μY and negative X−μX will often result in negative Y −μY . Thus, the product (X −μX)(Y
−μY ) will tend to be positive. On the other hand, if large X values often result in small Y values, the product
(X−μX)(Y −μY ) will tend to be negative. The sign of the covariance indicates whether the relationship
between two dependent random variables is positive or negative.

The covariance of two random variables X and Y is denoted by Cov(X, Y ), is defined by


Cov (X, Y) = 𝜎𝑥𝑦 = E[(X – E(X))(Y – E(Y)] = E(XY) − (EX)(EY)

N.B.:When X and Y are statistically independent, it can be shown that the covariance is zero.
Cov (X, Y) = E[(X – E(X))(Y – E(Y)] = E[(X – E(X))]E[(Y – E(Y)] = 0.

Thus if X and Y are independent, they are also uncorrelated. However, the reverse is not true as illustrated by
the following example.

Examples: The pair of random variables (X, Y) takes the values (1, 0), (0, 1), (−1, 0), and (0,−1), each with
probability ¼.
Solution: The marginal p.m.f.`s of X and Y are symmetric around 0, &E[X] = E[Y ] = 0. Furthermore, for all
possible value pairs of (x, y), either x or y is equal to 0, which implies that XY = 0 and E[XY ] = 0.
Therefore, Cov(X, Y) = E[(X – E(X)(Y – E(Y)] = 0

Properties of co variance
o Cov(X, Y) = Cov (Y, X)
o Cov (X, X) = Var(X)
o Cov(KX, Y) = K Cov(X, Y) for a constant K
o Var (X ± Y) = Var (X) + Var (Y) ± 2 Cov (X, Y)

80
8.5.2. Correlation Coefficient
Although the covariance between two random variables does provide information regarding the nature of the
relationship, the magnitude of σXY does not indicate anything regarding the strength of the relationship, since
σXY is not scale-free. Its magnitude will depend on the units used to measure both X and Y. There is a scale-
free version of the covariance called the correlation coefficient that is used widely in statistics.

Let X and Y be random variables with covariance Cov(X, Y)and standard deviations σXand σY, respectively.
The correlation coefficient (or coefficient of correlation)ρ of two random variables X and Y that have none
zero variances is defined as:

   xy 
Cov ( X ,Y ) E(XY) - E(X)E(Y) σ xy E{[ X – E(X)][Y – E(Y)]}
= = = .
Var ( X )Var (Y ) Var ( X )Var (Y ) σx σy 𝑉 𝑋 𝑉 (𝑌 )

It should be clear to the reader that ρXY is free of the units of X and Y. The correlation coefficient satisfies the
inequality −1 ≤ ρXY ≤ 1 and it assumes a value of zero when σXY = 0.

Examples: Let X and Y be random variables having joint probability density function
x  y 0  x  1, 0  y  1
f ( x, y )   then find Cor(X,Y)
0 elsewhere
E(XY) - E(X)E(Y)
Solution: Cor(X, Y) = =((1/3) – (7/12)(7/12)}/ (264/3456) = 8.636364.
Var ( X )Var (Y )

81
9. Common Probability distributions
9.1. Common Discrete Distributions and their Properties
9.1.1. Binomial distribution
In this sub-unit, we shall study one of the most popular discrete probability distributions, namely, the
Binomial distribution. It simplifies many probability problems which, otherwise, might be very tedious and
complicated while listing all the possible outcomes of an experiment.

Repeated trials play an important role in probability and statistics, especially when the number of trial (n)is
fixed, the parameter p (the probability of success) is same for each trial, and the trial are all independent.
Several random variables are a rise in connection with repeated trials. The one we shall study here concerns
the total number of success.

Examples of Binomial Experiments

Tossing a coin 20 times to see how many tails occur.


Asking 200 people whether they watch BBC news.
Rolling a die 10 times to see if a 5 appears.
A random variable X has Binomial distribution and it referred to as a Binomial random variable if and only if
n x
its probability distribution given by: f x; n, θ = θ 1 − θ n−x for x = 0, 1, . . . , n. In general
x
binomial distribution has the following characteristics:

An experiment repeated n times.


Only two possible outcomes: success (S) or Failure (F).
P(S)   (fixed at any trial).
The n-trials are independent
n n x
Mean: E(X) = µ = xp x = x=1 x θ (1 − θ)n−x = n θ
x
Variance: Var(X) = E(X – E(x))2 = nθ (1 − θ)

Remark
 the mean of the Binomial distribution is
n
E ( X )   x P( X  x)
x 0

n
= x
x 0
n
c x p x q n x

82
n
= x
x 0
n
c x p x q n x

n
n!
=x p x q n x
x 0 x!(n  x)!
n
n(n  1)!
=x p p x 1 q n x
x 0 x( x  1)!(n  x)!
n
(n  1)!
= np p x 1q n x
x 1 ( x  1)!(n  x)!
n
= np n 1
c x 1 p x 1 q n  x
x 1

= np(q  p) n1

= np(1) n1 [ q  p  1 ]
= np
 The mean of the binomial distribution is np

 Variance of the Binomial distribution:


The variance of the Binomial distribution is
V ( X )  E ( X 2 )  [ E ( X )]2

= E ( X 2 )  (np) 2 …………….. (1) [ E ( X )  np ]


Now,
n
E( X 2 ) = = x
x 0
2 n
c x p x q n x

n
= [ x( x  1)  x]
x 0
n
c x p x q n x

n n
n! n!
=  x( x  1) p x q n x +  x p x q n x
x 0 x!(n  x)! x 0 x!(n  x)!
n
n(n  1)(n  2)!
=  x( x  1) p 2 p x 2 q n x  E ( X )
x 0 x( x  1)( x  2)!(n  x)!
n
(n  2)!
= n(n  1) p 2  p x 2 q n  x  np
x 2 ( x  2)!(n  x)!
n
= n(n  1) p 2  n2
c x 2 p x 2 q n x  np
x 2

83
= n(n  1) p 2 (q  p) n2  np

= n(n  1) p 2 (1) n2  np [ q  p  1 ]

= n(n  1) p 2  np …………. (2)


Putting (2) in (1) we get
V (X )  n(n  1) p 2  np - (np) 2
= np(np  p  1  np)
= np(1  p)
= npq
 The variance of the Binomial distribution is npq

Example: A machine that produces stampings for automobile engines is malfunctioning and producing
5%defectives. The defective and non-defective stampings proceed from the machine in a random
manner. If the next five stampings are tested, find the probability that three of them are defective.

Solution: Let x equal the number of defectives in n = 5 trials. Then x is a binomial random variable with p,
the probability that a single stamping will be defective, equal to 0.05, and q = 1- 0.05 = 1 – 0.05 = 0.95.
The probability distribution for x is given by the expression:
 5
P(X  3)    0.053 (1  0.05) 53
 3
5!
 (0.05) 3 (0.95) 2
3! (5 - 3)!
5x4x3x2x1
 (0.05) 3 (0.95) 2
3x2x1(2x1)

Mean = np = 5x0.05 = 0.25 and variance = npq = 5x0.05x0.95 = 0.2375.

Example: If the probability is 0.20 that a person traveling on a certain airplane flight will request a
vegetarian lunch, what is the probability that three of 10 people traveling on this flight will request a
vegetarian lunch?

Solution: Let X be the number of vegetarians. Given n = 10, p = 0.20, x = 3; then,


10 
P( X  3)   0.2 (0.8) 7  0.201 .
3

 
3

84
9.1.2. Poisson Distribution
The Poisson probability distribution, named for the French Mathematician S.D. Poisson (1781-1840),
provides a model for the relative frequency of the number of “rare events” that occurs in a unit of time, area,
volume, etc.

Examples of events whose relative frequency distribution can be Poisson probability distributions are:
 The number of new jobs submitted to a computer in any one minute,
 The number of fatal accidents per month in a manufacturing plant,
 The number of customers arrived during a given period of time,
 The number of bacteria per small volume of fluid,
 The number of customers arrived during a given period of time.
The properties of Poisson random variables are the following.
The experiment consists of counting the number of items X a particular event occurs during a given
units,
The probability that an event occurs in a given units is the same for all the units,
The number of events that occur in one unit is independent of the number that occurs in other units.

And a random variable X has Poisson distribution with parameter 𝜆 and it referred to as a Poisson random
𝜆 𝑥 𝑒 −𝜆
variable if and only if its probability distribution given by: 𝑝 𝑥; 𝜆 = for x = 0, 1, 2, . . .
𝑥!

∞ 𝜆 𝑥 𝑒 −𝜆
 Mean: E(X) = µ = 𝑥=0 𝑥 𝑥 ! = λ

 Variance: Var(X) = E(X – E(x))2 = 𝜆


Remark
 The mean and variance for a Poisson distribution are both .

x 
 x 1
E(X) = x
x 0 x!
e   , (letting y = x - 1)  e    
x 1 ( x  1)!

y
= e     e   e   
y 0 y!

To calculate Var(X), we first calculate



x 
( x  1  1) x 1    ( x  1) x 1    x 1  
E(X2) =  x 2 e     e    e  e 
x 0 x! x 1 ( x  1)!  x 1 ( x  1)! x 1 ( x  1)! 

85
  y y     y   
(re writing x – 1 as y) then E(X2) =   
 y 0 y ! 
 
e   e      e   e   2   , hence
y 0 y !

Var(X) = E(X2) - (E(X))2 = 2 +  -  = .

Example: Suppose that customers enter a waiting line at random at a rate of 4 per minute. Assuming that the
number entering the line during a given time interval has a Poisson distribution, find the probability
that one customer enters during a given one-minute interval of time?
1 4
Solution: Given   4 per min, P( x  1)  4 e  4e 4  0.0733 .
1!

9.1.3. Geometric distribution


Geometric distribution arises in a binomial experiment situation when trials are carried out independently
(with constant probability𝑝 of Success) until the first occurs. The random variable X denoting the number of
required trials is a geometrically distributed with parameter p.

Often we will be interested in measuring the length of time before some event occurs, for example, the
length of time a customer must wait in line until receiving service, or the length of time until a piece of
equipment fails. For this application, we view each unit of time as Bernoulli trail and consider a series of
trails identical to those described for the Binomial experiment. Unlike the Binomial experiment where X is
the total number of successes, the random variable of interest here is X, the number of trails (time units) until
the first success is observed.

And a random variable X has Geometric distribution with parameter P and it referred to as a Geometric
x−1
random variable if and only if its probability distribution given by: p x; p = p 1 − p , x = 0,1, 2, . . .,
where p is probability of success and x is number of trials until the first success occurs.

∞ x−1 1
 Mean: E(X) = µ = x=1 xp 1−p =
p
1−p q
 Variance: Var(X) = E(X – E(x))2 = = p2
p2

86
Example: If the probability is 0.75 that an applicant for a driver‟s license will pass the road test on any given
try. What is the probability that an applicant will finally pass the test on the fourth try?

Solution: Assuming that trials are independent, we substitute x=4 and p=0.75 into the formula for the
𝑥−1 4−1
geometric distribution, to get: p(x) = 𝑝 1 − 𝑝 = 0.75 1 − 0.75 = 0.75(0.25)3 = 0.011719

9.2. Common Continuous Probability Distributions


9.2.1. Uniform Distribution
One of the simplest continuous distributions in all of statistics is the continuous uniform distribution. This
distribution is characterized by a density function that is “flat,” and thus the probability is uniform in a closed
interval, say [a, b].Suppose you were to randomly select a number X represented by a point in the interval
l𝑎 ≤ 𝑥 ≤ 𝑏. The density function of X is represented graphically as follows.

1
Note that the density function forms a rectangle with base b−a and constant height to ensure that the
b−a

area under the rectangle equals one. As a result, the uniform distribution is often called the rectangular
distribution.

A random variable of the shown in the above graph is called a uniform random variable. Therefore, the
probability density function for a uniform random variable, X with the parameters of a and b is given by:
1
, a ≤x ≤b
f(x) = b−a
0, elsewhere
b 1 a+b
 Mean: E(X) = µ =∫a x b−a dx= 2
(b− a)2
 Variance: Var(X) = E(X – E(x))2 = 12

Example: The department of transportation has determined that the winning (low) bid X (in dollars) on a
5 2d
road construction contract has a uniform distribution with probability density function f(x) = 8d, if 5 <

x< 2d, where d is the department of transportation estimate of the cost of job. (a) Find the mean and
SD of X. (b) What fraction of the winning bids on road construction contracts are greater than the
department of transportation estimate?

87
2𝑑 5
Solution: (a) E(X) = ∫2𝑑/2 𝑥 8𝑑 𝑑𝑥 = (2d- 2d/2)/2 = d/2
(2𝑑− 2𝑑/2)2
V(X) = E(X – E(x))2 = = d2/12
12
2𝑑 5 5 5 5
(b) p(X > d) = ∫𝑑 𝑑𝑥 = 8𝑑 [x]2𝑑
𝑑 = 8𝑑 (2d - d) = 8
8𝑑

9.2.2.Normal Distribution
A random variable X is normal or normally distributed with parameters μ and σ2, (abbreviated N(μ, σ2)), if it
is continuous with probability density function:
1 x μ 2
1 ( )
f(x)  e2 σ
  x  ; σ  0 and    μ   ,the parameters μ and σ2 are the mean and the variance,
σ 2Π
respectively, of the normal random variable.

Properties of the Normal Distribution


1. The curve is bell-shaped.

Normal probability curve.


2. The mean, median and mode are equal and located at the center of the distribution.
3. The curve is symmetrical about the mean and it is uni-modal.
4. The curve is continuous, i.e., for each X, there is a corresponding Y value.
5. It never touches the X axis.
6. The total area under the curve is 1 and half of it is 0.5000
7. The areas under the curve that lie within one standard deviation, two and three standard deviations of
the mean are approximately 0.68 (68%), 0.95 (95%) and 0.997 (99.7%) respectively.
Graphically, it can be shown as:

88
9.2.2.1. Standard Normal Distribution

If we want to compute the probability P(a  X  b) , we have to evaluate the area under the normal curve
f (x) on the interval (a, b). This means we need to integrate the function f (x) defined above. Obviously,
the integral is not easily evaluated. That is,
b  x   2
1 
P ( a  X  b) 
 2
e
a
2 2
dx cannot be integrated directly.

But this is easily evaluated using a table of probabilities prepared for a special kind of normal distribution,
called the standard normal distribution.

𝑋−𝜇
If X is a normal random variable with the mean μ and variance σ then the variable Z = is the
𝜎

standardized normal random variable. In particular, if μ = 0 and σ= 1, then the density function is called the
standardized normal density and the graph of the standardized normal density distribution is similar to
normal distribution.

Convert all normal random variables to standard normal in order to easily obtain the area under the curve
with the help of the standard normal table.

Let X be a normal r-v with mean  and standard deviation  . Then we define the standard normal variable
Z as: Z  X  . Then the pdf of Z is, thus, given by:

1
1 2 z2
f ( z)  e ,  z   .
2
89
Properties of the Standard Normal Curve (Z):
1. The highest point occurs at μ=0.
2. It is a bell-shaped curve that is symmetric about the mean, μ=0. One half of the curve is a mirror image of
the other half, i.e., the area under the curve to the right of μ=0 is equal to the area under the curve to the
left of μ=0 equals ½.
5. The total area under the curve equals one.
6. Empirical Rule:
 Approximately 68% of the area under the curve is between -1 and +1.
 Approximately 95% of the area under the curve is between -2 and +2.
 Approximately 99.7% of the area under the curve is between -3 and +3.

Steps to find area under the standard normal distribution curve


i. Draw the picture
ii. Shade the desired area /region
i. If the area/region is:
 between 0 and any Z value, then look up the Z value in the table,
 in any tail, then look up the Z value to get the area and subtract the area from 0.5000,
 between two Z values on the same side of the mean, then look up both Z values from the table
and subtract the smaller area from the larger,
 between two Z values on opposite sides of the mean, then look up both Z values and add the
areas,
 less than any Z value to the right of the mean, then look up the Z value from the table to get the
area and add 0.5000 to the area,
 greater than any Z value to the left of the mean, then look up the Z value and add 0.5000 to the
area,
 in any two tails, then look up the Z values from the table, subtract the areas from 0.5000 and
add the answers.
Note that finding the area under the curve is the same as finding the probability of choosing any Z value at
random.

90
Example: Find the probabilities that a r-v having the standard N.D will take on a value
a) Less than 1.72; b)Less than -0.88;
c) Between 1.30 and 1.75; d) Between -0.25 and 0.45.
Solution:
a) P(Z  1.72)  P(Z  0)  P(0  Z  1.72)  0.5  0.4573  0.9573 .
b) P(Z  0.88)  P(Z  0.88)  0.5  P(0  Z  0.88)  0.5  0.3106  0.1894 .
c) P(1.30  Z  1.75)  P(0  Z  1.75)  P(0  Z  1.30)  0.4599  0.4032  0.0567 .
d) P(0.25  Z  0.45)  P(0.25  Z  0)  P(0  Z  0.45) .
 P(0  Z  0.25)  P(0  Z  0.45)  0.0987  0.1736  0.2723 .

Remark
The curve of any continuous probability distribution or density function is constructed so that the area under
the curve bounded by the two ordinates a= x1 and b= b equals the probability that the random variable X
assumes a value between a= x1and x = b. Thus, for the normal curve:
a X  b
P(a  X  b)  P     P( z1  Z  z 2 ),
    
Now, we need only to get the readings from the Z- table corresponding to z1 and z2 to get the required
probabilities, as we have done in the preceding example.

Example 9.5:If the scores for an IQ test have a mean of 100 and a standard deviation of 15, find the
probability that IQ scores will fall below 112.
Solution: IQ ~ N(100, 225)
Y  μ 112  100
P(Y 112)  P[  ]
σ 15
 P[Z  .800]  0.500  P(0  Z  .800)  0.500  0.2881  0.7881

9.2.3. Exponential Distribution


Exponential distribution is an important density function that employed as a model for the relative frequency
distribution of the length of time between random arrivals at a service counter when the probability of a
costumer arrival in any one unit of time is equal to the probability of arrival during any other. It is also used
as a model for the length of life of industrial equipment or products when the probability that an “old”
component will operate at least t additional time units, given it is now functioning, is the same as the
probability that a “new” component will operate at least t time units. Equipment subject to periodic
maintenance and parts replacement often exhibits this property of “never growing old”.
91
The exponential distribution is related to the Poisson probability distribution. In fact, it can be shown that if
the number of arrivals at a service counter follows a Poisson probability distribution with the mean number
1
of arrivals per unit of time equal to 𝛽 .

The continuous random variable X has an exponential distribution, with parameter β, if its density function
−𝑥
𝑒 𝛽
is given by: f(x) = , x ≥ 0, 𝛽 ≥ 0 .
𝛽
−𝑥
∞ 𝑒 𝛽
 Mean: E(X) = µ =∫0 𝑥 𝑑𝑥== 𝛽
𝛽
−𝑥
∞ 𝑒 2
 Variance: Var(X) = E(X – E(x))2= ∫0 𝑥 2 𝑑𝑥 - 𝛽 2 = 𝛽 2
2

−x
e 2
Example: Let X be an exponential random variable with pdf of : f(x) = , x ≥ 0then finf the mean and
2

variance of the random variavle X.


−x
∞ e 2
Solution: E(X) = µ =∫0 x dx= 2 and Var(X) = E(X – E(x))2 =4.
2

Example: The probability density of X is f (x)= 3e−3x forx > 0 then what is the mean and variance
0 elsewhere
of this pdf?

Solution: this distribution is an exponential and the mean and variance it is obtain in the manner as: E(X) =
∞ ∞
∫0 x 3e−3x dx= 1/3 and V(X) = ∫0 x 2 3e−3x dx – (1/3)2 = 1/9.

92