0 views

Uploaded by Syed Hussain

- OUTPUT2
- BBBC Presntation (1)
- Tut-1-cvg4150-2016-for posting (1)
- Ug Brochure Spread
- Reno Startr
- Local Food Chains are better than Foreign Food Chains
- Statistics I
- chap 01
- otput frekuensi.docx
- Statistics and Probability
- statistics.pdf
- Data Analysis Notes
- ADL QTManagement
- Assignment Stat [ SOALAN 1 ]
- Data Editing Part 2
- History, Theory, And Technique of Statistics
- Shapiro Wilk
- Manila Science 1st Year
- Abstract
- 01A Intro to Data

You are on page 1of 193

Inference

Dr. Bilal A Usmani

STRUCTURE OF THE COURSE

Business Statistics

Approach of the course

concepts and solving the examples that are related

to Statistics, especially descriptive statistics.

2. An introduction to SPSS / R, a very powerful,

freeware software, that might comprise of one or

two lectures.

What is Statistics?

describing and analysing data.

• This involves collecting, classifying,

summarizing, organizing, analyzing, and

interpreting numerical information.

What is Statistics? (cont…)

methodology which scientists and mathematicians

have developed for interpreting and drawing

conclusions from collected data.

collection, processing, interpretation and

presentation of data belongs to the domain of

statistics.

Scope of statistics

broad scope of applications to business,

government, and the physical and social sciences

is almost limitless. Virtually every single subject

from Anthropology to Zoology …. A to Z!

• Statistics plays key role in critical thinking-

whether in the classroom, on the job or in the

everyday life

The science of Statistics

It would therefore be desirable to know the characteristic

features of statistics in order to appreciate and

understand its general nature.

• Statistics deals with the behavior of aggregates or large groups of

data. It has nothing to do with what is happening to a particular

individual or object of the aggregate.

The science of Statistics (cont…)

kind rather than isolated figures.

• Statistics deals with variability that obscures underlying

patterns. No two objects in this universe are exactly alike. If

they were, there would have been no statistical problem.

• Statistics deals with uncertainties as every process of getting

observations whether controlled or uncontrolled, involves

deficiencies or chance variation. That is why we have to talk

in terms of probability.

The science of Statistics (cont…)

things which can be described numerically either by

counts or by measurements.

• Statistics deals with those aggregates which are subject

to a number of random causes, e.g. the heights of

persons are subject to a number of causes such as race,

ancestry, age, diet, habits, climate and so forth.

The science of Statistics (cont…)

run. There is no guarantee that a certain law will hold in

all cases. Statistical inference is therefore made in the

face of uncertainty.

• Statistical results might be misleading the incorrect if

sufficient care in collecting, processing and interpreting

the data is not exercised or if the statistical data are

handled by a person who is not well versed in the subject

mater of statistics.

The Purpose of studying Statistics

definitely useful to have a fairly good idea about

the way in which it works, and this is exactly the

purpose of this introductory course.

• The following points indicate some of the main

functions of this science:

• Statistics assists in summarizing the larger set of data

in a form that is easily understandable.

The Purpose of studying Statistics

(contd…)

and field experiments as well as surveys.

• Statistics assists in a sound and effective planning in

any field of inquiry.

• Statistics assists in drawing general conclusions and in

making predictions of how much of a thing will

happen under given conditions.

TYPES OF STATISTICS

Types of Statistics

descriptive and inferential statistics

Descriptive statistics

numerically or graphically by deriving:

• Statistics pertaining to central tendency such as the

mean, median, or mode

• Statistics pertaining to dispersion around the central

tendency such as the range or standard deviation or

statistics or graphs depicting the shape of a

distribution

Inferential statistics

• Inferential statistics allow one to infer population

parameters based upon sample statistics and to

model relationships within the data.

• Estimation is the group of statistics which allow for

the estimation about population values based upon

sample data.

Inferential statistics (contd…)

population parameter estimates and confidence

intervals.

• Modelling allows us to develop mathematical

equations which describe the interrelationships

between two or more variables.

• Hypothesis testing allows us to test for whether a

particular hypothesis we’ve developed is supported by

a systematic analysis of the data.

THE NATURE OF STATISTICS

DESCRIPTIVE STATISTICS

PROBABILITY

INFERENTIAL STATISTICS

FUNDAMENTAL ELEMENTS OF

STATISTICS

Fundamental elements of

statistics

1.Data and its collection

2.Variable

3. Population

4. Sample

1. Data and types of data

information.

•All data can be classified as one of

the two types:

• qualitative and quantitative

Quantitative data

are recorded on a naturally occurring

numerical scale.

• For example, the current unemployment rate

for each of the 4 provinces, the number of

convicted murderers who receive death

penalty each year over a 10-year period.

Qualitative data

cannot be measured on a natural numerical

scale; they can only be classified into a

group of categories.

• For example, political party affiliation in a

sample of 50 voters, the defective status

(defective or not) of each of 100 computer

chips manufactured by Intel.

Primary and secondary data

data) and have not undergone any sort of

statistical treatment, are called PRIMARY data.

• Data that have undergone any sort of treatment

by statistical methods at least ONCE, i.E. The

data that have been collected, classified,

tabulated or presented in some form for a certain

purpose, are called SECONDARY data.

Data: Objective

concerned, they should be stated in such a

way that you are absolutely clear about the

goal of your study ---

Exactly what it is that you

are trying to find out?

Data: Collection methodology

COLLECTION is concerned, you need to

consider:

1. Source of your data

(the statistical population)

2. Sampling Methodology

3. Instrument for collecting data

Collection of data

perhaps the collection of data.

• Statistical data are collected either by a

complete enumeration of the whole field, called

census, which in many cases would be too costly

and too time consuming as it requires large

number of enumerators and supervisory staff, or

by a partial enumeration associated with a

sample which saves much time and money.

Collection of primary data

employed to collect primary data:

ii) Indirect Investigation.

iii) Collection through Questionnaires.

iv) Collection through Enumerators.

v) Collection through Local Sources.

Collection of secondary data

following sources:

i) Official, e.g. the publications of the Statistical Division, Ministry of

Finance, the Federal and Provincial Bureaus of Statistics, Ministries of

Food, Agriculture, Industry, Labour, etc.

ii) Semi-Official, e.g., State Bank, Railway Board, Central Cotton

Committee, Boards of Economic Inquiry, District Councils, Municipalities,

etc.

iii) Publications of Trade Associations, Chambers of Commerce, etc.

iv) Technical and Trade Journals and Newspapers.

v) Research Organizations such as universities, and other institutions

2. Variable

an individual population unit.

• Age, sex, business income and expenses,

country of birth, capital expenditure, class

grades, and vehicle type are examples of

variables.

Types of variables

Numeric variables

measurable quantity as a number. Therefore

numeric variables are quantitative variables. The

data collected for a numeric

variable are quantitative data.

either continuous or discrete

Continuous variable

variable. Observations can take any value

between a certain set of real numbers.

height, time, age, and temperature.

Discrete Variable

Variables that can take a finite number of

values. A discrete variable cannot take the

value of a fraction between one value and

the next closest value.

• Examples of discrete variables include the number of

registered cars, number of business locations, and

number of children in a family, all of which measured

as whole units (i.e. 1, 2, 3 cars).

Categorical variables

describe a 'quality' or 'characteristic' of a

data unit, like 'what type' or 'which

category'. The data collected for a

categorical variable are qualitative data.

• Categorical variables may be further described

as ordinal or nominal:

Ordinal variable

Observations can take a value that can be

logically ordered or ranked. The categories

associated with ordinal variables can be ranked

higher or lower than another, but do not

necessarily establish a numeric difference

between each category.

• Examples of ordinal categorical variables include academic

grades (i.e. A, B, C), clothing size (i.e. small, medium, large,

extra-large) and attitudes (i.e. strongly agree, agree, disagree,

strongly disagree).

Nominal variable

Observations can take a value that is not

able to be organized in a logical sequence.

• Examples of nominal categorical variables

include sex, business type, eye color, religion

and brand.

3. Population

transactions, or events) that we are interested in

studying.

• For example, populations may include

• (1) all employed workers in a country

• (2) all registered voters in a city or country

• (3) everyone who is afflicted with AIDS

• (4) all the cars produced by a particular assembly line

• (5) the set of all accidents occurring in a particular

stretch of interstate highway during a holiday period.

4. Sample

is also known as representative data.

used to infer the characteristics of the population

from which it was drawn.

Sample (cont…)

population you are studying that is selected for

the actual research study.

• For example, instead of polling all 140 million registered voters

in the U.S during a presidential election year, a pollster might

select and question a sample of just 1500 voters, he would

record the the preference of each sampled vote.

Very

Important

Point!

DATA REPRESENTATION

Data representation

pictorially as a chart, diagram or graph.

information the reader or viewer is supposed to

be able to assimilate "from cold" while reading or

listening.

TABLES

Tables

• A table is perhaps the simplest means of

summarizing a set of observations

• They can be used for all types of numerical data.

• Tables are commonly used in collecting and

organizing raw data during an experiment and

also for representing final data to be included in

a paper or report.

• The representation of data in a table is formally

referred to as “tabular presentation.”

https://www.ncsu.edu/labwrite/res/gh/gh-tables.html

Tables (cont…)

in your report, list specific data values or draw

comparisons between variables by listing

subtotals, totals, averages, percentages,

frequencies, statistical results, etc.

• Good tables should be easy to read across rows

and down columns, easy to understand, and easy

to refer to in the text of your report. They should

also include only relevant data from your results.

https://www.ncsu.edu/labwrite/res/gh/gh-tables.html

A slightly Look at the

complicated components

table!

https://www.ncsu.edu/labwrite/res/gh/gh-tables.html

Parts of Previous table

• Title

• Table number

• Headings & Subheadings

• Table Body

• Table Spanner

• Dividers

• Table Notes

https://www.ncsu.edu/labwrite/res/gh/gh-tables.html

Types of Tables

•Statistical Tables

•Numerical Tables

GOOD

Practice

Versus

BAD

practice

FIGURES

Figures

Figures.

viewer the greatest number of ideas in the shortest

time with the least ink in the smallest place.”

Edward R. Tufte

The main themes of a graph

summarising numerical data

• The use of a visual image can simplify complex

information and help to highlight patterns and trends in

the data.

• Designed to add understanding of information that it

difficult to convey with words

• They are a particularly effective way of presenting a

large amount of data but can also be used instead of a

table to present smaller datasets.

The main themes of a graph (cont…)

from and a critical issue is to ensure that the

graph type selected is the most appropriate for

the data.

• In general, a graph:

• Must be clear, accurate, appropriate

• Avoid mere decoration

• Need a legend

Parts of a

Graph

(line)

Types of graphs

the most common types of graphs is provided in the next

slides

• It is followed by some general guidelines for designing

readily understandable graphs.

• There is more detailed information on the uses and good

design of particular types of graph in the literature.

• We will focus on graphs that can be used to compare

groups.

1. Bar

chart

2. Scatter

6 Pie chart plot

Graphs that

can

compare

groups 3. Box

5 Line plot plots

4 Stem

and leaf

plot

1. Bar chart

heights of category measures. Bar charts can be made of

category tallies, of different statistics by categories, or of

summary values. The height of the bars signifies the

magnitude of the values.

• For example, bars could represent:

• Total sales for four branch stores for a year

• Mean diameters of parts manufactured by four different machines in

a factory during a week.

• Counts of visitors to four local tourist destinations during a weekend.

Example

this bar chart

represent the counts

of paint flaws on an

automobile part.

Clustered and stacked bars

creating clusters of bars or by stacking bars.

• For example, suppose you want to track the

number of students at four regional high schools

by grade.

• Each cluster of bars represents a school, and each

bar within a cluster represents the number of

students in a grade.

Example 1

to compare subcategories within and

across categories. For example, this graph

shows:

East High has the most students.

Within East High, 12th grade has the most

students.

For each high school, the number of

students in each grade is similar.

The fewest students are in 9th grade at

West High, followed closely by the other

three grades at that high school.

Example 1: Stacked bars

school, and each bar within a

stack represents the number of

students in a grade.

you want to compare

subcategories within categories

and categories with each other.

Histograms

the histogram.

of a frequency distribution for either nominal or

ordinal data, a histogram depicts a frequency

distribution for discrete or continuous data.

Frequency Polygons

in many respects.

• A frequency polygon uses the same two axes as a

histogram.

• It is constructed by placing a point at the center of

each interval such that the height of the point is

equal to the frequency or relative frequency

associated with that interval

2. Scatterplot

relationship between two sets of data.

• The relationship between two variables is called

their correlation.

• The closer the data points come when plotted to

making a straight line, the higher the correlation

between the two variables, or the stronger the

relationship.

• It is an inferential plot and is used for continuous

data.

• It is effected by changing order of variable.

Scatterplots and correlation

3. Boxplots

however, they display only a summary

of the data. It is used for factor data

having levels.

• The central box is depicted vertically

but can also be horizontal-extends

from the 25th percentile, to the 75th

percentile.

• The 25th and 75th percentiles of a data set are

called the quartiles of the data.

• The line running between the quartiles marks the

50th percentile of the data set; half the

observations are less than or equal to, whereas

the other half are greater than or equal to this

value.

• If the 50th percentile lies approximately halfway

between the two quartiles, this implies that the

observations in the center of the data set are

roughly symmetric.

• The lines projecting out from the box on either

side extend to the adjacent values of the plot.

The adjacent values are the most extreme

observations in the data set.

4. Stem and leaf plot

leaf plot, is a diagram that quickly summarizes

data while maintaining the individual data points.

• The "stem" is a column of the unique elements of

data after removing the last digit.

• The final digits ("leaves") of each column are then

placed in a row next to the appropriate column

and sorted in numerical order.

5. Line plots

on a graph.

• It can be used to illustrate the relationship

between continuous quantities.

• Each point on the graph represents a pair of

values.

• Each value on the x-axis has a single

corresponding measurement on they-axis.

• Adjacent points are connected by straight lines.

Most commonly, the scale along the horizontal

axis represents time.

• Consequently, we are able to trace the

chronological change in the quantity on the

vertical axis over a specified period.

Inflation

7

6 6

5

Frequency

4.3

4 4

3 3

0

Category 1 Category 2 Category 3 Category 4

6. Pie chart

the data is represented in the form of a circle.

• In a pie chart, a circle is divided into various sections or

segments such that each sector or segment represents a

certain proportion or percentage of the total.

• In such a diagram, the total of all the given items is

equated to 360 degrees and the degrees of angles,

representing different items, are calculated

proportionately.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro

• The entire diagram looks like a pie and its components

resemble slices cut from a pie. The pie chart is used to

show the break-up of one continuous variable into its

component parts.

• For example, chart on the next slide shows the

distribution of the sales of the car industry between six

car companies.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro

Pie chart: Explanation

Maruti accounts for 24 per cent of the market

share, while GM accounts for 35 percent of the

market share, Ford for4 percent of the market

share, Tata for 10 percent of the market share,

Hyundai for 15 percent of the market share and

Fiat for 12 per cent of the market share.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro

Pie chart: In a nutshell

degrees which represents 100 per cent of the

value of the continuous variable.

• Thus, 3.6 degrees on the pie chart represent 1

percent of the total value of the continuous

variable being represented.

• A single pie diagram can represent only one

continuous variable.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro

• In terms of versatility of data representation, pie

charts are less versatile than either of bar charts,

x-y graphs or tables.

• However, their utility is in the fact that the

representation of data is cleaner and it gives an

immediate idea of the relative distribution of the

continuous variable amongst different sectors.

http://www.lofoya.com/Data-Interpretation/Pie-Charts/intro

MEASUREMENT OF CENTRAL

TENDENCY AND DISPERSION

CENTRAL TRENDENCY

Central tendency

the measurements that is the

tendency of the data to cluster,

center about certain numeric values.

• Measures of central tendency are sometimes

called measures of central location.

• They are also classed as summary statistics.

• They form part of the descriptive statistics of the

data.

• Some measures of central tendency are more

appropriate to use than others

Averages enable us to measure the

central tendency of variable data

https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

Averages

represent a set of data or a distribution as a whole.

• It is more or less CENTRAL value ROUND which the

observations in the set of data or distribution usually

tend to cluster.

• As a measure of central tendency (i.e. an average)

indicates the location or general position of the

distribution on the X-axis, it is also known as a measure

of location or position.

• Let us consider an example:

• Suppose that we have the following two frequency

distributions: 35

Suburb A Suburb B 30

4 0 0 25

5 8 0

20 Suburb A

6 27 8

7 30 27 15 Suburb B

8 16 30 10

9 0 16

10 0 0

5

0

4 5 6 7 8 9 10

• If we draw the frequency polygon of the two frequency

distributions, we obtain

Inspection of these frequency polygons shows that they

have exactly the same shape. It is their position relative

to the horizontal axis (X-axis) which distinguishes them.

each of the two suburbs, we will find that the average

number of rooms per house in A is 6.67 while in B it is

7.67.

• This difference of 1 is equivalent to the difference in

position of the two frequency polygons.

there are LARGER houses in suburb B than in suburb A, to

the extent that there are on the average.

ONE More Room in each house.

Various TYPES of Averages

has a use in specifically defined circumstances.

1) The arithmetic mean,

2) The geometric mean,

3) The harmonic mean

4) The median, and

5) The mode

• The Arithmetic, Geometric and Harmonic means are

averages that are mathematical in character, and give an

indication of the magnitude of the observed values.

• The Median indicates the middle position while the

mode provides information about the most frequent

value in the distribution or the set of data.

• The Mode is defined as that value which occurs most

frequently in a set of data i.e. it indicates the most

common result.

Arithmetic Mean

• Also called the “First moment” of X

• xi = values measured

• pi = Pr(X = xi) = Pr(we measure xi)

n

E[ X ] xi pi

i 1

Arithmetic Mean

• pi = constant = 1/n

• n = number of measurements

• Arithmetic mean

• Common “average”

1 n

x xi

Copyright 2004 David J. Lilja

n i 1

Potential Problem with Means

measurements

• Outliers can have a large influence on the

computed mean value

• Distorts our intuition about the central tendency

of the measured values

Potential Problem with Means

Mean

Mean

Copyright 2004 David J. Lilja

Geometric Mean

values x1, x2,…,xn is defined as the positive nth root

of their product.

G X 1 X 2 ... X n

n (Where Xi > 0)

mean becomes laborious as we have to extract the

nth root of the product of all the values.

Geometric Mean

• Taking logarithms to the base 10, we get

1

n

log X

n

Therefore G anti log log X

n

Example

• 45, 32, 37, 46, 39, 36, 41, 48, 36.

• Solution:

• We need to compute the numerical value of

45 32 37 46 39 36 41 48 36

9

ninth root of a quantity. So we make use of

logarithms, as shown below:

X log X

45 1.6532

32 1.5052 14.3870

1.5986

37 1.5682 9

46 1.6628 log X

log G

39 1.5911 n

36 1.5563 Hence G anti log 1.5986

41 1.6128 39.68

48 1.6812

36 1.5563

Sum 14.3870

• The above example pertained to the computation

of the geometric mean in case of raw data.

• Next, we consider the computation of the

geometric mean in the case of grouped data.

Geometric mean for grouped data

with midpoints X1, X2, …,Xk and the corresponding

frequencies f1, f2, …, fk (such that fi = n), the

geometric mean is given by

G X X .... X

f1 f2 fk

n

1 2 k

• Each value of X thus has to be multiplied by itself f

times, and the whole procedure becomes quite a

formidable task!

In terms of logarithms, the formula becomes

1

n

f log X

n

f log X

G anti log

n

• Obviously, the above formula is much easier to

handle.

• In the example of the EPA mileage ratings, we

have:

Class-mark

Mileage No. of

(midpoint) log X f log X

Rating Cars (f)

X

30.0 - 32.9 2 31.45 1.4976 2.9952

33.0 - 35.9 4 34.45 1.5372 6.1488

36.0 - 38.9 14 37.45 1.5735 22.0290

39.0 - 41.9 8 40.45 1.6069 12.8552

42.0 - 44.9 2 43.45 1.6380 3.2760

30 47.3042

47.3042

G anti log

30

• = antilog 1.5768 = 37.74

measures the central tendency of this data set, then the

central value of the mileage of those 30 cars comes out

to be 37.74 miles per gallon.

When should we use the geometric

mean?

• The question is, “When should we use the

geometric mean?”

when relative changes in some variable

quantity are averaged, we prefer

the geometric mean.

Arithmetic or geometric mean

increased during 4 years by the following

amounts:

Percentage

Year Turnover Compared

With Year Earlier

1958 £ 2,000 –

1959 £ 2,500 125

1960 £ 5,000 200

1961 £ 7,500 150

1962 £ 10,500 140

• The yearly increase is shown in a percentage form

in the right-hand column i.e. the turnover of 1959

is 125 percent of the turnover of 1958, the

turnover of 1960 is 200 percent of the turnover of

1959, and so on.

his average rate of turnover growth.

• If the arithmetic mean is adopted he finds his answer

to be:

• Arithmetic Mean

125 200 150 140

4

153.75

i.e. we are concluding that the turnover for any year

is 153.75% of the turnover for the previous year.

• In other words, the turnover in each of the years

considered appears to be 53.75 per cent higher than in

the previous year.

• If this percentage is used to calculate the turnover from

1958 to 1962 inclusive, we obtain:

153.75% of £ 2,000 = £ 3,075

153.75% of £ 3,075 = £ 4,728

153.75% of £ 4,728 = £ 7,269

153.75% of £ 7,269 = £ 11,176

• Whereas the actual turnover figures were

Year Turnover

1958 £ 2,000

1959 £ 2,500

1960 £ 5,000

1961 £ 7,500

1962 £ 10,500

• It seems that both the individual figures and,

more important, the total at the end of the

period, are incorrect.

• Using the arithmetic mean has exaggerated the

‘average’ annual rate of increase in the turnover

of this firm.

• Obviously, we would like to rectify this false

impression.

• The geometric mean enables us to do so.

4

125 200 150 140

525000000

4

151.37%

• Now, if we utilize this particular value to obtain the

individual turnover figures, we find that:

151.37% of £2,000 = £3,027

151.37% of £3,027 = £4,583

151.37% of £4,583 = £6,937

151.37% of £6,937 = £10,500

• So that the turnover figure of 1962 is exactly the same as

what we had in the original data.

Interpretation

at a constant rate, then the annual increase would have

been 51.37 percent.(On the average, each year’s

turnover is 51.37% higher than that in the previous year.)

the geometric mean in a situation when relative changes

in a variable quantity are to be averaged.

Interpretation

are not encountered too often, and that the

occasion to calculate the geometric mean arises

less frequently than the arithmetic mean.(The

most frequently used measure of central

tendency is the arithmetic mean.)

Mode

• EXAMPLE:

• Suppose that the marks of eight students in a

particular test are as follows:

• 2, 7, 9, 5, 8, 9, 10, 9

• Obviously, the most common mark is 9.

• In other words,

Mode = 9.

THE MODE IN CASE OF A DISCRETE

FREQUENCY DISTRIBUTION:

identification of the mode is immediate; one

simply finds that value which has the highest

frequency.

Example:

forty-seated plane.

Highest Frequency fm = 13

Occurs against the X value 13.

Hence:

Mode = x= 13

The mode is obviously 39 passengers and the company should be quite

satisfied that a 40 seater is the correct-size aircraft for this particular route.

THE MODE IN CASE OF THE FREQUENCY

DISTRIBUTION OF A CONTINUOUS VARIABLE:

one that has the highest frequency).

At what point within the modal group does the mode lie?

The answer is contained in the following formula:

Mode=

Where,

• l = lower class boundary of the modal class,

• fm = frequency of the modal class,

• f1 = frequency of the class preceding the

modal class,

• f2 = frequency of the class following modal

class, and

• h = length of class interval of the modal class

Example of EPA mileage ratings, we have:

It is evident that the third class is the modal class. The mode lies

somewhere between 35.95 and 38.95.

In order to apply the formula for the mode, we note that fm = 14,

f1 = 4 and f2 = 8.

Hence we obtain:

Mode by considering the graphical representation.

For the example of EPA Mileage Ratings, the

histogram was as shown below:

The frequency polygon of the same

distribution

Median

when the variable values are placed in order of

magnitude.

OR

The median is defined as a value which divides a

set of data into two halves, one half comprising of

observations greater than and the other half

smaller than it. More precisely, the median is a

value at or below which 50% of the data lie.

The median value can be ascertained by inspection

in many series.

EXAMPLE-1:

The average number of floors in the buildings at the center of a city:

5, 4, 3, 4, 5, 4, 3, 4, 5, 20, 5, 6, 32, 8, 27

Arranging these values in ascending order, we obtain

3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 8, 20, 27, 32

Picking up the middle value, we obtain the median equal to 5.

Interpretation

buildings, 7 have up to 5 floors and 7 have 5

floors or more. We noticed earlier that the

arithmetic mean was distorted toward the few

extremely high values in the series and hence

became unrepresentative. The median = 5 is

much more representative of this series.

Median in Case of a Frequency Distribution of

a Continuous Variable:

formula

Where

l =lower class boundary of the median class (i.e. that class for

which the cumulative frequency is just in excess of n/2).

h=class interval size of the median class

f =frequency of the median class

n=Sf (the total number of observations)

c =cumulative frequency of the class preceding the median

class

Example of the EPA mileage ratings:

In this example, n = 30 and n/2 = 15.

Thus the third class is the median class. The median lies somewhere

between 35.95 and 38.95. Applying the above formula, we obtain

Interpretation:

mileage less than or up to 37.88 miles per gallon

whereas the other half of the cars has mileage

greater than 37.88 miles per gallon. As discussed

earlier, the median is preferable to the arithmetic

mean when there are a few very high or low

figures in a series. It is also exceedingly valuable

when one encounters a frequency distribution

having open-ended class intervals.

Empirical relation between the mean,

median and the mode

empirical relation between the mean,

median and the mode. This is a concept

which is not based on a rigid mathematical

formula; rather, it is based on observation.

In fact, the word ‘empirical’ implies ‘based

on observation’.

Empirical relation between the mean,

median and the mode

relative positions of the mean, f

median and the mode in case

of a hump-shaped

distribution. In a single-

peaked frequency

distribution, the values of the

mean, median and mode

coincide if the frequency

distribution is absolutely Mean = Median = Mode X

symmetrical.

Empirical relation between the mean,

median and the mode

the mean, median and mode do not

all lie on the same point. They are

pulled apart from each other, and

the empirical relation explains the

way in which this happens.

Experience tells us that in a

unimodal curve of moderate

skewness, the median is usually

sandwiched between the mean and

the mode.

Empirical relation between the mean,

median and the mode

THE MEAN, MEDIAN AND THE MODE

DISPERSION

DISPERSION

their location on the horizontal axis (having

different ‘average’ values); similarly, they

differ in terms of the amount of variability

which they exhibit.

MEASURES OF DISPERSION

the variability in a sample or population.

a measure of central tendency, such as the

mean or median, to provide an overall

description of a set of data.

Absolute versus Relative Measures of

Dispersion

absolute and relative.

• An absolute measure of dispersion is one that measures

the dispersion in terms of the same units or in the square

of units, as the units of the data.

• For example, if the units of the data are rupees, meters,

kilograms, etc., the units of the measures of dispersion

will also be rupees, meters, kilograms, etc.

Absolute versus Relative Measures of

Dispersion

that is expressed in the form of a ratio, co-efficient of

percentage and is independent of the units of

measurement.

• A relative measure of dispersion is useful for comparison

of data of different nature. A measure of central

tendency together with a measure of dispersion gives an

adequate description of data.

Why is it important to measure the

spread of data?

of data values is important, but one of the main reasons

regards its relationship with measures of central

tendency.

• A measure of spread gives us an idea of how well the

mean, for example, represents the data. If the spread of

values in the data set is large, the mean is not as

representative of the data as if the spread of data is

small.

Why is it important to measure the

spread of data?

there are probably large differences between

individual scores.

positive if there is little variation in each data

group as it indicates that the similar.

In a nut shell…

dispersion.

• How “spread out” are the values?

• How much spread relative to the mean?

• What is the shape of the distribution of

values?

EXAMPLES

ages of a group of first-year students are quite

consistent, e.g. 17, 18, 18, 19, 18, 19, 19, 18, 17, 18 and

18 years.

in their spare time may show just the opposite situation,

e.g. 35, 23, 19, 48, 32, 24, 29, 37, 58, 18, 21 and 30.

EXAMPLES

variation that exists between the various values

of a data-set is of substantial importance.

• We obviously need to be aware of the amount of

variability present in a data-set if we are to come

to useful conclusions about the situation under

review.

EXAMPLES

different areas are as follows:

Number Number of Classes

of Pupils Area A Area B

10 – 14 0 5

15 – 19 3 8

20 – 24 13 10

25 – 29 24 12

30 – 34 17 14

35 – 39 3 5

40 – 44 0 3

45 - 49 0 3

60 60

EXAMPLES

that the answer is identical:

27.33 pupils in both areas.

average, it can readily be seen that they are entirely

DIFFERENT.

• And the graphs of the two distributions (given below) clearly

indicate this fact.

25

Number of Classes 20

15

10

0

– 9 1 4 19 – 24 – 29 – 34 – 39 – 44 - 49 – 54

4 0 – –

1 15 20 25 30 35 40 44 50

Number of Pupils

EXAMPLES

way can these two situations be distinguished?’

• We need a measure of variability or DISPERSION to

accompany the relevant measure of position or ‘average’

used.

• The word ‘relevant’ is important here for we shall find one

measure of dispersion which expresses the scatter of values

round the arithmetic mean, another the scatter of values

round the median, and so forth. Each measure of dispersion

is associated with a particular ‘average’.

Histograms

40 40

35 35

30 30

25 25

20 20

15 15

10 10

5 5

0 0

• Widely different distributions

• How to capture this variability in one number?

Measures of dispersion

dispersion, these can be classified as:

• the range,

• variance,

• absolute deviation,

• standard deviation and

• quartiles, deciles and percentiles.

RANGE

and lowest scores in a data set and is the simplest

measure of spread. So we calculate range as:

f

X

X0 Xm

Range

RANGE

invalidate it, but in general it gives no idea of the

DISTRIBUTION of the observations between the

two ends of the series.

• For this reason it is used principally as a

supplementary aid in the description of variable

data, in conjunction with other measures of

dispersion.

Disadvantages

range is estimated by finding the difference between the upper

boundary of the highest class and the lower boundary of the

lowest class.

• However, because of the fact that it is computed from only the

two extreme values in a data-set, it has two serious

disadvantages.

• It ignores all the INFORMATION available from the intermediate

observations.

• It might give a MISLEADING picture of the spread in the data.

Disadvantages

unsatisfactory measure of dispersion.

However, it is APPROPRIATELY used in

statistical quality control charts of

manufactured products, daily temperatures,

stock prices, etc.

Example

set:

23 56 45 65 59 55 62 54 85 25

is 23. This results in a range of 62, which is 85

minus 23.

Points to ponder!

set the boundaries of the scores.

• This can be useful if you are measuring a variable that has either a

critical low or high threshold (or both) that should not be crossed.

• The range will instantly inform you whether at least one value

broke these critical thresholds.

• In addition, the range can be used to detect any errors when

entering data. For example, if you have recorded the age of

school children in your study and your range is 7 to 123 years old

you know you have made a mistake!

Quartiles, Deciles and Percentiles

• It can be divided into four quarters,

here the two halves are mentioned

• The quartiles, together with the

median, achieve the division of the

50% 50%

total area into four equal parts.

X

• The first, second and third quartiles

are given by the formulae: Median

Quartiles, Deciles and Percentiles

• First quartile:

What are l, n, h, f and c?

h n f

Q1 l c

f 4

• Second quartile (i.e. median):

2n

c l n 2 c

h h

Q2 l

f 4 f

25% 25% 25% 25%

• Third quartile: ~ X

• Q1 Q2 = X Q3

h 3n

Q3 l c

f 4 It is clear from the formula of the second quartile that the

second quartile is the same as the median.

Quartiles, Deciles and Percentiles

It is easily seen that the 5th decile

• The deciles and the percentiles given the division

of the total area into 10 and 100 equal parts

respectively.

h n

• The formula for the first decile is D1 l c

f 10

h 2n h 3n

D2 l c D3 l c

f 10 f 10

It is easily seen that the 50th percentile

Quartiles, Deciles and Percentiles

• The formula for the first percentile is

h n

P1 l c

f 100

• The formulae for the subsequent percentiles are

h 2n h 3n

P2 l c P3 l c

f 100 f 100

• And so on…

A simple example for estimating quartiles

• For example, consider the marks of the 100 students below, which

have been ordered from the lowest to the highest scores.

Order Score Order Score Order Score Order Score Order Score

1st 35 21st 42 41st 53 61st 64 81st 74

2nd 37 22nd 42 42nd 53 62nd 64 82nd 74

3rd 37 23rd 44 43rd 54 63rd 65 83rd 74

4th 38 24th 44 44th 55 64th 66 84th 75

5th 39 25th 45 45th 55 65th 67 85th 75

6th 39 26th 45 46th 56 66th 67 86th 76

7th 39 27th 45 47th 57 67th 67 87th 77

A simple example for estimating quartiles

Order Score Order Score Order Score Order Score Order Score

8th 39 28th 45 48th 57 68th 67 88th 77

9th 39 29th 47 49th 58 69th 68 89th 79

10th 40 30th 48 50th 58 70th 69 90th 80

11th 40 31st 49 51st 59 71st 69 91st 81

12th 40 32nd 49 52nd 60 72nd 69 92nd 81

13th 40 33rd 49 53rd 61 73rd 70 93rd 81

14th 40 34th 49 54th 62 74th 70 94th 81

15th 40 35th 51 55th 62 75th 71 95th 81

16th 41 36th 51 56th 62 76th 71 96th 81

17th 41 37th 51 57th 63 77th 71 97th 83

18th 42 38th 51 58th 63 78th 72 98th 84

19th 42 39th 52 59th 64 79th 74 99th 84

20th 42 40th 52 60th 64 80th 74 100th 85

A simple example for estimating quartiles

• Quartile (Q1 The first) lies between the 25th and 26th student's

marks,

• The second quartile (Q2) between the 50th and 51st student's

marks, and

• The third quartile (Q3) between the 75th and 76th student's

marks.

First quartile (Q1) = (45 + 45) ÷ 2 = 45

Second quartile (Q2) = (58 + 59) ÷ 2 = 58.5

Third quartile (Q3) = (71 + 71) ÷ 2 = 71

Reflection form the example

(100 students, rather than an odd number, such as 99

students). This means that when we calculate the

quartiles, we take the sum of the two scores around

each quartile and then half them (hence Q1= (45 + 45) ÷

2 = 45). However, if we had an odd number of scores

(say, 99 students), we would only need to take one score

for each quartile (that is, the 25th, 50th and 75th

scores). You should recognize that the second quartile is

also the median.

Reflection form the example

less affected by outliers or a skewed data set than the equivalent

measures of mean and standard deviation.

• For this reason, quartiles are often reported along with the

median as the best choice of measure of spread and central

tendency, respectively, when dealing with skewed and/or data

with outliers.

• A common way of expressing quartiles is as an interquartile range.

The interquartile range describes the difference between the

third quartile (Q3) and the first quartile (Q1), telling us about the

range of the middle half of the scores in the distribution.

Reflection form the example

Interquartile range = Q3 - Q1

= 71 - 45

= 26

• However, it should be noted that in journals and other

publications you will usually see the interquartile range reported

as 45 to 71, rather than the calculated range.

• A slight variation on this is the semi-interquartile range, which is

half the interquartile range = ½ (Q3 - Q1). Hence, for our 100

students, this would be 26 ÷ 2 = 13. This is also known as the

quartile deviation.

f

X

Q1 Q3

Inter-quartile Range

Quartile Deviation

(Semi Inter-quartile Range)

Coefficient of Dispersion (COD)

measure is known as the COEFFICIENT OF DISPERSION, and is

defined by the relation given below:

COD

1

2 Range

Mid Range

Xm X0

2 Xm X0

Xm X0 Xm X0

2

Coefficient of Dispersion (COD)

the purposes of COMPARISON. (This is so because a pure

number can be compared with another pure number.)

• For example, if the coefficient of dispersion for one

data-set comes out to be 0.6 whereas the coefficient of

dispersion for another data-set comes out to be 0.4, then

it is obvious that there is greater amount of dispersion in

the first data-set as compared with the second.

Chebychev’s Inequality

of a bell curve, has several features. One of them deals

with the spread of the data relative to the number

of standard deviations from the mean.

one standard deviation from the mean, 95% is two

standard deviations from the mean, and approximately

99% is within three standard deviations from the mean.

• But if the data set is not distributed in the shape of a bell curve,

then a different amount could be within one standard deviation.

Chebychev’s inequality provides a way to know what fraction of

data falls within K standard deviations from the mean for any data

set.

Why sampling?

study or survey every member of a population; studying a

sample of that population is a more attainable goal.

• Sampling gives you the opportunity to have confidence in

the data, and also assume what might happen if the

assumptions are wrong.

• If you perform your research with the wrong sample, or

just one that is inaccurately designed, and you will

almost certainly get misleading results.

Techniques of Sampling

two major subsets.

• Probabilistic and

• Non-probablistic

• There are four types of Probabilistic

sampling: Random, Systematic, Cluster, and

Stratified.

5. Statistical inference

other generalization about a population based on

information contained in a sample. We use the

information contained in the smaller sample to learn

about the larger population.

Thus, from the sample of 1500 voters, the pollster may estimate

the percentage of all the voters who would vote for each

presidential candidate if the elections were held on the day the

poll was conducted or he might use the information to predict

the outcome on election.

- OUTPUT2Uploaded byMirahAvisha
- BBBC Presntation (1)Uploaded byNabanita Ghosh
- Tut-1-cvg4150-2016-for posting (1)Uploaded byLoki
- Ug Brochure SpreadUploaded byAnees Malik
- Reno StartrUploaded byKagak Ajek
- Local Food Chains are better than Foreign Food ChainsUploaded bySamiyaIllias
- Statistics IUploaded byNoorunnisha
- chap 01Uploaded byTahir Naeem Jatt
- otput frekuensi.docxUploaded byArisGalurSetiyawati
- Statistics and ProbabilityUploaded bykhamru7823
- statistics.pdfUploaded byLakshmiRengarajan
- Data Analysis NotesUploaded bysonagre
- ADL QTManagementUploaded byK Tyagi
- Assignment Stat [ SOALAN 1 ]Uploaded bysyed hisham
- Data Editing Part 2Uploaded byslhippo
- History, Theory, And Technique of StatisticsUploaded bysomebody535
- Shapiro WilkUploaded bydarco
- Manila Science 1st YearUploaded byRezi Morales
- AbstractUploaded byaldwinmanzanero
- 01A Intro to DataUploaded byKevin Tran
- Rubric for Persuasive EssatUploaded byletty15vb
- 329 lp math obsUploaded byapi-222110053
- Mmpds 2015 Statistical Property Analysis OverviewUploaded byaakash sharma
- Economics 1 Answer.pdfUploaded byNoushad Ali
- Aggregation and the Estimated Effects of Economic Conditions on Health Jason M. LindoUploaded byMatthew
- Compre 2016OctFinalUploaded byRodin Paspasan
- StatisticsUploaded bybl9nkverse
- unit plan rubric jd an - sheet1Uploaded byapi-352269670
- ABET syllabus CEE 309 2018 (3 pages).docxUploaded byjhgjgh
- ProbabilityAndStatistics.pdfUploaded bykareemyas7120

- Samples of Test Bank for Business Statistics in Practice 6th Edition by Bruce BowermanUploaded byUsman
- Dubai Gold SalesUploaded byMuhammad Sharower Uddin
- Lotus 1-2-3 for WindowsUploaded bykennedy_saleh
- Graphs and PicturesUploaded byAnurag Vijayan
- IELTS Writing Task 1.docxUploaded byNur Fitriana Haryanto
- Stats lecture 1Uploaded byGSS
- Data Visualization TipsUploaded byescribdie
- Basics of Technical CommunicationUploaded byPooja Nair
- 200 Quantitative Aptitude Questions PDF.pdfUploaded byS Raju
- Writing Task One Course PackUploaded byAroona Khan
- interpreting charts and graphsUploaded byapi-296317938
- 40 Days Study Plan for IBPS ClerkUploaded byRavi Prakash
- Topic 4 Statistic II (Form 3)Uploaded byCt Kursiah
- IELTS Writing Task1_line GraphicUploaded byOscar Barajas González
- Writing Task 1Uploaded bysimranjit kaur
- 299270030-Alboukadel-Kassambara-ggplot2-The-Elements-for-Elegant-Data-Visualization-in-R.pdfUploaded byNita Ferdiana
- Visualizing Social NetworksUploaded byrks_rmrct
- advantages and disadvantages picturesUploaded byapi-298215585
- Chapter 2 - Displaying and Describing Categorical DataUploaded byAdam Glassner
- Math 9th Grade - 4th PeriodUploaded bySophie Cor - Pé
- 4MB0_01_que_20120307Uploaded byaisharyn
- Gtf0014 – Introduction to Computer and Information System Semester JulyUploaded bynawas0606
- Chapter 8 I STATISTICS I+II (Enhance)Uploaded byciknuyuhuda
- MELJUN CORTES Visual SymbolsUploaded byMELJUN CORTES, MBA,MPA
- chapter 3 displaying and describing categorical data part 1Uploaded byapi-232613595
- Brasseur_Florence_Nightengale.pdfUploaded byjjg8116
- Effective Data Visualization the Right Chart for the Right DataUploaded byde7yT3iz
- Common Pitfalls in Dashboard DesignUploaded bymbreitbart
- Lumira - IntroductionUploaded bydee
- Excel ChartUploaded bymenonlakshmi1