Sie sind auf Seite 1von 139

1/5/2020 Dr.

Arsham's Statistics Site

Statistical Thinking for Managerial Decisions


Para mis visitantes del mundo de habla hispana, este sitio se encuentra disponible en español en:

América Latina España

This Web site is a course in statistics appreciation; i.e., acquiring a feeling for the statistical way of
thinking. It contains various useful concepts and topics at many levels of learning statistics for
decision making under uncertainties. The cardinal objective for this Web site is to increase the
extent to which statistical thinking is merged with managerial thinking for good decision making
under uncertainty.

Professor Hossein Arsham

MENU
Chapter 1: Towards Statistical Thinking for Decision Making
Chapter 2: Descriptive Sampling Data Analysis
Chapter 3: Probability as a Confidence Measuring Tool for Statistical Inference
Chapter 4: Necessary Conditions for Statistical Decision Making
Chapter 5: Estimators and Their Qualities
Chapter 6: Hypothesis Testing: Rejecting a Claim
Chapter 7: Hypotheses Testing for Means and Proportions
Chapter 8: Tests for Statistical Equality of Two or More Populations
Chapter 9: Applications of the Chi-square Statistic
Chapter 10: Regression Modeling and Analysis
Chapter 11: Unified Views of Statistical Decision Technologies
Chapter 12: Index Numbers and Ratios with Applications
A Why List: Frequently Asked Statistical Questions (Word.Doc)
Formulas Concerning the Mean(s) (PDF), Print to enlarge
A Conceptual Summary-Sheet
A Technical Summary-Sheet
Exercise Your Knowledge to Enhance What You Have Learned (PDF)
E-Labs and Computational Tools
Excel for Statistical Data Analysis
Widely Used Statistical Tables (PDF)
What Maths Do I Need for This Course? (Word.Doc), A Sample of "How Things Can Go
Wrong?"

Companion Sites:
Topics in Statistical Data Analysis
Time Series Analysis and Business Forecasting
Computers and Computational Statistics
Questionnaire Design and Surveys Sampling
Probabilistic Modeling
Systems Simulation
Probability and Statistics Resources
Success Science
Leadership Decision Making
Linear Programming (LP) and Goal-Seeking Strategy
Artificial-variable Free LP Solution Algorithms
Integer Optimization and the Network Models
Tools for LP Modeling Validation
The Classical Simplex Method
Zero-Sum Games with Applications
Computer-assisted Learning Concepts and Techniques
Linear Algebra and LP Connections
From Linear to Nonlinear Optimization with Business Applications
Construction of the Sensitivity Region for LP Models
Zero Sagas in Four Dimensions

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 1/139
1/5/2020 Dr. Arsham's Statistics Site

Business Keywords and Phrases


Collection of JavaScript E-labs Learning Objects
Compendium of Web Site Review
Impact of the Internet on Learning & Teaching
The Business Statistics Online Course

To search the site, try Edit | Find in page [Ctrl + f]. Enter a word or phrase in the dialogue box,
e.g."parameter" or"probability". If the first appearance of the word/phrase is not what you are looking for, try Find
Next.

1. Towards Statistical Thinking for Decision Making

1. Introduction
2. The Birth of Probability and Statistics
3. Statistical Modeling for Decision-Making under Uncertainties
4. Statistical Decision-Making Process
5. What is Business Statistics?
6. Common Statistical Terminology with Applications

2. Descriptive Sampling Data Analysis

1. Greek Letters Commonly Used in Statistics


2. Type of Data and Levels of Measurement
3. Why Statistical Sampling?
4. Sampling Methods
5. Representative of a Sample: Measures of Central Tendency
6. Selecting Among the Mean, Median, and Mode
7. Specialized Averages: The Geometric & Harmonic Means
8. Histogramming: Checking for Homogeneity of Population
9. How to Construct a BoxPlot
10. Measuring the Quality of a Sample
11. Selecting Among the Measures of Dispersion
12. Shape of a Distribution Function: The Skewness-Kurtosis Chart
13. A Numerical Example & Discussions
14. The Two Statistical Representations of a Population
15. Empirical (i.e., observed) Cumulative Distribution Function

3. Probability as a Confidence Measuring Tool for Statistical Inference

1. Introduction
2. Probability, Chance, Likelihood, and Odds
3. How to Assign Probabilities
4. General Computational Probability Rules
5. Combinatorial Math: How to Count Without Counting
6. Joint Probability and Statistics
7. Mutually Exclusive versus Independent Events
8. What Is so Important About the Normal Distributions?
9. What Is a Sampling Distribution?
10. What Is The Central Limit Theorem (CLT)?
11. An Illustration of CLT
12. What Is"Degrees of Freedom"?
13. Applications of and Conditions for Using Statistical Tables
14. Numerical Examples for Statistical Tables
Beta Density Function
Binomial Probability Function
Chi-square Density Function
Exponential Density Function
F-Density Function
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 2/139
1/5/2020 Dr. Arsham's Statistics Site

Gamma Density Function


Geometric Probability Function
Hypergeometric Probability Function
Log-normal Density Function
Multinomial Probability Function
Negative Binomial Probability Function
Normal Density Function
Poisson Probability Function
Student T-Density Function
Triangular Density Function
Uniform Density Function
Other Density and Probability Functions

4. Necessary Conditions for Statistical Decision Making

1. Introduction
2. Measure of Surprise for Outlier Detection
3. Homogeneous Population (Don't mix apples and oranges)
4. Test for Randomness
5. Test for Normality

5. Estimators and Their Qualities

1. Introduction
2. Qualities of a Good Estimator
3. Estimations with Confidence
4. What Is the Margin of Error?
5. Bias Reduction Techniques: Bootstrapping and Jackknifing
6. Prediction Intervals
7. What Is a Standard Error?
8. Sample Size Determination
9. Pooling the Sampling Estimates for Mean, Variance, and Standard Deviation
10. Revising the Expected Value and the Variance
11. Subjective Assessment of Several Estimates
12. Bayesian Statistical Inference: An Introduction

6. Hypothesis Testing: Rejecting a Claim

1. Introduction
2. Managing the Producer's or the Consumer's Risk
3. Classical Approach to Testing Hypotheses
4. The Meaning and Interpretation of P-values (what the data say)
5. Blending the Classical and the P-value Based Approaches in Test of Hypotheses
6. Bonferroni Method for Multiple P-Values Procedure
7. Power of a Test and the Size Effect
8. Parametric vs. Non-Parametric vs. Distribution-free Tests

7. Hypotheses Testing for Means and Proportions

1. Introduction
2. Single Population t-Test
3. Two Independent Populations
4. Non-parametric Multiple Comparison Procedures
5. The Before-and-After Test
6. ANOVA for Normal but Condensed Data Sets
7. ANOVA for Dependent Populations

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 3/139
1/5/2020 Dr. Arsham's Statistics Site

8. Tests for Statistical Equality of Two or More Populations

1. Introduction
2. Equality of Two Normal Populations
3. Testing a Shift in Normal Populations
4. Analysis of Variance (ANOVA)
5. Equality of Proportions in Several Populations
6. Distribution-free Equality of Two Populations
7. Comparison of Two Random Variables

9. Applications of the Chi-square Statistic

1. Introduction
2. Test for Crosstable Relationship
3. 2 by 2 Crosstable Analysis
4. Identical Populations Test for Crosstable Data
5. Test for Equality of Several Population Proportions
6. Test for Equality of Several Population Medians
7. Goodness-of-Fit Test for Probability Mass Functions
8. Compatibility of Multi-Counts
9. Necessary Conditions in Applying the Above Tests
10. Testing the Variance: Is the Quality that Good?
11. Testing the Equality of Multi-Variances
12. Correlation Coefficients Testing

10. Regression Modeling and Analysis

1. Simple Linear Regression: Computational Aspects


2. Regression Modeling and Analysis
3. Regression Modeling Selection Process
4. Covariance and Correlation
5. Pearson, Spearman, and Point-biserial Correlations
6. Correlation, and Level of Significance
7. Independence vs. Correlated
8. How to Compare Two Correlation Coefficients
9. Conditions and the Check-list for Linear Models
10. Analysis of Covariance: Comparing the Slopes
11. Residential Properties Appraisal Application

11. Unified Views of Statistical Decision Technologies

1. Introduction
2. Hypothesis Testing with Confidence
3. Regression Analysis, ANOVA, and Chi-square Test
4. Regression Analysis, ANOVA, T-test, and Coefficient of Determination
5. Relationships among Popular Distibutions

12. Index Numbers and Ratios with Applications

1. Introduction
2. Consumer Price Index
3. Ratio Indexes
4. Composite Index Numbers
5. Variation Index as a Quality Indicator
6. Labor Force Unemployment Index
7. Seasonal Index and Deseasonalizing Data
8. Human Ideal Weight: The Body Mass Index
9. Statistical Technique and Index Numbers
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 4/139
1/5/2020 Dr. Arsham's Statistics Site

Introduction to Statistical Thinking for Decision Making

This site builds up the basic ideas of business statistics systematically and correctly. It is a
combination of lectures and computer-based practice, joining theory firmly with practice. It
introduces techniques for summarizing and presenting data, estimation, confidence intervals and
hypothesis testing. The presentation focuses more on understanding of key concepts and
statistical thinking, and less on formulas and calculations, which can now be done on small
computers through user-friendly Statistical JavaScript A, etc. A Spanish version of this site is
available at Razonamiento EstadÃstico para la Toma de Decisiones Gerenciales and its collection
of JavaScript.

Today's good decisions are driven by data. In all aspects of our lives, and importantly in the
business context, an amazing diversity of data is available for inspection and analytical insight.
Business managers and professionals are increasingly required to justify decisions on the basis of
data. They need statistical model-based decision support systems.

Statistical skills enable them to intelligently collect, analyze and interpret data relevant to their
decision-making. Statistical concepts and statistical thinking enable them to:

solve problems in a diversity of contexts.


add substance to decisions.
reduce guesswork.

This Web site is a course in statistics appreciation; i.e., acquiring a feel for the statistical way
of thinking. It hopes to make sound statistical thinking understandable in business terms. An
introductory course in statistics, it is designed to provide you with the basic concepts and
methods of statistical analysis for processes and products. Materials in this Web site are
tailored to help you make better decisions and to get you thinking statistically. A cardinal
objective for this Web site is to embed statistical thinking into managers, who must often
decide with little information.

In competitive environment, business managers must design quality into products, and into
the processes of making the products. They must facilitate a process of never-ending
improvement at all stages of manufacturing and service. This is a strategy that employs
statistical methods, particularly statistically designed experiments, and produces processes
that provide high yield and products that seldom fail. Moreover, it facilitates development of
robust products that are insensitive to changes in the environment and internal component
variation. Carefully planned statistical studies remove hindrances to high quality and
productivity at every stage of production. This saves time and money. It is well recognized
that quality must be engineered into products as early as possible in the design process. One
must know how to use carefully planned, cost-effective statistical experiments to improve,
optimize and make robust products and processes.

Business Statistics is a science assisting you to make business decisions under


uncertainties based on some numerical and measurable scales. Decision making processes
must be based on data, not on personal opinion nor on belief.

The Devil is in the Deviations: Variation is inevitable in life! Every process, every
measurement, every sample has variation. Managers need to understand variation for two
key reasons. First, so that they can lead others to apply statistical thinking in day-to-day
activities and secondly, to apply the concept for the purpose of continuous improvement. This
course will provide you with hands-on experience to promote the use of statistical thinking
and techniques to apply them to make educated decisions, whenever you encounter variation
in business data. You will learn techniques to intelligently assess and manage the risks
inherent in decision-making. Therefore, remember that:

Just like weather, if you cannot control something, you should learn how to measure
and analyze it, in order to predict it, effectively.

If you have taken statistics before, and have a feeling of inability to grasp concepts, it may be
largely due to your former non-statistician instructors teaching statistics. Their deficiencies
lead students to develop phobias for the sweet science of statistics. In this respect,
Professor Herman Chernoff (1996) made the following remark:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 5/139
1/5/2020 Dr. Arsham's Statistics Site

"Since everybody in the world thinks he can teach statistics even though he does
not know any, I shall put myself in the position of teaching biology even though I
do not know any"

Inadequate statistical teaching during university education leads even after graduation, to
one or a combination of the following scenarios:

1. In general, people do not like statistics and therefore they try to avoid it.
2. There is a pressure to produce scientific papers, however often confronted with"I need something quick."
3. At many institutes in the world, there are only a few (mostly 1) statisticians, if any at all. This means that
these people are extremely busy. As a result, they tend to advise simple and easy to apply techniques,
or they will have to do it themselves. For my teaching philosophy statements, you may like to visit the
Web site On Learning & Teaching.
4. Communication between a statistician and decision-maker can be difficult. One speaks in statistical
jargon; the other understands the monetary or utilitarian benefit of using the statistician's
recommendations.

Plugging numbers into the formulas and crunching them have no value by themselves. You should
continue to put effort into the concepts and concentrate on interpreting the results.

Even when you solve a small size problem by hand, I would like you to use the available computer
software and Web-based computation to do the dirty work for you.

You must be able to read the logical secret in any formulas not memorize them. For example, in
computing the variance, consider its formula. Instead of memorizing, you should start with some
why:

i. Why do we square the deviations from the mean.


Because, if we add up all deviations, we get always zero value. So, to deal with this problem, we
square the deviations. Why not raise to the power of four (three will not work)? Squaring does the
trick; why should we make life more complicated than it is? Notice also that squaring also
magnifies the deviations; therefore it works to our advantage to measure the quality of the data.

ii. Why is there a summation notation in the formula.


To add up the squared deviation of each data point to compute the total sum of squared deviations.

iii. Why do we divide the sum of squares by n-1.


The amount of deviation should reflect also how large the sample is; so we must bring in the
sample size. That is, in general, larger sample sizes have larger sum of square deviation from the
mean. Why n-1 not n? The reason for n-1 is that when you divide by n-1, the sample's variance
provides an estimated variance much closer to the population variance, than when you divide by n.
You note that for large sample size n (say over 30), it really does not matter whether it is divided by
n or n-1. The results are almost the same, and they are acceptable. The factor n-1 is what we
consider as the"degrees of freedom".

This example shows how to question statistical formulas, rather than memorizing them. In fact,
when you try to understand the formulas, you do not need to remember them, they are part of your
brain connectivity. Clear thinking is always more important than the ability to do arithmetic.

When you look at a statistical formula, the formula should talk to you, as when a musician looks at
a piece of musical-notes, he/she hears the music.

computer-assisted learning: The computer-assisted learning provides you a"hands-on"


experience which will enhance your understanding of the concepts and techniques covered in this
site.

Java, once an esoteric programming language for animating Web pages, is now a full-fledged
platform for building JavaScript E-labs' learning objects with useful applications. As you used to do
experiments in physics labs to learn physics, computer-assisted learning enables you to use any
online interactive tool available on the Internet to perform experiments. The purpose is the same;
i.e., to understand statistical concepts by using statistical applets which are entertaining and
educating.

The appearance of computer software, JavaScript, Statistical Demonstration Applets, and Online
Computation are the most important events in the process of teaching and learning concepts in

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 6/139
1/5/2020 Dr. Arsham's Statistics Site

model-based, statistical decision making courses. These e-lab Technologies allow you to construct
numerical examples to understand the concepts, and to find their significance for yourself.

Unfortunately, most classroom courses are not learning systems. The way the instructors attempt
to help their students acquire skills and knowledge has absolutely nothing to do with the way
students actually learn. Many instructors rely on lectures and tests, and memorization. All too
often, they rely on"telling." No one remembers much that's taught by telling, and what's told doesn't
translate into usable skills. Certainly, we learn by doing, failing, and practicing until we do it right.
The computer assisted learning serves this purpose.

A course in appreciation of statistical thinking gives business professionals an edge.


Professionals with strong quantitative skills are in demand. This phenomenon will grow as the
impetus for data-based decisions strengthens and the amount and availability of data
increases. The statistical toolkit can be developed and enhanced at all stages of a career.
Decision making process under uncertainty is largely based on application of statistics for
probability assessment of uncontrollable events (or factors), as well as risk assessment of
your decision. For the foundation of decision making visit Operations/Operational Research
site. For more statistical-based Web sites with decision making applications, visit Decision
Science Resources, and Modeling and Simulation Resources sites.

The main objective for this course is to learn statistical thinking; to emphasize more on
concepts, and less theory and fewer recipes, and finally to foster active learning using the
useful and interesting Web-sites. It is already a known fact that"Statistical thinking will one
day be as necessary for efficient citizenship as the ability to read and write." So, let's be
ahead of our time.

Further Readings:
Chernoff H., A Conversation With Herman Chernoff, Statistical Science, Vol. 11, No. 4, 335-350, 1996.
Churchman C., The Design of Inquiring Systems, Basic Books, New York, 1971. Early in the book he stated that knowledge could be
considered as a collection of information, or as an activity, or as a potential. He also noted that knowledge resides in the user and not
in the collection.
Rustagi M., et al. (eds.), Recent Advances in Statistics: Papers in Honor of Herman Chernoff on His Sixtieth Birthday, Academic Press,
1983.

The Birth of Probability and Statistics

The original idea of"statistics" was the collection of information about and for the"state". The
word statistics derives directly, not from any classical Greek or Latin roots, but from the Italian
word for state.

The birth of statistics occurred in mid-17th century. A commoner, named John Graunt, who
was a native of London, began reviewing a weekly church publication issued by the local
parish clerk that listed the number of births, christenings, and deaths in each parish. These
so called Bills of Mortality also listed the causes of death. Graunt who was a shopkeeper
organized this data in the form we call descriptive statistics, which was published as Natural
and Political Observations Made upon the Bills of Mortality. Shortly thereafter he was elected
as a member of Royal Society. Thus, statistics has to borrow some concepts from sociology,
such as the concept of Population. It has been argued that since statistics usually involves
the study of human behavior, it cannot claim the precision of the physical sciences.

Probability has much longer history. Probability is derived from the verb to probe meaning
to"find out" what is not too easily accessible or understandable. The word"proof" has the
same origin that provides necessary details to understand what is claimed to be true.

Probability originated from the study of games of chance and gambling during the 16th
century. Probability theory was a branch of mathematics studied by Blaise Pascal and Pierre
de Fermat in the seventeenth century. Currently in 21st century, probabilistic modeling is used
to control the flow of traffic through a highway system, a telephone interchange, or a
computer processor; find the genetic makeup of individuals or populations; quality control;
insurance; investment; and other sectors of business and industry.

New and ever growing diverse fields of human activities are using statistics; however, it
seems that this field itself remains obscure to the public. Professor Bradley Efron expressed
this fact nicely:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 7/139
1/5/2020 Dr. Arsham's Statistics Site

During the 20th Century statistical thinking and methodology have become the scientific
framework for literally dozens of fields including education, agriculture, economics,
biology, and medicine, and with increasing influence recently on the hard sciences such
as astronomy, geology, and physics. In other words, we have grown from a small
obscure field into a big obscure field.

Further Readings:
Daston L., Classical Probability in the Enlightenment, Princeton University Press, 1988.
The book points out that early Enlightenment thinkers could not face uncertainty. A mechanistic, deterministic machine, was the
Enlightenment view of the world.
David H., and A.Edwards, Annotated Readings in the History of Statistics, Springer, 2001. Offers a general historical collections of the
probability and statistical literature.
Gillies D., Philosophical Theories of Probability, Routledge, 2000. Covers the classical, logical, subjective, frequency, and propensity
views.
Hacking I., The Emergence of Probability, Cambridge University Press, London, 1975. A philosophical study of early ideas about
probability, induction and statistical inference.
Hald A., A History of Probability and Statistics and Their Applications before 1750, Wiley, 2003.
Peters W., Counting for Something: Statistical Principles and Personalities, Springer, New York, 1987. It teaches the principles of applied
economic and social statistics in a historical context. Featured topics include public opinion polls, industrial quality control, factor
analysis, Bayesian methods, program evaluation, non-parametric and robust methods, and exploratory data analysis.
Porter T., The Rise of Statistical Thinking, 1820-1900, Princeton University Press, 1986. The author states that statistics has become
known in the twentieth century as the mathematical tool for analyzing experimental and observational data. Enshrined by public
policy as the only reliable basis for judgments as the efficacy of medical procedures or the safety of chemicals, and adopted by
business for such uses as industrial quality control, it is evidently among the products of science whose influence on public and
private life has been most pervasive. Statistical analysis has also come to be seen in many scientific disciplines as indispensable for
drawing reliable conclusions from empirical (i.e., observed) results. This new field of mathematics found so extensive a domain of
applications.
Stigler S., The History of Statistics: The Measurement of Uncertainty Before 1900, U. of Chicago Press, 1990. It covers the people, ideas,
and events underlying the birth and development of early statistics.
Tankard J., The Statistical Pioneers, Schenkman Books, New York, 1984.
This work provides the detailed lives and times of theorists whose work continues to shape much of the modern statistics.

Statistical Modeling for Decision-Making under Uncertainties:


From Data to the Instrumental Knowledge

In this diverse world of ours, no two things are exactly the same. A statistician is interested in
both the differences and the similarities; i.e., both departures and patterns.

The actuarial tables published by insurance companies reflect their statistical analysis of the
average life expectancy of men and women at any given age. From these numbers, the
insurance companies then calculate the appropriate premiums for a particular individual to
purchase a given amount of insurance.

Exploratory analysis of data makes use of numerical and graphical techniques to study
patterns and departures from patterns. The widely used descriptive statistical techniques are:
Frequency Distribution; Histograms; Boxplot; Scattergrams and Error Bar plots; and
diagnostic plots.

In examining distribution of data, you should be able to detect important characteristics, such
as shape, location, variability, and unusual values. From careful observations of patterns in
data, you can generate conjectures about relationships among variables. The notion of how
one variable may be associated with another permeates almost all of statistics, from simple
comparisons of proportions through linear regression. The difference between association
and causation must accompany this conceptual development.

Data must be collected according to a well-developed plan if valid information on a conjecture


is to be obtained. The plan must identify important variables related to the conjecture, and
specify how they are to be measured. From the data collection plan, a statistical model can
be formulated from which inferences can be drawn.

As an example of statistical modeling with managerial implications, such as "what-if"


analysis, consider regression analysis. Regression analysis is a powerful technique for
studying relationship between dependent variables (i.e., output, performance measure) and
independent variables (i.e., inputs, factors, decision variables). Summarizing relationships
among the variables by the most appropriate equation (i.e., modeling) allows us to predict or
identify the most influential factors and study their impacts on the output for any changes in
their current values.

Frequently, for example the marketing managers are faced with the question, What Sample
Size Do I Need? This is an important and common statistical decision, which should be given
due consideration, since an inadequate sample size invariably leads to wasted resources.
The sample size determination section provides a practical solution to this risky decision.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 8/139
1/5/2020 Dr. Arsham's Statistics Site

Statistical models are currently used in various fields of business and science. However, the
terminology differs from field to field. For example, the fitting of models to data, called
calibration, history matching, and data assimilation, are all synonymous with parameter
estimation.

Your organization database contains a wealth of information, yet the decision technology
group members tap a fraction of it. Employees waste time scouring multiple sources for a
database. The decision-makers are frustrated because they cannot get business-critical data
exactly when they need it. Therefore, too many decisions are based on guesswork, not facts.
Many opportunities are also missed, if they are even noticed at all.

Knowledge is what we know well. Information is the communication of knowledge. In every


knowledge exchange, there is a sender and a receiver. The sender make common what is
private, does the informing, the communicating. Information can be classified as explicit and
tacit forms. The explicit information can be explained in structured form, while tacit
information is inconsistent and fuzzy to explain. Know that data are only crude information
and not knowledge by themselves.

Data is known to be crude information and not knowledge by itself. The sequence from data
to knowledge is: from Data to Information, from Information to Facts, and finally, from
Facts to Knowledge. Data becomes information, when it becomes relevant to your decision
problem. Information becomes fact, when the data can support it. Facts are what the data
reveals. However the decisive instrumental (i.e., applied) knowledge is expressed together
with some statistical degree of confidence.

Fact becomes knowledge, when it is used in the successful completion of a decision process.
Once you have a massive amount of facts integrated as knowledge, then your mind will be
superhuman in the same sense that mankind with writing is superhuman compared to
mankind before writing. The following figure illustrates the statistical thinking process based
on data in constructing statistical models for decision making under uncertainties.

Click on the image to enlarge it and THEN print it.


The Path from Statistical Data to Managerial Knowledge

The above figure depicts the fact that as the exactness of a statistical model increases, the
level of improvements in decision-making increases. That's why we need Business Statistics.
Statistics arose from the need to place knowledge on a systematic evidence base. This
required a study of the rules of computational probability, the development of measures of
data properties and relationships, and so on.

Statistical inference aims at determining whether any statistical significance can be attached
that results after due allowance is made for any random variation as a source of error.
Intelligent and critical inferences cannot be made by those who do not understand the
purpose, the conditions, and applicability of the various techniques for judging significance.

Considering the uncertain environment, the chance that"good decisions" are made increases
with the availability of"good information." The chance that"good information" is available
increases with the level of structuring the process of Knowledge Management. The above
figure also illustrates the fact that as the exactness of a statistical model increases, the level
of improvements in decision-making increases.

Knowledge is more than knowing something technical. Knowledge needs wisdom. Wisdom is
the power to put our time and our knowledge to the proper use. Wisdom comes with age and
experience. Wisdom is the accurate application of accurate knowledge and its key
component is to knowing the limits of your knowledge. Wisdom is about knowing how
something technical can be best used to meet the needs of the decision-maker. Wisdom, for
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 9/139
1/5/2020 Dr. Arsham's Statistics Site

example, creates statistical software that is useful, rather than technically brilliant. For
example, ever since the Web entered the popular consciousness, observers have noted that
it puts information at your fingertips but tends to keep wisdom out of reach.

The notion of "wisdom" in the sense of practical wisdom has entered Western civilization
through biblical texts. In the Hellenic experience this kind of wisdom received a more
structural character in the form of philosophy. In this sense philosophy also reflects one of the
expressions of traditional wisdom.

Business professionals need a statistical toolkit. Statistical skills enable you to intelligently
collect, analyze and interpret data relevant to their decision-making. Statistical concepts
enable us to solve problems in a diversity of contexts. Statistical thinking enables you to add
substance to your decisions.

That's why we need statistical data analysis in probabilistic modeling. Statistics arose from
the need to place knowledge management on a systematic evidence base. This required a
study of the rules of computational probability, the development of measures of data
properties, relationships, and so on.

The purpose of statistical thinking is to get acquainted with the statistical techniques, to be
able to execute procedures using available JavaScript, and to be conscious of the conditions
and limitations of various techniques.

Statistical Decision-Making Process

Unlike the deterministic decision-making process, such as linear optimization by solving


systems of equations, Parametric systems of equations and in decision making under pure
uncertainty, the variables are often more numerous and more difficult to measure and control.
However, the steps are the same. They are:

1. Simplification
2. Building a decision model
3. Testing the model
4. Using the model to find the solution:
It is a simplified representation of the actual situation
It need not be complete or exact in all respects
It concentrates on the most essential relationships and ignores the less essential ones.
It is more easily understood than the empirical (i.e., observed) situation, and hence permits the
problem to be solved more readily with minimum time and effort.
5. It can be used again and again for similar problems or can be modified.

Fortunately the probabilistic and statistical methods for analysis and decision making under
uncertainty are more numerous and powerful today than ever before. The computer makes
possible many practical applications. A few examples of business applications are the following:

An auditor can use random sampling techniques to audit the accounts receivable for clients.
A plant manager can use statistical quality control techniques to assure the quality of his production with
a minimum of testing or inspection.
A financial analyst may use regression and correlation to help understand the relationship of a financial
ratio to a set of other variables in business.
A market researcher may use test of significace to accept or reject the hypotheses about a group of
buyers to which the firm wishes to sell a particular product.
A sales manager may use statistical techniques to forecast sales for the coming year.

Questions Concerning Statistical the Decision-Making Process:

1. Objectives or Hypotheses: What are the objectives of the study or the questions to be answered? What
is the population to which the investigators intend to refer their findings?

2. Statistical Design: Is the study a planned experiment (i.e., primary data), or an analysis of records ( i.e.,
secondary data)? How is the sample to be selected? Are there possible sources of selection, which
would make the sample atypical or non-representative? If so, what provision is to be made to deal with
this bias? What is the nature of the control group, standard of comparison, or cost? Remember that
statistical modeling means reflections before actions.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 10/139
1/5/2020 Dr. Arsham's Statistics Site

3. Observations: Are there clear definition of variables, including classifications, measurements (and/or
counting), and the outcomes? Is the method of classification or of measurement consistent for all the
subjects and relevant to Item No. 1.? Are there possible biased in measurement (and/or counting) and, if
so, what provisions must be made to deal with them? Are the observations reliable and replicable (to
defend your finding)?

4. Analysis: Are the data sufficient and worthy of statistical analysis? If so, are the necessary conditions of
the methods of statistical analysis appropriate to the source and nature of the data? The analysis must
be correctly performed and interpreted.

5. Conclusions: Which conclusions are justifiable by the findings? Which are not? Are the conclusions
relevant to the questions posed in Item No. 1?

6. Representation of Findings: The finding must be represented clearly, objectively, in sufficient but non-
technical terms and detail to enable the decision-maker (e.g., a manager) to understand and judge them
for himself? Is the finding internally consistent; i.e., do the numbers added up properly? Can the different
representation be reconciled?

7. Managerial Summary: When your findings and recommendation(s) are not clearly put, or framed in an
appropriate manner understandable by the decision maker, then the decision maker does not feel
convinced of the findings and therefore will not implement any of the recommendations. You have
wasted the time, money, etc. for nothing.

Further Readings:
Corfield D., and J. Williamson, Foundations of Bayesianism, Kluwer Academic Publishers, 2001. Contains Logic, Mathematics, Decision
Theory, and Criticisms of Bayesianism.
Lapin L., Statistics for Modern Business Decisions, Harcourt Brace Jovanovich, 1987.
Pratt J., H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision Theory, The MIT Press, 1994.

What is Business Statistics?

The main objective of Business Statistics is to make inferences (e.g., prediction, making decisions)
about certain characteristics of a population based on information contained in a random sample
from the entire population. The condition for randomness is essential to make sure the sample is
representative of the population.

Business Statistics is the science of ‘good' decision making in the face of uncertainty and is
used in many disciplines, such as financial analysis, econometrics, auditing, production and
operations, and marketing research. It provides knowledge and skills to interpret and use statistical
techniques in a variety of business applications. A typical Business Statistics course is intended for
business majors, and covers statistical study, descriptive statistics (collection, description, analysis,
and summary of data), probability, and the binomial and normal distributions, test of hypotheses
and confidence intervals, linear regression, and correlation.

Statistics is a science of making decisions with respect to the characteristics of a group of persons
or objects on the basis of numerical information obtained from a randomly selected sample of the
group. Statisticians refer to this numerical observation as realization of a random sample. However,
notice that one cannot see a random sample. A random sample is only a sample of a finite
outcomes of a random process.

At the planning stage of a statistical investigation, the question of sample size (n) is critical. For
example, sample size for sampling from a finite population of size N, is set at: N½+1, rounded up
to the nearest integer. Clearly, a larger sample provides more relevant information, and as a result
a more accurate estimation and better statistical judgement regarding test of hypotheses.

Under-lit Streets and the Crimes Rate: It is a fact that if residential city streets are under-lit then
major crimes take place therein. Suppose you are working in the Mayer’s office and put you in
charge of helping him/her in deciding which manufacturers to buy the light bulbs from in order to
reduce the crime rate by at least a certain amount, given that there is a limited budget?

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 11/139
1/5/2020 Dr. Arsham's Statistics Site

Click on the image to enlarge it and THEN print it.


Activities Associated with the General
Statistical Thinking and Its Applications

The above figure illustrates the idea of statistical inference from a random sample about the
population. It also provides estimation for the population's parameters; namely the expected value
µx, the standard deviation, and the cumulative distribution function (cdf) Fx, s and their
corresponding sample statistics, mean , sample standard deviation Sx, and empirical (i.e.,
observed) cumulative distribution function (cdf), respectively.

The major task of Statistics is the scientific methodology for collecting, analyzing, interpreting a
random sample in order to draw inference about some particular characteristic of a specific
Homogenous Population. For two major reasons, it is often impossible to study an entire
population:

The process would be too expensive or too time-consuming.


The process would be destructive.

In either case, we would resort to looking at a sample chosen from the population and trying
to infer information about the entire population by only examining the smaller sample. Very
often the numbers, which interest us most about the population, are the mean m and standard
deviation s, any number -- like the mean or standard deviation -- which is calculated from an
entire population, is called a Parameter. If the very same numbers are derived only from the
data of a sample, then the resulting numbers are called Statistics. Frequently, Greek letters
represent parameters and Latin letters represent statistics (as shown in the above Figure).

The uncertainties in extending and generalizing sampling results to the population are
measures and expressed by probabilistic statements called Inferential Statistics. Therefore,
probability is used in statistics as a measuring tool and decision criterion for dealing with
uncertainties in inferential statistics.

An important aspect of statistical inference is estimating population values (parameters) from


samples of data. An estimate of a parameter is unbiased if the expected value of sampling
distribution is equal to that population. The sample mean is an unbiased estimate of the
population mean. The sample variance is an unbiased estimate of population variance. This
allows us to combine several estimates to obtain a much better estimate. The Empirical
distribution is the distribution of a random sample, shown by a step-function in the above
figure. The empirical distribution function is an unbiased estimate for the population
distribution function F(x).

Given you already have a realization set of a random sample, to compute the descriptive
statistics including those in the above figure, you may like using Descriptive Statistics
JavaScript.

Hypothesis testing is a procedure for reaching a probabilistic conclusive decision about a


claimed value for a population’s parameter based on a sample. To reduce this
uncertainty and having high confidence that statistical inferences are correct, a sample must
give equal chance to each member of population to be selected which can be achieved by
sampling randomly and relatively large sample size n.

Given you already have a realization set of a random sample, to perform hypothesis testing
for mean m and variance s2, you may like using Testing the Mean and Testing the Variance
JavaScript, respectively.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 12/139
1/5/2020 Dr. Arsham's Statistics Site

Statistics is a tool that enables us to impose order on the disorganized cacophony of the real
world of modern society. The business world has grown both in size and competition.
Corporate executive must take risk in business, hence the need for business statistics.

Business statistics has grown with the art of constructing charts and tables! It is a science of
basing decisions on numerical data in the face of uncertainty.

Business statistics is a scientific approach to decision making under risk. In practicing


business statistics, we search for an insight, not the solution. Our search is for the one
solution that meets all the business's needs with the lowest level of risk. Business statistics
can take a normal business situation, and with the proper data gathering, analysis, and re-
search for a solution, turn it into an opportunity.

While business statistics cannot replace the knowledge and experience of the decision
maker, it is a valuable tool that the manager can employ to assist in the decision making
process in order to reduce the inherent risk, measured by, e.g., the standard deviation s.

Among other useful questions, you may ask why we are interested in estimating the
population's expected value m and its Standard Deviation s ? Here are some applicable
reasons. Business Statistics must provide justifiable answers to the following concerns for
every consumer and producer:

1. What is your (or your customers) Expectation of the product/service you buy (or that
your sell)? That is, what is a good estimate for m ?
2. Given the information about your (or your customers) expectation, what is the Quality of
the product/service you buy (or that you sell)? That is, what is a good estimate for s ?
3. Given the information about what you buy (or your sell) expectation, and the quality of
the product/service, how does the product/service compare with other existing similar
types? That is, comparing several m 's, and several s 's.

Common Statistical Terminology with Applications

Like all profession, also statisticians have their own keywords and phrases to ease a precise
communication. However, one must interpret the results of any decision making in a
language that is easy for the decision-maker to understand. Otherwise, he/she does not
believe in what you recommend, and therefore does not go into the implementation phase.
This lack of communication between statisticians and the managers is the major roadblock
for using statistics.

Population: A population is any entire collection of people, animals, plants or things on


which we may collect data. It is the entire group of interest, which we wish to describe or
about which we wish to draw conclusions. In the above figure the life of the light bulbs
manufactured say by GE, is the concerned population.

Qualitative and Quantitative Variables: Any object or event, which can vary in successive
observations either in quantity or quality is called a"variable." Variables are classified
accordingly as quantitative or qualitative. A qualitative variable, unlike a quantitative variable
does not vary in magnitude in successive observations. The values of quantitative and
qualitative variables are called"Variates" and"Attributes", respectively.

Variable: A characteristic or phenomenon, which may take different values, such as weight,
gender since they are different from individual to individual.

Randomness: Randomness means unpredictability. The fascinating fact about inferential


statistics is that, although each random observation may not be predictable when taken
alone, collectively they follow a predictable pattern called its distribution function. For
example, it is a fact that the distribution of a sample average follows a normal distribution for
sample size over 30. In other words, an extreme value of the sample mean is less likely than
an extreme value of a few raw data.

Sample: A subset of a population or universe.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 13/139
1/5/2020 Dr. Arsham's Statistics Site

An Experiment: An experiment is a process whose outcome is not known in advance with


certainty.

Statistical Experiment: An experiment in general is an operation in which one chooses the


values of some variables and measures the values of other variables, as in physics. A
statistical experiment, in contrast is an operation in which one take a random sample from a
population and infers the values of some variables. For example, in a survey, we"survey"
i.e."look at" the situation without aiming to change it, such as in a survey of political opinions.
A random sample from the relevant population provides information about the voting
intentions.

In order to make any generalization about a population, a random sample from the entire
population; that is meant to be representative of the population, is often studied. For each
population, there are many possible samples. A sample statistic gives information about a
corresponding population parameter. For example, the sample mean for a set of data would
give information about the overall population mean m .

It is important that the investigator carefully and completely defines the population before
collecting the sample, including a description of the members to be included.

Example: The population for a study of infant health might be all children born in the U.S.A.
in the 1980's. The sample might be all babies born on 7th of May in any of the years.

An experiment is any process or study which results in the collection of data, the outcome of
which is unknown. In statistics, the term is usually restricted to situations in which the
researcher has control over some of the conditions under which the experiment takes place.

Example: Before introducing a new drug treatment to reduce high blood pressure, the
manufacturer carries out an experiment to compare the effectiveness of the new drug with
that of one currently prescribed. Newly diagnosed subjects are recruited from a group of local
general practices. Half of them are chosen at random to receive the new drug, the remainder
receives the present one. So, the researcher has control over the subjects recruited and the
way in which they are allocated to treatment.

Design of experiments is a key tool for increasing the rate of acquiring new knowledge.
Knowledge in turn can be used to gain competitive advantage, shorten the product
development cycle, and produce new products and processes which will meet and exceed
your customer's expectations.

Primary data and Secondary data sets: If the data are from a planned experiment relevant
to the objective(s) of the statistical investigation, collected by the analyst, it is called a
Primary Data set. However, if some condensed records are given to the analyst, it is called a
Secondary Data set.

Random Variable: A random variable is a real function (yes, it is called" variable", but in
reality it is a function) that assigns a numerical value to each simple event. For example, in
sampling for quality control an item could be defective or non-defective, therefore, one may
assign X=1, and X = 0 for a defective and non-defective item, respectively. You may assign
any other two distinct real numbers, as you wish; however, non-negative integer random
variables are easy to work with. Random variables are needed since one cannot do
arithmetic operations on words; the random variable enables us to compute statistics, such
as average and variance. Any random variable has a distribution of probabilities associated
with it.

Probability: Probability (i.e., probing for the unknown) is the tool used for anticipating what
the distribution of data should look like under a given model. Random phenomena are not
haphazard: they display an order that emerges only in the long run and is described by a
distribution. The mathematical description of variation is central to statistics. The probability
required for statistical inference is not primarily axiomatic or combinatorial, but is oriented
toward describing data distributions.

Sampling Unit: A unit is a person, animal, plant or thing which is actually studied by a
researcher; the basic objects upon which the study or experiment is executed. For example,
a person; a sample of soil; a pot of seedlings; a zip code area; a doctor's practice.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 14/139
1/5/2020 Dr. Arsham's Statistics Site

Parameter: A parameter is an unknown value, and therefore it has to be estimated.


Parameters are used to represent a certain population characteristic. For example, the
population mean m is a parameter that is often used to indicate the average value of a
quantity.

Within a population, a parameter is a fixed value that does not vary. Each sample drawn from
the population has its own value of any statistic that is used to estimate this parameter. For
example, the mean of the data in a sample is used to give information about the overall mean
min the population from which that sample was drawn.

Statistic: A statistic is a quantity that is calculated from a sample of data. It is used to give
information about unknown values in the corresponding population. For example, the
average of the data in a sample is used to give information about the overall average in the
population from which that sample was drawn.

A statistic is a function of an observable random sample. It is therefore an observable


random variable. Notice that, while a statistic is a"function" of observations, unfortunately, it is
commonly called a random"variable" not a function.

It is possible to draw more than one sample from the same population, and the value of a
statistic will in general vary from sample to sample. For example, the average value in a
sample is a statistic. The average values in more than one sample, drawn from the same
population, will not necessarily be equal.

Statistics are often assigned Roman letters (e.g. and s), whereas the equivalent unknown
values in the population (parameters ) are assigned Greek letters (e.g., µ, s).

The word estimate means to esteem, that is giving a value to something. A statistical
estimate is an indication of the value of an unknown quantity based on observed data.

More formally, an estimate is the particular value of an estimator that is obtained from a
particular sample of data and used to indicate the value of a parameter.

Example: Suppose the manager of a shop wanted to know m , the mean expenditure of
customers in her shop in the last year. She could calculate the average expenditure of the
hundreds (or perhaps thousands) of customers who bought goods in her shop; that is, the
population mean m . Instead she could use an estimate of this population mean m by
calculating the mean of a representative sample of customers. If this value were found to be
$25, then $25 would be her estimate.

There are two broad subdivisions of statistics: Descriptive Statistics and Inferential Statistics
as described below.

Descriptive Statistics: The numerical statistical data should be presented clearly, concisely,
and in such a way that the decision maker can quickly obtain the essential characteristics of
the data in order to incorporate them into decision process.

The principal descriptive quantity derived from sample data is the mean ( ), which is the
arithmetic average of the sample data. It serves as the most reliable single measure of the
value of a typical member of the sample. If the sample contains a few values that are so large
or so small that they have an exaggerated effect on the value of the mean, the sample is
more accurately represented by the median -- the value where half the sample values fall
below and half above.

The quantities most commonly used to measure the dispersion of the values about their
mean are the variance s2 and its square root, the standard deviation s. The variance is
calculated by determining the mean, subtracting it from each of the sample values (yielding
the deviation of the samples), and then averaging the squares of these deviations. The mean
and standard deviation of the sample are used as estimates of the corresponding
characteristics of the entire group from which the sample was drawn. They do not, in
general, completely describe the distribution (Fx) of values within either the sample or the
parent group; indeed, different distributions may have the same mean and standard
deviation. They do, however, provide a complete description of the normal distribution, in
which positive and negative deviations from the mean are equally common, and small

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 15/139
1/5/2020 Dr. Arsham's Statistics Site

deviations are much more common than large ones. For a normally distributed set of values,
a graph showing the dependence of the frequency of the deviations upon their magnitudes is
a bell-shaped curve. About 68 percent of the values will differ from the mean by less than the
standard deviation, and almost 100 percent will differ by less than three times the standard
deviation.

Inferential Statistics: Inferential statistics is concerned with making inferences from samples
about the populations from which they have been drawn. In other words, if we find a
difference between two samples, we would like to know, is this a"real" difference (i.e., is it
present in the population) or just a"chance" difference (i.e. it could just be the result of
random sampling error). That's what tests of statistical significance are all about. Any inferred
conclusion from a sample data to the population from which the sample is drawn must be
expressed in a probabilistic term. Probability is the language and a measuring tool for
uncertainty in our statistical conclusions.

Inferential statistics could be used for explaining a phenomenon or checking for validity of a
claim. In these instances, inferential statistics is called Exploratory Data Analysis or
Confirmatory Data Analysis, respectively.

Statistical Inference: Statistical inference refers to extending your knowledge obtained from
a random sample from the entire population to the whole population. This is known in
mathematics as Inductive Reasoning, that is, knowledge of the whole from a particular. Its
main application is in hypotheses testing about a given population. Statistical inference
guides the selection of appropriate statistical models. Models and data interact in statistical
work. Inference from data can be thought of as the process of selecting a reasonable model,
including a statement in probability language of how confident one can be about the
selection.

Normal Distribution Condition: The normal or Gaussian distribution is a continuous


symmetric distribution that follows the familiar bell-shaped curve. One of its nice features is
that, the mean and variance uniquely and independently determines the distribution. It has
been noted empirically that many measurement variables have distributions that are at least
approximately normal. Even when a distribution is non-normal, the distribution of the mean of
many independent observations from the same distribution becomes arbitrarily close to a
normal distribution, as the number of observations grows large. Many frequently used
statistical tests make the condition that the data come from a normal distribution.

Estimation and Hypothesis Testing:Inference in statistics are of two types. The first is
estimation, which involves the determination, with a possible error due to sampling, of the
unknown value of a population characteristic, such as the proportion having a specific
attribute or the average value m of some numerical measurement. To express the accuracy of
the estimates of population characteristics, one must also compute the standard errors of the
estimates. The second type of inference is hypothesis testing. It involves the definitions of a
hypothesis as one set of possible population values and an alternative, a different set. There
are many statistical procedures for determining, on the basis of a sample, whether the true
population characteristic belongs to the set of values in the hypothesis or the alternative.

Statistical inference is grounded in probability, idealized concepts of the group under study,
called the population, and the sample. The statistician may view the population as a set of
balls from which the sample is selected at random, that is, in such a way that each ball has
the same chance as every other one for inclusion in the sample.

Notice that to be able to estimate the population parameters, the sample size n must be
greater than one. For example, with a sample size of one, the variation (s2) within the sample
is 0/1 = 0. An estimate for the variation (s2) within the population would be 0/0, which is
indeterminate quantity, meaning impossible.

Greek Letters Commonly Used as Statistical Notations

We use Greek letters as scientific notations in statistics and other scientific fields to honor the
ancient Greek philosophers who invented science and scientific thinking. Before Socrates, in
6th Century BC, Thales and Pythagoras, amomg others, applied geometrical concepts to
arithmetic, and Socrates is the inventor of dialectic reasoning. The revival of scientific
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 16/139
1/5/2020 Dr. Arsham's Statistics Site

thinking (initiated by Newton's work) was valued and hence reappeared almost 2000 years
later.

Greek Letters Commonly Used as Statistical Notations


alpha beta ki-sqre delta mu nu pi rho sigma tau theta
a b c2 d m n p r s t q

Note: ki-square (ki-sqre, Chi-square), c 2, is not the square of anything, its name implies Chi-
square (read, ki-square). Ki does not exist in statistics.

I'm glad that you're overcoming all the confusions that exist in learning statistics.

Type of Data and Levels of Measurement

Information can be collected in statistics using qualitative or quantitative data. Qualitative


data, such as eye color of a group of individuals, is not computable by arithmetic relations.
They are labels that advise in which category or class an individual, object, or process fall.
They are called categorical variables.

Quantitative data sets consist of measures that take numerical values for which descriptions
such as means and standard deviations are meaningful. They can be put into an order and
further divided into two groups: discrete data or continuous data.

Discrete data are countable data and are collected by counting, for example, the number of
defective items produced during a day's production.

Continuous data are collected by measuring and are expressed on a continuous scale. For
example, measuring the height of a person.

Among the first activities in statistical analysis is to count or measure: Counting/measurement


theory is concerned with the connection between data and reality. A set of data is a
representation (i.e., a model) of the reality based on numerical and measurable scales. Data
are called"primary type" data if the analyst has been involved in collecting the data relevant
to his/her investigation. Otherwise, it is called"secondary type" data.

Data come in the forms of Nominal, Ordinal, Interval, and Ratio (remember the French word
NOIR for the color black). Data can be either continuous or discrete.

Levels of Measurements
_________________________________________
Nominal Ordinal Interval/Ratio
Ranking? no yes yes
Numerical
no no yes
difference

Both the zero point and the units of measurement are arbitrary on the Interval scale. While
the unit of measurement is arbitrary on the Ratio scale, its zero point is a natural attribute.
The categorical variable is measured on an ordinal or nominal scale.

Counting/measurement theory is concerned with the connection between data and reality.
Both statistical theory and counting/measurement theory are necessary to make inferences
about reality.

Since statisticians live for precision, they prefer Interval/Ratio levels of measurement.

Pareto Chart: A Pareto chart is similar to the histogram, except that it is a frequency bar
chart for qualitative variables, rather than being used for quantitative data that have been
grouped into classes. The following is an example of a Pareto chart that shows the types of
shoes-frequency, worn in the class on a particular day:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 17/139
1/5/2020 Dr. Arsham's Statistics Site

Click on the image to enlarge it and THEN print it.


A Typical Pareto Chart

For a good business application of discrete random variables, visit Markov Chain Calculator,
Large Markov Chain Calculator and Zero-Sum Games.

Why Statistical Sampling?

Sampling is the selection of part of an aggregate or totality known as population, on the basis
of which a decision concerning the population is made.

The following are the advantages and/or necessities for sampling in statistical decision
making:

1. Cost: Cost is one of the main arguments in favor of sampling, because often a sample
can furnish data of sufficient accuracy and at much lower cost than a census.

2. Accuracy: Much better control over data collection errors is possible with sampling than
with a census, because a sample is a smaller-scale undertaking.

3. Timeliness: Another advantage of a sample over a census is that the sample produces
information faster. This is important for timely decision making.

4. Amount of Information: More detailed information can be obtained from a sample


survey than from a census, because it take less time, is less costly, and allows us to
take more care in the data processing stage.

5. Destructive Tests: When a test involves the destruction of an item under study,
sampling must be used. Statistical sampling determination can be used to find the
optimal sample size within an acceptable cost.

Further Reading:
Thompson S., Sampling, Wiley, 2002.

Sampling Methods

From the food you eat to the television you watch, from political elections to school board
actions, much of your life is regulated by the results of sample surveys.

A sample is a group of units selected from a larger group (the population). By studying the
sample, one hopes to draw valid conclusions about the larger group.

A sample is generally selected for study because the population is too large to study in its
entirety. The sample should be representative of the general population. This is often best
achieved by random sampling. Also, before collecting the sample, it is important that one
carefully and completely defines the population, including a description of the members to be
included.

A common problem in business statistical decision-making arises when we need information


about a collection called a population but find that the cost of obtaining the information is
prohibitive. For instance, suppose we need to know the average shelf life of current inventory.
If the inventory is large, the cost of checking records for each item might be high enough to
cancel the benefit of having the information. On the other hand, a hunch about the average
shelf life might not be good enough for decision-making purposes. This means we must
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 18/139
1/5/2020 Dr. Arsham's Statistics Site

arrive at a compromise that involves selecting a small number of items and calculating an
average shelf life as an estimate of the average shelf life of all items in inventory. This is a
compromise, since the measurements for a sample from the inventory will produce only an
estimate of the value we want, but at substantial savings. What we would like to know is
how"good" the estimate is and how much more will it cost to make it"better". Information of
this type is intimately related to sampling techniques. This section provides a short discussion
on the common methods of business statistical sampling.

Cluster sampling can be used whenever the population is homogeneous but can be
partitioned. In many applications the partitioning is a result of physical distance. For instance,
in the insurance industry, there are small"clusters" of employees in field offices scattered
about the country. In such a case, a random sampling of employee work habits might not
required travel to many of the"clusters" or field offices in order to get the data. Totally
sampling each one of a small number of clusters chosen at random can eliminate much of
the cost associated with the data requirements of management.

Stratified sampling can be used whenever the population can be partitioned into smaller
sub-populations, each of which is homogeneous according to the particular characteristic of
interest. If there are k sub-populations and we let Ni denote the size of sub-population i, let N
denote the overall population size, and let n denote the sample size, then we select a
stratified sample whenever we choose:

ni = n(Ni/N)

items at random from sub-population i, i = 1, 2, . . . ., k.

The estimates is:

s = S Wt. t, over t = 1, 2, ..L (strata), and t is SXit/nt.

Its variance is:

SW2t /(Nt-nt)S2t/[nt(Nt-1)]

Population total T is estimated by N. s; its variance is

SN2t(Nt-nt)S2t/[nt(Nt-1)].

Random sampling is probably the most popular sampling method used in decision making
today. Many decisions are made, for instance, by choosing a number out of a hat or a
numbered bead from a barrel, and both of these methods are attempts to achieve a random
choice from a set of items. But true random sampling must be achieved with the aid of a
computer or a random number table whose values are generated by computer random
number generators.

A random sampling of size n is drawn from a population size N. The unbiased estimate for
variance of is:

Var( ) = S2(1-n/N)/n,

where n/N is the sampling fraction. For sampling fraction less than 10% the finite population
correction factor (N-n)/(N-1) is almost 1.

The total T is estimated by N ´ , its variance is N2Var( ).

For 0, 1, (binary) type variables, variation in estimated proportion p is:

S2 = p(1-p) ´ (1-n/N)/(n-1).

For ratio r = Sxi/Syi= / , the variation for r is:

[(N-n)(r2S2x + S2y -2 r Cov(x, y)]/[n(N-1) 2].

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 19/139
1/5/2020 Dr. Arsham's Statistics Site

Determination of sample sizes (n) with regard to binary data: Smallest integer greater than or
equal to:

[t2 N p(1-p)] / [t2 p(1-p) + a2 (N-1)],

with N being the size of the total number of cases, n being the sample size, a the expected
error, t being the value taken from the t-distribution corresponding to a certain confidence
interval, and p being the probability of an event.

Cross-Sectional Sampling:Cross-Sectional study the observation of a defined population at


a single point in time or time interval. Exposure and outcome are determined simultaneously.

What is a statistical instrument? A statistical instrument is any process that aim at


describing a phenomena by using any instrument or device, however the results may be
used as a control tool. Examples of statistical instruments are questionnaire and surveys
sampling.

What is grab sampling technique? The grab sampling technique is to take a relatively
small sample over a very short period of time, the result obtained are usually instantaneous.
However, the Passive Sampling is a technique where a sampling device is used for an
extended time under similar conditions. Depending on the desirable statistical investigation,
the passive sampling may be a useful alternative or even more appropriate than grab
sampling. However, a passive sampling technique needs to be developed and tested in the
field.

Further Reading:
Thompson S., Sampling, Wiley, 2002.

Statistical Summaries

Representative of a Sample: Measures of Central Tendency Summaries

How do you describe the"average" or"typical" piece of information in a set of data? Different
procedures are used to summarize the most representative information depending of the type
of question asked and the nature of the data being summarized.

Measures of location give information about the location of the central tendency within a
group of numbers. The measures of location presented in this unit for ungrouped (raw) data
are the mean, the median, and the mode.

Mean: The arithmetic mean (or the average, simple mean) is computed by summing all
numbers in an array of numbers (xi) and then dividing by the number of observations (n) in
the array.

Mean = = S Xi /n, the sum is over all i's.

The mean uses all of the observations, and each observation affects the mean. Even though
the mean is sensitive to extreme values; i.e., extremely large or small data can cause the
mean to be pulled toward the extreme data; it is still the most widely used measure of
location. This is due to the fact that the mean has valuable mathematical properties that
make it convenient for use with inferential statistical analysis. For example, the sum of the
deviations of the numbers in a set of data from the mean is zero, and the sum of the squared
deviations of the numbers in a set of data from the mean is the minimum value.

You might like to use Descriptive Statistics to compute the mean.

Weighted Mean: In some cases, the data in the sample or population should not be
weighted equally, rather each value should be weighted according to its importance.

Median: The median is the middle value in an ordered array of observations. If there is an
even number of observations in the array, the median is the average of the two middle
numbers. If there is an odd number of data in the array, the median is the middle number.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 20/139
1/5/2020 Dr. Arsham's Statistics Site

The median is often used to summarize the distribution of an outcome. If the distribution is
skewed, the median and the interquartile range (IQR) may be better than other measures to
indicate where the observed data are concentrated.

Generally, the median provides a better measure of location than the mean when there are
some extremely large or small observations; i.e., when the data are skewed to the right or to
the left. For this reason, median income is used as the measure of location for the U.S.
household income. Note that if the median is less than the mean, the data set is skewed to
the right. If the median is greater than the mean, the data set is skewed to the left. For
normal population, the sample median is distributed normally with m = the mean, and
standard error of the median (p/2)½ times standard error of the mean.

The mean has two distinct advantages over the median. It is more stable, and one can
compute the mean based of two samples by combining the two means.

Mode: The mode is the most frequently occurring value in a set of observations. Why use the
mode? The classic example is the shirt/shoe manufacturer who wants to decide what sizes to
introduce. Data may have two modes. In this case, we say the data are bimodal, and sets of
observations with more than two modes are referred to as multimodal. Note that the mode is
not a helpful measure of location, because there can be more than one mode or even no
mode.

When the mean and the median are known, it is possible to estimate the mode for the
unimodal distribution using the other two averages as follows:

Mode » 3(median) - 2(mean)

This estimate is applicable to both grouped and ungrouped data sets.

Whenever, more than one mode exist, then the population from which the sample came is a
mixture of more than one population, as shown, for example in the following bimodal
histogram.

Click on the image to enlarge it and THEN print it.


A Mixture of Two Different Populations

However, notice that a Uniform distribution has uncountable number of modes having equal
density value; therefore it is considered as a homogeneous population.

Almost all standard statistical analyses are conditioned on the assumption that the population
is homogeneous.

Notice that Excel has very limited statistical capability. For example, it displays only one
mode, the first one. Unfortunately, this is very misleading. However, you may find out if there
are others by inspection only, as follow: Create a frequency distribution, invoke the menu
sequence: Tools, Data analysis, Frequency and follow instructions on the screen. You will
see the frequency distribution and then find the mode visually. Unfortunately, Excel does not
draw a Stem and Leaf diagram. All commercial off-the-shelf software, such as SAS and
SPSS, display a Stem and Leaf diagram, which is a frequency distribution of a given data set.

Selecting Among the Mode, Median, and Mean

It is a common mistake to specify the wrong index for central tenancy.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 21/139
1/5/2020 Dr. Arsham's Statistics Site

Click on the image to enlarge it and THEN print it.


Selecting Among the Mode, Median, and Mean

The first consideration is the type of data, if the variable is categorical, the mode is the single
measure that best describes that data.

The second consideration in selecting the index is to ask whether the total of all observations
is of any interest. If the answer is yes, then the mean is the proper index of central tendency.

If the total is of no interest, then depending on whether the histogram is symmetric or skewed
one must use either mean or median, respectively.

In all cases the histogram must be unimodal. However, notice that, e.g., a Uniform
distribution has uncountable number of modes having equal density value; therefore it is
considered as a homogeneous population.

Notice also that:

|Mean - Median| £s

The main characteristics of these three statistics are tabulated below:

The Main Characteristics of the Mode, the Median, and the Mean
Fact
The Mode The Median The Mean
No.
It is the value of the middle point
It is the most frequent value in It is the value in a given aggregate
of the array (not midpoint of
1 the distribution; it is the point which would obtain if all the values
range), such that half the item are
of greatest density. were equal.
above and half below it.
The value of the mode is The value of the media is fixed by The sum of deviations on either side
established by the its position in the array and of the mean are equal; hence, the
2
predominant frequency, not by doesn't reflect the individual algebraic sum of the deviation is
the value in the distribution. value. equal zero.
The aggregate distance between
It is the most probable value, the median point and all the value It reflect the magnitude of every
3
hence the most typical. in the array is less than from any value.
other point.
A distribution may have 2 or
more modes. On the other Each array has one and only one An array has one and only one
4
hand, there is no mode in a median. mean.
rectangular distribution.
It cannot be manipulated Means may be manipulated
The mode does nott reflect algebraically: medians of algebraically: means of subgroups
5
the degree of modality. subgroups cannot be weighted may be combined when properly
and combined. weighted.
It cannot be manipulated It may be calculated even when
It is stable in that grouping
algebraically: modes of individual values are unknown,
6 procedures do not affect it
subgroups cannot be provided the sum of the values and
appreciably.
combined. the sample size n are known.
It is unstable that it is
Value must be ordered, and may Values need not be ordered or
7 influenced by grouping
be grouped, for computation. grouped for this calculation.
procedures.
8 Values must be ordered and It can be compute when ends are It cannot be calculated from a
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 22/139
1/5/2020 Dr. Arsham's Statistics Site

group for its computation. open frequency table when ends are
open.
It is stable in that grouping
It can be calculated when It is not applicable to qualitative
9 procedures do not seriously affected
table ends are open. data.
it.

The Descriptive Statistics JavaScript provides a complete set of information about all statistics that
you ever need. You might like to use it to perform some numerical experimentation for validating
the above assertions for a deeper understanding.

Specialized Averages: The Geometric & Harmonic Means

The Geometric Mean: The geometric mean (G) of n non-negative numerical values is the nth root
of the product of the n values.

If some values are very large in magnitude and others are small, then the geometric mean is a
better representative of the data than the simple average. In a"geometric series", the most
meaningful average is the geometric mean (G). The arithmetic mean is very biased toward the
larger numbers in the series.

An Application: Suppose sales of a certain item increase to 110% in the first year and to 150% of
that in the second year. For simplicity, assume you sold 100 items initially. Then the number sold in
the first year is 110 and the number sold in the second is 150% x 110 = 165. The arithmetic
average of 110% and 150% is 130% so that we would incorrectly estimate that the number sold in
the first year is 130 and the number in the second year is 169. The geometric mean of 110% and
150% is G = (1.65)1/2 so that we would correctly estimate that we would sell 100 (G)2 = 165 items
in the second year.

The Harmonic Mean:The harmonic mean (H) is another specialized average, which is useful in
averaging variables expressed as rate per unit of time, such as mileage per hour, number of units
produced per day. The harmonic mean (H) of n non-zero numerical values x(i) is: H = n/[S (1/x(i)].

An Application: Suppose 4 machines in a machine shop are used to produce the same part.
However, each of the four machines takes 2.5, 2.0, 1.5, and 6.0 minutes to make one part,
respectively. What is the average rate of speed?

The harmonic means is: H = 4/[(1/2.5) + (1/2.0) + 1/(1.5) + (1/6.0)] = 2.31 minutes.

If all machines working for one hour, how many parts will be produced? Since four machines
running for one hour represent 240 minutes of operating time, then: 240 / 2.31 = 104 parts will be
produced.

The Order Among the Three Means: If all the three means exist, then the Arithmetic Mean is
never less than the other two, moreover, the Harmonic Mean is never larger than the other two.

You might like to use The Other Means JavaScript in performing some numerical experimentation
for validating the above assertions for a deeper understanding.

Further Reading:
Langley R., Practical Statistics Simply Explained, 1970, Dover Press.

Histogramming: Checking for Homogeneity of Population

A histogram is a graphical presentation of an estimate for the density (for continuous random
variables) or probability mass function (for discrete random variables) of the population.

The geometric feature of histogram enables us to find out useful information about the data, such
as:

1. The location of the"center" of the data.


2. The degree of dispersion.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 23/139
1/5/2020 Dr. Arsham's Statistics Site

3. The extend to which its is skewed, that is, it does not fall off systemically on both side of its
peak.
4. The degree of peakedness. How steeply it rises and falls.

The mode is the most frequently occurring value in a set of observations. Data may have two
modes. In this case, we say the data are bimodal, and sets of observations with more than two
modes are referred to as multimodal. Whenever, more than one mode exist, then the population
from which the sample came is a mixture of more than one population. Almost all standard
statistical analyses are conditioned on the assumption that the population is homogeneous,
meaning that its density (for continuous random variables) or probability mass function (for discrete
random variables) is unimodal. However, notice that, e.g., a Uniform distribution has uncountable
number of modes having equal density value; therefore it is considered as a homogeneous
population.

To check the unimodality of sampling data, one may use the histogramming process.

Number of Class Intervals in a Histogram: Before we can construct our frequency distribution we
must determine how many classes we should use. This is purely arbitrary, but too few classes or
too many classes will not provide as clear a picture as can be obtained with some more nearly
optimum number. An empirical (i.e., observed) relationship, known as Sturge's rule, may be used
as a useful guide to determine the optimal number of classes (k) is given by

the smallest integer greater than or equal to

Minimum of { n 1/2, 10 Log(n) }, n ³ 30,

where Log is the logarithm in base 10, and n is the total number of the numerical values which
comprise the data set.

Therefore, class width is:

(highest value - lowest value) / k

The following JavaScript produces a histogram based on this rule:


Test for Homogeneity of a Population.

To have an"optimum" you need some measure of quality -- presumably in this case, the"best" way
to display whatever information is available in the data. The sample size contributes to this; so the
usual guidelines are to use between 5 and 15 classes, with more classes, if you have a larger
sample. You should take into account a preference for tidy class widths, preferably a multiple of 5
or 10, because this makes it easier to understand.

Beyond this it becomes a matter of judgement. Try out a range of class widths, and choose the one
that works best. This assumes you have a computer and can generate alternative histograms fairly
readily.

There are often management issues that come into play as well. For example, if your data is to be
compared to similar data -- such as prior studies, or from other countries -- you are restricted to the
intervals used therein.

If the histogram is very skewed, then unequal classes should be considered. Use narrow classes
where the class frequencies are high, wide classes where they are low.

The following approaches are common:

Let n be the sample size, then the number of class intervals could be

Min {n½, 10 Log(n) }.

The Log is the logarithm in base 10. Thus for 200 observations you would use 14 intervals but for
2000 you would use 33.

Alternatively,

1. Find the range (highest value - lowest value).


home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 24/139
1/5/2020 Dr. Arsham's Statistics Site

2. Divide the range by a reasonable interval size: 2, 3, 5, 10 or a multiple of 10.


3. Aim for no fewer than 5 intervals and no more than 15.

One of the main applications of histogramming is to Test for Homogeneity of a Population. The
unimodality of the histogram is a necessary condition for the homogeneity of population to make
any statistical analysis meaningful. However, notice that, e.g., a Uniform distribution has
uncountable number of modes having equal density value; therefore it is considered as a
homogeneous population.

Further Reading:
Efron B., and R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall (now the CRC Press), 1994. Contains a tedious test for
multimodality that is based on the Gaussian kernel density estimates and then test for multimodality by using the window-size approach.

How to Construct a BoxPlot

A BoxPlot is a graphical display that has many characteristics. It includes the presence of possible
outliers. It illustrates the range of data. It shows a measure of dispersion such as the upper
quartile, lower quartile and interquartile range (IQR) of the data set as well as the median as a
measure of central location, which is useful for comparing sets of data. It also gives an indication of
the symmetry or skewness of the distribution. The main reason for the popularity of boxplots is that
they offer much of information in a compact way.

Steps to Construct a BoxPlot:

1. Horizontal lines are drawn at the smallest observation (A), lower quartile. And another from
the upper quartile (D), and the largest observation (E). Vertical lines to produce the box join
these horizontal lines at points (B, and D).

2. A vertical line is drawn at the median point (C), as shown on the above Figure.

For a deeper understanding, you may like using graph paper, and Descriptive Sampling Statistics
JavaScript in constructing the BoxPlots for some sets of data; e.g., from your textbook.

Measuring the Quality of a Sample

Average by itself is not a good indication of quality. You need to know the variance to make any
educated assessment. We are reminded of the dilemma of the six-foot tall statistician who drowned
in a stream that had an average depth of three feet.

Statistical measures are often used for describing the nature and extent of differences among the
information in the distribution. A measure of variability is generally reported together with a
measure of central tendency.

Statistical measures of variation are numerical values that indicate the variability inherent in a set
of data measurements. Note that a small value for a measure of dispersion indicates that the data
are concentrated around the mean; therefore, the mean is a good representative of the data set.
On the other hand, a large measure of dispersion indicates that the mean is not a good
representative of the data set. Also, measures of dispersion can be used when we want to
compare the distributions of two or more sets of data. Quality of a data set is measured by its
variability: Larger variability indicates lower quality. That is why high variation makes the manager
very worried. Your job, as a statistician, is to measure the variation, and if it is too high and
unacceptable, then it is the job of the technical staff, such as engineers, to fix the process.

Decision situations with complete lack of knowledge, known as the flat uncertainty, have the
largest risk. For simplicity, consider the case when there are only two outcomes, one with
probability of p. Then, the variation in the outcomes is p(1-p). This variation is the largest if we set
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 25/139
1/5/2020 Dr. Arsham's Statistics Site

p = 50%. That is, equal chance for each outcome. In such a case, the quality of information is at its
lowest level.

Remember, quality of information and variation are inversely related. The larger the variation
in the data, the lower the quality of the data (i.e., information): the Devil is in the Deviations.

The four most common measures of variation are the range, variance, standard deviation, and
coefficient of variation.

Range: The range of a set of observations is the absolute value of the difference between the
largest and smallest values in the data set. It measures the size of the smallest contiguous interval
of real numbers that encompasses all of the data values. It is not useful when extreme values are
present. It is based solely on two values, not on the entire data set. In addition, it cannot be defined
for open-ended distributions such as Normal distribution.

Notice that, when dealing with discrete random observations, some authors define the range as:
Range = Largest value - Smallest value + 1.

A normal distribution does not have a range. A student said,"since the tails of a normal density
function never touch the x-axis and since for an observation to contribute to forming such a curve,
very large positive and negative values must exist" Yet such remote values are always possible,
but increasingly improbable. This encapsulates the asymptotic behavior of normal density very
well. Therefore, in spite of this behavior, it is useful and applicable to a wide range of decision-
making situations.

Quartiles: When we order the data, for example in ascending order, we may divide the data into
quarters, Q1…Q4, known as quartiles. The first Quartile (Q1) is that value where 25% of the
values are smaller and 75% are larger. The second Quartile (Q2) is that value where 50% of the
values are smaller and 50% are larger. The third Quartile (Q3) is that value where 75% of the
values are smaller and 25% are larger.

Percentiles: Percentiles have a similar concept and therefore, are related; e.g., the 25th percentile
corresponds to the first quartile Q1, etc. The advantage of percentiles is that they may be
subdivided into 100 parts. The percentiles and quartiles are most conveniently read from a
cumulative distribution function, as depicted in the following figure.

Click on the image to enlarge it and THEN print it.


Empirical Cumulative Distribution Function as an Informative Tool

Interquartiles Range: The interquartile range (IQR) describes the extent for which the middle 50%
of the observations scattered or dispersed. It is the distance between the first and the third
quartiles:

IQR = Q3 - Q1,

which is twice the Quartile Deviation. For data that are skewed, the relative dispersion, similar to
the coefficient of variation (C.V.) is given (provided the denominator is not zero) by the Coefficient
of Quartile Variation:

CQV = (Q3-Q1) / (Q3 + Q1).

Note that almost all statistics that we have covered up to now can be obtained and understood
deeply by graphical method using Empirical (i.e., observed) Cumulative Distribution Function
(ECDF) JavaScript. However, the numerical Descriptive Statistics provides a complete set of
information about all statistics that you ever need.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 26/139
1/5/2020 Dr. Arsham's Statistics Site

The Duality between the ECDF and the Histogram: Notice that the empirical (i.e., observed)
cumulative distribution function (ECDF) indicates by its height at a particular pointthat is
numerically equal to the area in the corresponding histogram to the left of that point. Therefore,
either or both could be used depending on the intended applications.

Mean Absolute Deviation (MAD): A simple measure of variability is the mean absolute deviation:

MAD = S |(xi - )| / n.

The mean absolute deviation is widely used as a performance measure to assess the quality of the
modeling, such forecasting techniques. However, MAD does not lend itself to further use in making
inference; moreover, even in the error analysis studies, the variance is preferred since variances of
independent (i.e., uncorrelated) errors are additive; however MAD does not have such a nice
feature.

The MAD is a simple measure of variability, which unlike range and quartile deviation, takes every
item into account, and it is simpler and less affected by extreme deviations. It is therefore often
used in small samples that include extreme values.

The mean absolute deviation theoretically should be measured from the median, since it is at its
minimum; however, it is more convenient to measure the deviations from the mean.

As a numerical example, consider the price (in $) of same item at 5 different stores: $4.75, $5.00,
$4.65, $6.10, and $6.30. The mean absolute deviation from the mean is $0.67, while from the
median is $0.60, which is a better representative of deviation among the prices.

Variance: An important measure of variability is variance. Variance is the average of the squared
deviations of each observation in the set from the arithmetic mean of all of the observations.

Variance = S (xi - ) 2 / (n - 1), where n is at least 2.

The variance is a measure of spread or dispersion among values in a data set. Therefore, the
greater the variance, the lower the quality.

The variance is not expressed in the same units as the observations. In other words, the
variance is hard to understand because the deviations from the mean are squared, making it too
large for logical explanation. This problem can be solved by working with the square root of the
variance, which is called the standard deviation.

Standard Deviation: Both variance and standard deviation provide the same information; one can
always be obtained from the other. In other words, the process of computing a standard
deviation always involves computing a variance. Since standard deviation is the square root of the
variance, it is always expressed in the same units as the raw data:

Standard Deviation = S = (Variance) ½

For large data sets (say, more than 30), approximately 68% of the data are contained within one
standard deviation of the mean, 95% contained within two standard deviations. 97.7% (or almost
100% ) of the data are contained within within three standard deviations (S) from the mean.

You may use Descriptive Statistics JavaScript to compute the mean, and standard deviation.

The Mean Square Error (MSE) of an estimate is the variance of the estimate plus the square of its
bias; therefore, if an estimate is unbiased, then its MSE is equal to its variance, as it is the case in
the ANOVA table.

Coefficient of Variation: Coefficient of Variation (CV) is the absolute relative deviation with
respect to size , provided is not zero, expressed in percentage:

CV =100 |S/ | %

CV is independent of the unit of measurement. In estimation of a parameter, when its CV is less


than 10%, the estimate is assumed acceptable. The inverse of CV; namely, 1/CV is called the
Signal-to-noise Ratio.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 27/139
1/5/2020 Dr. Arsham's Statistics Site

The coefficient of variation is used to represent the relationship of the standard deviation to the
mean, telling how representative the mean is of the numbers from which it came. It expresses the
standard deviation as a percentage of the mean; i.e., it reflects the variation in a distribution
relative to the mean. However, confidence intervals for the coefficient of variation are rarely
reported. One of the reasons is that the exact confidence interval for the coefficient of variation is
computationally tedious.

Note that, for a skewed or grouped data set, the coefficient of quartile variation:

VQ = 100(Q3 - Q1)/(Q3 + Q1)%

is more useful than the CV.

You may use Descriptive Statistics to compute the mean, standard deviation and the coefficient of
variation.

Variation Ratio for Qualitative Data: Since the mode is the most frequently used measure of
central tendency for qualitative variables, variability is measured with reference to the mode. The
statistic that describes the variability of quantitative data is the Variation Ratio (VR):

VR = 1 - fm/n,

where fm is the frequency of the mode, and n is the total number of scores in the distribution.

Z Score: how many standard deviations a given point (i.e., observation) is above or below the
mean. In other words, a Z score represents the number of standard deviations that an observation
(x) is above or below the mean. The larger the Z value, the further away a value will be from the
mean. Note that values beyond three standard deviations are very unlikely. Note that if a Z score is
negative, the observation (x) is below the mean. If the Z score is positive, the observation (x) is
above the mean. The Z score is found as:

Z = (x - ) / standard deviation of X

The Z score is a measure of the number of standard deviations that an observation is above or
below the mean. Since the standard deviation is never negative, a positive Z score indicates that
the observation is above the mean, a negative Z score indicates that the observation is below the
mean. Note that Z is a dimensionless value, and therefore is a useful measure by which to
compare data values from two different populations, even those measured by different units.

Z-Transformation: Applying the formula z = (X - m) / s will always produce a transformed variable


with a mean of zero and a standard deviation of one. However, the shape of the distribution will not
be affected by the transformation. If X is not normal, then the transformed distribution will not be
normal either.

One of the nice features of the z-transformation is that the resulting distribution of the transformed
data has an identical shape but with mean zero, and standard deviation equal to 1.

One can generalize this data transformation to have any desirable mean and standard deviation
other than 0 and 1, respectively. Suppose we wish the transformed data to have the mean and
standard deviation of M and D, respectively. For example, in the SAT Scores, they are set at M =
500, and D=100. The following transformation should be applied:

Z = (standard Z) ´ D + M

Suppose you have two data sets with very different scales (e.g., one has very low values, another
very high values). If you wish to compare these two data sets, due to differences in scales, the
statistics that you generate are not comparable. It is a good idea to use the Z-transformation of
both original data sets and then make any comparison.

You have heard the terms z value, z test, z transformation, and z score. Do all of these terms mean
the same thing? Certainly not:

The z value refers to the critical value (a point on the horizontal axes) of the Normal (0, 1) density
function, for a given area to the left of that z-value.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 28/139
1/5/2020 Dr. Arsham's Statistics Site

The z test refers to the procedures for testing the equality of mean (s) of one (or two) population(s).

The z score of a given observation x, in a sample of size n, is simply (x - average of the sample)
divided by the standard deviation of the sample. One must be careful not to mistake z scores for
the Standard Scores.

The z transformation of a set of observations of size n is simply (each observation - average of all
observations) divided by the standard deviation among all observations. The aim is to produce a
transformed data set with a mean of zero and a standard deviation of one. This makes the
transformed set dimensionless and manageable with respect to its magnitudes. It is used also in
comparing several data sets that have been measured using different scales of measurements.

Pearson coined the term"standard deviation" sometime near 1900. The idea of using squared
deviations goes back to Laplace in the early 1800's.

Finally, notice again, that the transforming raw scores to z scores do NOT normalize the data.

Computation of Descriptive Statistics for Grouped Data: One of the most common ways to
describe a single variable is with a frequency distribution. A histogram is a graphical presentation
of an estimate for the frequency distribution of the population. Depending upon the particular
variable, all of the data values may be represented, or you may group the values into categories
first (e.g., by age). It would usually not be sensible to determine the frequencies for each value.
Rather, the values are grouped into ranges, and the frequency is then determined.). Frequency
distributions can be depicted in two ways: as a table or as a graph that is often referred to as a
histogram or bar chart. The bar chart is often used to show the relationship between two
categorical variables.

Grouped data is derived from raw data, and it consists of frequencies (counts of raw values)
tabulated with the classes in which they occur. The Class Limits represent the largest (Upper) and
lowest (Lower) values which the class will contain. The formulas for the descriptive statistic
becomes much simpler for the grouped data, as shown below for Mean, Variance, Standard
Deviation, respectively, where (f) is for the frequency of each class, and n is the total frequency:

Selecting Among the Quartile Deviation, Mean Absolute Deviation, and Standard Deviation

A general guideline for selecting a suitable statistic in describing the dispersion in a population
includes consideration of the following factors:

1. The concept of dispersion required by the problem. Is a single pair of values adequate, such
as the two extremes or the two quartiles (range or Q)?

2. The type of data available. If they are few in numbers, or contain extreme value, avoid the
standard deviation. If they are generally skewed, avoid the mean absolute deviation as well.
If they have a gap around the quartile, the quartile deviation should be avoided.

3. The peculiarity of the dispersion measures themselves. These are summarized under"The
Main Characteristics of the Quartile Deviation, the Mean Absolute Deviation, and the
Standard deviation" below.

The Main Characteristics of the Quartile Deviation, the Mean Absolute Deviation, and the Standard
Deviation
Fact The Quartile Deviation The Mean Absolute Deviation The Standard Deviation

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 29/139
1/5/2020 Dr. Arsham's Statistics Site

No.
The quartile deviation is also easy to The standard deviation is
The mean absolute deviation has
calculate and to understand. usually more useful and better
the advantage of giving equal
1 However, it is unreliable if there are adapted to further analysis
weight to the deviation of every
gaps in the data around the than the mean absolute
value form the mean or median.
quartiles. deviation.
It is more reliable as an
Therefore, it is a more sensitive
estimator of the population
It depends on only 2 values, which measure of dispersion than those
2 dispersion than other
include the middle half of the items. described above and ordinarily
measures, provided the
has a smaller sampling error.
distribution is normal.
It is also easier to compute and
It is the most widely used
It is usually superior to the range as to understand and is less
3 measure of dispersion and the
a rough measure of dispersion. affected by extreme values than
easiest to handle algebraically.
the standard deviation.
It may be determined in an open- Unfortunately, it is difficult to
Compared with the others, it is
end distribution, or one in which the handle algebraically, since minus
4 harder to compute and more
data may be ranked but not signs must be ignored in its
difficult to understand.
measured quantitatively. computation.
It also useful in badly skewed Its main application is in
It is generally affected by
distributions or those in which other modeling accuracy for
5 extreme values that may be
measures of dispersion would be comparative forecasting
due to skewness of data
warped by extreme values. techniques.

You might like to use the Descriptive Sampling Statistics JavaScript in performing some numerical
experimentation for validating the above assertions for a deeper understanding.

Shape of a Distribution Function:


The Skewness-Kurtosis Chart

The pair of statistical measures, skewness and kurtosis, are measuring tools, which is used in
selecting a distribution(s) to fit your data. To make an inference with respect to the population
distribution, you may first compute skewness and kurtosis from your random sample from the
entire population. Then, locating a point with these coordinates on the widely used skewness-
kurtosis chart , guess a couple of possible distributions to fit your data. Finally, you might use the
goodness-of-fit test to rigorously come up with the best candidate fitting your data. Removing
outliers improves the accuracy of both skewness and kurtosis.

Skewness: Skewness is a measure of the degree to which the sample population deviates from
symmetry with the mean at the center.

Skewness = S (xi - ) 3 / [ (n - 1) S 3 ], n is at least 2.

Skewness will take on a value of zero when the distribution is a symmetrical curve. A positive value
indicates the observations are clustered more to the left of the mean with most of the extreme
values to the right of the mean. A negative skewness indicates clustering to the right. In this case
we have: Mean £ Median £ Mode. The reverse order holds for the observations with positive
skewness.

Kurtosis: Kurtosis is a measure of the relative peakedness of the curve defined by the distribution
of the observations.

Kurtosis = S (xi - ) 4 / [ (n - 1) S 4 ], n is at least 2.

Standard normal distribution has kurtosis of +3. A kurtosis larger than 3 indicates the distribution is
more peaked than the standard normal distribution.

Coefficient of Excess Kurtosis = Kurtosis - 3.

A value of less than 3 for kurtosis indicates that the distribution is flatter than the standard normal
distribution.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 30/139
1/5/2020 Dr. Arsham's Statistics Site

It can be shown that,

Kurtosis - Skewness 2 is greater than or equal to 1, and


Kurtosis is less than or equal to the sample size n.

These inequalities hold for any probability distribution having finite skewness and kurtosis.

In the Skewness-Kurtosis Chart, you notice two useful families of distributions, namely the beta
and gamma families.

The Beta-Type Density Function: Since the beta density has both a shape and a scale
parameter, it describes many random phenomena provided the random variable is between [0, 1].
For example, when both parameters are integer with random variables the result is the binomial
Probability function.

Applications: A basic distribution of statistics for variables bounded at both sides; for example x
between [0, 1]. The beta density is useful for both theoretical and applied problems in many areas.
Examples include distribution of proportion of population located between lowest and highest value
in sample; distribution of daily per cent yield in a manufacturing process; description of elapsed
times to task completion (PERT). There is also a relationship between the Beta and Normal
distributions. The conventional calculation is that given a PERT Beta with highest value as b,
lowest as a, and most likely as m, the equivalent normal distribution has a mean and mode of (a +
4m + b)/6 and a standard deviation of (b - a)/6.

Comments: Uniform, right triangular, and parabolic distributions are special cases. To generate
beta, generate two random values from a gamma, g1, g2. The ratio g1/(g1 +g2) is distributed like a
beta distribution. The beta distribution can also be thought of as the distribution of X1 given
(X1+X2), when X1 and X2 are independent gamma random variables.

Gamma-Type Density Function: Some random variables are always non-negative. The density
function associated with these random variables often is adequately modeled as the gamma
density function. The Gamma-Type Density Function has both a shape and a scale parameter.
With both the shape and scale parameters equal to 1, the result is the exponential density function.
Chi-square is also a special case of gamma density function with shape parameter equal to 2.

Applications: A basic distribution of statistics for variables bounded at one side ; for example x
greater than or equal to zero. The gamma density gives distribution of time required for exactly k
independent events to occur, assuming events take place at a constant rate. Used frequently in
queuing theory, reliability, and other industrial applications. Examples include distribution of time
between re-calibrations of instrument that needs re-calibration after k uses; time between inventory
restocking, time to failure for a system with standby components.

Comments: Erlangian, Exponential, and Chi-square distributions are special cases. The negative
binomial is an analog to gamma distribution with discrete random variable.

What is the distribution of the product of sample observations from the uniform (0, 1) random? Like
many problems with products, this becomes a familiar problem when turned into a problem about
sums. If X is uniform (for simplicity of notation make it U(0,1)), Y=-log(X) is exponentially
distributed, so the log of the product of X1, X2, ... Xn is the sum of Y1, Y2, ... Yn which has a
gamma (scaled Chi-square) distribution. Thus, it is a gamma density with shape parameter n and
scale 1.

The Log-normal Density Function: Permits representation of a random variable whose logarithm
follows a normal distribution. The ratio of two log-normally random variables is also log-normal.

Applications: Model for a process arising from many small multiplicative errors. Appropriate when
the value of an observed variable is a random proportion of the previously observed value.

Applications: Examples include distribution of sizes from a breakage process; distribution of


income size, inheritances and bank deposits; distribution of various biological phenomena; life
distribution of some transistor types.

The lognormal distribution is widely used in situations where values are positively skewed (where
the distribution has a long right tail; negatively skewed distributions have a long left tail; a normal
distribution has no skewness). Examples of data that"fit" a lognormal distribution include financial

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 31/139
1/5/2020 Dr. Arsham's Statistics Site

security valuations or real estate property valuations. Financial analysts have observed that the
stock prices are usually positively skewed, rather than normally (symmetrically) distributed. Stock
prices exhibit this trend because the stock price cannot fall below the lower limit of zero but may
increase to any price without limit. Similarly, healthcare costs illustrate positive skewness since unit
costs cannot be negative. For example, there can't be negative cost for services in a capitation
contract. This distribution accurately describes most healthcare data.

In the case where the data are log-normally distributed, the Geometric Mean acts as a better data
descriptor than the mean. The more closely the data follow a log-normal distribution, the closer the
geometric mean is to the median, since the log re-expression produces a symmetrical distribution.

Further Reading:
Snell J., Introduction to Probability, Random House, 1987. Read section 4.2 for a link between beta and F distributions (with the advantage
that tables are easy to find).
Tabachnick B., and L. Fidell, Using Multivariate Statistics, HarperCollins, 1996. Has a good discussion on applications and significance tests
for skewness and kurtosis.

Numerical Example and Discussions

A Numerical Example: Given the following, small (n = 4) data set, compute the descriptive
statistics: x1 = 1, x2 = 2, x3 = 3, and x4 = 6.

ixi ( xi- ) ( xi - ) 2 ( xi - ) 3 ( xi - )4
1 1 -2 4 -8 16
2 2 -1 1 -1 1
3 3 0 0 0 0
4 6 3 9 27 81
Sum 12 0 14 18 98

The mean is 12 / 4 = 3; the variance is s2 = 14 / 3 = 4.67; the standard deviation is s = (14/3) 0.5
= 2.16; the skewness is 18 / [3 (2.16) 3 ] = 0.5952, and finally, the kurtosis is 98 / [3 (2.16) 4] = 1.5.

You might like to use Descriptive Statistics to check your hand computation.

A Short Discussion on the Descriptive Statistic:

Deviations about the mean m of a distribution is the basis for most of the statistical tests we will
learn. Since we are measuring how much a set of scores is dispersed about the mean m , we are
measuring variability. We can calculate the deviations about the mean m and express it as
variance s2 or standard deviation s. It is very important to have a firm grasp of this concept
because it will be a central concept throughout your statistics course.

Both variance s2 and standard deviation s measure variability within a distribution. Standard
deviation s is a number that indicates how much on average each of the values in the distribution
deviates from the mean m (or center) of the distribution. Keep in mind that variance s2 measures
the same thing as standard deviation s (dispersion of scores in a distribution). Variance s2,
however, is the average squared deviations about the mean m . Thus, variance s2 is the square of
the standard deviation s.

The expected value and the variance of the statistic are m and s2/n, respectively.

The expected value and variance of statistic S2 are s2 and 2s4 / (n-1), respectively.

and S2 are the best estimators for m and s2. They are Unbiased (you may update your estimate);
Efficient (they have the smallest variation among other estimators); Consistent (increasing sample
size provides a better estimate); and Sufficient (you do not need to have the whole data set; what
you need are Sxi and Sxi2 for estimations). Note also that the above variance S2 is justified only in
the case where the population distribution tends to be normal, otherwise one may use
bootstrapping techniques.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 32/139
1/5/2020 Dr. Arsham's Statistics Site

In general, it is believed that the pattern of mode, median, and mean go from lower to higher in
positive skewed data sets, and just the opposite pattern in negative skewed data sets. However;
for example, in the following 23 numbers, mean = 2.87, median = 3, but the data is positively
skewed:

4, 2, 7, 6, 4, 3, 5, 3, 1, 3, 1, 2, 4, 3, 1, 2, 1, 1, 5, 2, 2, 3, 1

and, the following 10 numbers have mean = median = mode = 4, but the data set is left skewed:

1, 2, 3, 4, 4, 4, 5, 5, 6, 6.

Note also, that most commercial software do not correctly compute skewness and kurtosis. There
is no easy way to determine confidence intervals about a computed skewness or kurtosis value
from a small to medium sample. The literature gives tables based on asymptotic methods for
sample sets larger than 100 for normal distributions only.

You may have noticed that using the above numerical example on some computer packages such
as SPSS, the skewness and the kurtosis are different from what we have computed. For example,
the SPSS output for the skewness is 1.190. However, for large a sample size n, the results are
identical.

Reference and Further Readings:


David H., Early sample measures of variability, Statistical Science, Vol. 13, 1998, 368-377. This article provides a good historical account of
statistical measures.
Groeneveld R., A class of quantile measures for kurtosis, The American Statistician, 325, Nov. 1998.
Lehmann E., Testing Statistical Hypotheses, 1996, Wiley. Exact confidence interval for the coefficient of variation is computationally tedious as
shown in this book.

The Two Statistical Representations of a Population

The following figure depicts a typical relationship between the cumulative distribution function (cdf)
and the density (for continuous random variables),

All characteristics of the population are well described by either of these two functions. The figure
also illustrates their applications in determining the (lower) percentile measures denoted by P:

P = P[ X £ x] = Probability that the random variable


X is less than or equal to a given number x,

among other useful information. Notice that the probability P is the area under the density function
curve, while numerically equal to the height of cdf curve at point x.

Both functions can be estimated by smoothing the empirical (i.e., observed) cumulative step-
function, and smoothing the histogram constructed from a random sample.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 33/139
1/5/2020 Dr. Arsham's Statistics Site

Empirical (i.e., observed) Cumulative Distribution Function

The empirical cumulative distribution function (ECDF), also known as Ogive (pronounced o-
jive), is used to graph cumulative frequency.

The ogive is the estimator for the population's cumulative distribution function, which contains
all the characteristic of the population. The empirical distribution is a staircase function with
the location of the drops randomly placed. The size of the each stair at each point depends
on the frequency of that point value, and it is equal to the frequency/n where n is the sample
size. The sample size is the sum of all frequencies.

Note that almost all statistics we have covered up to now can be obtained and understood
more deeply by graph paper using Empirical Distribution Function JavaScript. You may like
using this JavaScript in performing some numerical experimentation for a deeper
understanding.

Other widely used decision model based upon empirical cumulative distribution function
(ECDF) as a measuring tool and decision procedure are the ABC Inventory Classification,
Single-period Inventory Analysis (The Newsboy Model), and determination of the Best Time
to Replace Equipment. For other inventory decisions, visit the Inventory Control Models site.

Introduction

Modeling of a Data Set: Families of parametric distribution models are widely used to
summarize a huge data set, to obtain predictions, assess goodness of fit, to estimate
functions of the data not easily derived directly, or to render manageable random effects. The
trustworthiness of the results obtained depends on the generality of the distribution family
employed.

Inductive Inference: This extension of our knowledge from a particular random sample to
the population is called inductive inference. The main function of business statistics is the
provision of techniques for making inductive inference and for measuring the degree of
uncertainty of such inference. Uncertainty is measured in terms of probability statements, and
that is the reason we need to learn the language of uncertainty and its measuring tool called
probability.

In contrast to the inductive inference, mathematics often uses deductive inference to prove
theorems, while in empirical science, such as statistics, inductive inference is used to find
new knowledge or to extend our knowledge.

Further Readings:
Brown B., F. Spears, and L. Levy, The log F: A distribution for all seasons, Computational Statistics, 17(1), 47-58, 2002.

Probability, Chance, Likelihood, and Odds

The concept of probability occupies an important place in the decision-making process under
uncertainty, whether the problem is one faced in business, in government, in the social
sciences, or just in one's own everyday personal life. In very few decision-making situations
is perfect information -- all the needed facts -- available. Most decisions are made in the face
of uncertainty. Probability enters into the process by playing the role of a substitute for
certainty - a substitute for complete knowledge.

Probability is especially significant in the area of statistical inference. Here the statistician's
prime concern lies in drawing conclusions or making inferences from experiments which
involve uncertainties. The concepts of probability make it possible for the statistician to
generalize from the known (sample) to the unknown (population) and to place a high degree
of confidence in these generalizations. Therefore, Probability is one of the most important
tools of statistical inference.

Probability has an exact technical meaning -- well, in fact it has several, and there is still
debate as to which term ought to be used. However, for most events for which probability is
easily computed; e.g., rolling of a die, the probability of getting a four [::], almost all agree on

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 34/139
1/5/2020 Dr. Arsham's Statistics Site

the actual value (1/6), if not the philosophical interpretation. A probability is always a number
between 0 and 1. Zero is not"quite" the same thing as impossibility. It is possible that"if" a
coin were flipped infinitely many times, it would never show"tails", but the probability of an
infinite run of heads is 0. One is not"quite" the same thing as certainty but close enough.

The word"chance" or"chances" is often used as an approximate synonym of "probability",


either for variety or to save syllables. It would be better practice to leave"chance" for informal
use, and say"probability" if that is what is meant. One occasionally sees"likely"
and"likelihood"; however, these terms are used casually as synonyms for"probable"
and"probability".

Odds is a probabilistic concept related to probability. It is the ratio of the probability (p) of an
event to the probability (1-p) that it does not happen: p/(1-p). It is often expressed as a ratio,
often of whole numbers; e.g.,"odds" of 1 to 5 in the die example above, but for technical
purposes the division may be carried out to yield a positive real number (here 0.2). Odds are
a ratio of nonevents to events. If the event rate for a disease is 0.1 (10 per cent), its nonevent
rate is 0.9 and therefore its odds are 9:1.

Another way to compare probabilities and odds is using"part-whole thinking" with a binary
(dichotomous) split in a group. A probability is often a ratio of a part to a whole; e.g., the ratio
of the part [those who survived 5 years after being diagnosed with a disease] to the whole
[those who were diagnosed with the disease]. Odds are often a ratio of a part to a part; e.g.,
the odds against dying are the ratio of the part that succeeded [those who survived 5 years
after being diagnosed with a disease] to the part that 'failed' [those who did not survive 5
years after being diagnosed with a disease].

Aside from their value in betting, odds allow one to specify a small probability (near zero) or a
large probability (near one) using large whole numbers (1,000 to 1 or a million to one). Odds
magnify small probabilities (or large probabilities) so as to make the relative differences
visible. Consider two probabilities: 0.01 and 0.005. They are both small. An untrained
observer might not realize that one is twice as much as the other. But if expressed as odds
(99 to 1 versus 199 to 1) it may be easier to compare the two situations by focusing on large
whole numbers (199 versus 99) rather than on small ratios or fractions.

How to Assign Probabilities?

Probability is an instrument to measure the likelihood of the occurrence of an event. There


are five major approaches of assigning probability: Classical Approach, Relative Frequency
Approach, Subjective Approach, Anchoring, and the Delphi Technique:

1. Classical Approach: Classical probability is predicated on the condition that the outcomes of an
experiment are equally likely to happen. The classical probability utilizes the idea that the lack of
knowledge implies that all possibilities are equally likely. The classical probability is applied when the
events have the same chance of occurring (called equally likely events), and the sets of events are
mutually exclusive and collectively exhaustive. The classical probability is defined as:

P(X) = Number of favorable outcomes / Total number of possible outcomes

2. Relative Frequency Approach: Relative probability is based on accumulated historical or experimental


data. Frequency-based probability is defined as:

P(X) = Number of times an event occurred / Total number of opportunities for the event to occur.

Note that relative probability is based on the ideas that what has happened in the past will hold.

3. Subjective Approach: The subjective probability is based on personal judgment and experience. For
example, medical doctors sometimes assign subjective probability to the length of life expectancy for a
person who has cancer.

4. Anchoring: is the practice of assigning a value obtained from a prior experience and adjusting the value
in consideration of current expectations or circumstances
5. The Delphi Technique: It consists of a series of questionnaires. Each series is one"round". The
responses from the first"round" are gathered and become the basis for the questions and feedback of
the second"round". The process is usually repeated for a predetermined number of"rounds" or until the
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 35/139
1/5/2020 Dr. Arsham's Statistics Site

responses are such that a pattern is observed. This process allows expert opinion to be circulated to all
members of the group and eliminates the bandwagon effect of majority opinion.

Delphi Analysis is used in decision making processes, in particular in forecasting. Several"experts" sit
together and try to compromise on something upon which they cannot agree.

Further Reading:
Delbecq, A., Group Techniques for Program Planning, Scott Foresman, 1975.

General Computational Probability Rules

1. Addition: When two or more events will happen at the same time, and the events are not mutually
exclusive, then:

P (X or Y) = P (X) + P (Y) - P (X and Y)

Notice that, the equation P (X or Y) = P (X) + P (Y) - P (X and Y), contains especial events: An event (X
and Y) which is the intersection of set/events X and Y, and another event (X or Y) which is the union (i.e.,
either/or) of sets X and Y. Although this is very simple, it says relatively little about how event X
influences event Y and vice versa. If P (X and Y) is 0, indicating that events X and Y do not intersect
(i.e., they are mutually exclusive), then we have P (X or Y) = P (X) + P (Y). On the other hand if P (X and
Y) is not 0, then there are interactions between the two events X and Y. Usually it could be a physical
interaction between them. This makes the relationship P (X or Y) = P (X) + P (Y) - P (X and Y) nonlinear
because the P(X and Y) term is subtracted from which influences the result.

The above rule is known also as the Inclusion-Exclusion Formula. It can be extended to more than
two events. For example, for three events A, B, and C, it becomes:

P(A or B or C) =
P(A) + P(B) + P(C) - P(A and B) - P(A and C) - P(B and C) + P(A and B and C)

2. Special Case of Addition: When two or more events will happen at the same time, and the events are
mutually exclusive, then:

P(X or Y) = P(X) + P(Y)

3. General Multiplication Rule: When two or more events will happen at the same time, and the events
are dependent, then the general rule of multiplicative rule is used to find the joint probability:

P(X and Y) = P(Y) ´ P(X|Y),

where P(X|Y) is a conditional probability.

4. Special Case of Multiplicative Rule: When two or more events will happen at the same time, and the
events are independent, then the special rule of multiplication rule is used to find the joint probability:

P(X and Y) = P(X) ´ P(Y)

5. Conditional Probability: A conditional probability is denoted by P(X|Y). This phrase is read: the
probability that X will occur given that Y is known to have occurred.

Conditional probabilities are based on knowledge of one of the variables. The conditional probability of
an event, such as X, occurring given that another event, such as Y, has occurred is expressed as:

P(X|Y) = P(X and Y) ¸ P(Y),

provided P(Y) is not zero. Note that when using the conditional rule of probability, you always divide the
joint probability by the probability of the event after the word given. Thus, to get P(X given Y), you divide
the joint probability of X and Y by the unconditional probability of Y. In other words, the above equation is
used to find the conditional probability for any two dependent events.

The simplest version of the Bayes' Theorem is:

P(X|Y) = P(Y|X) ´ P(X) ¸ P(Y)

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 36/139
1/5/2020 Dr. Arsham's Statistics Site

If two events, such as X and Y, are independent then:

P(X|Y) = P(X),

and
P(Y|X) = P(Y)

6. The Bayes' Rule:

P(X|Y) = [ P(X) ´ P(Y|X) ] ¸ [P(X) ´P(Y|X) + P(not X) ´ P(Y| not X)]

Bayes' rule provides posterior probability [i.e, P(X|Y)] sharpening the prior probability [i.e., P(X)] by the
availability of accurate and relevant information in probabilistic terms.

An Application: Suppose two machines, A and B, produce identical parts. Machine A has
probability 0.1 of producing a defective each time, whereas Machine B has probability 0.4 of
producing a defective. Each machine produces one part. One of these parts is selected at random,
tested, and found to be defective. What is the probability that it was produced by Machine B?

Probability tree diagrams depict events or sequences of events as branches of a tree. Tree
diagrams are useful for visualizing the conditional probabilities:

The probabilities at the end of each branch are the probability that events leading to that end will
happen simultaneously. The above tree diagram indicates that the probability of a part testing
Good is 9/20 + 6/20 = 3/4, therefore the probability of Bad is 1/4. Thus, P(made by B | it is bad) =
(4/20) / (1/4) = 4/5.

Now using the Bayes' Rule we are able to obtain useful information such as:

P(it is bad | made by B) = 1/4(4/5) / [1/4(4/5) + 3/4(2/5)] = 2/5.

Equivalently, using the above conditional probability, results in:

P(it is bad | made by B) = P(it is bad & made by B)/P(made by B) = (4/20)/(1/2) = 2/5.

Venn Diagram: A diagram used, in general to represent sets and subsets. It is a way of displaying
how different sets of objects overlap. John Venn an English mathematician devised them. Venn
diagram could be used as a computational probability tool similar to the probability tree diagram.
The following are Venn diagrams representation for two of the above Probability Rules:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 37/139
1/5/2020 Dr. Arsham's Statistics Site

An Application: A surveys show that 70% of all convenience store shoppers buy milk and 55%
buy bread. If 45% buy both bread and milk, what percentage buy neither?

Solution: The Venn diagram model for this problem is depicted below:

The solution is readily available from the above Venn diagram model, i.e.

P [buy neither] = 1 - [0.25 + 0.45 + 0.1] = 20%

Another approach is to use both, first the Complement Probability Rule and then the Addition
Probability Rule, i.e.

P [buy neither] = 1 - P[bread OR milk] = 1 - [0.70 + 0.55 – 0.45] = 20%

It is up to you to decide which approach is “nicer†and more transparent.

Exercise Your Knowledge on the following probabilistic problem: An urn contains 4 red-balls
(representing, say defective items) and 8 white-balls (Representing, say non-defective items), as
depicted below:

An Urn Model

Suppose 2 balls are drawn at random. Use the following tree diagram, which is a probabilistic
model for this experiment, and verify the solution to the following questions, with the answer given
in the bracket at the end of each question:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 38/139
1/5/2020 Dr. Arsham's Statistics Site

A Tree Diagram as a Probabilistic Model

1. What is the probability of having at least 1 white ball? (10/11)


2. What is the probability that the balls are the same color? (17/33)
3. What is the probability that the second ball is white? (2/3)
4. What is the probability that the second ball is white given that the balls are the same color?
(14/17)
5. Are the events in (2) and (3) independent? (No. Why Not?)
6. What is the expected number of white balls? (4/3)

Another Question for You: A coin fair is flipped twice, what is the conditional probability that both
flips land on heads, given:

a. The first flip lands on heads


b. At least one of the flips lands on head.

Are the answers to part a and b identical? Why?

You may like using the Bayes' Revised Probability JavaScript.

Further Reading:
Ross Sh., A First Course in Probability, Prentice Hall, 2001.

Combinatorial Math: How to Count Without Counting

Many disciplines and sciences require the answer to the question: How Many? In finite probability
theory we need to know how many outcomes there would be for a particular event, and we need to
know the total number of outcomes in the sample space.

Combinatorics, also referred to as Combinatorial Mathematics, is the field of mathematics


concerned with problems of selection, arrangement, and operation within a finite or discrete
system. Its objective is: How to count without counting. Therefore, One of the basic problems of
combinatorics is to determine the number of possible configurations of objects of a given type.

You may ask, why combinatorics? If a sample spaces contains a finite set of outcomes,
determining the probability of an event often is a counting problem. But often the numbers are just
too large to count in the 1, 2, 3, 4 ordinary ways.

A Fundamental Result: If an operation consists of two steps, of which the first can be done in
n1ways and for each of these the second can be done in n2 ways, then the entire operation can be
done in a total of n1× n2 ways.

This simple rule can be generalized as follow: If an operation consists of k steps, of which the first
can be done in n1 ways and for each of these the second step can be done in n2 ways, for each of
these the third step can be done in n3 ways and so forth, then the whole operation can be done in
n1 × n2 × n3 × n4 ×.. × nk ways.

Numerical Example: A quality control inspector wishes to select one part for inspection from each
of four different bins containing 4, 3, 5 and 4 parts respectively. The total number of ways that the
parts can be selected is 4×3×5×4 or 240 ways.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 39/139
1/5/2020 Dr. Arsham's Statistics Site

Factorial Notation: the notation n! (read as, n factorial) means by definition the product:

n! = (n)(n-1)(n-2)(n-3)...(3)(2)(1).

Notice that by convention, 0! = 1, (i.e., 0! º 1) . For example, 6! = 6×5×4×3×2×1 = 720

Permutations versus Combination: A permutation is an arrangement of objects from a set of


objects. That is, the objects are chosen from a particular set and listed in a particular order. A
combination is a selection of objects from a set of objects, that is objects are chosen from a
particular set and listed, but the order in which the objects are listed is immaterial.

Permutations Example: How many permutations (ordered arrangements) are there of the letters
a, b, and c? In this case it is easy to make a list:

abc, bac, cab


acb, bca, cba

The number of permutations is six. We might observe that there are 3 choices for the first letter, 2
choices for the second letter and 1 choice for the third letter. There is 3 × 2 × 1 = 3! permutations
of the three letters a, b and c. Generalizing, if we have n distinct objects, we would have n choices
for the first position, n-1 choices for the second position and so on. We find that the permutation of
n objects selected among n distinct objects is n!.

The number of ways of lining up k objects at a time from n distinct objects is denoted by n P k, and
by the preceding we have:

n P k = (n)(n-1)(n-2)(n-3)......(n-k+1)

Therfore, The number of permutations of n distinct objects taken k at a time can be written as:

n P k = n! / (n - k) !

Combinations: There are many problems in which we are interested in determining the number of
ways in which k objects can be selected from n distinct objects without regard to the order in which
they are selected. Such selections are called combinations or k-sets. It may help to think of
combinations as a committee. The key here is without regard for order.

The number of combinations of k objects from a set with n objects is n C k. For example, the
combinations of {1,2,3,4} taken k=2 at a time are {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, for a total of 6
= 4! / [(2!)(4-2) !] subsets.

The general formula is:

n C k = n! / [k! (n-k) !].

This is basically a subset problem where you specify the number of elements in the subset.

You may ask, what is the relation of combinations to permutations? Each of the above subsets
forms 3! = 6 distinct permutations. 6×4 = 24, which equals 4P 3 If we use the notation 4 C 3 to
indicate the number of combinations of 4 distinct objects taken 3 at a time, then by the above we
have:

4 C 3 = 4 P 3 / 3! = 4! / [ 3! (4 -3) !] = 4.

Notice that:

n C k = n C n-k

An Application: One of the fundamental aspects of economic activity is a trade in which one party
provides another party something, in return for which the second party provides the first something
else, i.e., the Barter Economics.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 40/139
1/5/2020 Dr. Arsham's Statistics Site

The invention of money during 16th Century in Europe was a necessary tool of trading. The
usage of money greatly simplifies barter system of trading, thus lowering transactions costs. If a
society produces 100 different goods, there are:

100 C 2 = 100! / [2! (100 - 2)!] = 100(99)(98!) / [2 (98!)] = (100)(99)] / 2 = 4,950

different possible,"good-for-good" trades. With money, only 100 prices are needed to establish all
possible trading ratios.

As another application, consider the following probabilistic problem. Suppose there are at most
10 defective items in a batch of size 150. You have shipped 15 items to one of your customers.
What is the chance that the customer would find at least one defective item?

P [ at least one defective item] = 1 – P [no defective items] 1 – [ ( 10 C 0 )( 140 C 10 ) / 150 C 15 ]
= 2/3

This probability is too large, meaning it has a high risk to make an unsatisfied customer.

Permutation with Repetitions: How many different letter arrangements can be formed using the
letters P E P P E R?

In general, there are multinomial coefficients:

n! / (n1! n2! n3! ... nr!)

different permutations of n objects, of which n1 are alike, n2, are alike, n3 are alike,..... nr are alike.
Therefore, the answer is 6! /(3! 2! 1!) = 60 possible arrangements of the letters P E P P E R.

You may like using the Combinatorial Math JavaScript.

Further Reading:
Ross Sh., Introduction to Probability and Statistics for Engineers and Scientists, Academic Press, 2004.

Joint Probability and Statistics

A joint probability distribution of a group of random variables is the distribution of group of variables
as a whole. Applied business statistics deal mostly with the joint probability distribution of two
discrete random variables. The joint probability distribution of two discrete random variables is the
likelihood of observing all combinations of the two variables.

Joint Probability Function: Let us have two discrete random variables X and Y, taking values xi, i
= 1,....,m, and yi , j = 1,.....,n, respectively. The function:

PX, Y = PX, Y(x, y) = P(X = x, Y = y)

is called the joint probability function of the random variables X and Y.

As an example, consider two competitive stocks (A, and B). Suppose the estimated rates of return
of stocks A and B are given as follow (respectively):

RA = [0.8, 1.0, 1.2], and RB = [0.9, 1.0, 1.1]

The numbers in the body of the following table are the estimated probabilities of all possible
combinations of two jointly probabilities of the two random variables RA, and RB:

RB
0.9 1.0 1.1
0.8 0.1 0.1 0.1
RA 1.0 0.1 0.1 0.1
1.2 0.1 0.2 0.1

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 41/139
1/5/2020 Dr. Arsham's Statistics Site

Joint Probability

Marginal Probability Function: The function:

PX(x) = ånj=1 P(X = x, Y = yj)

is called the marginal density of X and similarly PY(y):

PY(y) = åmi=1 P(X = xi, Y = y)

is called the marginal density of Y.

Numerical Example: Find the marginal density of RA and RB from the Joint Probability table.

To calculate the marginal distribution of RB, simply look at the table and add the probabilities in
each column.

To obtain the marginal distribution of RA, add the probabilities in each row. The marginal
distributions of A and B are shown at the rigt and the bottom margins of the below table,
respectively:

RB Marginal
0.9 1.0 1.1 ¯
0.8 0.1 0.1 0.1 0.3
RA 1.0 0.1 0.1 0.1 0.3
1.2 0.1 0.2 0.1 0.4
Marginal ® 0.3 0.4 0.3

Marginal Probability Functions

It is clear that a given joint distribution determines the marginal distributions uniquely. However, the
converse is not true; a given marginal distribution can come from many different joint distributions.
The function that links the marginal densities and the joint density is called the copula. In practice,
one picks the marginal distributions first and then selects an appropriate copula to achieve the right
amount of dependency among the individual random variables.

Cumulative Distribution: Take X and Y as above, then the function:

FX, Y = FX, Y(x, y) = P(X £ x, Y £ y)

is called the joint cumulative distribution of X and Y.

RB
0.9 1.0 1.1
0.8 0.1 0.2 0.3
RA 1.0 0.2 0.4 0.6
1.2 0.3 0.7 1.0

Joint Cumulative Distribution

The resulting F must increase in the left-to-right and top-to-bottom directions.

The function:

FX(x) = ånj=1 P(X £ x, Y = yj)

is called the marginal cumulative distribution of X and similarly.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 42/139
1/5/2020 Dr. Arsham's Statistics Site

FY(y) = åmi=1 P(X = x i, Y £ y)

is called the joint cumulative distribution of Y.

Stochastic Independence: When P(A | B) does not depend on the event B, that is P(A | B) = P(A)
is given by:

P(A | B) = [P(A Ç B)] / [P(B)] if P(B) > 0

and is left undefined when P(B) = 0. The symbol A Ç B means"the event A occurs and the event B
occurs.

As an example, suppose we wish to compute the probability that the return on A is medium or high
(RA ³ 1.0) given that the return on B is medium or high (RB ³ 1.0)?

We need to calculate:

P(RA ³ 1.0 | RB ³ 1.0) = [P(RA ³ 1.0 and RB ³ 1.0) ] / [P(RB ³ 1.0)]


Now referring to the below tables:

RB
0.9 1.0 1.1
0.8 0.1 0.1 0.1
RA 1.0 0.1 0.1 0.1
1.2 0.1 0.2 0.1

P(RA ³ 1.0 and RB ³ 1.0 )

RB
0.9 1.0 1.1
0.8 0.1 0.1 0.1
RA 1.0 0.1 0.1 0.1
1.2 0.1 0.2 0.1

P(RB ³ 1.0 )

P(RA ³ 1.0 and RB ³ 1.0) = 0.5, P(RB ³ 1.0) = 0.7, and consequently:

P(RA ³ 1 | RB ³ 1) = [0.5 ] / [ 0.7] = 0.714.


An Application: Determine the number of elementary outcomes and then find the probability of
the event 1/2(RA + RB) < 1.0.

Note that each return takes three values and is allowed to move independently of the other return,
which means we have nine elementary outcomes. In the probabilities of the elementary outcomes
that belong to the event 1/2(RA + RB) = 0.4 are given in bold:

RB
0.9 1.0 1.1
0.8 0.1 0.1 0.1
RA 1.0 0.1 0.1 0.1
1.2 0.1 0.2 0.1

Consequently,

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 43/139
1/5/2020 Dr. Arsham's Statistics Site

P[1/2(RA + RB)] = 0.4.

For estimation of the expected values, variances, etc, you may using the Bivariate Distributions
JavaScript.

Further Reading:
Ross Sh., Introduction to Probability Models, Academic Press, 2002.

Mutually Exclusive versus Independent Events

Mutually Exclusive (ME): Event A and B are ME if both cannot occur simultaneously. That is, P[A
and B] = 0.

Independency (Ind.): Events A and B are independent if having the information that B already
occurred does not change the probability that A will occur. That is P[A given B occurred] = P[A].

If two events are ME they are also Dependent: P(A given B) = P[A and B] ¸ P[B], and since P[A
and B] = 0 (by ME), then P[A given B] = 0. Similarly,

If two events are Independent then they are also not ME.

If two events are Dependent then they may or may not be ME.

If two events are not ME, then they may or may not be Independent.

The following Figure contains all possibilities. The notations used in this table are as follows: X
means does not imply, question mark ? means it may or may not imply, while the check mark
means it implies.

Notice that the (probabilistic) pairwise independency and mutual independency for a collection of
events A1,..., An are two different notions.

Further Reading:
Ross Sh., A First Course in Probability, Prentice Hall, 2001.

What Is so Important About the Normal Distributions?

The term"normal" possibly arose because of the various attempts made to establish this
distribution as the underlying rule governing all continuous variables. These attempts were based
on false premises and consequently failed. Nonetheless, the normal distribution rightly occupies a
preeminent place in the field of probability. In addition to portraying the distribution of many types of
natural and physical phenomena (such as the heights of men, diameters of machined parts, etc.), it
also serves as a convenient approximation of many other distributions which are less tractable.
Most importantly, it describes the manner in which certain estimators of population characteristics
vary from sample to sample and, thereby, serves as the foundation upon which much statistical
inference from a random sample to population are made.

Normal Distribution (called also Gaussian) curves, which have a bell-shaped appearance (it is
sometimes even referred to as the"bell-shaped curves") are very important in statistical analysis. In
any normal distribution is observations are distributed symmetrically around the mean, 68% of all

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 44/139
1/5/2020 Dr. Arsham's Statistics Site

values under the curve lie within one standard deviation of the mean and 95% lie within two
standard deviations.

There are many reasons for their popularity. The following are the most important reasons for its
applicability:

1. One reason the normal distribution is important is that a wide variety of naturally occurring random
variables such as heights and weights of all creatures are distributed evenly around a central value,
average, or norm (hence, the name normal distribution). Although the distributions are only
approximately normal, they are usually quite close.

Whenever there are too many factors influencing the outcome of a random outcome, then the underlying
distribution is approximately normal. For example, the height of a tree is determined by the"sum" of such
factors as rain, soil quality, sunshine, disease, etc.

As Francis Galton wrote in 1889, "Whenever a large sample of chaotic elements are taken in hand and
arranged in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to
have been latent all along."

2. Almost all statistical tables are limited by the size of their parameters. However, when these
parameters are large enough one may use normal distribution for calculating the critical values for these
tables. For example, the F-statistic is related to standard normal z-statistic as follows: F = z2, where F
has (d.f.1 = 1, and d.f.2 is the largest available in the F-table). For more, visit the Relationships among
Common Distributions.

Approximation of the binomial: For example, the normal distribution provides a very accurate
approximation of the binomial when n is large and p is close to 1/2. Even if n is small and p is not
extremely close to 0 or to 1, the approximation is adequate. In fact, the normal approximation of the
binomial will be satisfactory for most purposes provided that np > 5 and nq > 5.

Here is how the approximation is made. First, set m = np and s2 = npq. To allow for the fact that the
binomial is a discrete distribution, we conventionally use a continuity correction factor of 1/2 unit added
to or subtracted from X on the grounds that the discrete value (x = a) should correspond on a continuous
scale to (a - 1/2) < x < (a + 1/2). Then we compute the value of the standard normal variable by:

z = [(a - 1/2) - m]/s OR z = [(a + 1/2) - m]/s

Now one may used the standard normal table for the numerical values.

An Application: The probability of a defective item coming off a certain assembly line is p = 0.25. A
sample of 400 items is selected from a large lot of these items. What is the probability 90 or less items
are defective?

3. If the mean and standard deviation of a normal distribution are known, it is easy to convert back and
forth from raw scores to percentiles.

4. It has been proven that the underlying distribution is normal if and only if the sample mean is
independent of the sample variance, this characterizes the normal distribution. Therefore many
effective transformations can be applied to convert almost any shaped distribution into a normal one.

5. The most important reason for popularity of normal distribution is the Central Limit Theorem (CLT). The
distribution of the sample averages of a large number of independent random variables will be
approximately normal regardless of the distributions of the individual random variables. The Central
Limit Theorem is a useful tool when you are dealing with a population with an unknown distribution.
Often, you may analyze the mean (or the sum) of a sample of size n. For example instead of analyzing
the weights of individual items you may analyze the batch of size n, that is, the packages each
containing n items.

6. The Sampling distribution of normal populations provide more information than any other distributions.
For example, the following standard (i.e., having the same unit as the data have) errors are readily
available:

Standard Error of the Median = (p/2n)½S.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 45/139
1/5/2020 Dr. Arsham's Statistics Site

Standard Error of the Standard Deviation = S/(2n)½.


Therefore, the test statistic for the null hypothesis s = s0, is Z = (2n)½ (S - s0)/s0.

Standard Error of the Variance = S2[(2/(n-1)]½.

Standard Error of the Interquartiles Half-Range (Q) = 1.166Q/n½

Standard Error of the Skewness = (6/n)½.

Standard Error of the Skewness of Sample Mean = Skewness/n½

Notice that the skewness in sampling distribution of the mean rapidly disappears as n gets larger.

Standard Error of the Kurtosis = (24/n)½ = 2 times the standard error of skewness.

Standard Error of the Correlation (r) = [(1 - r2)/(n-1)]½.

Moreover,

Quartile deviation » 2S/3, and, Mean absolute deviation » 4S/5.

7. The other reason the normal distributions are so important is that the normality condition is required by
almost all kinds of parametric statistical tests. Using most statistical tables, such as T-table (except its
last row), c2-table, and F-tables, all required the normality condition of the population. This condition
must be tested before using these tables, otherwise the conclusion might be wrong.

What Is A Sampling Distribution?

A sampling distribution describes probabilities associated with a statistic when a random sample is
drawn from the entire population.

The sampling distribution is the density (for a continuous statistic, such as an estimated mean), or
probability function (for discrete statistic, such as an estimated proportion).

Derivation of the sampling distribution is the first step in calculating a confidence interval or
carrying out a hypothesis testing for a parameter.

Example: Suppose that x1,.......,xn are a simple random sample from a normally distributed
population with expected value m and known variance s2. Then, the sample mean is normally
distributed with expected value m and variance s2/n.

The main idea of statistical inference is to take a random sample from the entire particular
population and then to use the information from the sample to make inferences about the particular
population characteristics such as the mean m(measure of central tendency), the standard
deviation (measure of dispersion, spread) s or the proportion of units in the population that have a
certain characteristic. Sampling saves money, time, and effort. Additionally, a sample can provide,
in some cases, as much or more accuracy than a corresponding study that would attempt to
investigate an entire population. Careful collection of data from a sample will often provide better
information than a less careful study that tries to look at everything.

Often, one must also study the behavior of the mean of sample values taken from different
specified populations; e.g., for comparison purposes.

Because a sample examines only part of a population, the sample mean will not exactly equal the
corresponding mean of the population m . Thus, an important consideration for those planning and
interpreting sampling results is the degree to which sample estimates, such as the sample mean,
will agree with the corresponding population characteristic.

In practice, only one sample is usually taken. In some cases a small"pilot sample" is used to test
the data-gathering mechanisms and to get preliminary information for planning the main sampling
scheme. However, for purposes of understanding the degree to which sample means will agree

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 46/139
1/5/2020 Dr. Arsham's Statistics Site

with the corresponding population mean m , it is useful to consider what would happen if 10, or 50,
or 100 separate sampling studies, of the same type, were conducted. How consistent would the
results be across these different studies? If we could see that the results from each of the samples
would be nearly the same (and nearly correct!), then we would have confidence in the single
sample that will actually be used. On the other hand, seeing that answers from the repeated
samples were too variable for the needed accuracy would suggest that a different sampling plan
(perhaps with a larger sample size) should be used.

A sampling distribution is used to describe the distribution of outcomes that one would observe
from replication of a particular sampling plan.

Know that estimates computed from one sample will be different from estimates that would be
computed from another sample.

Understand that estimates are expected to differ from the population characteristics (parameters)
that we are trying to estimate, but that the properties of sampling distributions allow us to quantify,
based on probability, how they will differ.

Understand that different statistics have different sampling distributions with distribution shape
depending on (a) the specific statistic, (b) the sample size, and (c) the parent distribution.

Understand the relationship between sample size and the distribution of sample estimates.

Understand that increasing the sample size can reduce the variability in a sampling distribution.

See that in large samples, many sampling distributions can be approximated with a normal
distribution.

Sampling Distribution of the Mean and the Variance for Normal Populations: Given the
random variable X is distributed normally with mean m and standard deviation s, then for a random
sample of size n:

The sampling distribution of [ - m] ´ n½ ¸ s, is the standard normal distribution.

The sampling distribution of [ - m ] ´ n½ ¸ S, is a T-distribution with parameter d.f. = n-1.

The sampling distribution of [S2(n-1) ¸ s2], is a c2-distribution with parameter d.f. = n-1.

For two independent samples, the sampling distribution of [S 12 / S22], is an F-distribution


with parameters d.f.1 = n 1-1, and d.f.2= n 2-1.

What Is The Central Limit Theorem?

The central limit theorem (CLT) is a "limit" that is "central" to statistical practice. For practical
purposes, the main idea of the CLT is that the average (center of data) of a sample of observations
drawn from some population is approximately distributed as a normal distribution if certain
conditions are met. In theoretical statistics there are several versions of the central limit theorem
depending on how these conditions are specified. These are concerned with the types of
conditions made about the distribution of the parent population (population from which the sample
is drawn) and the actual sampling procedure.

One of the simplest versions of the central limit theorem stated by many textbooks is: if we take a
random sample of size (n) from the entire population, then, the sample mean which is a random
variable defined by:

S xi / n,

has a histogram which converges to a normal distribution shape if n is large enough. Equivalently,
the sample mean distribution approaches to normal distribution as the sample size increases.

Some students having difficulty reconciling their own understanding of the central limit theorem
with some of the textbooks statements. Some textbooks do not emphasize the on the
independent, random samples of fixed-size n (say more than 30).
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 47/139
1/5/2020 Dr. Arsham's Statistics Site

The shape of the sampling distributions for means - becomes increasingly normal as the sample
size n becomes larger. The increasing sample size is what causes the distribution to become
increasingly normal and the independence condition provides the Ön contraction of the standard
deviation.

The CLT for proportion data, such as binary 0, 1, again the sampling distribution-- while
becoming increasingly "bell-shaped" -- remains confined to the domain [0,1]. This domain
represents a dramatic difference from a normal distribution, with has an unbounded domain.
However, as n increases without bound, the "width" of the bell becomes very small so that the CLT
"still works".

In applications of the central limit theorem to practical problems in statistical inference, however,
we are more interested in how closely the approximate distribution of the sample mean follows a
normal distribution for finite sample size, than in the limiting distribution itself. Sufficiently close
agreement with a normal distribution allows us to use normal theory for making inferences about
population parameters (such as the mean ) using the sample mean, irrespective of the actual form
of the parent population.

It can be shown that, if the parent population has mean m and a finite standard deviation s, then
the sample mean distribution has the same mean m but with smaller standard deviation which is s
divided by n½.

You know by now that, whatever the parent population is, the standardized variable Z = (X - m )/s
will have a distribution with a mean m = 0 and standard deviation s =1 under random sampling.
Moreover, if the parent population is normal, then Z is distributed exactly as the standard normal.
The central limit theorem states the remarkable result that, even when the parent population is
non-normal, the standardized variable is approximately normal if the sample size is large enough.
It is generally not possible to state conditions under which the approximation given by the central
limit theorem works and what sample sizes are needed before the approximation becomes good
enough. As a general guideline, statisticians have used the prescription that, if the parent
distribution is symmetric and relatively short-tailed, then the sample mean more closely
approximates normality for smaller samples than if the parent population is skewed or long-tailed.

Under certain conditions, in large samples, the sampling distribution of the sample mean can be
approximated by a normal distribution. The sample size needed for the approximation to be
adequate depends strongly on the shape of the parent distribution. Symmetry (or lack thereof) is
particularly important.

For a symmetric parent distribution, even if very different from the shape of a normal distribution,
an adequate approximation can be obtained with small samples (e.g., 15 or more for the uniform
distribution). For symmetric, short-tailed parent distributions, the sample mean more closely
approximates normality for smaller sample sizes than if the parent population is skewed and long-
tailed. In some extreme cases (e.g. binomial) sample sizes far exceeding the typical guidelines
(e.g., over 30) are needed for an adequate approximation. For some distributions without first and
second moments (e.g., one is known as the Cauchy distribution), the central limit theorem does not
hold.

For some distributions, extremely large (impractical) samples would be required to approach a
normal distribution. In manufacturing, for example, when defects occur at a rate of less than 100
parts per million, using, a Beta distribution yields an honest Confidence Interval (CI) of total defects
in the population.

A question for you: Roll two perfectly balanced dice one time and the result will sum to an integer
between 2 and 12. Which sum is most likely? (Hint: what CLT implies?)

An Illustration of CLT

Sampling Distribution of the Sample Means: Instead of working with individual scores,
statisticians often work with means. What happens is that several samples are taken, the mean is
computed for each sample, and then the means are used as the data, rather than individual scores
being used. The sample is a sampling distribution of the sample means.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 48/139
1/5/2020 Dr. Arsham's Statistics Site

The central limit theorem explains why many distributions tend to be close to the normal
distribution. The key ingredient is that the random variable being observed should be the sum or
mean of many independent identically distributed random variables.

We can draw the probability distribution of the following random variables:

Sampling Distribution of Values (X): Consider the case where a single, fair die is rolled.

Here are the values that are possible and their probabilities.

X Values 1 2 3 4 5 6
Probability 1/6 1/6 1/6 1/6 1/6 1/6

Here are the mean, and variance of this random variable X:

Mean = m = E[X] = S [ x × p(x) ] = 3.5


Variance = s2 = E[X2] – m2 = S [ x2 × p(x) ] - m2 = 2.92

Sampling Distribution of Samples' Mean (Xbar): Consider the case where two fair dice are
rolled instead of one.

Here are the sums that are possible and their probabilities.

Sum 2 3 4 5 6 7 8 9 10 11 12
Prob 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

But, we are not interested in the sum of the dice, we are interested in the sample mean. We find
the sample mean by dividing the sum by the sample size.

Xbar 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
Prob 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Now let us compute the mean, and variance, of this new random variable Xbar.

Here are the mean, and variance of the random variable Xbar:

Mean (Xbar) = mXbar = E[Xbar] = S [ xbar × p(xbar) ] = 3.5


Variance (Xbar> = s2Xbar = E[Xbar2] – mXbar 2 = S [ xbar2 × p(xbar) ] - mXbar2 = 1.46

Another way to think of sampling distributions is as Probability Distribution of Random


Variables.

But, if we take repeated samples of the same size from a population, and then we plot the means
of all those samples, our distribution will look a little better. We call distributions of sample
statistics, Sampling Distributions.

The reason for this is that you can get the middle values in many more different ways than the
extremes.

Example: When throwing two dice: 1+6 = 2+5 = 3+4 = 7, but only 1+1 = 2 and only 6+6 = 12. That
is: even though you get any of the six numbers equally likely when throwing one die, the extremes
are less probable than middle values in sums of several dice.

To see how the central limit theorem works is in the distribution of scores from increasing numbers
of dice throws, as below.

In this illustration, the number on the top of each rolled die is an independent random event.
Independent because the results of each die roll does not depend on the result of any previous roll,
and random because, assuming that the die is "fair", the value on the top of the rolled die cannot
be predicted in advance. The sum of their results is the total number of dots on the tops of all the
rolled dice. The bar chart illustrates the distribution of the sum. The distribution of each
independent die roll is flat, not bell-shaped. See for yourself. Roll one die a bunch of times and
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 49/139
1/5/2020 Dr. Arsham's Statistics Site

watch the bar chart evolve. The distribution of the sum of two independent die rolls is triangular.
Try it and see. What about five dice? What about ten?

The CLT says that no matter what the distribution of the population looks like, the sampling
distribution will be distributed normally, as long as your sample size is big enough (about 30). The
distribution will have a mean equal to the population mean and a standard error equal to the
population standard deviation divided by the square root of the sample size.

The measure of spread that we use for Sampling Distributions is the standard error (SE). The SE
will always be smaller than the population standard deviation since the sampling distribution is one
of sample statistics. Each sample mean will dampen the effect of outliers, bringing the tails of the
sampling distribution in and creating a bigger "lump" in the middle, centered on the population
mean. You can interpret the SE for sampling distributions in the same way as the standard
deviation for populations.

Click on the image to enlarge it and THEN print it.


Limiting Behavior of the Sample Mean:
An Experimental Demonstration.

Properties of the Sampling Distribution of the Sample Means:

When all of the possible sample means are computed, then the following properties are true:

The mean of the sample means will be the mean of the population
The variance of the sample means will be the variance of the population divided by the
sample size.
The standard deviation of the sample means (known as the standard error of the mean) will
be smaller than the population mean and will be equal to the standard deviation of the
population divided by the square root of the sample size.
If the population has a normal distribution, then the sample means will have a normal
distribution.
If the population is not normally distributed, but the sample size is sufficiently large, then the
sample means will have an approximately normal distribution. Some books define sufficiently
large as at least 30 and others as at least 31.

What Is"Degrees of Freedom"?

Recall that in estimating the population's variance, we used (n-1) rather than n, in the denominator.
The factor (n-1) is called"degrees of freedom."

Estimation of the Population Variance: Variance in a population is defined as the average of


squared deviations from the population mean. If we draw a random sample of n cases from a
population where the mean is known, we can estimate the population variance in an intuitive way.
We sum the deviations of scores from the population mean and divide this sum by n. This estimate
is based on n independent pieces of information, and we have n degrees of freedom. Each of the n
observations, including the last one, is unconstrained ('free' to vary).

When we do not know the population's mean, we can still estimate the population variance; but,
now we compute deviations around the sample mean. This introduces an important constraint
because the sum of the deviations around the sample mean is known to be zero. If we know the
value for the first (n-1) deviations, the last one is known. There are only n-1 independent pieces of
information in this estimate of variance.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 50/139
1/5/2020 Dr. Arsham's Statistics Site

If you study a system with n parameters xi, i =1..., n, you can represent it in an n-dimension space.
Any point of this space shall represent a potential state of your system. If your n parameters could
vary independently, then your system would be fully described in a n-dimension hyper-volume (for
n over 3). Now, imagine you have one constraint between the parameters (an equation with your n
parameters), then your system would be described by a (n-1)-dimension hyper-surface (for n over
3). For example, in three dimensional space, a linear relationship means a plane which is 2-
dimensional.

In statistics, your n parameters are your n data. To evaluate variance, you first need to infer the
mean m . So when you evaluate the variance, you have one constraint on your system (which is the
expression of the mean), and it remains only (n-1) degrees of freedom to your system.

Therefore, we divide the sum of squared deviations by n-1, rather than by n, when we have sample
data. On average, deviations around the sample mean are smaller than deviations around the
population mean. This is because our sample mean is always in the middle of our sample scores;
in fact, the minimum possible sum of squared deviations for any sample of numbers is around the
mean for that sample of numbers. Thus, if we sum the squared deviations from the sample mean
and divide by n, we have an underestimate of the variance in the population (which is based on
deviations around the population mean).

If we divide the sum of squared deviations by n-1 instead of n, our estimate is a bit larger, and it
can be shown that this adjustment gives us an unbiased estimate of the population variance.
However, for large n, say, over 30, it does not make too much difference if we divide by n, or n-1.

Degrees of Freedom in ANOVA: You will see the key parse"degrees of freedom" also appearing in
the Analysis of Variance (ANOVA) tables. If I tell you about 4 numbers, but don't say what they are,
the average could be anything. I have 4 degrees of freedom in the data set. If I tell you 3 of those
numbers, and the average, you can guess the fourth number. The data set, given the average, has
3 degrees of freedom. If I tell you the average and the standard deviation of the numbers, I have
given you 2 pieces of information, and reduced the degrees of freedom from 4 to 2. You only need
to know 2 of the numbers' values to guess the other 2.

In an ANOVA table, degree of freedom (df) is the divisor in (Sum of Squared deviations)/df which
will result in an unbiased estimate of the variance of a population.

In general, a degree of freedom d.f. = N - k, where N is the sample size, and k is a small number,
equal to the number of"constraints", the number of"bits of information" already"used up". As we will
see in the ANOVA section, degree of freedom is an additive quantity; total amounts of it can be
"partitioned" into various components. For example, suppose we have a sample of size 13 and
calculate its mean, and then the deviations from the mean; only 12 of the deviations are free to
vary. Once one has found 12 of the deviations, the thirteenth one is determined.

In bivariate correlation or regression situations, k = 2. The calculation of the sample means of each
variable"uses up" two bits of information, leaving N - 2 independent bits of information.

In a one-way analysis of variance (ANOVA) with g groups, there are three ways of using the data
to estimate the population variance. If all the data are pooled, the conventional SST/(n-1) would
provide an estimate of the population variance.

If the treatment groups are considered separately, the sample means can also be considered as
estimates of the population mean, and thus SSb/(g - 1) can be used as an estimate. The remaining
("within-group","error") variance can be estimated from SSw/(n - g). This example demonstrates
the partitioning of d.f.:
d.f. total = n - 1 = d.f.(between) + d.f.(within) = (g - 1) + (n - g).

Therefore, the simple 'working definition' of d.f. is ‘sample size minus the number of estimated
parameters'. A more complete answer would have to explain why there are situations in which the
degrees of freedom is not an integer. After we said all this, the best explanation, is mathematical in
that we use d.f. to obtain an unbiased estimate.

In summary, the concept of degrees of freedom is used for the following two different purposes:

Parameter(s) of certain distributions, such as F and t-distribution, are called degrees of freedom.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 51/139
1/5/2020 Dr. Arsham's Statistics Site

Most importantly, the degrees of freedom are used to obtain unbiased estimates for the population
parameters.

Applications of and Conditions for Using Statistical Tables

A problem with almost all statistical textbooks is that they not only do not provide information to
understand connections between statistical tables. Students often ask: Why T- table values with
d.f.=1 are much larger compared with other d.f. values? Some tables are limited, what should I do
when the sample size is too large? How can I get familiarity with tables and their differences. Is
there any type of integration among tables? Are there any connections between test of hypotheses
and confidence interval under different scenario, for example testing with respect to one, two more
than two populations? And so on.

The following Figure demonstrates useful relationships among common statistical tables:

Click on the image to enlarge it and THEN print it.


Relationships Among Common Statistical
Tables with Their Applications

Some widely used applications of the popular statistical tables can be categorized as follows:

T - Table:

1. Single Population µ Test.


2. Two Independent Populations Means Test.
3. The Before-and-After µ's Test.
4. Tests Concerning Regression Coefficients .
5. Test Concerning Correlation.

Conditions for using this table: Test for randomness of the data is needed before using this
table. Test for normality condition of the population distribution is also needed if the sample size is
small, or it may not be possible to invoke the central limit theorem.

Z - Table:

1. Test for Randomness.


2. Tests concerning µ for one population or two populations based on their large-size, random
sample(s), (say over 30) to invoke the central limit theorem. This includes test concerning
proportions, with large-size, random sample size n (say over 30) to invoke distribution
convergence results.
3. To Compare Two Correlation Coefficients.

Notes: As you know by now, in test of hypotheses concerning m, and construction of confidence
interval for it, we start with s known, since the critical value (and the p-value) of the Z-Table
distribution can be used. Considering the more realistic situations, when we don't know s, the T-
Table is used. In both cases, we need to verify the normality condition of the population's
distribution; however, if the sample size n is very large, we can in fact switch back to Z-Table by
virtue of the central limit theorem. For perfectly normal populations, the t-distribution corrects for
any errors introduced by estimating s with s when doing inference.

Note also that, in hypothesis testing concerning the parameter of binomial and Poisson
distributions for large sample sizes, the standard deviation is known under the null hypotheses.
That's why you may use the normal approximations for both of these distributions.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 52/139
1/5/2020 Dr. Arsham's Statistics Site

Conditions for using this table: Test for randomness of the data is needed before using this
table. Test for normality condition of the population distribution is also needed if the sample size is
small, or it may not be possible to invoke the Central Limit Theorem.

Chi-square - Table:

1. Test for Cross-table Relationship.


2. Identical-Populations Test for Crosstable Data.
3. Test for Equality of Several Population Proportions.
4. Test for Equality of Several Population Medians.
5. Goodness-of-Fit Test for Probability Mass Functions.
6. Compatibility of Multi-Counts.
7. Correlation-Coefficient Testing.
8. Necessary Conditions in Applying the Above Tests.
9. Testing the Variance: Is the Quality that Good?.
10. Testing the Equality of Multi-Variances.

Conditions for using this table: The necessary conditions for using this table for all the above
tests, except for the last one, can be found at Conditions for the Chi-square Based Tests. The last
application requires normality (condition) of the population distribution.

F - Table:

1. Multi-Means Comparisons: Analysis of Variance (ANOVA).


2. Tests Concerning Two Variances.
3. Overall Assessment of Regression Models .

Conditions for using this table: Tests for randomness of the data and normality (condition) of the
populations are needed before using this table for ANOVA. Same conditions must be satisfied for
the residuals in regression analysis.

The following chart summarizes application of statistical tables with respect to test of hypotheses
and construction of confidence intervals for mean mand variance s 2 in one population or the
comparison of two or more populations.

Click on the image to enlarge it and THEN print it.


Selection of an Appropriate Statistical Table

You may like using Online Statistical Computation in performing most of these tests. The P-values
for the Popular Distributions Web site provides P-values useful in major statistical testing. The
results are more accurate than those that can be obtained (by interpolation) from statistical tables
of your textbook are.

Further Reading:
Balakrishnan N., and V. Nevzorov, A Primer on Statistical Distributions, Wiley, 2003.
Evans M., N. Hastings, and B. Peacock, Statistical Distributions, Wiley, 2000.
Kanji G., 100 Statistical Tests, Sage Publisher, 1995.

Numerical Examples for Statistical Tables

The presentation of the statistical tables is not universal. Some statistical textbooks authors’
enjoy given tabular values of the right-tail probabilities, while for others left-tail probabilities are
preferred. Even within each of these groups you will find some differences in presenting each table
differently than others, never in a unified format. This lack of uniformity often confuses most of
students while learning statistics.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 53/139
1/5/2020 Dr. Arsham's Statistics Site

The following presents some numerical examples of common statistical tables with some
applications. You may like using The P-values for the Popular Distributions JavaScript.

Binomial Probability

X ~ B(n, p), read, the random variable X has a binomial distribution with parameters n trials, and
probability of a success is p.

Example: Find probability of at most k = 3 success from B(n = 7, p = 0.4). Using any Binomial
table, one should get:
P[k £ 3] = 0.7102.
Using The P-values for the Popular Distributions JavaScript, one gets:

P[k £ 3] = 1 – P[k ³ 4] = 1 – 0.2898 = 0.7102.

Questions for you: Which of the following two events is more likely to happen? Getting exactly 6
heads in tossing a fair coin (i.e, p=1/2), n = 10 times or tossing it n=20 times. Why?

Application: A traveling salesman has find that the probability of a sale on a single contact is 0.02.
If the salesman contacts 200 prospects, find the probability that he will make at least one sale.

P[at least one sale] = 1 – P[no sale] = 1 – (1-0.02)200 = 1 – (0.98)200 = 98%

Normal Density Function

X ~ N(0, 1), read, the random variable X is distributed Normally with mean, and variance 0, and 1,
respectively.

A Fact: If X ~ N(m, s), then

Z = (X - m) / s ~ N(0, 1)

Example: Let X ~ N(1, 2), compute P(X £ 5.21)


P[ (X -1) / 2 £ (5.21 -1) / 2] = P(Z £ 2.105) » P(Z £ 2.11) = .4826 + .5 = .9826
Notice that P(Z £ 0) = .5

Similarly, P(X ³ 2.1) = P(Z ³ (2.1 - 1) / (2)) = P(Z ³ .55) = 0.5 - .2088 = .2912
Using The P-values for the Popular Distributions JavaScript, the 2p-value is:

P[| Z | £ 2.1] = 0.582.

Questions for you: Compute P( X ³ 3), P(1 £ X £ 4), P(X ³ - 1), find the value of a such that P(X
³ a) = 0.4515

Applications:

1. Testing hypotheses on the population’s mean, with a known variance, at a given significance
level a.

H0: m = m0 Ha: m ¹ m0

A Fact: Given X ~ N(?, s) and having a random realization of size n: x1, x 2, ..., xn, then

Z = [xbarn - m] / (s / n1/2) ~ N(0,1).

Notice that in most cases, the standard deviation (s is unknown. However, one may use the
sampling estimate S for s provided the sample size is large enough, say, over 30.)

Given n = 4, xbar4 = 492 , test H0: m = 500 at significance level a = 0.05 if s = 16?
The Z-statistic is Z = [492 - 500] / [16 / (41/2)] = -1, however the tabulated critical Z-value is Z.025 =
1.96
Conclusion: No reason to reject H0.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 54/139
1/5/2020 Dr. Arsham's Statistics Site

Question for you: Given the same sampling information, test H0: m = 505 vs. Ha: m ¹ 505.

2. Setting a confidence interval on the mean, variance known.


Given xbar4 = 492 construct a 95% confidence interval for m given s = 16

P [xbar - Za/2 s / n1/2 £ m £ xbar + Za/2 s / n1/2] ³ 1- a

Plugging in the numerical values, one gets:


P[476.3 £ m £ 507.7] ³ 0.95

Notice the Duality between the test of hypothesis and confidence interval.

Question for you: Given the same sampling information, construct a 90% confidence interval for m
given the same information.

3. Central Limit Theorem (CLT)

A Fact: If E(X) = m, Var(X) = s2, then

(xbar - m) / (s / n1/2) ~ N(0,1),

for large n, say, n ³ 30

As a strong result, the CLT implies that if the sample size is large enough, then one may relax the
normality condition whenever dealing with the question of testing or constructing confidence
interval for population’s mean (m).

T-Density Function

A Fact: If X ~ N(m, ?), then

[xbar - m] / [S / n 1/2] ~ t n-1

Example: Find t such that P(T11 > t) = .1 => t = 1.363


Using The P-values for the Popular Distributions JavaScript, the 2p-value is:

P[| T | £ 1.363] = 0.2.

Question for you: Find t such that P( T8 > t ) = .01

Applications:

1. Testing hypotheses on mean, variance unknown


Given xbar16 = 12.1 and S2 = 2.225, test m = 12.5 vs m ¹ 12.5, at a = 0.05 significance level.
The computed statistic is t = -1.07 but the critical value from t-table = 2.131.
Conclusion: There is no reason to reject that m = 12.5.

Question for you: Given the same sampling information perform the test H0: m = 11 vs m ¹ 11 at a =
.01.

2. Construction of confidence interval for m, variance unknown


Example: Given xbar16 = 12.1, S2 = 2.225 develop a 95% confidence interval for m

P[xbar - t(a / 2, n -1) S / n1/2 £ m £ xbar + t (a / 2, n- 1) S / n1/2] ³ 1 - a

Therefore P[11.31 £ m £ 12.89] ³ 0.95 Again, notice the Duality between the test of hypothesis and
confidence interval.

Question for you: Construct a 90% confidence interval for the same problem, is it wider than the
other one, why or why not?

Notice that the T-density converges to the standard normal N(0, 1) as sample size gets larger. In
fact the elements in the last row of t-table are the N(0,1) probabilities.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 55/139
1/5/2020 Dr. Arsham's Statistics Site

Chi-square Density Function

A Fact: If X ~ N(?, s), then the random variable

(n-1)S 2 / s2 ~ c 2(n - 1)

the parameter (n-1) = n is called degrees-of-freedom (d.f).


Example: if d.f. = n = 15, and a = 0.975 find the c 2 value. From the c 2 table, we get c 2 =6.26
Using The P-values for the Popular Distributions JavaScript, the p-value is:

P[c 2 £ 6.26] = 0.975.

Applications:

1. Tests of hypotheses on the variance of a normal population.

Given n = 16 and S2 = 2.22 test that s2 = 2.0 at a = .05. The sampling statistic is c20 = 16.65,
however from the table, the critical values are c2( 15, .025) = 27.4884 and c2(15, .975) = 6.26
Conclusion: There is no reason to reject that s2 = 2.0

2. Interval estimation of the variance of a normal population

P[(n-1)S2 / c2 (n, a /2) £ s2 £ (n-1)S2 / c2(n, 1-a /2)] ³ 1 - a

Example: Given the same sampling information as above construct a 95% confidence interval for
s
Plugging in the given information, you should get:
P[ 1.332 £ s2 £ 4.587] ³ .95

Again, notice the Duality between the test of hypothesis and confidence interval.

Question for you: Given the same sampling information should we reject that s2 = 2.0? at a = .1
Note that c2(15, .05) = 8.55, and c2(15, .95) = 7.26.

F-Density Function

A Fact: Consider two independent samples, one form two normal populations with known variance
s21, and s22, then

(S12 / s12) / (S12 / s12) ~ F(n 1 - 1, n2 - 1)

Example: Find F such that P[F8, 7 ³ F] = .05 => The F value is F = 3.79

Notice: By now, you should have noticed that while every Statistical Table collected at the end of
your textbook, provides the critical values for the right-tail as well as the left-tail probabilities,
except the F-Table, which contains the critical values for the right-tail probabilities only. However,
one might use the following nice property of F-distribution that:

Fn1, n2, 1- a = 1 / Fn2, n1, a

to obtain the critical values for the left-tail probabilities. Here is a numerical example:

F 2, 3, 0.9 = 1 / F 3, 2, 0.1 = 1 / 9.16 = 0.109

You need both tails probabilities for test of hypothesis and construction of confidence interval for
the ratio of two independent populations' variances.

Example: Find P[F8, 7 ³ F] = .95. We may not be able to get the critical value from the table,
however, one may utilize the fact that:

Fn1, n2, 1 - a = 1 / Fn2, n1, a

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 56/139
1/5/2020 Dr. Arsham's Statistics Site

Therefore, F = 1 / 3.50 = 0.2857


Using The P-values for the Popular Distributions JavaScript, the p-value is: P[ F £ 0.2857] = 0.942
(which is exact).

Applications:

1. Testing of hypothesis on the variance of two normal populations.

Example: Given n1 = n2 = 16, S12 = 34.14, and S22 = 47.32, should we reject that s12 = s22 at a =
0.1
The sampling statistics is F = S12 / S22 = .785, but the critical values are F15, 15, .05 = 2.38, and F15,
15, .95 = 1 / 2.38 =0.421.
Conclusion: Therefore is no reason to reject.

Question for you: Given the same sampling information, construct a 90% confidence interval for
variance ratio: s12 / s22

Binomial Probability Function

An important class of decision problems under uncertainty involves situations for which there are
only two possible random outcomes.

The binomial probability function gives probability of exact number of"successes" in n independent
trials, when probability of success p on single trial is a constant. Each single trial is called a
Bernoulli Trial satisfying the following conditions:

1. Each trial results in one of two possible, mutually exclusive, outcomes. One of the possible
outcomes is denoted (arbitrarily) as a success, and the other is denoted a failure.
2. The probability of a success, denoted by p, remains constant from trial to trial. The probability
of a failure, 1-p, is denoted by q.
3. The trials are independent; that is, the outcome of any particular trial is not affected by the
outcome of any other trial.

The number of ways of getting r successes in n trials is:

P (r successes in n trials) = nCr . pr . (1- p)(n-r)


= n! / [r!(n-r)!] . [pr . (1- p)(n-r)].

The mean and variance of random variable r, are np and np(1-p), respectively, where q = 1 - p. The
skewness and kurtosis are (2q -1)/ (npq)½, and (1- 6pq)/(npq), respectively. From its skewness,
we notice that the distribution is symmetric for p =1/2 and most skewed when p is 0 or 1.

Its mode is within interval [(n+1)p -1, (n+1)p], therefore if (n+1) p is not an integer, then the mode is
an integer within the interval. However if (n+1)p is an integer, then its probability function has two
but adjacent modes: (n+1)p -1, and (n+1)p.

Determination of probabilities for p over 0.5: The binomial tables in some textbooks are limited
to deterring the probabilities for values of p up to 0.5. However, these tables can be used for values
of p over 0.5. By recasting a problem in terms of p to 1 -p, and setting r to n-r, then the probability
of obtaining r successes in n trials for a given value of p is equal to the probability of obtaining n-r
failures in n trials with 1-p.

An Application: A large shipment of purchased parts is received at a warehouse, and a sample of


10 parts is checked for quality. The manufacturer's claim is that at most 5% might be defective.
What is the chance that the sample includes one defective?

P (one defective out of ten) = {10! /[(1!)(9!)]}(0.05)1(0.95)9 = 32%.

Know that the binomial distribution is to satisfy the five following requirements: (1) each trial can
have only two outcomes or its outcomes can be reduced to two categories which are called pass
and fail, (2) there must be a fixed number of trials, (3) the outcome of each trail must be

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 57/139
1/5/2020 Dr. Arsham's Statistics Site

independent, (4) the probabilities must remain constant, (5) and the outcome of interest is the
number of successes.

Normal approximation for binomial: All binomial tables are limited in their scope; therefore it is
necessary to use standard normal distribution in computing the binomial probabilities. The
following numerical example illustrates how good the approximation could be. This provides an
indication for real applications when n is beyond the given values in the available binomial tables.

Numerical Example: A sample of 20 items are taken randomly from a manufacturing process with
defective probability p = 0.40. What is the probability of obtaining exactly 5 defective?

P (5 out of 20) = {20!/[(5!)(15!)]} ´ (0.40)5(0.6)15= 7.5%

Since the mean and standard deviation of distribution are:

m = np = 8, and s = (npq)1/2 = 2.19,

respectively; therefore, the standardized observation for r = 5, by using the continuity factor (which
always enlarges) are:

z1 = [(r-1/2) - m] / s = (4.5 -8)/2.19 = -1.60, and

z2 = [(r+1/2) - m] / s = (5.5 -8)/2.19 = -1.14.

Therefore, the approximated P (5 out of 20) is P (z being within interval -1.60, -1.14). Now, by
using the standard normal table, we obtain:

P (5 out of 20) = 0.44520 - 0.37286 = 7.2%

Comments: The approximation for binomial distribution is used frequently in quality control,
reliability, survey sampling, and other industrial problems.

Poisson approximation for binomial: Notice that, whenever you use Poisson approximation to
the binomial distribution with parameters n and p, then the goodness of the approximation is
largely determined by the smallness of the p parameter rather than how large is n.

You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.

You might like to use the Exact Confidence Interval Construction and Test of Hypothesis for
Binomial Population , and Binomial Probability Function JavaScript in performing some numerical
experimentation for validating the above assertions for a deeper understanding.

Geometric Distribution

In a sequence of independent and identically distributed Bernoulli (p) trials, the number of trials
required to get the 1st success has a Geometric(p) distribution.

A Typical Geometric Probability Function


Click on the image to enlarge it and THEN print it

If a single event or trial has two possible outcomes, say Xi can be 0 or 1 with P(Xi=1) = p, the
probability of having to observe k trials before the first "one" appears is given by the geometric
distribution.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 58/139
1/5/2020 Dr. Arsham's Statistics Site

The probability that the first "one" would appear on the first trial is p.
The probability that the first "one" appears on the second trial is p(1-p), because the first trial had
to have been a zero followed by a one.
By generalizing this procedure, the probability that there will be k-1 failures before the first success
is:

P (X = k) = (1 –p) k-1p

This is the geometric distribution.

A geometric distribution has a mean of 1/p and a variance of (1-p)/p2.

Application: A manufacturing process is monitored. As each product exits the process line, it is
tested for defective versus non-defective. On the first defect, the process is stopped for re-
adjustment. The random variable X follows a Geometric distribution with p = P(product is non-
defective).

The Geometric distribution has the memoryless property. Mathematically, for any non-negative
integers s and t, this property can be written

P(X = s + t | X ³ s ) = P(X = t)

Application: Gives probability of requiring exactly x binomial trials before the first success is
achieved. Used in quality control, reliability, and other industrial situations.

Example: Determination of probability of requiring exactly five tests firings before first success is
achieved.

The Geometric distribution is the discrete analogue of the Exponential distribution, which models
the time needed to get a success.

The Exponential distribution is the continuous analog of the Geometric distribution. Like the
Geometric distribution, the Exponential distribution also has the memoryless property.

Mathematically, for any non-negative real numbers s and t, this property can be written

P(X > s + t | X > s ) = P(X > t)

The Exponential distribution is a special case of the Gamma distribution (r = 1). Furthermore, the
sum of r independent and identically distributed Exponential (l) random variables has a Gamma
distribution with parameters r and theta.

In a Poisson (l) process, the waiting times between consecutive events are distributed as
Exponential with mean 1/(l).

You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.

Negative Binomial Distribution

This is an extension of the geometric distribution, describing the waiting time until r "ones" have
appeared. The probability of the rth "one" appearing on the kth trial is given by the negative
binomial distribution:

P (X = k) = r-1C k-1pr-1 (1 –p) k-r p

in other words, the first part is the probability of r-1 success in the previous k-1 trails as a binomial
probability, the last tem is the probability of success.

The following is a Negative Binomial probability function with parameters (r = 6 , k= 30, p = 0.5):

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 59/139
1/5/2020 Dr. Arsham's Statistics Site

Click on the image to enlarge it and THEN print it.


A Negative Binomial Probability Function

A negative binomial distribution has:

mean = r/p and variance = r(1-p)/p2

Application: Suppose we are at a rifle range with an old gun that misfires 5 out of 6 times. Define
``success'' as the event the gunfires and let X be the number of failures before the third success.
Then X has a negative binomial with parameters (3, 1/6). The probability that there are 10 failures
before the third success is given by:

P(X = 10) = 2C12 (1/6)3 (5/6)10 = 5%

The expected value and variance of X are: E(X) = 3(1-5/6) / (1/6) = 15, and Var(X) = 3(1-5/6) /
(1/6)2 = 90.

In a sequence of independent and identically distributed Bernoulli (p) trials, the number of trials
required to get the rth success has a Negative Binomial (r,p) distribution.

Example: The number of oil wells that must be drilled to get r productive wells.

Relationships to Other Distributions: A Negative Binomial (r, p) random variable can be thought
of as the sum of r independent and identically distributed Geometric(p) random variables. The
Geometric (p) is a special case of the Negative Binomial with r=1.

Application: Gives probability similar to Poisson distribution when events do not occur at a
constant rate and occurrence rate is a random variable that follows a gamma distribution.

Example: Distribution of number of cavities for a group of dental patients.

Comments: Generalization of Pascal distribution when s is not an integer. Many authors do not
distinguish between Pascal and negative binomial distributions.

You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.

Hypergeometric Distribution

The Hypergeometric (x; n, M, N) Distribution applies when we are sampling n items without
replacement from a population of M successes and N-M failures.

The hypergeometric distribution arises when a random selection (without repetition) is made
among objects of two distinct types. Typical examples:

Choose a team of 8 from a group of 10 men and 7 women.


Choose a committee of five from the legislature consisting of 52 Democrats and 48 Republicans.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 60/139
1/5/2020 Dr. Arsham's Statistics Site

The Concept of Hypergeometric Events

The above Venn diagram depicts choosing a random subset of size r from n items of which M = m
items belong in a particular category, the probability that x = k of the selected items belong to that
category.

The Binomial distribution looks at n trials "with replacement." The hypergeometric distribution is for
the case "without replacement."

Here p changes from one Bernoulli trial to the next. Specifically, we have a population of size N
with M out of the N members being "Successes" and the remaining (N-M) being "Failures." We
choose a random sample of n (equivalent to taking out n members in succession without
replacement).

The probability that X = x and given by:

P (X = x) = xC M n-xC N-M / mC N

for all integers x between Max [0, n -(N+M)] and Min [n, M].

The expected value and variance of X are given by:

nM / N and nM(N-n)/(N-1)(N-M) / [N2(N-1)],

respectively.

In other words, there is a total number of N chips in the urn and n chips are drawn at random
without replacement. Out of these n chips, k chips are red, and the remainder (n - k) are white. So,
the formula is the number of ways to choose k chips from r red chips in the urn multiplied by the
number of ways to choose n - k chips from white chips. This is divided by the sample space, or the
number of ways to select n chips from the total of N chips in the urn.

Application: Gives probability of picking exactly x good units in a sample of n units from a
population of N units when there are k bad units in the population. Used in quality control and
related applications.

Example: Given a lot with 21 good units and four defective. What is the probability that a sample
of five will yield not more than one defective?

Example: The number of defective items in a sample of size n from a box containing N items of
which k are defective.

Application: A manufacturing process is monitored. As each product exits the process line, it is
tested for defective versus non-defective. On the fifth defect, the process is stopped for re-
adjustment. The random variable X follows a Negative Binomial distribution with r = 5 and p =
P(product is non-defective).

Relationships to Other Distributions: The Hypergeometric (N, k, n) may be approximated by a


Binomial (n, p = k/N) if N is very large relative to n. In this circumstance, replacement and non-
replacement tend to become indistinguishable.

By extension, since the Binomial can be approximated by the Poisson, we can also approximate
the Hypergeometric by a Poisson if the Binomial approximation is appropriate and n is reasonably
large with k/N small.

You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 61/139
1/5/2020 Dr. Arsham's Statistics Site

Exponential Density Function

An important class of decision problems under uncertainty concerns the random durations between
events. For example, the the length of time between breakdowns of a machine not exceeding a
certain time interval, such as the copying machine in your office not breaking down during this
week.

Exponential distribution gives distribution of time between independent events occurring at a


constant rate. Its density function is:

f(t) = l exp(-lt),

where l is the average number of events per unit of time, which is a positive number.

The mean and the variance of the random variable t (time between events) are 1/ l, and 1/l2,
respectively.

Applications include probabilistic assessment of the time between arrivals of patients to the
emergency room of a hospital, and time between arrivals of ships at a particular port.

Comments: Itis a special case of Gamma distribution.

You might like to use Exponential Density to perform your computations, and Lilliefors Test for
Exponentiality to perform the goodness-of-fit test.

F-Density Function

The F-distribution is the distribution of the ratio of two independent sampling (of size of n1, and n2,
respectively) estimates of variance from standard normal distributions. It is also formed by the ratio
of two independent chi-square variables divided by their respective independent degrees of
freedom.

Its main applications are in testing equality of two independent population variances based on two
independent random samples, ANOVA, and regression analysis.

By now, you should have noticed that while every Statistical Table collected at the end of your
textbook, provides the critical values for the right-tail as well as the left-tail probabilities, except the
F-Table, which contains the critical values for the right-tail probabilities only. However, one might
use the following nice property of F-distribution that:

Fn1, n2, 1- a = 1 / Fn2, n1, a

to obtain the critical values for the left-tail probabilities. Here is a numerical example:

F 2, 3, 0.9 = 1 / F 3, 2, 0.1 = 1 / 9.16 = 0.109

You need both tails probabilities for test of hypothesis and construction of confidence interval for
the ratio of two independent populations' variances.

You might like to use F-Density Function to obtain its P-values.

Chi-square Density Function

The probability density curve of a Chi-square distribution is an asymmetric curve stretching over
the positive side of the line and having a long right tail. The form of the curve depends on the value
of a parameter known as the degree of freedom (d.f.).

The expected value of Chi-square statistic is its d.f., its variance is twice of its d.f., and its mode is
equal to (d.f.- 2).

Chi square Distribution relation to Normal Distribution: The Chi-square distribution is related
to the sampling distribution of the variance when the sample is from a normal distribution. The
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 62/139
1/5/2020 Dr. Arsham's Statistics Site

sample variance is a sum of squares of standard normal variables N (0, 1). Hence, the of square of
N (0,1) random variable is a Chi-square with 1 d.f..

Notice that the Chi-square is related to F-statistics as follows: F = Chi-square/d.f.1, where F has
(d.f.1 = d.f. of the Chi-square-table, and d.f.2 is the largest available in the F-table)

Similar to Normal random variables, the Chi-square has the additive property. For example, for two
independent Chi-square variables, their sum is also Chi-square with degrees of freedom equal to
the sum of the d.f. of the individual d.f.s. Thus the unbiased sample variance for a sample of size
n from N (0,1) is a sum of n-1 Chi-squares, each with d.f. = 1, hence Chi-square with d.f. = n-1.

The most widely used applications of Chi-square distribution are:

The Chi-square Test for Association which is a non-parametric test; therefore, it can be used for
nominal data too. It is a test of statistical significance widely used bivariate tabular association
analysis. Typically, the hypothesis is whether or not two populations are different in some
characteristic or aspect of their behavior based on two random samples. This test procedure is
also known as the Pearson Chi-square test.

The Chi-square Goodness-of-Fit Test is used to test if an observed distribution conforms to any
particular distribution. Calculation of this goodness-of-fit test is by comparison of observed data
with data expected based on a particular distribution.

You might like to use Chi-square Density to find its P-values.

Multinomial Probability Function

A multinomial random variable is an extended binomial. However, the difference is that in a


multinomial case, there are more than two possible outcomes. There are a fixed number of
independent outcomes, with a given probability for each outcome.

The Expected Value (i.e., averages):

Expected Value = m = S (Xi ´ Pi), the sum is over all i's.

Expected value is another name for the mean and (arithmetic) average.

It is an important statistic, because, your customers want to know what to “expect†,


from your product/service OR as a purchaser of “raw material†for your
product/service you need to know what you are buying, in other word what you expect to get:

To read-off the meaning of the above formula, consider computation of the average of the
following data

2, 3, 2, 2, 0, 3

The average is Summing up all the numbers and dividing by their counts:

(2 + 3 + 2 + 2 + 0 + 3) / 6

This can be group and re-written as:

[ 2(3) + 3(2) + 0(1)] / 6 = 2(3/6) + 3(2/6) + 0(1/6)

which is the sum of each distinct observation times its probability. Right?

Expected value is known also as the First Moment, borrowed from Physics, because it is the
point of balance where the data and the probabilities are the distances and the weights,
respectively.

The Variance is:

Var(X) = E[(X- m)2] = E[X2 - 2Xm + m2].

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 63/139
1/5/2020 Dr. Arsham's Statistics Site

We simplify this using the above rules. First, because the expectation of a sum equals the
sum of expectations:

Var(X) = E[X2] - E[2Xm] + E[m2].

Then, because constants may be taken out of an expectation:

Var(X) = E[X2] - 2 mE[X] + m2 E[1] = E[X2] - 2 m 2 + m2 = E[X2] - m2.

Finally, notice that E[X2] can be written as E[g(X)] where g(X)=X2. From the final fact about
expectations, we can calculate this:

E[X2] = S x2 P(X = x), for all x

Therefore, the Variance is:

Variance = s2 = S [Xi2 ´ Pi] - m2, the sum is over all i's.

For example, suppose we toss two fair coins and we are interested in determining the
expected value and the variance of the outcome:

E[X2] = (0) 2P(X=0) + (1) 2P(X=1) + (2) 2P(X=2) = 0(1/4) + 1(1/2) + 4(1/4) = 3/2.

From this, we calculate the variance:

Var(X) = E[X2] - m 2 = 3/2 - (1) 2 = 1/2.

Useful Tools for Population's Mean and Variance Estimations: It is not difficult to show
that,

E(aX + b) = aE(X) + b, for any constant a and b


Var(aX+ b) = a2Var(X), for any constant a and b

Application: Notice that the above two examples are among some the tools well suited for
reducing or even in preventing computational statistics round-off errors as well as computers'
over/under flows.

Example: Suppose a random sample of size n = 9, is:

X: 220, 220, 260, 280, 270, 250, 300, 290, 240.

We wish to estimate the mean and the variance of the population based on this sample.

Let a = 10, and b = 22, then dividing the observational data set by a = 10, and then
subtracting 22 fron each value, we obtain a new data set Y:

Y: 0, 0, 4, 6, 5, 3, 8, 7, 2.

Computing the mean and the variance of set Y, we obtain:

S yi = 35, S yi2 = 203

Hence, the estimated mean and variance using the Y data set are 35/9, and [203 –
9(35/9)2] / 8 = 8.36, respectively. However, notice that X = 10Y + 22, therefore, the estimated
mean and variance for the population are E(X) = 10 E(Y) + 22 = 350 + 22 = 372, and Var(X)
= 102 Var(Y) = 836, respectively.

Notice that, the variance is not expressed in the same units as the expected value. So, the
variance is hard to understand and to explain as a result of the squared term in its
computation. This can be alleviated by working with the square root of the variance, which is
called the Standard (i.e., having the same unit as the data have) Deviation:

Standard Deviation = s = (Variance) ½

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 64/139
1/5/2020 Dr. Arsham's Statistics Site

Both variance and standard deviation provide the same information and, therefore, one can
always be obtained from the other. In other words, the process of computing standard
deviation always involves computing the variance. Since standard deviation is the square
root of the variance, it is always expressed in the same units as the expected value.

For the dynamic process, the Volatility as a measure for risk includes the time period over
which the standard deviation is computed. The Volatility measure is defined as standard
deviation divided by the square root of the time duration.

Coefficient of Variation: Coefficient of Variation (CV) is the absolute relative deviation with
respect to size provided is not zero, expressed in percentage:

CV =100 |s/ | %

Notice that the CV is independent from the expected value measurement. The coefficient of
variation demonstrates the relationship between standard deviation and expected value, by
expressing the risk as a percentage of the expected value. The inverse of CV (namely 1/CV)
is called the Signal-to-Noise Ratio.

You might like to use Multinomial Applet for checking your computation and performing
computer-assisted experimentation.

An Application: Consider two investment alternatives, Investment I and Investment II with


the characteristics outlined in the following table:

- Two Investments -

Investment I Investment II

Payoff % Prob. Payoff % Prob.


1 0.25 3 0.33

7 0.50 5 0.33

12 0.25 8 0.34

Performance of Two Investments

To rank these two investments under the Standard Dominance Approach in Finance, first we
must compute the mean and standard deviation and then analyze the results. Using the
Multinomial for calculation, we notice that the Investment I has mean = 6.75% and standard
deviation = 3.9%, while the second investment has mean = 5.36% and standard deviation =
2.06%. First observe that under the usual mean-variance analysis, these two investments
cannot be ranked. This is because the first investment has the greater mean; it also has the
greater standard deviation; therefore, the Standard Dominance Approach is not a useful
tool here. We have to resort to the coefficient of variation (C.V.) as a systematic basis of
comparison. The C.V. for Investment I is 57.74% and for Investment II is 38.43%. Therefore,
Investment II has preference over the Investment I. Clearly, this approach can be used to
rank any number of alternative investments. Notice that less variation in return on investment
implies less risk.

Expectation of a sum of a random number of random variables: Suppose that the


number of people entering a department store on a given day is a random variable with mean
50. Suppose further that the amount of money spent by these customers is independent
random variables having a common mean of $80. What is the expected amount of money
spent in the store on a given day?.

E (sum of N random variables Xi) = E(N) . E(X)

Hence, the expected amount of money spent in the store is (50)(80) = $4000.

You might like to use this JavaScript in performing some numerical experimentation to:

1. Show that E[aX + b] = aE(X) + b.


2. Show that V[aX + b] = a2V(X).
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 65/139
1/5/2020 Dr. Arsham's Statistics Site

3. Show that: E(X2)= V(X) + (E(X))2.

Normal Density Function

In the Descriptive Statistic Section of this Web site, we have been concerned with how
empirical scores are distributed and how best to describe their distribution. We have
discussed several different measures, but the mean m will be the measure that we use to
describe the center of the distribution, and the standard deviation s will be the measure we
use to describe the spread of the distribution. Knowing these two facts gives us ample
information to make statements about the probability of observing a certain value within that
distribution. If I know, for example, that the average Intelligence Quotient (I.Q.) score is 100
with a standard deviation of s = 20, then I know that someone with an I.Q. of 140 is very
smart. I know this because 140 deviates from the mean mby twice the average amount as the
rest of the scores in the distribution. Thus, it is unlikely to see a score as extreme as 140
because most of the I.Q. scores are clustered around 100 and only deviate 20 points from
the mean m .

Many applications arise from the central limit theorem (CLT). The CLT states that, average of
values of n observations approaches normal distribution, irrespective of the form of original
distribution under quite general conditions. Consequently, normal distribution is an
appropriate model for many, but not all, physical phenomena, such as distribution of physical
measurements on living organisms, intelligence test scores, product dimensions, average
temperatures, and so on.

Know that the Normal distribution is to satisfy seven requirements: (1) the graph should be
bell shaped curve; (2) mean, median and mode are all equal; (3) mean, median and mode
are located at the center of the distribution; (4) it has only one mode, (5) it is symmetric about
mean, (6) it is a continuous function; (6) it never touches x-axis; and (7) the area under curve
equals one.

Many methods of statistical analysis presume normal distribution.

When we know the mean and variance of a Normal then it allows us to find probabilities. So,
if, for example, you knew some things about the average height of women in the nation,
including the fact that heights are distributed normally, you could measure all the women in
your extended family and find the average height. This enables you to determine a probability
associated with your result, if the probability of getting your result, given your knowledge of
women nationwide, is high. Then your family's female height cannot be said to be different
from average. If that probability is low, then your result is rare (given the knowledge about
women nationwide), and you can say your family is different. You have just completed a test
of the hypothesis that the average height of women in your family is different from the overall
average.

The ratio of two independent observations from the standard normal is distributed as the
Cauchy Distribution which has thicker tails than a normal distribution. It density function is f(x)
= 1/[p(1+x2)], for all real value x.

An Application: A portfolio manager believes that the overnight loss of his portfolio is
distributed normally with mean $0 and standard deviation of $10 000. Find the 5% one-day
value at risk for this portfolio.

Let X denotes the random portfolio loss distributed as X ~ N (0, 10 0002). The value at risk
v5% is defined by definition a number such that

P(X £ v5%) = 0.95.

To find v5% we standardize the random variable on the left-hand side:

X £ v5% Û X – 0 £ v5% – 0 Û [X – 0] / [10 000] £ [v5% – 0] / [10 000].

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 66/139
1/5/2020 Dr. Arsham's Statistics Site

The transformation is denoted by Z = (X - 0) / 10 000 which has standard normal distribution.


Therefore,

P{Z £ [v5%– 0] / [10 000]} = 0.95.

If we denote by z95% the 95% quantile of a standard normal distribution, then

[v5%] / [10 000] = z95%

v5% can be found in normal statistical table:

z95% = 1.645, v95% = 10 000z95% = 16 450

Therefore, the overnight 5% value at risk is $16450.

You might like to use Standard Normal JavaScript instead of using tabular values from your
textbook, and the well-known Lilliefors' Test for Normality to assess the goodness-of-fit.

Poisson Probability Function

Life is good for only two things, discovering mathematics and teaching
mathematics.
-- Simeon Poisson

An important class of decision problems under uncertainty is characterized by the small


chance of the occurrence of a particular event, such as an accident. Poisson probability
function computes the probability of exactly x independent occurrences during a given period
of time, if events take place independently and at a constant rate. Poisson probability function
also represent number of occurrences over constant areas or volumes:

Poisson probabilities are often used; for example in quality control, software and hardware
reliability, insurance claim, number of incoming telephone calls, and queuing theory.

Application: Gives probability of exactly x independent occurrences during a given period of


time if events take place independently and at a constant rate. May also represents number
of occurrences over constant areas or volumes. It is used frequently in quality control,
reliability, queuing theory, and so on.

Example: Used to represent distribution of number of defects in a piece of material,


customer arrivals, insurance claims, incoming telephone calls, alpha particles emitted, and so
on.

A process that creates fabric is monitored. If the number of defects (X) per meter of fabric
exceeds 5 then the process is stopped for diagnosis. The random variable X follows a
Poisson distribution with rate = number of defects per meter of fabric.

An Application: One of the most useful applications of the Poisson distribution is in the field
of queuing theory. In many situations where queues occur it has been shown that the number
of people joining the queue in a given time period follows the Poisson model. For example, if
the rate of arrivals to an emergency room is l per unit of time period (say 1 hr), then:

P ( n arrivals) = ln e-l / n!

The mean and variance of random variable n are both l . However if the mean and variance
of a random variable have equal numerical values, then it is not necessary that its distribution
is a Poisson. Its mode is within interval [l -1, l].

Applications:

P ( 0 arrival) = e-l
P ( 1 arrival) = l e-l / 1!
P ( 2 arrival) = l2 e-l/ 2!

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 67/139
1/5/2020 Dr. Arsham's Statistics Site

and so on. In general:

P ( n+1 arrivals ) = l P ( n arrivals ) / n.

Normal approximation for Poisson: All Poisson tables are limited in their scope; therefore,
it is necessary to use standard normal distribution in computing the Poisson probabilities. The
following numerical example illustrates how good the approximation could be.

Numerical Example: Emergency patients arrive at a large hospital at the rate of 0.033 per
minute. What is the probability of exactly two arrivals during the next 30 minutes?

The arrival rate during 30 minutes is l = (30)(0.033) = 1. Therefore,

P (2 arrivals) = [12 /(2!)] e-1 = 18%

The mean and standard deviation of distribution are:

m = l = 1, and s = l 1/2 = 1,

respectively; therefore, the standardized observation for n = 2, by using the continuity factor
(which always enlarges) are:

z1 = [(r-1/2) - m] / s = (1.5 -1)/1 = 0.5, and

z2 = [(r+1/2) - m] / s = (2.5 -1)/1 = 1.5.

Therefore, the approximated P (2 arrivals) is P (z being within the interval 0.5, 1.5). Now, by
using the standard normal table, we obtain:

P (2 arrivals) = 0.43319 - 0.19146 = 24%

As you see the approximation is slightly overestimated, therefore the error is on the safe side.
For large values of l, say over 20, one may use the Normal approximation to calculate
Poisson probabilities.

Notice that by taking the square root of a Poisson random variable, the transformed variable
is more symmetric. This is a useful transformation in regression analysis of Poisson
observations.

Poisson approximation for binomial: Notice that, whenever you use Poisson
approximation to the binomial distribution with parameters n and p, then the goodness of the
approximation is largely determined by the smallness of the p parameter rather than how
large is n.

You might like to use Common Discrete Probability Functions to obtain probability and the
cumulative probability functions.

You might like to use Poisson Probability Function JavaScript to perform your computation,
and Testing Poisson to perform the goodness-of-fit test.

Further Reading:
Barbour et al., Poisson Approximation, Oxford University Press, 1992.

Student T-Density Function

The t distributions were discovered in 1908 by William Gosset, who was a chemist and a
statistician employed by the Guinness brewing company. He considered himself a student
still learning statistics, so that is how he signed his papers as pseudonym"Student". Or,
perhaps he used a pseudonym due to"trade secret" restrictions by Guinness.

Note that there are different t-distributions; it is a class of distributions. When we speak of a
specific t distribution, we have to specify the degrees of freedom. The t density curves are
symmetric and bell-shaped like the normal distribution and have their peak at 0. However, the

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 68/139
1/5/2020 Dr. Arsham's Statistics Site

spread is more than that of the standard normal distribution. The larger the degrees of
freedom, the closer the t-density is to the normal density.

The shape of a t-distribution depends on a parameter called"degree-of-freedom". As the


degree-of-freedom gets larger, the t-distribution gets closer and closer to the standard normal
distribution. For practical purposes, the t-distribution is treated as the standard normal
distribution when degree-of-freedom is greater than 30.

Suppose we have two independent random variables, one is Z, distributed as the standard
normal distribution, while the other has a Chi-square distribution with (n-1) d.f.; then the
random variable:

(n-1)Z / c2

has a t-distribution with (n-1) d.f. For large sample size (say, n over 30), the new random
variable has an expected value equal to zero, and its variance is (n-1)/(n-3) which is close to
one.

Notice that the t- statistic is related to F-statistic as follow: F = t2, where F has (d.f.1 = 1, and
d.f.2 = d.f. of the t-table)

You might like to use Student t-Density to obtain its P-values.

Triangular Density Function

The triangular distribution shows the number of successes when you know the minimum,
maximum, and most likely values. For example, you could describe the number of intakes
seen per week when past intake data show the minimum, maximum, and most likely number
of cases seen. It has a continuous probability distribution.

The parameters for the triangular distribution are Minimum (a), Maximum (b), and Likeliest
(c). There are three conditions underlying triangular distribution:

The minimum number of items is fixed.


The maximum number of items is fixed.
The most likely number of items falls between the minimum and maximum values.

These three parameters forming a triangular shaped distribution, which shows that values
near the minimum and maximum are less apt to occur than those near the most likely value.

The following are the general Triangular density function, together with the expected value
and the variance for a Triangular random variable X (a, c, b):

f(x) = 2(x-a) / [(b-a)(c-a)], for a £ x £ c


f(x) = 2(b-x) / [(b-a)(b-a)], for c£ x £ b
E(X) = (a + b + c) / 3
Var(X) = (a2 + b2 + c2 - ab - ac - bc) / 18

The following is a Triangular density function with parameters (a = 0, c = 0.25, a = 1):

Click on the image to enlarge it and THEN print it.


A Triangular Density Function

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 69/139
1/5/2020 Dr. Arsham's Statistics Site

Application: Given X is distributed as above, compute the tails probability P (X £ 0.1 OR X ³


0.9).

Further Reading:
Evans M., Hastings N., and B., Peacock, Triangular Distribution, Ch. 40 in Statistical Distributions, Wiley, pp. 187-188, 2000.

Uniform Density Function

The uniform density function gives the probability that observation will occur within a
particular interval [a, b] when probability of occurrence within that interval is directly
proportional to interval length. Its mean and variance are:

m = (a+b)/2, s2 = (b-a)2/12.

Applications: Used to generate random numbers in sampling and Monte Carlo simulation.

Comments: Special case of beta distribution.

You might like to use Goodness-of-Fit Test for Uniform and performing some numerical
experimentation for a deeper understanding of the concepts.

Notice that any Uniform distribution has uncountable number of modes having equal density
value; therefore it is considered as a homogeneous population.

Discrete Uniform Distribution: The discrete uniform distribution describes the distribution of
n equally likely events (labeled with the integers from 1 to n), each with probability 1/n.

If X is a discrete uniform random variable with parameter n, then the mean, and variance are
as follows:

E(X) = (n+1)/2, Var(X) = (n2 -1) /12

Further Reading:
Balakrishnan N., and V. Nevzorov, A Primer on Statistical Distributions, Wiley, 2003.

Necessary Conditions for Statistical Decision Making

Introduction to Inferential Data Analysis Necessary Conditions: Do not just learn


formulas and number-crunching. Learn about the conditions under which statistical testing
procedures apply. The following conditions are common to almost all statistical tests:

1. Any undetected outliers may have major impact and may influence the results of almost all statistical
estimation and testing procedures.

2. Homogeneous population. That is, there is not more than one mode. Perform Test for Homogeneity of a
Population

3. The sample must be random. Perform Test for Randomness.

4. In addition to the Homogeneity requirement, each population has a normal distribution. Perform the
Lilliefors' Test for Normality.

5. Homogeneity of variances. Variation in each population is almost the same as in the other(s). Perform
The Bartlett's Test.

For two populations use the F-test. For 3 or more populations, there is a practical rule known as
the"Rule of 2". In this rule, one divides the highest variance of a sample by the lowest variance of the
other sample. Given that the sample sizes are almost the same, and the value of this division is less
than 2, then the variations of the populations are almost the same.

Notice: This important condition in analysis of variance (ANOVA and the t-test for mean differences) is
commonly tested by the Levene test or its modified test known as the Brown-Forsythe test. Interestingly,
both tests rely on the homogeneity of variances condition!

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 70/139
1/5/2020 Dr. Arsham's Statistics Site

These conditions are crucial, not for the method of computation, but for the testing using the
resultant statistic. Otherwise, we can do ANOVA and regression without any assumptions, and the
numbers come out the same. Simple computations give us least-square fits, partitions of variance,
regression coefficients, and so on. We do need the above conditions when test of hypotheses are
our main concern.

Further Readings:
Good Ph., and J. Hardin, Common Errors in Statistics, Wiley, 2003.
Wang H., Improved confidence estimators for the usual one-sided confidence intervals for the ratio of two normal variances, Statistics &
Probability Letters, Vol. 59, No.3, 307-315, 2002.

Measure of Surprise for Outlier Detection

Robust statistical techniques are needed to cope with any undetected outliers; otherwise they are
more likely to invalidate the conditions underlying statistical techniques, and they may seriously
distort estimates and produce misleading conclusions in test of hypotheses. A common approach
consists of assuming that contaminating models, different from the one generating the rest of the
data, generate the (possible) outliers.

Because of a potentially large variance, outliers could be the outcome of sampling errors or clerical
errors such as recording data. Therefore, you must be very careful and cautious. Before declaring
an observation"an outlier," find out why and how such observation occurred. It could even be an
error at the data entering stage while using any computer package.

In practice, any observation with a standardized value greater than 2.5 in absolute value is a
candidate for being an outlier. In such a case, one must first investigate the source of the datum. If
there is no doubt about the accuracy or veracity of the observation, then it should be removed, and
the model should be refitted.

1. Compute the mean ( ) and standard deviation (S) of the whole sample.

2. Set limits for the mean :


- k ´ S, + k ´ S.
A typical value for k is 2.5

3. Remove all sample values outside the limits.

4. Now, iterate through the algorithm, the sample set may reduce after removing the outliers by
applying step 3.

5. In most cases, we need to iterate through this algorithm several times until all outliers are
removed.

An Application: Suppose you ask ten of your classmates to measure a given length X. The
results (in mm) are:

46, 48, 38, 45, 47, 58, 44, 45, 43, 44

Is 58 an outlier? Computing the mean and the variance of the ten measurement using the
Descriptive Sampling Statistics JavaScript, are 45.8, and 5.1(after the needed adjustment),
respectively. The Z-value for 58 is Z (58) = 2.4. Since the measurements, in general, follow a
normal distribution, therefore,

Probability [X as large as 2.4 times standard deviation] = 0.008,

obtained by using the Standard Normal P-value JavaScript, or from the normal table in your
textbook.

According this probability, one expects only 0.09 of the ten measurements as bad as this one. This
is a very rare event, however, in spite of such small probability, it has occurred, therefore, it might
be an outlier.

The next most suspected measurement is 38, is it an outlier? It is a question for you.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 71/139
1/5/2020 Dr. Arsham's Statistics Site

A Notice: Outlier detection in the single population setting is not too difficult. Quite often, however,
one can argue that the detected outliers are not really outliers, but form a second population. If
this is the case, a data separation approach needs to be taken.

You might like to use the Identification of Outliers JavaScript in performing some numerical
experimentation for validating and for a deeper understanding of the concepts

Further Reading:
Rothamsted V., V. Barnett, and T. Lewis, Outliers in Statistical Data, Wiley, 1994.

Homogeneous Population

A homogeneous population is a statistical population which has a unique mode.

Notice that, e.g., a Uniform distribution has uncountable number of modes having equal density
value; therefore it is considered as a homogeneous population.

To determine if a given population is homogeneous or not, construct the histogram of a random


sample from the entire population. If there is more than one mode, then you have a mixture of two
or more different populations. Know that to perform any statistical testing, you need to make sure
you are dealing with a homogeneous population.

One of the main applications of histogramming is to Test for Homogeneity of a Population. The
unimodality of the histogram is a necessary condition for the homogeneity of a population in order
to conduct any meaningful statistical analysis. However, notice that, e.g., a Uniform distribution has
uncountable number of modes having equal density value; therefore it is considered as a
homogeneous population.

Test for Randomness: The Runs' Test

A basic condition in almost all inferential statistics is that a set of data constitutes a random sample
from a given homogeneous population. The condition of randomness is essential to make sure the
sample is truly representitive of the population. The widely used test for randomness is the Runs
test.

A"run" is a maximal subsequence of like elements.

Consider the following sequence (D for Defective items, N for Non-defective items) from a
production line: DDDNNDNDNDDD. Number of runs is R = 7, with n1 = 8, and n2 = 4 which are
number of D's and N's.

A sequence is a random sequence if it is neither"over-mixed" nor"under-mixed". An example of


over-mixed sequence is DDDNDNDNDNDD, with R = 9 while under-mixed looks like
DDDDDDDDNNNN with R = 2. There the above sequence seems to be a random sequence.

The Runs Tests, which is also known as Wald-Wolfowitz Test, is designed to test the randomness
of a given sample at 100(1- a)% confidence level. To conduct a runs test on a sample, perform the
following steps:

Step 1: compute the mean of the sample.

Step 2: going through the sample sequence, replace any observation with +, or - depending on
whether it is above or below the mean. Discard any ties.

Step 3: compute R, n1, and n2.

Step 4: compute the expected mean and variance of R, as follows:

a =1 + 2n1n2/(n 1 + n2).

s2 = 2n1n2(2n 1n2-n1- n2)/[[n1 + n2)2 (n1 + n2 -1)].

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 72/139
1/5/2020 Dr. Arsham's Statistics Site

Step 5: Compute z = (R-m)/ s.

Step 6: Conclusion:

If z > Za, then there might be cyclic, seasonality behavior (under-mixing).

If z < - Za, then there might be a trend.

If z < - Za/2, or z > Za/2, reject the randomness.

Note: This test is valid for cases for which both n1 and n2 are large, say greater than 10. For small
sample sizes, special tables must be used.

For example, suppose for a given sample of size 50, we have R = 24, n1 = 14 and n2 = 36. Test for
randomness at a = 0.05.
The Plugging these into the above formulas we have a = 16.95, s = 2.473, and z = -2.0 From Z-
table, we have Z = 1.645. Therefore, there might be a trend, which means that the sample is not
random.

You may use the following JavaScript to Test for Randomness.

Test for Normality

The standard test for normality is the Lilliefors' statistic. A histogram and normal probability plot will
also help you distinguish between a systematic departure from normality when it shows up as a
curve.

Lilliefors' Test for Normality: This test is a special case of the Kolmogorov-Smirnov goodness-of-fit
test, developed for testing the normality of population's distribution. When applying the Lilliefors
test, a comparison is made between the standard normal cumulative distribution function, and a
sample cumulative distribution function with standardized random variable. If there is a close
agreement between the two cumulative distributions, the hypothesis that the sample was drawn
from population with a normal distribution function is supported. If, however, there is a discrepancy
between the two cumulative distribution functions too great to be attributed to chance alone, then
the hypothesis is rejected.

The difference between the two cumulative distribution functions is measured by the statistic D,
which is the greatest vertical distance between the two functions.

You might like to use the well-known Lilliefors' Test for Normality to assess the goodness-of-fit.

Further Readings
Thode T., Testing for Normality, Marcel Dekker, Inc., 2001. Contains the major tests for normality.

Introduction to Estimation

To estimate means to esteem (to give value to). An estimator is any quantity calculated from the
sample data which is used to give information about an unknown quantity in the population. For
example, the sample mean is an estimator of the population mean m.

Results of estimation can be expressed as a single value; known as a point estimate, or a range of
values, referred to as a confidence interval. Whenever we use point estimation, we calculate the
margin of error associated with that point estimation.

Estimators of population parameters are sometimes distinguished from the true value by using the
symbol 'hat'. For example, true population standard deviation s is estimated from a sample
population standard deviation.

Again, the usual estimator of the population mean is = Sxi / n, where n is the size of the sample
and x1, x2, x3,.......,xn are the values of the sample. If the value of the estimator in a particular
sample is found to be 5, then 5 is the estimate of the population mean µ.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 73/139
1/5/2020 Dr. Arsham's Statistics Site

Qualities of a Good Estimator

A"Good" estimator is the one which provides an estimate with the following qualities:

Unbiasedness: An estimate is said to be an unbiased estimate of a given parameter when the


expected value of that estimator can be shown to be equal to the parameter being estimated. For
example, the mean of a sample is an unbiased estimate of the mean of the population from which
the sample was drawn. Unbiasedness is a good quality for an estimate, since, in such a case,
using weighted average of several estimates provides a better estimate than each one of those
estimates. Therefore, unbiasedness allows us to upgrade our estimates. For example, if your
estimates of the population mean µ are say, 10, and 11.2 from two independent samples of sizes
20, and 30 respectively, then a better estimate of the population mean µ based on both samples
is [20 (10) + 30 (11.2)] (20 + 30) = 10.75.

Consistency: The standard deviation of an estimate is called the standard error of that estimate.
The larger the standard error the more error in your estimate. The standard deviation of an
estimate is a commonly used index of the error entailed in estimating a population parameter
based on the information in a random sample of size n from the entire population.

An estimator is said to be"consistent" if increasing the sample size produces an estimate with
smaller standard error. Therefore, your estimate is"consistent" with the sample size. That is,
spending more money to obtain a larger sample produces a better estimate.

Efficiency: An efficient estimate is one which has the smallest standard error among all unbiased
estimators.

The"best" estimator is the one which is the closest to the population parameter being estimated:

Click on the image to enlarge it and THEN print it.


The Concept of "Distance" for an Estimator Is Demonstrated

The above figure illustrates the concept of closeness by means of aiming at the center for
unbiased with minimum variance. Each dart board has several samples:

The first one has all its shots clustered tightly together, but none of them hit the center. The second
one has a large spread, but around the center. The third one is worse than the first two. Only the
last one has a tight cluster around the center, therefore has good efficiency.

If an estimator is unbiased, then its variability will determine its reliability. If an estimator is
extremely variable, then the estimates it produces may not on average be as close to the
population parameter as a biased estimator with small variance.

The following chart depicts the quality of a few popular estimators for the population mean µ:

Click on the image to enlarge it and THEN print it.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 74/139
1/5/2020 Dr. Arsham's Statistics Site

Sample Mean as a "Good" Estimator for the Population's Expected Value

The widely used estimator of the population mean m is = Sxi/n, where n is the size of the sample
and x1, x2, x3,......., xn are the values of the sample that have all of the above good properties.
Therefore, it is a"good" estimator.

If you want an estimate of central tendency as a parameter for a test or for comparison, then small
sample sizes are unlikely to yield any stable estimate. The mean is sensible in a symmetrical
distribution as a measure of central tendency; but, e.g., with ten cases, you will not be able to
judge whether you have a symmetrical distribution. However, the mean estimate is useful if you are
trying to estimate the population sum, or some other function of the expected value of the
distribution. Would the median be a better measure? In some distributions (e.g., shirt size) the
mode may be better. BoxPlot will indicate outliers in the data set. If there are outliers, the median is
better than the mean as a measure of central tendency.

You might like to use Descriptive Statistics JavaScript for obtaining"good" estimates.

Further Readings
Casella G., and R. Berger, Statistical Inference, Wadsworth Pub. Co., 2001.
Lehmann E., and G. Casella, Theory of Point Estimation, Springer Verlag, New York, 1998.

Estimations with Confidence

In practice, a confidence interval is used to express the uncertainty in a quantity being estimated.
There is uncertainty because inferences are based on a random sample of finite size from the
entire population or process of interest. To judge the statistical procedure we can ask what would
happen if we were to repeat the same study, over and over, getting different data (and thus
different confidence intervals) each time.

In most studies, investigators are usually interested in determining the size of difference of a
measured outcome between groups, rather than a simple indication of whether or not it is
statistically significant. Confidence intervals present a range of values, on the basis of the sample
data, in which the value of such a difference may lie.

Know that a confidence interval computed from one sample will be different from a confidence
interval computed from another sample.

Understand the relationship between sample size and width of confidence interval, moreover, know
that sometimes the computed confidence interval does not contain the true value.

Let's say you compute a 95% confidence interval for a mean m . The way to interpret this is to
imagine an infinite number of samples from the same population, at leat 95% of the computed
intervals will contain the population mean m , and at most 5% will not. However, it is wrong to
state,"I am 95% confident that the population mean m falls within the interval."

Again, the usual definition of a 95% confidence interval is an interval constructed by a process
such that the interval will contain the true value at least 95% of the time. This means that"95%" is a
property of the process, not the interval.

Is the probability of occurrence of the population mean greater in the confidence interval (CI)
center and lowest at the boundaries? Does the probability of occurrence of the population mean in
a confidence interval vary in a measurable way from the center to the boundaries? In a general
sense, normality condition is assumed, and then the interval between CI limits is represented by a
bell shaped t distribution. The expectation (E) of another value is highest at the calculated mean
value, and decreases as the values approach the CI limits.

Tolerance Interval and CI: A good approximation for the single measurement tolerance interval is
n½ times confidence interval of the mean.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 75/139
1/5/2020 Dr. Arsham's Statistics Site

Statistics with Confidence


Click on the image to enlarge it and THEN print it

You may use Estimations With Confidence, and Confidence Intervals for Two Populations to check
your hand computations.

You need to use Sample Size Determination JavaScript at the design stage of your statistical
investigation in decision making with specific subjective requirements.

A Note on Multiple Comparison via Individual Intervals: Notice that, if the confidence intervals
from two samples do not overlap, there is a statistically significant difference, say at 5%. However,
the other way is not true; two confidence intervals can overlap even when there is a significant
difference between them.

As a numerical example, consider the means of two independent samples. Suppose their values
are 10 and 22 with equal standard error of 4. The 95% confidence interval for the two statistics
(using the critical value of 1.96) are: [2.2, 17.8] and [14.2, 29.8], respectively. As you see they
display considerable overlap. However, the z-statistic for the two-population mean is: |22 -10|/(16 +
16)½ = 2.12 which is clearly significant under the same conditions as applied for constructing the
confidence intervals.

One should examine the confidence interval for the difference explicitly. Even if the confidence
intervals are overlapping, it is hard to find the exact overall confidence level. However, the sum of
individual confidence levels can serve as an upper limit. This is evident from the fact that: P(A and
B) £ P(A) + P(B).

Numerical examples for construction of confidence intervals are given in The Statistical Tables
section.

Further Reading:
Cohen J., Statistical Power Analysis for the Behavioral Sciences, L. Erlbaum Associates, 1988.
Kraemer H., and S. Thiemann, How Many Subjects? Provides basic sample size tables, explanations, and power analysis.
Murphy K., and B. Myors, Statistical Power Analysis, L. Erlbaum Associates, 1998. Provides a simple and general sample size determination
for hypothesis tests.
Newcombe R., Interval estimation for the difference between independent proportions: Comparison of eleven methods, Statistics in Medicine,
17, 873-890, 1998.
Hahn G. and W. Meeker, Statistical Intervals: A Guide for Practitioners, Wiley, 1991.
Schenker N., and J. Gentleman, On judging the significance of differences by examining the overlap between confidence intervals, The
American Statistician, 55(2), 135-139, 2001.

What Is the Margin of Error?

Estimation is the process by which sample data are used to indicate the value of an unknown
quantity in a population.

Results of estimation can be expressed as a single value, known as a point estimate; or a range of
values, referred to as a confidence interval.

Whenever we use point estimation, we calculate the margin of error associated with that point
estimate. For example, for the estimation of the population proportion, by the means of sample
proportion (p), the margin of error is calculated often as follows:

±1.96 [p(1-p)/n]1/2

In newspapers and television reports on public opinion polls, the margin of error often appears in a
small font at the bottom of a table or screen. However, reporting the amount of error only, is not
informative enough by itself, what is missing is the degree of the confidence in the findings. The

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 76/139
1/5/2020 Dr. Arsham's Statistics Site

more important missing piece of information is the sample size n; that is, how many people
participated in the survey, 100 or 100000? By now, you know well that the larger the sample size
the more accurate is the finding, right?

The reported margin of error is the margin of"sampling error". There are many non-sampling errors
that can and do affect the accuracy of polls. Here we talk about sampling error. The fact that sub-
groups might have sampling error larger than the group, one must include the following statement
in the report:

"Other sources of error include, but are not limited to, individuals refusing to participate
in the interview and inability to connect with the selected number. Every feasible effort
was made to obtain a response and reduce the error, but the reader (or the viewer)
should be aware that some error is inherent in all research."

If you have a yes/no question in a survey, you probably want to calculate a proportion P of Yes's
(or No's). In a simple random sample survey, the variance of p is p(1-p)/n, ignoring the finite
population correction, for large n, say over 30. Now a 95% confidence interval is

p - 1.96 [p(1-p)/n]1/2, p + 1.96 [p(1-p)/n]1/2.

A conservative interval can be calculated, since p(1-p) takes its maximum value when p = 1/2.
Replace 1.96 by 2, put p = 1/2 and you have a 95% consevative confidence interval of 1/n1/2. This
approximation works well as long as p is not too close to 0 or 1. This useful approximation allows
you to calculate approximate 95% confidence intervals.

For continuous random variables, such as the estimation of the population mean m, the margin of
error is calculated often as follows:

±1.96 S/n1/2.

The margin of error can be reduced by one or a combination of the following strategies:

1. Decreasing the confidence in the estimate -- an undesirable strategy since confidence relates to the
chance of drawing the wrong conclusion (i.e., increases the Type II error).
2. Reducing the standard deviation -- something we cannot do since it is usually a static property of the
population.
3. Increasing the sample size -- this provides more information for a better decision.

You might like to use Descriptive Statistics JavaScript to check your computations, and Sample
Size Determination JavaScript at the design stage of your statistical investigation in decision
making with specific subjective requirements.

Further Reading
Levy P., and S. Lemeshow, Sampling of Populations: Methods and Applications, Wiley, 1999.

Bias Reduction Techniques: Bootstrapping and Jackknifing

Some inferencial statistical techniques do not require distributional assumptions about the statistics
involved. These modern non-parametric methods use large amounts of computation to explore the
empirical variability of a statistic, rather than making a priori assumptions about this variability, as is
done in the traditional parametric t- and z- tests.

Bootstrapping: Bootstrapping method is to obtain an estimate by combining estimators to each of


many sub-samples of a data set. Often M randomly drawn samples of T observations are drawn
from the original data set of size n with replacement, where T is less n.

Jackknife Estimator: A jackknife estimator creates a series of estimate, from a single data set by
generating that statistic repeatedly on the data set leaving one data value out each time. This
produces a mean estimate of the parameter and a standard deviation of the estimates of the
parameter.

Monte Carlo simulation allows for the evaluation of the behavior of a statistic when its
mathematical analysis is intractable. Bootstrapping and jackknifing allow inferences to be made
from a sample when traditional parametric inference fails. These techniques are especially useful
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 77/139
1/5/2020 Dr. Arsham's Statistics Site

to deal with statistical problems, such as small sample size, statistics with no well-developed
distributional theory, and parametric inference condition violations. Both are computer intensive.
Bootstrapping means you take repeated samples from a sample and then make statements about
a population. Bootstrapping entails sampling-with-replacement from a sample. Jackknifing involves
systematically doing n steps, of omitting 1 case from a sample at a time, or, more generally, n/k
steps of omitting k cases; computations that compare"included" vs."omitted" can be used
(especially) to reduce the bias of estimation. Both have applications in reducing bias in estimations.

Resampling -- including the bootstrap, permutation, and other non-parametric tests -- is a method
for hypothesis testing, confidence limits, and other applied problems in statistics and probability. It
involves no formulas or tables.

Following the first publication of the general technique (and the bootstrap) in 1969 by Julian Simon
and subsequent independent development by Bradley Efron, resampling has become an
alternative approach for testing hypotheses.

There are other findings: "The bootstrap started out as a good notion in that it presented, in theory,
an elegant statistical procedure that was free of distributional conditions. In practice the bootstrap
technique doesn't work very well, and the attempts to modify it make it more complicated and more
confusing than the parametric procedures that it was meant to replace."

While resampling techniques may reduce the bias, they achieve this at the expense of increase in
variance. The two major concerns are:

1. The loss in accuracy of the estimate as measured by variance can be very large.
2. The dimension of the data affects drastically the quality of the samples and therefore the estimates.

Further Readings:
Young G., Bootstrap: More than a Stab in the Dark?, Statistical Science, l9, 382-395, 1994. Provides the pros and cons on the bootstrap
methods.
Yatracos Y., Assessing the quality of bootstrap samples and of the bootstrap estimates obtained with finite resampling, Statistics and
Probability Letters, 59, 281-292, 2002.

Prediction Intervals

In many application of business statistics, such as forecasting, we are interested in construction of


a statistical interval for random variable, rather than a parameter of a population distribution.

The Tchebysheff's inequality is often used to put bounds on the probability that a proportion of random
variable X will be within k > 1 standard deviation of the mean m for any probability distribution. In other
words:

P [|X - m| ³ k s] £ 1/k2, for any k greater than 1

The symmetric property of Tchebysheff's inequality is useful; e.g., in constructing control limits in the
quality control process. However, the limits are very conservative due to lack of knowledge about the
underlying distribution.

The above bounds can be improved (i.e., becomes tighter) if we have some knowledge about the
population distribution. For example, if the population is homogeneous; that is, its distribution is
unimodal; then,

P [|X - m| ³ k s] £ 1/(2.25k2), for any k greater than 1.

The above inequality is known as the Camp-Meidell inequality.

Now, let X be a random variable distributed normally with estimated mean and standard deviation S,
then a prediction interval for the sample mean with 100(1- a)% confidence level is:

± ta/2 ´ S ´ (1+1/n)1/2.

This is the range of a random variable with 100(1- a)% confidence, using t-table. Relaxing the
normality condition for sample-mean prediction interval, requires a large sample size, say n over 30.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 78/139
1/5/2020 Dr. Arsham's Statistics Site

Further Readings:
Grant E., and R. Leavenworth, Statistical Quality Control, McGraw-Hill, 1996.
Ryan T., Statistical Methods for Quality Improvement, John Wiley & Sons, 2000. A very good book for a starter.

What Is a Standard Error?

For statistical inference, namely statistical testing and estimation, one needs to estimate the
population's parameter(s). Estimation involves the determination, with a possible error due to
sampling, of the unknown value of a population parameter, such as the proportion having a specific
attribute or the average value m of some numerical measurement. To express the accuracy of the
estimates of population characteristics, one must also compute the standard errors of the
estimates. These are measures of accuracy that determine the possible errors arising from the fact
that the estimates are based on random samples from the entire population, and not on a complete
population census.

Standard error is a statistic indicating the accuracy of an estimate. That is, it tells us to assess how
different the estimate (such as ) is from the population parameter (such as m). It is therefore, the
standard deviation of a sampling distribution of the estimator such as . The following is a
collection of standard errors for the widely used statistics:

Standard Error for the Mean is: S/n½.

As one expects, the standard error decreases as the sample size increases. However the
standard deviation of the estimate decreases by a factor of n½ not n. For example, if you
wish to reduce the error by 50%, the sample size must be 4 times n, which is expensive.
Therefore, as an alternative to increasing sample size, one may reduce the error by
obtaining"quality" data that provide a more accurate estimate.

For a finite population of size N, the standard error of the sample mean of size n, is:

S ´ [(N -n)/(nN)]½.

Standard Error for sample Variance S2 is: S2/[(n-1)/2]½

Standard Error for the Multiplication of Two Independent Means 1 ´ 2 is:

{ 1 S22/n2 + 2 S12/n1}½.

Standard Error for Two Dependent Means 1 ± 2 is:

{S12/n1 + S22/n2 + 2 r ´ [(S12/n1)(S22/n2)]½}½.

Standard Error for the Proportion P is:

[P(1-P)/n]½

Standard Error for P1 ± P2, Two Dependent Proportions is:

{[P1 + P2 - (P1-P2)2] / n}½.

Standard Error of the Proportion (P) from a finite population is:

[P(1-P)(N -n)/(nN)]½.

The last two formulas for finite population are frequently used when we wish to compare a
sub-sample of size n with a larger sample of size N, which contains the sub-sample. In such
a comparison, it would be wrong to treat the two samples"as if" there were two independent
samples. For example, in comparing the two means one may use the t-statistic but with the
standard error:

SN [(N -n)/(nN)]½

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 79/139
1/5/2020 Dr. Arsham's Statistics Site

as its denominator. Similar treatment is needed for proportions.

Standard Error of the Slope (m) in Linear Regression is

Sres / Sxx½, where Sres is the residual' standard deviation.

Standard Error of the Intercept (b) in Linear Regression is:

Sres[(Sxx + n ´ 2) /(n ´ Sxx] ½.

Standard Error of the Predicted Value using a Linear Regression is:

Sy(1 - r2)½.

The term (1 - r2)½ is called the coefficient of alienation. Therefore if r = 0, the error of
prediction is Sy as expected.

Standard Error of the Linear Regression is:

Sy (1 - r2)½.

Note that if r = 0, then the standard error reaches its maximum possible value, which is
standard deviation in Y.

Stability of an estimator: An estimator is stable if, by taking two different samples of the same
size, they produce two estimates having"small" absolute difference. The stability of an estimator is
measured by its reliability:

Reliability of an estimator = 1 / (its standard error)2

The larger the standard error, the less reliable is the estimate. Reliability of estimators is often used
to select the"best" estimator among all unbiased estimators.

Sample Size Determination

At the planning stage of a statistical investigation, the question of sample size (n) is critical. This is
an important question therefore it should not be taken lightly. To take a larger sample than is
needed to achieve the desired results is wasteful of resources, whereas very small samples often
lead to what are no practical use of making good decisions. The main objective is to obtain both a
desirable accuracy and a desirable confidence level with minimum cost.

Students sometimes ask me, what fraction of the population do you need for good estimation? I
answer,"It's irrelevant; accuracy is determined by sample size." This answer has to be modified if
the sample is a sizable fraction of the population.

The confidence level of conclusions drawn from a set of data depends on the size of the data set.
The larger the sample, the higher is the associated confidence. However, larger samples also
require more effort and resources. Thus, your goal must be to find the smallest sample size that
will provide the desirable confidence.

For an item scored 0 or 1, for no or yes, the standard error (SE) of the estimated proportion p,
based on your random sample observations, is given by:

SE = [p(1-p)/n]1/2

where p is the proportion obtaining a score of 1, and n is the sample size. This SE is the standard
deviation of the range of possible estimate values.

The SE is at its maximum when p = 0.5, therefore the worst case scenario occurs when 50% are
yes, and 50% are no.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 80/139
1/5/2020 Dr. Arsham's Statistics Site

Under this extreme condition, the sample size, n, can then be expressed as the largest integer less
than or equal to:

n = 0.25/SE2

To have some notion of the sample size, for example for SE to be 0.01 (i.e. 1%), a sample size of
2500 will be needed; 2%, 625; 3%, 278; 4%, 156, 5%, 100.

Note, incidentally, that as long as the sample is a small fraction of the total population, the actual
size of the population is entirely irrelevant for the purposes of this calculation.

Pilot Studies: When the needed estimates for sample size calculation is not available from an
existing database, a pilot study is needed for adequate estimation with a given precision. A pilot, or
preliminary, sample must be drawn from the population, and the statistics computed from this
sample are used in determination of the sample size. Observations used in the pilot sample may
be counted as part of the final sample, so that the computed sample size minus the pilot sample
size is the number of observations needed to satisfy the total sample size requirement.

Sample Size with Acceptable Absolute Precision: The following present the widely used method for
determining the sample size required for estimating a population mean and proportion.

Let us suppose we want an interval that extends d unit on either side of the estimator. We can write

d = Absolute Precision = (reliability coefficient) ´ (standard error) = Za/2 ´ (S/n1/2)

Suppose, based on a pilot sample of size n, the estimated proportion is p, then the required
sample size with the absolute error size not exceeding d, with 1- a confidence is:

[t2 n p(1-p)] / [t2 p(1-p) - d2 (n-1)],

where t = t a/2 being the value taken from the t-table with parameter d.f. = n = n-1, corresponding to
the desired 1- a confidence interval.

For large pilot sample sizes (n), say over 30, the simplest sample size determinate is:

[(Za/2)2 S2] / d2 for the Mean m

[(Za/2)2 p(1-p)] / d2 for the proportion,

where d is the desirable margin of error (i.e., the absolute error), which is the half-length of the
confidence interval with 100(1- a)% confidence interval.

Sample Size with Acceptable Type I and Type II Errors: One may use the following sample size
determinate, which is based on the size of type I and Type II errors:

2(Za/2 + Zb/2)2S2/d2,

where a and b are the desirable type I, and type II errors, respectively. S2 is the variance obtained
from the pilot run, and d is the difference between the null and alternative (m0 -ma).

Sample Size with Acceptable Relative Precision: You may use the following sample size
determinate for a desirable relative error D in %, which requires an estimate of the coefficient of
variation (CV in %) from a pilot sample with size over 30:

[(Za/2)2 (C.V.)2] / D2

Sample Size Based on the Null and an Alternative: One may use power of the test to determine
the sample size. The functional relation of the power and the sample size is known as the
operating characteristic curve. On this curve, as sample size increases, the power function
increases rapidly. Let d be such that:

ma = m0 + d

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 81/139
1/5/2020 Dr. Arsham's Statistics Site

is an alternative to represent departure from the null hypothesis. We wish to be reasonably


confident to find evidence against the null, if in fact the particular alternative holds. That is, the type
error b, is the probability of failing to find evidence at least at level of a, when the alternative holds.
This implies

Required sample size = (z1 + z2) S2/ d2

Where: z1 = |mean - m0|/ SE, z2 = |mean - ma|/ SE, the mean is the current estimate for m, and S is
the current estimate for s.

All of the above sample size determinates could also be used for estimating the mean of any
unimodal population, with discrete or continuous random variables, provided the pilot run size (n) is
larger than (say) 30.

In estimating the sample size, when the standard deviation is not known, instead of S2 one may
use 1/4 of the range for sample size over 30 as a"good" estimate for the standard deviation. It is a
good practice to compare the result with IQR/1.349.

One may extend the sample size determination to other useful statistics, such as correlation
coefficient (r) based on acceptable Type I and Type II errors:

2 + [(Za/2 + Zb/2( 1- r2) ½)/r] 2

provided r is not equal to -1, 0, or 1.

The aim of applying any one of the above sample size determinates is at improving your pilot
estimates at feasible costs.

You might like to use Sample Size Determination JavaScript to check your computations.

Further Reading:
Kish L., Survey Sampling, Wiley, 1995.
Murphy K., and B. Myors, Statistical Power Analysis, L. Erlbaum Associates, 1998. Provides a simple and general sample size determination
for hypothesis tests.

Revising the Expected Value and the Variance

Averaging Variances: What is the mean variance of k variances without regard to differences in
their sample sizes? The answer is simply:

Average of Variances = [SSi2] / k

However, what is the variance of all k groups combined? The answer must consider the sample
size ni of the ith group:

Combined Group Variance = S ni[Si2 + di2]/N,

where di = meani - grand mean, and N = S ni, for all i = 1, 2, .., k.

Notice that the above formula allows us to split up the total variance into its two component parts.
This splitting process permits us to determine the extent to which the overall variation is inflated by
the difference between group means. What the variation would be if all groups had the same
mean? ANOVA is a well-known application of this concept where the equality of several means is
tested.

Subjective Mean and Variance: In many applications, we saw how to make decisions based on
objective data; however, an informative decision-maker might be able to combine his/her subjective
input and the two sources of information.

Application: Suppose the following information is available from two independent sources:

Revising the Expected Value and the Variance


Estimate Source Expected value Variance
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 82/139
1/5/2020 Dr. Arsham's Statistics Site

Sales manager m1 = 110 s12 = 100


Market survey m2 = 70 s22 = 49

The combined expected value is:

[m1/s12 + m2/s22 ] / [1/s12 + 1/s22]

The combined variance is:

2 / [1/s12 + 1/s22]

For our application, using the above tabular information, the combined estimate of expected sales
is 83.15 units with combined variance of 65.77.

You might like to use Revising the Mean and Variance JavaScript in performing some numerical
experimentation. You may apply it for validating the above example and for a deeper
understanding of the concept where more than two sources of information are to be combined.

Subjective Assessment of Several Estimates Based on Relative Precision

In many cases, we may wish to compare several estimates of the same parameter. The
simplest approach is to measure the closeness among the estimates in an attempt to
determine that at least one of the estimates is more than r times the parameter away from the
parameter, where r is a subjective, non-negative number less than one.

You might like to use Subjective Assessment of Estimates JavaScript to isolate any
inaccurate estimate. By repeating the same process you might be able to remove all
inaccurate estimates.

Further Reading:
Tsao H. and T. Wright, On the maximum ratio: A tool for assisting inaccuracy assessment, The American Statistician, 37(4), 1983.

Bayesian Statistical Inference: An Introduction

Statistical inference describes the procedures by which we use the observed data to draw
conclusions about the population from which the data came or about the process by which
the data were generated. Our assumption is that there is an unknown process that generates
the data we have and that this process can be described by a probability distribution, which,
in turn, can be characterized by some unknown parameters. For instance, for a normal
distribution the unknown parameters are m and s2.

Broadly speaking, statistical inference can be classified under two headings: classical
inference and Bayesian inference. Classical statistical inference is based on two premises:

1. The sample data constitute the only relevant information.


2. The construction and assessment of the different procedures for inference are based
on long-run behavior under essentially similar circumstances.

In Bayesian inference we combine sample information with prior information. Suppose that
we draw a random sample x1, x2,....xn of size n from a normal population.

In statistical inference we take the sample mean as our estimate of m. Its variance is s2 / n.
The inverse of this variance is known as the sample precision. Thus the sample precision is n
/ s2.

In the Bayesian inference we have prior information on m. This is expressed in terms of a


probability distribution known as the prior distribution. Suppose that the prior distribution is
normal with mean m0 and variance s02, that is, precision 1/ s02. We now combine this with
the sample information to obtain what is known as the posterior distribution of µ. This

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 83/139
1/5/2020 Dr. Arsham's Statistics Site

distribution can be shown to be normal. Its mean is a weighted average of the sample mean
and the prior mean, weighted by the sample precision and prior precision, respectively. Thus,

Posterior mean = (W1 + W2 m0) / (W1 + W2)

Posterior variance = 1 / (W1 + W2)

where

W1 = Sample precision = n/S2, and W1 = Prior precision = n/s02

Also, the precision (or inverse of the variance) of the posterior distribution of m is W1 + W2,
that is, the sum of the sample precision and prior precision.

The posterior mean will lie between the sample mean and the prior mean. The posterior
variance will be less than both the sample and prior variances.

In this Web site do not discuss Bayesian inference because this would take us into a lot more
detail than we intend to cover. However, the basic notion of combining the sample mean and
prior mean in inverse proportion to their variances will be of interest while being useful.

You may like using the Bayesian Statistical Inference JavaScript for checking your
computation and performing some experiment.

Further Reading:
Ghosh M., and G. Meeden, Bayesian Methods for Finite Population Sampling, Chapman & Hall/CRC, 1997.

Managing the Producer's or the Consumer's Risk

The logic behind a statistical test of hypothesis is similar to the following logic. Draw two lines
on a paper and determine whether they are of different lengths. You compare them and
say,"Well, certainly they are not equal. Therefore they must be of different lengths. By
rejecting equality, that is, the null hypothesis, you assert that there is a difference.

The power of a statistical test is best explained by the overview of the Type I and Type II
errors. The following matrix shows the basic representation of these errors.

Click on the image to enlarge it and THEN print it.


The Type-I and Type-II Errors

As indicated in the above matrix a Type-I error occurs when, based on your data, you reject
the null hypothesis when in fact it is true. The probability of a type-I error is the level of
significance of the test of hypothesis and is denoted by a .

Type-I error is often called the producer's risk that consumers reject a good product or
service indicated by the null hypothesis. That is, a producer introduces a good product, in
doing so, he or she take a risk that consumer will reject it.

A type II error occurs when you do not reject the null hypothesis when it is in fact false. The
probability of a type-II error is denoted by b . The quantity 1 - b is known as the Power of a
Test. A Type-II error can be evaluated for any specific alternative hypotheses stated in the
form"Not Equal to" as a competing hypothesis.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 84/139
1/5/2020 Dr. Arsham's Statistics Site

Type-II error is often called the consumer's risk for not rejecting possibly a worthless
product or service indicated by the null hypothesis.

Students often raise questions, such as what are the 'right' confidence intervals, and why do
most people use the 95% level? The answer is that the decision-maker must consider both
the Type I and II errors and work out the best tradeoff. Ideally one wishes to reduce the
probability of making these types of error; however, for a fixed sample size, we cannot reduce
one type of error without at the same time increasing the probability of another type of error.
Nevertheless, to reduce the probabilities of both types of error simultaneously is to increase
the sample size. That is, by having more information one makes a better decision.

The following example highlights this concept. A electronics firm, Big Z, manufactures and
sells a component part to a radio manufacturer, Big Y. Big Z consistently maintain a
component part failure rate of 10% per 1000 parts produced. Here Big Z is the producer and
Big Y is the consumer. Big Y, for reasons of practicality, will test sample of 10 parts out of lots
of 1000. Big Y will adopt one of two rules regarding lot acceptance:

Rule 1: Accept lots with one or fewer defectives; therefore, a lot has either 0 defective
or 1 defective.
Rule 2: Accept lots with two or fewer defectives; therefore, a lot has either 0,1, or 2
defective(s).

On the basis of the binomial distribution, the P(0 or 1) is 0.7367. This means that, with a
defective rate of 0.10, the Big Y will accept 74% of tested lots and will reject 26% of the lots
even though they are good lots. The 26% is the producer's risk or the a level. This a level is
analogous to a Type I error -- rejecting a true null. Or, in other words, rejecting a good lot. In
this example, for illustration purposes, the lot represents a null hypothesis. The rejected lot
goes back to the producer; hence, producer's risk. If Big Y is to take rule 2, then the
producer's risk decreases. The P(0 or, or 1, or 2) is 0.9298 therefore, Big Y will accept 93%
of all tested lots, and 7% will be rejected, even though the lot is acceptable. The primary
reason for this is that, although the probability of defective is 0.10, the Big Y through rule 2
allows for a higher defective acceptance rate. Big Y increases its own risk (consumer's risk),
as stated previously.

Making Good Decision: Given that there is a relevant profit (which could be negative) for
the outcome of your decision, and a prior probability (before testing) for the null hypothesis to
be true, the objective is to make a good decision. Let us denote the profits for each cell in the
decision table as $a, $b, $c and $d (column-wise), respectively. The expectation of profit is
[aa + (1-a)b], and + [(1-b)c + bd], depending whether the null is true.

Now having a prior (i.e., before testing) subjective probability of p that the null is true, then
the expected profit of your decision is:

Net Profit = [aa + (1-a)b]p + [(1-b)c + bd](1-p) - Sampling cost

A good decision makes this profit as large as possible. To this end, we must suitably choose
the sample size and all other factors in the above profit function.

Note that, since we are using a subjective probability expressing the strength of belief
assessment of the truthfulness of the null hypothesis, it is called a Bayesian Approach to
statistical decision making, which is a standard approach in decision theory.

You might like to use the Subjectivity in Hypothesis Testing JavaScript in performing some
numerical experimentation for validating the above assertions for a deeper understanding.

Further Reading:
Cochran W., Planning and Analysis of Observational Studies, Wiley, 1983.

Hypothesis Testing: Rejecting a Claim

To perform a hypothesis test, one must be very specific about the test one wishes to perform.
The null hypothesis must be clearly stated, and the data must be collected in a repeatable
manner. If there is any subjectivity, the results are technically not valid. All of the analyses,

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 85/139
1/5/2020 Dr. Arsham's Statistics Site

including the sample size, significance level, the time, and the budget, must be planned in
advance, or else the user runs the risk of"data diving".

Hypothesis testing is mathematical proof by contradiction. For example, for a Student's t test
comparing two groups, we assume that the two groups come from the same population
(same means, standard deviations, and in general same distributions). Then we do our best
to prove that this assumption is false. Rejecting H0 means either H0 is false, or a rare event
as has occurred.

The real question is in statistics not whether a null hypothesis is correct, but whether it is
close enough to be used as an approximation.

Test of Hypotheses
Click on the image to enlarge it and THEN print it

In most statistical tests concerning m, we start by assuming the s2, and the higher moments,
such as skewness and kurtosis, are equal. Then, we hypothesize that the a's are equal wich
is null hypothesis.

The"null" often suggests no difference between group means, or no relationship between


quantitative variables, and so on.

Then we test with a calculated t-value. For simplicity, suppose we have a two-sided test. If
the calculated t is close to 0, we say"it is good", as we expected. If the calculated t is far from
0, we say,"the chance of getting this value of t, given my assumption that the populations are
statistically the same, is so small that I will not believe the assumption. We will say that the
populations are not equal; specifically the means are not equal."

As an example, sketch a normal distribution with mean 1 - 2 and standard deviation s. If


the null hypothesis is true, then the mean is 0. We calculate the 't' value, as per the equation.
We look up a"critical" value of t. The probability of calculating a t value more extreme ( + or - )
than this, given that the null hypothesis is true, is equal or less than the a risk we used in
pulling the critical value from the table. Mark the calculated t, and critical t (both sides) on the
sketch of the distribution. Now, if the calculated t is more extreme than the critical value, we
say,"the chance of getting this t, by shear chance, when the null hypothesis is true, is so
small that I would rather say the null hypothesis is false, and accept the alternative, that the
means are not equal." When the calculated value is less extreme than the calculated value,
we say,"I could get this value of t by shear chance. I cannot detect a difference in the means
of the two groups at the a significance level."

In this test, we need (among others) the condition that the population variances (i.e.,
treatment impacts on central tendency but not variability) are equal. However, this test is
robust to violations of that condition if n's are large and almost the same size. A counter
example would be to try a t-test between (11, 12, 13) and (20, 30, 40). The pooled and
unpooled tests both give t statistics of 3.10, but the degrees of freedom are different: d.f. = 4
(for pooled) or d.f. about 2 (for unpooled). Consequently the pooled test gives p = 0.036 and
the unpooled p = 0.088. We could go down to n = 2 and get something still more extreme.

More numerical examples with applications are given in The Statistical Tables section.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 86/139
1/5/2020 Dr. Arsham's Statistics Site

You might like to use Testing the Mean, and Testing the Variance in performing more of these
tests

You might need to use Sample Size Determination JavaScript at the design stage of your
statistical investigation in decision making with specific subjective requirements.

Classical Approach to Testing Hypotheses

In this treatment there are two parties: One party (or a person) proposes the null hypothesis
(the claim). Another party proposes an alternative hypothesis. A significance level a and a
sample size n are agreed upon by both parties. The next step is to compute the relevant
statistic based on the null hypothesis and the random sample of size n. Finally, one
determines the rejection region. The conclusion based on this approach is as follows:

If the computed statistic falls within the rejection region, then Reject the null hypothesis;
otherwise Do Not Reject the null hypothesis (the claim).

You may ask: How do you determine the critical value (such as z-value) for the rejection
interval for one and two-tailed hypotheses?. What is the rule?

First, you have to choose a significance level a. Knowing that the null hypothesis is always
in"equality" form then, the alternative hypothesis has one of the three possible
forms:"greater-than","less-than", or"not equal to". The first two forms correspond to a one-tail
hypothesis while the last one corresponds to a two-tail hypothesis.

If your alternative is in the form of "greater-than", then z is the value that gives you an area to the right
tail of the distribution that is equal to a.

If your alternative is in the form of "less-than", then z is the value that gives you an area to the left tail of
the distribution that is equal to a.

If your alternative is in the form of "not equal to", then there are two z values, one positive and the other
negative. The positive z is the value that gives you an a/2 area to the right tail of the distribution. While,
the negative z is the value that gives you an a/2 area to the left tail of the distribution.

The above rule can be generalized and implemented for determining the critical value for any test
of hypothesis, you must first master reading the statistical tables, because, as you see, not all
tables in your textbook are presented in the same format.

The Meaning and Interpretation of P-values (what the data say?)

The p-value, which directly depends on a given sample attempts to provide a measure of the
strength of the results of a test for the null hypothesis, in contrast to a simple reject or do not reject
in the classical approach to the test of hypotheses. If the null hypothesis is true, and if the chance
of random variation is the only reason for sample differences, then the p-value is a quantitative
measure to feed into the decision-making process as evidence. The following table provides a
reasonable interpretation of p-values:

P-value Interpretation
P < 0.01 very strong evidence against H0
0.01£ P < 0.05 moderate evidence against H0
0.05 £ P < 0.10 suggestive evidence against H0
0.10 £ P little or no real evidences against H0

This interpretation is widely accepted, and many scientific journals routinely publish papers using
this interpretation for the result of a test of hypothesis.

For the fixed-sample size, when the number of realizations is decided in advance, the distribution
of p is uniform, assuming the null hypothesis is true. We would express this as P(p £ x) = x. That
means the criterion of p £ 0.05 achieves a of 0.05.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 87/139
1/5/2020 Dr. Arsham's Statistics Site

Understand that the distribution of p-values under null hypothesis H0 is uniform, and thus does not
depend on a particular form of the statistical test. In a statistical hypothesis test, the P value is the
probability of observing a test statistic at least as extreme as the value actually observed,
assuming that the null hypothesis is true. The value of p is defined with respect to a distribution.
Therefore, we could call it"model-distribution hypothesis" rather than"the null hypothesis".

In short, it simply means that, if the null had been true, the p-value is the probability against the null
in that case. The p-value is determined by the observed value; however, this makes it difficult to
even state the inverse of p.

Finally, since the p-values are random variables, one cannot compare several p-values for any
statistical conclusions (nor order them). This is a common mistake many people do, therefore, the
above table is not intended for such a comparison.

You might like to use The P-values for the Popular Distributions JavaScript.

Further Readings:
Arsham H., Kuiper's P-value as a Measuring Tool and Decision Procedure for the Goodness-of-fit Test, Journal of Applied Statistics, Vol. 15,
No.3, 131-135, 1988.
Good Ph.., Resembling Methods: A Practical Guide to Data Analysis, Springer Verlag, 1999.

Blending the Classical and the P-value Based Approaches in Test of Hypotheses

A p-value is a measure of how much evidence you have against the null hypothesis. Notice that
the null hypothesis is always in = form, and does not contain any forms of inequalities. The smaller
the p-value, the more evidence you have. In this setting, the p-value is based on the hull
hypothesis and has nothing to do with an alternative hypothesis and therefore with the rejection
region. In recent years, some authors try to use the mixture of the classical and the p-value
approaches. It is based on the critical value obtained from given a, the computed statistics and the
p-value. This is a blend of two different schools of thought. In this setting, some textbooks compare
the p-value with the significance level to make decisions on a given test of hypothesis. The larger
the p-value is when compared with a (in one-sided alternative hypothesis, and a/2 for the two
sided alternative hypotheses), the less evidence we have for rejecting the null hypothesis. In such
a comparison, if the p-value is less than some threshold (usually 0.05, sometimes a bit larger like
0.1 or a bit smaller like 0.01) then you reject the null hypothesis. The following deal with such a
combined approach.

Use of P-value and a: In this setting, we must also consider the alternative hypothesis in drawing
the rejection region. There is only one p-value to compare with a (or a/2). Know that, for any test of
hypothesis, there is only one p-value. The following outlines the computation of the p-value and the
decision process involved in a given test of hypothesis:

1. P-value for One-sided Alternative Hypotheses: The p-value is defined as the area under the right tail of
distribution, if the rejection region in on the right tail; if the rejection region is on the left tail, then the p-
value is the area under the left tail (in one-sided alternative hypotheses).

2. P-value for Two-sided Alternative Hypotheses: If the alternative hypothesis is two-sided (that is, rejection
regions are both on the left and on the right tails), then the p-value is the area under the right tail or to
the left tail of the distribution, depending on whether the computed statistic is closer to the right rejection
region or left rejection region.

For symmetric densities (such as t-density), the left and right tails p-values are the same. However, for
non-symmetric densities (such as Chi-square) use the smaller of the two. This makes the test more
conservative. Notice that, for a two sided-test alternative hypotheses, the p-value is never greater than
0.5.

3. After finding the p-value as defined here, you compare it with a pre-set a value for one-sided tests, and
with a/2 for two sided-test. The larger the p-value is when compared with a (in one-sided alternative
hypothesis, and a/2 for the two sided alternative hypotheses), the less evidence we have for rejecting
the null hypothesis.

To avoid looking-up the p-values from the limited statistical tables given in your textbook, most
professional statistical packages such as SAS and SPSS provide the two-tailed p-value. Based on
where the rejection region is, you must find out what p-value to use.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 88/139
1/5/2020 Dr. Arsham's Statistics Site

Some textbooks have many misleading statements about p-value and its applications. For
example, in many textbooks you find the authors double the p-value to compare it with a when
dealing with the two-sided test of hypotheses. One wonders how they do it in the case when"their"
p-value exceeds 0.5? Notice that, while it is correct to compare the p-value with a for a one sided
tests of hypotheses a, for two-sided hypotheses, one must compare the p-value with a/2, NOT a
with 2 times p-value, as some textbooks advise. While the decision is the same, there is a clear
distinction here and an important difference, which the careful reader will note.

How to set the appropriate a value? You may have wondered why a = 0.05 is so popular in a
test of hypothesis. a = 0.05 is traditional for tests, but is arbitrary in its origins suggested by R.A.
Fisher, who suggested it in the spirit of 0.05 being the biggest p-value at which one would think
maybe the null hypothesis in a statistical experiment was to be considered false. This was also a
tradeoff between"type I error" and "type II error"; that we do not want to accept the wrong null
hypothesis, but we do not want to fail to reject the false null hypothesis, either. As a final note, the
average of these two p-values is often called the mid-p value.

Conversions from two-sided to one-sided probabilities: Let C be the probability for a two-sided
confidence interval (CI) constructed for an estimate. The probability (C1) that either the estimate is
greater than the lower limit or that it is less than the upper limit can be computed by using:

C1 = C/2 + 1/2, for conversion to one-sided

Numerical Example: Suppose you wish to convert a C = 90% two-sided CI into a one-sided, then
C1 = 0.90/2 + 1/2 = 95%.

You might need to use Sample Size Determination JavaScript at the design stage of your statistical
investigation in decision making with specific, subjective requirements.

Bonferroni Method for Multiple P-Values Procedure

One may combine several t-tests by using the Bonferroni method. It works reasonably well when
there are only a few tests, but as the number of comparisons increases above 8, the value of 't'
required to conclude that a difference exists becomes much larger than it really needs to be, and
the method becomes over conservative.

One way to make the Bonferroni t-test less conservative is to use the estimate of the population
variance computed from within the groups in the analysis of variance.

t = ( 1 - 2 )/ ( s2 / n1 + s2 / n2 )1/2,

where s2 is the population variance computed within the groups.

Hommel's Multiple P-Values Procedure: This test can be summarized as follows:

Suppose we have n number of P-values: p(i), i =1, .., n, in ascending order corresponding to
independent tests. Let j be the largest integer, such as:

p(n-j+k) > ka/j, for all k=1,..,j.

If no such j exists, reject all hypotheses; otherwise, reject all hypotheses with p(i) £ a / j. This
provides a strong control of the family-wise error rate at a level.

There are other improvements on the Bonferroni adjustment when multiple tests are independent
or positively dependent. However, the Hommel's method is the most powerful compared with other
methods.

Further Readings:
Hommel G., Bonferroni procedures for logically related hypotheses, Journal of Statistical Planning and Inference, 82, 119-128, 1999.
Kost J., and M. McDermott, Combining dependent P-values, Statistics and Probability Letters, 60, 183-190, 2002.
Wasteful P., and S. Young, Resembling-Based Multiple Testing: Examples and Methods for P-Value Adjustment, Wiley, 1992.
Wright S., Adjusted P-values for simultaneous inference, Biometrics, 48, 1005-1013, 1992.

Power of a Test and the Size Effect


home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 89/139
1/5/2020 Dr. Arsham's Statistics Site

The power of a test plays the same role in hypothesis testing that Standard Error played in
estimation. It is a measuring tool for assessing the accuracy of a test or in comparing two
competing test procedures.

The power of a test is the probability of rejecting a false null hypothesis when the null hypothesis is
false. This probability is inversely related to the probability of making a Type II error, not rejecting
the null hypothesis when it is false. Recall that we choose the probability of making a Type I error
when we set a. If we decrease the probability of making a Type I error, then we increase the
probability of making a Type II error. Therefore, there are basically two errors possible when
conducting a statistical analysis; type I error and and type II error:

Type I error - (producer's) risk of rejecting the null hypothesis when it is in fact true.
Type II error - (consumer's) risk of not rejecting the null hypothesis when it is in fact false.

Power and Alpha (a): Thus, the probability of not rejecting a true null has the same relationship to
Type I errors as the probability of correctly rejecting an untrue null does to Type II error. Yet, as I
mentioned if we decrease the odds of making one type of error we increase the odds of making the
other type of error. What is the relationship between Type I and Type II errors? For a fixed sample
size, decreasing one type of error increases the size of the other one.

Power and the Size Effect: Anytime we test whether a sample differs from a population, or
whether two samples come from 2 separate populations, there is the condition that each of the
populations we are comparing has its own mean and standard deviation (even if we do not know
it). The distance between the two population means will affect the power of our test. This is known
as the size of treatment, also known as the effect size, as shown in the following table with the
three popular values for a:

Power as a Function of a and the Size Effect


a
Size Effect 0.10 0.05 0.01
1.0 .22 .13 .03
2.0 .39 .26 .09
3.0 .59 .44 .20
4.0 .76 .64 .37
5.0 .89 .79 .57
6.0 .96 .91 .75
7.0 .99 .97 .88

Power and the Size of Variance s2: The greater the variance S2, the lower the power 1-b.
Anything that effects the extent to which the two distributions share common values will increase b
(the likelihood of making a Type II error)

Power and the Sample Size: The smaller the sample sizes n, the lower the power. Very small n
produces power so low that false hypotheses are accepted.

The following is a list of four factors influencing the power:

effect size (for example, the difference between the means)


variance S2
significance level a
number of observations, or the sample size n

In practice, the first three factors are often fixed. Only the sample size can be controlled by the
statistician and that only within budget constraint. There exists a tradeoff between budget and
achievement of desirable accuracy in any analysis.

A Numerical Example: The power of a test is most easily understood by viewing it in the context
of a composite test. A composite test requires the specification of a population mean as the
alternative hypothesis. For example, using Z-test of hypothesis in the following Figure. The power
is developed from specification of an alternative hypothesis such as m = 2.5, and m = 3. The

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 90/139
1/5/2020 Dr. Arsham's Statistics Site

resultant distribution under this alternative shifts to the right 2.5 units with the shaded area
representing the power of the test, correctly rejecting a false null.

Power of a Test
Click on the image to enlarge it

Not rejecting the null hypothesis when it is false is defined as a Type II error, and is denoted by the
b region. In the above Figure this region lies to the left of the critical value. In the configuration
shown in this Figure, b falls to the left of the critical value (and below the statistic's density (or
probability) function under the alternative hypothesis Ha). The b is also defined as the probability of
not-rejecting a false null hypothesis when it is false, also called a miss. Related to the value of b is
the power of a test. The power is defined as the probability of rejecting the null hypothesis given
that a specific alternative is true, and is computed as (1- b).

A Short Discussion: Consider testing a simple null versus simple alternative. In the Neyman-
Pearson setup, an upper bound is set for the probability of a given Type I error (a), and then it is
desirable to find tests with low probability of type II error (b) given this. The usual justification for
this is that"we are more concerned about a Type I error, so we set an upper limit on the a that we
can tolerate." I have seen this sort of reasoning in elementary texts and also in some advanced
ones. It doesn't seem to make any sense. When the sample size is large, for most standard tests,
the ratio b/a tends to 0. If we care more about Type I error than Type II error, why should this
concern dissipate with increasing sample size?

This is indeed a drawback of the classical theory of testing statistical hypotheses. A second
drawback is that the choice lies between only two test decisions: reject the null or accept the null. It
is worth considering approaches that overcome these deficiencies. This can be done, for example,
by the concept of profile-tests at a 'level' a. Neither the Type I nor Type II error rates are
considered separately, but they are the ratio of a correct decision. For example, we accept the
alternative hypothesis Ha and reject the null H0, if an event is observed which is at least a-times
greater under Ha than under H0. Conversely, we accept H0 and reject Ha, if an event is observed
which is at least a-times greater under H0 than under Ha. This is a symmetric concept which is
formulated within the classical approach.

Power of Parametric versus Non-parametric Tests: As a general rule, for a given sample size n,
the parametric tests are more powerful than their non-parametric counterparts. The primarily
reason for this is that we have emphasized parametric tests. Moreover, among the parametric
tests, those which use correlation are more powerful, such as the before-and-after test. This is
known as a Variance Reduction Technique used in system simulation to increase the accuracy
(i.e., reduce variation) without increasing the sample size.

Correlation Coefficient as a Measuring Tool and Decision Criterion for the Effect Size: The
correlation coefficient could be obtained and used as a measuring tool and decision criteron for the
strength of the effect size based on the computed test-statistic for major hypothesis testing.

The correlation coefficient r stands as a very useful and accessible index of the magnitude of
effect. It is commonly accepted that the small, medium, and large effect sizes correspond to r-
values over 0.1, 0.3, and 0.5, respectively. The following are needed transformation of some major
inferential statistics to the r-value:
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 91/139
1/5/2020 Dr. Arsham's Statistics Site

For the t(df)-statistic: r = [t2/(t2 + df)]½

For the F(1,df2)-statistic: r = [F/(F + df)]½

For the c2(1)-statistic: r = [c2/n] ½

For the Standard Normal Z: r = (Z2/n)½

You might like to use Sample Size Determination JavaScript at the design stage of your statistical
investigation in decision making with specific subjective requirements.

Further Reading:
Murphy K., and B. Myors, Statistical Power Analysis, L. Erlbaum Associates, 1998.

Parametric vs. Non-Parametric vs. Distribution-free Tests

One must use a statistical technique called non-parametric if it satisfies at least one of the following
five types of criteria:

1. The data entering the analysis are enumerative; that is, counted data represent the number of
observations in each category or cross-category.

2. The data are measured and/or analyzed using a nominal scale of measurement.

3. The data are measured and/or analyzed using an ordinal scale of measurement.

4. The inference does not concern a parameter in the population distribution; for example, the hypothesis
that a time-ordered set of observations exhibits a random pattern.

5. The probability distribution of the statistic upon which the analysis is based is not dependent upon
specific information or conditions (i.e., assumptions) about the population(s) from which the sample(s)
are drawn, but only upon general assumptions, such as a continuous and/or symmetric population
distribution.

According to these creteria, the distinction of non-parametric is accorded either because of the
level of measurement used or required for the analysis, as in types 1 through 3; the type of
inference, as in type 4, or the generality of the assumptions made about the population distribution,
as in type 5.

For example, one may use the Mann-Whitney Rank Test as a non-parametric alternative to
Students T-test when one does not have normally distributed data.

Mann-Whitney: To be used with two independent groups (analogous to the independent groups t-
test)
Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the
related samples t-test)
Kruskall-Wallis: To be used with two or more independent groups (analogous to the single-factor
between-subjects ANOVA)
Friedman: To be used with two or more related groups (analogous to the single-factor within-
subjects ANOVA)

Non-parametric vs. Distribution-free Tests:

Non-parametric tests are those used when some specific conditions for the ordinary tests are
violated.

Distribution-free tests are those for which the procedure is valid for all different shape of the
population distribution.

For example, the Chi-square test concerning the variance of a given population is parametric since
this test requires that the population distribution be normal. The Chi-square test of independence
does not assume normality condition, or even that the data are numerical. The Kolmogorov-
Smirnov test is a distribution-free test, which is applicable to comparing two populations with any
distribution of continuous random variable.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 92/139
1/5/2020 Dr. Arsham's Statistics Site

The following section is an interesting non-parametric procedure with various and useful
applications.

Comparison of Two Random Variables: Consider two independent observations X = (x1, x2,…,
xr) and Y = (y1, y2,…, ys) for two random variables X and Y respectively. To estimate the reliability
function:

R = Pr (X > Y)

One may use:

The estimator RS = U/(r ´ s),

where U is the number of pairs (xi, yj) such that xi > yj, for all i = 1, 2, ,r, and j = 1, 2,..,s.

This estimator is an unbiased one with the minimum variance for R. It is important to know that the
estimate has an upper limit, non-negative delta value for its accuracy:

Pr{R ³ RS - d} ³ max {1- exp(-2nd2), 4nd2/(1-4nd2)}.

Application areas include the insurance ruin problem. Let random variable Y denote the claims per
unit of time and let random variable X denote the return on investment (ROI) for the Insurance
Company. Finally, let z denote the constant premium amount collected; then the probability that the
insurance company will survive is:

R = Pr [X + z > Y}.

You might like to use the Kolmogorov-Smirnov Test for Two Populations and Comparing Two
Random Variables in checking your computations and performing some numerical experiment for a
deeper understanding of these concepts.

Further Readings:
Arsham H., A generalized confidence region for stress-strength reliability, IEEE Transactions on Reliability, 35(4), 586-589, 1986.
Conover W., Practical Nonparametric Statistics, Wiley, 1998.
Hollander M., and D. Wolfe, Nonparametric Statistical Methods, Wiley, 1999.
Kotz S., Y. Lumelskii, and M. Pensky, The Stress-Strength Model and Its Generalizations: Theory and Applications,
Imperial College Press, London, UK, 2003, distributed by World Scientific Publishing.

Hypotheses Testing

Let us consider a simple problem of inference about population mean. We have a large population
with known mean. We take a sample and wish to know whether the sample mean is significantly
different from the population mean. Our null hypothesis is that it is not.

The theory of probability is only capable of dealing with random variables which generate a
frequency distribution "in the long run". We have one fixed population and one fixed sample. There
is nothing random about this problem and the experiment is conducted once, so there is no "long
run".

We pretend that the experiment was not conducted once, but an infinite number of times, that is,
we consider all possible samples of the same size. We assume that each sample mean includes
an "error", which is independently and normally distributed about zero. The sample mean now
becomes our random variable, which we call our "statistic". We can now apply the t-test or z-test
interpretation of probability.

We are now able to determine the probability of a randomly chosen sample mean having a value at
least as extreme as our original sample mean. Note that we are implicitly assuming that the null
hypothesis is true. This probability is our p-value which we apply to the original problem.

Remember that, in the t-tests for differences in means, there is a condition of equal population
variances that must be examined. One way to test for possible differences in variances is to do an
F test. However, the F test is very sensitive to violations of the normality condition; i.e., if
populations appear not to be normal, then the F test will tend to reject too often the null of no
differences in population variances.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 93/139
1/5/2020 Dr. Arsham's Statistics Site

You might like to use the following JavaScript to check your computations and to perform some
statistical experiments for deeper understanding of these concepts:

Testing the Mean.


Testing the Variance.
Testing Two Populations.
Testing the Difference: The Before-and-After Test.
ANOVA.
For statistical equality of two populations, you might like to use the Kolmogorov-Smirnov Test.

Single Population t-Test

The purpose is to compare the sample mean with the given population mean. The aim is to judge
the claimed mean value, based on a set of random observations of size n. A necessary condition
for validity of the result is that the population distribution is normal, if the sample size n is small
(say less than 30).

The task is to decide whether to accept a null hypothesis:

H0 = m = m0

or to reject the null hypothesis in favor of the alternative hypothesis:

Ha: m is significantly different from m0

The testing framework consists of computing a the t-statistics:

T = [( - m0) n1/2] / S

Where is the estimated mean and S2 is the estimated variance based on n random observations.

The above statistic is distributed as a t-distribution with parameter d.f. = n = (n-1). If the absolute
value of the computed T-statistic is"too large" compared with the critical value of the t-table, then
one rejects the claimed value for the population's mean.

This test could also be used for testing similar claims for other unimodal populations including
those with discrete random variables, such as proportion, provided there are sufficient
observations (say, over 30).

You might like to use Testing the Mean JavaScript in checking your computations. and Sample
Size Determination JavaScript at the design stage of your statistical investigation in decision
making with specific subjective requirements.

You might like also to use JavaScript Testing Two Populations.

Two Independent Populations

If an estimate is an unbiased such as sample mean, then it is a good idea to pool the estimates to
get a single estimate from several relatively small samples. The pooled estimate is a “goodâ€
estimate when compared with each individual estimates.

Pooled Mean: Supposed we have m number of estimates (i), of sample size n(i), for the
population expected value m, the pooled estimate is:

[S n(i) (i)] / [Sn(i)], both sums are over all values of i = 1, 2,. . ., m.

Pooled Variance: Since the sample variance is also unbiased estimate of population variance s2,
therefore, it is a good idea to pool the estimates to get a single estimate from m number of
estimates S(i)2, of sample size n(i), the pooled estimate is:

{[S [n(i) – 1] S(i)2 ] } / {[S n(i)] – m}, both sums are over all values of i = 1, 2,…, m.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 94/139
1/5/2020 Dr. Arsham's Statistics Site

We pool variance estimates for other good reasons. Depending on a particular reason, then the
conclusion might have to be made explicitly conditional on e.g., the validity of the equal-variance
model. There are several different good reasons for pooling:

to get a single stable estimate from several relatively small samples, where variance
fluctuations seem not to be systematic; or

for convenience, when all the variance estimates are near enough to equality; or

when there is no choice but to model variance, as in simple linear regression with no
replicated X values.

You might like to use JavaScript Pooling the Means, and Variances.

Pooled Standard Deviation: Both the sample mean, and variance are unbiased estimates for the
population parameters, m, and s2, respectively, however the sample standard deviation in NOT an
unbiased estimate of population standard deviation s. This is so, because of an equality known as
the Jensen's inequality when applied to a concave function, i.e., the square root of the unbiased
variance estimate. Therefore, pooling standard deviation directly is meaningless; the best one can
do to take the square root of the pooled variance

Notice that, when sample sizes are large and nearly equal, so that there is essentially no difference
between the pooled and unpooled estimates of standard errors of paired-data samples, and
degrees of freedom are nearly asymptotic. This rationale can fall apart for any other cases. One
must pool variance rather than merely taking a shortcut in the computation of standard errors.

If you calculate the test without the assumption, you have to determine the degrees of freedom
(d.f.). The formula works in such a way that d.f. will be less if the larger sample variance is in the
group with the smaller number of observations. This is the case in which the two tests will differ
considerably. A study of the formula for the d.f. is most enlightening, and one must understand the
correspondence between the unfortunate design, having the most observations in the group with
little variance, and the low d.f. and accompanying large t-value.

Applications: When doing t tests for differences in means of populations, for independent
samples case:

1. For differences in means that do not make any assumption about equality of population variances, use
the standard error formula:

[S21/n1 + S22/n2]½,

with d.f. = n = n1 or n2 whichever is smaller.

2. With equal variances, use the statistics:

with parameter d.f. = n = (n1 + n2- 2), for n1, and n2 greater than to 1, where the pooled variance is:

3. If total N is less than 50 and one sample is 1/2 the size of the other (or less), and if the smaller sample
has a standard deviation at least twice as large as the other sample, then apply the procedure given in
item no. 1, but adjust d.f. parameter of the t-test to the largest integer less than or equal to:

d.f. = n = A/(B +C),

where:

A = [S21/n1 + S22/n2]2,

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 95/139
1/5/2020 Dr. Arsham's Statistics Site

B = [S21/n1]2 / (n1 -1),

C = [S22/n2]2/ (n2 -1)

Otherwise, do not worry about the problem of having an actual a level that is much different than what
you have set it to be.

The following decision chart provides a guide in selecting an appropriate test-statistic concerning the
means m's for both, one and two populations.

Click on the image to enlarge it and THEN print it.


A Decision Chart for Testing the Means m's

The last approach, which is very general with conservative results, can be implemented using
Testing Two Populations JavaScript.

You might like to use JavaScript Testing the Mean for One Population

Non-parametric Multiple Comparison Procedures

Duncan's multiple-range test: This is one of the many multiple comparison procedures. It is based
on the standardized range statistic by comparing all pairs of means while controlling the overall
Type I error at a desirable level. While it does not provide interval estimates of the difference
between each pair of means, it does indicate which means are significantly different from the
others. For determining the significant differences between a single control group mean and the
other means, one may use the Dunnett's multiple-comparison test.

Introduction to Tests for Statistical Equality of Two or More Populations:

Two random variables X and Y having distribution FX(x) and FY(y) respectively, are said to be
equivalent, or equal in rule, or equal in distribution, if and only if they have the same distribution
function. That is,

FX(z) = FY(z), for all z,

There are different tests depending on the intended applications. The widely used tests for
statistical equality of populations are as follow:

1. Equality of Two Normal Populations: One may use the Z-test and F-test to check the equality
of the means, and the equality of variances, respectively.

2. Testing a Shift in Normal Populations: Often we are interested in testing for a given shift in a
given population distribution, that is testing if a random variable Y is equal in distribution to
another X + c for some constant c. In other words, the distribution of Y is the distribution of X
shifted. In testing any shift in distribution one needs to test for normality first, and then testing
the difference in expected values by applying the two-sided Z-test with the null hypothesis of:

H0: mY - mX = c.

3. Analysis of Variance: Analysis of Variance (ANOVA) tests are designed for simultaneous
testing of equality of three or more populations. The preconditions in applying ANOVA are

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 96/139
1/5/2020 Dr. Arsham's Statistics Site

normality of each population's distribution, and the equality of all variances simultaneously
(not the pair-wise tests).

Notice that ANOVA is an extension of item no. 1 in testing equality of more than two
populations. It can be shown that if one applies ANOVA for testing the equality of two
populations based on two independent samples with sizes of n1 and n2 form each
population, respectively, then the results of both tests will be identical. Moreover, the test-
statistic obtained by each test are directly related, i.e.,

F a , (1, n1+ n2 - 2) = t 2 a/2 , (n1+ n2 - 2)

4. Equality of Proportions in Several Populations: This test is for discrete random variables. It is
one of the many interesting chi-square applications.

5. Distribution-free Equality of Two Populations: Whenever one is interested in testing the


equality of two populations with a common continuous random variable, without any
reference to the underlying distribution such as normality condition, one may use the
distribution-free known as the K-S test.

6. Non-parametric Comparison of Two Random Variables: Consider two independent


observations X = (x1, x2,…, xr) and Y = (y1, y2,..., ys) for two independent populations with
random variables X and Y, respectively. Often we are interested in estimating the Pr (X > Y).

Equality of Two Normal Populations:

The normal or Gaussian distribution is a continuous symmetric distribution that follows the familiar
bell-shaped curve. One of its nice features is that, the mean and variance uniquely and
independently determines the distribution.

Therefore, for testing the statistical equality of two independent normal populations, one must first
perform the Lilliefors' Test for Normality to assess this condition. Given that both populations are
normally distributed, then one must performing two more tests, namely the test for equality of the
two means and the test for equality of the two variances. Both of these tests can be carried out by
using the Test of Hypotheses for Two Populations JavaScript.

Multi-Means Comparisons: Analysis of Variance (ANOVA)

The tests we have learned up to this point allow us to test hypotheses that examine the difference
between only two means. Analysis of Variance or ANOVA will allow us to test the difference
between two or more means. ANOVA does this by examining the ratio of variability between two
conditions and variability within each condition. For example, say we give a drug that we believe
will improve memory to a group of people and give a placebo to another group of people. We might
measure memory performance by the number of words recalled from a list we ask everyone to
memorize. A t-test would compare the likelihood of observing the difference in the mean number of
words recalled for each group. An ANOVA test, on the other hand, would compare the variability
that we observe between the two conditions to the variability observed within each condition.
Recall that we measure variability as the sum of the difference of each score from the mean. When
we actually calculate an ANOVA we will use a short-cut formula

Thus, when the variability that we predict between the two groups is much greater than the
variability we don't predict within each group, then we will conclude that our treatments produce
different results.

An Illustrative Numerical Example for ANOVA

Consider the following (small integers, indeed for illustration while saving space) random samples
from three different populations.

With the null hypothesis:


H0: µ1 = µ2 = µ3,

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 97/139
1/5/2020 Dr. Arsham's Statistics Site

and the alternative:


Ha: at least two of the means are not equal.

At the significance level a = 0.05, the critical value from F-table is


F 0.05, 2, 12 = 3.89.

Sum Mean
Sample P1 2 3 1 3 1 10 2
Sample P2 3 4 3 5 0 15 3
Sample P3 5 5 5 3 2 20 4

Demonstrate that, SST=SSB+SSW.


That is, the sum of squares total (SST) equals sum of squares between (SSB) the groups plus sum
of squares within (SSW) the groups.

Computation of sample SST: With the grand mean = 3, first, start with taking the difference
between each observation and the grand mean, and then square it for each data point.

Sum
Sample P1 1 0 4 0 4 9
Sample P2 0 1 0 4 9 14
Sample P3 4 4 4 0 1 13

Therefore SST = 36 with d.f = (n-1) = 15-1 = 14

Computation of sample SSB:

Second, let all the data in each sample have the same value as the mean in that sample. This
removes any variation WITHIN. Compute SS differences from the grand mean.

Sum
Sample P1 1 1 1 1 1 5
Sample P2 0 0 0 0 0 0
Sample P3 1 1 1 1 1 5

Therefore SSB = 10, with d.f = (m-1)= 3-1 = 2 for m=3 groups.

Computation of sample SSW:

Third, compute the SS difference within each sample using their own sample means. This provides
SS deviation WITHIN all samples.

Sum
Sample P1 0 1 1 1 1 4
Sample P2 0 1 0 4 9 14
Sample P3 1 1 1 1 4 8

SSW = 26 with d.f = 3(5-1) = 12. That is, 3 groups times (5 observations in each -1)

Results are: SST = SSB + SSW, and d.fSST = d.fSSB + d.fSSW, as expected.

Now, construct the ANOVA table for this numerical example by plugging the results of your
computation in the ANOVA Table. Note that, the Mean Squares are the Sum of squares divided by
their Degrees of Freedom. F-statistics is the ratio of the two Mean Squares.

The ANOVA Table


Sources of Variation Sum of Squares Degrees of Freedom Mean Squares F-Statistic
Between Samples 10 2 5 2.30
Within Samples 26 12 2.17
Total 36 14

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 98/139
1/5/2020 Dr. Arsham's Statistics Site

Conclusion: There is not enough evidence to reject the null hypothesis H0.

The Logic behind ANOVA: First, let us try to explain the logic and then illustrate it with a simple
example. In performing the ANOVA test, we are trying to determine if a certain number of
population means are equal. To do that, we measure the difference of the sample means and
compare that to the variability within the sample observations. That is why the test statistic is the
ratio of the between-sample variation (MSB) and the within-sample variation (MSW). If this ratio is
close to 1, there is evidence that the population means are equal.

Here is a good application for you: Many people believe that men get paid more in the business
world, in a specific profession at specific level, than women, simply because they are male. To
justify or reject such a claim, you could look at the variation within each group (one group being
women's salaries and the other group being men's salaries) and compare that to the variation
between the means of randomly selected samples of each population. If the variation in the
women's salaries is much larger than the variation between the men's and women's mean salaries,
one could say that because the variation is so large within the women's group that this may not be
a gender-related problem.

Now, getting back to our numerical example of the drug treatment to improve memory vs the
placebo. We notice that: given the test conclusion and the ANOVA test's conditions, we may
conclude that these three populations are in fact the same population. Therefore, the ANOVA
technique could be used as a measuring tool and statistical routine for quality control as described
below using our numerical example.

Construction of the Control Chart for the Sample Means: Under the null hypothesis, the
ANOVA concludes that µ1 = µ2 = µ3; that is, we have a "hypothetical parent population." The
question is, what is its variance? The estimated variance (i.e., the total mean squares) is 36 / 14 =
2.57. Thus, estimated standard deviation is = 1.60 and estimated standard deviation for the means
is 1.6 / 5½ » 0.71. Under the conditions of ANOVA, we can construct a control chart with the
warning limits = 3 ± 2(0.71); the action limits = 3 ± 3(0.71). The following figure depicts the
control chart.

Click on the image to enlarge it and THEN print it.


ANOVA and Quality Control

Conditions for Using ANOVA Test: The following conditions must be tested prior to using
ANOVA test, otherwise the results are not valid: Randomness of the samples, Normality of
populations, and Equality of Variances for all populations.

You May Ask Why Not Using Pair-wise T-test Instead ANOVA? Here are two reasons:
Performing pair-wise t-test for K populations, you will need to perform, K(K-1)/2 pair-wise t-test.
Now suppose the significance level for each test is set at 5% level, then the overall significance
level would be approximately equal to 0.05K(K-1)/2. For example, for K = 5 populations, you have
to performing 10 pair-wise t-tests, moreover, the overall significance level is equal to 50%, which is
too high type-I error for any statistical decision making.

You might like to use ANOVA: Testing Equality of Means for your computations, and then to
interpret the results in managerial (not technical) terms.

You might need to use Sample Size Determination JavaScript at the design stage of your statistical
investigation in decision making with specific subjective requirements.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 99/139
1/5/2020 Dr. Arsham's Statistics Site

ANOVA for Normal but Condensed Data Sets

In testing the equality of several means, often the raw data are not available. In such a case, one
must perform the needed analysis based on secondary data using the data summaries; namely,
the triple-set: The sample sizes, the sample means, and the sample variances.

Suppose one of the samples is of size n having the sample mean , and the sample variance S2.
Let:

yi = + (S2/n)½ for all i = 1, 2, …, n-1,

and

yn = n - (n - 1)y1

Then, the new random data yi's are surrogate data having the same mean and variance as the
original data set. Therefore, by generating the surrogate data for each sample, one can perform
the standard ANOVA test. The results are identical.

You might like to use ANOVA for Condensed Data for your computation and experimentation.

The JavaScript Subjective Assessment of Estimates tests the claim that at least the ratio of one
estimate to the largest estimate is as large as a given claimed value.

Further Reading:
Larson D., Analysis of variance with just summary statistics as input, The American Statistician, 46(2), 151-152, 1992.

ANOVA for Dependent Populations

Populations can be dependent in either of the following ways:

1. Every subject is tested in every experimental condition. This kind of dependency is called the
repeated-measurement design.
2. Subjects under different experimental conditions are related in some manner. This kind of
dependency is called matched-subject designed.

An Application: Suppose we are interested in studying the effect of alcohol on driving ability. Ten
subjects are given three different alcohol levels and the number of driving errors are tabulated
below:

Mean
0 oz 2 3 1 3 1 4 1 3 2 1 2.1
2 oz 3 2 1 4 2 3 1 5 1 2 2.4
4 oz 3 1 2 4 2 5 2 4 3 2 3.1

The test null hypothesis is:

H0: µ1 = µ2 = µ3,

and the alternative:

Ha: at least two of the means are not equal.

Using the ANOVA for Dependent Populations JavaScripts, we obtain the needed information in
constructing the following ANOVA table:

The ANOVA Table


Sources of Variation Sum of Squares Degrees of Freedom Mean Squares F-Statistic
Subjects 31.50 9 3.50 -
Between 5.26 2 2.63 7.03
Within 6.70 18 0.37
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 100/139
1/5/2020 Dr. Arsham's Statistics Site

Total 43.46 29

Conclusion: The p-value is P= 0.006, indicating a strong evidence against the null hypothesis.
The means of the populations are not equal. Here, one may conclude that person who has
consumed more than certain level of alcohol commits more driving errors.

A"block design sampling" implies studying more than two dependent populations. For testing the
equality of means of more than two populations based on block design sampling, you may use
Two-Way ANOVA Test JavaScript. In the case of having block design data with replications, use
Two-Way ANOVA with Replications JavaScript to obtain the needed information for constructing
the ANOVA tables.

Test for Equality of Several Population Proportions

The Chi-square test of homogeneity provides an alternative method for testing the null
hypothesis that two population proportions are equal. Moreover, it extend, to several
populations similar to the ANOVA test that compares several means.

An Application: Suppose we wish to test the null hypothesis

H0: P1 = P2 = ..... = Pk

That is, all three population proportions are almost identical. The sample data from each of
the three populations are given in the following table:

Test for homogeneity of Several Population Proportions


Populations Yes No Total
Sample I 60 40 100
Sample II 57 53 110
Sample III 48 72 120
Total 165 165 330

The Chi-square statistic is 8.95 with d.f. = (3-1)(3-1) = 4. The p-value is equal to 0.062,
indicating that there is moderate evidence against the null hypothesis that the three
populations are statistically identical.

You might like to use Testing Proportions to perform this test.

Distribution-free Equality of Two Populations

For statistical equality of two populations, one may use the Kolmogorov-Smirnov Test (K-S
Test) for two populations. The K-S test seeks differences between the two population's
distribution function based on their two independent random samples. The test rejects the
null hypothesis of no difference between the two populations if the difference between the
two empirical distribution functions is "large".

Prior to applying the K-S test it is necessary to arrange each of the two sample observations
in a frequency table. The frequency table must have a common classification. Therefore the
test is based on the frequency table, which belongs to the family of distribution-free tests.

The K-S Test process is as follows:

1. Some k number of"classes" is selected, each typically covering a different but similar
range of values.

2. Some much larger number of independent observations (n1, and n2, both larger than
40) are taken. Each is measured and its frequency is recorded in a class.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 101/139
1/5/2020 Dr. Arsham's Statistics Site

3. Based on the frequency table, the empirical cumulative distribution functions F1i and
F2i for two sample populations are constructed, for i = 1, 2,..., k.

4. The K-S statistic is the largest absolute difference between F1i and F2i; i.e.,

K-S statistic = D = Maximum | F1i - F2i |, for all i = 1, 2, .., k.

The above process is depicted in the following figure.

Click on the image to enlarge it and THEN print it.


The K-S Test Process for Equality of Two Populations Decision

The critical values of K-S statistic can be found at Computers and Computational Statistics
with Applications

An Application: The daily sales of the two subsidiaries of The PC & Accessories Company
are shown in the following table, with n1 = 44, and n2 = 54:

Daily Sales at Two Branches Over 6 Months


Sales ($1000) Frequency I Frequency II
0-2 11 1
3-5 7 3
6-8 8 6
9 - 11 3 12
12 - 14 5 12
15 - 17 5 14
18 - 20 5 6
Sums 44 54

The manager of the first branch is claiming that"since the daily sales are random
phenomena, my overall performance is as good as the other manager's performance." In
other words:

H0: The daily sales at the two stores are almost the same.
Ha: The performance of the managers is significantly different.

Following the above process for this test, the K-S statistic is 0.421 with the p-value of 0.0009,
indicating a strong evidence against the null hypothesis. There is enough evidence that the
performance of the manager of the second branch is better.

Introduction to Applications of the Chi-square Statistic

The variance is not the only thing for which you use a Chi-square test for.

The most widely used applications of Chi-square distribution are:

The Chi-square Test for Association which is a non-parametric test; therefore, it can be used
for nominal data too. It is a test of statistical significance widely used bivariate tabular
association analysis. Typically, the hypothesis is whether or not two populations are different
in some characteristic or aspect of their behavior based on two random samples. This test
procedure is also known as the Pearson Chi-square test.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 102/139
1/5/2020 Dr. Arsham's Statistics Site

The Chi-square Goodness-of-Fit Test is used to test if an observed distribution conforms to


any particular distribution. Calculation of this goodness-of-fit test is by comparison of
observed data with data expected based on a particular distribution.

One of the disadvantages of some of the Chi-square tests is that they do not permit the
calculation of confidence intervals; therefore, determination of the sample size is not readily
available.

Treatment of Cases with Many Categories: Notice that, although in the following section
most of the crosstables have only two categories, it is always possible to convert cases with
many categories into similar crosstables. To do so, one must consider all possible pairs of
categories and their numerical values while constructing the equivalent"two-categories"
crosstable.

Test for Crosstable Relationship

Crosstables: Often crosstables are used to test relationships among two categorical types of
data, or independence of two variables, such as cigarette smoking and drug use. If you were
to survey 1000 people on whether or not they smoke and whether or not they use drugs, you
would get one of four answers: (no, no) (no, yes) (yes, no) (yes, yes)

By compiling the number of people in each category, you can ultimately test whether drug
usage is independent of cigarette smoking by using the Chi-square distribution (this is
approximate, but works well). Again, the methodology for this is in your textbook. The
degrees of freedom is equal to (number of rows-1)(number of columns -1). That is, these
many numbers needed to fill in the entire body of the crosstable, the rest will be determined
by using the given row sums and the column sums values.

Do not forget the conditions for the validity of Chi-square test and related expected values
greater than 5 in 80% or more of the cells. Otherwise, one could use an"exact" test, using
either a permutation or resampling approach.

An Application: Suppose a counselor of a school in a small town is interested whether the


curriculum chosen by students is related to the occupation of their parents. It is necessary to
record the data as shown in the following contingency table with two rows (r1, r2) and three
columns (c1, c2, c3):

Relationship between occupation of parents and


curriculum chosen by high school students
Curriculum Chosen by Students
Parental
Occupation College prep Vocational General Totals
Professional 12 2 6 20
Blue collar 6 6 8 20
Totals 18 8 14

Under the hypothesis that there is no relation, the expected (E) frequency would be:

Ei, j = (Sri)(Scj)/N

The Observed (O) and Expected (E) frequencies are recorded in the following table:

Expected frequencies for the data.


College prep Vocational General Totals
O = 12 O=2 O=6 åO = 20
Professional
E=9 E=4 E=7 åE = 20
Blue collar O=6 O=6 O=8 åO= 20
E=9 E=4 E=7 åE = 20
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 103/139
1/5/2020 Dr. Arsham's Statistics Site

åO = 18 åO =8 åO = 14


Totals åE = 18 åE = 8 åE = 14

The quantity

c 2 = S [(O - E )2 / E]

is a measure of the degree of deviation between the Observed and Expected frequencies. If
there is no relationship between the row variable and the column variable this measure will
be very close to zero. Under the hypothesis that there is a relationship between the rows and
the columns, this quantity has a Chi-square distribution with parameter equal to number of
rows minus 1, multiplied by number of columns minus 1.

For this numerical example we have:

c 2 = S [(O - E )2 / E] = 30/7 = 4.3

with d.f. = (2-1)(3-1) = 2, that has the p-value of 0.14, suggesting little or no real evidences
against the null hypothesis.

The main question is how large is this measure. The maximum value of this measure is:

c 2max = N(A-1),

where A is the number of rows or columns, whichever is smaller. For our numerical example
it is, 40(2-1) = 40.

The coefficient of determination which has a range of [0, 1], provides relative strength of
relationship, computed as

c 2/c 2max = 4.3/40 = 0.11

Therefore we conclude that the degree of association is only 11% which is fairly weak.

Alternatively, you could also look at the contingency coefficient f statistic, which is:

f = [ c2/(N + c2)]½ = 0.31

This statistic ranges between 0 and 1 and can be interpreted like the correlation coefficient.
This measure also indicates that the curriculum chosen by students is related to the
occupation of their parents.

You might like to use Chi-square Test for Crosstable Relationship in performing this test, and
he P-values for the Popular Distributions JavaScript to findout the p-values of Chi-square
statistic.

Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Fleiss J., Statistical Methods for Rates and Proportions, Wiley, 1981.

2 by 2 Crosstable Analysis

Using Chi-square in a 2x2 table requires the Yates's correction. One first subtracts 0.5 from
the absolute differences between observed and expected frequencies for each of the three
genotypes before squaring, dividing by the expected frequency, and summing. The formula
for the Chi-square value in a 2x2 table can be derived from the Normal Theory comparison of
the two proportions in the table using the total incidence to produce the standard errors. The
rationale of the correction is a better equivalence of the area under the normal curve and the
probabilities obtained from the discrete frequencies. In other words, the simplest correction is
to move the cut-off point for the continuous distribution from the observed value of the
discrete distribution to midway between that and the next value in the direction of the null
hypothesis expectation. Therefore, the correction essentially only applied to one d.f. tests
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 104/139
1/5/2020 Dr. Arsham's Statistics Site

where the"square root" of the Chi-square looks like a"normal/t-test" and where a direction
can be attached to the 0.5 addition.

Chi-square distribution is used as an approximation of the binomial distribution. By applying a


continuity correction, we get a better approximation of the binomial distribution for the
purposes of calculating tail probabilities.

Given the following 2x2 table, one may compute some relative risk measures:

a b
c d

The most usual measures are:

Rate-difference: a/(a+c) - b/(b+d)


Rate-ratio: (a/(a+c))/(b/(b+d))
Odds-ratio: ad/bc

The rate difference and rate ratio are appropriate when you are contrasting two groups
whose sizes (a+c and b+d) are given. The odds ratio is for when the issue is association
rather than difference.

The risk-ratio (RR) is the ratio of the proportion (a/(a+b)) to the proportion (c/(c+d)):

RR = (a / (a + b)) / (c / (c + d))

RR is thus a measure of how much larger the proportion in the first row is compared to the
second. RR value of < 1.00 indicating a 'negative' association [a/(a+b) < c/(c+d)], 1.00
indicating no association [a/(a+b) = c/(c+d)], and >1.00 indicating a 'positive' association
[a/(a+b) > c/(c+d)]. The further from 1.00 the RR is, the stronger the association.

Notice that the odds ratio (OR) is equal to the simple crossproduct ratio of a 2×2 table.

The OR can be written as: (a/b)/(c/d) which is the ratio of these two odds -- hence its name,
the odds ratio. Both the numerator and denominator are odds. For example, the numerator,
a/b, gives the odds of a positive versus negative rating by Rater 2 given that Rater 1's rating
is positive. The denominator c/d gives the odds of a positive versus negative rating by Rater
2 given that Rater 1's rating is negative.

Since the odds ratio is skewed, so we cannot easily compute a standard error for the odds
ratio itself. We can, however, find a standard error for the natural logarithm of the odds ratio.
It is simply:

[ 1/a + 1/b + 1/c + 1/d ]1/2

Notice that, you need to compute the confidence interval on the log scale and then
transform the results back to the original scale of measurement.

We see that as any or all of the counts in the two by two table increase, the confidence
interval for the log odds ratio shrinks. Also, it turns out that the smallest count in the 2 by 2
table plays the largest role in determining the size of the standard error.

Identical Populations Test for Crosstable Data

Test of homogeneity is much like the Test for Crosstable Relationship in that both deal with
the cross-classification of nominal data; that is, r ´ c tables. The method of computing Chi-
square statistic is the same for both tests, with the same d.f.

The two tests differ, however, in the following respect. The Test for Crosstable Relationship is
made on data drawn from a single population (with fixed total) where one is concerned with
whether one set of attributes is independent of another set. The test for homogeneity, on the
other hand, is designed to test the null hypothesis that two or more random samples are
drawn from the same population or from different populations, according to some criterion of
classification applied to the samples.
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 105/139
1/5/2020 Dr. Arsham's Statistics Site

The homogeneity test is concerned with the question: Are the samples drawn form
populations that are homogeneous (i.e., the same) with respect to some criterion of
classification?

In the crosstable for this test, either the row or the column categories may represent the
populations from which the samples are drawn.

An Application: Suppose a board of directors of a labor union wishes to survey the opinion
of its members regarding a change in its constitution. The following table shows the result of
the survey sent to three union locals:

Reactions of A Sample of Three Locals Group Members


Union Local
Reaction A B C

In Favor
18 22 10
Against
7 14 9
5 4 11
No Response

The problem is not to determine whether or not the union members are in favor of the
change. The question is to test if there is a significant difference in the proportions of opinion
of the three populations' members concerning the proposed change.

The Chi-square statistic is 9.58 with d.f. = (3-1)(3-1) = 4. The p-value is equal to 0.048,
indicating that there is moderate evidence against the null hypothesis that the three union
locals are the same.

You might like to use Populations Homogeneity Test to perfor this test.

Further Readings:
Agresti A., Categorical Data Analysis, Wiley, 2002.
Clark Ch., and L. Schkade, Statistical Analysis for Administrative Decisions, South-Western Pub., 1979.

Test for Equality of Several Population Medians

Generally, the median provides a better measure of location than the mean when there are
some extremely large or small observations; i.e., when the data are skewed to the right or to
the left. For this reason, median income is used as the measure of location for the U.S.
household income.

Suppose we are interested in testing the equality of the medians of k number of populations
with respect to the same continuous random variable.

The first step in calculating the test statistic is to compute the common median of the k
samples combined. Then, determine for each group the number of observations falling above
and below the common median. The resulting frequencies are arranged in a 2 by k
crosstable. If the k samples are, in fact, from populations with the same median, one expects
about one half the score in each sample to be above the combined median and about one
half to be below. In the case that some observations are equal to the combined median, one
may drop those few observations, in constructing a 2 by k crosstable. Under this condition,
now the Chi-square statistic may be computed and compared with the p-value of Chi-square
distribution with d.f. = k-1.

An illustrative application: Do public and private primary school teachers differ with respect
to their salary? The data from a random sample are given in the following table (in thousands
of dollars per year).

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 106/139
1/5/2020 Dr. Arsham's Statistics Site

Public Private Public Private


35 29 25 50
26 50 27 37
27 43 45 34
21 22 46 31
27 42 33
38 47 26
23 42 46
25 32 41

The test of hypothesis is:

H0: The public and private school teachers' salaries are almost the same.

The median of all data (i.e., combined) is 33.5. Now determine in each group the number of
observations falling above and below the common median of 33.5. The resulting frequencies
are shown in the following table:

Crosstable for the public and private school teachers'


Public Private Total
Above median 6 8 14
Below median 10 4 14
Total 16 12 28

The Chi-square statistic based on this table is 2.33. The p-value for the computed test
statistic with d.f. = (2-1)(2-1) = 1 is 0.127, therefore, we are unable to reject the null
hypothesis.

You might like to use Testing Medians to perform this test.

Goodness-of-Fit Test for Probability Mass Functions

There are other tests that might use the Chi-square, such as goodness-of-fit test for discrete
random variables. Therefore, Chi-square is a statistical test that measures"goodness-of-fit".
In other words, it measures how much the observed or actual frequencies differ from the
expected or predicted frequencies. Using a Chi-square table will enable you to discover how
significant the difference is. A null hypothesis in the context of the Chi-square test is the
model that you use to calculate your expected or predicted values. If the value you get from
calculating the Chi-square statistic is sufficiently high (as compared to the values in the Chi-
square table), it tells you that your null hypothesis is probably wrong.

Let Y1, Y 2, . . ., Y n be a set of independent and identically distributed discrete random


variables. Assume that the probability distribution of the Y i's has the probability mass
function f o (y). We can divide the set of all possible values of Yi, i = {1, 2, ..., n}, into m non-
overlapping intervals D1, D2, ...., Dm. Define the probability values p1, p2, ..., pm as;

p1 = P(Yi ÃŽ D1)
p2 = P(Yi ÃŽ D2)

pm = P(Yi ÃŽ Dm)

Where the symbol ÃŽ means,"an element of".

Since the union of the mutually exclusive intervals D1, D2,...., Dm is the set of all possible
values for the Yi's, (p1 + p2 + .... + pm) = 1. Define the set of discrete random variables X1,

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 107/139
1/5/2020 Dr. Arsham's Statistics Site

X2, ...., Xm, where

X1= number of Yi's whose valueÃŽD1


X2= number of Yi's whose value ÃŽ D2

:
:

Xm= number of Yi's whose value ÃŽ Dm

and (X1+ X2+ .... + Xm) = n. Then the set of discrete random variables X1, X2, ...., Xmwill
have a multinomial probability distribution with parameters n and the set of probabilities {p1,
p2, ..., pm}. If the intervals D1, D2, ...., Dm are chosen such that npi ³ 5 for i = 1, 2, ..., m,
then;

C = S (Xi - npi) 2/ npi.

The sum is over i = 1, 2,..., m. The results is distributed as c2 m-1.

For the goodness-of-fit sample test, we formulate the null and alternative hypothesis as

H0 : fY(y) = fo(y)
Ha : fY(y) ¹ fo(y)

At the a level of significance, H0 will be rejected in favor of Ha if

C = S (Xi - npi) 2/ npi

is greater than c2 m

However, it is possible that in a goodness-of-fit test, one or more of the parameters of fo(y)
are unknown. Then the probability values p1, p2, ..., pm will have to be estimated by
assuming that H0 is true and calculating their estimated values from the sample data. That is,
another set of probability values p'1, p'2, ..., p'mwill need to be computed so that the values
(np'1, np'2, ..., np'm) are the estimated expected values of the multinomial random variable
(X1, X2, ...., Xm). In this case, the random variable C will still have a Chi-square distribution,
but its degrees of freedom will be reduced. In particular, if the probability function fo(y) has r
unknown parameters,

C = S (Xi - npi) 2/ npi

is distributed as c2 m-1-r.

For this goodness-of-fit test, we formulate the null and alternative hypothesis as

H0: fY(y) = fo(y)


Ha: fY(y) ¹ fo(y)

At the a level of significance, H0 will be rejected in favor of Ha if C is greater than c2 m-1-r.

An Application: A die is thrown 300 times and the following frequencies are observed. Test
the hypothesis that the die is fair at level 0.05. Under the null hypothesis that the die is fair,
the expected frequencies are all equal to 300/6 = 50. Both the Observed (O) and Expected
(E) frequencies are recorded in the following table together with the random variable Y that
represents the number on each sides of the die:

Goodness-of-fit Test For Discrete Variables


Y 1 2 3 4 5 6
O 57 43 59 55 63 23
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 108/139
1/5/2020 Dr. Arsham's Statistics Site

E 50 50 50 50 50 50

The quantity

c 2 = S [(O - E )2 / E] = 22.04

is a measure of the goodness-of-fit. If there is a reasonably good fit to the hypothetical


distribution, this measure will be very close to zero. Since c 2 n-1, 0.95 = 11.07, we reject the
null hypothesis that the die is a fair one.

You might like to use this JavaScript to perform this test.

For statistical equality of two random variables characterizing two populations, you might like
to use the Kolmogorov-Smirnov Test if you have two independent sets of random
observations, one from each population.

Compatibility of Multi-Counts Test

In some applications, such as quality control, it is necessary to check if the process is under
control. This can be done by testing if there are significant differences between number
of"counts", taken over k equal-periods of times. The counts are supposed to have been
obtained under comparable conditions.

The null hypothesis is:

H0: There is no significant difference between number of"counts" taken over k equal-periods
of times.

Under the null hypothesis, the statistic:

S (Ni - N)2/N

has a Chi-square distribution with d.f. = k-1. Where i is the count's number, Ni is its counts,
and N = SNi/k.

One may extend this useful test to where the duration of obtaining the ith count is ti. Then the
above test statistic becomes:

S [(Ni - tiN)2/ tiN]

and has a Chi-square distribution with d.f. = k-1, where i is the count's number, Ni is its
counts, and N = SNi/Sti.

You might like to use the Compatibility of Multi-Counts JavaScript to check your
computations, and to perform some numerical experimentation for a deeper understanding of
the concepts.

Necessary Conditions for the Above Chi-square Based Testing

Like any statistical test procedures, the Chi-square based testing must meet certain
necessary conditions to apply; otherwise, any obtained conclusion might be wrong or
misleading. This is true in particular for using the Chi-square-based test for cross-tabulated
data.

Necessary conditions for the Chi-square based tests for crosstable data are:

1. Expected values greater than 5 in 80% or more of the cells.


2. Moreover, if number of cells is fewer than 5, then all expected values must be greater
than 5.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 109/139
1/5/2020 Dr. Arsham's Statistics Site

An Example: Suppose the monthly number of accidents reported in a factory in three eight-
hour shifts is 1, 7, and 7, respectively. Are the working conditions and the exposure to risk
similar for all shifts? Clearly, the answer must be, No they are not. However, applying the
goodness-of-fit, at 0.05, under the null hypothesis that there are no differences in the number
of accidents in three shifts, one expects 5, 5, and 5 accidents in each shift. The Chi-square
test statistic is:

c 2 = S [(O - E )2 / E] = 4.8

However, since c 2 n-1, 0.95 = 5.99, there is no reason to reject that there is no difference,
which is a very strange conclusion. What is wrong with this application?

You might like to use this JavaScript to verify your computation.

Testing the Variance: Is the Quality that Good?

Suppose a population has a normal distribution. The manager is to test a specific claim made
about the quality of the population by way of testing its variance s2. Among three possible
scenarios, the interesting case is in testing the following null hypothesis based on a set of n
random sample observations:

H0: Variation is about the claimed value.


Ha: The variation is more than what is claimed, indicating the quality is much lower than
expected.

Upon computing the estimated variance S2 based on n observations, then the statistic:

c½ = [(n-1)S2] / s2

has a Chi-square distribution with degree of freedom n = n - 1. This statistic is then used for
testing the above null hypothesis.

You might like to use Testing the Variance JavaScript to check your computations.

Testing the Equality of Multi-Variances

The equality of variances across populations is called homogeneity of variances or


homoscedasticity. Some statistical tests, such as testing equality of the means by the t-test
and ANOVA, assume that the data come from populations that have the same variance, even
if the test rejects the null hypothesis of equality of population means. If this condition of
homogeneity of variance is not met, the statistical test results may not be valid.
Heteroscedasticity refers to lack of homogeneity of variances.

Bartlett's Test is used to test if k samples have equal variances. It compares the Geometric
Mean of the group variances to the arithmetic mean; therefore, it is a Chi-square statistic with
(k-1) degrees of freedom, where k is the number of categories in the independent variable.
The test is sensitive to departures from normality. The sample sizes do not have to be equal
but each must be at least 6. Just like the two population t-test, ANOVA can go wrong when
the equality of variances condition is not met.

The Bartlett test statistic is designed to test for equality of variances across groups against
the alternative that variances are unequal for at least two groups. Formally,

H0: All variances are almost equal.

The test statistic:

B = {S [(ni -1)LnS2] S [(ni -1)LnSi2]}/ C

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 110/139
1/5/2020 Dr. Arsham's Statistics Site

In the above, Si2 is the variance of the ith group, ni is the sample size of the ith group, k is the
number of groups, and S2 is the pooled variance. The pooled variance is a weighted average
of the group variances and is defined as:

S2 = {S [(ni -1)Si2]} / S [(ni -1)], over all i = 1, 2,..,k

and

C = 1 + {S [1/(ni -1)] - 1/ S [1/(ni -1)] }/[3(k+1)].

You might like to use the Equality of Multi-Variances JavaScript to check your computations,
and to perform some numerical experimentation for a deeper understanding of the concepts.

Rule of 2: For 3 or more populations, there is a practical rule known as the"Rule of 2".
According to this rule, one divides the highest variance of a sample by the lowest variance of
the other sample. Given that the sample sizes are almost the same, and the value of this
division is less than 2, then, the variations of the populations are almost the same.

Example: Consider the following three random samples from three populations, P1, P2, P3:

Sample P1 Sample P2 Sample P3


25 17 8
25 21 10
20 17 14
18 25 16
13 19 12
6 21 14
5 15 6
22 16 16
25 24 13
10 23 6
N 10 10 10
Mean 16.90 19.80 11.50
Std.Dev. 7.87 3.52 3.81
SE Mean 2.49 1.11 1.20

The ANOVA Table


Sources of Variation Sum of Squares Degrees of Freedom Mean Squares F-Statistic
Between Samples 79.40 2 39.70 4.38
Within Samples 244.90 27 9.07
Total 324.30 29

With an F = 4.38 and a p-value of 0.023, we reject the null at a = 0.05. This is not good news,
since ANOVA, like the two-sample t-test, can go wrong when the equality of variances
condition is not met.

Further Readings:
Hand D., and C. Taylor, Multivariate Analysis of Variance and Repeated Measures, Chapman and Hall, 1987.
Miller R. Jr, Beyond ANOVA: Basics of Applied Statistics, Wiley, 1986.

Correlation Coefficients Testing

The Fisher's Z-transformation is a useful tool in the circumstances in which two or more
independent correlation coefficients are to be compared simultaneously. To perform such a
test one may evaluate the Chi-square statistic:

c2 = S[(ni - 3).Zi2] - [S(ni - 3).Zi]2 / [S(ni - 3)], the sums are over all i = 1, 2, .., k.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 111/139
1/5/2020 Dr. Arsham's Statistics Site

Where the Fisher Z-transformation is

Zi = 0.5[Ln(1+ri) - Ln(1-ri)], provided | ri | ¹ 1.

Under the null hypothesis:

H0: All correlation coefficients are almost equal.

The test statistic c2 has (k-1) degrees of freedom, where k is the number of populations.

An Application: Consider the following correlation coefficients obtained by random sampling


form ten independent populations.

Population Pi Correlation ri Sample Size ni


1 0.72 67
2 0.41 93
3 0.57 73
4 0.53 98
5 0.62 82
6 0.21 39
7 0.68 91
8 0.53 27
9 0.49 75
10 0.50 49

Using the above formula c2-statistic = 19.916, that has a p-value of 0.02. Therefore, there is
moderate evidence against the null hypothesis.

In such a case, one may omit a few outliers from the group, then use the Test for Equality of
Several Correlation Coefficients JavaScript. Repeat this process until a possible
homogeneous sub-group may emerge.

You might need to use Sample Size Determination JavaScript at the design stage of your
statistical investigation in decision making with specific subjective requirements.

Simple Linear Regression: Computational Aspects

The regression analysis has three goals: predicting, modeling, and characterization. What
would be the logical order in which to tackle these three goals such that one task leads to
and /or and justifies the other tasks? Clearly, it depends on what the prime objective is.
Sometimes you wish to model in order to get better prediction. Then the order is obvious.
Sometimes, you just want to understand and explain what is going on. Then modeling is
again the key, though out-of-sample predicting may be used to test any model. Often
modeling and predicting proceed in an iterative way and there is no 'logical order' in the
broadest sense. You may model to get predictions, which enable better control, but iteration
is again likely to be present and there are sometimes special approaches to control
problems.

The following contains the main essential steps during modeling and analysis of regression
model building, presented in the context of an applied numerical example.

Formulas and Notations:

= Sx /n
This is just the mean of the x values.
= Sy /n
This is just the mean of the y values.
Sxx = SSxx = S(x(i) - )2 = Sx2 - ( Sx)2 / n
Syy = SSyy = S(y(i) - )2 = Sy2 - ( Sy) 2 / n
Sxy = SSxy = S(x(i) - )(y(i) - ) = S(x × y) – (Sx) × (Sy) / n
Slope m = SSxy / SSxx

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 112/139
1/5/2020 Dr. Arsham's Statistics Site

Intercept, b = - m .
y-predicted = yhat(i) = m×x(i) + b
Residual(i) = Error(i) = y – yhat(i)
SSE = Sres = SSres = SSerrors = S[y(i) – yhat(i)]2 = SSyy – m SSxy
Standard deviation of residuals = s = Sres = Serrors = [SSres / (n-2)]1/2
Standard error of the slope (m) = Sres / SSxx1/2
Standard error of the intercept (b) = Sres[(SSxx + n. 2) /(n × SSxx] 1/2
R2 = (SSyy - SSE) / SSyy

A computational Example: A taxicab company manager believes that the monthly repair
costs (Y) of cabs are related to age (X) of the cabs. Five cabs are selected randomly and
from their records we obtained the following data: (x, y) = {(2, 2), (3, 5), (4, 7), (5, 10), (6,
11)}.

The first step in constructing a simple linear regression model is to draw a scattered diagram,
as shown in the following figure for our numerical example:

Click on the image to enlarge it and THEN print it.


A Visual Procedure as an Assessment Tool and Decision Process
for Linearity of the Best Fit Based on the Scattered Diagram

The linear dependency of Y variable with variable X can be checked graphically by carefully
examining all the points in the scatter diagram, and see if it is possible to bound all the points
within two parallel lines, shown in green in the above figure.

The graphical method of line fitting is illustrated in the above figure. The best regression line
fitting the data is always the line that is parallel to the bounds and passing always passes
through a point with coordinates of (mean of x values, mean of y values). This point known as
the mean-mean point and it is highlighted by a read circle around it, in the above side-by-side
figures.

Based on our practical knowledge and the scattered diagram of the data, we hypothesize a
linear relationship between predictor X, and the cost Y.

Least Square Method: The best fit line results when there is the smallest value for the sum of
the squares of the deviations between y and yhat. Notice that if you used regression of Y
against X to estimate the slope, and the intercept the estimated values would be very
different to if using a regression of X against Y.

Now the question is how we can best (i.e., least square) use the sample information to
estimate the unknown slope (m) and the intercept (b)? The first step in finding the least
square line is to construct a sum of squares table to find the sums of x values (Sx), y values
(Sy), the squares of the x values (Sx2), the squares of the x values (Sy2), and the cross-
product of the corresponding x and y values (Sxy), as shown in the following table:

x y x2 xy y2

2 2 4 4 4
3 5 9 15 25
4 7 16 28 49
5 10 25 50 100
6 11 36 66 121
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 113/139
1/5/2020 Dr. Arsham's Statistics Site

SUM 20 35 90 163 299

The second step is to substitute the values of Sx, Sy, Sx2, Sxy, and Sy2 into the following
formulas:

SSxy = Sxy – (Sx)(Sy)/n = 163 - (20)(35)/5 = 163 - 140 = 23

SSxx = Sx2 – (Sx)2/n = 90 - (20)2/5 = 90- 80 = 10

SSyy = Sy2 – (Sy)2/n = 299 - 245 = 54

Use the first two values to compute the estimated slope:

Slope = m = SSxy / SSxx = 23 / 10 = 2.3

To estimate the intercept of the least square line, use the fact that the graph of the least
square line always pass through ( , ) point, therefore,

The intercept = b = – (m)( ) = (Sy)/ 5 – (2.3) (Sx/5) = 35/5 – (2.3)(20/5) = -2.2

Therefore the least square line is:

y-predicted = yhat = mx + b = -2.2 + 2.3x.

After estimating the slope and the intercept the question is how we determine statistically if
the model is good enough, say for prediction. The standard error of slope is:

Standard error of the slope (m)= Sm = Sres / Sxx1/2,

and its relative precision is measured by statistic

tslope = m / Sm.

For our numerical example, it is:

tslope = 2.3 / [(0.6055)/ (101/2)] = 12.01

which is large enough, indication that the fitted model is a"good" one.

You may ask, in what sense is the least squares line the"best-fitting" straight line to 5 data
points. The least squares criterion chooses the line that minimizes the sum of square vertical
deviations, i.e., residual = error = y - yhat:

SSE = S (y – yhat)2 = S(error)2 = 1.1

The numerical value of SSE is obtained from the following computational table for our
numerical example.
x -2.2+2.3x y error squared
Predictor y-predicted observed y errors

2 2.4 2 -0.4 0.16


3 4.7 5 0.3 0.09
4 7 7 0 0
5 9.3 10 0.7 0.49
6 11.6 11 -0.6 0.36
Sum=0 Sum=1.1

Alternately, one may compute SSE by:

SSE = SSyy – m SSxy = 54 – (2.3)(23) = 54 - 52.9 = 1.1,

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 114/139
1/5/2020 Dr. Arsham's Statistics Site

as expected

Notice that this value of SSE agrees with the value directly computed from the above table.
The numerical value of SSE gives the estimate of variation of the errors s2:

s2 = SSE / (n -2) = 1.1 / (5 - 2) = 0.36667

The estimate the value of the error variance is a measure of variability of the y values about
the estimated line. Clearly, we could also compute the estimated standard deviation s of the
residuals by taking the square roots of the variance s2.

As the last step in the model building, the following Analysis of Variance (ANOVA) table is
then constructed to assess the overall goodness-of-fit using the F-statistics:

Analysis of Variance Components

Sum of Mean
Source DF F Value Prob > F
Squares Square

Model 1 52.90000 52.90000 144.273 0.0012


Error 3 SSE = 1.1 0.36667
Total 4 SSyy = 54

For practical proposes, the fit is considered acceptable if the F-statistic is more than five-
times the F-value from the F distribution tables at the back of your textbook. Note that, the
criterion that the F-statistic must be more than five-times the F-value from the F distribution
tables is independent of the sample size.

Notice also that there is a relationship between the two statistics that assess the quality of the
fitted line, namely the T-statistics of the slope and the F-statistics in the ANOVA table. The
relationship is:

t2slope = F

This relationship can be verified for our computational example.

The Coefficient of Determination: The coefficient of determination is defined, and denoted


by R2:

R2 = (SSyy - SSE) / SSyy = 1 – (SSE / SSyy), 0 £ R2 £ 1

The numerical value of R2 represents the proportion of the sum of squares of deviations of
the y values about their mean that can be attributed to the linear relationship between y and
x.

For our numerical example, we have:

R2 = (SSyy - SSE) / SSyy = (54 – 1.1) / 54 = 0.98

This means that about 98% of variation in the house price is because the houses have
different sizes. Therefore, size of a house is a very strong factor in prediction the price of the
house by the constructed linear model between size (x), and the price (y).

If sample size is large enough, say over 30 pairs of (x, y), then R2 has stronger and more
useful meaning. That is, the value of the R2 is the percentage of variation in y that can be
attributed to the variation in predictor x to predict y by using the constructed linear model.

Predictions by Regression: After we have statistically checked the goodness of-fit of the
model and the residuals conditions are satisfied, we are ready to use the model for prediction
with confidence. Confidence interval provides a useful way of assessing the quality of

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 115/139
1/5/2020 Dr. Arsham's Statistics Site

prediction. In prediction by regression often one or more of the following constructions are of
interest:

1. A confidence interval for a single future value of Y corresponding to a chosen value of


X.
2. A confidence interval for a single pint on the line.
3. A confidence region for the line as a whole.

Confidence Interval Estimate for a Future Value: A confidence interval of interest can be
used to evaluate the accuracy of a single (future) value of y corresponding to a chosen value
of X (say, X0). It provides confidence interval for an estimated value Y corresponding to X0
with a desirable confidence level 1 - a.

Yp ± Se . tn-2, a/2 {1/n + (X0 – )2/ Sx}1/2

Confidence Interval Estimate for a Single Point on the Line: If a particular value of the
predictor variable (say, X0) is of special importance, a confidence interval on the value of the
criterion variable (i.e. average Y at X0) corresponding to X0 may be of interest. It provides
confidence interval on the estimated value of Y corresponding to X0 with a desirable
confidence level 1 - a.

Yp ± Se . tn-2, a/2 { 1 + 1/n + (X0 – )2/ Sx}1/2

It is of interest to compare the above two different kinds of confidence interval. The first kind
has larger confidence interval that reflects the less accuracy resulting from the estimation of a
single future value of y rather than the mean value computed for the second kind confidence
interval. The second kind of confidence interval can also be used to identify any outliers in
the data.

Confidence Region the Regression Line as the Whole: When the entire line is of interest,
a confidence region permits one to simultaneously make confidence statements about
estimates of Y for a number of values of the predictor variable X. In order that region
adequately covers the range of interest of the predictor variable X; usually, data size must be
more than 10 pairs of observations.

Yp ± Se { (2 F2, n-2, a) . [1/n + (X0 – )2/ Sx]}1/2

In all cases the JavaScript provides the results for the nominal (x) values. For other values of
X one may use computational methods directly, graphical method, or using linear
interpolations to obtain approximated results. These approximation are in the safe directions
i.e., they are slightly wider that the exact values.

Regression Modeling and Analysis

Many problems in analyzing data involve describing how variables are related. The simplest
of all models describing the relationship between two variables is a linear, or straight-line,
model. Linear regression is always linear in the coefficients being estimated, not necessarily
linear in the variables.

The simplest method of drawing a linear model is to"eye-ball" a line through the data on a
plot, but a more elegant, and conventional method is that of least squares, which finds the
line minimizing the sum of the vertical distances between observed points and the fitted line.
Realize that fitting the"best" line by eye is difficult, especially when there is much residual
variability in the data.

Know that there is a simple connection between the numerical coefficients in the regression
equation and the slope and intercept of the regression line.

Know that a single summary statistic, like a correlation coefficient, does not tell the whole
story. A scatterplot is an essential complement to examining the relationship between the two
variables.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 116/139
1/5/2020 Dr. Arsham's Statistics Site

Again, the regression line is a group of estimates for the variable plotted on the Y-axis. It has
a form of y = b + mx, where m is the slope of the line. The slope is the rise over run. If a line
goes up 2 for each 1 it goes over, then its slope is 2.

The regression line goes through a point with coordinates of (mean of x values, mean of y
values), known as the mean-mean point.

If you plug each x in the regression equation, then you obtain a predicted value for y. The
difference between the predicted y and the observed y is called a residual, or an error term.
Some errors are positive and some are negative. The sum of squares of the errors plus the
sum of squares of the estimates add up to the sum of squares of Y:

Partitioning the Three Sum of Squares


Click on the image to enlarge it and THEN print it

The regression line is the line that minimizes the variance of the errors. The mean error is
zero; so, this means that it minimizes the sum of the squares errors.

The reason for finding the best fitting line is so that you can make a reasonable prediction of
what y will be if x is known (not vise-versa).

r2 is the variance of the estimates divided by the variance of Y. r is the size of the slope of the
regression line, in terms of standard deviations. In other words, it is the slope of the
regression line if we use the standardized X and Y. It is how many standard deviations of Y
you would go up, when you go one standard deviation of X to the right.

Coefficient of Determination: Another measure of the closeness of the points to the


regression line is the Coefficient of Determination:

r2 = SSyhat yhat / SSyy

which is the amount of the squared deviation in Y, that is explained by the points on the least
squares regression line.

Homoscedasticity and Heteroscedasticity: Homoscedasticity (homo = same, skedasis =


scattering) is a word used to describe the distribution of data points around the line of best fit.
The opposite term is heteroscedasticity. Briefly, homoscedasticity means that data points are
distributed equally about the line of best fit. Therefore, homoscedasticity means constancy of
variance over all the levels of factors. Heteroscedasticity means that the data points cluster
or clump above and below the line in a non-equal pattern.

Standardized Regression Analysis: The scale of measurements used to measure X and Y


has major impact on the regression equation and correlation coefficient. This impact is more
drastic comparing two regression equations having different scales of measurement. To
overcome these drawbacks, one must standardize both X and Y prior to constructing the
regression and interpreting the results. In such a model, the slope is equal to the correlation
coefficient r. Notice that the derivative of function Y with respect to dependent variable X is
the correlation coefficient. Therefore, there is a nice similarity in the meaning of r in statistics
and the derivative from calculus, in that its sign and its magnitude reveal the
increasing/decreasing and the rate of change, as the derivative of a function do.

In the usual regression modeling the estimated slope and intercept are correlated;
therefore, any error in estimating the slope influences the estimate of the intercept. One of

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 117/139
1/5/2020 Dr. Arsham's Statistics Site

the main advantages of using the standardized data is that the intercept is always equal to
zero.

Regression when both X and Y are random: Simple linear least-squares regression has
among its conditions that the data for the independent (X) variables are known without error.
In fact, the estimated results are conditioned on whatever errors happened to be present in
the independent data sets. When the X-data have an error associated with them the result is
to bias the slope downwards. A procedure known as Deming regression can handle this
problem quite well. Biased slope estimates (due to error in X) can be avoided using Deming
regression.

If X and Y are random variables, then the correlation coefficient R is often referred to as the
Coefficient of Reliability.

The Relationship Between Slope and Correlation Coefficient: By a little bit of algebraic
manipulation, one can show that the coefficient of correlation is related to the slope of the two
regression lines: Y on X, and X on Y, denoted by m yx and mxy, respectively:

R2 = m yx . mxy

Lines of regression through the origin: Often the conditions of a practical problem require
that the regression line go through the origin (x = 0, y = 0). In such a case, the regression line
has one parameter only, which is its slope:

m = S (xi ´ yi)/ Sxi2

Notice: The requirement of having zero intercept has major impact of inferential statistic of
the estimated model, that is, one cannot apply any test of hypothesis or construct confidence
interval. Having forced the regression equation through the origin causes limitation in its
applications. In the usual unconstrained case, the expected error of the regression equation
is equal to zero, and the errors are distributed normally. However, one may apply
inferential statistic to the zero intercept estimated model only if the mean of the X and Y falls
exactly upon the calculated regression line, this is an additional requirement to all other
conditions of the usual regression analysis. Therefore, for the models with the omission of the
intercept, it is generally agreed that, for example, R2 should not be defined or even
considered.

Parabola models: Parabola regressions have three coefficients with a general form:

Y = a + bX + cX2,

where

c = { S (xi - xbar)2×yi - n[S(xi - xbar) 2× Syi]} / {n S(xi - xbar) 4 - [S(xi - xbar)2] 2}

b = [S(xi- xbar) yi]/[ S(xi - xbar)2] - 2×c×xbar

a = {Syi - [c× S(x i - xbar) 2)}/n - (c×xbar×xbar + b×xbar),

where xbar is the mean of xi's.

Applications of quadratic regression include fitting the supply and demand curves in
econometrics and fitting the ordering cost and holding cost functions in inventory control for
finding the optimal ordering quantity.

You might like to use Quadratic Regression JavaScript to check your hand computation. For
higher degrees than quadratic, you may like to use the Polynomial Regressions JavaScript.

Multiple Linear Regression: The objectives in a multiple regression problem are essentially
the same as for a simple regression. While the objectives remain the same, the more
predictors we have the calculations and interpretations become more complicated. With

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 118/139
1/5/2020 Dr. Arsham's Statistics Site

multiple regression, we can use more than one predictor. It is always best, however, to be
parsimonious, that is to use as few variables as predictors as necessary to get a reasonably
accurate forecast. Multiple regression is best modeled with commercial package such as
SAS and SPSS. The forecast takes the form:

Y = b0 + b1X1 + b2X2 + . . .+ bnXn,

where b0 is the intercept, b1, b2, . . . bn are coefficients representing the contribution of the
independent variables X1, X2,..., Xn.

For small sample size, you may like to use the Multiple Linear Regression JavaScript.

What Is Auto-Regression: In time series analysis and forecasting techniques, often linear
regression is use to combine present and past values of an observation in order to forecast
its future value. The model is called an autoregressive model. For details and implementation
process visit Autoregressive Modeling JavaScript.

What Is Logistic Regression: Standard logistic regression is a method for modeling binary
data (e.g., does a person smoke or not?, does a person survive a disease, or not?).
Polygamous logistic regression is a method for modeling more than two options (e.g., does a
person take the bus, drive a car or take the subway? does an office use WordPerfect, Word,
or other office-ware?).

Why Linear Regression? The study of corn shell (i.e., ear of corn) height versus rainfall has
shown to have the following regression curve:

Click on the image to enlarge it and THEN print it.


Why Linear Regression?

Clearly, the relationship is highly nonlinear; however, if we are interested in a"small" range
(say, for a specific geographical area, like southern region of the state of Maryland) then the
condition of linearity might be satisfactory. A typical application is depicted in the above figure
where we are interested in predicting the height of corn in an area with rainfall in the range of
[a, b]. Magnifying process of scale for this range allows us to fit a useful linear regression. If
the range is not short enough, then one may sub-divide the range accordingly by applying the
same process of fitting a few lines, one for each sub-interval.

Structural Changes: When a regression model has been estimated using the available data
set, an additional data set may sometimes become available. To test if previous model is still
valid or the two separate models are equivalent or not, one may use the analysis of
covariance testing described on this site.

You might like to use the Regression Analysis JavaScript to check your computations and to
perform some numerical experimentation for a deeper understanding of the concepts.

Further Reading:
Chatterjee S., B. Price, and A. Hadi, Regression Analysis by Example, Wiley, 1999.

Regression Modeling Selection Process

When you have more than one regression equation based on data, to select the"best model",
you should compare:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 119/139
1/5/2020 Dr. Arsham's Statistics Site

1. R-squares: That is, the percentage of variance [in fact, the sum of squares] in Y accounted for by
variance in X captured by the model.
2. When you want to compare models of different sizes (different numbers of independent variables (p)
and/or different sample sizes n), you must use the Adjusted R-Square, because the usual r-square tends
to grow with the number of independent variables.

r2 a = 1 - (n - 1)(1 - r2)/(n - p - 1)

3. Standard deviation of error terms, i.e., observed y-value - predicted y-value for each x.
4. Trends in errors as a function of control variable x. Systematic trends are not uncommon.
5. The T-statistic of individual parameters.
6. The values of the parameters and its content to content underpinnings.
7. Fdf1 df2 value for overall assessment. Where df1 (numerator degrees of freedom) is the number of
linearly independent predictors in the assumed model minus the number of linearly independent
predictors in the restricted model; i.e., the number of linearly independent restrictions imposed on the
assumed model, and df2 (denominator degrees of freedom) is the number of observations minus the
number of linearly independent predictors in the assumed model.

The observed F-statistic should exceed not merely the selected critical value of F-table, but at least four
times the critical value.

Finally in statistics for business, there exists an opinion that with more than 4 parameters, one can
fit an elephant so that if one attempts to fit a regression funtion that depends on many parameters,
the result should not be regarded as very reliable.

Further Reading:
Draper N., and H. Smith, Applied Regression Analysis, Wiley, 1998.

Covariance and Correlation

Suppose that X and Y are two random variables for the outcome of a random experiment. The
covariance of X and Y is defined by

Cov (X, Y) = E{[X - E(X)][Y - E(Y)]}

and, given that the variances are strictly positive, the correlation of X and Y is defined by

r (X, Y) = Cov(X, Y) / [sd(X) . sd(Y)]

Correlation is a scaled version of covariance; note that the two parameters always have the same
sign (positive, negative, or 0). When the sign is positive, the variables are said to be positively
correlated; when the sign is negative, the variables are said to be negatively correlated; and when
it is 0, the variables are said to be uncorrelated.

Notice that the correlation between two random variables is often due only to the fact that both
variables are correlated with the same third variable.

As these terms suggest, covariance and correlation measure a certain kind behavior in both
variables. Correlation is very similar to the derivative of a function that you may have studies in
high school.

Coefficient of Determination: The square of correlation coefficient r 2 indicates the proportion of


the variation in one variable that can be associated with the variance in the other variable. The
three typical possibilities are depicted in the following figure:

The proportion of shared variance by two variables for the different values of the coefficient of
determination:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 120/139
1/5/2020 Dr. Arsham's Statistics Site

r2 = 0, r2 = 1, and r2 = 0.25,
as shown by the shaded areas in this figure.

Properties: The following exercises give some basic properties of expected values. The main tool
that you will need is the fact that expected value is a linear operation.

You might like to use this Applet in performing some numerical experimentation to:

1. Show that E[X/Y] ¹ E(X)/E(Y).


2. Show that E[X ´ Y] ¹ E(X) ´ E(Y).
3. Show that [E(X ´ Y)2] £ E(X2) ´ E(Y2).
4. Show that [E(X/Y)n] ³ E(Xn)/E(Yn), for any n.
5. Show that Cov(X, Y) = E(XY) - E(X)E(Y).
6. Show that Cov(X, Y) = Cov(Y, X).
7. Show that Cov(X, X) = V(X).
8. Show that: If X and Y are independent random variables, then
Var(XY) = 2 V(X) ´ V(Y) + V(X)(E(Y))2 + V(Y)(E(X))2.

Pearson, Spearman, and Point-Biserial Correlations

There are measures that describe the degree to which two variables are linearly related. For the
majority of these measures, the correlation is expressed as a coefficient that ranges from 1.00 to
-1.00. A value of 1 is indicating a perfect linear relationship, such that knowing the value of one
variable will allow perfect prediction of the value of the related value. A value of 0 is indicating no
predictability by a linear model. With negative values indicating that, when the value of one variable
is higher than average, the other is lower than average (and vice versa); and positive values
indicating that, when the value of one variable is high, so is the other (and vice versa).

Correlation is similar to the derivative you have learned in calculus (a deterministic course).

The Pearson's product correlation is an index of the linear relationship between two variables.

The Pearson's correlation is

r = SSxy / (SSxx ´SSyy)0.5

A positive relationship indicates that if an individual value of x is above the mean of x's, then this
individual x is likely to have a y value that is above the mean of y's, and vice versa. A negative
relationship would be an x score above the mean of x and a y score below the mean of y. It is a
measure of the relationship between variables and an index of the proportion of individual
differences in one variable that can be associated with the individual differences in another
variable.

Notice that, the correlation coefficient is the mean of the cross-products of scores. Therefore, if you
have three values for r of 0.40, 0.60, and 0.80, you cannot say that the difference between r = 0.40
and r = 0.60 is the same as the difference between r = 0.60 and r = 0.80, or that r = 0.80 is twice as
large as r = 0.40 because the scale of values for the correlation coefficient is not interval or ratio,
but ordinal. Therefore, all you can say is that, for example, a correlation coefficient of +.80
indicates a high positive linear relationship and a correlation coefficient of +.40 indicates a some
what lower positive linear relationship.

The square of the correlation coefficient equals the proportion of the total variance in Y that can be
associated with the variance in x. It can tell us how much of the total variance of one variable can
be associated with the variance of another variable.

Note that a correlation coefficient is done on linear correlation. If the data forms a parabola, then a
linear correlation of x and y will produce an r-value equal to zero. So one must be careful and look
at data.

The standard statistics for hypothesis testing: H0: r = r0, is the Fisher's normal transformation:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 121/139
1/5/2020 Dr. Arsham's Statistics Site

z = 0.5[Ln(1+r) - Ln(1-r)], with mean m = 0.5[Ln(1+ r0) - Ln(1-r0)], and standard deviation s =
(n-3)-½.

Having constructed a desirable confidence interval, say [a, b], based on statistic Z, it has to be
transformed back to the original scale. That is, the confidence interval is:

(e2a -1)/ (e2a +1), (e2b -1)/ (e2b +1).

Provided | r0 | ¹ 1, and | r0 | ¹ 1, and n is greater than 3.

Alternatively,

{1+ r - (1-r) exp[2za/(n-3)½]} / {1+ r + (1-r) exp[2za/(n-3)½]}, and

{1+ r - (1-r) exp[-2za/(n-3)½]} / {1+ r + (1-r) exp[-2za/(n-3)½]}

You might like to use this calculator for your needed computation. You may perform Testing the
Population Correlation Coefficient .

Spearman rank-order correlation is used as a non-parametric version of Pearson's. It is


expressed as:

r = 1 - (6 S d2) / [n(n2 - 1)],

where d is the difference rank between each X and Y pair.

Spearman correlation coefficient can be algebraically derived from the Pearson correlation formula
by making use of sums of series. Pearson contains expressions for S X(i), S Y(i), S X(i)2, and
SY(i)2.

In the Spearman case, the X(i)'s and Y(i)' are ranks, and so the sums of the ranks, and the sums of
the ranks squared, are entirely determined by the number of cases (without any ties).

S i = (n+1)n/2, S i2 = n(n+1)(2n+1)/6.

The Spearman formula then is equal to:

[12P - 3n(n+1)2] / [n(n2 - 1)],

where P is the sum of the product of each pair of ranks X(i)Y(i). This reduces to:

r = 1 - (6 S d2) / [n(n2 - 1)],

where d is the difference rank between each x(i) and y(i) pair.

An important consequence of this is that if you enter ranks into a Pearson formula, you get
precisely the same numerical value as that obtained by entering the ranks into the Spearman
formula. This comes as a bit of a shock to those who like to adopt simplistic slogans, such
as"Pearson is for interval data, Spearman is for ranked data". Spearman doesn't work too well if
there are many tied ranks. That's because the formula for calculating the sums of squared ranks
no longer holds true. If one has many tied ranks, use the Pearson formula.

One may use this measure as a decision-making tool:

Value of |r| Interpretation


0.00 - 0.40 Poor
0.41 - 0.75 Fair
0.76 - 0.85 Good
0.86 - 1.00 Excellent

This interpretation is widely accepted, and many scientific journals routinely publish papers using
this interpretation for the estimated result, and even for the test of hypothesis.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 122/139
1/5/2020 Dr. Arsham's Statistics Site

Point-Biserial Correlation is used when one random variable is binary (0, 1) and the other is a
continuous random variable; the strength of relationship is measured by the point-biserial
correlation:

r = (X1 - X0)[pq/S2] ½

Where X1and X0 are the means of scores having 1, and 0 values, and p and q are their
proportions, respectively. S2 is the sample variance of the continuous random variable. This is a
simplified version of the Pearson correlation for the case when one of the two random variables is
a (0, 1) Nominal random variable.

Note also that r has the shift-invariant property for any positive scale. That is ax + c, and by + d,
have same r as x and y, for any positive a and b.

Correlation, and Level of Significance

It is intuitive that with very few data points, a high correlation may not be statistically significant.
You may see statements such as,"correlation is significant between x and y at the a = 0.005 level"
and"correlation is significant at the a = 0.05 level." The question is: how do you determine these
numbers?

Using the simple correlation r, the formula for F-statistic is:

F= (n-2) r2 / (1-r2), where n is at least 2.

As you see, F statistic is monotonic function with respect to both: r2, and the sample size n.

Notice that the test for the statistical significance of a correlation coefficient requires that the two
variables be distributed as a bivariate normal.

Independence vs. Correlated

In the sense that it is used in statistics; i.e., as an assumption in applying a statistical test; a
random sample from the entire population provides a set of random variables X1,...., Xn, that are
identically distributed and mutually independent. Mutually independent is stronger than pairwise
independence. The random variables are mutually independent if their joint distribution is equal to
the product of their marginal distributions.

In the case of joint normality, independence is equivalent to zero correlation, but not in general.
Independence will imply zero correlation but not conversely. Not that not all random variables have
a first moment, let alone a second moment, and hence there may not be a correlation coefficient.

However; if the correlation coefficient of two random variables is not zero then the random
variables are not independent.

How to Compare Two Correlation Coefficients?

Given that two populations have normal distributions, we wish to test for the following null
hypothesis regarding the equality of correlation coefficients:

Ho: r 1 = r 2,

based on two observed correlation coefficients r1, and r2, obtained from two random sample of size
n1 and n2, respectively, provided | r1 | ¹ 1, and | r2 | ¹ 1, and n1, n2 both are greater than 3. Under
the null hypothesis and normality condition, the test statistic is:

Z = (z1 - z2) / [ 1/(n1-3) + 1/(n2-3) ]½


where:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 123/139
1/5/2020 Dr. Arsham's Statistics Site

z1 = 0.5 Ln [ (1+r1)/(1-r1) ],
z2 = 0.5 Ln [ (1+r2)/(1-r2) ],

and n1= sample size associated with r1, and n2 =sample size associated with r2.

The distribution of the Z-statistic is the standard Normal(0,1); therefore, you may reject H0 if |Z|>
1.96 at the 95% confidence level.

An Application: Suppose r1 = 0.47, r2 = 0.63 are obtained from two independent random samples
of size n1=103, and n2 = 103, respectively. Therefore, the z1 = 0.510, and z2 = 0.741, with Z-
statistics:

Z = (0.510 - 0.7)/ [1/(103-3) + 1/(103-3)]½ = -1.63

This result is not within the rejection region of the two-tails critical values at a = 0.05, therefore is
not significant. Therefore, there is not sufficient evidence to reject the null hypothesis that the two
correlation coefficients are equal

Clearly, this test can be modified and applied for test of hypothesis regarding population correlation
r based on observed r obtained from a random sample of size n:

Z = (zr - zr ) / [1/(n-3) ]½,

provided | r | ¹ 1, and | r | ¹ 1, and n is greater than 3.

Testing the Equality of Two Dependent Correlations: In testing the hypothesis of no difference
between two population correlation coefficients:

H0: r (X, Y) = r (X, Z)

Against the alternative:

Ha: r (X, Y) ¹ r (X, Z)

with a common covariare X, one may use the following test statistics:

t = { (rxy - rxz ) [ (n-3)(1 + ryz)]½ ] } / {2(1-rxy2 - rxz2 - ryz2 + 2rxyrxzryz )}½,

with n - 3 degree of freedom, where n is the tripled-ordered sample size, provided all absolute
value of r's are not equal to 1.

Numerical example: Suppose n = 87, rxy = 0.631, rxz = 0.428, and ryz = 0.683, then t-statistic is
equal to 3.014, with p-value equal to 0.002, indicating a strong evidence against the null
hypothesis.

Adjusted R2: In modeling selection process based of R2 values, it is often necessary and
meaningful to adjust the R2's for their degrees of freedom. Each Adjusted R2 is calculated by:

1 - [(n - i)(1 - R2)] / (n - p),

where i is equal to 1 if there is an intercept and 0 otherwise; n is the number of observations used
to fit the model; and p is the number of parameters in the model.

You might like to use the Testing the Population Correlation Coefficient JavaScript in performing
some numerical experimentation for validating and a deeper understanding of the concepts.

Conditions and the Check-list for Linear Models

Almost all models of reality, including regression models, have assumptions that must be verified in
order that the model has power to test hypotheses and for it to be able to predict accurately.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 124/139
1/5/2020 Dr. Arsham's Statistics Site

The following is the list of basic assumptions (i.e., conditions) and the tools to check these
necessary conditions.

1. Any undetected outliers may have major impact on the regression model. Outliers are a few
observations that are not well fitted by the"best" available model. In such case one, must first investigate
the source of data, if there is no doubt about the accuracy or veracity of the observation, then it should
be removed and the model should be refitted.

You might like to use the Determination of the Outliers JavaScript to perform some numerical
experimentation for validating and for a deeper understanding of the concepts

2. The dependent variable Y is a linear function of the independent variable X. This can be checked by
carefully examining all the points in the scatter diagram, and see if it is possible to bound them all within
two parallel lines. You may also use the Detective Testing for Trend to check this condition, see the
numerical example for the details.

Click on the image to enlarge it and THEN print it.


A Typical Scatter-diagram for a Linear Model

3. The distribution of the residual must be normal. You may check this condition by using the Lilliefors Test
for Normality.

4. The residuals should have a mean equal to zero, and a constant standard deviation (i.e., homoskedastic
condition). You may check this condition by dividing the residuals data into two or more groups; this
approach is known as the Goldfeld-Quandt test. You may use the Stationary Testing Process to check
this condition.

5. The residuals constitute a set of random variables. You may use the Test for Randomness and Test for
Randomness of Fluctuations to check this condition.

6. Durbin-Watson (D-W) statistic quantifies the serial correlation of least-squares errors in its original form.
D-W statistic is defined by:

D-W statistic = S2n (ej - ej-1)2 / S1n ej2,

where ej is the jth error. D-W takes values within [0, 4]. For no serial correlation, a value close to 2 is
expected. With positive serial correlation, adjacent deviates tend to have the same sign, therefore D-W
becomes less than 2; whereas, with negative serial correlation, alternating signs of error, D-W takes
values larger than 2. For a least-squares fit where the value of D-W is significantly different from 2, the
estimates of the variances and covariances of the parameters (i.e., coefficients) can be in error, being
either too large or too small. The serial correlation of the deviates arise also time series analysis and
forecasting. You may use the Measuring for Accuracy JavaScript to check this condition.

The"good" regression equation candidate is further analyzed using a plot of the residuals versus
the independent variable(s). If any patterns are seen in the graph; e.g., an indication of non-
constant variance; then there is a need for data transformation. The following are the widely used
transformations:

X' = 1/X, for non-zero X.


X' = Ln (X), for positive X.
X' = Ln(X), Y' = Ln (Y), for positive X, and Y.
Y' = Ln (Y), for positive Y.
Y' = Ln (Y) - Ln (1-Y), for Y positive, less than one.
Y' = Ln [Y/(100-Y)], known as the logit transformation, which is useful for the S-shape functions.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 125/139
1/5/2020 Dr. Arsham's Statistics Site

Taking square root of a Poisson random variable, the transformed variable is more symmetric. This is a
useful transformation in regression analysis with Poisson observations. It also stabilizes the residual
variation.

Box-Cox Transformations: The Box-Cox transformation, below, can be applied to a regressor, a


combination of regressors, and/or to the dependent variable (y) in a regression. The objective of doing
so is usually to make the residuals of the regression more homoskedastic (ie., independently and
identically distributed) and closer to a normal distribution:

(yl - 1) / l for a constant l not equal to zero, and log(y) for l = 0.

You might like to use the Regression Analysis with Diagnostic Tools JavaScript to check your
computations, and to perform some numerical experimentation for a deeper understanding of the
concepts.

Analysis of Covariance: Comparing the Slopes

Consider the following two samples of before-and-after independent treatments.

Values of Covariate X and a Dependent Variable Y


Treatment-I Treatment-II
X Y X Y
5 11 2 1
3 9 6 7
1 5 4 3
4 8 7 8
6 12 3 2

We wish to test the following test of hypothesis on the two means of the dependent variable Y1,
and Y2:

H0: The difference between the two means is about a given value M.
Ha: The difference between the two means is quite different than it is claimed.

Since we are dealing with dependent variables, it's natural to investigate the linear regression
coefficients of the two samples; namely, the slopes and the intercepts.

Suppose we are interested in testing the equality of two slopes. In other words, we wish to
determine if two given lines are statistically parallel. Let m1 represent the regression coefficient for
explanatory variable X1 in sample 1 with size n1. Let m2 represent the regression coefficient for X2
in sample 2 with size n2. The difference between the two estimated slopes has the following
variance:

V= Var [m1 - m2] = {Sxx1 ´ Sxx2[(n1 -2)Sres12 + (n2 -2)Sres22] /[(n1 + n2 - 4)(Sxx1 + Sxx2].

Then, the quantity:

(m1 - m2) / V½

has a t-distribution with d.f. = n1 + n2 - 4.

This test and its generalization in comparing more than two slopes are called the Analysis of
Covariance (ANOCOV). The ANOCOV test is the same as in the ANOVA test; however there is an
additional variable called covariate. ANOCOV enables us to conduct and to extend the before-and-
after test for two different populations. The process is as follows:

1. Find a linear model for (X1, Y1) = (before1, after1), and one for (X2, Y2) = (before2, after2) that
fit best.

2. Perform the test of the hypothesis m1 = m2.


home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 126/139
1/5/2020 Dr. Arsham's Statistics Site

3. If the test result indicates that the slopes are almost equal, then compute the common slope
of the two parallel regression lines:

Slopepar = (m1Sxx1 + m2Sxx2) / (Sxx1 + Sxx2).

The variance of the residuals is:

SSres2 = [Syy1 + Syy2 - (Sxy1 + Sxy2) Slopepar] / ( n1 + n1 -3).

4. Now, perform the test for the difference between the two the intercepts, which is the vertical
difference between the two parallel lines:

Intercepts' difference = 1 - 2 - ( 1 - 2) Slopepar.

The test statistic is:

(Intercepts' difference) / {SSres [1/n1 + 1/n2 + ( 1 - 2)2/(Sxx1 + Sxx2)]½},

which has a t-distribution with parameter d.f. = n1 + n1 -3.

Depending on the outcome of the last test, one may reject the null hypothesis.

For our numerical example, using the Analysis of Covariance JavaScripts, we obtained the
following statistics:
Slope 1 = 1.3513514; its standard error = 0.2587641
Slope 2 = 1.4883721; its standard error = 1.0793906

These indicate that there is no evidence against equality of the slopes. Now, we may test for any
differences in the intercepts. Suppose we wish to test the null hypothesis that the vertical distance
between the two parallel lines is about 4 units.

Using the second function in the Analysis of Covariance JavaScript, we obtained the statistics:
Common Slope = 1.425, Intercept =5.655, providing a moderate evidence against the null
hypothesis.

Further Reading:
Wall F., Statistical Data Analysis Handbook, by McGraw-Hill, New York, 1986.

Residential Properties Appraisal Application

Estimating the market value of large numbers of residential properties is of interest to a number of
socio-economic stakeholders, such as mortgage and insurance companies, banks and real-estate
agencies, and investment property companies, etc. It is both a science and an art. It is a science,
because it is based on formal, rigorous and proven methods. It is an art because interaction with
socio-economic stakeholders and the methods used give rise to all sorts of tradeoffs and
compromises that assessors and their organizations must take into account when making
decisions on the basis of their experience and skills.

The market value assessment of a set of selected houses involves performing an assessment by a
few individual appraisers for each property and then computing an average obtained from the few
individuals.

Individual appraisal refers to the process of estimating the exchange value of a house on the basis
of a direct comparison between its profiles and the profiles of a set of other comparable properties
sold on acceptable conditions. The profile of a property consists of all the relevant attributes of
each house, such as the location, size, gross living space, age, one-story, two-story or more,
garage, swimming pool, basement, etc. Data on prices and characteristics of individual houses are
available; e.g., from the U.S Bureau of the Census.

Often regression analysis is used to determine what characteristics influence the price of the
houses. Thus it is important to correct the subjective elements in the appraisal value before
carrying out the regression analysis. Coefficients that are not significantly different from zero as
indicated by insignificant t-statistics at a 5% level are dropped from the regression model.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 127/139
1/5/2020 Dr. Arsham's Statistics Site

There are several practical questions to be answered before the actual data collection can take
place.

The first step is to use statistical techniques, such as geographic clustering, to define
homogeneous groupings of houses within an urban area.

How many houses should we look at? Ideally, one would collect information on as many houses as
time and money allow. It is these practical considerations that make statistics so useful. Hardly
anyone could spend the time, money, and effort needed to look at every house for sale. It is
unrealistic to obtain information on every house of interest, or in statistical terms, on every item in
the population. Thus, we can look only at a sample of houses -- a subset of the population -- and
hope that this sample will give us reasonably accurate information about the population. Let us say
we can afford to look at 16 houses.

We would probably choose to select a simple random sample-that is, a sample in which, roughly
speaking, every house in the population has equal chance of being included. Then we would
expect to get a reasonably representative sample of houses throughout this selected size range,
reflecting prices for the whole neighborhood. This sample should give us some information about
all houses of all sizes within this range, since a simple random sample tends to select as many
larger houses as smaller houses, and as many expensive as less expensive ones.

Suppose that the 16 houses in our random sample have the sizes and prices shown in the
following Table. If 160 houses are randomly selected, variables Y, X1, and X2 are random
variables. We have no control over them and cannot know what specific values will be selected. It
is chance only that determines them.

- Sizes, Ages, and Prices of Twenty Houses -

X1 = Size X2 = Age Y = Price X1 = Size X2 = Age Y = Price


1.8 30 32 2.3 30 44
1.0 33 24 1.4 17 27
1.7 25 27 3.3 16 50
1.2 12 25 2.2 22 37
2.8 12 47 1.5 29 28
1.7 1 30 1.1 29 20
2.5 12 43 2.0 25 38
3.6 28 52 2.6 2 45

What can we tell about the relationship between size and price from our sample? Reading the data
from the above table row-wise, and entering them in the Regression Analysis with Diagnostic Tools
JavaScript, we found the following simple regression model:

Price = 9.253 + 12.873(Size)

Now consider the problem of estimating the price (Y) of a house from knowing its size (X1) and
also its age (X2). The sizes and prices will be the same as in the simple regression problem. What
we have done is add ages of houses to the existing data. Note carefully that in real life, one would
not first go out and collect data on sizes and prices and then analyze the simple regression
problem. Rather, one would collect all data, which might be pertinent on all twenty houses at the
outset. Then the analysis performed would throw out predictors which turn out not to be needed.

The objectives in a multiple regression problem are essentially the same as for a simple
regression. While the objectives remain the same, the more predictors we have the calculations
and interpretations become more complicated. For large data set one may use the multiple
regression module of any statistical package such as SAS and SPSS. Using the Multiple Linear
Regression JavaScript, for our numerical example with X1 = Size, X2 = Age, and Y = Price, we
obtain the following statistical model:

Price = 9.959 + 12.800(Size) - 0.027(Age)

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 128/139
1/5/2020 Dr. Arsham's Statistics Site

The regression results suggest that, on average, as the Size of house increases the Price
increases. However, the coefficient of the variable Age is significantly small with negative value
indicating an inverse relationship. Older houses tend to cost less than newer houses. Moreover,
the correlation between Price and Age is -0.236. This result indicates that only 6% of variation in
price can be accounted by the different in ages of the houses. This result supports our suspicion
that the Age is not a significant predictor of price. Therefore, the simple regression:

Price = 9.253 + 12.873(Size)

Now, the question is: Is this model is good enough to satisfy the usual conditions of the regression
analysis.

The following is the list of basic assumptions (i.e., conditions) and the tools to check these
necessary conditions.

1. Any undetected outliers may have major impact on the regression model. Using the Determination of the
Outliers JavaScript we found that there is no outlier in the above data set.

2. The dependent variable Price is a linear function of the independent variable Size. By carefully
examining the scatter diagram we found that the linearity condition is satisfied.

3. The distribution of the residual must be normal. Reading the data from the above table row-wise, and
entering them in the Regression Analysis with Diagnostic Tools JavaScript, we found that the normality
condition is also satisfied.

4. The residuals should have a mean equal to zero, and a constant standard deviation (i.e., homoskedastic
condition). By the Regression Analysis with Diagnostic Tools JavaScript, the results are satisfactory.

5. The residuals constitute a set of random variables. The persistent non-randomness in the residuals
violates the best linear unbiased estimator condition. However, since the numerical statistics
corresponding to the residuals obtained by using Regression Analysis with Diagnostic Tools JavaScript,
are not significant, therefore our ordinary least square regression is adequate for our analysis.

6. Durbin-Watson (D-W) statistic quantifies the serial correlation of least-squares errors in its original form.
D-W statistic for this model is 1.995, which is good enough in rejecting any serial correlation.

7. More Useful Statistics for the Model: The standard errors for the Slope and the Intercept are0.881, and
1.916, respectively, which are small enough. The F-statistic is 213.599, which is large enough indicating
that the model is good enough overall for prediction purposes.

Notice that since the above analysis is performed on a specific set of data, as always, one must be
careful in generalizing its findings. For example, one may ask, Is the aim prediction, or is the
interest in interpretation of individual regression coefficients? In the latter case, inferences that
condition on "other things being constant" will not be valid unless all other relevant variables are
included in the regression equation. Even if their effect is not "significant" they have to be included,
unless it can be shown that their exclusion makes little difference to the values of other
coefficients. Regression is not at all robust against departures from the assumption that data have
been randomly sampled from the population that is of interest. This is an issue for all observational
data.

The importance of these conditions using Monte Carlo simulations demonstrates that for linear
regression the Normality assumption of the residuals is not all crucial. Lack of Normality is
moderated by sample size depending on number of independent variable, so the bigger the
sample size, the more non-Normality you can tolerate. However the independence assumption of
the errors terms and constancy of its variance are very important. Any large error in the
independent variables has also has a big effect.

Further Reading:
Lovell, R., and French, N., Estimated realization price: what do the banks want and what can be realistically provided? Journal of property
finance, 6, 7-16, 1995.
Newsome, B.A. and Zeitz, J., 1992. Adjusting comparable sales using multiple regression analysis-the need for segmentation, The Appraisal
Journal, 8, 129-135.

Introduction to Integrating Statistical Concepts

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 129/139
1/5/2020 Dr. Arsham's Statistics Site

Statistical thinking for decision-making requires a deeper understanding than merely memorizing
each isolated technique. Understanding involves ever expansion of neural network by means of
correct connectivity between concepts. The aim of this chapter is to look closely at some of the
concepts and techniques that we have learned up to now in a unifying theme. The following case
studies, improve your statistical thinking to see the wholeness and manifoldness of statistical tools.

As you will see, although one would hope that all tests give the same results this is not always the
case. It all depends on how informative the data are and to what extend they have been
condensed before presenting them to you for analysis (while becoming a good statistician). The
following sections are illustrations in examining how much useful information they provide and how
they may result in opposite conclusions, if one is not careful enough.

Hypothesis Testing with Confidence

One of the main advantages of constructing a confidence interval (CI) is to provide a degree of
confidence for the point estimate for the population parameter. Moreover, one may utilize CI for the
test of hypothesis purposes. Suppose you wish to test the following general test of hypothesis:

H0: The population parameter is almost equal to a given claimed value,

against the alternative:

Ha: The population parameter is not even close to the claimed value.

The process of executing the above test of hypothesis at a level of significance using CI is as
follows:

1. Ignore the claimed value in the null hypothesis, for the time being.
2. Construct a 100(1- a)% confidence interval based on the available data.
3. If the constructed CI does not contain the claimed value, then there is enough evidence to
reject the null hypothesis; otherwise, there is no reason to reject the null hypothesis.

You might like to use the Hypothesis Testing with Confidence JavaScript to perform some
numerical experimentation for validating the above assertions and for a deeper understanding.

Regression Analysis, ANOVA, and Chi-square Test

There are close relationships among linear regression, analysis of variance and the Chi-square
test. To illustrate the relationship, consider the following application:

Relationship between age and income in a given neighborhood: A random survey sampling of
size 33 individuals in a neighborhood revealed the following pairs of data. For each pair age is in
years and the indicated income is in thousands of dollars:

- Relation between Age and Income($1000) -


Age Income Age Income Age Income
20 15 42 19 61 13
22 13 47 17 62 14
23 17 53 13 65 9
28 19 55 18 67 7
35 15 41 21 72 7
24 21 53 39 65 22
26 26 57 28 65 24
29 27 58 22 69 27
39 31 58 29 71 22
31 16 46 27 69 9
37 19 44 35 62 21

Constructing a linear regression gives us:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 130/139
1/5/2020 Dr. Arsham's Statistics Site

Income = 22.88 - 0.05834 (Age)

This suggests a negative relationship; as people get older, they have lower income, on average.
Although slope is small, it cannot be considered as zero, since the t-statistic for it is -0.70, which is
significant.

Now suppose you have only the following secondary data, where the original data have been
condensed:

- Relation between Age and Income($1000) -


Age ( 60 & Over
Age ( 29 - 39 ) Age ( 40 - 59 ) )
15 19 13
13 17 14
17 13 9
21 21 7
15 39 21
26 28 24
27 22 27
31 26 22
16 27 9
19 35 22
19 18 7

One may use ANOVA in testing that there is no relationship among age and income. Performing
the analysis provides the F-statistic equal to 3.87 which is quite significant; i.e., rejecting the
hypothesis of no difference in population average income for the three age groups.

Now, suppose more condensed secondary data are provided as in the following table:

Relation between Age and Income($1000):


Age
Income 20-39 40-59 60 and over
Up to $20,000 7 4 6
$20,000 and
4 7 5
over

One may use the Chi-square test for the null hypothesis that age and income are unrelated. The
Chi-square statistic is 1.70, which is not significant; therefore there is no reason to believe income
and age are related! But of course, these data are over-condensed, because when all data in the
sample were used, there was an observable relationship.

Regression Analysis, ANOVA, T-test, and Coefficient of Determination

There are very direct relationships among linear regression, analysis of variance, t-test and the
coefficient of determination. The following small data set is for illustrating the connections among
the above statistical procedures, and therefore relationships among statistical tables:

X1 4 5 4 6 7 7 8 9 9 11

X2 8 6 8 10 10 11 13 14 14 16

Suppose we apply the t-test. The statistic is t = 3.207, with d.f. = 18. The p-value is 0.003 indicating
a very strong evidence against the null hypothesis.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 131/139
1/5/2020 Dr. Arsham's Statistics Site

Now, by introducing a dummy variable x with two values, say 0 and 1, representing the two data
sets, respectively, we are able to apply regression analysis:

x 0 0 0 0 0 0 0 0 0 0

y 4 5 4 6 7 7 8 9 9 11

x 1 1 1 1 1 1 1 1 1 1

y 8 6 8 10 10 11 13 14 14 16

Among other statistics, we obtain a large slope = m = 4 ¹ 0, indicating the rejection of the null
hypothesis. Notice that, the t-statistic for the slope is: t-statistic = slope/(its standard error) = 4/
1.2472191 = 3.207, which is the t-statistic we obtained from the t-test. In general, the square of t-
statistic of the slope is the F-statistic in the ANOVA table; i.e.,

tm2 = F-statistic

Moreover, the coefficient of determination r 2 = 0.36, which is always obtainable from the t-test, as
follows:

r2 = t 2 / (t 2 + d.f.).

For our numerical example, the r 2 is (3.207) 2 / [(3.207) 2 + 18] = 0.36, as expected.

Now, applying ANOVA on the two sets of data, we obtain the F-statistic = 10.285, with d.f.1 = 1,
and d.f.2 = 18. The F-statistic is not large enough; therefore, one must reject the null hypothesis.
Note that, in general,

F a , (1, n) = t 2 a/2 , n.

For our numerical example, F = t 2 = (3.207) 2 = 10.285, as expected.

As expected, by just looking at the data, all three tests indicate strongly that the means of the two
sets are quite different.

Relationships among Distributions and Unification of Statistical Tables

Particular attention must be paid to a first course in statistics. When I first began studying statistics,
it bothered me that there were different tables for different tests. It took me a while to learn that this
is not as haphazard as it appeared. Binomial, Normal, Chi-square, t, and F distributions that you
will learn are actually closely connected.

A problem with elementary statistical textbooks is that they not only don't provide information of
this kind, to permit a useful understanding of the principles involved, but they usually don't provide
these conceptual links. If you want to understand connections between statistical concepts, then
you should practice making these connections. Learning by doing statistics lends itself to active
rather than passive learning. Statistics is a highly interrelated set of concepts, and to be successful
at it, you must learn to make these links conscious in your mind.

Students often ask: Why T- table values with d.f. = 1 are much larger compared with other d.f.
values? Some tables are limited. What should I do when the sample size is too large?, How can I
become familiar with tables and their differences. Is there any type of integration among tables? Is
there any connection between test of hypotheses and confidence interval under different
scenarios? For example, testing with respect to one, two, more than two populations, and so on.

The following Figure demonstrates useful relationships among distributions and a unification of
statistical tables:

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 132/139
1/5/2020 Dr. Arsham's Statistics Site

Click on the image to enlarge it and THEN print it.


Useful Relationships Among Common Density Functions

For example, the following are some nice connections between major tables:

Standard normal z and F-statistics: F = z2, where F has (d.f.1 = 1, and d.f.2 is the largest available in the
F-table)

T- statistic and F-statistic: F = t2, where F has (d.f.1 = 1, and d.f.2 = d.f. of the t-table)

Chi-square and F-statistics: F = Chi-square/d.f.1, where F has (d.f.1 = d.f. of the Chi-square-table, and
d.f.2 is the largest available in the F-table)

T-statistic and Chi-square: (Chi-square)½ = t, where Chi-square has d.f.=1, and t has d.f. = ¥.

Standard normal z and T-statistic: z = t, where t has d.f. = ¥.

Standard normal z and Chi-square: (2 Chi-square)½ - (2d.f.-1)½ = z, where d.f. is the largest available
in the Chi-square table).

Standard normal z, Chi-square, and T- statistic: z/[Chi-aquare/n)½ = t with d.f. = n.

F-statistics and its Inverse: Fa(n1, n2) = 1/F1-a(n2, n1), therefore it is only necessary to tabulate, say the
upper tail probabilities.

Correlation coefficient r and T-statistic: t = [r(n-2)½]/[1 - r2]½.

Transformation of Some Inferential Statistics to the Standard normal Z:

For the t(df): Z = {df ´ Ln[1 + (t2/df)]}½ ´ {1 - [1/(2df)]}½.

For the F(1,df): Z = {df ´ Ln[1 + (F/df)]}½ ´ {1 - [1/(2df)]}½,

where Ln is the natural logarithm.

Visit also the Relationships among Common Distributions.

You may like using the statistical tables at the back of your book and/or P-values JavaScript in
performing some numerical experimentation for validating the above relationships for a deeper
understanding of the concepts. You might need to use a scientific calculator, too.

Further Reading:
Kagan. A., What students can learn from tables of basic distributions, Int. Journal of Mathematical Education in Science and Technology, 30(6),
1999.

Index Numbers and Ratios

When facing a lack of a unit of measure, we often use indicators as surrogates for direct
measurement. For example, the height of a column of mercury is a familiar indicator of
temperature. No one presumes that the height of mercury column constitutes temperature in quite
the same sense that length constitutes the number of centimeters from end to end. However, the
height of a column of mercury is a dependable correlate of temperature and thus serves as a
useful measure of it. Therefore, and indicator is an accessible and dependable correlate of a

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 133/139
1/5/2020 Dr. Arsham's Statistics Site

dimension of interest; that correlate is used as a measure of that dimension because direct
measurement of the dimension is not possible or practical. In like manner index numbers serve as
surrogate for actual data.

The primary purposes of an index number are to provide a value useful for comparing magnitudes
of aggregates of related variables to each other, and to measure the changes in these magnitudes
over time. Consequently, many different index numbers have been developed for special use.
There are a number of particularly well-known ones, some of which are announced on public
media every day. Government agencies often report time series data in the form of index numbers.
For example, the consumer price index is an important economic indicator. Therefore, it is useful to
understand how index numbers are constructed and how to interpret them. These index numbers
are developed usually starting with base 100 that indicates a change in magnitude relative to its
value at a specified point in time.

For example, in determining the cost of living, the Bureau of Labor Statistics (BLS) first identifies
a"market basket" of goods and services the typical consumer buys. Annually, the BLS surveys
consumers to determine what they buy and the overall cost of the goods and services they buy:
What, where, and how much. The Consumer Price Index (CPI) is used to monitor changes in the
cost of living (i.e. the selected market basket) over time. When the CPI rises, the typical family has
to spend more dollars to maintain the same standard of living. The goal of the CPI is to measure
changes in the cost of living. It reports the movement of prices, not in dollar amounts, but with an
index number.

Consumer Price Index

The simplest and widely used measure of inflation is the Consumer Price Index (CPI). To compute
the price index, the cost of the market basket in any period is divided by the cost of the market
basket in the base period, and the result is multiplied by 100.

If you want to forecast the economic future, you can do so without knowing anything about how the
economy works. Further, your forecasts may turn out to be as good as those of professional
economists. The key to your success will be the Leading Indicators, an index of items that
generally swing up or down before the economy as a whole does.

Period 1 Period 2
q1 = p1 = q1 = p1 =
Items
Quantity Price Quantity Price
Apples 10 $.20 8 $.25
Oranges 9 $.25 11 $.21

we found that using period 1 quantity, the price index in period 2 is

($4.39/$4.25) x 100 = 103.29

Using period 2 quantities, the price index in period 2 is

($4.31/$4.35) x 100 = 99.08

A better price index could be found by taking the geometric mean of the two. To find the geometric
mean, multiply the two together and then take the square root. The result is called a Fisher Index.

In USA, since January 1999, the geometric mean formula has been used to calculate most basic
indexes within the Comsumer Price Indeces (CPI); in other words, the prices within most item
categories (e.g., apples) are averaged using a geometric mean formula. This improvement moves
the CPI somewhat closer to a cost-of-living measure, as the geometric mean formula allows for a
modest amount of consumer substitution as relative prices within item categories change.

Notice that, since the geometric mean formula is used only to average prices within item
categories, it does not account for consumer substitution taking place between item categories. For
example, if the price of pork increases compared to those of other meats, shoppers might shift their
purchases away from pork to beef, poultry, or fish. The CPI formula does not reflect this type of
consumer response to changing relative prices.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 134/139
1/5/2020 Dr. Arsham's Statistics Site

Ratio Index Numbers

The following provides the computational procedures with applications for some Index numbers,
including the Ratio Index, and Composite Index numbers.

Suppose we are interested in the labor utilization of two manufacturing plants A and B with the unit
outputs and man/hours, as shown in the following table, together with the national standard over
the last three months:

Plant Type - A Plant Type - B


Months Unit Output Man Hours Unit Output Man Hours
1 0283 200000 11315 680000
2 0760 300000 12470 720000
3 1195 530000 13395 750000
Standard 4000 600000 16000 800000

The labor utilization for the Plant A in the first month is:

LA,1 = [(200000/283)] / [(600000/4000)] = 4.69

Similarly,

LB,3 = 53.59/50 = 1.07.

Upon computing the labor utilization for both plants for each month, one can present the results by
graphing the labor utilization over time for comparative studies.

You might like to use the Index Numbers JavaScript to check your hand computation.

Composite Index Numbers

Consider the total labor, and material cost for two consecutive years for an industrial plant, as
shown in the following table:

Year 2000 Year 2001


Unit Needed Unit Cost Total Unit Cost Total
Labor 20 10 200 11 220
Almunium 02 100 200 110 220
Electricity 02 50 100 60 120
Total 500 560

From the information given in the above table, the index for the two consecutive years are 500/500
= 1, and 560/500 = 1.12, respectively.

Further Readings:
Watson C., P. Billingsley, D. Croft, and D. Huntsberger, Statistics for Management and Economics, Allyn & Bacon, Inc., 1993.

Variation Index as a Quality Indicator

A commonly used index of variation measure and comparison for nominal and ordinal data is
called the index of dispersion:

D = k (N2 - Sfi2)/[N2(k-1)]

where k is the number of categories, fi is the number of ratings in each category, and N is the total
number of rating. D is a number between zero and 1 depending if all ratings fall into one category,
home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 135/139
1/5/2020 Dr. Arsham's Statistics Site

or if ratings were equally divided among the k categories.

An Application: Consider the following data with n = 100 participants, k = 5 categories, f1 = 25, f2
= 42, and so on.

Category Frequency
A 25
B 42
C 8
D 13
E 12

Therefore the dispersion index is: D = 5 (1002 - 2766)/[1002(4)] = 0.904, indicating a good spread
of scores across the categories.

You might like to use the Index Numbers JavaScript to check your hand computation.

Labor Force Unemployment Index

Is a given city an economically depressed area? The degree of unemployment among labor (L)
force is considered to be a proper indicator of economic depression. To construct the
unemployment index, each person is classified both with respect to membership in the labor force
and the degree of unemployment in fractional value, ranging from 0 to 1. The fraction that indicates
the portion of labor that is idle is:

L = S[UiPi] / SPi, the sums are over all i = 1, 2,…, n.

where Pi is the proportion of a full workweek for each resident of the area held or sought
employment and n is the total number of residents in the area. Ui is the proportion of Pi for which
each resident of the area unemployed. For example, a person seeking two days of work per week
(5 days) and employed for only one-half day would be identified with Pi = 2/5 = 0.4, and Ui = 1.5/2
= 0.75. The resulting multiplication UiPi = 0.3 would be the portion of a full workweek for which the
person was unemployed.

Now the question is What value of L constitutes an economic depressed area. The answer belongs
to the decision-maker to decide.

Seasonal Index and Deseasonalizing Data

Seasonal index represents the extent of seasonal influence for a particular segment of the year.
The calculation involves a comparison of the expected values of that period to the grand mean.

We need to get an estimate of the seasonal index for each month, or other periods such as
quarter, week, etc, depending on the data availability. Seasonality is a pattern that repeats for each
period. For example annual seasonal pattern has a cycle that is 12 periods long, if the periods are
months, or 4 periods long if the periods are quartets.

A seasonal index is how much the average for that particular period tends to be above (or below)
the grand average. Therefore, to get an accurate estimate for it, we compute the average of the
first period of the cycle, and the second period, etc, and divide each by the overall average. The
formula for computing seasonal factors is:

Si = Di/D,

where:
Si = the seasonal index for ith period,
Di = the average values of ith period,
D = grand avrage,
i = the ith seasonal period of the cycle.

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 136/139
1/5/2020 Dr. Arsham's Statistics Site

A seasonal index of 1.00 for a particular month indicates that the expected value of that month is
1/12 of the overall average. A seasonal index of 1.25 indicates that the expected value for that
month is 25% greater than 1/12 of the overall average. A seasonal index of 80 indicates that the
expected value for that month is 20% less than 1/12 of the overall average.

Deseasonalizing Process: Deseasonalizing the data, also called Seasonal Adjustment is the
process of removing recurrent and periodic variations over a short time frame (e.g., weeks,
quarters, months). Therefore, season variations are regularly repeating movements in series
values that can be tied to recurring events. The Deseasonalized data is obtained by simply dividing
each time series observation by the corresponding seasonal index.

Almost all time series published by the government are already deseasonalized using the seasonal
index to unmasking the underlying trends in the data, which could have been caused by the
seasonality factor.

A Numerical Application: The following table provides monthly sales ($000) at a college
bookstore.

M Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total

T
1 196 188 192 164 140 120 112 140 160 168 192 200 1972
2 200 188 192 164 140 122 132 144 176 168 196 194 2016
3 196 212 202 180 150 140 156 144 164 186 200 230 2160
4 242 240 196 220 200 192 176 184 204 228 250 260 2592
Mean: 208.6 207.0 192.6 182.0 157.6 143.6 144.0 153.0 177.6 187.6 209.6 221.0 2185
Index: 1.14 1.14 1.06 1.00 0.87 0.79 0.79 0.84 0.97 1.03 1.15 1.22 12

The sales show a seasonal pattern, with the greatest number when the college is in session and
decrease during the summer months. For example, for January the index is:

S(Jan) = D(Jan)/D =208.6/181.84 = 1.14,

where D(Jan) is the mean of all four January month, and D is the grand mean of all past four years
sales.

You might like to use the Seasonal Index JavaScript to check your hand computation. As always
you must first use Plot of the Time Series as a tool for the initial characterization process.

For testing seasonality based on seasonal index, you may like to use Test for Seasonality
JavaScript.

For modeling the time series having both the seasonality and trend components, visit the Business
Forecasting site.

Human Ideal Weight:


The Body Mass Index

One of oldest and still most popular index is modeling Human Ideal Weight by means of Body
Mass Index (BMI).

The foundation of Ideal Weight rests on historical, social, behavioral, cultural, physiological,
metabolic, and genetic perspectives.

The normal digestive process: Normally, as food moves along the digestive tract, digestive
juices and enzymes digest and absorb calories and nutrients (see figure 1). After we chew and
swallow our food, it moves down the esophagus to the stomach, where a strong acid continues the
digestive process. The stomach can hold about 3 pints of food at one time. When the stomach
contents move to the duodenum, the first segment of the small intestine, bile and pancreatic juice
speed up digestion. Most of the iron and calcium in the foods we eat is absorbed in the duodenum.
The jejunum and ileum, the remaining two segments of the nearly 20 feet of small intestine,

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 137/139
1/5/2020 Dr. Arsham's Statistics Site

complete the absorption of almost all calories and nutrients. The food particles that cannot be
digested in the small intestine are stored in the large intestine until eliminated.

The history of the formulas for calculating ideal body weight began in 1871 when a French medical
doctor developed a model. These formulas pre-dated and probably influenced development of the
Metropolitan Life tables of height and weight. However, these formulas have no method to
compensate for Age and Current Weight. They are only based on Height. For people who are very
overweight or obese the formulas would suggest an ideal weight that is virtually impossible to
achieve or maintain through dieting.

Body Mass Index or BMI is the standardized method for determining whether your body weight and
the amount of body fat you have are in a healthy range. A BMI Metric Calculator uses a weight-to-
height ratio (BMI=kg/m2) and assigns a number to the result. To get your approximate BMI using
English system, multiply your weight in pounds by 703, then divide the result by your height in
inches, and divide that result by your height in inches a second time, i.e. BMI = 703W/h2

Metric and English Conversions:

Converting Kilograms into Pounds:


1 Kilograms = 2.2 pounds

Converting Meter and Centimeter into Feet and Inch

1 Feet = 12 Inch, 1 cm = 0.408 Inch , 1 Inch = 1 cm = 2.451 cm,


1 meter = 3.281, 1 Feet = 30.48 cm.

The BMI range is between 18.5 - 30 or greater. Generally speaking, a Body Mass Index over 25 is
considered overweight and 30 or above is obese. People with a higher percentage of body fat tend
to have a higher BMI except for body builders

The BMI ranges for adults are shown in the following chart.

Click on the image to enlarge

They are not exact ranges of healthy and unhealthy weights. However, they show that health risk
increases at higher levels of overweight and obesity. Even within the healthy BMI range, weight
gains can carry health risks for adults.

This Body Mass Index chart lets you see if your weight falls within a healthy range. Use this as a
guide only. Work closely with your doctor to develop a weight control plan that is right for you.

Overweight refers to an excess of body weight, but not necessarily body fat. Obesity means an
excessively high proportion of body fat. Health professionals use a measurement called body mass
index (BMI) to classify an adult's weight as healthy, overweight, or obese. BMI describes body
weight relative to height and is correlated with total body fat content in most adults. For example,
having excess abdominal body fat is a health risk. Men with a waist of more than 40 inches around
and women with a waist of 35 inches or more are at risk for health problems.

Formulas for Lean Body

For Men:
Lean Body Weight = (1.10 × Weight(kg)) - 128 ( Weight2/(100 × Height(m))2)

For Women:
Lean Body Weight = (1.07 × Weight(kg)) - 148 ( Weight2/(100 × Height(m))2)

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 138/139
1/5/2020 Dr. Arsham's Statistics Site

Women tend to imagine their ideal weight is unrealistically low, so they diet unnecessarily. Men
tend to allow their ideal weight to be higher than medically recommended. Men and Women should
learn from each other.

You might like to use the Body Mass Index JavaScript to check your hand computation.

Notice that, for example, a large waist and wide hips signal accumulation of so-called "intra-
abdominal fat" -- the particularly harmful deep "hidden" fat that surrounds the abdominal organs
and is linked to diabetes, high blood pressure and heart disease. Therefore, one must reflects on
other fat distribution different from that indicated by weight and height.

Further Readings:
Pai M., and Paloucek F. The origin of the "Ideal" body weight equations, Ann Pharmacol, 34, 1066-1069, 2000.

Statistical Technique and Index Numbers

One must be careful in applying or generalizing any statistical technique to the index numbers. For
example, the correlation of rates raises the potential problem. Specifically, let X, Y, and X be three
independent variables, so that pair-wise correlations are zero; however, the ratios X/Y, and Z/Y will
be correlated due to the common denominator.

Let I = X1/X2 where X1, and X2 are dependent variables with correlation r, having mean and
coefficient of variation m1, c1 and m2, c2, respectively; then,

Mean of I = m1 (1-r´c1´c2 + c22)/m2,

Standard Deviation of I = m1(c12 - 2 r´c1´c2 + c22) ½ /m2

For more index numbers and ratios, visit Economics and Financial Ratios and Indices site.

The Copyright Statement: The fair use, according to the 1996 Fair Use Guidelines for Educational
Multimedia, of materials presented on this Web site is permitted for non-commercial and classroom
purposes only.
This site may be mirrored intact (including these notices), on any server with public access, and
linked to other Web pages. All files are available at http://home.ubalt.edu/ntsbarsh/Business-stat
for mirroring.

Kindly e-mail me your comments, suggestions, and concerns. Thank you.

Professor Hossein Arsham

This site was launched on 1/18/1994, and its intellectual materials have been thoroughly revised
on a yearly basis. The current version is the 12th Edition. All external links are checked once a
month.

Back to Dr. Arsham's Home Page

EOF: Ó 1994-2015

home.ubalt.edu/ntsbarsh/Business-stat/opre504.htm#rgl 139/139

Das könnte Ihnen auch gefallen