Beruflich Dokumente
Kultur Dokumente
On
Quantitative Techniques
Prepared By:
Submitted To: Prof. Priyabrata Nayak
Contents
Introduction
History
Data Collecting
Arranging Data
Using Data Array
Using Frequency Distribution
Central Tendency
Dispersion
Skewness
Measures of Central Tendency
Mean
Median
Mode
Probability
Probability Distribution
Discrete Probability Distribution
Contents
Sampling
Central Limit Theorem
Standard Error
Sample Size
Estimation
Point Estimation
Interval Estimation
Correlation
Coefficient of Determination
Introduction to
Statistics
Dispersion
Measures of Central
Tendency
The three most common measures of central tendency are
the mean, the median, and the mode.
Measures of Central
Tendency
Arithmetic Mean
The arithmetic mean is the most common measure of central tendency. It simply
the sum of the numbers divided by the number of numbers. The symbol m is
used for the mean of a population. The symbol M is used for the mean of a
sample. The formula for m is shown below: m= SX
N
where SX is the sum of all the numbers in the
numbers in the sample and N is the number of
numbers in the sample. As an example, the
mean of the numbers 1+2+3+6+8=
20
5
=4 regardless of whether the numbers constitute the entire population or just a sample
from the population.
The table, Number of touchdown passes, shows the number of touchdown (TD) passes
thrown by each of the 31 teams in the National Football League in the 2000 season.
The mean number of touchdown passes thrown is 20.4516 as shown below. m=
SX 634
= = 20.4516
N 31
You 3 3 3
John's 3 4 2
Maria's 3 4 2
Shareecia's 3 4 2
Luther's 3 5 1
For Dataset 1, the median is three, the same as your score. For Dataset 2, the
median is 4. Therefore, your score is below the median. This means you are in the
lower half of the class. Finally for Dataset 3, the median is 2. For this dataset, your
score is above the median and therefore in the upper half of the distribution.
Computation of the Median: When there is an odd number of numbers, the median is
simply the middle number. For example, the median of 2, 4, and 7 is 4. When there
is an even number of numbers, the median is the mean of the two middle numbers.
Thus, the median of the numbers 2, 4, 7, 12 is
4+7
=5.5.
2
Measures of Central
Tendency
Mode
The mode is the most frequently occuring value. For the data in the table,
Number of touchdown passes, the mode is 18 since more teams (4) had
18 touchdown passes than any other number of touchdown passes. With
continuous data such as response time measured to many decimals, the
frequency of each value is one since no two scores will be exactly the
same (see discussion of continuous variables). Therefore the mode of
continuous data is normally computed from a grouped frequency
distribution. The Grouped frequency distribution table shows a grouped
frequency distribution for the target response time data. Since the interval
with the highest frequency is 600-700, the mode is the middle of that
interval (650).
Grouped frequency distribution
Range Frequency
500-600 3
600-700 6
700-800 5
800-900 5
900-1000 0
1000-1100 1
Probability
Probability theory is the mathematical study of phenomena
characterized by randomness or uncertainty.
More precisely, probability is used for modelling situations when the
result of an experiment, realized under the same circumstances,
produces different results (typically throwing a dice or a coin).
Mathematicians and actuaries think of probabilities as numbers in the
closed interval from 0 to 1 assigned to "events" whose occurrence or
failure to occur is random. Probabilities P(A) are assigned to events A
according to the probability axioms.
The probability that an event A occurs given the known occurrence of an
event B is the conditional probability of A given B; its numerical value is
(as long as P(B) is nonzero). If the conditional
probability of A given B is the same as the ("unconditional") probability of
A, then A and B are said to be independent events. That this relation
between A and B is symmetric may be seen more readily by realizing
that it is the same as saying when A and B are
independent events.
Probability Distribution
Outcomes of an experiment and their probabilities of occurrence. If the
experiment were to be repeated any number of times, the same
probabilities should also repeat. For example, the probability distribution
for the possible number of heads from two tosses of a fair coin having
both a head and a tail would be as follows:
Number of Heads Tosses Probability of Event
0 (tail, tail) . 25
1 (head, tail) + (tail, head) . 50
2 (head, head) . 25
Example
A typical example is the following: assume 5% of the
population is green-eyed. You pick 500 people randomly.
The number of green-eyed people you pick is a random
variable X which follows a binomial distribution with n = 500
and p = 0.05 (when picking the people with replacement).
Sampling
In many disciplines, there is often a need to describe the
characteristics of some large entity, such as the air quality in a region,
the prevalence of smoking in the general population, or the output
from a production line of a pharmaceutical company. Due to practical
considerations, it is impossible to assay the entire atmosphere,
interview every person in the nation, or test every pill. Sampling is the
process whereby information is obtained from selected parts of an
entity, with the aim of making general statements that apply to the
entity as a whole, or an identifiable part of it. Opinion pollsters use
sampling to gauge political allegiances or preferences for brands of
commercial products, whereas water quality engineers employed by
public health departments will take samples of water to make sure it
is fit to drink. The process of drawing conclusions about the larger
entity based on the information contained in a sample is known as
statistical inference.
There are several advantages to using sampling rather than
conducting measurements on an entire population. An important
advantage is the considerable savings in time and money that can
result from collecting information from a much smaller population.
When sampling individuals, the reduced number of subjects that need
to be contacted may allow more resources to be devoted to finding
and persuading nonresponders to participate. The information
collected using sampling is often more accurate, as greater effort can
be expended on the training of interviewers, more sophisticated and
expensive measurement devices can be used, repeated
measurements can be taken, and more detailed questions can be
posed.
Sampling
Definitions
The term "target population" is commonly used to refer to the group of
people or entities (the "universe") to which the findings of the sample
are to be generalized. The "sampling unit" is the basic unit (e.g.,
person, household, pill) around which a sampling procedure is
planned. For instance if one wanted to apply sampling methods to
estimate the prevalence of diabetes in a population, the sampling unit
would be persons, whereas households would be the sampling unit
for a study to determine the number of households where one or
more persons were smokers. The "sampling frame" is any list of all
the sampling units in the target population. Although a complete list of
all individuals in a population is rarely available, an alphabetic listing
of residents in a community or of registered voters are examples of
sampling frames.
Central Limit Theorem
A central limit theorem is any of a set of weak-
convergence results in probability theory. They all express
the fact that any sum of many independent identically
distributed random variables will tend to be distributed
according to a particular "attractor distribution". The most
important and famous result is called The Central Limit
Theorem which states that if the sum of the variables has
a finite variance, then it will be approximately normally
distributed.
Since many real processes yield distributions with finite
variance, this explains the ubiquity of the normal
distribution.
Standard Error
In statistics, the standard error of a measurement, value or quantity
is the estimated standard deviation of the process by which it was
generated, including adjusting for sample size. In other words the
standard error is the standard deviation of the sampling distribution
of the sample statistic (such as sample mean, sample proportion or
sample correlation).
Sample Size
Sample size, usually designated N, is the number of repeated
measurements in a statistical sample. They are used to estimate
a parameter, a descriptive quantity of some population. N
determines the precision of that estimate. Larger N gives smaller
error bounds of estimation. A typical statement is to say that one
can be 95% sure the true parameter is within +or- B of the
estimate, where B is an error bound that decreases with
increasing N. Such a bounded estimate is referred to as the
confidence interval for that parameter.
Estimation
Estimation is the calculated approximation of a result which is
usable even if input data may be incomplete, uncertain, or noisy.
In statistics, see estimation theory, estimator.
In mathematics, approximation or estimation typically means
finding upper or lower bounds of a quantity that cannot readily be
computed precisely. While initial results may be unusable
uncertain, recursive input from output, can purify results to be
approximately accurate, certain, complete and noise-free.
Point Estimation
In statistics, point estimation involves the use of sample data to
calculate a single value (known as a statistic) which is to serve as a
"best guess" for an unknown (fixed or random) population parameter.
More formally, it is the application of a point estimator to the data.
Point estimation should be contrasted with Bayesian methods of
estimation, where the goal is usually to compute (perhaps to an
approximation) the posterior distributions of parameters and other
quantities of interest. The contrast here is between estimating a single
point (point estimation), versus estimating a weighted set of points (a
probability density function).
Interval Estimation
In statistics, interval estimation is the use of sample data to calculate
an interval of possible (or probable) values of an unknown population
parameter. The most prevalent forms of interval estimation are
confidence intervals (a frequentist method) and credible intervals (a
Bayesian method).
Point Estimation
Introduction
Types of regression
Coefficient of Determination
Statistical measure of Goodness-Of-Fit. It measures how good
the estimated regression equation is, designated as r2 (read as r-
squared). The higher the r-squared, the more confidence one can
have in the equation. Statistically, the coefficient of determination
represents the proportion of the total variation in the y variable
that is explained by the regression equation. It has the range of
values between 0 and 1. It is computed as