Sie sind auf Seite 1von 24

Chapter 2: Statistics

of Data Science
Chapter Index

S.No. Reference Particulars Slide


No. From - To

1 Learning Objectives 3

2 Topic 1 Measures of Central 4 -5


Tendency

3 Topic 2 Probability Theory 6 - 9

4 Topic 3 Statistical Inference 10 - 11


Chapter Index

S.No. Reference Particulars Slide


No. From -To

5 Topic 4 Sampling Theory 12 - 15

6 Topic 5 Hypothesis Testing 16 - 18

7 Topic 6 Regression Analysis 19 – 20

8 Let’s Sum Up 16
Learning Objectives

 Describe the concept of probability theory


 € Explain the meaning and types of statistical inference
 € Discuss the importance of sampling theory
 € Elucidate the meaning and importance of hypothesis
testing
 € Describe the concept and types of regression analysis
1. Measures of Central Tendency

Measures of Central Tendency:

There are three main measures of central tendency which are as


follows:

 €€Arithmetic Mean: The mean of a variable represents its

average value. It can be calculated by using the below formula:

where X represents the sample mean and f i represents the


frequency of an ith observation of the variable. Mean is the
hypothetical value of a variable. It may or may not exist in the
dataset.
1. Measures of Central Tendency

 Median : Median is called positional average of a variable.


When we arrange the observations of a variable in an
ascending or descending order then the middle value of the
series of the observations is called Median. Median value
divides the observations into two equal halves. Half of the
observations of the variable are lower than the median value
and the other half observations are higher than the median
value. Quartiles, Deciles and Percentiles are the extensions of
the median.
 Mode : The mode of a variable is the observation with the
highest frequency or highest concentration of frequencies.
2. Probability Theory

 Probability theory is a branch of mathematics that is concerned with


chance or probability. Probability theory deals with concepts by
expressing them in the form of axioms which formalize in terms of
probability space. The probability may take any value between 0 and
1. The probability space assigns a value between 0 and 1 to a set of
outcomes which are called sample space. If a subset of the sample
space is taken, it is called an event.

 The probability theory involves use of discrete and continuous


random variables and probability distributions. The distributions
provide mathematical abstractions of non-deterministic or
uncertain processes or measured quantities which may occur as a
single occurrence or over time.
2. Probability Theory

Continuous Probability Distributions


• A continuous random variable is a random variable having an infinite
and uncountable range. If the random variable is continuous, its
probability distribution is called continuous probability distribution.
• A continuous distribution refers to the set of probabilities of the
possible values of a continuous random variable.
• A probability distribution can be described using an equation called
Probability Density Function (PDF). The area under the curve of a
random variable’s PDF shows the probabilities of the continuous
random variables.
• Probability of a continuous random variable having some value is
zero.
2. Probability Theory

Discrete Probability Distributions

 Random events lead to discrete random variables. Usually, the


discrete random variables are denoted as X and their
probability distribution is denoted as P(X).
 Some of the most common discrete probability distributions
used in statistics include binomial distribution, geometric
distribution, hypergeometric distribution, multinomial
distribution, negative binomial distribution, and Poisson
distribution.
 Discrete probability distributions can be described using
frequency distribution tables, graphs or charts.
2. Probability Theory

Classical Probability Distributions

There are four types of Classical Probability Distributions:

Bernoulli Distribution: A Bernoulli distribution has only

one trial and only two possible outcomes, namely 1

(success) and 0 (failure).

Uniform Distribution: In a uniform distribution, there

may be any number of outcomes and the probability of

getting any outcome is equally likely.


2. Probability Theory

 Binomial Distribution: A binomial distribution is the one

wherein only two outcomes are possible for all the trials

and each trial’s results are independent of each other.

 Normal Distribution: Normal distribution results in a

bell-shaped symmetrical curve. This distribution occurs

naturally in many situations.


3. Statistical Inference

• Statistical inference refers to the process using which inferences


about a population are made on the basis of certain statistics
calculated from a sample of data drawn from that population. In
other words, statistical inference refers to the use of probability
theory to make inferences about a population from the sample data.
• Assume that we want to estimate the average life expectancy of males
living in Tamil Nadu, India or the percentage of public that is satisfied
with the work done by the current government. To know the actual
results, we cannot obtain data from each person in the population.
Therefore, we obtain the data from a part of the population called
sample. Data is obtained from the sample population and is analyzed to
draw inferences about the population.
3. Statistical Inference

In inferential statistics, the experimenter tries to achieve three


goals as follows:
• Parameter estimation: Parameters are the unknown
constants in a probability distribution that determine the
properties of a distribution.
• Data prediction: After the parameters have been estimated
for a particular distribution, they can be used to predict the
future data.
• Model comparison: After the data has been predicted for an
entire population, the experimenter selects one model which
best explains the observed data from two or more models.
4. Sampling Theory

The practice of collecting samples and analyzing them to derive some useful
information is called sampling theory. Some important concepts related to
sampling theory are as follows:
• Data: Data refers to the entire set of observations that have been collected.
• Population: An entire group of subjects or objects that are to be studied
and analysed is called population.
• Sample: A sample is a portion or sub-collection of elements that are
examined in order to estimate the characteristics of a population.
• Parameter: A parameter refers to a characteristic of a sample that is
generalised for the population.
• Statistics: It is a branch of mathematics that deals with planning and
conducting experiments, obtaining data, and organising, summarising,
presenting, analysing, interpreting and drawing conclusions based on data.
4. Sampling Theory

Sampling Frame
• Sampling frame refers to the complete list of all the items (everyone and
everything) that must be studied. At first, it would appear that a
sampling frame is the same as population. But, population is general,
whereas sampling frame is specific.
• For example, we may define a population as all those individuals who can
be sampled (for example, all the Indian Americans living in Texas, USA),
whereas an exhaustive list of all the Indian Americans living in Texas,
USA would be considered as the sampling frame because it is not
necessary that all the Indian Americans living in Texas, USA would be
listed under the list so provided.
• In statistical research, the experimenters require a list of items in order
to draw a sample from it. It must be ensured that the sampling frame is
adequate for the needs of the experimenter.
4. Sampling Theory

Sampling Methods
 In statistics, there are various sampling methods. Sampling
methods are divided into two categories, namely probability
sampling and non-probability sampling.
 Probability sampling is the one wherein the sample has a known
probability of being selected.
 In non-probability sampling, a sample does not have known
probability of being selected. In probability sampling, we can
determine the probability that each sample will be selected. In
addition, we can also determine which sampling units belong to
which sample..
4. Sampling Theory

Sampling Errors
• Errors that are involved in sampling are shown in following
figure:
5. Hypothesis Testing

 A hypothesis is a statement or a proposed explanation about one or more


populations. A hypothesis statement is usually associated with the
population parameters. A hypothesis can be tested using a research
method.
 In hypothesis testing, there are two types of hypotheses, namely null
hypothesis and alternate hypothesis. The null hypothesis (H 0) is the
hypothesis to be tested. Alternate hypothesis (H A) is the hypothesis that
must be accepted if the sample data leads to rejection of H 0.
 Hypothesis testing, also called significance testing, is a method which is
used to test the hypothesis regarding the population parameters using
the data collected from a sample. Alternatively, we can say that
hypothesis testing is a method of evaluating samples to learn about the
characteristics of a given population.
5. Hypothesis Testing

Four Steps to Hypothesis Testing


• The process of hypothesis testing consists of four steps as follows:

Step 1: Identify the hypothesis to be tested.

Step 2: Set the criterion upon which the hypothesis would be tested.

Step 3: Select a random sample from the population and measure the sample mean (Compute the test
statistic).

Step 4: Make a decision – Compare the observed value of the sample to what we expect to observe if
the claim we are testing is true.
5. Hypothesis Testing

Analysis of Variance (ANOVA)


• Independent-sample t-test can be applied to situations where
there are only two independent samples. In other words, we
can use independent-sample t-tests for comparing the means
of two populations (such as males and females). When we have
more than two independent samples, t-test is inappropriate.
The Analysis of Variance (ANOVA) has an advantage over t-test
when the researcher wants to compare the means of a larger
number of population (i.e., three or more).
• ANOVA is a parametric test that is used to study the difference
among more than two groups in the datasets. It helps in
explaining the amount of variation in the dataset.
6. Regression Analysis

• Regression analysis is a statistical method that is used to


model a relationship between two or more variables of interest.
• Regression analysis is usually used to model a relationship
between a response variable (dependent variable) and one or
more predictor (independent) variables.
• There are various types of regression. However, the basic
function of these regression models is to examine the influence
of one or more independent variables on a dependent variable.
• Regression analysis helps in identifying which variables have
an impact on a variable of interest.
6. Regression Analysis

Types of Regression Techniques


Seven important types of regression techniques include:
 Linear Regression
 Logistic Regression
 Polynomial Regression
 Ordinal Regression
 Ridge Regression
 Principal Components Regression (PCR)
 Partial Least Squares (PLS) Regression
Let’s Sum Up

 The probability theory is a branch of mathematics that is


concerned with chance or probability.
 The probability theory involves use of discrete and continuous
random variables and probability distributions.
 Statistical inference describes the use of probability theory to
make inferences about a population from the sample data.
 The practice of collecting samples and analyzing them to derive
some useful information is called sampling theory.
 Hypothesis testing is a method of calculating samples to learn
about the characteristics of a given population.
 Regression analysis a statistical method which is used to model a
relationship between two or more variables of interest.
THANK YOU

Das könnte Ihnen auch gefallen