Engineering Data Analysis Terms

Sevilla, William H.
CpE-2101
Introduction to Probability
Probability is the science of how likely events are to happen. At its simplest, it’s concerned
with the roll of a dice, or the fall of the cards in a game. But probability is also vital to science and
life more generally.
Probability is used, for example, in such diverse areas as weather forecasting and to work out the
cost of your insurance premiums.
A basic understanding of probability is an essential skill in life, even if you are not a professional
gambler or weather forecaster.
Source: https://www.skillsyouneed.com/num/probability.html
What is Probability?
The probability that an event will occur is a number between 0 and 1. In other words, it is
a fraction. It is also sometimes written as a percentage, because a percentage is simply a fraction
with a denominator of 100. For more about these concepts, see our pages
on Fractions and Percentages.
An event that is certain to occur has a probability of 1, or 100%, and one that will definitely not
occur has a probability of zero. It is also said to be impossible.
The equation for probability is:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑒𝑠𝑖𝑟𝑒𝑑 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠

P=
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟𝑠 𝑜𝑓 𝑃𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠
Source: https://www.skillsyouneed.com/num/probability.html
Sample Space
In the study of probability, an experiment is a process or investigation from which results

are observed or recorded.
An outcome is a possible result of an experiment.
A sample space is the set of all possible outcomes in the experiment. It is usually denoted by the
letter S. Sample space can be written using the set notation, { }.
Source: https://www.onlinemathlearning.com/samples-in-probability.html
Event
Life is full of random events! You need to get a "feel" for them to be a smart and successful
person. The toss of a coin, throw of a dice and lottery draws are all examples of random events.
When we say "Event" we mean one (or more) outcomes.
Example Events:
 Getting a Tail when tossing a coin is an event

 Rolling a "5" is an event.
Events can be one of the following:
 Independent (each event is not affected by other events),
Events can be "Independent", meaning each event is not affected by any other events. This is
an important idea! A coin does not "know" that it came up heads before ... each toss of a coin is a
perfect isolated thing.
 Dependent (also called "Conditional", where an event is affected by other events)
But some events can be "dependent" ... which means they can be affected by previous events.
Like when one person is drawing 2 Cards from a deck. After taking one card from the deck there
are less cards available, so the probabilities change.
 Mutually Exclusive events
Mutually Exclusive means we can't get both events at the same time. It is either one or the
other, but not both. Examples would be turning left or right are Mutually Exclusive (you can't
do both at the same time), Heads and Tails are Mutually Exclusive, Kings and Aces are
Mutually Exclusive.
 Inclusive events
Inclusive events are events that can happen at the same time. To find the probability of an
inclusive event we first add the probabilities of the individual events and then subtract the
probability of the two events happening at the same time.
Source: https://www.mathsisfun.com/data/probability-events-types.html
Probability of events
To solve for the probability of each we need to follow the following:
 Independent Events
P(X and Y) =P(X) ⋅P(Y)
Example
If one has three dice what is the probability of getting three 4s? The probability of
getting a 4 on one die is 1/6. The probability of getting 3 4s is:
1 1 1 1
P(4and4and4) = ⋅ ⋅ =
6 6 6 216
 Dependent Events
P(X and Y) =P(X) ⋅P(Y after x)
What is the probability for you to choose two red cards in a deck of cards?
A deck of cards has 26 black and 26 red cards. The probability of choosing a red
card randomly is:
26 1
P(red) = =
52 2
The probability of choosing a second red card from the deck is now:
25
P(red) =
51
The probability
1 25 25
P(2red) = ⋅ =
2 51 102
 Mutually Exclusive events
P(X or Y) = P(X) +P(Y)
An example of two mutually exclusive events is a wheel of fortune. Let's say you win a bar of
chocolate if you end up in a red or a pink field. There being 8 possible outcomes and there being two
red slots and 1 pink color and the other 5 colors are different colors. What is the probability that the
wheel stops at red or pink? P(red or pink) =P(red) +P(pink)
2 1
P(red) = =
8 4
1
P(pink) =
8
2 1 3
P(red or pink) = + =
8 8 8
 Inclusive Events
P(X or Y) = P(X) + P(Y) −P(X and Y)
What is the probability of drawing a black card or a ten in a deck of cards? There are 4
tens in a deck of cards P(10) = 4/52. There are 26 black cards P(black) = 26/52. There are 2
black tens P(black and 10) = 2/52
4 26 2 30 2 28 7
P(black or ten) = + – = – = =
52 52 52 52 52 52 13
Source: https://www.mathplanet.com/education/pre-algebra/probability-and-statistic/probability-
of-events
Conditional Probability
The conditional probability of an event B is the probability that the event will occur given
the knowledge that an event A has already occurred. This probability is written P(B|A), notation
for the probability of B given A. In the case where events A and B are independent (where
event A has no effect on the probability of event B), the conditional probability of event B given
event A is simply the probability of event B, that is P(B).
If events A and B are not independent, then the probability of the intersection of A and B (the
probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).
From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A):
Note: This expression is only valid when P(A) is greater than 0.
Source: http://www.stat.yale.edu/Courses/1997-
98/101/condprob.htm#targetText=The%20conditional%20probability%20of%20an,event%20A%20has%
20already%20occurred.
Random Variable
Random variable, usually written X, is a variable whose possible values are numerical
outcomes of a random phenomenon. There are two types of random
variables, discrete and continuous.
 Discrete Random Variables
A discrete random variable is one which may take on only a countable number of distinct values
such as 0, 1,2,3,4 ... Discrete random variables are usually (but not necessarily) counts. If a random
variable can take only a finite number of distinct values, then it must be discrete. Examples of
discrete random variables include the number of children in a family, the Friday night attendance
at a cinema, the number of patients in a doctor's surgery, the number of defective light bulbs in a
box of ten.
The probability distribution of a discrete random variable is a list of probabilities associated with
each of its possible values. It is also sometimes called the probability function or the probability
mass function. Suppose a random variable X may take k different values, with the probability
that X = xi defined to be P(X = xi) = pi. The probabilities pi must satisfy the following:
1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.
 Continuous Random Variables
A continuous random variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements. Examples include height, weight, the
amount of sugar in an orange, the time required to run a mile.
A continuous random variable is not defined at specific values. Instead, it is defined over
an interval of values, and is represented by the area under a curve (in advanced mathematics, this
is known as an integral). The probability of observing any single value is equal to 0, since the
number of values which may be assumed by the random variable is infinite.
Suppose a random variable X may take all values over an interval of real numbers. Then the
probability that X is in the set of outcomes A, P(A), is defined to be the area above A and under a
curve. The curve, which represents a function p(x), must satisfy the following:
1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.
All random variables (discrete and continuous) have a cumulative distribution function. It is a
function giving the probability that the random variable X is less than or equal to x, for every
value x. For a discrete random variable, the cumulative distribution function is found by summing
up the probabilities.
Source: http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm
Joint Probability Distribution
Joint probability is the probability of two events happening together. The two events are
usually designated event A and event B. In probability terminology, it can be written as:
p(A and B) or p(A ∩ B)
Joint probability is also called the intersection of two (or more) events.
A joint probability distribution shows a probability distribution for two (or more) random
variables. Instead of events being labeled A and B, the norm is to use X and Y. The formal
definition is:
f(x, y) = P(X = x, Y = y)
The whole point of the joint distribution is to look for a relationship between two variables.
 Joint Probability Mass Function
If your variables are discrete their distribution can be described by a joint probability mass
function (Joint PMF). Basically, if you have found all probabilities for all possible combinations of
X and Y, then you have created a joint PMF.
 Joint Probability Density Function
If you have continuous variables, they can be described with a probability density function (PDF).
Unlike the discrete variable example above, you can’t write out every combination of every
variable because you would have infinite possibilities to write out (which is, of course,
impossible). What you can do is create a formula; the formula that describes all possible
combinations of X and Y is called a joint PDF.
Source: https://www.statisticshowto.datasciencecentral.com/joint-probability-distribution/
Binomial Distribution
A binomial distribution can be thought of as simply the probability of a SUCCESS or

FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a
type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For
example, a coin toss has only two possible outcomes: heads or tails and taking a test could have
two possible outcomes: pass or fail.
The first variable in the binomial formula, n, stands for the number of times the experiment runs.
The second variable, p, represents the probability of one specific outcome. For example, let’s
suppose you wanted to know the probability of getting a 1 on a die roll. If you were to roll a die 20
times, the probability of rolling a one on any throw is 1/6. Roll twenty times and you have a
binomial distribution of (n=20, p=1/6). SUCCESS would be “roll a one” and FAILURE would be
“roll anything else.” If the outcome in question was the probability of the die landing on an even
number, the binomial distribution would then become (n=20, p=1/2). That’s because your
probability of throwing an even number is one half.
Criteria
Binomial distributions must also meet the following three criteria:
 The number of observations or trials is fixed. In other words, you can only figure out
the probability of something happening if you do it a certain number of times. This is
common sense—if you toss a coin once; your probability of getting a tails is 50%. If you
toss a coin a 20 times, your probability of getting a tails is very, very close to 100%.
 Each observation or trial is independent. In other words, none of your trials have an effect
on the probability of the next trial.
 The probability of success (tails, heads, fail or pass) is exactly the same from one trial to
another.
The binomial distribution formula is:
b(x; n, P) = nCx * Px * (1 – P)n – x
Where:
b = binomial probability
x = total number of “successes” (pass or fail, heads or tails etc.)
P = probability of a success on an individual trial
n = number of trials
Note: The binomial distribution formula can also be written in a slightly different way,
because nCx = n!/x! (n-x)! (This binomial distribution formula uses factorials (What is a
factorial?). “q” in this formula is just the probability of failure (subtract your probability of success
from 1).
Source: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/binomial-
theorem/binomial-distribution-formula/
Hypergeometric Distribution
The hypergeometric distribution is used to calculate probabilities when sampling without

replacement. For example, suppose you first randomly sample one card from a deck of 52. Then,
without putting the card back in the deck you sample a second and then (again without replacing
cards) a third. Given this sampling procedure, what is the probability that exactly two of the
sampled cards will be aces (4 of the 52 cards in the deck are aces). You can calculate this
probability using the following formula based on the hypergeometric distribution:
Where
k is the number of "successes" in the population

x is the number of "successes" in the sample
N is the size of the population
n is the number sampled
p is the probability of obtaining exactly x successes
kCx is the number of combinations of k things taken x at a time
Source: http://onlinestatbook.com/2/probability/hypergeometric.html
Poisson Distribution
The Poisson distribution can be used to calculate the probabilities of various numbers of
"successes" based on the mean number of successes. In order to apply the Poisson distribution,
the various events must be independent. Keep in mind that the term "success" does not really
mean success in the traditional positive sense. It just means that the outcome in question occurs.
Suppose you knew that the mean number of calls to a fire station on a weekday is 8. What is the
probability that on a given weekday there would be 11 calls? This problem can be solved using the
following formula based on the Poisson distribution:
Where
e is the base of natural logarithms (2.7183)

μ is the mean number of "successes"
x is the number of "successes" in question
For this example,
Since the mean is 8 and the question pertains to 11 fires.
The mean of the Poisson distribution is μ. The variance is also equal to μ. Thus, for this example,
both the mean and the variance are equal to 8.
Source: http://onlinestatbook.com/2/probability/poisson.html
Hypothesis, Statistical Hypothesis and Hypothesis Testing
Hypothesis testing was introduced by Ronald Fisher, Jerzy Neyman, Karl Pearson and
Pearson’s son, Egon Pearson. Hypothesis testing is a statistical method that is used in making
statistical decisions using experimental data. Hypothesis Testing is basically an assumption that
we make about the population parameter
Hypothesis can be classified as follows:
 Null hypothesis
 Alternative hypothesis
In statistical analysis, we have to make decisions about the hypothesis. These decisions include
deciding if we should accept the null hypothesis or if we should reject the null hypothesis. Every
test in hypothesis testing produces the significance value for that particular test. In Hypothesis
testing, if the significance value of the test is greater than the predetermined significance level,
then we accept the null hypothesis. If the significance value is less than the predetermined value,
then we should reject the null hypothesis. For example, if we want to see the degree of
relationship between two stock prices and the significance value of the correlation coefficient is
greater than the predetermined significance level, then we can accept the null hypothesis and
conclude that there was no relationship between the two stock prices. However, due to the
chance factor, it shows a relationship between the variables.
Source: https://www.statisticssolutions.com/hypothesis-testing/
Two Types of Hypothesis and Level of Significance
As said Before, Hypothesis can be classified as:
 Null hypothesis: Null hypothesis is a statistical hypothesis that assumes that the
observation is due to a chance factor. Null hypothesis is denoted by; H0: μ1 = μ2, which
shows that there is no difference between the two population means.
 Alternative hypothesis: Contrary to the null hypothesis, the alternative hypothesis shows
that observations are the result of a real effect.
The significance level, also denoted as alpha or α, is a measure of the strength of the evidence that
must be present in your sample before you will reject the null hypothesis and conclude that
the effect is statistically significant. The researcher determines the significance level before
conducting the experiment.
The significance level is the probability of rejecting the null hypothesis when it is true. For
example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when
there is no actual difference. Lower significance levels indicate that you require stronger evidence
before you will reject the null hypothesis. Use significance levels during hypothesis testing to help
you determine which hypothesis the data support. Compare your p-value to your significance
level. If the p-value is less than your significance level, you can reject the null hypothesis and
conclude that the effect is statistically significant. In other words, the evidence in your sample is
strong enough to be able to reject the null hypothesis at the population level.
Rejection region:
The rejection region is the values of test statistic for which the null hypothesis is rejected.
Non rejection region:
The set of all possible values for which the null hypothesis is not rejected is called the rejection
region.
Source: https://statisticsbyjim.com/glossary/significance-level/
Hypothesis Testing Approach
There are basically three approaches to hypothesis testing. The researcher should note that all
three approaches require different subject criteria and objective statistics, but all three
approaches give the same conclusion.
 The first approach is to test the statistic approach. The common steps in all three
approaches of hypothesis testing is the first step, which is to state the null and alternative
hypothesis. The second step of the test statistic approach is to determine the test size and
to obtain the critical value. The third step is to compute the test statistic. The fourth step
is to reject or accept the null hypothesis depending upon the comparison between the
tabulated value and the calculated value. If the tabulated value in hypothesis testing is
more than the calculated value, than the null hypothesis is accepted. Otherwise it is
rejected. The last step of this approach of hypothesis testing is to make a substantive
interpretation.
 The second approach of hypothesis testing is the probability value approach. The second
step of this approach is to determine the test size. The third step is to compute the test
statistic and the probability value. The fourth step of this approach is to reject the null
hypothesis if the probability value is less than the tabulated value. The last step of this
approach of hypothesis testing is to make a substantive interpretation.
 The third approach is the confidence interval approach. The second step is to determine
the test size or the (1-test size) and the hypothesized value. The third step is to construct
the confidence interval. The fourth step is to reject the null hypothesis if the
hypothesized value does not exist in the range of the confidence interval. The last step of
this approach of hypothesis testing is to make the substantive interpretation.The first
approach of hypothesis testing is a classical test statistic approach, which computes a test
statistic from the empirical data and then makes a comparison with the critical value. If
the test statistic in this classical approach is larger than the critical value, then the null
hypothesis is rejected. Otherwise, it is accepted.
Source: https://www.statisticssolutions.com/hypothesis-
testing2/#targetText=The%20first%20approach%20of%20hypothesis,the%20null%20hypothesis%20
is%20rejected.
Test Statistics and Steps to Hypothesis Testing
Hypothesis testing is a scientific process of testing whether or not the hypothesis is plausible.
The following steps are involved in hypothesis testing:
 The first step is to state the null and alternative hypothesis clearly. The null and
alternative hypothesis in hypothesis testing can be a one tailed or two tailed test.
 The second step is to determine the test size. This means that the researcher decides
whether a test should be one tailed or two tailed to get the right critical value and the
rejection region.
 The third step is to compute the test statistic and the probability value. This step of the
hypothesis testing also involves the construction of the confidence interval depending
upon the testing approach.
 The fourth step involves the decision making step. This step of hypothesis testing helps
the researcher reject or accept the null hypothesis by making comparisons between the
subjective criterion from the second step and the objective test statistic or the probability
value from the third step.
 The fifth step is to draw a conclusion about the data and interpret the results obtained
from the data.
Source: https://www.statisticssolutions.com/hypothesis-
testing2/#targetText=The%20first%20approach%20of%20hypothesis,the%20null%20hypothesis%20is%2
0rejected.
Making Decision: Type I and Type II Error
The statistical practice of hypothesis testing is widespread not only in statistics but also
throughout the natural and social sciences. When we conduct a hypothesis test there a couple of
things that could go wrong. There are two kinds of errors, which by design cannot be avoided,
and we must be aware that these errors exist. The errors are given the quite pedestrian names of
type I and type II errors. What are type I and type II errors, and how we distinguish between
them? Briefly:
 Type I errors happen when we reject a true null hypothesis

 Type II errors happen when we fail to reject a false null hypothesis
When you perform a hypothesis test, there are four possible outcomes depending on the
actual truth (or falseness) of the null hypothesis H0 and the decision to reject or not. The four
possible outcomes in the table are: The decision is not to reject H0 when H0 is true (correct
decision). The decision is to reject H0 when H0 is true (incorrect decision known as a Type I
error). The decision is not to reject H0 when, in fact, H0 is false (incorrect decision known as
a Type II error). The decision is to reject H0 when H0 is false (correct decision whose probability
is called the Power of the Test).
Each of the errors occurs with a particular probability. The Greek letters α and β represents the
probabilities.
α = probability of a Type I error = P(Type I error) = probability of rejecting the null hypothesis
when the null hypothesis is true.
β = probability of a Type II error = P(Type II error) = probability of not rejecting the null
hypothesis when the null hypothesis is false.
α and β should be as small as possible because they are probabilities of errors. They are rarely
zero. The Power of the Test is 1 –β. ideally, we want a high power that is as close to one as
possible. Increasing the sample size can increase the Power of the Test.
Source: https://courses.lumenlearning.com/introstats1/chapter/outcomes-and-the-type-i-and-type-ii-
errors/#targetText=α%20%3D%20probability%20of%20a%20Type,the%20null%20hypothesis%20is%20f
alse.
T-Test for Independent Sample
The independent t-test, also called the two sample t-test, independent-samples t-test or
student's t-test, is an inferential statistical test that determines whether there is a statistically
significant difference between the means in two unrelated groups.
Null and alternative hypotheses for the independent t-test
The null hypothesis for the independent t-test is that the population means from the two
unrelated groups are equal:
H0: u1 = u2
In most cases, we are looking to see if we can show that we can reject the null hypothesis and
accept the alternative hypothesis, which is that the population means are not equal:
HA: u1 ≠ u2
To do this, we need to set a significance level (also called alpha) that allows us to either reject or
accept the alternative hypothesis. Most commonly, this value is set at 0.05.
What do you need to run an independent t-test?
In order to run an independent t-test, you need the following:
 One independent, categorical variable that has two levels/groups.

 One continuous dependent variable.
Source: https://statistics.laerd.com/statistical-guides/independent-t-test-statistical-guide.php
T-Test Correlated Sample
The Paired Samples t Test compares two means that are from the same individual, object,
or related units. The two means can represent things like:
 A measurement taken at two different times (e.g., pre-test and post-test with an
intervention administered between the two time points)
 A measurement taken under two different conditions (e.g., completing a test under a
"control" condition and an "experimental" condition)
 Measurements taken from two halves or sides of a subject or experimental unit (e.g.,
measuring hearing loss in a subject's left and right ears).
The purpose of the test is to determine whether there is statistical evidence that the mean
difference between paired observations on a particular outcome is significantly different from
zero. The Paired Samples t Test is a parametric test.
This test is also known as:
 Dependent t Test
 Paired t Test
 Repeated Measures t Test
The variable used in this test is known as Dependent variable, or test variable (continuous),
measured at two different times or for two related conditions or units
Your data must meet the following requirements:
1. Dependent variable that is continuous (i.e., interval or ratio level)

o Note: The paired measurements must be recorded in two separate variables.
2. Related samples/groups (i.e., dependent observations)
o The subjects in each sample, or group, are the same. This means that the subjects
in the first group are also in the second group.
3. Random sample of data from the population
4. Normal distribution (approximately) of the difference between the paired values
5. No outliers in the difference between the two related groups
Source: https://libguides.library.kent.edu/SPSS/PairedSamplestTest
Z-Test
A Z-test is a type of hypothesis test. Hypothesis testing is just a way for you to figure out if
results from a test are valid or repeatable. For example, if someone said they had found a new
drug that cures cancer, you would want to be sure it was probably true. A hypothesis test will tell
you if it’s probably true, or probably not true. A Z test, is used when your data is
approximately normally distributed.
 When you can run a Z Test.
Several different types of tests are used in statistics (i.e. f test, chi square test, t test). You would
use a Z test if:
 Your sample size is greater than 30. Otherwise, use a t test.

 Data points should be independent from each other. In other words, one data point isn’t
related or doesn’t affect another data point.
 Your data should be normally distributed. However, for large sample sizes (over 30) this
doesn’t always matter.
 Your data should be randomly selected from a population, where each item has an equal
chance of being selected.
 Sample sizes should be equal if at all possible.
Running a Z test on your data requires five steps:
1. State the null hypothesis and alternate hypothesis.

2. Choose an alpha level.
3. Find the critical value of z in a z table.
4. Calculate the z test statistic (see below).
5. Compare the test statistic to the critical z value and decide if you should support or reject
the null hypothesis.
Source: https://www.statisticshowto.datasciencecentral.com/z-test/
Z-Test for One-Sample Group and One Group Proportion
A one-sample z-test is used to test whether a population parameter is significantly

different from some hypothesized value.
Here is how to use the test.
 Define hypotheses.
 Specify significance level. Often, researchers choose significance levels equal to 0.01, 0.05,
or 0.10; but any value between 0 and 1 can be used.
 Compute test statistic. The test statistic is a z-score (z) defined by the following equation.
(𝑥 − 𝑀 )
z= σ
( )
√n
Where x is the observed sample mean, M is the hypothesized population mean (from the
null hypothesis), and σ is the standard deviation of the population.
 Compute P-value. The P-value is the probability of observing a sample statistic as extreme
as the test statistic. Since the test statistic is a z-score, use the Normal Distribution
Calculator to assess the probability associated with the z-score.
 Evaluate null hypothesis. The evaluation involves comparing the P-value to

the significance level, and rejecting the null hypothesis when the P-value is less than the
significance level.
The one-sample z-test can be used when the population is normally distributed, and the
population variance is known.
Source: https://stattrek.com/statistics/dictionary.aspx?definition=one-sample%20z-test
Z-Test for Two-Sample Means
This tests for a difference in proportions. A two proportion z-test allows you to compare
two proportions to see if they are the same.
 The null hypothesis (H0) for the test is that the proportions are the same.
 The alternate hypothesis (H1) is that the proportions are not the same.
Steps:
Sample question: let’s say you’re testing two flu drugs A and B. Drug A works on 41 people out of a
sample of 195. Drug B works on 351 people in a sample of 605. Are the two drugs comparable? Use
a 5% alpha level.
 Step 1: Find the two proportions:

41
P1 = 195 = 0.21 (that’s 21%)
351
P2 = 605 = 0.58 (that’s 58%).
Set these numbers aside for a moment.
 Step 2: Find the overall sample proportion. The numerator will be the total number of
“positive” results for the two samples and the denominator is the total number of people
in the two samples.
(41 + 351)
p = (195 + 605) = 0.49.
Set this number aside for a moment.
 Step 3: Insert the numbers from Step 1 and Step 2 into the test statistic formula:
Solving the formula, we get:

Z = 8.99
We need to find out if the z-score falls into the “rejection region.”
Step 4: Find the z-score associated with α/2. I’ll use the following table of known values:
The z-score associated with a 5% alpha level / 2 is 1.96.
Step 5: Compare the calculated z-score from Step 3 with the table z-score from Step 4. If the
calculated z-score is larger, you can reject the null hypothesis.
8.99 > 1.96, so we can reject the null hypothesis.
Source: https://www.statisticshowto.datasciencecentral.com/z-test/
F-Test
F-Test is any test that uses F-distribution. F value is a value on the F distribution. Various
statistical tests generate an F value. The value can be used to determine whether the test is
statistically significant. In order to compare two variances, one has to calculate the ratio of the
two variances, which is as under:
F Value = Larger Sample Variance / Smaller Sample Variance = 𝜎12 / 𝜎22
While carrying out F-Test, we need to frame the null and alternate hypothesis. Then, we need to
determine the level of significance under which the test has to be carried out. Subsequently, we
have to find out the degrees of freedom of both the numerator and denominator. This will help
determine the F table value. The F Value seen in the table is then compared to the calculated F
Value to determine whether or not to reject the null hypothesis.
Below are the steps where the F-Test formula is used to test the null hypothesis that the variances
of two populations are equal:
Step 1: Firstly, frame the null and alternate hypothesis. The null hypothesis assumes that the
variances are equal. H0: σ12 = σ22. The alternate hypothesis states that the variances are
unequal. H1: σ12 ≠ σ22. Here σ12 and σ22 are the symbols for variances.
Step 2: Calculate the test statistic (F distribution).
Step 3: Calculate the degrees of freedom. Degree of freedom (df1) = n1 – 1 and Degree of freedom
(df2) = n2 – 1 where n1 and n2 are the sample sizes.
Step 4: Look at the F value in the F table. For 2 tailed tests, divide the alpha by 2 for finding the
right critical value. Thus, the F value is found looking at the degrees of freedom in the numerator
and the denominator in the F table.
Step 5: Compare the F statistic obtained in Step 2 with the critical value obtained in Step 4. If the
F statistic is greater than the critical value at the required level of significance, we reject the null
hypothesis. If the F statistic obtained in Step 2 is less than the critical value at the required level of
significance, we cannot reject the null hypothesis.
Source: https://www.wallstreetmojo.com/f-test-formula/
One-Way Analysis of Variance
The one-way analysis of variance (ANOVA) is used to determine whether there are any
statistically significant differences between the means of two or more independent (unrelated)
groups (although you tend to only see it used when there are a minimum of three, rather than
two groups). For example, you could use a one-way ANOVA to understand whether exam
performance differed based on test anxiety levels amongst students, dividing students into three
independent groups (e.g., low, medium and high-stressed students). Also, it is important to
realize that the one-way ANOVA is an omnibus test statistic and cannot tell you which specific
groups were statistically significantly different from each other; it only tells you that at least two
groups were different. Since you may have three, four, five or more groups in your study design,
determining which of these groups differ from each other is important. You can do this using a
post hoc test.
Note: If your study design not only involves one dependent variable and one independent
variable, but also a third variable (known as a "covariate") that you want to "statistically control",
you may need to perform an ANCOVA (analysis of covariance), which can be thought of as an
extension of the one-way ANOVA. Alternatively, if your dependent variable is the time until an
event happens, you might need to run a Kaplan-Meier analysis.
Assumptions
When you choose to analyse your data using a one-way ANOVA, part of the process
involves checking to make sure that the data you want to analyse can actually be analysed using a
one-way ANOVA. You need to do this because it is only appropriate to use a one-way ANOVA if
your data "passes" six assumptions that are required for a one-way ANOVA to give you a valid
result. In practice, checking for these six assumptions just adds a little bit more time to your
analysis, requiring you to click a few more buttons in SPSS Statistics when performing your
analysis, as well as think a little bit more about your data, but it is not a difficult task.
 Assumption #1: Your dependent variable should be measured at the interval or ratio
level (i.e., they are continuous).
 Assumption #2: Your independent variable should consist of two or more
categorical, independent groups.
 Assumption #3: You should have independence of observations, which means that there is
no relationship between the observations in each group or between the groups
themselves.
 Assumption #4: There should be no significant outliers. Outliers are simply single data
points within your data that do not follow the usual pattern.
 Assumption #5: Your dependent variable should be approximately normally distributed
for each category of the independent variable.
 Assumption #6: There needs to be homogeneity of variances. You can test this assumption
in SPSS Statistics using Levene's test for homogeneity of variances.
Source: https://statistics.laerd.com/spss-tutorials/one-way-anova-using-spss-
statistics.php#targetText=The%20one%2Dway%20analysis%20of,%2C%20rather%20than%20two%20gr
oups).
Pearson Product Moment Coefficient of Correlation r
Correlation is a technique for investigating the relationship between two quantitative,

continuous variables, for example, age and blood pressure. Pearson's correlation coefficient (r) is
a measure of the strength of the association between the two variables.
The first step in studying the relationship between two continuous variables is to draw a scatter
plot of the variables to check for linearity. The correlation coefficient should not be calculated if
the relationship is not linear. For correlation only purposes, it does not really matter on which
axis the variables are plotted. However, conventionally, the independent (or explanatory) variable
is plotted on the x-axis (horizontally) and the dependent (or response) variable is plotted on the
y-axis (vertically).
The nearer the scatter of points is to a straight line, the higher the strength of association between
the variables. Also, it does not matter what measurement units are used.
r = -1 data lie on a perfect straight line with a negative slope
r=0 no linear relationship between the variables
r = +1 data lie on a perfect straight line with a positive slope
Source: http://learntech.uwe.ac.uk/da/Default.aspx?pageid=1442
Linear Regression
Linear regression attempts to model the relationship between two variables by fitting a
linear equation to observed data. One variable is considered to be an explanatory variable, and
the other is considered to be a dependent variable. For example, a modeler might want to relate
the weights of individuals to their heights using a linear regression model.
Before attempting to fit a linear model to observed data, a modeler should first determine
whether or not there is a relationship between the variables of interest. This does not necessarily
imply that one variable causes the other (for example, higher SAT scores do not cause higher
college grades), but that there is some significant association between the two variables.
A scatterplot can be a helpful tool in determining the strength of the relationship between two
variables. If there appears to be no association between the proposed explanatory and dependent
variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a
linear regression model to the data probably will not provide a useful model. A valuable
numerical measure of association between two variables is the correlation coefficient, which is a
value between -1 and 1 indicating the strength of the association of the observed data for the two
variables.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory
variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the
value of y when x = 0).
 Least-Squares Regression
The most common method for fitting a regression line is the method of least-squares. This
method calculates the best-fitting line for the observed data by minimizing the sum of the squares
of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly,
then its vertical deviation is 0). Because the deviations are first squared, then summed, there are
no cancellations between positive and negative values.
 Outliers and Influential Observations
After a regression line has been computed for a group of data, a point which lies far from the
line (and thus has a large residual value) is known as an outlier. Such points may represent
erroneous data, or may indicate a poorly fitting regression line. If a point lies far from the other
data in the horizontal direction, it is known as an influential observation.
 Residuals
Once a regression model has been fit to a group of data, examination of the residuals (the
deviations from the fitted line to the observed values) allows the modeler to investigate the
validity of his or her assumption that a linear relationship exists. Plotting the residuals on the y-
axis against the explanatory variable on the x-axis reveals any possible non-linear relationship
among the variables, or might alert the modeler to investigate lurking variables. In our example,
the residual plot amplifies the presence of outliers.
 Lurking Variables
If non-linear trends are visible in the relationship between an explanatory and dependent
variable, there may be other influential variables to consider. A lurking variable exists when the
relationship between two variables is significantly affected by the presence of a third variable
which has not been included in the modeling effort. Since such a variable might be a factor of
time (for example, the effect of political or economic cycles), a time series plot of the data is often
a useful tool in identifying the presence of lurking variables.
 Extrapolation
Whenever a linear regression model is fit to a group of data, the range of the data should be
carefully observed. Attempting to use a regression equation to predict values outside of this range
is often inappropriate, and may yield incredible answers. This practice is known as extrapolation.
Consider, for example, a linear model which relates weight gain to age for young children.
Applying such a model to adults, or even teenagers, would be absurd, since the relationship
between age and weight gain is not consistent for all age groups.
Source: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
Chi-Square Test of Goodness-of-Fit
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the
observed value of a given phenomena is significantly different from the expected value. In Chi-
Square goodness of fit test, the term goodness of fit is used to compare the observed sample
distribution with the expected probability distribution. Chi-Square goodness of fit test
determines how well theoretical distribution (such as normal, binomial, or Poisson) fits the
empirical distribution. In Chi-Square goodness of fit test, sample data is divided into intervals.
Then the numbers of points that fall into the interval are compared, with the expected numbers
of points in each interval.
Home | Chi-Square Goodness of Fit Test
Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed
value of a given phenomena is significantly different from the expected value. In Chi-Square
goodness of fit test, the term goodness of fit is used to compare the observed sample distribution with
the expected probability distribution. Chi-Square goodness of fit test determines how well
theoretical distribution (such as normal, binomial, or Poisson) fits the empirical distribution. In Chi-
Square goodness of fit test, sample data is divided into intervals. Then the numbers of points that fall
into the interval are compared, with the expected numbers of points in each interval.
START RUNNING YOUR STATISTICAL ANALYSES NOW FOR FREE - CLICK HERE
Procedure for Chi-Square Goodness of Fit Test:
 Set up the hypothesis for Chi-Square goodness of fit test:
A. Null hypothesis: In Chi-Square goodness of fit test, the null hypothesis assumes that there is no
significant difference between the observed and the expected value.
B. Alternative hypothesis: In Chi-Square goodness of fit test, the alternative hypothesis assumes
that there is a significant difference between the observed and the expected value.
 Compute the value of Chi-Square goodness of fit test using the following formula:
Where, = Chi-Square goodness of fit test O= observed value E= expected

value
 Degree of freedom: In Chi-Square goodness of fit test, the degree of freedom depends on
the distribution of the sample.
 Hypothesis testing: Hypothesis testing in Chi-Square goodness of fit test is the same as in
other tests, like t-test, ANOVA, etc. The calculated value of Chi-Square goodness of fit
test is compared with the table value. If the calculated value of Chi-Square goodness of fit
test is greater than the table value, we will reject the null hypothesis and conclude that
there is a significant difference between the observed and the expected frequency. If the
calculated value of Chi-Square goodness of fit test is less than the table value, we will
accept the null hypothesis and conclude that there is no significant difference between the
observed and expected value.
Source: https://www.statisticssolutions.com/chi-square-goodness-of-fit-
test/#targetText=Chi%2DSquare%20goodness%20of%20fit%20test%20is%20a%20non%2Dparametri
c,different%20from%20the%20expected%20value.&targetText=In%20Chi%2DSquare%20goodness%
20of%20fit%20test%2C%20sample,data%20is%20divided%20into%20intervals.
Chi-Square Test of Independence
The Chi-Square test of independence is used to determine if there is a significant relationship

between two nominal (categorical) variables. The frequency of each category for one nominal
variable is compared across the categories of the second nominal variable. The data can be
displayed in a contingency table where each row represents a category for one variable and each
column represents a category for the other variable. For example, say a researcher wants to
examine the relationship between gender (male vs. female) and empathy (high vs. low). The chi-
square test of independence can be used to examine this relationship. The null hypothesis for this
test is that there is no relationship between gender and empathy. The alternative hypothesis is
that there is a relationship between gender and empathy (e.g. there are more high-empathy
females than high-empathy males).
How to calculate the chi-square statistic by hand. First we have to calculate the expected value of
the two nominal variables. We can calculate the expected value of the two nominal variables by
using this formula:
Where
= expected value
= Sum of the ith column
= Sum of the kth row
N = total number
After calculating the expected value, we will apply the following formula to calculate the value of
the Chi-Square test of Independence:
= Chi-Square test of Independence
= Observed value of two nominal variables
= Expected value of two nominal variables
Degree of freedom is calculated by using the following formula:

DF = (r-1)(c-1)
Where
DF = Degree of freedom
r = number of rows
c = number of columns
Hypothesis:
Null hypothesis: Assumes that there is no association between the two variables.
Alternative hypothesis: Assumes that there is an association between the two variables.
Hypothesis testing: Hypothesis testing for the chi-square test of independence as it is for other
tests like ANOVA, where a test statistic is computed and compared to a critical value. The critical
value for the chi-square statistic is determined by the level of significance (typically .05) and the
degrees of freedom. The degrees of freedom for the chi-square are calculated using the following
formula: df = (r-1)(c-1) where r is the number of rows and c is the number of columns. If the
observed chi-square test statistic is greater than the critical value, the null hypothesis can be
rejected.
Source: https://www.statisticssolutions.com/non-parametric-analysis-chi-
square/#targetText=The%20Chi%2DSquare%20test%20of,of%20the%20second%20nominal%20variable.

Engineering Data Analysis Terms

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Engineering Data Analysis Terms

Hochgeladen von

Copyright:

Verfügbare Formate

Sevilla, William H.

The equation for probability is:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝑒𝑠𝑖𝑟𝑒𝑑 𝑂𝑢𝑡𝑐𝑜𝑚𝑒𝑠

In the study of probability, an experiment is a process or investigation from which results

An outcome is a possible result of an experiment.

When we say "Event" we mean one (or more) outcomes.

 Getting a Tail when tossing a coin is an event

Events can be one of the following:

 Independent (each event is not affected by other events),

 Dependent (also called "Conditional", where an event is affected by other events)

 Mutually Exclusive events

To solve for the probability of each we need to follow the following:

P(X and Y) =P(X) ⋅P(Y)

P(X and Y) =P(X) ⋅P(Y after x)

P(X or Y) = P(X) +P(Y)

P(X or Y) = P(X) + P(Y) −P(X and Y)

Note: This expression is only valid when P(A) is greater than 0.

 Discrete Random Variables

1: 0 < pi < 1 for each i

 Continuous Random Variables

1: The curve has no negative values (p(x) > 0 for all x)

2: The total area under the curve is equal to 1.

A curve meeting these requirements is known as a density curve.

Joint Probability Distribution

p(A and B) or p(A ∩ B)

 Joint Probability Mass Function

 Joint Probability Density Function

A binomial distribution can be thought of as simply the probability of a SUCCESS or

Binomial distributions must also meet the following three criteria:

The binomial distribution formula is:

b(x; n, P) = nCx * Px * (1 – P)n – x

The hypergeometric distribution is used to calculate probabilities when sampling without

k is the number of "successes" in the population

e is the base of natural logarithms (2.7183)

For this example,

Since the mean is 8 and the question pertains to 11 fires.

Hypothesis, Statistical Hypothesis and Hypothesis Testing

Hypothesis can be classified as follows:

Two Types of Hypothesis and Level of Significance

As said Before, Hypothesis can be classified as:

Test Statistics and Steps to Hypothesis Testing

Making Decision: Type I and Type II Error

 Type I errors happen when we reject a true null hypothesis

T-Test for Independent Sample

Null and alternative hypotheses for the independent t-test

What do you need to run an independent t-test?

In order to run an independent t-test, you need the following:

 One independent, categorical variable that has two levels/groups.

This test is also known as:

Your data must meet the following requirements:

1. Dependent variable that is continuous (i.e., interval or ratio level)

 When you can run a Z Test.

 Your sample size is greater than 30. Otherwise, use a t test.

Running a Z test on your data requires five steps:

1. State the null hypothesis and alternate hypothesis.

Z-Test for One-Sample Group and One Group Proportion

A one-sample z-test is used to test whether a population parameter is significantly

Here is how to use the test.

 Evaluate null hypothesis. The evaluation involves comparing the P-value to

Z-Test for Two-Sample Means

 Step 1: Find the two proportions:

Solving the formula, we get:

The z-score associated with a 5% alpha level / 2 is 1.96.

8.99 > 1.96, so we can reject the null hypothesis.