Sie sind auf Seite 1von 8

Ferriol, Charlotte Ann M.

18-57405
BSEE-2102
Engineering Data Analysis
Probability
Probability is a measure quantifying the likelihood that events will occur.
Sample Space
The set of all possible outcomes of a statistical experiment is called the sample space and
is represented by the symbol S. Each outcome in a sample space is called an element or a
member of the sample space, or simply a sample point. If the sample space has a finite number of
elements, we may list the members separated by commas and enclosed in braces. Thus, the
sample space S, of possible outcomes when a coin is flipped, may be written S = {H,T}, where H
and T correspond to heads and tails, respectively.
Event
An event is a subset of a sample space
Probability of an event
The probability of an event A is the sum of the weights of all sample points in A.
Therefore, 0 ≤ P(A) ≤ 1,P (φ)=0 , and P(S)=1 . Furthermore, if A1, A2, A3, ... is a sequence of
mutually exclusive events, then P(A1 ∪A2 ∪A3 ∪···)=P(A1)+P(A2)+P(A3)+···.
Conditional probability
The probability of an event B occurring when it is known that some event A has occurred
is called a conditional probability and is denoted by P(B|A). The symbol P(B|A) is usually read
“the probability that B occurs given that A occurs” or simply “the probability of B, givenA.”
The conditional probability of B, givenA, denoted by P(B|A), is defined by P(B|A)=
P(A∩B)/P(A), provided P(A)>0.
Random variable
A random variable is a function that associates a real number with each element in the
sample space.
Joint probability distribution
If X and Y are two discrete random variables, the probability distribution for their
simultaneous occurrence can be represented by a function with values f(x,y) for any pair of
values (x,y) within the range of the random variables X and Y . It is customary to refer to this
function as the joint probability distribution of X and Y . Hence, in the discrete case,
f(x,y)=P(X = x,Y = y); that is, the values f(x,y) give the probability that outcomes x and y occur
at the same time.
The function f(x,y) is a joint probability distribution or probability mass function of the
discrete random variables X and Y.
Binomial distribution
The number X of successes in n Bernoulli trials is called a binomial random variable.
The probability distribution of this discrete random variable is called the binomial distribution,
and its values will be denoted by b(x;n,p) since they depend on the number of trials and the
probability of a success on a given trial. Thus, for the probability distribution of X, the number of
defectives is P(X = 2) =f(2) = b(2;3, 1/ 4)= 9/64.
Hypergeometric distribution
Hypergeometric distribution does not require independence and is based on sampling
done without replacement.
Probability distribution of the hypergeometric random variable X, the number of
successes in a random sample of size n selected from N items of which k are labeled success and
N −k labeled failure, is h(x;N,n,k)=k xN−k n−x N n, max{0,n−(N −k)}≤x ≤ min{n,k}.The range
of x can be determined by the three binomial coefficients in the definition, where x and n−x are
no more than k and N−k, respectively, and both of them cannot be less than 0. Usually, when
both k (the number of successes) and N −k (the number of failures) are larger than the sample
size n, the range of a hypergeometric random variable will be x =0
Poisson distribution
The probability distribution of the Poisson random variable X, representing the number
of outcomes occurring in a given time interval or specified region denoted by t.
Hypothesis
Hypothesis is a proposed explanation for a phenomenon. For a hypothesis to be a
scientific hypothesis, the scientific method requires that one can test it.
Statistical Hypothesis
A statistical hypothesis is an assertion or conjecture concerning one or more populations.
Hypothesis Testing
Hypothesis testing is a way for you to test the results of a survey or experiment to see if
you have meaningful results. It is also testing whether your results are valid by figuring the odds
that your results have happened by chance.
Two types of hypothesis and level of significance
The structure of hypothesis testing will be formulated with the use of the term null
hypothesis, which refers to any hypothesis we wish to test and is denoted by H0. The rejection of
H0 leads to the acceptance of an alternative hypothesis, denoted by H1. An understanding of the
different roles played by the null hypothesis (H0) and the alternative hypothesis (H1) is crucial to
one’s understanding of the rudiments of hypothesis testing. The alternative hypothesis H1
usually represents the question to be answered or the theory to be tested, and thus its
specification is crucial. The null hypothesis H0 nullifies or opposes H1 and is often the logical
complement to H1. As the reader gains more understanding of hypothesis testing, he or she
should note that the analyst arrives at one of the two following conclusions:
 reject H0 in favor of H1 because of sufficient evidence in the data or
 fail to reject H0 because of insufficient evidence in the data.
The probability of committing a type I error, also called the level of significance, is denoted
by the Greek letter α.
Hypothesis testing approach
Testing a hypothesis is an important part of the scientific method that allows you to
evaluate the validity of an educated guess. It can be the following:
 Z-test
 F-Test
 Normality
 Chi-square test for independence
 Analysis of variance (ANOVA)
 Mood’s median
 Welch’s T-test
 Kruskal-Wallis H test
 Box-Cox Power Transformation
Test statistics and steps to hypothesis testing
Hypothesis testing can be one of the most confusing aspects for students, mostly because
before you can even perform a test, you have to know what your null hypothesis is. Often, those
tricky word problems that you are faced with can be difficult to decipher. But it’s easier than you
think; all you need to do is:
Step 1: State the Null hypothesis.
Step 2: State the Alternate Hypothesis.
Step 3: Draw a picture to help you visualize the problem.
Step 4: State the alpha level. If you aren’t given an alpha level, use 5% (0.05).
Step 5: Find the rejection region area (given by your alpha level above) from the z-table.

Step 6: Find the test statistic using this formula:


Step 6: If Step 6 is greater than Step 5, reject the null hypothesis.
Making decision: Type I and Type II error
Rejection of the null hypothesis when it is true is called a type I error. Nonrejection of
the null hypothesis when it is false is called a type II error.
T-test for independent sample
Independent t-Test involves examination of the significant differences on one factor or
dimension (dependent variable) between means of two independent groups (e.g., male vs.
female, with disability vs. without disability) or two experimental groups (control group vs.
treatment group). For example, you might want to know whether there is a significant difference
on the level of social activity between individuals with disabilities and individuals without
disabilities. When to Use Independent t-Test Any analysis where:
 There is only one dimension or factor (dependent variable) • There are three or
more groups of the factor (independent variable)
 One is interested in looking at mean differences across two independent groups
T-test for correlated sample
As we do in Sampling Distributions, we can consider the distribution of r over repeated
samples of x and y. The following theorem is analogous to the Central Limit Theorem, but
for r instead of x̄. This time we require that x and y have a joint bivariate normal distribution or
that samples are sufficiently large. You can think of a bivariate normal distribution as the three-
dimensional version of the normal distribution, in which any vertical slice through the surface
which graphs the distribution results in an ordinary bell curve.
The sampling distribution of r is only symmetric when ρ = 0 (i.e. when x and y are
independent). If ρ ≠ 0, then the sampling distribution is asymmetric and so the following theorem
does not apply, and other methods of inference must be used.
Theorem 1: Suppose ρ = 0. If x and y have a bivariate normal distribution or if the sample
size n is sufficiently large, then r has a normal distribution with mean 0, and t = r/sr ~ T(n – 2)
where

Here the numerator r of the random variable t is the estimate of ρ = 0 and sr is the standard
error of t.
Observation: If we solve the equation in Theorem 1 for r, we get

Observation: The theorem can be used to test the hypothesis that population random
variables x and y are independent i.e. ρ = 0.
Z-test
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution. Because of the central limit theorem,
many test statistics are approximately normally distributed for large samples. For each
significance level, the Z-test has a single critical value (for example, 1.96 for 5% two tailed)
which makes it more convenient than the Student's t-test which has separate critical values for
each sample size. Therefore, many statistical tests can be conveniently performed as
approximate Z-tests if the sample size is large or the population variance is known. If the
population variance is unknown (and therefore has to be estimated from the sample itself) and
the sample size is not large (n < 30), the Student's t-test may be more appropriate.
Z-test for one sample group and one group proportion
One Proportion Z Test is a hypothesis test to make comparison between a group to
specified population proportion. Hypothesis test need an analyst to state a null hypothesis and an
alternative hypothesis. The results are mutually exclusive. That is if one is true, the other one
must be false and vice versa. Use this One Proportion Z Test statistics calculator to find the value
of Z - test statistic by entering observed proportion, sample size and null hypothesis value.
Z-test for one sample means
A one sample z test is one of the most basic types of hypothesis test. In order to run a one
sample z test, you work through several steps:
Step 1: State the Null Hypothesis. This is one of the common stumbling blocks–in order
to make sense of your sample and have the one sample z test give you the right information you
must make sure you’ve written the null hypothesis and alternate hypothesis correctly. For
example, you might be asked to test the hypothesis that the mean weight gain of pregnant women
was more than 30 pounds. Your null hypothesis would be: H0: μ = 30 and your alternate
hypothesis would be H,sub>1: μ > 30.
Step 2: Use the z-formula to find a z-score.

All you do is put in the values you are given into the formula. Your question should give
you the sample mean (x̄), the standard deviation (σ), and the number of items in the sample (n).
Your hypothesized mean (in other words, the mean you are testing the hypothesis for, or
your null hypothesis) is μ0.
F-test
An F-test is any statistical test in which the test statistic has an F-distribution under
the null hypothesis. It is most often used when comparing statistical models that have been fitted
to a data set, in order to identify the model that best fits the population from which the data were
sampled. Exact "F-tests" mainly arise when the models have been fitted to the data using least
squares.

One-way analysis for variance


In statistics, one-way analysis of variance(abbreviated one-way ANOVA) is a technique
that can be used to compare means of two or more samples (using the F distribution). This
technique can be used only for numerical response data, the "Y", usually one variable, and
numerical or (usually) categorical input data, the "X", always one variable, hence "one-way".
The ANOVA tests the null hypothesis that samples in all groups are drawn from
populations with the same mean values. To do this, two estimates are made of the population
variance. These estimates rely on various assumptions (see below). The ANOVA produces an F-
statistic, the ratio of the variance calculated among the means to the variance within the samples.
If the group means are drawn from populations with the same mean values, the variance between
the group means should be lower than the variance of the samples, following the central limit
theorem. A higher ratio therefore implies that the samples were drawn from populations with
different mean values.
Pearson product moment coefficient of correlation r
Pearson product-moment correlation coefficient(PPMCC) or the bivariate correlation, is a
measure of the linear correlation between two variables X and Y.
The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value
of 0 indicates that there is no association between the two variables. A value greater than 0
indicates a positive association; that is, as the value of one variable increases, so does the value
of the other variable. A value less than 0 indicates a negative association; that is, as the value of
one variable increases, the value of the other variable decreases. This is shown in the diagram
below:

How can we determine the strength of association based on the Pearson correlation coefficient?
The stronger the association of the two variables, the closer the Pearson correlation
coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or
negative, respectively. Achieving a value of +1 or -1 means that all your data points are included
on the line of best fit – there are no data points that show any variation away from this line.
Values for r between +1 and -1 (for example, r = 0.8 or -0.4) indicate that there is variation
around the line of best fit. The closer the value of r to 0 the greater the variation around the line
of best fit. Different relationships and their correlation coefficients are shown in the diagram
below:
Linear regression
Linear regression is a basic and commonly used type of predictive analysis. The overall
idea of regression is to examine two things: (1) does a set of predictor variables do a good job in
predicting an outcome (dependent) variable? (2) Which variables in particular are significant
predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign
of the beta estimates–impact the outcome variable? These regression estimates are used to
explain the relationship between one dependent variable and one or more independent
variables. The simplest form of the regression equation with one dependent and one independent
variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c =
constant, b = regression coefficient, and x = score on the independent variable.
Chi-square test for goodness-of-fit
A goodness-of-fit test between observed and expected frequencies is based on the
quantity,

where χ2 is a value of a random variable whose sampling distribution is approximated


very closely by the chi-squared distribution with v = k − 1 degrees of freedom. The symbols oi
and ei represent the observed and expected frequencies, respectively, for the ith cell.
Chi-square test of independence
The chi-squared test can also be used to test the hypothesis of independence of two
variables of classification. Suppose that we wish to determine whether the opinions of the voting
residents of the state of Illinois concerning a new tax reform are independent of their levels of
income. Members of a random sample of 1000 registered voters from the state of Illinois are
classified as to whether they are in a low, medium, or high income bracket and whether or not
they favor the tax reform. The observed frequencies are presented in Table 10.6, which is known
as a contingency table.

A contingency table with r rows and c columns is referred to as an r × c table(“ r × c” is


read “r by c”). The row and column totals in Table 10.6 are called marginal frequencies. Our
decision to accept or reject the null hypothesis, H0, of independence between a voter’s opinion
concerning the tax reform and his or her level of income is based upon how good a fit we have
between the observed frequencies in each of the 6 cells of Table 10.6 and the frequencies that we
would expect for each cell under the assumption that H0 is true. To find these expected
frequencies, let us define the following events:
L: A person selected is in the low-income level.
M: A person selected is in the medium-income level.
H: A person selected is in the high-income level.
F: A person selected is for the tax reform.
A: A person selected is against the tax reform.
By using the marginal frequencies, we can list the following probability estimates:

The expected frequencies are obtained by multiplying each cell probability by the total
number of observations. As before, we round these frequencies to one decimal. Thus, the
expected number of low-income voters in our sample who favor the tax reform is estimated to be
(336/ 1000)(598/ 1000)(1000) = (336)(598)/1000 = 200.9
when H0 is true. The general rule for obtaining the expected frequency of any cell is
given by the following formula:
expected frequency =(column total)×(row total) grand total
.
The expected frequency for each cell is recorded in parentheses beside the actual
observed value in Table 10.7. Note that the expected frequencies in any row or column add up to
the appropriate marginal total. In our example, we need to compute only two expected
frequencies in the top row of Table 10.7 and then find the others by subtraction. The number of
degrees of freedom associated with the chi-squared test used here is equal to the number of cell
frequencies that may be filled in freely when we are given the marginal totals and the grand total,
and in this illustration that number is 2. A simple formula providing the correct number of
degrees of freedom is v =( r−1)(c−1).

Das könnte Ihnen auch gefallen