Statistics and Probability Reviewer

Statistics 2.
Ordinal - classifies data into categories that can be ranked;

but precise differences between the ranks do not exist; judging
- a science of conducting studies to collect, organize,
(1st, 2nd) rating scale (poor, excellent)
summarize, analyze, present, interpret, and draw
conclusions from data 3. Interval - ranks data, and precise differences between units
of measure exist; there is no meaningful zero (IQ, temperature)
- used to analyze the results of surveys and as a tool in
scientific research to make decisions based on controlled 4. Ratio - possesses all the characteristics of interval
experiments. Other uses of statistics include operations measurement, and a true zero exists. True ratios exist when the
research, quality control, estimation, and prediction same variable is measured on two different members of the
population; height, weight, time, salary, age
Variable - a characteristic or attribute under study that can

assume different values Data Collection Methods
Random Variables - values are determined by chance 1. Surveys
Data - values (observations or measurements) that the a) Telephone - less costly, more candid, not face-face.
variables can assume Disadvantages: some don’t have phones or will not answer,
unlisted numbers
Data Set - a collection of observations (data values) on one
or more variables b) Mailed Questionnaires - less expensive to conduct,
respondents can remain anonymous. Disadvantages: low
Population - consists of all subjects (human, etc) that are
number of responses, inappropriate answers to questions;
being studied
some people may have difficulty reading or understanding
Sample - a group of subjects selected from a population the questions
c) Personal Interview - obtain in-depth responses.
Disadvantages: interviewers must be trained in asking
2 Main Areas of Statistics
questions and recording responses; interviewer may be
1. Descriptive statistics - the collection, organization, biased
summarization, and presentation of data. Tables, charts or
2. Surveying records
graphs are used to organize and present data. Descriptive
values such as the average score are used to summarize 3. Direct observation of situations
data.
2. Inferential statistics - generalizing from samples to
Reasons for Using Samples
populations, performing estimations and hypothesis tests,
determining relationships among variables, and making 1. Saves time and money
predictions. Make inferences from samples to populations.
2. Enables the researcher to get information that he or she
Hypothesis testing - a decision-making process for might not be able to obtain otherwise
evaluating claims about a population, based on
3. Enables the researcher to get more detailed information
information obtained from samples
about a particular subject
Classifications of Variables
4 Basic Sampling Techniques
1. Quantitative - Numerical and can be ordered or ranked
1. Random sampling - subjects are selected by random
(age, heights, weights, body temperatures)
numbers from calculators, computers, or tables; for a sample of
a) Discrete - values that can be counted size n, all possible samples of this size have an equal chance of
being selected from the population.
b) Continuous - assume an infinite number of values
between any two specific values; obtained by Limitation: if the population is extremely large, it is time
measuring and often include fractions and decimals consuming to number and select the sample elements
2. Qualitative - variables that can be placed into distinct Methods for Random Sampling
categories, according to some characteristic or attribute
a) Fish bowl - number each element of the population, place
(gender, religion, geographic locations)
the numbers on cards in a hat or fishbowl, mix them, and
select the sample by drawing the cards
Measuring Variables - to establish relationships between b) Random numbers - number the elements of the population
variables; observe the variables and measure/record their sequentially and then select each element by using random
observations. numbers
Scale of measurement - measuring a variable into a set of 2. Systematic random sampling- using every kth number after
categories and a process that classifies each individual into the first subject us selected from 1 through k; done after the
one category first number is selected at random. The advantage of
systematic sampling is the ease of selecting the sample
elements.
4 Types of Measurement Scales
3. Stratified random sampling - dividing the population into
1. Nominal level of measurement - classifies data into subgroups, called strata, and subjects are randomly selected
mutually exclusive (non overlapping), exhausting within groups; ensures representation of all population
categories in which no order or ranking can be imposed on subgroups that are important to the study. Disadvantages:
the data.gender, zip code, eye color, nationality, religion
many variables of interest, dividing a large population into 4. Classes must be mutually exclusive - non-overlapping class
representative subgroups requires a great deal of effort. limits so that data cannot be placed into two classes
4. Cluster sampling- subjects are selected by using an 5. Classes must be continuous - no gaps in frequency
intact group(cluster) that is representative of the distribution
population.
6. Classes must be exhaustive - enough to accommodate all
Advantages: A cluster sample can reduce costs, it can the data
simplify fieldwork it is convenient.
Disadvantage: homogeneous
Reasons for constructing a frequency distribution
1. To organize the data in a meaningful, intelligible way.
Frequency Distribution and Graphs
2. To enable the reader to determine the nature or shape of
Constructing a frequency distribution - most convenient the distribution.
method of organizing data
3. To facilitate computational procedures for measures of
Frequency distribution -organization of raw data in table average and spread
form, using classes and frequencies; way of presenting a
4. To enable the researcher to draw charts and graphs to
summary of the data that shows
present data
a) possibility of seeing patterns or relationships in data
5. To enable the reader to compare different data sets
b) how many times each data point
(observation/outcome) occurs in a data set
Types of Frequency Distribution
Components of frequency distribution table
1. Categorical Frequency Distribution - used for data that can
Class - quantitative/qualitative category, each raw data
be placed in specific categories, such as nominal/ordinal level
value is placed into
data.
Tally - data recorded in the sequence which they are
2. Grouped Frequency Distributions - used when the range of
collected, before they are processed/ranked
the data is large, the data must be grouped into classes that are
Frequency - number of data values contained in a specific more than one unit in width.
class
3. Ungrouped Frequency Distribution - used when the range
1. Qualitative variable (ordinal/nominal data) of the data values is relatively small, a frequency distribution
can be constructed using single data values for each class
a) Class, tally, frequency, percent
4. Cumulative Frequency Distribution - gives total # of values
2. Quantitative variable (numerical data)
that fall below the upper boundary of each class. Values are
a) Class limit, class boundaries - numbers used to found by adding the frequencies of classes less than or equal to
separate the classes so there are no gaps in the upper class boundary of a specific class (ascending cumulative
frequency distribution; tally, frequency frequency)
Basic Rules: Constructing “Class” in the Frequency Sample of Frequency Distribution Table
Distribution
1. There should be 5-20 classes
2. Class limits should have the same decimal place value
as the data
a) Class boundaries should have one additional
place value and end in a 5
Lower limit - 0.5 = lower boundary
Upper limit + 0.5 = upper boundary Constructing statistical charts and graphs - most useful
method of presenting the data
3. Classes must be equal in width - found by subtracting
lower/upper class limit of one class from lower/upper class Uses of graphs in statistics
limit of the next class if boundaries are given. Find the 1. Convey data to viewers in pictorial form
class width by dividing the range by the number of classes
2. Useful in getting the audience’s attention in a presentation
* don’t subtract limits of a single class; incorrect answer
3. Describe/analyze data set
*researcher decides how many classes to use and the
width of each class 4. Discuss an issue, reinforce a critical point, summarize data
set
Sturge’s Rule - determining number of classes to use in a
histogram or frequency distribution table 5. Discover trends/patterns in a situation
k = 1+3.322(log10n) Frequency Distribution Graphs
k = number of classes • X axis - score categories (X values)
n = size of the data • Y axis - frequencies

• Histogram or a polygon - When the score categories have
numerical scores from an interval or ratio scale
Commonly Used Graphs the distribution; reported along with the mean or the median
1. Histogram - contiguous vertical bars of various heights
(frequencies)
Modal class - the mode for grouped data; the class with the
2. Frequency polygon - using lines that connect points largest frequency
plotted for the frequencies
1. Unimodal - a data set that has only one value that occurs
3. Ogive or Cumulative Frequency - represents the with the greatest frequency
cumulative frequencies. visually represent how many
2. Bimodal - a data set that has two values that occur with the
values are below a certain upper class boundary
same greatest frequency, both values are considered to be the
mode
Constructing Statistical Graphs 3. Multimodal - a data set that has more than two values that
1. Draw and label x and y axes occur with the same greatest frequency, each value is used as
2. Choose a suitable scale and label it on the y axis the mode
3. Represent the class boundaries on the x axis
4. Plot the points and draw the bars or lines
Central Tendency and the Shape of the Distribution
1. Symmetrical (Normal) Distribution - the data values are
Relative Frequency Graphs - used when the proportion of
evenly distributed on both sides of the mean. When the
data values is more important than the actual number of
distribution is unimodal, the mean, median, and mode are the
data values
same and are at the center of the distribution
To convert a frequency into a proportion or relative
frequency, divide the frequency for each class by the total
of the frequencies. The sum of the relative frequencies will
always be 1
Other Types of Graph

1. Bar graph - vertical or horizontal bars whose heights or
lengths represent the frequencies of the data
2. Pareto chart - frequency distribution for a categorical
variable, frequencies are displayed by vertical bars,
arranged in order from highest to lowest 2. Positively Skewed or Right-skewed Distribution - majority
of the data values fall to the left of the mean and cluster at the
3. Time series graph - represents data that occur over a lower end of the distribution; the “tail” is to the right. The
specific period of time; look for trends/patterns mean is to the right of the median, and the mode is to the left
4. Pie graph - circle divided into sections or wedges of the median
according to the percentage of frequencies; nominal/
categorical
Data Distribution
Measures of Central Tendency
Central tendency - descriptive statistical measure that
determines a single value that best describes the center
and represents the entire distribution; condense a large
set of data into a single value
- goal is to identify the single value that is the best 3. Negatively Skewed or Left-skewed Distribution - majority of
representative for the entire set of data the data values fall to the right of the mean and cluster at the
upper end of the distribution, with the tail to the left. The mean
Statistic - a characteristic or measure obtained by using is to the left of the median, and the mode is to the right of the
the data values from a sample median
Parameter - a characteristic or measure obtained by using
all the data values from a specific population
1. Mean - most commonly used measure of central
tendency; balance point of the distribution; sum of the
values divided by the total number of values
2. Median - midpoint of the list where scores in a
distribution are listed from smallest to largest; a more
appropriate measure of central tendency than the mean;
divides the scores so that 50% of the scores in the
distribution have values that are equal to or less than the
median *When a distribution is extremely skewed, the value of the
mean will be pulled toward the tail
3. Mode - most frequently occurring category or score in
the distribution or in the data set; peak or high point of Central Tendency and Variability - two primary values that are
used to describe a distribution of scores
Central tendency - the central point of the distribution Q2 is the same as the 50th percentile, or the median
Q3 corresponds to the 75th percentile
Variability - descriptive statistic that describes how the
scores are scattered around that central point; determined 4. Interquartile Range (IQR) - difference between Q1 and Q3
by measuring distance and is the range of the middle 50% of the data; used to identify
outliers, and as a measure of variability in exploratory data
- inferential statistic that describes how accurately any
analysis (EDA)
individual score or sample represents the entire
population 5. Deciles - Deciles divide the distribution into 10 groups,
denoted by D1, D2, etc. Deciles can be found by using the
formulas given for percentiles
Measures of Variation
1. Range - total distance covered by the distribution, from
Relationships Among Percentiles, Deciles, and Quartiles
the highest score to the lowest score
R = highest value - lowest value • Deciles are denoted by D1 , D2 , D3 , and they correspond to
P10, P20, P30
2. Variance (  or s2) - average of the squares of the
2
distance each value is from the mean • Quartiles are denoted by Q1 , Q2 , Q3 and they correspond to
P25, P50, P75
2   ( X  ) 2
s2  ( X  X ) • The median is the same as P50 or Q2 or D5
N n 1
X = individual value X = sample mean
μ = population mean n = sample size
N = population size
3. Standard Deviation (  or s) - standard distance
between a score and the mean; square root of the
variance
Exploratory (Descriptive) Data Analysis, EDA - to examine data
to find out what information can be discovered about the data
Uses of Variance and Standard Deviation such as the center and the spread
1. To determine the spread of the data.
2. To determine the consistency of a variable
3. To determine the number of data values that fall within
a specified interval in a distribution
4. Used quite often in inferential statistics.
Stem-and-Leaf Plot - data plot that uses part of the data value
Coefficient of Variation (CVar) - statistic that allows to as the stem and part of the data value as the leaf to form
compare standard deviations when the units are different; groups or classes. Leading digit (stem), trailing digit (leaf),
the standard deviation divided by the mean, result frequency
expressed as a percentage
Boxplot (Box and Whisker Plot) - graph of a data set obtained
For samples: For population: by drawing: the lowest value of the data set (minimum), Q1,
s 
CVar  100% CVar  100% the median, Q3, the highest value of the data set (maximum)
X 
Comparing Boxplots for Two or More Data Sets - use the
Measures of Positions - used to locate the relative location of the medians. To compare the variability, use the
position of a data value in the data set interquartile range or the length of the boxes.
1. Standard score (z-score) - tells how many standard
deviations a data value is above or below the mean for a Probability and Counting Rules
specific distribution of values
Probability - the chance of an event occurring
a) If a z score is 0, the data value is the same as the mean
Basic Concepts of Probability
b) if the z score is (+), the score is above the mean
1. Probability Experiments - a chance process that generates a
c) if the z score is (-), the score is below the mean set of data or well-defined results called outcomes
When all data for a variable are transformed into z scores, 2. Outcome - the result of a single trial of a probability
the resulting distribution will have a mean of 0 and a experiment
standard deviation of 1
3. Space sample (S) - set of all possible outcomes of a
value  mean
z statistical experiment
sd
2. Percentile - divide the data set into 100 equal groups
percentile = (# of values below X)+0.5 x 100%
total # of values
3. Quartiles - divide the distribution into four groups,
separated by Q1, Q2, Q3 Tree Diagram - used to determine all possible outcomes of a
probability experiment
Q1 is the same as the 25th percentile
Classifications of Events a) Independent Events - the probability of both
occurring is P(A and B) = P(A) x P(B)
Event (E) - consists of a set of outcomes of a probability
experiment b) Dependent Events - conditional probability P(B/A)
- the probability of both occurring is
1. Independent - the first event does not affect the
P(A and B) = P(A) x P(B/A)
probability of the next event occurring
2. Dependent - the probability of the second event
occurring depends on the first event Conditional Probability
The probability that event B occurs given that event A has
3. Complementary event ( E ) - set of outcomes in the
already occurred:
sample space that are not included in the outcomes of
event E; mutually exclusive P(B|A) = P(A and B)
P(A)
P(E)  1 P(E) P(E)  P(E)  1
Determination of the Number of Outcomes of Events
Three Basic Interpretations of Probability 1. Fundamental Counting Rule - mulitply (k1 * k2 * k3 * kn)
1. Classical Probability - relies of the sample space; 2. Permutation - arrangement of n objects in a specific order
assumes all outcomes are equally likely to occur; actual Permutation Rule - # of permutations of n objects taking r
performance of experiment is not necessary; outcomes objects at a time; order is important
are obtained by observation and tree diagram n!
P(E) = # of outcomes in E = Pr  where n! = n factorial
(n  r)!
n
n(E) total # of outcomes n(S)

2. Empirical Probability - uses frequency distribution; 3. Combination - selection of distinct objects without order
Combination Rule - # of combinations of r objects selected
outcomes are based on the frequency distribution and
observation from n objects; order is not important
P(E) = frequency for class = f n!

Cr 
(n  r)!r!
n
total frequencies n
3. Subjected Probability - researcher makes an educated
guess about the chance of an event occurring; experiment
performance not needed; based on educated personal Probability Distribution - a relative frequency distribution of all
judgment/estimate, opinions and inexact information possible outcomes if an experiment
Four Basic Probability Rules Different Types of Probability Distribution

1. Probability Distribution of Discrete Variables - binomial,
Probability Rule 1 - probability of any event is a number
(fraction/decimal) between and including 0 and 1 poisson distribution
2. Probability Distribution of Continuous Variables - uniform,
0  P(E)  1 normal distribution
Probability Rule 2 - if event E can’t occur, probability is 0
Probability Rule 3 - if event E is certain, probability is 1 Random Variables - characteristic that varies from one
Probability Rule 4 - sum of the probabilities of all component of a population to another; its values vary randomly
outcomes in the sample space is 1 or by chance
*Probability values range from 0 to 1 1. Discrete Random Variables - has a finite or countable
*When probability is near 0, occurrence is highly unlikely number of values (0, 1, 2…)
*When probability is near 0.5, there is a 50-50 chance 2. Continuous Random Variables - has infinitely many values
*When probability is near 1, event is likely to occur associated with measurements on a continuous scale where
*When probability of an event/complement is known, the there are no gaps or interruptions (5, 5.1, 6.2…)
other can be found by subtracting the probability from 1
Discrete Probability Distribution - table, graph, or

Rules in Solving Probability of Compound Events (2 or mathematical expression that specifies all possible values
more) (outcomes) of a random variable with their probabilities. It
1. Addition Rule should satisfy the criteria:
a) Mutually Exclusive Events - when two events A 1.  P(x)  1 where x is a discrete variable and
and B are mutually exclusive P(A or B) = P(A) + P(x) is the probability of x
P(B)
2. 0  P(x)  1 for every value of x
b) Non-mutually Exclusive - if A and B are not
mutually exclusive P(A or B) = P(A) + P(B) - P(A
and B)
Mean of a Probability Distribution - expected value; typical
2. Multiplication Rule and Conditional Probability value that represents the central location of a probability
distribution  xP(x)
Variance and Standard Deviation of a Probability Hypergeometric Random Variable - the number X of successes
Distribution - measures the amount of spread in a of a hypergeometric experiment
distribution 2   [(x  ) 2 P(x)] Probability mass function (pmf)
 K  N  K 
 k  n  k 
Binomial Distribution - with parameters n and p, is the P( X  k )    
discrete probability distribution of the # of successes in a  N 
sequence of n independent experiments  n 
 
4 Properties of Binomial Distribution where N = population size
1. Fixed Number of Trials (n) K = # of success states in the population
2. Two outcomes in a trial, success or failure n = # of draws
3. Trials are independent k = # of observed successes
4. Probability of success P remains constant  a = is a binomial coefficient
 b
 
General Formula pmf is (+) when max(0, n  K  n)  k  min(K, n)
X ~ B(n, p) pmf satisfies the recurrence relation
P( X  r)nc rpr qnr  N  K 
X = random variable  
n
n = # of trials P( X  0)   
 
N
r = # of successes  n 
q = # of failures  
p = probability of success
Mean and Variance
X ~ B(n, p)
mean   E(x)  np
variance  2  Var( X )  npq
where q 1 p
Mode - of a binomial B(n,p) distribution
|(n+1)p| if (n+1)p is 0 or a noninteger
(n+1)p and (n+1)p-1 if (n+1)p{1,..., n}
n if (n+1)p=n+1
Median - no formula to find the median for a binomial

distribution
Multinomial Distribution - used to compute probabilities

in situations that have more than 2 possible outcomes
1. Statistical experiment with k outcomes
2. Repeated independently n times
n!
P  (n !)( n !)...( n !) p1
( n1 )
p2
(n2)
... pk
( n k)
1 2 k
where P = probability
n = total # of events
n1 = # of times outcome 1 occurs
n2 = # of times outcome 2 occurs
nk = # of times outcome k occurs
p1 = probability of outcome 1
p2 = probability of outcome 2
pk = probability of outcome k
Hypergeometric Distribution - discrete probability

distribution that describes the probability of k successes in
n draws, without replacement, from population N that
contains exactly K objects, wherein each draw is either a
success or a failure
Conditions Characterizing Hypergeometric Distribution
1. The result of each draw can be classified into one of two
mutually exclusive categories (Pass/Fail, True/False )
2. The probability of a success changes on each draw, as
each draw decreases the population (sampling without
replacement from a finite population)

Statistics and Probability Reviewer

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Statistics and Probability Reviewer

Hochgeladen von

Copyright:

Verfügbare Formate

Statistics 2.

Ordinal - classifies data into categories that can be ranked;

Variable - a characteristic or attribute under study that can

k = 1+3.322(log10n) Frequency Distribution Graphs

k = number of classes • X axis - score categories (X values)

n = size of the data • Y axis - frequencies

Other Types of Graph

n(E) total # of outcomes n(S)

P(E) = frequency for class = f n!

Four Basic Probability Rules Different Types of Probability Distribution

Discrete Probability Distribution - table, graph, or

Median - no formula to find the median for a binomial

Multinomial Distribution - used to compute probabilities

Hypergeometric Distribution - discrete probability

Das könnte Ihnen auch gefallen