Sie sind auf Seite 1von 40

1

the statistical analysis of data


by Dr. Dang Quang A & Dr. Bui The Hong
Hanoi Institute of Information Technology
2
Preface
Statistics is the science of collecting, organizing and interpreting
numerical and nonnumerical facts, which we call data.
The collection and study of data are important in the work of
many professions, so that training in the science of statistics is
valuable preparation for variety of careers. , for example
economists and financial advisors, businessmen, engineers,
farmers
Knownedge of probability and statistical methods also are
useful for informatic specialists of various fields such as data
mining, knowledge discovery, neural network, fuzzy system and
so on.
Whatever else it may be, statistics is, firsrt and foremost, a
collection of tools used for converting raw data into information
to help decision makers in their works.
The science of data - statistics - is the subject of this course.


3
Audience and objective
Audience
This tutorial as an introductory course to statistics is intended
mainly for users such as engineers, economists,
managers,...which need to use statistical methods in their work
and for students. However, it will be in many aspects useful for
computer trainers.
Objectives
Understanding statistical reasoning
Mastering basic statistical methods for analyzing data
such as descriptive and inferential methods
Ability to use methods of statistics in practice with the
help of computer softwares in statistics
Entry requirements
High school algebra course (+elements of calculus)
Skill of working with computer




4
Contents
Preface
Chapter 1 Introduction.
Chapter 2 Data presentation...
Chapter 3 Data characteristics...
descriptive summary statistics
Chapter 4 Probability: Basic...
concepts .
Chapter 5 Basic Probability
distributions ...
Chapter 6 Sampling
Distributions .
Chapter 7 Estimation.
Chapter 8 General Concepts of
Hypothesis Testing ..
Chapter 9 Applications of
Hypothesis Testing ..

Chapter 10 Categorical Data ....
Analysis and Analysis of variance
Chapter 11 Simple Linear
regression and correlation
Chapter 12 Multiple regression
Chapter 13 Nonparametric
statistics
References
Appendix A
Appendix B
Appendix C
Appendix D
Index

[Back]
5
Chapter 1 Introduction
1.1 What is Statistics
Whatever else it may be, statistics is, first and foremost, a collection of
tools used for converting raw data into information to help decision
makers in their works.
1.2. Populations and samples
A population is a whole, and a sample is a fraction of the whole.
A population is a collection of all the elements we are studying and
about which we are trying to draw conclusions. Such a population is
often referred to as the target population.
A sample is a collection of some, but not all, of the elements of the
population
1.3. Descriptive and inferential statistics
Descriptive statistics is devoted to the summarization and description of
data (population or sample) .
Inferential statistics uses sample data to make an inference about a
population .
1.4. Brief history of statistics
1.5. Computer softwares for statistical analysis
















[Back]
6
Chapter 2 Data presentation
2.1 Introduction
The objective of data description is to summarize the characteristics of a data set.
Ultimately, we want to make the data set more comprehensible and meaningful. In this
chapter we will show how to construct charts and graphs that convey the nature of a
data set. The procedure that we will use to accomplish this objective depends on the
type of data.
2.2 Types of data
Quantitative data are observations measured on a numerical scale.
- Nonnumerical data that can only be classified into categories are said to be
qualitative data..
2.3 Qualitative data presentation
Category frequency = the number of observations that fall in that category.
Relative frequency = the proportion of the total number of observations that fall in that
category
Percentage for a category = Relative frequency for the category x 100%
2.4 Graphical description of qualitative data
Bar graphs and pie charts

[Back] [Contents]
7
Chapter 2 (continued 1)
2.5 Graphical description of quantitative data: Stem and
Leaf displays
Stem and leaf display is widely used in exploratory data analysis when the data set is
small
Steps to follow in constructing a Stem and Leaf Display
Advantages and disdvantage of a stem and leaf display
2.6 Tabulating quantitative data: Relative frequency
distributions
Frequency distribution is a table that organizes data into classes
Class frequency = the number of observations that fall into the class.
Class relative frequency = Class frequency/ Total number of observations
- Relative class percentage = Class relative frequency x 100%
2.7 Graphical description of quantitative data: histogram
and polygon
frequency histogram, relative frequency histogram and percentage histogram.
frequency polygon, relative frequency polygon and percentage polygon
2.8 Cumulative distributions and cumulative polygons
2.9 Exercises
[Back]
8
Chapter 3 Data characteristics: descriptive summary
statistics
3.1 Introduction
3.2 Types of numerical descriptive measures
3.3 Measures of central tendency
3.4 Measures of data variation
3.5 Measures of relative standing
3.6 Shape
3.7 Methods for detecting outlier
3.8 Calculating some statistics from grouped data
3.9 Computing descriptive summary statistics using
computer softwares
3.10 Exercises




[Back] [Contents]
9
Chapter 3 (continued 1)
3.2 Types of numerical descriptive measures:
Location, Dispersion, Relative standing and Shape
3.3 Measures of location ( or central tendency)
3.3.1 Mean
3.3.2 Median
3.3.3 Mode
3.3.4 Geometric mean
3.4 Measures of data variation
3.4.1 Range
3.1.2 Variance and standard deviation
Uses of the standard deviation: Chebyshevs Theorem,
The Empirical Rule
3.4.3 Relative dispersion: The coefficient of variation



[Back]
10
Chapter 3 (continued 2)
3.5 Measures of relative standing
Descriptive measures that locate the relative position of an
observation in relation to the other observations are called
measures of relative standing
The pth percentile is a number such that p% of the
observations of the data set fall below and (100-p)% of the
observations fall above it.
Lower quartile = 25
th
percentile Mid- quartile,= 50
th

percentile.
Upper quartile = 75
th
percentile, Interquartile range, z-score
3.6 Shape
3.6.1 Skewness
3.6.2 Kurtosis
3.7 Methods for detecting outlier
3.8 Calculating some statistics from grouped
data

[Back]
11
Chapter 4. Probability: Basic concepts

4.1 Experiment, Events and Probability of an Event
4.2 Approaches to probability
4.3 The field of events
4.4 Definitions of probability
4.5 Conditional probability and independence
4.6 Rules for calculating probability
4.7 Exercises

[Back] [Contents]
12
Chapter 4 (continued 1)
4.1 Experiment, Events and Probability of an Event
The process of making an observation or recording a
measurement under a given set of conditions is a trial or
experiment.
Outcomes of an experiment are called events.
We denote events by capital letters A, B, C,
The probability of an event A, denoted by P(A), in general, is the
chance A will happen.
4.2 Approaches to probability
.Definitions of probability as a quantitative measure of the
degree of certainty of the observer of experiment.
.Definitions that reduce the concept of probability to the more
primitive notion of equal likelihood (the so-called classical
definition ).
.Definitions that take as their point of departure the relative
frequency of occurrence of the event in a large number of trials
(statistical definition).


[Back]
13
Chapter 4 (continued 2)
4.3 The field of events
Definitions and relations between the events: A implies B, A and
B are equivalent (A=B), product or intersection of the events A and
B (AB), sum or union of A and B (A+B), difference of A and (A-B or
A\B), certain (or sure) event, impossible event, complement of A,
mutually exclusive events, simple (or elementary), sample space.
Ven diagrams
Field of events
4.4 Definitions of probability
4.4.1 The classical definition of probability
4.4.2 The statistical definition of probability
4.4.3 Axiomatic construction of the theory of probability (optional)
4.5 Conditional probability and independence
Definition, formula, multiplicative theorem, independent and
dependent events







[Back]
14
Chapter 4 (continued 3)
4.5 Conditional probability and independence
4.6 Rules for calculating probability
4.6.1 The addition rule
for pairwise mutually exclusive events
P(A
1
+ A
2
+ ...+A
n
)= P(A
1
)+P(A
2
)+ ...+P(A
n
)
for two nonmutually exclusive events A and B
P(A+B) = P(A) + P(B) P(AB).
4.6.2 Multiplicative rule
P(AB) = P(A) P(B|A) = P(B) P(A|B).
4.6.3 Formula of total probability
P(B)= P(A
1
)P(B|A
1
)+P(A
2
)P(B|A
2
)+ ...+P(A
n
)P(B|A
n
).
[Back]
15
Chapter 5 Basic Probability distributions
5.1 Random variables
5.2 The probability distribution for a discrete random
variable
5.3 Numerical characteristics of a discrete random
variable
5.4 The binomial probability distribution
5.5 The Poisson distribution
5.6 Continuous random variables: distribution function
and density function
5.7 Numerical characteristics of a continuous random
variable
5.8 The normal distribution
5.9 Exercises

[Back] [Contents]
16
Chapter 5 (continued 1)
5.1 Random variables
A random variable is a variable that assumes numerical values
associated with events of an experiment.
Classification of random variables: A discrete random variable
and continuous random variable
5.2 The probability distribution for a discrete random
variable
The probability distribution for a discrete random variable x is a
table, graph, or formula that gives the probability of observing
each value of x.
Properties of the probability distribution

[Back]
17
Chapter 5 (continued 2)
5.3 Numerical characteristics of a discrete random
variable
5.3.1 Mean or expected value: =E(X)= xp(x)
5.3.2 Variance and standard deviation o
2
=E[(X- )
2
]
5.4 The binomial probability distribution
Model (or characteristics) of a binomial random variable
The probability distribution
mean and variance for a binomial random variable
5.5 The Poisson distribution
Model (or characteristics) of a Poisson random variable
The probability distribution
mean and variance for a Poisson random variable
[Back]
18
Chapter 5 (continued 3)
5.6 Continuous random variables: distribution function
and density function
Cumulative distribution function F(x)=P(X<x)
Density probability function f(x) = F(x)
5.7 Numerical characteristics of a continuous random
variable
Mean or expected value =E(X)=} xp(x)dx
Variance and standard deviation
5.8 The normal distribution
The density function, mean and variance for a normal random variable
o , 2o and 3o rules
The normal distribution as an approximation to binomial probability
distribution
[Back]
19
Chapter 6 Sampling Distributions
6.1 Why the method of sampling is important
6.2 Obtaining a Random Sample
6.3 Sampling Distribution
6.4 The sampling distribution of sample mean: the
Central Limit Theorem
6.5 Summary
6.6 Exercises



[Back] [Contents]
20
Chapter 6 (continued 1)
6.1 Why the method of sampling is important
two samples from the same population can provide
contradictory information about the population
Random sampling eliminates the possibility of bias in selecting a
sample and, in addition, provides a probabilistic basic for
evaluating the reliability of an inference
6.2 Obtaining a Random Sample
A random sample of n experimental units is one selected in
such a way that every different sample of size n has an equal
probability of selection
procedures for generating a random sample

[Back]
21
Chapter 6 (continued 2)
6.3 Sampling Distribution
A numerical descriptive measure of a population is called a parameter.
A quantity computed from the observations in a random sample is
called a statistic.
A sampling distribution of a sample statistic (based on n observations)
is the relative frequency distribution of the values of the statistic
theoretically generated by taking repeated random samples of size n
and computing the value of the statistic for each sample.
Examples of computer-generated random samples
6.4 The sampling distribution of sample mean: the
Central Limit Theorem
If the size is sufficiently large, the mean of a random sample from a
population has a sampling distribution that is approximately normal,
regardless of the shape of the relative frequency distribution of the
target population
Mean and standard deviation of the sampling distribution
6.5 Summary

[Back]
22
Chapter 7. Estimation
7.1 Introduction
7.2 Estimation of a population mean: Large-sample case
7.3 Estimation of a population mean: small sample case
7.4 Estimation of a population proportion
7.5 Estimation of the difference between two population
means: Independent samples
7.6 Estimation of the difference between two population
means: Matched pairs
7.7 Estimation of the difference between two population
proportions
7.8 Choosing the sample size
7.9 Estimation of a population variance
7.10 Summary

[Back] [Contents]
23
Chapter 7 (continued 1)
7.2 Estimation of a population mean: Large-sample case
Point estimate for a population mean:
Large-sample (1-o) 100% Confidence interval for a population
mean ( use the fact that For sufficient large sample size n>=30,
the sampling distribution of the sample mean, , is approximately
normal).
7.3 Estimation of a population mean: small sample case
(n<30)
Problems arising for small sample sizes and Assumption: the
population has an approximate normal distribution.
(1-o) 100% Confidence interval using t-distribution.
7.4 Estimation of a population proportion
For sufficiently large samples, the sampling distribution of the
proportion p-hat is approximately normal.
Large-sample (1-o) 100% Confidence interval for a population
proportion

[Back]
24
Chapter 7 (continued 2)
7.5 Estimation of the difference between two population
means: Independent samples
For sufficiently large sample size (n
1
and n
2
>= 30), the
sampling distribution of
1
-
2
based on independent
random samples from two populations, is approximately normal
Small sample sizes under some assumptions on populations
7.6 Estimation of the difference between two population
means: Matched pairs
Assumption: the population of paired differences is normally
distributed Procedure
7.7 Estimation of the difference between two population
proportions
For sufficiently large sample size (n
1
and n
2
>= 30), the
sampling distribution of p
1
- p
2
based on independent random
samples from two populations, is approximately normal
(1-o) 100% Confidence interval for p
1
- p
2
[Back]
25
Chapter 8. General Concepts of Hypothesis Testing
8.1 Introduction
The procedures to be discussed are useful in situations, where we are
interested in making a decision about a parameter value rather then
obtaining an estimate of its value
8.2 Formulation of Hypotheses
A null

hypothesis H
0
is the hypothesis against which we hope to
gather evidence. The hypothesis for which we wish to gather
supporting evidence is called the alternative hypothesises H
a

One-tailed (directional) test and two-tailed test

8.3 Conclusions and Consequences for a Hypothesis Test
The goal of any hypothesis-testing is to make a decision based on
sample information: whether to reject H
0
in favor of

H
a
we make
one of two types of error.
A Type I error occurs if we reject H
0
when it is true. The probability of
committing a Type I error is denoted by o (also called significance
level)
A Type II error occurs if we do not reject H
0
when it is false. The
probability of committing a Type II error is denoted by |.

Contents
[Back]
26
Chapter 8 (continued 1)
8.4 Test statistics and rejection regions
The test statistic is a sample ststistic, upon which the decision
concerning the null and alternative hypotheses is based.
The rejection region is the set of possible values of the test statistic for
which the null hypotheses will be rejected.
Steps for testing hypothesis
Critical value =boundary value of the rejection region
8.5 Summary
8.6 Exercises

[Back]
27
Chapter 9. Applications of Hypothesis Testing
9.1 Diagnosing a hypothesis test
9.2 Hypothesis test about a population mean
9.3 Hypothesis test about a population proportion
9.4 Hypothesis tests about the difference between two
population means
9.5 Hypothesis tests about the difference between two
proportions
9.6 Hypothesis test about a population variance
9.7 Hypothesis test about the ratio of two population
variances
9.8 Summary
9.9 Exercises
[Back] [Contents]
28
Chapter 9 (continued 1)
9.2 Hypothesis test about a population mean
Large- sample test (n>=30):
the sampling distribution of is approximately normal and s is a good
approximation of o.
Procedure for large- sample test
Small- sample test:
Assumption: the population ha aaprox. Normal distribution.
Procedure for small- sample test (using t-distribution)\
9.3 Hypothesis test about a population proportion
Large- sample test
9.4 Hypothesis tests about the difference between
two population means
Large- sample test :
Assumptions: n
1
>=30, n
2
>=30; samples are selected randomly and
independently from the populations
Small- sample test

[Back]
29
Chapter 9 (continued 2)
9.5 Hypothesis tests about the difference between two
proportions:
Assumptions, Procedure
9.6 Hypothesis test about a population variance
Assumption: the population has an approx. nornal distr.
Procudure using chi-square distribution
9.7 Hypothesis test about the ratio of two population
variances (optional)
Assumptions: Populations has approx. nornal distr., random
samples are independent.
Procudure using F- distribution

[Back]
30
Chapter 10. Categorical Data Analysis and Analysis of
Variance
10.1 Introduction
10.2 Tests of goodness of fit
10.3 The analysis of contingency tables
10.4 Contingency tables in statistical software packages
10.5 Introduction to analysis of variance
10.6 Design of experiments
10.7 Completely randomized designs
10.8 Randomized block designs
10.9 Multiple comparisons of means and confidence regions
10.10 Summary
10.11 Exercises
[Back] [Contents]
31
Chapter 10 (continued 1)
10.1 Introduction
10.2 Tests of goodness -of- fit
Purpose: to test for a dependence on a qualitative variable that
allow for more than two categorires for a response.Namely, it test
there is a significant difference between observed frequency
distribution and a theoretical frequency distribution .
Procedure for a Chi-square goodness -of- fit test
10.3 The analysis of contingency tables
Purpose :to determine whether a dependence exists between to
qualitative variables
Procedure for a Chi-square Test for independence of two
directions of Classification
10.4 Contingency tables in statistical software packages

[Back]
32
Chapter 10 (continued 2)
10.5 Introduction to analysis of variance
Purpose: Comparison of more than two means
10.6 Design of experiments
Concepts of experiment, design of the experiment, response variable,
factor, treatment
Concepts of Between-sample variation, Within-sample variation
10.7 Completely randomized designs
This design involves a comparison of the means of k treatments, based
on independent random samples of n
1
, n
2
,, n
k
observations drawn
from populations.
Assumptions: All k populations are normal, have equal variances
F-test for comparing k population means
10.8 Randomized block designs
Concept of randomized block design
Tests to compare k Treatment and b Block Means
10.9 Multiple comparisons of means and confidence
regions
[Back]
33
Chapter 11. Simple Linear regression and correlation
11.1 Introduction: Bivariate relationships
11.2 Simple Linear regression: Assumptions
11.3 Estimating A and B: the method of least squares
11.4 Estimating o
2

11.5 Making inferences about the slope, B
11.6. Correlation analysis
11.7 Using the model for estimation and prediction
11.8. Simple Linear Regression: An Overview Example
11.9 Exercises
[Back] [Contents]
34
Chapter 11 (continued 1)
11.1 Introduction: Bivariate relationships
Subject is to determine the relationship between two variables.
Types of relationships: direct and inverse
Scattergram
11.2 Simple Linear regression: Assumptions
a simple linear regression model y = A + B x + e
assumptions required for a linear regression model: E(e) = 0, e is
normal, o
2
is equal a constant for all value of x.
11.3 Estimating A and B: the method of least squares
the least squares estimators a and b , formula for a and b
11.4 Estimating o
2
Formula for s
2
, an estimator for o
2

interpretation of s, the estimated standard deviation of e

[Back]
35
Chapter 11 (continued 2)
11.5 Making inferences about the slope, B
Problem about making an inference of the population regression line
E(y)=A+Bx based on the sample regression line y^=a+bx
Sampling distribution of the least square estimator of slope b
Test of the utility of the model: H
0:
B =0 against H
a
: B=0 or B>0, B<0
A (1-o) 100% Confidence interval for B
11.6. Correlation analysis
Is the statistical tool for describing the degree to which one variable is
linearly related to another.
The coefficient of correlation r is a measure of the strength of the
linear relationship between two variables
The coefficient of determination
11.7 Using the model for estimation and prediction
A (1-o) 100% Confidence interval for the mean value of y for x = x
p

A (1-o) 100% Confidence interval for an individual y for x = x
p

11.8. Simple Linear Regression: An Example
[Back]
36
Chapter 12. Multiple regression
12.1. Introduction: the general linear model
12.2 Model assumptions
12.3 Fitting the model: the method of least squares
12.4 Estimating o
2

12.5 Estimating and testing hypotheses about the B
parameters
12.6. Checking the utility of a model
12.7. Using the model for estimating and prediction
12.8 Multiple linear regression: An overview example
12.8. Model building: interaction models
12.9. Model building: quadratic models
12.10 Exercises
[Back] [Contents]
37
Chapter 12 (continued 1)
12.1. Introduction: the general linear model
y = B
0
+ B
1
x
1
+ ... + B
k
x
k
+ e, where y - dependent., x
1
, x
2
, ..., x
k
-
independent variables, e - random error.
12.2 Model assumptions
For any given set of values x
1
, x
2
, ..., x
k
, the random error e has a normal
probability distribution with the mean equal 0 and variance equal o
2
.
The random errors are independent.
12.3 Fitting the model: the method of least squares
Least square prediction equation: y^= b
0
+b
1
x
1
+.+ b
k
x
k

12.4 Estimating o
2

12.5 Estimating and testing hypotheses about the B parameters
Sampling distributions of b
0
, b
1
, ..., b
k
A (1-o) 100% Confidence interval for B
i
(i =0, 1,.., k)
Test of an individual parameter coefficient B
i


[Back]
38
Chapter 12 (continued 1)
12.6. Checking the utility of a model
Finding a measure of how well a linear model fits a set of data: the
multiple coefficient of determination
testing the overall utility of the model
12.7. Using the model for estimating and prediction
A (1-o) 100% confidence interval for the mean value of y for a given x
A (1-o) 100% confidence interval for an individual y for for a given x
12.8 Multiple linear regression: An overview example
12.8. Model building: interaction models
Interaction model with two independent variables: E(y) = B
0
+ B
1
x
1
+
B
2
x
2
+ B
3
x
1
x
2

procedure to build an interaction model
12.9. Model building: quadratic models
Quadratic model in a single variable: E(y) = B
0
+ B
1
x + B
2
x
2
procedure to build a quadratic model
[Back]
39
Chapter 13. Nonparametric statistics
13.1. Introduction
Situations where t and F test are unsuitable
What do nonparametric methods use?
13.2. The sign test for a single population
Purpose: to test hypotheses about median of any populations
Procedure for the sign test for a population median
Sign test based on a large sample (n>=10)
13.3 Comparing two populations based on independent
random samples:Wilcoxon rank sum test
Nonparametric test about the difference between two populations is
the test to detect whether distribution 1 is shifted to the right of
distribution 2 or vice versa.
wilcoxon rank sum test for a shift in population locations
The case of large samples (n
1
>10, n
2
>10)
[Back] [Contents]
40
Chapter 13 (continued 1)
13.4. Comparing two populations based on matched pairs
Wilcoxon signed ranks test for a shift in population locations
Wilcoxon signed ranks test for large samples (n>25)
13.5. Comparing populations using a completely randomized
design: The Kruskal-Wallis H test
The Kruskal-Wallis H test is the nonparam. Equivalent of ANOVA F
test when the assumptions that populations are normally distributed
with common variance are not satisfied.
The Kruskal-Wallis H test for comparing k population probability
distributions
13.6 Rank Correlation: Spearmans r
s
statistic
Is stastistic developed to measure and to test for correlation between
two random variables.
Formula for computing Spearmans rank correlation coefficient rs
Spearmans nonparametric test for rank correlation
13.7 Exercises

[Contents] [Back]

Das könnte Ihnen auch gefallen