BFC 34303 Chapter 1 Descriptive Statistics

BFC 34303
CIVIL ENGINEERING STATISTICS

Chapter 1
Descriptive Statistics
Faculty of Civil and Environmental Engineering
Universiti Tun Hussein Onn Malaysia
What is ‘statistics’?
Statistics is the science that deals with collecting, classifying, presenting,
describing, analysing and interpreting data to enable us to draw
conclusions and making reasonable decisions.
It can be divided into two categories:

a) Descriptive statistics
b) Inferential statistics
1
Descriptive statistics
The activity of collecting, classifying, presenting and describing
quantitative data.
Methods for organising (frequency table), representing (graphs) and
summarising data (central tendency and variability).
Inferential statistics
The part dealing with technique and method of interpretation of the
results obtained from the descriptive statistics.
Population Sample
Population is the entire A portion of population selected
(complete) collection of data for study.
whose properties are analysed.
It contains all the subjects of A sample is any set of entities,
interest. cases, subjects, items or
experimental units chosen from
Can be of any size, its items
the population.
need not be uniform but must
share at least one measurable
feature.
2
Random Sample
A random sample is a sample selected in such a way that each element
of the population has the same chance of being selected.
Parameter
Parameter is a numerical measurement describing some characteristics
of a population.
Eg. the population mean and variance
Statistic
Statistic is a numerical measurement describing some characteristics of
a sample.
Eg. the sample mean and variance
5
Variable
Any measured characteristic or attribute that differs for different
elements.
For example, if the weight of 30 people were measured, then weight
would be a variable.
Can be classified as quantitative or qualitative.
Quantitative Variable
The variable being studied is numeric and measured on an ordinal,
interval or ratio scale.
Eg. Ambient temperature, vehicular speed and walking distance.
3
Qualitative Variable
The variable being studied is non-numeric and measured on a nominal
scale.
Also called ‘categorical’ variable.
Eg. Gender, eye colour and educational level.
Ordinal Interval Ratio Nominal
Meaningful Meaningful zero point

Data may only be
Data are ranked differences beween and ratio between
classified
values values
Position in Distance to Number of Marital

Exam grade Clothes size Temperature Gender
race class patients seen Status
Data
A set of data is a collection of observations, measurements or
information obtained for a study.
It can be classified as qualitative data or quantitative data.
Quantitative Data Qualitative Data
• Data that can be measured • Data that are not in numerical

numerically or counted. form but instead assigned as
• Can be either continuous data attributes.
or discrete data. • Eg. race, age, gender, marital
• Eg. length, time, mass, status
tempertature
8
4
• Data that can only take exact and countable
values.
Discrete Data • Eg. number of students in a class, number of
cars sold in a day, number of persons in a
family.
• Data that take any value over a certain interval

and can be measured to a certain degree of
Continuous accuracy (correct to certain decimal places)
Data
• Eg. weight of students in a class, time taken to
complete race, fat content in canned food.
Ungrouped and Grouped Data
Ungrouped Data
Raw data that is not in the term of interval.
Frequency distribution has been arranged in order.
Example:
Weight of seven students: 56, 74, 68, 90, 52, 48, 65
Number of cars owned per household:
No. of cars owned 0 1 2 3

No. of households 6 28 12 4
10
5
Grouped Data
Data is grouped according to class intervals before the frequency
distribution is assigned.
Example:
Height of students in a class:
Height (cm) 150-159 160-169 170-179 180-189

No. of students 5 11 21 8
11
Measures of Location
Median
Mean Percentile
Measures
Mode of Quartile
Location
12
6
Measures of Central Tendency
Central tendency is a
statistical measure that Mean
determines a single value
that accurately describes the
center of the distribution and Mode Median
represents the entire
distribution of scores.
The goal is to identify the Measures
single value that is the best of Central
Tendency
representative for the entire
set of data.
13
Mode • The mode is the most frequent score in a data set.
• The median is the middle score for a data set that has
Median
been arranged in order of magnitude.
• The mean (or average) is equal to the sum of all the

Mean values in a data set divided by the number of values in
the data set.
14
7
Quartiles
Quartiles are values that divide a data set into four parts containing an
approximately equal number of observations.
The total of 100% is split into four equal parts (four quarters):
Q1 Q2 Q3
25% 25% 25% 25%
Interquartile Range = Q3 – Q1
First quartile (Q1) or lower quartile

Second quartile (Q2) or middle quartile, which is also the median
Third quartile (Q3) or upper quartile
15
Percentiles
Percentiles divide a set of data which are arranged in ascending order
into 100 equal parts.
A percentile is a measure used to indicate the value below which a given
percentage of observations in a group of observations fall.
For example, the 25th percentile is the value below which 25% of the
observations may be found.
Note:
25th percentile (P25) = First quartile (Q1)
50th percentile (P50) = Second quartile (Q2), which is also the median
75th percentile (P75) = Third quartile (Q3)
16
8
Measures of Dispersion
Variance
Standard
Range
Deviation
Measures
of
Dispersion
17
Measures of dispersion (or variation) describe how spread out a set of
data is, or the extent of the variability in individual items of the distribution.
Let us look at the following data sets to see how measures of central
tendency is different from measures of dispersion:
Data Set 1: 6, 7, 8, 6, 9, 6 (Mean = 7) (Range = 6 – 9)

Data Set 2: 5, 7, 2, 6, 13, 9 (Mean = 7) (Range = 2 – 13)
Most of the numbers in data set 1 are close to the mean value, while in
data set 2 the numbers are spread away from the mean. The difference in
the spread can be determined by a measure of dispersion.
18
9
However, range is not a good measure of dispersion because it is
influenced by the extreme values and the calculation does not cover all
observations.
Variance and standard deviation are most useful and widely used
measures of dispersion. Although they are influenced by the extreme
values, the calculations cover all the observations.
19
Variance
Variance (s2 or s2) is the average of the squared differences from the
mean.
Standard Deviation
Standard deviation (s or s) a measure of dispersion of observations
within a data set. It is simply the square root of the variance.
If the observations are all close to the mean, then the standard deviation
is close to zero.
If many observations are far from the mean, then the standard deviation
is far from zero.
If all the observations are equal, then the standard deviation is zero.
20
10
The equation for variance (s2) is given below:
∑ 𝑥 − 𝑥̅
𝑠 =
𝑛−1
The equation for standard deviation (s) is given below:
∑ 𝑥 − 𝑥̅
𝑠=
𝑛−1
where 𝑥̅ is the mean and n is number of observations.
21
Stem-and-Leaf Diagram
A stem-and-leaf diagram (or

display) is a method for
presenting quantitative data
in a graphical format to assist
in visualising the shape of a
distribution.
The "stem" is the first digit or
digits, and the "leaf" is the
last digit.
Stem Leaf
22
11
To construct a stem-and-leaf diagram:
1. Arrange the data in order of magnitude (ascending order).
2. Place the stems in order, vertically from smallest to largest.
3. Place the leaves in order, in each row from smallest to largest.
4. Create a key for the stem-and-leaf diagram so that people know how
to interpret the diagram.
Online tutorial: https://www.youtube.com/watch?v=_7m0Q_m2ppg
23
Distribution of Data
A symmetric curve (bell-shaped) is one in which both sides of the
distribution would exactly match the other if the figure were folded over
its central point. This is called a normal distribution.
An example is shown below:
24
12
A distribution is said to be skewed to the right, or positively skewed,
when most of the data are concentrated on the left of the distribution.
The right tail clearly extends farther from the distribution's centre than
the left tail, as shown below:
Most data on the left
Right tail elongated
Positive skew
25
A distribution is said to be skewed to the left, or negatively skewed, if

most of the data are concentrated on the right of the distribution.
The left tail clearly extends farther from the distribution's centre than the
right tail, as shown below:
Most data on the right
Left tail elongated
Negative skew
26
13
Interpreting Distribution of Data from Stem-and-Leaf Diagram
If the stem-and-leaf diagram is turned on its side, it will look like the
following:
The distribution shows that most data are clustered at the right. The left
tail extends farther from the data centre than the right tail. Therefore, the
distribution is skewed to the left or negatively skewed.
27
Box-and-Whisker Plot
A box-and-whisker plot (also called a box plot) displays the five-number
summary of a set of data.
The five-number summary is the:

1. Minimum
2. First quartile
3. Second quartile (median)
4. Third quartile
5. Maximum
In a box plot, we draw a box from the first quartile to the third quartile. A
vertical line goes through the box at the median.
28
14
70
max
Horizontal Box-and-Whisker 60
Q1 Q2 Q3 50
min max
40 Q3
0 10 20 30 40 50 60 70
30
Q2
20
Vertical Box-and-Whisker 10
min
0
29
To construct a box-and-whisker plot:
1. Determine the five-number summary.

2. Draw a horizontal axis on which the number obtained in step 1 can
be located. Above this axis, mark the five-number summary with
vertical lines.
3. Connect the quartiles to each other to make a box, and then connect
the box to the maximum and minimum lines.
4. Calculate the values of upper and lower inner fence to determine
whether the data has outliers.
Upper inner fence = Q3 + 1.5*(Q3 – Q1)

Lower inner fence = Q1 – 1.5*(Q3 – Q1)
Online tutorial: https://www.youtube.com/watch?v=o7qWblT5NZI
30
15
Lower inner fence Upper inner fence
min max
Q1 Q2 Q3
10 20 30 40 50 60 70 80 90 100
The data lies within the upper and lower inner fence, so the data has no outlier.
Lower inner fence Upper inner fence

Outlier
min max
Q1 Q2 Q3
10 20 30 40 50 60 70 80 90 100
The observation that lies outside fence is known as outlier. 31
Shape of Distribution: Symmetry and Skewness
The diagram below shows a symmetrical distribution (normal

distribution). The ‘whiskers’ are the same length and the median (Q2) is
in the centre of the box.
Q1 Q2 Q3
min max
32
16
The diagram below shows a positively skewed distribution (skewed to
the right). The left ‘whisker’ is shorter than the right ‘whisker’ and the
median (Q2) is nearer to Q1.
Q1 Q2 Q3
min max
33
The diagram below shows a negatively skewed distribution (skewed

to the left). The left ‘whisker’ is longer than the right ‘whisker’ and the
median (Q2) is nearer to Q3.
Q1 Q2 Q3
min max
34
17
Analysing Grouped Data
Median Percentile
Mean Quartile
Measures
Mode Decile
of Location
35
Standard Interquartile
Deviation Range
Variance Range
Measures
of
Dispersion
36
18
Formula
∑
Mean, 𝑥̅ = ∑
where x = data and f = frequency
Mode = 𝐿 + c
where Lm = lower boundary of the class containing the mode

d1 = difference between the frequency of the mode class and the
frequency of the class immediately before it
d2 = difference between the frequency of the mode class and the
frequency of the class immediately after it
c = size of the mode class
37
Median = 𝐿 + c
where Lm = lower boundary of the class containing the median

n = total number of observations
FL = cumulative frequency of the class before the median class
fm = frequency of the median class
c = size of the median class
Quartile, 𝑄 = 𝐿 + 𝑐
38
19
Percentile, 𝑃 = 𝐿 + 𝑐
Decile, 𝐷 = 𝐿 + 𝑐
where k = 1, 2, 3, …
Lk = lower boundary of the class where Qk, Pk, Dk lies
n = total number of observations
FL = cumulative frequency of the class before the Qk, Pk, Dk class
fk = frequency of the class where Qk, Pk, Dk lies
ck = size of the class where Qk, Pk, Dk lies
39
∑
∑
∑
Variance, 𝑠 = ∑
∑
∑
∑
Standard Deviation, 𝑠 = ∑
where x = data and f = frequency
40
20

BFC 34303 Chapter 1 Descriptive Statistics

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

BFC 34303 Chapter 1 Descriptive Statistics

Hochgeladen von

Copyright:

Verfügbare Formate

BFC 34303

CIVIL ENGINEERING STATISTICS

It can be divided into two categories:

Ordinal Interval Ratio Nominal

Meaningful Meaningful zero point

Position in Distance to Number of Marital

Quantitative Data Qualitative Data

• Data that can be measured • Data that are not in numerical

• Data that take any value over a certain interval

Ungrouped and Grouped Data

No. of cars owned 0 1 2 3

Height (cm) 150-159 160-169 170-179 180-189

Mode • The mode is the most frequent score in a data set.

• The mean (or average) is equal to the sum of all the

25% 25% 25% 25%

First quartile (Q1) or lower quartile

Data Set 1: 6, 7, 8, 6, 9, 6 (Mean = 7) (Range = 6 – 9)

The equation for standard deviation (s) is given below:

where 𝑥̅ is the mean and n is number of observations.

A stem-and-leaf diagram (or

Online tutorial: https://www.youtube.com/watch?v=_7m0Q_m2ppg

Most data on the left

Right tail elongated

A distribution is said to be skewed to the left, or negatively skewed, if

Most data on the right

Left tail elongated

The five-number summary is the:

To construct a box-and-whisker plot:

1. Determine the five-number summary.

Upper inner fence = Q3 + 1.5*(Q3 – Q1)

Online tutorial: https://www.youtube.com/watch?v=o7qWblT5NZI

Lower inner fence Upper inner fence

The observation that lies outside fence is known as outlier. 31

Shape of Distribution: Symmetry and Skewness

The diagram below shows a symmetrical distribution (normal

The diagram below shows a negatively skewed distribution (skewed

where Lm = lower boundary of the class containing the mode

where Lm = lower boundary of the class containing the median

where x = data and f = frequency

Das könnte Ihnen auch gefallen