Sie sind auf Seite 1von 20

BFC 34303

CIVIL ENGINEERING STATISTICS


Chapter 1
Descriptive Statistics
Faculty of Civil and Environmental Engineering
Universiti Tun Hussein Onn Malaysia

What is ‘statistics’?
Statistics is the science that deals with collecting, classifying, presenting,
describing, analysing and interpreting data to enable us to draw
conclusions and making reasonable decisions.

It can be divided into two categories:


a) Descriptive statistics
b) Inferential statistics

1
Descriptive statistics
The activity of collecting, classifying, presenting and describing
quantitative data.
Methods for organising (frequency table), representing (graphs) and
summarising data (central tendency and variability).

Inferential statistics
The part dealing with technique and method of interpretation of the
results obtained from the descriptive statistics.

Population Sample
Population is the entire A portion of population selected
(complete) collection of data for study.
whose properties are analysed.
It contains all the subjects of A sample is any set of entities,
interest. cases, subjects, items or
experimental units chosen from
Can be of any size, its items
the population.
need not be uniform but must
share at least one measurable
feature.

2
Random Sample
A random sample is a sample selected in such a way that each element
of the population has the same chance of being selected.

Parameter
Parameter is a numerical measurement describing some characteristics
of a population.
Eg. the population mean and variance

Statistic
Statistic is a numerical measurement describing some characteristics of
a sample.
Eg. the sample mean and variance
5

Variable
Any measured characteristic or attribute that differs for different
elements.
For example, if the weight of 30 people were measured, then weight
would be a variable.
Can be classified as quantitative or qualitative.

Quantitative Variable
The variable being studied is numeric and measured on an ordinal,
interval or ratio scale.
Eg. Ambient temperature, vehicular speed and walking distance.

3
Qualitative Variable
The variable being studied is non-numeric and measured on a nominal
scale.
Also called ‘categorical’ variable.
Eg. Gender, eye colour and educational level.

Ordinal Interval Ratio Nominal

Meaningful Meaningful zero point


Data may only be
Data are ranked differences beween and ratio between
classified
values values

Position in Distance to Number of Marital


Exam grade Clothes size Temperature Gender
race class patients seen Status

Data
A set of data is a collection of observations, measurements or
information obtained for a study.
It can be classified as qualitative data or quantitative data.

Quantitative Data Qualitative Data

• Data that can be measured • Data that are not in numerical


numerically or counted. form but instead assigned as
• Can be either continuous data attributes.
or discrete data. • Eg. race, age, gender, marital
• Eg. length, time, mass, status
tempertature
8

4
• Data that can only take exact and countable
values.
Discrete Data • Eg. number of students in a class, number of
cars sold in a day, number of persons in a
family.

• Data that take any value over a certain interval


and can be measured to a certain degree of
Continuous accuracy (correct to certain decimal places)
Data
• Eg. weight of students in a class, time taken to
complete race, fat content in canned food.

Ungrouped and Grouped Data

Ungrouped Data
Raw data that is not in the term of interval.
Frequency distribution has been arranged in order.
Example:
Weight of seven students: 56, 74, 68, 90, 52, 48, 65
Number of cars owned per household:

No. of cars owned 0 1 2 3


No. of households 6 28 12 4

10

5
Grouped Data
Data is grouped according to class intervals before the frequency
distribution is assigned.
Example:
Height of students in a class:

Height (cm) 150-159 160-169 170-179 180-189


No. of students 5 11 21 8

11

Measures of Location

Median

Mean Percentile

Measures
Mode of Quartile
Location

12

6
Measures of Central Tendency

Central tendency is a
statistical measure that Mean
determines a single value
that accurately describes the
center of the distribution and Mode Median
represents the entire
distribution of scores.
The goal is to identify the Measures
single value that is the best of Central
Tendency
representative for the entire
set of data.

13

Mode • The mode is the most frequent score in a data set.

• The median is the middle score for a data set that has
Median
been arranged in order of magnitude.

• The mean (or average) is equal to the sum of all the


Mean values in a data set divided by the number of values in
the data set.

14

7
Quartiles
Quartiles are values that divide a data set into four parts containing an
approximately equal number of observations.
The total of 100% is split into four equal parts (four quarters):

Q1 Q2 Q3

25% 25% 25% 25%

Interquartile Range = Q3 – Q1

First quartile (Q1) or lower quartile


Second quartile (Q2) or middle quartile, which is also the median
Third quartile (Q3) or upper quartile
15

Percentiles
Percentiles divide a set of data which are arranged in ascending order
into 100 equal parts.
A percentile is a measure used to indicate the value below which a given
percentage of observations in a group of observations fall.
For example, the 25th percentile is the value below which 25% of the
observations may be found.
Note:
25th percentile (P25) = First quartile (Q1)
50th percentile (P50) = Second quartile (Q2), which is also the median
75th percentile (P75) = Third quartile (Q3)

16

8
Measures of Dispersion

Variance

Standard
Range
Deviation

Measures
of
Dispersion

17

Measures of Dispersion
Measures of dispersion (or variation) describe how spread out a set of
data is, or the extent of the variability in individual items of the distribution.
Let us look at the following data sets to see how measures of central
tendency is different from measures of dispersion:

Data Set 1: 6, 7, 8, 6, 9, 6 (Mean = 7) (Range = 6 – 9)


Data Set 2: 5, 7, 2, 6, 13, 9 (Mean = 7) (Range = 2 – 13)

Most of the numbers in data set 1 are close to the mean value, while in
data set 2 the numbers are spread away from the mean. The difference in
the spread can be determined by a measure of dispersion.
18

9
Measures of Dispersion
However, range is not a good measure of dispersion because it is
influenced by the extreme values and the calculation does not cover all
observations.
Variance and standard deviation are most useful and widely used
measures of dispersion. Although they are influenced by the extreme
values, the calculations cover all the observations.

19

Variance
Variance (s2 or s2) is the average of the squared differences from the
mean.

Standard Deviation
Standard deviation (s or s) a measure of dispersion of observations
within a data set. It is simply the square root of the variance.
If the observations are all close to the mean, then the standard deviation
is close to zero.
If many observations are far from the mean, then the standard deviation
is far from zero.
If all the observations are equal, then the standard deviation is zero.
20

10
The equation for variance (s2) is given below:

∑ 𝑥 − 𝑥̅
𝑠 =
𝑛−1

The equation for standard deviation (s) is given below:

∑ 𝑥 − 𝑥̅
𝑠=
𝑛−1

where 𝑥̅ is the mean and n is number of observations.

21

Stem-and-Leaf Diagram

A stem-and-leaf diagram (or


display) is a method for
presenting quantitative data
in a graphical format to assist
in visualising the shape of a
distribution.
The "stem" is the first digit or
digits, and the "leaf" is the
last digit.

Stem Leaf

22

11
To construct a stem-and-leaf diagram:
1. Arrange the data in order of magnitude (ascending order).
2. Place the stems in order, vertically from smallest to largest.
3. Place the leaves in order, in each row from smallest to largest.
4. Create a key for the stem-and-leaf diagram so that people know how
to interpret the diagram.

Online tutorial: https://www.youtube.com/watch?v=_7m0Q_m2ppg

23

Distribution of Data
A symmetric curve (bell-shaped) is one in which both sides of the
distribution would exactly match the other if the figure were folded over
its central point. This is called a normal distribution.
An example is shown below:

24

12
A distribution is said to be skewed to the right, or positively skewed,
when most of the data are concentrated on the left of the distribution.
The right tail clearly extends farther from the distribution's centre than
the left tail, as shown below:

Most data on the left

Right tail elongated

Positive skew

25

A distribution is said to be skewed to the left, or negatively skewed, if


most of the data are concentrated on the right of the distribution.
The left tail clearly extends farther from the distribution's centre than the
right tail, as shown below:

Most data on the right

Left tail elongated

Negative skew

26

13
Interpreting Distribution of Data from Stem-and-Leaf Diagram
If the stem-and-leaf diagram is turned on its side, it will look like the
following:

The distribution shows that most data are clustered at the right. The left
tail extends farther from the data centre than the right tail. Therefore, the
distribution is skewed to the left or negatively skewed.
27

Box-and-Whisker Plot
A box-and-whisker plot (also called a box plot) displays the five-number
summary of a set of data.

The five-number summary is the:


1. Minimum
2. First quartile
3. Second quartile (median)
4. Third quartile
5. Maximum

In a box plot, we draw a box from the first quartile to the third quartile. A
vertical line goes through the box at the median.

28

14
70
max

Horizontal Box-and-Whisker 60

Q1 Q2 Q3 50
min max

40 Q3
0 10 20 30 40 50 60 70

30
Q2

20

Vertical Box-and-Whisker 10
min
0
29

To construct a box-and-whisker plot:

1. Determine the five-number summary.


2. Draw a horizontal axis on which the number obtained in step 1 can
be located. Above this axis, mark the five-number summary with
vertical lines.
3. Connect the quartiles to each other to make a box, and then connect
the box to the maximum and minimum lines.
4. Calculate the values of upper and lower inner fence to determine
whether the data has outliers.

Upper inner fence = Q3 + 1.5*(Q3 – Q1)


Lower inner fence = Q1 – 1.5*(Q3 – Q1)

Online tutorial: https://www.youtube.com/watch?v=o7qWblT5NZI

30

15
Lower inner fence Upper inner fence

min max
Q1 Q2 Q3

10 20 30 40 50 60 70 80 90 100

The data lies within the upper and lower inner fence, so the data has no outlier.

Lower inner fence Upper inner fence


Outlier

min max
Q1 Q2 Q3

10 20 30 40 50 60 70 80 90 100

The observation that lies outside fence is known as outlier. 31

Shape of Distribution: Symmetry and Skewness

The diagram below shows a symmetrical distribution (normal


distribution). The ‘whiskers’ are the same length and the median (Q2) is
in the centre of the box.

Q1 Q2 Q3
min max

32

16
The diagram below shows a positively skewed distribution (skewed to
the right). The left ‘whisker’ is shorter than the right ‘whisker’ and the
median (Q2) is nearer to Q1.

Q1 Q2 Q3
min max

33

The diagram below shows a negatively skewed distribution (skewed


to the left). The left ‘whisker’ is longer than the right ‘whisker’ and the
median (Q2) is nearer to Q3.

Q1 Q2 Q3
min max

34

17
Analysing Grouped Data

Median Percentile

Mean Quartile

Measures
Mode Decile
of Location

35

Standard Interquartile
Deviation Range

Variance Range
Measures
of
Dispersion

36

18
Formula


Mean, 𝑥̅ = ∑
where x = data and f = frequency

Mode = 𝐿 + c

where Lm = lower boundary of the class containing the mode


d1 = difference between the frequency of the mode class and the
frequency of the class immediately before it
d2 = difference between the frequency of the mode class and the
frequency of the class immediately after it
c = size of the mode class

37

Median = 𝐿 + c

where Lm = lower boundary of the class containing the median


n = total number of observations
FL = cumulative frequency of the class before the median class
fm = frequency of the median class
c = size of the median class

Quartile, 𝑄 = 𝐿 + 𝑐

38

19
Percentile, 𝑃 = 𝐿 + 𝑐

Decile, 𝐷 = 𝐿 + 𝑐

where k = 1, 2, 3, …
Lk = lower boundary of the class where Qk, Pk, Dk lies
n = total number of observations
FL = cumulative frequency of the class before the Qk, Pk, Dk class
fk = frequency of the class where Qk, Pk, Dk lies
ck = size of the class where Qk, Pk, Dk lies

39




Variance, 𝑠 = ∑




Standard Deviation, 𝑠 = ∑

where x = data and f = frequency

40

20

Das könnte Ihnen auch gefallen