Sie sind auf Seite 1von 33

Chapter 2 Data Summary and

Presentation
Items to covered in this
presentation include:

Review basic statistics


Discuss samples and populations
Discuss presentations techniques

Statistics are a foundation for


engineering sciences.
It quantifies the likelihood of an
occurrence based on a summary of
the past.
Statistics is the taming of
randomness
Engineers need to understand the
fundamentals of statistics.

Statistical Analysis of Experimental Data


Through statistics we want
to quantify three points:
A single representative
value for the data
A value that represents the
variation or spread in the
data.
An interval about which the
true value is expected to
lie.

Mean
Median
Mode
Deviation
Variance
Standard Deviation

Sample Statistics
The variance of a sample is given
by:

1
1
2
2
s
( xi x )

n 1 i 1
n 1
N

1
x

i
n
i 1
n

The variance for a population is:

N
1
2 ( xi ) 2
N i 1
The standard deviation = square

root of the variance.


xi

i 1

n

n or n-1
In general, whenever a simple random sample is
used, one needs to increase the size of the variance
to compensate xfor the fact that
does not exactly
equal the true mean , hence n-1 is used as the
divisor rather than n.
The term degrees of freedom, is also used to
describe n-1.
x

Recall that the sum of the deviations is always zero


so if one knows all the deviations (xi-1 - ) except for
one, the unknown can easily be solved.

Median
The median is another central measure; to find it:
Arrange all the samples from highest to lowest (or lowest
to highest)
If n is odd, the median is the value in the middle position
((n+1)/2)
If n is even, the sample median is the average of the
values in positions (n/2) and ((n+1)/2)

The determination of the median for a population is


identical.

Mode and Outliers


The mode is the value that occurs most frequently.
There may be more than one mode in a data set.
The range = maximum value minimum value.
Outliers are values that are well outside the other
values in the data set. Some are legitimate values,
but some may be due to error in measurement or
recording.
Note: Summary statistics that use arithmetic methods and
involve the entire data set are always affected by outliers;
e.g. the mean and standard deviation. The median is not
affected by outliers.

Quartiles
The trimmed mean reduces the effect of outliers by
calculating the mean of the data set when p% of the values at
each end (highest and lowest) are trimmed.
Quartiles divide the ranked data set into four equal-size
groups, as closely as possible. Arrange the data set in order.

First quartile (Q1)the value at position (n + 1)/4.


Third quartile (Q3)the value at position 3(n + 1)/4.
Q2, the value at position (n + 1)/2, is the median.
Definitions of quartiles vary. We will use these ones.
Quartiles are quite insensitive to outliers.

Interquartile Range (IQR) is the difference between the


upper and lower quartiles (range of Q2 plus Q3). It highlights
the variability in the data

Percentiles
The percentile value divides a data set so that p%
of the values are less than, and 100(1p)% are
greater than, it.
pth percentile = value at position p(n + 1)/100. Other
definitions exist; this one works well except at the
extremes.
Q1 = 25th percentile; median = 50th percentile; Q3 = 75th
percentile.

For qualitative data, the summary statistics above


can not be calculated because the values are
names or labels. The only meaningful statistics are
the frequencies and relative frequencies of values
or groups of values.

Summary Statistics
For a sample, the values calculated above are
known as statistics. For a population, the values
are called parameters.
We want to know the parameters of a population
but it is impractical/impossible to access the entire
population.
That is why we collect a sample and use the
descriptive statistics above to provide a cursory
assessment, or we use inferential statistical
methods to make estimates, test theories, or
formulate models of the parameters.

Data Plots
The most effective method
for reviewing data is
through graphical methods.
Engineers are visual
people and therefore require
charts to be transformed
to graphs.

Stem and Leaf Diagram


A Stem and leaf plot is used for exploring data but is not
used for formal reporting.
It allows for quick review to determine medians and mode.
The method involves the following procedure:

Stem and Leaf Diagram (textbook)

Stem and Leaf Diagram (textbook)

Stem and Leaf Diagram (textbook)

Histogram
Histograms & frequency distributions
1. Choose boundary points for the class intervals (cells or
bins). Usually, intervals are the same width. Class limits
must not overlap.
2. Find the frequency (number of data values) in each
interval.
3. Calculate the relative frequencies (number of data values
total number in an interval).
4. If the class intervals are the same width, draw rectangles
with heights equal to the frequencies or relative
frequencies.

5. If the class intervals are not equal in width, draw


rectangles with areas that represent the frequencies or
relative frequencies.

Histogram (example)

Skewed or Symmetric
A histogram is perfectly symmetric if its right
half is a mirror image of its left half.
Histograms that are not symmetric are referred to
as skewed.
A histogram with a long right tail is said to be
skewed to the right, or positively skewed.
A histogram with a long left tail is said to be
skewed to the left, or negatively skewed.

Unimodal, bimodal, and multimodal data


sets
A data set that has a histogram with only one peak
is called unimodal. Example?
If a data set has a histogram with two peaks, we
say that it is bimodal. Example?
If there are more than two peaks in the histogram
of a data set, it is said to be multimodal.
Example?

Box Plot
The box plot is a graphical display that simultaneously
describes several important features of a data set, such as
center, spread, departure from symmetry, and identification
of observations that lie unusually far from the bulk of the
data.
The basic boxplot presents the median, Q1, Q3, the maximum
and the minimum values of the data set.
The width of the box is the interquartile range (Q3-Q1). The
median is marked by a line in the box.
Draw lines from the box to the values that are closest to, but
within, a range of 1.5 IQR (called whiskers or fences).
- Lower whisker = Q1 1.5(Q3 Q1). Upper whisker = Q3 + 1.5(Q3
Q1)

Identify each value outside the fences separately; these are

Box Plot

Box Plot

Box Plots
Comparative (side-by-side)
boxplots

When we want to compare


samples from more than one
data set, we plot the boxplots
side-by-side using the same
scale.
This allows us to compare how
the distributions differ between
the data sets.

Boxplots can be plotted


horizontally or vertically. It is
usual for comparative boxplots
to be plotted vertically.
Histograms are often used in
formal reports. Boxplots are
also seen in formal reports, but
stem and leaf plots should not
appear in formal reports.

Example
Tensile tests for a set of alum. alloy yields the
following results: 15 30 51 20
17
19 20
32 17 15 23 19 15 18 16 22 29
15 13 15 ksi. Plot the data using stem and leaf,
histogram and box plot.
1 | 3 5 5 5 5 [Q1] 5 6 7 7 8 [Q2] 9 9
2 | 0 0 2 [Q3]3 9
3|02
4|
5|1
Median = 18.5
Mode = 15

Example

Time Series Plots

A time series plot is a graph in which the vertical


axis denotes the observed value of the variable
(say x) and the horizontal axis denotes the time
(which could be minutes, days, years, etc.).
A time series or time sequence is a data set in
which the observations are recorded in the order in
which they occur.
When measurements are plotted as a time series,
one often see
trends,
cycles, or
other broad features of the data

Time Series

Multivariate Data
Multivariate data occurs when each data
observation in the data set has two or more values
that are possibly related.
We can present bivariate data graphically using a
scatterplot or an x-y diagram.
From a scatterplot, we can assess the following
aspects of the possible relationship between the
two variables:
Direction: positive or negative
Strength: strong, medium, weak, no relationship
Linearity: linear or non-linear.

Multivariate Data

Multivariate Data
Covariance and correlationmeasure the
association or relatedness between variables for
x , s x , y , and s y .
bivariate (x,y) data.
Given bivariate (x,y) data, calculate
The sample covariance gives the direction of the
association but does not give an indication of the
comparative
strength:
n
n
n
n

1
S xy ( xi x )( yi y ) xi yi
n
i 1
i 1

Cov( x, y ) s xy

S xy
n 1

i
i 1

yi
i 1

Multivariate Data
The sample correlation coefficient, r, gives the direction (+/)
and the comparative strength of the association (1 rxy 1):
n

S xx ( xi x ) 2 (n 1) s x2
i 1
n

S yy ( yi y ) 2 (n 1) s 2y
i 1

Negative means negative slope


We want value as close to ve or +ve 1
cause that means stronger correlation

Multivariate Data

In general, we can make some general remarks about the


correlation variable:
For r > 0.8 we have a strong correlation (0.9 and better to draw
conclusions).
With 0.5 < r < 0.8, we have a moderate correlation
With r < 0.5, we have a weak correlation.

Example
The following are results for
tensile strength and hardness
for a copper alloy. Is there a
correlation?
Tensile Str. Brinell Hardness
106.2
35.0
106.3
37.2
105.3
39.8
106.1
35.8
105.4
41.3
106.3
40.7
104.7
38.7
105.4
40.2
105.5
38.1
105.1
41.6

Example

Das könnte Ihnen auch gefallen