Sie sind auf Seite 1von 79

CBE 488/588

DOE
1.17.11
Overview for today
• Syllabus and course plan
• Importance/usefulness of DOE
• Statistics, statistics, statistics…
Introductions
• Name
• Year
• Major
• Interesting fact
• Why you chose to take this class
Course website
http://webpages.sdsmt.edu/~tkirschli/CBE_488_588.html
Why study DOE?
• Typical experiment
• Choose a hypothesis to test
Hypothesis • Pick a variable to measure

• Run a series of tests


Experiment • Collect data on that variable

• Try to interpret results in a


Analysis meaningful way
Why study DOE?
• Typical experiment
• Choose a hypothesis to test
Hypothesis • Pick a variable to measure

• Run a series of tests What does


this mean?
Experiment • Collect data on that variable

• Try to interpret results in a


Analysis meaningful way
Why study DOE?
• “Frequently conclusions are easily drawn from
a well-designed experiment, even when
elementary methods of analysis are
employed. Conversely, even the most
sophisticated statistical analysis cannot
salvage a badly designed experiment”
– Box, Hunter and Hunter
Why study DOE?
• Think of your typical chemical
reactor….
– How many variables can you
name that might affect product
quality?

– How many experiments would it


take to test each while holding
all the others constant?
Design of Experiments
• Usually it is
most efficient
to test several
variables
simultaneously
Pre-DOE Pre-test
Statistics, statistics, statistics…
• Objectives
• By the end of today you should be able to answer
the following questions:
• What are the key parameters used to statistically
describe both a population and a sample?
• What is a normal distribution and how can I use
it?
• What is the central limit theorem?
• How can I apply hypothesis testing to statistical
problems?
Toss up!
Break into three teams
Get familiar with the game
Play one round….
Toss up!
• Dice Game Instructions

• You'll win this game by becoming the player with the highest score over 100 points at the end of
the game.

• 1. You start by rolling 10 dice.


• 2. You score a point each time a dice rolled turns up green.
• 3. Put the green ones aside and re-roll the rest.
• 4. Repeat step 3 until one of the following happens:

• - You choose to stop, add up your points and pass the dice.
• -You roll 1 (or more) red dice AND NO green dice, which means you lose all earned points for that
round.
• -You pass 100 points!

• A player may stop rolling at any time and keep any points earned for that round. The points are
written on the score sheet and you cannot lose them once recorded there. When one player passes
the 100 point total, every other player has one chance to roll the dice and score higher to beat that
total.
Toss up!
• You wonder if a modified rolling method is
better than the standard rolling method…
• As a team, pick a modified rolling method you
think may give you better results when you
roll all 10 die at once
• Plot the # of green dice versus the roll number
for 10 trials of each strategy
• Calculate the average of each strategy
Toss up!
• Is your second method better than your first
in the long run?
• Might any difference have arisen from pure
chance?
• Maybe if the experiment was repeated the
results would be reversed??

• To consider the previous questions we need to


develop some elementary statistical theory.
Experimental Error
• experimental error : unavoidable variation
– contributing factors: ambient conditions, operator
skill, reagent purity/age, equipment condition
– not associated with blame
• How can you tell there is experimental error in
your toss up results?
Experimental Error
• Variation exists in all processes.
• Variation makes life interesting. It's always present, but
we tend to forget it. Business students learn to expect
that the actual financial figures should always equal
the budget. Science students are taught that mass
balances must balance exactly. What about your own
world? Do you expect the mail to arrive at exactly the
same time every day? Do you get irritated with the
postal worker when it doesn't? How often is the
weather forecast accurate? You must get a grip on
variation. It is critical to understanding how to deal
with your data.
Experimental error
• Understanding and reducing
variation are the keys to success.
• reducing product variationcustomer
satisfaction.
• reducing process variation efficiency

• In order to reduce variation, you need to know


where it comes from. Only then can you focus
your energies on the areas where you can have
the biggest influence.
Types of variation
• Common Cause
• Common cause variation occurs as a state of nature. It
creates a background noise that comes from the
cumulative effect of many small, unavoidable and
unknown sources. It is repeatable and predictable. It
cannot be altered without changing the process itself,
which requires management intervention and
engineering breakthroughs catalyzed by application of
DOE.
• Examples: raw material fluctuations, internal machine
control fluctuations
Types of variation
• Special Cause
• This source of variation occurs due to assignable
causes. It's not purely a matter of chance. Special
causes produce erratic behavior that appears
unnatural when viewed with control charts. It is
sporadic and unpredictable.
• Examples: improperly adjusted machines,
operator errors, defective raw material, excessive
tool wear
Population vs. Sample
• Everyday
• i.e. everyone living in Rapid City  population
• everyone you see on your way to school
sample

• Industry
• any possible product density  population
• the samples your operator measures at 8am,
noon, 4 pm, 8pm and midnight sample
Population vs. Sample
• In your Toss up! experiment what is the
population and what is the sample?
Population vs. Sample
• The total aggregate of observations that might
occur as population
– population has a size of N where N is large
– what you could see
• sample
– n is the small # of observations that have occurred
as sample
– what you actually see
Probability distribution
• each individual value of an observation is y
• p(y) is the probability density
• area under the curve is the probability
Sample average and population mean
sample mean

average value

(sum of each observation)


_____________________
(number of observations)
Sample average and population mean
• for a hypothetical population containing a
very large # of observations (N)
Sample average and population mean
Descriptive Statistics

• Mean - This is the most commonly used statistic.


It is more commonly referred to as the overall
average of the data. We will use the words
"mean" and "average" interchangeably. The mean
equals the sum of the individuals, divided by the
number of individuals. The mean pinpoints the
location of the data, but says nothing about the
spread, so it can't be the only statistic you report
on your data.
Median
• Median - the middle value of a dataset when
ordered from smallest to largest. Think of this
as the "middle of the road" data point. Half
the data will be to the left and half to the
right. If you've got an even number of points,
the median becomes the average of the two
points in the middle.
Average vs. Median
Average vs. Median
US states Avg. listing price Median sales Trulia popularity
Week ending Jan 4 price Week ending Jan 4
Date range: Aug-
Oct '06
Name Amount Amount Rank
Ohio $172,053 $133,358 13
Iowa $174,133 $150,500 37
Indiana $179,975 $63,655 22
Michigan $187,318 $123,192 8
Nebraska $188,347 $150,000 47
Kansas $194,118 $59,044 33
Missouri $194,275 $84,624 24
North Dakota $195,791 $88,600 51
West Virginia $196,948 $105,000 38
Oklahoma $199,081 $106,000 28
Average vs. Median

New Jersey $417,515 $342,000 5


Montana $454,874 $151,300 40
Massachusetts $501,828 $310,000 12
Colorado $539,148 $240,000 16
Connecticut $549,778 $260,000 23
California $554,301 $452,000 1
Wyoming $564,266 $130,702 50
District Of Columbia $689,888 $720,000 45
New York $697,609 $350,000 4
Hawaii $931,276 $522,000 41
Which number better describes the
typical house?

Ohio $172,053 $133,358 13

Wyoming $564,266 $130,702 50


Mode
• Mode - the most frequently occurring value in
the data set. This statistic is simply the value that
is found most often. For example, let's say you
walk into a suit store. You glance around and see
lots of $200 price tags. This gives you the
impression that the typical price is $200. Right or
wrong, this is the only convenient way to get a
feeling for the average, which could only be
properly calculated from a complete list of the
inventory. The mode is convenient, but it's not
very accurate as an indicator of data location, so
we don't advocate you use this statistic.
Mean/Median
• Think back to your Toss up! game. We know
the sample mean and median for both sample
sets. Tell these numbers to the group next to
you. Can they adequately judge what your
rolls looked like from this data?

• What else do you need to know?


Measures of variability
• The mean gives us an idea of data location but
tells us nothing about variability
Measures of variability
• say you want to know something about a
population. You need to know a couple of
things. First of all, what is the average
attribute of that population. Second... how
much variability is there? What is the spread?

• a measure of how much deviation happens


from the mean
• deviation
Measures of variability
Variance: mean valve of the square of deviations, taken over the entire population

The square root of variance is the standard deviation


Measures of variability
• What if we don’t know eta?

sample variance

sample standard deviation


n-1?
• n-1 --> degrees of freedom
• because all deviations have to sum to 0, if we
specify n-1 of the y values, the system is 100 %
specified.

• Therefore the # of DOF is n -1


n-1?
• Another way to think about it…
– if you only have one measurement, can you tell
anything about the variability?
Population vs. sample

sample variance

variance
sample standard deviation

standard
deviation
Standard Deviation

• Standard Deviation - a second measure of


variation, derived from the variance. The
standard deviation is equal to the square root of
the variance, as shown below. This statistic is not
additive, so it is not good to use within a series of
calculations. However, it allows you to express
variation in the original units of measure.
In your groups
• Group 1
• Calculate the average and standard deviation
for the following data on hog-corn ratios (units
are the price of hogs/100 lbs and price of
corn/bushel) 16.8, 13.3, 11.8,15.0, 13.2
In your groups
• Group 2
• A psycologist measured (in seconds) the
following times required for 10 experimental
rats to complete a maze:24, 37, 38, 43, 33, 35,
48, 29, 30, 38. Find the average, sample
variance and sample standard deviation for
these data
In your groups
• The following observations on the lift of an
airfoil (in kg) were obtained on successive
trials in a wind tunnel: 9072, 9148, 9103,
9084, 9077, 9111, 9096. Find the average,
sample variance and sample standard
deviation for these data
5 minute break
• What are the key parameters used to
statistically describe both a population and a
sample?
Normal Distributions
• Normal Distribution
• The normal distribution
is characterized by its
smooth, bell-shaped
curve. It's symmetric, so
the mean, median and
mode are all equal. The
normal distribution is
defined by two
parameters - the mean
and the standard
deviation.
Normal Distributions
• On the far right
notice the two
normal
distributions,
each with a
mean of 25. The
distribution with
the standard
deviation of 1 is
narrower but
taller than the
distribution with
the standard
deviation of 2.
Characterizing the normal distribution

• once the mean and variance are known, the


entire distribution is characterized
• Notation:

• i.e. N (30,25) has an average of 30 and a variance


of 25
• sigma (the standard deviation) measures the
distance from its mean (eta) to the inflection
point of the curve
Normal Distributions
• The area under the normal
curve relates to the proportion
of population falling within
specified limits. For example,
68% of the total area falls
within plus or minus one
standard deviation from the
mean. In other words, you find
about 2/3rds of any given
population within these limits.
If you go plus/minus two
standard deviations, the area is
95%, and within plus/minus
three standard deviations -
99.7 %.
Z-scores and Probabilities

• The areas under normal curves can be broken


down further with the aid of statistical tables.
These tables break down probabilities by the
number of standard deviations (Z) that your
data point falls from the mean. The Z-score is
zero at the mean, minus to left, and plus to
the right. Any normal distribution, no matter
what its mean and standard deviation values,
can be characterized via these standardized Z-
scores.
Z scores and probability
• To calculate the Z-score, subtract the mean (µ)
from your data point (Y) and divide it by the
standard deviation (s).
Z scores and probability
Using a Z table
• This Z-table only displays
positive values. The normal
distribution is symmetric (as
shown at right), so the tail
area for a negative z-value is
identical to that of the
positive z-value.

• To find the area between


two values, take: 1 - (left tail
area) - (right tail area).
Example
• Example 1
• Intelligence quotient (IQ) test
scores typically follow a normal
distribution for any given
population. Let's look at the
Stanford-Binet IQ test, which by
design produces a true mean of
100 points and a standard
deviation of 16 points. As shown
below, the scores of individuals
taking this test can be scaled in
terms of how many standard
deviations (Z) they deviate from
the mean (0 on the Z-scale).
• What is the probability that a
randomly selected person has an
IQ score below 100?
Example
Example
• What is the probability
that the person's score
is greater than 120?
Example
What is the probability that the person's score
is greater than 120?
Example
• What is the probability
that the person's score
is between 90 and 115?
Example

What is the probability that the person's score


is between 90 and 115?
Your turn
• The times taken to install bumpers on cars are
normally distributed with a mean time of 1.50
minutes and standard deviation of 0.25 minutes.
We've got several questions that need answering.

• What portion of bumpers:


• a) require between 1.40 and 1.50 minutes?
• b) exceed 1.75 minutes?
• c) fall at or below 1.45 minutes?
Testing for normality
• In order to use the normal distribution our
data must be normally distributed

• is data bell shaped?


– requires a lot of data (150 points)
Testing for normality
• is data bell
shaped?
– requires a lot of
data (150 points)
• plot on special
paper or in a
software program
– Normal
Probability plot
Self test
• What is a normal distribution and how do we
use it?
Random Sampling
• The Hypothesis of random sampling
A set of observations is a random sample
from the population
unfortunately does not always apply to
actual data
• weather data; warm days follow warm days
• political poling: location matters
Random sampling
• Random sampling means
Random sampling
• when the probability distribution of one
observation is affected by the level of another
the observations are said to be statistically
dependent

• in real life this happens all the time


• in statistics we need a way around it
Central Limit Theorem
• Central Limit theorem
• The distribution of errors tends to be
approximately normal. That such a tendency is
expected is called the Central Limit Theorem
Central Limit Theorem
• Variations in Normality
Central Limit Theorem
• http://www.stat.sc.edu/~west/javahtml/CLT.ht
ml
Central Limit Theorem

• The central limit theorem (CLT) shows the tremendous


advantages that experimenters gain by relying on
averages rather than individual results. It applies to any
process, even one that generates an abnormally
distributed output. Here are the three main tenets of
CLT:
· As the sample size (n) becomes large, the distribution
of averages becomes normal.
· The variance of the averages is smaller than the
variance of the individuals by a factor of n.
· The mean of the distribution of averages is the same
as the mean of the distribution of individuals.
Hypothesis Testing
• Remember back to your Toss up! experiment…
• null hypothesis
– the hypothesis that there is no difference between
two population means
– if the null hypothesis is discredited we say there is
a statistically significant difference
Hypothesis testing
• Suppose a prosecutor suspects that a defendant is guilty of
a crime. Under the American system of criminal justice, the
accused person is presumed innocent. The prosecutor's job
is to disprove the "hypothesis" of innocence in favor of the
alternative hypothesis that the accused is guilty. Factual
evidence must be presented that is both inconsistent with
the hypothesis of innocence and consistent with the
hypothesis of guilt. If the prosecution fails, the defendant is
found "not guilty." That is, the jury rejects the hypothesis
of guilt. This does not imply that the jury accepts the
hypothesis of innocence. It only means that the evidence is
not strong enough to support a conviction.
Hypothesis Testing

• A newly formulated plastic material is being tested to determine if it has a


higher tensile strength value that the current product. Company
management will not allow the product to be marketed unless there is
strong evidence that it is substantially different from the current product.
A sample of 100 plastic parts is selected for the test, and each part is
subjected to the standard tensile test procedure.

• On the basis of the sampling experiment, management must decide


between two courses of action:
1. Approve the new product.
2. Disapprove the new product (and request further research).
Hypothesis Testing
• Proposal - Determine what you want to test and
express it in mathematical terms. Actually, you
must create two statements.
• null hypothesis (Ho) -states that some parameter
equals a specific value.
• alternative hypothesis (Ha)- specifies the value
the parameter is expected to take if the null
hypothesis is incorrect.
• The hypotheses are always formulated so that Ho
and Ha are opposites; When one is true, the
other is false.
Hypothesis Testing
• For the plastic material example, management is only
considering two possible alternatives. The null
hypothesis is that of "no difference", the new product
is not substantially different and would not be worth
marketing. The alternative hypothesis is that the new
plastic material is substantially different (improved),
and marketing will be able to sell the new product.
Management will classify the new plastic as improved
only if it has a tensile strength of more than 20 pounds
per square inch (psi).
• These hypotheses are written like this:
Statistics, statistics, statistics…
• Objectives
• By the end of today you should be able to answer
the following questions:
• What are the key parameters used to statistically
describe both a population and a sample?
• What is a normal distribution and how can I use
it?
• What is the central limit theorem?
• How can I apply hypothesis testing to statistical
problems?

Das könnte Ihnen auch gefallen