Sie sind auf Seite 1von 10

Mathematical Model

Equation, formula

Mathematical Model
Ideal gas law : PV = NRT Q1 : Is this relationship true? Q2 : What is the value of the constant R? Answer these questions by a set of measurements :

E = mc

V = IR

y = x 2 3 cos 2 (log )

Mathematical Model

PV = NRT
P : pressure V : Volume T : Temperature n : number of moles R : universal gas constant

(Pi , Vi , Ti , N i )

Ri =

PiVi N i Ti

Assumptions : ideal gas, static and close environment

Errors due to unknown outside factors exists.

Statistical Model
Observed data
p = P + p

Analysis of Variance Model (ANOVA)


v = V + v t = T +t n = N +n

One-way ANOVA
N 11,, 122

Compare multiple populations


Y11 , Y12 ,..., Y1n1

Unobserved measurement errors (random)

Ideal gas law :

pv = nRt( + p )+v p v p( v Rt n t Rn t + R n t ) p (v ( PV) = NRT n )R ( t ) n v


Systematic component Statistical Model Random errors Model parameter
Unknown parameter in systematic component e.g. universal gas constant R

Assumptions 1. Normal

N 22,, 22 2

Y21 , Y22 ,..., Y2 n2

2. Equal Variances 3. Independence

Data

..
N aa,, a 2 2

Ya1 , Ya 2 ,..., Yana

One-way ANOVA
Total sample size
N = ni
i =1 a

One-way ANOVA
ANOVA model
=
1 N

Overall population mean (grand mean) ith treatment effect Random errors ANOVA model
Yij = + i + ij
j = 1,2,..., ni

n
i =1 i

Yij = + i + ij
i

j = 1,2,..., ni

i = 1,2,..., a
ij ~ N (0, 2 )
iid

i = i

a ni i = 0 i =1

n
i =1 i

=0

Between group
+ 1

Within group
+ 2 + 21 = Y21 + 2 + 22 = Y22

ij = Yij i = Yij i

i = 1,2,..., a
ij ~ N (0, 2 )
iid

+2

.
+a

.
+ 2 + 2 n = Y2 n
2 2

n
i =1 i

=0

Test for Treatment Effects


H = the = effect. 0 HPopulation s are not all the same. H 1 : are treatment H 0 0: Population s1are 2 same. avs=vs 1 vs1 : There some i 0effects. H : There0is: no treatment l H :

Test for Treatment Effects


Break down of sum of squares

ith sample mean

1 Yi = ni

Y
j =1

ni

ij

(Y
a ni i =1 j =1
a ni ij

ij

ijY Y== YinSSY Y(Yij + (Yij Yi ) Y SS T = YA + SS E Yi ) i i +


2 a 2 a ni i =1 i =1 j =1

) ( ( ) )
MS A =

overall sample mean Total sum of squares

Y =

1 N

Y
i =1 j =1

=
ni

1 N

n Y
i =1

i i

Treatment mean squares Error mean squares


= ni Yi Y
i =1 a

SS A 1 a = ni Yi Y a 1 a 1 i =1

SST = Yij Y
i =1 j =1 a ni

) )

MS E =

SS E 1 a ni 2 = (Yij Yi ) N a N a i =1 j =1

Treatment sum of squares Between Group Variation Withinsum of squares Error Group Variation

SS A = Yi Y
i =1 j =1

H 1 true

i not all the same

MS variation of Y around Y large A tends to i be large

SS E = (Yij Yi )
a ni i =1 j =1

MSE is unaffected by the population means.

Test for Treatment Effects


Treatment mean squares Error mean squares
MS A F= MS E
SS A 1 a MS A = = ni Yi Y a 1 a 1 i =1 MS E =

F Distribution
( )
2

f (x ) =

SS E 1 a ni 2 = (Yij Yi ) N a N a i =1 j =1

r + r2 r1 r +r 1 1 2 r1 2 r1 2 x 2 1 1 + r1 x 2 r2 r1 r 2 r 2 2 2

,x > 0

1 0.9 0.8

F Densities
r1 = 2, r2 = 4 r1 = 4, r2 = 6 r1 = 9, r2 = 9 r1 = 12, r2 = 12

X ~ F (r1 , r2 )
E(X ) =
Var ( X ) =

Test statistic

0.7 0.6 0.5 0.4

r2 r2 2

Reject 0 if > is too N-a, Reject H0 ifHFobs F F(a-1, large.).

0.3 0.2 0.1

r1 (r2 2 ) (r2 4 )
2

2r22 (r1 + r2 2 )

F (a 1, N a, )

Obtained from F distribution table

0 0 1 2 3 4 5

F Distribution
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5

F Distribution Table

F (r1 , r2 )

F (r1 , r2 , )
F (3,4,0.05) = ? .59 6 F (4,6,,0.01) = 9.15 ?

ANOVA Table
H 0 : 1 = 2 = l a = 0 vs H 1 : some i 0 MS A MS E
SS

Computational Formulae
ith total
a

Ti = Yij
j =1

ni

overall total
2

T.. = Ti
i =1

Test statistic F =
Source Treatment Error Total

Reject H0 if Fobs > F(a-1, N-a, ).


d.f. MS F-ratio

SS A = ni Yi Y
i =1
a ni

=
2

Ti 2 T..2 N i =1 ni
a
a ni

SSA SSE SST

a-1 N-a N-1

SSA/(a - 1) SSE/(N - a)

MSA / MSE

SS E = (Yij Yi ) = Yij2
a i =1 j =1 i =1 j =1

Ti 2 i =1 ni

SS T = Yij Y
i =1 j =1

ni

= Yij2
i =1 j =1

ni

T..2 N

One-way ANOVA
Example : Color brightness of films
aBrand n1 = n2 = n3 = 15 =3
Kodak Agfa 32, 34, 31, 30, 37, 28, 28, 27, 30, 32, 26, 29, 27, 30, 31
3 15 2

One-way ANOVA
Source
Ti
452 378

SS 1363.38 SSA 621.86 SSE 1985.24 SST

d.f.

MS 681.69 SSA/(a - 1) 14.81 SSE/(N - a)

F-ratio 46.03 MSA / MSE

N = 45Data

Treatment Error Total

a21 N42a N441 -

41, 44, 50, T 32, 378 32, + T.. = 45238, 38, Yij 46040 + 378 T1 Fuji452 43, T2 = 578 47, 3 = 32, 36, 35, 34,1408 578i =40,=136 = 578 = 1 j 23, 24, 25, 21, 26, 25, 27, 26, 22, 25, 27, 30, 25, 25, 27

H 0 : 1 = 2 = 3

vs

H 1 : not H 0

= 0.05
SS E = 198586SS A1363.38 621. 24 SS T .

From F distribution table

F (2,42,0.05) F (2,40,0.05) = 3.23

2 3 T 45222 T578 2 378 2 1408 2 SS A = i . .. SS A = 1363 38 + + n N15 i =115i 15 45

F ratio = 46.03 > 3.23


Reject H0 at = 0.05 .
The color brightness of the three brands of films are significantly different.

2 SS TT = .Yij SST = 1985 24 46040 i =1 j =1

ni

T..2 1408 2 N 45

Estimation
Treatment effect : i Point Yi Y Interval

Estimation
Example : Color brightness of films

(Y Y ) t
i

N a , 2

1 1 MS E n N i

Y1 =

452 = 30.13 15

Y2 =

578 378 = 38.53 Y3 = = 25.2 15 15

Y =

1408 = 31.29 45

95% C.I. For 1 : [Y12.16 ..t64,0.025 (MS E ) 1(14.81 1 1 (1 .801 48] ) 2.021 1 ) 30 13 31. . Y , 0 4229 95% C.I. For 2 - 3 : 132... Y,3) 25.42,]0. (2.MS E (14.+ 1 + 1 38 53 > (1033 16.842) 025 0212) 1 3 81) 1 t [Y 49 2.17
n1
N 15

45

Difference in treatment effects : i - j Point Yi Y j Interval (Yi Y j ) t N a , 2 MS E + n n j i


1 1

95% C.I. For 1 - 2 : [ 11.24 , 5.56] 95% C.I. For 1 - 3 : [2.09 , 7.77]

3 n2 n 15 1 < 2

15

1 > 3

2 > 1 > 3

Overall confidence < 95%

Two way ANOVA


Example : Brightness of synthetic fabric
Temperature Time (cycles) 40 50

Two way ANOVA


Example : Brightness of synthetic fabric
MTB > ANOVA 'Bright' = Temp Time*Temp. Two-way factorial 'Time' Time model: MTB > print 'Bright' ANOVA 'Temp'

350F
38, 32, 30 40, 45, 36

375F
37, 35, 40 39, 42, 46

400F
36, 39, 43 39, 48, 47

Analysis of Data Display Variance (Balanced Designs)


Row Bright Type Levels Values Time Temp Factor Time fixed 40 =2 j = 1 38 fixed i 350 350 40 Temp 3j i i

Yijk = + i + j + ij + ijk

ij50 = 375

k = 1,2,3 j = 1, 2,3 i = 1,2


= 400 0
ij j

Two-way factorial ANOVA model:

Yijk = + i + j + ij + ijk

i i j i

k = 1,2,3 j = 1, 2,3 i = 1,2


j

= j = ij = ij = 0

ijk ~ N (0, 2 )
iid

2 32 40 350 3 30 40 350 Analysis 37 Variance for Bright of 4 40 375 5 35 40 375 Source 40 DF MS 6 40 375 SS 7 36 40 400 Time 1 150.22 150.22 8 39 40 400 Temp 2 80.78 40.39 9 43 40 4003.44 Time*Temp 2 1.72 10 40 50 350 Error 12 186.00 15.50 11 45 50 350 Total 17 420.44 12 36 50 350

ijk ~ N (0, 2 )
iid

F 9.69 2.61 0.11

P 0.009 0.115 0.896

significant

Interaction
Group mean Time = 40 Time = 50 Time = 50 Time = 40

Regression
Sir Francis Galton (1822 1911)
Height of Son

Non-additive Additive

Height of Father

350

375

400

Temperature

Height of the sons of fathers regressed towards the mean height of the population

Regression
Simple Linear Regression Examples
Dependent variable (Y ) Job performance Return of a stock Overall CGA Tree age (by C14) Independent variable (X ) Extent of training Risk of the stock A-Level Score Tree age (by tree rings)

Simple Linear Regression


Scatterplot Linear Model the relationship between dependent variable and independent variable(s) one independent variable Regression line A line well fit the data

Simple Linear Regression Model


Data :

Simple Linear Regression Model


Example : Y = Height of son (in cm) X = Height of father (in cm) assumptions Suppose true relation given by : Y (= ) = 0.9 X + 15 More reasonable relationship E Y 0.9 X + 15 Fathers with same heights
X 170 170 175 175 180 180 185 185
Observed

{( X 1 , Y1 ), ( X 2 , Y2 ),..., ( X n , Yn )}
, i = 1,2,..., n
i ~ N (0, 2 )
iid

Yi = + X i + i

E(Y) = 0.9X + 15 Y 168

(Random Error)

Sons with same heights Y

Unrealistic! 1.3 169.3 Estimate the regression line 171.7 172.5 -0.8 171.7 Fit a regression line to datadata the from these observed 174.6 174.6 177 -2.4
169.3 182.2 181.5
Unobserved

0.7
Unobserved

182.2
Observed

Estimation of Model Parameters


Sample statistics
1 n X = Xi n i =1
n

Fitting Regression Line


Example : Study of how wheat yield depends on fertilizer. X = Fertilizer (in lb/acre)
X Y 100 40 200 50 300 50 400 70

1 n Y = Yi n i =1
2 n 2 i

S xy = ( X i X )(Yi Y ) = X i Yi nXY
n n i =1 i =1
2

Y = Yield (in bu/acre)


500 65 600 65 700 80

S xx = (X i X ) = X nX
i =1 i =1

S yy = (Yi Y ) = Yi nY
n 2 n 2 i =1 i =1

K S xy =b= S xx

K = a = Y bX

X = 400

Y = 60

D Fitted regression line : Y = a + bX

X
i =1

2 i

= 1400000

Y
i =1

= 26350

True regression line :

E (Y ) = + X

X Y
i =1

i i

= 184500

Fitting Regression Line


X = 400 Y = 60

Fitting Regression Line


Y
i =1 7

X Y
i =1

i i

= 184500

X
i =1
2

Y = 36.43 + 0.059 X
2

2 i

= 1400000

= 26350

Prediction
X 0 = 650 400

280000 S xx = 1400000 nX7 )(400) X i2 ( 2


i =1

S yy = 1150 (7 )(60)2xy = 184500i (nXY )(60 ) S xy 16500 7 )(400 26350 yy X iY


i =1

Y0 = 36Y0 = (74.03)(400 ) .43 + 60 78 0.059 0

S xy b = 16500 = 0.059 b= S xx 280000

X0 = 0

a = Y bX.059 )(400 ) = 36.43 60 (0

Y0 = 36.43 ?

Fitted regression line : Y = 36.43 + 0.059 X

Danger of Extrapolation
SARS Trend
1400 1200

Danger of Extrapolation
SARS Trend
2500

2000 1000 No. of Cases 800 600 400 200 0 10-Mar 15-Mar 0 28-Feb 10-Mar 20-Mar 25-Mar 30-Mar Date 4-Apr 9-Apr 14-Apr 19-Apr -500 Date 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May 19-May No. of Cases 1500

1000

500

Danger of Extrapolation
SARS Trend
2500 2000

Danger of Extrapolation
SARS Trend
1000 900 No. of patients in hospital 800 700 600 500 400 300 200 100

No. of Cases

1500 1000 500

0 28-Feb 10-Mar -500

20-Mar

30-Mar

9-Apr

19-Apr

29-Apr

9-May

19-May

0 10-Mar

15-Mar

20-Mar

25-Mar

30-Mar Date

4-Apr

9-Apr

14-Apr

19-Apr

Date

Danger of Extrapolation
SARS Trend
2000

Danger of Extrapolation
SARS Trend
2000 No. of patients in hospital 1500 1000 500 0 28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May -500 Date 19May

No. of patients in hospital

1500

1000

500

0 28-Feb -500

10-Mar

20-Mar

30-Mar

9-Apr

19-Apr

29-Apr

9-May

19-May

Date

Danger of Extrapolation
SARS Trend
2000 No. of patients in hospital 1500 1000 500 0 28-Feb -500 Date

Danger of Extrapolation
SARS Trend
No. of patients in hospital 1200 1000 800 600 400 200 0 28-Feb 10-Mar 20-Mar 30-Mar 9-Apr 19-Apr 29-Apr 9-May Date

20-Mar

9-Apr

29-Apr

19-May

8-Jun

19May

Danger of Extrapolation
SARS Trend
1200 No. of patients in hospital 1000 800 600 400 200 0 28-Feb

Nonlinear Relationships

20-Mar

9-Apr Date

29-Apr

19-May

8-Jun

Association Causation
Example : Price and Demand for gas
Year Price Demand Year Price Demand
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969

Simpsons Paradox

30 134
1970

31 112
1971

37 136
1972

42 109
1973

43 105
1974

45 87
1975

50 56
1976

54 43
1977

54 77
1978

57 35
1979

1960-1965

Year
1974-1979

58 65

58 56

60 58

73 55

88 49

89 39

92 36

97 46

100 40

102 42
1966-1973

Price

Demand

Fitted regression line : Demand = 139.24 1.11 Price ? Low demand is due to high price. ?

Test For Regression Effect

Test For Regression Effect


Test H 0 : = 0 vs

H1 : 0

Fitted values Residuals

Yi = a + bX i

Random Error

ri = Yi Yi

i = Yi X i

Decomposition of Variation
Yi Y = Yi Y + Yi Yi
Variation of Y Explained variation

) (

)
Unexplained variation

Test For Regression Effect

Test For Regression Effect


Break down of of Variation Decomposition sum of squares

(Y
n i =1

Y = (Y + Yi iY)Y = i YiY )Y (Yi+ )Yi Yi


2 n 2 n i =1 i =1

SST

SSR

SSE

Total sum of squares

SST = S yy
n n S2 (a = iS xy )2 2 2 SS R = b 2 SY +XbXi + bX i Y ) xx (bX X Y ) i =1 i =1 xx

Regression sum of squares Error sum of squares

SS E = SS T bSS Rxx = S yy S yy 2 S E

2 S xy

S xx

Test For Regression Effect


H 0 : = 0 vs
SS MS R = R = SS R 1

Test For Regression Effect


Example : Wheat yield example Regression line S xx = 280000
Source Regression Total

H1 : 0
SS E MS E = n2

Y =S36.= 1150 .059 X S xy = 16500 yy 43 + 0


SS

Test statistic F = ANOVA table


Source Regression Error Total

MS R MS E

Reject H0 if Fobs > F(1, n - 2, ).

Regression line
974.68 1150

Y = 36.43 + 0.059 X
1 6 974.68

d.f.

MS

F-ratio 27.805

SS b0 059 2 SS RError.S .xx ) (280000 )T = S yy = 1150 = 974 68 175.32 (2 5

SS E = 1150 974 SS T 35.064 175.32SS R .68

SS SSR SSE SST

d.f. 1

n-2 n-1

MS SSR SSE/(n - 2)

F-ratio MSR / MSE

F (1,5,0.05) = 6.61 < 27.805


Reject H0 at = 0.05 .

Coefficient of Determination
Strong relationship
SS R SS T

Coefficient of Determination

High prediction power Explained variation

R2 =

Total variation

0 R2 1
No linear relationship Example : R 2 = 974.68 = 84.8%
1150

Perfect linear relationship

C.I. For Regression Parameters


100(1 - )% C.I. for
b t n 2, 2 MS E S xx

C.I. For Regression Parameters


Example : Wheat yield example Regression line
Source Regression Error Total

Y = 36.43 + 0.059 X
SS d.f. 1 5 6 MS 974.68 35.064 F-ratio 27.805

100(1 - )% C.I. for

1 X a t n 2 , 2 MS E + n S xx

974.68 175.32 1150

Large Sxx

More accurate estimates Demonstration

95% C.I. for : 95% C.I. for :

0 t 5 . , 0.57 ) [b..059,002520288 E]35.064 ( 0 0302 0. 0878

MS S xx 280000

2 2 3643,0.0252172) E( .64) 1 (400 ) [a ..43 ( 37.236] 1 + X + 32 t 5 2 MS 36 .892 , . .57 35 7 n S xx 280000

Prediction
Predict the value of Y0 at a fixed value of X = X0 Point prediction :
Y0 = a + bX 0

Prediction
Example : Wheat yield example Regression line
Source Regression Error
2

Y = 36.43 + 0.059 X
SS d.f. 1 5 6 MS 974.68 35.064 F-ratio 27.805

100(1 - )% prediction interval (P.I.)


1 (X X ) Y0 t n 2 , 2 MS E 1 + + 0 n S xx

974.68 175.32 1150

Total

At X0 = 450,

Y0 = 36.43 + (0.059)(450) 62 98

90% prediction interval

X) 2 450 400 ) (E 62 143 12.837. 1 + + [50..98 05. 75.817] 1 (1X ( 280000 ) Y0 .98t, 0(.2,02MS35064 )1+ +0 62 5 n 7 S xx
2

Prediction

Multiple Linear Regression


Example : Fuel consumption data
Data Display

Row 1 2 3 4 5 6 7 8 9 10 11 12

State ME NH VT FUEL MA RI CN NY NJ PA OH IN IL

POP

TAX

NLIC

INC

ROAD 1.976 1.250 1.586 INC + 3 2.351 0.431 1.333 11.868 2.138 8.577 8.507 5.939 14.186

FUELC

DLIC

1029 9.00 771 9.00 462 9.00 =5787 + 1TAX 7.50 0 968 8.00 3082 10.00 18366 8.00 7367 8.00 11926 8.00 10783 7.00 5291 8.00 11251 7.50

540 3.571 441 4.092 268 3.865 + 3060 4.870 DLIC + 2 527 4.399 1760 5.342 8278 5.319 4074 5.126 6312 4.447 5948 4.512 2804 4.391 5903 5.126

557 52.4781 404 57.1984 259 58.0087 ROAD 52.8771 + 2396 4 397 54.4422 1408 57.1058 6312 45.0724 3439 55.3007 5528 52.9264 5375 55.1609 3068 52.9957 5301 52.4664

..

Multiple Linear Regression


Example : Fuel consumption data
Regressionof Variance Analysis Analysis The regression equationSS is SOURCE DF MS F p FUEL = 37.7 - 3.483991.921.34997.98- 6.65 INC0.000 TAX + DLIC - 0.242 ROAD Regression 4 22.68 Error 43 1892.05 44.00 Predictor t-ratio p Total 47 Coef 5883.96 Stdev Constant 37.68 18.57 2.03 0.049 TAX -3.478 1.298 -2.68 0.010 DLIC 1.3366 0.1924 6.95 0.000 Unusual Observations INC -6.651 1.723 -3.86 Obs. TAX FUEL Fit Stdev.Fit Residual 0.000 St.Resid ROAD -0.2417 37 5.0 63.963 64.7580.3391 3.723 -0.71 -0.795 0.480 -0.14 X 40 7.0 s = 6.633 96.812 73.371 R-sq = 67.8% 2.102 23.441 3.73R R-sq(adj) = 64.9%

R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence.

10