Introduction To Generalized Linear Models

Introduction to Generalized
Linear Models
2007 CAS Predictive Modeling Seminar
Prepared by
Louise Francis
Francis Analytics and Actuarial Data Mining, Inc.
www.data-mines.com
Louise_francis@msn.com
October 11, 2007
Objectives
Gentle introduction to Linear
Models
Illustrate some simple applications
of linear models
Address some practical modeling
issues
Show features common to LMs
and GLMs
Predictive Modeling Family
Predictive Modeling
Classical Linear Models
GLMs
Data Mining
Linear Models Are Basic Statistical

Building Blocks: Ex:Mean Payment by Age Group
Linear Model for Means: A Step

Function Ex:Mean Payment by Age Group
Linear Models Based on Means

Payment by Age Group and Attorney Involvement
Severity
An Introduction to Linear Regression
Year
Intro to Regression Cont.

Fits line that minimizes squared deviation between actual and
fitted values
) 2
min (Yi Y )
Some Work Related Liability Data

Closed Claims from Tx Dept of Insurance
Total Award
Initial Indemnity reserve
Policy Limit
Attorney Involvement
Lags
Closing
Report
Injury
Sprain, back injury, death, etc
Data, along with some of analysis will be posted on internet
Simple Illustration
Total Settlement vs. Initial Indemnity Reserve
How Strong Is Linear Relationship?:

Correlation Coefficient
Varies between -1 and 1
Zero = no linear correlation
lnInitialIndemnityRes
lnTotalAward
lnInitialExpense
lnReportlag
lnInitialIndemnityRes lnTotalAward lnInitialExpense lnReportlag

1.000
0.303
1.000
0.118
0.227
1.000
-0.112
0.048
0.090
1.000
Scatterplot Matrix
Prepared with Excel add-in XLMiner
Linear Modeling Tools Widely

Available: Excel Analysis Toolpak
Install Data
Analysis Tool
Pak (Add In) that
comes wit Excel
Click Tools, Data
Analysis,
Regression
How Good is the fit?
First Step: Compute residual

Residual = actual fitted
Y=lnTotal
Award
10.13
14.08
10.31
Predicted Residual
11.76
-1.63
12.47
1.61
11.65
-1.34
Sum the square of the residuals (SSE)

Compute total variance of data with no
model (SST)
Goodness of Fit Statistics

R2: (SSE Regression/SS Total)
percentage of variance explained
Adjusted R2
R2 adjusted for number of coefficients in

model
Note SSE = Sum squared errors
MS is Mean Square Error
R2 Statistic
Significance of Regression
F statistic:
(Mean square error of Regression/Mean

Square Error of Residual)
Df of numerator = k = number of predictor vars
Df denominator = N - k
ANOVA (Analysis of Variance) Table

Standard way to evaluate fit of model
Breaks Sum Squared Error into model and
residual components
Goodness of Fit Statistics

T statistics: Uses SE of coefficient to
determine if it is significant
SE of coefficient is a function of s (standard

error of regression)
Uses T-distribution for test
It is customary to drop variable if coefficient
not significant
T-Statistic: Are the Intercept and

Coefficient Significant?
Intercept
lnInitialIndemnity
Res
Coefficients
10.343
Standard
Error
0.112
0.154
0.011
t Stat
P-value
92.122
0
13.530
8.21E-40
Other Diagnostics: Residual Plot

Independent Variable vs. Residual
Points should scatter randomly around zero
If not, regression assumptions are violated
Predicted vs. Residual
Random Residual
DATA WITH NORMALLY DISTRIBUTED ERRORS RANDOMLY

GENERATED
What May Residuals Indicate?

If absolute size of residuals increases as
predicted increases, may indicate nonconstant variance
may indicate need to log dependent variable

May need to use weighted regression
May indicate a nonlinear relationship
Standardized Residual: Find Outliers

N
zi =
( yi yi )
se
, se =
(y
)
y
i i
i=1
N k 1
Standardized Residuals by Observation

6
4
2
0
-2
500
1000
-4
Observation
1500
2000
Outliers
May represent error
May be legitimate but have undue influence
on regression
Can downweight oultliers
Weight inversely proportional to variance of

observation
Robust Regression
Based on absolute deviations

Based on lower weights for more extreme
observations
Non-Linear Relationship
Non-Linear Relationships
Suppose Relationship between dependent and
independent variable is non-linear?

Linear regression requires a linear relationship
Transformation of Variables
Apply a transformation to either the
dependent variable, the independent variable

or both
Examples:
Y = log(Y)
X = log(X)
X = 1/X
Y=Y1/2
Transformation of Variables:
Skewness of Distribution
Use Exploratory Data Analysis to detect skewness, and heavy tails
After Log Transformation

-Data much less skewed, more like Normal, though still skewed
Box Plot
Histogram
500
20
450
400
18
Y Values
14
12
10
8
6
4
2
0
Frequency
16
350
300
250
200
150
100
50
0
lnTotalAward
9.6
10.4 11.2
12
12.8 13.6 14.4 15.2

lnTotalAward
16
16.8 17.6
Transformation of Variables
Suppose the Claim Severity is a function of the
log of report lag
Compute X = log(Report Lag)

Regress Severity on X
Categorical Independent Variables:

The Other Linear Model: ANOVA
Average of Totalsettlementamountorcourtaward
Injury
Amputation
Backinjury
Braindamage
Burnschemical
Burnsheat
Circulatorycondition
Table above created with Excel Pivot Tables
Total
567,889
168,747
863,485
1,097,402
801,748
302,500
Model
Model is Model Y = ai, where i is a category of
the independent variable. ai is the mean of

category i.
Drop Page Fields Here
Aver age Sever ity By Injur y
Aver age of Tr ended Sever ity
16, 000. 00
14, 000. 00
12, 000. 00
10, 000. 00
Drop Series Fields Here
8, 000. 00
6, 000. 00
4, 000. 00
2, 000. 00
BRUISE
Y=a1
Total
BURN
CRUSHI CUT/ PU
NG
NCT
4,215.78 2,185.64 2,608.14 1,248.90
EYE
534.23
FRACTU
RE
OTHER
SPRAIN STRAIN
14,197.4 6,849.98 3,960.45 7,493.70
Y=a
9
Injur y
Two Categories
Model Y = ai, where i is a category
of the independent variable and ai

is its mean
In traditional statistics we compare
a1 to a2
If Only Two Categories: T-Test for test of

Significance of Independent Variable
Mean
Variance
Observations
Hypothesized Mea
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Variable 1
124,002
2.35142E+11
354
0
1591
-7.17
0.00
1.65
0.00
1.96
Variable 2
440,758
1.86746E+12
1448
Use T-Test from Excel Data Analysis Toolpak
More Than Two Categories

Use F-Test instead of T-Test
With More than 2 categories, we refer to it as
an Analysis of Variance (ANOVA)
Fitting ANOVA With Two Categories

Using A Regression
Create A Dummy Variable for Attorney
Involvement
Variable is 1 If Attorney Involved, and 0
Otherwise
Attorneyinvolvement-insurer
Y
Y
Y
N
Y
N
Y
Y
N
Attorney TotalSettlement
1
25000
1
1300000
1
30000
0
42500
1
25000
0
30000
1
36963
1
145000
0
875000
More Than 2 Categories

If there are K Categories Create k-1 Dummy Variables
Dummyi = 1 if claim is in category i, and is 0

otherwise
The kth Variable is 0 for all the Dummies

Its value is the intercept of the regression
Design Matrix
Injury Code
1
1
12
11
17
Injury_Backin Injury_Multipl Injury_Nervou

jury
scondition
einjuries
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
Injury_Other
Top table Dummy variables were hand coded, Bottom

table dummy variables created by XLMiner.
0
0
0
0
1
Regression Output for Categorical

Independent
A More Complex Model Multiple

Regression
Let Y = a + b1*X1 + b2*X2 +
bn*Xn+e
The Xs can be numeric variables
or categorical dummies
Multiple Regression
Y = a + b1* Initial Reserve+ b2* Report Lag + b3*PolLimit
+ b4*age+ ciAttorneyi+dkInjury k+e
SU M M A R Y O U TPU T
R egression Statistics
M ultiple R
R Square
Adjusted R Square
Standard Error
O bservations
0.49844
0.24844
0.24213
1.10306
1802
A N O VA
df
R egression
R esidual
Total
Intercept
lnInitialIndem nityR es
lnR eportlag
Policy Lim it
C lm t Age
Attorney
Injury_Backinjury
Injury_Braindam age
Injury_Burnschem ical
Injury_Burnsheat
Injury_C irculatorycondition
15
1786
1801
SS
718.36
2173.09
2891.45
C oefficients Standard Error

10.052
0.156
0.105
0.011
0.020
0.011
0.000
0.000
-0.002
0.002
0.718
0.068
-0.150
0.075
0.834
0.224
0.587
0.247
0.637
0.175
0.935
0.782
MS
47.89
1.22
t Stat
64.374
9.588
1.887
4.405
-1.037
10.599
-1.995
3.719
2.375
3.645
1.196
F
39.360
P-value
0.000
0.000
0.059
0.000
0.300
0.000
0.046
0.000
0.018
0.000
0.232
More Than One Categorical Variable

For each categorical variable
Create k-1 Dummy variables
K is the total number of variables
The category left out becomes the base
category
Its value is contained in the intercept
Model is Y = ai + bj + + e or
Y = u+ai + bj + + e, where ai + bj
are offsets to u
e is random error term
Correlation of Predictor Variables:

Multicollinearity
Multicollinearity
Predictor variables are assumed
uncorrelated
Assess with correlation matrix
Remedies for Multicollinearity

Drop one or more of the highly correlated
variables
Use Factor analysis or Principle components
to produce a new variable which is a
weighted average of the correlated variables
Use stepwise regression to select variables
to include
Similarities with GLMs

Linear Models
Transformation of
Variables
Use dummy coding for
categorical variables
Residual
Test significance of
coefficients
GLMs
Link functions
Use dummy coding for
categorical variables
Deviance
Test significance of
coefficients
Introductory Modeling Library

Recommendations
Berry, W., Understanding Regression Assumptions,
Sage University Press

Iversen, R. and Norpoth, H., Analysis of Variance,
Sage University Press
Fox, J., Regression Diagnostics, Sage University
Press
Data Mining for Business Intelligence, Concepts,
Applications and Techniques in Microsoft Office
Excel with XLMiner,Shmueli, Patel and Bruce, Wiley
2007
De Jong and Heller, Generalized Linear Models for
Insurance Data, Cambridge, 2008

Introduction To Generalized Linear Models

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Introduction To Generalized Linear Models

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Generalized

Predictive Modeling Family

Classical Linear Models

Linear Models Are Basic Statistical

Linear Model for Means: A Step

Linear Models Based on Means

An Introduction to Linear Regression

Intro to Regression Cont.

Some Work Related Liability Data

Data, along with some of analysis will be posted on internet

How Strong Is Linear Relationship?:

lnInitialIndemnityRes lnTotalAward lnInitialExpense lnReportlag

Prepared with Excel add-in XLMiner

Linear Modeling Tools Widely

How Good is the fit?

First Step: Compute residual

Sum the square of the residuals (SSE)

Goodness of Fit Statistics

percentage of variance explained

R2 adjusted for number of coefficients in

(Mean square error of Regression/Mean

ANOVA (Analysis of Variance) Table

Goodness of Fit Statistics

SE of coefficient is a function of s (standard

T-Statistic: Are the Intercept and

Other Diagnostics: Residual Plot

Predicted vs. Residual

DATA WITH NORMALLY DISTRIBUTED ERRORS RANDOMLY

What May Residuals Indicate?

predicted increases, may indicate nonconstant variance

may indicate need to log dependent variable

May indicate a nonlinear relationship

Standardized Residual: Find Outliers

Standardized Residuals by Observation

Weight inversely proportional to variance of

Based on absolute deviations

independent variable is non-linear?

dependent variable, the independent variable

After Log Transformation

12.8 13.6 14.4 15.2

log of report lag

Compute X = log(Report Lag)

Categorical Independent Variables:

Table above created with Excel Pivot Tables

the independent variable. ai is the mean of

Aver age of Tr ended Sever ity

4,215.78 2,185.64 2,608.14 1,248.90

14,197.4 6,849.98 3,960.45 7,493.70

of the independent variable and ai

If Only Two Categories: T-Test for test of

Use T-Test from Excel Data Analysis Toolpak

More Than Two Categories

an Analysis of Variance (ANOVA)

Fitting ANOVA With Two Categories

More Than 2 Categories

Dummyi = 1 if claim is in category i, and is 0

The kth Variable is 0 for all the Dummies

Injury_Backin Injury_Multipl Injury_Nervou

Top table Dummy variables were hand coded, Bottom

Regression Output for Categorical

A More Complex Model Multiple

The Xs can be numeric variables

C oefficients Standard Error