Beruflich Dokumente
Kultur Dokumente
Ch16, pg 16-7
Specimen 2010, Q1
September 2011, Q9 (ii)
Two types of variable GLMs require to be defined:
Weight/exposure these are the weights used in the model fit to attach
an importance to each observation. Eg in a claim frequency model exposure
would be defined as the length of time the policy has been on risk; for an
average claim size model, the exposure will be the number of claims for the
observation.
Common choices for the prior weight are equal to:
o 1 (eg when modeling claim counts)
o the number of exposures (eg when modeling claim frequency)
o the number of claims (eg when modeling claim severity)
Response this is the value that the model is trying to predict. Hence, in
the claim frequency model, it is the number of claims for that observation,
and for the average claim size model, it is the total claims cost for that
observation.
In general the name of the model corresponds to the ratio: response/weight, ie:
Claim frequency = number of claims / policy years
Average claim size = cost of claims / number of claims
A categorical factor is a factor to be used for modeling where the values of
each level are distinct, and often cannot be given any natural ordering or score.
An example is car manufacturer, which has various values. By contrast, a noncategorical factor is one that takes a naturally ordered value, eg age or car
value (these may need to be rounded at the input stage to reduce the number of
levels to a convenient number).
An interaction term is used where the pattern in the response variable is
modeled better by including extra parameters for each combination of two or
more factors. This combination adds predictive value over and above the
separate single factors.
Initial analyses
Ch17, pg 23
April 2010, Q8
exposure and claim count. If these levels are not ultimately combined with
other levels, the GLM maximum likelihood algorithm may not converge. (If a
factor level has zero claims and a multiplicative level is being fitted, the
theoretically correct multiplier for that level will be close to zero, and the
parameter estimate corresponding to the log of that likelihood may be so
large and negative that the numerical algorithm seeking the maximum
likelihood will not converge).
In addition to investigating the exposure and claim distribution, a query of
one-way statistics (eg frequency, severity, loss ratio and pure premium) will
give a preliminary indication of the effect of each factor.
Example:
Factor 1
(car age)
Factor 2
(mileage)
Exposure
0-1
0-1
2-6
2-6
7+
7+
0-8k
8k+
0-8k
8k+
0-8k
8k+
900
25,450
4,700
13,025
5,273
652
Predicted
value
(apply
GLM)
0.4
0.8
0.5
1.0
1.5
3.0
Total response
= exposure x predicted
value
360
20,360
2,350
13,025
7,909
1,956
This improves the predicted values by taking into account the credibility (or lack
thereof) for the response in a single location.
Two main forms of spatial smoothing are typically employed:
Distance-based smoothing
Adjacency-based smoothing
The features of each form of smoothing make it more or less appropriate to use
depending on the underlying processes behind the loss typed being modelled.
Distance-based smoothing incorporates information about nearby location
codes based on the distance between the location codes: the further away a
location code, the less influence (or weight) is given to its experience).
This is true regardless of whether an area is urban or rural, and whether natural
or artificial boundaries (such as rivers) exist between location codes, and
therefore may not be appropriate for certain perils such as theft, where a river
with no bridge separates two areas and therefore the claims experience is
different.
As such, distance-based smoothing methods are often employed for weatherrelated perils were there is less danger of over- or under-smoothing urban and
rural areas.
Distance-based smoothing methods have the advantage of being easy to
understand and implement, as no distributional assumptions are required in the
algorithm.
Distance-based methods can also be enhanced by amending the distance metric
to include dimensions other than longitude and latitude. Eg including urban
density in the distance metric would allow urban areas to be more influenced by
experience in nearby urbane areas than by nearby rural ones, which may be
appropriate.
Adjacency-based smoothing incorporates information about directly
neighbouring location codes. Each location code is influenced by its direct
neighbours, each of which is in turn influenced by its direct neighbours;
distributional assumptions or prior knowledge of the claims processes can be
incorporated in the technique. The algorithms are therefore iterative and
complex to implement.
As this smoothing method relies on defining which location codes neighbor each
other, natural or artificial boundaries (eg rivers or motorways) can be reflected in
the smoothing process.
Location codes tend to be smaller in urban regions and larger in rural areas, so
adjacency-based smoothing can sometimes handle urban and rural differences
more appropriately for non-weather-related perils.
The Gini coefficient is a measure of statistical dispersion that can range from 0 to
1. The higher the Gini coefficient, the more predictive the model.
The structure of a GLM
Ch16, pg 12
September 2011, Q9 (i)
The linear model structure:
k
Yi =
X ij j
+ i
j=1
-1
Yi = g (
X ij j
+ i) + i
j=1
D=
d (Y i ; i )
i=1
D1
D2
F statistics in cases where the scale parameter for the model is not
known, its estimator is distributed as a 2 distribution, and the ratio of two
2 distributions is the F distribution:
D1D2
(df 1df 2)(D2 /df 2)
2df 1df 2
Fdf
df 2 ,df 2
For example, if the levels represent a continuous variable (eg vehicle age),
then the relativity should vary smoothly as the factor value increases. The
error ranges of these relativities are also distinct (they dont overlap much),
indicating that the response from the data underlying each level has a
significantly different relativity value. Hence the factor should be accepted.
These can be illustrated on a graph.
Consistency checks with other factors note that time is not the only
factor that can be used as a consistency check. If you are producing a
model for a multi-distribution channel business then it is particularly
important that each factor is checked to ensure that it is valid for every
channel.
Differences in the data collection methods by channel can cause problems
here.
Also a random factor could be created in the data as a means to check
consistency for a factor.