Sie sind auf Seite 1von 11

A Markov state-space model of Lupus Nephritis disease

dynamics
A. Eleuteri

1. The data
The data set is comprised of 234 observations for 93 subjects. Some covariates are missing
observations:
LPGDS
# missing

MCP1
1

CP
40

VCAM1
1

40

When using the data to fit models, median imputation was used for LPGDS and VCAM1, and
multiple imputation for MCP1 and VCAM1.

1.1 Assessing the difference between the UK and US cohorts


The Kolmogorov-Smirnov test has been used (null hypothesis is equality of distributions.)
There is evidence of difference only for the LPGDS measurements.

The following graphs show the number of events (combined into two different classes)

Given the overall differences, using both cohorts for data analysis requires careful investigation.

1.2 Correlation analysis


The following table shows the pairwise Spearman rank correlations amongst all the measurements.
All p-values are significant, with the various p<<0.00001

Spear
man
rank
correla
tions
LPGD
S

LP
GD
S

M
CP
1

1.00

0.4
9

0.49

1.0
0

0.58

0.5
9

0.63

0.5
1

0.64

0.3
9

0.37

0.3
7

A
C GP
P
1
0.
5 0.6
8
3
0.
5 0.5
9
1
1.
0 0.7
0
2
0.
7 1.0
2
0
0.
5 0.6
0
5
0.
6 0.6
1
3

0.49

0.5
0

0.
6
9

MCP1
CP
AGP1
VCA
M1
TF
BILA
G
(binar
y)

0.7
2

VC
AM
1

0.41

T
F
0.
3
7
0.
3
7
0.
6
1
0.
6
3
0.
4
1
1.
0
0

0.52

0.
5
7

0.64
0.39
0.50
0.65
1.00

BIL
AG
(bin
ary)
0.49
0.50
0.69
0.72
0.52
0.57

1.00

1.3 Principal components analysis


In order to assess the viability of projecting the prognostic factors to a small dimensional space, a
PCA analysis has been performed. The graph for the first two components doesnt show any
appreciable clustering structure.

-5

10

15
15

-10

27

5
0
-5
-10

0.1
0.0

LPGDS
186
VCAM1

179
178

-0.2

-0.1

128

2
104
3
48 83
202
62
116 198 52 224 25
19
226
1
205
43
12
92 210
153
190 200
135
203
81 8056
51
152
6745942
165
134
105
33
75
86
103
32
MCP1
55
1891
58
44
131
106
206
47
79
570
215
41
65
11
57
20
163
93
201 40
208
49
117 123
31
147
227
7
66
29
228
82
15
107
161222
129
78
148
100
1623
199 214
101
229
26
54
17
28
216
231
76
138
173
156
4
9 110197
53
99
126
90 13 225
84 233
217
187
234
232 136
39 14
221
188
AGP1
209
193
96
109
192 182 108
171 95158
CP 195
155
162
159
218
177
30
89
146
10111
50
45
34
207 219 172
145
8
185
46
223204 196164
21 24
212
184
154
194
98
150 220
9777
130 160
157102
189
85
183
213
211 71
72
137
149 140
141
TF
139
118

Comp.2

94

127

10

0.2

88
230

-0.2

-0.1

0.0

0.1

0.2

Comp.1

2. A multi-state Markov model


We will assume a model of state transitions as follows:
Inactive Nephritis (1)

Active Nephritis (2)

The state transitions will be represented by a transition intensity matrix:

12
21

12

21

The entries in the matrix are hazard rates, i.e. the instantaneous probability of moving from one state
to another. The simplest case is to assume time homogeneity, i.e. the hazard rates are timeindependent. This is also equivalent to assuming that the distribution underlying the hazard rates is
exponential.
A simple extension of the model can be obtained by considering a time-dependent transition matrix of
the form:

Q (t ) Qg (t )
with g(t) a non-negative function. This formulation implies the existence of an operational time for

g (t ) t 1

which the process is time homogeneous. For example, with


we obtain Weibull transition
intensities (note that the time homogeneous model can be obtained as a special case with =1.)
Hazard rates can depend on prognostic factors. For a set of k factors, we will assume a log linear
model:

log ij ij0 ij1 x1 ij2 x2 L ijk xk


.
The advantage of this model is that its possible to get subject-specific predictions. Fitting this model
is more complex than only fitting the baseline hazard rate, and requires more data, as more parameters
need to be estimated.
Once the transition intensity matrix has been determined, it can be shown that the probabilities for all
state transitions at a given time t can be calculated by using the relationship:

P(t ) exp(Qt )

2.1 Fitting the time homogeneous model


If we denote by state 1 the set of BILAG scores {D, E} and by state 2 the set {A, B, C}, we get the
following table of observed transitions:
from
1
2

to
1
83
14

2
9
35

Note that there are only 9 transitions from state 1 to state 2, and 14 from state 2 to state 1, which
limits model complexity. The sample size is reduced to 211, since 23 records are dropped due to
subjects with only one record (to model a transition we need at least two records.)
A time homogeneous model provides the following estimate of the transition intensity matrix:

.2613 .2613
Q

.5997 .5997
The models log likelihood is -57.72, and the corrected (for the small sample size) Akaike Information
Criterion (AICc) is 119.49.
The expected (predicted by the model) vs. observed prevalence of the states up to 1 year from first
observation is shown in the following graph:

100
80
40

60

Observed
Expected

20

40

60

P re v a le n c e (% )

100
80

State 2

20

P re v a le n c e (% )

State 1

0.0

0.2

0.4

0.6

0.8

1.0

0.0

Times

0.2

0.4

0.6

0.8

1.0

Times

Empirically we notice slight trends in the observed state prevalences, which are not captured by the
time homogeneous model.

2.2 Fitting the time dependent model

-56.433
-56.434
-56.436

-56.435

log likelihood

-56.432

-56.431

We next fit a time-dependent model with Weibull transitions. The log likelihood for different values of
is shown in this graph:

0.62

0.63

0.64

0.65

0.66

alphas

The function is maximised for =0.64. The log likelihood for the new model is -56.43 and the AICc is
118.98. Despite the model being more complex, it has a slightly lower AICc, so it should be preferred.
The estimated transition intensity matrix is:

.2695 .2695
Q w

.7300 .7300

It should be noted how the 1-2 transition intensity hasnt changed much with respect to the first
model, but the 2-1 transition intensity is now much higher.
The prevalence counts of the states for the new model are shown here:

100

100
80

State 2

80

State 1

60
0

20

40

P re v a le n c e (% )

60
40
0

20

P re v a le n c e (% )

Observed
Expected

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

Times

0.4

0.6

0.8

1.0

Times

Qualitatively we can see that this model is better at following the trends in the data.

2.3 Including predictive factors in the model


Assuming as a baseline the time dependent model, prognostic factors were added to the model, so that
the transition intensities can be written:

log 12 120 121 x


1
log 21 210 21
x

where x stands for a prognostic factor. Notice that each factor is taken into account twice, once for
each transition, requiring four coefficients to be estimated.
The AICc was calculated for each model, and the rescaled AICc values were calculated (rescaling was
done by subtracting from all AICc the minimum AICc amongst all models.)
The following table shows the AIC values:
Baseline only
10.16

LPGDS
9.27

MCP1
10.84

CP
2.31

AGP1
0

VCAM1
8.65

TF
11.34

The best model has value zero; all other models are worse. A step-down procedure was also applied
(using AICc as the selection criterion) and the same results are obtained.
Based on the above results, we will investigate a model with CP and AGP1 as predictors.

The following table shows the transition intensities with hazard ratios for each covariate, with 95%
confidence intervals:
1-2
2-1

Baseline
0.37 (0.14, 0.98)
1.59 (0.5, 5.03)

AGP1
2.01 (1.21, 3.32)
1.18 (0.74, 1.90)

CP
0.59 (0.24,1.46)
0.52 (0.30, 0.92)

Note that the confidence intervals for CP on the 1-2 transition and for AGP1 on the 2-1 transition
include the value 1, which suggests that the impact of the covariates on these transitions might not be
significant. The AICc for this model is 106.39.
We will therefore fit another model, in which AGP1 only has effect on the 1-2 transition, and CP only
on the 2-1 transition. The following table shows the parameter estimates:
1-2
2-1

Baseline
0.45 (0.19, 1.05)
1.83 (0.83, 4.07)

AGP1
1.54 (1.13, 2.10)
-

CP
0.65 (0.43, 0.98)

The AICc for this model is 104.36, hence the evidence for this model is higher than for the previous
one. The equations for the final model can then be written:

log 12 0.452 0.432 log AGP1


log 21 1.833 0.429 log CP
12
21

12

21

P (t ) exp(Qt 0.64 )
The following graph shows the transition intensities as functions of log AGP1 and log CP:

2.4 Model diagnostics


In this section we will diagnose model fit using different indicators.

0.0
-1.0

-0.5

Residuals

0.5

1.0

Summary residuals
We first look at summary residuals, which allow assessing of the functional form of the covariates.
The first graph shows that non-linear effects might be involved in the relationship of log CP with the
transition intensity.

10

12

log CP

The next graph shows that theres considerable prediction uncertainty for log AGP1 values less than 5,
and possible nonlinear effects becoming relevant for values larger than 8.

1.0
0.5
0.0
-1.0

-0.5

Residuals

10

12

log AGP1

Prevalence counts
The next graph shows the prevalence counts of the states, taking into account the effects of prognostic
factors.

100

100
80

State 2

80

State 1

60
20

40

Prevalence (%)

60
40

20

Prevalence (%)

Observed
Expected

0.0

0.2

0.4

0.6

Times

0.8

1.0

0.0 0.2

0.4

0.6

0.8 1.0

Times

The following table shows the same information at a discrete set of times:

State 1 (BILAG D,E)


Observed
Expected
47
48.8
43
43.7
35
35.9
21
21.1

Time
3 months
6 months
9 months
1 year

State 2 (BILAG A,B,C)


Observed
Expected
22
20.2
18
17.3
14
13.1
10
9.9

12

Influence plot
The following influence plot shows the residuals of the model, annotated with subject identifier. For a
good model all residuals should be around zero. Large residuals denote possible outliers. For
example, we see that subject 15119 as the highest residual; and that there are twice the number of
outliers with residuals larger than 2 in the US cohort vs. the UK cohort (4 vs. 2). High residuals flag
observations which may exhibit dynamic patterns that for some reason are not well-captured by the
model. Note that a robust variance estimator has been calculated to take into account of the two
different cohorts in the data. This estimator would be used when calculating confidence intervals on
model predictions.

RT5
SL5

RS5

15004

JV5

15068

adjsres

10

15119

15120

SB5
15083
15124
15078
15070
15087
15048
15102
10062
15062
15054
10004
15079
15080
EA5
15106
RR5
10003
10017
10035
10040
10042
10049
10057
10061
15002
15063
15066
15086
DH5
RA5
10030
10034
10041
10052
15024
15032
15047
15060
15118
15123
DC5
10043
10045
10060
10069
15001
15015
15082
15122
MG5
10021
10050
10051
15005
15053
15073
15088
15089
15090
AD5
AV5
0

10

20

30

40
Index

50

60

70