Sie sind auf Seite 1von 15

A Hidden Markov Model for

Transcriptional Regulation in Single Cells


John Goutsias
AbstractWe discuss several issues pertaining to the use of stochastic biochemical systems for modeling transcriptional regulation in
single cells. By appropriately choosing the system state, we can model transcriptional regulation by a hidden Markov model (HMM).
This opens the possibility of using well-known techniques for the statistical analysis and stochastic control of HMMs to mathematically
and computationally study transcriptional regulation in single cells. Unfortunately, in all but a few simple cases, analytical
characterization of the statistical behavior of the proposed HMM is not possible. Moreover, analysis by Monte Carlo simulation is
computationally cumbersome. We discuss several techniques for approximating the HMM by one that is more tractable. We employ
simulations, based on a biologically relevant transcriptional regulatory system, to show the relative merits and limitations of various
approximation techniques and provide general guidelines for their use.
Index TermsHidden Markov models, Monte Carlo simulation, stochastic biochemical systems, stochastic dynamical systems,
transcriptional regulation, transcriptional regulatory systems.

1 INTRODUCTION
T
RANSCRIPTIONAL regulation is a fundamental biological
process used by cells to control their actions and
properties through protein synthesis. Transcription maps
genetic information encoded in a DNA molecule into RNA
molecules, which are then used for protein synthesis by
translation. Understanding transcriptional regulation is
fundamental to cell biology and may eventually lead to
novel techniques for the prevention and treatment of
human diseases [1], [2]. Since some readers may not be
familiar with basic cell biology, we provide a simple
introduction to transcriptional regulation in Section 2. For
more information, we refer the reader to [3], [4].
Most work on transcriptional regulation requires ex-
tensive biological experimentation, which is time consum-
ing and expensive. However, it is becoming increasingly
clear that mathematical modeling of transcriptional regula-
tion may lead to inexpensive computational tools that can
be used to understand and predict basic principles under-
lying this important biological process and guide biological
experimentation via simulation [5], [6].
We may consider transcriptional regulation in a large
population of cells [7] or in single cells [8], [9]. In the former
case, we may construct a model to predict the dynamic
evolutions of the concentrations of molecular species in the
population. In the latter case, we may construct a model to
predict the dynamic evolutions of various statistics (e.g.,
means and standard deviations) of the copy numbers of each
molecular species in a single cell. To model transcriptional
regulation in a population of cells, we need to assume that a
large number of genotypically identical cells are available
[7], which express the same set of genes using identical
molecular machineries. Unfortunately, we cannot satisfy
this assumption in practice. Moreover, the averaging effect
of studying transcriptional regulation in a large population
of cells may mask important biological behavior and may
lead to false conclusions. Therefore, it is more appropriate
to study transcriptional regulation in single cells [10].
Due to the fact that biochemical reactions in single cells
may be initiated by molecular collisions at random times,
fluctuations may dominate transcriptional regulation dy-
namics [11], [12], [13]. This necessitates the use of a stochastic
approach to transcriptional regulation [8], [9], [11], [14]. To
develop such an approach, we may assume that a cell is a
well-stirred homogeneous medium at thermal equilibrium,
comprised of a number of interacting molecules. New
molecules are synthesized by biochemical reactions in-
itiated at random times by stochastic interactions among
existing molecules. This simplified view allows us to model
transcriptional regulation in single cells by a mathemati-
cally tractable stochastic biochemical system. We discuss this
approach in Section 3.
Our main objective in this paper is to discuss several
important issues pertaining to the use of stochastic
biochemical systems for modeling transcriptional regula-
tion in single cells. It is most common in the literature to
characterize the state of a stochastic biochemical system by
the vector Xt of the copy numbers of the molecular
species present in the system at time t. Then, a continuous-
time vector-valued Markov chain (and, more precisely, a
birth-death process) is used to characterize the dynamic
evolution of that state. In Section 3, we argue that it may be
more appropriate to characterize the state of a stochastic
biochemical system by the vector Zt of the numbers of
occurrences of the underlying reactions, from which the
copy numbers of the molecular species may be directly
calculated. In this case, a continuous-time vector-valued
Markov chain (and, more precisely, a birth process) is used to
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006 57
. The author is with the Whitaker Biomedical Engineering Institute, Clark
Hall 308A, The Johns Hopkins University, Baltimore, MD 21218.
E-mail: goutsias@jhu.edu.
Manuscript received 28 Mar. 2005; revised 21 July 2005; accepted 15 Aug.
2005; published online 31 Jan. 2006.
For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org, and reference IEEECS Log Number TCBB-0018-0305.
1545-5963/06/$20.00 2006 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
characterize the dynamic evolution of that state. Unfortu-
nately, we cannot measure the state Zt or calculate it from
Xt. Therefore, we may model transcriptional regulation
by a hidden Markov model (HMM) [15], with Zt being the
hidden state and Xt being the observable state of that model.
This opens the possibility of using well-known techniques
for the statistical analysis and stochastic control of HMMs to
mathematically and computationally study transcriptional
regulation in single cells.
In all but a few simple cases, we cannot analytically
characterize the statistical behavior of the hidden and
observable states. However, we can use Monte Carlo
techniques to stochastically simulate the system and
estimate relevant statistics. Unfortunately, this is a compu-
tationally intensive approach. In Section 4, we discuss
several techniques that can be used to approximate the
dynamic evolutions of the hidden states. Some techniques
have been used in the literature to approximate the
dynamic evolutions of the observable states. Simulations,
based on a biologically relevant transcriptional regulatory
system, clearly show the relative merits and limitations of
various approximations. It turns out that some techniques
may not be appropriate, whereas others may produce
excellent approximations. Finally, we summarize our
conclusions in Section 5.
2 TRANSCRIPTIONAL REGULATION
Transcription and translation are two important biological
mechanisms used by cells for protein synthesis. During
transcription, the DNA coding region of a gene is copied
into messenger RNA (mRNA) molecules. A gene is usually
associated with two DNA regions, known as the regulatory
region and the promoter of the gene. Proteins, known as
transcription factors (TFs), bind at specific sites along the
regulatory region of a gene and recruit a large enzyme, the
RNA polymerase II, at the promoter of the gene. After
binding at the promoter, the RNA polymerase II locally
separates the two DNA strands and transcribes the gene by
moving along one of the strands. The TFs regulate
transcription by either activating or repressing the recruit-
ment and binding of the RNA polymerase II at the promoter
of the gene.
During translation, the information encoded in mRNA
molecules is used for protein synthesis. This is done by a
large molecular complex, the ribosome. After binding an
mRNA molecule, the ribosome converts the encoded
genetic information into one of 20 amino acids and
chemically links these amino acids to form a protein.
mRNAs and proteins may be subject to degradation.
Proteins may also be subject to chemical modifications and
processing (e.g., dimerization, cleavage, phosphorylation,
etc.). These steps may alter mRNA and protein activity and
exert additional control on transcriptional regulation.
To illustrate the previous steps, we refer to the one-gene
regulatory system depicted in Fig. 1.
1
The genes regulatory
region consists of two distinct binding sites, R
1
and R
2
.
Moreover, its promoter coincides withR
2
. The TFDmaybind
at site R
1
and, at sufficiently high concentrations, may also
bind at site R
2
. The binding of Dat R
1
activates transcription
of the gene by recruiting the RNA polymerase II at the
promoter. Activation of transcription produces mRNA
transcripts that are translated into proteins M. After
synthesis, two M molecules may bind (i.e., dimerize) to
form a stable TF molecule D. These steps form a positive
feedback loop that, if left unchecked, may produce an
infinite number of proteins M and TFs D. However, since
the number of D molecules increases as a function of time, a
D molecule will eventually bind at site R
2
. This will exclude
RNA polymerase II from binding at the promoter, in which
case, transcription will be repressed. The resulting negative
feedback will eventually stabilize protein synthesis at some
desired level.
The reader should keep in mind that transcriptional
regulation is controlled by several additional and not well-
understood biological mechanisms, such as mRNA and
protein localization, alternative splicing, protein folding,
and chromatin modification and remodeling [3], [4]. By
limiting ourselves to the previously discussed mechanisms,
we obtain an approximation that allows us to design simple
and tractable mathematical models for transcriptional
regulation. We discuss one such model next.
3 A HIDDEN MARKOV MODEL
3.1 Stochastic Biochemical Systems
A stochastic biochemical system consists of ` elementary
(monomolecular or bimolecular) irreversible reaction chan-
nels, which react at random times. A monomolecular
reaction channel converts a reactant molecule into one or
more product molecules. A bimolecular reaction channel
converts two reactant molecules into one or more product
molecules. We can decompose a reaction channel that
involves more than two reactant molecules into a cascade of
elementary reaction channels and model a reversible
reaction channel by two irreversible reaction channels.
We characterize the state of a stochastic biochemical
system at time t by the `-dimensional random vector
58 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
1. This is a simplified version of a basic biological mechanism of a genetic
switch that controls the fate of an E. coli cell infected by the bacteriophage `
virus [16]. This mechanism controls transcriptional regulation of the
bacteriophage ` repressor protein (CI), responsible for maintaining a
passive integration of the ` chromosome into the host DNA, a state known
as lysogeny.
Fig. 1. A simple transcriptional regulatory system.
Zt 7
1
t 7
2
t 7
`
t
T
, where 7
i
t :, if the
ith reaction has occurred : times during the time
interval 0. t and T denotes vector or matrix transposi-
tion. The random variable 7
i
t is referred to as the
degree of advancement (DA) of the ith reaction [17]. In the
following, we denote by A
i
t the number of molecules
of the ith reactant or product species present in the
system at time t. By assuming ` distinct species, we set
Xt A
1
tA
2
t A
`
t
T
.
Given that the biochemical system is at state Xt x at
time t, let
i
x be the number of all possible distinct
combinations of the reactant molecules associated with the
ith reaction channel when the systemis at state x. Note that

i
x
r
i
. for monomolecular reactions
r
i
r
i
1,2. for bimolecular reactions
with identical reactants
r
i
r
,
. for bimolecular reactions with
different reactants.
8
>
>
>
>
<
>
>
>
>
:
1
for some 1 i. , `, i 6 ,. Moreover, let c
i
0 be the
probability per unit time that a randomly chosen combina-
tion of reactant molecules will react through the
ith reaction channel. This probability is known as the
specific probability rate constant of the ith reaction. Then, the
probability that one ith reaction will occur during a time
interval t. t dt will approximately be equal to
i
xdt,
for a sufficiently small dt, where

i
x
4
c
i

i
x. i 2 M
4
f1. 2. . . . . `g.
is known as the propensity function of the ith reaction
channel [18], [19].
Note that, given the state zt of the biochemical system
at time t, we can uniquely determine the state xt of the
system at time t. This is due to the fact that
A
i
t q
i
Zt
4
r
0.i

X
i2M
:
ii
7
i
t. t ! 0. 2
for i 2 N
4
f1. 2. . . . . `g, where r
0.i
is the initial number of
molecules of the ith species present in the cell at time t 0
and :
ii
is the stoichiometric coefficient. This coefficient
quantifies the change in the number of molecules of the
ith molecular species caused by one occurrence of the
ithreaction. The state zt cannot be determinedfromxt in
general since there might be several states zt that leadto the
same state xt. To distinguishZt fromXt, we refer to Zt
as the hidden state and to Xt as the observable state.
The discrete-valued random process Z fZt. t ! 0g
characterizes the dynamic evolution of the hidden state of a
biochemical system. This process is specified by the prob-
ability mass function (PMF) 1
z
z; t PrZt z j Z0 0,
for every t ! 0. Simple probabilistic arguments show that
1
z
z; t satisfies the following first-order differential equa-
tion [20]:
01
z
z; t
0t

X
i2M
c
i
z e
i
1
z
z e
i
; t c
i
z1
z
z; t.
3
for t 0, with initial condition 1
z
0; 0 1, where e
i
is the
ith column of the ` ` identity matrix and
c
i
z
4

i
gz c
i

i
gz. 4
with gz q
1
zq
2
z q
`
z
T
. This is the well-known
forward Kolmogorov differential equation [21], [22], [23] that
governs the stochastic evolution of a continuous-time
Markov chain. However, in computational biochemistry,
(3) is referred to as the chemical master equation (CME) [17], a
term that we also use in this paper. It turns out that Z is a
multivariate birth process [21], [23].
We can show from (3) that the means j
i
t
4
E7
i
t
and covariances ,
ii
0 t
4
Cov7
i
t. 7
i
0 t of the hidden
state process, satisfy the following system of first-order
differential equations:
2
dj
i
t
dt
Ec
i
Zt. i 2 M. 5
d,
ii
0 t
dt
Ec
i
Ztcii
0

E7
i
tc
i
0 Zt j
i
tEc
i
0 Zt
E7
i
0 tc
i
Zt j
i
0 tEc
i
Zt. i. i
0
2 M.
6
where c0 1 and ci 0, for i 6 0. Note that the time
derivatives dj
i
t,dt, i 2 M, given by (5), define the
reaction rates of the reactions in M. These derivatives are
also known as (time-dependent) fluxes [24].
By following probabilistic arguments similar to the ones
that lead to (3), we can show that the PMF 1
x
x; t
PrXt x j X0 x
0
of the observable state process
X fXt. t ! 0g satisfies the following CME [18]:
01
x
x; t
0t

X
i2M

i
x s
i
1
x
x s
i
; t
i
x1
x
x; t.
7
for t 0, with 1
x
x
0
; 0 1, where s
i
:
1i
:
2i
:
`i

T
is
the `-dimensional vector of the stoichiometric coefficients
associated with the ith reaction. In this case, X is a
multivariate birth-death process [21], [23].
In most publications (except [20], [25]), only the
molecular population process X is used to characterize
the state of a stochastic biochemical system. In certain cases,
however, we must also use the DA process Z. For example,
we may want to evaluate the efficiency of a transcriptional
reaction by calculating the average number of mRNA
molecules synthesized during a given time period or the
mean waiting time between successive occurrences of
mRNA synthesis events. Since we cannot, in general,
evaluate these quantities analytically, we must repeatedly
sample the hidden states of the biochemical system and use
the resulting DA trajectories to obtain Monte Carlo
estimates of these quantities. Another important use of the
DA process comes from our need to elucidate the
biochemical mechanisms of transcriptional regulation and
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 59
2. Although most statistical quantities used in this paper depend on the
initial conditions Z0 0 or X0 x
0
, for simplicity, we do not show this
dependence in our formulation.
investigate how these mechanisms affect cellular function. It
has been noted in [24] that a promising approach to this
problem is to develop a quantitative methodology that
allows us to systematically study how various reactions in a
transcriptional pathway determine the molecular popula-
tion dynamics and fluxes. We believe that, in view of the
fact that the observable system dynamics are determined by
a (linear) superposition of individual hidden state dynamics
(recall (2)), the DA process will play a key role in
constructing such a methodology. Finally, there might be
cases in which we can only specify a stochastic biochemical
system by a CME over the DA process. This is true in
Section 4.5, where we approximate a stochastic biochemical
system that contains slow and fast reactions by one that
contains only slow reactions. It turns out that the approx-
imating system can only be specified by a CME similar to
(3); see (25) below. Since we cannot determine the
DA process Z from the molecular population process X,
we must characterize a stochastic biochemical system by
using both states.
It is clear from the previous discussion that we can use
the following HMM to characterize a stochastic biochemical
system:
zt $ 1
z
z; t hidden state. 8
xt gzt observable state. 9
yt $ j
yjx
yt j xt measurements. 10
where j
yjx
j in (10) is the conditional probability density
function of obtaining measurements yt of the observable
system state xt. Since, in this paper, we are not interested
in modeling the measurement process, we focus our
attention on (8) and (9).
3.2 Simulation
Except in simple cases, it is not possible to analytically
derive a solution to the CME (3). However, it is possible to
simulate the dynamics of the HMM (8), (9) by an exact
stochastic simulation algorithm, known as the Gillespie
algorithm [26], and estimate relevant statistics (e.g., means,
variances, and PMFs) by Monte Carlo simulation [27].
The Gillespie algorithm, applied to (3), generates a
sample trajectory, fzt. t ! 0g, by following two steps.
First, given the hidden state zt of the biochemical system
at time t, the time t t of the next reaction is determined by
drawing a sample t from the exponential distribution:
Tt; t c
0
zt c
c0ztt
. t ! 0.
where
c
0
z
4
X
i2M
c
i
z.
Then, the choice of the next reaction is determined by
drawing a sample from the PMF:
1i; t
c
i
zt
c
0
zt
. i 2 M.
and the DA of that reaction is increased by one.
The previous algorithm is referred to as the direct
Gillespie algorithm, to distinguish it from another variation
known as the first reaction method [28]. Unfortunately, both
versions of the Gillespie algorithm are computationally
intensive, especially when applied to large and highly
reactive biochemical systems. Recent attempts to accelerate
the Gillespie algorithm have produced a number of
refinements [29], [30], [31], [32], [33], [34], [35], [36].
However, these algorithms remain computationally inten-
sive as biochemical systems become progressively more
complex.
In the following section, we discuss techniques that,
under specific assumptions, can be effectively used to
approximate the dynamic evolution of the hidden state
Zt. These techniques lead to a more efficient implementa-
tion of the Gillespie algorithm and, under additional
assumptions, allow us to analytically approximate the
solution of the CME (3) by a multivariate normal distribu-
tion whose means and covariances are calculated by
recursively solving a system of first-order ordinary differ-
ential equations.
3.3 Example
We will be illustrating various concepts and techniques
discussed in this paper by using the simple transcriptional
regulatory system depicted in Fig. 1. This system consists of
` 6 molecular species which react inaccordance with `
10 reactions. We summarize these reactions in Table 1 and
provide biologically relevant values for the associated
specific probability rate constants, obtained from our work
in [25]. The first reaction models translation of mRNA into
protein M, whereas reaction 3 models transcription. Reac-
tions 2 and 4 model the degradation of M and mRNA,
respectively. Reactions 5-8 model dimer/DNA binding/
unbinding. Finally, reactions 9 and 10 model dimerization of
M to D. Note that reactions 1-4, 6, 8, and 10 are mono-
molecular, whereas reactions 5, 7, and 9 are bimolecular.
We initialize the system with two monomers and four
dimers and assume two DNA templates per cell. In this case:
c
1
z c
1
:
3
:
4

c
2
z c
2
2 :
1
:
2
2:
9
2:
10

c
3
z c
3
:
5
:
6
:
7
:
8

c
4
z c
4
:
3
:
4

c
5
z c
5
4 :
5
:
6
:
7
:
8
:
9
:
10
2 :
5
:
6

c
6
z c
6
:
5
:
6
:
7
:
8

c
7
z c
7
4 :
5
:
6
:
7
:
8
:
9
:
10

:
5
:
6
:
7
:
8

c
8
z c
8
:
7
:
8

c
9
z c
9
2 :
1
:
2
2:
9
2:
10

1 :
1
:
2
2:
9
2:
10
,2
c
10
z c
10
4 :
5
:
6
:
7
:
8
:
9
:
10
.
11
60 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Moreover, (2) results in:
A
1
t 2 7
1
t 7
2
t 27
9
t 27
10
t
A
2
t 4 7
5
t 7
6
t 7
7
t 7
8
t 7
9
t 7
10
t
A
3
t 7
3
t 7
4
t
A
4
t 2 7
5
t 7
6
t
A
5
t 7
5
t 7
6
t 7
7
t 7
8
t
A
6
t 7
7
t 7
8
t.
12
where the correspondence between A
i
and a particular
molecular species is depicted in Table 1.
In Fig. 2, we depict typical realizations of the dynamic
evolutions of some hidden and some observable states of
the transcriptional regulatory system depicted in Fig. 1.
These realizations have been obtained by the exact simula-
tion method of Gillespie, applied on (3), during a 35 minute
period (a typical time between successive divisions of E. coli
cells). We also depict the dynamic evolutions of the means
and standard deviations about the means, estimated by
Monte Carlo simulation of 1. 000 sample trajectories. More-
over, in Fig. 3a, we depict the PMFs of the monomers,
dimers, and mRNA transcripts at time t 10 min,
estimated by the same Monte Carlo simulation.
Initially, there are no DNA templates that are bound at
both regulatory sites R
1
and R
2
by D, in which case, active
transcription of the gene takes place. The resulting positive
feedback sustains mRNA synthesis, which results in a
gradual increase of monomer M and dimer D molecules.
Eventually, dimers D bind at both regulatory sites, in which
case, transcription is effectively repressed. The resulting
negative feedback gradually represses mRNA synthesis.
The number of mRNA molecules present in the cell reaches
a maximum of eight molecules. Subsequently, all mRNA
molecules are consumed by degradation. Overall, positive
feedback gradually increases the population of dimers,
which is then stabilized by negative feedback. The simula-
tions depicted in Fig. 2 were coded in Matlab and took, on
average, 60 sec of CPU time per sample trajectory on a
2GHz Xeon PC running Windows 2000.
3
4 APPROXIMATIONS
We mentioned in the previous section that, in most cases, it
is not possible to derive an analytical solution of the CME.
Instead, we need to use the Gillespie algorithm, in
conjunction with Monte Carlo simulation techniques, to
stochastically simulate the CME (3) and estimate hidden
and observable state statistics (e.g., means and variances). It
turns out that this approach is computationally intensive.
For example, the simulations depicted in Fig. 2 and Fig. 3a
took about 16 hrs. of CPU time. For this reason, it is very
important to approximate the CME (3) by a more tractable
equation. In this section, we present a number of approx-
imations and discuss their relative merits and limitations.
4.1 Langevin Approximation
A useful approximation to the CME (3) is obtained by
assuming that there exists a time step dt such that the
following two conditions are satisfied:
C1. Changes in the hidden system states that occur during
any time interval t. t dt do not appreciably affect
the propensity functions c
i
z, i 2 M.
C2. The expected number of occurrences of each reaction in a
time interval t. t dt is much larger than one.
It can be shown that, under both conditions, the dynamic
evolution of the hidden state process Z is governed by the
following system of stochastic differential equations [19]:
d7
i
t c
i
Ztdt

c
i
Zt
p
d\
i
t. i 2 M. 13
for t 0, where f\
i
. i 2 Mg are mutually independent
standard Brownian motions, which are also independent of Z.
Each equation in (13) is a Langevin equation [17].
The system (13) can be numerically solved by discretiz-
ing time at equally spaced points /dt, / 0. 1. . . . , and by
integrating (13) using the well-known Euler-Maruyama
method [37]. Because of condition C1, this leads to the
following iterations:
7
i
/ 1dt 7
i
/dt c
i
Z/dtdt

c
i
Z/dtdt
p
`
i
. / 0. 1. . . . . i 2 M.
14
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 61
TABLE 1
Reactions Associated with the Transcriptional Regulatory
System Depicted in Fig. 1
3. The reader can find our code as supplementary material, which can be
accessed on the Computer Society Digital Library at http://computer.org/
tcbb/archives.htm. The cited CPU times are not the best possible since our
code is not optimized. However, they provide a clear distinction between
the computational requirements of the techniques discussed in this paper.
initialized by 7
i
0 0, for every i 2 M, where f`
i
. i 2
Mg are mutually independent zero mean Gaussian random
variables with unit variance, which are also independent of
Z. We refer to the resulting approximation technique as the
Langevin approximation (LA) method. A similar approxima-
tion, applied on the observable states Xt, has been
extensively used for modeling biological fluctuations in
single cells (e.g., see [8], [38], [39], [40], [41], [42]).
In many cases, we may not be able to simultaneously
satisfy the previous conditions. Referring to the transcrip-
tional regulatory system depicted in Fig. 1, we may pick a
sufficiently small value for dt so that the propensity
functions do not appreciably change during any time
interval t. t dt, thus satisfying condition C1. However,
transcription and translation are slow reactions, which
means that they will occur infrequently during the time
interval t. t dt, as compared to other reactions. In this
case, condition C2 will not be satisfied for the chosen value
of dt and the LA method may fail to provide a satisfactory
approximation.
We illustrate this problem in Fig. 4, where we depict
typical realizations of the dynamic evolutions of some
hidden and some observable states of the transcriptional
regulatory system depicted in Fig. 1, obtained by (14) and
(2), with dt 0.05 s. We also depict the dynamic evolutions
of their means and standard deviations about the means,
estimated by Monte Carlo simulation of 1,000 sample
trajectories. Moreover, in Fig. 3b, we depict the PMFs of
the monomers, dimers, and mRNA transcripts at time
t 10 min, estimated by the same Monte Carlo simulation.
The sample trajectories of the numbers of reactions and
mRNAtranscripts depicted in Fig. 4 are not satisfactory since
they do not follow the integer-valued, step-like behavior of
the actual trajectories (compare with Fig. 2). This is due to the
fact that these reactions occur infrequently, in which case,
we cannot simultaneously satisfy conditions C1 and C2.
However, the LA method results in very good Monte
Carlo estimates for the means, standard deviations, and
PMFs. This is due to the fact that, in the limit as dt !0,
(14) implies that the hidden state means and covariances of
the approximating system satisfy the same system of
differential equations as the original system, given by (5),
(6). Therefore, the LA method always provides an exact
match of the first and second-order statistics of Z for a
sufficiently small time step dt.
On the average, it took about 10 sec of CPUtime to sample
the systemstates, which is six times faster than the CPUtime
required by the exact method. Note that the computational
savings obtained by using the LAmethod are moderate. This
is due to the fact that, to satisfy condition C1, we need to
choose a rather small time step, dt, which leads to a large
number of iterations (42. 000 iterations).
4.2 Linear Noise Approximation
Unfortunately, the LA method does not allow us to obtain
an expression for the joint probability density function (PDF)
j
Z
z; t of the hidden states. However, by using additional
approximations, we can characterize the hidden states by a
multivariate Gaussian PDF e jj
Z
z; t, given by
e jj
Z
z; t
1
2
`,2
\ jQQtj
1,2
exp
n

1
2\

z \ iit

T
QQ
1
t

z \ iit

o
.
15
for t 0, with mean vector \ iit and covariance matrix
\ QQt, where \ is the cellular volume and jQQj denotes the
determinant of matrix QQ. For completeness, we outline the
mathematical steps that lead to (15) in the Appendix.
62 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 2. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation of the
CME (3) based on the exact simulation method.
The vector iit satisf71ies the following system of first-
order ordinary differential equations:
di
i
t
dt
e cc
i
iit. t 0. i 2 M. 16
where
e cc
i
z
4
1
\
c
i
\ z. 17
Moreover, QQt satisfies the matrix Riccati differential
equation
dQQt
dt
BBtQQt QQtBB
T
t CCt. t 0. 18
where BBt is an `` matrix with elements /
ii
0 t and
CCt is an ` ` diagonal matrix with elements c
i
t,
given by
/
ii
0 t
0e cc
i
iit
0:
i
0
and c
i
t e cc
i
iit. 19
The system(16) and(18) canbesolvednumerically(e.g., by
the standard Euler method) to provide an approximation to
the dynamic evolutions of the DA means and covariances.
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 63
Fig. 3. PMF estimates of monomer, dimer, and mRNA transcript distributions in the transcriptional regulatory system depicted in Fig. 1 at time
t 10 min obtained by: (a) exact simulation, (b) Langevin approximation, (c) Poisson approximation, (d) mean-field approximation, and (e) quasi-
equilibrium approximation.
This approach is substantially faster than Monte Carlo
simulations and can be used to provide a rapid assess-
ment of the statistical behavior of the HMM (8), (9).
For reasons we explain in the Appendix, we refer to the
resulting technique as the linear noise approximation (LNA)
method. A detailed description of how this method can be
used in certain biochemical systems can be found in [43],
[44]. The LNA method conveniently characterizes the HMM
(8), (9) by the multivariate Gaussian PDF (15), which is
determined by the system of first-order differential equa-
tions (16) and (18). In this case, and from (2), A
i
t is a
linear combination of Gaussian DAs. Therefore, the ob-
servable state Xt will follow a multivariate Gaussian
distribution as well.
The LNA method is based on (A.4), which is obtained by
linearizing the propensity function c
i
Zt about the mean
vector jjt EZt. However, when the derivatives
0
2
c
i
jj,0:
i
0:
i
0 and the covariances ,
ii
0 are not negli-
gible, then (A.4) may not hold. Another problem is that the
LNA method is obtained from a Langevin approximation in
the limit as the system volume \ tends to infinity. However,
\ is a biological parameter that cannot be artificially
increased to improve the accuracy of the LNA method.
Finally, since the LNA method is obtained from the
LA method by additional approximations, it suffers from
similar drawbacks. For these reasons, we need more
accurate and versatile approximation techniques. In view
of the approximation techniques discussed in Sections 4.3
and 4.4 below, we believe that there is no advantage in
using the LNA method.
4.3 Poisson Approximation
A better approximation of the HMM (8), (9) may be obtained
by employing a time step dt that satisfies condition C1, but
may not necessarily satisfy condition C2. Since reactions
that occur during the time interval /dt. / 1dt will not
appreciably change the values of the propensity func-
tions, given the DA values at time /dt, these reactions will
occur independently of each other. Moreover, the number
of occurrences of the ith reaction during /dt. / 1dt
will be a Poisson random variable with parameter
c
i
z/dtdt [19]. In this case, (14) becomes
7
i
/ 1dt 7
i
/dt 1
i
c
i
Z/dtdt.
/ 0. 1. . . . . i 2 M.
20
initialized by 7
i
0 0, for every i 2 M. Given the
DA values at time /dt, 1
i
c
i
z/dtdt, i 2 M, are
mutually independent Poisson random variables with
parameters c
i
z/dtdt, i 2 M, respectively. We refer to
the resulting approximation as the Poisson approximation
(PA) method.
The PA method has been recently used to develop
computational improvements of the stochastic simulation
algorithm of Gillespie [30], [31], [32], [34], [35], [36]. Note
that, in the limit as dt !0, (20) implies that the hidden state
means and covariances of the approximating system satisfy
the same differential equations as the original system, given
by (5), (6). Therefore, the PA method always provides an
exact match of the first and second-order statistics of Z for a
sufficiently small time step dt. But, most importantly, it may
result in a better approximation than the LA method.
In Fig. 5, we depict typical realizations of the dynamic
evolutions of some hidden and some observable states of
the transcriptional regulatory system depicted in Fig. 1,
obtained by (20) and (2), with dt 0.05 s. We also depict the
dynamic evolutions of the means and standard deviations
about the means, estimated by Monte Carlo simulation of
64 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 4. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation based
on the Langevin approximation method.
1,000 sample trajectories. Moreover, in Fig. 3c, we depict the
PMFs of the monomers, dimers, and mRNA transcripts at
time t 10 min, estimated by the same Monte Carlo
simulation. Similarly to the LA method, it took about
12 sec of CPU time on average to sample the system states.
The PA method produces very good approximations of
the dynamic evolutions of the hidden and observable states,
accurately preserving the discrete nature of these states.
Moreover, the method results in very good Monte Carlo
estimates for the means, standard deviations, and PMFs, as
expected. Therefore, we believe that this method should be
preferred over the LA method.
4.4 Mean-Field Approximation
Similarly to the LA method, the PA method does not allow
us to derive an expression for the joint PMF 1
z
z; t of the
hidden states. However, we show in the Appendix that we
can approximately characterize the hidden states by a PMF
e
11
Z
z; t, given by
e
11
Z
z; t
1
t
exp
n

1
2

z e jj jjt

T
e
IR IR
1
t

z e jj jjt

o
.
21
for t 0, where
t
4
X
z0
exp
n

1
2

z e jj jjt

T
e
IR IR
1
t

z e jj jjt

o
. 22
In (21), (22), the elements of the mean vector e jj jjt and
covariance matrix
e
IR IRt satisfy the following first-order
differential equations:
de jj
i
t
dt
c
i
e jj jjt
1
2
X
`
/.|1
d
2.i/|
e ,,
/|
t. i 2 M. 23
de ,,
ii
0 t
dt
c
i
e jj jjt
1
2
X
`
/.|1
d
2.i/|
e ,,
/|
t
" #
cii
0

X
`
/1
d
1.i/
te ,,
i
0
/
t d
1.i
0
/
te ,,
i/
t. i. i
0
2 M.
24
where
d
1.i/
t
4
0c
i
e jj jjt
0:
/
and d
2.i/|

4
0
2
c
i
e jj jjt
0:
/
0:
|
.
Note that d
2.i/|
does not depend on t. Moreover,
e
11
Z
z; t is a
normal Gibbs distribution at temperature 2,/
1
, with energy
function z e jj jjt
T
e
IR IR
1
tz e jj jjt and partition function
t, where /
1
is Boltzmanns constant.
4
For reasons we explain in the Appendix, we refer to the
resulting technique as the mean-field approximation (MFA)
method. This method conveniently characterizes the stochas-
tic biochemical system by the dynamic evolution of the
normal Gibbs distribution (21), (22), which is determined by
the system of coupled first-order differential equations (23),
(24). From (2), we may approximate the PMF 1
X
x; t by a
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 65
Fig. 5. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation based
on the Poisson approximation method.
4. The Gibbs distribution (21), (22) may be approximated by a sampled
Gaussian distribution, in which case t 2
`,2
j
e
IR IRtj
1,2
. However, the
accuracy of this approximation depends on the values of e jj jj and
e
IR IR and may
not always be acceptable. For example, in the univariate case, when e jj 100
and e oo 40, the values of the sampled Gaussian distribution are 0.6 percent
smaller than the values of the Gibbs distribution, but, when e jj e oo 30, this
error increases to 15.5 percent.
normal Gibbs distribution
e
11
X
x; t with means and covar-
iances given by
e
EEA
i
t r
0.i

X
i2M
:
ii
e jj
i
t.
g
Cov CovA
i
t. A
i
0 t
X
i.i
0
2M
:
ii
:
i
0
i
0 e ,,
ii
0 t.
for i. i
0
2 N.
In Fig. 6, we depict dynamic evolutions of the means and
standard deviations of some hidden and some observable
states of the transcriptional regulatory system depicted in
Fig. 1, approximated by the MFA method. The means e jj
i
t,
i 2 M, and covariances e ,,
ii
0 t, i. i
0
2 M, are calculated
by using Eulers method, with dt 0.05 s, to recursively
solve (23), (24). These quantities are superimposed on the
state realizations depicted in Fig. 5. In Fig. 3d, we depict the
marginal normal Gibbs approximations
e
11
A
i
r; t of the
PMFs of the monomers, dimers, and mRNA transcripts at
time t 10 min, where
e
11
A
i
r; t
1
`
i
t
exp
r
e
EEA
i
t
h i
2
2
g
Var VarA
i
t
8
>
<
>
:
9
>
=
>
;
. t 0.
with
`
i
t
4
X
r!0
exp
r
e
EEA
i
t
2
2
g
Var VarA
i
t
( )
.
It took about 16 seconds of CPU time to obtain the dynamic
evolution of the means and standard deviations depicted in
Fig. 6, which is about 750 times faster than the Monte Carlo
approach based on the PA method.
By comparing the results depicted in Fig. 5 and Fig. 6 and,
more specifically, the evolution of the standard deviation
associated with the reaction DNA DD !DNA 2D, we
see that we may need to increase the accuracy of the MFA
method in certain cases. As we explain in the Appendix,
this may be accomplished by including higher-order (! 3)
moments in the differential equation (24). Such an inclusion
will, however, result in increasing the complexity of the
method.
Although the MFA method may produce results that are
not as accurate as the ones obtained by Monte Carlo
estimation, this method is very attractive since, similarly to
the LNA method, it may be used to provide a rapid
assessment of the statistical behavior of a biochemical
system. Moreover, this method is superior to the LNA
method for three main reasons: 1) It is based on the more
accurate Poisson approximation, 2) its approximation
accuracy does not depend on the cellular volume, and 3) it
does not require linearization of the underlying propensity
functions.
4.5 Stochastic Quasi-Equilibrium Approximation
Most often, reactions occur on vastly different time scales.
For example, the transcription and translation reactions
depicted in Fig. 1 are typically slow reactions, whereas
dimerization is a fast reaction. This means that transcrip-
tion and translation may occur infrequently, whereas,
dimerization may occur numerous times within successive
occurrences of slow reactions. In such cases, the Gillespie
algorithm will be spending the most time simulating fast
reaction events. It may, however, be less important to know
the activity of fast reactions in detail since the systems
dynamic evolution may be mostly determined by the
activity of the slow reactions. This is illustrated by the
simulation results depicted in Fig. 2, which show that it
may not be important to know the detailed dynamic
66 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 6. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines). The state evolutions have been obtained by
the Poisson approximation method, whereas the evolutions of the means and standard deviations have been obtained by the mean-field
approximation method.
evolution of the monomer and dimer states since the large
fluctuations underlying these states do not seem to affect
the remaining states. Therefore, we may be able to
approximate the CME (3) by one that involves only slow
reactions.
This idea has recently been explored by several
investigators, who have proposed a number of techniques
for eliminating fast reaction kinetics [20], [25], [45], [46]. The
techniques proposed in [20], [25] are based on the CME (3),
whereas the techniques proposed in [45], [46] are based on
the CME (7). Since our interest here focuses on the CME (3),
we briefly discuss the quasi-equilibrium approximation
technique proposed by us in [25].
In the following, we assume that the first `
0
reactions of
a biochemical system are slow, whereas the remaining
` `
0
reactions are fast. We set
Zt
Z
:
t
Z
)
t
!
. z
z
:
z
)
!
.
e
i

e
i
0
!
. i 2 M
:
. and e
i

0
e
i
!
. i 2 M
)
.
where M
:

4
f1. 2. . . . . `
0
g and
M
)

4
f`
0
1. `
0
2. . . . . `g.
with Z
:
t, z
:
, and e
i
being `
0
-dimensional vectors, and
Z
)
t, z
)
, and e
i
being ` `
0
-dimensional vectors. From
(3), we can show that the marginal PMF 1
z
z
:
; t of the
slow reactions satisfies the following CME [25]:
01
z
z
:
; t
0t

X
i2M
:
c
t
i
z
:
e
i
1
z
z
:
e
i
; t
c
t
i
z
:
1
z
z
:
; t.
25
where
c
t
i
z
:

4
X
z
)
c
i
z
:
. z
)
1z
)
j z
:
; t. i 2 M
:
. 26
with 1z
)
j z
:
; t being the conditional probability of the
fastDAs at timet, giventhestate of the slowreactions at t.
If we focus our interest on stochastic biochemical
systems for which the slow propensity functions depend
linearly on fast DAs (which is the case for the transcrip-
tional regulatory system depicted in Fig. 1), we can show
that [25]
c
t
i
z
:
c
i
z
:
. jjz
:
; t. i 2 M
:
. 27
where jjz
:
; t
4
j
`01
z
:
; tj
`02
z
:
; t j
`
z
:
; t
T
, with
j
i
z
:
; t
4
E7
i
t j Z
:
t z
:
, i 2 M
)
, being the mean
DA of the ith fast reaction at time t, given the state z
:
of
the slow reactions at t. Equations (25), (26) show that we
can approximate the biochemical system by one that is
comprised of only slow reactions, provided that we can
calculate the conditional mean DAs of the fast reactions,
given the hidden states of the slow reactions. In this case,
the fast reactions will exert their influence on the slow
reactions by means of their conditional mean DAs through
the propensity functions of the slow reactions.
Given that Z
:
t z
:
, the optimal mean-square estimate
b rr
i
t of the observable system state A
i
t of the original
biochemical system will be given by (recall (2)):
b rr
i
t EA
i
t j Z
:
t z
:

r
0.i

X
i2M:
:
ii
:
i
t
X
i2M)
:
ii
j
i
z
:
; t.
28
for i 2 N. Therefore, the original biochemical system can be
approximatedby one whose hiddenstate is governedby (25),
whereas its observable state is given by (28). This leads to the
following approximation of the state-space model (8), (9):
z
:
t $ 1
z
z
:
; t hidden state
b xxt b ggz
:
t observable state.
where
b qq
i
z
:
t r
0.i

X
i2M
:
:
ii
:
i
t
X
i2M
)
:
ii
j
i
z
:
; t.
From (25) and (26), we can show that the means and
covariances of the slow hidden states satisfy the same
differential equations as the original system, given by (5),
(6). Note also that, if the ith observable state is not affected
by a fast reaction, then (28) implies that
b
AA
i
t A
i
t.
For all other states, E
b
AA
i
t EA
i
t, for t ! 0, but
Cov
b
AA
i
t.
b
AA
i
0 t 6 CovA
i
t. A
i
0 t. Therefore, the obser-
vable states of the approximating and original systems that
are not affected by any fast reaction are identical. The
mean values of the remaining states are also identical, but
their covariances may be different.
To calculate the conditional mean DAs of the fast
reactions required by (25), we have proposed in [25] a
quasi-equilibrium approach, based on a principle of
statistical microscopic reversibility, according to which the
fast reactions are assumed to rapidly reach a state of
equilibrium between consecutive occurrences of slow
reactions. We illustrate this approach by using the example
depicted in Fig. 1.
Since dimerization is a fast reaction, we set z
:

:
1
:
2
:
3
:
4
:
5
:
6
:
7
:
8

T
and z
)
:
9
:
10

T
. In this case, (11)
and (27) result in
c
t
1
z c
1
:
3
:
4

c
t
2
z c
2

2 :
1
:
2
2j
9
z
:
; t 2j
10
z
:
; t

c
t
3
z c
3
:
5
:
6
:
7
:
8

c
t
4
z c
4
:
3
:
4

c
t
5
z c
5

4 :
5
:
6
:
7
:
8
j
9
z
:
; t
j
10
z
:
; t

2 :
5
:
6

c
t
6
z c
6
:
5
:
6
:
7
:
8

c
t
7
z c
7

4 :
5
:
6
:
7
:
8
j
9
z
:
; t
j
10
z
:
; t

:
5
:
6
:
7
:
8

c
t
8
z c
8
:
7
:
8
.
29
Moreover,
b
AA
1
t 2 7
1
t 7
2
t 2j
9
Z
:
; t 2j
10
Z
:
; t 30
b
AA
2
t 4 7
5
t 7
6
t 7
7
t 7
8
t
j
9
Z
:
; t j
10
Z
:
; t.
31
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 67
with
b
AA
i
t A
i
t, for i 3. 4. 5. 6, where A
i
t is given
by (12). Since the 10th reaction is the reverse of the
9th reaction, it is expected that, between successive
occurrences of slow reactions, dimerization will rapidly
reach equilibrium. At equilibrium, given the state z
:
of the
slow reactions, the probability that a reaction 9 will occur
within the time interval t. t dt will approximately equal
the probability that a reaction 10 will occur within the next
time interval t dt. t 2dt. Otherwise, dimerization may
eventually result in an unreasonably large number of
dimers (if the first probability is larger than the second)
or no dimers at all (if the second probability is larger
than the first), conditions that will be harmful to the cell.
This implies that c
9
z
:
. Z
)
e
9
c
10
z
:
. Z
)
. However,
since the value of Z
)
rapidly becomes large, a slight
change in Z
)
will not affect the value of the propensity
function. Therefore, we can approximately assume that
c
9
z
:
. Z
)
c
10
z
:
. Z
)
. This equality leads to [25]:
j
9
z
:
; t j
10
z
:
; t
1
2
z
:

2
z
:
41z
:

p
h i
. 32
within successive occurrences of slow reactions, where
z
:
2 :
1
:
2

c
10
2c
9

1
2
. 33
1z
:

1
4
2 :
1
:
2
2 :
1
:
2
1

c
10
2c
9
4 :
5
:
6
:
7
:
8
.
34
We refer to the resulting approximation as the stochastic
quasi-equilibrium approximation (SQEA) method.
In general, we may not be able to obtain a CME for the
molecular population process
b
XX of a stochastic biochemical
system obtained by SQEA. The CME (7) is derived by
relating the molecular population process Xt dt at time
t dt to Xt [18]. This is possible due to the linear
relationship between Xt and Zt, given by (2), which
implies that Xt dt Xt :
ii
, if the ith reaction
occurs during the time interval t. t dt. However, we may
not be able to relate
b
XXt dt and
b
XXt since
b
XXt may be a
nonlinear function of the DA process Zt, e.g., see (30)-(34).
In Fig. 7, we depict typical realizations, obtained by the
direct Gillespie algorithm, of the dynamic evolutions of some
hidden and some observable states of the transcriptional
regulatory system depicted in Fig. 1, approximated by the
CME(25) and(28), withslowpropensityfunctions givenby
(29) and (32)-(34). We also depict the dynamic evolutions of
their means and standard deviations about the means,
estimated by Monte Carlo simulation of 1. 000 sample
trajectories. Moreover, in Fig. 3e, we depict the PMFs of the
monomers, dimers, andmRNAtranscripts at time t 10min,
estimatedbythe same Monte Carlosimulation. Onaverage, it
took about 0.5 sec of CPU time to sample the system states,
which is 120 times faster than the exact method and 24 times
faster than the PA method.
The results depicted in Fig. 7 showthat the SQEAmethod
produces a relatively smooth approximation of the
dynamic evolution of the number of monomers and dimers.
This is expected since
b
AA
1
t and
b
AA
2
t are the conditional
expectations of A
1
t and A
2
t, respectively, conditioned on
the state of the slow reactions. Note, however, that the use
of the SQEA method is based on the premise that we are not
interested in the detailed evolutions of the observable states
A
1
t and A
2
t. On the other hand, the SQEA method
provides very good approximations of the remaining states
and rapidly produces good Monte Carlo estimates for the
means, standard deviations, and PMFs.
68 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
Fig. 7. Typical dynamic evolutions of some hidden and some observable states (gray solid lines) of the transcriptional regulatory system depicted in
Fig. 1, their estimated means (black solid lines), and standard deviations about the means (dotted lines), obtained by Monte Carlo simulation based
on the stochastic quasi-equilibrium approximation method.
5 CONCLUSION
In this paper, we introduced an HMM for transcriptional
regulation in single cells. The reaction DAs are used as the
hidden states of the model, whereas the molecular popula-
tions are used as the observable states. The dynamic
evolution of the hidden states is characterized by the
CME (3), whereas the observable states are directly
calculated from the hidden states by means of (2).
Unfortunately, analytical characterization of the pro-
posed HMM is not possible. A Monte Carlo simulation
approach, based on the Gillespie algorithm, can be used to
estimate various statistics. This approach is computation-
ally intensive and often not practical. Therefore, we are
forced to seek analytical and computationally tractable
approximations of the original HMM. We presented several
approximation techniques, including the LA, PA, and
SQEA methods. Moreover, we discussed the LNA and
MFA methods, which approximate the state probabilities by
multivariate normal distributions. We pointed out that the
LA and LNA methods should not be used unless the two
conditions required for their validity can be verified.
If we can separate the reactions of a transcriptional
regulatory system into slow and fast reactions, we may
use the SQEA method to simplify the system. This method
eliminates the fast reactions, provided that we are not
interested in such reactions. We can also use an HMM to
characterize the resulting slow reaction subsystem. We can
study its dynamical behavior by employing a Monte Carlo
simulation approach based on the Gillespie algorithm. If this
approach turns out to be computationally intensive, we may
use the MFA method for a rapid assessment and the PA
method for a more precise assessment of the statistical
behavior of the simplified system.
Although we have focused our discussion on modeling
transcriptional regulation in single cells, the techniques
presented in this paper are general enough to be applied to
other types of stochastic biochemical system, such as
signaling and metabolic networks.
APPENDIX
Linear Noise Approximation. The PDF j
Z
z; t of the
hidden state vector Zt, governed by the system of
Langevin equations (13), satisfies the following nonlinear
Fokker-Planck equation [17], [19]:
0j
Z
z; t
0t

X
i2M
0

c
i
zj
Z
z; t

0:
i

1
2
0
2

c
i
zj
Z
z; t

0:
2
i
.
A.1
for t 0. Unfortunately, j
Z
z; t cannot be determined since
finding a solution to this equation is as difficult as finding a
solution to the original CME (3). However, we can
approximate (A.1) by the linear Fokker-Planck equation
(A.7) below, whose solution leads to the Gaussian PDF
e jj
Z
z; t, given by (15).
Indeed, let t be a noise process with PDF j

; t, such
that
Zt \ iit

\
p
t. t 0. A.2
where iit satisfies (16) and \ is the cellular volume. A
Taylor series expansion of the propensity function c
i
Zt
about the mean vector jjt EZt results in
c
i
Zt c
i
jjt Zt jjt
T
0c
i
jjt
0z

1
2
Zt jjt
T
0
2
c
i
jjt
0z
2
Zt jjt.
A.3
where 0qz,0z and 0
2
qz,0z
2
denote the gradient and
Hessian of q, respectively. From (1), (2), and (4), note that
the derivatives of c
i
z with respect to z of order greater
than 2 are zero. Therefore, (A.3) is exact. By taking
expectation on both sides of (A.3) and by assuming that
the third term on the right-hand side is negligible (thus, at
each time t, linearizing the propensity functions c
i
z,
i 2 M, about the mean vector jjt), we approximately
obtain
Ec
i
Zt c
i
jjt. t 0. i 2 M. A.4
Then, (5), (16), (17), (A.2), and (A.4) imply that Et 0,
t 0. Note that
j

; t j
Z
\ iit

\
p
; t. A.5
whereas
1
\
c
i
\ iit

\
p
e cc
i
iit \
1,2

e cc
i
iit
1

\
p
T
de cc
i
iit
dz
O1,\ .
A.6
The first equality in (A.6) is due to (17), whereas the second
equality is due to a Taylor series expansion of e cc
i
z about
iit. By using (A.5) and (A.6), we can expand (A.1) in terms
of order \
1,2
, \
0
, \
1,2
, etc. It turns out that the term of
order \
1,2
is zero due to (16). Moreover, for a sufficiently
large volume \ , we may ignore all terms of order other than
\
0
, in which case, the PDF j

; t will approximately
satisfy the following linear multivariate Fokker-Planck
equation:
0j

; t
0t

X
i2M
X
i
0
2M
/
ii
0 t
0
i
0 j

; t
0
i

1
2
X
i2M
c
i
t
0
2
j

; t
0
2
i
. t 0.
A.7
where /
ii
0 t and c
i
t are given by (19).
It can be verified that the solution of the previous
equation is a multivariate Gaussian distribution with zero
mean and covariance matrix QQ that satisfies the matrix
Riccati equation (18) [17]. Equation (A.5) implies that
j
Z
z; t j

z \ iit

\
p ; t

.
Therefore, the PDF j
Z
z; t can be approximated by the
multivariate Gaussian PDF e jj
Z
z; t, given by (15). Because
this technique is based on linearizing, at each time t, the
propensity functions c
i
z, i 2 M, about the mean vector
jjt and, since the DAs Zt are modeled by the mean-
plus-noise model in (A.2), it is referred to as the linear noise
approximation (LNA) method [17].
Mean-field Approximation. To derive an expression for
the PMF 1
Z
z; t, we may approximate the stochastic
biochemical system with one whose hidden state vector
e
ZZt satisfies (compare with (20)):
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 69
e
ZZ/ 1dt
e
ZZ/dt /; dt. / 0. 1. . . . . A.8
where /; dt, / 0. 1. . . . , are mutually independent
discrete-valued random vectors, with /; dt being
independent of
e
ZZ/dt, for every / 1. 2. . . . . We may
then determine and the PMF
e
11
Z
z; t of
e
ZZt so that
the means e jj
i
t
4
E
e
77
i
t, i 2 M, and covariances
e ,,
ii
0 t
4
Cov
e
77
i
t.
e
77
i
0 t, satisfy the same differential
equations as the actual system; in particular (recall (5)
and (6)),
de jj
i
t
dt
Ec
i

e
ZZt. i 2 M. A.9
de ,,
ii
0 t
dt
Ec
i

e
ZZtcii
0

E
e
77
i
tc
i
0
e
ZZt e jj
i
tEc
i
0
e
ZZt
E
e
77
i
0 tc
i

e
ZZt e jj
i
0 tEc
i

e
ZZt. i. i
0
2 M.
A.10
Finally, we may set 1
Z
z; t
e
11
Z
z; t.
From (A.8), note that
e
ZZ/dt
X
/1
|0
|; dt. / 1. 2. . . . .
Therefore,
e
ZZ/dt is the sum of / independent discrete-
valued random vectors. Under certain general conditions,
the central limit theorem implies that, for sufficiently large /,
e
ZZ/dt will be governed by a multivariate sampled
Gaussian distribution (e.g., see [23, p. 279]). By setting
/dt !t, we obtain (21), (22).
From a Taylor series expansion of the propensity
function c
i

e
ZZt about the mean vector e jj jjt, we have that
(recall (A.3))
Ec
i

e
ZZt c
i
e jj jjt
1
2
X
`
/.|1
0
2
c
i
e jj jjt
0:
/
0:
|
e ,,
/|
t. A.11
E
e
77
i
0 tc
i

e
ZZt c
i
e jj jjte jj
i
0 t
X
`
/1
0c
i
e jj jjt
0:
/
e ,,
i
0
/
t

e jj
i
0 t
2
X
`
/.|1
0
2
c
i
e jj jjt
0:
/
0:
|
e ,,
/|
t.
A.12
where we set the third-order central moments of
e
ZZt in
(A.12) equal to zero. Equations (A.11) and (A.12), together
with (A.9) and (A.10), result in (23), (24). Note that the
second-order derivatives of c
i
z with respect to z will
either be 0 or constant. These derivatives are therefore
independent of t.
From (A.8), and since /; dt is independent of
e
ZZ/dt,
we have that
E
i
/; dt e jj
i
/ 1dt e jj
i
/dt. i 2 M. A.13
Cov
i
/; dt.
i
0 /; dt e ,,
ii
0 / 1dt e ,,
ii
0 /dt.
i. i
0
2 M.
A.14
Because of (23), (24), (A.11), (A.13), and (A.14), we can set

i
/; dt 1
i
`
/
\
i
. A.15
where 1
i
`
/
is a Poisson random variable with parameter
`
/
Ec
i

e
ZZ/dtdt and \
i
, i 2 M, are zero-mean
random variables, with \
i
being independent of 1
i
0 ,
i
0
2 M, such that
Cov\
i
. \
i
0

X
`
/1
0c
i
e jj jjt
0:
/
e ,,
i
0
/
t
0c
i
0 e jj jjt
0:
/
e ,,
i/
t
!
dt.
A.16
In this case, the number of occurrences of the ith reaction
during a time interval /dt. / 1dt follows a Poisson
distribution with parameter Ec
i

e
ZZ/dtdt, instead of
c
i

e
ZZ/dtdt. To compensate for errors introduced by this
approximation, we add a zero-mean correction term \
i
to
the Poisson random variable 1
i
. The covariances of \
i
are
chosen so that (A.8) and (A.15) imply (A.9), (A.10).
In other words, we here assume that the most important
influence on the firing rate of a given reaction is exerted by
the mean propensity function of that reaction through a
Poisson process. The additive correction term compensates
for statistical variations not accounted for by the Poisson
process. This approach is a type of mean-field approximation,
similar to the one used in statistical mechanics [47], [48],
which employs a linear correction term to compensate for
errors in the approximation. For this reason, we refer to this
technique as the mean-field approximation (MFA) method.
Note that the MFA method discussed above is based on
the assumption that the third-order central moments of
e
ZZt
are zero; otherwise, the right-hand-side of (A.12) must
include a fourth term, which is a function of those moments.
This assumption allows calculation of the dynamic evolu-
tions of the means and covariances of
e
ZZ by means of the
system of differential equations (A.9), (A.10). For a more
accurate approximation, we may include higher-order (! 3)
central moments in the formulation. In this case,
i
/; dt
will still be given by (A.15), but the covariances of \
i
will
be given by a more complicated formula than (A.16), in
terms of higher-order central moments of
e
ZZ. These mo-
ments can be calculated by differential equations that are
similar to, albeit more complex than, (A.9) and (A.10).
REFERENCES
[1] M.V. Karamouzis, V.G. Gorgoulis, and A.G. Papavassiliou,
Transcription Factors and Neoplasia: Vistas in Novel Drug
Design, Clinical Cancer Research, vol. 8, pp. 949-961, 2002.
[2] L. Hood, J.R. Heath, M.E. Phelps, and B. Lin, Systems Biology
and New Technologies Enable Predictive and Preventive Medi-
cine, Science, vol. 306, pp. 640-643, 2004.
[3] M. Ptashne and A. Gann, Genes & Signals. Cold Spring Harbor,
N.Y.: Cold Spring Harbor Laboratory Press, 2002.
[4] H. Lodish, A. Berk, P. Matsudaira, C.A. Kaiser, M. Krieger, M.P.
Scott, S.L. Zipursky, and J. Darnell, Molecular Cell Biology, fifth ed.
New York: W.H. Freeman and Company, 2003.
[5] D. Endy and R. Brent, Modelling Cellular Behaviour, Nature,
vol. 409, pp. 391-395, 2001.
70 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 1, JANUARY-MARCH 2006
[6] H. Kitano, Computational Systems Biology, Nature, vol. 420,
pp. 206-210, 2002.
[7] J. Goutsias and S. Kim, A Nonlinear Discrete Dynamical Model
for Transcriptional Regulation: Construction and Properties,
Biophysical J., vol. 86, pp. 1922-1945, 2004.
[8] C.V. Rao, D.M. Wolf, and A.P. Arkin, Control, Exploitation and
Tolerance of Intracelluar Noise, Nature, vol. 420, pp. 231-237,
2002.
[9] J.M. Raser and E.K. OShea, Control of Stochasticity in Eukaryotic
Gene Expression, Science, vol. 304, pp. 1811-1814, 2004.
[10] J.M. Levsky and R.H. Singer, Gene Expression and the Myth of
the Average Cell, Trends in Cell Biology, vol. 13, no. 1, pp. 4-6,
2003.
[11] H.H. McAdams and A. Arkin, Stochastic Mechanisms in Gene
Expression, Proc. Natl Academy of Science, vol. 94, pp. 814-819,
1997.
[12] H.H. McAdams and A. Arkin, Its a Noisy Business! Genetic
Regulation at the Nanomolar Scale, Trends in Genetics, vol. 15,
no. 2, pp. 65-69, 1999.
[13] N. Fedoroff and W. Fontana, Small Numbers of Big Molecules,
Science, vol. 297, pp. 1129-1131, 2002.
[14] M.B. Elowitz, A.J. Levine, E.D. Siggia, and P.S. Swain, Stochastic
Gene Expression in a Single Cell, Science, vol. 297, pp. 1183-1186,
2002.
[15] W.J. Ewens and G.R. Grant, Statistical Methods in Bioinformatics: An
Introduction. New York: Springer, 2001.
[16] M. Ptashne, A Genetic Switch: Phage Lambda Revisited, third ed.
Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press,
2004.
[17] N.G. vanKampen, Stochastic Processes in Physics and Chemistry.
Amsterdam: Elsevier, 1992.
[18] D.T. Gillespie, A Rigorous Derivation of the Chemical Master
Equation, Physica A, vol. 188, pp. 404-425, 1992.
[19] D.T. Gillespie, The Chemical Langevin Equation, J. Chemical
Physics, vol. 113, no. 1, pp. 297-306, 2000.
[20] E.L. Haseltine and J.B. Rawlings, Approximate Simulation of
Coupled Fast and Slow Reactions for Stochastic Chemical
Kinetics, J. Chemical Physics, vol. 117, no. 15, pp. 6959-6969, 2002.
[21] S. Karlin and H.M. Taylor, A First Course in Stochastic Processes,
second ed. San Diego, Calif.: Academic Press, 1975.
[22] S. Karlin and H.M. Taylor, A Second Course in Stochastic Processes.
San Diego, Calif.: Academic Press, 1981.
[23] A. Papoulis and S.U. Pillai, Probability, Random Variables and
Stochastic Processes, fourth ed. New York: McGraw-Hill, 2002.
[24] R. Heinrich and S. Schuster, The Regulation of Cellular Systems.
New York: Chapman & Hall, 1996.
[25] J. Goutsias, Quasiequilibrium Approximation of Fast Reaction
Kinetics in Stochastic Biochemical Systems, J. Chemical Physics,
vol. 122, 184102, 2005.
[26] D.T. Gillespie, Exact Stochastic Simulation of Coupled Chemical
Reactions, J. Physical Chemistry, vol. 81, no. 25, pp. 2340-2361,
1977.
[27] J.S. Liu, Monte Carlo Strategies in Scientific Computing. New York:
Springer-Verlag, 2001.
[28] D.T. Gillespie, General Method for Numerically Simulating
Stochastic Time Evolution of Coupled-Chemical Reactions,
J. Computational Physics, vol. 22, pp. 403-434, 1976.
[29] M.A. Gibson and J. Bruck, Efficient Exact Stochastic Simulation
of Chemical Systems with Many Species and Many Channels,
J. Physical Chemistry A, vol. 104, pp. 1876-1889, 2000.
[30] D.T. Gillespie, Approximate Accelerated Stochastic Simulation of
Chemically Reacting Systems, J. Chemical Physics, vol. 115, no. 4,
pp. 1716-1733, 2001.
[31] D.T. Gillespie and L.R. Petzold, Improved Leap-Size Selection for
Accelerated Stochastic Simulation, J. Chemical Physics, vol. 119,
no. 16, pp. 8229-8234, 2003.
[32] M. Rathinam, L.R. Petzold, Y. Cao, and D.T. Gillespie, Stiffness in
Stochastic Chemically Reacting Systems: The Implicit Tau-Leap-
ing Method, J. Chemical Physics, vol. 119, no. 24, pp. 12784-12794,
2003.
[33] Y. Cao, H. Li, and L. Petzold, Efficient Formulation of the
Stochastic Simulation Algorithm for Chemically Reacting Sys-
tems, J. Chemical Physics, vol. 121, no. 9, pp. 4059-4067, 2004.
[34] T. Tian and K. Burrage, Bionomial Leap Methods for Simulating
Stochastic Chemical Kinetics, J. Chemical Physics, vol. 121, no. 21,
pp. 10356-10364, 2004.
[35] A. Chatterjee, D.G. Vlachos, and M.A. Katsoulakis, Binomial
Distribution Based t-Leap Accelerated Stochastic Simulation,
J. Chemical Physics, vol. 122, 024112, 2005.
[36] Y. Cao, D.T. Gillespie, and L.R. Petzold, Avoiding Negative
Populations in Explicit Poisson Tau-Leaping, J. Chemical Physics,
vol. 123, 054104, 2005.
[37] D.J. Higham, An Algebraic Introduction to Numerical Simulation
of Stochastic Differential Equations, SIAM Rev., vol. 43, no. 3,
pp. 525-546, 2001.
[38] J. Hasty, J. Pradines, M. Dolnik, and J.J. Collins, Noise-Based
Switches and Amplifiers for Gene Expression, Proc. Natl
Academy of Science, vol. 97, no. 5, pp. 2075-2080, 2000.
[39] E.M. Ozbudak, M. Thattai, I. Kurtser, A.D. Grossman, and A. van
Oudenaarden, Regulation of Noise in the Expression of a Single
Gene, Nature Genetics, vol. 31, pp. 69-73, 2002.
[40] J.R. Pirone and T.C. Elston, Fluctuations in Transcription Factor
Binding Can Explain the Graded and Binary Responses Observed
in Inducible Gene Expression, J. Theoretical Biology, vol. 226,
pp. 111-121, 2004.
[41] R. Steuer, Effects of Stochastisity in Models of the Cell Cycle:
From Quantized Cycle Times to Noise-Induced Oscillations,
J. Theoretical Biology, vol. 228, pp. 293-301, 2004.
[42] M.L. Simpson, C.D. Cox, and G.S. Sayler, Frequency Domain
Chemical Langevin Analysis of Stochasticity in Gene Transcrip-
tional Regulation, J. Theoretical Biology, vol. 229, pp. 383-394, 2004.
[43] J. Elf and M. Ehrenberg, Fast Evaluation of Fluctuations in
Biochemical Networks with the Linear Noise Approximation,
Genome Research, vol. 13, pp. 2475-2484, 2003.
[44] R. Tomioka, H. Kimura, T.J. Kobayashi, and K. Aihara, Multi-
variate Analysis of Noise in Genetic Regulatory Networks,
J. Theoretical Biology, vol. 229, pp. 501-521, 2004.
[45] C.V. Rao and A.P. Arkin, Stochastic Chemical Kinetics and the
Quasi-Steady-State Assumption: Application to the Gillespie
Algorithm, J. Chemical Physics, vol. 118, no. 11, pp. 4999-5010,
2003.
[46] Y. Cao, D.T. Gillespie, and L.R. Petzold, The Slow-Scale
Stochastic Simulation Algorithm, J. Chemical Physics, vol. 122,
014116, 2005.
[47] G. Parisi, Statistical Field Theory. Redwood City, Calif.: Addison-
Wesley, 1988.
[48] J.J. Binney, N.J. Dowrick, A.J. Fisher, and M.E. J. Newman, The
Theory of Critical Phenomena: An Introduction to the Renormalization
Group. Oxford, U.K.: Oxford Univ. Press, 1992.
John Goutsias received the Diploma degree in
electrical engineering from the National Techni-
cal University of Athens, Greece, in 1981, and
the MS and PhD degrees in electrical engineer-
ing from the University of Southern California,
Los Angeles, in 1982 and 1986, respectively. In
1986, he joined the Department of Electrical and
Computer Engineering at The Johns Hopkins
University, Baltimore, Maryland, where he is
currently a professor of electrical and computer
engineering, a Whitaker Biomedical Engineering Professor, and a
professor of applied mathematics and statistics. His research interests
include signal processing and analysis, computational biology, and
bioinformatics. Dr. Goutsias served as an associate editor for the IEEE
Transactions on Signal Processing (1991-1993) and the IEEE Transac-
tions on Image Processing (1995-1997). He is currently an area editor
for the Journal of Visual Communication and Image Representation, a
coeditor for the Journal of Mathematical Imaging and Vision, and an
associate editor for the IEEE Transactions on Pattern Analysis and
Machine Intelligence and the EURASIP Journal on Bioinformatics and
Systems Biology. He is a senior member of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
GOUTSIAS: A HIDDEN MARKOV MODEL FOR TRANSCRIPTIONAL REGULATION IN SINGLE CELLS 71

Das könnte Ihnen auch gefallen