Inside-Outside and Forward-Backward Algorithms Are Just Backprop (Tutorial Paper)

Inside-Outside and Forward-Backward Algorithms Are Just Backprop
(Tutorial Paper)
Jason Eisner
Department of Computer Science
Johns Hopkins University
jason@cs.jhu.edu
Abstract in neural networks. Thus, it now seems useful to

A probabilistic or weighted grammar implies call attention to its role in some of NLP’s core algo-
a posterior probability distribution over possi- rithms for structured prediction.
ble parses of a given input sentence. One often
needs to extract information from this distri- 1.1 Why the connection matters
bution, by computing the expected counts (in The connection is fundamental. However, in the
the unknown parse) of various grammar rules, present author’s experience, it is not as widely
constituents, transitions, or states. This re- known as it should be, even among experienced re-
quires an algorithm such as inside-outside or
searchers in this area. Other pedagogical presenta-
forward-backward that is tailored to the gram-
mar formalism. Conveniently, each such al- tions treat the inside-outside algorithm as if it were
gorithm can be obtained by automatically dif- sui generis within NLP, deriving it “directly” as
ferentiating an “inside” algorithm that merely a challenging dynamic programming method that
computes the log-probability of the evidence sums over exponentially many parses. That treat-
(the sentence). This mechanical procedure ment follows the original papers (Baker, 1979; Je-
produces correct and efficient code. As for linek, 1985; see Lari and Young, 1991 for history).
any other instance of back-propagation, it can
While certainly valuable, it ignores the point that the
be carried out manually or by software. This
pedagogical paper carefully spells out the con-
algorithm is working with a log-linear (exponential-
struction and relates it to traditional and non- family) distribution. All such distributions share the
traditional views of these algorithms. property that a certain gradient is a vector of ex-
pected feature counts. The inside-outside algorithm
can be viewed as following a standard recipe—
1 Introduction
back-propagation—for computing this gradient.
The inside-outside algorithm (Baker, 1979) is a That insight is practically useful when deriv-
core method in natural language processing. Given ing new algorithms. The original inside algorithm
a sentence, it computes the expected count of each applies to probabilistic context-free grammars in
possible grammatical substructure at each position Chomsky Normal Form. However, other inside al-
in the sentence. Such expected counts are commonly gorithms are frequently constructed for other pars-
used (1) to train grammar weights from data, (2) to ing strategies or other grammar formalisms (see sec-
select low-risk parses, and (3) as soft features that tion 8 for examples). It is very handy that these
characterize sentence positions for other NLP tasks. can be algorithmically differentiated to obtain the
The algorithm can be derived directly but is gen- corresponding inside-outside algorithms. The core
erally perceived as tricky. This paper explains how it of this paper (section 5) demonstrates by example
can be obtained simply and automatically by back- how to do this manually, by working through the
propagation—more precisely, by differentiating the derivation of standard inside-outside. Alternatively,
inside algorithm. In the same way, the forward- one can implement one’s new inside algorithm using
backward algorithm (Baum, 1972) can be gotten by a software framework that supports automatic dif-
differentiating the backward algorithm. ferentiation in a general-purpose programming lan-
Back-propagation is now widely known in the guage (see www.autodiff.org) or a neural net-
natural language processing and machine learning work (e.g., Bergstra et al., 2010)—hopefully without
communities, thanks to the recent surge of interest too much overhead. Then the rest comes for free.
Note that we use the name “inside-outside” (or inside and outside computations. Klein and Man-
“forward-backward”) to denote just an algorithm ning (2001) regard the resulting proof forests—
that computes certain expected counts. Such an traditionally called parse forests—as weighted hy-
algorithm runs an inside pass and then an outside pergraphs. Li and Eisner (2009) show how to com-
pass, and then combines their results. The resulting pute various expectations and gradients over such
counts are broadly useful, as we noted at the start hypergraphs, by techniques including the inside-
of the paper (see section 4 for details). Thus, we outside algorithm, and clarify the “wonderful” con-
use “inside-outside” narrowly to mean computing nection between expected counts and gradients. Eis-
these counts. We do not use it to refer to the larger ner et al. (2005, section 5) observe without de-
method that computes the counts repeatedly in order tails that for real-weighted proof systems, the ex-
to iteratively reestimate grammar parameters: we pected counts of the axioms (in our setting, gram-
call that method by its generic name, Expectation- mar rules) can be obtained by applying back-
Maximization (section 4). propagation. They detail two ways to apply back-
propagation, noting inside-outside as an example.
1.2 Contents of the paper Graphical models are like context-free grammars
After concisely stating the formal setting (section 2) in that they also specify log-linear distributions over
and the inside algorithm (section 3), we discuss the structures.1 Darwiche (2003) shows how to com-
expected counts, their uses, and their relation to the pute marginal posteriors (i.e., expected counts) in a
gradient (section 4). Finally, we show how to differ- graphical model by the same technique given here.
entiate the inside algorithm to obtain the new algo-
rithm that computes this gradient (section 5). 2 Definitions and Notation
For readers who are using this paper to learn the
algorithms, section 6 gives interpretations of the β Assume a given alphabet Σ of terminal symbols
and α quantities that arise and relates them to the and a disjoint finite alphabet N of nonterminal
traditional dynamic programming presentation. symbols that includes the special symbol ROOT.
As a bonus, section 7 then offers a supplementary A derivation T is a rooted, ordered tree whose
perspective. Here the inside algorithm is presented leaves are labeled with elements of Σ and whose in-
as normalizing any weighted parse forest and, fur- ternal nodes are labeled with elements of N . We
ther, converting it into a PCFG that can be sampled say that the internal node t uses the production rule
from. The inside-outside algorithm is then explained A → σ if A ∈ N is the label of t and σ ∈ (Σ ∪ N )∗
as computing this sampler’s probabilities of hitting is the sequence of labels of its children (in order).
various anchored constituents and rules. Similarly, We denote this rule by Tt .
the backward algorithm can be regarded as normal- In this paper, we focus mainly on derivations in
izing a weighted “trellis” graph and, further, con- Chomsky Normal Form (CNF)—those for which
verting it into a non-stationary Markov model; the each rule Tt has the form A → B C or A → w for
forward-backward algorithm computes hitting prob- some A, B, C ∈ N and w ∈ Σ. We write R for
abilities in this model. the set of all possible rules of these forms, and R[A]
Section 8 discusses other settings where the same for the subset with A to the left of the arrow. How-
approach can be applied, starting with the forward- ever, the following definitions generalize naturally
backward algorithm for Hidden Markov Models. to other choices of R.
Two appendices work through some additional vari- A weighted context-free grammar (WCFG) in
ants of the algorithms. Chomsky Normal Form is a function G : R → R≥0 .
Thus, G assigns a weight to each CNF rule. We ex-
1.3 Related work tend it to assign a weight to each CNF derivation, by
Other papers have also provided significant insight 1
Indeed, the two formalisms can be unified under a broader
into this subject. In particular, Goodman (1998, formalism such as case-factor diagrams (McAllester et al.,
1999) unifies most parsing algorithms as semiring- 2004) or probabilistic programming (Sato, 1995; Sato and
weighted theorem proving, with discussion of both Kameya, 2008).
Q
defining G(T ) = t∈T G(Tt ), where t ranges over Algorithm 1 The inside algorithm
the internal nodes of T . 1: function I NSIDE(G, w)
A probabilistic context-free grammar
P (PCFG) 2: initialize all β[· · · ] to 0
is a WCFG G in which (∀A ∈ N ) R∈R[A] G(R) = 3: for k := 1 to n : . width-1 constituents
1. In this case, G(T ) is a probability measure over 4: for A ∈ N :
all derivations.2 5: β[Akk−1 ] += G(A → wk )
The CNF derivation T is called a parse of w ∈ 6: for width := 2 to n : . wider constituents
Σ∗ if ROOT is the label of its root and w is its fringe, 7: for i := 0 to n − width : . start point
i.e., the sequence of labels of its leaves. We refer to 8: k := i + width . end point
w as a sentence and denote its length by n; T (w) 9: for j := i + 1 to k − 1 : . midpoint
denotes the set of all parses of w. 10: for A, B, C ∈ N :
The triple hA, i, ki is mnemonically written as Aki 11: β[Aki ] += G(A → B C)β[Bij ]β[Cjk ]
and pronounced as “A from i to k.” We say that a
parse of w uses the anchored nonterminal Aki or 12: return Z := β[ROOTn0 ]
the anchored rule Aki → Bij Cjk or Akk−1 → wk if
it contains the respective configurations
polynomial-time dynamic programming algorithm,
A which interleaves sums and products.
Along the way, the inside algorithm computes
A B C A useful intermediate quantities. Each inner weight
β[Aki ] is the total weight of all derivations with root
wi+1 . . . wk wi+1 . . . wj wj+1 . . . wk wk A and fringe wi+1 wi+2 . . . wk . This implies the cor-
rectness of the return value, and is rather easy to es-
3 The Inside Algorithm tablish by induction on the width k − i.
Note that if a parse contains any 0-weight rules,
The inside algorithm (Algorithm 1) returns the to-
then that parse also has weight 0 and so does not
tal weight Z of all parses of sentence w accord-
contribute to Z. In effect, such rules and parses are
ing to a WCFG G. It is the natural extension to
excluded by G. Such rules can in fact be skipped
weighted CFGs of the CKY algorithm, a recognition
at lines 5 and 11, where they clearly have no ef-
algorithm for unweighted CFGs in Chomsky Nor-
fect. This further reduces runtime from O(n3 |N |3 )
mal Form (Kasami, 1965; Younger, 1967).
to O(n3 |G|), where |G| denotes the number of rules
The importance of Z is that a probability distribu-
of nonzero weight.
tion over the parses T ∈ T (w) is given by
def
p(T | w) = G(T )/Z (1) 4 Expected Counts and Derivatives
When G is a PCFG representing a prior distribution 4.1 The goal of inside-outside

on parses T , (1) is its posterior after observing the The inside-outside algorithm aims to extract useful
fringe w. When G is a WCFG, (1) directly defines a information from the distribution (1). Given a sen-
conditional distribution on parses. tence w, it computes the expected count of each
Z is a sum of exponentially many products, since rule R ∈ R in a random parse T drawn from that
|T (w)| is exponential in n = |w|. Fortunately, distribution:
many of the sub-products are shared across multiple X X
def
summands, and can be factored out using the dis- c(R) = p(T | w) δ(Tt = R) (2)
tributive property. This strategy leads to the above T t∈T
2
For this statement to hold even for “non-tight” PCFGs (Chi, For example, when G is a PCFG, c(A → B C) is
1999), we must consider the uncountable space of all finite and
infinite derivations. That requires equipping this space with an
the posterior expectation of the number of times that
appropriate σ-algebra and defining the measure G more pre- A expanded as B C while generating the sentence
cisely. w. Why are these expected counts useful? As Baker
(1979) saw, summing them over all observed sen- A CNF parse never uses a rule or constituent more
tences constitutes the E step within the Expectation- than once at a given position, so an anchored ex-
Maximization (EM) method (Dempster et al., 1977). pected count is always in [0, 1]. In fact, it is the
EM adjusts the rule probabilities G to locally maxi- probability that a random parse uses this anchored
mize likelihood (i.e., the probability of the observed rule or anchored constituent.
sentences under G). These anchored probabilities are independently
useful. They can be used in subsequent NLP tasks
4.2 The log-linear view as soft features that characterize each portion of
Of course, another way to locally maximize likeli- the sentence by its likely syntactic behavior. If
hood is to follow the gradient of log-likelihood. It c(Aki ) = 0.9, then (according to G) the substring
has often been pointed out that EM is related to gra- wi+1P . . . wk is probably a constituent of type A. If
dient ascent (e.g., Salakhutdinov et al., 2003; Berg- also j c(Aki → Bij Cjk ) = 0.75, this A probably
Kirkpatrick et al., 2010). splits into subconstituents B and C.
We now observe that (1) is an exponential-family Even for the parsing task itself, the anchored
model, implying a close relationship between ex- probabilities are useful for decoding—that is, se-
pected counts and this gradient. lecting a single “best” parse tree T̂ . If the system
When we re-express the distribution (1) in the will be rewarded for finding correct constituents,
standard log-linear form, we see that its natural pa- the expected reward of T̂ is the sum of the an-
def
rameters are given by θR = log G(R) for R ∈ R: chored probabilities of the anchored constituents in-
cluded in T̂ . The T̂ that maximizes this sum4 can
p(T | w) = G(T )/Z be selected by a Viterbi-style algorithm, once all the
1 Y 1 X
anchored probabilities have been computed (Good-
= G(Tt ) = exp θTt
Z Z man, 1996; Matsuzaki et al., 2005).
t∈T t∈T
1 X
= exp θR · fR (T ) (3) 5 Deriving the Inside-Outside Algorithm
Z
R∈R
5.1 Back-propagation
Here each fR is a feature function: fR (T ) ∈ N
counts the occurrences of rule R in parse T . Section 4.2 showed that the expected counts can be
By standard properties of log-linear models, obtained as the partial derivatives of log Z. How-
∂(log Z)/∂θR equals the expectation of fR (T ) un- ever, we will start by obtaining the partial derivatives
der distribution (3). But the latter is precisely the of Z. This will lead to a more standard presentation
expected count c(R) that we desire. Thus, expected of the inside-outside algorithm, exposing quantities
counts can be obtained as the gradient of log Z. We such as the outer weights α that are both intuitive
will show in section 5 that this is precisely how the and useful.
inside-outside algorithm operates. The inside algorithm can be regarded as evalu-
ating an arithmetic circuit that has many inputs
4.3 Anchored probabilities {G(R) : R ∈ R} and one output Z. Each non-input
Along the way, the classical inside-outside algo- node of the circuit is a β value, which is defined as a
rithm finds the expected counts ofPthe anchored sum of products of certain other β values and G val-
rules. It finds c(A → B C) as i,j,k c(Aki → ues. The circuit’s size and structure are determined
Bij Cjk ), where c(Aki → Bij Cjk ) denotes the ex- by w. The nested loops in Algorithm 1 simply it-
pected count of that anchored rule, or equivalently, erate through the nodes of this circuit in a topologi-
the expected number of times that A → B C is cally sorted order, computing the value at each node.
used at the particular position described by i, j, k. A Given any arithmetic circuit that represents a dif-
simple extension (section 6.3) will find c(Aki ), the ferentiable function Z, automatic differentiation
expected count of an anchored constituent.3 extends it with a new adjoint circuit that computes
3
A slower method is c(Aki ) = c(Aki → Bij Cjk ). 4
P
B,C,j An example of minimum Bayes risk decoding.
the gradient of Z—that is, the partial derivatives of yields three adjoint summands
the output Z with respect to the inputs.
In the common case of reverse-mode automatic ðG(A → B C) += ðβ[Aki ]β[Bij ]β[Cjk ] (8)
differentiation (Griewank and Corliss, 1991), the ad- ðβ[Bij ] += G(A → B C)ðβ[Aki ]β[Cjk ] (9)
joint circuit employs a back-propagation strategy
ðβ[Cjk ] += G(A → B C)β[Bij ]ðβ[Aki ] (10)
(Werbos, 1974).5 For each node x in the original cir-
cuit (not just the input nodes), the adjoint circuit in- Similarly, the initialization step in line 5,
cludes an adjoint node ðx whose value is the partial
derivative ∂Z/∂x. Beginning with the obvious fact β[Akk−1 ] += G(A → wk ) (11)
that ðZ = 1, the adjoint circuit next computes the
yields the single (obvious) adjoint summand
partials of Z with respect to the nodes that directly
influence Z, and then with respect to the nodes that ðG(A → wk ) += ðβ[Akk−1 ] (12)
influence those, gradually working back toward the
inputs. Thus, the earlier x is computed, the later ðx Importantly, computing the adjoints increases the
is computed. runtime by only a constant factor.
The adjoint of an inner weight, ðβ[· · · ], corre-
5.2 Differentiating the inside algorithm sponds to the traditional outer weight, written as
Back-propagation is popular for optimizing the pa- α[· · · ]. We will adopt this notation below. Fur-
rameters of neural networks, which involve nonlin- thermore, for consistency, we will write ðG(R) as
ear functions (LeCun, 1985; Rumelhart et al., 1986). α[R]—the “outer weight” of rule R.
We do not need to give a full presentation here, be-
cause the inside algorithm’s circuit consists entirely 5.3 The inside-outside algorithm
of multiply-adds. In this simple case, automatic dif- We need only one more move to derive the inside-
def
ferentiation just augments each operation outside algorithm. So far, we have obtained α[R] =
ðG(R), the partial of Z with respect to G(R) =
x += y1 · y2 (4) exp θR . We log-transform both Z and G(R) to ar-
rive finally at the expected count:
with the pair of adjoint operations
∂ log Z
ðy1 += ðx · y2 (5) c(R) = . from section 4.2
∂θR
ðy2 += y1 · ðx (6) ∂ log Z ∂Z ∂G(R)
= · · . chain rule
∂Z ∂G(R) ∂θR
Intuitively, (5) recognizes that increasing y1 by a
small increment will increase x by · y2 , which = (1/Z) · α[R] · G(R) (13)
in turn increases Z by ðx · · y2 . This suggests that The final algorithm appears as Algorithm 2. This
ðy1 = ðx · y2 . However, increasing y1 may affect visits all of the inside nodes in a topologically sorted
Z through more than just x, since y1 may also feed order by calling Algorithm 1, then visits all of their
into other equations like (4). Differentiability im- adjoints in a topologically sorted order (roughly the
plies that the combined effect of these influences of reverse), and finally applies (13).
y1 on Z is additive as → 0, accounting for the +=
in (5) (which is unrelated to the += in (4)). 6 Detailed Discussion
This pattern in (4)–(6) extends to three-way prod-
6.1 Relationship to the traditional version
ucts as well. Thus, the key step at line 11 of the
inside algorithm, An “off-the-shelf” application of automatic dif-
ferentiation produces efficient code. Indeed, we
β[Aki ] += G(A → B C)β[Bij ]β[Cjk ] (7) would have gotten slightly more efficient code if
5
In cases like ours, where the original circuit has variable
we had applied the technique directly to the prob-
size, it is sometimes referred to as “backprop through structure” lem of section 4.2—finding the gradient of log-
(Williams and Zipser, 1989; Goller and Küchler, 2005). likelihood. This version is shown as Algorithm 3
Algorithm 2 The inside-outside algorithm mands in (14) and (15). To derive this fact via gra-
1: procedure I NSIDE -O UTSIDE(G, w) dients, just revise our previous construction to use
2: Z := I NSIDE(G, w) . side effect: sets β[· · · ] anchored rules R0 ∈ R. Extending the notation of
3: initialize all α[· · · ] to 0 section 4.2, let fR0 (T ) ∈ {0, 1} denote the count of
4: α[ROOTn0 ] += 1 . sets ðZ = 1 R0 in parse T . We wish to find its expectation c(R0 ).
5: for width := n downto 2 : . wide to narrow Intuitively, this should equal ∂(log Z)/∂θR0 , which
6: for i := 0 to n − width : . start point intuitively should work out to the summand in ques-
7: k := i + width . end point tion via anchored versions of (8) and (13). This in-
8: for j := i + 1 to k − 1 : . midpoint deed holds if we revise (3) to use anchored features
9: for A, B, C ∈ N : fR0 of the tree T :
10: α[A → B C] += α[Aki ]β[Bij ]β[Cjk ]
11: α[Bij ] += G(A → B C)α[Aki ]β[Cjk ] 1 X
p(T | w) = exp θR0 · fR0 (T ) (16)
12: α[Cjk ] += G(A → B C)β[Bij ]α[Aki ] Z 0 0
R ∈R
13: for k := 1 to n : . width-1 constituents
14: for A ∈ N : def
and then set θR0 = log G(R) whenever R0 is an
15: α[A → wk ] += α[Akk−1 ] anchoring of R. Notice that (16) is a more ex-
16: for R ∈ R : . expected rule counts pressive model than a WCFG, since it permits the
17: c(R) := α[R] · G(R)/Z same rule R to have different weights at different
positions—an idea we will revisit in section 8.4.
However, here we take the gradient of its Z only
in Appendix A. It computes adjoint quantities of at the point in parameter space that corresponds to
the form Zα [x] = ∂(log Z)/∂x rather than α[x] = the actual WCFG G—obtained by setting θR0 to the
∂Z/∂x. As a result, it divides once by Z at line 4, same value, log G(R), for all anchorings R0 of R.
whereas Algorithm 2 must do so many times at
line 17 (to implement the correction in (13)).
Even Algorithm 2 is slightly more efficient than 6.3 Interpreting the outer weights
the traditional version (Lari and Young, 1990),
thanks to the new quantity α[R]. The traditional ver- Section 5.3 exposes some interesting quantities.
sion (Algorithm 4) leaves out line 17, instead replac- The outer weight of a constituent is α[Aki ] =
ing lines 10 and 15 respectively with ðβ[Aki ]. The outer weight of a rule—novel to our
presentation—is α[R] = ðG(R). We now connect
α[Aki ]G(A→B C)β[Bij ]β[Cjk ]
c(A → B C) += (14) these partial derivatives to a traditional view of the
Z
α[Akk−1 ]G(A→wk )
outer weights.
c(A → wk ) += (15)
Z As we will see, β[Aki ] · ðβ[Aki ] (inner times outer)
Our automatically derived Algorithm 2 efficiently is the total weight of all parses that contain Aki .
omits the common factor G(R)/Z from each sum- Then (1) implies that the total probability of these
mand above. It uses α[R] to accumulate the result- parses is β[Aki ] · ðβ[Aki ]/Z. This gives the marginal
ing intermediate sum (over positions of R), and fi- probability of the anchored nonterminal, i.e., c(Aki ).
nally multiplies the sum by G(R)/Z at line 17.6 Analogously, G(R) · ðG(R) is the total weight of
6.2 Obtaining the anchored probabilities all parses that contain R, where a parse is counted
multiple times in this total if it contains R multiple
If one wants the anchored rule probabilities dis- times. Dividing by Z as before gives the expected
cussed in section 4.3, they are precisely the sum- count c(R), which is indeed what line 17 does.
6
In practice, to avoid allocating separate memory for α[R], What does an outer weight signify on its own, and
it can be stored in c(R) and then multiplied in place by G(R)/Z
at line 17 to obtain the true c(R). Current automatic differenti-
how does it help compute the total weight? Con-
ation packages do not discover optimizations of this sort, as far sider a parse T that contains the anchored nontermi-
as this author knows. nal Aki , so that T has the form
ROOT within it can be obtained by combining R with the
incomplete “outside derivation” that lacks the token
w1 . . . wi A wk+1 . . . wn of R at that position, which has the form
ROOT
wi+1 . . . wk
w1 . . . wi A wk+1 . . . wn
Its weight G(T ) is the product of weights of all rules
in the parse. This can be factored into an inside
B C
product that considers the rules dominated by Aki ,
times an outside product that considers all the other
rules in T . That is, T is obtained by combining wi+1 . . . wj wj+1 . . . wk
an “inside derivation” with an incomplete “outside
Summing the weights of these instances (a tree with
derivation,” of the respective forms
c copies of R is added c times), the distributive prop-
A ROOT erty implies the total is G(R) · α[R], where the outer
weight α[R] sums over the “outside derivations.”
wi+1 . . . wk w1 . . . wi A wk+1 wn Algorithm 2 accumulates that sum at line 10 (for a
binary rule as drawn above) or line 15 (unary rule).
and the weight of T is the product of their weights.
We can verify from this fact that ∂Z/∂G(R) is in-
The many parses T that contain Aki can be ob-
deed α[R], by checking the effect onP Z of increasing
tained by pairwise combinations of inside and out-
G(R) slightly. Let us express Z = ∞ c=0 Zc , where
side derivations of the forms shown. Thus—thanks
Zc is the total weight of parses that contain exactly
to the distributive property—the total weight of
c copies of R. Increasing G(R) by aPmultiplicative
these parses is β[Aki ] · α[Aki ], where the inner weight
factor of 1 +P will increase Z to c (1 + )c Zc ,
β[Aki ] sums the inside product over all such inside
which ≈ c (1 + c)Zc for ≈ 0. Put an-
derivations, and the outer weight α[Aki ] sums the
other way, increasing G(R)P by adding G(R) results
outside product over all such outside derivations.
in increasing Z by about c cZc . Therefore (at
Both sums are accomplished in practice by dynamic
programming recurrences that build from smaller to
least
P when G(R) 6= 0), we conclude ∂Z/∂G(R) =
larger derivations. Namely, Algorithm 1 line 11 ex- Pc /G(R) = (G(R) · α[R])/G(R) = α[R],
c cZ
since c cZc is the total weight computed in the
tends from smaller inside derivations to β[Aki ], while
previous paragraph.
Algorithm 2 lines 11–12 extend from α[Aki ] to larger
outside derivations. 7 The Forest PCFG
The previous paragraph is part of the traditional
(“difficult”) explanation of the inside-outside algo- For additional understanding, this section presents a
rithm. However, it gives an alternative route to the different motivation for the inside algorithm—which
gradient interpretation. It implies that Z can be then leads to an attractive independent derivation of
found as β[Aki ] · α[Aki ] plus the weights of some the inside-outside algorithm.
other parses that do not involve β[Aki ]. It follows
that ∂Z/∂β[Aki ] is indeed α[Aki ]. 7.1 Constructing the parse forest as a PCFG
Similarly, how about the outer weight and total The inner weights serve to enable the construc-
weight of a rule? Consider a parse T that contains tion of a convenient representation—as a PCFG—
rule R at a particular position i, j, k. G(T ) can be of the distribution (1). Using a grammar to repre-
factored into the rule probability G(R)—which can sent the packed forest of all parses of w was origi-
be regarded as an “inside product” with only one nally discussed by Bar-Hillel et al. (1961) and Billot
factor—times an “outside product” that considers all and Lang (1989). The construction below (Neder-
the other rule tokens in T . This decomposition helps hof and Satta, 2003) takes the weighted version of
us find the total weight as before. Each of the many such a grammar and “renormalizes” it into a PCFG
instances of a parse T with a token of R marked (Thompson, 1974; Abney et al., 1999; Chi, 1999).
This PCFG G 0 generates only parses of w: that p. 56; Goodman, 1998, section 4.4.1; Finkel et al.,
is, it assigns weight 0 to any derivation with fringe 2006). Our presentation merely exposes the con-
6= w. G 0 uses the original terminals Σ, but a different struction of G 0 as an intermediate step.8
nonterminal set—namely the anchored nonterminals
N 0 = {Aki : A ∈ N , 0 ≤ i < k ≤ n}, with root 7.3 Expected counts from the forest PCFG
symbol ROOT0 = ROOTn0 . The CNF rules over these The above reparameterization of p(T | w) as a
nonterminals are the anchored rules R0 . The func- PCFG G 0 makes it easier to find c(Aki ). The proba-
tion G 0 : R0 → R≥0 can now be defined in terms of bility of finding Aki in a parse must be the probability
the inner weights: of encountering it when sampling a parse top-down
from G 0 (the hitting probability).
G 0 (Aki → Bij Cjk ) Observe that the top-down sampling procedure
def G(A → B C) · β[Bij ] · β[Cjk ] starts at ROOT0 . If it reaches Aki , it has probability
= (17) G 0 (Aki → Bij Cjk ) of reaching Bij as well as Cjk on
β[Aki ]
the next step.
def G(A → wk )
G 0 (Akk−1 → wk ) = =1 (18) Thus, the hitting probability c(Aki ) of an anchored
β[Akk−1 ]
nonterminal is the total probability of all “paths”
for A, B, C ∈ N and 0 ≤ i < j < k ≤ n, and from ROOT0 to Aki . To find all such totals, we ini-
G0 (· · · ) = 0 otherwise. All trees T with G 0 (T ) > 0 tialize all c(· · · ) = 0 and then set c(ROOT0 ) = 1.
have fringe w. Note that |G 0 | = O(n3 |G|). Once we know c(Aki ), we can extend those paths to
Thus, G 0 defines the weights of the rules in R0 [Aki ] its successor vertices, much as in the forward algo-
to be the relative contributions made to β[Aki ] in Al- rithm for HMMs.
gorithm 1 by its summands.7 This ensures that they Clearly, the probability that c(Aki ) expands using
sum to 1, making G 0 a PCFG. a particular anchored rule R0 ∈ R0 [Aki ] during top-
down sampling is
7.2 Sampling from the forest PCFG
c(R0 ) = c(Aki ) · G 0 (R0 ) (19)
One use of G0
is to enable easy sampling from (1),
since PCFGs are designed to allow sampling (by a which we add into the expected count of the unan-
straightforward top-down recursive procedure). In chored version R:
effect, the top-down sampler recursively subdivides c(R) += c(R0 ) (20)
the total probability mass Z. For example, suppose
that 60% of the total Z = β[ROOTn0 ] was contributed This anchored rule hits the successors of Aki . E.g.:
by the various derivations in which the two child c(Bij ) += c(Aki → Bij Cjk ) (21)
subtrees of ROOT have root labels B, C and fringes
w1 . . . wj , wj+1 . . . wn . Then the first step of sam- c(Cjk ) += c(Aki → Bij Cjk ) (22)
pling from G 0 has a 60% chance of expanding ROOT0 This view leads to a correct algorithm that directly
into B0j and Cjn . If that happens, the sampler then re- computes c values without using α values at all. This
cursively chooses how to expand those nonterminals Algorithm 5 looks exactly like Algorithm 2 or 3, ex-
according to their inner weights. cept it uses the above lines that modify c(· · · ) in
By thus sampling a tree T 0 from G 0 , and then sim- place of the lines that modify α[· · · ] or Zα [· · · ]. We
plifying each internal node label Aki to A, we obtain can in fact regard Algorithm 3 as simply the result of
a parse T of w, distributed according to p(T | w) as rearranging Algorithm 5 to avoid some of the multi-
desired. This method is equivalent to traditional pre- plications and divisions needed to construct G 0 , ex-
sentations of sampling from p(T | w) (Bod, 1995, ploiting the fact that they cancel out as paths are ex-
7
When β[Aki ] = 0, these relative contributions are indeter-
tended.
minate (quotients are 0/0). Then G 0 can specify any probability 8
Exposing G 0 can be computationally advantageous, in fact.
distribution over the rules R0 [Aki ]. That distribution will never After preprocessing a PCFG, samples can be drawn in time
be used: the PCFG G 0 has probability 0 of reaching the an- O(n) per tree independent of the size of the grammar, by us-
chored nonterminal Aki and needing to expand it. ing alias sampling (Vose, 1991) to draw each rule in O(1) time.
For example, in Algorithm 5, update (22) above 8.1 The forward-backward algorithm
expands via (19) and (17) into A Hidden Markov Model (HMM) is essentially a
G(A→B C)·β[Bij ]·β[Cjk ] simple PCFG. Its nonterminal symbols N are often
c(Cjk ) += c(Aki ) · β[Aki ]
(23)
called states or tags. The rule set R now consists
Algorithm 3 rearranges this into not of CNF rules as in section 2, but all rules of the
form ROOT → A or A → w B or A → w (for
c(Cjk ) c(Aki )
β[Cjk ]
+= β[Aki ]
· G(A → B C) · β[Bij ] (24) A, B ∈ N and w ∈ Σ). As a result, a parse of a
length-4 sentence w must have the form
and then systematically uses Zα [x] to store c(x)
β[x] . It
similarly transforms all other c updates into Zα up-
C
OT
D
RO
dates as well. Algorithm 3 then recovers c(x) :=
w4
α
w3
Z [x] · β[x] at the end of the algorithm, for all x ∈ R,
w2
w1
and could do the same for all x ∈ R0 or x ∈ N 0 .
which is said to tag each word wj with its par-
8 Other Settings ent state. All parses share this right-branching tree
structure, so they differ only in their choice of tags.
Many types of weighted grammar are used in com-
(Traditionally, the weight G(A → w B) or
putational linguistics. Hence one often needs to
G(A → w) is defined as the product of an emission
construct new inside and inside-outside algorithms
probability, p(w | A), and a transition probabil-
(Goodman, 1998, 1999). As examples, the weighted
ity, p(B | A) or p(HALT | A). However, the fol-
version of Earley’s (1970) algorithm (Stolcke, 1995)
lowing algorithms do not require this. Indeed, the
handles arbitrary WCFGs (not restricted to CNF).
algorithms do not require that G is a PCFG—any
Vijay-Shanker and Weir (1993) treat tree-adjoining
right-branching WCFG will do.)
grammar. Eisner (1996) handles projective depen-
In this setting, the inside algorithm (which com-
dency grammars. Smith and Smith (2007) and
putes β bottom-up) is known as the backward al-
Koo et al. (2007) handle non-projective dependency
gorithm, because it proceeds from right to left. The
grammars, using an inside algorithm with a different
subsequent outside pass (which computes α top-
structure (not a dynamic programming algorithm).
down) is known as the forward algorithm.
In typical settings, each T is a derivation
Thanks to the fixed tree structure, this special-
tree (Vijay-Shankar et al., 1987)—a tree-structured
ized version of inside-outside can run in total time
recipe for assembling some syntactic description of
of only O(n|N |2 ). Of course, once we work out the
the input sentence.QThe parse probability p(T | w)
fast backward algorithm (directly or from the inside
takes the form Z1 t∈T exp θt , where t ranges over
algorithm), the forward-backward algorithm comes
certain configurations in T . Z is the total weight
for free by algorithmic differentiation, with α[x] de-
of all T . Then the core insight of section 4.2 holds
noting ðx = ∂Z/∂x. Pseudocode appears as Algo-
up: given an “inside” algorithm to compute log Z,
rithms 6–7 in Appendix B.
we can differentiate to obtain an “inside-outside” al-
The forest of all taggings of w may be compactly
gorithm that computes ∇ log Z, which yields up the
represented as a directed acyclic graph—the trel-
expected counts of the configurations.
lis. The forward-backward algorithms can be nicely
In this section we review several such settings.
understood with reference to this trellis, shown as
Parsing with a WCFG in CNF (Algorithms 1–2) is
Figure 1 in Appendix B. Each maximal path in the
merely the simplest case that fully illustrates the “ins
trellis corresponds to a tagging; α and β quantities
and outs,” as it were.
sum the weights of prefix and suffix paths. G de-
In all these cases, the EM algorithm is not the only
fines a probability distribution over these taggings
use of the expected counts: Sections 1 and 4.3 men-
tioned more important uses. EM itself is only useful log p(w) in the case of a generative grammar. Even in this case,
for training a generative grammar, at best.9 EM may not be the best choice: once we are already comput-
ing ∇ log Z, any continuous optimization algorithm (batch or
9
EM locally maximizes log Z, which equals the likelihood online) can exploit that gradient to improve log Z.
via (1). Following section 7, one can determine tran- to partially supervised parsing (Pereira and Schabes,
sition probabilities to the edges of the trellis so that 1992; Matsuzaki et al., 2005), where the output tree
a random walk on the trellis samples a tagging, from is not fully unobserved.
left to right, according to (1). The backward algo-
rithm serves to compute β values from which these 8.3 Synchronous grammars
transition probabilities are found (cf. section 7.1). In a synchronous grammar (Shieber and Schabes,
Under these probabilities, the trellis becomes a non- 1990), a parse T is a derivation tree that produces
stationary Markov model over the taggings. The for- aligned syntactic representations of a pair of sen-
ward pass now finds the hitting probabilities in this tences. These cases are handled as before. The in-
Markov model (cf. section 7.3), which describe how put to the inside algorithm is a pair of aligned or
often the random walk will reach specific anchored unaligned sentences, or a single sentence.
nonterminals or traverse edges between them. These The simplest synchronous grammar is a finite-
are the expected counts of tags and tag bigrams at state transducer. Here T is an alignment of the two
specific positions. sentences, tagged with states. Eisner (2002) general-
ized the forward-backward algorithm to this setting
8.2 Other grammar formalisms and drew a connection to gradients.
Many grammar formalisms have weighted versions
that produce exponential-family distributions over 8.4 Conditional grammars
tree-structured derivations: In the conditional random field (CRF) approach,
p(T | w) is defined as Gw (T )/Z, where Gw is a
• PCFGs or WCFGs whose nonterminals are lex-
specialized grammar constructed given the input w.
icalized or extended with other attributes, in-
Generally Gw defines weights for the anchored rules
cluding unification-based grammars (Johnson
R0 , allowing the weight of a rule to be position-
et al., 1999)
specific. (This slightly affects lines 5 and 11 of
• Categorial grammars, which use an unbounded
Algorithm 1.) Any weighted grammar formalism
set of nonterminals N (bounded for any given
may be used: e.g., the formalisms in section 2 and
input sentence)
section 8.1 respectively yield the CRF-CFG (Finkel
• Tree substitution grammars and tree adjoin-
et al., 2008) and the popular linear-chain CRF (Sut-
ing grammars (Schabes, 1992), in which the
ton and McCallum, 2011).
derivation tree is distinct from the derived tree
Nothing in our approach changes. In fact, super-
• History-based stochasticizations such as the
vised training of a CRF usually follows the gradi-
structured language model (Jelinek, 2004)
ent ∇ log p(T ∗ | w). This equals the vector of rule
• Projective and non-projective dependency
counts observed in the supervised tree T ∗ , minus the
grammars as mentioned earlier
expected rule counts ∇ log Z—as equivalently com-
• Semi-Markov models for chunking (Sarawagi
puted by backprop or inside-outside.
and Cohen, 2004), with runtime O(n2 )
Each formalism has one or more inside algo- 8.5 Pruned, prioritized, and beam parsing
rithms10 that efficiently compute Z, typically in time For speed, it is common in practice to perform only
O(n2 ) to O(n6 ) via dynamic programming. These a subset of the updates at line 11 of Algorithm 1.
inside algorithms can all be differentiated using the This approximate algorithm computes an underesti-
same recipe. mate Ẑ of Z (via underestimates β̂[Aki ]), because
The trick continues to work when one of these in- only a subset of the inside circuit is used. By
side algorithms is extended to the case of prefix pars- storing this dynamically determined smaller circuit
ing or lattice parsing (Nederhof and Satta, 2003), and performing back-propagation through it (Eisner
where the input sentence is not fully observed, or et al., 2005), we can compute the partial deriva-
10
Even basic WCFGs admit multiple algorithms—Earley’s
tives ∂ Ẑ/∂ β̂[Aki ] and ∂ Ẑ/∂G(R). These approxi-
algorithm, unary cycle elimination, the “hook trick,” and more mations may be used in all formulas in order to com-
(see Goodman, 1998; Eisner and Blatz, 2007). pute the expected counts of constituents and rules
in the pruned parse forest, which consists of the 9 Conclusions and Further Reading
parse trees—with total weight Ẑ—explored by the
Computational linguists have been back-
approximate algorithm.
propagating through their arithmetic circuits
Many clever single-pass or multi-pass strategies
for a long time without realizing it—indeed, since
exist for including the most important updates.
before Rumelhart et al. (1986) popularized the
Strategies are usually based either on direct prun-
use of this technique to train neural networks.
ing of constituents or prioritized updates with early
Recognizing this connection can help us to under-
stopping. A key insight is that β[Aki ] · ðβ[Aki ] is the
stand, teach, develop, and implement many core
total contribution of β[Aki ] to Z, so one should be
algorithms of the field.
sure to include the update β[Aki ] += ∆ if ∆ · α[Aki ]
Good follow-up reading includes
is predicted to be large.
Weighted automata are popular for parsing, par- • how to use the inside algorithm within a larger
ticularly dependency parsing (Nivre, 2003). In this neural network that can be differentiated end-
case, T is similar to a derivation tree: it is a sequence to-end (Gormley et al., 2015; Gormley, 2015);
of operations (e.g., shift and reduce) that constructs • how to efficiently obtain the partial derivatives
a syntactic representation. Here an approximate Ẑ of the expected counts by differentiating the al-
is usually computed by beam search, and can be dif- gorithm a second time (Li and Eisner, 2009);
ferentiated as above. • how to navigate through the space of possible
inside algorithms (Eisner and Blatz, 2007);
8.6 Inside-outside should be as fast as inside • how to convert an inside algorithm into variant
In all cases, it is good practice to derive one’s algorithms such as Viterbi parsing, k-best pars-
inside-outside algorithm by differentiation—manual ing, and recognition (Goodman, 1999);
or automatic—to ensure correctness and efficiency. • how to apply the same ideas to graphical mod-
It should be a red flag if a proposed inside-outside els (Aji and McEliece, 2000; Darwiche, 2003).
algorithm is asymptotically slower than its inside al-
Acknowledgments
gorithm.11 Why? Because automatic differentiation
produces an adjoint circuit that is at most twice the Thanks to anonymous referees and to Tim Vieira for
size of the original circuit (assuming binary opera- useful comments that improved the paper—and to
tors). That means ∇ log Z always can be evaluated the organizers of this workshop for soliciting tutorial
with the same asymptotic runtime as log Z. papers in the first place.
Good researchers do sometimes slip up and pub-
lish less efficient algorithms. Koo et al. (2007)
present the O(n3 ) algorithms for non-projective de-
pendency parsing: they point out that a contempo-
raneous IWPT paper with the same O(n3 ) inside
algorithm had somehow raised inside-outside’s run-
time from O(n3 ) to O(n5 ). Similarly, Vieira et al.
(2016) provide efficient algorithms for variable-
order linear-chain CRFs, but note that a JMLR pa-
per with the same forward algorithm had raised the
grammar constant in forward-backward’s runtime.
11
Unless inside-outside has been deliberately slowed down to
reduce its space requirements, as in the discard-and-recompute
scheme of Zweig and Padmanabhan (2000). Absent such a
scheme, inside-outside may need asymptotically more space
than inside: though the adjoint circuit is not much bigger than
the original, evaluating it may require keeping more of the orig-
inal in memory at once (unless time is traded for space).
A Pseudocode for Inside-Outside Variants
Algorithm 3 A cleaner variant of the inside-outside
The core of this paper is Algorithms 1 and 2 in the
algorithm
main text. For easy reference, this appendix pro-
1: procedure I NSIDE -O UTSIDE(G, w)
vides concrete pseudocode for some close variants
2: Z := I NSIDE(G, w) . side effect: sets β[· · · ]
of Algorithm 2 that are discussed in the main text,
3: initialize all α[· · · ] to 0
highlighting the differences from Algorithm 2. All α n 1
of these variants compute the same expected counts
4: Z [ ROOT 0 ] += Z . sets ðZ = 1/Z
5: for width := n downto 2 : . wide to narrow
c, with only small constant-factor differences in ef-
6: for i := n − width downto 0 : . start point
ficiency.
7: k := i + width . end point
8: for j := k − 1 downto i + 1 : . midpoint
Algorithm 3 is the clean, efficient version of
9: for A, B, C ∈ N :
inside-outside that obtains the expected counts α α k j k
∇ log Z by direct implementation of backprop. We
10: Z [A → B C] += Z [Ai ]β[Bi ]β[Cj ]
α j α k k
may regard this as the fundamental form of the algo- 11: Z [Bi ] += G(A → B C) Z [Ai ]β[Cj ]
α k j α k
rithm. The differences from Algorithm 2 are high- 12: Z [Cj ] += G(A → B C)β[Bi ] Z [Ai ]
lighted in red and discussed in section 6.1. 13: for k := n downto 1 : . width-1 constituents
Since log Z is replacing Z as the output being dif- 14: for A ∈ N : p
ferentiated, the adjoint quantity ðx is redefined as α α k
15: Z [A → wk ] += Z [Ak−1 ]
∂(log Z)/∂x and stored in Zα [x]. The name Zα is
16: for R ∈ R : . expected rule counts
chosen because it is Z1 times the traditional α. The α
17: c(R) := Z [R] · G(R) . no division by Z
computations in the loops are not affected by this
rescaling: they perform the same operations as in
Algorithm 2. Only the start and end of the algorithm
are different (lines 4 and 17).
To emphasize the pattern of reverse-mode auto- Algorithm 4 A more traditional variant of the
matic differentiation, Algorithm 3 takes care to com- inside-outside algorithm
pute the adjoint quantities in exactly the reverse of 1: procedure I NSIDE -O UTSIDE(G, w)
the order in which Algorithm 1 computed the orig- 2: Z := I NSIDE(G, w) . side effect: sets β[· · · ]
inal quantities. The resulting iteration order in the 3: initialize all α[· · · ] to 0
i, j, and k loops is highlighted in blue. Comput- 4: α[ROOTn0 ] += 1 . sets ðZ = 1
ing adjoints in reverse order always works and can 5: for width := n downto 2 : . wide to narrow
be regarded as the default strategy. However, this 6: for i := 0 to n − width : . start point
is merely a cosmetic change: the version in Algo- 7: k := i + width . end point
rithm 2 is just as valid, because it too visits the nodes 8: for j := i + 1 to k − 1 : . midpoint
of the adjoint circuit in a topologically sorted order. 9: for A, B, C ∈ N :
Indeed, since each of these blue loops is paralleliz- 10: c(A → B C)
able, it is clear that the order cannot matter. What α[Ak ]G(A→B C)β[B j ]β[C k ]
i i j
+= Z
is crucial is that the overall narrow-to-wide order of
11: α[Bij ] += G(A → B C)α[Aki ]β[Cjk ]
Algorithm 1, which is not parallelizable, is reversed
by both Algorithm 2 and Algorithm 3. 12: α[Cjk ] += G(A → B C)β[Bij ]α[Aki ]
Algorithm 4 is a more traditional version of Al- 14: for A ∈ N :
α[Akk−1 ]G(A→wk )
gorithm 2. The differences from Algorithm 2 are 15: c(A → wk ) += Z
highlighted in red and discussed in section 6.1. 16: for R ∈ R :
17: do nothing . c(R) has already been computed
Finally, Algorithm 5 is a version that is derived
on independent principles (section 7.3). It directly
computes the hitting probabilities of anchored con- j ≤ n) to denote an anchored nonterminal A that
stituents and rules in a PCFG representation of the serves as the tag of wj . That is, Aj is anchored so
parse forest. This may be regarded as the most natu- that its left child is wj . (Since this Aj actually dom-
ral way to obtain the algorithm without using gradi- inates all of wj . . . wn in the right-branching tree, it
ents. However, as section 7.3 notes, it can be fairly would be called Anj−1 if we were running the full
easily rearranged into Algorithm 3, which is slightly inside algorithm.)
more efficient. The differences from Algorithms 2 Line 7 builds up the right-branching tree by com-
and 3 are highlighted in red. bining a word from j − 1 to j (namely wj ) with
To rearrange Algorithm 5 into Algorithm 3, the a phrase from j to n (namely B j+1 ). This line is
key is to compute not the count c(x), but the ratio a specialization of line 11 in Algorithm 1, which
c(x) c(x)
β[x] (or G(x) when x is a rule), storing this ratio in
combines a phrase from i to j with another phrase
the variable Zα [x]. This ratio Zα [x] can be interpreted from j to k. Thanks to the right-branching con-
as ∂(log Z)/∂x as previously discussed. straint, the backward algorithm only has to loop over
O(n) triples of the form (j−1, j, n) (with fixed n)—
whereas the inside algorithm must loop over O(n3 )
Algorithm 5 An inside-outside variant motivated as triples of the form (i, j, k).
finding hitting probabilities
1: procedure I NSIDE -O UTSIDE(G, w) Algorithm 6 The backward algorithm
2: Z := I NSIDE(G, w) . side effect: sets β[· · · ] 1: function BACKWARD(G, w)
3: initialize all c[· · · ] to 0 2: initialize all β[· · · ] to 0
4: c(ROOTn0 ) += 1 3: for A ∈ N : . stopping rules
5: for width := n downto 2 : . wide to narrow 4: β[An ] += G(A → wn )
6: for i := 0 to n − width : . start point
5: for j := n − 1 downto 1 :
7: k := i + width . end point
6: for A, B ∈ N : . transition rules
8: for j := i + 1 to k − 1 : . midpoint
7: β[Aj ] += G(A → wj B) β[B j+1 ]
9: for A, B, C ∈ N :
10: c(Aki → Bij Cjk ) := . eqs. (19), (17) 8: for A ∈ N : . starting rules
G(A→B C)·β[Bij ]·β[Cjk ]
9: β[ROOT ] += G(ROOT → A) β[A1 ]
0
c(Aki ) · β[Aki ] 10: return Z := β[ROOT0 ]
11: c(A → B C) += c(Ai → Bij Cjk )
k
12: c(Bij ) += c(Aki → Bij Cjk ) The forward-backward algorithm, Algorithm 7,

13: c(Cjk ) += c(Aki → Bij Cjk ) is derived mechanically by differentiating Algo-
rithm 6, by exactly the same procedure as in sec-
tion 5. As a result, it is a specialization of Algo-
15: for A ∈ N :
rithm 2.
16: c(Akk−1 → wk ) := c(Akk−1 ) . eqs. (19),(18)
This presentation of the forward-backward algo-
17: c(A → wk ) += c(Akk−1 → wk )
rithm finds the expected counts of rules R ∈ R.
18: for R ∈ R : . expected rule counts However, section 8.1 mentions that each rule R can
19: do nothing . c(R) has already been computed be regarded as consisting of an emission action Re
followed by a transition action Rt . We may want
to find the expected counts of the various actions.
B Pseudocode for Forward-Backward
These can of course be found by summing the ex-
Algorithm 6 is the backward algorithm, as intro- pected counts of all rules containing a given action.
duced in section 8.1. It is an efficient specializa- However, this step can also be handled naturally by
tion of the inside algorithm (Algorithm 1) to right- backprop, in the common case where each G(R) is
branching trees. defined as a product pRe ·pRt of the conditional prob-
Notation: We use ROOT0 to denote the root an- abilities of the emission and transition. In this case,
chored nonterminal, and Aj (for A ∈ N and 1 ≤ θR = log G(R) from section 4.2 can be re-expressed
Algorithm 7 The forward-backward algorithm represents the anchored rule ROOT → A1 . The
1: procedure F ORWARD -BACKWARD(G, w) weight of a trellis edge corresponding to an anchor-
2: Z := BACKWARD(G, w) . also sets β[· · · ] ing of rule R is given by G(R). The weight G(T )
3: initialize all α[· · · ] to 0 of a tagging T is then the product weight of the path
4: α[ROOT0 ] += 1 . sets ðZ = 1 that corresponds to that tagging.
5: for A ∈ N : . starting rules On this view, the inner weight β[Aj ] can be re-
6: α[ROOT → A] += α[ROOT0 ] β[A1 ] garded as a suffix weight: it sums up the weight
7: α[A1 ] += α[ROOT0 ] G(ROOT → A) of all paths from Aj to HALT. Algorithm 6 can be
8: for j := 1 to n − 1 : transparently viewed as computing all suffix weights
9: for A, B ∈ N : . transition rules from right to left by dynamic programming. Z =
10: α[A → wj B] += α[A ] β[B j+1 ]
j ROOT 0 sums the weight of all paths from ROOT 0 to
11: α[B j+1 ] += α[Aj ] G(A → wj B) HALT. Similarly, the outer weight α[Aj ] can be re-
garded as a prefix weight, computed symmetrically
12: for A ∈ N : . stopping rules
n within Algorithm 7.
13: G(A → wn ) += α[A ]
14: for R ∈ R : . expected rule counts The constructions of section 7 are easier to un-
15: c(R) := α[R] · G(R)/Z derstand in this setting. Here is the interpretation.
It is possible to replace the non-negative weights on
ROOT 0 ROOT 1 ROOT 2 ROOT 3 HALT the trellis edges with probabilities, in such a way
that the product weight of each path is not changed.
NOUN 1 NOUN 2 NOUN 3
Indeed, the method is essentially identical to the
“weight pushing” algorithm for weighted finite-state
automata (Mohri, 2000).
VERB 1 VERB 2 VERB 3
The probabilistic version of the trellis is a repre-
sentation of a new weighted grammar G 0 —an HMM
Figure 1: The trellis of taggings of a length-3 sentence, under
(hence a type of PCFG) that generates only taggings
an HMM where N = {ROOT, NOUN, VERB}. (Although the
of w, with the probabilities given by (1).
trellis shows that ROOT may be used as an ordinary tag, often
In the probabilistic version of the trellis, the edges
in practice it is allowed only at the root. This can be arranged
j from a node have total probability of 1. Thus it is
by giving weight 0 to rules involving ROOT for j > 0, corre-
the graph of a Markov chain, whose states are the
sponding to the gray edges.)
anchored nonterminals N 0 . Sampling a tagging of
w is now as simple as taking a random walk from
as the sum of two parameters, θRe + θRt , which ROOT 0 until HALT is reached. The forward pass can
represent the logs of these conditional probabilities. be interpreted as a straightforward use of dynamic
Then the expected emission and transition counts are programming to compute the hitting probabilities of
given by ∂(log Z)/∂θRe and ∂(log Z)/∂θRt . the nodes in the trellis, as well as the probabilities of
traversing a node’s out-edges once the node is hit.
It is traditional to view the forward-backward al-
But how were the trellis probabilities found in the
gorithm as running over a “trellis” of taggings (Fig-
first place? The edge Aj → B j+1 originally had
ure 1), which represents the forest of parses. Since
weight G(A → wj B). In the probabilistic version
a nonterminal Aj that is anchored at position j nec- G(A→wj B)·β[B j+1 ]
essarily emits wj , the trellis representation does not of the trellis, it has probability β[Aj ]
.
bother to show the emissions. It is simply a directed This represents the total weight of paths from A j
graph showing the transitions. Every parse (tagging) that start with this edge, as a fraction of the total
of w corresponds to a path in Figure 1. Specifically, weight β[Aj ] of all paths from Aj . (The edges in-
edge Aj → B j+1 in the trellis represents the an- volving ROOT0 and HALT edges are handled simi-
chored rule Aj → wj B j+1 , without showing wj . larly.) Computing the necessary β weights to deter-
Similarly, An → HALT represents the anchored rule mine these probabilities is the essential function of
An → wn , without showing wn , and ROOT → A1 the backward algorithm.
References J. Earley. An efficient context-free parsing algorithm.
Communications of the ACM, 13(2):94–102, 1970.
Steven Abney, David McAllester, and Fernando Pereira.
Relating probabilistic grammars and automata. In Pro- Jason Eisner. Three new probabilistic models for de-
ceedings of ACL, pages 542–557, 1999. pendency parsing: An exploration. In Proceedings of
COLING, pages 340–345, Copenhagen, August 1996.
S. Aji and R. McEliece. The generalized distributive law.
IEEE Transactions on Information Theory, 46(2):325– Jason Eisner. Parameter estimation for probabilistic
343, 2000. finite-state transducers. In Proceedings of ACL, pages
1–8, 2002.
J. K. Baker. Trainable grammars for speech recognition.
In Jared J. Wolf and Dennis H. Klatt, editors, Speech Jason Eisner and John Blatz. Program transformations for
Communication Papers Presented at the 97th Meeting optimization of parsing algorithms and other weighted
of the Acoustical Society of America, MIT, Cambridge, logic programs. In Proceedings of the 11th Conference
MA, June 1979. on Formal Grammar, pages 45–85, 2007.
Yehoshua Bar-Hillel, M. Perles, and E. Shamir. On for- Jason Eisner, Eric Goldlust, and Noah A. Smith. Compil-
mal properties of simple phrase structure grammars. ing comp ling: Weighted dynamic programming and
Zeitschrift für Phonetik, Sprachwissenschaft und Kom- the Dyna language. In Proceedings of HLT-EMNLP,
munikationsforschung, 14:143–172, 1961. Reprinted pages 281–290, 2005.
in Y. Bar-Hillel. (1964). Language and Information: Jenny Rose Finkel, Christopher D. Manning, and An-
Selected Essays on their Theory and Application, drew Y. Ng. Solving the problem of cascading errors:
Addison-Wesley 1964, 116–150. Approximate Bayesian inference for linguistic annota-
L. E. Baum. An inequality and associated maximization pipelines. In Proceedings of EMNLP, pages 618–
tion technique in statistical estimation of probabilistic 626, 2006.
functions of a Markov process. Inequalities, 3, 1972. Jenny Rose Finkel, Alex Kleeman, and Christopher D.
Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, Manning. Efficient, feature-based, conditional random
DeNero, John DeNero, and Dan Klein. Painless un- field parsing. In Proceedings of ACL-08: HLT, pages
supervised learning with features. In Proceedings of 959–967, June 2008.
NAACL, June 2010. Christoph Goller and Andreas Küchler. Learning task-
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pas- dependent distributed representations by backpropaga-
cal Lamblin, Razvan Pascanu, Guillaume Desjardins, tion through structure. Report AR-95-02, Fakultät Für
Joseph Turian, David Warde-Farley, and Yoshua Ben- Informatik, Technischen Universität München, 2005.
gio. Theano: A CPU and GPU math compiler in
Joshua Goodman. Efficient algorithms for parsing the
python. In Stéfan van der Walt and Jarrod Millman,
DOP model. In Proceedings of EMNLP, 1996.
editors, Proceedings of the 9th Python in Science Con-
ference, pages 3–10, 2010. Joshua Goodman. Parsing Inside-Out. PhD thesis, Har-
vard University, May 1998.
S. Billot and B. Lang. The structure of shared forests
in ambiguous parsing. In Proceedings of ACL, pages Joshua Goodman. Semiring parsing. Computational Lin-
143–151, April 1989. guistics, 25(4):573–605, December 1999.
Rens Bod. Enriching Linguistics with Statistics: Per- Matthew R. Gormley. Graphical Models with Structured
formance Models of Natural Language. PhD thesis, Factors, Neural Factors, and Approximation-Aware
University of Amsterdam, Academische Pers, Amster- Training. PhD thesis, Johns Hopkins University, Oc-
dam, 1995. ILLC Dissertation Series 1995-14. tober 2015.
Zhiyi Chi. Statistical properties of probabilistic context- Matthew R. Gormley, Mark Dredze, and Jason Eisner.
free grammars. Computational Linguistics, 25(1): Approximation-aware dependency parsing by belief
131–160, 1999. propagation. Transactions of the Association for Com-
putational Linguistics, 3:489–501, August 2015. ISSN
Adnan Darwiche. A differential approach to inference
2307-387X.
in Bayesian networks. Journal of the Association for
Computing Machinery, 50(3):280–305, 2003. Andreas Griewank and George Corliss, editors. Auto-
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum matic Differentiation of Algorithms. SIAM, Philadel-
likelihood from incomplete data via the EM algorithm. phia, 1991.
J. Royal Statist. Soc. Ser. B, 39(1):1–38, 1977. With Frederick Jelinek. Markov source modelling of test gen-
discussion. eration. In NATO Advanced Study Institute: Impact
of Processing Techniques on Communication, pages Joakim Nivre. An efficient algorithm for projective de-
569–598. Martinus Nijhoff, 1985. pendency parsing. In Proceedings of the 8th Inter-
Frederick Jelinek. Stochastic analysis of structured lan- national Workshop on Parsing Technologies (IWPT),
guage modeling. In Mark Johnson, Sanjeev P. Khu- pages 149–160, 2003.
danpur, Mari Ostendorf, and Roni Rosenfeld, editors, Fernando Pereira and Yves Schabes. Inside-outside rees-
Mathematical Founations of Speech and Languge Pro- timation from partially bracketed corpora. In Proceed-
cesing, number 138 in IMA Volumes in Mathematics ings of ACL, 1992.
and its Applications, pages 37–71. Springer, 2004.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learn-
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi ing internal representations by error propagation. In
Chi, and Stefan Riezler. Estimators for stochastic D. E. Rumelhart and J. L. McClelland, editors, Par-
‘unification-based’ grammars. In Proceedings of ACL, allel Distributed Processing: Explorations in the Mi-
pages 535–549, University of Maryland, 1999. crostructure of Cognition, volume 1, pages 318–362.
Tadao Kasami. An efficient recognition and syntax algo- MIT Press, 1986.
rithm for context-free languages. Technical report, Air Ruslan Salakhutdinov, Sam Roweis, and Zoubin Ghahra-
Force Cambridge Research Laboratory, 1965. mani. Optimization with EM and expectation-
Dan Klein and Christopher D. Manning. Parsing and hy- conjugate-gradient. In Proceedings of ICML, Wash-
pergraphs. In Proceedings of the International Work- ington, DC, 2003.
shop on Parsing Technologies (IWPT), 2001. Sunita Sarawagi and William W Cohen. Semi-Markov
Terry Koo, Amir Globerson, Xavier Carreras, and conditional random fields for information extraction.
Michael Collins. Structured prediction models via In Proceedings of NIPS, 2004.
the matrix-tree theorem. In Proceedings of EMNLP- Taisuke Sato. A statistical learning method for logic pro-
CoNLL, pages 141–150, June 2007. grams with distribution semantics. In Proceedings of
K. Lari and S. Young. The estimation of stochastic ICLP, pages 715–729, 1995.
context-free grammars using the inside-outside algo-
Taisuke Sato and Yoshitaka Kameya. New advances
rithm. Computer Speech and Language, 4:35–56,
in logic-based probabilistic modeling by PRISM. In
1990.
Probabilistic Inductive Logic Programming, pages
K. Lari and S. Young. Applications of stochastic context- 118–155. Sringer, 2008.
free grammars using the inside-outside algorithm.
Yves Schabes. Stochastic lexicalized tree-adjoining
Computer Speech and Language, 5:237–257, 1991.
grammars. In Proceedings of COLING, 1992.
Yann LeCun. Une procédure d’apprentissage pour réseau
a seuil asymmetrique (a learning scheme for asymmet- Stuart Shieber and Yves Schabes. Synchronous tree-
ric threshold networks). In Proceedings of Cognitiva adjoining grammars. In Proceedings of COLING,
85, pages 599–604, Paris, France, 1985. 1990.
Zhifei Li and Jason Eisner. First- and second-order ex- David A. Smith and Noah A. Smith. Probabilistic models
pectation semirings with applications to minimum- of nonprojective dependency trees. In Proceedings of
risk training on translation forests. In Proceedings of EMNLP-CoNLL, pages 132–140, 2007.
EMNLP, pages 40–51, 2009. Andreas Stolcke. An efficient probabilistic context-free
Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. parsing algorithm that computes prefix probabilities.
Probabilistic CFG with latent annotations. In Proceed- Computational Linguistics, 21(2):165–201, 1995.
ings of ACL, pages 75–82, 2005. Charles Sutton and Andrew McCallum. An introduction
David McAllester, Michael Collins, and Fernando to conditional random fields. Foundations and Trends
Pereira. Case-factor diagrams for structured proba- in Machine Learning, 4(4):267–373, 2011.
bilistic modeling. In Proceedings of UAI, 2004. Richard A. Thompson. Determination of probabilis-
Mehryar Mohri. Minimization algorithms for sequential tic grammars for functionally specified probability-
transducers. Theoretical Computer Science, 324:177– measure languages. IEEE Transactions on Computers,
201, March 2000. C-23(6):603–614, 1974.
Mark-Jan Nederhof and Giorgio Satta. Probabilistic pars- Tim Vieira, Ryan Cotterell, and Jason Eisner. Speed-
ing as intersection. In Proceedings of the 8th Inter- accuracy tradeoffs in tagging with variable-order
national Workshop on Parsing Technologies (IWPT), CRFs and structured sparsity. In Proceedings of
pages 137–148, April 2003. EMNLP, Austin, TX, November 2016.
K. Vijay-Shankar, David J. Weir, and Aravind K. Joshi.
Characterizing structural descriptions produced by
various grammatical formalisms. In Proceedings of
ACL, pages 104–111, 1987.
K. Vijay-Shanker and David J. Weir. Parsing some con-
strained grammar formalisms. Computational Linguis-
tics, 19(4):591–636, 1993.
Michael D. Vose. A linear algorithm for generating
random numbers with a given distribution. IEEE
Transactions on Software Engineering, 17(9):972–
975, September 1991.
P. Werbos. Beyond Regression: New Tools for Prediction
and Analysis in the Behavioral Sciences. PhD thesis,
Harvard University, 1974.
Ronald J. Williams and David Zipser. A learning al-
gorithm for continually running fully recurrent neural
networks. Neural Computation, 1(2):270–280, 1989.
D. H. Younger. Recognition and parsing of context-free
languages in time n3 . Information and Control, 10(2):
189–208, February 1967.
G. Zweig and M. Padmanabhan. Exact alpha-beta com-
putation in logarithmic space with application to MAP
word graph construction. In Proceedings of ICSLP,
2000.

Inside-Outside and Forward-Backward Algorithms Are Just Backprop (Tutorial Paper)

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Inside-Outside and Forward-Backward Algorithms Are Just Backprop (Tutorial Paper)

Hochgeladen von

Copyright:

Verfügbare Formate

Inside-Outside and Forward-Backward Algorithms Are Just Backprop

Abstract in neural networks. Thus, it now seems useful to

When G is a PCFG representing a prior distribution 4.1 The goal of inside-outside

12: c(Bij ) += c(Aki → Bij Cjk ) The forward-backward algorithm, Algorithm 7,

Das könnte Ihnen auch gefallen