Beruflich Dokumente
Kultur Dokumente
(Tutorial Paper)
Jason Eisner
Department of Computer Science
Johns Hopkins University
jason@cs.jhu.edu
C
OT
D
RO
dates as well. Algorithm 3 then recovers c(x) :=
w4
α
w3
Z [x] · β[x] at the end of the algorithm, for all x ∈ R,
w2
w1
and could do the same for all x ∈ R0 or x ∈ N 0 .
which is said to tag each word wj with its par-
8 Other Settings ent state. All parses share this right-branching tree
structure, so they differ only in their choice of tags.
Many types of weighted grammar are used in com-
(Traditionally, the weight G(A → w B) or
putational linguistics. Hence one often needs to
G(A → w) is defined as the product of an emission
construct new inside and inside-outside algorithms
probability, p(w | A), and a transition probabil-
(Goodman, 1998, 1999). As examples, the weighted
ity, p(B | A) or p(HALT | A). However, the fol-
version of Earley’s (1970) algorithm (Stolcke, 1995)
lowing algorithms do not require this. Indeed, the
handles arbitrary WCFGs (not restricted to CNF).
algorithms do not require that G is a PCFG—any
Vijay-Shanker and Weir (1993) treat tree-adjoining
right-branching WCFG will do.)
grammar. Eisner (1996) handles projective depen-
In this setting, the inside algorithm (which com-
dency grammars. Smith and Smith (2007) and
putes β bottom-up) is known as the backward al-
Koo et al. (2007) handle non-projective dependency
gorithm, because it proceeds from right to left. The
grammars, using an inside algorithm with a different
subsequent outside pass (which computes α top-
structure (not a dynamic programming algorithm).
down) is known as the forward algorithm.
In typical settings, each T is a derivation
Thanks to the fixed tree structure, this special-
tree (Vijay-Shankar et al., 1987)—a tree-structured
ized version of inside-outside can run in total time
recipe for assembling some syntactic description of
of only O(n|N |2 ). Of course, once we work out the
the input sentence.QThe parse probability p(T | w)
fast backward algorithm (directly or from the inside
takes the form Z1 t∈T exp θt , where t ranges over
algorithm), the forward-backward algorithm comes
certain configurations in T . Z is the total weight
for free by algorithmic differentiation, with α[x] de-
of all T . Then the core insight of section 4.2 holds
noting ðx = ∂Z/∂x. Pseudocode appears as Algo-
up: given an “inside” algorithm to compute log Z,
rithms 6–7 in Appendix B.
we can differentiate to obtain an “inside-outside” al-
The forest of all taggings of w may be compactly
gorithm that computes ∇ log Z, which yields up the
represented as a directed acyclic graph—the trel-
expected counts of the configurations.
lis. The forward-backward algorithms can be nicely
In this section we review several such settings.
understood with reference to this trellis, shown as
Parsing with a WCFG in CNF (Algorithms 1–2) is
Figure 1 in Appendix B. Each maximal path in the
merely the simplest case that fully illustrates the “ins
trellis corresponds to a tagging; α and β quantities
and outs,” as it were.
sum the weights of prefix and suffix paths. G de-
In all these cases, the EM algorithm is not the only
fines a probability distribution over these taggings
use of the expected counts: Sections 1 and 4.3 men-
tioned more important uses. EM itself is only useful log p(w) in the case of a generative grammar. Even in this case,
for training a generative grammar, at best.9 EM may not be the best choice: once we are already comput-
ing ∇ log Z, any continuous optimization algorithm (batch or
9
EM locally maximizes log Z, which equals the likelihood online) can exploit that gradient to improve log Z.
via (1). Following section 7, one can determine tran- to partially supervised parsing (Pereira and Schabes,
sition probabilities to the edges of the trellis so that 1992; Matsuzaki et al., 2005), where the output tree
a random walk on the trellis samples a tagging, from is not fully unobserved.
left to right, according to (1). The backward algo-
rithm serves to compute β values from which these 8.3 Synchronous grammars
transition probabilities are found (cf. section 7.1). In a synchronous grammar (Shieber and Schabes,
Under these probabilities, the trellis becomes a non- 1990), a parse T is a derivation tree that produces
stationary Markov model over the taggings. The for- aligned syntactic representations of a pair of sen-
ward pass now finds the hitting probabilities in this tences. These cases are handled as before. The in-
Markov model (cf. section 7.3), which describe how put to the inside algorithm is a pair of aligned or
often the random walk will reach specific anchored unaligned sentences, or a single sentence.
nonterminals or traverse edges between them. These The simplest synchronous grammar is a finite-
are the expected counts of tags and tag bigrams at state transducer. Here T is an alignment of the two
specific positions. sentences, tagged with states. Eisner (2002) general-
ized the forward-backward algorithm to this setting
8.2 Other grammar formalisms and drew a connection to gradients.
Many grammar formalisms have weighted versions
that produce exponential-family distributions over 8.4 Conditional grammars
tree-structured derivations: In the conditional random field (CRF) approach,
p(T | w) is defined as Gw (T )/Z, where Gw is a
• PCFGs or WCFGs whose nonterminals are lex-
specialized grammar constructed given the input w.
icalized or extended with other attributes, in-
Generally Gw defines weights for the anchored rules
cluding unification-based grammars (Johnson
R0 , allowing the weight of a rule to be position-
et al., 1999)
specific. (This slightly affects lines 5 and 11 of
• Categorial grammars, which use an unbounded
Algorithm 1.) Any weighted grammar formalism
set of nonterminals N (bounded for any given
may be used: e.g., the formalisms in section 2 and
input sentence)
section 8.1 respectively yield the CRF-CFG (Finkel
• Tree substitution grammars and tree adjoin-
et al., 2008) and the popular linear-chain CRF (Sut-
ing grammars (Schabes, 1992), in which the
ton and McCallum, 2011).
derivation tree is distinct from the derived tree
Nothing in our approach changes. In fact, super-
• History-based stochasticizations such as the
vised training of a CRF usually follows the gradi-
structured language model (Jelinek, 2004)
ent ∇ log p(T ∗ | w). This equals the vector of rule
• Projective and non-projective dependency
counts observed in the supervised tree T ∗ , minus the
grammars as mentioned earlier
expected rule counts ∇ log Z—as equivalently com-
• Semi-Markov models for chunking (Sarawagi
puted by backprop or inside-outside.
and Cohen, 2004), with runtime O(n2 )
Each formalism has one or more inside algo- 8.5 Pruned, prioritized, and beam parsing
rithms10 that efficiently compute Z, typically in time For speed, it is common in practice to perform only
O(n2 ) to O(n6 ) via dynamic programming. These a subset of the updates at line 11 of Algorithm 1.
inside algorithms can all be differentiated using the This approximate algorithm computes an underesti-
same recipe. mate Ẑ of Z (via underestimates β̂[Aki ]), because
The trick continues to work when one of these in- only a subset of the inside circuit is used. By
side algorithms is extended to the case of prefix pars- storing this dynamically determined smaller circuit
ing or lattice parsing (Nederhof and Satta, 2003), and performing back-propagation through it (Eisner
where the input sentence is not fully observed, or et al., 2005), we can compute the partial deriva-
10
Even basic WCFGs admit multiple algorithms—Earley’s
tives ∂ Ẑ/∂ β̂[Aki ] and ∂ Ẑ/∂G(R). These approxi-
algorithm, unary cycle elimination, the “hook trick,” and more mations may be used in all formulas in order to com-
(see Goodman, 1998; Eisner and Blatz, 2007). pute the expected counts of constituents and rules
in the pruned parse forest, which consists of the 9 Conclusions and Further Reading
parse trees—with total weight Ẑ—explored by the
Computational linguists have been back-
approximate algorithm.
propagating through their arithmetic circuits
Many clever single-pass or multi-pass strategies
for a long time without realizing it—indeed, since
exist for including the most important updates.
before Rumelhart et al. (1986) popularized the
Strategies are usually based either on direct prun-
use of this technique to train neural networks.
ing of constituents or prioritized updates with early
Recognizing this connection can help us to under-
stopping. A key insight is that β[Aki ] · ðβ[Aki ] is the
stand, teach, develop, and implement many core
total contribution of β[Aki ] to Z, so one should be
algorithms of the field.
sure to include the update β[Aki ] += ∆ if ∆ · α[Aki ]
Good follow-up reading includes
is predicted to be large.
Weighted automata are popular for parsing, par- • how to use the inside algorithm within a larger
ticularly dependency parsing (Nivre, 2003). In this neural network that can be differentiated end-
case, T is similar to a derivation tree: it is a sequence to-end (Gormley et al., 2015; Gormley, 2015);
of operations (e.g., shift and reduce) that constructs • how to efficiently obtain the partial derivatives
a syntactic representation. Here an approximate Ẑ of the expected counts by differentiating the al-
is usually computed by beam search, and can be dif- gorithm a second time (Li and Eisner, 2009);
ferentiated as above. • how to navigate through the space of possible
inside algorithms (Eisner and Blatz, 2007);
8.6 Inside-outside should be as fast as inside • how to convert an inside algorithm into variant
In all cases, it is good practice to derive one’s algorithms such as Viterbi parsing, k-best pars-
inside-outside algorithm by differentiation—manual ing, and recognition (Goodman, 1999);
or automatic—to ensure correctness and efficiency. • how to apply the same ideas to graphical mod-
It should be a red flag if a proposed inside-outside els (Aji and McEliece, 2000; Darwiche, 2003).
algorithm is asymptotically slower than its inside al-
Acknowledgments
gorithm.11 Why? Because automatic differentiation
produces an adjoint circuit that is at most twice the Thanks to anonymous referees and to Tim Vieira for
size of the original circuit (assuming binary opera- useful comments that improved the paper—and to
tors). That means ∇ log Z always can be evaluated the organizers of this workshop for soliciting tutorial
with the same asymptotic runtime as log Z. papers in the first place.
Good researchers do sometimes slip up and pub-
lish less efficient algorithms. Koo et al. (2007)
present the O(n3 ) algorithms for non-projective de-
pendency parsing: they point out that a contempo-
raneous IWPT paper with the same O(n3 ) inside
algorithm had somehow raised inside-outside’s run-
time from O(n3 ) to O(n5 ). Similarly, Vieira et al.
(2016) provide efficient algorithms for variable-
order linear-chain CRFs, but note that a JMLR pa-
per with the same forward algorithm had raised the
grammar constant in forward-backward’s runtime.
11
Unless inside-outside has been deliberately slowed down to
reduce its space requirements, as in the discard-and-recompute
scheme of Zweig and Padmanabhan (2000). Absent such a
scheme, inside-outside may need asymptotically more space
than inside: though the adjoint circuit is not much bigger than
the original, evaluating it may require keeping more of the orig-
inal in memory at once (unless time is traded for space).
A Pseudocode for Inside-Outside Variants
Algorithm 3 A cleaner variant of the inside-outside
The core of this paper is Algorithms 1 and 2 in the
algorithm
main text. For easy reference, this appendix pro-
1: procedure I NSIDE -O UTSIDE(G, w)
vides concrete pseudocode for some close variants
2: Z := I NSIDE(G, w) . side effect: sets β[· · · ]
of Algorithm 2 that are discussed in the main text,
3: initialize all α[· · · ] to 0
highlighting the differences from Algorithm 2. All α n 1
of these variants compute the same expected counts
4: Z [ ROOT 0 ] += Z . sets ðZ = 1/Z
5: for width := n downto 2 : . wide to narrow
c, with only small constant-factor differences in ef-
6: for i := n − width downto 0 : . start point
ficiency.
7: k := i + width . end point
8: for j := k − 1 downto i + 1 : . midpoint
Algorithm 3 is the clean, efficient version of
9: for A, B, C ∈ N :
inside-outside that obtains the expected counts α α k j k
∇ log Z by direct implementation of backprop. We
10: Z [A → B C] += Z [Ai ]β[Bi ]β[Cj ]
α j α k k
may regard this as the fundamental form of the algo- 11: Z [Bi ] += G(A → B C) Z [Ai ]β[Cj ]
α k j α k
rithm. The differences from Algorithm 2 are high- 12: Z [Cj ] += G(A → B C)β[Bi ] Z [Ai ]
lighted in red and discussed in section 6.1. 13: for k := n downto 1 : . width-1 constituents
Since log Z is replacing Z as the output being dif- 14: for A ∈ N : p
ferentiated, the adjoint quantity ðx is redefined as α α k
15: Z [A → wk ] += Z [Ak−1 ]
∂(log Z)/∂x and stored in Zα [x]. The name Zα is
16: for R ∈ R : . expected rule counts
chosen because it is Z1 times the traditional α. The α
17: c(R) := Z [R] · G(R) . no division by Z
computations in the loops are not affected by this
rescaling: they perform the same operations as in
Algorithm 2. Only the start and end of the algorithm
are different (lines 4 and 17).
To emphasize the pattern of reverse-mode auto- Algorithm 4 A more traditional variant of the
matic differentiation, Algorithm 3 takes care to com- inside-outside algorithm
pute the adjoint quantities in exactly the reverse of 1: procedure I NSIDE -O UTSIDE(G, w)
the order in which Algorithm 1 computed the orig- 2: Z := I NSIDE(G, w) . side effect: sets β[· · · ]
inal quantities. The resulting iteration order in the 3: initialize all α[· · · ] to 0
i, j, and k loops is highlighted in blue. Comput- 4: α[ROOTn0 ] += 1 . sets ðZ = 1
ing adjoints in reverse order always works and can 5: for width := n downto 2 : . wide to narrow
be regarded as the default strategy. However, this 6: for i := 0 to n − width : . start point
is merely a cosmetic change: the version in Algo- 7: k := i + width . end point
rithm 2 is just as valid, because it too visits the nodes 8: for j := i + 1 to k − 1 : . midpoint
of the adjoint circuit in a topologically sorted order. 9: for A, B, C ∈ N :
Indeed, since each of these blue loops is paralleliz- 10: c(A → B C)
able, it is clear that the order cannot matter. What α[Ak ]G(A→B C)β[B j ]β[C k ]
i i j
+= Z
is crucial is that the overall narrow-to-wide order of
11: α[Bij ] += G(A → B C)α[Aki ]β[Cjk ]
Algorithm 1, which is not parallelizable, is reversed
by both Algorithm 2 and Algorithm 3. 12: α[Cjk ] += G(A → B C)β[Bij ]α[Aki ]
13: for k := 1 to n : . width-1 constituents
Algorithm 4 is a more traditional version of Al- 14: for A ∈ N :
α[Akk−1 ]G(A→wk )
gorithm 2. The differences from Algorithm 2 are 15: c(A → wk ) += Z
highlighted in red and discussed in section 6.1. 16: for R ∈ R :
17: do nothing . c(R) has already been computed
Finally, Algorithm 5 is a version that is derived
on independent principles (section 7.3). It directly
computes the hitting probabilities of anchored con- j ≤ n) to denote an anchored nonterminal A that
stituents and rules in a PCFG representation of the serves as the tag of wj . That is, Aj is anchored so
parse forest. This may be regarded as the most natu- that its left child is wj . (Since this Aj actually dom-
ral way to obtain the algorithm without using gradi- inates all of wj . . . wn in the right-branching tree, it
ents. However, as section 7.3 notes, it can be fairly would be called Anj−1 if we were running the full
easily rearranged into Algorithm 3, which is slightly inside algorithm.)
more efficient. The differences from Algorithms 2 Line 7 builds up the right-branching tree by com-
and 3 are highlighted in red. bining a word from j − 1 to j (namely wj ) with
To rearrange Algorithm 5 into Algorithm 3, the a phrase from j to n (namely B j+1 ). This line is
key is to compute not the count c(x), but the ratio a specialization of line 11 in Algorithm 1, which
c(x) c(x)
β[x] (or G(x) when x is a rule), storing this ratio in
combines a phrase from i to j with another phrase
the variable Zα [x]. This ratio Zα [x] can be interpreted from j to k. Thanks to the right-branching con-
as ∂(log Z)/∂x as previously discussed. straint, the backward algorithm only has to loop over
O(n) triples of the form (j−1, j, n) (with fixed n)—
whereas the inside algorithm must loop over O(n3 )
Algorithm 5 An inside-outside variant motivated as triples of the form (i, j, k).
finding hitting probabilities
1: procedure I NSIDE -O UTSIDE(G, w) Algorithm 6 The backward algorithm
2: Z := I NSIDE(G, w) . side effect: sets β[· · · ] 1: function BACKWARD(G, w)
3: initialize all c[· · · ] to 0 2: initialize all β[· · · ] to 0
4: c(ROOTn0 ) += 1 3: for A ∈ N : . stopping rules
5: for width := n downto 2 : . wide to narrow 4: β[An ] += G(A → wn )
6: for i := 0 to n − width : . start point
5: for j := n − 1 downto 1 :
7: k := i + width . end point
6: for A, B ∈ N : . transition rules
8: for j := i + 1 to k − 1 : . midpoint
7: β[Aj ] += G(A → wj B) β[B j+1 ]
9: for A, B, C ∈ N :
10: c(Aki → Bij Cjk ) := . eqs. (19), (17) 8: for A ∈ N : . starting rules
G(A→B C)·β[Bij ]·β[Cjk ]
9: β[ROOT ] += G(ROOT → A) β[A1 ]
0
c(Aki ) · β[Aki ] 10: return Z := β[ROOT0 ]
11: c(A → B C) += c(Ai → Bij Cjk )
k
graph showing the transitions. Every parse (tagging) that start with this edge, as a fraction of the total
of w corresponds to a path in Figure 1. Specifically, weight β[Aj ] of all paths from Aj . (The edges in-
edge Aj → B j+1 in the trellis represents the an- volving ROOT0 and HALT edges are handled simi-
chored rule Aj → wj B j+1 , without showing wj . larly.) Computing the necessary β weights to deter-
Similarly, An → HALT represents the anchored rule mine these probabilities is the essential function of
An → wn , without showing wn , and ROOT → A1 the backward algorithm.
References J. Earley. An efficient context-free parsing algorithm.
Communications of the ACM, 13(2):94–102, 1970.
Steven Abney, David McAllester, and Fernando Pereira.
Relating probabilistic grammars and automata. In Pro- Jason Eisner. Three new probabilistic models for de-
ceedings of ACL, pages 542–557, 1999. pendency parsing: An exploration. In Proceedings of
COLING, pages 340–345, Copenhagen, August 1996.
S. Aji and R. McEliece. The generalized distributive law.
IEEE Transactions on Information Theory, 46(2):325– Jason Eisner. Parameter estimation for probabilistic
343, 2000. finite-state transducers. In Proceedings of ACL, pages
1–8, 2002.
J. K. Baker. Trainable grammars for speech recognition.
In Jared J. Wolf and Dennis H. Klatt, editors, Speech Jason Eisner and John Blatz. Program transformations for
Communication Papers Presented at the 97th Meeting optimization of parsing algorithms and other weighted
of the Acoustical Society of America, MIT, Cambridge, logic programs. In Proceedings of the 11th Conference
MA, June 1979. on Formal Grammar, pages 45–85, 2007.
Yehoshua Bar-Hillel, M. Perles, and E. Shamir. On for- Jason Eisner, Eric Goldlust, and Noah A. Smith. Compil-
mal properties of simple phrase structure grammars. ing comp ling: Weighted dynamic programming and
Zeitschrift für Phonetik, Sprachwissenschaft und Kom- the Dyna language. In Proceedings of HLT-EMNLP,
munikationsforschung, 14:143–172, 1961. Reprinted pages 281–290, 2005.
in Y. Bar-Hillel. (1964). Language and Information: Jenny Rose Finkel, Christopher D. Manning, and An-
Selected Essays on their Theory and Application, drew Y. Ng. Solving the problem of cascading errors:
Addison-Wesley 1964, 116–150. Approximate Bayesian inference for linguistic annota-
L. E. Baum. An inequality and associated maximiza- tion pipelines. In Proceedings of EMNLP, pages 618–
tion technique in statistical estimation of probabilistic 626, 2006.
functions of a Markov process. Inequalities, 3, 1972. Jenny Rose Finkel, Alex Kleeman, and Christopher D.
Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, Manning. Efficient, feature-based, conditional random
DeNero, John DeNero, and Dan Klein. Painless un- field parsing. In Proceedings of ACL-08: HLT, pages
supervised learning with features. In Proceedings of 959–967, June 2008.
NAACL, June 2010. Christoph Goller and Andreas Küchler. Learning task-
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pas- dependent distributed representations by backpropaga-
cal Lamblin, Razvan Pascanu, Guillaume Desjardins, tion through structure. Report AR-95-02, Fakultät Für
Joseph Turian, David Warde-Farley, and Yoshua Ben- Informatik, Technischen Universität München, 2005.
gio. Theano: A CPU and GPU math compiler in
Joshua Goodman. Efficient algorithms for parsing the
python. In Stéfan van der Walt and Jarrod Millman,
DOP model. In Proceedings of EMNLP, 1996.
editors, Proceedings of the 9th Python in Science Con-
ference, pages 3–10, 2010. Joshua Goodman. Parsing Inside-Out. PhD thesis, Har-
vard University, May 1998.
S. Billot and B. Lang. The structure of shared forests
in ambiguous parsing. In Proceedings of ACL, pages Joshua Goodman. Semiring parsing. Computational Lin-
143–151, April 1989. guistics, 25(4):573–605, December 1999.
Rens Bod. Enriching Linguistics with Statistics: Per- Matthew R. Gormley. Graphical Models with Structured
formance Models of Natural Language. PhD thesis, Factors, Neural Factors, and Approximation-Aware
University of Amsterdam, Academische Pers, Amster- Training. PhD thesis, Johns Hopkins University, Oc-
dam, 1995. ILLC Dissertation Series 1995-14. tober 2015.
Zhiyi Chi. Statistical properties of probabilistic context- Matthew R. Gormley, Mark Dredze, and Jason Eisner.
free grammars. Computational Linguistics, 25(1): Approximation-aware dependency parsing by belief
131–160, 1999. propagation. Transactions of the Association for Com-
putational Linguistics, 3:489–501, August 2015. ISSN
Adnan Darwiche. A differential approach to inference
2307-387X.
in Bayesian networks. Journal of the Association for
Computing Machinery, 50(3):280–305, 2003. Andreas Griewank and George Corliss, editors. Auto-
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum matic Differentiation of Algorithms. SIAM, Philadel-
likelihood from incomplete data via the EM algorithm. phia, 1991.
J. Royal Statist. Soc. Ser. B, 39(1):1–38, 1977. With Frederick Jelinek. Markov source modelling of test gen-
discussion. eration. In NATO Advanced Study Institute: Impact
of Processing Techniques on Communication, pages Joakim Nivre. An efficient algorithm for projective de-
569–598. Martinus Nijhoff, 1985. pendency parsing. In Proceedings of the 8th Inter-
Frederick Jelinek. Stochastic analysis of structured lan- national Workshop on Parsing Technologies (IWPT),
guage modeling. In Mark Johnson, Sanjeev P. Khu- pages 149–160, 2003.
danpur, Mari Ostendorf, and Roni Rosenfeld, editors, Fernando Pereira and Yves Schabes. Inside-outside rees-
Mathematical Founations of Speech and Languge Pro- timation from partially bracketed corpora. In Proceed-
cesing, number 138 in IMA Volumes in Mathematics ings of ACL, 1992.
and its Applications, pages 37–71. Springer, 2004.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learn-
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi ing internal representations by error propagation. In
Chi, and Stefan Riezler. Estimators for stochastic D. E. Rumelhart and J. L. McClelland, editors, Par-
‘unification-based’ grammars. In Proceedings of ACL, allel Distributed Processing: Explorations in the Mi-
pages 535–549, University of Maryland, 1999. crostructure of Cognition, volume 1, pages 318–362.
Tadao Kasami. An efficient recognition and syntax algo- MIT Press, 1986.
rithm for context-free languages. Technical report, Air Ruslan Salakhutdinov, Sam Roweis, and Zoubin Ghahra-
Force Cambridge Research Laboratory, 1965. mani. Optimization with EM and expectation-
Dan Klein and Christopher D. Manning. Parsing and hy- conjugate-gradient. In Proceedings of ICML, Wash-
pergraphs. In Proceedings of the International Work- ington, DC, 2003.
shop on Parsing Technologies (IWPT), 2001. Sunita Sarawagi and William W Cohen. Semi-Markov
Terry Koo, Amir Globerson, Xavier Carreras, and conditional random fields for information extraction.
Michael Collins. Structured prediction models via In Proceedings of NIPS, 2004.
the matrix-tree theorem. In Proceedings of EMNLP- Taisuke Sato. A statistical learning method for logic pro-
CoNLL, pages 141–150, June 2007. grams with distribution semantics. In Proceedings of
K. Lari and S. Young. The estimation of stochastic ICLP, pages 715–729, 1995.
context-free grammars using the inside-outside algo-
Taisuke Sato and Yoshitaka Kameya. New advances
rithm. Computer Speech and Language, 4:35–56,
in logic-based probabilistic modeling by PRISM. In
1990.
Probabilistic Inductive Logic Programming, pages
K. Lari and S. Young. Applications of stochastic context- 118–155. Sringer, 2008.
free grammars using the inside-outside algorithm.
Yves Schabes. Stochastic lexicalized tree-adjoining
Computer Speech and Language, 5:237–257, 1991.
grammars. In Proceedings of COLING, 1992.
Yann LeCun. Une procédure d’apprentissage pour réseau
a seuil asymmetrique (a learning scheme for asymmet- Stuart Shieber and Yves Schabes. Synchronous tree-
ric threshold networks). In Proceedings of Cognitiva adjoining grammars. In Proceedings of COLING,
85, pages 599–604, Paris, France, 1985. 1990.
Zhifei Li and Jason Eisner. First- and second-order ex- David A. Smith and Noah A. Smith. Probabilistic models
pectation semirings with applications to minimum- of nonprojective dependency trees. In Proceedings of
risk training on translation forests. In Proceedings of EMNLP-CoNLL, pages 132–140, 2007.
EMNLP, pages 40–51, 2009. Andreas Stolcke. An efficient probabilistic context-free
Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. parsing algorithm that computes prefix probabilities.
Probabilistic CFG with latent annotations. In Proceed- Computational Linguistics, 21(2):165–201, 1995.
ings of ACL, pages 75–82, 2005. Charles Sutton and Andrew McCallum. An introduction
David McAllester, Michael Collins, and Fernando to conditional random fields. Foundations and Trends
Pereira. Case-factor diagrams for structured proba- in Machine Learning, 4(4):267–373, 2011.
bilistic modeling. In Proceedings of UAI, 2004. Richard A. Thompson. Determination of probabilis-
Mehryar Mohri. Minimization algorithms for sequential tic grammars for functionally specified probability-
transducers. Theoretical Computer Science, 324:177– measure languages. IEEE Transactions on Computers,
201, March 2000. C-23(6):603–614, 1974.
Mark-Jan Nederhof and Giorgio Satta. Probabilistic pars- Tim Vieira, Ryan Cotterell, and Jason Eisner. Speed-
ing as intersection. In Proceedings of the 8th Inter- accuracy tradeoffs in tagging with variable-order
national Workshop on Parsing Technologies (IWPT), CRFs and structured sparsity. In Proceedings of
pages 137–148, April 2003. EMNLP, Austin, TX, November 2016.
K. Vijay-Shankar, David J. Weir, and Aravind K. Joshi.
Characterizing structural descriptions produced by
various grammatical formalisms. In Proceedings of
ACL, pages 104–111, 1987.
K. Vijay-Shanker and David J. Weir. Parsing some con-
strained grammar formalisms. Computational Linguis-
tics, 19(4):591–636, 1993.
Michael D. Vose. A linear algorithm for generating
random numbers with a given distribution. IEEE
Transactions on Software Engineering, 17(9):972–
975, September 1991.
P. Werbos. Beyond Regression: New Tools for Prediction
and Analysis in the Behavioral Sciences. PhD thesis,
Harvard University, 1974.
Ronald J. Williams and David Zipser. A learning al-
gorithm for continually running fully recurrent neural
networks. Neural Computation, 1(2):270–280, 1989.
D. H. Younger. Recognition and parsing of context-free
languages in time n3 . Information and Control, 10(2):
189–208, February 1967.
G. Zweig and M. Padmanabhan. Exact alpha-beta com-
putation in logarithmic space with application to MAP
word graph construction. In Proceedings of ICSLP,
2000.