Random Language Model 2019 DeGiuli

PHYSICAL REVIEW LETTERS 122, 128301 (2019)
Editors' Suggestion Featured in Physics
Random Language Model

E. DeGiuli
Institut de Physique Théorique Philippe Meyer, École Normale Supérieure, PSL University,
Sorbonne Universités, CNRS, 75005 Paris, France
(Received 11 September 2018; revised manuscript received 5 February 2019; published 29 March 2019)
Many complex generative systems use languages to create structured objects. We consider a model of
random languages, defined by weighted context-free grammars. As the distribution of grammar weights
broadens, a transition is found from a random phase, in which sentences are indistinguishable from noise,
to an organized phase in which nontrivial information is carried. This marks the emergence of deep
structure in the language, and can be understood by a competition between energy and entropy.
DOI: 10.1103/PhysRevLett.122.128301
It is a remarkable fact that structures of the most astound- recursively enumerable grammars produce more elaborate
ing complexity can be encoded into sequences of digits from graphs. Associated with an increase in complexity is an
a finite alphabet. Indeed, the complexity of life is written in increased difficulty of parsing [4]. Because biological
the genetic code, with an alphabet fA; T; C; Gg, proteins are instantiations of grammars must have been discovered
coded from strings of 20 amino acids, and human-written text by evolution, there is a strong bias toward simpler gram-
is composed in small, fixed alphabets. This “infinite use of mars; we consider context-free grammars (CFGs), which
finite means” [1] was formalized by Post and Chomsky with are the lowest order of the Chomsky hierarchy that supports
the notion of generative grammar [2,3], and has been hierarchical structure.
elaborated upon since, both by linguists and computer Despite their ubiquity in models of complex generative
scientists [4]. A generative grammar consists of an alphabet systems, grammars have hitherto played a minor role in
of hidden symbols, an alphabet of observable symbols, and a physics, and most known results on grammars are theorems
set of rules, which allow certain combinations of symbols to regarding worst-case behavior [12], which need not re-
be replaced by others. From an initial start symbol S, one present the typical case. Human languages show Zipf’s law
progressively applies the rules until only observable symbols [13–15], a power-law dependence of word frequency on its
remain; any sentence produced this way is said to be rank, and many sequences, including human text, show
“grammatical,” and the set of all such sentences is called long-range information-theoretic correlations [16–18],
the language of the grammar. The sequence of rule appli- which can be created by a CFG [18]; but are these typical
cations is called a derivation. For example, the grammar features of some ensemble of grammars? In this work we
fS → SS; S → ðSÞ; S → ðÞg has a single hidden symbol initiate this research program by proposing and simulating
S and two observable symbols, (and), and produces an ensemble of CFGs, so that grammars can be considered
the infinite set of all strings of well-formed parentheses. as physical systems [19]. We will find that CFGs possess
A simple derivation in this grammar is S → SS → ðSÞS → two natural “temperature” scales that control grammar
ððÞÞS → ððÞÞðÞ. Besides their original use in linguistics, complexity, one at the surface interface, and another in
where the observable symbols are typically taken to be
words, and grammars produce sentences [Fig. 1(a)] [3,5],
generative grammars have found applications in manifold
domains: in the secondary structure of ribonucleic acid
(RNA) [Fig. 1(b)] [6,7], in compiler design [4], in self-
assembly [8], in protein sequence analysis [9], and in
quasicrystals [10], to name a few.
The complexity of a language is limited by conditions
imposed on its grammar, as described by the Chomsky
hierarchy, which, in increasing complexity, distinguishes
regular, context-free, context-sensitive, and recursively
enumerable grammars [11]. Each class of grammar has a FIG. 1. Illustrative derivation trees for (a) simple English
characteristic graphical structure of its derivations: regular sentence, and (b) the RNA secondary structure (after [6]). The
grammars produce linear derivations, context-free gram- latter is a derivation of the sequence “gacuaagcugaguc” and
mars produce trees (Fig. 1), and context-sensitive and shows its folded structure. Terminal symbols are encircled.
0031-9007=19=122(12)=128301(6) 128301-1 © 2019 American Physical Society

the tree interior. As either of these temperatures is lowered, influence of the grammar on variables. For a fixed T , the
there is a phase transition, which corresponds to the weight of a configuration is
emergence of nontrivial information propagation. We Y Y
characterize this phase transition using results from sim- Wðσ; ojT ; GÞ ¼ Mσα σα σα Oσ α oα ; ð1Þ
1 2 3 1 2
ulations, and understand its location by a balance between α∈ΩT α∈∂ΩT
energy and entropy.
where each α ¼ ðα1 ; α2 ; α3 Þ is a factor in the order
Generative grammars.—A generative grammar is
σ α1 → σ α2 σ α3 . Note that Mabc ≠ Macb in general; thus
defined by an alphabet χ and a set of rules R. The alphabet
has N hidden, “nonterminal” symbols χ H , and T observ- the left and right branches are distinguished [20]. We
able, “terminal” symbols χ O . The most general rule is can write W ¼ e−E with
of the form a1 a2 …an → b1 b2 …bm , where ai ∈ χ H , X X
E¼− π abc ðσÞ log M abc − ρaB ðσ; oÞ log OaB ; ð2Þ
bi ∈ χ ¼ χ H ∪ χ O . In a CFG the rules are specialized to
a;b;c a;B
the form a1 → b1 b2 …bm , and we will insist that m ≥ 1, so
that there is no “empty” string. Without loss of generality, where π abc is the number of times the rule a → bc appears
we consider CFGs in the Chomsky normal form, in which in the configuration σ, and likewise, ρaB is the number of
case all rules are of the form [4] a → bc or a → A, where a, times the rule a → B appears. This defines a conditional
b, c ∈ χ H and A ∈ χ O . Note that we may have b ¼ a, or probability measure on configurations Pðσ; ojT ; GÞ ¼
b ¼ c, or a ¼ b ¼ c. Any derivation in the Chomsky e−Eðσ;ojT ;GÞ =ZðT ; GÞ where
reduced form can be drawn on a binary tree. Beginning X
from the start symbol S ∈ χ H , rules are applied until the ZðT ; GÞ ¼ e−Eðσ;ojT ;GÞ : ð3Þ
string contains only observable symbols. Such a string is fσ i ;ot g
called a sentence. The set of all sentences is the language of
the grammar. Given a string of observables S ¼ A1 …Al All configurations have S at the root node. For simplicity,
and a grammar G, one can ask whether there exists a in this work we consider as a model for the tree topology
derivation that produces S from the start symbol S; if so, S probability PðT jGÞ ¼ W tree =Ztree with W tree ðT Þ ¼
is said to be grammatical. pj∂ΩT j ð1 − pÞjΩT j , where p is the emission probability,
A formal grammar as defined above can only distinguish the probability that a hidden node becomes an observable
grammatical from ungrammatical sentences. A richer node. p controls the size of trees; we will choose it such that
model is obtained by giving each rule a non-negative real the tree size distribution is cut off above a length ξ ¼ 1000.
valued weight. Such a weighted grammar is useful in Some facts about the resulting binary trees are recorded in
applications, because weights can be continuously driven the Supplemental Material (SM) [22].
by a learning process, and can be used to define proba- A model with weights of the form of Eq. (1) is called a
bilities of parses. Moreover, a weighted grammar can be put weighted
P CFGP(WCFG). In the particular case where 1 ¼
into the Gibbs form, as shown below. For CFGs, to every b;c M abc ¼ A OaA for all a, it is easy to see that M and O
rule of the form a → bc we assign a weight Mabc , and to are conditional probabilities: Mabc ¼ Pða → bcja →
every rule of the form a → A we assign a weight OaA . nonterminalÞ and OaA ¼ Pða → Aja → terminalÞ. In this
Each candidate derivation of a sentence has two different case the model is called a probabilistic CFG (PCFG). In the
types of degrees of freedom. There is the topology T of the main text, we consider a weighted CFG, model W; in the SM,
tree, namely the identity (terminal or nonterminal) of each we show that our results are robust in model P, a PCFG.
node, as well as the variables, both terminal and non- There are tradeoffs between these models: model P is easier
terminal, on the nodes. We write ΩT for the set of internal to sample, because it has ZðT ; GÞ ¼ 1 from normalization of
factors, i.e., factors of the form a → bc, and ∂ΩT for the probability, and thus is factorized. But model W is more
boundary factors, i.e., those associated to a → A rules. The amenable to theory, since it is less constrained.
number of boundary factors is written lT , which is also the Random language model.—Each grammar defines prob-
number of leaves. Since derivations are trees, the number of abilities for sentences. To extract the universal properties of
internal factors is lT − 1. We will write σ for nonterminal grammars, which do not depend on all details of M and O,
symbols, and o for terminals; these can be enumerated in an we need a measure on the space of grammars. What is an
arbitrary way 1; …; N and 1; …; T, respectively. Given T , appropriate measure? From Eq. (2), log M and log O are
we can write σ i for the value of the nonterminal on site i, analogous to coupling constants in statistical mechanics. A
and similarly oj for the terminal on site j. The number of σ i simple model is to assume a Gaussian distribution for these,
is 2lT − 1, while the number of oj is lT . We write G for the so that M and O are lognormal. This can be motivated as
pair M, O, σ for fσ i g, and o for fot g. follows: language evolution is a dynamical process, which
To define a probability measure on derivations, it is must be slow in order for language to remain comprehen-
convenient to factorize it into the part specifying T , and the sible at any given moment. If each log M abc and log OaB are
remainder. In this way, we separate the tree shape from the the accumulation of independent, additive increments [25],
128301-2
these will lead to a lognormal. We define deep and surface 1

Hs ðG; kÞ ¼ hlog 1=Pðo1 ; o2 ; …; ok jGÞi: ð7Þ
sparsities as, respectively, k

1 X 2 Mabc 1 X 2 OaB For CFGs, we can also consider the block entropy rate of
sd ¼ 3 log ; ss ¼ log ; ð4Þ deep configurations,
N a;b;c M̄ NT a;B Ō
1
where M̄ ¼ 1=N 2 and Ō ¼ 1=T are the corresponding Hd ðG; kÞ ¼ hlog 1=Pðσ 1 ; σ 2 ; …; σ k jGÞi; ð8Þ
uniform probabilities; it is convenient to use this normali- k
zation even for model W where weights are not strictly where the symbols are taken from a (leftmost) derivation. In
normalized. A lognormal distribution of grammar weights is both cases the ensemble average is taken with the actual
PG ðM; OÞ ≡ Z−1 −ϵd sd −ϵs ss
ð5Þ probability of occurrence, PðojGÞ for Hs, and PðσjGÞ
G Je e ;
for H d.
P P
The grammar averages H̄d ðkÞ and H̄s ðkÞ are shown in
where J ¼ e− a;b;c log Mabc − a;B log OaB , and the space of M
Fig. 2, for k as indicated; here and in the following, the bars
and O is defined by appropriate normalization and positivity
show the 20th and 80th percentiles, indicating the observable
constraints. We define the random language model (RLM) as
range of Hd and Hs over the ensemble of grammars [26]. The
the ensemble of grammars drawn from Eq. (5).
dependence on ϵd is striking: for ϵd ≳ N 3 = log2 N, both
An alternative motivation of Eq. (5) is that this is the
H̄s ð1Þ and H̄ d ð1Þ are flat. In this regime, H̄d ð1Þ ≈ log N,
maximum-entropy measure when the grammar averages s¯d
indicating that although configurations strictly follow the
and s¯s are constrained. sd and ss measure the density of rules
rules of a WCFG, deep configurations are nearly indistin-
about their respective median values M̄ and Ō. When sd and
guishable from completely random configurations.
ss are finite, all rules must have a finite probability: this
However, at ϵd ¼ ϵ ≈ N 3 = log2 N there is a pronounced
reflects the fact that, given any finite amount of data, one can
transition, and both entropies begin to drop. This transition
only put a lower bound on the probability of any particular
corresponds to the emergence of deep structure.
rule. In model W the Lagrange multipliers ϵd and ϵs satisfy
The first block entropy H d ðG; 1Þ measures information
N3 NT in the single-character distribution, while the differential
sd ¼ ; ss ¼ : ð6Þ entropies δHd ðG; kÞ ¼ ðk þ 1ÞHd ðG; k þ 1Þ − kH d ðG; kÞ
2ϵd 2ϵs
measure incremental information in the higher-order dis-
When ϵd → ∞, s¯d → 0, which is the value corresponding to tributions [17]. The Shannon entropy rate including all
a completely uniform deep grammar, that is, when for a correlations can either be obtained from limk→∞ H d ðG; kÞ,
nonterminal a, all rules a → bc have the same probability or from limk→∞ δHd ðG; kÞ. These coincide, but the latter
1=N 2 . This is clearly the limit in which the grammar carries converges faster [17]. In the SM, we show that δH d ðG; kÞ,
no information. As ϵd is lowered, sd increases, and the and thus, the limiting rate appears to collapse with ϵ̃d log N.
grammar carries more information. In terms of how deter- For all entropies, the sample-to-sample fluctuations
ministic the rules are, ϵd plays the role of temperature, with decrease rapidly with k, suggesting that the limiting rates
random ↔ hot and deterministic ↔ cold; we will refer to it as are self-averaging.
the deep temperature. This analogy can also be seen formally: To further investigate the nature of the transition, we
in the SM, we show that if the energy E is replaced by βE, show in Fig. 3(a) a Zipf plot: the frequency of each symbol,
then Eq. (6) is replaced by sd ¼ β2 N 3 =ð2ϵd Þ, such that arranged in decreasing order. Figure 3(a) shows the Zipf
lowering ϵd is equivalent to increasing β. Similarly, ϵs plot for deep structure; the Zipf plot for surface structure is
controls information transmission at the surface; we call it
the surface temperature.
To investigate the role of ϵd on language structure, we
sampled grammars from the RLM at fixed values T ¼ 27,
ϵs =ðNTÞ ¼ 0.01. Since the surface sparsity is large, there is
already some simple structure at the surface; we will
explore how deep structure emerges as N and ϵd are varied.
For each value of N and ϵd , we created 120 distinct
grammars, from which we sample 200 sentences (see
SM for more details). Altogether, approximately 7200 FIG. 2. Shannon entropy of random CFGs as functions of
distinct languages were constructed. ϵ̃d ¼ ϵd =N 3 . (a) Block entropy of hidden configurations for
The information content of a grammar G is naturally indicated k and N. (b) Block entropy of observed strings; symbols
encoded by Shannon entropies. For a sequence o1 ;o2 ;…;ok as in (a). The constant value for ϵd > ϵ depends on the surface
the Shannon block entropy rate is temperature ϵs . Bars indicate 20th and 80th percentiles.
128301-3
Theory.—How can we gain some theoretical insight into

the RLM? Consider the entropy of an observed P string of
length l, composed of n sentences of length lk , k lk ¼ l.
The entropy of this string derives from three distinct
combinatorial levels: (i) each sentence can be represented
by a derivation tree with many different topologies, (ii) each
derivation tree can host a variety of internal hidden
variables, and (iii) given the hidden variables, the observed
symbols can themselves vary.
FIG. 3. (a) Zipf plot of hidden symbols for N ¼ 40. Here Some scaling considerations are useful. Each derivation
ϵ̃d ¼ ϵd =N 3 . (b) Order parameter Q2, with bars indicating 20th tree can have many topologies: the entropy of binary trees
and 80th percentile ranges over grammars at each parameter
scales as lk log 4, so that the total tree entropy scales as
value. Inset: same plot in log-log axes.
St ∼ l log 4. Each derivation tree has 2lk − 1 hidden
variables, so that the total number of hidden d.o.f. is
similar, but less dramatic (see SM). We see a sharp change 2l − n, and the corresponding deep entropy scales as
at ϵ : for ϵd > ϵ, the frequencies of hidden symbols are Sd ∼ ð2l − nÞ log N. Finally, the sentences have an entropy
nearly uniform, while below ϵ, the distribution is closer to So ∼ l log T.
exponential (In the SM, we show that a power-law regime We see that when typical sentences are of length
for the observable symbols appears when T is large). The hli ≫ 1, so that l − n ∼ l, these numbers are independent
permutation symmetry among hidden symbols is thus of partitioning, to the leading order. For large hli, we get
spontaneously broken at ϵ . the scaling S ∼ l logð4N 2 TÞ.
What is the correct order parameter to describe this This must be compared with the “energetic” terms
transition? The ferromagnetic order parameter is log W tree ¼ ðl − nÞ logð1 − pÞ þ l log p ∼ −2l log 2 for
mr ¼ hNδσ i ;r − 1i, where i is a site. This does not show p near 1=2, and E, Eq. (2). In E, π is positively correlated
any signal of a transition, despite the fact that the start with M, since rules with a higher weight are more
symbol explicitly breaks the replica symmetry. A more frequently used; hence we can obtain a simple scaling
interesting choice is one of Edwards-Anderson type, such estimate E ∼ −N 3 π log m − NTρ log o where π is the mean
value of π abc , and log m is the value of a typical positive
as QEA rs ¼ hNδσ i ;r − 1ihNδσ i ;s − 1i where r and s label
different sentences produced from the same grammar, fluctuation ofPlog M abc , and is similarly for O.PFrom the
and σ i is a specified site [27]. However, sentences produced sum rules a;b;c π abc ¼ jΩj ¼ l − n and a;B ρaB ¼
by a CFG do not have fixed derivation trees, so we need to j∂Ωj ¼ l we have π ¼ ðl − nÞ=N 3 , ρ ¼ l=ðNTÞ. The
compare symbols in relative position. For each interior rule mean value of log M abc is log M̄, and the mean value of
a → bc we can define log OaB is log Ō. These contributions lead to a constant
value of E. The positive fluctuations
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiin
ffi log M and log O
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
3
Qabc ðGÞ ¼ hδσ α ;a ðN 2 δσ α ;b δσ α − 1Þi; ð9Þ that couple to E scale as ðN =2ϵd Þ and ðNT=2ϵs Þ,
1 2 3
;c
respectively, leading to
averaged over all interior vertices α, and averaged over sffiffiffiffiffiffiffi sffiffiffiffiffiffiffi
derivations. Here, σ α1 is the head symbol at vertex α, and N3 NT
E ∼ −l −l þ const: ð10Þ
σ α2 , σ α3 are the left and right symbols, respectively. Q 2ϵd 2ϵs
measures patterns in rule application at each branching of a
derivation tree. It is thus an order parameter for deep Combining this with S, the effective free energy F ¼
structure. Upon averaging over grammars in the absence of E − log W tree − S reflects a competition between energy
any fields, the permutation symmetry must be restored: and entropy. If we consider N and ϵd as varying, then there
¯ ¼ q0 þ δab ql þ δac qr þ δbc qh þ δab δac q . As shown
Qabc is a scale ϵ ¼ N 3 = log2 N where the energetic fluctuations
in the SM, these components show a transition, but there is balance entropy. For ϵd ≫ ϵ, the energy of a configuration
significant noise below ϵ, despite there being 120 replicas is unimportant, and the grammar is thus irrelevant: the
at each point. Evidently, Qabc has large fluctuations below
P language produced by the WCFG must then be indistin-
ϵ . This suggests a definition Q2 ≡ a;b;c Q2abc , plotted in guishable from random sequences, as found empirically
Fig. 3(b). The signal is clear: on the large scale, Q2 has a above. In contrast, for ϵd ≪ ϵ, the language reflects those
scaling form Q2 ≈ N 3 fðϵd =ϵ Þ and is small above ϵ . The sequences with high intrinsic weight, and their entropy is
scaling Q2 ∼ N 3 suggests that below the transition, all less important. The characteristic scale ϵ identified by
hidden symbols start to carry information in the deep these simple arguments agrees with that found empirically
structure. above, and locates the emergence of deep structure.
128301-4
However, further work is needed to predict the behavior of a field can be inferred, and a map of syntax in CFGs could
Q2 , Hs , and Hd . be deduced. This is a tantalizing goal for future work.
Learning human languages.—Around 6000 languages Conclusion.—We introduced a model of random lan-
are spoken around the world [28]; given fractured and guages, which captures the generative aspect of complex
highly sparse input, how does a child come to learn the systems. The model has a transition in parameter space that
precise syntax of one of these many languages? This corresponds to the emergence of deep structure. Since the
question has a long history in linguistics and cognitive interaction is long-range, we expect that the RLM, or a
science [29,30]. One scenario for learning is known as the variant, is exactly solvable. We hope that this will be
Principles and Parameters (P&P) theory [31]. This posits clarified in the future.
that the child is biologically endowed with a general class
This work benefited from discussions with C. Callan,
of grammars, the “principles,” and by exposure to one
J. Kurchan, G. Parisi, R. Monasson, G. Semerjian,
particular language, fixes its syntax by setting some
P. Urbani, F. Zamponi, A. Zee, and Z. Zeravcic.
number of parameters, assumed to be binary. For example,
the head-directionality parameter controls whether verbs
come before or after objects, like English and Japanese,
respectively. A vast effort has been devoted to mapping out
the possible parameters of human languages [28,32]. The [1] W. Von Humboldt, Humboldt:’On Language’: On the
richness of the discovered structure has been used as Diversity of Human Language Construction and Its Influ-
ence on the Mental Development of the Human Species
criticism of the approach [33]: if the child needs to set
(Cambridge University Press, Cambridge, England, 1999).
many parameters, then do these all need to be innate? This [2] E. L. Post, Formal reductions of the general combinatorial
would be a heavy evolutionary burden, and a challenge to decision problem, Am. J. Math. 65, 197 (1943).
efficient learning. [3] N. Chomsky, Syntactic Structures (Walter de Gruyter,
The RLM can shed some light on this debate. First, since Berlin, 2002).
only two living human languages are known to possess [4] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction
syntax beyond CFG [34], we consider WCFGs a valid to Automata Theory, Languages, and Computation, 3rd ed.
starting point [37]. Following experimental work [30], we (Pearson, Boston, Ma, 2007).
picture the learning process as follows. Initially, the child [5] N. Chomsky, Aspects of the Theory of Syntax (MIT Press,
does not know the rules of the grammar, so it begins with Cambridge, Massachusetts, 2014), Vol. 11.
some small number of hidden symbols and assigns uniform [6] D. B. Searls, The language of genes, Nature (London) 420,
211 (2002).
values to the weights M and O. To learn is to increase the
[7] B. Knudsen and J. Hein, Pfold: RNA secondary structure
likelihood of the grammar by adjusting the weights and prediction using stochastic context-free grammars, Nucleic
adding new hidden symbols. As weights are driven away Acids Res. 31, 3423 (2003).
from uniform values, the temperatures ϵd and ϵs decrease. [8] E. Winfree, X. Yang, and N. C. Seeman, Universal compu-
Eventually, the transition to deep structure is encountered, tation via self-assembly of DNA: Some theory and experi-
and the grammar begins to carry information. ments, in DNA Based Computers II, DIMACS Series in
In the absence of any bias, this transition would occur Discrete Mathematics and Theoretical Computer Science
suddenly and dramatically, spontaneously breaking all N 3 Vol. 44 (American Mathematical Society, Providence, R.I.,
directions in M space simultaneously, as in Fig. 3(b). 1999), p. 191.
However, in realistic child language learning, the child’s [9] J. P. Barton, A. K. Chakraborty, S. Cocco, H. Jacquin, and
R. Monasson, On the entropy of protein families, J. Stat.
environment acts as a field on this likelihood ascent, and
Phys. 162, 1267 (2016).
can cause the structure-emerging transitions to occur at [10] J. G. Escudero, Formal Languages for quasicrystals, in
different critical deep temperatures, depending on their Symmetries in Science IX (Springer, Boston, 1997),
coupling to the field. For example, a left-right symmetry pp. 139–152.
breaking could correspond to setting the head directionality [11] M. A. Nowak, N. L. Komarova, and P. Niyogi, Computa-
parameter. tional and evolutionary aspects of language, Nature
Although this description is schematic, we insist that the (London) 417, 611 (2002).
various symmetry-breaking transitions, which could give [12] For example, from Ref. [4], Theorem 7.17 on the size of
rise to parameters, are emergent properties of the model. derivation trees, Theorem 7.31 on the conversion of an
Thus, if there are indeed many parameters to be set, these automaton to a CFG, and Theorem 7.32 on the complexity
do not all need to be innate: the child only needs the basic of conversion to the Chomsky normal form (see below).
[13] G. K. Zipf, The Psycho-Biology of Language: An Intro-
structure of a WCFG, and the rest is emergent. The P&P
duction to Dynamic Philology (Routledge, Milton Park,
theory is thus consistent with existence of many parame- 2013).
ters. If the RLM can be solved, by which we mean that the [14] R. F. i. Cancho and R. V. Solé, Least effort and the origins of
partition function Z can be computed, then the series of scaling in human language, Proc. Natl. Acad. Sci. U.S.A.
symmetry-breaking transitions that occur in the presence of 100, 788 (2003).
128301-5
[15] A. Corral, G. Boleda, and R. Ferrer-i Cancho, Zipf's law for [27] D. J. Gross, I. Kanter, and H. Sompolinsky, Mean-Field
word frequencies: Word forms versus lemmas in long texts, Theory of the Potts Glass, Phys. Rev. Lett. 55, 304
PLoS One 10, e0129031 (2015). (1985).
[16] W. Ebeling and T. Pöschel, Entropy and long-range corre- [28] M. C. Baker, The Atoms of Language: The Mind’s Hidden
lations in literary English, Europhys. Lett. 26, 241 (1994). Rules of Grammar (Basic Books, New York, 2008).
[17] T. Schürmann and P. Grassberger, Entropy estimation of [29] R. C. Berwick, P. Pietroski, B. Yankama, and N. Chomsky,
symbol sequence, Chaos 6, 414 (1996). Poverty of the stimulus revisited, Cogn. Sci. 35, 1207
[18] H. W. Lin and M. Tegmark, Critical behavior in physics and (2011).
probabilistic formal languages, Entropy 19, 299 (2017). [30] C. Yang, S. Crain, R. C. Berwick, N. Chomsky, and J. J.
[19] G. Parisi, Complex systems: A physicist's viewpoint, Bolhuis, The growth of language: Universal Grammar,
Physica (Amsterdam) 263A, 557 (1999). experience, and principles of computation, Neurosci. Bio-
[20] Indeed if the left-right branches are not distinguished, behav. Rev. 81, 103 (2017).
CFGs do not have any more expressive power than regular [31] N. Chomsky, Lectures on Government and Binding: The
grammars [21]. Pisa Lectures (Walter de Gruyter, Berlin, 1993), Vol. 9.
[21] J. Esparza, P. Ganty, S. Kiefer, and M. Luttenberger, [32] U. Shlonsky, The cartographic enterprise in syntax, Lang.
Parikh’s theorem: A simple and direct automaton construc- Linguist. Compass 4, 417 (2010).
tion, Inf. Proc. Lett. 111, 614 (2011). [33] G. Ramchand and P. Svenonius, Deriving the functional
[22] See Supplemental Material at http://link.aps.org/ hierarchy, Lang. Sci. 46, 152 (2014).
supplemental/10.1103/PhysRevLett.122.128301 which in- [34] Only Swiss-German and Bambara have confirmed features
cludes details on binary trees, sampling methods, robustness beyond CFG [35,36].
in PCFG, differential entropies, and equation derivations, [35] C. Culy, The complexity of the vocabulary of Bambara,
and Refs. [23,24]. Linguistics and philosophy 8, 345 (1985).
[23] S. Chib and E. Greenberg, Understanding the Metropolis- [36] S. M. Shieber, Evidence against the context-freeness of
Hastings algorithm, Am. Stat. 49, 327 (1995). natural language, in Philosophy, Language, and Artificial
[24] P. Flajolet and R. Sedgewick, Analytic Combinatorics Intelligence (Springer, Cambridge, 1985), pp. 79–89.
(Cambridge University Press, Cambridge, England, 2009). [37] Note also that some lexicalized models used for machine
[25] D. Sornette and R. Cont, Convergent multiplicative proc- learning, such as [38], are WCFGs with multi-indexed
esses repelled from zero: Power laws and truncated power hidden variables.
laws, J. Phys. I (France) 7, 431 (1997). [38] M. Collins, Head-driven statistical models for natural
[26] The error bars p inffiffiffiffiffiffiffi
measurements
ffi are then smaller by factor language parsing, Computational linguistics 29, 589
approximately 120 ∼ 11. (2003).
128301-6

Random Language Model 2019 DeGiuli

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Random Language Model 2019 DeGiuli

Hochgeladen von

Copyright:

Verfügbare Formate

PHYSICAL REVIEW LETTERS 122, 128301 (2019)

Editors' Suggestion Featured in Physics

Random Language Model

0031-9007=19=122(12)=128301(6) 128301-1 © 2019 American Physical Society

these will lead to a lognormal. We define deep and surface 1

Theory.—How can we gain some theoretical insight into

Das könnte Ihnen auch gefallen