Beruflich Dokumente
Kultur Dokumente
Abstract
This paper describes in detail an algorithm for the unsupervised learning of natural lan-
guage morphology, with emphasis on challenges that are encountered in languages typolog-
ically similar to European languages. It utilizes the Minimum Description Length analysis
described in Goldsmith (2001), and has been implemented in software that is available for
downloading and testing.
This paper describes in detail an algorithm used for the unsupervised learning of
natural language morphology which works well for European languages and other
languages in which the average number of morphemes per word is not too high.
It has been implemented and tested in Linguistica, and is based on the theoretical
principles described in Goldsmith (2001)1 . The executable for this program, and
the source code as well, is available at http://linguistica.uchicago.edu.
Section 2 of this paper gives a brief overview of the theory that lies behind this
work; sections 3 through 10 discuss the details of the algorithm in considerable
detail. Section 11 presents an evaluation of the algorithm in an application to a
corpus of English, and section 12 addresses briefly the theoretical implications of
work on the unsupervised learning of linguistic structure more generally.
1
I am grateful for comments, suggestions, and criticisms from Yu Hu, Irina Matveeva,
Colin Sprague, Jeni Parham, and the anonymous reviewers of this journal.
2 John Goldsmith
2
There is no natural home in the analysis presented in this paper for the distinction
between inflectional and derivational morphology. This question is addressed, however,
in Goldsmith and Hu (2005), in which an analysis of the distinction is offered in terms
of the geometry of a finite state automaton for the morphology.
An Algorithm for the Unsupervised Learning of Morphology 3
plus the length of the optimal compression of the corpus, when we use the proba-
bilistic model to compress the data (see (1)). The length of the optimal compression
of the corpus is the base 2 logarithm of the reciprocal of the probability assigned to
the corpus by the model; we return to this notion (a standard one in information
theory) below. Since we are concerned with morphological analysis, I will hence-
forth use the more specific term the morphology rather than model ; one can read
the M in (1) as referring specifically to a morphology.
1
(1) DescriptionLength(CorpusC, M odelM ) = length(M ) + log2
prob(C|M )
MDL analysis proposes that the morphology M which minimizes the objective
function in (1) is the best morphology of the corpus. Intuitively, the first term (the
length of the model, in bits) expresses the conciseness of the morphology, giving us
strong motivation to find the simplest morphology possible, while the second term
expresses how well the model describes the corpus in question. The morphology
M spreads probability mass over a wide universe of possible words (by assigning
a probability to all possible words in the language, and by being subject to the
requirement that the probabilities sum to 1), and we want one that assigns as
much of it as possible to the words of the particular corpus which we happen to be
looking at. Instead of considering the probability of the corpus, we consider the log
of the reciprocal of that probability, because this is a quantity which is expressible
in information theoretic bits, and which can then be added to the first term in (1);
that is, by multiplying the log probability of the corpus by −1, we can reasonably
add the two terms and attempt to find the analysis which minimizes the sum of
the two terms. Hence the term: minimum description length.
Thus we need to design a morphology M which assigns a distribution D over
words such that the observed words in the corpus lie in the support of D (the set
to which D assigns non-zero probability), and we need to do this in a way which
allows us to easily calculate the length of M.
(2)a
5
And what is the information that composes the morphology of a language such as
regarding the ordering of possible morphemes in the language. We condense all of this
information into essentially three components of the morphology: a list of stems, a list of
affixes, and a list of signatures, which are structures indicating which stems may appear
⎧ NULL ⎫
⎧crawl ⎫⎪ ⎪
⎪ ⎪⎪ ed ⎪
(2) (a) ⎨ jump ⎬⎨ ⎬
⎪ walk ⎪⎪ ing ⎪
⎩ ⎭⎪ s ⎪
⎩ ⎭
(b)
jump
climb NULL
walk ing
dog Stems ed
Suffixes
crawl ship
book dom
stand s
....
As the structure in (2)b suggests, the role of pointers in the construction of the
formal morphology is critical. We must ask precisely how long a pointer is in such a
4 John Goldsmith
(2)b
As the structure in (2b) suggests, the role of pointers in the construction of the
formal morphology is critical. We must ask precisely how long a pointer is in such
a diagram, and we must get an answer to that question expressed in units of bits.
Information theory provides an answer to this question, or rather, to the question:
what is the shortest encoding system that we can set up for pointers in such a
situation, measured in bits? The answer is the base 2 logarithm of the reciprocal
of the frequency of the item being pointed to, or − log2 f req(•) (see, e.g., Bell et
al. (1990)). Intuitively, this means that up to this limit, we can find an encoding
that allows frequently used items to be more easily accessed, if by “easily” we
mean pointed to in a more concise fashion. It is possible to quite literally encode a
pointer to an object X by a string of binary digits no larger than 1 greater than the
base 2 logarithm of the reciprocal of the frequency of X, and thus this quantity is
often referred to simply as the “length of the pointer” to X. When we speak of the
“length” of a pointer, then, one may paraphrase that as the length of the optimal
encoding of the pointer.
3
An extension of this heuristic is discussed in Hu and Goldsmith (2005).
6 John Goldsmith
letters which appear immediately after word-initial “a”, etc. Such data structures
are widely used today, and will be familiar to most readers of this paper, especially
in the context of a Patricia trie, a trie in which all nodes with only unary branching
beneath them are merged with their daughters.
Harris proposed that peaks in successor frequency would be suitable detectors
for the discovery of morpheme breaks. As Hafer and Weiss (1974) note, Harris’s ap-
parent proposal is actually a family of closely related proposals, and none of them
work anywhere close to perfectly, for various reasons, some of which we will review
here. There are a number of parameters that one can modify in the actual imple-
mentation of Harris’s suggestion, and we adopt a set of parameters that increases
the precision, while decreasing its recall. In short, we adjust Harris’ proposal so
that it is makes fewer analytical claims about the words, but those that it makes
are relatively trustworthy. We do this in the following way.
Looking at peaks in the successor frequency in the first 3 letters of a word tends
to give rise to a large number of spurious peaks, in the sense that the peaks do not
signal morpheme boundaries. Since there are more consonants than vowels, and
since vowels tend to follow consonants, just as consonants tend to follow vowels,
there is a strong tendency for the successor frequency to be larger after a vowel than
after a consonant within the first 3 letters of a word, and hence for this algorithm
to find a (spurious) morpheme break after any vowel in the first 3 letters of a word.
Since we are at this point looking for “stem-suffix” breaks, we restrict our attention
to candidate stems that are at least 3 letters in length, recognizing that there are
some shorter stems (e.g., be) which will only be discovered at a later point.
We actually place a more stringent requirement on the cuts motivated by a
peak in successor frequency at this point: we require that to make a cut after
the ith letter, the successor frequency must be exactly 1 after both the i–1 th letter
and the i+1 th letter. This decision is a conservative one, in the following sense.
The two most common reasons to find a successor frequency greater than 1 in
two successive positions are these: either both peaks are accurate indicators of
morpheme breaks, and the first morpheme is one letter long (for example, with
the words petit, petits, petite, petites, a successor frequency of 3 is found after petit
and a successor frequency of 2 is found after petite), or a morpheme break is found
after the first position, and two of the suffixes that occur begin with the same
letter (e.g., many stems are followed by both –ing and –ion, in addition to –ed and
–s). It is difficult to be certain which is the correct cut at this point; by putting
this condition on the bootstrapping heuristic, no cut is made for the –ing and –
ion words at this point. The algorithm will very soon have considerable knowledge
about the morphology of the language, and it will know that –ing and –ion are
common suffixes, but that –ng and –on are not common suffixes, so it will be able
to make a much more informed choice than it can right now.
set of suffixes with which it appears in the corpus as an alphabetized list: a suffix-
list. We then create a list of such suffix-lists, and associate with each such list the set
of stems that appears with precisely that set of suffixes. This association is exactly a
signature, as described earlier in this paper, as in (2). Each stem is associated with
exactly one signature. Common signatures in English include NULL.s (primarily
nouns), NULL.ed.ing.s (verbs), and NULL.er.est.ly (adjectives).
We then apply a set of filters in order to eliminate certain implausible signatures,
because our goal in this first heuristic is to prefer precision over recall, in the sense
that we would rather fail to uncover some morphological structure than detect
spurious or false structure. We set a threshold (of 3) for the minimum number of
words an affix may appear in; a hypothetical suffix occurring less often than that
is eliminated.
The second heuristic we use to eliminate signatures is based on an apriori proba-
bility of the length of a suffix being just one letter in length. NULL is a likely affix
in general (in the sense that languages often build words with no overt affixes), but
suffixes with only one letter (phoneme) are both rare and suspect. Even if we did
not know English, we would be wise to be suspicious of a morphological analysis
which posits a stem car that can be followed by the affixes NULL, e, t, p, b, and d.
These are really distinct stems (in English: car, care, cart, carp, and card ). As noted
by Brent (1999), natural languages do act as if they select their morphemes with
an eye to keeping their mean length to the neighborhood of 5, with the average less
for affixes than for stems, but with a relatively low probability of morphemes of
length 1. To be sure, NULL.s is the most common signature in English, French, and
Spanish, so we can take this length consideration only as a tendency, and be willing
to accept a signature such as NULL.s as legitimate if it is found in association with
a sufficient number of examples.
A certain amount of experimentation has led us to the following heuristic.4 Any
signature with a large number of stems (defined as 25) is permitted, while those
with fewer are subject to the following test. A signature must have at least two
affixes that are of length at least 2 (where a NULL affix is considered to be of
length 2 for these purposes); otherwise it is dropped. Thus by this latter criterion,
NULL.t, or b.p, would be eliminated, while br.tr and NULL.br would be accepted.
What is the connection between finding signatures and MDL? Each signature
represents a considerable savings in the number of letters that are needed in the
stem lists. We may think of the null morphology as being the morphology in which
there are no affixes, and the only structure present is that the words of the corpus are
each individually represented in the list of stems. When we are able to reduce a set of
t stems and f affixes to a description as a signature (and such a signature represents
t times f words altogether), we are able to save f–1 copies of each stem, and t–1
copies of each affix. If the average length of a stem is S and the average length of
an affix is F, the signature will save approximately log2 (27)[S(F − 1) + F (S − 1)],
4
The value chosen for this parameter has been chosen somewhat arbitrarily, and here, as
at a few other places in this paper, experiments with a large number of gold standards
for different languages might lead to somewhat different optimal settings.
8 John Goldsmith
4 Check signatures
The Check signatures function directly incorporates the insights of the Minimum
Description Length perspective on grammar induction. It examines each signature
in turn, and attempts to determine if the transfer of material (letters, phonemes)
from stem to suffix will improve the overall description length of the morphology.
For example, if there is a large set of words ending in –ion and –ive, the function
described in the prior section will draw the conclusion that there are suffixes –on
and –ve in these cases, and place the –i in the stems, not in the suffixes. The
purpose of this function is to identify and correct that error.
Now, each signature consists of a list of pointers to stems, and pointers to suffixes,
and in most cases, there are more stems than there are suffixes in a signature.
When we examine a signature, we typically expect a healthy variety of different
final letters: while there may be a skew in the distribution of letters that may
appear stem-finally, there should nonetheless be a good variety. Check signatures
computes the entropy of the set of stem-final letters. If that entropy is greater than
the threshold value (experimentally set at 1.4), the function returns, performing no
change. If the entropy is less than the threshold amount, it considers the entropy
of the set of stem-final bigrams, and performs the same check for measure against
the entropy threshold. The function successively considers the entropy of stem-final
strings of up to 4 characters, and determines what the largest k is for which the set
of k -long stem-final letters has an entropy less than the threshold.5
It then considers each of these restructurings of the signature, and calculates an
approximation of the change in the morphology’s description length brought about
by the change in the cuts between stem and suffix that would be caused by shifting
a certain amount of stem-final material to the beginning of the suffixes, such as the
–i– alluded to above.
The first step is to calculate how much length the signature σ is responsible for in
the overall morphology —so that we can compare that length to the length of the
alternative signatures which attempt to handle the same data. Now, a signature is
composed essentially of the following: a list of pointers to stems, and a list of pointers
to suffixes. From here on out, it will be convenient to have a good notation to
indicate the frequency of a word or morpheme in the corpus, and we shall henceforth
5
It should be clear that this strategy is just a heuristic, and a more complex heuristic
may prove worthwhile in more complex cases. Testing the entropy of the last k letters of
the stems is a rough test as to whether we have wrongly cut up one or a small number
of suffixes between the stem and the affix, but it works well in practice.
An Algorithm for the Unsupervised Learning of Morphology 9
Indeed, a suffix which is associated with only a single signature is a bit suspect;
being able to reanalyze a signature (such as on.ve) so that it is replaced by a
signature that consists only of suffixes that “already” and “independently” exist is
a good thing, as it decreases the description length of the morphology by increased
use of a smaller inventory of parts. In order to be able to keep track of the possibility
of making such a move, when we calculate the bit-length (information content) of
a signature, we assign to it a portion of the information content of the suffix entry
itself that is proportional to the relative use made of the suffix by that signature.
For example, if signature σ is the only signature to use the suffix on, and storage
of the suffix on takes 9.2 bits, then signature σ is charged the full 9.2 bits at this
point, in addition to the length of the pointer to on which the signature needs in
order to do its work. If, however, there was another signature σ 0 which used the
suffix on to cover an equal number of tokens in the corpus, then signature σ would
only be responsible for 9.2/2 ( = 4.6) bits in the present calculation.
created by shifting all stem-final a’s to form a suffix –able, and all of the stem-final
–i ’s to form a suffix –ible, meanwhile shortening all of the stems by one letter.6
The description length of one of these alternative signatures is calculated as
follows: to determine whether the restructuring is preferable, we must total each
of the description lengths, and compare them to the original description length,
opting for the situation in which the description length is the least.
Consider first the length of the pointers to stems. Since by design, each stem T
is associated with exactly one signature, these numbers will not generally change
when we restructure the signature–whether the stem is positi- or posit- will not
change the number of occurrences of position and positive in the overall corpus;
but as this example suggests, the removal of a portion of material from stem T (in
this case, the material i ) may well give rise to a “new” stem T 0 which independently
occurs elsewhere in the corpus (for example, as an unanalyzed word). Indeed, that
discovery should speak in favor of this reanalysis, for the stem posit is being used
more often. Restructuring the entire morphology in order to calculate the overall
effects of this change would be the most accurate way to proceed; however, we
accept a simplification, and merely decrease the length of the stem-pointer in the
signature by increasing the frequency of the stem in question: it becomes the sum of
the number of occurrences of the stem T in the present signature σ, plus the number
of occurrences of the stem T 0 in its other signature or its unanalyzed occurrences.
Thus the length of the pointer to T will shift from log [W ] [W ]
[T ] to log [T ]+[T 0 ] (where
0
[W] is the total number of words in the corpus), a difference equal to log(1 + [T ]
[T ] )
(the reader may recall that log(1 + x) is approximately x − x2 /2 + x3 /3 for small x),
[T ]
and similarly change in the length of the pointer to T 0 will be equal to log(1 + [T 0 ] ).
Furthermore, the stem T is now entirely removable from the list of stems, and
therefore an additional savings equal to approximately |T | ∗ log(27) occurs, which
is likely to be a considerably larger amount.
Even when a new stem is created which did not exist before (e.g., posit- instead
of positi-), if it is shorter, then the amount of information in the stem list decreases;
hence if the number of stems associated with signature σ is [Stems(σ)], and a final
string of length k is removed from them, there is a total savings of approximately
[Stems(σ)] ∗ k ∗ log(27) bits associated with the new signature.
And what of the list of suffixes in this new signature? In the first place, it is pos-
sible that this list of suffixes already exists in the morphology as an independently
needed signature σ∗, and if that is the case, then a considerable simplification can
be achieved by simply merging the signature σ with σ∗. Let us construct a list of
all the places in the morphology where this merger will give rise to a simplifica-
[W ] [W ]
tion. First, the length of a pointer to σ∗ will shift from log [σ∗] to log [σ∗]+[σ] , and
[σ]
that difference is log(1 + [σ∗] ); in parallel fashion, the length of the pointers to σ
[σ∗]
will change by an amount equal to log(1 + [σ] ). Using the approximation men-
6
As this example illustrates, it may be that the best analysis would be one where one of
these letters was transferred to the suffix, and the other was not. This possibility is not
currently considered by the algorithm.
An Algorithm for the Unsupervised Learning of Morphology 11
tioned above, we see that this means a savings of about [σ∗] [σ] ) bits for every place
where the signature σ was mentioned in the grammar previously. Thus the savings
are considerable when a signature is replaced, or superseded, by a signature which
occurs considerably more times. And these savings will indeed accrue quite a few
times, for there are many places in the grammar where pointers to signatures occur:
minimally, there is a pointer to a signature associated with each stem. The savings
to σ∗ that occur when signature σ can be replaced by an already existing signa-
[σ]
ture σ∗ due to the collapsing procedure alone are thus [Stems(σ∗)] ∗ log(1 + [σ∗] ),
while savings of the cost of the pointers to σ∗ from the stems of σ is equal to
[Stems(σ∗)] ∗ log(1 + [σ∗]
[σ] ). One is tempted to see this as a quantitative evaluation
of Meillet’s classic dictum that a language is a système où tout se tient.
If the new signature σ∗ did not independently occur, we must calculate the
relevant parts of its description length: the length of its pointers to its individual
suffixes. The length of the pointer to suffix f is log [W ]
[f ] . We continue, as we noted
above, to prorate the information content of the actual phonological material of the
suffix between this new signature and all the other signatures that also point to
this suffix. The more signatures point to the suffix, the less any of them will have
to be responsible for that suffix’s phonological content.
One of the conditions that we placed on the successor frequency bootstrapping al-
gorithm blocked it from associating a stem with a particular suffix if there were two
or more suffixes that began with the same letter (e.g., conservation and conserva-
tive could not be analyzed as conserv-ation and conserv-ative, even in the presence
of conserve and conserving). We now make up for this initial conservatism, by
scanning through our list of discovered stems and looking to see if there are any
unanalyzed words which consist of such a stem followed by a suffix that had been
discovered elsewhere. When such words are found, they are analyzed and divided
into stem and suffix. If there should be two such ways found, the one with the more
common stem is preferred.
We now consider all signatures containing at least two stems and two suffixes, and
scan through the words unanalyzed so far, to see if they fall into any such signatures.
We sort the signatures by robustness, and look for the most robust signatures first.
When we find that a signature matches a set of words, we analyze the words into
stem and affix with that signature. One of the consequences of this is that we now
can find stems whose length is shorter than the limit we placed on stems in the
initial boostrap heuristic, because our knowledge of the morphological patterns is
now greater.
12 John Goldsmith
7
In this paper, we have simplified things slightly by assuming each observed word occurs
once (so that stem frequencies can be derived from number of different affixes they are
observed with); in general, this is a simplifying assumption that does not need to be
made, but if it is, then [w ] here is equal to 1.
An Algorithm for the Unsupervised Learning of Morphology 13
term, a suffix term, and a pointer to the signature itself. The stem term is the sum,
over all of the stems in the signature, of the graphological information of the stem
and a pointer to that stem; the suffix term is the sum, over all of the suffixes in
the signature, of the graphological information of the suffix if the suffix did not
already exist, and a pointer to that suffix. If the cost of the new analysis is less
than the cost of the current analysis, we select the new analysis and its concomitant
word-analyses.
Approximate information content of an analysis, where F is the set of suffixes
which already existed in the morphology, and G is the set which did not:
X 1 X [W ]
(5) |t| log2 27 + log + (|f | log2 27 + log2 )
S f req(f ) [f ]
f ∈F G f ∈G
Following this, we use the MDL-based check-signatures function (see section 4).
10 Allomorphy
Determining the correct segmentation of an arbitrary word is only the first step
in analyzing the morphology of a language: in addition, virtually all languages
display allomorphy, that is, variation in the realized form of a given morpheme.
In English, the same morpheme appears as love (in the word love, lovesick, and
loves) and lov (in the words loving, lover, and loved ). More generally, word-final –e
in English deletes before a range of suffixes, including –ed, –ing, and –ity. Suffixes
too take different forms: the plural –s in English appears as –es after stems that
end in s, sh, or ch (hisses, masses, hitches, etc.). It is often not obvious to the
analyst or to the native speaker just where this allomorphy begins and ends (a point
we discuss in greater detail in section 11 below). For example, it is reasonable to
assert that the stems receive and recept (as in recept-ion) are alternate realizations
(that is, allomorphs) of the same morpheme, paralleled by deceive/decept(ion),
perceive/percept(ion) and conceive/concept(ion), but it is less clear whether the
14 John Goldsmith
correct form is recep or recept. And other conceivably related forms are not in
fact related at all: for example, the stems resolut- and revolut- (from the words
resolution/revolution) are not related by any rule relating s and v in English.
In this section, we present an algorithm that takes certain steps towards dealing
with this challenge. At present, the algorithm to be presented is capable only of
detecting rules of allomorphy that delete stem-final material, like the deletion of
word-final –e in English, and rules that cause alternations of a stem-final letter (e.g.,
y becomes i ) before certain suffixes (e.g., –es). This capability is useful, however,
for a range of languages, including English. Considerable work remains before the
range of actual alternations can be automatically detected; some further examples
are discussed in the next section.
Let us step back and think about this problem more generally. The task of finding
the principles that relate the forms (allomorphs) of a stem is generally conceived
of as the task of discovering the phonology of a language, a problem that has been
attacked by a number of researchers, especially in the past ten years (Ellison (1991),
Albright (2002), Albright and Hayes (2003), Neuvel and Fulop (2002), and others.)
Most, but not all, of this work has assumed that some “oracle”—some outside
source of information—provides the phonology learner with the information that
two words are morphologically related: the two words may be explicitly marked
as being part of the same morphological paradigm, for example. But the present
algorithm does not have access to that information, by the ground-rules that we
have set for it.
The most reliable information the framework has is the set of robust signatures
in the language, and it is this information that it uses to determine if there is stem-
final deletion at play in the language it is considering. Suppose there is a suffix
F which deletes stem-final L, and suffix F appears with stems that appear with a
null suffix. (For example, F might be the suffix –ing in English, and L the letter
–e.) Then there will appear to be two distinct signatures in the language: NULL.F
(from “regular” stems that do not end in L) and L.F (from stems that end in L). In
addition, under these phonological conditions, the morphology may have wrongly
analyzed some cases of stem-final L’s as having been part of a larger suffix.
Since signatures are the most reliable distributional information we have about
the language at this point, we use them to detect this situation in the following
way. We consider all 1-letter suffixes, and for each one (which we will mnemonically
enough call ‘e’, in honor of one of the suffixes that actually passes the tests we are
about to describe), we wish to establish three classes of suffixes:
In cases (1) and (3), we indicate that a suffix f deletes a preceding e with this
An Algorithm for the Unsupervised Learning of Morphology 15
number of stems that are associated with these signatures, setting thresholds of 5
and 50, respectively. A suffix L that passes this test is interpreted as being erro-
neously analyzed as a suffix, and is reintegrated into preceding stems; suffixes are
reassigned according to the function T, as defined above.
A similar method is used to identify a stem-final segment that mutates under the
influence of a following suffix (for example, stem-final y mutates to i before suffixes
such as –al : bury + al > burial, beauty + f ul > beautif ul, dry + ed > dried). For
each 1-letter suffix Y, we do the following. For each suffix Z that occurs with Y
in a signature Y.Z (e.g., Z might be the suffix ies in the signature y.ies), we look
to see if Z can be decomposed into IZ*, where I is a single letter, and Z* is an
existing suffix (in the cases just mentioned, the Z* would be –al, –ful, or –ed ).
If that condition is met, we define T(Z) as the ordered pair (I, Z*), written a bit
more perspicuously as {Y |I}Z∗. Intuitively, {Y |I}Z∗ means a fixed suffix Z* which
has the property of mutating an immediately preceding Y to I, much as <e>ing
refers to a fixed suffix ing which mutates a preceding e to the null string. Note that
the letter identified as I can be distinct in the case of each suffix; it is merely the
first letter of the suffix. If, however, there exists a common letter (call it I ) that is
shared by a majority of the suffixes, then we conclude that all suffixes Z such that
T (Z) = (I, Z∗) are indeed of the form Z* (e.g., –ial is reanalyzed as –al ) with the
property that they mutate a stem-final Y to I. The corresponding stems are then
modified, so that they take on a final –Y.
These procedures illustrate how the drive for simplification of signatures can lead
to the discovery of simple patterns of allomorphy.
11 Evaluation
As is the case with most natural language efforts, a quantitative evaluation of
the accuracy of morphological analysis of English is fraught with issues that were
initially unexpected. We built by hand a gold standard of some 15,000 words and
the target morphological analysis we expected. This turned out to be a much greater
challenge than we expected, and we will explain why in what follows.8
We decided to evaluate with an accuracy measure, rather than with precision
and recall as in Goldsmith (2001). This was based on a practical consideration: in a
certain sense, all that we care about is getting the “right” answer, and on producing
a system that gets the “right” answer as often as possible, so we decided to assign a
positive value only to the analyses of those words which matched our gold standard
analysis. The gold standard contains an indication of where the final suffix is in
each (non-compound, non-proper noun) word, if there is one.
We ran into the following sorts of issues in developing the gold standard:
1. Words in which we did not know whether there was a morphological analysis.
Is there a morphological analysis in such words as boisterous, ambassador,
annual, poem (cf. poet), agrarian, armor, benediction, crucial, or worn?
8
The initial preparation of this gold standard was done by Nikki Adams.
An Algorithm for the Unsupervised Learning of Morphology 17
2. Words in which we were certain that there was a morphological analysis, but
we were not sure which of two different analyses was the “right” one: is allergic
based on a stem allerg, or is it from allergy plus the suffix ic? Is alphabetical
based on alphabet or on alphabetic? Is Algerian from Algeria plus –n, or plus
–an, or plus –ian? We know there is a suffix –ian in Corinth-ian (and maybe
in Belg-ian), and Palestin-ian, and probably in Canad-ian. But what about
Cuban? Is that a suffix –an or –n? In a different area, is dogmatically to
be analyzed as dogmatic plus ally? Most words ending in –ally are arguably
made up of two suffixes, -al- plus –ly, as in abnormally (from abnormal plus
ly); but dogmatical is not a word: shouldn’t this play a role in our analysis?
3. Words in which simple segmentation of the words into stem plus suffix was
not sufficient; the true stem of the word was different from the result of
segmenting the word into two pieces. The clearest example of this involved
final –e’s: loving is composed of love plus –ing. In other cases, the modification
is greater: decision is decide + ion, cutting is cut plus ing, decency is decent
plus y. Curiosity is curious + ity. Is application built from apply plus ation?
Not so clear.
4. In some words, segmentation is the wrong thing to worry about: crises is
crisis in the plural form. How do deal with that: treat it as crisis + s?
5. In some cases, it is not clear what the “right” form for the suffix is. Is the
analysis of churches to be church plus s or plus es?
6. We know there is morphology involved, but is it English morphology? Is
corpus based on a stem corp plus a suffix us? I am not sure, though I am
reasonably confident that alumnus is alumn + us (related to alumn + i ).
Similarly: debutante.
We decided that our research goals would be best satisfied by the following set
of decisions:
1. Make the standard of the gold standard extremely high; a low score is an
acceptable consequence. It should not be the case that the algorithm is pe-
nalized if it comes up with an analysis that is in some sense correct, and yet
“better” than the one placed on the gold standard. For example, if the algo-
rithm discovers the analysis of alumnus as alumn plus –us, it should not be
penalized for this, even if we are surprised that it does so.
When it is really not clear what the analysis is, do not score the algorithm
one way or the other on the word. The word will still be part of the input,
but it will not be scored. We also made the assumption that the analyses of
proper nouns were not to be tested.
2. When there is clear allomorphy, make a decision ahead of time as to which
aspects of the morphology the algorithm is responsible for. At this point in
time, we decided that we wanted our algorithm to be tested on learning the
stem-final –e deletion and stem-final –y allomorphy, and so we set the gold
standard correct analysis of words such as loving and cries as love+ing and
cry+es (but not cry+s), respectively. In future work, we will add to our
18 John Goldsmith
gold standard, and make it possible to select which other aspects of English
allomorphy one wishes one’s algorithm to be tested against.
3. The gold standard must be made publicly available.
On the first 200,000 and the first 300,000 words of the Brown corpus, Linguistica
achieved accuracy of 72%. Of the errors (that is, of the 28% of the words that were
not correctly analyzed), approximately 30% were due to inaccurately reconstructing
“missing” stem-final –e’s. For example, when the words abused and abusive were
found (but no other related words, notably abuse), the algorithm was unable to
reconstruct abuse as the stem, and it reconstructed instead abus, and these analyses
were scored as errors. (That is, if we did not demand the reconstruction of these
–e’s, accuracy would rise to approximately 80%).
References
Albright, Adam. (2002) Islands of reliability for regular morphology: Evidence from Italian.
Language 78: 684–709.
Albright, Adam and Bruce Hayes. (2003) Rules vs. analogy in English past tenses: a
computational/experimental study. Cognition 90: 119–61.
Bell, Timothy C., John G. Cleary, and Ian H. Witten. (1990) Text Compression. Englewood
Cliffs, N.J.: Prentice Hall.
Brent, Michael. (1999) An Efficient, Probabilistically Sound Algorithm for Segmentation
and Word Discovery. Machine Learning 34(1-3): 71–105.
Ellison, T. Mark. (1991) The iterative learning of phonological constraints. Dissertation.
http://citeseer.nj.nec.com/ellison91iterative.html
Goldsmith, John. (2001) The unsupervised learning of natural language morphology. Com-
putational Linguistics 27: 153–198.
An Algorithm for the Unsupervised Learning of Morphology 19
Goldsmith, John, and Yu Hu. (2004) From signatures to finite state automata. Paper
presented at the Midwest Computational Linguistics Colloquium. Bloomington IN.
Hafer, M. A., and Weiss, S. F. (1974) Word segmentation by letter successor varieties.
Information Storage and Retrieval 10: 371–385.
Harris, Zellig. (1955) From phoneme to morpheme. Language 31: 190–222, Reprinted in
Harris (1970).
Harris, Zellig. (1967) Morpheme boundaries within words: report on a computer test.
Transformations and Discourse Analysis Papers 73. Reprinted in Harris (1970).
Harris, Zellig. (1970) Papers in Structural and Transformational Linguistics. Dordrecht:
D. Reidel.
Hu, Yu, Irina Matveeva, John Goldsmith, and Colin Sprague. (2005) The SED heuristic for
morpheme discovery: a look at Swahili. Papers from the Psychocomputational Models
of Human Language Acquisition Workshop at ACL 2005, edited by William Sakas,
Alexander Clark, James Cussens, and Aris Xanthos.
Neuvel, Sylvain and Sean Fulop. (2002) Unsupervised learning of morphology without
morphemes. Proceedings of the ACL Workshop on Morphological and Phonological
Learning, pp. 31-40. Philadelphia.
Rissanen, Jorma. (1989) Stochastic Complexity in Statistical Inquiry. World Scientific Se-
ries in Computer Science. Singapore: World Scientific.
Wallace, C. S. and M. P. Georgeff. (1983) A general objective for inductive inference.
Technical Report 32, Department of Computer Science, Monash University.
Wallace, C.S. and D. L. Dowe. (1999) Minimum Message Length and Kolmogorov Com-
plexity. The Computer Journal 42(4): 270–283.
Xanthos, Aris. (2003) Du k-gramme au mot: variation sur un thme distributionnaliste.
Bulletin de linguistique et des sciences du langage (BIL) n◦ 21, Lausanne, UNIL