Sie sind auf Seite 1von 6

An unifying grammar or a growing bag of words

Hielke Prins
6359973

The role of rules in language acquisition has been heavily debated in the past, as illustrated by the
papers of Saffran [1] and Marcus [2] on artificial language learning. Where Marcus argues in favor of a
rule based mechanism that facilitates the learning proces, Saffran and others showed that transition
probabilities between syllable pairs alone can account for the performance of infants on word
segmentation. They suggest that similar statistic cues might support a role in acquisition of other
aspects of language. In order to evaluate this claim using evidence from natural language acquisition
the similarity between infant utterances and a reference corpus of adult dialogues is tracked over time
by comparing n-gram frequencies.

INTRODUCTION {let, me} {let, me, shoot}


Language acquisition {me, shoot} {me, shoot, your}
{shoot, your} {shoot, your, bracelet}
On word level, natural languages are {your, bracelet}
arrangements of discrete symbols according to
certain structural patterns. During language N-grams are used in Markov models to predict
acquisition infants familiarize themselves with the next word in a sentence depending only on
these patterns and the vocabulary of the the n – 1 previous words. These models are used
language. as an approximation of the actual underlying
The debate on the nature of the patterns focused language.
on the question whether they are rule-like Distributions of n-grams have successfully been
relationships between placeholders or more used to improve Bayesian spam filters and
arbitrary statistically dependent. N-gram plagiarism detection. In applications of this type,
frequencies are one way to capture some of the documents are compared to a reference
patterns in such an arbitrary way. distribution of n-grams retrieved from classified
target documents.
N-grams
An n-gram is a subsequence of n words from a Similarity
sentence. For example, one of the sentences from Following the concepts behind these applications
the Brown Childes corpus [3], one would expect sets of n-grams taken from
corpora of infants of different ages to become
let me shoot your bracelet
more and more similar to each other and to a
can be decomposed in the following series of reference set taken from a corpus of adult
bigrams (n = 2) or trigrams (n = 3): conversation.

1/6
Since the n-grams do not explicitly refer to Containment thus gives the proportion of n-
variable positions and both sets are unordered, grams in the reference set that are also in the
testing this hypothesis will address the question compared set. This measure has been used before
whether a statistical mechanisms could at least in to compare two sets of n-grams by Barrón-
principle account for natural language Cedeno and Rosso [5] in order to detect
acquisition. plagiarism but the measure is sensitive to the
relative size of the infant n-gram set.
METHODS
RESULTS
Corpora
Context size
Infant corpora were taken from the Brown
corpus. Utterances of the children Adam (55 One would expect that the chance of finding
sessions) and Sarah (139 sessions) are common n-grams in two different corpora of
considered. The children were tracked when they significant size decreases when n increases. Not
were approximately 2 to 4 years old. only are there simply less n-grams of larger size,
they also include more different words making
The TüBa-E/S corpus of spoken English served them more and more context dependent.
as a reference, as well as the mothers interaction
with their child in the Brown corpus itself. On the other hand, reoccurring sequences of
word might be more characteristic for the
Preprocessing underlying language then individual words are.
The Brown childes corpora were segmented in During early language acquisition there might
words. Words were defined as sequences of thus be a tendency of increasing similarity while
uppercase or lowercase characters, brackets and n increases.
the apostrophe. The graphs in Figure 1 indeed show that Adam
The TüBa-E/S corpus was stripped from its and Sarah have more bigrams (n = 2) in common
syntactic annotations. This resulted in 29672 when they are 3 years old but more unigrams
sentences. when they are 2 years old (yellow line), despite
the fact that in there are in general more bigrams
Similarity measures then unigrams in all corpora with sentences
To measure the similarity between the n-gram longer then one word. Between an age of 3 and 4
distributions the notions of resemblance and years there are no clear differences.
containment defined by Broder [4] are used. The measure of resemblance defined above
Resemblance depicts the number of matches (Formula 1) reflects this effect in the decreasing
between the elements of two sets of n-grams difference between unigrams and bigrams by
scaled by the size of the joint set: approximately 5 percent (last column of Table 1).
Both measures show the increasing fit to an
∣N  A  ∩ N  B∣ exponential drop with distance that one would
R A , B = (1)
∣N  A  ∪ N  B∣ expect for Markov models.
Because the size of the reference set is much There is a clear effect of the age of the individual
larger then those compiled from the infant children on the frequencies of longer n-grams.
utterances, containment is an asymmetric The occurrence of higher values of n increases
measure scaled only by the size of the reference with age increasing the length of the right tail of
set: the distribution.
∣N  A ∩ N B∣ Similarity of the distributions over n for the two
C  A , B = (2)
∣N  B∣ children seems to be only slightly affected by

2/6
their age. The blue and red bars show the n-
2 years old
grams the infants do not share with each other.
5000 0.4
4500
The striking increase in visual resemblance is
0.35
4000 probably mainly caused by increasing symmetry
0.3 in set size (see subtotals in Table 1).
3500
3000 0.25
# n-grams

2500 0.2 Learning and acquisition


2000 0.15 In order to investigate whether the increase in
1500
0.1 common n-grams over age could be explained by
1000
0.05 acquisition of statistical patterns due to
500
0 0
familiarization with those of the learned
1 2 3 4 5 6 7 8 9 language the graphs in Figure 2 plot the
n resemblance between Adam and the adult
corpora.
3 years old The graph on the left shows the relative amount
14000 0.4 of resemblance between mother and child that is
12000 0.35 accounted for by sequences of varying length.
0.3 The right one does the same for containment of
10000
0.25 the Tuebingen spoken English corpus.
# n-grams

8000
0.2 Plotted this way the distribution of common n-
6000
0.15 grams over n is depicted by the relative amount
4000
0.1 of the various colors in a vertical slice of the
2000 0.05 graph. The trends in both graphs are alike,
0 0
showing that adults and infants share an
1 2 3 4 5 6 7 8 9 increasing common set of longer sequences at the
n expense of the amount of shared unigrams and
between the infants second and third year.
4 years old DISCUSSION
9000 0.4
The results suggest that infants indeed
8000 0.35
familiarize themselves with frequently
7000 0.3 reoccurring sequences in the linguistic input they
6000
0.25 get offered and they can do so (to some extend)
# n-grams

5000
0.2 using only statistical cues.
4000
0.15 Most of the action seems to take place between
3000
2000 0.1 an age of 2 and 3. For future work it might
1000 0.05 therefore be interesting to increase temporal
0 0 resolution between these ages by grouping the
1 2 3 4 5 6 7 8 9 sessions in smaller bins (eg. half a year or a
n month).
A quick analysis of the rules used in the
Figure 1: Distribution of unique (bars) and
automatically annotated sets shows however that
common (yellow line) n-grams over n for Adam
rule length and complexity likewise grow in the
(blue) and Sarah (red) from the Brown corpus.
same period. Leaving the question formulated in
Dashed green line shows the resemblance
the title as yet unanswered.
between the two sets.

3/6
Comparing the set distributions might (including acquisition providing a better informed
the n-gram frequencies) might shed some light preliminary answer on this question but is left for
on the actual kind of n-grams learned during future investigation.

100% 100%
90% 90%
80% 80%
70% 70%
R(Adam, mother)

C(Adam, T-E/S)
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2 3 4 2 3 4
Age (years) Age (years)

1 2 3 4 5 6 7 8 9

Figure 2: Relative resemblance (Formula 1) between Adam and his mother (left) and containment
(Formula 2) of the reference set in that of Adam, plotted against age. The different colors depict
different sizes of n.

REFERENCES
[1] J. R. Saffran, R. N. Aslin, and E. L. Newport, “Statistical learning by 8-month-old infants,”
Science, vol. 274, no. 5294, p. 1926, 1996.
[2] G. F. Marcus, S. Vijayan, S. Bandi Rao, and P. M. Vishton, “Rule learning by seven-month-old
infants,” Science (New York, N.Y.), vol. 283, no. 5398, pp. 77-80, Jan. 1999.
[3] R. Brown, “A first language: The early stages.,” 1973.
[4] A. Z. Broder, “On the resemblance and containment of documents,” in Compression and
Complexity of Sequences 1997. Proceedings, pp. 21–29, 2002.
[5] A. Barrón-Cedeno and P. Rosso, “On automatic plagiarism detection based on n-grams
comparison,” Advances in Information Retrieval, pp. 696–700, 2009.

1098

4/6
age n Adam Sarah common R(Adam, Sarah)
2 1 865 519 576 0.29387
2 2 4634 2153 363 0.05076
2 3 3022 1377 33 0.00744
2 4 1163 413 4 0.00253
2 5 403 91 0 0
2 6 162 17 0 0
2 7 75 2 0 0
2 8 45 1 0 0
2 9 34 0 0 0
10403 4573 976
3 1 1340 843 940 0.30099
3 2 8916 4116 1683 0.11437
3 3 12515 4940 614 0.03398
3 4 9513 3165 122 0.00953
3 5 5537 1560 11 0.00154
3 6 2739 646 0 0
3 7 1229 241 0 0
3 8 528 91 0 0
3 9 239 34 0 0
42556 15636 3370
4 1 735 982 911 0.34665
4 2 5398 5616 1733 0.13595
4 3 7757 7577 731 0.0455
4 4 6300 5683 166 0.01366
4 5 3973 3316 22 0.003
4 6 2199 1702 5 0.00128
4 7 1120 825 1 0.00051
4 8 547 420 0 0
4 9 276 236 0 0
28305 26357 3569

Table 1: Unique (third and fourth column) and common (fifth


column) n-grams for two infants from the Brownes corpus. The
sixth column shows the resemblance between the two sets of n-
grams as defined in Formula 2.

5/6
n age C(Adam, T-E/S) Adam mother common R(Adam, mother)
1 2 0.1661 447 437 994 0.52928
1 3 0.24702 758 560 1522 0.53591
1 4 0.22351 766 271 880 0.45905
2 2 0.01473 4039 4663 958 0.09917
2 3 0.0543 7373 6029 3226 0.19401
2 4 0.04563 5595 2654 1536 0.15697
3 2 0.00067 2883 6581 172 0.01784
3 3 0.00874 11464 10882 1665 0.06934
3 4 0.00743 7831 4194 657 0.0518
4 2 0.00002 1146 5572 21 0.00311
4 3 0.00107 9162 10467 473 0.02352
4 4 0.00109 6270 3805 196 0.01908
5 2 0 401 3929 2 0.00046
5 3 0.00014 5436 7958 112 0.00829
5 4 0.00012 3931 2817 64 0.00939
6 2 0 162 2579 0 0
6 3 0 2708 5525 31 0.00375
6 4 0 2184 1933 20 0.00483
7 2 0 75 1630 0 0
7 3 0 1220 3725 9 0.00181
7 4 0 1117 1287 4 0.00166
8 2 0 45 1035 0 0
8 3 0 527 2488 1 0.00033
8 4 0 546 846 1 0.00071
9 2 0 34 658 0 0
9 3 0 239 1678 0 0
9 4 0 276 559 0 0

Table 2: Containment (third column, Formula 1) of the reference set in that of


Adam and resemblance (Formula 2) between the sets of Adam and his mother
(last column). Unique n-grams in the set are shown in the the fourth (Adam)
and fifth (mother) column. The sixth column shows the amount of n-grams
mother and child share with each other.

6/6

Das könnte Ihnen auch gefallen