Beruflich Dokumente
Kultur Dokumente
Hielke Prins
6359973
The role of rules in language acquisition has been heavily debated in the past, as illustrated by the
papers of Saffran [1] and Marcus [2] on artificial language learning. Where Marcus argues in favor of a
rule based mechanism that facilitates the learning proces, Saffran and others showed that transition
probabilities between syllable pairs alone can account for the performance of infants on word
segmentation. They suggest that similar statistic cues might support a role in acquisition of other
aspects of language. In order to evaluate this claim using evidence from natural language acquisition
the similarity between infant utterances and a reference corpus of adult dialogues is tracked over time
by comparing n-gram frequencies.
1/6
Since the n-grams do not explicitly refer to Containment thus gives the proportion of n-
variable positions and both sets are unordered, grams in the reference set that are also in the
testing this hypothesis will address the question compared set. This measure has been used before
whether a statistical mechanisms could at least in to compare two sets of n-grams by Barrón-
principle account for natural language Cedeno and Rosso [5] in order to detect
acquisition. plagiarism but the measure is sensitive to the
relative size of the infant n-gram set.
METHODS
RESULTS
Corpora
Context size
Infant corpora were taken from the Brown
corpus. Utterances of the children Adam (55 One would expect that the chance of finding
sessions) and Sarah (139 sessions) are common n-grams in two different corpora of
considered. The children were tracked when they significant size decreases when n increases. Not
were approximately 2 to 4 years old. only are there simply less n-grams of larger size,
they also include more different words making
The TüBa-E/S corpus of spoken English served them more and more context dependent.
as a reference, as well as the mothers interaction
with their child in the Brown corpus itself. On the other hand, reoccurring sequences of
word might be more characteristic for the
Preprocessing underlying language then individual words are.
The Brown childes corpora were segmented in During early language acquisition there might
words. Words were defined as sequences of thus be a tendency of increasing similarity while
uppercase or lowercase characters, brackets and n increases.
the apostrophe. The graphs in Figure 1 indeed show that Adam
The TüBa-E/S corpus was stripped from its and Sarah have more bigrams (n = 2) in common
syntactic annotations. This resulted in 29672 when they are 3 years old but more unigrams
sentences. when they are 2 years old (yellow line), despite
the fact that in there are in general more bigrams
Similarity measures then unigrams in all corpora with sentences
To measure the similarity between the n-gram longer then one word. Between an age of 3 and 4
distributions the notions of resemblance and years there are no clear differences.
containment defined by Broder [4] are used. The measure of resemblance defined above
Resemblance depicts the number of matches (Formula 1) reflects this effect in the decreasing
between the elements of two sets of n-grams difference between unigrams and bigrams by
scaled by the size of the joint set: approximately 5 percent (last column of Table 1).
Both measures show the increasing fit to an
∣N A ∩ N B∣ exponential drop with distance that one would
R A , B = (1)
∣N A ∪ N B∣ expect for Markov models.
Because the size of the reference set is much There is a clear effect of the age of the individual
larger then those compiled from the infant children on the frequencies of longer n-grams.
utterances, containment is an asymmetric The occurrence of higher values of n increases
measure scaled only by the size of the reference with age increasing the length of the right tail of
set: the distribution.
∣N A ∩ N B∣ Similarity of the distributions over n for the two
C A , B = (2)
∣N B∣ children seems to be only slightly affected by
2/6
their age. The blue and red bars show the n-
2 years old
grams the infants do not share with each other.
5000 0.4
4500
The striking increase in visual resemblance is
0.35
4000 probably mainly caused by increasing symmetry
0.3 in set size (see subtotals in Table 1).
3500
3000 0.25
# n-grams
8000
0.2 Plotted this way the distribution of common n-
6000
0.15 grams over n is depicted by the relative amount
4000
0.1 of the various colors in a vertical slice of the
2000 0.05 graph. The trends in both graphs are alike,
0 0
showing that adults and infants share an
1 2 3 4 5 6 7 8 9 increasing common set of longer sequences at the
n expense of the amount of shared unigrams and
between the infants second and third year.
4 years old DISCUSSION
9000 0.4
The results suggest that infants indeed
8000 0.35
familiarize themselves with frequently
7000 0.3 reoccurring sequences in the linguistic input they
6000
0.25 get offered and they can do so (to some extend)
# n-grams
5000
0.2 using only statistical cues.
4000
0.15 Most of the action seems to take place between
3000
2000 0.1 an age of 2 and 3. For future work it might
1000 0.05 therefore be interesting to increase temporal
0 0 resolution between these ages by grouping the
1 2 3 4 5 6 7 8 9 sessions in smaller bins (eg. half a year or a
n month).
A quick analysis of the rules used in the
Figure 1: Distribution of unique (bars) and
automatically annotated sets shows however that
common (yellow line) n-grams over n for Adam
rule length and complexity likewise grow in the
(blue) and Sarah (red) from the Brown corpus.
same period. Leaving the question formulated in
Dashed green line shows the resemblance
the title as yet unanswered.
between the two sets.
3/6
Comparing the set distributions might (including acquisition providing a better informed
the n-gram frequencies) might shed some light preliminary answer on this question but is left for
on the actual kind of n-grams learned during future investigation.
100% 100%
90% 90%
80% 80%
70% 70%
R(Adam, mother)
C(Adam, T-E/S)
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
2 3 4 2 3 4
Age (years) Age (years)
1 2 3 4 5 6 7 8 9
Figure 2: Relative resemblance (Formula 1) between Adam and his mother (left) and containment
(Formula 2) of the reference set in that of Adam, plotted against age. The different colors depict
different sizes of n.
REFERENCES
[1] J. R. Saffran, R. N. Aslin, and E. L. Newport, “Statistical learning by 8-month-old infants,”
Science, vol. 274, no. 5294, p. 1926, 1996.
[2] G. F. Marcus, S. Vijayan, S. Bandi Rao, and P. M. Vishton, “Rule learning by seven-month-old
infants,” Science (New York, N.Y.), vol. 283, no. 5398, pp. 77-80, Jan. 1999.
[3] R. Brown, “A first language: The early stages.,” 1973.
[4] A. Z. Broder, “On the resemblance and containment of documents,” in Compression and
Complexity of Sequences 1997. Proceedings, pp. 21–29, 2002.
[5] A. Barrón-Cedeno and P. Rosso, “On automatic plagiarism detection based on n-grams
comparison,” Advances in Information Retrieval, pp. 696–700, 2009.
1098
4/6
age n Adam Sarah common R(Adam, Sarah)
2 1 865 519 576 0.29387
2 2 4634 2153 363 0.05076
2 3 3022 1377 33 0.00744
2 4 1163 413 4 0.00253
2 5 403 91 0 0
2 6 162 17 0 0
2 7 75 2 0 0
2 8 45 1 0 0
2 9 34 0 0 0
10403 4573 976
3 1 1340 843 940 0.30099
3 2 8916 4116 1683 0.11437
3 3 12515 4940 614 0.03398
3 4 9513 3165 122 0.00953
3 5 5537 1560 11 0.00154
3 6 2739 646 0 0
3 7 1229 241 0 0
3 8 528 91 0 0
3 9 239 34 0 0
42556 15636 3370
4 1 735 982 911 0.34665
4 2 5398 5616 1733 0.13595
4 3 7757 7577 731 0.0455
4 4 6300 5683 166 0.01366
4 5 3973 3316 22 0.003
4 6 2199 1702 5 0.00128
4 7 1120 825 1 0.00051
4 8 547 420 0 0
4 9 276 236 0 0
28305 26357 3569
5/6
n age C(Adam, T-E/S) Adam mother common R(Adam, mother)
1 2 0.1661 447 437 994 0.52928
1 3 0.24702 758 560 1522 0.53591
1 4 0.22351 766 271 880 0.45905
2 2 0.01473 4039 4663 958 0.09917
2 3 0.0543 7373 6029 3226 0.19401
2 4 0.04563 5595 2654 1536 0.15697
3 2 0.00067 2883 6581 172 0.01784
3 3 0.00874 11464 10882 1665 0.06934
3 4 0.00743 7831 4194 657 0.0518
4 2 0.00002 1146 5572 21 0.00311
4 3 0.00107 9162 10467 473 0.02352
4 4 0.00109 6270 3805 196 0.01908
5 2 0 401 3929 2 0.00046
5 3 0.00014 5436 7958 112 0.00829
5 4 0.00012 3931 2817 64 0.00939
6 2 0 162 2579 0 0
6 3 0 2708 5525 31 0.00375
6 4 0 2184 1933 20 0.00483
7 2 0 75 1630 0 0
7 3 0 1220 3725 9 0.00181
7 4 0 1117 1287 4 0.00166
8 2 0 45 1035 0 0
8 3 0 527 2488 1 0.00033
8 4 0 546 846 1 0.00071
9 2 0 34 658 0 0
9 3 0 239 1678 0 0
9 4 0 276 559 0 0
6/6