Sie sind auf Seite 1von 6

Statistische Methoden

Unix Tools - Tutorial

Sabine Schulte im Walde


mit Matt Crocker und Enrico Lieblang

Computerlinguistik
Universität des Saarlandes

14. Juni 2005

1. Show the frequency of each distinct vowel sequences


in example, e.g. `a´ or `ie´.
Do not distinguish upper and lower case. 24 a
28 e
2 ea
1 ee
1 eo
tr 'A-Z' 'a-z' < example |
17 i
tr -sc 'aeiou' '\n' | 1 io
14 o
sort |
1 oo
uniq -c 1 ou
8 u
3 ua
2 ui
1 uu
Statistische Methoden: Unix Tools 2

1
2. What is the total number of word types for both corpora
german and english, using the tokeniser above?

tr -sc 'a-zA-Z' '\n' < german |


sort | uniq | wc -w
18694

tr -sc 'a-zA-Z' '\n' < english |


sort | uniq | wc -w
3806

Statistische Methoden: Unix Tools 3

3. Change the simple tokeniser, so that it allows for numbers


as well as words. Now what is the total number of types
in both corpora?

tr -sc ’0-9a-zA-Z' '\n' < german |


sort | uniq | wc -w
19066

tr -sc ’0-9a-zA-Z' '\n' < english |


sort | uniq | wc -w
3872

Statistische Methoden: Unix Tools 4

2
4. Build on the tokeniser above, so that all uppercase
characters are translated to lower case. For the German
corpus, compare the 20 most frequent words with the
20 most frequent words when case is left unchanged.
2960 die
2778 der
1840 und
1348 in
2559 der 946 den
tr -sc ’0-9A-Za-z' '\n' < german | 2450 die 908 das
1762 und 758 zu
sort | uniq -c | 1224 in 702 von
917 den 675 mit
sort -nr | sed 20q 743 zu 634 fuer
699 das 610 im
668 von 603 sich
616 mit 597 ist
603 sich
tr 'A-Z' 'a-z' < german | 596 ist
596
593
nicht
auf
576 fuer
tr -sc ’0-9a-z' '\n' | 572 nicht
557
547
des
ein
557 des
sort | uniq -c | 552 auf
542
507
dem
auch
535 dem
sort -nr | sed 20q 533 im
455 eine

509 Die
470 ein
428 auch

Statistische Methoden: Unix Tools 5

5. What are the 10 most frequent English words


ending in `ing´?

tr -sc '0-9A-Za-z' '\n' < english |


sort | uniq -c | sort -nr |
grep 'ing$' | head 169 going
78 something
67 meeting
47 thing
38 being
35 morning
32 saying
31 working
31 doing
28 reading

Statistische Methoden: Unix Tools 6

3
6. The POS tagged Brown corpus is a bit of a mess. Tokenise in the usual way,
leaving tags attached to their words, and use sort/uniq to determine the
frequencies. Then use tr to put the tags for each token in a separate column,
and use grep to make sure you only process tagged words, and show the
10 most frequent words along with their tags.

tr -sc '/0-9A-Za-z' '\n' < brown_pos |


grep -v '^/$' |
103 the DT
sort | uniq -c | 56 and CC
grep '/' | tr '/' '\t' | 45 to TO
sort -nr | head 45 he PRP
44 a DT
36 of IN
35 was VBD
34 in IN
31 He PRP
29 I PRP
Statistische Methoden: Unix Tools 7

7. Use sed to do simple stemming (i.e. remove: ly, ed, ing)


of the types in the English corpus, then determine the
new number of word types found in the corpus, and
compare this with your results from question 2.

tr -sc 'a-zA-Z' '\n' < english |


sort | uniq | wc -w
3806

tr -sc 'a-zA-Z' '\n' < english |


sed s/ly$//g | sed s/ed$//g | sed s/ing$//g |
sort -u | wc -w
3472

Statistische Methoden: Unix Tools 8

4
8. List the bigrams from the English sample in descending
order of frequency. What are the bigrams with a
frequency between 90 and 100?

tr -sc 'A-Za-z' '\n' < english > eng1

tail +2 eng1 > eng2

paste eng1 eng2 |


99 will be
sort | uniq -c | sort -nr | 98 we re
awk '90<=$1 && $1<=100' 95 to the
95 the President
92 on the
91 I don

Statistische Methoden: Unix Tools 9

9. For the German corpus, list all the bigrams with a


frequency of 4 or more, which contain the word `werden´.

tr -sc 'A-Za-z' '\n' < german > germ1

tail +2 germ1 > germ2


16 werden die
8 zu werden
paste germ1 germ2 | 8 werden koennen
grep 'werden' | 7 werden kann
sort | uniq -c | sort -nr | 7 werden Die
5 werden in
awk '$1>=4' 4 werden weil
4 werden sollen
4 werden musz

Statistische Methoden: Unix Tools 10

5
10. Find the POS bigrams in the Brown corpus, and
list the 10 most frequent.

tr -sc '/0-9A-Za-z' '\n' < brown_pos


|
143 DT NN
grep '/' | grep -v '^/$' | 119 PRP VBD
tr '/' '\t' > word-tag 87 IN DT
66 IN PRP
60 NN IN
cut -f 2 word-tag > tag1 47 NN PRP
45 VBD PRP
tail +2 tag1 > tag2 38 VBD RB
35 VBD DT
32 VBD IN
paste tag1 tag2 |
sort | uniq -c | sort -nr | head
Statistische Methoden: Unix Tools 11