Sie sind auf Seite 1von 19

Exact Methods in the Study of Language and Text

Dedicated to Professor Gabriel Altmann On the Occasion of His 75th Birthday

Edited by Peter Grzybek & Reinhard Khler Mouton de Gruyter Berlin New York

Contents

Viribus Quantitatis Peter Grzybek and Reinhard Khler A diachronic study of the style of Longfellow Sergej N. Andreev Zum Gebrauch des deutschen Identittspronomens derselbe als funktionelles quivalent von Demonstrativ- und Personalpronomina aus historischer Sicht John Ole Askedal Diversikation bei Eigennamen Karl-Heinz Best Bemerkungen zu den Formen des Namens Schmidt Hermann Bluhme Statistical parameters of Ivan Frankos novel Perekhresni steky (The Cross-Paths) Solomija Buk and Andrij Rovenchak Some remarks on the generalized Hermite and generalized Gegenbauer probability distributions and their applications Mario Cortina-Borja New approaches to cluster analysis of typological indices Michael Cysouw Menzeraths law for the smallest grammars ukasz D bowski e

v 1

13 21 33

39

49 61 77

xiv

Contents

Romanian online dialect atlas: Data capture and presentation Sheila Embleton, Dorin Uritescu, and Eric Wheeler Die Ausdrucksmittel des Aspekts der tschechischen Verben Jeehyeon Eom Quantifying the MULTEXT-East morphosyntactic resources Toma Erjavec A corpus based quantitative study on the change of TTR, word length and sentence length of the English language Fan Fengxiang On the universality of Zipfs law for word frequencies Ramon Ferrer i Cancho Die Morrissche und die Bhlersche Triade Probleme und Lsungsvorschlge Udo L. Figge Die kognitive Semantik der Wahrheit Michael Fleischer, Micha Grech, und Agnieszka Ksia ek z Kurzvorstellung der Korrelativen Dialektometrie Hans Goebl A note on a systems theoretical model of usage Johannes Gordesch and Peter Kunsmann Itemanalysen und Skalenkonstruktion in der Sprichwortforschung Rdiger Grotjahn und Peter Grzybek Do we have problems with Arens law? A new look at the sentenceword relation Peter Grzybek and Ernst Stadlober A language of thoughts is no longer an utopia Wolfgang Hilberg

87

97

111

123

131

141

153

165

181

193

205

219

Contents

xv

Language subgrouping Hans J. Holm Contextual word prominence Lud k Heb ek e r c Das Menzerath-Gesetz in der Vulgata Marc Hug Toward a theory of syntax and persuasive communication Julian Jamison Grapheme und Laute des Russischen: Zwei Ebenen ein Hugkeitsmodell? Re-Analyse einer Untersuchung von A.M. Pekovskij Emmerich Kelih Zur Zeitoptimierung der russischen Verbmorphologie Sebastian Kempgen a Ak sha: between sphere and arrow on the triple source for everything Walter A. Koch Quantitative analysis of co-reference structures in texts Reinhard Khler and Sven Naumann Anthroponym Pseudonym Kryptonym: Zur Namensgebung in Erpresserschreiben Helle Krner Quantitative linguistics within Czech contexts Jan Krlk Semantic components and metaphorization Viktor Krupa Wortlngenhugkeit in J.W. v. Goethes Gedichten Ina Khner

225

237

245

259

269

281

287

317

331

343

353

361

xvi

Contents

A general purpose ranking variable with applications to various ranking laws Daniel Lavalette Wie schreibe ich einen Beitrag zu Gabriels Festschrift? Werner Lehfeldt und [Lsung im Text] Bemerkungen zum Menzerath-Altmannschen Gesetz Edda Leopold Die Strkemessung des Zusammenhangs zwischen den Komponenten der Phraseologismen Viktor Levickij and Iryna Zadorona Pairs of corresponding discrete and continuous distributions: Mathematics behind, algorithms and generalizations Jn Ma utek c Linguistic numerology Grigorij Ja. Martynenko Towards the measurement of nominal phrase grammaticality: contrasting denite-possessive phrases with denite phrases of 13th to 19th century Spanish Alfonso Medina-Urrea A network perspective on intertextuality Alexander Mehler Two semi-mathematical asides on Menzerath-Altmanns law Peter Meyer Stylometric experiments in modern Greek: Investigating authorship in homogeneous newswire texts George K. Mikros On script complexity and the Oriya script Panchanan Mohanty

371

383

391

399

407

415

427

439

449

461

473

Contents

xvii

Statistical analogs in DNA sequences and Tamil language texts: rank frequency distribution of symbols and their application to evolutionary genetics and historical linguistics Sundaresan Naranan and Vriddhachalam K. Balasubrahmanyan Zur Diversikation des Bedeutungsfeldes slowakischer verbaler Prxe Emlia Nemcov Ords criterion with word length spectra for the discrimination of texts, music and computer programs Michael P. Oakes Indexes of lexical richness can be estimated consistently with knowledge of elasticities: some theoretical and empirical results Epaminondas E. Panas Huffman coding trees and the quantitative structure of lexical elds Adam Pawowski Linguistic disorders and pathologies: synergetic aspects Rajmund G. Piotrowski and Dmitrij L. Spivak Text ranking by the weight of highly frequent words Ioan-Iovitz Popescu Frequency analysis of grammemes vs. lexemes in Taiwanese Regina Pustet Are word senses reected in the distribution of words in text? Reinhard Rapp Humanities tears Jeff Robbins Wortlnge im Polnischen in diachroner Sicht Otto A. Rottmann

485

499

509

521

533

545

555

567

575

587

597

xviii

Contents

The Menzerath-Altmann law in translated texts as compared to the original texts Maria Roukk Different translations of one original text in a qualitative and quantitative perspective Irma Sorvali The effects of diversication and unication on the inectional paradigms of German nouns Petra Steiner and Claudia Prn Nicht ganz ohne . . . Thomas Stolz, Cornelia Stroh and Aina Urdze Satz: stoisches axma oder peripatetischer lgos? Wolf Thmmel Using Altmann-tter for text analysis: An example from Czech Ludmila Uhlov r Local grammars in word counting Duko Vitas and Cvetana Krstev Fitting the development of periphrastic do in all sentence types Relja Vulanovi and Harald Baayen c Language change in a communication network Eric S. Wheeler Die Suche nach Invarianten und Harmonien im Bereich symbolischer Formen Wolfgang Wildgen Applying an evenness index in quantitative studies of language and culture: a case study of womens shoe styles in contemporary Russia Andrew Wilson and Olga Mudraya

605

611

623

633

647

659

665

679

689

699

709

Contents

xix

The weighted mid-P condence interval for the difference of independent binomial proportions Viktor Witkovsk and Gejza Wimmer Gabriel Altmann: Complete bibliography of scholarly works (19602005) Tabula Gratulatoria In Honor of Gabriel Altmann

723

735 755

Text ranking by the weight of highly frequent words Ioan-Iovitz Popescu

I am ill at these numbers. . . Hamlet Act 2, Scene 2

Almost every scientist, by ordering their own published articles or those of others from the most to the least cited paper, will conclude that only the head of the list is truly signicant and existent for the scientic community. I also did this with my papers when posting them in descending order on my website a few years ago (Popescu 2001). The question was if there exists a simple and objective head cutoff for this purpose. A proposal in this connection has only recently been set forth for the quantication of scientic output of individuals by a single and easily computable scientometric parameter (Hirsch 2005). This is the h-index, dened as the number h of papers with citation counts higher or equal to h. For instance, a scientist cumulating a h-index of, say, h = 20, will have published 20 papers that have received at least 20 citations each. Obviously, the corresponding Hirschs point H(h, h) on the (rank, frequency) citation curve appears as a turning point, the closest to the (rank, frequency) origin, as illustrated in Figure 1. Generally, by construction proper, the (rank, frequency) citation distribution starts with the rank number one, corresponding to the most highly cited paper (theres one in every crowd) and ends with the rank equal to the total number of papers having at least one citation. Consequently, the total number of citations is given by the area under the (rank, frequency) citation curve. Hirsch also found that this area is proportional to h2 , i.e. Total Citation Count = ah2 , with the constant a ranging between the values 3 and 5 for the papers in the eld of physics. For university teachers in Physics, as suggested by Hirsch, a value of h 12 would be a minimal threshold for an associate professor, while a value of at least h 18 is needed for advancement to full professor. At the very top of this scale there are scientists cumulating up to about h 100 for physical sciences and almost h 200 for biological and

558 Ioan-Iovitz Popescu

Figure 1: At Hirschs point H(h, h) the frequency and rank (always positive integers) have the same (or the closest possible) value, frequency = rank = h being the h-index of the evaluated (rank, frequency) distribution

biomedical sciences. I will quote only the following relevant assertions from Hirschs paper regarding the scientic output evaluation by the h-index: I argue that two individuals with similar h are comparable in terms of their overall scientic impact, even if their total number of papers or their total number of citations is very different. Conversely, that between two individuals (of the same scientic age) with similar number of total papers or of total citation count and very different h-values, the one with the higher h is likely to be the better scientist. To this, I will add, however, that perhaps a fairer assessment criterion would be the cumulated citation percentage of the most cited rst h papers out of the overall number of citations. The present work is aimed to bring empirical arguments for the transfer of the h-index concept from scientometrics to linguistics, in other words, to switch the problem from paper citation ranking to word frequency ranking. Three main classes of web text sources were used for this purpose, namely The Bible (Table 1, p. 562), classical works (Table 2, p. 563), and Nobel lectures (Table 3, p. 565). More specically, the (rank, frequency) word distributions of these widely known literary or scientic texts (see references) have been produced with the help of web available word frequency counters (see references) and processed and cleaned up of non-words with a Microsoft Excel program. Three important quantities describing the (rank, frequency)

Text ranking by the weight of highly frequent words 559

word distribution were worked out in this way and introduced in the Tables 1 to 3 (p. 562ff.), as follows: 1. text length or total word count (equivalent to the total citations count), representing the area under the (rank, frequency) word curve from the rst rank (rank one) up to the last rank (as given by the total number of unique words or the vocabulary), denoted in the corresponding table column headings as Total; 2. h-index for words, by analogy to that introduced by Hirsch (2005) for paper citations, indicates the word distribution width and is dened as the number h of unique words with counts higher or equal to h; 3. weight or percentage of the rst h highly frequent words (hfw) out of the total word count (equivalent to the scientometric percentage of the rst h highly cited papers). Two other quantities, xing the distribution scales, but not loading the tables, are (4) the vocabulary, giving the maximum value of the rank scale by the number of unique, different words, and (5) the value of the highest frequency of the word distribution, that is the number of words populating rank one, thus xing the frequency scale. A large variety of texts of various elds and of different size have been compared by sorting the data by these indicators. Thus, pasting the data of all mentioned three tables together, summing up a total of 151 texts, and sorting them by the rst quantity (1) we can see that the investigated text lengths cover an interval between 53841 total word count of Goethes Faust 2 in Klines English translation and 295 total word count of The Third Epistle of John. Likewise, sorting the data by the second quantity dened above (2), the h-index, we nd out that its value ranges between the Books of Ezekiel or Jeremiah, both having h = 83, and again 3 John with h = 6. Generally, as expected, the rankings by text length and by h-index are closely similar, inasmuch as the square of the h-index represents a fairly accurate estimate of the total number of words according to the relationship Total Word Count = ah2 (the proportionality constant, a, corresponding to the 151 tabulated texts, ranges between 4.5 and 9.5), Last but not least, sorting the data by the weight of the rst h highly frequent words (3), that is by the normalized word inventory hard core, the joint listing reveals the top position of Bible texts (with 15 Holy Books having a hfw weight from 65 to 60 per cent), followed by classical texts (hfw weight from Newtons 63 and Einsteins 55 down to Dantes 40 per cent)

560 Ioan-Iovitz Popescu

and, nally, by Nobel lecture texts (hfw weight from 47 to 27 per cent) and, almost within the same hfw bandwidth, current scientic papers, newspapers and random texts. In other words, the hfw criterion appears as a consistent estimator of the ineffable grace under which the text has been created. The present text, for instance, excepting tables and gures, has an h-index of 13 and a hfw percentage of 33.

Figure 2: Three grace rankings as revealed by the weight of highly frequent words

Figure 2 illustrates graphically the separate hfw rankings of the above tabulated texts. Here again the ranking within and between the three considered text levels is evidenced. Clearly, more comparative research on similarities and differences of the word distributions is needed for a better understanding of the meaning of the hfw criterion and of the divine art of using the threads of highly frequent words in the text tapestery. Particularly striking appears the hfw synergism of various text parts as illustrated in Table 4 for Dostoevskys Crime and Punishment. Thus, though the six novel parts have almost the same hfw percentage when taken separately, this value increases significantly when counted together. This and other related text features will be detailed elsewhere. In summary, a simple and objective measure is proposed for the text evaluation by a single criterion, namely the percent of the cumulated number of

Text ranking by the weight of highly frequent words 561

the rst h decreasingly ranked words out of the total word count. Any (electronic) text can be evaluated in this way in a matter of seconds. The highest hfw synergies found sofar are The Bible (77 percent), The Old Testament (76 percent), The Pentateuch (72 per cent), The New Testament (70 percent), The Four Gospels (67 per cent), Dickens David Coppereld (68 per cent) and Great Expectations (65 per cent), Tolstoys War and Peace (65 per cent), Dostoevskys Crime and Punishment (64 percent), Homers Iliad (64 percent) and Odyssey (64 percent), and so on. Acknowledgments. Thanks are due to linguist Professor Gabriel Altmann and to biophysicist Professor Daniel Lavalette for stimulating my interest in ranking matters, to physicists Professors Nicholas Ionescu-Pallas and Rudolf Emil Nistor for helpful discussions, and to chemist Professor Alexandru Balaban for pointing out Hirschs recent scientometric paper.

562 Ioan-Iovitz Popescu Table 1: The books of the Bible sorted by decreasing weight of highly frequent words
ID 3 2 24 26 4 5 9 1 6 11 12 14 23 42 43 7 10 19 40 13 44 20 41 46 62 66 18 15 27 38 45 17 16 Text Leviticus Exodus Jeremiah Ezekiel Numbers Deuteronomy 1 Samuel Genesis Joshua 1 Kings 2 Kings 2 Chronicles Isaiah Luke John Judges 2 Samuel Psalms Matthew 1 Chronicles Acts Proverbs Mark 1 Corinthians 1 John Revelation Job Ezra Daniel Zechariah Romans Esther Nehemiah Total 24567 32692 42671 39428 32918 28377 25051 38315 18858 24507 23521 26085 37037 25942 19116 18953 20598 41551 23696 20350 24262 15056 15157 9450 2506 12001 18107 7445 11588 6449 9417 5633 10487 h 65 75 83 83 76 66 71 81 58 64 61 65 71 65 62 59 60 74 63 50 66 49 50 42 23 45 51 33 42 33 42 32 38 hfw 65 64 64 64 63 63 63 62 60 60 60 60 60 60 60 59 59 59 58 57 57 55 55 54 54 54 53 52 52 52 52 51 50 ID 47 21 28 30 33 49 58 25 48 8 36 39 50 52 29 37 51 22 60 53 59 31 34 32 54 35 55 61 56 63 57 65 64 Text 2 Corinthians Ecclesiastes Hosea Amos Micah Ephesians Hebrews Lamentations Galatians Ruth Zephaniah Malachi Philippians 1 Tessalonians Joel Haggai Colossians Song of Solomon 1 Peter 2 Tessalonians James Obadiah Nahum Jonah 1 Timothy Habakkuk 2 Timothy 2 Peter Titus 2 John Philemon Jude 3 John Total 6061 5586 5181 4217 3156 3003 6902 3409 3083 2567 1617 2567 2155 1834 2032 1124 1976 2656 2471 1018 2304 665 1278 1321 2244 1463 1661 1554 886 295 423 608 295 h 33 32 30 28 25 23 33 25 25 22 18 22 21 19 18 13 19 21 21 15 20 11 13 15 19 15 17 16 11 7 8 8 6 hfw 50 49 49 49 45 45 45 44 44 43 43 43 43 43 42 42 42 41 41 39 39 38 37 36 36 35 34 34 29 26 24 24 22

Text ranking by the weight of highly frequent words 563 Table 2: Classical works sorted by decreasing weight of highly frequent words
ID N1 E1 N11 SC11 SC15 ST07 SC07 D1I D1E1 SC06 ST03 SC03 SC12 SC02 ST02 SH10 SC16 ST04 SC14 SC17 SC09 SH01 SC04 SC10 SH02 ST05 ST01 SH06 ST08 ST10 D3E1 SH07 SH05 Author [trans.] Newton Einstein [R.W. Lawson] Newton Shakespeare Shakespeare Shakespeare Shakespeare Dante Dante [H.W. Longfellow] Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Dante [H.F. Cary] Shakespeare Shakespeare Text Principia (an excerpt) Relativity Principia, Book III The merry wives of Windsor Twelfth night Othello Much ado about nothing Divina Commedia 1 Inferno Divine Comedy 1 Hell Measure for measure Hamlet As you like it The taming of the shrew Alls well that ends well Coriolanus Richard III Two gentlemen of Verona Julius Caesar Troilus and Cressida Winters tale The comedy of errors Henry IV part 1 Cymbeline The merchand of Venice Henry IV part 2 King Lear Antony and Cleopatra Henry VI part 3 Romeo and Juliet Titus Andronicus Divine Comedy 3 Paradise Henry VIII Henry VI part 2 Total 35982 29368 7066 23779 21483 27939 22579 22934 37031 23137 32223 22832 22155 24368 29278 31426 18244 20843 27614 25996 16181 26152 28985 22210 27980 27803 26963 25896 25917 21723 35345 25973 26806 h 73 63 38 67 64 69 62 58 69 63 73 61 64 64 72 71 56 60 73 68 50 65 72 61 68 70 68 65 67 64 70 65 66 hfw 63 55 55 53 52 51 51 51 51 51 51 50 50 50 50 49 49 49 49 49 49 48 48 48 48 48 48 48 47 47 47 47 47

(continued on next page)

564 Ioan-Iovitz Popescu


Table 2 (continued from previous page) ID SC05 G1E2 D1E2 SH03 D2E1 SH09 G1E1 SC01 G2E1 SH08 SH04 SC08 SC13 ST09 G1G ST06 G2G D2I D3I Author [trans.] Shakespeare Goethe [G.M. Priest] Dante [H.F. Cary] Shakespeare Dante [H.F. Cary] Shakespeare Goethe [A.S. Kline] Shakespeare Goethe [A.S. Kline] Shakespeare Shakespeare Shakespeare Shakespeare Shakespeare Goethe Shakespeare Goethe Dante Dante Text Loves labours lost Faust 1 Divine Comedy 1 Hell Henry V Divine Comedy 2 Purgatory Richard II Faust 1 A midsummer nights dream Faust 2 King John Henry VI part 1 Pericles, prince of Tyre The tempest Timon of Athens Faust 1 Macbeth Faust 2 Divina Commedia 2 Purgatorio Divina Commedia 3 Paradiso Total 23048 32455 36476 27557 36560 23894 32874 17167 53841 21775 22846 19560 17453 19623 30625 18213 44452 15400 9577 h 63 68 69 64 70 60 68 57 78 57 62 59 57 55 64 53 74 42 36 hfw 47 47 46 46 46 46 46 45 45 45 45 45 44 44 43 42 41 40 40

Text ranking by the weight of highly frequent words 565 Table 3: Nobel lectures sorted by decreasing weight of highly frequent words Year and Field 1965 Phys 1908 Chem 1938 Lit 2004 Lit 1979 Peace 1902 Phys 1911 Chem 1925 Med 1925 Med 1963 Peace 1984 Lit 1920 Phys 1970 Lit 1902 Phys 1950 Lit 1973 Lit 1983 Peace 1989 Peace 1991 Peace 1905 Med 1975 Econ 1989 Econ 1930 Lit 1953 Peace 1959 Lit 1976 Lit 1986 Econ 1975 Med 1993 Lit 1935 Chem 1986 Peace 2002 Peace 1996 Lit Author Richard P. Feynman Ernest Rutherford Pearl Buck Elfriede Jelinek Mother Teresa Hendrik A. Lorentz Marie Curie Frederick G. Banting John Macleod Linus Pauling Jaroslav Seifert Max Planck Alexandr Solzhenitsyn Pieter Zeeman Bertrand Russell Heinrich Bll Lech Walesa Dalai Lama Mikhail Gorbachev Robert Koch Leonid V. Kantorovich Trygve Haavelmo Sinclair Lewis George C. Marshall Salvatore Quasimodo Saul Bellow James M. Buchanan Jr. Renato Dulbecco Toni Morrison Irne Joliot-Curie Elie Wiesel Jimmy Carter Wislawa Szymborska Total 11265 5082 9090 5746 3822 7301 4319 8193 4862 6246 5243 5203 6516 3480 5703 6094 2587 3601 5693 4283 3924 3186 5007 3249 3698 4775 4623 3675 2972 1105 2693 2330 1983 h 41 26 39 33 26 31 25 32 24 28 26 24 32 21 29 28 19 23 26 24 22 21 25 19 21 26 23 22 22 12 19 16 16 hfw 47 46 45 45 44 44 43 41 41 41 41 40 40 39 39 39 39 39 39 38 38 38 37 37 37 37 37 36 36 32 32 30 27

566 Ioan-Iovitz Popescu Table 4: Illustrating hfw synergism of various parts of Dostoevskys Crime and Punishment Text Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 Epilogue Parts 1+2 Parts 1+2+3 Parts 1+2+3+4 Parts 1+2+3+4+5 Parts 1+2+3+4+5+6 All Parts + Epilogue Total 35365 38653 29924 28342 28226 35900 6336 74066 104028 132370 160617 196519 202853 h 70 76 71 67 66 74 30 103 123 137 154 171 174 hfw 52 52 51 51 51 53 44 57 59 61 62 64 64

References
Hirsch, Jorge E. 2005 An index to quantify an individuals scientic research output. In: arXiv:physics/0508025 v4 23 Aug 2005. http://arxiv.org/PS_cache/physics/pdf/0508/0508025.pdf Popescu, Ioan-Iovitz 2001 Cited Papers Ranked by Descending Citation Frequency. http://www.geocities.com/iipopescu/CITSH.htm Main electronic text sources and tools used in this paper The Bible (English King James Version). http://www.fourmilab.ch/etexts/www/Bible/ Shakespeare, The Complete Works. http://www-tech.mit.edu/Shakespeare/ Dante, Divina Commedia and Goethes Faust. http://jollyroger.com/library/ Newton, The Principia. http://members.tripod.com/~gravitee/ Einstein, Relativity: The Special and General Theory. http://www.bartleby.com/173/

Text ranking by the weight of highly frequent words 567 The Nobel Lectures. http://nobelprize.org/nobel/ Dostoevsky, Crime and Punishment. http://www.bartleby.com/318/ Word Frequency Counters: http://www.georgetown.edu/faculty/ballc/webtools/web_freqs.html http://www.writewords.org.uk/word_count.asp