You are on page 1of 3

Experiment No.

8 Lexical Diversity
14 February 2013


To create a vocabulary set for given text collection and analyse the diversity.


Python 2.7.3, module nltk

Theoretical Background:

Lexical diversity is a measure of vocabulary variation within a written text or a persons speech. Token is a collection of all the words in the text. A token type or vocaulary is a set of token, where each token appears exactly once. The ratio of token to token type gives the lexical diversity of the document.Type refers to all dierent types of words of a piece of text. For example, if a text has 100 words, but all of them are the same word, we say that it has only one type. If all of the 100 words are dierent from each other, we say that it has 100 types. Token, on the other hand, refers to all words of a piece of text. Therefore, a 100-word text has 100 tokens.

Algorithm and Datastructure design:

Input: A set of documents Output: Vocabulary set for the collection of documents 1

Datastrucutres: A list vocabulary Steps: Open module and select a textle Print list(textle) Print len(textle) Print set(textle) Print len(set(textle)) Print oat(len(textle)) / len(set(textle))

Experimental setup:

The experiment is carried out by analyzing which contains several texts from the Gutenburg corpus . The algorithm is implementd in Python. The following python/nltk tools are used for the implementation[?]: len :Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary). sets: The sets module provides classes for constructing and manipulating unordered collections of unique elements. Common uses include membership testing, removing duplicates from a sequence, and computing standard math operations on sets such as intersection, union, dierence, and symmetric dierence. The method set(list) will eliminate the duplicate entries of the list. This property is used to create a vocabulary set from the tokens. list :Return a list whose items are the same and in the same order as iterables items. iterable may be either a sequence, a container that supports iteration, or an iterator object. The lexical richness of a given text is a count, on average, of how many times each word appears in a text. We get this by dividing the token count by the type count. 2

1. Since the size of token-type is at denomenator, a high value of token/type ratio indicates, a low diversity of the document. 2. The size of vocabulary will not grow as linearly as size of tokens, with size of text. 3. Type-token-ratio can act as an indicator of a persons vocabulary size 4. Type-token-ratio is also taken as an important indicator of an authors style.

List of major references

1. Magnus Lie Hetland,Beginning Python: from novice to professional, 2008. 2. S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, OReilly Media Inc., 2009. 3. Introduction to NLTK ,Ivan V. Meza-Ruiz, Srinivasan C Janarthanam, Bonnie Webber and Chris Gorgolewski , 16 October 2009 4. .