Sie sind auf Seite 1von 16

The Python Papers

Volume 1, Issue 1
pypapers.cgpublisher.com
Journal Information

The Python Papers


ISSN: Under Application

Editors
Tennessee Leeuwenburg
Maurice Ling
Richard Jones

Referencing Information
Articles from this edition of this journal may be referenced as below:
Author, “Title” (2006) The Python Papers, Volume N, Issue M, pp. m:n

e.g. Maurice Ling, “An Anthological Review of Research Utilizing MontyLingua”, (2006) The Python Papers, Volume 1, Issue 1, pp. 7:15.

Copyright Statement

© Copyright 2006 The Python Papers and the individual authors


This work is copyright under the Creative Commons 2.5 license subject to Attribution, Noncommercial and
Attribution conditions. The full legal code may be found at http://creativecommons.org/licenses/by-nc-
sa/2.1/au/

First Published in 2006 in Melbourne, Australia.

Referees

1. Tennessee Leeuwenburg (tennessee@tennessee.id.au)


2. Hugo Liu, Massachusetts Institute of Technology
The Python Papers, Volume 1, Issue 1 1

The Python Papers


Volume One, Issue One : Spring 2006

In this Edition:

Articles and Editorials: Peer Reviewed Submissions:


Editorial... Page 2 MontyLingua... Page 5
Peer Reviewed Paper by Maurice Ling:
● Foreword
“MontyLingua, an integral part of
● Welcome to the first edition
ConceptNet which is currently the largest
● Introducing the team commonsense knowledge base, is an
English text processor developed using
Python programming language in MIT Media
Python Coding Idioms pt 1... Page 3 Lab. The main feature of MontyLingua is the
Tennessee Leeuwenburg: coverage for all aspects of English text
processing from raw input text to semantic
“The first things to come to grips with in any meanings and summary generation ... ”
programming language are its basics. This
can get you so far, but can leave you with a
series of ad hoc solutions which do not read This paper aims to review the use of and
naturally.” roles played by MontyLingua and its
components in research work published in
19 articles between October 2004 and
A discussion of using _methods() to weakly
August 2006. We had observed a diversified
indicate their intended use, and the conseq-
use of MontyLingua in many different areas,
uences this has for code re-use.
both generic and domain-specific. Although
the use of text summarizing component had
not been observe, we are optimistic that it
Python Events... Page 14 will have a crucial role in managing the
Tennessee Leeuwenburg: current trend of information overload in
future research.
● OSDC 2006 – Dec 6 to Dec 9
The Python Papers, Volume 1, Issue 1 2

Foreword: Introducing The Team


Welcome to the first edition
Tennessee Leeuwenburg – Editor-in-Chief
G'day pythonistas. Welcome to
the first edition of the The Tennessee Leeuwenburg is a software develo-
Python Papers. per working at the Australian Bureau of Met-
eorology on automatic text generation of
This journal came about from weather forecasts. Prior to this he spent time
the conspicuous absence of a working on meteorological data transfer
vehicle for quality-reviewed art- standards in the form of the OpenDAP
icles around which a comm- database system.
unity may be structured. In order to enc-
ourage people who may be using Python in an
amateur context, in their profession or Maurice Ling – Associate Editor
academically, a community is required such Maurice Ling is a PhD candidate in the
that current knowledge may be propagated department of Zoology of the University of
and contribution encouraged. Melbourne working on text analysis of bio-
Therefore the first task at hand is I think to logical literature for the purpose of under-
describe the purpose and scope of this standing hormone interactions in the mouse
magazine. mammary cell.

The Python Papers represents the efforts of


Australian (indeed Melbournian) Python en- Richard Jones – Associate Editor
thusiasts who are excited by the technology Richard is Common Ground's Senior Software
and want to help it flourish. Despite the Developer with 10 years' broad experience
common hometown of the editorial board, it is working with multiple languages and tools in
hoped that the journal will reach a worldwide web-based management systems, data ar-
audience. chive, meta-data systems, business systems,
In finding and including content, we are e-commerce and communications. He also
looking to mix in reader stories and feedback, runs the bi-annual PyWeek Game Program-
articles addressing specific areas of python ming Challenge and is involved with the
(such as recipes) etc, code reviews of problem organisation of the Open Source Developer's
code, and also peer-reviewed academic Conference in Melbourne, Australia.
articles.
As we are only a small team, we rely on Mike DeWhirst – Technical Assistant
reader support to drive content and take part Mike has been very helpful in providing
in bringing each edition of the magazine to mailing list facilities and collaboration tools
fruition. while setting up this publication.
This edition, perhaps understandably, draws
its content from those near to (and including!) Stephanie Chong – Design Assistant
the editorial team. It is hoped that this
offering will encourage others to contribute One of Stephanie's first comments on the The
articles so that we can expand our horizons. Python Papers was “how can we make it
prettier?” We hired her on the spot.
This journal offers a forum for both informal
and formal publication, so that whatever the Stephanie is currently studying Arts/Law at
level of interest, readers and authors can be the University of Melbourne.
engaged.
Oh, and don't forget to check out our blog for Contacting The Python Papers
regular announcements:
The editors may be contacted via email at:
http://pythonpapers.org/diary editor@pythonpapers.org
We welcome feedback in the blog comments. Article submissions gratefully accepted at:
submissions@pythonpapers.org
The Python Papers, Volume 1, Issue 1 3

Python Coding Idioms pt 1 – class interfaces


Tennessee Leeuwenburg

The first things to come to grips with in any programming


language are its basics – how to achieve tasks like calling id ·i ·om (n.)
a method, creating a class, reacting to user input and 2.2. The specific grammatical, syntactic,
and structural character of a given
creating data structures. language.
This can get you so far, but without some more subtle ...
tools in the box, it can leave you with a series of ad-hoc
5. A style of artistic expression
solutions which do not read naturally (by this I mean have characteristic of a particular individual,
obvious meaning) and, quite likely, result in inefficient school, period or medium: the idiom of
behavior. the French impressionists; the punk rock
idiom.
This article will concentrate on coding idiom, in particular
that of using consistent ways for declaring methods within Text 1: Source: dictionary.com
python code for indicating additional information about
their intended use. By adopting this idiom it is possible to tell at a glance which methods of a
class NewPayment:
class are intended for public consumption,
and which relate more to internal use.
def __init__(self, payment_info, callbacks):
self.payment_info = payment_info
self.callbacks = callbacks
Let's jump straight into an example (see Text
2).
def process(self):
'''
Processes the payment into the financial system Here we can see a class which represents a
''' payment in a fictional ordering system. The
self.recordPayment() payment class contains a trigger for its
self.despatchOrder()
self.sendConfirmation() processing and rules for performing that
self.concludePayment() processing.
def recordPayment():
''' What is not obvious at first glance is which of
Make a permanent record of the payment in the database
'''
those methods are intended for other aspects
pass of the system to interact with directly, and
def despatchOrder(): which of those methods are related to the
'''
Despatch order to the processing subsystem
internal processing logic. This article will
''' describe a habit which, if adopted, can
pass
provide an at-a-glance way to make that
def concludePayment():
'''
distinction. Such a distinction is useful in
Clean up object references, call any callbacks learning how to use a new piece of code (for
'''
example, extending and inheriting from an
for callback in self.callbacks:
callback.callback(self)
unfamiliar class) and also for documentation
purposes. The effects of this on the final code
Text 2: Initial ordering system without suggested notation will also be shown.
In
short:
What's a PEP ?
● 'Interface' methods are presented without a prefix A PEP is a Python Enhancement Proposal.
● 'Internal' methods which may be over-ridden are These documents make up what information
prefixed by a single underscore is available to the community, covering new
● 'Private' methods are prefixed by a double features, design decisions and other aspects
underscore of Python development. They are available
at http://www.python.org/dev/peps/

This is consistent with PEP-81, and is basically a subset of it. PEP-8's goals are to cover coding
style, which is a larger topic than this article can tackle.
To make a brief diversion, however, coding style is something which is often only appreciated
after some time spent programming. This is because its effects are seldom seen within small or

1 “Style Guide for Python Code”, Guido van Rossum: http://www.python.org/dev/peps/pep-0008/


The Python Papers, Volume 1, Issue 1 4

self-contained projects, but rather become apparent when examining code years later,
adapting someone else's code, or in small efficiency gains which come from having good
programming habits generally.

class NewPayment:

def __init__(self, payment_info, callbacks): The resulting code is shown (left, Text 3) and
self.payment_info = payment_info
self.callbacks = callbacks methods can be easily categorised into
def process(self):
interface and internal methods.
'''
Processes the payment into the financial system process() is an external method, intended for
'''
use by other classes and is part of the
self._recordPayment() interface.
self._despatchOrder()
self._sendConfirmation()
self._concludePayment() _recordPayment(), _despatchOrder() and
def _recordPayment():
_concludePayment() are all intended for
''' internal processing. They may still be
Make a permanent record of the payment in the database
''' overridden or accessed, as may all python
pass methods, however they are clearly
def _despatchOrder(): distinguished from the interface.
'''

'''
Despatch order to the processing subsystem
For a developer approaching this code for the
pass first time, it is immediately apparent which of
def _concludePayment(): the methods he should call in order to trigger
'''
Clean up object references, call any callbacks
processing of a payment, and which methods
''' can be left alone.
for callback in self.callbacks:
callback.callback(self) Similarly, should a bug be present in one
aspect of the order processing, it is easy to see
Text 3: New ordering system code with underscore notation
that it will lie within one of the internal
methods.
The Python Papers, Volume 1, Issue 1 5

An Anthological Review of Research Utilizing MontyLingua:


a Python-Based End-to-End Text Processor
Maurice HT Ling
Department of Zoology, The University of Melbourne, Australia

Correspondence: mauriceling@acm.org

Abstract
MontyLingua, an integral part of ConceptNet which is currently the largest commonsense
knowledge base, is an English text processor developed using Python programming language in
MIT Media Lab. The main feature of MontyLingua is the coverage for all aspects of English text
processing from raw input text to semantic meanings and summary generation, yet each
component in MontyLingua is loosely-coupled to each other at the architectural and code level,
which enabled individual components to be used independently or substituted. However, there
has been no review exploring the role of MontyLingua in recent research work utilizing it. This
paper aims to review the use of and roles played by MontyLingua and its components in
research work published in 19 articles between October 2004 and August 2006. We had
observed a diversified use of MontyLingua in many different areas, both generic and domain-
specific. Although the use of text summarizing component had not been observe, we are
optimistic that it will have a crucial role in managing the current trend of information overload
in future research.

Categories and Subject Descriptors


H.5.2 [User Interfaces]: Natural Language
I.2.7 [Natural Language Processing]: Language Parsing

1. Introduction
MontyLingua (web.media.mit.edu/~hugo/montylingua/) is a natural language processing engine
primarily developed by Hugo Liu in MIT Media Labs using the Python programming language,
which is entitled as “an end-to-end natural language processor with common sense ” (Liu,
2004). It is an entire suite of individual tools catering to all aspects of English text processing,
ranging from raw text to the extraction of semantic meanings and summary generation; thus,
end-to-end. Commonsense is incorporated into MontyLingua's part-of-speech (POS) tagger,
MontyTagger, as contextual rules.
MontyTagger was previously released by Hugo Liu as a standalone Brill-styled (Brill, 1995) POS
tagger in 2002 but is now packaged with other components forming MontyLingua. A Java
version of MontyLingua, built using Jython, had also been released. MontyLingua is also an
integral part of ConceptNet (Liu and Singh, 2004), presently the largest commonsense
knowledge base (Hsu and Chen, 2006), as a text processor and understander, as well as
forming an application programming interface (API) to ConceptNet. At the same time, it had
also been incorporated into Minorthird, a collection of Java classes for storing text, annotating
text, and learning to extract entities and categorize text, written by William W. Cohen in
Carnegie Mellon University (Cohen, 2004).
To date, there were only 2 modules specifically written to process English text using Python:
MontyLingua and NLTK (Loper and Bird, 2002). NLTK (Natural Language Toolkit) was developed
by Edward Loper (University of Pennsylvania) and Steven Bird (The University of Melbourne)
with the main purpose of teaching computational linguistics to computer science students
(Loper and Bird, 2002). Thus, NLTK is more of a text processing library from which text
The Python Papers, Volume 1, Issue 1 6

processing engines, such as MontyLingua, could be developed from, rather than a suite of
usable tools. This implied that MontyLingua could be re-implemented using NLTK but had not
been done. Another popular text processor is GATE (Cunningham, 2000), which was developed
in Java. The main difference between GATE and MontyLingua is that GATE is a template
processing engine rather than natural language processing.
ConceptNet and MontyLingua, as well as 15 applications of ConceptNet, had been previously
been described (Liu and Singh, 2004). However, there has not been any review since October
2004 updating the state-of-the-art use of either ConceptNet or MontyLingua. At the same time,
there has not been any review examining the roles played by MontyLingua and its components
in recent research work, especially post-October 2004. This paper aims to review the use of
and roles played by MontyLingua and its components in research work published between
October 2004 and August 2006.
The rest of this paper is organized as follows: Section 2 describes the distinctive feature and
main components of MontyLingua. In Section 3, we review 23 research publications, that were
published between October 2004 and August 2006, for the role played by MontyLingua and its
component in these research. Section 4 discusses some trends observed in these research.
However, it is not the aim of this paper to describe MontyLingua itself or the works using it, at
the source code level.

2. Distinctive Feature of MontyLingua


The distinctive feature of MontyLingua is the coverage for all aspects of English text processing
from raw input text to semantic meanings and summary generation, yet each component in
MontyLingua is loosely-coupled to each other at the architectural and code level. This had
enabled MontyLingua to be used in 3 different contexts: (1) as a suite of tools for processing
text to semantic meaning and summary generation; (2) decouple each component of
MontyLingua for individual use; (3) using MontyLingua as a baseline system and substituting
components to cater to specific applications. The end result of (2) and (3) may be the same but
the approaches are philosophically different. The rest of this section will focus on the individual
components making up MontyLingua and how (2) and (3) can be fulfilled.
MontyLingua consists of six components: MontyTokenizer, MontyTagger, MontyLemmatiser,
MontyREChunker, MontyExtractor, and MontyNLGenerator. MontyTokenizer, which is sensitive
to common abbreviations, separates the input English text into constituent words and
punctuations. Common contractions are resolved into their un-contracted form. For example,
“you're” is resolved to “you are”. MontyTagger is a Penn Treebank Tag Set (Marcus et al., 1993)
part-of-speech (POS) tagger based on Brill tagger (Brill, 1995) and enriched with commonsense
in the form of contextual rules. MontyLemmatiser strips any inflectional morphology from each
word. That is, verbs are reduced to infinite form and nouns to singular form. MontyREChunker
reads the POS sequence and identifies semantic phrases (adjective, noun, verb, prepositional)
using a series of Regular Expressions. MontyExtractor extracts phrases and subject-verb-object
triplets from the chunked text. Lastly, MontyNLGenerator uses the output of MontyExtractor to
generate text summaries.
At code level, each component resides in a file and is standalone. This feature enables each of
the six components to be used individually. In some of the research articles reviewed in Section
4 below, MontyTagger was used on its own. On the other hand, it also means that each of the
six components can be easily substituted to cater to specific applications. The simplest way to
do this is to modify the jist method in the class MontyLingua (file: MontyLingua.py) as follows:
The jist method illustrates the end-to-end process of MontyLingua.
def jist(self,text):
sentences = self.split_sentences(text)
tokenized = map(self.tokenize,sentences)
tagged = map(self.tag_tokenized,tokenized)
chunked = map(self.chunk_tagged,tagged)
# print "CHUNKED: " + string.join(chunked,'\n ')
extracted = map(self.extract_info,chunked)
return extracted
The Python Papers, Volume 1, Issue 1 7

The input text is tokenized, tagged, chunked by MontyTokenizer, MontyTagger, and


MontyREChunker respectively before phrase and subject-verb-object triplets are extracted by
MontyExtractor. Substituting each of these component is little more than re-directing the
execution to the substituted component and back.

3. Anthology of Applications Utilizing MontyLingua


Six research articles were retrieved from ACM Digital Library using “montylingua” as the search
term. A search using Google (search term: +montylingua +.pdf) added another 13 to the list;
consisting of 1 doctoral dissertation, 1 masters dissertation, 2 technical reports, and 9 articles.
This section will briefly describe the role of MontyLingua in each of these 19 publications
published between October 2004 and August 2006 in chronological order.

3.1. Chandrasekaran's Adaptive Multimodal Language Acquisition


(Chandrasekaran, 2004) attempted to develop a language acquisition system through
multimodal input. The system tries to initiate a dialog with the users to learn nouns, verbs, or
adjectives. Text input were POS tagged by MontyTagger to identify nouns, verbs, or adjectives.

3.2. ATHENS
ATHENS system (Skillicorn and Vats, 2004), developed in Queen's University, Canada, is a web-
mining tool for information discovery. A case study on extracting knowledge on terrorism was
presented. The authors extracted 9 clusters of information which summarized the events as of
September 12, 2001 using the search terms “al Qaeda” and “bin Laden”. After retrieving a list
of web-pages through Google WebAPI, MontyTagger was used to generate a list of nouns, which
was then filtered for a list of discriminatory nouns by comparison to the relative frequency in
British National Corpus (www.natcorp.ox.ac.uk). A page-page Jaccard similarity matrix (Bradeen
and Havey, 1995) was computed using the frequencies of discriminatory nouns on each page
which considered multiple search terms (2 search terms in this case). Finally, a 2-pass
clustering was performed – first on the entire set of retrieved web-pages, followed by clustering
within each of the top level clusters. A list of descriptive nouns were generated for each cluster.
Iterative search can be done using the list of descriptors for each cluster.

3.3. HyperPipes
Eisenstein and Davis (2004) attempted to develop a human gesture classifier, HyperPipes, into
4 categories (deictic, action, other, unknown) using only linguistics information. A set of
manually classified gestures with the corresponding transcribed speech were extracted from 9
persons (not physics or mechanically trained) describing 3 objects: a latchbox, a piston, and a
pinball machine. MontyLingua was used for POS tagging and stemming of the transcribed
speech. A number of features were extracted from MontyLingua-processed text, including
unigrams, bigrams and trigrams. Comparing a baseline classification where all gestures are
deictic (48.7% accurate), HyperPipes achieved an accuracy of 66%. This was compared to
Naïve Bayes (59%), C4.5 (56%) and SVM (56%). This was also compared to manual
classification with only audio information, that is, humans listening to the speech without
watching the video footage, which only achieved 45% accuracy.

3.4. Udani et. al.'s Noun Sense Induction


Word sense induction refers to inferring contextual senses of an ambiguous word (words with
multiple meanings) which is a crucial aspect of text understanding. Udani et. al. (2005)
attempted to advance this field by bootstrapping on the the large body of contextual
information available online for sense induction of nouns. MontyLingua was used to tag and
stem the first 500 research result titles and snippets from Google for clustering. The system
was evaluated on 5 terms and demonstrated 85.7% accuracy in noun sense induction as
compared to the random chance of 31.6% accuracy.
The Python Papers, Volume 1, Issue 1 8

3.5. MontyTagger as a Teaching Tool


Light et. al. (2005) observed increasing numbers of non-computer science student interested in
learning about natural language processing. However, these students had difficulty in
understanding programming and Unix to use computational linguistics tools effectively. Light
et. al. (2005) constructed a web-based interface to nine computational linguistic tools,
including MontyTagger.

3.6. Text Processing of Economics Literature


Nee Jan van Eck's masters dissertation at the Econometric Institute of Erasmus University
Rotterdam focused on text processing of economics literature for the purpose of extracting
economics-relevant terms and presenting it as a concept map linking these terms (van Eck,
2005, van Eck and van den Berg, 2005). MontyLingua was used to tokenize, POS tag, and stem
economics literature prior to linguistics and statistical filtering for relevant terms.

3.7. Metafor
Metafor was developed as a structure generation tool to convert everyday English language
into Python codes (Liu and Lieberman, 2005), which is a common task for programmers who
need to implement requirements into systems. MontyLingua was used to process input text
into subject-verb-object(s) triplets which were anaphorically dereferenced using ConceptNet
(Liu and Singh, 2004). Programmatic entities forming the core generated codes were performed
in three parts. Firstly, a set of semantic recognizers were used over the subject-verb-object(s)
triplets to identify code structures, such as lists, quotes, and if-else structures. Secondly,
actions or changes to the extracted code structures were identified which would be used to
form the class functions. Lastly, the context of the actions were identified. That is, which
actions affect which objects. These programmatic entities were then used to generate Python
codes. Although it is not likely that the generated Python code is executable, Metafor is likely to
be adopted as a brainstorming tool according to a case study done by the authors (Liu and
Lieberman, 2005).

3.8. Richardson and Fox's Concept Map Based Cross-Language Resource Learning
Concept map was described by Joseph Novak as “graphical representations of knowledge that
are comprised of concepts and the relationships between them” (Novak and Gowin, 1984)
which had been shown to facilitate a student's learning process (McNaught and Kennedy,
1997). Richardson and Fox (2005) examined the role of concept maps as a cross-language
learning resource by giving a set of articles written in Spanish and their English translations to
a control group of student, whereas the experimental group received the same materials as the
control group supplemented with concept maps produced by domain experts. The experimental
group performed significantly better than the control suggesting the advantage of having a
concept map. MontyTagger was used to extract nouns which were subsequently used to form
the nodes on the concept automatically in further experiments but the authors did not evaluate
the differences in the nodes of the concept maps produced by domain experts and that of
MontyTagger.

3.9. QABLe
QABLe (Question-Answering Behavior Learner) used prior learning and problem solving
strategies (Tadepalli and Natarajan, 1996) in text understanding for question and answer (Grois
and Wilkins, 2005b, Grois and Wilkins, 2005a). MontyTagger was used for both processing of
text, which was to be understood, and the questions. A prior system, Deep Read (Hirschman et
al., 1999), was evaluated using Remedia Corpus (a collection of 115 children's stories provided
by Remedia Publications). Using the same corpus, QABLe achieved 48% accuracy, compared to
36% by Deep Read (Grois and Wilkins, 2005b, Grois and Wilkins, 2005a).

3.10. Arizona State University BioAI group in TREC 200 5


The Text Retrieval Conference (TREC) Genomic Track 2005 is an ad-hoc document retrieval task
in 5 different areas of 10 instances each. The Arizona State University BioAI group (Yu et al.,
2005) chose to use Apache Lucene (lucene.apache.org) to retrieve abstracts from PubMed,
The Python Papers, Volume 1, Issue 1 9

which were POS tagged using MontyTagger and anaphorically resolved. Facts from the
processed abstracts were extracted by template matching. Evaluations by TREC were based on
the top 10 and 100 retrieved abstracts respectively. Yu et. al. (Yu et al., 2005) achieved 27%
precision and 11% precision on the top 10 and 100 abstracts respectively.

3.11. SkillSum
Reiter and Dale said that “the goal of many NLG [natural language generation] systems is to
produce documents which are as similar as possible to documents produced by human
experts” (Reiter and Dale, 2000). One of the difficulties is to decide what goes into the
generated document, the context selection rules, and it is also known that corpora of expert-
written text may not form the gold standards as expert may disagree or vary in opinions (Reiter
and Sripada, 2002). From a set of skills test results and authored evaluations, SkillSum
attempted to derive context rules (Williams and Reiter, 2005). MontyLingua was used to parse
authored evaluations to identify message types (Geldof, 2003), followed by Rhetorical Structure
Theory analysis. A trial by the authors suggested that users preferred SkillSum's report over
basic numerical test scores (Reiter and Dale, 2000).

3.12. Kennedy, Natsev and Chang's Query Class Induction for Multimodal Video Search
One of the more sophisticated forms of search techniques is multimodal search which assumes
the set of items to be searched takes on different roles and specific search techniques, when
applied, could improve overall retrieval performance. For example, a video clip in a collection
could be searched by title and subject classification (metadata), qualities of image or contents
of image (visual cues), dialogue or speech (audio cues), and subtitles (text). In multimodal
search, an important aspect is to be able to classify the search queries and studies in
multimodal video retrieval had used pre-defined classes (Chua et al., 2004, Yan et al., 2004).
Kennedy, Natsev and Chang proposed a framework for multimodal search without prior need
to define query classes by semantic analysis of the input query (Kennedy et al., 2005).
MontyLingua was used for POS tagging and stemming of the input query before constructing it
into an OKAPI query (Robertson et al., 1992). An improvement of 18% was realized over using
pre-defined query classes (Chua et al., 2004, Yan et al., 2004) by evaluating using TRECVID
2004 (Robertson et al., 1992).

3.13. Memsworldonline
Memsworldonline (Zhang et al., 2006a) was developed for information retrieval in domain-
specific digital libraries on microelectromechanical systems by using a combination of Formal
Concept Analysis (Priss, 1996) and information anchors. Information anchors are common
concepts in the field which allowed for examination into community dynamics (Troy et al.,
2006) or emerging trends (Kontostathis et al., 2003). For example, this paper is an information
anchor for MontyLingua (topic area). Other possible anchors are authors (related areas of
expertise) and institutions (research directions). Information anchors essentially consists of
keywords, key phrases, metadata, and inter-document relationships. MontyLingua was used in
Memsworldonline to extract nouns, noun phrases, and sub-phrases in documents as one of the
means to derive information anchors. These information anchors formed an ontology to classify
documents.

3.14. PEPURS
With increasing use of digital libraries comes the problem of author ambiguity (Torvik et al.,
2005), as author names could be written in various forms of initials and more than one
published authors may share the same initial. PEPURS attempted to advance the field of author
name clarification by analyzing author's websites for publication records and segmenting these
records into appropriate data fields (Zhang et al., 2006b). Each publication record is tagged
twice, once by a purpose-built tagger, and by MontyTagger. These were then used as input for
B-classifier and P-classifier running in parallel to segment the publication records before
merging the results from the classifiers using a decision tree (Mitchell, 1997). The three
classifiers ran as a stacked generalization procedure (Wolpert, 1992).
The Python Papers, Volume 1, Issue 1 10

3.15. Automatic Construction of Domain-Specific Concept Structures


Libo Chen's doctoral dissertation at Technischen Universitat Darmstadt focused on automatic
construction of domain-specific concept structures (Chen, 2006) in response to the problem of
vocabulary mismatch in web search (Blair, 1986, Furnas et al., 1987) by constructing domain-
specific concepts and linking these terms. MontyTagger was used to POS tag web-pages prior to
concept extraction.

3.16. Red Opal


Feature selection of online product reviews is an important aspect of online shopping (Liu et al.,
2005). Red Opal (Scaffidi, 2006) used a probability-based algorithm in feature selection, and
comparing that to a support-based algorithm (Liu et al., 2005). MontyLingua was used for POS
tagging and stemming of online product reviews before processing by each of the two
algorithms for feature selection. The speed of MontyLingua's POS tagging and stemming
averaged at 301 milliseconds per review, with the fastest being 250 milliseconds, on a single
3GHz Pentium 4 processor with 1GB of RAM, running Windows XP Professional SP 2 and Sun's
J2RE 1.4.2 with 250MB heap size.

3.17. Hsu and Chen's Commonsense Query Expansion for Image Retrieval
Hsu and Chen (2006) investigated the usefulness of commonsense knowledge in image
retrieval which had been used previously in query expansion (Liu and Lieberman, 2002).
MontyLingua was used for POS tagging and stemming of the initial query before commonsense
query expansion by ConceptNet (Liu and Singh, 2004). From the evaluation results using the
ImageCLEF 2005 test collection (Clough et al., 2005), the authors concluded that introducing
commonsense knowledge into the retrieval task is suitable for precision-oriented tasks (Hsu
and Chen, 2006).

4. Discussion
MontyLingua was released in 2004 (Liu, 2004) and was described in October 2004 with
ConceptNet (Liu and Singh, 2004). In the same paper, 15 applications of ConceptNet were
featured. Since then, the state-of-the-art use of either ConceptNet or MontyLingua and roles
played by MontyLingua and its components in recent research work had not been reviewed.
This paper aims to review the use of and roles played by MontyLingua and its components in
research work published between October 2004 to August 2006.
Of the 17 research reports reviewed, all had used MontyTagger for POS tagging, 8 of them had
used MontyLemmatiser for stemming, and only 2 (Metafor and Memsworldonline) had used
MontyREChunker and MontyExtractor. None of the reviewed work seems to have used
MontyNLGenerator for text summarization.
An interesting observation is the use of MontyTagger in a wide context, such as web-pages
(Skillicorn and Vats, 2004, Udani et al., 2005), transcribed human speech (Eisenstein and Davis,
2004), economics papers (van Eck, 2005, van Eck and van den Berg, 2005), and biomedical
papers (Yu et al., 2005), despite the fact that MontyTagger was generically trained using Wall
Street Journal corpus. This might suggest that MontyTagger could be used in various context,
which is reflected in daily life where a non-legally trained person might still be able to read
legal text intelligently despite some inability to grasp the total meaning as appear to a legally
trained person. However, it had been shown that a generically trained POS tagger will perform
inadequately on domain-specialized text, such as biomedical literature (Tateisi and Tsuji, 2004).
In spite of this, MontyTagger had been used in specialized sitting (van Eck, 2005, van Eck and
van den Berg, 2005, Yu et al., 2005) which might suggest that the numerical measurement of
POS tagging accuracy may not correlate with the “functional” POS tagging accuracy. For
example, the word “book” can be tagged as “noun, base form” (NN) or “noun, singular form”
(NNS) but may be treated as an error when calculating POS tagging accuracy as the quotient
between the number of correctly tagged tokens and the total number of tokens.
Only 2 of the systems had used MontyREChunker and MontyExtractor. Metafor had used them
to gain semantic understanding of daily written language while Memsworldonline used them to
The Python Papers, Volume 1, Issue 1 11

process domain-specific text. Despite a small sample size of 2, a supportive case could be
made for the use of MontyREChunker and MontyExtractor in both generic text (Liu and
Lieberman, 2005) and domain-specific text (Zhang et al., 2006a).
In this review, we did not observe any applications of MontyNLGenerator. However, it is likely
that text summary may have a role in future in managing the current trend of information
overload. It is plausible that future research will place greater emphasis on summary
generation of domain-specific libraries as a whole or in a time-striated fashion, as an extension
of Memsworldonline. Web search could use natural language generation techniques to
summarize the results on-the-fly. Natural language generation could extend Metafor (Liu and
Lieberman, 2005) to include automated generation source code documentation. This could
then be used to identify code architectures and algorithms which is one of the problems in
program optimization by algorithm replacement (Metzger and Wen, 2000).
In summary, we had reviewed 19 articles published between October 2004 and August 2006
for the roles played by MontyLingua or its components in these studies, thereby updating the
state-of-the-art utility of MontyLingua. We had observed a diversified use of MontyLingua in
many different areas, both generic and domain-specific. Although the use of the text
summarizing component had not been observed, we are optimistic that it will have a crucial
role in managing the current trend of information overload in future research.

5. Acknowledgement
The author will like to thank the reviewers for their invaluable comments.

6. References
BLAIR, D. C. (1986) Indetermincy in the subject access to documents. Information Processing
and Management, 22, 229-241.
BRADEEN, J. M. & HAVEY, M. J. (1995) Restriction Fragment Length Polymorphisms Reveal
Considerable Nuclear Divergence within a Well-Supported Maternal Clade in Allium
Section Cepa (Alliaceae). American Journal of Botany, 82, 1455-1462.
BRILL, E. (1995) Transformation-based error-driven learning and natural language processing: a
case study in part of speech tagging. Computational Linguistics, 21, 543-565.
CHANDRASEKARAN, R. (2004) Using language structure for adaptive multimodal language
acquisition. Sixth International Conference on Multimodal Interfaces (ICMI'04). State
College, Pennsylvania, USA, ACM Press.
CHEN, L. (2006) Automatic construction of domain-specific concept structures. Technischen
Universitat Darmstadt.
CHUA, T. S., NEO, S. Y., WANG, K. Y., SHI, R., ZHAO, M. & XU, H. (2004) TRECVID 2004 search
and feature extraction task by NUS PRIS. TRECVID 2004 Workshop.
CLOUGH, P., MULLER, H., DESELAERS, T., GRUBINGER, M., LEHMANN, T. M., JENSEN, J. & HERSH,
W. (2005) The CLEF 2005 cross-language image retrieval track. 2005 Cross Language
Evaluation Forum.
COHEN, W. W. (2004) Minorthird: Methods for Identifying Names and Ontological Relations in
Text using Heuristics for Inducing Regularities from Data,
http://minorthird.sourceforge.net
CUNNINGHAM, H. (2000) Software Architecture for Language Engineering. Department of
Computer Science. University of Sheffield.
EISENSTEIN, J. & DAVIS, R. (2004) Visual and linguistic information in gesture classification. 6th
international conference on Multimodal interfaces. State College, PA, USA, ACM Press.
FURNAS, G. W., LANDAUER, T. K., GOMEZ, L. M. & DUMAIS, S. T. (1987) The vocabulary problem
in human-system communication. Communications of the ACM, 30, 964-971.
GELDOF, S. (2003) Corpus analysis for NLG. IN REITER, E., HORACEK, H. & DEEMTER, K. V. (Eds.)
9th European Workshop on NLG.
GROIS, E. & WILKINS, D. (2005a) Learning strategies for story comprehension: a reinforcement
learning approach. 22nd International Conference on Machine Learning. Bonn, Germany.
GROIS, E. & WILKINS, D. C. (2005b) Learning Strategies for Open-Domain Natural Language
The Python Papers, Volume 1, Issue 1 12

Question Answering. International Joint Conference on Artificial Intelligence 2005


(IJCAI05).
HIRSCHMAN, L., LIGHT, M. & BURGER, J. (1999) Deep Read: A reading comprehension system.
Annual Meeting of the Association of Computational Linguistics 99.
HSU, M.-H. & CHEN, H.-H. (2006) Information retrieval with commonsense knowledge. 29th
annual international ACM SIGIR conference on Research and development in
information retrieval. Seattle, Washington, USA, ACM Press.
KENNEDY, L. S., NATSEY, A. & CHANG, S.-F. (2005) Automatic discovery of query-class-
dependent models for multimodal search. 13th annual ACM international conference on
Multimedia. Singapore, ACM Press.
KONTOSTATHIS, A., GALITSKY, L. M., POTTENGER, W. M., ROY, S. & PHELPS, D. J. (2003) A survey
of emerging trend detection in textual data mining. IN BERRY, M. (Ed.) A comprehensive
survey of text mining. Springer-Verlag.
LIGHT, M., ARENS, R. & LU, X. (2005) Web-based interfaces for natural language processing
tools. 43rd Annual Meeting of the Association for Computational Linguistics - Effective
Tools and Methodologies for Teaching Natural Language Processing and Computational
Linguistics Workshop.
LIU, B., HU, M. & CHENG, J. (2005) Opinion Observer: Analyzing and comparing opinions on the
web. 14th International Conference on World Wide Web (WWW'05). ACM Press.
LIU, H. (2004) MontyLingua: An end-to-end natural language processor with common sense.
LIU, H. & LIEBERMAN, H. (2002) Robust photo retrieval using world semantics. LREC 2002
Workshop on Creating and Using Semantics for Information Retrieval and Filtering.
Canary Islands.
LIU, H. & LIEBERMAN, H. (2005) Metafor: visualizing stories as code. 10th International
Conference on Intelligent User Interfaces. San Diego, California, USA.
LIU, H. & SINGH, P. (2004) ConceptNet: A Practical Commonsense Reasoning Toolkit. BT
Technology Journal, 22, 211-226.
LOPER, E. & BIRD, S. (2002) NLTK: The natural language toolkit. ACL Workshop on Effective
Tools and Methodologies for Teaching Natural Language Processing and Computational
Linguistics. Philadelphia, Association for Computational Linguistics.
MARCUS, M. P., SANTORINI, B. & MARCINKIEWICZ, M. A. (1993) Building a large annotated
corpus of English: the Penn Treebank. Computational Linguistics, 19, 313-330.
MCNAUGHT, C. & KENNEDY, D. (1997) Use of concept mapping in the design of learning tools
for interactive multimedia. Journal of Interactive Learning Research, 8, 389-406.
METZGER, R. & WEN, Z. (2000) Automatic algorithm recognition and replacement: A new
approach to program optimization, The MIT Press.
MITCHELL, T. (1997) Machine Learning, McGraw Hill.
NOVAK, J. D. & GOWIN, D. B. (1984) Learning how to learn, Cambridge, UK, Cambridge
University Press.
PRISS, U. (1996) Formal concept analysis in information science. Annual Review of Information
Science and Technology, 40.
REITER, E. & DALE, R. (2000) Building natural language generation systems, Cambridge,
Cambridge University Press.
REITER, E. & SRIPADA, S. (2002) Should corpora text be gold standards for NLG? International
Conference of Natural Language Generation 2002 (INLG02).
RICHARDSON, R. & FOX, E. A. (2005) Using concept maps in digital libraries as a cross-language
resource discovery tool. 5th ACM/IEEE-CS Joint Conference on Digital libraries. Denver,
CO, USA, ACM Press.
ROBERTSON, S. E., WALKER, S., HANCOCK-BEAULIEU, M., GULL, A. & LAU, M. (1992) Okapi at
TREC4. Text Retrieval Conference 1992.
SCAFFIDI, C. (2006) Application of a probability-based algorithm to extraction of product
features from online reviews. Pittsburg, PA, USA, Carnegie Mellon University.
SKILLICORN, D. B. & VATS, N. (2004) Novel information discovery for intelligence and
counterterrorism. Kingston, Ontario, Canada, School of Computing, Queen's University.
TADEPALLI, P. & NATARAJAN, B. (1996) A formal framweork for speedup learning from problems
and solutions. Journal of Artificial Intelligence Research, 4, 445-475.
TATEISI, Y. & TSUJI, J. I. (2004) Part-of-Speech Annotation of Biology Research Abstracts. 4th
International Conference on Language Resource and Evaluation (LREC2004).
TORVIK, V. I., WEEBER, M., SWANSON, D. R. & SMALHEISER, N. R. (2005) A probabilistic similarty
The Python Papers, Volume 1, Issue 1 13

metric for Medline records: a model for author name disambiguation. Journal of the
American Society for Information Science and Technology, 56, 140-158.
TROY, A. D., ZHANG, G. Q. & MEHREGANY, M. (2006) Evolution of the Hilton Head Workshop
research community. Education Digest of the 2006 Solid-State Sensor and Actuator
Workshop.
UDANI, G., DAVE, S., DAVIS, A. & SIBLEY, T. (2005) Noun sense induction using web search
results. 28th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval. ACM Press.
VAN ECK, N. J. (2005) Towards automatic knowledge discovery from scientific literature.
Econometric Institute, Faculty of Economics. Rotterdam, Erasmus University.
VAN ECK, N. J. & VAN DEN BERG, J. (2005) A novel algorithm for visualizing concept
associations. 16th International Workshop on Database and Expert System Applications
(DEXA'05).
WILLIAMS, S. & REITER, E. (2005) Deriving Content Selection Rules from a Corpus of Non-
naturally Occurring Documents for a Novel NLG Application. IN BELZ, A. & VARGES, S.
(Eds.) Corpus Linguistics 2005 Workshop on Using Corpora for Natural Language
Generation. Birmingham, UK.
WOLPERT, D. (1992) Stacked generalization. Neural Networks, 5, 241-259.
YAN, R., YANG, J. & HAUPTMANN, A. G. (2004) Learning query-class dependent weights in
automatic video retrieval. ACM Multimedia 2004. ACM Press.
YU, L., AHMED, S. T., GONZALEZ, G., LOGSDON, B., NAKAMURA, M., NIKKILA, S., SHAH, K., TARI,
L., WENDT, R., ZEIGLER, A. & BARAL, C. (2005) Genomic information retrieval through
selective extraction and tagging by the ASU-BioAI group. 14th Text Retrieval Conference
(TREC2005).
ZHANG, G.-Q., TROY, A. D. & BOURGOIN, K. (2006a) Bootstrapping ontology learning for
information retrieval using formal concept analysis and information anchors. 14th
International Conference on Conceptual Structures. Aalborg, Denmark.
ZHANG, W., YU, C., SMALHEISER, N. & TORVIK, V. (2006b) Segmentation of publication records
of authors from the web. 22nd IEEE International Conference on Data Engineering
(ICDE'06). Atlanta, Georgia, IEEE Press.
The Python Papers, Volume 1, Issue 1 14

Python Events
Tennessee Leeuwenburg

OSDC 2006: Melbourne, December 5th-8th 2006


“The Open Source Developers' Conference is a conference designed for developers, by
developers. It covers numerous programming languages across a range of operating systems.
Talks vary from introductory pieces through to the deeply technical. With three talks on at any
time, spanning over three days; there is bound to be something of interest to any developer.”

This is the major event in the year's Python calendar, with a major component of the
presentations being from Python developers. Anthony Baxter (Python's release manager) is a
keynote speaker and will be looking at Python 3.0, IronPython (mono) and PyPy.

From the list of papers, a few immediately grab attention. “An Introduction to Plone: An Open
Source Content Management System” and “A Rails/Django Comparison” should provide web-
oriented Python developers with a good overview of available technologies. “Python 3.0” will
be on everyone's menu, as we look forward to language changes which all will need to be
aware of in times to come.

For more information, visit http://www.osdc.com.au/

Presentatio n Authors
Automatic Text Generation and Weather Forecasting Leeuwenburg, Tennessee
A Rails/Django Comparison Green, Alan and
Askins, Ben
Development of Mono Applications with Agile Languages Rees, Mark
Accessing Relational Databases with Python Todd, Andy
RESTful Software Development and Maintenance Hyland-Wood, David
Python in Mozilla Hammond, Mark
What's New In Python: 2006 Edition Baxter, Anthony
The Planet Feed Reader: Better Living Through Gravity Gardiner, Mary
An Introduction to Plone: An Open Source Content Management System Aune, Nate
What's Old Is New Again Jones, Richard
Shiny, Pretty Things Jones, Richard
Overview of Python ctypes Holkner, Alex
Python 3.0 (keynote) Baxter, Anthony

Das könnte Ihnen auch gefallen