Beruflich Dokumente
Kultur Dokumente
Answering System
By Amiya Patanaik
(05EG1008)
MAY 2009
2
CERTIFICATE
This is to certify that the thesis entitled Open Domain Factoid Question
Answering System is a bonafide record of authentic work carried out by Mr. Amiya
Patanaik under my supervision and guidance for the fulfilment of the requirement for the
award of the degree of Bachelor of Technology (Honours) at the Indian Institute of
Technology, Kharagpur. The work incorporated in this has not been, to the best of my
knowledge, submitted to any other University or Institute for the award of any degree or
diploma.
Acknowledgement
I would also like to express my heartfelt gratitude to my co-guide Dr. S. K. Das and all
the professors of Electrical and Computer Science Engineering Department for all the
guidance, education and necessary skill set they have endowed me with, throughout my
years of graduation.
Last but not the least; I would like to thank my friends for their help during the course
of my work.
Date:
Amiya Patanaik
05EG1008
Department of Electrical Engineering
IIT Kharagpur - 721302
4
Dedicated to
my parents and friends
5
ABSTRACT
Contents
CERTIFICATE 2
ACKNOWLEDGEMENT 3
DEDICATION 4
ABSTRACT 5
CONTENTS 6
LIST OF FIGURES AND TABLES 8
Chapter 1: Introduction 9
1.1 History of Question Answering Systems 9
1.2 Architecture 10
1.3 Question answering methods 11
1.3.1 Shallow 11
1.3.2 Deep 11
1.4 Issues 12
1.4.1 Question classes 12
1.4.2 Question processing 13
1.4.3 Context and QA 13
1.4.4 Data sources for QA 13
1.4.5 Answer extraction 13
1.4.6 Answer formulation 13
1.4.7 Real time question answering 14
1.4.8 Multi-lingual (or cross-lingual) question answering 14
1.4.9 Interactive QA 14
1.4.10 Advanced reasoning for QA 14
1.4.11 User profiling for QA 14
1.5 A generic framework for QA 15
1.6 Evaluating QA Systems 15
1.6.1 End-to-End Evaluation 16
1.6.2 Mean Reciprocal Rank 16
1.6.3 Confidence Weighted Score 16
1.6.4 Accuracy and coverage 17
1.6.5 Traditional Metrics – Recall and Precision 17
Chapter2: Question Analysis 19
2.1 Determining the Expected Answer Type 19
2.1.1 Question Classes 19
2.1.2 Manually Constructed rules for question classification 20
7
REFERENCES 47
8
Tables PageNo.
Chapter1. Introduction
In information retrieval, question answering (QA) is the task of automatically answering
a question posed in natural language. To find the answer to a question, a QA computer
program may use either a pre-structured database or a collection of natural language
documents (a text corpus such as the World Wide Web or some local collection).
QA research attempts to deal with a wide range of question types including: fact, list,
definition, How, Why, hypothetical, semantically-constrained, and cross-lingual
questions. Search collections vary from small local document collections, to internal
organization documents, to compiled newswire reports, to the World Wide Web.
* Open-domain question answering deals with questions about nearly everything, and
can only rely on general ontologies and world knowledge. On the other hand, these
systems usually have much more data available from which to extract the answer.
Some of the early AI systems were question answering systems. Two of the most famous
QA systems of that time are BASEBALL and LUNAR, both of which were developed in
the 1960s. BASEBALL answered questions about the US baseball league over a period of
one year. LUNAR, in turn, answered questions about the geological analysis of rocks
returned by the Apollo moon missions. Both QA systems were very effective in their
chosen domains. In fact, LUNAR was demonstrated at a lunar science convention in
1971 and it was able to answer 90% of the questions in its domain posed by people
untrained on the system. Further restricted-domain QA systems were developed in the
following years. The common feature of all these systems is that they had a core
database or knowledge system that was hand-written by experts of the chosen domain.
Some of the early AI systems included question-answering abilities. Two of the most
famous early systems are SHRDLU and ELIZA. SHRDLU simulated the operation of a
robot in a toy world (the "blocks world"), and it offered the possibility to ask the robot
10
questions about the state of the world. Again, the strength of this system was the choice
of a very specific domain and a very simple world with rules of physics that were easy
to encode in a computer program. ELIZA, in contrast, simulated a conversation with a
psychologist. ELIZA was able to converse on any topic by resorting to very simple rules
that detected important words in the person's input. It had a very rudimentary way to
answer questions, and on its own it lead to a series of chatter bots such as the ones that
participate in the annual Loebner prize.
The 1970s and 1980s saw the development of comprehensive theories in computational
linguistics, which led to the development of ambitious projects in text comprehension
and question answering. One example of such a system was the Unix Consultant (UC), a
system that answered questions pertaining to the Unix operating system. The system
had a comprehensive hand-crafted knowledge base of its domain, and it aimed at
phrasing the answer to accommodate various types of users. Another project was
LILOG, a text-understanding system that operated on the domain of tourism
information in a German city. The systems developed in the UC and LILOG projects
never went past the stage of simple demonstrations, but they helped the development
of theories on computational linguistics and reasoning.
In the late 1990s the annual Text Retrieval Conference (TREC) included a question-
answering track which has been running until the present. Systems participating in this
competition were expected to answer questions on any topic by searching a corpus of
text that varied from year to year. This competition fostered research and development
in open-domain text-based question answering. The best system of the 2004
competition achieved 77% correct fact-based questions.
In 2007 the annual TREC included a blog data corpus for question answering. The blog
data corpus contained both "clean" English as well as noisy text that include badly-
formed English and spam. The introduction of noisy text moved the question answering
to a more realistic setting. Real-life data is inherently noisy as people are less careful
when writing in spontaneous media like blogs. In earlier years the TREC data corpus
consisted of only newswire data that was very clean.
An increasing number of systems include the World Wide Web as one more corpus of
text. Currently there is an increasing interest in the integration of question answering
with web search. Ask.com is an early example of such a system, and Google and
Microsoft have started to integrate question-answering facilities in their search engines.
One can only expect to see an even tighter integration in the near future.
1.2 Architecture
The first QA systems were developed in the 1960s and they were basically natural-
language interfaces to expert systems that were tailored to specific domains. In
contrast, current QA systems use text documents as their underlying knowledge source
and combine various natural language processing techniques to search for the answers.
Current QA systems typically include a question classifier module that determines the
type of question and the type of answer. After the question is analyzed, the system
11
typically uses several modules that apply increasingly complex NLP techniques on a
gradually reduced amount of text. Thus, a document retrieval module uses search
engines to identify the documents or paragraphs in the document set that are likely to
contain the answer. Subsequently a filter preselects small text fragments that contain
strings of the same type as the expected answer. For example, if the question is "Who
invented Penicillin" the filter returns text that contain names of people. Finally, an
answer extraction module looks for further clues in the text to determine if the answer
candidate can indeed answer the question.
QA is very dependent on a good search corpus - for without documents containing the
answer, there is little any QA system can do. It thus makes sense that larger collection
sizes generally lend well to better QA performance, unless the question domain is
orthogonal to the collection. The notion of data redundancy in massive collections, such
as the web, means that nuggets of information are likely to be phrased in many different
ways in differing contexts and documents, leading to two benefits:
(1) By having the right information appear in many forms, the burden on the QA
system to perform complex NLP techniques to understand the text is lessened.
(2) Correct answers can be filtered from false positives by relying on the correct
answer to appear more times in the documents than instances of incorrect ones.
1.3.1 Shallow
When using massive collections with good data redundancy, some systems use
templates to find the final answer in the hope that the answer is just a reformulation of
the question. If you posed the question "What is a dog?", the system would detect the
substring "What is a X" and look for documents which start with "X is a Y". This often
works well on simple "factoid" questions seeking factual tidbits of information such as
names, dates, locations, and quantities.
1.3.2 Deep
However, in the cases where simple question reformulation or keyword techniques will
not suffice, more sophisticated syntactic, semantic and contextual processing must be
performed to extract or construct the answer. These techniques might include named-
entity recognition, relation detection, co reference resolution, syntactic alternations,
word sense disambiguation, logic form transformation, logical inferences (abduction)
12
and commonsense reasoning, temporal or spatial reasoning and so on. These systems
will also very often utilize world knowledge that can be found in ontologies such as
WordNet, or the Suggested Upper Merged Ontology (SUMO) to augment the available
reasoning resources through semantic connections and definitions.
Statistical QA, which introduces statistical question processing and answer extraction
modules, is also growing in popularity in the research community. Many of the lower-
level NLP tools used, such as part-of-speech tagging, parsing, named-entity detection,
sentence boundary detection, and document retrieval, are already available as
probabilistic applications.
1.4 Issues
Different types of questions require the use of different strategies to find the answer.
Question classes are arranged hierarchically in taxonomies.
13
The same information request can be expressed in various ways - some interrogative,
some assertive. A semantic model of question understanding and processing is needed,
one that would recognize equivalent questions, regardless of the speech act or of the
words, syntactic inter-relations or idiomatic forms. This model would enable the
translation of a complex question into a series of simpler questions, would identify
ambiguities and treat them in context or by interactive clarification.
Questions are usually asked within a context and answers are provided within that
specific context. The context can be used to clarify a question, resolve ambiguities or
keep track of an investigation performed through a series of questions.
Before a question can be answered, it must be known what knowledge sources are
available. If the answer to a question is not present in the data sources, no matter how
well we perform question processing, retrieval and extraction of the answer, we shall
not obtain a correct result.
Answer extraction depends on the complexity of the question, on the answer type
provided by question processing, on the actual data where the answer is searched, on
the search method and on the question focus and context. Given that answer processing
depends on such a large number of factors, research for answer processing should be
tackled with a lot of care and given special importance.
There is need for developing Q&A systems that are capable of extracting answers
from large data sets in several seconds, regardless of the complexity of the question, the
size and multitude of the data sources or the ambiguity of the question.
The ability to answer a question posed in one language using an answer corpus in
another language (or even several). This allows users to consult information that they
cannot use directly. See also machine translation.
1.4.9 Interactive QA
It is often the case that the information need is not well captured by a QA system, as
the question processing part may fail to classify properly the question or the
information needed for extracting and generating the answer is not easily retrieved. In
such cases, the questioner might want not only to reformulate the question, but (s)he
might want to have a dialogue with the system.
More sophisticated questioners expect answers which are outside the scope of
written texts or structured databases. To upgrade a QA system with such capabilities,
we need to integrate reasoning components operating on a variety of knowledge bases,
encoding world knowledge and common-sense reasoning mechanisms as well as
knowledge specific to a variety of domains.
The user profile captures data about the questioner, comprising context data, domain
of interest, reasoning schemes frequently used by the questioner, common ground
established within different dialogues between the system and the user etc. The profile
may be represented as a predefined template, where each template slot represents a
different profile feature. Profile templates may be nested one within another.
15
Question
Question Analysis
Corpus or
document Document Top n text Answer
Answers
collection Retrieval segments Extraction
or
sentences
It should be noted that while the three components address completely separate
aspects of question answering it is often difficult to know where to place the boundary
of each individual component. For example the question analysis component is usually
responsible for generating an IR query from the natural language question which can
then be used by the document retrieval component to select a subset of the available
documents. If, however, an approach to document retrieval requires some form of
iterative process to select good quality documents which involves modifying the IR
query, then it is difficult to decide if the modification should be classed as part of the
question analysis or document retrieval process.
relevant documents. Then the queries can be used to make an evaluation based on
precision and recall. But this is not possible even for the smallest of document
collections and with the size of corpuses like AQUAINT with approximately 1,00,000
articles it is next to impossible.
The last answer evaluated will therefore be the one the system has least confidence in.
Given this ordering CWS is formally defined in Equation 1.2:
|Q|
no. of correct in first i answers
å i
CWS = i =1
(1.2)
|Q|
CWS therefore rewards systems which can not only provide correct exact answers to
questions but which can also recognise how likely an answer is to be correct and hence
place it early in the sorted list of answers. The main issue with CWS is that it is difficult
to get an intuitive understanding of the performance of a QA system given a CWS score
as it does not relate directly to the number of questions the system was capable of
answering.
The recall of an IR system S at rank n for a query q is the fraction of the relevant
documents
AD ,q , which have been retrieved:
18
| RDS ,q ,n Ç AD ,q |
recall ( D, q, n) =
S
(1.5)
| AD ,q |
The precision of an IR system S at rank n for a query q is the fraction of the retrieved
documents RDS ,q ,n that are relevant:
| RDS ,q ,n Ç AD ,q |
precision S ( D, q, n) = (1.6)
| RDS ,q ,n |
Clearly given a set of queries Q average recall and precision values can be calculated to
give a more representative evaluation of a specific IR system. Unfortunately these
evaluation metrics although well founded and used throughout the IR community suffer
from two problems when used in conjunction with the large document collections
utilized by QA systems, namely determining the set of relevant documents within a
collection for a given query, AD ,q . The only accurate way to determine which documents
are relevant to a query is to read every single document in the collection and determine
its relevance. Clearly given the size of the collections over which QA systems are being
operated this is not a feasible proposition. It must be kept in mind that just because a
relevant document is found does not automatically mean the QA system will be able to
identify and extract a correct answer. Therefore it is better to use recall and precision at
the document retrieval stage rather than for the complete system.
Document Collection/Corpus
Relevant Documents
AD ,q
Retrieved Documents
RDS ,q ,n
As the first component in a QA system it could easily be argued that question analysis is
the most important part. Not only is the question analysis component responsible for
determining the expected answer type and for constructing an appropriate query for
use by an IR engine but any mistakes made at this point are likely to render useless any
further processing of a question. If the expected answer type is incorrectly determined
then it is highly unlikely that the system will be able to return a correct answer as most
systems constrain possible answers to only those of the expected answer type. In a
similar way a poorly formed IR query may result in no answer bearing documents being
retrieved and hence no amount of further processing by an answer extraction
component will lead to a correct answer being found.
Support vector machines (SVMs) are a set of related supervised learning methods used
for classification and regression. Viewing input data as two sets of vectors in an n-
dimensional space, an SVM will construct a separating hyper-plane in that space, one
which maximizes the margin between the two data sets. To calculate the margin, two
parallel hyperplanes are constructed, one on each side of the separating hyper-plane,
which are "pushed up against" the two data sets. Intuitively, a good separation is
achieved by the hyper-plane that has the largest distance to the neighboring data points
of both classes, since in general the larger the margin the lower the generalization error
of the classifier.
&
D = ( , ) | ! ℝ# , ! {−1,1}%=1 (2.1)
where the is either 1 or −1, indicating the class to which the point belongs. Each
is a p-dimensional real vector. We want to give the maximum-margin hyperplane which
divides the points having = 1 from those having = − 1. Any hyperplane can be
written as the set of points satisfying
w⋅−* =0 (2.2)
where denotes the dot product. The vector w is a normal vector: it is perpendicular to
the hyperplane. The parameter */‖,‖ determines the offset of the hyperplane from the
origin along the normal vector w. We want to choose the w and b to maximize the
margin, or distance between the parallel hyperplanes that are as far apart as possible
while still separating the data. These hyperplanes can be described by the equations
w⋅−* =1 (2.3)
and
w ⋅ − * = −1 (2.4)
Note that if the training data are linearly separable, we can select the two hyperplanes
of the margin in a way that there are no points between them and then try to maximize
their distance. By using geometry, we find the distance between these two hyperplanes
is 2/‖,‖, so we want to minimize ‖,‖. As we also have to prevent data points falling
into the margin, we add the following constraint: for each i either
w ⋅ − * ≥ 1 (2.5)
and
w ⋅ − * ≤ −1 (2.6)
This can be rewritten as:
22
(w ⋅ − *) ≥ 1 (2.8)
If instead of the Euclidean inner product w ⋅ one fed the QP solver with a function
K(w , ) the boundary between the two classes would then be,
K(x,w) + b = 0 (2.9)
and the set of x Î on that boundary becomes a curved surface embedded in R when
Rd d
Figure 2.1: The kernel trick, after transformation the data is linearly separable.
Along with SVM, we also tried Naïve Bayes Classifier[6]. A naive Bayes classifier is a
term in Bayesian statistics dealing with a simple probabilistic classifier based on
23
p(C| F_1,…….,F_n),
( ) (!" ,……,!_#| )
p(C | F_1,…….,F_n) = (2.10)
(!" ,…….,!_#)
In practice we are only interested in the numerator of that fraction, since the
denominator does not depend on C and the values of the features F_i are given, so that
the denominator is effectively constant. The numerator is equivalent to the joint
probability model
and so forth. Now the "naive" conditional independence assumptions come into play:
assume that each feature Fi is conditionally independent of every other feature Fj for j
i. This means that
p(F_i | C, F_j) = p(F_i | C), (2.13)
p(C, F_1, ......., F_n) = p(C) p(F_1| C) p(F_2| C) p(F_3| C) …….. p(F_n| C)
= p(C)∏%&=1 p(F_i | C) (2.14)
2.1.7 Datasets
We used the publicly available training and testing datasets provided by Tagged
Question Corpus, Cognitive Computation Group at the Department of Computer Science,
University of Illinois at Urbana-Champaign (UIUC) [5]. All these datasets have been
manually labelled by UIUC [5] according to the coarse and fine grained categories in
Table 1.1. There are about 5,500 labelled questions randomly divided into 5 training
datasets of sizes 1,000, 2,000, 3,000, 4,000 and 5,500 respectively. The testing dataset
contains 2000 labelled questions from the TREC QA track. The TREC QA data is hand
labelled by us.
2.1.8 Features
For each question, we extract two kinds of features: bag-of-words or a mix of POS tags
and words. Every question is represented as feature vectors; the weight associated with
each word varies between 0 and 1. The following example demonstrated different
feature sets considered for a given question and its POS parse.
Figure 2.2: Various feature sets extracted from the given question and its corresponding
part of speech tags.
25
The larger the entropy H(X) is, the more uncertain the random variable X is. In
information retrieval many methods have been applied to evaluate term’s relevance to
documents, among which entropy-weighting, based on information theoretic ideas, is
proved the most effective and sophisticated. Let fit be the frequency of word i in
document t, ni the total number of occurrences of word i in document collection, N the
number of total documents in the collection, then the confusion (or entropy) of word i
can be measured as follows:
The larger the confusion of a word is, the less important it is. The confusion achieves
maximum value log(N) if the word is evenly distributed over all documents, and
minimum value 0 if the word occurs in only one document.
Keeping this in mind to calculate the entropy of a word, certain preprocessing is needed.
Let C be the set of question types. Without loss of generality, it is denoted by C = {1, . . .
,N}. Ci is a set of words extracted from questions of type i, that is to say, Ci represents a
word collection similar to documents. From the viewpoint of representation, each Ci is
the same as a document because both of which are just a collection of words. Therefore
we can also use the idea of entropy to evaluate word’s importance. Let ai be the weight
of word i, fit be the frequency of word i in Ct, ni be the total number of occurrences of
word i in all questions, then ai is defined as:
Weight of word i is opposite to its entropy: the larger the entropy of word i is, the less
important to question classification it is. In other words, the smaller weight is
associated with word i. Consequently, ai get the maximum value of 1 if word i occurs in
only one set of question type, and the minimum value of 0 if the word is evenly
distributed over all sets. Note that if a word occurs in only one set, for other sets fik is 0.
We use the convention that 0 log 0 = 0, which is easily justified since xlogx → 0 as x → 0.
26
It must be noted that the classifiers were NOT trained on TREC data. The classifier
classified questions into six broad classes and fifty coarse classes. Therefore a baseline
(random) classifier is (1/50) = 2% accurate. We employed various smoothing
techniques to Naive Bayes Classifier. The performance without smoothing was too low
and not worth mentioning. While Witten-Bell smoothing worked well, simple add one
smoothing outperformed it. The accuracy reported here are for Naive Bayes Classifier
employing add one smoothing.
We implemented weighted feature set SVM classifier into a cross platform standalone
desktop application (shown below). The application will be made available to public for
evaluation. Training was done on a set of 12788 questions provided by Cognitive
Computation Group at the Department of Computer Science, University of Illinois at
Urbana-Champaign.
90 Baseline Classifier
80
70
60 Naïve Bayes Classifier using
Bag of Words feature
50
40
Naïve Bayes Classifier using
30 Partitioned feature
20
10 SVM Classifier using Bag of
0 Words feature
Figure 2.4: JAVA Question Classifier, can be downloaded for evaluation from
http://www.cybergeeks.co.in/projects.php?id=10
It must be noted that query expansion is internally carried out by the APIs used to
retrieve documents from the web, although because of the proprietary nature their
working is unknown and unpredictable.
I a about an are
As at be by com
De en for from how
In is it la of
On or that the this
To was what when where
Who will with und the
www
The list of stop words we obtained is much smaller than standard stop word lists
(although there is no definite list of stop words which all natural language processing
tools incorporate, most of these lists are very similar).
29
The text collection over which a QA system works tend to be so large that it is
impossible to process whole of it to retrieve the answer. The task of the document
retrieval module is to select a small set from the collection which can be practically
handled in the later stages. A good retrieval unit will increase precision while
maintaining good enough recall.
where f(qi,D) is qi's term frequency in the document D, | D | is the length of the
document D in words, and avgdl is the average document length in the text collection
from which documents are drawn. k1 and b are free parameters, usually chosen as k1 =
2.0 and b = 0.75. IDF(qi) is the IDF (inverse document frequency) weight of the query
term qi. It is usually computed as:
where N is the total number of documents in the collection, and n(qi) is the number of
documents containing qi. There are several interpretations for IDF and slight variations
on its formula. In the original BM25 derivation, the IDF component is derived from the
Binary Independence Model.
%(() 9
-log = log (3.3)
9 %(()
Now suppose we have two query terms q1 and q2. If the two terms occur in documents
entirely independently of each other, then the probability of seeing both q1 and q2 in a
randomly picked document D is:
%((1) %((2)
∙
9 9
and the information content of such an event is:
2
9
; log
%((&)
&=1
With a small variation, this is exactly what is expressed by the IDF component of BM25.
educational purposes. The search APIs can return top n documents for a given query.
We read top n uniform resource locators (URLs) and build the collection of documents
to be used for answer retrieval. As the task of reading the URLs over the internet is
inherently slow process, this stage is the most taxing one in terms of runtime. To
accelerate the process we employ multi threaded URL readers so that multiple URLs can
be read simultaneously. Figure 3.1 shows the document retrieval framework.
Local
Corpus
Top n
Docs
IR Query
URL Reader
URL Reader
URL Reader
Multi threaded
INTERNET Reader module
One of the main considerations when doing document retrieval for QA is the amount of
text to retrieve and process for each question. Ideally a system would retrieve a single
text unit that was just large enough to contain a single instance of the exact answer for
every question. Whilst the ideal is not attainable, the document retrieval stage can act as
a filter between the document collections/web and answer extraction components by
retrieving a relatively small set of text collection. Therefore our target is to increasing
coverage with least number of retrieved documents to form the text collection. Lowered
precision is penalized by higher average processing time by later stages. Therefore,
32
criterion for selecting the right collection size depends on coverage and average
processing time. The table below shows percentage coverage, average processing time
at different ranks for Google and Yahoo search APIs. The results are obtained on a set of
30 questions (equally distributed over all question classes) from TREC 04 QA track [5].
%Coverage vs rank
80
70
%Coverage
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Yahoo BOSS API 23 31 37 42 48 49 49 51 51 52 53 53 54 54 55
Google AJAX Search
28 48 56 58 64 64 64 66 70 72 72 73 73 73 74
API
80
%Coverage vs Average Processing time(sec)
70
60
50
10
0 1.1
1.9
2.2
2.6
3.1
3.4
4.6
5.1
0.021
0.09
0.23
0.37
0.51
0.803
1.39
From the results it is clear that going up to rank 5 ensures a good coverage while
maintaining low processing time. Clearly Google outperforms Yahoo at all ranks.
34
The final stage in a QA system, and arguably the most important, is to extract and
present the answers to questions. We employ a named entity (NE) recognizer to filter
out those sentences which could potentially contain answer to the given question. In
our system we have used GATE – A General Architecture for Text Engineering provided
by The Sheffield NLP group [15] as a tool to handle most of the NLP tasks including NE
recognition.
4.1.1 WordNet
WordNet [16] is the product of a research project at Princeton University which has
attempted to model the lexical knowledge of a native speaker of English. In WordNet
each unique meaning of a word is represented by a synonym set or synset. Each synset
has a gloss that defines the concept of the word. For example the words car, auto,
automobile, and motorcar is a synset that represents the concept define by gloss: four
wheel Motor vehicle, usually propelled by an internal combustion Engine. Many glosses
have examples of usages associated with them, such as "he needs a car to get to work."
In addition to providing these groups of synonyms to represent a concept, WordNet
connects concepts via a variety of semantic relations. These semantic relations for
nouns include:
· Hyponym/Hypernym (IS-A/ HAS A)
· Meronym/Holonym (Part-of / Has-Part)
· Meronym/Holonym (Member-of / Has-Member)
· Meronym/Holonym (Substance-of / Has-Substance)
G(wi , q j ) = xi , j (4.2)
Where xi , jò [0,1] is the value of sense/semantic similarity between wiòW and q jòQ .
! " # …… $
G( wi , q j )
%! %" …… %&
Figure 4.2: A sense network formed between a sentence and a query.
The coefficients are fine tuned depending on the type of corpus. Unlike newswire data
most of the information found on the internet is badly formatted, grammatically
incorrect and most of the time not well formed. So when web is used as the knowledge
base we use the following values of different coefficients: 4 = 1.0, 5 = 1.0, ; = 0.25 ,
> = 0.125 and noise decay factor a =0.25 but when using local corpus we reduce 4 to
0.5 and a to 0.1. Once we obtain the total score for each sentence, we sort then
according to these scores. We take top t sentences and consider the plausible answers
within them. If an answer appears with frequency f in sentence ranked r then that
answer gets a confidence score
1
C (ans) = (1 + ln( f )) (4.12)
r
Again all answers are sorted according to confidence score and top J (=5 in our case)
answers are returned along with corresponding sentence and URL (figure 4.3).
Figure 4.3: A sample run for the question “Who performed the first human heart
transplant?”
38
Our question answering module is written in JAVA. Use of JAVA makes the software
cross platform and highly portable. It uses various third party APIs for NLP and text
engineering; GATE, Stanford parser, Json and Lucene API to name a few. Each module is
designed keeping in mind space and time constraints. The URL reader module is multi
threaded to keep download time at the minimum. Most of the pre-processing is done via
GATE processing pipeline. More information is provided in appendix B.
Figure 5.1: Various modules of the QnA system along with each ones basic task.
5.1 Results
The idea of building an easily accessible question answering system which uses the web
as a document collection is not new. Most of these systems are accessed via a web
browser. In the later part of the section we compare our system with other web QA
systems. The tests were performed on a small set of fifty web based questions. The
reason we did not use questions from TREC QA is that the TREC questions are now
appearing quite frequently (sometimes with correct answers) in the results of web
search engines. This could have affected the results of any web based study. For this
reason a new collection of fifty questions was assembled to serve as the test set. Also we
don’t have access to AQUAINT corpus which is the knowledgebase for TREC QA systems.
The questions within the new test set were chosen to meet the following criteria:
39
1. Each question should be an unambiguous factoid question with only one known
answer. Some of the questions chosen do have multiple answers although this is
mainly due to incorrect answers appearing in some web documents.
2. The answers to the questions should not be dependent upon the time at which
the question is asked. This explicitly excludes questions such as “Who is the
President of the US?”
These questions are provided in appendix A.
For each question in set, the table below shows the (min) rank at which answer was
obtained. In case the system fails to answer a question we show the reason it failed. Also
time spent on various tasks is shown which would help in determining the feasibility of
the system to be used in real time environment. We used top 5 documents to construct
our corpus which restricts our coverage to 64%. In a way 64% is the accuracy upper
bound of our system.
As seen, most of the failures were because of the handicapped NE recognizer. The
question classifier failed in only one instance. @Rank 5 the system reached its accuracy
upper bound of 64%.
1. http://www.languagecomputer.com/demos/
2. http://misshoover.si.umich.edu/˜zzheng/qa-new/
3. http://www.ionaut.com:8400
42
The questions from the web question set were presented to the five systems on the
same day, within as short a period of time as was possible, so that the underlying
document collection, in this case the web would be relatively static and hence no system
would benefit from subtle changes in the content of the collection.
It is clear from the graph that our system outperforms all but AnswerFinder at rank 1.
This is quite important as the answer returned at rank 1 can be considered to be the
final answer provided by the system. At higher ranks it performs considerably better
than AnswerBus and IONAUT while performing marginally less than AnswerFinder and
PowerAnswer. The results are encouraging but it should be noted that due to the small
number of test questions it is difficult to draw firm conclusions from these experiments.
From table 5.1 it is clear that the system cannot be used for real time purposes as of
now. An average response time of 18.3 seconds is too high. But it must be noted that
43
document retrieval time will be significantly lower for offline – local corpus. More over
the task of post processing can be done offline on the corpus as it is independent of the
query. Once the corpus is pre-processed offline, the actual task of retrieving an answer
is quite low at 0.45 seconds. We believe that if we use our own crawler and pre-process
the documents beforehand, our system can retrieve answers fast enough to be used in
real time systems. The graph below shows percentage of time spent in different tasks.
Answer
Extraction time distribution
3%
Document
Retrieval
36%
Pre-Processing
61%
5.4 Conclusion
The main motivation behind the work in this thesis was to consider, where possible,
simple approaches to question answering which can be both easily understood and
would operate quickly. We observed that the performance of the system is limited by
the worst performing module of the QA system. So even if a single module fails the
whole system won’t be able to answer. In our case the NE recognizer is the weakest
link. Our NE recognizer recognizes limited sets of answer types which is not enough to
obtain a good enough overall accuracy. We employed machine learning techniques for
question classification whose performance is good enough and any further
improvements won’t be beneficial. We also proposed the Sense Net algorithm as new
way of ranking sentences and answers. Even with the limited capability of NE
recognizer the system is at par with state of the art web QA systems which confirms the
efficacy of the ranking algorithm. The time distribution of various modules shows that
the system is quite fast at the answer extraction stage, if used along with a local corpus
which is pre-processed offline it can be adapted for real time applications. Finally our
current results are encouraging but we acknowledge that due to the small number of
test questions it is difficult to draw firm conclusions from these experiments.
44
Appendix A
Small Web Based Question Set
Q001: The chihuahua dog derives it’s name from a town in which country? Ans: Mexico
Q002: What is the largest planet in our Solar System? Ans: Jupiter
Q003: In which country does the wild dog, the dingo, live? Ans: Australia or America
Q004: Where would you find budgerigars in their natural habitat? Ans: Australia
Q005: How many stomachs does a cow have? Ans: Four or one with four parts
Q006: How many legs does a lobster have? Ans: Ten
Q007: Charon is the only satellite of which planet in the solar system? Ans: Pluto
Q008: Which scientist was born in Germany in 1879, became a Swiss citizen in 1901 and
later became a US citizen in 1940? Ans: Albert Einstein
Q009: Who shared a Nobel prize in 1945 for his discovery of the antibiotic penicillin?
Ans: Alexander Fleming, Howard Florey or Ernst Chain
Q010: Who invented penicillin in 1928? Ans: Sir Alexander Fleming
Q011: How often does Haley’s comet appear? Ans: Every 76 years or every 75 years
Q012: How many teeth make up a full adult set? Ans: 32
Q013: In degrees centigrade, what is the average human body temperature? Ans: 37, 38
or 37.98
Q014: Who discovered gravitation and invented calculus? Ans: Isaac Newton
Q015: Approximately what percentage of the human body is water? Ans: 80%, 66%,
60% or 70%
Q016: What is the sixth planet from the Sun in the Solar System? Ans: Saturn
Q017: How many carats are there in pure gold? Ans: 24
Q018: How many canine teeth does a human have? Ans: Four
Q019: In which year was the US space station Skylab launched? Ans: 1973
Q020: How many noble gases are there? Ans: 6
Q021: What is the normal colour of sulphur? Ans: Yellow
Q022; Who performed the first human heart transplant? Ans: Dr Christiaan Barnard
Q023: Callisto, Europa, Ganymede and Io are 4 of the 16 moons of which planet? Ans:
Jupiter
Q024: Which planet was discovered in 1930 and has only one known satellite called
Charon? Ans: Pluto
Q025: How many satellites does the planet Uranus have? Ans: 15, 17, 18 or 21
Q026: In computing, if a byte is 8 bits, how many bits is a nibble? Ans: 4
Q027: What colour is cobalt? Ans: blue
Q028: Who became the first American to orbit the Earth in 1962 and returned to Space
in 1997? Ans: John Glenn
Q029: Who invented the light bulb? Ans: Thomas Edison
45
Q030: How many species of elephant are there in the world? Ans: 2
Q031: In 1980 which electronics company demonstrated its latest invention, the
compact disc? Ans: Philips
Q032: Who invented the television? Ans: John Logie Baird
Q033: Which famous British author wrote ”Chitty Chitty Bang Bang”? Ans: Ian Fleming
Q034: Who was the first President of America? Ans: George Washington
Q035: When was Adolf Hitler born? Ans: 1889
Q036: In what year did Adolf Hitler commit suicide? Ans: 1945
Q037: Who did Jimmy Carter succeed as President of the United States? Ans: Gerald
Ford
Q038: For how many years did the Jurassic period last? Ans: 180 million, 195 – 140
million years ago, 208 to 146 million years ago, 205 to 140 million years ago, 205 to 141
million years ago or 205 million years ago to 145 million years ago
Q039: Who was President of the USA from 1963 to 1969? Ans: Lyndon B Johnson
Q040: Who was British Prime Minister from 1974-1976? Ans: Harold Wilson
Q041: Who was British Prime Minister from 1955 to 1957? Ans: Anthony Eden
Q042: What year saw the first flying bombs drop on London? Ans: 1944
Q043: In what year was Nelson Mandela imprisoned for life? Ans: 1964
Q044: In what year was London due to host the Olympic Games, but couldn’t because of
the Second World War? Ans: 1944
Q045: In which year did colour TV transmissions begin in Britain? Ans: 1969
Q046: For how many days were US TV commercials dropped after President Kennedy’s
death as a mark of respect? Ans: 4
Q047: What nationality was the architect Robert Adam? Ans: Scottish
Q048: What nationality was the inventor Thomas Edison? Ans: American
Q049: In which country did the dance the fandango originate? Ans: Spain
Q050: By what nickname was criminal Albert De Salvo better known? Ans: The Boston
Strangler.
46
Appendix B
Implementation Details
We have used Jcreator (http://www.jcreator.com/ ) as the preferred IDE. The code uses
newer features like generics which is not compatible with any version of JAVA prior to
1.5. The following third party APIs are used:
All experiments performed on a Core 2 Duo 1.86 GhZ System with 2GB RAM. Default
stack size may not be sufficient to run the application. Therefore stack size should be
increased to at least 512MB using –Xmx512m command line option. Some classes
present in JWNL API conflict with GATE. To resolve the issue conflicting libraries
belonging to GATE must not be included in the classpath.
47
References
[1] Miles Efron. Query expansion and dimensionality reduction: Notions of
optimality in rocchio relevance feedback and latent semantic indexing.
Information Processing & Management, 44(1):163–180, January 2008.
[2] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu,
and Mike Gatford. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text
Retrieval Conference.
[3] Stephen E. Robertson and Steve Walker. 1999. Okapi/Keenbow at TREC-8. In
Proceedings of the 8th Text REtrieval Conference.
[4] Tom M. Mitchell. 1997. Machine Learning. Computer Science Series. McGraw-Hill.
[5] Corpora for Question Answering Task, Cognitive Computation Group at the
Department of Computer Science, University of Illinois at Urbana-Champaign.
[6] Xin Li and Dan Roth. 2002. Learning Question Classifiers. In Proceedings of the
19th International Conference on Computational Linguistics (COLING’02), Taipei,
Taiwan.
[7] Kadri Hacioglu and Wayne Ward. 2003. Question Classification with Support
Vector Machines and Error Correcting Codes. In Proceedings of the 2003
Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology (NAACL ’03), pages 28–30,
Morristown, NJ, USA.
[8] Ellen M. Voorhees. 1999. The TREC 8 Question Answering Track Report. In
Proceedings of the 8th Text REtrieval Conference.
[9] Ellen M. Voorhees. 2002. Overview of the TREC 2002 Question Answering Track.
In Proceedings of the 11th Text REtrieval Conference.
[10] Eric Breck, John D. Burger, Lisa Ferro, David House, Marc Light, and Inderjeet
Mani. 1999. A Sys Called Qanda. In Proceedings of the 8th Text REtrieval
Conference.
[11] Richard J. Cooper and Stefan M. R¨uger. 2000. A Simple Question Answering
System. In Proceedings of the 9th Text REtrieval Conference.
[12] Dell Zhang and Wee Sun Lee. 2003. Question Classification using Support Vector
Machines. In Proceedings of the 26th ACM International Conference on Research
48