Beruflich Dokumente
Kultur Dokumente
1 2 i s s , 1 10 n s s .
Here
, n k
C is the Similarity function and is defined as
, 1, 2, 2, 1, n k n k k n
C G G G G = =
The operator is used to match the entities present in both
its operands for measuring the similarity between them. Now,
the
, i n
FinalScore of all the passages is calculated as
, 1 , 2 ,
* *
i n i n i n
FinalScore w CurrentScore w ECRScore = +
Where
1 2
1 w w + =
Here
, i n
CurrentScore
is the Score of the passage obtained
from answer selection phase, w
1
, w
2
are weights given to
scores to incorporate the contribution of both modules and are
chosen (empirically) in our system to be 0.7 and 0.3
respectively. Finally answer passages are ranked according to
their FinalScore and top 5 passages are presented to user.
BioinQA
Figure 2. Sample output for the question What is the
difference between glycoprotein and lipoprotein?
5. SEMANTIC HETEROGEINITY
RESOLUTION THROUGH METADATA
Bioinformatics field is a multidisciplinary field, with
the users of all level, from general people, students to
researchers and medical practitioners. To bridge the gap
between level of understanding of an experienced researcher
and a novice, our system employs metadata information during
Answer extraction:
I. Utilization of scientific and general terminology: A non-
biology student is not likely to access information by
homosapiens but by humans. The user himself decides
whether he wants to use the system for novice search or
advanced user search (see clip). We have developed
Advanced-and-Learner-Knowledge Adaptive (ALKA)
Algorithm, which works and performs selective ranking (of the
initial 10 passages) on these principles: researchers use
scientific terms and terminologies of the jargon more
frequently. These may also include equations, numeric data
(numbers, percentage signs) and words of large length such as
Lymphadenopathy, etc. Thus, documents relevant for
researchers will include more of such terms with a higher
frequency because it actually fulfills the need of the user ,
whereas those meant for the novice would include simple
(short length) words with less numbers or equations. The entity
file of corpus constructed in the initial phase is configured to
classify the terms as either biological (e.g. Efavirenz),
scientific (e.g. homosapiens) or general (e.g. human), using
metadata information. If a passage contains more scientific
terms occurring frequently, it is given a lower rank for the
novice, and a higher one for the advanced user.
II. Use of acronyms: These are of great importance in a field
like biomedicine where precise scientific terms are used and
any error introduced due to requirement of typing long names
can be critical. Solution to problem of acronyms will not only
save time of user but also relieve them of burden of
remembering long scientific names to accuracy of single
character.
Manually built acronym lists have been employed to
resolve the differences in meaning due to use of acronyms at
one place and its full form at another place. Many acronym
lists have been compiled and published and many lists are
available on the Web (e.g., Acronym Finder and Canonical
Abbreviation/Acronym List). As the purpose of this study was
to demonstrate the use of information about expansions of
acronyms in enhancing the answers obtained from a question
answering system, use of a manually built acronym list is
justified.
III. Comprehending the implicit assumptions of the user: It
is a common observation that a typical question of user rarely
contains full information required to answer the question.
Rather it essentially contains many unstated assumptions and
also requires extending or narrowing the meaning of the
question to either broaden or shorten the search. This is
actually the case in the real life for humans as their
conversations hardly include the full detail, but leave many
things for the listener to assume. For example a user may ask
How does Tat enhances the ability of RNA polymerase to
elongate It is on part of system to decide between 3 RNA
polymerases (1, 2, and 3).
To perform in such circumstances, the system is built
with Concepts Relation Graph (CRG) which affects the search
by enhancing it with knowledge represented in the graph (CRG
BioinQA
Figure 3. Different outputs for advanced and novice users
respectively, and a display of acronym expansion.
is a form of metadata information). CRG is a one-to-many
relation graph representation of concepts and data of the
biomedical domain (the entities corresponding to the nodes of
the graph are obtained by this relation). For example in the
above mentioned question, the concept of RNA will be related
to the three variants possible, namely 1-, 2- and 3- RNA. CRG
is meant to fill in the missing information, which is required to
answer the question or remove the ambiguity from the
question. Given an ambiguous question, as determined by
CRG, the user can either be prompted to supply more
information, or the system can still answer the question, by
employing the aid of CRG. Clearly, a general user is not likely
to know everything about the searched concept in the
beginning, so the latter approach is better. Hence, system
recognizes the keywords present in the question, as well as
those in the CRG, augmenting the search by using CRG
entities. The user is then presented the answer, along with the
knowledge of CRG. The user can choose to take the help
BioinQA
provided by CRG, and thus can select the suitable answer,
without having to search again for the answer by supplying
more information. This approach is general enough and can
solve variety of problems. If more precise information about
the background of the user is available, the system can be
configured to provide a unique and unambiguous answer to the
user, by selecting just one entity from the CRG. Use of this
approach paves the way for development of a friendly QA
system, which will save the user from having to enter elaborate
information in the question (although at the expense of
accuracy). Figure 3 shows difference in levels of answers
obtained for novice and advanced user with acronym
expansion (Tuberculosis searched from word TB).
6. EXPERIMENTATION
As Sample Resource, abstracts were taken from
PUBMED to experiment on the system. Difference seeking
questions are not generally available to be used as test
questions, unlike the open domain evaluations, where test
questions can be mined from question logs (Encarta, Excite,
AskJeeves), thus we had them constructed by a one of the
biomedical students.
To build a set of questions we took a pair of 40
normal questions and 20 difference seeking questions from
general students by conducting a survey. The group comprised
of beginners and sophomores as well. This was to simulate use
of the system by both novice and expert users. The questions
thus received were of widely varying difficulty level covering
various topics of the subject. For each question the system
presents 5 top answers to the user (and 3 for difference seeking
questions). A question is answered if the answer to the
question is available in the text only which is presented to the
user (and not in the document from which the text is retrieved).
Comparison of BioinQA with the GOOGLE Search Engine
We compared our system with the most sophisticated
search engine, Google. Questions were posed to Google and 5
documents were checked for presence of answer in them.
Evaluation metrics
For general questions we used the popular metric
Mean Reciprocal Answer Rank (MRAR) suggested in TREC
[17] for the assessment of question answering systems, which
is defined as follows.
| | 1
1 1 1
,
[ ] ( )
n
i
RR MRAR
rank i n rank i
=
= =
n is the number of questions.; RR is the Reciprocal Rank.
For evaluation of comparison based questions no metric has
been suggested in the literature. To evaluate BioinQAs
performance for such questions novel metric was adopted
called Mean Correlational Reciprocal Rank (MCRR) which
is defined as: Let rank1 and rank2 be ranks of correct answers
given by system for both components respectively. Then
| | | | 1
1 1
( 1 2 )
n
i
MCRR
n rank i rank i
=
=
n is the number of questions.
If answer to a question is not found in passages
presented to user, then it is assumed that rank of that question
is whose value is large compared to number of passages.
For calculation of MRAR is taken as . To calculate
MCRR, is taken as a much smaller value as it avoids
punishing the case where the system provided answer to only
one of the components. In our experiments we took as 10.
The use of MCRR, being very similar to MRAR can be
justified as it is symmetric w.r.t. objects being compared so it
takes difference between A and B and difference between B
and A to be the same, and because answer to a comparison
question is complete when both components (e. g. lipoprotein
and glycoprotein) are described, not just one. So, it punishes
the answers where only one component has been answered.
Results: We calculated MRAR and MCRR for our system and
Google search engine. The following table and graphs
summarizes the result of our experiments:
Table 1: Experimental Results of BioinQA and Google on
the data set
Evaluation of the results:
As opposed to BioinQA, where answers passages
provided to user were taken as correct, for Google authors had
to manually search in whole document returned by it, to check
(a)
(b)
Figure 4. Plot of (a) MRAR vs. % of questions asked. (b)
MCRR vs. % of questions asked
MRAR MCRR
BioinQA 0.7333 0.3096
Google 0.6328 0.2195
whether it somewhere contained the answer. This makes the
user effort exorbitantly large for Google. Moreover this
strategy completely fails for comparison based questions if it
does not happen to find a direct answer in the same words as
presented in the question. The following figure proves the
ineffectiveness of Google.
Figure 5. No answer to the Google search for the question
What is the difference between lipoprotein and
glycoprotein?
7. CONCLUSIONS AND FUTURE WORK
Our biomedical QA system uses the technique of
entity recognition and matching. System is based on searching
in context and utilizes syntactic information. BioinQA also
answers the comparison type questions from multiple
documents, a feature which contrasts sharply with the existing
search engines, which merely return answers from single
document or passage. The use of the Metadata to understand
the implicit assumptions of the user, accommodate acronyms
and to answer the question, based on the expertise of the user
(rather than giving fixed answers for every user irrespective of
their background) makes the system adapted to needs of user.
Our future work will focus on developing a
systematic framework for image (jpeg, bmp, etc) extraction
and method for its contextual presentation, along with
presentation of the textual data as the answer to any question,
which will greatly enhance the understanding of the user.
Along with images, focus will be on incorporating audio
lectures available in the e-leaning facilities, and other sources
as PUBMED.
8. REFERENCES
[1]. http//www.ncbi.nlm.nih.gov/ - National Center for
Biotechnology Information. Last accessed 27 September,
2007
[2]. Stergos Afantenos, Vangelis Karkaletsis, Panagiotis
Stamatopoulos. Summarization from medical documents: a
survey. 13th April, 2005.
[3] Zweigenbaum P. Question answering in biomedicine.
Workshop on Natural Language Processing for Question
Answering, EACL 2003.
[4]. Schultz S., Honeck M., and Hahn. H. Biomedical text
retrieval in languages with complex morphology. Proceedings
of the Workshop on Natural Language Processing in the
Biomedical domain, July 2002, pp. 61-68.
[5]. Song Y., Kim S., and Rim H.. Terminology indexing and
reweighting methods for biomedical text retrieval.
Proceedings of the SIGIR'04 workshop on search and
discovery in bioinformatics, ACM, Sheffield, UK, 2004.
[6]. Minsuk Lee, James Cimino, Hai Ran Zhu, Carl Sable,
Vijay Shanker, John Ely , Hong Yu. Beyond Information
RetrievalMedical Question Answering. AMIA, 2006.
[7]. Jacquemart P. & Zweigenbaum P. Towards a medical
question-answering system: a feasibility study. In R. Baud, M.
Fieschi, P. Le Beux & P. Ruch, Eds., Proceedings Medical
Informatics Europe, volume 95 of Studies in Health
Technology and Informatics, p. 463468, Amsterdam: IOS
Press(2003).
[8]. Ayache, C. Rapport final de la champagne
EQueREVALDA, Evaluation en Question- Rponse2005. Site
webTechnolanguehttp://www.technolangue.net/article61.html.-
last accessed - 15th June 2007.
[9]. Rinaldi F., Dowdall J., Shneider G. & Persidis A.
Answering questions in the genomics domain. ACL2004 QA
Workshop, 2004.
[10]. P. Jacquemart, and P. Zweigenbaum, Towards a medical
question-answering system: a feasibility study, In
Proceedings Medical Informatics Europe, P. L. Beux, and R.
Baud, Eds., 2003, Amsterdam. IOS Press.
[11] J. Ely, J. A. Osheroff, M. H. Ebell, et al., Analysis of
questions asked by family doctors regarding patient care,
BMJ, vol. 319, 1999. pp.
358361.
[12]. Lei Li, Roop G. Singh, Guangzhi Zheng, Art
Vandenberg, Vijay Vaishnavi, Sham Navathe. A
Methodology for Semantic Integration of Metadata in
Bioinformatics Data Sources. 43rd ACM Southeast
Conference, March 18-20, 2005, Kennesaw, GA, USA.
[13]. Chen, L., Jamil, H. M., and Wang, N. Automatic
Composite Wrapper Generation for Semi-Structured
Biological Data Based on Table Structure Identification.
SIGMOD Record 33(2):58-64, 2004.
[14] Stoimenov, L., Djordjevic, K., Stojanovic, D. Integration
of GIS Data Sources over the Internet Using Mediator and
Wrapper Technology. Proceedings of the 2000 10th
Mediterranean Electrotechnical Conference. Information
Technology and Electrotechnology for the Mediterranean
Countries (MeleCon 2000), pp. 334-336.
[15] Kumar, P. Kashyap S., Mittal A., Gupta S. A Fully
Automatic Question Answering System for intelligent search
in ELearning Documents. International Journal on E-
Learning(2005) 4(!),149-166.
[16] Owen de Kretser, Alistair Moffat Needles and Haystacks:
A Search Engine for Personal Information Collections. acsc,
p. 58, Australasian Computer Science Conference, 2000.
[17] Giovanni Aloisio, Massimo Cafaro, Sandro Fiore, Maria
Mirto. ProGenGrid: aWorkflow Service Infrastructure for
Composing and Executing Bioinformatics Grid Services.
Proceedings of the 18th IEEE Symposium on Computer-Based
Medical Systems (CBMS05)
ABOUT THE AUTHORS
Dr. Ankush Mittal: Dr Ankush Mittal
is a faculty member at Indian Institute of
Technology Roorkee, India. He has
published many papers in the international
and national journals and conferences. He
has been an editorial board member, Int.
Journal on Recent Patents on Biomedical
Engineering and reviewer for IEEE
Transaction on Multimedia, IEEE
Transaction on Circuit and Systems for Video Technology,
IEEE Transactions on Image Processing, IEEE Transactions of
Fuzzy Systems, IEEE Transactions on TKDE, etc, He has been
awarded the Young Scientist Award by The National academy
of Sciences, India, 2006 for contribution in E-learning in the
country, best paper award with Rs. 10,000 at IEEE ICISIP
conference, 2005 and Star Performer, 2004-05, IIT Roorkee
based on overall performance (teaching, research, thesis
supervision, etc). His research interests include Image
Processing and Object Tracking, Bioinformatics, E-Learning,
Content-Based Retrieval, AI and Bayesian Networks.
Sparsh Mittal: Sparsh Mittal is a senior
undergraduate student of Electronics &
Communications Engineering Department at
Indian Institute of Technology Roorkee,
India. His research interests include natural
language processing, data mining, FPGA
implementation using VHDL and Verilog
and image processing.
Saket Gupta: Saket Gupta is a senior
undergraduate in Electronics and
Communication Engineering Department at
Indian Institute of Technology Roorkee,
India. He has worked on Content Based
Retrieval, QA Systems and other NLP
applications for e-learning. His current field
of research includes MIMO communication systems; Image
processing; and FPGA synthesis and design using VHDL. He
has been awarded many scholarships from IIT Roorkee and
from other institutions.
Sumit Bhatia: Sumit Bhatia is a senior
undergraduate student in Electrical
Engineering Department at Indian Institute
of Technology Roorkee, India. His current
research interests include Content Based
Information retrieval and Data Mining. In
the past, he has worked in the areas of
Digital Image processing and Remote Sensing.