Development, Implementation, and A Cognitive Evaluation of A Definitional Question Answering System For Physicians

Journal of Biomedical Informatics 40 (2007) 236251
www.elsevier.com/locate/yjbin
Development, implementation, and a cognitive evaluation

of a denitional question answering system for physicians
Hong Yu a,*, Minsuk Lee a, David Kaufman b, John Ely c,
Jerome A. Oshero d, George Hripcsak b, James Cimino b
a
Department of Health Sciences, University of Wisconsin-Milwaukee, Enderis Hall 939, 2400 E. Hartford Avenue, P.O. Box 413,
Milwaukee, WI 53211, USA
b
Department of Biomedical Informatics, Columbia University, USA
c
Department of Family Medicine, University of Iowa College of Medicine, 200 Hawkins Drive, 01291-D PFP, Iowa City, IA 52242, USA
d
Thomson Healthcare, 4819 Emperor Blvd, Suite 125, Durham, NC, USA
Received 17 November 2005

Available online 12 March 2007
Abstract
The published medical literature and online medical resources are important sources to help physicians make patient treatment deci-
sions. Traditional sources used for information retrieval (e.g., PubMed) often return a list of documents in response to a users query.
Frequently the number of returned documents from large knowledge repositories is large and makes information seeking practical only
after hours and not in the clinical setting. This study developed novel algorithms, and designed, implemented, and evaluated a medical
denitional question answering system (MedQA). MedQA automatically analyzed a large number of electronic documents to generate
short and coherent answers in response to denitional questions (i.e., questions with the format of What is X?). Our preliminary cog-
nitive evaluation shows that MedQA out-performed three other online information systems (Google, OneLook, and PubMed) in two
important eciency criteria; namely, time spent and number of actions taken for a physician to identify a denition. It is our contention
that question answering systems that aggregate pertinent information scattered across dierent documents have the potential to address
clinical information needs within a timeframe necessary to meet the demands of clinicians.
Published by Elsevier Inc.
Keywords: Question answering; Information retrieval; Question analysis; Text summarization; Machine-learning; Evaluation
1. Introduction [2,8,9]. Information retrieval systems (e.g., PubMed) return

lists of retrieved documents in response to user queries.
Physicians often have questions about the care of their Frequently, the number of retrieved documents is large.
patients. The published medical literature and online med- For example, querying PubMed about the drug celecoxib
ical resources are important sources to answer physicians results in more than one thousand articles. Physicians usu-
questions [14], and as a result, to enhance the quality of ally have limited time to browse the retrieved information.
patient care [57]. Although there are a number of anno- Studies indicate that physicians spend on average two min-
tated medical knowledge databases including UpToDate utes or less seeking an answer to a question, and that if a
and Thomson Micromedex that are available to physicians search takes longer, it is likely to be abandoned [1,1012].
to use, studies found that physicians often need to consult An evaluation study showed that it takes an average of
primary literature for the latest information in patient care more than 30 min for a healthcare provider to search for
answer from the MEDLINE search engine, which means
*
Corresponding author. Fax: +1 414 229 5100. information seeking is practical only after hours and
E-mail address: HongYu@uwm.edu (H. Yu). not in the clinical setting [13].
1532-0464/$ - see front matter Published by Elsevier Inc.

doi:10.1016/j.jbi.2007.03.002
H. Yu et al. / Journal of Biomedical Informatics 40 (2007) 236251 237
Question answering can automatically analyze a large

number of articles and generates short text, ideally, less
than a few seconds, to answer questions posed by physi-
cians. Such a technique may provide a practical alternative
that enables physicians to eciently seek information at
point of patient care. This paper reports the research devel-
opment, implementation, and a cognitive evaluation of a
medical denitional question answering system (MedQA).
Although it is our long-term goal to enable MedQA to
answer all types of medical questions, we started with def-
initional question type because it tends to be more clear-cut
and constrained in the medical domain; this contrasts with
many other types of clinical questions that typically have
large variations in what reasonable answers might be.
2. Background Fig. 1. A subset of clinical questions collected by Ely and his associates
[1]. The left bar represents generic question proportions. For example,
What, How, Do, Can account for 2231 (or 48%), 697 (or 15%),
Although the notion of computer-based question 320 (or 7%), and 187 (or 4%) of all clinical questions. Question examples
answering has been around since 1970s (e.g., [14]), the are on the right side of the bar.
actual research development is still a relatively young eld.
Question answering has been driven by the text retrieval
conference (TREC), which supports research within the tion, intervention, comparison, and outcome (PICO), the
information retrieval community by providing the infra- criteria recommended by the practice of evidence-based
structure necessary for large-scale evaluation of text retrie- medicine. Yu [24] proposed a framework to answer biolog-
val methodologies. TREC introduced question answering ical questions with images. None of systems described
(QA) track in 1999. TREC (2004) reported the best ques- above, however, reported a fully implemented question
tion answering system to perform with 77% accuracy for answering system that generates answers in response to
answering factoid questions (e.g., How many calories users questions from a large text collection such as millions
are there in a Big Mac?) and 62.2% F-score for answering of MEDLINE records and other World Wide Web collec-
list questions (e.g., List the names of chewing gums) [15]. tion as we are reporting in this study.
In addition to factoid questions, TREC has since 2003 pro- Empirical studies characterizing physicians information
vided evaluation to scenario questions (e.g., denitional needs and questions can illuminate dierent facets of the
question What is x?) that require long and complex question answering problem. Ely, DAlessandro and col-
answers. Research development in scenario question leagues collected thousands of medical questions1 at clinic
answering (e.g., [16,17]) has been supported by the setting [1,2527]. Fig. 1 lists a subset of these medical
Advanced Research and Development Activity (ARDA)s questions.
Advanced Question and Answering for Intelligence Taxonomies were built to semantically organize medical
(AQUAINT) program since 2001. questions into types. Ely and colleagues [1] studied the 1396
However, fewer research groups are working on medical medical questions they collected in one study to map to a
domain-specic question answering. Zweigenbaum [18,19] set of 69 question types (e.g., What is the cause of symp-
provided an in depth analysis of the feasibility of question tom X?, and What is the dose of drug X?) and 63 med-
answering in the biomedical domain. Rinaldi and col- ical topics (e.g., drug or cardiology). Additionally, they
leagues [20] adapted an open-domain question answering created an evidence taxonomy that incorporated ve
system to answer genomic questions (e.g., where was hierarchical categories [28]. The top categories were Clini-
spontaneous apoptosis observed?) with the focus on iden- cal or Non-clinical; the Clinical questions were further
tifying term relations based on a linguistic-rich full-parser. divided into General versus Specic;General questions were
Niu and colleagues [21] and Delbecque and colleagues [22] divided into Evidence and No-evidence; and Evidence ques-
incorporated semantic information for term relation identi- tions were divided into Intervention versus No-intervention.
cation. Specically, they mapped terms in a sentence to We developed supervised machine-learning approaches
the UMLS semantic classes (e.g., Disease or Syndrome) to automatically classify a medical question into a question
and then combined the semantic classes with surface cues type specied by the evidence taxonomy [28]. Using a
or shallow parsing to capture term relations. For example, total of 200 annotated questions, our leave-one-out tenfold
in the sentence the combination of aspirin plus streptoki- cross-validation performance showed over 80% accuracy
nase signicantly increased mortality at 3 months the for capturing questions that are answerable with evidence
word plus refers to the combination of two or more med-
ications [21] Xuang et al. [23] manually evaluated whether 1
The question collection is freely accessible at http://clinques.nlm.nih.
medical questions can be formulated by problem/popula- gov/.
238 H. Yu et al. / Journal of Biomedical Informatics 40 (2007) 236251
[29,30]. Because a specic answer strategy can be developed state-of-the-art information retrieval systems; namely,
for a specic question type, this paper reports research Google, OneLook, and PubMed.
development in Document Retrieval, Answer Extraction,
Summarization, and Answer Formulation and Presentation 3.1. Text collection
for answering one specic question type; namely, deni-
tional questions. While the heterogeneous World Wide Web collection
To assess the ecacy of this system, we employ an has shown to be comprehensive for open-domain question
empirical study informed by a cognitive engineering answering, research has found that the information from
approach to the study of human computer-interaction Web documents is not sucient for answering highly
[31,32]. This is an interdisciplinary and task-centered domain-specic biomedical questions [3946]; on the other
approach to the development of principles, methods, and hand, studies indicate that the biomedical literature is a
tools, to guide the analysis and design of computer-based qualied resource that can answer questions posed by phy-
systems [33]. The goal is to understand how dierent sys- sicians [4,11,13,18]. MedQA is currently built upon both
tems mediate task performance. Although outcome mea- the World Wide Web collection and the 15 million MED-
sures such as accuracy, eciency, and ease of use are LINE records.
important in this analysis, the focus is also on the cognitive
and behavioral processes involved in employing the system
to execute a task. In this context, we are interested in how 3.2. Document retrieval
MedQA would aect question answering relative to a sys-
tem such as Google or PubMed. Cognitive methods of Document retrieval returns a list of relevant documents
evaluation have been used to study a range of informa- in response to a users query. We applied Google as the
tion-seeking tasks [3437]. Recently, Elhadad and col- search engine to retrieve the Web documents. We applied
leagues [38] employed a video-analytic cognitive approach the indexing tool Lucene2 to index documents from our
to study the cognitive consequences of physicians use of local text collections (i.e., MEDLINE abstracts). A recent
an automated summarization system that generates a sum- evaluation [47] shows that Lucene out-performed another
mary that is tailored to patient characteristics. The ndings information retrieval tool Indri [48]. Since noun phrases
indicate that physicians using the tailored summarizer were in general provide important content for each question,
better able to access relevant information with greater e- we applied the shallow parser LTCHUNK [49] to extract
ciency and accuracy relative to a generic summarizer. noun phrases and then applied the noun phrases as query
terms to retrieve documents. We evaluated LTCHUNK
to perform with 79.2% recall and 84.6% precision for cor-
rectly identifying noun phrases in medical questions based
3. Methods on a randomly selected 100 medical questions [29].
Fig. 2 shows the MedQA overall architecture. In the fol- 3.3. Answer extraction
lowing section, we will rst describe the text collection from
which answers are extracted. We will then describe the Answer extraction identies from the retrieved docu-
development of Document Retrieval, Answer Extraction, ments relevant sentences that answer the questions. We
Summarization, Answer Formulation, and Presentation. developed multiple strategies to identify relevant sentences.
Finally, we will describe the cognitive evaluation of Med- MedQA rst classies sentences into specic types.
QA in which we compared MedQA with three other Biomedical articles that report original research normally
follow the rhetorical structure known as IMRAD (Intro-
duction, Methods, Results, and Discussion) [5053]. Within
each section, there is a well-structured rhetorical substruc-
ture. For example, the introduction of a scientic paper
usually begins with general statements about the
signicance of the topic and its history in the eld [51]. A
previous user study found that physicians prefer the Results
Fig. 2. MedQA system architecture. MedQA takes in a question posed by sections to others for determining the relevance of an arti-
either a physician or a biomedical researcher. Question Classication
automatically classies the posed question as a question type for which a
cle [54]. We found that denitional sentences were more
specic answer strategy has been developed. Noun phrases are generated likely to be in the sections of Introduction and Background.
as query terms. Document Retrieval applies the query terms to retrieve We have followed the previous approaches [55,56] to apply
documents from either World Wide Web documents or locally indexed supervised machine-learning approaches (e.g., nave bayes)
MEDLINE collection. Answer Extraction automatically identies the
to identify dierent sections (e.g., Introduction, Back-
sentences that provide answers to questions. Text Summarization con-
denses the text by removing the redundant sentences. Answer Formulation ground, Methods, Results, and Conclusion) [57]. The train-
generates a coherent summary. The summary is then presented to the user
2
who posed the question. http://lucene.apache.org.
ing set was generated automatically from the abstracts in In our application, one collection of text incorporates
which sections were identied by the authors of the relevant or denitional sentences, and another collection
abstracts. The trained classier was then used to automat- of text is irrelevant or background because it incorporates
ically predict the classes of MEDLINE sentences. This pro- sentences that are randomly selected from the MEDLINE
vided a total of 1,004,053 sentences for training, including, collection.
86,971 Introduction, 70,850 Background, 248,630 Meth- AutoSlog-TS rst performs part-of-speech tagging and
ods, 371,419 Results, 167,252 Conclusions, and 58,930 shallow parsing, and then generates every possible lexico-
Others. Employing leave-one-out tenfold cross validation syntactic pattern within a clause to extract every noun
with bag-of-words as features, the classier achieved phrase in both collections of relevant and irrelevant texts.
78.6% accuracy for identifying a sentence into one of the It then computes statistics based on how often each pattern
ve sections. appears in the relevant texts versus the irrelevant texts and
In addition, MedQA further categorized sentences based produces a ranked list of extraction patterns coupled with
on linguistic and syntactic features. For example, in scien- statistics indicating how strongly each pattern is associated
tic writing, it is customary to use the past tense when with relevant and irrelevant texts. The extraction patterns
reporting original work and present tense when describing are ranked based on how often each pattern appears in
established knowledge [50,51]. Biomedical literature the relevant texts versus the irrelevant texts. For each
reports not only experimental results, but also hypotheses, extraction pattern, AutoSlog-TS computes two frequency
tentative conclusions, hedges, and speculations. MedQA counts: totalfreq is the number of times that the pattern
applied cue phrases (e.g., suggest, potential, likely, may, appears anywhere in the corpus, and relfreq is the number
and at least) identied by [58], which was reported to out- of times that the pattern appears in the relevant texts. The
perform machine-learning approaches, to separate facts conditional probability estimates the likelihood that the
from speculations. Factual sentences were selected for cap- pattern is relevant:
turing denitions.
relevant relfreq
The denitional sentences were identied by lexico-syn- P
pattern totalfreq
tactic patterns. For example, qureyterm, Formative Verb
(e.g., is and are, Noun Phrase) can be used to iden- The RlogF measure [64] balances a patterns conditional
tify a denitional sentence such as vulvar vestibulitis syn- probability estimate with the log of its frequency:
drome (VVS) is a common form of dyspareunia in
relevant
premenopausal women to answer a question such as R log F pattern log2 relfreq P
pattern
What is vulvar vestibulitis syndrome? In contrast with
other state-of-the-art denitional question answering sys- The R log F measurement were evaluated in a number of
tems [5962] that mostly captured lexico-syntactic patterns information extraction tasks to be robust [65]. We there-
manually, our system automatically learned lexico-syntac- fore used R log F measure to rank the lexico-syntactic pat-
tic patterns from a large collection of training set that terns we generated by Autoslog-TS. We then implemented
was created automatically. the patterns into MedQA to capture denitional sentences.
We automatically learned the lexico-syntactic patterns
from a large set of Google denitions. Specically, we 3.4. Summarization and answer formulation
applied all of the terms that are included in the Unied
Medical Language System (UMLS 2005AA) as candidate Summarization condenses a stream of text into a shorter
denitional terms, and crawled the Web to search for version while preserving its information content [66]. Since
denitions. We built our crawler on the Google:Denition we obtain sentences from multiple documents, frequently,
service. Google:Denition provides denitions that seem many of the extracted sentences are redundant. Summari-
to mostly come from web glossaries. We found that zation attempts to remove the redundant sentences and
36,535 terms (from the total of 1 million) to have deni- present to the users a short coherent summary that maxi-
tions specied by the Google. We therefore downloaded mizes the coverage of information content. Summarization
a total of 191,406 denitions; the average number of deni- has been an active research eld in natural language pro-
tions for each denitional term was 5.2. cessing (NLP). Open-domain summarization techniques
With this set of denitions, we then automatically iden- consider word distribution [67], which is based on the intu-
tied lexico-syntactic patterns that comprise the deni- ition that the most frequent words represent the most
tions. We applied an information extraction system important concepts of the text; linguistic information [68],
Autoslog-TS [63,64] to automatically learn the lexico-syn- which includes meta-linguistic markers (e.g., in conclu-
tactic patterns. In the following, we will describe the Auto- sion); location [69], which relies on the intuition that the
Slog-TS and how we applied it for lexico-syntactic pattern headings and sentences at the beginning and end of the text
learning. may contain important information; rhetoric, which refers
AutoSlog-TS is an information extraction system that to the analysis of discourse structure for identifying the
automatically identies extraction patterns for noun main claims of texts [70]; cohesion [66,71], which captures
phrases by learning from two sets of unannotated texts. semantically related terms, co-reference, ellipsis and
conjunctions; and popularity [72], which is based on the 3.5. Incorporating Web denitions
intuition that the most popular content across multiple
documents is the most important; as well as techniques We found that the Web is a rich resource for denitions.
for improving summarization coherence [73] Information On the other hand, the denitions from the Web could be
extraction [74,75] were explored for summarizing medical, of mixed quality. For example, many online denitions
domain-specic texts; however, the systems neither were were not related to the biomedical domain. For example,
generalizable to the entire biomedical domain, nor output- heart was dened as both . . .one of the most successful
ed summaries. female fronted bands in the annals of hard rock and a
Since the focus of summarization is to remove redun- hollow, muscular organ that pumps blood through the
dant sentences, we cluster similar sentences into a group blood vessels by repeated, rhythmic contractions; the lat-
based on the assumption that redundant sentences share ter is useful for MedQA, while the former is non-relevant.
certain lexical similarities. MedQA applies information In order to lter out non-relevant denitions and sepa-
retrieval techniques to summarize sentences. It clusters sen- rate them from biomedical domain-specic denitions,
tences based on TF*IDF weighted cosine similarity [76] we develop a component to identify denitions in the
and selects the most representative sentence from each clus- World Wide Web documents and to incorporate useful
ter. MedQA clusters sentences with hierarchical clustering Web denitions as a part of answer output. We implement
algorithms, which are widely used for other tasks such as Google: Denition to capture denitions from the World
topical document clustering [77], and generate summaries Wide Web. We also capture other online dictionaries
using sentences from dierent clusters. Specically, Med- including Dorlands Illustrated Medical Dictionaries. For
QA implement two hierarchical clustering algorithms, each Web denition, we then measure the similarity (i.e.,
namely, group-wise average and single-pass clustering. In TF*IDF) between the denition and the retrieved MED-
the following, we will describe both algorithms: LINE abstracts. We select a Web denition if it has the
Group-wise average starts with the entire set of N sen- highest similarity score and yet the score is above an ad
tences to be summarized. It identies pair-wise sentence hoc threshold we had dened.
similarity based on TF*IDF word features. It then merges
the two sentences with the highest similarity into one clus- 3.6. Display
ter. It then re-evaluates pairs of sentences/clusters; two
clusters are mergeable if the average similarity across all MedQA presents a user with both Web denitions and
pairs of sentences within the two clusters is equal to or MEDLINE sentences. Since we found that the Web deni-
greater than a predened threshold. The computational tions are in general clearly stated and general-population
complexity of group-wise average is 12 N N 1 N 3 . targeted, we decided to display with an order that Web def-
Single-pass clustering starts with one sentence randomly initions appeared rst and then followed by MEDLINE
selected from the entire N set of sentences to be summarized. denitions. We chose to select the top ve MEDLINE sen-
When adding in the second sentence, it calculates the similar- tences in a summary. Since the clustering step frequently
ity between the two sentences and clusters the two sentences generated a large number of clusters, displaying only the
if the similarity is above a predened threshold. It continues summary to the user meant that MedQA would potentially
to add in additional sentences. When an add-in sentence are discard the rest of clusters that might be additionally
compared with a cluster containing several sentences, the important to the biologists or the physicians. MedQA dis-
similarity is the average similarity of the add-in with all of played both the summary and other relevant sentences
the sentences in the cluster. The computation complexity to the user. Fig. 3 shows an output generated by our system
of single-pass clustering is 12 N N 1. in response to the question What is vestibulitis? MedQA
Single-pass clustering makes a decision for each sentence links each sentence to the original document resources,
as soon as the sentence is rst judged, and therefore the either on the Web, or the locally indexed MEDLINE
process of clustering the entire sentence set is faster than collection.
with group-wise average. On the other hand, group-wise
average outperforms single-pass clustering in topical docu- 3.7. Software environment
ment clustering [78]. We balances the advantages of both
algorithms by applying single-pass clustering when we have MedQA is implemented with Perl as the core platform
a large number of sentences (>150) and group-wise average and is running on a Macintosh PowerPC (dual 2 GHz
when we have a smaller number of sentences. CPU, 2 GB of physical memory, Mac OS X server
MedQA applies the centroid-based summarization tech- 10.4.2). We chose Perl in this prototype phase because it
nique [79] to select the most representative sentence and oers advantages of having fast debugging capability,
then to generate a coherent summary. The method rst availability of numerous modules which provide ecient
selects from each cluster one sentence that has the highest implementations for many well-known data structures
similarity to the rest of the sentences within the cluster. and algorithms, and easy integration with multiple compo-
Then the selected sentences are ordered based on the simi- nents via inter-process communication capability. The dis-
larity to the rest of selected sentences. tributions of time spent among dierent components were
Fig. 3. MedQAs output of the question What is vestibulitis? The output displays an online denition that comes from Dorlands Illustrated Medical
Dictionaries, a summary that incorporates denitional sentences that are extracted from dierent PubMed records, and other relevant sentences that
incorporate other relevant sentences. The parenthetical expression incorporates the last name of the rst author and the year of the publication (e.g.,
Sackett, 2001); the expression links to the PubMed records from which the preceding sentences are extracted.
6 s for Document Retrieval, 6 s for Answer Extraction, and After preliminary examination, we found that many
6 s for Text Summarization. The full-parsing were done o- questions did not yield answers from two or more systems
line. to be evaluated. For example, the question what is
proshield? did not yield a meaningful answer from three
3.8. Cognitive evaluation systems (MedQA, OneLook, and PubMed). The objective
was to compare dierent systems, and unanswerable
We design a randomized controlled cognitive evaluation questions present a problem for the analyzes because they
in order to assess the ecacy, accuracy, and perceived ease render such comparisons impossible. On the other hand,
of use. MedQA is compared to three other systems; if we screened the questions with the four systems, it may
namely, Google, OneLook, and PubMed, all of which rep- introduce bias and a selective exclusion process. We
resent the state-of-the-art web-based information retrieval therefore employed an independent information retrieval
systems. OneLook is a portal for numerous online dictio- system, BrainBoost,4 which is a web-based question
naries including several medical editions (e.g., Dorlands). answering engine that accepts natural language queries.
The study was approved by the Columbia University Insti- BrainBoost was presented with questions randomly
tutional Review Board. selected from the categories of denitional questions, and
the rst 12 questions that returned an answer were included
3.9. Question selection in the study. The question selection task was performed by
an unbiased assistant who was not privy to the reasons for
The denitional questions were selected from a total of doing the search. The 12 selected questions are included in
138 denition questions out of 4,653 questions3 posed by Appendix A in bold font.
physicians at various clinical settings [1,2527]. The 138
denitional questions are listed in Appendix A. We
observed that the denitional questions in general fell into 3.10. Subjects and procedure
several categories including Disease or Syndrome, Drug,
Anatomy and Physiology, and Diagnostic Guideline. In Four physicians (one male and three female, age 3060)
order to maximize the evaluation coverage, we attempted with experience using information systems volunteered to
to select questions that covered most of these categories. participate in the study. Each physician was presented with
all of the 12 questions selected for inclusion. For each ques-
tion, the subjects were asked to evaluate two systems in
succession. The order of the questions and the systems
3
The question collection is freely accessible at http://clinques.
4
nlm.nih.gov/. http://www.brainboost.com/.
Table 1
Actions used to answer questions
Enter Query: Entering a search term in the search text box provided by the system
Find Document: An action that involves >10 s of time spent examining the retrieved list of documents (e.g., web documents or PubMed abstracts)
Select Linkout: An action that involves another linkout (i.e., clicking on embedded link) from the selected document
Query Modication: A user modies the existing query or user-interface (for example, change to Scholar.google)
Read Document: An action that involves the subject spending >10 s to read the selected document
Scroll-Down Document: Scroll down the document for searching for an answer
Search Text Box: A subject initiates the Find function to identify the related text
Select Document: The subject selects and opens a document to examine whether the answer appears in the document
Select Text as Answer: A subject selects the text as answer to the question
was counterbalanced. Each subject posed six questions to processes of question-answering. We also show the coding
each of the four systems in pairs of two. The four subjects process used to characterize participants actions. The sec-
therefore posed a total of 96 questions (12 4 2). ond part of this section focuses on a quantitative compar-
After consenting to participate in the study, participants ison of the four systems. We include both objective
were given written instructions on how to perform the task. measures such as actions and response latency, and subjec-
They were presented with each question on a cue card and tive measures, namely, participants ratings of the quality
asked to nd the text that best answered the question. of answers as well as their ease of use.
The card also indicated the two systems to be used and their
sequence. Once the text was located, they were asked to 3.13. Illustration
copy and paste it into a Word document. They were free
to continue to search and paste text into the document until The following two coding excerpts illustrate the process
they were satised that they found the best answer possible. of question-answering on two pairs of systems, OneLook
There was a time limit of 5 min for each question/system and MedQA and PubMed and MedQA. The excerpts are
event. Participants were asked to think-aloud during the representative of task performance. The subject was an
entire process. We chose 5 min as a cuto because a previ- experienced physician with a master in informatics and
ous study found that internet users successfully found was well-versed in performing medical information seeking
health information to answer questions in an average of tasks.
5 min [42]. After completing each question evaluation com-
paring the two systems, they were asked to respond to the 3.14. Excerpt 1OneLook and MedQA
following two Likert items: (1) rate the quality of answer
and (2) the ease of use of the system. We employed a ve The subject had completed four questions and was a lit-
point rating scale 15 in which 1 represents the poorest tle more than 30 min into the session. The current question
and 5 represents the best. We applied Morae usability answered was What is Bests Disease. The systems used
software system to record the screen activities and audio to nd the answer are OneLook and MedQA, respectively.
record a subjects comments for the entire session. Morae The entire segment was 5:12 min, of which 3:27 was used to
provides a video of all screen activity and logs a wide range search the OneLook system and 1:36 to search MedQA
of events and system interactions including mouse clicks, (Fig. 4).
text entries, web-page changes, and windows dialogue
events. It also provides the analyst with the capability to 3.14.1. Excerpt from a subjects protocol
timestamp, code, and categorizes a range of video events. 31:23 ACTION (ENTER QUERY-ONELOOK): Best
Disease
3.11. Analysis 31:30 System Response:4 dictionaries
including 2 general and 2 medical (see Fig. 4)
On the basis of a cognitive task analysis [32,35], we iden- 31:32 (User) COMMENT: So there are two
tied goals and actions common to all systems. Table 1 dictionary matches. . . for medicine and two
shows a list of actions we dened. We also noted system dictionary matches in general
responses (e.g., what was displayed after executing a 31:51 ACTION SELECT DOCUMENT
search), analyzed comments thematically and measured COMMENT: Wikipedia is available and I would
the response times. The protocols were coded two analysts just take a guess that Im going to get the
(H.Y. and D.K.). The total coding time for four subjects best information from the Wikipedia. . . and
was about 30 h. Im 100% wrong.
31:54 SYSTEM RESPONSE Web Page Changes
3.12. Evaluation results Wiki
COMMENT: The only thing I get from the
In the following section, we present results of the cogni- Wikipedia outlink is a one line definition
tive evaluation. The rst part of this section illustrates the of Best disease
Fig. 4. A screen shot of OneLooks return of the search term Bests Disease.
32:10 ACTION SELECT TEXT (pastes text into 32:10 ACTION SELECT DOCUMENT (Online Medical
Word document) Dictionary)
and Ill put in the system COMMENT: Do I will give it one more try and see
32:42 COMMENT: So I am going to go back to if I can get better information. I will try
OneLook and have a look at Dorlands Medical the online medical dictionary
Dictionary 34:19 SYSTEM RESPONSE Web Page Changes
32:10 ACTION SELECT DOCUMENT (clicks on Online Dictionary Search
link) Dorlands Medical Dictionary and see 34:24 ACTION Mouses over links (does not
if I can get a better definition, looking at select one)
a specific area COMMENT: which is totally unhelpful
32:49 SYSTEM RESPONSE Web Page Changes
Dorlands Medical Dictionary The participant selected the Wikipedia from the four
COMMENT: And I get a . . . worse definition. . . choices and expresses dissatisfaction with the sparse
Possibly, I better go back and check. I could response. He decided to continue to look for the answer
have a different definition in Dorlands Medical Dictionary and much to his sur-
33:05 ACTION SELECT TEXT (pastes text into prise, little detail was provided. A third try with the
Word document) online medical dictionary yielded the statement autoso-
Lets see I get a definition of congenital mal retinal degeneration in the rst several years of life
macular degeneration. I think thats the with several of the terms hyperlinked (e.g., to dene
same I got in Wikipedia retinal). The information was similar to that oered
33:25 COMMENT Ah I get different information, by the other two sources and he stated that it was
it does give you the idea that it is an totally unhelpful.
inherited congenital disease and that it
affects vision. It does not specifically 34:59 ACTION Go to MedQACOMMENT: OK, So we
talk about congenital macular are going to try MedQA
degeneration, youd have to outlink on 34:59 SYSTEM RESPONSE Web Page Changes
congenital macular degeneration to get MedQA
more information. about that illness COMMENT: I am waiting .. it is searching
33:39 ACTION LINKOUT (clicks on congenital Medline, it is searching the Web, done,
macular degeneration) done. . . identifying filtering that return
33:39 SYSTEM RESPONSE Web Page Changes 35:32 SYSTEM RESPONSE Web Page Changes
Dorlands Medical Dictionary (MerckSource MedQA (returns answer) Results and
resource library) Conclusion (reads from display) Summary
COMMENT: and there is. . . again minimal from MedLine and Wikipedia
information based on the outlink 5:33 Reviews results
33:59 ACTION SELECT TEXT (pastes text into 5:55 COMMENT: A much. . . much much better ah
Word document) return
5:59 ACTION SELECT TEXT (pastes text into 46:14 SYSTEM RESPONSE: 251 MEDLINE records
Word document) returned
35:59 COMMENT: With all of the annotations I 46:17 ACTION SELECT DOCUMENT COMMENT: Just
would have had to have searched manually try this one, surgical treatment of vulvar
or pulled together and its all vestibulitis, this seems to be a good
coordinated to give me a complete picture definition
of what the disease is 46:29 ACTION SELECT FULL-TEXT 46:39 SYSTEM
So ok RESPONSE: Out-link failed
So I would say that OneLook, the quality of 46:40 ACTION SELECT LINKOUT (of the full-
the answers I would say, I would just say text article)
was poor for what I would need and the ease 46:41 SYSTEM RESPONSE: Out-link failed
of use of the system that was also poor, it COMMENT: It does not seem to have any
required me to make multiple guesses at outlink, it is only the abstract. The
which resources would provide me with ah. . . abstract does not give any
comprehensive definition and that could characteristics of what syndrome is
help me as a physician 47:10 ACTION SELECT TEXT AS ANSWER
36:35 COMMENT: MedQA The quality of the 47:49 ACTION FIND DOCUMENT
answer was excellent and the ease of use 47:57 ACTION SELECT DOCUMENT
was absolutely. I mean. . . perfect just 48:00 ACTION SELECT FULL-TEXT (PDF FILE)
plain simple, I did not have to do 48:02 ACTION READ DOCUMENT
anything extra and I got exactly what COMMENT: seems to get pain syndromes
I needed 48:48 ACTION SELECT TEXT AS ANSWER
Excerpt 2PubMed and MedQA The subject had COMMENT: OK, I am going to leave PubMed
completed five questions and was a little 49:12 ACTION (ENTER QUERY-MEDQA): What is
more than 40 min into the session. The vestibulitis?
question in this excerpt was What is COMMENT: MedQA uses MEDLINE, probably
vestibulitis? The systems used to find will return the same information,
the answer were PubMed and MedQA, hopefully, it will get other information
respectively. The entire segment lasted as well.
6 min, of which 4:25 is used to search 49:52 SYSTEM RESPONSE: shown in Fig. 3
PubMed and 1:11 to search MedQA COMMENT: OK, MedQA pulls back exact the same
44:23 ACTION (Enter Query-PubMed): information, nothing else
vestibulitis 50:23 ACTION SELECT TEXT AS ANSWER
44:34 SYSTEM RESPONSE: 251 MEDLINE records GENERAL COMMENT: I would say that PubMed
returned again all the information was there but
44:51 (User) COMMENT: OK, I definitely got was not held in a useful fashion and I need
some answers that do not apply at all. . . I to search all and I have to filter
have no idea why the first set of returns myself. . .and quality of answer was OK and
are coming back with psychological ease of use is poor because I need to go
problems, but maybe not true, as a through everything. MedQA quality of
physician just makes assumption of that answer is excellent and ease of use is
ENT would be returned, but if I am excellent, I do not need to do anything
gynecologist, that probably is what I am
looking for. Vulvar vestibulitis, I have
no idea what it is. I guess I will go find MedQA returned an integrated summary drawing on
out because I do not know Medline abstracts and text from the Web. Only a single
45:22 ACTION SELECT DOCUMENT act of search was required and the information provided
45:23 ACTION SELECT FULL-TEXT OUT-LINK was much more comprehensive. The participant com-
45:24 SYSTEM RESPONSE: Out-link failed mented on the dierence in the quality of answer and in
45:25 ACTION SELECT FULL-TEXT OUT-LINK the ease of use. The examples were not meant to demon-
45:26 SYSTEM RESPONSE: Out-link failed strate the superiority of MedQA to the other systems, but
45:33 ACTION FIND DOCUMENT to characterize the dierence in the question answering
COMMENT: No. . . I cannot find any definitions process and in the ways the systems support interaction.
46:11 ACTION (Query Modification, For example, subjects would iteratively search PubMed
vestibulitis) COMMENT: Try and OneLook until they found a satisfactory answer. As
vestibulitis only a consequence, they would examine multiple documents
Table 2
A summary of comments of dierent systems (D for disadvantages and A for advantages)
Google (D) retrieves back a lot of links (to the question What is cubital tunnel syndrome?). Most of links seem to relate to individual cases of the
diseases, not necessarily denitions
Google (D): One needs to search and evaluate the denitions in Google
Google (A) retrieves both patient (Google) and physician-centric (scholar. Google) information
Google (A): Scholar.Google is much faster because it is the second link, while in PubMed the evaluator has to search through a lot of other articles
MedQA (D) needs to type in What is versus a direct query
MedQA (D) takes a considerable longer time to respond than other systems
MedQA (A) returns all the context otherwise the evaluator has to search manually. It is only one step and gets exactly needed
MedQA (A) gives answer (to the question What is Popper?) that OneLook did not, which is that the drug is injectable, which is important to know
for a physician
OneLook (D) pulls all links. It lets the user to guess which link contains a comprehensive answer. Sometimes, the links are broken. It is a matter of luck
to get to the right links
OneLook (D) answer quality is poor. It has a terrible user-interface. It shows two ugly photos
OneLook (A) denition has more content than PubMed.
PubMed (D) is not a good resource for denitions
PubMed (D) is not useful. It takes forever to nd information
(necessitating nd link and Linkout actions), only a few of We observed that none of the subjects used Google:Deni-
which were relevant. The subjects typically searched for tion as the service to identify denitions; instead, they
full-text articles as the Linkout actions. The iterative nature applied the query terms in Google or Scholar.Google.
of the search was also evidenced by the number of actions MedQA achieved the second highest ratings in both mea-
pertaining to query modication, searching the text box sures. OneLook received the lowest ratings for quality of
and document selection. answer and PubMed was rated the worst in terms of ease
Table 2 lists a summary of the comments made by sub- of use. All subjects gave the poorest score when there
jects throughout the evaluation. Our results show that was no answer found.
Google received more favorable comments than com- While the processing time to obtain an answer was
plaints. Both MedQA and OneLook received some good almost instant for Google, OneLook, and PubMed, the
comments and some complaints. PubMed was generally average time spent for MedQA to obtain an answer to
criticized and was not given any favorable comments. the 10 answerable questions was 15.8 7.1 s. Surprisingly,
MedQA was the fastest system in average for a subject to
3.15. Quantitative evaluation obtain the denition. For measuring the average time
spent, we excluded the cases in which MedQA and One-
The results show that the subjects did not nd answers Look returned no answer.
to one question in Google (dawns syndrome), three The subjects, on average, spent more time searching
questions in OneLook (epididymis appendix, hell pain PubMed than any of the other systems. In fact, the average
syndrome, and ottawa knee rules), three questions in
MedQA (epididymis appendix, Ottawa knee rules,
and paregoric), and two questions in PubMed (epidid- Table 4
ymis appendix and paregoric). Both MedQA and One- Average frequency of actions by category for each question
Look acknowledged no results found and returned no Google MedQA OneLook PubMed
answers if such an event occurs, while PubMed and Google Find link 0.4 0.1 0.1 1.2
returned a list of documents although subjects could not Linkout 0.2 0.1 1.2 1.5
identify the denitions from the documents within the Read document 0 0.3 0 0.8
5 min time limit. Query modication 0.2 0.2 0.6 0.6
Table 3 presents descriptive statistics of the subjective Scroll-down document 0.3 0.04 0.4 0.2
Search text box 0.2 0.2 0.1 1.4
and objective measures based on all 12 evaluated questions. Select document 1.7 0.5 2.4 3.1
In general, Google was the preferred system as reected
Select text as answer 1.4 0.7 1.6 1.3
both in the quality of the answer and ease of use ratings.
Table 3
Average score and (standard deviation) of quality of answer and ease of use and average time spent (in second) and action taken
Google MedQA OneLook PubMed
Time spent 69.6 (6.9) 59.1 (57.7) 83.1 (63.6) 182.2 (85.8)
Number of actions 4.4 (3.0) 2.1 (2.0) 6.5 (7.7) 10.3 (5.7)
Quality of answer 4.90 (0.15) 2.92 (0.24) 2.77 (0.08) 2.92 (0.88)
Ease of use 4.75 (0.29) 4.0 (0.24) 3.9 (0.32) 2.36 (0.88)
PubMed search required more than three times the amount Google may be a reliable resource for physicians in identi-
of time required to search MedQA. This is at least partly fying denitions. Our results contrast with most of the pre-
due to the complexity of the interaction. This is borne vious reports that show the Internet information content
out by the fact that participants needed more than 10 were of poor quality [44,8088], although there are dier-
actions in using PubMed to answer the question, whereas ences between our study and the others. We focused on a
they only required two actions on average when they used more general type of questions; namely, denitional ques-
MedQA. PubMed provides a range of aordances (e.g., tions, while the other works examined more specic medi-
limits, MeSH) that supports iterative searching. Although cal questions (e.g., What further testing should be ordered
this is a powerful tool, it also increases complexity of the for an asymptomatic Thyroid Nodule solitary thyroid nod-
task and user cognitive load. MedQA oers the simplest ule, with normal TFTs? in [89]). Second, physicians would
mode of interaction because it eliminates several of the evaluate Google high if they found answers from the sites
steps (e.g., upload documents, search text, and selectively even if some of the sites did not provide answers to the
access relevant information in document) involved in questions. Other studies, in contrast, evaluated whether
searching for information. The results of the commercial answers were present in all Web sites or selected Web sites.
search engines, Google and OneLook, fell in between Med- For example, one study [90] concluded that Google hits
QA and PubMed. However, as evidenced by the high stan- were of a poor quality because only one link out of ve
dard deviations, there was signicant variability between led to a website with relevant information. Last, in our
questions. study, the quality of answer was judged by the nal collec-
The frequency of actions by category is reported in tion of answers that may be copied-and-pasted from multi-
Table 4. The pattern of actions employed by participants ple Web pages. Other studies evaluated the completeness of
reects the nature of interactions supported by each sys- each single Web page to answer a specic question. Such
tem. For example, subjects would iteratively search Pub- evaluation will certainly lead to a much poorer rating of
Med till they found a satisfactory answer. As a the Internet because one evaluation study [35] concluded
consequence, they would examine multiple documents that information were scattered in the Internet: most of
(necessitating nd link and linkout actions), only a few of the Web pages incorporate information either in depth or
which were relevant to dening the query term. The sub- in breath, and few Web sites incorporate information both
jects typically searched for full-text articles as the linkout in depth and in breath.
actions. The iterative nature of the search was also evi- Although the evaluation results show that Google was
denced by the number of actions pertaining to query mod- preferred system in both quality of answer and ease of
ication, searching the text box, and document selection. use, the results show that MedQA out-performed Google
in time spent and number of actions, two important criteria
4. Discussion and future work in eciency for obtaining an answer. Our results show that
although it took less than a second for Google to retrieve a
This paper reports the development and evaluation of a list of relevant documents based on a query keyword and it
medical question answering systemMedQA. MedQA took an average of 16 seconds for MedQA to generate a
identied noun phrases in a question and applies them as summary, the average time spent for a subject to identify
the query terms to retrieve relevant documents. MedQA a denition was 59.1 57.7 s for MedQA, which was faster
further identied denitions based on linguistic features, than 69.6 6.9 s for Google. A recent study [35] evaluated
removed redundant sentences with robust statistical hierar- high quality and authoritative medical websites and found
chical clustering, and generated a coherent summary with a that information was scattered across the web: while most
centroid-based summarization method [79]. of web pages had only few facts related to a specic topic,
The contributions of this study are threefold. First, we relatively few web pages that had many facts. A user may
automatically identied a set of lexico-syntactic patterns have to visit several sites before nding all of the pertinent
for capturing denitions, while most of the previous work information. Our study conrmed that a subject needed to
in denitional question answering identied the patterns visit multiple web pages in order to obtain the information
manually [5962]. Second, MedQA built upon four he/she needed because denitions were scatter across the
advanced techniques; namely, question analysis, informa- web pages in a Google search. The ability of MedQA to
tion retrieval, answer extraction, and summarization to integrate disparate pieces of text makes MedQA superior
generate a coherent answer to denitional questions; None than other search engines.
of the previous domain-specic question answering systems Throughout MedQA development, we identied a num-
reported the integration of the summarization techniques. ber of important research areas. Our current system imple-
Third, we evaluated MedQA and three other state-of-the- mented the shallow syntactic chunker LT CHUNK to
art information systems using a cognitive model. capture noun phrases as query terms for answer extraction.
Our evaluation results show that MedQA out-per- However, we found that LT CHUNK made many mistakes.
formed OneLook and PubMed in most of the evaluation For example, LT CHUNK fails to identify eating disorder
criteria. All subjects gave Google almost perfect subjective in the question What is eating disorder?. The facts that LT
scores in quality of answer. The results demonstrated that CHUNK was trained on general English text, and not med-
ical, domain-specic text, and that LT CHUNK was mostly to. . . and . . . caused by. . . We have empirically evalu-
trained on regular sentences, not questions have greatly ated the relations between semantics and lexico-syntactic
undermined the capacity of LT CHUNK to eciently cap- patterns [91]. Future work one may combine the semantics
ture noun phrases of medical questions. A comprehensive with the lexico-syntactic patterns to eciently identify sen-
biomedical question answering system needs a robust and tences for answer extraction.
accurate parser that is specically developed in the biomed-
ical domain. Such a parser will also be useful for capturing 5. Conclusions
lexico-syntactic patterns in answer extraction.
Another important area of a successful question answer- This study reports research development, implementa-
ing system is to nd ways to capture users intentions to tion, and a cognitive evaluation of a biomedical question
determine the scope of answers. For example, when a phy- answering system (MedQA). MedQA generates short-para-
sician asks What is the dawn phenomenon? he wants to graph-level texts to answer physicians and other biomedi-
know not only the denition of this term, but also how to cal researchers ad hoc questions. The contributions of this
diagnose it and manage it. Essentially, a denition question work include:
(i.e., What is X?) requires answers beyond denitions
(e.g., what causes X? and How to treat X?). One can (1) Automatic generation of lexico-syntactic patterns for
obtain the users intentions by working directly with the identifying denitions.
target users (e.g., physicians or biologists) throughout (2) The integration of document retrieval, answer extrac-
MedQA development. tion, and summarization into a working system that
Summarization is an important research area. Summari- generated a short paragraph-level answer to a deni-
zation incorporates three areas; namely, removing redun- tional question.
dant sentences, identifying important information, and (3) A cognitive evaluation that compared MedQA with
generating a coherent summary. Currently we implemented three other state-of-the-art online information sys-
a simple summarization that is based on statistical cluster- tems; namely, Google, OneLook, and PubMed.
ing. Future work one may combine statistical approaches
with both linguistic and semantic approaches such as the Our results show that MedQA in general out-performed
one proposed by [75]. OneLook and PubMed in the following four criteria: qual-
Finally, semantic information plays an important role ity of answer, ease of use, time spent, and actions taken for
for both answer extraction and summarization. For exam- obtaining an answer. Although the evaluation results show
ple, the following are denitions for heart and heart that Google was preferred system in quality of answer and
attack, in which we have mapped terms to the UMLS ease of use, the results showed that MedQA out-performed
semantic types in [Superscript]. We link multi-word terms Google in its time-spent and actions taken for obtaining an
(e.g., myocardial infarction) with underscore _ (e.g., answer; both advances showed the promise for MedQA to
myocardial_infarction). be useful in clinical settings.
Heart[Body Part, Organ, or Organ Component]: The hollow[Spatial It is important to point out the limitations of this work.
Concept]
muscular[Spatial Concept] organ[Body Part, Organ, This is a small scale involving four physicians who evalu-
or Organ Component, Tissue] [Spatial Concept]
located behind[Spatial ated 12 medical questions. These physicians may not be
Concept] [Body Part, Organ, or Organ Component]
the_sternum and representative of the broader population. In addition, the
between the lungs[Body Part, Organ, or Organ Component]. small sample size precludes the ability to measure the statis-
Heart attack[Disease or Syndrome]: also called myocar- tical dierences among search engines. Future work needs
dial_infarction[Disease or Syndrome]; damage[Functional Concept] to increase both the number of subjects and the number
to the heart_muscle[Tissue] due to insucient blood of the questions to be evaluated. Although MedQA is a
supply[Organ or Tissue Function] for an extended[Spatial Concept] work in progress, we can provisionally conclude that such
time_period[Temporal Concept]. a system has the potential to facilitate the process of infor-
We found that given a denitional term (DT) with a mation seeking in demanding clinical contexts.
semantic type SDT, terms that appear in the denition were
statistically correlated with related semantic types. For Appendix A
example, when the semantic type of a denitional term
SDT is [Body Part, Organ, or Organ Component], the denition Denitional questions that we found from over four
tends to incorporate terms with the same semantic type thousands of clinical questions collected by Ely, DAlessan-
[Body Part, Organ, or Organ Component]
or related (e.g., , dro and colleagues [1,2527]. The 12 selected questions for
[Spatial Concept] [Body Part, Organ, or Organ Component]
, and ). evaluation are in bold.
Additionally, the lexico-syntactic patterns of denitional
sentences depend on the semantic types of denitional term 1. What is cerebral palsy?
SDT. When SDT is [Body Part, Organ, or Organ Component], its 2. What is gembrozil?
lexico-syntactic patterns includes . . .located. . .; and when 3. What is D-dimer?
SDT is [Disease or Syndrome], the patterns include . . .due 4. What is the marsh score?
5. What is hemoglobin A0? 60. What is euthyroid sick syndrome?

6. What is TCA (tetracaine, cocaine, alcohol)? 61. What is Williams Syndrome?
7. What is Vagisil? 62. What is senile tremor?
8. What is Tamiu (oseltamivir)? 63. What is NARES (nonallergic rhinitis with eosino-
9. What is midodrine? philia syndrome)?
10. What is Ambien (zolpidem)? 64. What is WalkerWarburg Syndrome?
11. What is Terazol (terconazole)? 65. What is Carnetts sign?
12. What is Maltsupex? 66. What is the oxygen dissociation curve?
13. What is DDAVP (1-desamino-8-D-arginine vaso- 67. What is the pivot-shift test?
pressin)? 68. What is Oslers sign?
14. What is Resaid? 69. What is nephrocalcinosis?
15. What is droperidol (Inapsine)? 70. What is Vanderwoude syndrome?
16. What is an appendix epididymis? 71. What is dysbrinogenemia?
17. What is henox premer? 72. What is WalkerWarburg syndrome?
18. What is Lotrel? 73. What is the antibiotic dose?
19. What is Hytrin? 74. What is bronchiolitis?
20. What is Cozaar? 75. What is occipital neuralgia?
21. What is ceftazidime? 76. What is the cubital tunnel syndrome?
22. What is Zolmitriptan (Zomig)? 77. What is Klippel Feil Syndrome?
23. What is octreotide (sandostatin) and somatostatin? 78. What is dyskinesia?
24. What is Proshield? 79. What is Sandifer syndrome?
25. What is risperidone (Risperdal)? 80. What is heel pain syndrome?
26. What is Genora? 81. What is a blue dome cyst?
27. What is mepron (Atovaquone)? 82. What is central pontine myelinosis?
28. What is Zoloft (sertraline)? 83. What is Kenny syndrome?
29. What is uvoxamine (Luvox)? 84. What is dysdiadokokinesis?
30. What is Lunelle? 85. What is Wegeners granulomatosis?
31. What is amantadine dosing? 86. What is melanosis coli?
32. What is cyclandelate? 87. What is lipoprotein A?
33. What is clonazepam (Klonopin)? 88. What is FISH (uorescence in situ hybridization)?
34. What is westsoy formula? 89. What is schizoaective disorder?
35. What is Zyprexa? 90. What is Charcot Marie Tooth Disease?
36. What is Mexitil (mexiletine)? 91. What is the adenosine-thallium exercise tolerance test?
37. What is paregoric? 92. What is Ogilvies syndrome?
38. What is Legatrin? 93. What is prealbumin?
39. What is nimodipine (Nimotop)? 94. What is serum sickness?
40. What is glatiramer (Copaxone)? 95. What is Kussmaul breathing?
41. What is propafenone? 96. What is PraderWilli syndrome?
42. What is Ultravate cream (Halobetasol)? 97. What is an Adies pupil?
43. What is cilostazol (Pletal) (for intermittent 98. What is Noonan syndrome?
claudication)? 99. What is fetor hepaticus?
44. What is sotalol? 100. What is dissociative disorder?
45. What is Norvasc (amlodipine)? 101. What is the Jendrassik maneuver?
46. What is Uristat? 102. What is a tethered spinal cord?
47. What is nabumetone? 103. What is the postcholecystectomy syndrome?
48. What is Zofran (ondansetron)? 104. What is herpes gladiatorum?
49. What is terbinane (Lamisil)? 105. What is Fragile X syndrome?
50. What is Cetirizine (Reactin)? 106. What is a high ankle sprain?
51. What is Serzone (nefazodone)? 107. What is PeutzJeghers Syndrome?
52. What is Sansert? 108. What is an eccrine spiradenoma?
53. What is Urised? 109. What is Fanconis Syndrome?
54. What is nedocromil sodium (Tilade)? 110. What is Stills disease?
55. What is Ultram (tramadol)? 111. What is the dawn phenomenon?
56. What is the Poland anomaly? 112. What is BOOP (bronchiolitis obliterans and organiz-
57. What is vestibulitis? ing pneumonia)?
58. What is hemolytic uremic syndrome? 113. What is Bests disease?
59. What is a Lisfranc fracture? 114. What is Rovsings sign?
115. What is a sliding hiatus hernia? [11] Jacquemart P, Zweigenbaum P. Towards a medical question-
116. What is biliary gastritis (bile gastritis)? answering system: a feasibility study. Stud Health Technol Inform
2003;95:4638.
117. What is a Hepatolite scan (questionably the same as a [12] Takeshita H, Davis D, Straus S. Clinical evidence at the point of care
PAPIDA scan)? in acute medicine: a handheld usability case study. Proceedings of the
118. What is seronegative spondyloarthropathy? human factors and ergonomics society 46th annual meeting, 2002, p.
119. What is Stickler Syndrome? 140913.
120. What is the Delphi technique? [13] Hersh WR, Crabtree MK, Hickam DH, Sacherek L, Friedman CP,
Tidmarsh P, et al. Factors associated with success in searching
121. What is Xalatan eye drops? MEDLINE and applying evidence to answer clinical questions. J Am
122. What is Betamol eye drops? Med Inform Assoc 2002;9(3):28393.
123. What is the binomial distribution? [14] Waltz D. An english language question answering system for a large
124. What is a Roux-en-Y hepatoenterostomy? relational database. Commun ACM 1978;21(7):52639.
125. What is the bivariate normal distribution? [15] Voorhees E. Overview of the TREC 2004 question answering track.
NIST Special Publication: SP 500-261, 2004.
126. What is Sperlings sign (Sperlings maneuver)? [16] Yu H, Hatzivassiloglou V. Towards answering opinion questions:
127. What is an incidence rate ratio? separating facts from opinions and identifying the polarity of opinion
128. What are Alomide ophthalmic drops? sentences. Proceedings of empirical methods in natural language
129. What is the urological surgery technique? processing (EMNLP), 2003.
130. What are Acular ophthalmic drops? (good for allergic [17] Bethard S, Yu H, Thornton A, Hatzivassiloglou V, Jurafsky D.
Semantic analysis of propositional opinions. AAAI 2004 spring
conjunctivitis?) symposium on exploring attitude and aect in text: theories and
131. What are poppers? applications, 2004.
132. What are Hawkins and Neer impingement signs (for [18] Zweigenbaum P. Question answering in biomedicine. EACL work-
shoulder pain, rotator cu injury)? shop on natural language processing for question answering, Buda-
133. What are the Ottawa knee rules? pest, 2003, p. 14.
[19] Zweigenbaum P. Question-answering for biomedicine: methods and
134. What are endomysial antibodies? state of the art. MIE 2005 Workshop, 2005.
135. What are preeclampsia labs (laboratory studies)? [20] Rinaldi F, Dowdall J, Schneider G, Persidis A. Answering questions
136. What are Lewy bodies? in the genomics domain. ACL 2004 workshop on question answering
137. What are the heat injury syndromes? in restricted domain, 2004.
138. What are pineal brain tumors? [21] Niu Y, Hirst G. Analysis of semantic classes in medical text for
question answering. ACL 2004 workshop on question answering in
restricted domains, 2004.
[22] Delbecque T, Jacquemart P, Zweigenbaum P. Indexing UMLS
References semantic types for medical question-answering. In: Engelbrecht R
et al, editors. Connecting medical informatics and bio-informatics
[1] Ely JW, Oshero JA, Ebell MH, Bergus GR, Levy BT, Chambliss ENMI 2005, 2005.
ML, et al. Analysis of questions asked by family doctors regarding [23] Xuang X, Lin J, Demner-Fushman D. Evaluation of PICO as a
patient care. Br Med J 1999;319(7206):35861. knowledge representation for clinical questions. Annual symposium
[2] Straus S, Sackett D. Applying evidence to the individual patient. Ann of the American medical informatics association, 2006.
Oncol 1999;10(1):2932. [24] Yu H. Towards answering biological questions with experimental
[3] Cimino JJ, Li J, Graham M, Currie LM, Allen M, Bakken S, et al. evidence: automatically identifying text that summarize image con-
Use of online resources while using a clinical information system. tent in full-text articles. American medical informatics association,
AMIA Annu Symp Proc 2003:1759. 2006.
[4] Alper BS, White DS, Ge B. Physicians answer more clinical [25] Ely JW, Oshero JA, Ferguson KJ, Chambliss ML, Vinson DC,
questions and change clinical decisions more often with synthesized Moore JL. Lifelong self-directed learning using a computer database
evidence: a randomized trial in primary care. Ann Fam Med of clinical questions. J Fam Pract 1997;45(5):3828.
2005;3(6):50713. [26] Ely JW, Oshero JA, Chambliss ML, Ebell MH, Rosenbaum ME.
[5] Westbrook JI, Gosling AS, Coiera E. Do clinicians use online Answering physicians clinical questions: obstacles and potential
evidence to support patient care? A study of 55,000 clinicians. J Am solutions. J Am Med Inform Assoc 2005;12(2):21724.
Med Inform Assoc 2004;11(2):11320. [27] DAlessandro DM, Kreiter CD, Peterson MW. An evaluation of
[6] Westbrook JI, Coiera EW, Gosling AS. Do online information information-seeking behaviors of general pediatricians. Pediatrics
retrieval systems help experienced clinicians answer clinical ques- 2004;113(1 Pt 1):649.
tions? J Am Med Inform Assoc 2005;12(3):31521. [28] Ely JW, Oshero JA, Gorman PN, Ebell MH, Chambliss ML, Pifer
[7] Gosling AS, Westbrook JI. Allied health professionals use of online EA, et al. A taxonomy of generic clinical questions: classication
evidence: a survey of 790 sta working in the Australian public study. Br Med J 2000;321(7258):42932.
hospital system. Int J Med Inform 2004;73(4):391401. [29] Yu H, Sable C. Being Erlang Shen: identifying Answerable questions.
[8] Sackett D, Straus S, Richardson W, Rosenberg W, Haynes R. Proceedings of the nineteenth international joint conference on
Evidence-Based Medicine: How to practice and teach EBM. Edin- articial intelligence on knowledge and reasoning for answering
burgh: Harcourt Publishers Limited; 2000. questions, 2005.
[9] Schilling LM, Steiner JF, Lundahl K, Anderson RJ. Residents [30] Yu H, Sable C, Zhu H. Classifying medical questions based on an
patient-specic clinical questions: opportunities for evidence-based evidence taxonomy. Proceedings of the AAAI 2005 workshop on
learning. Acad Med 2005;80(1):516. question answering in restricted domains, 2005.
[10] Alper B, Stevermer J, White D, Ewigman B. Answering family [31] Kushniruk A, Patel V. Cognitive and usability engineering methods
physicians clinical questions using electronic medical databases. J for the evaluation of clinical information systems. J Biomed Inform
Fam Pract 2001;50(11):9605. 2004;37(1):5676.
[32] Kaufman D, Patel R, Hilliman C, Morrin P, Pevzner J, Weinstock R, [57] Zhu H, Yu H. A comparative study for automatic zone identication
et al. Usability in the real world: assessing medical information in MEDLINE (A poster). BioLINK SIG: linking literature, Infor-
technologies in patients homes. J Biomed Inform 2003;36:4560. mation and knowledge for biology, 2005.
[33] Norman D. Cognitive engineering. User Centered System [58] Light M, Qiu X, Srinivasan P. The language of bioscience: facts,
Design: Lawrence Erlbaum Associates; 1986. p. 3161. speculations, and statements in between. HLT-NAACL 2004 work-
[34] Marchionini G. Information seeking in electronic environ- shop: BioLink 2004, Linking biological literature, Ontologies and
ments. New York: Cambridge University Press; 1995. databases, 2004.
[35] Bhavnani S. Why is it dicult to nd comprehensive information? [59] Klavans J, Muresan S. Evaluation of the DEFINDER system for
Implications of information scatter for search and design: research fully automatic glossary construction. Proceedings of American
articles. J Am Soc Inf Sci Technol 2005;56(9):9891003. medical informatics association symposium, 2001.
[36] Sutclie A, Ennis M. Towards a cognitive theory of information [60] Xu A, Weischedel R. Trec 2003 QA at BBN: answering denitional
retrieval. Interacting with Computers 1998;10(3):32151. questions. TREC, 2003.
[37] Seol Y, Kaufman D, Mendonca E, Cimino J, Johnson S. Scenario- [61] Blair-Goldensohn S, McKeown K, Schlaikjer A. Answering deni-
based assessment of physicians information needs. MedInfo, 2004. tional questions: a hybrid approach. New directions in question
[38] Elhadad N, McKeown K, Kaufman D, Jordan D. Facilitating answering, 2004.
physicians access to information via tailored text summarization. [62] Cui H, Kan M, Cua T. Generic soft pattern models for denitional
American medical informatics annual symposium, Washington, DC, question answering. The 28th annual international ACM SIGIR
2005. 2005, Salvado, Brazil, 2005.
[39] Allen E, Burke J, Welch M. How reliable is science information on [63] Rilo E, Philips W. An introduction to the Sundance and AutoSlog
the web? Nature 1999;402:722. systems, University of Utah School of Computing, 2004.
[40] Culver J, Gerr F, Fumkin H. Medical information in the internet: a [64] Rilo E. Automatically generating extraction patterns from untagged
study of an electronic bulletin board. J Gen Intern Med text. AAAI, 1996.
1997;12(8):46671. [65] Yu H, Agichtein E. Extracting synonymous gene and protein terms
[41] Davison K. The quality of dietary information on the world wide from biological literature. Bioinformatics 2003;19(Suppl. 1):i3409.
web. Clin Perform Qual Health Care 1997;5:646. [66] Barzilay R, Elhadad M. Using lexical chains for text summarization.
[42] Eysenbach G, Kohler C. How do consumers search for and appraise ACL intelligent scalable text summarization workshop 1997, Madrid,
health information on the world wide web? Qualitative study using 1997.
focus groups, usability tests, and in-depth interviews. Br Med J [67] Luhn H. The automatic creation of literature abstracts. IBM J Res
2002;324(7337):5737. Dev 1958;2(2).
[43] Sysenbach G, Powell J, Kuss O, Sa E. Empirical studies assessing the [68] Edmunson H. New methods in automatic extracting. J ACM
quality of health information for consumers on the world wide web: 1969;16(2):26485.
A systematic review. J Am Med Assoc 2002;287(20):2691700. [69] Hovy E, Lin C. Automated text summarization in summarist. ACL/
[44] Griths K, Christensen H. Quality of web based information on EACL workshop on intelligent scalable text summarization 1997,
treatment of depression: cross sectional survey. Br Med J Madrid, 1997.
2000;321:15115. [70] Marcu D. The rhetorical parsing, summarization, and generation of
[45] Impicciatore P, Pandolni C, Casella N, Bonati M. Reliability of natural language texts. University of Toronto, 1997.
health information for the public on the world wide web. Br Med J [71] Halliday M, Hasan R. Cohesion in English. Longman; 1976.
1997;314:18759. [72] Barzilay R, Elhadad M, McKeown K. Information fusion in the
[46] Thelwall M. Extracting macroscopic information from web links. J context of multi-document summarization. The 37th association for
Am Soc Inf Sci Technol 2001;52(13):115768. computational linguistics 1999, Maryland, 1999.
[47] Lin J. User simulations for evaluating answers to question series. [73] Barzilay R, Elhadad M, McKeown K. Inferring strategies for
Inform Process Manag 2007. sentence ordering in multidocument summerization. JAIR, 2002.
[48] Metzler D, Croft W. Combining the language model and inference [74] McKeown K, Chang S, Cimino J, Feiner S, Gravano L, et al.
network approaches to retrieval. Inform Process Manag PERSIVAL, a system for personalized search and summarization
2004;40(5):73550. over multimedia healthcare information. Proceedings of the 1st
[49] Mikheev A. 1996. Available from http://wwwltgedacuk/software/ ACM/IEEE-CS joint conference on digital libraries 2001, 2001.
chunk/. [75] Fiszman M, Rindesch T, Kilicoglu H. Abstraction summarization
[50] Day R. How to write and publish a scientic paper. Cam- for managing the biomedical research literature. HLT-NAACL 2004:
bridge: Cambridge University Press; 1998. computational lexical semantic workshop, 2004.
[51] Gabbay I, Sctcliee R. A qualitative comparison of scientic and [76] Salton G. A vector space model for information retrieval. CACM
journalistic texts from the perspective of extracting denitions. 1975;18(11):61320.
ACL 2004 workshop question answering in retricted domains, [77] Lee M, Wang W, Yu H. Exploring supervised and unsupervised
2004. approaches to detect topics in biomedical text. BMC Bioinformatics
[52] Mizuta Y, Collier N. Zone identication in biology articles as a basis 2006;7:140.
for information extraction. Natural language processing in biomed- [78] Hatzivassiloglou V, Gravano L, Maganti A. An investigation of
icine and its applications, Post-COLING Workshop, 2004. linguistic features and clustering algorithms for topical document
[53] Mizuta Y, Collier N. An annotation scheme for a rhetorical analysis clustering. SIGIR, 2000.
of biology articles. Proceedings of the fourth international conference [79] Radev D, Jing H, Budzikowska M. Centroid-based summarization of
on language resources and evaluation (LREC2004), 2004. multiple documents: sentence extraction, utility-based evaluation,
[54] Elhadad N, Kan M, Klavans J, McKeown K. Customization in a and user studies. ANLP/NAACL workshop on summarization, 2000.
unied framework for summarizing medical literature. J Artif Intell [80] Purcell G. The quality of health information on the internet. Br Med
Med 2004. J 2002;324(7337):5578.
[55] Lin J, Karakos D, Demner-Fushman D, Khudanpur S. Generative [81] Jadad AR, Gagliardi A. Rating health information on the Internet:
content models for structural analysis of medical abstracts. HLT- navigating to knowledge or to Babel? JAMA 1998;279(8):6114.
NAACL BioNLP, New York City, 2006. [82] Silberg WM, Lundberg GD, Musacchio RA. Assessing, controlling,
[56] Mcknight L, Srinivasan P. Categorization of sentence types in and assuring the quality of medical information on the Internet:
medical abstracts. American medical informatics association 2003, Caveant lector et vieworLet the reader and viewer beware. JAMA
Washington, DC, 2003. 1997;277(15):12445.
[83] Glennie E, Kirby A. The career of radiography: information on the [88] McClung HJ, Murray RD, Heitlinger LA. The Internet as a source
web. J Diagn Radiogr Imaging 2006;6:2533. for current patient information. Pediatrics 1998;101(6):E2.
[84] Childs S. Judging the quality of internet-based health information. [89] Berkowitz L. Review and evaluation of internet-based clinical
Perform Meas Metrics 2005;6(2):8096. reference tools for physicians: uptodate, 2002.
[85] Cline RJ, Haynes KM. Consumer health information seeking on the [90] Berland GK, Elliott MN, Morales LS, Algazy JI, Kravitz RL, Broder
Internet: the state of the art. Health Educ Res 2001;16(6):67192. MS, et al. Health information on the Internet: accessibility, quality,
[86] Benigeri M, Pluye P. Shortcomings of health information on the and readability in English and Spanish. JAMA 2001;285(20):261221.
Internet. Health Promot Int 2003;18(4):3816. [91] Yu H, Wei Y. The Semantics of a deniendum constrains both the
[87] Wyatt J. Commentary: measuring quality and impact of the world lexical semantics and the lexicosyntactic patterns in the deniens.
wide web. Br Med J 1997;314:1879. HLT-NAACL BioNLP 2006, New York, 2006.

Development, Implementation, and A Cognitive Evaluation of A Definitional Question Answering System For Physicians

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Development, Implementation, and A Cognitive Evaluation of A Definitional Question Answering System For Physicians

Hochgeladen von

Copyright:

Verfügbare Formate

Journal of Biomedical Informatics 40 (2007) 236251

Development, implementation, and a cognitive evaluation

Received 17 November 2005

1. Introduction [2,8,9]. Information retrieval systems (e.g., PubMed) return

1532-0464/$ - see front matter Published by Elsevier Inc.

Question answering can automatically analyze a large

5. What is hemoglobin A0? 60. What is euthyroid sick syndrome?

Das könnte Ihnen auch gefallen