Sie sind auf Seite 1von 9

AN ABSTRACT PROCESSING PROBLEM

WITH APPLICATION TO MACHINE


LEARNING
Author-Amulya Yadav(amulya@iitp.ac.in)
Mentor-Prof Rakesh Verma(rverma@uh.edu)
Organization-University of Houston
Date-July 26 2010
TECHNICAL REPORT
ABSTRACT:This project was one based on machine learning.The main objectives of the project are that
there is a medical journal PUBMED.Every year,millions of abstracts for research papers are printed in
that journal.Due to a recent need,there was this problem of identifying whether any given abstract in
the journal was relevant to chemical terrorism or not.An abstract was said to be relevant to chemical
terrorism if it contained the phrases “chemical terrorism”,”chemical”,”terrorism”.We wanted to do this
in a machine learning way.The project was planned to proceed in several phases.In the first phase,for
each and every abstract in the journal,a vector was prepared which would completely characterise the
abstract.These vectors were to be used as input to a machine learning algorithm with the help of
which,it was believed that a state would be achieved when an abstract without its vector
characterisation would be fed to the machine,it would be able to classify that abstract as either relevant
or irrelevant,due to the machine learning.The work of the author of this report was was the entire first
phase as described above.Thus,the main topic of this report is the preparation of these vectors from the
abstracts.The programming language used in preparing vectors is Perl as that language is tailor made for
text processing applications.

INTRODUCTION:Initially the entire text file which had the abstracts was to be tagged manually.Tags
here mean annotations in the text file itself such as <Y>,<N>,<Maybe>.Based on the tagged abstracts,a
universe was first created containing elements or phrases(in the layman’s language) which depicted the
relevance of the abstracts to chemical terrorism.Then this universe was modified to remove
redundancies.After this,the tagged text file was processed with the help of the universe and vectors
were created for each abstract.

Accuracy of the abstracts were paramount because of the machine learning nature of the algorithm to
which the vectors were to be fed.If the vectors turned out to be inaccurate,they would lead to the
machine learning to be erratic and unexpected results to show up.Therefore it was necessary to make
sure that the abstracts were what we wanted them to be.

BACKGROUND:The abstract text was to be tagged according to the following rules:

1.Any abstract which according to the human tagger was relevant to chemical terrorism was to be
marked with a <Y> after the title of the abstract.

2.Any abstract which according to the human tagger was irrelevant to chemical terrorism was to be
marked with a <N> after the title of the abstract.

3.In case the tagger was not sure,he could tag it with a <Maybe> or <M> .

4.Any phrase or word which pertains to chemical terrorism was to be enclosed in <T> and </T> tags.
5.Any name of a chemical was to be enclosed in <C> and </C> tags.

6.Just before the text of the abstract begins a tag which says <TXT> should me made.

We were playing with 400 abstracts from the PUBMED journal.100 abstracts were given to 1 person and
thus there were four people who were doing the tagging.Then each person had to check the tagging of
others.This was to reduce errors to the minimum.

STATEMENT OF PROBLEM:The problem in front of the author was this:

For the abstract processing program, the format would be in a


spreadsheet format- as a csv file.

Col 0: Year of publication


Col 1: Identifier for the training/non training data (e.g., unique
identifier like PubMed ID)
Col 2: If the data is training data, then correct classification- 1 for
relevant for terrorism, -1= irrelevant, if non-training data, then entry
will be 0.
Col 3: Actual text of abstract (no authors, journal name, title etc. )
Col 4: Actual text of title
Col 5- col n: Names of eatures which have been selected

e.g.
year, PMID, relevancy, abstract, title, f1, f2, f3, f4 ..... (fi= feature
vector i)
1998, xxxx, 1, text1, text2, 3, 2, 0,1 ,.......

the #s are the number of times the specific feature occurs in the
abstract.

These features were the the ones which we have in the universe of all the items which had been
tagged and which showed whether an abstract was relevant to chemical terrorism or not.The
program was to be written in such a manner such that it accounted for any human mistakes in
tagging.For eg. If in an abstract one feature was present twice but was only tagged once by the
human tagger,that particular feature’s number was to show up as 2 in the abstract’s vector.In
Col 2,the determination as to whether a particular abstract was training data or test data was
modeled probabalistically.Approximately 80% of the abstracts were supposed to be training
data and only 20% of the abstracts were to be used as test data.
APPROACH:

First of all,after the tagged file was prepared,the entire file was processed for looking at occurences of
the <T> and </T> tags.Anything in the text file which was tagged as <T> and </T> was to be picked up
and stored in a separate file as these were elements of the universe and would give us a degree of
relevancy of the abstract to chemical terrorism.There was the problem of nested tags,for eg. Inside a
relevant phrase marked by a <T> and a </T> there could be the name of a chemical marked by <C> and
</C> tags.So our purpose was to remove these <C> and </C> tags from the file of relevant items.This
was accomplished using simple UNIX filters such as sed,grep.After this,we had the entire universe in a
text file.Then,the entire text file was searched for the relevant items as per the columns and the vectors
were prepared according to csv format.An interesting observation that was made during modeling
probability was that the numbers generated by random number generator functions were such that
their 2 least significant digits are not at all random.So you won’t get good results.What was needed to
be done is that the numbers 2 most significant digits be used to model probability.

The tagged text file was processed abstract by abstract.The beginning of an abstract was thought to be
at a place where the line begins with a number followed by a period and a whitespace.In this line the
year of publication of the abstract was found.Following the beginning of the abstract was the title of the
abstract followed by the relevance tag.That tag had to be noted and stored appropriately.Then came the
author of the paper whose abstract is under consideration.Then came the address of the authors.After
that was the text of the abstract.After the text was the PMID number which was also to be stored.The
PMID marked the end of the abstract.After reading in the PMID,the program was ready to prepare the
vector for that particular abstract.After all the vectors have been prepared,we can remove all tags that
are present in the vectors using simple filters.

PROBLEMS ENCOUNTERED:

The technical defficulties faced in the problem were of a similar nature as those found in natural
language processing problems and a few other problems were there as well.The biggest problem was
that the tagging was done by humans and hence it was error prone.Different people had a different
interpretation of the tagging process and they tagged it accordingly.Each person’s interpretation was in
a sense logical.Thus there was an urgent need to get one standard for tagging.For eg,in the initial
version of the program,the text of the abstract was being read from the actual text file and everything
else was being read from the tagged file.Now obviously,we would want the 2 file pointers to match.But
while tagging, some people in the process of tagging inadvertently introduced a new line in the tagged
file when there isn’t any in the actual text file.Thus,the entire program had to be remodelled such that
everything was read from the tagged file so that there was no problem of synchronization of file
pointers.It was found that in 95% of the abstracts,the text of the abstracts was followed by the
PMID.But in 5% of cases the text of the abstracts was followed by the PMCID and after that PMID.The
program had to be made resistant to handle these sort of cases.The main problem was to find all the
different formats that are possible for a PMID abstract.A similar problem was that in nearly 2%(approx)
of cases the title was followed by not the author but the language of the abstract.The program was
made to account for these as well.

Also,sometimes the tagging was done in such a way that an entire line had been tagged inside <T> and
</T> tags.Such cases turn out to be lethal for the vectors because these features would be too specific
and would apply only for a particular abstract.Thus,these tags had to be searched for in the tagged file
and were removed.

Then there was this problem of a special kind.In many cases,phrases such as
“chemical,biological,nuclear,radiological weapons” etc turned up in the file.Tagging the entire phrase
would lead to the problem stated above :that of making this element of the universe too abstract
specific.Hence,tagging the entire phrase was out of the question.What we wanted was that such a thing
should be interpreted as chemical weapons because that is what makes it relevant for chemical
terrorism.To solve this problem,<R> tags were introduced in the tagging process.

These were to be interpreted as follows:

In our example,the phrase was to be tagged as follows:

<T>chemical <R>,biological,nuclear,radiological weapons </T> .To make these things work,the tagged
file was preprocessed to remove the <R> tags and form new lines.The preprocessing is done in a manner
such that in the processed tagged file would not have the above phrase as it is but instead would
contain <T>chemical weapons </T>.Anything preceding the <R> tag is taken and the word preceding the
</T> tag is concatenated with everything that precedes the <R> tag.Everything outside the <T> and </T>
tags are kept just as they are.Thus we get a new line without any <R> tags.

These <R> tags also became the victim of human misinterpretation.Therefore,a standard way of doing
things is really necessary.

It is my strong belief that atleast for the tagged abstracts that were given to the us in the first phase ,we
are producing correct vectors.Thourough checking has been done.

RESULTS:

This is what we have achieved:

The actual text file looked something like this:

2. Health Phys. 2010 Jun;98(6):903-5.

NIAID/NIH radiation/nuclear medical countermeasures product research and


development program.

Hafer N, Cassatt D, Dicarlo A, Ramakrishnan N, Kaminski J, Norman MK, Maidment B,

Hatchett R.

Division of Allergy, National Institute of Allergy and Infectious Diseases,

National Institutes of Health, Bethesda, MD 20892, USA.

One of the greatest national security threats to the United States is the

detonation of an improvised nuclear device or a radiological dispersal device in

a heavily populated area. The U.S. Government has addressed these threats with a

two-pronged strategy of preventing organizations from obtaining weapons of mass

destruction and preparing in case an event occurs. The National Institute of

Allergy and Infectious Diseases (NIAID) contributes to these preparedness efforts

by supporting basic research and development for chemical, biological,

radiological, and nuclear countermeasures for civilian use. The Radiation

Countermeasures Program at NIAID has established a broad research agenda focused

on the development of new medical products to mitigate and treat acute and

long-term radiation injury, promote the clearance of internalized radionuclides,

and facilitate accurate individual dose and exposure assessment. This paper

reviews the recent work and collaborations supported by the Radiation

Countermeasures Program.

PMID: 20445403 [PubMed - indexed for MEDLINE]


The tagged file looked something like this:

2. Health Phys. 2010 Jun;98(6):903-5.

NIAID/NIH radiation/nuclear medical countermeasures product research and

development program. <N>

Hafer N, Cassatt D, Dicarlo A, Ramakrishnan N, Kaminski J, Norman MK, Maidment B,

Hatchett R.

Division of Allergy, National Institute of Allergy and Infectious Diseases,

National Institutes of Health, Bethesda, MD 20892, USA.

One of the greatest national security threats to the United States is the

detonation of an improvised nuclear device or a <T> radiological dispersal </T> device in

a heavily populated area. The U.S. Government has addressed these threats with a

two-pronged strategy of preventing organizations from obtaining weapons of mass

destruction and preparing in case an event occurs. The National Institute of

Allergy and Infectious Diseases (NIAID) contributes to these preparedness efforts

by supporting basic research and development for chemical, biological,

radiological, and nuclear countermeasures for civilian use. The Radiation

Countermeasures Program at NIAID has established a broad research agenda focused

on the development of new medical products to mitigate and treat acute and

long-term radiation injury, promote the clearance of internalized radionuclides,

and facilitate accurate individual dose and exposure assessment. This paper

reviews the recent work and collaborations supported by the Radiation


Countermeasures Program.

PMID: 20445403 [PubMed - indexed for MEDLINE]

The vector for this that comes out of the program looks something like this:

2010,20445403,-1,"One of the greatest national security threats to the United States is the detonation
of an improvised nuclear device or a radiological dispersal device in a heavily populated area. The
U.S. Government has addressed these threats with a two-pronged strategy of preventing
organizations from obtaining weapons of mass destruction and preparing in case an event occurs. The
National Institute of Allergy and Infectious Diseases (NIAID) contributes to these preparedness efforts
by supporting basic research and development for chemical, biological, radiological, and nuclear
countermeasures for civilian use. The Radiation Countermeasures Program at NIAID has established a
broad research agenda focused on the development of new medical products to mitigate and treat
acute and long-term radiation injury, promote the clearance of internalized radionuclides, and
facilitate accurate individual dose and exposure assessment. This paper reviews the recent work and
collaborations supported by the Radiation Countermeasures Program. "," NIAID/NIH
radiation/nuclear medical countermeasures product research and development program.",0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

NOTE--- 1 represents <Y>,-1 represents <N>,0 represents <Maybe>,9999 represents test data ,8888
represents abstracts which have not been tagged for relevance ie those abstracts for which the tagger
has not tagged the abstracts at all.(a common occurrence in cases of textless abstracts(empty
abstracts)).

SUMMARY AND CONCLUSION:

This report provides insight into a way to process abstracts of such a kind in a very easy manner.I feel
that the future of this project depends heavily on human tagging which we simply cannot expect to be
error free.A stop gap arrangement would be to write a clear,precise and rigorous specification about the
tagging process.But according to the author,even this would be a marginal improvement over the
current process as average sizes of files per person to be tagged range from 4000 to 4500 lines.Errors
would be made in this.Thus,the only way out is to somehow automate the task of tagging or else these
difficulties will persist.However,the author recognises that the above solution would require natural
language processing ideas that have not yet been developed. Hence,this could be an interesting
problem to pursue.

REFERENCES:

1.Ellie Quigley:Perl Made Easy

Das könnte Ihnen auch gefallen