Beruflich Dokumente
Kultur Dokumente
Sentence (S-CONST)
P6 (S-i,W+j) P2
1< i<CONST
P3 P1
P4
Sentence S (S,W-n) (S,W) (S,W+1) (S,W+n)
(n>=1) (n>=2)
P5
(S+i,W+j)
1<i<CONST
P6
Sentence (S+CONST)
(S+i,W+j)
i>=CONST
from sentence number S1 and word number W1 . ple, P ROB(of = (4, 16)|subcommittee = (4, 1))
To decompose a summary sentence, we must con- will be assigned a high probability. (Rule: Adja-
sider how humans are likely to generate it; we draw cent words in a summary are highly likely to come
here on the operations we noted in Section 2. There from the same sentence in the document, retaining
are two general heuristic rules we can safely assume: their relative precedent relation, as in sentence re-
first, humans are more likely to cut phrases for a sum- duction. This rule can be further refined by adding
mary than cut single, isolated words; second, humans restrictions on distance between words.)
are more likely to combine nearby sentences into a sin-
gle sentence than combine sentences that are far apart. IF ((S1 = S2 ) and (W1 > W2 )), THEN P ROB(Ii+1 |Ii )
These two rules are our guidance in the decomposition is assigned the third highest value P3. For exam-
process. ple, P ROB(of = (2, 30)|subcommittee = (2, 41)).
We translate the heuristic rules into the above bi- (Rule: Adjacent words in a summary are likely to
gram probability P ROB(Ii+1 = (S2 , W2 )|Ii = (S1 , W1 )), come from the same sentence in the document but
where Ii , Ii+1 represent two adjacent words in the input reverse their relative orders, such as in the case
summary sentence, as noted earlier. The probability is of sentence reduction with syntactic transforma-
abbreviated as P ROB(Ii+1 |Ii ) in the following discus- tions.)
sion. It is assigned in the following manner: IF (S2 CON ST < S1 < S2 ), THEN P ROB(Ii+1 |Ii )
is assigned the fourth highest value P4. For exam-
IF ((S1 = S2 ) and (W1 = W2 1)) (i.e., the words
ple, P ROB(of = (3, 5)|subcommittee = (2, 41)).
are in two adjacent positions in the document),
(Rule: Adjacent words in a summary can come
THEN P ROB(Ii+1 |Ii ) is assigned the maximal
from nearby sentences in the document and retain
value P1. For example, P ROB(subcommittee =
their relative order, such as in sentence combina-
(2, 41)|communications = (2, 40)) in the example
tion. CONST is a small constant such as 3 or 5.)
of Figure 1 will be assigned the maximal value.
(Rule: Two adjacent words in a summary are most IF (S2 < S1 < S2+CON ST ), THEN P ROB(Ii+1 |Ii )
likely to come from two adjacent words in the doc- is assigned the fifth highest value P5. For exam-
ument.) ple, P ROB(of = (1, 10)|subcommittee = (2, 41)).
IF ((S1 = S2 ) and (W1 < W2 1)), THEN P ROB(Ii+1 |Ii ) (Rule: Adjacent words in a summary can come
is assigned the second highest value P2. For exam- from nearby sentences in the document but reverse
their relative orders.) follows:
IF (|S2 S1 | >= CON ST ), THEN P ROB(Ii+1 |Ii ) The input summary sentence (also
is assigned a small value P6. For example, P ROB(of = shown in Figure 3):
(23, 43)|subcommittee = (2, 41)). (Rule: Adja- Arthur B. Sackler, vice president for law and
cent words in a summary are not very likely to public policy of Time Warner Inc. and a mem-
come from sentences far apart.) ber of the Direct Marketing Association, told
the communications subcommittee of the Sen-
Based on the above principles, we create a Hidden ate Commerce Committee that legislation to
Markov Model as shown in Figure 2. The nodes in the protect childrens privacy online could destroy
figure represent possible positions in the document, and the spontaneous nature that makes the Inter-
the edges output the probability of going from one node net unique.
to another. This HMM is used in finding the most likely We first index the document, listing for each word
position sequence in the next step. Assigning values to its possible positions in the document. Stemming can
P1-P6 is experimental. In our experiment, the maximal be performed before indexing, although it is not used in
value is assigned 1 and others are assigned evenly de- this example. Augmenting each word with its possible
creasing values 0.9, 0.8 and so on. We determined the document positions, we therefore have the input for the
orders of the above rules based on observations over a Viterbi program, as shown below:
set of summaries. These values, however, can be ad-
Input to the Viterbi Program (words
justed or even trained for different corpora.
and their possible document positions):
arthur : 1,0
3.3 The Viterbi Algorithm b : 1,1
To find the most likely sequence we must find a se- sackler : 1,2 2,34 ... 15,6
quence of positions which maximizes the probability ...
P ROB(I1 , ..., IN ). Using the bigram model, it can be the : 0,21 0,26 ... 23,44
approximated as: internet : 0,27 1,39 ... 18,16
QN 1
P ROB(I1 , ..., IN ) = i=0 P ROB(Ii+1 |Ii ) unique : 0,28
P ROB(Ii+1 |Ii ) has been assigned in HMM. There- For this 48-word sentence, there are a total of 5.08
fore, we have all the information needed to solve the 1027 possible position sequences. Using the HMM in
problem. We use the Viterbi Algorithm to find the most Figure 2, we run the Viterbi Program to find the most
likely sequence. For a N-word sequence, supposing each likely position sequence. The intermediate output of
word has a document frequency of M, the Viterbi Al- the Viterbi program is shown as follows:
gorithm is guaranteed to find the most likely sequence
Intermediate output of the Viterbi
using k N M 2 steps, for some constant k, compared
N program(Each possible position of each
to M for the brute force search algorithm.
word is attached with a score shown in a
The Viterbi Algorithm [Viterbi1967] finds the most
bracket. The score indicates the maximal
likely sequence incrementally. It first finds the most
probability of the subsequence which ends with
likely sequence for (I1 I2 ), for each possible position of
that position. For example, 1,39[0.0016] for
I2 . This information is then used to compute the most
the word internet means that for all position
likely sequence for (I1 I2 I3 ), for each possible position
sequences which end at the word internet and
of I3 . The process repeats until all the words in the
assume internet take the document position
sequence have been considered.
(1,39), the highest score we can achieve is
We slightly revised the Viterbi algorithm for our ap-
0.0016. )
plication. In the initialization step, equal chance is as-
sumed for each possible document position for the first arthur : 1,0[1]
word in the sequence. In the iteration step, we take spe- b: 1,1[1]
cial measures to handle the case when a summary word sackler : 1,2[1] 2,34[0.6] ... 12,2[0.5]
does not appear in the document (thus has an empty ... :
position list). We mark the word as non-existent in the the : 0,21[0.0019] 0,26[0.0027] ... 23,44[0.0014]
original document and continue the computation as if internet : 0,27[0.0027] 1,39[0.0016] ... 18,16[0.0014]
it had not appeared in the sequence. unique : 0,28[0.0027]
tion. After that, we differentiate the components in phrases were also syntactically transformed. Despite
the sentence by combining words coming from adjacent these, the program successfully decomposed the sen-
document positions. tence.
After the sentence components have been identified,
the program does simple post-editing to cancel certain 4 Evaluation
mismatchings. The Viterbi program assigns each word
in the input sequence a position in the document, as We carried out two evaluation experiments, one de-
long as the word appears in the document at least once. tailed, small-scale experiment and a second larger scale
In this step, if any document sentence contributes only experiment. In the first experiment, we use 10 doc-
stop words for the summary, the matching is cancelled uments from the Ziff-Davis corpus that were selected
since the stop words are more likely to be inserted by and presented to 14 human judges, together with their
humans rather than coming from the original document. human-written summaries, in an experiment done by
Similarly, we remove document sentences providing only Marcu [Marcu1999]. The human judges were instructed
a single non-stop word. to extract sentences from the original document which
Figure 3 shows the final result. The components in are semantically equivalent to the summary sentences.
the summary are tagged as (FNUM:SNUM actual-text), Sentences selected by the majority of human judges
where FNUM is the sequential number of the compo- were collected to build an extract of the document, which
nent and SNUM is the number of the document sen- serves as the gold standard in evaluation.
tence where the component comes from. SNUM = -1 Our program can provide a set of relevant document
means that the component does not come from the orig- sentences for each summary sentence, as shown in Fig-
inal document. The borrowed components are tagged ure 3. Taking the union of the selected sentences, we
as (FNUM actual-text) in the document sentences. can build an extract for the document. We compared
In this example, the program correctly identified the this extract with the gold standard extract based on the
four document sentences from which the summary sen- majority of human judgments. We achieved an aver-
tence was combined; it correctly divided the summary age 81.5% precision,78.5% recall, and 79.1% f-measure
sentence into components and pinpointed the exact doc- for 10 documents. The average performance of 14 hu-
ument origin of each component. In this example, the man judges is 88.8% precision, 84.4% recall, and 85.7%
components that were borrowed from the document range f-measure. The detailed result for each document is
from a single word to long clauses. Certain borrowed shown in Table 1. Precision, Recall, and F-measure are
computed as follows: is exactly what happened in the extract of document
# of sentences in the extract and also the gold standard ZF109-601-903. The program, in contrast, picked up
Prec = only one of the paraphrases. However, this correct de-
total # of sentences in the extract
cision is penalized in the evaluation due to the mistake
# of sentences in the extract and also the gold standard
in the gold standard.
Recall = The program won perfect scores for 3 out of 10 doc-
total # of sentences in the gold standard
uments. We checked the 3 summaries and found that
2 Recall Prec their texts were largely produced by cut-and-paste, com-
F-measure =
Recall + Prec pared to other summaries which might have new sen-
tences written by humans. This indicates that when
Docno Prec Recall F-measure only the decomposition task is considered, the algo-
ZF109-601-903 0.67 0.67 0.67 rithm performs very well and finishes the task success-
ZF109-685-555 0.75 1 0.86 fully.
ZF109-631-813 1 1 1 In the second experiment, we selected 50 summaries
ZF109-712-593 0.86 0.55 0.67 from the corpus and ran the decomposition program on
ZF109-645-951 1 1 1 the documents. A human subject then read the de-
ZF109-714-915 0.56 0.64 0.6 composition results to judge whether they are correct.
ZF109-662-269 0.79 0.79 0.79 The programs answer is considered correct when all 3
ZF109-715-629 0.67 0.67 0.67 questions we posed in the decomposition problem are
ZF109-666-869 0.86 0.55 0.67 correctly answered 2 . There are a total of 305 sentences
ZF109-754-223 1 1 1 in 50 summaries. 18 (6.2%) sentences were wrongly de-
composed, so we achieve an accuracy of 93.8%. Most of
Average 0.815 0.785 0.791
the errors occur when a summary sentence is not con-
Table 1: The evaluation result for 10 documents structed by cut-and-paste but have many overlapping
words with certain sentence in the document. The ac-
curacy rate here is much higher than the precision and
Further analysis indicates that there are two types recall results in the first experiment. An important fac-
of errors by the program. The first is that the pro- tor for this is that here we do not require the program to
gram missed finding semantically equivalent sentences find the semantically equivalent document sentence(s)
which have very different wordings. For example, it if a summary sentence uses very different wordings.
failed to find the correspondence between the summary
sentence Running Higgins is much easier than installing
it and the document sentence The program is very easy 5 Corpus study
to use, although the installation procedure is somewhat Using the decomposition program, we analyzed 300 human-
complex. This is not really an error since the program
written summaries of news articles. We collected the
is not designed to find such paraphrases. For sentence summaries from a free news service. The news articles
decomposition, the program only needs to indicate that come from various sections of a number of newspapers
the summary sentence is not produced by cutting and
and cover a broad topic. 300 summaries contain 1,642
pasting text from the original document. The program sentences in total, ranging from 2 sentences per sum-
correctly indicated it by returning no matching sen- mary to 21 sentences per summary. The results show
tence. that 315 (19%) sentences do not have matching sen-
The second problem is that the program may iden- tences in the document, 686 (42%) sentences match to
tify a non-relevant document sentence as relevant if it
a single sentence in the document, 592 (36%) sentences
contains some common words with the summary sen- match to 2 or 3 sentences in the document, and only 49
tence. This typically occurs when a summary sentence (3%) sentences match to more than 3 sentences in the
is not constructed by cutting and pasting text from the
document. These results suggest that a significant por-
document but does share some words with certain doc- tion (78%) of summary sentences produced by humans
ument sentences. Our post-editing steps are designed to are based on cut-and-paste.
cancel such false matchings although we can not remove
them completely.
The program also demonstrates some advantages. 6 Applications and related work
One of them is that it can capture the duplications in
We have used the decomposition results in our develop-
the gold standard extract. The extract based on hu-
ment of a text generation system for domain-independent
man judgments is not perfect. If two paraphrases from
2
the document are chosen by an equal number of human However, if a sentence is not constructed by cut-and-paste, the
program only needs to answer the first question.
subjects then both will be included in the extract. This
summarization. The generation system mimics certain tion program, we are investigating methods to auto-
operations by humans in the cutting and pasting pro- matically learn such cut-and-paste rules from large cor-
cess as discussed in Section 2. Two main modules in our pora for fast and reliably generation of higher quality
generation system are the sentence reduction module summaries.
and sentence combination module. Using the decom-
position program, we were able to collect a corpus of Acknowledgment
summary sentences which were constructed by humans
using reduction operations. This corpus of reduction- We thank Daniel Marcu for sharing with us his evalu-
based, human-written summary sentences is then used ation dataset. This material is based upon work sup-
to train as well as evaluate our automatic sentence re- ported by the National Science Foundation under Grant
duction module. Similarly, we collected a corpus of No. IRI 96-19124, IRI 96-18797 and by a grant from
combination-based summary sentences, which reveals Columbia Universitys Strategic Initiative Fund. Any
to us interesting techniques that humans use frequently opinions, findings, and conclusions or recommendations
to paste fragments in the original document into a co- expressed in this material are those of the authors and
herent and informative sentence. do not necessarily reflect the views of the National Sci-
The task of decomposition is somewhat related to ence Foundation.
the summary alignment problem addressed in [Marcu1999].
However, his alignment algorithm operates at the sen- References
tence or clause level, while our decomposition program
aligns phrases of various granularity. Furthermore, the L. Baum. 1972. An inequality and associated maxi-
methodology of the two systems, ours using Hidden mization technique in statistical estimation of prob-
Markov Model and his using an IR based approach cou- abilistic functions of a markov process. Inequalities,
pled with a discourse model, are very different. (3):18.
We reduced the decomposition problem to the prob-
lem of finding the most likely document position for Peter F. Brown, J. C. Lai, and R. L. Mercer. 1991.
each word in the summary, which is in some sense sim- Aligning sentences in parallel corpora. In Proceed-
ilar to the problem of aligning parallel bilingual cor- ings of the 29th Annual Meeting of the ACL, pages
pora [Brown et al.1991, Gale and Church1991]. While 169176, Berkeley, California, June. Association for
they align sentences in a parallel bilingual corpus, we Computational Linguistics.
align phrases in a summary with phrases in a docu- B. Endres-Niggemeyer and E. Neugebauer. 1995.
ment. While they use sentence length as a feature, we Professional summarising: No cognitive simulation
use word position as a feature. The approach we used without observation. In Proceedings of the Interna-
for determining the probabilities in the Hidden Markov tional Conference in Cognitive Science, San Sebas-
Model is also totally different from theirs. tian, May.