Sie sind auf Seite 1von 11

Adapting the Neural Encoder-Decoder Framework from Single to

Multi-Document Summarization

Logan Lebanoff, Kaiqiang Song and Fei Liu


Department of Computer Science
University of Central Florida, Orlando, FL 32816, USA
{loganlebanoff, kqsong}@knights.ucf.edu feiliu@cs.ucf.edu

D ATASET S OURCE S UMMARY #PAIRS


Abstract
Gigaword the first sentence 8.3 words
4 Million
(Rush et al., 2015) of a news article title-like
Generating a text abstract from a set of docu-
CNN/Daily Mail 56 words
arXiv:1808.06218v2 [cs.CL] 28 Aug 2018

ments remains a challenging task. The neural (Hermann et al., 2015)


a news article
multi-sent
312 K
encoder-decoder framework has recently been TAC (08-11) 10 news articles 100 words
728
exploited to summarize single documents, but (Dang et al., 2008) related to a topic multi-sent
its success can in part be attributed to the avail- DUC (03-04) 10 news articles 100 words
320
(Over and Yen, 2004) related to a topic multi-sent
ability of large parallel data automatically ac-
quired from the Web. In contrast, parallel data Table 1: A comparison of datasets available for sent. sum-
for multi-document summarization are scarce marization (Gigaword), single-doc (CNN/DM) and multi-doc
and costly to obtain. There is a pressing need summarization (DUC/TAC). The labelled data for multi-doc
to adapt an encoder-decoder model trained on summarization are much less.
single-document summarization data to work
with multiple-document input. In this paper, of neural encoder-decoder models. MDS seeks to
we present an initial investigation into a novel condense a set of documents likely written by mul-
adaptation method. It exploits the maximal tiple authors to a short and informative summary.
marginal relevance method to select represen- It has practical applications, such as summarizing
tative sentences from multi-document input, product reviews (Gerani et al., 2014), student re-
and leverages an abstractive encoder-decoder
sponses to post-class questionnaires (Luo and Lit-
model to fuse disparate sentences to an ab-
stractive summary. The adaptation method is man, 2015; Luo et al., 2016), and sets of news arti-
robust and itself requires no training data. Our cles discussing certain topics (Hong et al., 2014).
system compares favorably to state-of-the-art State-of-the-art MDS systems are mostly extrac-
extractive and abstractive approaches judged tive (Nenkova and McKeown, 2011). Despite their
by automatic metrics and human assessors. promising results, such systems cannot perform
text abstraction, e.g., paraphrasing, generalization,
1 Introduction and sentence fusion (Jing and McKeown, 1999).
Neural abstractive summarization has primarily Further, annotated MDS datasets are often scarce,
focused on summarizing short texts written by sin- containing only hundreds of training pairs (see Ta-
gle authors. For example, sentence summarization ble 1). The cost to create ground-truth summaries
seeks to reduce the first sentence of a news article from multiple-document inputs can be prohibitive.
to a title-like summary (Rush et al., 2015; Nalla- The MDS datasets are thus too small to be used to
pati et al., 2016; Takase et al., 2016; Song et al., train neural encoder-decoder models with millions
2018); single-document summarization (SDS) fo- of parameters without overfitting.
cuses on condensing a news article to a handful A promising route to generating an abstractive
of bullet points (Paulus et al., 2017; See et al., summary from a multi-document input is to apply
2017). These summarization studies are empow- a neural encoder-decoder model trained for single-
ered by large parallel datasets automatically har- document summarization to a “mega-document”
vested from online news outlets, including Giga- created by concatenating all documents in the set
word (Rush et al., 2015), CNN/Daily Mail (Her- at test time. Nonetheless, such a model may not
mann et al., 2015), NYT (Sandhaus, 2008), and scale well for two reasons. First, identifying im-
Newsroom (Grusky et al., 2018). portant text pieces from a mega-document can be
To date, multi-document summarization (MDS) challenging for the encoder-decoder model, which
has not yet fully benefited from the development is trained on single-document summarization data
where the summary-worthy content is often con- • we demonstrate the effectiveness of the pro-
tained in the first few sentences of an article. This posed method through extensive experiments on
is not the case for a mega-document. Second, re- standard MDS datasets. Our system compares
dundant text pieces in a mega-document can be re- favorably to state-of-the-art extractive and ab-
peatedly used for summary generation under the stractive summarization systems measured by
current framework. The attention mechanism of both automatic metrics and human judgments.
an encoder-decoder model (Bahdanau et al., 2014)
is position-based and lacks an awareness of se- 2 Related Work
mantics. If a text piece has been attended to dur- Popular methods for multi-document summariza-
ing summary generation, it is unlikely to be used tion have been extractive. Important sentences are
again. However, the attention value assigned to a extracted from a set of source documents and op-
similar text piece in a different position is not af- tionally compressed to form a summary (Daume
fected. The same content can thus be repeatedly III and Marcu, 2002; Zajic et al., 2007; Gillick
used for summary generation. These issues may and Favre, 2009; Galanis and Androutsopoulos,
be alleviated by improving the encoder-decoder 2010; Berg-Kirkpatrick et al., 2011; Li et al., 2013;
architecture and its attention mechanism (Cheng Thadani and McKeown, 2013; Wang et al., 2013;
and Lapata, 2016; Tan et al., 2017). However, Yogatama et al., 2015; Filippova et al., 2015; Dur-
in these cases the model has to be re-trained on rett et al., 2016). In recent years neural networks
large-scale MDS datasets that are not available at have been exploited to learn word/sentence rep-
the current stage. There is thus an increasing need resentations for single- and multi-document sum-
for a lightweight adaptation of an encoder-decoder marization (Cheng and Lapata, 2016; Cao et al.,
model trained on SDS datasets to work with multi- 2017; Isonuma et al., 2017; Yasunaga et al., 2017;
document inputs at test time. Narayan et al., 2018). These approaches remain
In this paper, we present a novel adaptation extractive; and despite encouraging results, sum-
method, named PG-MMR, to generate abstracts marizing a large quantity of texts still requires so-
from multi-document inputs. The method is ro- phisticated abstraction capabilities such as gener-
bust and requires no MDS training data. It com- alization, paraphrasing and sentence fusion.
bines a recent neural encoder-decoder model (PG Prior to deep learning, abstractive summariza-
for Pointer-Generator networks; See et al., 2017) tion has been investigated (Barzilay et al., 1999;
that generates abstractive summaries from single- Carenini and Cheung, 2008; Ganesan et al., 2010;
document inputs with a strong extractive summa- Gerani et al., 2014; Fabbrizio et al., 2014; Pighin
rization algorithm (MMR for Maximal Marginal et al., 2014; Bing et al., 2015; Liu et al., 2015; Liao
Relevance; Carbonell and Goldstein, 1998) that et al., 2018). These approaches construct domain
identifies important source sentences from multi- templates using a text planner or an open-IE sys-
document inputs. The PG-MMR algorithm itera- tem and employ a natural language generator for
tively performs the following. It identifies a hand- surface realization. Limited by the availability of
ful of the most important sentences from the mega- labelled data, experiments are often performed on
document. The attention weights of the PG model small domain-specific datasets.
are directly modified to focus on these important Neural abstractive summarization utilizing the
sentences when generating a summary sentence. encoder-decoder architecture has shown promis-
Next, the system re-identifies a number of impor- ing results but studies focus primarily on single-
tant sentences, but the likelihood of choosing cer- document summarization (Nallapati et al., 2016;
tain sentences is reduced based on their similar- Kikuchi et al., 2016; Chen et al., 2016; Miao
ity to the partially-generated summary, thereby re- and Blunsom, 2016; Tan et al., 2017; Zeng et al.,
ducing redundancy. Our research contributions in- 2017; Zhou et al., 2017; Paulus et al., 2017; See
clude the following: et al., 2017; Gehrmann et al., 2018). The point-
• we present an investigation into a novel adapta- ing mechanism (Gulcehre et al., 2016; Gu et al.,
tion method of the encoder-decoder framework 2016) allows a summarization system to both
from single- to multi-document summarization. copy words from the source text and generate
To the best of our knowledge, this is the first at- new words from the vocabulary. Reinforcement
tempt to couple the maximal marginal relevance learning is exploited to directly optimize evalua-
algorithm with pointer-generator networks for tion metrics (Paulus et al., 2017; Kryściński et al.,
multi-document summarization; 2018; Chen and Bansal, 2018). These studies fo-
cus on summarizing single documents in part be- tant the i-th input word is to generating the t-th
cause the training data are abundant. output word (Eq. (1-2)). Following (See et al.,
The work of Baumel et al. (2018) and Zhang et 2017), αt,i is calculated by measuring the strength
al. (2018) are related to ours. In particular, Baumel of interaction between the decoder hidden state
et al. (2018) propose to extend an abstractive sum- hdt , the encoder hidden state hei , and the cumula-
marization system to generate query-focused sum- tive attention α et,i (Eq. (3)). α
et,i denotes the cumu-
maries; Zhang et al. (2018) add a document set en- lative attention that the i-th input word receives up
coder to their hierarchical summarization frame- to time step t-1. A large value of α et,i indicates the
work. With these few exceptions, little research i-th input word has been used prior to time t and it
has been dedicated to investigate the feasibility of is unlikely to be used again for generating the t-th
extending the encoder-decoder framework to gen- output word.
erate abstractive summaries from multi-document A context vector (ct ) is constructed (Eq. (4)) to
inputs, where available training data are scarce. summarize the semantic meaning of the input; it
This paper presents some first steps towards the is a weighted sum of the encoder hidden states.
goal of extending the encoder-decoder model to The context vector and the decoder hidden state
a multi-document setting. We introduce an adap- ([hdt ||ct ]) are then used to compute the vocabulary
tation method combining the pointer-generator probability Pvcb (w) measuring the likelihood of a
(PG) networks (See et al., 2017) and the maximal vocabulary word w being selected as the t-th out-
marginal relevance (MMR) algorithm (Carbonell put word (Eq. (5)).1
and Goldstein, 1998). The PG model, trained on
e
P
SDS data and detailed in Section §3, is capable ct = i αt,i hi (4)
of generating document abstracts by performing Pvcb (w) = softmax(Wy [hdt ||ct ] + by ) (5)
text abstraction and sentence fusion. However,
if the model is applied at test time to summa- In many encoder-decoder models, a “switch” is
rize multi-document inputs, there will be limita- estimated (pgen ∈ [0,1]) to indicate whether the
tions. Our PG-MMR algorithm, presented in Sec- system has chosen to select a word from the vo-
tion §4, teaches the PG model to effectively recog- cabulary or to copy a word from the input text
nize important content from the input documents, (Eq. (6)). The switch is computed using a feedfor-
hence improving the quality of abstractive sum- ward layer with σ activation over [hdt ||ct ||yt−1 ],
maries, all without requiring any training on multi- where yt−1 is the embedding of the output word
document inputs. at time t-1. The attention weights (αt,i ) are used
to compute the copy probability (Eq. (7)). If a
3 Limits of the Encoder-Decoder Model word w appears oncePor more in the input text,
The encoder-decoder architecture has become the its copy probability ( i:wi =w αt,i ) is the sum of
de facto standard for neural abstractive summa- the attention weights over all its occurrences. The
rization (Rush et al., 2015). The encoder is often final probability P (w) is a weighted combination
a bidirectional LSTM (Hochreiter and Schmidhu- of the vocabulary probability and the copy proba-
ber, 1997) converting the input text to a set of hid- bility. A cross-entropy loss function can often be
den states {hei }, one for each input word, indexed used to train the model end-to-end.
by i. The decoder is a unidirectional LSTM that
generates a summary by predicting one word at a pgen =σ(wz [hdt||ct||yt−1])+bz ) (6)
X
time. The decoder hidden states are represented P (w)=pgenPvcb(w)+(1−pgen) αt,i (7)
by {hdt }, indexed by t. For sentence and single- i:wi=w
document summarization (Nallapati et al., 2016;
Paulus et al., 2017; See et al., 2017), the input text To thoroughly understand the aforementioned
is treated as a sequence of words, and the model is encoder-decoder model, we divide its model pa-
expected to capture the source syntax inherently. rameters into four groups. They include
• parameters of the encoder and the decoder;
et,i = v> tanh(We [hdt ||hei ||e
αt,i ] + be ) (1)
• {wz , bz } for calculating the “switch” (Eq. (6));
αt,i = softmax(et,i ) (2)
1
Here [·||·] represents the concatenation of two vectors.
et,i = tt−1
P
α 0 =0 αt0 ,i (3) The pointer-generator networks (See et al., 2017) use two
linear layers to produce the vocabulary distribution Pvcb (w).
The attention weight αt,i measures how impor- We use Wy and by to denote parameters of both layers.
+

Summ Sent 2
+

Neural Decoder
Sent 1 Sent 2 Sent 3 Sent 4 Sent 5
MMR = 0.62 MMR = 0.2 MMR = 0.2 MMR = 0.50 MMR = 0.3

Update
+

Summ Sent 1
MMR scores

+
Sent 1 Sent 2 Sent 3 Sent 4 Sent 5
MMR = 0.62 MMR = 0.85 MMR = 0.2 MMR = 0.55 MMR = 0.7
Document 1 Document 2
Neural Encoder

Figure 1: System framework. The PG-MMR system uses K highest-scored source sentences (in this case, K=2) to guide the
PG model to generate a summary sentence. All other source sentences are “muted” in this process. Best viewed in color.

• {Wy , by } for calculating Pvcb (w) (Eq. (5)); approaches and, despite its straightforwardness,
• {v, We , be } for attention weights (Eq. (1)). performs on-par with state-of-the-art systems (Luo
and Litman, 2015; Yogatama et al., 2015). At each
By training the encoder-decoder model on single-
iteration, MMR selects one sentence from the doc-
document summarization (SDS) data containing a
ument (D) and includes it in the summary (S) until
large collection of news articles paired with sum-
a length threshold is reached. The selected sen-
maries (Hermann et al., 2015), these model param-
tence (si ) is the most important one amongst the
eters can be effectively learned.
remaining sentences and it has the least content
However, at test time, we wish for the model
overlap with the current summary. In the equation
to generate abstractive summaries from multi-
below, Sim1 (si , D) measures the similarity of the
document inputs. This brings up two issues. First,
sentence si to the document. It serves as a proxy
the parameters are ineffective at identifying salient
of sentence importance, since important sentences
content from multi-document inputs. Humans are
usually show similarity to the centroid of the doc-
very good at identifying representative sentences
ument. maxsj ∈S Sim2 (si , sj ) measures the max-
from a set of documents and fusing them into an
imum similarity of the sentence si to each of the
abstract. However, this capability is not supported
summary sentences, acting as a proxy of redun-
by the encoder-decoder model. Second, the atten-
dancy. λ is a balancing factor.
tion mechanism is based on input word positions
but not their semantics. It can lead to redundant
 
argmax λSim1(si,D) −(1−λ)max Sim2(si,sj )
content in the multi-document input being repeat- si∈D\S sj ∈S
| {z } | {z }
edly used for summary generation. We conjec- importance redundancy
ture that both aspects can be addressed by intro-
ducing an “external” model that selects represen- Our PG-MMR describes an iterative framework
tative sentences from multi-document inputs and for summarizing a multi-document input to a sum-
dynamically adjusts the sentence importance to re- mary consisting of multiple sentences. At each it-
duce summary redundancy. This external model is eration, PG-MMR follows the MMR principle to
integrated with the encoder-decoder model to gen- select the K highest-scored source sentences; they
erate abstractive summaries using selected repre- serve as the basis for PG to generate a summary
sentative sentences. In the following section we sentence. After that, the scores of all source sen-
present our adaptation method for multi-document tences are updated based on their importance and
summarization. redundancy. Sentences that are highly similar to
the partial summary receive lower scores. Select-
4 Our Method ing K sentences via the MMR algorithm helps the
Maximal marginal relevance. Our adaptation PG system to effectively identify salient source
method incorporates the maximal marginal rele- content that has not been included in the summary.
vance algorithm (MMR; Carbonell and Goldstein, Muting. To allow the PG system to effectively
1998) into pointer-generator networks (PG; See et utilize the K source sentences without retraining
al., 2017) by adjusting the network’s attention val- the neural model, we dynamically adjust the PG
ues. MMR is one of the most successful extractive attention weights (αt,i ) at test time. Let Sk rep-
resent a selected sentence. The attention weights Algorithm 1 The PG-MMR algorithm for summa-
of the words belonging to {Sk }K k=1 are calculated rizing multi-document inputs.
as before (Eq. (2)). However, words in other sen- Input: SDS data; MDS source sentences {Si }
tences are forced to receive zero attention weights 1: Train the PG model on SDS data
(αt,i =0), and all αt,i are renormalized (Eq. (8)). 2: I I(Si ) and R(Si ) are the importance and re-
( dundancy scores of the source sentence Si
new αt,i i∈{Sk}K k=1 3: I(Si ) ← SVR(Si ) for all source sentences
αt,i = (8)
0 otherwise 4: MMR(Si ) ← λI(Si ) for all source sentences
5: Summary ← {}
It means that the remaining sentences are “muted” 6: t ← index of summary words
in this process. In this variant, the sentence impor- 7: while t < Lmax do
tance does not affect the original attention weights, Find {Sk }K
8: k=1 with highest MMR scores
other than muting. Compute αt,i new
based on {Sk }K
9: k=1 (Eq. (8))
In an alternative setting, the sentence salience 10: Run PG decoder for one step to get {wt }
is multiplied with the word salience and renormal- 11: Summary ← Summary + {wt }
ized (Eq. (9)). PG uses the reweighted alpha val- 12: if wt is the period symbol then
ues to predict the next summary word. 13: R(Si ) ← Sim(Si , Summary), ∀i
( 14: MMR(Si ) ← λI(Si ) −(1 − λ)R(Si ), ∀i
new αt,iMMR(Sk) i∈{Sk}K k=1 15: end if
αt,i = (9)
0 otherwise 16: end while
Sentence Importance. To estimate sentence im-
portance Sim1 (si , D), we introduce a supervised tences and documents, they are data-driven and
regression model in this work. Importantly, the not restricted to a particular domain.
model is trained on single-document summariza- Sentence Redundancy. To calculate the redun-
tion datasets where training data are abundant. At dancy of the sentence (maxsj ∈S Sim2 (si , sj )), we
test time, the model can be applied to identify im- compute the ROUGE-L precision, which mea-
portant sentences from multi-document input. Our sures the longest common subsequence between a
model determines sentence importance based on source sentence and the partial summary (consist-
four indicators, inspired by how humans identify ing of all sentences generated thus far by the PG
important sentences from a document set. They in- model), divided by the length of the source sen-
clude (a) sentence length, (b) its absolute and rela- tence. A source sentence yielding a high ROUGE-
tive position in the document, (c) sentence quality, L precision is deemed to have significant content
and (d) how close the sentence is to the main topic overlap with the partial summary. It will receive a
of the document set. These features are considered low MMR score and hence is less likely to serve
to be important indicators in previous extractive as basis for generating future summary sentences.
summarization framework (Galanis and Androut- Alg. 1 provides an overview the PG-MMR al-
sopoulos, 2010; Hong et al., 2014). gorithm and Fig. 1 is a graphical illustration. The
Regarding the sentence quality (c), we lever- MMR scores of source sentences are updated af-
age the PG model to build the sentence represen- ter each summary sentence is generated by the PG
tation. We use the bidirectional LSTM encoder model. Next, a different set of highest-scored sen-
to encode any source sentence to a vector repre- tences are used to guide the PG model to generate
−→ ← −
sentation. [heN ||he1 ] is the concatenation of the the next summary sentence. “Muting” the remain-
last hidden states of the forward and backward ing source sentences is important because it helps
passes. A document vector is the average of all the PG model to focus its attention on the most sig-
sentence vectors. We use the document vector and nificant source content. The code for our model is
the cosine similarity between the document and publicly available to further MDS research.2
sentence vectors as indicator (d). A support vec-
tor regression model is trained on (sentence, score) 5 Experimental Setup
pairs where the training data are obtained from the
Datasets. We investigate the effectiveness of the
CNN/Daily Mail dataset. The target importance
PG-MMR method by testing it on standard multi-
score is the ROUGE-L recall of the sentence com-
document summarization datasets (Over and Yen,
pared to the ground-truth summary. Our model ar-
2
chitecture leverages neural representations of sen- https://github.com/ucfnlp/multidoc summarization
2004; Dang and Owczarzak, 2008). These include DUC-04
System R-1 R-2 R-SU4
DUC-03, DUC-04, TAC-08, TAC-10, and TAC-
SumBasic (Vanderwende et al., 2007) 29.48 4.25 8.64
11, containing 30/50/48/46/44 topics respectively. KLSumm (Haghighi et al., 2009) 31.04 6.03 10.23
The summarization system is tasked with gener- LexRank (Erkan and Radev, 2004) 34.44 7.11 11.19
ating a concise, fluent summary of 100 words or Centroid (Hong et al., 2014) 35.49 7.80 12.02
less from a set of 10 documents discussing a topic. ICSISumm (Gillick and Favre, 2009) 37.31 9.36 13.12
DPP (Taskar, 2012) 38.78 9.47 13.36
All documents in a set are chronologically ordered
Extract+Rewrite (Song et al., 2018) 28.90 5.33 8.76
and concatenated to form a mega-document serv- Opinosis (Ganesan et al., 2010) 27.07 5.03 8.63
ing as input to the PG-MMR system. Sentences PG-Original (See et al., 2017) 31.43 6.03 10.01
that start with a quotation mark or do not end with PG-MMR w/ SummRec 34.57 7.46 11.36
a period are excluded (Wong et al., 2008). Each PG-MMR w/ SentAttn 36.52 8.52 12.57
PG-MMR w/ Cosine (default) 36.88 8.73 12.64
system summary is compared against 4 human ab- PG-MMR w/ BestSummRec 36.42 9.36 13.23
stracts created by NIST assessors. Following con-
vention, we report results on DUC-04 and TAC-11 Table 2: ROUGE results on the DUC-04 dataset.
datasets, which are standard test sets; DUC-03 and TAC-11
TAC-08/10 are used as a validation set for hyper- System R-1 R-2 R-SU4
parameter tuning.3 SumBasic (Vanderwende et al., 2007) 31.58 6.06 10.06
The PG model is trained for single-document KLSumm (Haghighi et al., 2009) 31.23 7.07 10.56
LexRank (Erkan and Radev, 2004) 33.10 7.50 11.13
summarization using the CNN/Daily Mail (Her-
Extract+Rewrite (Song et al., 2018) 29.07 6.11 9.20
mann et al., 2015) dataset, containing single news Opinosis (Ganesan et al., 2010) 25.15 5.12 8.12
articles paired with summaries (human-written ar- PG-Original (See et al., 2017) 31.44 6.40 10.20
ticle highlights). The training set contains 287,226 PG-MMR w/ SummRec 35.06 8.72 12.39
articles. An article contains 781 tokens on aver- PG-MMR w/ SentAttn 37.01 10.43 13.85
PG-MMR w/ Cosine (default) 37.17 10.92 14.04
age; and a summary contains 56 tokens (3.75 sen- PG-MMR w/ BestSummRec 40.44 14.93 17.61
tences). During training we use the hyperparam-
eters provided by See et al. (2017). At test time, Table 3: ROUGE results on the TAC-11 dataset.
the maximum/minimum decoding steps are set to
• ext-DPP (Taskar, 2012) selects an optimal set of sentences
120/100 words respectively, corresponding to the per the determinantal point processes that balance the cov-
max/min lengths of the PG-MMR summaries. Be- erage of important information and the sentence diversity;
cause the focus of this work is on multi-document • abs-Opinosis (Ganesan et al., 2010) generates abstractive
summarization (MDS), we do not report results for summaries by searching for salient paths on a word co-
occurrence graph created from source documents;
the CNN/Daily Mail dataset.
• abs-Extract+Rewrite (Song et al., 2018) is a recent ap-
Baselines. We compare PG-MMR against a broad proach that scores sentences using LexRank and generates
spectrum of baselines, including state-of-the-art a title-like summary for each sentence using an encoder-
decoder model trained on Gigaword data.
extractive (‘ext-’) and abstractive (‘abs-’) systems.
They are described below.4 • abs-PG-Original (See et al., 2017) introduces an encoder-
decoder model that encourages the system to copy words
• ext-SumBasic (Vanderwende et al., 2007) is an extractive from the source text via pointing, while retaining the abil-
approach assuming words occurring frequently in a docu- ity to produce novel words through the generator.
ment set are more likely to be included in the summary;
• ext-KL-Sum (Haghighi and Vanderwende, 2009) greedily
adds source sentences to the summary if it leads to a de- 6 Results
crease in KL divergence;
• ext-LexRank (Erkan and Radev, 2004) uses a graph-based
Having described the experimental setup, we next
approach to compute sentence importance based on eigen- compare the PG-MMR method against the base-
vector centrality in a graph representation; lines on standard MDS datasets, evaluated by both
• ext-Centroid (Hong et al., 2014) computes the importance automatic metrics and human assessors.
of each source sentence based on its cosine similarity with
the document centroid; ROUGE (Lin, 2004). This automatic metric mea-
• ext-ICSISumm (Gillick et al., 2009) leverages the ILP sures the overlap of unigrams (R-1), bigrams (R-
framework to identify a globally-optimal set of sentences 2) and skip bigrams with a maximum distance of 4
covering the most important concepts in the document set;
words (R-SU4) between the system summary and
3
The hyperparameters for all PG-MMR variants are K=7 a set of reference summaries. ROUGE scores of
and λ=0.6; except for “w/ BestSummRec” where K=2. various systems are presented in Table 2 and 3 re-
4
We are grateful to Hong et al. (2014) for providing the
summaries generated by Centroid, ICSISumm, DPP systems. spectively for the DUC-04 and TAC-11 datasets.
These are only available for the DUC-04 dataset. We explore variants of the PG-MMR method.
They differ in how the importances of source sen- 1st summ sent 4th summ sent
2nd summ sent 5th summ sent
tences are estimated and how the sentence impor- 3rd summ sent
tance affects word attention weights. “w/ Cosine” (1, 2)

PG-Original
(2, 7)
computes the sentence importance as the cosine (4, 48)
similarity score between the sentence and docu- (15, 81)
(13, 71)
ment vectors, both represented as sparse TF-IDF
vectors under the vector space model. “w/ Summ- (2, 44)

PG-MMR
Rec” estimates the sentence importance as the (6, 53)
(10, 59)
predicted R-L recall score between the sentence (12, 77)
(12, 62)
and the summary. A support vector regression
model is trained on sentences from the CNN/Daily 0 5 10 15 20 25 30 35
Location in the multi-document input
Mail datasets (≈33K) and applied to DUC/TAC
sentences at test time (see §4). “w/ BestSumm- Figure 2: The median location of summary n-grams in the
Rec” obtains the best estimate of sentence impor- multi-document input (and the lower/higher quartiles). The
n-grams come from the 1st/2nd/3rd/4th/5th summary sen-
tance by calculating the R-L recall score between tence and the location is the source sentence index. (TAC-11)
the sentence and reference summaries. It serves
as an upper bound for the performance of “w/ System 1-grams 2-grams 3-grams Sent
SummRec.” For all variants, the sentence impor- Extr+Rewrite 89.37 54.34 25.10 6.65
PG-Original 99.64 96.28 88.83 47.67
tance scores are normalized to the range of [0,1]. PG-MMR 99.74 97.64 91.57 59.13
“w/ SentAttn” adjusts the attention weights using Human Abst. 84.32 45.22 18.70 0.23
Eq. (9), so that words in important sentences are
more likely to be used to generate the summary. Table 4: Percentages of summary n-grams (or the entire sen-
tences) appear in the multi-document input. (TAC-11)
The weights are otherwise computed using Eq. (8).
Location of summary content. We are inter-
As seen in Table 2 and 3, our PG-MMR method ested in understanding why PG-MMR outper-
surpasses all unsupervised extractive baselines, in- forms PG-Original at identifying summary content
cluding SumBasic, KLSumm, and LexRank. On from the multi-document input. We ask the ques-
the DUC-04 dataset, ICSISumm and DPP show tion: where, in the source documents, does each
good performance, but these systems are trained system tend to look when generating their sum-
directly on MDS datasets, which are not utilized maries? Our findings indicate that PG-Original
by the PG-MMR method. PG-MMR exhibits su- gravitates towards early source sentences, while
perior performance compared to existing abstrac- PG-MMR searches beyond the first few sentences.
tive systems. It outperforms Opinosis and PG-
In Figure 2 we show the median location of the
Original by a large margin in terms of R-2 F-scores
first occurrences of summary n-grams, where the
(5.03/6.03/8.73 for DUC-04 and 5.12/6.40/10.92
n-grams can come from the 1st to 5th summary
for TAC-11). In particular, PG-Original is the
sentence. For PG-Original summaries, n-grams of
original pointer-generator networks with multi-
the 1st summary sentence frequently come from
document inputs at test time. Compared to it, PG-
the 1st and 2nd source sentences, corresponding
MMR is more effective at identifying summary-
to the lower/higher quartiles of source sentence in-
worthy content from the input. “w/ Cosine” is
dices. Similarly, n-grams of the 2nd summary sen-
used as the default PG-MMR and it shows bet-
tence come from the 2nd to 7th source sentences.
ter results than “w/ SummRec.” It suggests that
For PG-MMR summaries, the patterns are differ-
the sentence and document representations ob-
ent. The n-grams of the 1st and 2nd summary sen-
tained from the encoder-decoder model (trained
tences come from source sentences of the range
on CNN/DM) are suboptimal, possibly due to a
(2, 44) and (6, 53), respectively. Our findings sug-
vocabulary mismatch, where certain words in the
gest that PG-Original tends to treat the input as
DUC/TAC datasets do not appear in CNN/DM and
a single-document and identifies summary-worthy
their embeddings are thus not learned during train-
content from the beginning of the input, whereas
ing. Finally, we observe that “w/ BestSummRec”
PG-MMR can successfuly search a broader range
yields the highest performance on both datasets.
of the input for summary content. This capability
This finding suggests that there is a great potential
is crucial for multi-document input where impor-
for improvements of the PG-MMR method as its
tant content can come from any article in the set.
“extractive” and “abstractive” components can be
separately optimized. Degree of extractiveness. Table 4 shows the
Linguistic Quality Rankings (%)
System Fluency Inform. NonRed. 1st 2nd 3rd 4th
Extract+Rewrite 2.03 2.19 1.88 5.6 11.6 11.6 71.2
LexRank 3.29 3.36 3.30 30.0 28.8 32.0 9.2
PG-Original 3.20 3.30 3.19 29.6 26.8 32.8 10.8
PG-MMR 3.24 3.52 3.42 34.8 32.8 23.6 8.8

Table 5: Linguistic quality and rankings of system summaries. (DUC-04)


Human Abstract PG-Original Summary
• Boeing 737-400 plane with 102 people on board crashed into a moun- • Adam Air Boeing 737-400 crashed Monday after vanishing off air
tain in the West Sulawesi province of Indonesia, on Monday, January traffic control radar screens between the Indonesian islands of Java and
01, 2007, killing at least 90 passengers, with 12 possible survivors. Sulawesi.
• The plane was Adam Air flight KI-574, departing at 12:59 pm from
Surabaya on Java bound for Manado in northeast Sulawesi. • Up to 12 people were thought to have survived, with rescue teams
racing to the crash site near Polewali in West Sulawesi , some 180 kilo-
• The plane crashed in a mountainous region in Polewali, west Su- metres north of the South Sulawesi provincial capital Makassar.
lawesi province.
• It was the worst air disaster since Sept. 5, 2005, when a Mandala Air-
• There were three Americans on board, it is not know if they line’s Boeing 737-200 crashed shortly after taking off from the North
survived. Sumatra’s airport, killing 103 people.
• The cause of the crash is not known at this time but it is possible bad
weather was a factor. • Earlier on Friday, a ferry carrying 628 people sank off the Java coast.

Extract+Rewrite Summary PG-MMR Summary


• Plane with 102 people on board crashes.
• The Adam Air Boeing 737-400 crashed Monday afternoon, but
• Three Americans among 102 on board plane in Indonesia. search and rescue teams only discovered the wreckage early Tuesday.
• Rescue team arrives in Indonesia after plane crash.
• Plane with 102 crashes in West Sulawesi, killing at least 90. • The Indonesian rescue team arrived at the mountainous area in West
Sulawesi province where a passenger plane with 102 people onboard
• No word on the fate of Boeing 737-400. crashed into a mountain in Polewali, West Sulawesi province.
• Plane carrying 96 passengers loses contact with Makassar.
• Plane crashes in Indonesia , killing at least 90. • Air force rear commander Eddy Suyanto told-Shinta radio station
that the plane – operated by local carrier Adam Air – had crashed in a
• Indonesian navy sends two planes to carry bodies of five. mountainous region in Polewali province on Monday.
• Indonesian plane carrying 102 missing.
• Indonesian lawmaker criticises slow deployment of plane. • There was no word on the fate of the remaining 12 people on board
the boeing 737-400.
• Hundreds of kilometers plane crash.

Table 6: Example system summaries and human-written abstract. The sentences are manually de-tokenized for readability.

percentages of summary n-grams (or entire sen- in the summary?), fluency (is the summary gram-
tences) appearing in the multi-document input. matical and well-formed?), and non-redundancy
PG-Original and PG-MMR summaries both show (does the summary successfully avoid repeating
a high degree of extractiveness, and similar find- information?). Human summaries are used as the
ings have been revealed by See et al. (2017). ground-truth. The turkers are also asked to provide
Because PG-MMR relies on a handful of rep- an overall ranking for the four system summaries.
resentative source sentences and mutes the rest, Results are presented in Table 5. We observe that
it appears to be marginally more extractive than the LexRank summaries are highest-rated on flu-
PG-Original. Both systems encourage generating ency. This is because LexRank is an extractive
summary sentences by stitching together source approach, where summary sentences are directly
sentences, as about 52% and 41% of the sum- taken from the input. PG-MMR is rated as the best
mary sentences do not appear in the source, but on both informativeness and non-redundancy. Re-
about 90% the n-grams do. The Extract+Rewrite garding overall system rankings, PG-MMR sum-
summaries (§5), generated by rewriting selected maries are frequently ranked as the 1st- and 2nd-
source sentences to title-like summary sentences, best summaries, outperforming the others.
exhibits a high degree of abstraction, close to that Example summaries. In Table 6 we present
of human abstracts. example summaries generated by various sys-
Linguistic quality. To assess the linguistic quality tems. PG-Original cannot effectively identify im-
of various system summaries, we employ Amazon portant content from the multi-document input.
Mechanical Turk human evaluators to judge the Extract+Rewrite tends to generate short, title-like
summary quality, including PG-MMR, LexRank, sentences that are less informative and carry sub-
PG-Original, and Extract+Rewrite. A turker is stantial redundancy. This is because the system is
asked to rate each system summary on a scale of 1 trained on the Gigaword dataset (Rush et al., 2015)
(worst) to 5 (best) based on three evaluation crite- where the target summary length is 7 words. PG-
ria: informativeness (to what extent is the mean- MMR generates summaries that effectively con-
ing expressed in the ground-truth text preserved dense the important source content.
7 Conclusion Yen-Chun Chen and Mohit Bansal. 2018. Fast ab-
stractive summarization with reinforce-selected sen-
We describe a novel adaptation method to gen- tence rewriting. In Proceedings of the Annual Meet-
erate abstractive summaries from multi-document ing of the Association for Computational Linguistics
inputs. Our method combines an extractive sum- (ACL).
marization algorithm (MMR) for sentence extrac- Jianpeng Cheng and Mirella Lapata. 2016. Neural
tion and a recent abstractive model (PG) for fusing summarization by extracting sentences and words.
source sentences. The PG-MMR system demon- In Proceedings of ACL.
strates competitive results, outperforming strong
Hoa Trang Dang and Karolina Owczarzak. 2008.
extractive and abstractive baselines.
Overview of the TAC 2008 update summarization
task. In Proceedings of Text Analysis Conference
(TAC).
References
Hal Daume III and Daniel Marcu. 2002. A noisy-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
channel model for document compression. In Pro-
Bengio. 2014. Neural machine translation by
ceedings of the Annual Meeting of the Association
jointly learning to align and translate. CoRR,
for Computational Linguistics (ACL).
abs/1409.0473.

Regina Barzilay, Kathleen R. McKeown, and Michael Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein.
Elhadad. 1999. Information fusion in the context of 2016. Learning-based single-document summariza-
multi-document summarization. In Proceedings of tion with compression and anaphoricity constraints.
the Annual Meeting of the Association for Computa- In Proceedings of the Association for Computational
tional Linguistics (ACL). Linguistics (ACL).

Tal Baumel, Matan Eyal, and Michael Elhadad. 2018. Günes Erkan and Dragomir R. Radev. 2004. LexRank:
Query focused abstractive summarization: Incorpo- Graph-based lexical centrality as salience in text
rating query relevance, multi-document coverage, summarization. Journal of Artificial Intelligence
and summary length constraints into seq2seq mod- Research.
els. arXiv preprint arXiv:1801.07704.
Giuseppe Di Fabbrizio, Amanda J. Stent, and Robert
Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. Gaizauskas. 2014. A hybrid approach to multi-
2011. Jointly learning to extract and compress. In document summarization of opinions in reviews.
Proceedings of the Annual Meeting of the Associa- Proceedings of the 8th International Natural Lan-
tion for Computational Linguistics (ACL). guage Generation Conference (INLG).

Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Katja Filippova, Enrique Alfonseca, Carlos Col-
Guo, and Rebecca J. Passonneau. 2015. Abstractive menares, Lukasz Kaiser, and Oriol Vinyals. 2015.
multi-document summarization via phrase selection Sentence compression by deletion with lstms. In
and merging. In Proceedings of ACL. Proceedings of the Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP).
Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017.
Improving multi-document summarization via text Dimitrios Galanis and Ion Androutsopoulos. 2010. An
classification. In Proceedings of the Association for extractive supervised two-stage method for sentence
the Advancement of Artificial Intelligence (AAAI). compression. In Proceedings of NAACL-HLT.

Jaime Carbonell and Jade Goldstein. 1998. The use Kavita Ganesan, ChengXiang Zhai, and Jiawei Han.
of MMR, diversity-based reranking for reordering 2010. Opinosis: A graph-based approach to ab-
documents and producing summaries. In Proceed- stractive summarization of highly redundant opin-
ings of the International ACM SIGIR Conference on ions. In Proceedings of the International Confer-
Research and Development in Information Retrieval ence on Computational Linguistics (COLING).
(SIGIR).
Sebastian Gehrmann, Yuntian Deng, and Alexander
Giuseppe Carenini and Jackie Chi Kit Cheung. 2008. Rush. 2018. Bottom-up abstractive summariza-
Extractive vs. NLG-based abstractive summariza- tion. In Proceedings of the Conference on Em-
tion of evaluative text: The effect of corpus contro- pirical Methods in Natural Language Processing
versiality. In Proceedings of the Fifth International (EMNLP).
Natural Language Generation Conference (INLG).
Shima Gerani, Yashar Mehdad, Giuseppe Carenini,
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, and Raymond T. Ng, and Bita Nejat. 2014. Abstractive
Hui Jiang. 2016. Distraction-based neural networks summarization of product reviews using discourse
for document summarization. In Proceedings of the structure. In Proceedings of the Conference on Em-
Twenty-Fifth International Joint Conference on Ar- pirical Methods in Natural Language Processing
tificial Intelligence (IJCAI). (EMNLP).
Dan Gillick and Benoit Favre. 2009. A scalable global Wojciech Kryściński, Romain Paulus, Caiming Xiong,
model for summarization. In Proceedings of the and Richard Socher. 2018. Improving abstraction
NAACL Workshop on Integer Linear Programming in text summarization. In Proceedings of the Con-
for Natural Langauge Processing. ference on Empirical Methods in Natural Language
Processing (EMNLP).
Dan Gillick, Benoit Favre, Dilek Hakkani-Tur, Berndt
Bohnet, Yang Liu, and Shasha Xie. 2009. The Chen Li, Fei Liu, Fuliang Weng, and Yang Liu. 2013.
ICSI/UTD summarization system at TAC 2009. In Document summarization via guided sentence com-
Proceedings of TAC. pression. In Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing
Max Grusky, Mor Naaman, and Yoav Artzi. 2018. (EMNLP).
Newsroom: A dataset of 1.3 million summaries
with diverse extractive strategies. In Proceedings of Kexin Liao, Logan Lebanoff, and Fei Liu. 2018. Ab-
the North American Chapter of the Association for stract meaning representation for multi-document
Computational Linguistics (NAACL). summarization. In Proceedings of the International
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Conference on Computational Linguistics (COL-
Li. 2016. Incorporating copying mechanism in ING).
sequence-to-sequence learning. In Proceedings of
Chin-Yew Lin. 2004. ROUGE: a package for au-
ACL.
tomatic evaluation of summaries. In Proceedings
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, of ACL Workshop on Text Summarization Branches
Bowen Zhou, and Yoshua Bengio. 2016. Pointing Out.
the unknown words. In Proceedings of the 54th An-
nual Meeting of the Association for Computational Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman
Linguistics (ACL). Sadeh, and Noah A. Smith. 2015. Toward abstrac-
tive summarization using semantic representations.
Aria Haghighi and Lucy Vanderwende. 2009. Explor- In Proceedings of the North American Chapter of
ing content models for multi-document summariza- the Association for Computational Linguistics: Hu-
tion. In Proceedings of the North American Chap- man Language Technologies (NAACL).
ter of the Association for Computational Linguistics
(NAACL). Wencan Luo and Diane Litman. 2015. Summarizing
student responses to reflection prompts. In Proceed-
Karl Moritz Hermann, Tomas Kocisky, Edward ings of the Conference on Empirical Methods in Nat-
Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- ural Language Processing (EMNLP).
leyman, and Phil Blunsom. 2015. Teaching ma-
chines to read and comprehend. In Proceedings of Wencan Luo, Fei Liu, Zitao Liu, and Diane Litman.
Neural Information Processing Systems (NIPS). 2016. Automatic summarization of student course
feedback. In Proceedings of the North American
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Chapter of the Association for Computational Lin-
Long short-term memory. Neural Computation, guistics: Human Language Technologies (NAACL).
9(8):1735–1780.
Yishu Miao and Phil Blunsom. 2016. Language as a
Kai Hong, John M Conroy, Benoit Favre, Alex
latent variable: Discrete generative models for sen-
Kulesza, Hui Lin, and Ani Nenkova. 2014. A repos-
tence compression. In Proceedings of the Confer-
itory of state of the art and competitive baseline sum-
ence on Empirical Methods in Natural Language
maries for generic news summarization. In Proceed-
Processing (EMNLP).
ings of the Ninth International Conference on Lan-
guage Resources and Evaluation (LREC). Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka Caglar Gulcehre, and Bing Xiang. 2016. Ab-
Matsuo, and Ichiro Sakata. 2017. Extractive sum- stractive text summarization using sequence-to-
marization using multi-task learning with document sequence RNNs and beyond. In Proceedings of the
classification. In Proceedings of the Conference on 20th SIGNLL Conference on Computational Natural
Empirical Methods in Natural Language Processing Language Learning (CoNLL).
(EMNLP).
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
Hongyan Jing and Kathleen McKeown. 1999. The de- 2018. Ranking sentences for extractive summariza-
composition of human-written summary sentences. tion with reinforcement learning. In Proceedings of
In Proceedings of the International ACM SIGIR the 16th Annual Conference of the North American
Conference on Research and Development in Infor- Chapter of the Association for Computational Lin-
mation Retrieval (SIGIR). guistics: Human Language Technologies (NAACL-
HLT).
Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya
Takamura, and Manabu Okumura. 2016. Control- Ani Nenkova and Kathleen McKeown. 2011. Auto-
ling output length in neural encoder-decoders. In matic summarization. Foundations and Trends in
Proceedings of EMNLP. Information Retrieval.
Paul Over and James Yen. 2004. An introduction to Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Flo-
DUC-2004. National Institute of Standards and rian, and Claire Cardie. 2013. A sentence com-
Technology. pression based framework to query-focused multi-
document summarization. In Proceedings of ACL.
Romain Paulus, Caiming Xiong, and Richard Socher.
2017. A deep reinforced model for abstractive sum- Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008. Ex-
marization. In Proceedings of the Conference on tractive summarization using supervised and semi-
Empirical Methods in Natural Language Processing supervised learning. In Proceedings of the Inter-
(EMNLP). national Conference on Computational Linguistics
(COLING).
Daniele Pighin, Marco Cornolti, Enrique Alfonseca,
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
and Katja Filippova. 2014. Modelling events
Ayush Pareek, Krishnan Srinivasan, and Dragomir
through memory-based, open-ie patterns for abstrac-
Radev. 2017. Graph-based neural multi-document
tive summarization. In Proceedings of the Annual
summarization. In Proceedings of the Confer-
Meeting of the Association for Computational Lin-
ence on Computational Natural Language Learning
guistics (ACL).
(CoNLL).
Alexander M. Rush, Sumit Chopra, and Jason Weston. Dani Yogatama, Fei Liu, and Noah A. Smith. 2015.
2015. A neural attention model for sentence sum- Extractive summarization by maximizing semantic
marization. In Proceedings of the Conference on volume. In Proceedings of the Conference on Em-
Empirical Methods in Natural Language Processing pirical Methods on Natural Language Processing
(EMNLP). (EMNLP).

Evan Sandhaus. 2008. The new york times annotated David Zajic, Bonnie J. Dorr, Jimmy Lin, and Richard
corpus. Linguistic Data Consortium. Schwartz. 2007. Multi-candidate reduction: Sen-
tence compression as a tool for document summa-
Abigail See, Peter J. Liu, and Christopher D. Manning. rization tasks. Information Processing and Manage-
2017. Get to the point: Summarization with pointer- ment.
generator networks. In Proceedings of the Annual
Meeting of the Association for Computational Lin- Wenyuan Zeng, Wenjie Luo, Sanja Fidler, and Raquel
guistics (ACL). Urtasun. 2017. Efficient summarization with read-
again and copy mechanism. In Proceedings of the
Kaiqiang Song, Lin Zhao, and Fei Liu. 2018. International Conference on Learning Representa-
Structure-infused copy mechanisms for abstractive tions (ICLR).
summarization. In Proceedings of the International Jianmin Zhang, Jiwei Tan, and Xiaojun Wan. 2018.
Conference on Computational Linguistics (COL- Towards a neural network approach to abstrac-
ING). tive multi-document summarization. arXiv preprint
arXiv:1804.09010.
Sho Takase, Jun Suzuki, Naoaki Okazaki, Tsutomu Hi-
rao, and Masaaki Nagata. 2016. Neural headline Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.
generation on abstract meaning representation. In 2017. Selective encoding for abstractive sentence
Proceedings of the Conference on Empirical Meth- summarization. In Proceedings of the Annual Meet-
ods in Natural Language Processing (EMNLP). ing of the Association for Computational Linguistics
(ACL).
Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.
Abstractive document summarization with a graph-
based attentional neural model. In Proceedings of
the Annual Meeting of the Association for Computa-
tional Linguistics (ACL).

Alex Kuleszaand Ben Taskar. 2012. Determinantal


Point Processes for Machine Learning. Now Pub-
lishers Inc.

Kapil Thadani and Kathleen McKeown. 2013. Sen-


tence compression with joint structural inference. In
Proceedings of CoNLL.

Lucy Vanderwende, Hisami Suzuki, Chris Brockett,


and Ani Nenkova. 2007. Beyond SumBasic: Task-
focused summarization with sentence simplification
and lexical expansion. Information Processing and
Management, 43(6):1606–1618.

Das könnte Ihnen auch gefallen