Beruflich Dokumente
Kultur Dokumente
Multi-Document Summarization
Summ Sent 2
+
Neural Decoder
Sent 1 Sent 2 Sent 3 Sent 4 Sent 5
MMR = 0.62 MMR = 0.2 MMR = 0.2 MMR = 0.50 MMR = 0.3
Update
+
Summ Sent 1
MMR scores
+
Sent 1 Sent 2 Sent 3 Sent 4 Sent 5
MMR = 0.62 MMR = 0.85 MMR = 0.2 MMR = 0.55 MMR = 0.7
Document 1 Document 2
Neural Encoder
Figure 1: System framework. The PG-MMR system uses K highest-scored source sentences (in this case, K=2) to guide the
PG model to generate a summary sentence. All other source sentences are “muted” in this process. Best viewed in color.
• {Wy , by } for calculating Pvcb (w) (Eq. (5)); approaches and, despite its straightforwardness,
• {v, We , be } for attention weights (Eq. (1)). performs on-par with state-of-the-art systems (Luo
and Litman, 2015; Yogatama et al., 2015). At each
By training the encoder-decoder model on single-
iteration, MMR selects one sentence from the doc-
document summarization (SDS) data containing a
ument (D) and includes it in the summary (S) until
large collection of news articles paired with sum-
a length threshold is reached. The selected sen-
maries (Hermann et al., 2015), these model param-
tence (si ) is the most important one amongst the
eters can be effectively learned.
remaining sentences and it has the least content
However, at test time, we wish for the model
overlap with the current summary. In the equation
to generate abstractive summaries from multi-
below, Sim1 (si , D) measures the similarity of the
document inputs. This brings up two issues. First,
sentence si to the document. It serves as a proxy
the parameters are ineffective at identifying salient
of sentence importance, since important sentences
content from multi-document inputs. Humans are
usually show similarity to the centroid of the doc-
very good at identifying representative sentences
ument. maxsj ∈S Sim2 (si , sj ) measures the max-
from a set of documents and fusing them into an
imum similarity of the sentence si to each of the
abstract. However, this capability is not supported
summary sentences, acting as a proxy of redun-
by the encoder-decoder model. Second, the atten-
dancy. λ is a balancing factor.
tion mechanism is based on input word positions
but not their semantics. It can lead to redundant
argmax λSim1(si,D) −(1−λ)max Sim2(si,sj )
content in the multi-document input being repeat- si∈D\S sj ∈S
| {z } | {z }
edly used for summary generation. We conjec- importance redundancy
ture that both aspects can be addressed by intro-
ducing an “external” model that selects represen- Our PG-MMR describes an iterative framework
tative sentences from multi-document inputs and for summarizing a multi-document input to a sum-
dynamically adjusts the sentence importance to re- mary consisting of multiple sentences. At each it-
duce summary redundancy. This external model is eration, PG-MMR follows the MMR principle to
integrated with the encoder-decoder model to gen- select the K highest-scored source sentences; they
erate abstractive summaries using selected repre- serve as the basis for PG to generate a summary
sentative sentences. In the following section we sentence. After that, the scores of all source sen-
present our adaptation method for multi-document tences are updated based on their importance and
summarization. redundancy. Sentences that are highly similar to
the partial summary receive lower scores. Select-
4 Our Method ing K sentences via the MMR algorithm helps the
Maximal marginal relevance. Our adaptation PG system to effectively identify salient source
method incorporates the maximal marginal rele- content that has not been included in the summary.
vance algorithm (MMR; Carbonell and Goldstein, Muting. To allow the PG system to effectively
1998) into pointer-generator networks (PG; See et utilize the K source sentences without retraining
al., 2017) by adjusting the network’s attention val- the neural model, we dynamically adjust the PG
ues. MMR is one of the most successful extractive attention weights (αt,i ) at test time. Let Sk rep-
resent a selected sentence. The attention weights Algorithm 1 The PG-MMR algorithm for summa-
of the words belonging to {Sk }K k=1 are calculated rizing multi-document inputs.
as before (Eq. (2)). However, words in other sen- Input: SDS data; MDS source sentences {Si }
tences are forced to receive zero attention weights 1: Train the PG model on SDS data
(αt,i =0), and all αt,i are renormalized (Eq. (8)). 2: I I(Si ) and R(Si ) are the importance and re-
( dundancy scores of the source sentence Si
new αt,i i∈{Sk}K k=1 3: I(Si ) ← SVR(Si ) for all source sentences
αt,i = (8)
0 otherwise 4: MMR(Si ) ← λI(Si ) for all source sentences
5: Summary ← {}
It means that the remaining sentences are “muted” 6: t ← index of summary words
in this process. In this variant, the sentence impor- 7: while t < Lmax do
tance does not affect the original attention weights, Find {Sk }K
8: k=1 with highest MMR scores
other than muting. Compute αt,i new
based on {Sk }K
9: k=1 (Eq. (8))
In an alternative setting, the sentence salience 10: Run PG decoder for one step to get {wt }
is multiplied with the word salience and renormal- 11: Summary ← Summary + {wt }
ized (Eq. (9)). PG uses the reweighted alpha val- 12: if wt is the period symbol then
ues to predict the next summary word. 13: R(Si ) ← Sim(Si , Summary), ∀i
( 14: MMR(Si ) ← λI(Si ) −(1 − λ)R(Si ), ∀i
new αt,iMMR(Sk) i∈{Sk}K k=1 15: end if
αt,i = (9)
0 otherwise 16: end while
Sentence Importance. To estimate sentence im-
portance Sim1 (si , D), we introduce a supervised tences and documents, they are data-driven and
regression model in this work. Importantly, the not restricted to a particular domain.
model is trained on single-document summariza- Sentence Redundancy. To calculate the redun-
tion datasets where training data are abundant. At dancy of the sentence (maxsj ∈S Sim2 (si , sj )), we
test time, the model can be applied to identify im- compute the ROUGE-L precision, which mea-
portant sentences from multi-document input. Our sures the longest common subsequence between a
model determines sentence importance based on source sentence and the partial summary (consist-
four indicators, inspired by how humans identify ing of all sentences generated thus far by the PG
important sentences from a document set. They in- model), divided by the length of the source sen-
clude (a) sentence length, (b) its absolute and rela- tence. A source sentence yielding a high ROUGE-
tive position in the document, (c) sentence quality, L precision is deemed to have significant content
and (d) how close the sentence is to the main topic overlap with the partial summary. It will receive a
of the document set. These features are considered low MMR score and hence is less likely to serve
to be important indicators in previous extractive as basis for generating future summary sentences.
summarization framework (Galanis and Androut- Alg. 1 provides an overview the PG-MMR al-
sopoulos, 2010; Hong et al., 2014). gorithm and Fig. 1 is a graphical illustration. The
Regarding the sentence quality (c), we lever- MMR scores of source sentences are updated af-
age the PG model to build the sentence represen- ter each summary sentence is generated by the PG
tation. We use the bidirectional LSTM encoder model. Next, a different set of highest-scored sen-
to encode any source sentence to a vector repre- tences are used to guide the PG model to generate
−→ ← −
sentation. [heN ||he1 ] is the concatenation of the the next summary sentence. “Muting” the remain-
last hidden states of the forward and backward ing source sentences is important because it helps
passes. A document vector is the average of all the PG model to focus its attention on the most sig-
sentence vectors. We use the document vector and nificant source content. The code for our model is
the cosine similarity between the document and publicly available to further MDS research.2
sentence vectors as indicator (d). A support vec-
tor regression model is trained on (sentence, score) 5 Experimental Setup
pairs where the training data are obtained from the
Datasets. We investigate the effectiveness of the
CNN/Daily Mail dataset. The target importance
PG-MMR method by testing it on standard multi-
score is the ROUGE-L recall of the sentence com-
document summarization datasets (Over and Yen,
pared to the ground-truth summary. Our model ar-
2
chitecture leverages neural representations of sen- https://github.com/ucfnlp/multidoc summarization
2004; Dang and Owczarzak, 2008). These include DUC-04
System R-1 R-2 R-SU4
DUC-03, DUC-04, TAC-08, TAC-10, and TAC-
SumBasic (Vanderwende et al., 2007) 29.48 4.25 8.64
11, containing 30/50/48/46/44 topics respectively. KLSumm (Haghighi et al., 2009) 31.04 6.03 10.23
The summarization system is tasked with gener- LexRank (Erkan and Radev, 2004) 34.44 7.11 11.19
ating a concise, fluent summary of 100 words or Centroid (Hong et al., 2014) 35.49 7.80 12.02
less from a set of 10 documents discussing a topic. ICSISumm (Gillick and Favre, 2009) 37.31 9.36 13.12
DPP (Taskar, 2012) 38.78 9.47 13.36
All documents in a set are chronologically ordered
Extract+Rewrite (Song et al., 2018) 28.90 5.33 8.76
and concatenated to form a mega-document serv- Opinosis (Ganesan et al., 2010) 27.07 5.03 8.63
ing as input to the PG-MMR system. Sentences PG-Original (See et al., 2017) 31.43 6.03 10.01
that start with a quotation mark or do not end with PG-MMR w/ SummRec 34.57 7.46 11.36
a period are excluded (Wong et al., 2008). Each PG-MMR w/ SentAttn 36.52 8.52 12.57
PG-MMR w/ Cosine (default) 36.88 8.73 12.64
system summary is compared against 4 human ab- PG-MMR w/ BestSummRec 36.42 9.36 13.23
stracts created by NIST assessors. Following con-
vention, we report results on DUC-04 and TAC-11 Table 2: ROUGE results on the DUC-04 dataset.
datasets, which are standard test sets; DUC-03 and TAC-11
TAC-08/10 are used as a validation set for hyper- System R-1 R-2 R-SU4
parameter tuning.3 SumBasic (Vanderwende et al., 2007) 31.58 6.06 10.06
The PG model is trained for single-document KLSumm (Haghighi et al., 2009) 31.23 7.07 10.56
LexRank (Erkan and Radev, 2004) 33.10 7.50 11.13
summarization using the CNN/Daily Mail (Her-
Extract+Rewrite (Song et al., 2018) 29.07 6.11 9.20
mann et al., 2015) dataset, containing single news Opinosis (Ganesan et al., 2010) 25.15 5.12 8.12
articles paired with summaries (human-written ar- PG-Original (See et al., 2017) 31.44 6.40 10.20
ticle highlights). The training set contains 287,226 PG-MMR w/ SummRec 35.06 8.72 12.39
articles. An article contains 781 tokens on aver- PG-MMR w/ SentAttn 37.01 10.43 13.85
PG-MMR w/ Cosine (default) 37.17 10.92 14.04
age; and a summary contains 56 tokens (3.75 sen- PG-MMR w/ BestSummRec 40.44 14.93 17.61
tences). During training we use the hyperparam-
eters provided by See et al. (2017). At test time, Table 3: ROUGE results on the TAC-11 dataset.
the maximum/minimum decoding steps are set to
• ext-DPP (Taskar, 2012) selects an optimal set of sentences
120/100 words respectively, corresponding to the per the determinantal point processes that balance the cov-
max/min lengths of the PG-MMR summaries. Be- erage of important information and the sentence diversity;
cause the focus of this work is on multi-document • abs-Opinosis (Ganesan et al., 2010) generates abstractive
summarization (MDS), we do not report results for summaries by searching for salient paths on a word co-
occurrence graph created from source documents;
the CNN/Daily Mail dataset.
• abs-Extract+Rewrite (Song et al., 2018) is a recent ap-
Baselines. We compare PG-MMR against a broad proach that scores sentences using LexRank and generates
spectrum of baselines, including state-of-the-art a title-like summary for each sentence using an encoder-
decoder model trained on Gigaword data.
extractive (‘ext-’) and abstractive (‘abs-’) systems.
They are described below.4 • abs-PG-Original (See et al., 2017) introduces an encoder-
decoder model that encourages the system to copy words
• ext-SumBasic (Vanderwende et al., 2007) is an extractive from the source text via pointing, while retaining the abil-
approach assuming words occurring frequently in a docu- ity to produce novel words through the generator.
ment set are more likely to be included in the summary;
• ext-KL-Sum (Haghighi and Vanderwende, 2009) greedily
adds source sentences to the summary if it leads to a de- 6 Results
crease in KL divergence;
• ext-LexRank (Erkan and Radev, 2004) uses a graph-based
Having described the experimental setup, we next
approach to compute sentence importance based on eigen- compare the PG-MMR method against the base-
vector centrality in a graph representation; lines on standard MDS datasets, evaluated by both
• ext-Centroid (Hong et al., 2014) computes the importance automatic metrics and human assessors.
of each source sentence based on its cosine similarity with
the document centroid; ROUGE (Lin, 2004). This automatic metric mea-
• ext-ICSISumm (Gillick et al., 2009) leverages the ILP sures the overlap of unigrams (R-1), bigrams (R-
framework to identify a globally-optimal set of sentences 2) and skip bigrams with a maximum distance of 4
covering the most important concepts in the document set;
words (R-SU4) between the system summary and
3
The hyperparameters for all PG-MMR variants are K=7 a set of reference summaries. ROUGE scores of
and λ=0.6; except for “w/ BestSummRec” where K=2. various systems are presented in Table 2 and 3 re-
4
We are grateful to Hong et al. (2014) for providing the
summaries generated by Centroid, ICSISumm, DPP systems. spectively for the DUC-04 and TAC-11 datasets.
These are only available for the DUC-04 dataset. We explore variants of the PG-MMR method.
They differ in how the importances of source sen- 1st summ sent 4th summ sent
2nd summ sent 5th summ sent
tences are estimated and how the sentence impor- 3rd summ sent
tance affects word attention weights. “w/ Cosine” (1, 2)
PG-Original
(2, 7)
computes the sentence importance as the cosine (4, 48)
similarity score between the sentence and docu- (15, 81)
(13, 71)
ment vectors, both represented as sparse TF-IDF
vectors under the vector space model. “w/ Summ- (2, 44)
PG-MMR
Rec” estimates the sentence importance as the (6, 53)
(10, 59)
predicted R-L recall score between the sentence (12, 77)
(12, 62)
and the summary. A support vector regression
model is trained on sentences from the CNN/Daily 0 5 10 15 20 25 30 35
Location in the multi-document input
Mail datasets (≈33K) and applied to DUC/TAC
sentences at test time (see §4). “w/ BestSumm- Figure 2: The median location of summary n-grams in the
Rec” obtains the best estimate of sentence impor- multi-document input (and the lower/higher quartiles). The
n-grams come from the 1st/2nd/3rd/4th/5th summary sen-
tance by calculating the R-L recall score between tence and the location is the source sentence index. (TAC-11)
the sentence and reference summaries. It serves
as an upper bound for the performance of “w/ System 1-grams 2-grams 3-grams Sent
SummRec.” For all variants, the sentence impor- Extr+Rewrite 89.37 54.34 25.10 6.65
PG-Original 99.64 96.28 88.83 47.67
tance scores are normalized to the range of [0,1]. PG-MMR 99.74 97.64 91.57 59.13
“w/ SentAttn” adjusts the attention weights using Human Abst. 84.32 45.22 18.70 0.23
Eq. (9), so that words in important sentences are
more likely to be used to generate the summary. Table 4: Percentages of summary n-grams (or the entire sen-
tences) appear in the multi-document input. (TAC-11)
The weights are otherwise computed using Eq. (8).
Location of summary content. We are inter-
As seen in Table 2 and 3, our PG-MMR method ested in understanding why PG-MMR outper-
surpasses all unsupervised extractive baselines, in- forms PG-Original at identifying summary content
cluding SumBasic, KLSumm, and LexRank. On from the multi-document input. We ask the ques-
the DUC-04 dataset, ICSISumm and DPP show tion: where, in the source documents, does each
good performance, but these systems are trained system tend to look when generating their sum-
directly on MDS datasets, which are not utilized maries? Our findings indicate that PG-Original
by the PG-MMR method. PG-MMR exhibits su- gravitates towards early source sentences, while
perior performance compared to existing abstrac- PG-MMR searches beyond the first few sentences.
tive systems. It outperforms Opinosis and PG-
In Figure 2 we show the median location of the
Original by a large margin in terms of R-2 F-scores
first occurrences of summary n-grams, where the
(5.03/6.03/8.73 for DUC-04 and 5.12/6.40/10.92
n-grams can come from the 1st to 5th summary
for TAC-11). In particular, PG-Original is the
sentence. For PG-Original summaries, n-grams of
original pointer-generator networks with multi-
the 1st summary sentence frequently come from
document inputs at test time. Compared to it, PG-
the 1st and 2nd source sentences, corresponding
MMR is more effective at identifying summary-
to the lower/higher quartiles of source sentence in-
worthy content from the input. “w/ Cosine” is
dices. Similarly, n-grams of the 2nd summary sen-
used as the default PG-MMR and it shows bet-
tence come from the 2nd to 7th source sentences.
ter results than “w/ SummRec.” It suggests that
For PG-MMR summaries, the patterns are differ-
the sentence and document representations ob-
ent. The n-grams of the 1st and 2nd summary sen-
tained from the encoder-decoder model (trained
tences come from source sentences of the range
on CNN/DM) are suboptimal, possibly due to a
(2, 44) and (6, 53), respectively. Our findings sug-
vocabulary mismatch, where certain words in the
gest that PG-Original tends to treat the input as
DUC/TAC datasets do not appear in CNN/DM and
a single-document and identifies summary-worthy
their embeddings are thus not learned during train-
content from the beginning of the input, whereas
ing. Finally, we observe that “w/ BestSummRec”
PG-MMR can successfuly search a broader range
yields the highest performance on both datasets.
of the input for summary content. This capability
This finding suggests that there is a great potential
is crucial for multi-document input where impor-
for improvements of the PG-MMR method as its
tant content can come from any article in the set.
“extractive” and “abstractive” components can be
separately optimized. Degree of extractiveness. Table 4 shows the
Linguistic Quality Rankings (%)
System Fluency Inform. NonRed. 1st 2nd 3rd 4th
Extract+Rewrite 2.03 2.19 1.88 5.6 11.6 11.6 71.2
LexRank 3.29 3.36 3.30 30.0 28.8 32.0 9.2
PG-Original 3.20 3.30 3.19 29.6 26.8 32.8 10.8
PG-MMR 3.24 3.52 3.42 34.8 32.8 23.6 8.8
Table 6: Example system summaries and human-written abstract. The sentences are manually de-tokenized for readability.
percentages of summary n-grams (or entire sen- in the summary?), fluency (is the summary gram-
tences) appearing in the multi-document input. matical and well-formed?), and non-redundancy
PG-Original and PG-MMR summaries both show (does the summary successfully avoid repeating
a high degree of extractiveness, and similar find- information?). Human summaries are used as the
ings have been revealed by See et al. (2017). ground-truth. The turkers are also asked to provide
Because PG-MMR relies on a handful of rep- an overall ranking for the four system summaries.
resentative source sentences and mutes the rest, Results are presented in Table 5. We observe that
it appears to be marginally more extractive than the LexRank summaries are highest-rated on flu-
PG-Original. Both systems encourage generating ency. This is because LexRank is an extractive
summary sentences by stitching together source approach, where summary sentences are directly
sentences, as about 52% and 41% of the sum- taken from the input. PG-MMR is rated as the best
mary sentences do not appear in the source, but on both informativeness and non-redundancy. Re-
about 90% the n-grams do. The Extract+Rewrite garding overall system rankings, PG-MMR sum-
summaries (§5), generated by rewriting selected maries are frequently ranked as the 1st- and 2nd-
source sentences to title-like summary sentences, best summaries, outperforming the others.
exhibits a high degree of abstraction, close to that Example summaries. In Table 6 we present
of human abstracts. example summaries generated by various sys-
Linguistic quality. To assess the linguistic quality tems. PG-Original cannot effectively identify im-
of various system summaries, we employ Amazon portant content from the multi-document input.
Mechanical Turk human evaluators to judge the Extract+Rewrite tends to generate short, title-like
summary quality, including PG-MMR, LexRank, sentences that are less informative and carry sub-
PG-Original, and Extract+Rewrite. A turker is stantial redundancy. This is because the system is
asked to rate each system summary on a scale of 1 trained on the Gigaword dataset (Rush et al., 2015)
(worst) to 5 (best) based on three evaluation crite- where the target summary length is 7 words. PG-
ria: informativeness (to what extent is the mean- MMR generates summaries that effectively con-
ing expressed in the ground-truth text preserved dense the important source content.
7 Conclusion Yen-Chun Chen and Mohit Bansal. 2018. Fast ab-
stractive summarization with reinforce-selected sen-
We describe a novel adaptation method to gen- tence rewriting. In Proceedings of the Annual Meet-
erate abstractive summaries from multi-document ing of the Association for Computational Linguistics
inputs. Our method combines an extractive sum- (ACL).
marization algorithm (MMR) for sentence extrac- Jianpeng Cheng and Mirella Lapata. 2016. Neural
tion and a recent abstractive model (PG) for fusing summarization by extracting sentences and words.
source sentences. The PG-MMR system demon- In Proceedings of ACL.
strates competitive results, outperforming strong
Hoa Trang Dang and Karolina Owczarzak. 2008.
extractive and abstractive baselines.
Overview of the TAC 2008 update summarization
task. In Proceedings of Text Analysis Conference
(TAC).
References
Hal Daume III and Daniel Marcu. 2002. A noisy-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
channel model for document compression. In Pro-
Bengio. 2014. Neural machine translation by
ceedings of the Annual Meeting of the Association
jointly learning to align and translate. CoRR,
for Computational Linguistics (ACL).
abs/1409.0473.
Regina Barzilay, Kathleen R. McKeown, and Michael Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein.
Elhadad. 1999. Information fusion in the context of 2016. Learning-based single-document summariza-
multi-document summarization. In Proceedings of tion with compression and anaphoricity constraints.
the Annual Meeting of the Association for Computa- In Proceedings of the Association for Computational
tional Linguistics (ACL). Linguistics (ACL).
Tal Baumel, Matan Eyal, and Michael Elhadad. 2018. Günes Erkan and Dragomir R. Radev. 2004. LexRank:
Query focused abstractive summarization: Incorpo- Graph-based lexical centrality as salience in text
rating query relevance, multi-document coverage, summarization. Journal of Artificial Intelligence
and summary length constraints into seq2seq mod- Research.
els. arXiv preprint arXiv:1801.07704.
Giuseppe Di Fabbrizio, Amanda J. Stent, and Robert
Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein. Gaizauskas. 2014. A hybrid approach to multi-
2011. Jointly learning to extract and compress. In document summarization of opinions in reviews.
Proceedings of the Annual Meeting of the Associa- Proceedings of the 8th International Natural Lan-
tion for Computational Linguistics (ACL). guage Generation Conference (INLG).
Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Katja Filippova, Enrique Alfonseca, Carlos Col-
Guo, and Rebecca J. Passonneau. 2015. Abstractive menares, Lukasz Kaiser, and Oriol Vinyals. 2015.
multi-document summarization via phrase selection Sentence compression by deletion with lstms. In
and merging. In Proceedings of ACL. Proceedings of the Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP).
Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017.
Improving multi-document summarization via text Dimitrios Galanis and Ion Androutsopoulos. 2010. An
classification. In Proceedings of the Association for extractive supervised two-stage method for sentence
the Advancement of Artificial Intelligence (AAAI). compression. In Proceedings of NAACL-HLT.
Jaime Carbonell and Jade Goldstein. 1998. The use Kavita Ganesan, ChengXiang Zhai, and Jiawei Han.
of MMR, diversity-based reranking for reordering 2010. Opinosis: A graph-based approach to ab-
documents and producing summaries. In Proceed- stractive summarization of highly redundant opin-
ings of the International ACM SIGIR Conference on ions. In Proceedings of the International Confer-
Research and Development in Information Retrieval ence on Computational Linguistics (COLING).
(SIGIR).
Sebastian Gehrmann, Yuntian Deng, and Alexander
Giuseppe Carenini and Jackie Chi Kit Cheung. 2008. Rush. 2018. Bottom-up abstractive summariza-
Extractive vs. NLG-based abstractive summariza- tion. In Proceedings of the Conference on Em-
tion of evaluative text: The effect of corpus contro- pirical Methods in Natural Language Processing
versiality. In Proceedings of the Fifth International (EMNLP).
Natural Language Generation Conference (INLG).
Shima Gerani, Yashar Mehdad, Giuseppe Carenini,
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, and Raymond T. Ng, and Bita Nejat. 2014. Abstractive
Hui Jiang. 2016. Distraction-based neural networks summarization of product reviews using discourse
for document summarization. In Proceedings of the structure. In Proceedings of the Conference on Em-
Twenty-Fifth International Joint Conference on Ar- pirical Methods in Natural Language Processing
tificial Intelligence (IJCAI). (EMNLP).
Dan Gillick and Benoit Favre. 2009. A scalable global Wojciech Kryściński, Romain Paulus, Caiming Xiong,
model for summarization. In Proceedings of the and Richard Socher. 2018. Improving abstraction
NAACL Workshop on Integer Linear Programming in text summarization. In Proceedings of the Con-
for Natural Langauge Processing. ference on Empirical Methods in Natural Language
Processing (EMNLP).
Dan Gillick, Benoit Favre, Dilek Hakkani-Tur, Berndt
Bohnet, Yang Liu, and Shasha Xie. 2009. The Chen Li, Fei Liu, Fuliang Weng, and Yang Liu. 2013.
ICSI/UTD summarization system at TAC 2009. In Document summarization via guided sentence com-
Proceedings of TAC. pression. In Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing
Max Grusky, Mor Naaman, and Yoav Artzi. 2018. (EMNLP).
Newsroom: A dataset of 1.3 million summaries
with diverse extractive strategies. In Proceedings of Kexin Liao, Logan Lebanoff, and Fei Liu. 2018. Ab-
the North American Chapter of the Association for stract meaning representation for multi-document
Computational Linguistics (NAACL). summarization. In Proceedings of the International
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Conference on Computational Linguistics (COL-
Li. 2016. Incorporating copying mechanism in ING).
sequence-to-sequence learning. In Proceedings of
Chin-Yew Lin. 2004. ROUGE: a package for au-
ACL.
tomatic evaluation of summaries. In Proceedings
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, of ACL Workshop on Text Summarization Branches
Bowen Zhou, and Yoshua Bengio. 2016. Pointing Out.
the unknown words. In Proceedings of the 54th An-
nual Meeting of the Association for Computational Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman
Linguistics (ACL). Sadeh, and Noah A. Smith. 2015. Toward abstrac-
tive summarization using semantic representations.
Aria Haghighi and Lucy Vanderwende. 2009. Explor- In Proceedings of the North American Chapter of
ing content models for multi-document summariza- the Association for Computational Linguistics: Hu-
tion. In Proceedings of the North American Chap- man Language Technologies (NAACL).
ter of the Association for Computational Linguistics
(NAACL). Wencan Luo and Diane Litman. 2015. Summarizing
student responses to reflection prompts. In Proceed-
Karl Moritz Hermann, Tomas Kocisky, Edward ings of the Conference on Empirical Methods in Nat-
Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- ural Language Processing (EMNLP).
leyman, and Phil Blunsom. 2015. Teaching ma-
chines to read and comprehend. In Proceedings of Wencan Luo, Fei Liu, Zitao Liu, and Diane Litman.
Neural Information Processing Systems (NIPS). 2016. Automatic summarization of student course
feedback. In Proceedings of the North American
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Chapter of the Association for Computational Lin-
Long short-term memory. Neural Computation, guistics: Human Language Technologies (NAACL).
9(8):1735–1780.
Yishu Miao and Phil Blunsom. 2016. Language as a
Kai Hong, John M Conroy, Benoit Favre, Alex
latent variable: Discrete generative models for sen-
Kulesza, Hui Lin, and Ani Nenkova. 2014. A repos-
tence compression. In Proceedings of the Confer-
itory of state of the art and competitive baseline sum-
ence on Empirical Methods in Natural Language
maries for generic news summarization. In Proceed-
Processing (EMNLP).
ings of the Ninth International Conference on Lan-
guage Resources and Evaluation (LREC). Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka Caglar Gulcehre, and Bing Xiang. 2016. Ab-
Matsuo, and Ichiro Sakata. 2017. Extractive sum- stractive text summarization using sequence-to-
marization using multi-task learning with document sequence RNNs and beyond. In Proceedings of the
classification. In Proceedings of the Conference on 20th SIGNLL Conference on Computational Natural
Empirical Methods in Natural Language Processing Language Learning (CoNLL).
(EMNLP).
Shashi Narayan, Shay B. Cohen, and Mirella Lapata.
Hongyan Jing and Kathleen McKeown. 1999. The de- 2018. Ranking sentences for extractive summariza-
composition of human-written summary sentences. tion with reinforcement learning. In Proceedings of
In Proceedings of the International ACM SIGIR the 16th Annual Conference of the North American
Conference on Research and Development in Infor- Chapter of the Association for Computational Lin-
mation Retrieval (SIGIR). guistics: Human Language Technologies (NAACL-
HLT).
Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya
Takamura, and Manabu Okumura. 2016. Control- Ani Nenkova and Kathleen McKeown. 2011. Auto-
ling output length in neural encoder-decoders. In matic summarization. Foundations and Trends in
Proceedings of EMNLP. Information Retrieval.
Paul Over and James Yen. 2004. An introduction to Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Flo-
DUC-2004. National Institute of Standards and rian, and Claire Cardie. 2013. A sentence com-
Technology. pression based framework to query-focused multi-
document summarization. In Proceedings of ACL.
Romain Paulus, Caiming Xiong, and Richard Socher.
2017. A deep reinforced model for abstractive sum- Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008. Ex-
marization. In Proceedings of the Conference on tractive summarization using supervised and semi-
Empirical Methods in Natural Language Processing supervised learning. In Proceedings of the Inter-
(EMNLP). national Conference on Computational Linguistics
(COLING).
Daniele Pighin, Marco Cornolti, Enrique Alfonseca,
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
and Katja Filippova. 2014. Modelling events
Ayush Pareek, Krishnan Srinivasan, and Dragomir
through memory-based, open-ie patterns for abstrac-
Radev. 2017. Graph-based neural multi-document
tive summarization. In Proceedings of the Annual
summarization. In Proceedings of the Confer-
Meeting of the Association for Computational Lin-
ence on Computational Natural Language Learning
guistics (ACL).
(CoNLL).
Alexander M. Rush, Sumit Chopra, and Jason Weston. Dani Yogatama, Fei Liu, and Noah A. Smith. 2015.
2015. A neural attention model for sentence sum- Extractive summarization by maximizing semantic
marization. In Proceedings of the Conference on volume. In Proceedings of the Conference on Em-
Empirical Methods in Natural Language Processing pirical Methods on Natural Language Processing
(EMNLP). (EMNLP).
Evan Sandhaus. 2008. The new york times annotated David Zajic, Bonnie J. Dorr, Jimmy Lin, and Richard
corpus. Linguistic Data Consortium. Schwartz. 2007. Multi-candidate reduction: Sen-
tence compression as a tool for document summa-
Abigail See, Peter J. Liu, and Christopher D. Manning. rization tasks. Information Processing and Manage-
2017. Get to the point: Summarization with pointer- ment.
generator networks. In Proceedings of the Annual
Meeting of the Association for Computational Lin- Wenyuan Zeng, Wenjie Luo, Sanja Fidler, and Raquel
guistics (ACL). Urtasun. 2017. Efficient summarization with read-
again and copy mechanism. In Proceedings of the
Kaiqiang Song, Lin Zhao, and Fei Liu. 2018. International Conference on Learning Representa-
Structure-infused copy mechanisms for abstractive tions (ICLR).
summarization. In Proceedings of the International Jianmin Zhang, Jiwei Tan, and Xiaojun Wan. 2018.
Conference on Computational Linguistics (COL- Towards a neural network approach to abstrac-
ING). tive multi-document summarization. arXiv preprint
arXiv:1804.09010.
Sho Takase, Jun Suzuki, Naoaki Okazaki, Tsutomu Hi-
rao, and Masaaki Nagata. 2016. Neural headline Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.
generation on abstract meaning representation. In 2017. Selective encoding for abstractive sentence
Proceedings of the Conference on Empirical Meth- summarization. In Proceedings of the Annual Meet-
ods in Natural Language Processing (EMNLP). ing of the Association for Computational Linguistics
(ACL).
Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.
Abstractive document summarization with a graph-
based attentional neural model. In Proceedings of
the Annual Meeting of the Association for Computa-
tional Linguistics (ACL).