Introduction Automatically generating summaries from large text corpora has long been studied in both information retrieval and natural language processing.
There are several types of text summarization tasks.
For if an input query is the generated summary can be and otherwise it is generic.
the number of documents to be summarized can vary from one to many.
The constituent sentences of might be formed in variety of different ways summarization can be conducted using either extraction or the former selects only sentences from the original document whereas the latter involves natural language generation.
In this we address the problem of generic extractive summaries from clusters of related commonly known as summarization.
In extractive text textual units ((e.g.
, from document set are extracted to form where grammaticality is assured at the local level.
Finding the optimal summary can be viewed as combinatorial optimization problem which is to solve 2007).
One of the standard methods for this problem is called Maximum Marginal Relevance and where greedy algorithm selects the most relevant and at the same time avoids redundancy by removing sentences that are too similar to already selected ones.
One major problem of is that it is because the decision is made based on the scores at the current iteration.
McDonald proposed to replace the greedy search of with globally optimal where the basic framework can be expressed as knapsack packing and an integer linear program solver can be used to maximize the resulting objective function.
can sometimes either be expensive for large scale problems or themselves might only be heuristic without associated theoretical approximation guarantees.
In this we study approaches for summarization.
several have been proposed for extractive summarization in the past.
and introduced stochastic for computing the relative importance of textual units for summarization.
In the importance of sentences is computed based on the concept of eigenvector centrality in the graph representation of sentences.
and also proposed an eigenvector centrality algorithm on weighted graphs for document summarization and 2004).
et al. later applied and to natural language processing tasks automatic extraction and word sense to extractive summarization et al., 2004).
Recent work in et al., presents approach where an undirected weighted graph is built for the document to be and vertices represent the candidate sentences and edge weights represent the similarity between sentences.
The summary extraction procedure is done by maximizing set function under cardinality constraint.
Inspired by et al., we perform summarization by maximizing functions under budget constraint.
budget constraint is natural in summarization task as the length of the summary is often restricted.
The length limitation represents the real world scenario where summaries are displayed using only limited computer screen real estate.
In the candidate units might not have identical costs ((e.g.
, sentence lengths vary).
Since cardinality constraint is special case budget constraint with unity our approach is more general than et al., 2009).
we propose greedy algorithm and both theoretically 4.1) and empirically 5.1) show that the algorithm solves the problem thanks to submodularity.
Regarding summarization experiments on task show that our approach is superior to the method in evaluation on scores 5).
Background on Consider set function which maps subsets of ground set to real numbers.
is called normalized if and is monotone if whenever T. is called if for any we have f(T).
An equivalent of is the property of diminishing in the of economics.
That is if for any and f(R).
Eqn. states that the of never increases in the contexts of ever larger exactly the property of diminishing returns.
This phenomenon arises naturally in many other contexts as well.
For the Shannon entropy function is in the set of random variables.
is discrete analog of convexity 1983).
As convexity makes continuous functions more amenable to plays an essential role in combinatorial optimization.
Many combinatorial optimization problems can be solved optimally or in polynomial time only when the underlying function is submodular.
It has been shown that any function can be minimized in polynomial time et al., 2001).
Maximization of is an optimization problem but some maximization problems can be solved nearoptimally.
famous result is that the maximization of monotone function under cardinality constraint can be solved using greedy algorithm et al., within constant factor (0.63) of being optimal.
approximation algorithm has also been obtained for maximizing monotone function with knapsack constraint Section 4.2). et.al. studied unconstrained maximization of arbitrary functions necessarilymonotone).
et.al. proposed method for optimally maximizing set function under cardinality and Lee et.al. studied maximization under matroid and knapsack constraints.
Problem Setup In this we study the problem of maximizing function under budget stated formally max where is the ground set of all linguistic units ((e.g.
, in the is the extracted summary subset of is the cost unit and is our and function scores the summary quality.
The budgeted constraint arises naturally since often the summary must be length limited as mentioned above.
In the budget could be the maximum number of words allowed in any or alternatively the maximum number of bytes of any where would then be either number of words or the number of bytes in sentence i.
To from the objective function measuring the summary quality must be submodular.
In there are two ways to apply optimization to any application domain.
