Anda di halaman 1dari 13

A Context-Based Word Indexing Model

for Document Summarization


Pawan Goyal, Laxmidhar Behera, Senior Member, IEEE, and
Thomas Martin McGinnity, Senior Member, IEEE
AbstractExisting models for document summarization mostly use the similarity between sentences in the document to extract the
most salient sentences. The documents as well as the sentences are indexed using traditional term indexing measures, which do not
take the context into consideration. Therefore, the sentence similarity values remain independent of the context. In this paper, we
propose a context sensitive document indexing model based on the Bernoulli model of randomness. The Bernoulli model of
randomness has been used to find the probability of the cooccurrences of two terms in a large corpus. A new approach using the
lexical association between terms to give a context sensitive weight to the document terms has been proposed. The resulting indexing
weights are used to compute the sentence similarity matrix. The proposed sentence similarity measure has been used with the
baseline graph-based ranking models for sentence extraction. Experiments have been conducted over the benchmark DUC data sets
and it has been shown that the proposed Bernoulli-based sentence similarity model provides consistent improvements over the
baseline IntraLink and UniformLink methods [1].
Index TermsLexical association, text summarization, document indexing

1 INTRODUCTION
D
OCUMENT summarization is an information retrieval
task, which aims at extracting a condensed version of
the original document [2]. A document summary is useful
since it can give an overview of the original document in a
shorter period of time. Readers may decide whether or not
to read the complete document after going through the
summary. For example, readers first look at the abstract of a
scientific article before reading the complete paper. Search
engines also use text summaries to help users make
relevance decisions [3].
The main goal of a summary is to present the main ideas
in a document/set of documents in a short and readable
paragraph. Summaries can be produced either from a single
document or many documents [4]. The task of producing
summary from many documents is called multidocument
summarization [5], [6], [7], [8], [9], [10]. Summarization can
also be specific to the information needs of the user, thus
called query-biased summarization [11], [12], [13]. For
instance, the QCS system (query, cluster, and summarize,
[12]) retrieves relevant documents in response to a query,
clusters these documents by topic and produces a summary
for each cluster. Opinion summarization [14], [15], [16], [17]
is another application of text summarization. Topic sum-
marization deals with the evolution of topics in addition to
providing the informative sentences [18].
This paper focuses on sentence extraction-based single
document summarization. Most of the previous studies on
the sentence extraction-based text summarization task use a
graph-based algorithm to calculate the saliency of each
sentence in a document and the most salient sentences
are extracted to build the document summary. The sentence
extraction techniques give an indexing weight to the
document terms and use these weights to compute the
sentence similarity [1] and/or document centroid [19] and
so on. The sentence similarity calculation remains central to
the existing approaches. The indexing weights of the
document terms are utilized to compute the sentence
similarity values. However, very elementary document
features are used to allocate an indexing weight to the
document terms, which include the term frequency, docu-
ment length, occurrence of a term in a background corpus
and so on. Therefore, the indexing weight remains
independent of the other terms appearing in the document
and the context in which the term occurs is overlooked in
assigning its indexing weight. This results in context
independent document indexing. To the authors knowl-
edge, no other work in the existing literature addresses the
problem of context independent document indexing for
the document summarization task.
A document contains both the content-carrying (topical)
terms as well as background (nontopical) terms. The
traditional indexing schemes cannot distinguish between
these terms that are reflected in the sentence similarity
values. A context sensitive document indexing model gives
a higher weight to the topical terms as compared to the
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 1693
. P. Goyal is with INRIA Paris-Rocquencourt, Domaine de Voluceau -
Rocquencourt, B.P. 105 - 78153 Le Chesnay, France.
E-mail: pawan.goyal@inria.fr.
. L. Behera is with the Department of Electrical Engineering, Indian
Institute of Technology Kanpur, UP 208016, India, and the Intelligent
Systems Research Centre, School of Computing and Intelligent Systems,
University of Ulster, Londonderry, NI, UK, BT48 7JL.
E-mail: lbehera@iitk.ac.in, l.behera@ulster.ac.uk.
. T.M. McGinnity is with the Intelligent Systems Research Centre, School of
Computing and Intelligent Systems, University of Ulster, Magee campus,
Londonderry BT48 7JL, Northern Ireland, United Kingdom.
E-mail: tm.mcginnity@ulster.ac.uk.
Manuscript received 25 Jan. 2012; revised 15 May 2012; accepted 19 May
2012; published online 25 May 2012.
Recommended for acceptance by J. Zobel.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2012-01-0063.
Digital Object Identifier no. 10.1109/TKDE.2012.114.
1041-4347/13/$31.00 2013 IEEE Published by the IEEE Computer Society
nontopical terms and, thus, influences the sentence simi-
larity values in a positive manner.
In this paper, we address the problem of context
independent document indexing using the lexical associa-
tion between document terms. In a document, the content-
carrying words will be highly associated with each other,
while the background terms will have very low association
with the other terms in the document. The association
between terms is captured in this paper by the lexical
association, computed through a corpus analysis.
The main motivation behind using the lexical association
is the central assumption that the context in which a word
appears provides useful information about its meaning [20].
Cooccurrence measures observe the distributional patterns
of a term with other terms in the vocabulary and have
applications in many tasks pertaining to natural language
understanding such as word classification [21], knowledge
acquisition [22], word sense disambiguation [23], informa-
tion retrieval [24], sentence retrieval [25], and word
clustering [26]. In this paper, we derive a novel term
association metric using the Bernoulli model of random-
ness. Multivariate Bernoulli models have previously been
applied to document indexing and information retrieval
[27], [28]. We use the Bernoulli model of randomness to find
the probability of the cooccurrences of two terms in a
corpus and use the classical semantic information theory to
quantify the information contained in the cooccurrences of
these two terms.
The lexical association metric, thus, derived is used to
propose a context-sensitive document indexing model. The
idea is implemented using a PageRank-based algorithm
[29] to iteratively compute how informative is each
document term. Sentence similarity calculated using the
context sensitive indexing should reflect the contextual
similarity between two sentences. This will allow two
sentences to have different similarity values depending on
the context. The hypothesis is that an improved sentence
similarity measure would lead to improvements in the
document summarization.
The text summarization experiments have been per-
formed on the single document summarization task over
the DUC01 and DUC02 data sets. It has been shown that the
proposed model consistently improves the performance of
the baseline sentence extraction algorithms under various
settings and, thus, can be used as an enhancement over the
baseline models. The theoretical foundations along with the
empirical results confirm that the proposed model advances
the state of the art in document summarization.
The main contributions of this paper are summarized
as follows:
1. We propose the novel idea of using the context-
sensitive document indexing to improve the sentence
extraction-based document summarization task.
2. We implement the idea by using the lexical associa-
tion between document terms in a PageRank-based
framework. Anovel termassociation metric using the
Bernoulli model of randomness has been derived for
this purpose. Empirical evidence has been provided
to show that using the derived lexical association
metric, average lexical association between the terms
in a target summary is higher compared to the
association between the terms in a document.
3. Experiments have been conducted over the bench-
mark document understanding conference (DUC)
data sets to empirically validate the effectiveness of
the proposed model.
The remainder of this paper has been organized as
follows: Section 2 discusses the related work in the field of
document summarization. The proposed lexical associa-
tion-based context sensitive indexing model is discussed in
Section 3 along with the derivation of the term association
metric. Experiments and results over the DUC data sets are
reported in Section 4, where the proposed approach is
compared to the baseline model in various settings.
Discussions with one specific document as an example are
reported in Section 5, where summaries obtained through
various approaches are shown. Conclusions and future
work are provided in Section 6.
2 RELATED WORK
Text summarization can either be abstractive or ex-
tractive. The abstraction-based models mostly provide the
summary by sentence compression and reformulation [30],
[31], [32], allowing summarizers to increase the overall
information without increasing the summary length [33],
[34]. However, these models require complex linguistic
processing. Sentence extraction models, on the other hand,
use various statistical features from the text to identify the
most central sentences in a document/set of documents.
Radev et al. [19] proposed a centroid-based summarization
model. They used the words having t) id) scores
(indexing weights) above a threshold to define the centroid
as a pseudodocument. Those sentences containing more
words from the centroid were assumed to be central. Erkan
and Radev [35] proposed LexRank to compute sentence
importance based on the concept of eigenvector centrality
and degree centrality. They used the hypothesis that the
sentences that are similar to many of the other sentences in
a cluster are more salient to the document topic. Sentence
similarity measures based on cosine similarity was
exploited for computing the adjacency matrix. Once the
document graph is constructed using the similarity values,
the degree centrality of a sentence :
i
is defined as the
number of sentences similar to :
i
, with similarity value
above a threshold. Eigenvector centrality is computed using
the LexRank algorithm iteratively, which was an adaptation
of the PageRank algorithm. Mihalcea and Tarau [36]
proposed TextRank, another iterative graph-based ranking
framework for text summarization and showed that other
graph-based algorithms can be derived from this model.
Researchers have used a combination of statistical and
linguistic features, such as term frequency [37], sentence
position [38], [39], [40], [41], cue words [42], topic signature
[43], lexical chains [44] and so on for computing a saliency
score of the sentences. Ko and Seo [45] combined two
consecutive sentences into a pseudosentence (bigram). These
bigrams were supposed to be contextually informative. First,
the bigrams are extracted using the sentence extraction
module. Sentences are extracted from these bigrams using
another sentence extraction task. Alguliev and Alyguliev
[46] used quadratic-type integer linear programming to
1694 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013
cluster sentences and used the cluster to discover latent
topical sections and information rich sentences.
Wan and Xiao [1] proposed to use a small number of
neighborhood documents to improve the sentence extrac-
tion model. Given a document 1
i
, they used its neighboring
documents to construct an expanded document set Aj
i
. The
PageRank-based algorithm was applied and both the local
and global information was used to compute the saliency
scores of sentences. The model works in two steps.
1. Neighborhood construction.
2. Summary extraction using the neighborhood
knowledge.
Given a document 1
i
2 1, the model finds i neighbor-
ing documents for the document 1
i
. These documents
construct the neighborhood knowledge context for 1
i
and
1
i
is said to be expanded to a small document set Aj
i
.
Using the expanded document, both the within-document
sentence relationships (local information) and cross-docu-
ment sentence relationships (global information) can be
used for the summarization process.
An adaptation of the graph-based ranking algorithm,
PageRank is used to compute the importance of a sentence
within a graph in a recursive manner, using the connectiv-
ity information of the graph. Given the expanded document
set Aj
i
, an undirected graph G o. 1 is used to reflect
the connectivity between sentences in the document set.
o f:
i
ji i jojg is the set of sentences in Aj
i
and 1 is a
matrix of size o o, such that an element c
,/
2 1 stores the
similarity between sentences :
,
and :
/
in o. The similarity
value is calculated using the cosine similarity measure.
An adjacency matrix ` `
,./

jojjoj
is used to describe
G as
`
,./

` :ii:
,
. :
/
. , 6 /.
0. otherwise
_ _
. 1
where ` denotes a confidence value, which is set to 1 if
the link is a within-document link and :ii
doc
1
|
. 1
i
if :
,
and :
/
come from different documents 1
|
and 1
i
, where
:ii
doc
denotes the document similarity measure. ` is
normalized to
~
` to normalize each row to 1. Based on the
global affinity graph G, the importance score (denoting how
informative a sentence is) 11ocoic:
,
for sentence :
,
is
calculated in a recursive manner using the PageRank-based
algorithm as follows:
11ocoic:
,
j

8/6,
11ocoic:
/

~
`
/,

1 j
joj
. 2
where j is a damping factor. Equation (2) is iteratively
applied till the convergence is achieved. The convergence
criterion is the difference between the importance score of
sentences in two successive iterations.
The above algorithm was named differently for different
settings, as described below:
UniformLink. Both the within-document and cross-docu-
ment relations are used.
InterLink. Only the cross-document relationships are used.
IntraLink. Only the within-document relationships are
used.
From (2), it is clear that the only input document feature
that this graph-based sentence extraction algorithm accepts
is the normalized sentence similarity matrix
~
`
/,
. This
matrix is constructed using the cosine similarity measure
between the sentences. However, the sentence vectors are
constructed using the t) id)-based indexing scheme,
which is context independent and does not take the
topicality of the document words into account.
None of the models, as described in this section, address
the problem of context insensitive document indexing.
This paper proposes to use the knowledge derived from the
underlying corpus to give a context-sensitive indexing
weight to the document terms. Sentence similarity will be
calculated using the indexing weights thus obtained.
3 EXPLORING LEXICAL ASSOCIATION FOR TEXT
SUMMARIZATION
Given a document 1
i
, the terms encountered in it can either
be topical or nontopical. While it is difficult to decide about
the topicality of a term only on the basis of a single
document, as suggested by the distributional hypothesis
[20], the patterns of term cooccurrence over a larger data set
can be helpful. Lexical association measures use the term
cooccurrence knowledge extracted from a large corpus.
Nontopical terms appear randomly across all the document
while topical terms appear in bursts. Therefore, computed
on a sufficiently large corpus, the lexical association value
between two topical terms should be higher than the lexical
association between two nontopical terms or a pair of
topical and nontopical terms.
To motivate the discussion, let us consider two arbitrary
documents 1
1
and 1
2
. Two sentences of each documents
are shown as follow:
1
1
: o
11
f started. career. engineeringg
: o
12
fshifted. engineering. humanitiesg.
1
2
: o
21
f engineering. application. scientific. principlesg.
: o
22
fengineering. design. build. machinesg.
The first document is discussing a persons life. He
started his career in engineering and later on, shifted to
humanities. The term engineering is not a content-
carrying term in this document. The second document, on
the other hand, talks about engineering. The two
sentences, o
21
and o
22
, attempt to define the field of
engineering. The term engineering is clearly a topical
term in 1
2
. By using any of the traditional indexing
schemes, engineering will be given approximately the
same indexing weight in both the documents. Therefore,
the similarity between o
11
and o
12
will nearly be the same as
the similarity between o
21
and o
22
. However, engineering
is topically related to 1
2
and is a background term in 1
1
. If
an indexing scheme can distinguish between the topical
and nontopical terms, engineering will receive a much
lower indexing weight in 1
1
than in 1
2
, resulting in a
decrease in the similarity value between o
11
and o
12
and an
increase in the similarity value between o
21
and o
22
.
Before developing an algorithm for the identification of
the topical and nontopical terms, it is important to
GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1695
reflect upon how do we decide that the term engineering
is topical in 1
2
and nontopical in 1
1
. In 1
2
, the term
engineering appears with many other terms such as
application, design, machines, scientific, principles which
are associated with the term engineering. In 1
1
, however,
the term engineering seems to be related only slightly
with the term career and not with other terms such as
humanities, shifted, started. Our knowledge of the word
association is based on the knowledge we have captured
about the world. For computational purposes, this knowl-
edge can be discovered by the corpus analysis.
We now present the underlying hypotheses of our
approach in this section.
. A document summary is centered around the topical
terms (content-carrying terms) encountered in the
document. In other words, H
1
: The ratio of topical
words is higher in a summary of a document than in
the original document.
. Topical terms appear in bursts while the nontopical
terms appear randomly across all the documents.
H
2
: For a carefully chosen lexical association metric,
lexical association between two topical terms should
be higher than the lexical association between two
nontopical terms or a pair of topical and nontopical
terms. This lexical association can be calculated
using a large corpus.
. Once the lexical association is calculated, we can
construct the document graph, with the terms
appearing in the document as the vertices and the
lexical association between these terms as the edges
of the graph. H
3
: A PageRank-based algorithm can
be used to determine the context-sensitive indexing
weights, resulting in performance improvement for
a document summarization task.
The cooccurrence patterns in a corpus can be used to
derive the lexical association measure. Assuming that the
terms are distributed according to the Bernoulli distribu-
tion, divergence from the randomness behavior can provide
a measure of the lexical association. Before going into the
derivation, we define the notation.
3.1 Notations
We consider a set of ` documents. Let these documents
have : unique words, which will be used to index these
documents, thus called index terms. Let T ft
1
. t
2
. . . . .
t
:
g be the set of these index terms. Let the set of
` documents be 1 f1
1
. 1
2
. . . . . 1
`
g. Let )
i,
be the
frequency with which term t
,
occurs in document 1
i
and
`
,
be the number of documents in which the term t
,
occurs
at least once. `
,
is also called the document frequency of
term t
,
. We will denote the probability of term t
i
appearing
in the corpus by j
i
. Let `
i,
denote the number of documents
in which terms t
i
and t
,
cooccur.
3.2 Bernoulli Model of Randomness: Derivation of
the Term Association Metric
Let us consider the distribution of terms t
i
and t
,
in a
corpus. The term t
,
appears in `
,
documents. Assuming the
terms to be distributed as per the Bernoulli distribution, the
probability j
i
of the term t
i
appearing in a document is
given by
j
i

`
i
`
. 3
Consider the `
,
documents in which term t
,
occurs. Term t
i
occurs in `
i,
documents out of these `
,
documents and
does not occur in `
,
`
i,
documents. Therefore, the
probability of `
i,
cooccurrences in `
,
documents is given
by the Bernoulli distribution
1io/`
i,
1`. `
,
. `
i,
4

`
,
`
i,
j
i
`
i,

i
`
,
`
i,
.
5
where
i
1 j
i
.
Equation (5) quantifies the probability that term t
i
has
`
i,
cooccurrences in `
,
documents. As per classical
semantic information theory, the quantity of information
associated is equivalent to the reciprocal of this probability,
expressed in bits [47]. Therefore, the information content in
the `
i,
cooccurrences of term t
i
in `
,
documents can be
expressed as
1i)`
i,
|oq
2
1io/`
i,
6
|oq
2
`
,
`
i,
_ _
j
i
`
i,

i
`
,
`
i,
_ _
. 7
where (6) is the quantification of the surprise of the `
i,
cooccurrences. Equation (7) requires the computation of
factorials and therefore, Stirlings approximation [48] can be
used to approximate the factorials included in the computa-
tion. According to Stirlings approximation:
i!

2i
p
i
c
_ _
i
. 8
Proceeding along the same lines of the derivation, as
proposed by Amati and Rijsbergen [27],
1
we can derive the
information content in the `
i,
cooccurrences as
1i)`
i,
0.5|oq
2
2`
i,
1
`
i,
`
,
_ _ _ _
`
i,
|oq
2
j
co
j
i
`
,
`
i,
|oq
2
1 j
co
1 j
i
.
9
where j
i

`
i
`
and j
co

`
i,
`
,
.
Equation (9) quantifies the self-information of the
`
i,
cooccurrences of term t
i
in `
,
documents. We propose
to use this information as the Bernoulli lexical association
measure to give a context-sensitive indexing weight to the
document terms.
We return to our hypothesis H
1
: The ratio of topical
words is higher in a summary of a document than in the
original document. Is this hypothesis empirically sup-
ported for the proposed measure, i.e., are the words in a
human generated summary of a document more lexically
1696 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013
1. Amati and Rijsbergen [27] used the Bernoulli model of randomness to
propose an indexing scheme, while the authors propose a cooccurrence
measure. The derivation proceeds on the similar lines but in a completely
different context.
associated than the original document? To investigate these
questions, empirical evidence was sought from the DUC
data sets, which are the benchmark data sets for the
evaluation of text summarization.
For each data set, the Bernoulli lexical association
between the indexing terms was calculated using (9). For
each document, the average lexical association between the
document terms was calculated (stop words were not used).
Similarly, average lexical association was computed for the
target summary (reference summary), provided in the
corpus.
2
For example, if a document/summary has
` words (excluding the stop words), the average lexical
association (o.1cr
3
) for the documents/summary was
calculated as
o.1cr

`
i1

`
,6i.,1

i,
`` 1
. 10
where
i,
corresponds to the lexical association between ith
and ,th word in the document/summary, which was
calculated a priori using the whole data set. For the
DUC01 data set, the o.1cr for the target summary was
2.76 as compared to 2.49 for the original documents.
Similarly, for the DUC02 data set, the o.1cr for the target
summary was 3.29 as compared to 2.89 for the original
documents. The difference was statistically significant to a
large degree. Fig. 1 compares the distribution of average
lexical association of the document and target summary
using the Bernoulli measure for two different DUC data
sets. The sample size for each data set is the number of
unique documents in that data set, i.e., 303 for DUC01 and
533 for DUC02.
As is evident from Fig. 1, the distribution is shifted
toward the right for the target summary, when compared
with the original document. Though the lexical association
of summary documents shows a slight shift toward the left
also, it is small as compared to the shift toward higher
lexical association. We quantified the probability mass
corresponding to the left and right shift for both the data
sets. For DUC01, the probability mass corresponding to the
shift toward left is 0.192 and for the shift toward right is
0.439. For DUC02, the probability mass is 0.098 and 0.576
corresponding to the shift toward left and shift toward
right, respectively. Therefore, using the Bernoulli associa-
tion measure, a summary document obtains a higher
average lexical association than the original document in
most cases. Since a document summary is expected to
contain a higher proportion of topical terms, as compared to
the original document, it can be stated that the Bernoulli
model-based lexical association can distinguish between a
topical and nontopical term. Thus, the hypothesis H
2
can be
modified as H
2
: The Bernoulli model-based lexical
association measure can distinguish between a topical and
nontopical term and can be used to give a context-sensitive
indexing weight to the document terms.
At this point, we study the behavior of some other lexical
association measures to justify that the proposed Bernoulli
lexical association measure is a much better fit for H
2
. The
following measures are considered:
. Point-wise mutual information (PMI) [49]:

i,
1`1
|oq
``
i,
`
i
`
,
_ _
. 11
. Mutual Information (MI) [50]:

i,
`1

c
,
0.1

c
,
0.1
jc
i
. c
,
|oq
jc
i
. c
,

jc
i
jc
,

. 12
GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1697
Fig. 1. Comparison of distribution of average lexical association of
document and target summary for the Bernoulli measure for (a) DUC01
and (b) DUC02 data sets.
TABLE 1
Average Lexical Association, Computed over the DUC01
Data Set Using Various Association Measures
TABLE 2
Average Lexical Association, Computed over the DUC02
Data Set Using Various Association Measures
2. If a document has more than one target summary, an average was
taken over the average lexical association for each summary.
3. We will also use o.1cr
1oc
and o.1cr
oni
to denote the corresponding
measure for a document and summary respectively.
where the binary variables c
i
and c
,
indicate the
presence/absence of terms t
i
and t
,
.
Tables 1 and 2 compare the average lexical association of
the documents and summaries, computed over the DUC01
and DUC02 data sets, respectively, for various lexical
association measures. The second and third columns
denote the average lexical association computed using
(10) over the document and target summary, respectively,
using the corresponding lexical association measures. The
students t-test and Wilcoxon signed rank test were
performed to determine if the difference was statistically
significant. The significance value (j-value) corresponding
to the t-test and Wilcoxon signed rank test are reported in
the fourth and fifth columns of these tables, respectively.
The last column shows the ratio of the average of the
average lexical association of the target summaries
(o.1cr
oni

o.
) and the average of the average lexical
association of the documents (o.1cr
1oc

o.
).
From Tables 1 and 2, it is clear that by using the PMI
measure, the lexical association between documents terms
is higher than between the summary terms. Therefore, the
PMI measure may not be a suitable choice for the possible
application in document summarization, which will be
verified later in Section 4. Using the MI and Bernoulli
measure, on the other hand, the average lexical association
between the terms in human summary is higher than that in
the original document. As verified by the two different
statistical tests, the difference is statistically significant
using both these association measures and therefore, the
hypothesis holds true for both the MI and Bernoulli
measures. However, the significance level as well as the
ratio of average lexical association between the target
summary and original document is much higher for the
Bernoulli measure as compared to the MI measure. Thus,
the proposed Bernoulli measure is a better fit for H
2
.
3.3 Context-Based Word Indexing
Given the lexical association measure between two terms in
a document from hypothesis H
2
, the next task is to calculate
the context sensitive indexing weight of each term in a
document using hypothesis H
3
. A graph-based iterative
algorithm is used to find the context sensitive indexing
weight of each term. Given a document 1
i
, a document
graph G is built. Let G \ . 1 be an undirected graph to
reflect the relationships between the terms in the document
1
i
. \ f.
,
j1 , j\ jg denotes the set of vertices, where
each vertex is a term appearing in the document. 1 is a
matrix of dimensions j\ j j\ j. Each edge c
,/
2 1 corre-
sponds to the lexical association value between the terms
corresponding to the vertices .
,
and .
/
. The lexical
association between the same term is set to 0. To use the
PageRank-based algorithm, 1 is normalized as
~
1

~
1
,./

j\ jj\ j
to make the sum of each row equal to 1.
~
1 is
defined as
~
1
,./

c
,/

j\ j
/1
c
,/
. , 6 /
0. otherwise
_
_
_
_
_
_
. 13
Based on the graph G and the normalized association
matrix
~
1, the context-sensitive indexing weight of each
word .
,
in a document 1
i
, denoted by iidcr\t.
,
is to
be calculated. It can be found in a recursive way using the
Page-rank-based algorithm. Algorithm 3.1 describes the
pseudocode of the proposed algorithm. j is the damping
factor. For implementation, the iidcr\t.
,
is initialized to
1.0 for all the document terms. icio\t.
,
is a buffer that
stores the indexing weights of the previous iteration.
The convergence of the iteration algorithm is achieved
when the difference between the scores computed at two
successive iterations falls below a given threshold c.
Algorithm 3.1. CONTEXTBASEDWORDINDEXING (E, j, c)
initialize iidcr\t.
,
18,, error 1 1
while 1 c
do
1 0
for, 1tojoj
do
icio\t.
,
iidcr\t.
,

iidcr\t.
,
j

8/6,
iidcr\t.
/

~
1
/,

1j
j\ j
1 1 iidcr\t.
,
icio\t.
,

2
_

_
1

1
p
_

_
return iidcr\t
3.4 Sentence Similarity Using the Context-Based
Indexing
The model described above gives a context-sensitive
indexing weight to each document term. The next step is
to use these indexing weights to calculate the similarity
between any two sentences. Given a sentence :
,
in the
document 1
i
, the sentence vector is built using the
iidcr\t computed as per Algorithm 3.1. The sentence
vector :
,
is calculated such that if a term .
/
appears in :
,
, it
is given a weight iidcr\t.
/
; otherwise, it is given a
weight 0. The similarity between two sentences :
,
and :
|
is
computed using the dot product, i.e., :ii:
,
. :
|
:
,
:
|
.
Besides using the new sentence similarity measure, the
paradigm as presented in Wan and Xiao [1], described in
Section 2 is used for calculating the score of the sentences.
The proposed method will be denoted by bern corre-
sponding to the Bernoulli measure. Section 4 reports the
experiments using the proposed model.
4 EXPERIMENTS
4.1 Evaluation Setup
The benchmark data sets from the DUC are used to evaluate
the text summarization systems. These data sets are
distributed through TREC. The data sets from DUC01 [51]
and DUC02 [52] are used for the experiments.
4
The aim of
the tasks in DUC01 and DUC02 was to evaluate generic
summaries for a document with an approximate length of
100 words. The data sets used are english news articles,
collected from TREC-9 for a single document summariza-
tion task. The DUC01 data set contains 303 unique
documents, which can broadly be categorized into 30 news
topics. The DUC02 data set contains 533 unique documents,
1698 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013
4. Consistent with the experiments reported in Wan and Xiao [1].
categorized into 59 new topics. These data sets also contains
the reference summary for each document.
While there are many approaches to evaluate the quality
of system generated summaries [53], [54], ROUGE [55], [56]
is the most widely used toolkit for the evaluation of the
system generated summaries. ROUGE measures the sum-
mary quality by counting the overlapping units such as i-
gram, word sequences and word pairs between the
candidate summary (system generated) and reference
summary. An i-gram recall measure (ROUGE-N) com-
puted through the ROUGE toolkit is given as
1OlG1 `

o2f1c)onig

iqioi2o
Conit
iotc/
i qioi

o2f1c)onig

iqioi2o
Coniti qioi
.
14
where i refers to the length of the i-gram, Coniti qioi
is the number of i-grams in the reference summaries and
Conit
iotc/
i qioi is the maximum number of i-grams,
cooccurring in the candidate summary and the reference
summary. Of all the scores reported by ROUGE toolkit,
ROUGE-1 (unigram-based), and ROUGE-2 (bigram-based)
scores have been shown to be in close agreement with
human judgment [55].
In the DUC01 data set, only one reference summary is
provided for each document. The DUC02 data set, on the
other hand, contains multiple reference summaries for each
document. The | option in the ROUGE toolkit is used to
truncate the system generated summaries to length 100. The
ROUGE-1 (unigram-based) and ROUGE-2 (bigram-based)
scores returned by the toolkit are reported.
4.2 Sensitivity Analysis
For the context-based word indexing algorithm, the para-
meters c (threshold) and j (damping factor) were deter-
mined through the experiments over the DUC02 data set
with the Bernoulli lexical association measure used with the
IntraLink model. The parameter c was varied in the range
f0.1. 0.01. 0.001. 0.0001g and the parameter j was varied in
the range f0.05. 0.95g with a step-size of 0.1. The sensitivity
of the results with respect to the parameter c has been
shown in Table 3. Fig. 2 demonstrates the sensitivity of
ROUGE-1 and ROUGE-2 scores with respect to the
parameter j.
From Table 3, the results were quite insensitive to the
choice of c in the range f0.1. 0.01. 0.001. 0.0001g. For
the rest of the experiments, c was set to 0.0001 as it gave
the best results. For c 0.0001, on average, 5.6 iterations
were required for each document for the proposed
algorithm to converge. From Fig. 2, the results were not
very sensitive to the parameter j in the range 0.65. 0.95.
j 0.85 was chosen for the rest of the experiments as it
gave the best results.
4.3 Comparison of Various Systems
For comparison, only the IntraLink and UniformLink models
are used in this paper. IntraLink is the simplest model,
which uses a single document. UniformLink was shown to
perform superior to InterLink by Wan and Xiao [1], while
both UniformLink and InterLink are nearly the same in
terms of the complexity offered.
First, we compare the performance of various lexical
association measures for the document summarization.
Table 4 compares the performance of various lexical
association measures over the DUC01 and DUC02 data sets
with IntraLink as the baseline model.
From Table 4, it is clear that the proposed Bernoulli-based
lexical association measure outperforms the PMI and MI
measures. While the PMI measure gave slight improvements
over DUC02, the performance was worse that the baseline
for DUC01. It clearly corresponds to the empirical evidence,
as shown in Tables 1 and 2. The MI measure provided
GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1699
Fig. 2. Sensitivity analysis with respect to the damping factor j for
DUC02 IntraLinkBernoulli for (a) ROUGE-1 and (b) ROUGE-2 scores.
TABLE 3
Sensitivity Analysis with Respect to the
Threshold c for DUC02 IntraLinkBernoulli
TABLE 4
Comparison of Various Lexical Association
Measures with the DUC Data Sets
improvements over the baseline for both data sets, with the
only exception being the ROUGE-2 score for the DUC01 data
set. However, the improvements were much higher for the
Bernoulli measure, clearly validating the choice of the
Bernoulli measure for the hypothesis H
2
in this paper.
For the UniformLink model, document similarities can
be calculated either via the cosine similarity or using the
indexing weights obtained using the association measure in
the proposed model. +neB will denote the computation of
neighboring documents using the association measure.
ic1 will denote the use of cosine similarity for the
neighborhood construction. The comparison with both the
DUC01 and DUC02 data sets is shown in Table 5. Please
note that the baseline model UniformLink is the same as
UniformLink-neB, since it uses the cosine similarity for
the computation of neighboring documents. We do not
report the experiments with the variation UniformLin-
k+neB since it does not use the context-based word
indexing for the sentence similarity computation.
As is clear fromTable 5, both the systems bernneB and
bern-neB outperform the baseline UniformLink model.
The improvements are more visible on the ROUGE-2 score
than on ROUGE-1. bernneB will be selected as the
proposed enhancement for the UniformLink model as
the improvements obtained were higher. It also makes the
system consistent as only the context-based word indexing
weights will be used for all the computations.
Once the systems bern and bernneB are selected as
the enhancements over the IntraLink and UniformLink
models, respectively, Tables 6 and 7 compare the proposed
enhancements when applied with IntraLink and Uniform-
Link methods over the DUC01 and DUC02 data sets,
respectively. For the UniformLink method, the number of
neighboring documents is set to i 10 for all experiments,
as reported in [1]. The proposed model is denoted by adding
bern and bern+neB to the corresponding baseline
models IntraLink and UniformLink, respectively. For
each system, the ROUGE-1 andROUGE-2 scores, as returned
bythe ROUGEtoolkit are provided, along withthe 95 percent
confidence interval, shown in the square brackets.
The results in Tables 6 and 7 lead to the following
observations:
. Using the word indexing by the Bernoulli cooccur-
rence measure always outperforms the correspond-
ing baseline model.
. For the DUC01 data set, the IntraLink system
performed better than the UnifomLink system (the
difference is visible on ROUGE-1 scores). On the
other hand, for the DUC02 data set, the UniformLink
system achieved a better performance. Applying the
Bernoulli model also gives the same results.
. For both the data sets, the Bernoulli model applied in
the simplest settings (IntraLinkbern) outperforms
both the baseline systems (IntraLink/UniformLink)
for both the ROUGE-1 and ROUGE-2 results.
. The improvements provided by the proposed model
are much more visible on ROUGE-2 than on
ROUGE-1. ROUGE-2 measures the bigram-based
similarity and, therefore, resembles more closely the
syntactic similarity between two summaries.
1700 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013
TABLE 5
Comparison Results with the DUC Data Sets for
the UniformLink Model
TABLE 8
Summary of Typical Participating Systems in DUC2002
TABLE 6
Comparison Results with DUC01
TABLE 7
Comparison Results with DUC02
We now compare the performance of our system to the
actual participating systems in the DUC2002. Table 8 gives a
short summary of the typical participating systems in
DUC2002 with high ROUGE scores. Table 9 shows the
comparison results of our system with these five systems.
From Table 9, we can see that our system performs
comparable to the systems 31, 29, and 27, which were
among the best participating systems in DUC2002. Though
the performance of the systems 28 and 21 is better than our
system, it is to be noted that the systems 28 and 21 used the
supervised model for sentence extraction and the super-
vised techniques have been shown to perform better than
the unsupervised methods.
5 DISCUSSION
The experimental results shown in the previous section,
along with the empirical evidence shown in Fig. 1 validate
the hypotheses proposed in this paper. In this section, a
document from the DUC data sets is taken as an example to
show how the proposed method produces a better
summary than the baseline models. First, the original
document from the DUC data set is shown (see Fig. 3)
accompanied by the corresponding reference summary (see
Fig. 4), as also provided with the data set. Then the
summaries obtained by both the baseline models, IntraLink
(see Fig. 5) and UniformLink (see Fig. 6) are shown. Finally,
the summary obtained by applying the proposed Bernoulli
model over the simplest baseline (i.e., IntraLinkbern) is
shown (see Fig. 7).
GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1701
TABLE 9
Comparison with Participating Systems in DUC2002
Fig. 3. The original document AP890704-0043 in DUC2001.
Fig. 4. The target summary for AP890704-0043.
Fig. 5. Summary produced by IntraLink for AP890704-0043.
The summaries shown above clearly reflect the advan-
tage offered by the proposed Bernoulli-based word indexing
model. The first two sentences in Fig. 7 are very much
similar to the first and third sentences in the manual
summary (see Fig. 4). However, these sentences do not
appear in the summary provided by IntraLink. In the
summary provided by UniformLink, only the first sentence
appears at the third position. These two sentences contain a
lot of contextual words such as communist, rebels,
violence, police, and so on. Since the proposed indexing
model gives an indexing weight using the lexical association
with all other words, the weights of the contextual words
are increased, reflecting on their sentence similarity values.
Therefore, these sentences become more central in the
IntraLink+Bernoulli method than the baseline models,
used without a context-based word indexing.
6 CONCLUSIONS AND FUTURE WORK
In this paper, the Bernoulli model of randomness has been
used to develop a graph-based ranking algorithm for
calculating how informative is each of the document terms.
We proposed three hypotheses, which were used for the
development. Hypothesis H
1
is based on the intuition that a
document summary contains the most salient information
in a text and therefore, the terms in a summary should be
more lexically associated with each other than in the
original document. This hypothesis was translated into an
empirical relation, The average lexical association between
document summaries should be higher than the average
lexical association between the original documents. The
authors conjectured that if a lexical association measure
follows H
2
, average lexical association will be higher in
summaries than in the documents, i.e., it will follow H
1
. The
authors also conjectured that if a lexical association measure
follows H
2
, it will follows H
3
, that is, it can be used to give a
context sensitive indexing weight to the document terms.
This hypothesis was correlated with the ROUGE scores.
The Bernoulli model of randomness was used to derive a
novel lexical association measure and it was compared with
two different lexical association measures, PMI and MI. We
use below the subscripts H
1
, H
2
, and H
3
to the different
measures used to denote the degree to which they satisfy
the empirical relation associated with that particular
hypothesis. From Tables 1, 2, and 4
1ciion||i
H
1
`1
H
1
1`1
H
1
.
1ciion||i
H
3
`1
H
3
1`1
H
3
.
and thus, the hypotheses H
1
and H
3
correlate for the three
association measures and it looks rational enough to use
this results as an evidence that
1ciion||i
H
2
`1
H
2
1`1
H
2
.
Although PMI does not support the empirical relation
associated with H
1
, the other two measures, MI and
Bernoulli do support this relation. From Table 4, we can
see that the results with PMI for the empirical relation
associated with H
3
were also not conclusive and it gave
worse performance over DUC01. Because the empirical
relations associated with H
1
and H
3
are consistent, the fact
that PMI does not support H
1
is more likely to point that
1`1 does not follow the hypothesis H
2
.
Using the proposed Bernoulli association measure, the
lexical association between the terms in a target summary is
higher compared to the association between the terms in a
document. Thus, the proposed measure satisfies Hypothesis
H
1
. It has been used along with the PageRank algorithm to
give a context-sensitive indexing weight to the document
terms to validate Hypothesis H
3
. The indexing weights,
thus, obtained have been used to calculate the sentence
similarity values. The underlying assumption in H
3
was
that the sentence similarity thus obtained would be context
sensitive and, therefore, should provide improvements in
the sentence extraction task for document summarization.
The concept of topical and nontopical terms was used to
modify the indexing wights of the document terms.
Analysis of some of the documents and the corresponding
summary figured out the specific advantage offered by the
proposed Bernoulli model-based context sensitive indexing.
The experiments performed using the benchmark DUC
data sets confirm that the new context-based word indexing
gives better performance than the baseline models. The
1702 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013
Fig. 7. Summary produced by IntraLink+Bernoulli for AP890704-0043.
Fig. 6. Summary produced by UniformLink for AP890704-0043.
interesting observation was that the Bernoulli model for
word indexing, when applied with the simplest IntraLink
model, performs better than all the baseline models. The
proposed model is general and since the sentence similarity
measure is central to any sentence extraction model, is
applicable to any of the sentence extraction techniques. It
was empirically verified that the proposed Bernoulli lexical
association measure outperforms the PMI and MI measures
for the context sensitive word indexing, used for the
document summarization.
As per the overall results shown in Tables 6 and 7,
although the maximum increase seen by any of the
techniques is around 1 percent absolute gain in ROUGE-1
and ROUGE-2, it is noteworthy that the improvements are
quite consistent. Using the IntraLink+bern model, around
6 percent relative gain in ROUGE-2 score was obtained for
both the DUC data sets. It is also to be noted that the
authors have proposed a general enhancement model
which can be used in combination with the existing
enhancement models to provide additional improvements.
Since the improvements are consistent (though not very
large), it can only have a positive impact on a human user of
the systems. As has been shown through an example
summary, the proposed model gives a higher weight to
the content-carrying terms and as a result, the sentences are
presented in such a way that the most informative sentences
appear on the top of the summary, making a positive
impact on the quality of the summary.
The model performed superior to all the baseline models.
However, the cooccurrences were calculated only on the
given data set, which was actually very small (303 and
533 documents for DUC01 and DUC02, respectively).
Further work would calculate the lexical association over
a large corpus such as the BNC corpus and investigate
whether it leads to additional improvements. It will also be
instructive to perform experiments using this context
sensitive document indexing approach for the information
retrieval task. Additionally, it will be interesting to see the
applicability of the proposed Bernoulli lexical association
measure in other natural language applications such as
word classification, word sense disambiguation and knowl-
edge acquisition.
ACKNOWLEDGMENTS
This research is supported under the Centre of Excellence in
Intelligent Systems project, funded by the Northern Ireland
Integrated Development Fund and InvestNI.
REFERENCES
[1] X. Wan and J. Xiao, Exploiting Neighborhood Knowledge for
Single Document Summarization and Keyphrase Extraction,
ACM Trans. Information Systems, vol. 28, pp. 8:1-8:34, http://doi.
acm.org/10.1145/1740592.1740596, June 2010.
[2] K.S. Jones, Automatic Summarising: Factors and Directions,
Advances in Automatic Text Summarization, pp. 1-12, MIT Press,
1998.
[3] L.L. Bando, F. Scholer, and A. Turpin, Constructing Query-
Biased Summaries: A Comparison of Human and System
Generated Snippets, Proc. Third Symp. Information Interaction in
Context, pp. 195-204, http://doi.acm.org/10.1145/1840784.
1840813, 2010.
[4] X. Wan, Towards a Unified Approach to Simultaneous Single-
Document and Multi-Document Summarizations, Proc. 23rd Intl
Conf. Computational Linguistics, pp. 1137-1145, http://portal.
acm.org/citation.cfm?id=1873781.1873909, 2010.
[5] X. Wan, An Exploration of Document Impact on Graph-Based
Multi-Document Summarization, Proc. Conf. Empirical Methods in
Natural Language Processing, pp. 755-762, http://portal.acm.org/
citation.cfm?id=1613715.1613811, 2008.
[6] Q.L. Israel, H. Han, and I.-Y. Song, Focused Multi-Document
Summarization: Human Summarization Activity vs. Automated
Systems Techniques, J. Computing Sciences in Colleges, vol. 25,
pp. 10-20, http://portal.acm.org/citation.cfm?id=1747137.
1747140, May 2010.
[7] C. Shen and T. Li, Multi-Document Summarization via the
Minimum Dominating Set, Proc. 23rd Intl Conf. Computational
Linguistics, pp. 984-992, http://portal.acm.org/citation.cfm?id=
1873781.1873892, 2010.
[8] X. Wan and J. Yang, Multi-Document Summarization Using
Cluster-Based Link Analysis, Proc. 31st Ann. Intl ACM SIGIR
Conf. Research and Development in Information Retrieval, pp. 299-306,
http://doi.acm.org/10.1145/1390334.1390386, 2008.
[9] D. Wang, T. Li, S. Zhu, and C. Ding, Multi-Document
Summarization via Sentence-Level Semantic Analysis and Sym-
metric Matrix Factorization, Proc. 31st Ann. Intl ACM SIGIR Conf.
Research and Development in Information Retrieval, pp. 307-314,
http://doi.acm.org/10.1145/1390334.1390387, 2008.
[10] S. Harabagiu and F. Lacatusu, Using Topic Themes for Multi-
Document Summarization, ACM Trans. Information Systems,
vol. 28, pp. 13:1-13:47, http://doi.acm.org/10.1145/1777432.
1777436, July 2010.
[11] H. Daume III and D. Marcu, Bayesian Query-Focused Summar-
ization, Proc. 21st Intl Conf. Computational Linguistics and the 44th
Ann. meeting of the Assoc. for Computational Linguistics, pp. 305-312,
http://dx.doi.org/10.3115/1220175.1220214, 2006.
[12] D.M. Dunlavy, D.P. OLeary, J.M. Conroy, and J.D. Schlesinger,
QCS: A System for Querying, Clustering and Summarizing
Documents, Information Processing and Management, vol. 43,
pp. 1588-1605, http://portal.acm.org/citation.cfm?id=1284916.
1285163, Nov. 2007.
[13] R. Varadarajan, V. Hristidis, and T. Li, Beyond Single-Page Web
Search Results, IEEE Trans. Knowledge and Data Eng., vol. 20,
no. 3, pp. 411-424, Mar. 2008.
[14] L.-W. Ku, L.-Y. Lee, T.-H. Wu, and H.-H. Chen, Major Topic
Detection and Its Application to Opinion Summarization, Proc.
28th Ann. Intl ACM SIGIR Conf. Research and Development in
Information Retrieval, pp. 627-628, http://doi.acm.org/10.1145/
1076034.1076161, 2005.
[15] E. Lloret, A. Balahur, M. Palomar, and A. Montoyo, Towards
Building a Competitive Opinion Summarization System: Chal-
lenges and Keys, Proc. Human Language Technologies: The 2009
Ann. Conference of the North Am. Ch. Assoc. for Computational
Linguistics, Companion Vol. : Student Research Workshop and Doctoral
Consortium, pp. 72-77, http://portal.acm.org/citation.cfm?id=
1620932.1620945, 2009.
[16] J.G. Conrad, J.L. Leidner, F. Schilder, and R. Kondadadi, Query-
Based Opinion Summarization for Legal Blog Entries, Proc. 12th
Intl Conf. Artificial Intelligence and Law, pp. 167-176, http://doi.
acm.org/10.1145/1568234.1568253, 2009.
[17] H. Nishikawa, T. Hasegawa, Y. Matsuo, and G. Kikui, Opinion
Summarization with Integer Linear Programming Formulation for
Sentence Extraction and Ordering, Proc. 23rd Intl Conf. Computa-
tional Linguistics: Posters, pp. 910-918, http://portal.acm.org/
citation.cfm?id=1944566.1944671, 2010.
[18] C.C. Chen and M.C. Chen, TSCAN: A Content Anatomy
Approach to Temporal Topic Summarization, IEEE Trans.
Knowledge and Data Eng., vol. 24, no. 1, pp. 170-183, Jan. 2012.
[19] D.R. Radev, H. Jing, M. Stys, and D. Tam, Centroid-Based
Summarization of Multiple Documents, Information Processing
and Management, vol. 40, pp. 919-938, http://portal.acm.org/
citation.cfm?id=1036118.1036121, Nov. 2004.
[20] Z. Harris, Mathematical Structures of Language. Wiley, 1968.
[21] K. Morita, E.-S. Atlam, M. Fuketra, K. Tsuda, M. Oono, and J.-i.
Aoe, Word Classification and Hierarchy using Co-Occurrence
Word Information, Information Processing and Management,
vol. 40, pp. 957-972, http://portal.acm.org/citation.cfm?id=
1036118.1036123, Nov. 2004.
[22] T. Yoshinari, E.-S. Atlam, K. Morita, K. Kiyoi, and J.-i. Aoe,
Automatic Acquisition for Sensibility Knowledge Using Co-
Occurrence Relation, Intl J. Computer Applications in Technology,
vol. 33, pp. 218-225, http://portal.acm.org/citation.cfm?id=
1477782.1477797, Dec. 2008.
GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1703
[23] B. Andreopoulos, D. Alexopoulou, and M. Schroeder, Word
Sense Disambiguation in Biomedical Ontologies with Term Co-
Occurrence Analysis and Document Clustering, Intl J. Data
Mining and Bioinformatics, vol. 2, pp. 193-215, http://portal.acm.
org/citation.cfm?id=1413934.1413935, Sept. 2008.
[24] P. Goyal, L. Behera, and T. McGinnity, Query Representation
Through Lexical Assoc. for Information Retrieval, IEEE Trans.
Knowledge and Data Eng., vol. 24, no. 12, pp. 2260-2273, Dec. 2011.
[25] K. Cai, C. Chen, and J. Bu, Exploration of Term Relationship for
Bayesian Network Based Sentence Retrieval, Pattern Recognition
Letters, vol. 30, no. 9, pp. 805-811, 2009.
[26] H. Li, Word Clustering and Disambiguation Based on Co-
Occurrence Data, Natl Language Eng., vol. 8, pp. 25-42, http://
portal.acm.org/citation.cfm?id=973860.973863, Mar. 2002.
[27] G. Amati and C.J. Van Rijsbergen, Probabilistic Models of
Information Retrieval Based on Measuring the Divergence from
Randomness, ACM Trans. Information Systems, vol. 20, pp. 357-
389, http://doi.acm.org/10.1145/582415.582416, Oct. 2002.
[28] D.E. Losada and L. Azzopardi, Assessing Multivariate Bernoulli
Models for Information Retrieval, ACM Trans. Information
Systems, vol. 26, pp. 17:1-17:46, http://doi.acm.org/10.1145/
1361684.1361690, June 2008.
[29] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank
Citation Ranking: Bringing Order to the Web, technical report,
Stanford Digital Library Technologies Project, http://citeseer.ist.
psu.edu/page98pagerank.html, 1998.
[30] J. Turner and E. Charniak, Supervised and Unsupervised
Learning for Sentence Compression, Proc. 43rd Ann. Meeting on
Assoc. for Computational Linguistics, pp. 290-297, http://dx.
doi.org/10.3115/1219840.1219876, 2005.
[31] J. Clarke and M. Lapata, Discourse Constraints for Document
Compression, Computational Linguistics, vol. 36, pp. 411-441, 2010.
[32] H. Daume III and D. Marcu, A Noisy-Channel Model for
Document Compression, Proc. 40th Ann. Meeting on Assoc. for
Computational Linguistics, pp. 449-456, http://dx.doi.org/
10.3115/1073083.1073159, 2002.
[33] C.-Y. Lin, Improving Summarization Performance by Sentence
Compression: A Pilot Study, Proc. Sixth Intl Workshop Information
Retrieval with Asian Languages, pp. 1-8, http://dx.doi.org/10.
3115/1118935.1118936, 2003.
[34] D. Zajic, B.J. Dorr, J. Lin, and R. Schwartz, Multi-Candidate
Reduction: Sentence Compression as a Tool for Document
Summarization Tasks, Information Processing and Management,
vol. 43, pp. 1549-1570, http://portal.acm.org/citation.cfm?id=
1284916.1285161, Nov. 2007.
[35] G. Erkan and D.R. Radev, LexRank: Graph-Based Lexical
Centrality as Salience in Text Summarization, J. Artificial
Intelligence Research, vol. 22, pp. 457-479, http://portal.acm.org/
citation.cfm?id=1622487.1622501, Dec. 2004.
[36] R. Mihalcea and P. Tarau, A Language Independent Algorithm
for Single and Multiple Document Summarization, Proc. Intl
Joint Conf. Natural Language Processing (IJCNLP), http://citeseerx.
ist.psu.edu/viewdoc/summary?doi=10.1.1.136.1125, 2005.
[37] H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM
J. Research and Development, vol. 2, pp. 159-165, http://dx.doi.org/
10.1147/rd.22.0159, Apr. 1958.
[38] C.-Y. Lin and E. Hovy, Identifying Topics by Position, Proc. Fifth
Conf. Applied Natural Language Processing, pp. 283-290, http://dx.
doi.org/10.3115/974557.974599, 1997.
[39] E. Hovy and C.-Y. Lin, Automated Text Summarization and the
Summarist System, Proc. Workshop Held at Baltimore, Maryland
(TIPSTER 98), pp. 197-214, http://dx.doi.org/10.3115/1119089.
1119121, 1998.
[40] R. Katragadda, P. Pingali, and V. Varma, Sentence Position
Revisited: A Robust Light-Weight Update Summarization Base-
line Algorithm, Proc. Third Intl Workshop Cross Lingual Informa-
tion Access: Addressing the Information Need of Multilingual Societies,
pp. 46-52, http://portal.acm.org/citation.cfm?id=1572433.
1572440, 2009.
[41] Y. Ouyang, W. Li, Q. Lu, and R. Zhang, A Study on Position
Information in Document Summarization, Proc. 23rd Intl Conf.
Computational Linguistics: Posters, pp. 919-927, http://portal.acm.
org/citation.cfm?id=1944566.1944672, 2010.
[42] H.P. Edmundson, New Methods in Automatic Extracting,
J. ACM, vol. 16, pp. 264-285, http://doi.acm.org/10.1145/
321510.321519, Apr. 1969.
[43] C.-Y. Lin and E. Hovy, The Automated Acquisition of Topic
Signatures for Text Summarization, Proc. 18th Conf. Computational
Linguistics, pp. 495-501, http://dx.doi.org/10.3115/990820.9
90892, 2000.
[44] R. Barzilay and M. Elhadad, Using Lexical Chains for Text
Summarization, Proc. ACL Workshop Intelligent Scalable Text
Summarization, pp. 10-17, http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.29.6428, 1997.
[45] Y. Ko and J. Seo, An Effective Sentence-Extraction Technique
Using Contextual Information and Statistical Approaches for Text
Summarization, Pattern Recognition Letters, vol. 29, pp. 1366-1371,
http://portal.acm.org/citation.cfm?id=1371261.1371371, July
2008.
[46] R. Alguliev and R. Alyguliev, Summarization of Text-Based
Documents with a Determination of Latent Topical Sections and
Information-Rich Sentences, Automatic Control and Computer
Sciences, vol. 41, pp. 132-140, http://dx.doi.org/10.3103/
S0146411607030030, 2007.
[47] J. Hintikka, On Semantic Information, Physics, Logic, and History,
pp. 147-172, Springer, 1970.
[48] G. Marsaglia and J.C.W. Marsaglia, A New Derivation of
Stirlings Approximation of n! Am. Math. Monthly, vol. 97,
pp. 826-829, http://portal.acm.org/citation.cfm?id=96077.96084,
Sept. 1990.
[49] K.W. Church and P. Hanks, Word Association Norms, Mutual
Information, and Lexicography, Computational Linguistics, vol. 16,
no. 1, pp. 22-29, 1990.
[50] M. Karimzadehgan and C. Zhai, Estimation of Statistical
Translation Models Based on Mutual Information for Ad Hoc
Information Retrieval, Proc. 33rd Intl ACM SIGIR Conf. Research
and Development in Information Retrieval, pp. 323-330, http://doi.
acm.org/10.1145/1835449.1835505, 2010.
[51] P. Over, Introduction to DUC-2001: An Intrinsic Evaluation of
Generic News Text Summarization Systems, Proc. DUC Workshop
Text Summarization, 2001.
[52] P. Over and W. Liggett, Introduction to DUC: An Intrinsic
Evaluation of Generic News Text Summarization Systems, Proc.
DUC Workshop Text Summarization, 2002.
[53] I. Mani, G. Klein, D. House, L. Hirschman, T. Firmin, and B.
Sundheim, Summac: A Text Summarization Evaluation, Natl
Language Eng., vol. 8, pp. 43-68, http://portal.acm.org/
citation.cfm?id=973860.973864, Mar. 2002.
[54] A. Nenkova, R. Passonneau, and K. McKeown, The Pyramid
Method: Incorporating Human Content Selection Variation in
Summarization Evaluation, ACM Trans. Speech Language Proces-
sing, vol. 4, no. 2, pp. 1-23, http://doi.acm.org/10.1145/
1233912.1233913, May 2007.
[55] C.-Y. Lin and E. Hovy, Automatic Evaluation of Summaries
using N-Gram Co-Occurrence Statistics, Proc. Conf. North Am. Ch.
Assoc. for Computational Linguistics on Human Language Technology,
pp. 71-78, http://dx.doi.org/10.3115/1073445.1073465, 2003.
[56] C.-Y. Lin, G. Cao, J. Gao, and J.-Y. Nie, An Information-Theoretic
Approach to Automatic Evaluation of Summaries, Proc. Main
Conf. Human Language Technology Conf. North Am. Chapter of the
Assoc. of Computational Linguistics, pp. 463-470, http://dx.doi.org/
10.3115/1220835.1220894, 2006.
1704 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013
Pawan Goyal received the BTech degree in
electrical engineering from Indian Institute of
Technology Kanpur, India, in 2007 and the PhD
degree from Intelligent Systems Research Cen-
tre, Faculty of Computing and Engineering,
University of Ulster, Magee campus, in 2011.
He is currently a postdoctoral researcher at
INRIA Paris Rocquencourt. His research inter-
ests include information retrieval, data mining,
and computational linguistics.
Laxmidhar Behera (S92-M03-SM03) received
the BSc and MSc degrees in engineering from
NIT Rourkela, India, in 1988 and 1990, respec-
tively, and the PhD degree from IIT Delhi, India.
He is currently a professor in the Department of
Electrical Engineering, IIT Kanpur, India. He
joined the Intelligent Systems Research Center,
University of Ulster, United Kingdom, as a
reader on sabbatical from IIT Kanpur from
2007-2009. He was an assistant professor in
BITS Pilani from 1995-1999 and pursued his postdoctoral studies in the
German National Research Center for Information Technology, GMD,
Sank Augustin, Germany, from 2000-2001. He has also worked as a
visiting researcher/professor in FHG, Germany, and ETH, Zurich,
Switzerland. He has more than 170 papers to his credit published in
refereed journals and presented in conference proceedings. His
research interests include intelligent control, robotics, neural networks,
and cognitive modeling. He is a senior member of the IEEE.
Thomas Martin McGinnity (M82-SM10) re-
ceived the first class honours degree in physics,
and a doctorate from the University of Durham.
He is a chartered engineer. He holds the post of
professor of intelligent systems engineering
within the Faculty of Computing and Engineer-
ing, University of Ulster, United Kingdom. He is
currently a director of the Intelligent Systems
Research Centre, which encompasses the
research activities of approximately 100 re-
searchers. He was formerly an associate dean of the faculty and a
director of both the universitys technology transfer company, Innovation
Ulster, and a spin out company, Flex Language Services. His current
research interests include the creation of bioinspired intelligent
computational systems which explore and model biological signal
processing, particularly in relation to cognitive robotics, and computation
neuroscience modeling of neurodegeneration. He is the author or
coauthor of more than 250 research papers, and has been awarded both
a Senior Distinguished Research Fellowship and a Distinguished
Learning Support Fellowship in recognition of his contribution to
teaching and research. He is a senior member of the IEEE and a fellow
of the IET.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1705

Anda mungkin juga menyukai