Anda di halaman 1dari 9

Automatic Evaluation of Topic Coherence

David Newman, Jey Han Lau, Karl Grieser , and Timothy Baldwin,
NICTA Victoria Research Laboratory, Australia
Dept of Computer Science, University of California, Irvine
Dept of Computer Science and Software Engineering, University of Melbourne, Australia
Dept of Information Systems, University of Melbourne, Australia
newman@uci.edu, depthchargex@gmail.com,
kgrieser@csse.unimelb.edu.au, tb@ldwin.net

Abstract case, extrinsic evaluation has been used to demon-


strate the effectiveness of the learned topics in the
This paper introduces the novel task of topic application domain, but standardly, no attempt has
coherence evaluation, whereby a set of words, been made to perform intrinsic evaluation of the top-
as generated by a topic model, is rated for ics themselves, either qualitatively or quantitatively.
coherence or interpretability. We apply a In machine learning, on the other hand, researchers
range of topic scoring models to the evaluation
have modified and extended topic models in a vari-
task, drawing on WordNet, Wikipedia and the
Google search engine, and existing research
ety of ways, and evaluated intrinsically in terms of
on lexical similarity/relatedness. In compar- model perplexity (Wallach et al., 2009), but there has
ison with human scores for a set of learned been less effort on qualitative understanding of the
topics over two distinct datasets, we show a semantic nature of the learned topics.
simple co-occurrence measure based on point-
This research seeks to fill the gap between topic
wise mutual information over Wikipedia data
is able to achieve results for the task at or
evaluation in computational linguistics and machine
nearing the level of inter-annotator correla- learning, in developing techniques to perform intrin-
tion, and that other Wikipedia-based lexical sic qualitative evaluation of learned topics. That
relatedness methods also achieve strong re- is, we develop methods for evaluating the qual-
sults. Google produces strong, if less consis- ity of a given topic, in terms of its coherence to
tent, results, while our results over WordNet a human. After learning topics from a collection
are patchy at best. of news articles and a collection of books, we ask
humans to decide whether individual learned top-
ics are coherent, in terms of their interpretability
1 Introduction and association with a single over-arching seman-
There has traditionally been strong interest within tic concept. We then propose models to predict
computational linguistics in techniques for learning topic coherence, based on resources such as Word-
sets of words (aka topics) which capture the latent Net, Wikipedia and the Google search engine, and
semantics of a document or document collection, in methods ranging from ontological similarity to link
the form of methods such as latent semantic analysis overlap and term co-occurrence. Over topics learned
(Deerwester et al., 1990), probabilistic latent seman- from two distinct datasets, we demonstrate that there
tic analysis (Hofmann, 2001), random projection is remarkable inter-annotator agreement on what is
(Widdows and Ferraro, 2008), and more recently, la- a coherent topic, and additionally that our methods
tent Dirichlet allocation (Blei et al., 2003; Griffiths based on Wikipedia are able to achieve nearly perfect
and Steyvers, 2004). Such methods have been suc- agreement with humans over the evaluation of topic
cessfully applied to a myriad of tasks including word coherence.
sense discrimination (Brody and Lapata, 2009), doc- This research forms part of a larger research
ument summarisation (Haghighi and Vanderwende, agenda on the utility of topic modelling in gist-
2009), areal linguistic analysis (Daume III, 2009) ing and visualising document collections, and ulti-
and text segmentation (Sun et al., 2008). In each mately enhancing search/discovery interfaces over

100
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 100108,
Los Angeles, California, June 2010. 2010
c Association for Computational Linguistics
document collections (Newman et al., to appeara). perplexity for evaluating topic models. In earlier
Evaluating topic coherence is a component of the work, we carried out preliminary experimentation
larger question of what are good topics, what char- using pointwise mutual information and Google re-
acteristics of a document collection make it more sults to evaluate topic coherence over the same set
amenable to topic modelling, and how can the po- of topics as used in this research (Newman et al.,
tential of topic modelling be harnessed for human 2009).
consumption (Newman et al., to appearb). Part of this research takes inspiration from the
work on automatic evaluation in machine translation
2 Related Work (Papineni et al., 2002) and automatic summarisation
(Lin, 2004). Here, the development of automated
Most earlier work on intrinsically evaluating learned methods with high correlation with human subjects
topics has been on the basis of perplexity results, has opened the door to large-scale automated evalua-
where a model is learned on a collection of train- tion of system outputs, revolutionising the respective
ing documents, then the log probability of the un- fields. While our aspirations are more modest, the
seen test documents is computed using that learned basic aim is the same: to develop a fully-automated
model. Usually perplexity is reported, which is the method for evaluating a well-grounded task, which
inverse of the geometric mean per-word likelihood. achieves near-human correlation.
Perplexity is useful for model selection and adjust-
ing parameters (e.g. number of topics T ), and is 3 Topic Modelling
the standard way of demonstrating the advantage of In order to evaluate topic modelling, we require a
one model over another. Wallach et al. (2009) pre- topic model and set of topics for a given document
sented efficient and unbiased methods for computing collection. While the evaluation methodology we
perplexity and evaluating almost any type of topic describe generalises to any method which gener-
model. ates sets of words, all of our experiments are based
While statistical evaluation of topic models is on Latent Dirichlet Allocation (LDA, aka Discrete
reasonably well understood, there has been much Principal Component Analysis), on the grounds that
less work on evaluating the intrinsic semantic qual- it is a state-of-the-art method for generating topics.
ity of topics learned by topic models, which could LDA is a Bayesian graphical model for text docu-
have a far greater impact on the overall value of ment collections represented by bags-of-words (see
topic modeling for end-user applications. Some re- Blei et al. (2003), Griffiths and Steyvers (2004),
searchers have started to address this problem, in- Buntine and Jakulin (2004)). In a topic model, each
cluding Mei et al. (2007) who presented approaches document in the collection of D documents is mod-
for automatic labeling of topics (which is core to the elled as a multinomial distribution over T topics,
question of coherence and semantic interpretabil- where each topic is a multinomial distribution over
ity), and Griffiths and Steyvers (2006) who applied W words. Typically, only a small number of words
topic models to word sense discrimination tasks. are important (have high likelihood) in each topic,
Misra et al. (2008) used topic modelling to identify and only a small number of topics are present in each
semantically incoherent documents within a docu- document.
ment collection (vs. coherent topics, as targeted in The collapsed Gibbs sampled topic model simul-
this research). Chang et al. (2009) presented the taneously learns the topics and the mixture of topics
first human-evaluation of topic models by creating in documents by iteratively sampling the topic as-
a task where humans were asked to identify which signment z to every word in every document, using
word in a list of five topic words had been ran- the Gibbs sampling update:
domly switched with a word from another topic.
This work showed some possibly counter-intuitive p(zid = t|xid = w, zid )
results, where in some cases humans preferred mod- id + id +
Nwt Ntd
els with higher perplexity. This type of result shows id
id
the need for further exploring measures other than w Nwt + W t Ntd + T

101
where zid = t is the assignment of the ith word in 4.1 WordNet similarity
document d to topic t, xid = w indicates that the WordNet (Fellbaum, 1998) is a lexical ontology
current observed word is w, and zid is the vector of that represents word sense via synsets, which
all topic assignments not including the current word. are structured in a hypernym/hyponym hierarchy
Nwt represents integer count arrays (with the sub- (nouns) or hypernym/troponym hierarchy (verbs).
scripts denoting what is counted), and and are WordNet additionally links both synsets and words
Dirichlet priors. via lexical relations including antonymy, morpho-
The maximum a posterior (MAP) estimates of the logical derivation and holonymy/meronym.
topics p(w|t), t = 1 . . . T are given by: In parallel with the development of WordNet, a
Nwt + number of computational methods for calculating
p(w|t) = the semantic relatedness/similarity between synset
w Nwt + W pairs (i.e. sense-specified word pairs) have been de-
We will follow the convention of representing a veloped, as we outline below. These methods ap-
topic via its top-n words, ordered by p(w|t). Here, ply to synset rather than word pairs, so to generate a
we use the top-ten words, as they usually provide single score for a given word pair, we look up each
sufficient detail to convey the subject of a topic, word in WordNet and exhaustively generate scores
and distinguish one topic from another. For the for each sense pairing defined by them, and calcu-
remainder of this paper, we will refer to individ- late their arithmetic mean.1
ual topics by its list of top-ten words, denoted by The majority of the methods (all methods other
w = (w1 , . . . , w10 ). than HSO, V ECTOR and L ESK) are restricted to op-
erating strictly over hierarchical links within a sin-
4 Topic Evaluation Methods gle hierarchy. As the verb and noun hierarchies are
not connected (other than via derivational links), this
We experiment with scoring methods based on means that it is generally not possible to calculate
WordNet (Section 4.1), Wikipedia (Section 4.2) and the similarity between noun and verb senses, for ex-
the Google search engine (Section 4.3). In the case ample. In such cases, we simply drop the synset
of Google, we query for the entire topic, but with pairing in question from our calculation of the mean.
WordNet and Wikipedia, this takes the form of scor- The least common subsumer (LCS) is a common
ing each word-pair in a given topic w based on the feature to a number of the measures, and is defined
component words (w1 , . . . , w10 ). Given some (sym- as the deepest node in the hierarchy that subsumes
metric) word-similarity measure D(wi , wj ), two both of the synsets under question.
straightforward ways
( )of producing a combined score For all our experiments over WordNet, we use the
from the 45 (i.e. 10
2 ) word-pair scores are: (1) the WordNet::Similarity package.
arithmetic mean, and (2) the median, as follows:
Path distance (PATH)
Mean-D-Score(w) = The simplest of the WordNet-based measures is
mean{D(wi , wj ), ij 1 . . . 10, i < j} to count the number of nodes visited while going
from one word to another via the hypernym hierar-
chy. The path distance between two nodes is de-
Median-D-Score(w) = fined as the number of nodes that lie on the short-
est path between two words in the hierarchy. This
median{D(wi , wj ), ij 1 . . . 10, i < j}
1
We also experimented with the median, and trialled filter-
Intuitively, the median seems the more natural rep- ing the set of senses in a variety of ways, e.g. using only the
resentation, as it is less affected by outlier scores, first sense (the sense with the highest prior) for a given word,
or using only the word senses associated with the POS with the
but we experiment with both, and fall back to empir- highest prior. In all cases, the overall trend was for the correla-
ical verification of which is the better combination tion with the human scores to drop relative to the mean, so we
method. only present the numbers for the mean in this paper.

102
count of nodes includes the beginning and ending Resnik Information Content (R ES)
word nodes. Resnik (1995) presents a method for weighting
Leacock-Chodorow (LC H) edges in WordNet (avoiding the assumption that all
edges between nodes have equal importance), by
The measure of semantic similarity devised by
weighting edges between nodes by their frequency
Leacock et al. (1998) finds the shortest path between
of use in textual corpora.
two WordNet synsets (sp(c1 , c2 )) using hypernym
Resnik found that the most effective measure of
and synonym relationships. This path length is then
comparison using this methodology was to measure
scaled by the maximum depth of WordNet (D), and
the Information Content (IC(c) = log p(c)) of
the log likelihood taken:
the subsumer with the greatest Information Content
sp(c1 , c2 ) from the set of all concepts that subsumed the two
simlch (c1 , c2 ) = log
2D initial concepts (S(c1 , c2 )) being compared:
Wu-Palmer (W U P)
simres (c1 , c2 ) = max [ log p(c)]
Wu and Palmer (1994) proposed to scale the depth cS(c1 ,c2 )
of the two synset nodes (depthc1 and depthc2 ) by
the depth of their LCS (depth(lcsc1 ,c2 )): Lin (L IN)
Lin (1998) expanded on the Information Theo-
simwup (c1 , c2 ) = retic approach presented by Resnik by scaling the
2 depth(lcsc1 ,c2 ) Information Content of each node by the informa-
depthc1 + depthc2 + 2 depth(lcsc1 ,c2 ) tion content of their LCS:
The scaling means that specific terms (deeper in the 2 log p(lcsc1 ,c2 )
simlin (c1 , c2 ) =
hierarchy) that are close together are more semanti- log p(c1 ) + log p(c2 )
cally similar than more general terms, which have a
short path distance between them. Only hypernym This measure contrasts the joint content of the two
relationships are used in this measure, as the LCS concepts with the difference between them.
is defined by the common member in the concepts Jiang-Conrath (JC N)
hypernym path.
Jiang and Conrath (1997) define a measure that
Hirst-St Onge (HSO) utilises the components of the information content
Hirst and St-Onge (1998) define a measure of se- of the LCS in a different manner:
mantic similarity based on length and tortuosity of
the path between nodes. Hirst and St-Onge attribute simjcn (c1 , c2 ) =
directions (up, down and horizontal) to the larger set 1
of WordNet relationships, and identify the path from IC(a) + IC(b) 2 IC(lcsa,b )
one word to another utilising all of these relation-
ships. The relatedness score is then computed by Instead of defining commonality and difference as
the weighted sum of the path length between the two with Lins measure, the key determinant is the speci-
words (len(c1 , c2 )) and the number of turns the path ficity of the two nodes compared with their LCS.
makes (turns(c1 , c2 )) to take this route: Lesk (L ESK)
relhso (c1 , c2 ) = Lesk (1986) proposed a significantly different ap-
proach to lexical similarity to that proposed in the
C len(c1 , c2 ) k turns(c1 , c2 )
methods presented above, using the lexical over-
where C and k are constants. Additionally, a set of lap in dictionary definitions (or glosses) to disam-
restrictions is placed on the path so that it may not biguate word sense. The sense definitions that con-
be more than a certain length, may not contain more tain the most words in common indicate the most
than a set number of turns, and may only take turns likely sense of the word given its co-occurrence with
in certain directions. similar word senses. Banerjee and Pedersen (2002)

103
adapted this method to utilise WordNet sense glosses where x is an article in W , Wikipedia. This mea-
rather than dictionary definitions, and expand the sure provides the similarity of one article to another,
dictionary definitions via ontological links, and it is however this is asymmetrical. The above metric is
this method we experiment with in this paper. used to find the weights of all outlinks from the two
articles being compared:
Vector (V ECTOR)
Schutze (1998) uses the words surrounding a term c~1 = (w(c1 l1 ), w(c1 l2 ), , w(c1 ln ))
in a piece of text to form a context vector that de- c~2 = (w(c2 l1 ), w(c2 l2 ), , w(c2 ln ))
scribes the context in which the word sense appears.
For a set of words associated with a target sense, a for the set of links l that is the union of the sets of
context vector is computed as the centroid vector of outlinks from both articles. The overall similarity
these words. The centroid context vectors each rep- of the two articles is then calculated by taking the
resent a word sense. To compare word senses, the cosine similarity of the two vectors.
cosine similarity of the context vectors is used.
Related Article Concept Overlap (RACO)
4.2 Wikipedia We also determine the category overlap of two
In the last few years, there has been a surge of in- articles by examining the outlinks of both articles,
terest in using Wikipedia to calculate semantic sim- in the form of the Related Article Concept Overlap
ilarity, using the Wikipedia article content, in-article (RACO) measure. The concept overlap of the sets
links and document categories (Strube and Ponzetto, of respective outlinks is given by the union of the
2006; Gabrilovich and Markovitch, 2007; Milne and two sets of categories from the outlinks from each
Witten, 2008). We present a selection of such meth- article:
ods below. There are a number of Wikipedia-based
scoring methods which we do not present results overlap(c1 , c1 ) =
( ) ( )
for here (notably Strube and Ponzetto (2006) and cat(l) cat(l)
Gabrilovich and Markovitch (2007)), due to their lol(c1 ) lol(c2 )
computational complexity and uncertainty about the
full implementation details of the methods. where ol(c1 ) is the set of outlinks from article c1 ,
As with WordNet, a given word will often have and cat(l) is the set of categories of which the arti-
multiple entries in Wikipedia, grouped in a disam- cle at outlink l is a member. To account for article
biguation page. For M I W, RACO and D OC S IM, size (and differing number of outlinks), the Jaccard
we apply the same strategy as we did with Word- coefficient is used:
Net, in exhaustively calculating the pairwise scores
between the sets of documents associated with each relraco (c1 , c2 ) =
( ) ( )
term, and averaging across them.
lol(c1 ) cat(l) lol(c2 ) cat(l)

Milne-Witten (M I W)
lol(c1 ) cat(l) + lol(c2 ) cat(l)
Milne and Witten (2008) adapted the Resnik
(1995) methodology to utilise the count of links Document Similarity (D OC S IM)
pointing to an article. As Wikipedia is self- In addition to these two measures of semantic re-
referential (articles link to related articles), no ex- latedness, we experiment with simple cosine simi-
ternal data is needed to find the referred-to-edness larity of the text of Wikipedia articles as a measure
of a concept. Milne and Witten use an adapted In- of semantic relatedness.
formation Content measure that weights the number
Term Co-occurrence (PMI)
of links from one article to another (c1 c2 ) by the
total number of links to the second article: Another variant is to treat Wikipedia as a single
meta-document and score word pairs using term co-
|W |
w(c1 c2 ) = |c1 c2 | log occurrence. Here, we calculate the pointwise mu-
|c1 , x)| tual information (PMI) of each word pair, estimated
xW

104
Selected high-scoring topics (unanimous score=3):
[N EWS] space earth moon science scientist light nasa mission planet mars ...
[N EWS] health disease aids virus vaccine infection hiv cases infected asthma ...
[B OOKS] steam engine valve cylinder pressure piston boiler air pump pipe ...
[B OOKS] furniture chair table cabinet wood leg mahogany piece oak louis ...

Selected low-scoring topics (unanimous score=1):


[N EWS] king bond berry bill ray rate james treas byrd key ...
[N EWS] dog moment hand face love self eye turn young character ...
[B OOKS] soon short longer carried rest turned raised filled turn allowed ...
[B OOKS] act sense adv person ppr plant sax genus applied dis ...

Table 1: A selection of high-scoring and low-scoring topics

from the entire corpus of over two million English where i = 1, . . . , 10 and j = 1, . . . , |V |, vj are
Wikipedia articles (1 billion words). PMI has been all the unique terms mentioned in the titles from the
studied variously in the context of collocation ex- top-100 search results, and 1 is the indicator function
traction (Pecina, 2008), and is one measure of the to count matches. For example, in the top-100 re-
statistical independence of observing two words in sults for our query above, there are 194 matches with
close proximity. Using a sliding window of 10- the ten topic words, so Google-titles-match(w) =
words to identify co-occurrence, we computed the 194.
PMI of all a given word pair (wi , wj ) as, following
Google log hits matches (L OG H ITS)
Newman et al. (2009):
Second, we issue queries as above, but return the
p(wi , wj ) log number of hits for our query:
PMI(wi , wj ) = log
p(wi )p(wj )
Google-log-hits(w) =
4.3 Search engine-based similarity
log10 (# results from search for w)
Finally, we present two search engine-based scor-
ing methods, based on Newman et al. (2009). In where w is the search string +w1 +w2 +w3 . . .
this case the external data source is the entire World +w10 . For example, our query above returns
Wide Web, via the Google search engine. Unlike 171,000 results, so Google-log-hits(w) = 5.2. and
the methods presented above, here we query for the the URL titles from the top-100 results include a to-
topic in its entirety,2 meaning that we return a topic- tal of 194 matches with the ten topic words, so for
level score rather than scores for individual word or this topic Google-titles-match(w)=194.
word sense pairs. In each case, we mark each search
term with the advanced search option + to search 5 Experimental Setup
for the terms exactly as is and prevent Google from We learned topics for two document collections: a
using synonyms or lexical variants of the term. An collection of news articles, and a collection of books.
example query is: +space +earth +moon +science These collections were chosen to produce sets of
+scientist +light +nasa +mission +planet +mars. topics that have more variable quality than one typi-
cally observes when topic modeling highly uniform
Google title matches (T ITLES)
content. The collection of D = 55, 000 news arti-
Firstly, we score topics by the relative occurrence cles was selected from English Gigaword, and the
of their component words in the titles of documents collection of D = 12, 000 books was downloaded
returned by Google: from the Internet Archive. We refer to these collec-
tions as N EWS and B OOKS, respectively.
Google-titles-match(w) = 1 [wi = vj ]
Standard procedures were used to tokenize each
2
All queries were run on 15/09/2009. collection and create the bags-of-words. We learned

105
Resource Method Median Mean Resource Method Median Mean
HSO 0.29 0.34 HSO 0.15 0.59
JC N 0.08 0.22 JC N 0.20 0.19
LC H 0.18 0.07 LC H 0.31 0.15
L ESK 0.38 0.37 L ESK 0.53 0.53
WordNet L IN 0.18 0.25 WordNet L IN 0.09 0.28
PATH 0.19 0.11 PATH 0.29 0.12
R ES 0.10 0.13 R ES 0.57 0.66
V ECTOR 0.07 0.20 V ECTOR 0.08 0.27
WUP 0.03 0.10 WUP 0.41 0.26
RACO 0.61 0.63 RACO 0.62 0.69
MIW 0.69 0.60 MIW 0.68 0.70
Wikipedia Wikipedia
D OC S IM 0.45 0.50 D OC S IM 0.59 0.60
PMI 0.78 0.77 PMI 0.74 0.77
T ITLES 0.80 T ITLES 0.51
Google Google
L OG H ITS 0.46 L OG H ITS 0.19
Gold-standard IAA 0.79 0.73 Gold-standard IAA 0.82 0.78

Table 2: Spearman rank correlation values for the Table 3: Spearman rank correlation values for the dif-
different scoring methods over the N EWS dataset (best- ferent scoring methods over the B OOKS dataset (best-
performing method for each resource underlined; best- performing method for each resource underlined; best-
performing method overall in boldface) performing method overall in boldface)

topic models of N EWS and B OOKS using T = 200 to be useful as categories or facets in a search inter-
and T = 400 topics respectively. We randomly face. Note that the useless topics from both collec-
selected a total of 237 topics from the two collec- tions are not chance artifacts produced by the mod-
tions for user scoring. We asked N = 9 users to els, but are in fact stable and robust statistical fea-
score each of the 237 topics on a 3-point scale where tures in the data sets.
3=useful (coherent) and 1=useless (less coher-
ent). 6 Results
We provided annotators with a rubric and guide- The results for the different topic scoring methods
lines on how to judge whether a topic was useful over the N EWS and B OOKS collections are pre-
or useless. In addition to showing several examples sented in Tables 2 and 3, respectively. In each ta-
of useful and useless topics, we instructed users to ble, we separate out the scoring methods into those
decide whether the topic was to some extent coher- based on WordNet (from Section 4.1), those based
ent, meaningful, interpretable, subject-heading-like, on Wikipedia (from Section 4.2), and those based on
and something-you-could-easily-label. For our pur- Google (from Section 4.3).
poses, the usefulness of a topic can be thought of As stated in Section 4, we experiment with two
as whether one could imagine using the topic in a methods for combining the word-pair scores (for all
search interface to retrieve documents about a par- methods other than the two Google methods, which
ticular subject. One indicator of usefulness is the operate natively over a word set), namely the arith-
ease by which one could think of a short label to de- metic mean and median. We present the numbers
scribe a topic. for these two methods in each table. In each case,
Table 1 shows a selection of high- and low- we evaluate via Spearman rank correlation, revers-
scoring topics, as scored by the N = 9 users. The ing the sign of the calculated value for PATH (as it
first topic illustrates the notion of labelling coher- is the only instance of a distance metric, where the
ence, as space exploration, e.g., would be an obvi- gold-standard is made up of similarity values).
ous label for the topic. The low-scoring topics dis- We include the inter-annotator agreement (IAA)
play little coherence, and one would not expect them in the final row of each table, which we consider

106
to be the upper bound for the task. This is calcu- tably HSO, JC N, LC H, L IN, R ES and W U P) is
lated as the average Spearman rank correlation be- much greater than we had expected. The single most
tween each annotator and the mean/median of the consistent method is L ESK, which is based on lexi-
remaining annotators for that topic. Encouragingly, cal overlap in definition sentences and makes rela-
there is relatively little difference in the IAA be- tively modest use of the WordNet hierarchy. Supple-
tween the two datasets; the median-based calcula- mentary evaluation where we filtered out all proper
tion produces slightly higher values and is empiri- nouns from the topics (based on simple POS priors
cally the method of choice.3 for each word learned from an automatically-tagged
Of all the topic scoring methods tested, PMI version of the British National Corpus) led to a slight
(term co-occurrence via simple pointwise mutual in- increase in results for the WordNet methods; the full
formation) is the most consistent performer, achiev- results are omitted for reasons of space. In future
ing the best or near-best results over both datasets, work, we intend to carry out error analysis to deter-
and approaching or surpassing the inter-annotator mine why some of the methods performed so badly,
agreement. This indicates both that the task of or inconsistently across the two datasets.
topic evaluation as defined in this paper is com- There is no clear answer to the question of
putationally tractable, and that word-pair based co- whether the mean or median is the best method for
occurrence is highly successful at modelling topic combining the pair-wise scores.
coherence.
Comparing the different resources, Wikipedia is 7 Conclusions
far and away the most consistent performing, with
We have proposed the novel task of topic coher-
PMI producing the best results, followed by M I W
ence evaluation as a form of intrinsic topic evalu-
and RACO, and finally D OC S IM. There is rela-
ation with relevance in document search/discovery
tively little difference in results between N EWS and
and visualisation applications. We constructed
B OOKS for the Wikipedia methods. Google achieves
a gold-standard dataset of topic coherence scores
the best results over N EWS, for T ITLES (actually
over the output of a topic model for two distinct
slightly above the IAA), but the results fall away
datasets, and evaluated a wide range of topic scor-
sharply over B OOKS. The reason for this can be
ing methods over this dataset, drawing on WordNet,
seen in the sample topics in Table 1: the topics for
Wikipedia and the Google search engine. The sin-
B OOKS tend to be more varied in word class than
gle best-performing method was term co-occurrence
for N EWS, and contain less proper names; also, the
within Wikipedia based on pointwise mutual infor-
genre of B OOKS is less well represented on the web.
mation, which achieve results very close to the inter-
We hypothesise that Wikipedias encyclopedic na-
annotator agreement for the task. Google was also
ture means that it has good coverage over both do-
found to perform well over one of the two datasets,
mains, and thus more robust.
while the results for the WordNet-based methods
Turning to WordNet, the overall results are
were overall surprisingly low.
markedly better over B OOKS, again largely because
of the relative sparsity of proper names in the re- Acknowledgements
source. The results for individual methods are some- NICTA is funded by the Australian government as rep-
what surprising. Whereas JC N and LC H have been resented by Department of Broadband, Communication
shown to be two of the best-performing methods and Digital Economy, and the Australian Research Coun-
over lexical similarity tasks (Budanitsky and Hirst, cil through the ICT centre of Excellence programme. DN
has also been supported by a grant from the Institute of
2005; Agirre et al., 2009), they perform abysmally Museum and Library Services, and a Google Research
at the topic scoring task. Indeed, the spread of re- Award.
sults across the WordNet similarity methods (no-
3
Note that the choice of mean or median for IAA is in- References
dependent of that for the scoring methods, as they are com-
bining different things: annotator scores in the one hand, and E Agirre, E Alfonseca, K Hall, J Kravalova, M Pasca,
word/concept pair scores on the other. and A Soroa. 2009. A study on similarity and re-

107
latedness using distributional and WordNet-based ap- M Lesk. 1986. Automatic sense disambiguation us-
proaches. In Proc. of HLT: NAACL 2009, pages 19 ing machine readable dictionaries: how to tell a pine
27, Boulder, Colorado. cone from an ice cream cone. In Proc. of SIGDOC86,
S Banerjee and T Pedersen. 2002. An adapted Lesk algo- pages 2426, Toronto, Canada.
rithm for word sense disambiguation using WordNet. D Lin. 1998. Automatic retrieval and clustering of sim-
Proc. of CICLing02, pages 136145. ilar words. In Proc. of COLING/ACL98, pages 768
774, Montreal, Canada.
DM Blei, AY Ng, and MI Jordan. 2003. Latent Dirich-
let allocation. Journal of Machine Learning Research, C-Y Lin. 2004. ROUGE: a package for automatic
3:9931022. evaluation of summaries. In Proc. of the ACL 2004
Workshop on Text Summarization Branches Out (WAS
S Brody and M Lapata. 2009. Bayesian word sense 2004), pages 7481, Barcelona, Spain.
induction. In Proc. of EACL 2009, pages 103111, Q Mei, X Shen, and CX Zhai. 2007. Automatic labeling
Athens, Greece. of multinomial topic models. In Proc. of KDD 2007,
A Budanitsky and G Hirst. 2005. Evaluating WordNet- pages 490499.
based Measures of Lexical Sematic Relatedness. D Milne and IH Witten. 2008. An effective, low-
Computational Linguistics, 32(1):1347. cost measure of semantic relatedness obtained from
WL Buntine and A Jakulin. 2004. Applying discrete Wikipedia links. In Proc. of AAAI Workshop on
PCA in data analysis. In Proc. of UAI 2004, pages Wikipedia and Artificial Intelligence, pages 2530,
5966. Chicago, USA.
J Chang, J Boyd-Graber, S Gerris, C Wang, and D Blei. H Misra, O Cappe, and F Yvon. 2008. Using LDA to
2009. Reading tea leaves: How humans interpret topic detect semantically incoherent documents. In Proc. of
models. In Proc. of NIPS 2009. CoNLL 2008, pages 4148, Manchester, England.
D Newman, S Karimi, and L Cavedon. 2009. External
H Daume III. 2009. Non-parametric bayesian areal lin- evaluation of topic models. In Proc. of ADCS 2009,
guistics. In Proc. of HLT: NAACL 2009, pages 593 pages 1118, Sydney, Australia.
601, Boulder, USA.
D Newman, T Baldwin, L Cavedon, S Karimi, D Mar-
Scott Deerwester, Susan T. Dumais, George W. Furnas, tinez, and J Zobel. to appeara. Visualizing docu-
Thomas K. Landauer, and Richard Harshman. 1990. ment collections and search results using topic map-
Indexing by latent semantic analysis. Journal of the ping. Journal of Web Semantics.
American Society of Information Science, 41(6). D Newman, Y Noh, E Talley, S Karimi, and T Bald-
C Fellbaum, editor. 1998. WordNet: An Electronic Lexi- win. to appearb. Evaluating topic models for digital
cal Database. MIT Press, Cambridge, USA. libraries. In Proc. of JCDL/ICADL 2010, Gold Coast,
E Gabrilovich and S Markovitch. 2007. Computing se- Australia.
mantic relatedness using Wikipedia-based explicit se- K Papineni, S Roukos, T Ward, and W-J Zhu. 2002.
mantic analysis. In Proc. of IJCAI07, pages 1606 BLEU: a method for automatic evaluation of machine
1611, Hyderabad, India. translation. In Proc. of ACL 2002, pages 311318,
Philadelphia, USA.
T Griffiths and M Steyvers. 2004. Finding scientific top-
ics. In Proc. of the National Academy of Sciences, vol- P Pecina. 2008. Lexical Association Measures: Colloca-
ume 101, pages 52285235. tion Extraction. Ph.D. thesis, Charles University.
P Resnik. 1995. Using information content to evalu-
T Griffiths and M Steyvers. 2006. Probabilistic topic ate semantic similarity in a taxonomy. In Proc. of IJ-
models. In Latent Semantic Analysis: A Road to CAI95, pages 448453, Montreal, Canada.
Meaning. H Schutze. 1998. Automatic word sense discrimination.
A Haghighi and L Vanderwende. 2009. Exploring con- Computational Linguistics, 24(1):97123.
tent models for multi-document summarization. In M Strube and SP Ponzetto. 2006. WikiRelate! comput-
Proc. of HLT: NAACL 2009, pages 362370, Boulder, ing semantic relateness using Wikipedia. In Proc. of
USA. AAAI06, pages 14191424, Boston, USA.
G Hirst and D St-Onge. 1998. Lexical chains as repre- Q Sun, R Li, D Luo, and X Wu. 2008. Text segmentation
sentations of context for the detection and correction with LDA-based Fisher kernel. In Proc. of ACL-08:
of malapropism. In Fellbaum (Fellbaum, 1998), pages HLT, pages 269272.
305332. HM Wallach, I Murray, R Salakhutdinov, and
T Hofmann. 2001. Unsupervised learning by proba- DM Mimno. 2009. Evaluation methods for
bilistic latent semantic analysis. Machine Learning, topic models. In Proc. of ICML 2009, page 139.
42(1):177196. D Widdows and K Ferraro. 2008. Semantic Vectors:
JJ Jiang and DW Conrath. 1997. Semantic similarity A scalable open source package and online technol-
based on corpus statistics and lexical taxonomy. In ogy management application. In Proc. of LREC 2008,
Proc. of COLING97, pages 1933, Taipei, Taiwan. Marrakech, Morocco.
Z Wu and M Palmer. 1994. Verb selection and lexical
C Leacock, G A Miller, and M Chodorow. 1998. Using selection. In Proc. of ACL94, pages 133138, Las
corpus statistics and WordNet relations for sense iden- Cruces, USA.
tification. Computational Linguistics, 24(1):14765.

108

Anda mungkin juga menyukai