Abstract
1601
1)
Object(s)/Stu
2)
A8ributes
3)
Preposi>ons
brown
0.01
near(a,b)
1
4)
Constructed
CRF
6)
Generated
Sentences
Input
Image
striped
0.16
furry
.26
near(b,a)
1
against(a,b)
.11
against(b,a)
.04
+,($%
This
is
a
photograph
of
one
wooden
.2
feathered
.06
beside(a,b)
.24
person
and
one
brown
sofa
!"#$% '()*$%
a)
dog
...
beside(b,a)
.17
and
one
dog.
The
person
is
...
+,(&% against
the
brown
sofa.
And
brown
0.32
near(a,c)
1
the
dog
is
near
the
person,
striped
0.09
furry
.04
near(c,a)
1
'()*-% !"#&% and
beside
the
brown
sofa.
wooden
.2
against(a,c)
.3
Feathered
.04
against(c,a)
.05
...
beside(a,c)
.5
beside(c,a)
.45
b)
person
...
!"#-% '()*&%
+,(-%
brown
0.94
near(b,c)
1
striped
0.10
near(c,b)
1
furry
.06
wooden
.8
against(b,c)
.67
against(c,b)
.33
5)
Predicted
Labeling
Feathered
.08
beside(b,c)
.0
<<null,person_b>,against,<brown,sofa_c>>
beside(c,b)
.19
<<null,dog_a>,near,<null,person_b>>
...
...
<<null,dog_a>,beside,<brown,sofa_c>>
c)
sofa
Figure 2. System flow for an example image: 1) object and stuff detectors find candidate objects, 2) each candidate region is processed by
a set of attribute classifiers, 3) each pair of candidate regions is processed by prepositional relationship functions, 4) A CRF is constructed
that incorporates the unary image potentials computed by 1-3, and higher order text based potentials computed from large document
corpora, 5) A labeling of the graph is predicted, 6) Sentences are generated based on the labeling.
documents are provided. The process of generation then on associating individual words with image regions [2, 8].
becomes one of combining or summarizing relevant docu- In continuations of that work, and other work on image
ments, in some cases driven by keywords estimated from parsing and object detection, the spatial relationships be-
the image content [13]. From the computer vision perspec- tween labeled parts either detections or regions of im-
tive these techniques might be analogous to first recognizing ages was used to improve labeling accuracy, but the spa-
the scene shown in an image, and then retrieving a sentence tial relationships themselves were not considered outputs in
based on the scene type. It is very unlikely that a retrieved their own right [24, 7, 16, 21, 15]. Estimates of spatial re-
sentence would be as descriptive of a particular image as the lationships between objects form an important part of the
generated sentence in Fig. 1. output of the computer vision aspect of our approach and
This paper pushes to make a tight connection between are used to drive sentence generation.
the particular image content and the sentence generation
There is a great deal of ongoing research on estimating
process. This is accomplished by detecting objects, mod-
attributes for use in computer vision [18, 9, 19, 14] that
ifiers (adjectives), and spatial relationships (prepositions),
maps well to our process of estimating modifiers for objects
in an image, smoothing these detections with respect to a
in images. We use low level features from Farhadi et al. [9]
statistical prior obtained from descriptive text, and then us-
for modifier estimation. Our work combines priors for vi-
ing the smoothed results as constraints for sentence gen-
sually descriptive language with estimates of the modifiers
eration. Sentence generation is performed either using a
based on image regions around object detections.
n-gram language model [3, 22] or a simple template based
approach [27, 4]. Overall, our approach can handle the po- There is some recent work very close in spirit to our own.
tentially huge number of scenes that can be constructed by Yao et al. [26] look at the problem of generating text with a
composing even a relatively small number of instances of comprehensive system built on various hierarchical knowl-
several classes of objects in a variety of spatial relation- edge ontologies and using a human in the loop for hierar-
ships. Even for quite small numbers for each factor, the chical image parsing (except in specialized circumstances).
total number of such layouts is not possible to sample com- In contrast, our work automatically mines knowledge about
pletely, and any set of images would have some particular textual representation, and parses images fully automati-
bias. In order to avoid evaluating such a bias, we purpose- cally without a human operator and with a much sim-
fully avoid whole image features or scene/context recogni- pler approach overall. Despite the simplicity of our frame-
tion in our evaluation although noting explicitly that it work it is still a step toward more complex description gen-
would be straightforward to include a scene node and ap- eration compared to Farhadi et al.s (also fully automatic)
propriate potential functions in the model presented. method based on parsing images into a meaning representa-
tion triple describing 1 object, 1 action, and 1 scene [10].
2. Related Work In their work, they use a single triple estimated for an im-
Early work on connecting words and pictures for the pur- age to retrieve sentences from a collection written to de-
pose of automatic annotation and auto illustration focused scribe similar images. In contrast our work detects multiple
1602
A,r1
A,r1
across the image and create nodes for stuff categories with
Obj1
Prep1
Obj1
Prep1
high scoring detections. Note that this means that the num-
A,r2
Z1
A,r2
ber of nodes in a graph constructed for an image depends on
Prep3
Obj2
Prep3
Z3
Obj2
the number of object and stuff detections that fired in that
image (something we have to correct for during parameter
Z2
learning). For each object and stuff node we classify the ap-
Stu1
Prep2
Stu1
Prep2
pearance using a set of trained attribute classifiers and create
A,r3
A,r3
1603
the amount of labeled image data that would be required the number of object detections for an image). We use a
is daunting. Instead we learn these from large text collec- scoring function that is graph size independent:
tions. By observing in text how people describe objects,
objtf (mod, obj)tf 2 (obj, prep, obj)tf
attributes and prepositions between objects we can model + +
the relationships between node labels. Descriptions of the N N N 1 N
text based potentials are provided in Sec. 5.2. measuring the score of a predicted labeling as: a) the num-
4.1. Converting to Pairwise potentials ber of true obj labels minus the number of false obj labels
normalized by the number of objects, plus b) the number of
Since preposition nodes describe the relationship be- true mod-obj label pairs minus the number of false mod-obj
tween a preposition label and two object labels, they are pairs, plus c) the number of true obj-prep-obj triples mi-
most naturally modeled through trinary potential functions: nus the number of false obj-prep-obj triples normalized by
the number of nodes and the number of pairs of objects (N
(obji , prepij , objj ; textP r) (6) choose 2).
However, most CRF inference code accepts only unary 4.3. CRF Inference
and pairwise potentials. Therefore we convert this trinary
To predict the best labeling for an input image graph
potential into a set of unary and pairwise potentials through
(both at test time or during parameter training) we utilize the
the introduction of an additional z node for each 3-clique
sequential tree re-weighted message passing (TRW-S) algo-
of obj-prep-obj nodes (see fig 3). Each z node connecting
rithm introduced by Kolmogorov [17] which improves upon
two object nodes has domain O1PO2 where O1 is the
the original TRW algorithm from Wainwright et al [25].
domain of object node1, P is our set of prepositional rela-
These algorithms are inspired by the problem of maximiz-
tions, and O2 is the domain of object node2. In this way
ing a lower bound on the energy. TRW-S modifies the TRW
the trinary potential is converted to a unary potential on z,
algorithm so that the value of the bound is guaranteed not
(zij ; textP r), along with 3 pairwise potentials, one for
to decrease. For our image graphs, the CRF constructed is
each of object node1, preposition node, and object node2
relatively small (on the order of 10s of nodes). Thus, the
that enforce that the labels selected for each node are the
inference process is quite fast, taking on average less than a
same as the label selected for Z:
second to run per image.
0 if Zij (1) = Oi
(zij , obji ) =
inf o.w.
(7) 5. Potential Functions
In this section, we present our image based and descrip-
0 if Zij (2) = Pij tive language based potential functions. At a high level the
(zij , prepij ) = (8)
inf o.w. image potentials come from hand designed detection strate-
0 if Zij (3) = Oj gies optimized on external training sets . In contrast the text
(zij , objj ) = (9) potentials are based on text statistics collected automatically
inf o.w.
from various corpora.
4.2. CRF Learning
5.1. Image Based Potentials
We take a factored learning approach to estimate the pa-
rameters of our CRF from 100 hand-labeled images. In our (obji ; objDet) - Object and Stuff Potential
energy function (Eqns (1)-(5)), the parameters represent Object Detectors: We use an object detection system
the trade-off between image and text based potentials. The based on Felzenszwalb et al.s mixtures of multi-scale de-
parameters represent the weighting between image based formable part models [12] to detect thing objects. We use
potentials. And, the parameters represent the weighting the provided detectors for the 20 PASCAL 2010 object cate-
between text based potentials. In the first stage of learn- gories and train 4 additional non-PASCAL object categories
ing we estimate the image parameters while ignoring the for flower, laptop, tiger, and window. For the non-PASCAL
text based terms (by setting 1 to 0). To learn image po- categories, we train new object detectors using images and
tential weights we fix 0 to 1 and use grid search to find bounding box data from Imagenet [6]. The output score of
optimal values for 1 and 2 . Next we fix the parameters the detectors are used as potentials.
to their estimated value and learn the remaining parameters Stuff Detectors: Classifiers are trained to detect regions
the trade-off between image and text based potentials ( corresponding to non-part based object categories. We train
parameters) and the weights for the text based potentials ( linear SVMs on the low level region features of [9] to rec-
parameters). Here we set 0 and 0 to 1 and use grid search ognize: sky, road, building, tree, water, and grass stuff cat-
over values of 1 and 1 to find appropriate values. egories. SVM outputs are mapped to probabilities. Train-
It is important to carefully score output labelings fairly ing images and bounding box regions are taken from Ima-
for graphs with variable numbers of nodes (dependent on geNet. At test time, classifiers are evaluated on a coarsely
1604
This
is
a
photograph
of
one
sky,
There
are
two
aeroplanes.
The
rst
shiny
There
are
one
cow
and
one
sky.
There
are
one
dining
table,
one
chair
and
two
one
road
and
one
bus.
The
blue
aeroplane
is
near
the
second
shiny
aeroplane.
The
golden
cow
is
by
the
blue
sky.
windows.
The
wooden
dining
table
is
by
the
sky
is
above
the
gray
road.
The
wooden
chair,
and
against
the
rst
window,
and
Here
we
see
one
person
and
gray
road
is
near
the
shiny
bus.
against
the
second
white
window.
The
wooden
one
train.
The
black
person
is
The
shiny
bus
is
near
the
blue
sky.
chair
is
by
the
rst
window,
and
by
the
second
by
the
train.
white
window.
The
rst
window
is
by
the
second
white
window.
Figure 4. Results of sentence generation using our method with template based sentence generation. These are good results as judged by
human annotators.
sampled grid of overlapping square regions covering the im- tions are used for the other preposition terms. We include
ages. Pixels in any region with a classification probability synonymous prepositions to encourage variation in sentence
above a fixed threshold are treated as detections, and the generation but sets of synonymous prepositions share the
max probability for a region is used as the potential value. same potential. Note for each preposition we compute both
prep(a,b) and prep(b,a) as either labeling order can be pre-
(attri ; attrCl) - Attribute Potential
dicted in the output result.
Attribute Classifiers: We train visual attribute classi-
fiers that are relevant for our object (and stuff) categories. 5.2. Text Based Potentials
Therefore, we mine our large text corpus of Flickr descrip- We use two potential functions calculated from large
tions (described in Sec. 5.2) to find attribute terms com- text corpora. The first is a pairwise potential on attribute-
monly used with each object (and stuff) category removing object label pairs (attri , obji ; textP r) and the second
obviously non-visual terms. The resulting list consists of 21 is a trinary potential on object-preposition-object triples
visual attribute terms describing color (e.g. blue, gray), tex- (obji , prepij , objj ; textP r). These potentials are the
ture (e.g. striped, furry), material (e.g. wooden, feathered), probability of various attributes for each object (given the
general appearance (e.g. rusty, dirty, shiny), and shape object) and the probabilities of particular prepositional re-
(e.g. rectangular) characteristics. Training images for the lationships between object pairs (given the pair of objects).
attribute classifiers come from Flickr, Google, the attribute The conditional probabilities are computed from counts of
dataset provided by Farhadi et al [9], and ImageNet [6]. An word co-occurrence as described below.
RBF kernel SVM is used to learn a classifier for each visual Parsing Potentials: To generate counts for the attribute-
attribute term (up to 150 positive peer class with all other object potential p (attri , obji ; textP r) we collect a large
training examples as negatives). The outputs of the classi- set of Flickr image descriptions (similar to but less regu-
fiers are used as potential values. lated than captions). For each object (or stuff) category
(prepij ; prepF uns) - Preposition Potential we collect up to the min of 50000 or all image descrip-
tions by querying the Flickr API1 with each object cat-
Preposition Functions: We design simple prepositional egory term. Each sentence from this descriptions set is
functions that evaluate the spatial relationships between parsed by the Stanford dependency parser [5] to generate
pairs of regions in an image and provide a score for each the parse tree and dependency list for the sentence. We
of 16 preposition terms (e.g. above, under, against, be- then collect statistics about the occurence of each attribute
neath, in, on etc). For example, the score for above(a, b) and object pair using the adjectival modifier dependency
is computed as the percentage of regiona that lies in the amod(attribute, object). Counts for synonyms of object
image rectangle above the bounding box around regionb . and attribute terms are merged together.
The potential for near(a, b) is computed as the minimum For generating the object-preposition-object potential
distance between regiona and regionb divided by the diag-
onal size of a bounding box around regiona . Similar func- 1 http://www.flickr.com/services/api/
1605
meaning representation is fixed and generation must make
use of all given content words; and, generation may insert
only gluing words (i.e., function words such as there, is,
the, etc). These restrictions could be lifted in future work.
6.1. Decoding using Language Models
Templated
Genera-on:
This
is
a
photograph
of
one
furry
sheep.
Templated
Genera-on:
Here
we
see
three
persons,
one
A N -gram language model is a conditional probability
Simple
Decoding:
the
furry
sheep
it.
sky,
one
grass
and
one
train.
The
rst
colorful
person
is
underneath
the
clear
sky,
and
beside
the
second
colorful
distribution P (xi |xiN +1 , ..., xi1 ) of N -word sequences
person,
and
within
the
shiny
train.
The
second
colorful
person
is
underneath
the
clear
sky,
and
by
the
shiny
(xiN +1 , ..., xi ), such that the prediction of the next word
train.
The
green
grass
is
near
the
clear
sky.
The
third
black
person
is
underneath
the
clear
sky,
and
by
the
depends only on the previous N -1 words. That is, with
green
grass,
and
within
the
shiny
train.
The
shiny
train
is
by
the
clear
sky,
and
beside
the
green
grass.
N -1th order Markov assumption, P (xi |x1 , ..., xi1 ) =
Simple
Decoding:
the
colorful
person
is
underneath
the
clear
sky.
the
colorful
person
who
beside
the
colorful
P (xi |xiN +1 , ..., xi1 ). Language models are shown to be
person.
the
colorful
person
is
underneath
the
clear
sky.
the
green
grass
and
near
the
clear
sky.
the
colorful
person
is
within
the
shiny
train.
the
black
person
is
simple but effective for improving machine translation and
underneath
the
clear
sky.
the
black
person
and
by
the
green
grass.
the
shiny
train
and
by
the
clear
sky.
the
automatic grammar corrections.
colorful
person
and
by
the
shiny
train.
the
shiny
train
Templated
Genera-on:
Here
we
see
two
cows
and
one
tree.
The
and
beside
the
green
grass.
the
black
person
is
within
In this work, we make use of language models to pre-
rst
cow
is
by
the
tree.
The
second
cow
is
by
the
tree.
the
shiny
train.
Simple
Decoding:
the
cow
and
by
the
tree.
the
cow
and
by
the
tree.
dict gluing words (i.e. function words) that put together
words in the meaning representation. As a simple exam-
Figure 5. Comparison of our two generation methods.
ple, suppose we want to determine whether to insert a func-
tion word x between a pair of words and in the mean-
p (obji , prepij , obji ; textP r) we collect 1.4 million ing representation. Then, we need to compare the length-
Flickr image descriptions by querying for pairs of object normalized probability p(x) with p(), where p takes
terms. Sentences containing at least 2 object (or stuff) the nth root of the probability p for n-word sequences, and
categories and a prepositional ( 140k) are parsed using p(x) = p()p(x|)p(|x) using bigram (2-gram) lan-
the Stanford dependency parser. We then collect statis- guage models. If considering more than two function words
tics for the occurence of each prepositional dependency be- between and , dynamic programming can be used to find
tween object categories. For a prepositional dependency oc- the optimal sequence of function words efficiently. Because
curence, object1 is automatically picked as either the sub- the ordering of words in each triple of the meaning repre-
ject or object part of the prepositional dependency based on sentation coincides with the typical ordering of words in
the voice (active or passive) of the sentence, while object2 English, we retain the original ordering for simplicity. Note
is selected as the other. Counts include synonyms. that this approach composes a separate sentence for each
Google Potentials: Though we parse thousands of triple, independently from all other triples.
descriptions, the counts for some objects can be too
sparse. Therefore, we also collect additional Google 6.2. Templates with Linguistic Constraints
Search based potentials: g (attri , obji ; textP r) and Decoding based on language models is a statistically
g (obji , prepij , objj ; textP r). These potentials are com- principled approach, however, two main limitations are: (1)
puted from the number of search results approximated by it is difficult to enforce grammatically correct sentences us-
Google for an exact string match query on each of our ing language models alone (2) it is ignorant of discourse
attribute-object pairs (e.g. brown dog) and preposition- structure (coherency among sentences), as each sentence
object-preposition triples (e.g. dog on grass). is generated independently. We address these limitations
Smoothed Potentials: Our final potentials are computed by constructing templates with linguistically motivated con-
as a smoothed combination of the parsing based potentials straints. This approach is based on the assumption that there
with the Google potentials: p + (1 )g . are a handful of salient syntactic patterns in descriptive lan-
guage that we can encode as templates.
6. Generation
The output of our CRF is a predicted labeling of the im- 7. Experimental Results & Conclusion
age. This labeling encodes three kinds of information: ob-
To construct the training corpus for language models,
jects present in the image (nouns), visual attributes of those
we crawled Wikipedia pages that describe objects our sys-
objects (modifiers), and spatial relationships between ob-
tem can recognize. For evaluation, we use the UIUC PAS-
jects (prepositions). Therefore, it is natural to extract this
CAL sentence dataset2 , which contains up to five human-
meaning into a triple (or set of triples), e.g.:
generated sentences that describe 1000 images. From this
<< white, cloud >, in, < blue, sky >>
set we evaluate results on 847 images3 .
Based on this triple, we want to generate a complete sen-
tence such as There is a white cloud in the blue sky. 2 http://vision.cs.uiuc.edu/pascal-sentences/
We restrictions generation so that: the set of words in the 3 153 were used to learn CRF and detection parameters.
1606
Missing
detec+ons:
Incorrect
detec+ons:
Incorrect
a1ributes:
Coun+ng
is
hard!
Just
all
wrong!
There
are
one
road
and
one
cat.
The
This
is
a
photograph
of
two
sheeps
and
one
There
are
two
cows
and
one
person.
The
There
are
one
po*ed
plant,
one
tree,
furry
road
is
in
the
furry
cat.
grass.
The
rst
black
sheep
is
by
the
green
rst
brown
cow
is
against
the
brown
one
dog
and
one
road.
The
gray
Here
we
see
one
po*edplant.
person,
and
near
the
second
cow.
The
grass,
and
by
the
second
black
sheep.
The
po*ed
plant
is
beneath
the
tree.
The
second
black
sheep
is
by
the
green
grass.
brown
person
is
beside
the
second
cow.
tree
is
near
the
black
dog.
The
road
is
near
the
black
dog.
The
black
dog
is
near
the
gray
po*ed
plant.
Figure 6. Results of sentence generation using our method with template based sentence generation. These are bad results as judged by
human annotators.
1607
7.1. Qualitative Results [9] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describ-
The majority of our generated sentences look quite good. ing objects by their attributes. In CVPR, 2009. 1602, 1604,
1605
Example results on PASCAL images rated as good are
[10] A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian,
shown in fig 4. In fact most of our results look quite good. J. Hockenmaier, and D. A. Forsyth. Every picture tells a
Even bad results almost always look reasonable and are story: generating sentences for images. In ECCV, 2010.
relevant to the image content (fig 6). Only for a small minor- 1602
ity of the images are the generated descriptions completely [11] L. Fei-Fei, C. Koch, A. Iyer, and P. Perona. What do we see
unrelated to the image content (fig 6, 2 right most images). when we glance at a scene. Journal of Vision, 4(8), 2004.
In cases where the generated sentence is not quite perfect 1601
this is usually due to one of three problems: a failed object [12] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester.
detection that misses an object, a detection that proposes the Discriminatively trained deformable part models, release 4.
wrong object category, or an incorrect attribute prediction. http://people.cs.uchicago.edu/ pff/latent-release4/. 1604
However, because of our use of powerful vision systems [13] Y. Feng and M. Lapata. How many words is a picture worth?
(state of the art detectors and attribute methodologies) the automatic caption generation for news images. In Pr. ACL,
ACL 10, pages 12391249, 2010. 1601, 1602
results produced are often astonishingly good.
[14] V. Ferrari and A. Zisserman. Learning visual attributes. In
7.2. Conclusion NIPS, 2007. 1602
[15] C. Galleguillos, A. Rabinovich, and S. J. Belongie. Object
We have demonstrated a surprisingly effective, fully au-
categorization using co-occurrence, location and appearance.
tomatic, system that generates natural language descriptions
In CVPR, 2008. 1602
for images. The system works well and can produce results [16] A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepo-
much more specific to the image content than previous au- sitions and comparative adjectives for learning visual classi-
tomated methods. Human evaluation validates the quality fiers. In ECCV, 2008. 1602
of the generated sentences. One key to the success of our [17] V. Kolmogorov. Convergent tree-reweighted message pass-
system was automatically mining and parsing large text col- ing for energy minimization. TPAMI, 28, Oct. 2006. 1604
lections to obtain statistical models for visually descriptive [18] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar.
language. The other is taking advantage of state of the art Attribute and simile classifiers for face verification. In ICCV,
vision systems and combining all of these in a CRF to pro- 2009. 1602
duce input for language generation methods. [19] C. Lampert, H. Nickisch, and S. Harmeling. Learning to de-
tect unseen object classes by between-class attribute transfer.
Acknowledgements In CVPR, 2009. 1602
This work supported in part by NSF Faculty Early Career [20] K. Papineni, S. Roukos, T. Ward, and W. jing Zhu. Bleu: a
Development (CAREER) Award #1054133. method for automatic evaluation of machine translation. In
IBM Research Report, 2001. 1607
References [21] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost
[1] A. Aker and R. Gaizauskas. Generating image descriptions for image understanding: Multi-class object recognition and
using dependency relational patterns. In Pr. ACL, pages segmentation by jointly modeling texture, layout, and con-
12501258, 2010. 1601 text. IJCV, 81:223, January 2009. 1602
[22] H. Stehouwer and M. van Zaanen. Language models for con-
[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei,
textual error detection and correction. In CLAGI, 2009. 1602
and M. Jordan. Matching words and pictures. JMLR,
[23] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny
3:11071135, 2003. 1602
images: a large dataset for non-parametric object and scene
[3] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large
recognition. TPAMI, 30, 2008. 1601
language models in machine translation. In EMNLP-CoNLL,
[24] A. Torralba, K. P. Murphy, and W. T. Freeman. Using the
2007. 1602
forest to see the trees: exploiting context for visual object
[4] S. Channarukul, S. W. McRoy, and S. S. Ali. Doghed: a detection and localization. Commun. ACM, 53, March 2010.
template-based generator for multimodal dialog systems tar- 1602
geting heterogeneous devices. In NAACL, 2003. 1602 [25] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Map
[5] M.-C. de Marnee and C. D. Manning. Stanford typed depen- estimation via agreement on (hyper)trees: Message-passing
dencies manual. 1605 and linear-programming approaches. IEEE Tr Information
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Theory, 51:36973717, 2005. 1604
ImageNet: A Large-Scale Hierarchical Image Database. In [26] B. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t:
CVPR, 2009. 1604, 1605 Image parsing to text description. Proc. IEEE, 98(8), 2010.
[7] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative mod- 1602
els for multi-class object layout. In ICCV, 2009. 1602 [27] L. Zhou and E. Hovy. Template-filtered headline summa-
[8] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object rization. In Text Summarization Branches Out: Pr ACL-04
recognition as machine translation. In ECCV, 2002. 1602 Wkshp, July 2004. 1602
1608