Semantics in Visual Information Retrieval PDF

Feature Article
the field, see the Further Reading sidebar) have
Semantics in
paved the way for third-generation systems, fea-
turing full multimedia data management and net-
working support. Forthcoming standards such as
MPEG-4 and MPEG-7 (see the Nack and Lindsay
Visual
article in this issue) provide the framework for effi-
cient representation, processing, and retrieval of
visual information.
Yet many problems must still be addressed and
Information
solved before these technologies can emerge. An
important issue is the design of indexing struc-
tures for efficient retrieval from large, possibly dis-
tributed, multimedia data repositories. To achieve
Retrieval
this goal, image and video content descriptors can
be internally organized and accessed through
multidimensional index structures.5 A second key
problem is to bridge the semantic gap between the
system and users. That is, devise representations
Carlo Colombo, Alberto Del Bimbo, and Pietro Pala capturing visual content at high semantic levels
University of Florence, Italy especially relevant for retrieval tasks. Specifically,
automatically obtaining a representation of high-
level visual content remains an open issue.
Virtually all the systems based on automatic stor-
V
isual information retrieval systems age and retrieval of visual information proposed
A compositional have entered a new era. First- so far use low-level perceptual representations of
approach increases generation systems allowed access to pictorial data, which have limited semantics.
the level of images and videos through textual Building up a representation proves tanta-
representation that data.1,2 Typical searches for these systems include, mount to defining a model of the world, possibly
can be automatically for example, all images of paintings of the through a formal description language, whose
extracted and used in Florentine school of the 15th century or all semantics capture only a few significant aspects of
a visual information images by Cezanne with landscapes. Such sys- the information content.6
retrieval system. tems expressed information through alphanu- Different languages and semantics induce
Visual information at meric keywords or scripts. They employed diverse world representations. In text, for exam-
the perceptual level representation schemes like relational models, ple, the meaning of single words is specific yet
is aggregated frame models, and object-oriented models. On the limited, and an aggregate of several wordsa
according to a set of other hand, current-generation retrieval systems phraseproduces a higher degree of significance
rules. These rules support full retrieval by visual content.3,4 Access and expressivity. Hence, the rules for the syntac-
reflect the specific to visual information is not only performed at a tic composition of signs in a given language also
context and conceptual level, using keywords as in the textu- generate a new world representation, offering
transform perceptual al domain, but also at a perceptual level, using richer semantics in the hierarchy of signification.
words into phrases objective measurements of visual content. In To avoid equivocation, a retrieval system
capturing pictorial these systems, image processing, pattern recogni- should embed a semantic level reflecting as much
content at a higher, tion, and computer vision constitute an integral as possible the one humans refer to during inter-
and closer to the part of the systems architecture and operation. rogation. The most common way to enrich a visu-
human, semantic They objectively analyze pixel distribution and al information retrieval systems semantics is to
level. extract the content descriptors automatically from annotate pictorial information manually at stor-
raw sensory data. Image content descriptors are age time through a set of external keywords
commonly represented as feature vectors, whose describing the pictorial content. Unfortunately,
elements correspond to significant parameters textual annotation has several problems:
that model image attributes. Therefore, visual
attributes are regarded as points in a multidimen- 1. Its too expensive to go through manual anno-
sional feature space, where point closeness reflects tation with large databases.
feature similarity.
These advances (for comprehensive reviews of 2. Annotation is subjective (generally, the anno-
38 1070-986X/99/$10.00 1999 IEEE

tator and the user are different persons).
Further Reading
3. Keywords typically dont support retrieval by For a comprehensive introduction to visual information retrieval, see
similarity.
A. Del Bimbo, Visual Information Retrieval, Academic Press, London, 1999.
Automatically increasing the semantic level of P. Aigrain, H. Zhang, and D. Petkovic, Content-Based Representation and
representation provides an alternative. Starting Retrieval of Visual Media: A State-of-the-Art Review, Multimedia Tools and
from perceptual featuresthe atomic elements of Applications, Vol. 3, No. 4, Dec. 1996, pp. 179-202.
visual informationsome intermediate semantic A. Gupta and R. Jain, Visual Information Retrieval, Comm. of the ACM, Vol. 40,
levels can be extracted using a suitable set of rules. No. 5, May 1997, pp. 71-79.
Perceptual features represent the evidence upon
which to build the interpretation of visual data. A A review of the state of the art in visual information processing can be
process of syntactic construction called composi- found in
tional semantics builds the semantic representation.
In this article, we discuss how to extract auto- B. Furht, S.W. Smoliar, and H.J. Zhang, Video and Image Processing in Multimedia
maticallyfrom raw image and video datatwo Systems, Kluwer Academic Publishers, Boston, 1996.
distinct semantic levels and how to represent these
levels through appropriate language rules. As we
organize semantic levels according to a significa- Semantic levels: expression and emotion. At
tion hierarchy, the corresponding description lan- the expressive level, perceptual data are organized
guages become stratified, allowing the composition into a group of new featuresthe expressive fea-
of higher semantic levels according to syntactic turestaking into account both spatial and tem-
rules that combine perceptual features and lower poral distributions of perceptual features.
level signs. Since these languages directly depend Expressive features reflect concepts that humans
on objective features, the approach naturally embody at a higher level of abstraction to achieve
accommodates visual search by example and a more compact visual representation.
retrieval by similarity. Well address two different Combination rules are modeled as functions F
visual contexts: art paintings and commercial acting over the perceptual feature set P and return-
videos. Well also present retrieval examples showing a score expressing the degree of truth by which
ing that compositional semantics improves accor- the rule F holds. Hence, a rule Fj can be defined as
dance with human judgment and expectation.
Fj : [0, 1]n [0, 1]
A language-oriented approach
Here we discuss the compositional semantics Operators of logical composition between rules
framework we developed. We also provide back- extend the signification of the representation. We
ground theories in art and advertising. define these operators as
Compositional semantics framework F1 F2 = min(F1, F2) F1 F2 = max(F1, F2)

The compositional semantics framework
involves a bottom-up analysis and processing of a The expressive feature set F = {F1, , Fm} qualifies
visual message, starting from its perceptual fea- the content of the visual message at the expressive
tures. For still images, these features are image col- level.
ors and edges. For videos or image sequences, A musical example lets us capture the distinc-
additional features include the presence of editing tion between the expressive and emotional levels.
effects, the motion of objects within a scene, and Assume that the laws of harmony and counter-
so on. Without loss of generality, the perceptual point characterize music composition at the
JulySeptember 1999
properties of a visual message can be represented expressive level. However, following these rules
through a set of scores P = {i}, i = 1, , n, each doesnt guarantee a pleasant result for the audi-
score i[0, 1] representing the extent to which ence. In other words, musical fruition and under-
the i-th feature appears in the message. standing involves adopting aesthetic criteria that
We devised two distinct levels of the significa- go beyond expressionthat is, the syntax of
tion hierarchy, namely the expressive and the emo- meaningand reach emotion as the ultimate
tional levels, as plausible intermediate steps semantics of meaning. (The art of J.S. Bach pro-
involved in the construction of meaning. vides a remarkable example of how to reach musi-
39
Figure 1. (a) Contrasts cal beauty by a skillful adherence to the formal
and colors in an art rules of 18th century music.) We formally con-
image. (b) The Itten struct the emotional level, at the top of our signi-
sphere. (c) The polygons fication hierarchy, from the levels belownamely,
generating three- and the expressive and perceptual levels. With a nota-
four-chromatic tion similar to the one above, rules at the emo-
accordance. tional level are represented through functions G
acting over the set of perceptual and expressive
features P F and returning a score that expresses
the degree of truth by which the rule G holds.
Hence, a rule Gk can be defined as
(a)
Gk : [0, 1]n+m [0, 1]
Geographic coordinates
Operators of logical composition between rules
can extend the representations semantics. The set
G = {Gk} qualifies the content of the visual message
at the emotional level.
Perceptual, expressive, and emotional features

qualify the meaning of a visual message at differ-
ent levels of signification. For all these levels, con-
struction rules depend on the specific data domain
External views to which they refer (such as movies, commercials,
and TV news for videos; paintings, photographs,
and trademarks for still images). Specifically, the
expressive level features objective properties that
generally depend on collective cultural back-
grounds. The emotional level, on the contrary,
relies on subjective elements such as individual
cultural background and psychological moods.
Equatorial section Longitudinal section
(b) Background theories
Yellow Here we present theories that provide a refer-
Green-yellow Orange-yellow ence framework for developing expressive and
emotional rules in the domains of art images and
Green Orange
commercial videos.
Green-blue Orange-red Expression in art images. Among the many

authors who recently addressed the psychology of
Blue Red
art images, Arnheim discussed the relationships
between artistic form and perceptive processes,7
Purple-blue Purple-red
Purple and Itten8 formulated a theory about use of color
Pure color circle in art and about the semantics it induces. Itten
observed that color combinations induce effects
such as harmony, disharmony, calmness, and
excitement that artists consciously exploit in their
paintings. Most of these effects relate to high-level
chromatic patterns rather than to physical prop-
IEEE MultiMedia
erties of single points of color (see, for example,

Figure 1a). Such rules describe art paintings at the
expressive level. The theory characterizes colors
The three colors The four colors
accordance accordance according to the categories of hue, luminance,
(c) and saturation. Twelve fundamental hues are cho-
sen and each of them is varied through five levels
40
of luminance and three levels of saturation. These
colors are arranged in a chromatic sphere called
the Itten sphere (Figure 1b), such that perceptual
contrasting colors have opposite coordinates with
respect to the center of the sphere.
Analyzing the polar reference system, we can
identify four different types of color contrasts:
pure colors, light-dark, warm-cold, and saturated-
unsaturated (quality). Psychological studies have
suggested that, in western culture, red-orange
environments induce a sense of warmth (yellow
through red-purple are warm colors). Conversely,
green-blue conveys a sensation of cold (yellow-
green through purple are cold colors). Cold sen-
sations can be emphasized by the contrast with a
Courtesy of Mediaset
warm color or dampened by its coupling with a

highly cold tint. The concept of harmonic accor-
dance combines hues and tones to generate a sta-
bility effect onto the human eye. Harmony can be
achieved by creating color accordances, generat- Figure 2. Some frames taken from a Sergio Tacchini commerical video. The
ed on the sphere by connecting locations through use of colors and lines in each frame communicates dynamism. In this video,
regular polygons (see Figure 1c). the effect is further enhanced by a high frequency of shots.
Emotion in art images. Color combinations

also induce emotional effects. The mapping of trasting hues and the presence of a single domi-
low-level color primitives into emotions is quite nant color region inspire a sense of uneasiness
complex. It must consider theories about the use strengthened by the presence of dark yellow and
of colors and cognitive models, and involve cul- purple colors. The presence of regions in har-
tural and anthropological backgrounds. In other monic accordance communicates a sense of sta-
words, people with different cultures will likely bility and joy. Calmness and quiet can be
share the same expressive level representation but conveyed through combining complementary
not perceive the same emotions in front of the colors. In the presence of two noncomplementary
same color pattern. colors, the human eye looks for the complemen-
Artists use color combinations unconsciously tary of the observed color. This rouses a sense of
and consciously to produce optical and psycho- anguish.
logical sensations.9 Warm colors attract the eye Another basic contribution to the semantics of
and grab attention of the observer more than cold an image relates to the presence and characteris-
colors. Cold sensations provided by a large region tics of notable lines. In fact, line slope is a key fea-
of cold color can be emphasized by the contrast ture through which the artist communicates
with a warm color or dampened by its coupling different emotions. For example, an oblique slope
with a highly cold tint. Similarly, small, close cold communicates dynamism and action, while a flat
regions emphasize large, warm regions. Red com- slope, such as a horizon, communicates calmness
municates happiness, dynamism, and power. and relaxation. Figure 2 shows an example, illus-
Orange, the warmest color, resembles the color of trating how saturated colors can combine with
fire and thus communicates glory. Green com- sloped lines to communicate a sensation of
municates calmness and relaxation, and is the dynamism in the observer.
JulySeptember 1999
color of hope. Blue, a cold color, improves the

dynamism of warm colors, suggesting gentleness, Expression in commercials. Semiotics studies
fairness, faithfulness, and virtue. Purple, a melan- the meaning of symbols and signs. In semiotics, a
choly color, sometimes communicates fear. sign represents anything that conveys meaning
Brown generally serves as the background color according to commonly accepted conventions.
for relaxing scenes. Differences in lightness deter- Semiotics therefore suggests that signs relate to
mine a sense of plasticity and the perception of their meaning according to the specific cultural
different planes of depth. The absence of con- background. Semioticians usually identify two dis-
41
tinct steps for the production of meaning:10 tastic and unrealistic way. The director creates
a movie-like atmosphere, with a set of domi-
1. an abstract level formed by narrative structures, nant colors defining a closed-chromatic world
that is, structures including all those basic and with all the traditional editing effects (cuts,
signs that create meaning and those values dissolves) possibly taking place.
determined by sign combinations, and
Playful commercials emphasize the accordance
2. a concrete level formed by discourse structures, between users needs and product qualities.
describing the way in which the author uses Here a manifest parody of the other typologies
narrative elements to create a story. of ads takes place. The commercial clearly
states to the audience that theyre watching
Representing the expressive and emotional level advertising material. Situations and places vis-
for commercial videos inherits some of the fea- ibly differ from everyday life, and theyre
tures introduced before for still images and adds deformed in such a caricatural and grotesque
new features related to the way frames are con- fashion that the agreement between product
catenated. Semiotics classifies advertising qualities and purchasers needs is often shown
imagesand specifically commercialsinto four in an ironic way (such as an old woman dri-
different categories (noted below) that relate to ving a Ferrari). The director emphasizes the
the narrative element.11 Directors will use the presence of the camera in the ad and uses all
main narrative signs of the videocamera breaks, possible effects to stimulate the active partici-
colors, editing effects, rhythm, shot angles, and pation of the audience. Also, everything looks
linesin a peculiar way, depending on the spe- strange and falseuse of unnatural colors,
cific video typology considered. improbable camera takes, and so on.
Practical commercials emphasize the qualities Emotion in commercials. The choice of a

of a product according to a common set of given combination of narrative signs affects both
values. The advertiser describes the product in the expressive and emotional levels of significa-
a familiar environment so that the audience tion. While dissolves communicate calmness and
naturally perceives it as useful. Camera takes relaxation, cuts increase the videos dynamism
are usually frontal, and transitions take place (see again Figure 2). Sometimes in commercials,
in a smooth and natural way. This implies directors emphasize cuts by including a white
choosing long dissolves for merging shots, and frame between the end of the first frame and the
the prevalence of horizontal and vertical beginning of the second, thus inducing a sense of
linesgiving the impression of relaxation and shock in the observer. The semantics associated
solidity. with editing effects thus relates to the use of cuts
or dissolves and to the length of the shots in the
Critical commercials introduce a hierarchy of video. If the video consists of a few long shots, the
reference values. The product serves as the sub- effect is that of calmness and relaxation.
ject of the story, which focuses on the produc- Alternatively, videos with many short shots sepa-
ts qualities through an apparently objective rated by cuts induce a sense of action and
description of its features. The scene has to dynamism in the audience.
appear more realistic than reality itself. For this
reason, the commercial has a minimum num- Content representation and semantics-
ber of camera breaks. Due to smooth camera based retrieval
motions, the ever-changing colors in the back- In the framework of our research, we used the
ground draw the audiences attention to the theories presented in the previous section to qual-
(constant) color of the product. ify the content of a visual message at an interme-
diate semantic level. Specifically, weve exploited
IEEE MultiMedia
Utopic commercials provide evidence that the Ittens theory and semiotic principles to represent
product can succeed in critical tests. Here, the the visual content at the expressive level. Our
story doesnt follow a realistic plot. Rather, sit- work supports defining formal rules to qualify the
uations appear as in a dream. Mythical scenar- effects of color, geometric, and dynamic feature
ios are chosen to present the product, shown combinations. Weve also exploited psychologi-
to succeed in critical conditions often in a fan- cal principles of visual communication to repre-
42
:= region | hue = h | lum = l | sat = s |
sent the visual content at the emotional level. In
| warmth = w | size = S | position = p |
the next section, we use the semantic descriptors
| Contrast (1, 2) |
automatically extracted from art images and com-
| Harmony (1, ,n) | 1 2 | 1 2
mercial videos for retrieval purposes.
Content representation for art images where is a feasible value for the measure , with
Here we discuss the rules needed to represent {h, l, s, w} denoting the attributes of hue, lumi-
the content of art images. nance, saturation, and warmth, respectively. We
define semantic clauses in terms of a fulfillment
Expressive level. Exploiting Ittens model to relation |= of a generic formula on a region R. The
qualify an images chromatic content at an expres- degree of truth by which is verified over R is
sive level requires expressed by a value thats computed consider-
ing fuzzy descriptors of region properties. We code
1. segmenting the image into regions character- semantic clauses into a model-checking engine.
ized by uniform colors, and Given a generic formula and an image I, the
engine computes a score representing the degree
2. representing chromatic and spatial features of of truth by which is verified over I (see the side-
image regions. bar Model-Checking Engine on the next page).13
We segment images by looking for clusters in the Emotional level. The psychological analysis of
color space, then back-projecting cluster cen- effects induced by images suggests that we can use
troids in the feature space onto the image. only a limited subset of primary emotions to
Segmentation occurs through selecting the express the great variability of human emotions
appropriate color space so that small feature dis- (secondary emotions). Explicitly defining the rules
tances correspond to similar colors in the per- mapping expressive and perceptual features onto
ceptual domain. Adopting the International emotions would require us to build an overcom-
Commission on Illumination (Commission plicated model taking into account cultural and
Internationale de LEclairageCIE) L*u*v* behavioral issues. Hence, we prefer to determine
space12 accomplishes this. the relative relevance of each single feature
Once segmented into regions, the images through an adaptation process that fits a specific
region features can be described in terms of intra- culture and fashion.
region and inter-region properties. Intra-region Explicitly, the definition of the rules mapping
properties include region color, warmth, hue, expressive and perceptual features onto emotions
luminance, saturation, position, and size. Inter- involves
region properties consist of hue, saturation,
warmth, luminance contrast, and harmony. To 1. Identifying a set of primary emotions. We identi-
manage the vagueness of chromatic properties, we fied four primary emotions, namely, action,
use a fuzzy representation model to describe a relaxation, joy, and uneasiness, as the most
generic propertys value. Therefore, we can relevant to express the interaction of humans
describe a generic property by introducing n ref- with images. Referring to the general scheme
erence values for that property. Then, considering presented above, secondary emotions such as
n scores (1, n), the i-th score measures the fear and aggressiveness can be expressed as
extent to which the region conforms to the i-th combinations of action and uneasiness.
reference value. For instance, if we introduce three
reference values for the luminance of a region 2. Identifying for each primary emotion a set of plau-
(corresponding to dark, medium, and bright), the sible inputs. In our model, we expect contrasts
JulySeptember 1999
descriptor (0.0, 0.1, 0.9) represents a very bright of warmth and contrasts of hue to generate a
regions luminance. sense of action and dynamism. The presence
According to Ittens rules, these abstract prop- of lines with a high slope reinforces this.
erties must be translated into language sentences.
In formal language theory notation, region for- 3. Defining a training set for the model. Given a
mulas () that characterize chromatic and arrange- database of images, we use some images as
ment properties of color patches represent these templates of primary emotions. This training
sentences through set adapts weights of perceptual features to
43
= Contrastw (hue = orange, size = large)
Model-Checking Engine
3 A model-checking engine decides the satisfaction of a formula over an
image I with a two-step process. First, the engine recursively decomposes for-
3: Contrastw (1, 2)
2: hue = orange mula into subformulas in a top-down manner. This allows the representation
1: size = large of with a tree in which leaf nodes represent intra-region specifications.
Afterwards, the engine labels regions in the image description with the sub-
1 2 formulas they satisfy in a bottom-up approach. The engine decides the satis-
faction of region formulas by directly referring to the chromatic fuzzy
descriptors. This first labeling level is then exploited to decide the satisfaction
R4 of composition formulas. In this labeling process, the engine first checks if the
R3 region contains pixels that can satisfy the subformula. If it can, the formula
labels the region with the subformula and with the degree of truth by which
the region is satisfied. When the degree of truth falls under a given minimum
R1
threshold, the engine drops the image from the candidate list.
In Figure A, the engine computes the degree by which the formula =
R2 Contrastw (hue = orange, size = large) holds over an image. The engine assumes
that the image has just been segmented. For simplicity, the engine considers
only four regions, namely R1, R2, R3, and R4. First, decomposes into
subformulas:
3 : Contrastw (1, 2)
2 : size = large
1 : hue = orange
Then, according to a bottom-up approach, the engine labels regions with

the subformulas they satisfy:
Step 1 Label R2 and R4 with 1

Step 2 Label R1 with 2
Figure A. The model-checking engine. From top to Step 3 Label R1, R2, and R4 with 3
bottom: formula decomposition, segemented
image, and original image. At the end of this process, the engine labels three regions with 3, thus the
original formula is satisfied over the image.
determine the relative relevance of each fea-

ture for the emotion induced in subjects of a I
G relax= g2I(Flumc, brown, green,
certain cultural context. w r1,w r2 ,w r3 )
We summarize the dependence between per-

ceptual/expressive features and emotions below. where Flumc represents the expressive features mea-
We measure the degree of action communicated suring the presence of luminance contrasts and
by an image as brown and green the perceptual features measuring
the presence of brown and green regions. The
I
G action= g1I(Fwarmthc,Fhuec, slanted
, degree of joy communicated by an image can be
measured as
w a1,w a2 ,w a3 )
IEEE MultiMedia
I
where Fwarmthc and Fhuec represent the expressive G joy = g 3I ( Fharmony , w j1 )
features measuring the presence of warmth and
hue contrasts, and slanted the perceptual feature where Fharmony represents the expressive features
measuring the presence of slanted lines (the wis measuring the presence of regions in harmonic
are the feature weights). We can measure the accordance. Finally, we measure the degree of
degree of relaxation an image communicates by uneasiness an image communicates as
44
I
G uneas = g4I(Fnhuec, yellow, purple, Table 1. Perceptual mapping onto semiotic categories.
w u 1,w u 2 ,w u 3 )
Semiotic Categories
where Fnhuec represents the expressive features Perceptual Features Fpractical Fplayful Futopic Fcritical
measuring the absence of contrasts of hue, and saturated ~0 ~1
yellow and purple the perceptual features measuring recurrent ~1 ~0
the presence of yellow and purple regions. hor/vert ~1 ~0 ~1
Presently we use a linear mapping to model the cuts ~1 ~1
giI functions and a linear regression scheme to dissolves ~1 ~1
achieve adaptation of weights wi based on a set of edit ~0
training examples.
Content representation for commercial videos

Now well present the content representation practical, and playful features. These sentences are
rules for commercial videos. expressed through formulas () defined as
Expressive level. Semiotic principles permit := Fpractical k1 | Fplayful k2 | Futopic k3

establishing a link between the set of perceptu- |Fcritical k4 | 1 2 | 1 2
al features and each of the four semiotic cate-
gories of commercial videos (practical, playful, where the kis represent fixed threshold values.
utopic, and critical). This supports organizing a
database of commercials on the basis of their Emotional level. High-level properties of a
semiotic content. Once a video has been seg- commercial relate to the feelings it inspires in the
mented into shots, all perceptual features are observer. Weve organized these characteristics in
extracted. Each features value is represented a hierarchical fashion. A first classification sepa-
according to a fuzzy representation model. A rates commercials that induce action from those
score in [0, 1] is introduced for each feature to that induce quietness. Each class then further
qualify the features presence in the video. Inter- splits to specify the kind of action and quietness.
frame features address Explicitly, action in a commercial can induce two
different feelingssuspense and excitement.
the presence of cuts (cuts) and dissolves (dissolves), Similarly, quietness can be further specified as
relaxation and happiness.
the presence or absence of colors recurring in In the following, we describe the mapping
many frames (recurrent ~1 and recurrent ~0, respec- between emotional features and low-level perceptu-
tively), and al features reflecting reference principles of visual
communication. (See also Table 2 on the next page.)
the presence or absence of editing effects (edit
~1 and edit ~0, respectively). V
Action: The degree of action G action for a video
can be improved by the presence of red and
Intra-frame features address the presence of purple. Action videos are often characterized by
framings exhibiting high slanting. Short
horizontal and vertical or slanted lines (hor/vert sequences are joined by cuts. The main distin-
~1 and hor/vert ~0, respectively) and guishing feature of these videos lies in the pres-
ence of a high degree of motion in the scene.
saturated or unsaturated colors (saturated ~1 and
saturated ~0, respectively). Excitement: Videos that communicate excite-
JulySeptember 1999
ment (with degree GVexcite) typically feature short

Table 1 summarizes how perceptual features sequences joined through cuts.
are mapped onto the four semiotic categories. We
adopted a linear mapping where table entries indi- Suspense: To improve the degree of suspense
cate expected feature weights and indicates GVsuspense in a video that communicates action,
irrelevance. directors combine both long (long) and short
We represent the video content through sen- (short) sequences in the video and join them
tences qualifying the presence of critical, utopic, through frequent cuts.
45
Table 2. Perceptual mapping onto emotional features.
Emotion Categories
Perceptual Features Action Excitement Suspense Quietness Relaxation Happiness
dissolves ~1 ~1 ~1
cuts ~1 ~1 ~1 ~1 ~1 ~1
long ~1 ~1 ~1 ~1
short ~1 ~1 ~1 ~1 ~1
motion ~1 ~1 ~1 ~0 ~1
hor/vert ~0 ~0 ~0 ~1 ~1 ~1
red ~1 ~1 ~1
orange ~1 ~1 ~1
green ~1 ~1 ~1
blue ~1 ~1 ~1
purple ~1 ~1 ~1 ~0 ~0 ~0
white ~1 ~1 ~1
black ~0 ~0 ~0
indicates irrelevance
Quietness: The degree of quietness GVquiet for a tional semantics framework expounded earlier to
video can be improved by the presence of blue, develop systems for retrieving still images and
orange, green, and white colors, and lowered videos based on both expressive and emotional
by the presence of black and purple. Quiet content.13,14 We discuss the method used in these
videos feature horizontal framings. A few long systems for extracting perceptual features from
sequences might be present, possibly joined raw visual data in the sidebar Perceptual Features
through dissolves. for Still Images and Videos.
In this section, we present examples of retriev-
Relaxation: Videos that communicate relaxing art images and commercial videos according
V
ation (with degree G relax ) dont show relevant to the expressive level. You can find retrieval
motion components. based on emotional features elsewhere.15 We eval-
uated retrieval performance in terms of effective-
Happiness: Videos that communicate happiness ness, that is, a measure of the agreement between
(with degree GVhappy) share the same features as human evaluators and the system in ranking a test
quiet videos but also exhibit a relevant motion set of images according to their similarity to a
component. query. Measuring effectiveness reliably requires a
small image test set (typically 20 to 50 images).
Presently, we use a linear approximation to Given a sample query, we evaluated the agree-
model emotional feature mapping, constructed by ment as the percentage of human evaluators who
weight adaptation according to a linear regression rank images in the same (or very close) position as
scheme. Following the above scheme, the extent the system does. We define effectiveness as
to which a video k conforms to one of the six
classes is computed through six scores{GVj (k)}61 , k = Pj (i ) + j (i )
each representing a weighted combination of low-

level video features. To achieve good discrimina-
S j (i) = Q (i , k) j
k = Pj (i ) j (i )
tion between the six classes of commercials

requires that for a generic video belonging to cat- where i is an image from the test set, j denotes the
egory j, the ratio GVj /GVm must be high, at least for sample query, and j(i) is the width of the window
IEEE MultiMedia
all the categories not a generalization of category centered in the rank Pj(i) assigned by the system.
j (we call these categories j-opponent categories). Qj(i, k) is the percentage of people who ranked the
i-th image in a position between Pj(i) j(i) and
Model validation and retrieval results Pj(i) + j(i).
At the Visual Information Processing Lab of the
University of Florence, weve used the composi-
46
Perceptual Features for Still Images and Videos
Here we discuss the perceptual features for still internally described by segmenting each shots
images and videos. keyframe as described in the previous section of
this sidebar.
Still images
An images semantics relates to its color con- Cuts
tent and the presence of elements such as lines Rapid motion in the scene and sudden changes
that induce dynamism and action. in lighting yield low correlation between contigu-
ous frames, especially in cases adopting a high
Colors temporal subsampling rate. To avoid false cut
Color cluster analysis helps segment images into detection, we studied a metric insensitive to such
color regions. We can obtain clustering in 3D space variations while reliably detecting true cuts. We
by using an improved version of the standard K- partition each frame into nine subframes and rep-
means algorithm, which avoids converging with resent each subframe by considering the color his-
nonoptimal solutions. This algorithm uses competi- tograms in the hue, saturation, intensity (HSI) color
tive learning as the basic technique for grouping space. We detect cuts by considering the volume
points in the color space. An images chromatic con- of the difference of subframe histograms in two
tent can be expressed using a set of eight numbers consecutive frames. The presence of a cut at frame
normalized in [0, 1] denoting one color out of the i can be detected by thresholding the average vol-
set {red, orange, yellow, green, blue, purple, white, ume value of for the nine subframes. After repeat-
and black}. Each number quantifies the presence in ing the above procedure for all frames i = 1
the image of a region exhibiting the i-th color. #frames, we simply obtain the overall feature relat-
ed to the presence of cuts in a video as cuts =
Lines #cuts/#frames, where cuts [0, 1].
Detecting significant line slopes in an image can
be accomplished by using the Hough transform to Dissolves
generate a line slope histogram (see Figure B). The The dissolve effect merges two sequences by
feature hor/vert [0, 1] gives the ratio of horizontal partly overlapping them. Detecting dissolves in
and vertical lines with respect to the overall num- commercials proves particularly difficult because
ber of lines in the image. dissolves typically occur in a limited number of
consecutive frames. Due to this peculiarity, exist-
Videos ing approaches to detect dissolves (developed for
Video analysis primarily aims to segment video, movies) have shown poor performance. We use Figure B. Perceptual
that is, to identify each shots start and end points instead corner statistics to detect dissolves. During features for still images.
and the videos characterization through its most the editing effect, corners associated with the first From left to right: video
representative keyframes. Once a video has been sequence gradually disappear and those associat- frame, computed edges,
fragmented into shots and video editing features ed with the second sequence gradually appear. and line slope
have been extracted, each shots content can be continued on p. 48 histogram.
80
Stacchini.mv.L2
70
60
Occurences
50
40
30
20
10
0
0 45 90 135 180 225 270
Slope
(1) (2) (3)
47
(1) (2) (3) (4)
Figure C. Corners continued from p. 47 ratio between colors that recur in a high percent-
detected during a age of keyframes and the overall number of signif-
dissolve effect. This yields a local minimum in the number of cor- icant colors in the video sequence) and saturated
ners detected during the dissolve (see Figure C). (expressing the relative presence of saturated col-
An image corner is characterized by large and dis- ors in the scene). Another relevant inter-shot char-
tinct values of the gradient auto-correlation acteristicthe rhythm of a sequence of
matrixs eigenvalues. We evaluate the feature shotsrelates to shot duration and to the use of
dissolves [0, 1] as dissolves = #dissolves/#frames. cuts and dissolves to join shots. We define the
rhythm r(i1, i2) [0, 1] of a video sequence over a
Motion frame interval [i1, i2] as
We analyze motion by tracking corners in a
sequence. For each shot, we compute a feature
# cuts + # dissolves
motion [0, 1] that represents the average intensi- r ( i2 , i2 ) =
i2 i1 + 1
ty of motion. motion = 0 means that motion is
absent during the sequence. Higher values of
motion indicate the presence of increasingly relevant where #cuts and #dissolves are measured in the
motion components. same interval. A simple feature measuring an entire
sequences internal rhythm is the average rhythm,
Inter-shot features which relates to the overall number of breaks edit =
We represent color-related inter-shot features r(1, #frames).
used in a video as recurrent [0, 1] (expressing the
Image retrieval
For art image retrieval, we used a test set of 40
images, including Renaissance to contemporary
painters. We measured retrieval effectiveness with
reference to four different queries, addressing con-
trasts of luminance, warmth, saturation, and
harmony.
We collected answers from 35 experts in the
fine arts field. We asked them to rank database
images according to each reference query. Figure
3 shows the test database images. Figure 4 shows
the eight sets of top-ranked images the system
retrieved in response to the four reference queries.
(For more information about these images visit
Courtesy of WebMuseum
http://www.cineca.it/wm/.)
Figure 5 shows the plots of effectiveness S as a
function of rankings. They show the agreement
between the people interviewed and the system in
ranking images according to the queries. Figure 5
shows only rankings from 1 to 8, since they rep-
Figure 3. Images used to test system effectiveness for the representation of resent agreement on the most representative
expressive content. images. The plots show a very large agreement
48
100
1 2 3 4 5 6 7 8
Average agreement
80
Agreement
60
40
(a)
1 2 3 4 5 6 7 8 20
0
(a) 0 1 2 3 4 5 6 7
Retrieval ranking
8 9
(b) 100
1 2 3 4 5 6 7 8
Average agreement
80
Agreement
60
Courtesy of WebMuseum
(c) 40
1 2 3 4 5 6 7 8
20
0
(b) 0 1 2 3 4 5 6 7
Retrieval ranking
8 9
(d)
100
Figure 4. Top-ranked images according to queries for contrast of luminance (a),
contrast of saturation (b), contrast of warmth (c), and harmonic accordance (d). 80 Average agreement
Agreement
60
40
20
0
0 1 2 3 4 5 6 7 8 9
(c) Retrieval ranking
100
80 Average agreement
Agreement
60
40
20
0
0 1 2 3 4 5 6 7 8 9
(d) Retrieval ranking
Figure 5. System effectiveness measured against

experts for four different sample queries: constrast
of luminance (a), constrast of saturation (b), con-
trast of warmth (c), and harmonic accordance (d).
Figure 6. Results of a query for images with two large regions showing regions of Figure 6. The
constrasting luminance. user may define the
degree of truth (60 per-
JulySeptember 1999
cent) assigned to the

between the experts and the system in assigning query. Figure 6 (right) shows the retrieved paint-
similarity rankings. ings. The 12 best-matched images all feature large
Figure 6 shows an example of retrieval accord- regions with contrasting luminance. The top-
ing to the expressive level using a database of 477 ranked image represents an outstanding example
images representing paintings from the 15th to of luminance contrast, featuring a black region
the 20th century. Two dialog boxes define prop- over a white background. Images in the second,
erties (hue and dimension) of the two sketched third, fifth, sixth, and seventh positions exempli-
49
100 100 Average agreement
80 Average agreement 80
Agreement
Agreement
60 60
40 40
20 20
0 0
(a) 0 1 2 3 4 5 6 (b) 0 1 2 3 4 5 6
Retrieval ranking Retrieval ranking
100 100
Average agreement
80 80
Agreement
Average agreement
Agreement
60 60
40 40
20 20
0 0
(c) 0 1 2 3 4 5 6 (d) 0 1 2 3 4 5 6
Retrieval ranking Retrieval ranking
Figure 8. Plots of the agreement between system and experts classification of

commercials with reference to queries for purely practical (a), critical (b),
utopic (c) and playful (d) commercials.
fy how contrast of Figure 7 shows how the experts (Figure 7b) and
luminance between the system (Figure 7c) classified database com-
large regions can be mercials. Each region (cylinders of zero height
Figure 7. Representation of database commercials used to convey the per- arent shown) in the square has a vertical cylinder,
semiotic features (a) along with the experts (b) and ception of different whose height is proportional to the percentage of
systems (c) classification of them. planes of depth. commercials located in that region.
Figure 8 shows a more accurate measure of sys-
Video retrieval tem effectiveness by displaying the agreement
We evaluated video retrieval effectiveness between the system and experts with reference to
using a test set of 20 commercial videos. A team queries for purely practical, critical, utopic, and
of five experts in the semiotic and marketing fields playful commercials. Each query considered the
classified each video in terms of the four semiotic five top-ranked commercials. For a generic com-
categories. We used the common representation mercial, we measured the agreement between the
of commercials categories based on the semiotic system and experts as A = 1 d/dmax, where d is the
square. This instrument, first introduced by city block distance between the two blocks in
Greimas,10 combines pairs of four semiotic objects the semiotic square where the system and the
with the same semantic level, according to three experts located the commercial. dmax is the maxi-
basic relationships: opposition, completion, and mum value of d, here 8. The best average agree-
contradiction. Objects placed at opposite sides of ment corresponds to the query for playful
the square are complementary. Figure 7a shows commercials, evidencing the effectiveness of the
the practical-playful and critical-utopic diagonals features used to model this category. The worst
of the semiotic square as coordinate axes. We performance corresponds to queries for practical
asked the experts to classify the commercials by commercials. In this case, the system classified
associating each commercial with a position on many commercials as practical, while the experts
the square. To ease the classification task, 25 rec- rated them as critical. Such a classification mis-
tangular regions were identified in the square match originates from the recurrent presence in
IEEE MultiMedia
through a regular partition, which supports the critical commercials of foreground views of the
definition of a videos three different degrees of promoted product. The experts can easily detect
conformity to a generic category. For instance, all and recognize a foreground view, but the system
videos classified in region 4 of Figure 7a feature a cant detect this feature.
high degree of conformity to the utopic category Figures 9 through 12 show examples of video
and a medium conformity to the playful one. retrieval according to the expressive level using a
50
Figure 9. Retrieval of playful commercials. Courtesy of Mediaset
Figure 10. Some of the keyframes for the first-ranked spot in Figure 9 (Sergio
database of 131 commercials. Users can specify Tacchini).
queries by defining the degree by which the
retrieved commercials have to conform to the four
semiotic categories. Figure 9 shows the output of
the retrieval system in response to a query for
purely playful commercials. At the right of the list
of retrieved items, a bar graph displays the degree
by which each shot of the top-ranked videothe
white, thin vertical lines represent cuts and dis-
solvesbelongs to the playful category. The top-
ranked videos in this category all advertise
products for young and smart people (sports-
wear, sport watches, blue jeans, and so on). These
results arent surprising, since they reflect the
common marketing practice of targeting a com-
mercial to a specific audience.
Courtesy of RAI
Figures 10 and 11 report some of the keyframes

for the two top-ranked spots in Figure 9. The first
best-ranked spot (advertising sportswear) presents
all the typical features of a playful commercial, Figure 11. Some of the keyframes for the second-ranked spot in Figure 9 (Audi
including a very fast rhythm, unorthodox camera A3).
takes, situationslike skating in a tennis court
reflecting semantic issues at a higher level than
that of our computer analysis, and very saturated
colors. Similar features appear in the second- Figure 12. Retrieval of
ranked commercial. In Figure 11 notice the pres- critical commercials.
ence of quasi-identical keyframes (the close-ups of
the man) typical of a nonlinear story. In this spot
the camera frenetically switches between close-up
JulySeptember 1999
views of the man and his dogs, almost never alter-

nating details and global views like a utopic spot
would do.
Figure 12 shows the output of the retrieval sys-
tem in response to a query for purely critical com-
mercials. In contrast to the previous example, the
top-ranked videos obtained in response to the
query all advertise typical home products. Figures
51
The research presented in this article attempts
to introduce different levels of signification by a
layered representation of visual knowledge. The
most relevant insight we gained is that defining
rules that capture visual meaning can be difficult,
especially when dealing with general visual
domains. As a good design practice, such rules
should derive from specific domain characteriza-
tions, and possibly be refined and tailored to spe-
cific classes of users.
Future work will address experimenting with
different application scenarios (such as TV news
Courtesy of RAI
and movies), and complementing visual features
and language sentences with textual and audio
data. MM
Figure 13. Some of the keyframes for the first-ranked spot in Figure 12 (Pasta
Barilla). Acknowledgments
We thank Bruno Bertelli and Laura Lombardi
for useful discussions during the development of
this work. Wed also like to acknowledge all those
experts who provided their support in the perfor-
mance tests.
References
1. T. Joseph and A. Cardenas, PicQuery: A High-Level
Query Language for Pictorial Database Manage-
ment, IEEE Trans. on Software Engineering, Vol. 14,
No. 5, May 1988, pp. 630-638.
2. N. Roussopolous, C. Faloutsos, and T. Sellis, An
Efficient Pictorial Database System for Pictorial
Structured Query Language (PSQL), IEEE Trans. on
Courtesy of Mediaset
Software Engineering, Vol. 14, No. 5, May 1988,

pp. 639-650.
3. M. Flickner et al., Query by Image and Video
Content: The QBIC System, Computer, Vol. 28, No.
Figure 14. Some of the keyframes for the second-ranked spot in Figure 12 (Cera 9, Sept. 1995, pp. 310-315.
Emulsio). 4. J.R. Smith and S.F. Chang, VisualSeek: A Fully
Automated Content-Based Image Query System,
Proc. ACM Multimedia 96, ACM Press, New York,
13 and 14 show some relevant frames for the two Nov. 1996.
top-ranked commercials in Figure 12. 5. D.A. White and Ramesh Jain, Similarity Indexing
with the SS-tree, Proc. of the 12th Intl Conf. on Data
Conclusions and future work Engineering, IEEE Computer Society Press, Los
If, as the adage says, an image is worth a thou- Alamitos, Calif., Feb. 1996, pp. 516-523.
sand words, then every designer of multimedia 6. R.J. Brachman and H.J. Levesque, eds., Readings in
retrieval systems is well aware that the converse Knowledge Representation, Morgan Kaufmann, Los
also holds true. That is, many times a word can Altos, Calif., 1985.
stand for a thousand images, since it represents a 7. R. Arnheim, Art and Visual Perception: A Psychology of
IEEE MultiMedia
class of equivalence of objects, thus reflecting a the Creative Eye, Regents of the University of
higher semantic level than that of the objects California, Palo Alto, Calif., 1954.
themselves. We believe that future multimedia 8. J. Itten, Art of Color (Kunst der Farbe), Otto Maier
retrieval systems will have to support access to Verlag, Ravensburg, Germany, 1961 (in German).
information at different semantic levels to reflect 9. C.R. Haas, Advertising Practice (Pratique de la
diverse application needs and user queries. Publicit), Bordas, Paris, 1988 (in French).
52
10. A.J. Greimas, Structural Semantics (Smantique
Structurale), Larousse, Paris, 1966 (in French). Alberto Del Bimbo is a professor
11. J.-M. Floch, Semiotics, Marketing, and in the Department of Systems and
Communication: Below the Signs, the Strategies Informatics at the University of
(Smiotique, Marketing et Communication: Sous les Florence, Italy. His scientific inter-
signes, les stratgies), University of France Press, Paris, ests cover image sequence analysis,
1990 (in French). shape and object recognition,
12. R.C. Carter and E.C. Carter, CIELUV Color image databases and multimedia, visual languages, and
Difference Equations for Self-Luminous Displays, advanced man-machine interaction. He earned his MS
Color Research and Applications, Vol. 8, No. 4, 1983, in electronic engineering at the University of Florence,
pp. 252-553. Italy in 1977. He is presently an associate editor of
13. J.M. Corridoni, A. Del Bimbo, and P. Pala, Pattern Recognition, Journal of Visual Languages and
Sensations and Psychological Effects in Color Image Computing, and IEEE Transactions on Multimedia. He is a
Database, ACM Multimedia Systems, Vol. 7, No. 3, member of IEEE and IAPR. He presently serves as the
May 1999, pp. 175-183. Chairman of the Italian Chapter of IAPR. He is the gen-
14. M. Caliani et al., Commercial Video Retrieval by eral chair of IEEE Intl. Conference on Multimedia
Induced Semantics, Proc. IEEE Intl Work. on Computing and Systems (ICMCS 99).
Content-Based Access of Images and Video Databases,
IEEE CS Press, Los Alamitos, Calif., Jan. 1998,
pp. 72-80.
15. C. Colombo, A. Del Bimbo, and P. Pala Retrieval of Pietro Pala received his MS in
Commercials by Video Semantics, Proc. IEEE Intl electronic engineering at the
Conf. on Computer Vision and Pattern Recognition University of Florence, Italy, in
(CVPR 98), IEEE CS Press, Los Alamitos, Calif., June 1994. In 1998, he received his PhD
1998, pp. 572-577. in information science from the
same university. Currently he is a
research scientist at the Department of Systems and
Informatics at the University of Florence. His current
Carlo Colombo is an assistant research interests include recognition, image databases,
professor in the Department of neural networks, and related applications.
Systems and Informatics at the
University of Florence, Italy. His
main research activities are in the Contact the authors at the Department of Systems
field of computer vision, with spe- and Informatics, University of Florence, Via Santa Marta
cific interests in image and video analysis, human- 3, I-50139, Florence, Italy, e-mail {columbus,delbimbo,
machine interfaces, robotics, and multimedia. He holds pala}@dsi.unifi.it.
an MS in electronic engineering from the University of
Florence, Italy (1992) and a PhD in robotics from the
SantAnna School of University Studies and Doctoral
Research, Pisa, Italy (1996). He is a member of IEEE and
the International Association for Pattern Recognition
(IAPR), and presently serves as secretary to the IAPR
Italian Chapter.
JulySeptember 1999
53

Semantics in Visual Information Retrieval PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Semantics in Visual Information Retrieval PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

Feature Article

the field, see the Further Reading sidebar) have

38 1070-986X/99/$10.00 1999 IEEE

Compositional semantics framework F1 F2 = min(F1, F2) F1 F2 = max(F1, F2)

Green-blue Orange-red Expression in art images. Among the many

erties of single points of color (see, for example,

warm color or dampened by its coupling with a

Emotion in art images. Color combinations

color of hope. Blue, a cold color, improves the

Practical commercials emphasize the qualities Emotion in commercials. The choice of a

Then, according to a bottom-up approach, the engine labels regions with

Step 1 Label R2 and R4 with 1

determine the relative relevance of each fea-

We summarize the dependence between per-

Content representation for commercial videos

Expressive level. Semiotic principles permit := Fpractical k1 | Fplayful k2 | Futopic k3

ment (with degree GVexcite) typically feature short

each representing a weighted combination of low-

tion between the six classes of commercials

Figure 5. System effectiveness measured against

cent) assigned to the

Figure 8. Plots of the agreement between system and experts classification of

Figures 10 and 11 report some of the keyframes

views of the man and his dogs, almost never alter-

Software Engineering, Vol. 14, No. 5, May 1988,

Anda mungkin juga menyukai