Semantics in
paved the way for third-generation systems, fea-
turing full multimedia data management and net-
working support. Forthcoming standards such as
MPEG-4 and MPEG-7 (see the Nack and Lindsay
Visual
article in this issue) provide the framework for effi-
cient representation, processing, and retrieval of
visual information.
Yet many problems must still be addressed and
Information
solved before these technologies can emerge. An
important issue is the design of indexing struc-
tures for efficient retrieval from large, possibly dis-
tributed, multimedia data repositories. To achieve
Retrieval
this goal, image and video content descriptors can
be internally organized and accessed through
multidimensional index structures.5 A second key
problem is to bridge the semantic gap between the
system and users. That is, devise representations
Carlo Colombo, Alberto Del Bimbo, and Pietro Pala capturing visual content at high semantic levels
University of Florence, Italy especially relevant for retrieval tasks. Specifically,
automatically obtaining a representation of high-
level visual content remains an open issue.
Virtually all the systems based on automatic stor-
V
isual information retrieval systems age and retrieval of visual information proposed
A compositional have entered a new era. First- so far use low-level perceptual representations of
approach increases generation systems allowed access to pictorial data, which have limited semantics.
the level of images and videos through textual Building up a representation proves tanta-
representation that data.1,2 Typical searches for these systems include, mount to defining a model of the world, possibly
can be automatically for example, all images of paintings of the through a formal description language, whose
extracted and used in Florentine school of the 15th century or all semantics capture only a few significant aspects of
a visual information images by Cezanne with landscapes. Such sys- the information content.6
retrieval system. tems expressed information through alphanu- Different languages and semantics induce
Visual information at meric keywords or scripts. They employed diverse world representations. In text, for exam-
the perceptual level representation schemes like relational models, ple, the meaning of single words is specific yet
is aggregated frame models, and object-oriented models. On the limited, and an aggregate of several wordsa
according to a set of other hand, current-generation retrieval systems phraseproduces a higher degree of significance
rules. These rules support full retrieval by visual content.3,4 Access and expressivity. Hence, the rules for the syntac-
reflect the specific to visual information is not only performed at a tic composition of signs in a given language also
context and conceptual level, using keywords as in the textu- generate a new world representation, offering
transform perceptual al domain, but also at a perceptual level, using richer semantics in the hierarchy of signification.
words into phrases objective measurements of visual content. In To avoid equivocation, a retrieval system
capturing pictorial these systems, image processing, pattern recogni- should embed a semantic level reflecting as much
content at a higher, tion, and computer vision constitute an integral as possible the one humans refer to during inter-
and closer to the part of the systems architecture and operation. rogation. The most common way to enrich a visu-
human, semantic They objectively analyze pixel distribution and al information retrieval systems semantics is to
level. extract the content descriptors automatically from annotate pictorial information manually at stor-
raw sensory data. Image content descriptors are age time through a set of external keywords
commonly represented as feature vectors, whose describing the pictorial content. Unfortunately,
elements correspond to significant parameters textual annotation has several problems:
that model image attributes. Therefore, visual
attributes are regarded as points in a multidimen- 1. Its too expensive to go through manual anno-
sional feature space, where point closeness reflects tation with large databases.
feature similarity.
These advances (for comprehensive reviews of 2. Annotation is subjective (generally, the anno-
properties of a visual message can be represented expressive level. However, following these rules
through a set of scores P = {i}, i = 1, , n, each doesnt guarantee a pleasant result for the audi-
score i[0, 1] representing the extent to which ence. In other words, musical fruition and under-
the i-th feature appears in the message. standing involves adopting aesthetic criteria that
We devised two distinct levels of the significa- go beyond expressionthat is, the syntax of
tion hierarchy, namely the expressive and the emo- meaningand reach emotion as the ultimate
tional levels, as plausible intermediate steps semantics of meaning. (The art of J.S. Bach pro-
involved in the construction of meaning. vides a remarkable example of how to reach musi-
39
Figure 1. (a) Contrasts cal beauty by a skillful adherence to the formal
and colors in an art rules of 18th century music.) We formally con-
image. (b) The Itten struct the emotional level, at the top of our signi-
sphere. (c) The polygons fication hierarchy, from the levels belownamely,
generating three- and the expressive and perceptual levels. With a nota-
four-chromatic tion similar to the one above, rules at the emo-
accordance. tional level are represented through functions G
acting over the set of perceptual and expressive
features P F and returning a score that expresses
the degree of truth by which the rule G holds.
Hence, a rule Gk can be defined as
(a)
Gk : [0, 1]n+m [0, 1]
Geographic coordinates
Operators of logical composition between rules
can extend the representations semantics. The set
G = {Gk} qualifies the content of the visual message
at the emotional level.
Perceptual, expressive, and emotional features
qualify the meaning of a visual message at differ-
ent levels of signification. For all these levels, con-
struction rules depend on the specific data domain
External views to which they refer (such as movies, commercials,
and TV news for videos; paintings, photographs,
and trademarks for still images). Specifically, the
expressive level features objective properties that
generally depend on collective cultural back-
grounds. The emotional level, on the contrary,
relies on subjective elements such as individual
cultural background and psychological moods.
Equatorial section Longitudinal section
(b) Background theories
Yellow Here we present theories that provide a refer-
Green-yellow Orange-yellow ence framework for developing expressive and
emotional rules in the domains of art images and
Green Orange
commercial videos.
40
of luminance and three levels of saturation. These
colors are arranged in a chromatic sphere called
the Itten sphere (Figure 1b), such that perceptual
contrasting colors have opposite coordinates with
respect to the center of the sphere.
Analyzing the polar reference system, we can
identify four different types of color contrasts:
pure colors, light-dark, warm-cold, and saturated-
unsaturated (quality). Psychological studies have
suggested that, in western culture, red-orange
environments induce a sense of warmth (yellow
through red-purple are warm colors). Conversely,
green-blue conveys a sensation of cold (yellow-
green through purple are cold colors). Cold sen-
sations can be emphasized by the contrast with a
Courtesy of Mediaset
41
tinct steps for the production of meaning:10 tastic and unrealistic way. The director creates
a movie-like atmosphere, with a set of domi-
1. an abstract level formed by narrative structures, nant colors defining a closed-chromatic world
that is, structures including all those basic and with all the traditional editing effects (cuts,
signs that create meaning and those values dissolves) possibly taking place.
determined by sign combinations, and
Playful commercials emphasize the accordance
2. a concrete level formed by discourse structures, between users needs and product qualities.
describing the way in which the author uses Here a manifest parody of the other typologies
narrative elements to create a story. of ads takes place. The commercial clearly
states to the audience that theyre watching
Representing the expressive and emotional level advertising material. Situations and places vis-
for commercial videos inherits some of the fea- ibly differ from everyday life, and theyre
tures introduced before for still images and adds deformed in such a caricatural and grotesque
new features related to the way frames are con- fashion that the agreement between product
catenated. Semiotics classifies advertising qualities and purchasers needs is often shown
imagesand specifically commercialsinto four in an ironic way (such as an old woman dri-
different categories (noted below) that relate to ving a Ferrari). The director emphasizes the
the narrative element.11 Directors will use the presence of the camera in the ad and uses all
main narrative signs of the videocamera breaks, possible effects to stimulate the active partici-
colors, editing effects, rhythm, shot angles, and pation of the audience. Also, everything looks
linesin a peculiar way, depending on the spe- strange and falseuse of unnatural colors,
cific video typology considered. improbable camera takes, and so on.
Utopic commercials provide evidence that the Ittens theory and semiotic principles to represent
product can succeed in critical tests. Here, the the visual content at the expressive level. Our
story doesnt follow a realistic plot. Rather, sit- work supports defining formal rules to qualify the
uations appear as in a dream. Mythical scenar- effects of color, geometric, and dynamic feature
ios are chosen to present the product, shown combinations. Weve also exploited psychologi-
to succeed in critical conditions often in a fan- cal principles of visual communication to repre-
42
:= region | hue = h | lum = l | sat = s |
sent the visual content at the emotional level. In
| warmth = w | size = S | position = p |
the next section, we use the semantic descriptors
| Contrast (1, 2) |
automatically extracted from art images and com-
| Harmony (1, ,n) | 1 2 | 1 2
mercial videos for retrieval purposes.
Content representation for art images where is a feasible value for the measure , with
Here we discuss the rules needed to represent {h, l, s, w} denoting the attributes of hue, lumi-
the content of art images. nance, saturation, and warmth, respectively. We
define semantic clauses in terms of a fulfillment
Expressive level. Exploiting Ittens model to relation |= of a generic formula on a region R. The
qualify an images chromatic content at an expres- degree of truth by which is verified over R is
sive level requires expressed by a value thats computed consider-
ing fuzzy descriptors of region properties. We code
1. segmenting the image into regions character- semantic clauses into a model-checking engine.
ized by uniform colors, and Given a generic formula and an image I, the
engine computes a score representing the degree
2. representing chromatic and spatial features of of truth by which is verified over I (see the side-
image regions. bar Model-Checking Engine on the next page).13
We segment images by looking for clusters in the Emotional level. The psychological analysis of
color space, then back-projecting cluster cen- effects induced by images suggests that we can use
troids in the feature space onto the image. only a limited subset of primary emotions to
Segmentation occurs through selecting the express the great variability of human emotions
appropriate color space so that small feature dis- (secondary emotions). Explicitly defining the rules
tances correspond to similar colors in the per- mapping expressive and perceptual features onto
ceptual domain. Adopting the International emotions would require us to build an overcom-
Commission on Illumination (Commission plicated model taking into account cultural and
Internationale de LEclairageCIE) L*u*v* behavioral issues. Hence, we prefer to determine
space12 accomplishes this. the relative relevance of each single feature
Once segmented into regions, the images through an adaptation process that fits a specific
region features can be described in terms of intra- culture and fashion.
region and inter-region properties. Intra-region Explicitly, the definition of the rules mapping
properties include region color, warmth, hue, expressive and perceptual features onto emotions
luminance, saturation, position, and size. Inter- involves
region properties consist of hue, saturation,
warmth, luminance contrast, and harmony. To 1. Identifying a set of primary emotions. We identi-
manage the vagueness of chromatic properties, we fied four primary emotions, namely, action,
use a fuzzy representation model to describe a relaxation, joy, and uneasiness, as the most
generic propertys value. Therefore, we can relevant to express the interaction of humans
describe a generic property by introducing n ref- with images. Referring to the general scheme
erence values for that property. Then, considering presented above, secondary emotions such as
n scores (1, n), the i-th score measures the fear and aggressiveness can be expressed as
extent to which the region conforms to the i-th combinations of action and uneasiness.
reference value. For instance, if we introduce three
reference values for the luminance of a region 2. Identifying for each primary emotion a set of plau-
(corresponding to dark, medium, and bright), the sible inputs. In our model, we expect contrasts
JulySeptember 1999
descriptor (0.0, 0.1, 0.9) represents a very bright of warmth and contrasts of hue to generate a
regions luminance. sense of action and dynamism. The presence
According to Ittens rules, these abstract prop- of lines with a high slope reinforces this.
erties must be translated into language sentences.
In formal language theory notation, region for- 3. Defining a training set for the model. Given a
mulas () that characterize chromatic and arrange- database of images, we use some images as
ment properties of color patches represent these templates of primary emotions. This training
sentences through set adapts weights of perceptual features to
43
= Contrastw (hue = orange, size = large)
Model-Checking Engine
3 A model-checking engine decides the satisfaction of a formula over an
image I with a two-step process. First, the engine recursively decomposes for-
3: Contrastw (1, 2)
2: hue = orange mula into subformulas in a top-down manner. This allows the representation
1: size = large of with a tree in which leaf nodes represent intra-region specifications.
Afterwards, the engine labels regions in the image description with the sub-
1 2 formulas they satisfy in a bottom-up approach. The engine decides the satis-
faction of region formulas by directly referring to the chromatic fuzzy
descriptors. This first labeling level is then exploited to decide the satisfaction
R4 of composition formulas. In this labeling process, the engine first checks if the
R3 region contains pixels that can satisfy the subformula. If it can, the formula
labels the region with the subformula and with the degree of truth by which
the region is satisfied. When the degree of truth falls under a given minimum
R1
threshold, the engine drops the image from the candidate list.
In Figure A, the engine computes the degree by which the formula =
R2 Contrastw (hue = orange, size = large) holds over an image. The engine assumes
that the image has just been segmented. For simplicity, the engine considers
only four regions, namely R1, R2, R3, and R4. First, decomposes into
subformulas:
3 : Contrastw (1, 2)
2 : size = large
1 : hue = orange
I
where Fwarmthc and Fhuec represent the expressive G joy = g 3I ( Fharmony , w j1 )
features measuring the presence of warmth and
hue contrasts, and slanted the perceptual feature where Fharmony represents the expressive features
measuring the presence of slanted lines (the wis measuring the presence of regions in harmonic
are the feature weights). We can measure the accordance. Finally, we measure the degree of
degree of relaxation an image communicates by uneasiness an image communicates as
44
I
G uneas = g4I(Fnhuec, yellow, purple, Table 1. Perceptual mapping onto semiotic categories.
w u 1,w u 2 ,w u 3 )
Semiotic Categories
where Fnhuec represents the expressive features Perceptual Features Fpractical Fplayful Futopic Fcritical
measuring the absence of contrasts of hue, and saturated ~0 ~1
yellow and purple the perceptual features measuring recurrent ~1 ~0
the presence of yellow and purple regions. hor/vert ~1 ~0 ~1
Presently we use a linear mapping to model the cuts ~1 ~1
giI functions and a linear regression scheme to dissolves ~1 ~1
achieve adaptation of weights wi based on a set of edit ~0
training examples.
45
Table 2. Perceptual mapping onto emotional features.
Emotion Categories
Perceptual Features Action Excitement Suspense Quietness Relaxation Happiness
dissolves ~1 ~1 ~1
cuts ~1 ~1 ~1 ~1 ~1 ~1
long ~1 ~1 ~1 ~1
short ~1 ~1 ~1 ~1 ~1
motion ~1 ~1 ~1 ~0 ~1
hor/vert ~0 ~0 ~0 ~1 ~1 ~1
red ~1 ~1 ~1
orange ~1 ~1 ~1
green ~1 ~1 ~1
blue ~1 ~1 ~1
purple ~1 ~1 ~1 ~0 ~0 ~0
white ~1 ~1 ~1
black ~0 ~0 ~0
indicates irrelevance
Quietness: The degree of quietness GVquiet for a tional semantics framework expounded earlier to
video can be improved by the presence of blue, develop systems for retrieving still images and
orange, green, and white colors, and lowered videos based on both expressive and emotional
by the presence of black and purple. Quiet content.13,14 We discuss the method used in these
videos feature horizontal framings. A few long systems for extracting perceptual features from
sequences might be present, possibly joined raw visual data in the sidebar Perceptual Features
through dissolves. for Still Images and Videos.
In this section, we present examples of retriev-
Relaxation: Videos that communicate relax- ing art images and commercial videos according
V
ation (with degree G relax ) dont show relevant to the expressive level. You can find retrieval
motion components. based on emotional features elsewhere.15 We eval-
uated retrieval performance in terms of effective-
Happiness: Videos that communicate happiness ness, that is, a measure of the agreement between
(with degree GVhappy) share the same features as human evaluators and the system in ranking a test
quiet videos but also exhibit a relevant motion set of images according to their similarity to a
component. query. Measuring effectiveness reliably requires a
small image test set (typically 20 to 50 images).
Presently, we use a linear approximation to Given a sample query, we evaluated the agree-
model emotional feature mapping, constructed by ment as the percentage of human evaluators who
weight adaptation according to a linear regression rank images in the same (or very close) position as
scheme. Following the above scheme, the extent the system does. We define effectiveness as
to which a video k conforms to one of the six
classes is computed through six scores{GVj (k)}61 , k = Pj (i ) + j (i )
all the categories not a generalization of category centered in the rank Pj(i) assigned by the system.
j (we call these categories j-opponent categories). Qj(i, k) is the percentage of people who ranked the
i-th image in a position between Pj(i) j(i) and
Model validation and retrieval results Pj(i) + j(i).
At the Visual Information Processing Lab of the
University of Florence, weve used the composi-
46
Perceptual Features for Still Images and Videos
Here we discuss the perceptual features for still internally described by segmenting each shots
images and videos. keyframe as described in the previous section of
this sidebar.
Still images
An images semantics relates to its color con- Cuts
tent and the presence of elements such as lines Rapid motion in the scene and sudden changes
that induce dynamism and action. in lighting yield low correlation between contigu-
ous frames, especially in cases adopting a high
Colors temporal subsampling rate. To avoid false cut
Color cluster analysis helps segment images into detection, we studied a metric insensitive to such
color regions. We can obtain clustering in 3D space variations while reliably detecting true cuts. We
by using an improved version of the standard K- partition each frame into nine subframes and rep-
means algorithm, which avoids converging with resent each subframe by considering the color his-
nonoptimal solutions. This algorithm uses competi- tograms in the hue, saturation, intensity (HSI) color
tive learning as the basic technique for grouping space. We detect cuts by considering the volume
points in the color space. An images chromatic con- of the difference of subframe histograms in two
tent can be expressed using a set of eight numbers consecutive frames. The presence of a cut at frame
normalized in [0, 1] denoting one color out of the i can be detected by thresholding the average vol-
set {red, orange, yellow, green, blue, purple, white, ume value of for the nine subframes. After repeat-
and black}. Each number quantifies the presence in ing the above procedure for all frames i = 1
the image of a region exhibiting the i-th color. #frames, we simply obtain the overall feature relat-
ed to the presence of cuts in a video as cuts =
Lines #cuts/#frames, where cuts [0, 1].
Detecting significant line slopes in an image can
be accomplished by using the Hough transform to Dissolves
generate a line slope histogram (see Figure B). The The dissolve effect merges two sequences by
feature hor/vert [0, 1] gives the ratio of horizontal partly overlapping them. Detecting dissolves in
and vertical lines with respect to the overall num- commercials proves particularly difficult because
ber of lines in the image. dissolves typically occur in a limited number of
consecutive frames. Due to this peculiarity, exist-
Videos ing approaches to detect dissolves (developed for
Video analysis primarily aims to segment video, movies) have shown poor performance. We use Figure B. Perceptual
that is, to identify each shots start and end points instead corner statistics to detect dissolves. During features for still images.
and the videos characterization through its most the editing effect, corners associated with the first From left to right: video
representative keyframes. Once a video has been sequence gradually disappear and those associat- frame, computed edges,
fragmented into shots and video editing features ed with the second sequence gradually appear. and line slope
have been extracted, each shots content can be continued on p. 48 histogram.
80
Stacchini.mv.L2
70
60
Occurences
50
40
30
20
10
0
0 45 90 135 180 225 270
Slope
(1) (2) (3)
47
(1) (2) (3) (4)
Figure C. Corners continued from p. 47 ratio between colors that recur in a high percent-
detected during a age of keyframes and the overall number of signif-
dissolve effect. This yields a local minimum in the number of cor- icant colors in the video sequence) and saturated
ners detected during the dissolve (see Figure C). (expressing the relative presence of saturated col-
An image corner is characterized by large and dis- ors in the scene). Another relevant inter-shot char-
tinct values of the gradient auto-correlation acteristicthe rhythm of a sequence of
matrixs eigenvalues. We evaluate the feature shotsrelates to shot duration and to the use of
dissolves [0, 1] as dissolves = #dissolves/#frames. cuts and dissolves to join shots. We define the
rhythm r(i1, i2) [0, 1] of a video sequence over a
Motion frame interval [i1, i2] as
We analyze motion by tracking corners in a
sequence. For each shot, we compute a feature
# cuts + # dissolves
motion [0, 1] that represents the average intensi- r ( i2 , i2 ) =
i2 i1 + 1
ty of motion. motion = 0 means that motion is
absent during the sequence. Higher values of
motion indicate the presence of increasingly relevant where #cuts and #dissolves are measured in the
motion components. same interval. A simple feature measuring an entire
sequences internal rhythm is the average rhythm,
Inter-shot features which relates to the overall number of breaks edit =
We represent color-related inter-shot features r(1, #frames).
used in a video as recurrent [0, 1] (expressing the
Image retrieval
For art image retrieval, we used a test set of 40
images, including Renaissance to contemporary
painters. We measured retrieval effectiveness with
reference to four different queries, addressing con-
trasts of luminance, warmth, saturation, and
harmony.
We collected answers from 35 experts in the
fine arts field. We asked them to rank database
images according to each reference query. Figure
3 shows the test database images. Figure 4 shows
the eight sets of top-ranked images the system
retrieved in response to the four reference queries.
(For more information about these images visit
Courtesy of WebMuseum
http://www.cineca.it/wm/.)
Figure 5 shows the plots of effectiveness S as a
function of rankings. They show the agreement
between the people interviewed and the system in
ranking images according to the queries. Figure 5
shows only rankings from 1 to 8, since they rep-
Figure 3. Images used to test system effectiveness for the representation of resent agreement on the most representative
expressive content. images. The plots show a very large agreement
48
100
1 2 3 4 5 6 7 8
Average agreement
80
Agreement
60
40
(a)
1 2 3 4 5 6 7 8 20
0
(a) 0 1 2 3 4 5 6 7
Retrieval ranking
8 9
(b) 100
1 2 3 4 5 6 7 8
Average agreement
80
Agreement
60
Courtesy of WebMuseum
(c) 40
1 2 3 4 5 6 7 8
20
0
(b) 0 1 2 3 4 5 6 7
Retrieval ranking
8 9
(d)
100
Figure 4. Top-ranked images according to queries for contrast of luminance (a),
contrast of saturation (b), contrast of warmth (c), and harmonic accordance (d). 80 Average agreement
Agreement
60
40
20
0
0 1 2 3 4 5 6 7 8 9
(c) Retrieval ranking
100
80 Average agreement
Agreement
60
40
20
0
0 1 2 3 4 5 6 7 8 9
(d) Retrieval ranking
Figure 6. Results of a query for images with two large regions showing regions of Figure 6. The
constrasting luminance. user may define the
degree of truth (60 per-
JulySeptember 1999
49
100 100 Average agreement
80 Average agreement 80
Agreement
Agreement
60 60
40 40
20 20
0 0
(a) 0 1 2 3 4 5 6 (b) 0 1 2 3 4 5 6
Retrieval ranking Retrieval ranking
100 100
Average agreement
80 80
Agreement
Average agreement
Agreement
60 60
40 40
20 20
0 0
(c) 0 1 2 3 4 5 6 (d) 0 1 2 3 4 5 6
Retrieval ranking Retrieval ranking
fy how contrast of Figure 7 shows how the experts (Figure 7b) and
luminance between the system (Figure 7c) classified database com-
large regions can be mercials. Each region (cylinders of zero height
Figure 7. Representation of database commercials used to convey the per- arent shown) in the square has a vertical cylinder,
semiotic features (a) along with the experts (b) and ception of different whose height is proportional to the percentage of
systems (c) classification of them. planes of depth. commercials located in that region.
Figure 8 shows a more accurate measure of sys-
Video retrieval tem effectiveness by displaying the agreement
We evaluated video retrieval effectiveness between the system and experts with reference to
using a test set of 20 commercial videos. A team queries for purely practical, critical, utopic, and
of five experts in the semiotic and marketing fields playful commercials. Each query considered the
classified each video in terms of the four semiotic five top-ranked commercials. For a generic com-
categories. We used the common representation mercial, we measured the agreement between the
of commercials categories based on the semiotic system and experts as A = 1 d/dmax, where d is the
square. This instrument, first introduced by city block distance between the two blocks in
Greimas,10 combines pairs of four semiotic objects the semiotic square where the system and the
with the same semantic level, according to three experts located the commercial. dmax is the maxi-
basic relationships: opposition, completion, and mum value of d, here 8. The best average agree-
contradiction. Objects placed at opposite sides of ment corresponds to the query for playful
the square are complementary. Figure 7a shows commercials, evidencing the effectiveness of the
the practical-playful and critical-utopic diagonals features used to model this category. The worst
of the semiotic square as coordinate axes. We performance corresponds to queries for practical
asked the experts to classify the commercials by commercials. In this case, the system classified
associating each commercial with a position on many commercials as practical, while the experts
the square. To ease the classification task, 25 rec- rated them as critical. Such a classification mis-
tangular regions were identified in the square match originates from the recurrent presence in
IEEE MultiMedia
through a regular partition, which supports the critical commercials of foreground views of the
definition of a videos three different degrees of promoted product. The experts can easily detect
conformity to a generic category. For instance, all and recognize a foreground view, but the system
videos classified in region 4 of Figure 7a feature a cant detect this feature.
high degree of conformity to the utopic category Figures 9 through 12 show examples of video
and a medium conformity to the playful one. retrieval according to the expressive level using a
50
Figure 9. Retrieval of playful commercials. Courtesy of Mediaset
Figure 10. Some of the keyframes for the first-ranked spot in Figure 9 (Sergio
database of 131 commercials. Users can specify Tacchini).
queries by defining the degree by which the
retrieved commercials have to conform to the four
semiotic categories. Figure 9 shows the output of
the retrieval system in response to a query for
purely playful commercials. At the right of the list
of retrieved items, a bar graph displays the degree
by which each shot of the top-ranked videothe
white, thin vertical lines represent cuts and dis-
solvesbelongs to the playful category. The top-
ranked videos in this category all advertise
products for young and smart people (sports-
wear, sport watches, blue jeans, and so on). These
results arent surprising, since they reflect the
common marketing practice of targeting a com-
mercial to a specific audience.
Courtesy of RAI
51
The research presented in this article attempts
to introduce different levels of signification by a
layered representation of visual knowledge. The
most relevant insight we gained is that defining
rules that capture visual meaning can be difficult,
especially when dealing with general visual
domains. As a good design practice, such rules
should derive from specific domain characteriza-
tions, and possibly be refined and tailored to spe-
cific classes of users.
Future work will address experimenting with
different application scenarios (such as TV news
Courtesy of RAI
and movies), and complementing visual features
and language sentences with textual and audio
data. MM
Figure 13. Some of the keyframes for the first-ranked spot in Figure 12 (Pasta
Barilla). Acknowledgments
We thank Bruno Bertelli and Laura Lombardi
for useful discussions during the development of
this work. Wed also like to acknowledge all those
experts who provided their support in the perfor-
mance tests.
References
1. T. Joseph and A. Cardenas, PicQuery: A High-Level
Query Language for Pictorial Database Manage-
ment, IEEE Trans. on Software Engineering, Vol. 14,
No. 5, May 1988, pp. 630-638.
2. N. Roussopolous, C. Faloutsos, and T. Sellis, An
Efficient Pictorial Database System for Pictorial
Structured Query Language (PSQL), IEEE Trans. on
Courtesy of Mediaset
class of equivalence of objects, thus reflecting a the Creative Eye, Regents of the University of
higher semantic level than that of the objects California, Palo Alto, Calif., 1954.
themselves. We believe that future multimedia 8. J. Itten, Art of Color (Kunst der Farbe), Otto Maier
retrieval systems will have to support access to Verlag, Ravensburg, Germany, 1961 (in German).
information at different semantic levels to reflect 9. C.R. Haas, Advertising Practice (Pratique de la
diverse application needs and user queries. Publicit), Bordas, Paris, 1988 (in French).
52
10. A.J. Greimas, Structural Semantics (Smantique
Structurale), Larousse, Paris, 1966 (in French). Alberto Del Bimbo is a professor
11. J.-M. Floch, Semiotics, Marketing, and in the Department of Systems and
Communication: Below the Signs, the Strategies Informatics at the University of
(Smiotique, Marketing et Communication: Sous les Florence, Italy. His scientific inter-
signes, les stratgies), University of France Press, Paris, ests cover image sequence analysis,
1990 (in French). shape and object recognition,
12. R.C. Carter and E.C. Carter, CIELUV Color image databases and multimedia, visual languages, and
Difference Equations for Self-Luminous Displays, advanced man-machine interaction. He earned his MS
Color Research and Applications, Vol. 8, No. 4, 1983, in electronic engineering at the University of Florence,
pp. 252-553. Italy in 1977. He is presently an associate editor of
13. J.M. Corridoni, A. Del Bimbo, and P. Pala, Pattern Recognition, Journal of Visual Languages and
Sensations and Psychological Effects in Color Image Computing, and IEEE Transactions on Multimedia. He is a
Database, ACM Multimedia Systems, Vol. 7, No. 3, member of IEEE and IAPR. He presently serves as the
May 1999, pp. 175-183. Chairman of the Italian Chapter of IAPR. He is the gen-
14. M. Caliani et al., Commercial Video Retrieval by eral chair of IEEE Intl. Conference on Multimedia
Induced Semantics, Proc. IEEE Intl Work. on Computing and Systems (ICMCS 99).
Content-Based Access of Images and Video Databases,
IEEE CS Press, Los Alamitos, Calif., Jan. 1998,
pp. 72-80.
15. C. Colombo, A. Del Bimbo, and P. Pala Retrieval of Pietro Pala received his MS in
Commercials by Video Semantics, Proc. IEEE Intl electronic engineering at the
Conf. on Computer Vision and Pattern Recognition University of Florence, Italy, in
(CVPR 98), IEEE CS Press, Los Alamitos, Calif., June 1994. In 1998, he received his PhD
1998, pp. 572-577. in information science from the
same university. Currently he is a
research scientist at the Department of Systems and
Informatics at the University of Florence. His current
Carlo Colombo is an assistant research interests include recognition, image databases,
professor in the Department of neural networks, and related applications.
Systems and Informatics at the
University of Florence, Italy. His
main research activities are in the Contact the authors at the Department of Systems
field of computer vision, with spe- and Informatics, University of Florence, Via Santa Marta
cific interests in image and video analysis, human- 3, I-50139, Florence, Italy, e-mail {columbus,delbimbo,
machine interfaces, robotics, and multimedia. He holds pala}@dsi.unifi.it.
an MS in electronic engineering from the University of
Florence, Italy (1992) and a PhD in robotics from the
SantAnna School of University Studies and Doctoral
Research, Pisa, Italy (1996). He is a member of IEEE and
the International Association for Pattern Recognition
(IAPR), and presently serves as secretary to the IAPR
Italian Chapter.
JulySeptember 1999
53