Automatic Semantic Content Extraction in Videos Using A Fuzzy Ontology and Rule-Based Model

Automatic Semantic Content Extraction
in Videos Using a Fuzzy Ontology

and Rule-Based Model
Yakup Yildirim, Adnan Yazici, Senior Member, IEEE, and
Turgay Yilmaz, Student Member, IEEE
AbstractRecent increase in the use of video-based applications has revealed the need for extracting the content in videos. Raw
data and low-level features alone are not sufficient to fulfill the user s needs; that is, a deeper understanding of the content at the
semantic level is required. Currently, manual techniques, which are inefficient, subjective and costly in time and limit the querying
capabilities, are being used to bridge the gap between low-level representative features and high-level semantic content. Here, we
propose a semantic content extraction system that allows the user to query and retrieve objects, events, and concepts that are
extracted automatically. We introduce an ontology-based fuzzy video semantic content model that uses spatial/temporal relations in
event and concept definitions. This metaontology definition provides a wide-domain applicable rule construction standard that allows
the user to construct an ontology for a given domain. In addition to domain ontologies, we use additional rule definitions (without using
ontology) to lower spatial relation computation cost and to be able to define some complex situations more effectively. The proposed
framework has been fully implemented and tested on three different domains. We have obtained satisfactory precision and recall rates
for object, event and concept extraction.
Index TermsSemantic content extraction, video content modeling, fuzziness, ontology
1 INTRODUCTION
T
HE rapid increase in the available amount of video data
has caused an urgent need to develop intelligent
methods to model and extract the video content. Typical
applications in which modeling and extracting video
content are crucial include surveillance, video-on-demand
systems, intrusion detection, border monitoring, sport
events, criminal investigation systems, and many others.
The ultimate goal is to enable users to retrieve some desired
content from massive amounts of video data in an efficient
and semantically meaningful manner.
There are basically three levels of video content which are
raw video data, low-level features and semantic content.
First, raw video data consist of elementary physical video
units together with some general video attributes such as
format, length, and frame rate. Second, low-level features are
characterized by audio, text, and visual features such as
texture, color distribution, shape, motion, etc. Third,
semantic content contains high-level concepts such as objects
and events. The first two levels on which content modeling
and extraction approaches are based use automatically
extracted data, which represent the low-level content of a
video, but they hardly provide semantics which is much
more appropriate for users. Users are mostly interested in
querying and retrieving the video in terms of what the video
contains. Therefore, raw video data and low-level features
alone are not sufficient to fulfill the users need; that is, a
deeper understanding of the information at the semantic
level is required in many video-based applications.
However, it is very difficult to extract semantic content
directly from raw video data. This is because video is a
temporal sequence of frames without a direct relation to its
semantic content [1]. Therefore, many different representa-
tions using different sets of data such as audio, visual
features, objects, events, time, motion, and spatial relations
are partially or fully used to model and extract the semantic
content. No matter which type of data set is used, the process
of extracting semantic content is complex and requires
domain knowledge or user interaction.
There are many research works in this area. Most of
them use manual semantic content extraction methods.
Manual extraction approaches are tedious, subjective, and
time consuming [2], which limit querying capabilities.
Besides, the studies that perform automatic or semiauto-
matic extraction do not provide a satisfying solution.
Although there are several studies employing different
methodologies such as object detection and tracking,
multimodality and spatiotemporal derivatives, the most
of these studies propose techniques for specific event type
extraction or work for specific cases and assumptions. In
[3], simple periodic events are recognized where the
success of event extraction is highly dependent on robust-
ness of tracking. The event recognition methods described
in [4] are based on a heuristic method that could not handle
multiple-actor events. Event definitions are made through
predefined object motions and their temporal behavior. The
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 1, JANUARY 2013 47
. The authors are with the Department of Computer Engineering, Middle
East Technical University, Ankara 06531, Turkey.
E-mail: yy@alumni.bilkent.edu.tr, {yazici, turgay}@ceng.metu.edu.tr.
Manuscript received 20 Aug. 2010; revised 28 July 2011; accepted 15 Aug.
2011; published online 21 Sept. 2011.
Recommended for acceptance by V.S. Subramanian.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2010-08-0462.
Digital Object Identifier no. 10.1109/TKDE.2011.189.
1041-4347/13/$31.00 2013 IEEE Published by the IEEE Computer Society
shortcoming of this study is its dependence on motion
detection. In [5], scenario events are modeled from shape
and trajectory features using a hierarchical activity repre-
sentation extended from [4]. Hakeem and Shah [6] propose
a method to detect events in terms of a temporally related
chain of directly measurable and highly correlated low-
level actions (subevents) by using only temporal relations.
Another key issue in semantic content extraction is the
representation of the semantic content. Many researchers
have studied this from different aspects. A simple
representation could relate the events with their low-level
features (shape, color, etc.) using shots from videos,
without any spatial or temporal relations. However, an
effective use of spatiotemporal relations is crucial to
achieve reliable recognition of events. Employing domain
ontologies facilitate use of applicable relations on a
domain. There are no studies using both spatial relations
between objects, and temporal relations between events
together in an ontology-based model to support automatic
semantic content extraction. Studies such as BilVideo [7],
[8], extended-AVIS [9], multiView [10] and classView [11]
propose methods using spatial/temporal relations but do
not have ontology-based models for semantic content
representation. Bai et al. [12] present a semantic content
analysis framework based on a domain ontology that is
used to define semantic events with a temporal description
logic where event extraction is done manually and event
descriptions only use temporal information. Nevatia and
Natarajan [13] propose an ontology model using spatio-
temporal relations to extract complex events where the
extraction process is manual. In [14], each linguistic
concept in the domain ontology is associated with a
corresponding visual concept with only temporal relations
for soccer videos. Nevatia et al. [15] define an event
ontology that allows natural representation of complex
spatiotemporal events in terms of simpler subevents. A
Video Event Recognition Language (VERL) that allows
users to define the events without interacting with the low-
level processing is defined. VERL is intended to be a
language for representing events for the purpose of
designing an ontology of the domain, and, Video Event
Markup Language (VEML) is used to manually annotate
VERL events in videos. The lack of low-level processing
and using manual annotation are the drawbacks of this
study. Akdemir et al. [16] present a systematic approach to
address the problem of designing ontologies for visual
activity recognition. The general ontology design principles
are adapted to the specific domain of human activity
ontologies using spatial/temporal relations between con-
textual entities. However, most of the contextual entities
which are utilized as critical entities in spatial and
temporal relations must be manually provided for activity
recognition. Yildirim [17] provide a detailed survey of the
existing approaches for semantic content representation
and extraction.
Considering the above-mentioned needs for content-
based retrieval and the related studies in the literature,
methodologies are required for automatic semantic content
extraction applicable in wide-domain videos.
In this study, a new Automatic Semantic Content
Extraction Framework (ASCEF) for videos is proposed for
bridging the gap between low-level representative features
and high-level semantic content in terms of object, event,
concept, spatial and temporal relation extraction. In order
to address the modeling need for objects, events and
concepts during the extraction process, a wide-domain
applicable ontology-based fuzzy VIdeo Semantic COntent
Model (VISCOM) that uses objects and spatial/temporal
relations in event and concept definitions is developed.
VISCOM is a metaontology for domain ontologies and
provides a domain-independent rule construction standard.
It is also possible to give additional rule definitions
(without using ontology) for defining some special situa-
tions and for speeding up the extraction process. ASCEF
performs the extraction process by using these metaontol-
ogy-based and additional rule definitions, making ASCEF
wide-domain applicable.
In the automatic event and concept extraction process,
objects, events, domain ontologies, and rule definitions are
used. The extraction process starts with object extraction.
Specifically, a semiautomatic Genetic Algorithm-based
object extraction approach [18] is used for the object
extraction and classification needs of this study. For each
representative frame, objects and spatial relations between
objects are extracted. Then, objects extracted from consecu-
tive representative frames are processed to extract temporal
relations, which is an important step in the semantic content
extraction process. In these steps, spatial and temporal
relations among objects and events are extracted automati-
cally allowing and using the uncertainty in relation
definitions. Event extraction process uses objects, spatial
relations between objects and temporal relations between
events. Similarly, objects and events are used in concept
extraction process.
This study proposes an automatic semantic content
extraction framework. This is accomplished through the
development of an ontology-based semantic content model
and semantic content extraction algorithms. Our work
differs from other semantic content extraction and repre-
sentation studies in many ways and contributes to semantic
video modeling and semantic content extraction research
areas. First of all, we propose a metaontology, a rule
construction standard which is domain independent, to
construct domain ontologies. Domain ontologies are en-
riched by including additional rule definitions. The success
of the automatic semantic content extraction framework is
improved by handling fuzziness in class and relation
definitions in the model and in rule definitions.
A domain-independent application for the proposed
system has been fully implemented and tested. As a proof
of wide-domain applicability, experiments have been con-
ducted for event and concept extraction for basketball,
football, and office surveillance videos. Satisfactory preci-
sion and recall rates in terms of object, event, and concept
extraction are obtained by the proposed framework. Our
results show that the system can be used in practical
applications. Our earlier work can be found in [19], [20].
The organization of the paper is as follows. In Section 2,
the proposed video semantic content model is described in
detail. The automatic semantic content extraction system is
explained in Section 3. In Section 4, the performed
experiments and the performance evaluation of the system
are given. Finally, in Section 5, our conclusions and future
research directions are discussed.
48 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 1, JANUARY 2013
2 VIDEO SEMANTIC CONTENT MODEL
In this section, the proposed semantic video content model
and the use of special rules (without using ontology) are
described in detail.
2.1 Overview of the Model
Ontology provides many advantages and capabilities for
content modeling. Yet, a great majority of the ontology-
based video content modeling studies propose domain
specific ontology models limiting its use to a specific
domain. Besides, generic ontology models provide solu-
tions for multimedia structure representations. In this
study, we propose a wide-domain applicable video content
model in order to model the semantic content in videos.
VISCOM is a well-defined metaontology for constructing
domain ontologies. It is an alternative to the rule-based and
domain-dependent extraction methods. Constructing rules
for extraction is a tedious task and is not scalable. Without
any standard on rule construction, different domains can
have different rules with different syntax. In addition to the
complexity of handling such difference, each rule structure
can have weaknesses. Besides, VISCOM provides a stan-
dardized rule construction ability with the help of its
metaontology. It eases the rule construction process and
makes its use on larger video data possible.
The rules that can be constructed via VISCOM ontology
can cover most of the event definitions for a wide variety
of domains. However, there can be some exceptional
situations that the ontology definitions cannot cover. To
handle such cases, VISCOM provides an additional rule-
based modeling capability without using ontology. Hence,
VISCOM provides a solution that is applicable on a wide
variety of domain videos.
Objects, events, concepts, spatial and temporal relations
are components of this generic ontology-based model.
Similar generic models such as [13], [21], [22] which use
objects and spatial and temporal relations for semantic
content modeling neither use ontology in content represen-
tation nor support automatic content extraction. To the best
of our knowledge, there is no domain-independent video
semantic content model which uses both spatial and
temporal relations between objects and which also supports
automatic semantic content extraction as our model does.
The starting point is identifying what video contains and
which components can be used to model the video content.
Keyframes are the elementary video units which are still
images, extracted from original video data that best
represent the content of shots in an abstract manner. Name,
domain, frame rate, length, format are examples of general
video attributes which form the metadata of video. Each
video instance, \
i
Video Database, is represented as:
\
i
= \
i
ictodoto
. \
i
/cy)ioic
_
, where \
i
/cy)ioic
is the set of keyframes
of \
i
and \
i
ictodoto
= \
i
ioic
. \
i
doioii
. \
i
)ioiciotc
. \
i
|ciqt/
. \
i
)oiiot
_
.
\
i
doioii
is an attribute of video metadata that represents the
domain of the video instance, where \
i
doioii
1. 1 =
1
0
. . . . . 1
i
is the set of all possible domains.
Each 1
r
1 contains semantically meaningful content
common for 1
r
, which can be represented with an ontology
O`T
r
, where O`T
r
O`T. O`T = O`T
0
. . . . . O`T
i
is
the set of all possible domain ontologies.
O`T
r
is a domain ontology and represented as O`T
r
=
`ctoòdc|. C1
r
), where `ctoòdc| is the model having
domain-independent content definitions in terms of types
and relations. In our case, these definitions are semantic
contents. C1
r
is the set of domain specific `ctoòdc|
individuals for 1
r
.
The proposed model of this study is a `ctoòdc| and
represented with \ 1oCO` = \ C. 111 ). \ C is the set of
\ 1oCO` classes and 111 is the set of domain-independent
\ 1oCO`class individuals. Each\ C
r
in\ Cis representedas
\ C
r
= \ C
r
ioic
. \ C
r
jioj
_
, where \ C
r
ioic
is the name of the
class and \ C
r
jioj
is the set of relations and properties of class
\ C
r
. \ 1oCO` has a number of classes representing seman-
tically meaningful components of video, where \ C
r
ioic
=
Coijoicit. O/,cct. 1.cit. Coiccjt. oiii|oiity. . . ..
Domain-independent \ 1oCO` class individuals are
grouped under movement, temporal, structural, and spatial
relation types. 111 = `11 T11 OCT1 o11, where
`11 = doni. nj. iiq/t. |c)t is the set of movement relation
types, T11 = /c)oic. icct:. :toit:. )iii:/. o.ci|oj:. cno|.
dniiiq is the set of temporal relation types, OCT1 =
coijo:cdO). i:. joitO). :n/:toiccO) is the set of rela-
tion types used to define concept inclusion, membership and
structural object relations, and o11 = 1o11 1o11
To11 is the set of spatial relation types, where To11 =
ii:idc. joitio||y1i:idc. di:,oiit. tonc/ is the set of topolo-
gical, 1o11 = iiq/t. |c)t. o/o.c. /c|on is the set of posi-
tional, and 1o11 = )oi. icoi is the set of distance spatial
relation types.
Each domain ontology is enriched with additional rule
definitions to be able to define some complex situations
more effectively. 1
i
1 represents rule definitions for
domain 1
i
1, where 1 = 1
0
. . . . . 1
i
represents all
possible rule sets for all domains.
Both the ontology model and the semantic content
extraction process is developed considering uncertainty
issues. For the semantic content representation, VISCOM
ontology introduces fuzzy classes and properties. Spatial
Relation Component, Event Definition, Similarity, Object Com-
posed Of Relation and Concept Component classes are fuzzy
classes as they aim to having fuzzy definitions. Object
instances have membership values as an attribute which
represents the relevance of the given Minimum Bounding
Rectangle (MBR) to the object type. Spatial relation calcula-
tions return fuzzy results and Spatial Relation Component
instances are extracted with fuzzy membership values.
2.2 Ontology-Based Modeling
The linguistic part of VISCOM contains classes and relations
between these classes. Some of the classes represent semantic
content types such as Object and Event while others are used
in the automatic semantic content extraction process.
Relations defined in VISCOM give ability to model events
and concepts related with other objects and events.
VISCOM is developed on an ontology-based structure
where semantic content types and relations between these
types are collected under VISCOM Classes, VISCOM Data
Properties which associate classes with constants and
VISCOMObject Properties which are usedto define relations
between classes. In addition, there are some domain-
independent class individuals.
YILDIRIM ET AL.: AUTOMATIC SEMANTIC CONTENT EXTRACTION IN VIDEOS USING A FUZZY ONTOLOGY AND RULE-BASED MODEL 49
C-Logic [23] is used for the formal representation of
VISCOM classes and operations of the semantic content
extraction framework. C-Logic includes a representation
framework for entities, their attributes, and classes using
identities, labels, and types. VISCOM is represented with
the following C-logic formulation:
\ 1oCO` :
c|o:: = C
i
.
tcijoio|1c|1id =
/c)oic. icct:.
dniiiq. o.ci|oj.
:toit:. )iii:/c:.
cno|
_
_
_
_
.
o/,cctCoij1id =
coijo:cdO). i:.
:n/:toiccO).
ici/ciO).
joitO)
_
_
_
_
.
io.cicit1id =
doni. nj. iiq/t.
|c)t. :totioioiy
_ _
.
:jotio|1c|1id =
)oi. icoi. di:,oiit.
ii:idc. joi1i:idc.
tonc/. o/o.c. /c|on.
|c)t. iiq/t
_
_
_
_
_
_
_
_
n/cic
iid(C
i
. \ 1oCO`C|o::).
_
_
(1)
where the predicate ind(Entitiy,Class) is used to mean an
entity is defined as an individual of a class in the formal
representation of classes.
All of the VISCOMclasses andrelations are given in Fig. 1.
Red colored arrows represent is-a relations, blue-colored
arrows represent has-a relations.
Below, the VISCOM classes are introduced with their
description, formal representation and important relation
(property) descriptions.
2.2.1 Component
VISCOM collects all of the semantic content under the class
of Component. A component can have synonym names and
similarity relations with other components. Component class
has three subclasses as Objects, Events, and Concepts and is
represented as
Coijoicit :
tyjc = O
i
. 1
,
. C
/
. :ii = o
i
.
:yiioic = :tiiiq [ [
_ _
n/cic
iid(O
i
. O/,cct). iid(1
,
. 1.cit).
iid(C
/
. Coiccjt). iid(o
i
. oiii|oiity).
ot io:t oic o) i. ,. / 0.
_
_
(2)
where property hasSimilarContext is used to associate
similar components in a fuzzy manner when there is a
similar component in the ontology with a component that is
supposed to be extracted.
2.2.2 Object
Objects correspond to existential entities. An object is the
starting point of the composition. An object has a name,
low-level features, and composed-of relations. Basketball
player, referee, ball and hoop are examples of objects for the
basketball domain
O/,cct :
ioic = :tiiiq [ [. |on1c.c|1cotnic = 1
i
.
coijo:cdO) = CO1
,
_ _
n/cic
iid(1
i
. 1on1c.c|1cotnic).
iid(CO1
,
. Coijo:cdO)1c|otioi).
_
_
(3)
where property hasComposedOfObjectRelation is used to
define concept inclusion, membership, and structural object
relations such as part of, member of, substance of, is a, and
composed of. It has a relevance degree and a reference to an
object composed-of group individual in its definition.
2.2.3 Event
Events are long-term temporal objects and object relation
changes. They are described by using objects and spatial/
temporal relations between objects. Relations between
events and objects and/or their attributes indicate how
events are inferred from objects and/or object attributes. In
Fig. 1. VISCOM classes and relations.
addition, temporal event relations can also be used in event
definitions. An event has a name, a definition in terms of
temporal event relations or spatial/temporal object relations,
and role definitions of the objects taking part in the event.
Jump ball, rebound, and free throw are examples of events for
the basketball domain
1.cit :
ioic = :tiiiq [ [. c.cit1c) = 11
i
.
o/,cct1o|c = O1
,
.
tcijoio|1.citCoij = T1C
|
_
_
_
_
n/cic
iid(11
i
. 1.cit1c)iiitioi).
iid(T1C
|
. Tcijoio|1.citCoijoicit).
iid(O1
,
. O/,cct1o|c).
ot |co:t oic o) i. | 0.
_
_
(4)
where property hasTemporalEventComponent is used to define
temporal relations between events which are used in the
definition of other events. hasEventDefinition is utilized to
associate events with event definitions. An event can be
expressed with more than one event definition.
2.2.4 Concept
Concepts are general definitions that contains related events
and objects in it. Each concept has a relation with its
components that are used for its definition. Attack and defense
are examples of concepts for the basketball domain
Coiccjt :
ioic = :tiiiq [ [. coiccjtCoij = CC
i
[ [
n/cic
iid(CC
i
. CoiccjtCoijoicit). i 0.
_
_
_
(5)
where property hasConceptComponent is used to define the
relation that exists in concepts meaning.
2.2.5 Spatial Relation
Spatial relations express the relative object positions between
two objects such as above, inside, or far. The spatial relation
types are grouped under three categories as topological,
distance and positional spatial relations. The individuals of
this class are utilized by the individuals of Spatial Relation
Component class
ojotio|1c|otioi :
tyjc = T
i
. 1
,
. 1
/
[ [
n/cic
iid(T
i
. Tojo|oqico|ojotio|1c|otioi).
iid(1
,
. 1o:itioio|ojotio|1c|otioi).
iid(1
/
. 1i:toiccojotio|1c|otioi).
ot |co:t oic o) i. ,. / 0.
_
_
(6)
Tojo|oqico|ojotio|1c|otioi :
tyjc =
ii:idc. tonc/.
joitio||y1i:idc.
di:,oiit
_
_
_
_
_
_
_
_
_
_
(7)
1o:itioio|ojotio|1c|otioi : tyjc =
iiq/toidc.
|c)toidc.
o/o.c. /c|on
_
_
_
_
_
_
_
_
_
_
_
_
_
(8)
1i:toiccojotio|1c|otioi :
__
tyjc = )oi. icoi
(9)
2.2.6 Spatial Relation Component
Spatial Relation Component class is used to represent spatial
relations between object individuals. It takes two object
individuals and at most one spatial relation individual from
each subclass of Spatial Relation class. This class is utilized in
spatial change andevent definition modeling. It is possible to
define imprecise relations by specifying the membership
value for the spatial relation individual used in its definition.
For the basketball domain, Player under Hoop is an example of
Spatial Relation Component class individuals
ojotio|1c|otioiCoijoicit :
ioic = :tiiiq [ [.
o/,cct1 = O
i
.
o/,cct2 = O
,
.
:jotio|1c| = o1
/
.
ici/ci:/ij = [j
|
[
_
_
_
_
n/cic
iid(O
i
. O/,cct).
iid(O
,
. O/,cct).
iid(o1
/
. ojotio|1c|otioi).
0 _ j
|
_ 1. i ,= ,.
_
_
(10)
2.2.7 Spatial Change
Spatial Change class is utilized to express spatial relation
changes between objects or spatial movements of objects in
order to model events.
Spatial regions representing objects have spatial relations
between each other. These relations change in time. This
information is utilized in event definitions. Temporal
relations between spatial changes are also used when more
than one spatial change is needed for definition. This
concept is explained under Temporal Relations and Event
Definition classes
ojotio|C/oiqc :
ioic = :tiiiq [ [.
iiitio|o1C = o1C
i
.
)iio|o1C = o1C
,
.
o/,cct1o|c = O1
/
.
:jotio|ò.cicit = o`
i
_
_
_
n/cic
iid(o1C
i
. ojotio|1c|otioiCoijoicit).
iid(o1C
,
iid(O1
/
. O/,cct1o|c).
iid(o`
i
. ojotio|ò.cicit).
croct|y oic o) i. i 0.
i) i 0 t/ci , 0.
_
_
(11)
where property hasSpatialMovementComponent is used to
define single object movements.
2.2.8 Spatial Change Period
Spatial changes have an interval that is designated by the
spatial relation individuals used in their definitions. Spatial
relations are momentary situations but periods of spatial
relations can be extracted from consecutive frames. When-
ever the temporal situation between Spatial Relation Compo-
nent individuals defined in a Spatial Change individual is
satisfied, the Spatial Change individual is extracted and
Spatial Relation Component individuals periods are utilized to
calculate the Spatial Change individuals interval. According
to the meaning of the spatial change, periods of spatial
relations should be included or discarded in the calculation
of spatial change intervals. In order to address this need,
Spatial Change Period class is defined. It has four individuals
as startToEnd, startToStart, endToStart, and endToEnd
ojotio|C/oiqc1ciiod : ic|Tyjc =
:toitTootoit.
:toitTo1id.
cidTootoit.
cidTo1id
_
_
_
_
_
_
_
_
.
_
_
(12)
2.2.9 Spatial Movement
Second alternative to define a spatial change is using spatial
movements. Spatial movements represent spatial changes of
single objects. This class is used to define movement types. It
has five individuals as; moving to left, moving to right, moving
up, moving down, and stationary
ojotio|ò.cicit : ic|Tyjc =
io.c:1c)t.
io.c:lj.
io.c:1iq/t.
io.c:1oni.
:totioioiy
_
_
_
_
_
_
_
_
.
_
_
(13)
2.2.10 Spatial Movement Component
Spatial Movement Component class is used to declare object
movement individuals. Ball moves left is an example of an
individual of this class
ojotio|ò.cicitCoijoicit :
ioic = :tiiiq [ [.
O/,cct = O
i
.
:jotio|ò. = o`
,
_
_
_
n/cic
iid(O
i
. O/,cct).
iid(o`
,
. ojotio|ò.cicit).
_
_
(14)
2.2.11 Temporal Relation
Temporal relations are used to order Spatial Changes or
Events in Event Definitions. Allens temporal relationships
[24] are used to express parallelism and mutual exclusion
between components
Tcijoio|1c|otioi : ic|Tyjc =
/c)oic. dniiiq.
o.ci|oj. :toit:.
cno|. )iii:/c:.
icct
_
_
_
_
_
_
_
_
.
_
_
(15)
2.2.12 Temporal Event Component
Temporal Event Component class is used to define temporal
relations between Event individuals
Tcijoio|1.citCoijoicit :
ioic = :tiiiq [ [.
iiitio|1 = 1
i
.
)iio|1 = 1
,
.
tcij1c|otioi = T1
/
_
_
_
n/cic
iid(1
i
. 1.cit).
iid(1
,
. 1.cit).
iid(T1
/
. Tcijoio|1c|otioi).
i ,= ,.
_
_
(16)
2.2.13 Temporal Spatial Change Component
Temporal Spatial Change Component class is used to define
temporal relations between spatial changes in Event defini-
tions. For instance, the temporal relationafter is usedbetween
Ball hits Hoop andPlayer jumps Spatial Change individual inthe
definition of Rebound event
Tcijoio|ojotio|C/oiqc :
ioic = :tiiiq [ [.
tcij1c|otioi = T1
/
.
iiitio|oC = oC
i
.
)iio|oC = oC
,
_
_
_
n/cic
iid(oC
i
. ojotio|C/oiqc).
iid(oC
,
. ojotio|C/oiqc).
iid(T1
/
. Tcijoio|1c|otioi).
i ,= ,.
_
_
(17)
2.2.14 Event Definition
An event can have several definitions where each definition
describes the event with a certainty degree. In other words,
each event definition has a membership value for the event it
defines that denotes the clarity of description. Event defini-
tions contain individuals of Spatial Change, Spatial Relation
Component or Temporal Spatial Change Component classes
1.cit1c)iiitioi :
ioic = :tiiiq [ [.
:jotio|1c|Coij = o1C
,
.
tcijojotio|C/oiqc = ToCC
|
.
niincojotio|C/oiqc = loC
i
.
ic|c.oicc = [j
i
[.
o/,cct1o|c = O1
/
_
_
_
n/cic
iid(o1C
,
iid(O1
/
. O/,cct1o|c).
iid(ToC
|
. Tcijojotio|C/oiqcCoij).
iid(loC
i
. ojotio|C/oiqc).
croct|y oic o) ,. |. i 0.
_
_
(18)
Event definitions generally contain more than one Spatial
Change individual which are temporally related with each
other. hasUniqueSpatialChange is used for cases when single
Spatial Change individual is enough to make the event
definition. Property hasTemporalSpatialChangeComponent is
used to model temporal spatial change relations. hasEvent-
SpatialRelationComponent is used when a spatial relation
between two objects is enough to make the event definition.
Property hasEventRelevance is used to define the relevance of
the definition to the event.
2.2.15 Concept Component
Concept Component class is used to associate components to
a concept semantically. This association is fuzzy and the
degree of it denotes the degree of inclusion. This class is
utilized in the concept extraction process
CoiccjtCoijoicit :
ioic = :tiiiq [ [.
ic|c.oicc = [j
i
[.
o/,cct1o|c = O1
,
.
coijoicit = CO`
i
_
_
_
n/cic
iid(CO`
i
. Coijoicit).
iid(O1
,
. O/,cct1o|c).
0 _ j
i
_ 1.
_
_
(19)
2.2.16 Similarity
Similarity class is used to represent the relevance of a
component to another component in a fuzzy manner.
Whenever a component which has a similarity relation with
another component is extracted, the semantically related
component is automatically extracted by using this similar-
ity relation:
oiii|oiity :
ioic = :tiiiq [ [. ic|c.oicc = j
i
[ [.
:ii\it/ = CO`
,
_ _
n/cic
iid(CO`
,
. Coijoicit). 0 _ j
i
_ 1.
_
_
(20)
2.3 Rule-Based Modeling
Additional rules are utilized to extend the modeling
capabilities. Each rule has two parts as body and head where
body part contains any number of domain class or property
individuals andhead part contains only one individual with a
value, j, representing the certainty of the definition given in
the body part to represent the definition in the head part where
0 _ j _ 1. The basic syntax of rules has parentheses and
logical connectives (.. .. . \. ) in both body and head parts.
Rule definitions are used for two different purposes. The
first purpose is to lower the spatial relation computation
cost. A useful utilization is as follows: Inverse spatial
relations and spatial relations that can be described in terms
of other spatial relations can be expressed with rule
definitions. In the spatial relation extraction process, these
rules can be utilized to extract the content represented with
the head part of the rule definition automatically. Rule
definition for Near distance relation type is presented as an
example of rule usage for this kind.
`coi1n|c :
/o:O/,cct(?o1C. ?O) . /o:on/,cct(?o1C. ?o) .
(/o:ojotio|1c|otioi(?o1C. tojo|oqico|1i:idc) .
/o:ojotio|1c|otioi(?o1C. tojo|oqico|1oi1i:idc)
./o:ojotio|1c|otioi(?o1C. jo:itioio|Tonc/))
_
_
_
_
.
/o:ojotio|1c|otioi(?`cno1C. di:toicc`coi)
. /o:\ o|nc(1.0)
_ _
n/cic
iid o1C. ojotio|1c|otioiCoijoicit ( ).
iid `cno1C. ojotio|1c|otioiCoijoicit ( ).
iid O. O/,cct ( ). iid o. O/,cct ( ).
_
_
(21)
Besides, nearly every domain has a number of irregular
situations that cannot be represented with the relation sets
defined in the ontology. VISCOM is enriched with addi-
tional rule definitions where it is hard to define situations as
a natural part of ontology. The second purpose of additional
rules is to define such complex situations.
Rules can contain any class/property individual defined
in the ontology. In fact, VISCOM is adequate to represent
any kind of event definition in terms of spatial or/and
temporal relations and similarity definitions. Rules give the
opportunity to make the event definitions which contain a
set of events or other class individuals defined in the
domain ontology. An example concept rule is as follows:
1n:y1n|c :
/o:CoiccjtCoijoicit(?A. noi/Tooit) .
/o:CoiccjtCoijoicit(?A. noi/To1iiit)
_ _
.
Coiccjt(/n:y) . /o:\ o|nc(0.65) [ [
n/cic
iid A. Coijoicit ( ).
_
_
(22)
Fig. 2. Rebound event representation.
Consequently, it can be stated that rule definitions
strengthens the framework in terms of both semantic content
representation and semantic content extraction.
2.4 Domain Ontology Construction with VISCOM
VISCOM is utilized as a metamodel to construct domain
ontologies. Basically, domain specific semantic contents are
defined as individuals of VISCOM classes and properties
1oioiiOito|oqy :
ictoòdc| = \ 1oCO` [ [.
doioii = 1
i
.
c|o::1id: = C1
0
C1
/
.
doto1ioj1id: = 11
0
11
|
.
o/,cct1ioj1id: = O1
0
O1
i
_
_
_
n/cic
C1: oic doioii c|o:: iid:.
11: oic doioii doto jiojcity iid:.
O1: oic doioii o/,cct jiojcity iid:.
iid 1
i
. 1oioii ( ).
_
_
(23)
Algorithm 1 presents the steps followed to construct a
domain ontology by using VISCOM. For the evaluation
purposes, we have constructed an Office Surveillance Ontol-
ogy, a Basketball Ontology and a Football Ontology by using
VISCOM. A small portion of the basketball ontology is
illustrated in Fig. 2 for Rebound event, as an example.
Algorithm 1. Ontology Construction with VISCOM
Require: VISCOM
Ensure: Domain Ontology
1: define O, 1 and C individuals.
2: define all possible SRs occuring within an 1.
3: define all possible OMs occuring within an 1.
4: use SRs and `s to define SCs.
5: describe temporal relations between SCs as TSCCs.
6: make EDs with SCs, SRs and TSCCs.
7: for all 1s do
8: if an event can be defined with an event def then
9: define 1 in terms of EDs.
10: end if
11: if an event can be defined with temporal relations
between other events then
12: define 1s in terms of ETRs.
13: end if
14: end for
15: for all Cs do
16: construct a relation with the C that can be placed in
its meaning.
17: end for
18: define os.
In accordance with the Algorithm 1, ontology construc-
tion starts with defining Rebound as the Event individual, and
Hoop, Ball, Player and Basket as the Object individuals. Next
stepis to define all Spatial Relation Component individuals that
happen during a Rebound event such as Ball Above Player,
Player BelowBasket and Ball Far fromHoop. Then, the sequence
of the Spatial Relation Component individuals are defined as
Spatial Change individuals such as Jump to Ball and Hit Hoop.
One or more Spatial Change individuals may be used to create
different Event Definition individuals. In the Rebound exam-
ple, two Event Definition individuals are defined. Rebound
Definition 1 has two Spatial Change individuals (Hit Hoop and
Jump to Ball) in its definition which have a temporal relation
with each other, while Rebound Definition 2 has only one
Spatial Change individual (Jump for Rebound). Each event
definition uses different spatial and temporal relations
between objects in order to define the event. The ontology
developer always has a chance to add a new definition that
will cover cases where existing definitions are not sufficient
enough. Also he/she has an opportunity to add new
individual definitions, modify, or delete them at any time.
All of the semantic content are defined as VISCOM class
individuals in a similar manner.
3 AUTOMATIC SEMANTIC CONTENT EXTRACTION
FRAMEWORK
The Automatic Semantic Content Extraction Framework is
illustrated in Fig. 3. The ultimate goal of ASCEF is to extract
all of the semantic content existing in video instances. In
order to achieve this goal, the automatic semantic content
extraction framework takes \
i
, O`T
i
, and 1
i
, where \
i
is a
video instance, O`T
i
is the domain ontology for domain 1
i
which \
i
belongs to, and 1
i
is the set of rules for domain 1
i
.
The output of the extraction process is a set of semantic
contents, named \ oC
i
, and represented as \ oC
i
= \
i
.
O1
i
. 11
i
. 11
i
). O1
i
= O1
i
0
. . . . . O1
i
i
is the set of object
instances occurring in \
i
, where an object instance is
represented as O1
i
,
= )ioicio. `11. j. tyjc ). `11 is the
minimum bounding rectangle surrounding the object in-
stance. j represents the certainty of the extraction, where
0 _ j _ 1. tyjc is an individual of a class C1
i
in ontology
O`T
i
. 11
i
= 11
i
0
. . . . . 11
i
i
is the set of event instances
Fig. 3. Automatic semantic content extraction framework.
occurring in \
i
, where an event instance is represented as
11
i
,
= :toit)ioicio. cid)ioicio. j. tyjc ). j represents the
certainty of the extraction, where 0 _ j _ 1. tyjc is an
individual of a class C1
i
in ontology O`T
i
. 11
i
= 11
i
0
.
. . . . 11
i
i
is the set of concept instances occurring in \
i
,
where a concept instance is represented as 11
i
,
=
:toit)ioicio. cid)ioicio. j. tyjc ). j represents the cer-
tainty of the extraction, where 0 _ j _ 1. tyjc is an
individual of a class C1
i
in ontology O`T
i
oC11 :
iijnt = \
i
. 1O
i
. 1
i
.
ontjnt = \ oC
i
_ _
n/cic
iid \
i
. \ idco ( ).
iid 1O
i
. 1oioiiOito|oqy ( ).
iid 1
i
. 1n|c ( ).
iid \ oC
i
. \ idcoocioiticCoitcit ( ).
_
_
(24)
Semantic contents are basically object, event, and concept
instances taking part in video instances
\ idcoocioiticCoitcit :
.idco = \
i
.
o/,cct: = O1
i
.
c.cit: = 11
i
.
coiccjt: = C1
i
_
_
_
n/cic
iid \
i
. \ idco ( ).
iid O1
i
. O/,cct1i:toicc ( ).
iid 11
i
. 1.cit1i:toicc ( ).
iid C1
i
. Coiccjt1i:toicc ( ).
_
_
(25)
There are two main steps followed in the automatic
semantic content extraction process. The first step is to
extract and classify object instances from representative
frames of shots of the video instances. The second step is to
extract events and concepts by using domain ontology and
rule definitions. A set of procedures is executed to extract
semantically meaningful components in the automatic event
and concept extraction process. The first semantically mean-
ingful components are spatial relation instances between
object instances. Then, the temporal relations are extracted
by using changes in spatial relations. Lastly, events and
concepts are extracted by using the spatial and temporal
relations. Details about these procedures are described in
below given sections.
3.1 Object Extraction
Object extraction is one of most crucial components in the
framework, since the objects are used as the input for the
extraction process. However, the details of object extraction
process is not presented in detail, considering that the object
extraction process is mostly in the scope of computer vision
and image analysis techniques.
It can be argued that having a computer vision-based
object extraction component prevents the framework being
domain independent. However, object extraction techniques
use training data to learn object definitions, which are
usually shape, color, and texture features. These definitions
are mostly the same across different domains. Thus, using
training data in such object extraction techniques does not
necessarily make those techniques domain dependent. As
long as the object extraction technique can identify a large
number of different object types, such a technique is usable
in ASCEF.
In order to meet the object extraction and classification
need, a semiautomatic Genetic Algorithm-based object
extraction approach [18], [20] is utilized in this study. The
approach is a supervised learning approach utilizing eight
MPEG-7 descriptors to represent the objects.
During the object extraction process, for each representa-
tive keyframe in the video, above-mentioned object extrac-
tion process is performed and a set of objects is extracted and
classified. The extracted object instances are stored with their
type, frame number, membership value, and Minimum
Bounding Rectangle data. Object instances are used as input
with the domain ontologies in the event and concept
extraction process
`ci/cio/ij :
j = )|oot [ [ [ [
n/cic
0 _ j _ 1.
_
_
_
(26)
`11 :
r = iitcqci [ [. y = iitcqci [ [.
nidt/ = iitcqci [ [. |ciqt/ = iitcqci [ [
_ _ _
(27)
O/,cct1i:toicc :
)ioicò = ini/ci [ [.
iii1onidiiq1cctoiq|c = `11
i
.
ici/ci:/ij = ò\
,
.
o/,cctTyjc = O
/
_
_
_
n/cic
iid O
/
. O/,cct ( ).
iid `11
i
. `11 ( ).
iid ò\
,
. `ci/cio/ij
_ _
.
_
_
(28)
3.2 Spatial Relation Extraction
Object instances are represented with the MBR. There can be
i object instance (as regions) represented with 1 in a frame
1, where 1 = 1
0
. . . . . 1
i
. For each 1, the upper left-hand
corner point represented with 1
n|
, length and width of 1 are
stored. The area inside 1
i
is represented with 1
c
i
where the
edges of 1
i
are represented with 1
u
i
.
Every spatial relation extraction is stored as a Spatial
Relation Component instance which contains the frame
number, object instances, type of the spatial relation, and a
fuzzy membership value of the relation. A Spatial Relation
Component instance is formally represented as
ojotio|1c|otioi1i:toicc :
o/,cct = O
i
.
:n/,cct = o
,
.
ic|otioiTyjc = 1
/
.
)ioicò = ini/ci [ [.
ici/ci:/ij = ò\
i
_
_
_
n/cic
iid O
i
. O/,cct1i:toicc ( ).
iid o
,
. O/,cct1i:toicc
_ _
.
iid 1
/
. ojotio|1c|otioi ( ).
iid ò\
i
. `ci/cio/ij ( ).
_
_
(29)
Spatial relations are fuzzy relations and membership
values for each relation type can be calculated according to
the positions of objects relative to each other. Below, we
explain how membership values (j
di:
. j
toj
. j
jo:
) for each of
the distance, topological, and positional relation categories
are calculated.
3.2.1 Topological Relations
Topological relation types are inside, partially inside, touches,
and disjoint. The membership values for the topological
relation types are calculated by using
j
toj
_
1
i
. 1
/
_
=
_
1
c
i
1
c
/
_
1
c
/
[25[. (30)
j
toj
(1
i
. 1
/
) = 1 means region 1
/
is inside region 1
i
.
j
toj
(1
i
. 1
/
) = 0 . (1
u
i
1
u
/
) = O means region 1
/
is disjoint
with region 1
i
. 0 < j
toj
(1
i
. 1
/
) < 1 means region 1
/
is
partially inside region 1
i
. j
toj
(1
i
. 1
/
) = 0 . (1
u
i
1
u
/
) ,= O
means region 1
/
touches region 1
i
.
3.2.2 Distance Relations
We use two distance relation types for simplicity as far and
near. The distance between two nearest points of regions are
used in the calculation formulas of j
)oi
and j
icoi
. When
two regions have inside, partially inside, or touch topological
relation, the distance relation membership values are
directly assigned as j
)oi
(1
i
. 1
/
) = 0 and j
icoi
(1
i
. 1
/
) = 1.
When there is a disjoint topological relation, the fuzzy
membership functions of j
)oi
and j
icoi
given in Fig. 4 are
used. In a relation of two objects, the relation is near if
the distance between the objects is less than or equal to the
longest side of the smaller objects bounding rectangle.
Also, the relation is far, if the distance is greater than or
equal to the longest side of the bigger objects bounding
rectangle. If the distance is between these two values, the
distance relation membership value is calculated according
to the given membership function.
3.2.3 Positional Relations
j
jo:
is calculated as j
o/o.c
jo:
, j
/c|on
jo:
, j
|c)t
jo:
, and j
iiq/t
jo:
values for
each positional relation type. Center points of regions are
used to calculate membership values, as most of the studies
such as [25], [26], [27] do. The center point of one of the
regions is fixed as origin (0,0). The sinus of the angle ()
between the r coordinate and the line between two center
points of regions is calculated. This value is used to calculate
j
jo:
s with the following formulas:
j
iiq/t
jo:
_
1
i
. 1
/
_
=
:ii( 90). 0 < < 90 . 270 < < 360
0. ot/cini:c
_
(31)
j
|c)t
jo:
_
1
i
. 1
/
_
=
:ii( 90). 90 < < 270
0. ot/cini:c
_
(32)
j
o/o.c
jo:
_
1
i
. 1
/
_
=
:ii(). 0 < < 180
0. ot/cini:c
_
(33)
j
/c|on
jo:
_
1
i
. 1
/
_
=
:ii( 180). 180 < < 360
0. ot/cini:c.
_
(34)
Rule definitions are utilizedin our model in order to lower
spatial relation computation costs. At the end of Section 2.3,
the rule definition for distance relation near is given.
3.3 Temporal Relation Extraction
In the framework, temporal relations are utilized in order to
add temporality to sequence Spatial Change or Events
individuals in the definition of Event individuals.
One of the well-known formalisms proposed for
temporal reasoning is Allens temporal interval algebra
[24] which describes a temporal representation that takes
the notion of a temporal interval as primitive. Allens
algebra is used to express parallelism and mutual exclusion
between model components of VISCOM. Allen defined a
set of 13 qualitative relations that may hold between two
intervals A = [r
. r
[ and Y = [y
. y
[. Table 1 shows how

Allen expressed these precise relations (by means of
constraints on the boundaries of the crisp intervals
(A = [r
. r
[ and Y = [y
. y
[) involved. The formulas

given in the definition column of Table 1 are used to
extract temporal relations between instances.
3.4 Event Extraction
Event instances are extracted after a sequence of automatic
extraction processes. Each extraction process outputs
Fig. 4. Graph for distance relation membership function.
TABLE 1
Allens Temporal Interval Relations
instances of a semantic content type defined as an
individual in the domain ontology. Algorithm 2 describes
the whole event extraction process. In addition, relations
between the extraction processes are illustrated in Fig. 5.
Algorithm 2. Event Extraction Algorithm
Require: Domain Ontology, Object Instances
Ensure: Event Instances
1: for all SRC individuals in the ontology do
2: extract SRC instances that satisfy the individual def.
3: execute SR rule def
4: end for
5: for all SMC individuals in the ontology do
6: extract SMC instances that satisfy the individual def.
7: end for
8: for all SC individuals in the ontology do
9: check if there are SRC or SMC instances that satisfy
the individual def.
10: end for
11: for all TSC individuals in the ontology do
12: extract SC instances that satisfy the individual def.
13: end for
14: for all ED individuals in the ontology do
15: check if there are SC, SR or TSC instances that satisfy
the individual def.
16: end for
17: for all 1 individuals in the ontology do
18: check if there are ED instances that satisfy the
individual def.
19: end for
20: for all Event individuals which have Temporal Event
Component individuals do
21: extract Event instances that satisfy the individual def.
22: end for
23: for all o individuals in the ontology do
24: extract 1 instances that satisfy the individual def.
25: end for
26: execute all rules defined for 1 individuals to extract
additional events.
During the extraction process, the semantic content is
extracted with a certainty degree between 0 and 1. An
extracted event instance is represented with a type, a frame
set representingthe events interval, a membershipvalue and
the roles of the objects taking part in the event. Frame Set is
used to represent the frame interval of instances
1ioicoct :
:toit = iitcqci [ [. cid = iitcqci [ [.
.idco = \
i
_ _
n/cic
iid \
i
. \ idco ( ). :toit ,= . cid ,= .
_
_
(35)
1.cit1i:toicc :
)ioicoct = 1o
i
.
c.citTyjc = 1
/
.
ici/ci:/ij = ò\
,
.
o/,cct1o|c = O1
i
_
_
_
n/cic
iid 1o
i
. 1ioicoct ( ). iid 1
/
. 1.cit ( ).
iid ò\
,
. `ci/cio/ij
_ _
.
iid O1
i
. O/,cct1o|c ( ).
_
_
(36)
3.5 Concept Extraction
In the concept extraction process, Concept Component indivi-
duals and extracted object, event, and concept instances are
used. Concept Component individuals relate objects, events,
and concepts with concepts. When an object or event that is
used in the definition of a concept is extracted, the related
concept instanceis automaticallyextractedwiththerelevance
degree given in its definition. In addition, Similarity indivi-
duals are utilized in order to extract more concepts from the
extracted components. The last step in the concept extraction
process is executing concept rule definitions.
Concept Extraction Algorithm given as Algorithm 3 simply
describes the whole concept extraction process. In addition,
relations between the concept extraction processes are
illustrated in Fig. 6.
Algorithm 3. Concept Extraction Algorithm
Require: Domain Ontology, Object Instances, Event Instances
Ensure: Event Instances, Concept Instances
1: for all CC individuals in the ontology do
2: check is there are O or 1 instances that satisfy the
individual def.
3: end for
4: for all o individuals in the ontology do
5: extract C instances that satisfy the individual def.
6: end for
7: execute all rules defined for C individuals.
Fig. 5. Event extraction process.
Similar to the event extraction, concepts are extractedwith
a membership value between 0 and1. The following example
explains how component membership values are used to
calculate concept membershipvalues: Event individual 1and
Object individual O are related components with the Concept
individual C. Event 1andObject Ohave relevance values for
representing the concept C as j
1
C
and j
O
C
, respectively.
When an event 1 instance is extracted with a membership
value j
1
and an object O instance is extracted with a
membership value j
O
, the membership value for concept C
instance is calculated as: j
C
= ior((j
1
+ j
1
C
). (j
O
+ j
O
C
)).
Concept instances use the frame interval of events or
objects that are taking part in their definition. A concept has
a type, an interval as a frameset, a membership value which
represents the possibility of the concept realization in the
extracted concept period and the roles of objects taking part
in the concept
Coiccjt1i:toicc :
)ioicoct = 1o
i
.
ici/ci:/ij = ò\
,
.
coiccjtTyjc = C
/
.
o/,cct1o|c = O1
i
_
_
_
n/cic
iid 1o
i
. 1ioicoct ( ).
iid ò\
,
. `ci/cio/ij
_ _
.
iid C
/
. Coiccjt ( ).
iid O1
i
. O/,cct1o|c ( ).
_
_
(37)
After object extraction and classification, the extraction
algorithms defined in Sections 3.4 and 3.5 are applied with
relation calculations defined in Sections 3.2 and 3.3. Spatial,
temporal, and similarity relations defined in domain
ontologies, rule definitions, and extracted instances are used
together in the semantic extraction process.
4 EMPIRICAL STUDY
In this study, OWL is chosen as the semantic markup
language and Semantic Web Rule Language (SWRL) [28] is
usedto make rule definitions. Inorder to capture imprecision
in rules, a fuzzy extension of SWRL is used. In this extension,
OWL individuals include a specification of the degree (a truth
value between 0 and 1) of confidence with which an
individual is an instance of a given class or property.
Protege [29] platform is utilized as the ontology editor.
Protege is tightly integrated with a number of libraries which
handle ontology deployment and management issues. Jena2
[30] library, which is an open source Java framework for
building semantic applications, is used in this study.
The experimental part of the system contains evaluation
tests on office surveillance, basketball, and football videos.
1
Precision and recall rates and Boundary Detection Accuracy
(BDA) [31] score, that are important metrics to see the
performance of the retrieval systems, are usedinthis study to
evaluate the success of the proposed framework. A semantic
content is accepted as a correctly extracted semantic content
when its interval intersects with the manually extracted
semantic content interval. In addition, precision and recall
rates are calculated according to the detected content
boundary/interval compared with the manually labeled
boundary/interval with the formulas given below:
1icc
iit
=
t
i/
t
d/
t
d/
(38)
1cc
iit
=
t
i/
t
d/
t
i/
(39)
11 =
t
i/
t
d/
ior(t
i/
. t
d/
)
. (40)
where t
d/
and t
i/
are the automatically detected event/
concept interval and the manually labeled event/concept
interval, respectively.
Initially, the framework is tested with five office surveil-
lance videos, each being 10 minutes in length. Totally, 1,026
keyframes are extracted and utilized in the extraction
process. During this test, first the object extraction is done
automatically and all of the semantic content extraction
process is executed. Then, in order to see the effect of object
extraction on the success of the system, the test is performed
by providing the objects manually. The number of VISCOM
class individuals (ontology-based rule definitions) and
Fig. 6. Concept extraction process.
TABLE 2
Office Surveillance Ontology Facts
1. The OWL representation of VISCOM metamodel and the domain
ontologies constructed with the VISCOM can be accessed from the webpage
http://multimedia.ceng.metu.edu.tr/viscom.
corresponding extraction results for each individual/defini-
tion is given in Table 2. The test results for the case with
automatic object extraction are given in Table 3.
The videos in this test contains 50 semantic entities. After
retrieving 50 entities, it is observed that 45 of them are
correctly extracted, five wrongly extracted and five missed.
The missed entities are the result of the automatic object
extraction process that has misclassified or not extracted
some of the objects. Wrong extractions are result of
sensitivity of ontological rules on the object positions
(i.e., small movements of person object that are not walking/
casting extracted as walking/casting) and an unsuitable class
individual definition in the ontological rules (i.e., the
similarity definitions of typing). Both precision and recall
rates are calculated as 90.00 percent and BDA score is
calculated as 78.59 percent, which shows the success of our
proposed framework.
When the objects are manually provided for the second
step of the test, the precision rate is 90.74 percent, the recall
rate is 98.00 percent, and the BDA score is 79.90 percent.
Obviously, this result is expected since the framework uses
the objects in the keyframes as the input and extracts events
and concepts by using the objects and VISCOM rules. When
a missing or wrong classification of object instances occurs
in the automatic object extraction process, the success of
event/concept extraction decreases.
Next, the framework is tested with three basketball
videos, each being 2 minutes in length. Totally, 207 key-
frames are extracted and utilized in the extraction process.
The videos contain eight semantic entities, where the
extraction resulted with seven correct, one wrong, and one
missed entities. Results are given in Table 4. For this test,
manually annotated object instances are utilized and
membership value for object instances is defined as
90 percent. Wrong extractions in this test are the result of
the unsuitable similarity class definition for the rebound event.
The framework is also tested with five football videos,
each being 2 minutes in length. Totally, 312 keyframes are
extracted and utilized in the extraction process. Manually
annotated object instances are utilized and membership
values for object instances are defined as 0.90 for football
domain tests. Three event types are defined in the domain
ontology. Test results for football domainare giveninTable 5.
Test results are also compared with the results of two
recent studies. The first one is a multimodal framework for
semantic event extraction from basketball games based on
webcasting text and broadcast video in [32]. In this study, an
unsupervised clustering-based method is proposed to
automatically detect event from web-casting text and a
statistical approach is proposed to detect event boundary in
the video. The second study proposes a method to detect
events involving multiple agents in a video and to learn their
structure in terms of a temporally related chain of subevents
TABLE 4
Scores for Basketball Videos
TABLE 5
Scores for Football Videos
TABLE 3
Scores for Office Surveillance Videos
TABLE 6
Comparison with Recent Semantic Content Extraction Studies
[6]. InTable 6, comparisons of the results of this study andthe
other mentioned studies are given. Both for basketball and
surveillance videos similar or better precision andrecall rates
and BDA scores have been obtained when compared with
the results of these studies. The only exception is the rebound
event because of the reasons given above.
In order to evaluate the effect of additional rule usage to
the semantic content extraction, two different set of rules are
defined. First, positional relations Below and Left, and
distance relation Near rules are defined in order to see the
effect of rules onspatial relationcomputationcost. The videos
that are used during these tests contain totally 10,342 spatial
relation instances where 1,504 of them are spatial relation
instances that have Below, Left, or Near as a spatial relation
type. The spatial relation instances having these spatial
relation types are extracted by using the rule definitions.
Initially, the spatial relation computation time is calcu-
lated for the case where no rule definition is made. Then, the
rules are defined one by one and the computation times are
calculated after adding each rule definition. As it can be seen
in Fig. 7, the spatial relation computation times are decreased
with the increase in the number of rules definitions.
As the second test, working and busy concept rules are
defined and utilized in the concept extraction process. Two
more concept instances, one working and one busy concept,
are extracted by using the rule definitions together with
the domain ontology. Some complex situations such as in the
busy concept rule definition are easily expressed with rule
definitions by using the individuals defined in the domain
ontology. In order to define such cases with the representa-
tion capabilities of VISCOM, we have to define extra
semisemantic individuals directly related with the busy
concept. This increases both the execution time of the
extraction process and the complexity of the ontology.
Some practical issues to consider on the evaluation of the
proposed framework can be motion of objects in the third
dimension, occlusion, and object tracking. In VISCOM, the
spatial relation types are mostly defined in image plane (far,
near, left, right, etc.). After some preliminary tests on the
system, it is observed that the relations in third dimension
and perspective effect on the sizes of the objects cause
inefficiencies in the extraction results. Therefore, the extrac-
tion process is updated by utilizing a relative-distance
approach. Besides, the system performs the extraction
process temporally, andevaluates the consecutive keyframes
in the video. Therefore, except the cases that the objects are
occluded all the time during an event, using consecutive
keyframes prevents the system from the effects of the object
occlusions. In addition, such temporality enables tracking of
objects quite well.
All these tests show that the proposed ontology-based
automatic semantic content extraction framework is success-
ful for both event and concept extraction. There are two
points that must be ensured to achieve this success. The first
one is to obtain object instances correctly. Whenever a
missing or misclassified object instance occurs in the object
instance set, which is used by the framework as input,
success of event and concept extraction decreases. The
second issue is to use the proposed VISCOM metamodel
effectively andconstruct a well andcorrectly defineddomain
ontology. Wrong, extra, or missing definitions in the
constructed ontology can decrease the extraction success.
In the tests, we have encountered wrong extractions because
of the wrong Similarity class individual definitions for typing
event in office domain.
5 CONCLUSION
The primary aim of this research is to develop a framework
for an automatic semantic content extraction system for
videos which can be utilized in various areas, such as
surveillance, sport events, and news video applications. The
novel idea here is to utilize domainontologies generatedwith
a domain-independent ontology-based semantic content
metaontology model and a set of special rule definitions.
Automatic Semantic Content Extraction Framework con-
tributes in several ways to semantic video modeling and
semantic content extraction research areas. First of all, the
semantic content extraction process is done automatically. In
addition, a generic ontology-based semantic metaontology
model for videos (VISCOM) is proposed. Moreover, the
semantic content representation capability and extraction
success are improved by adding fuzziness in class, relation,
and rule definitions. An automatic Genetic Algorithm-based
object extraction method is integrated to the proposed
system to capture semantic content. In every component of
the framework, ontology-based modeling and extraction
capabilities are used. The test results clearly showthe success
of the developed system.
As a further study, one can improve the model and the
extraction capabilities of the framework for spatial relation
extraction by considering the viewing angle of camera and
the motions in the depth dimension.
ACKNOWLEDGMENTS
This work is supported by a research grant from TU
B
_
ITAK
EEEAG with grant number 109E014.
REFERENCES
[1] M. Petkovic and W. Jonker, An Overview of Data Models and
Query Languages for Content-Based Video Retrieval, Proc. Intl
Conf. Advances in Infrastructure for E-Business, Science, and Education
on the Internet, Aug. 2000.
[2] M. Petkovic and W. Jonker, Content-Based Video Retrieval by
Integrating Spatio-Temporal and Stochastic Recognition of
Events, Proc. IEEE Intl Workshop Detection and Recognition of
Events in Video, pp. 75-82, 2001.
Fig. 7. Rule effect on spatial relation computation.
[3] L.S. Davis, S. Fejes, D. Harwood, Y. Yacoob, I. Haratoglu, and M.J.
Black, Visual Surveillance of Human Activity, Proc. Third Asian
Conf. Computer Vision (ACCV), vol. 2, pp. 267-274, 1998.
[4] G.G. Medioni, I. Cohen, F. Bremond, S. Hongeng, and R. Nevatia,
Event Detection and Analysis from Video Streams, IEEE Trans.
Pattern Analysis Machine Intelligence, vol. 23, no. 8, pp. 873-889,
Aug. 2001.
[5] S. Hongeng, R. Nevatia, and F. Bremond, Video-Based Event
Recognition: Activity Representation and Probabilistic Recogni-
tion Methods, Computer Vision and Image Understanding, vol. 96,
no. 2, pp. 129-162, 2004.
[6] A. Hakeem and M. Shah, Multiple Agent Event Detection and
Representation in Videos, Proc. 20th Natl Conf. Artificial
Intelligence (AAAI), pp. 89-94, 2005.
[7] M.E. Do nderler, E. Saykol, U. Arslan, O
. Ulusoy, and U.
Gu du kbay, Bilvideo: Design and Implementation of a Video
Database Management System, Multimedia Tools Applications,
vol. 27, no. 1, pp. 79-104, 2005.
[8] T. Sevilmis, M. Bastan, U. Gu du kbay, and O
. Ulusoy, Automatic
Detection of Salient Objects and Spatial Relations in Videos for a
Video Database System, Image Vision Computing, vol. 26, no. 10,
pp. 1384-1396, 2008.
[9] M. Ko pru lu , N.K. Cicekli, and A. Yazici, Spatio-Temporal
Querying in Video Databases, Information Sciences, vol. 160,
nos. 1-4, pp. 131-152, 2004.
[10] J. Fan, W. Aref, A. Elmagarmid, M. Hacid, M. Marzouk, and X.
Zhu, Multiview: Multilevel Video Content Representation and
Retrieval, J. Electronic Imaging, vol. 10, no. 4, pp. 895-908, 2001.
[11] J. Fan, A.K. Elmagarmid, X. Zhu, W.G. Aref, and L. Wu,
Classview: Hierarchical Video Shot Classification, Indexing,
and Accessing, IEEE Trans. Multimedia, vol. 6, no. 1, pp. 70-86,
Feb. 2004.
[12] L. Bai, S.Y. Lao, G. Jones, and A.F. Smeaton, Video Semantic
Content Analysis Based on Ontology, IMVIP 07: Proc. 11th Intl
Machine Vision and Image Processing Conf., pp. 117-124, 2007.
[13] R. Nevatia and P. Natarajan, EDF: A Framework for Semantic
Annotation of Video, Proc. 10th IEEE Intl Conf. Computer Vision
Workshops (ICCVW 05), p. 1876, 2005.
[14] A.D. Bagdanov, M. Bertini, A. Del Bimbo, C. Torniai, and G. Serra,
Semantic Annotation and Retrieval of Video Events Using
Multimedia Ontologies, Proc. IEEE Intl Conf. Semantic Computing
(ICSC), Sept. 2007.
[15] R. Nevatia, J. Hobbs, and B. Bolles, An Ontology for Video Event
Representation, Proc. Conf. Computer Vision and Pattern Recogni-
tion Workshop, p. 119, http://ieeexplore.ieee.org/xpls/abs_all.
jsp?arnumber=1384914, 2004.
[16] U. Akdemir, P.K. Turaga, and R. Chellappa, An Ontology Based
Approach for Activity Recognition from Video, Proc. ACM Intl
Conf. Multimedia, A. El-Saddik, S. Vuong, C. Griwodz, A.D. Bimbo,
K.S. Candan, and A. Jaimes, eds., pp. 709-712, http://dblp.uni-
trier.de/db/conf/mm/mm2008.html#AkdemirTC08, 2008.
[17] Y. Yildirim, Automatic Semantic Content Extraction in Video
Using a Spatio-Temporal Ontology Model, PhD dissertation,
Computer Eng. Dept., METU, Turkey, 2009.
[18] T. Yilmaz, Object Extraction from Images/Videos Using a
Genetic Algorithm Based Approach, masters thesis, Computer
Eng. Dept., METU, Turkey, 2008.
[19] Y. Yildirim and A. Yazici, Ontology-Supported Video
Modeling and Retrieval, Proc. Fourth Intl Conf. Adaptive
Multimedia Retrieval: User, Context, and Feedback (AMR), pp. 28-
41, 2006.
[20] Y. Yildirim, T. Yilmaz, and A. Yazici, Ontology-Supported Object
and Event Extraction with a Genetic Algorithms Approach for
Object Classification, Proc. Sixth ACM Intl Conf. Image and Video
Retrieval (CIVR 07), pp. 202-209, 2007.
[21] V. Mezaris, I. Kompatsiaris, N.V. Boulgouris, and M.G. Strintzis,
Real-Time Compressed-Domain Spatiotemporal Segmentation
and Ontologies for Video Indexing and Retrieval, IEEE Trans.
Circuits Systems Video Technology, vol. 14, no. 5, pp. 606-621,
May 2004.
[22] D. Song, H.T. Liu, M. Cho, H. Kim, and P. Kim, Domain
Knowledge Ontology Building for Semantic Video Event Descrip-
tion, Proc. Intl Conf. Image and Video Retrieval (CIVR), pp. 267-275,
2005.
[23] W. Chen and D.S. Warren, C-logic of Complex Objects, PODS
89: Proc. Eighth ACM SIGACT-SIGMOD-SIGART Symp. Principles
of Database Systems, pp. 369-378, 1989.
[24] J.F. Allen, Maintaining Knowledge about Temporal Intervals,
Comm. ACM, vol. 26, no. 11, pp. 832-843, 1983.
[25] M.J. Egenhofer and J.R. Herring, A Mathematical Framework for
the Definition of Topological Relationships, Proc. Fourth Intl
Symp. Spatial Data Handling, pp. 803-813, 1990.
[26] M. Vazirgiannis, Uncertainty Handling in Spatial Relation-
ships, SAC 00: Proc. ACM Symp. Applied Computing, pp. 494-
500, 2000.
[27] P.-W. Huang and C.-H. Lee, Image Database Design Based on
9D-SPA Representation for Spatial Relations, IEEE Trans. Knowl-
edge and Data Eng., vol. 16, no. 12, pp. 1486-1496, Dec. 2004.
[28] I. Horrocks, P.F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, and
M. Dean, Swrl: A Semantic Web Rule Language, technical
report, W3C, http://www.w3.org/Submission/SWRL/, 2004.
[29] Protege Ontology Editor, http://protege.stanford.edu/, 2012.
[30] Jena: A Semantic Web Framework, http://www.hpl.hp.com/
semweb/, 2012.
[31] C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan, Live Sports Event
Detection Based on Broadcast Video and Web-Casting Text,
MULTIMEDIA 06: Proc. 14th Ann. ACM Intl Conf. Multimedia,
pp. 221-230, 2006.
[32] Y. Zhang, C. Xu, Y. Rui, J. Wang, and H. Lu, Semantic Event
Extraction from Basketball Games Using Multi-Modal Analysis,
Proc. IEEE Intl Conf. Multimedia and Expo (ICME 07), pp. 2190-
2193, 2007.
Yakup Yildirim received the BS degree in
computer science and information engineering
from Bilkent University in 1997 and the MS
and PhD degrees in computer engineering
from Middle East Technical University (METU)
in 2000 and 2009, respectively. He has been
working as a senior scientist in the Capability
Development group at the NCI Agency since
2008. His research interests include multi-
media and video databases, ontology and
semantic content modeling.
Adnan Yazici received the PhD degree in
computer science from the Department of
Electrical Engineering and Computer Science,
Tulane University, New Orleans, in 1991, where
he also was a visiting professor between 1998
and 2000. He is a full professor in the Depart-
ment of Computer Engineering, METU. His
current research interests include intelligent
database systems, fuzzy database modeling,
spatiotemporal databases, multimedia and video
databases, and wireless multimedia sensor networks. He has published
more than 150 papers in referred international journals and conferences
and coauthored two books. He is a senior member of the IEEE.
Turgay Yilmaz received the BS degree in
computer engineering from Bilkent University in
2004 and the MS degree in computer engineer-
ing from Middle East Technical University
(METU) in 2008, where he is currently working
toward the PhD degree under the supervision of
Prof. Dr. Adnan Yazici. Currently, he is a senior
software engineer at HAVELSAN, Inc. His
research interests include multimedia data-
bases, machine learning, intelligent systems,
fuzzy systems, soft computing and evolutionary computation. He is a
student member of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Automatic Semantic Content Extraction in Videos Using A Fuzzy Ontology and Rule-Based Model

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Automatic Semantic Content Extraction in Videos Using A Fuzzy Ontology and Rule-Based Model

Diunggah oleh

Hak Cipta:

Format Tersedia

Automatic Semantic Content Extraction

in Videos Using a Fuzzy Ontology

[. Table 1 shows how

[) involved. The formulas

Anda mungkin juga menyukai