Anda di halaman 1dari 81

TOPIC

By

XYZ

Supervisor

Dr
A thesis submitted in partial fulfillment of
The requirements for the degree of
Masters in Information Technology

In

Department of Information Technology


Pakistan

(July 2008)

APPROVAL
It is certified that the contents and form of thesis entitled submitted, have been found
satisfactory for the requirement of degree.

Advisor: __________________

Committee Member: _________________

Committee Member: _________________

Committee Member: _________________

II

IN THE NAME OF ALMIGHTY ALLAH


THE MOST BENEFICENT AND THE MOST MERCIFUL

TO MY PARENTS,
BROTHER AND SISTERS

III

CERTIFICATE OF ORIGINALITY
I hereby declare that this submission is my own work and to the best of my knowledge it
contains no materials previously published or written by another person, nor material which to a
substantial extent has been accepted for the award of any degree or diploma at BZU or at any
other educational institute, except where due acknowledgement has been made in the thesis. Any
contribution made to the research by others, with whom I have worked at BZU or elsewhere, is
explicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product of my own work, except for
the assistance from others in the projects design and conception or in style, presentation and
linguistics which has been acknowledged.

Author Name:

Signature: ______________

IV

ACKNOWLEDGEMENTS
First of all I am extremely thankful to Almighty Allah for giving me courage and strength to
complete this challenging task and to compete with international research community. I am also
grateful to my family, especially my parents who have supported and encouraged me through
their prayers that have always been with me.
I am highly thankful to for his valuable suggestions and continuous guidance throughout my
research work. His foresightedness and critical analysis of things taught me a lot about valuable
research which will be more helpful to me in my practical life.
I would like to offer my gratitude to all the members of the research group and my close
colleagues who have been encouraging me throughout my research work especially Mr Maruf
Pasha.

TABLE OF CONTENTS
List of Figures

VIII

List of Tables VIII


List of AbbreviationsX
ABSTRACT XI
CHAPTER 1 1
INTRODUCTION

1.1.

Motivation

1.2.

Problem Definition

1.3.

Objective and Goals of Research

1.4.

Outlines of Thesis

2
3

CHAPTER 2 5
BACKGROUND STUDIES 5
2.1.

Data Integration

2.2.

Issues in data integration

2.3.

Approaches to Data Integration

2.4.

Query Processing in Data integration 9

2.5.

Ontology

10

2.6.

Indexing

13

6
7

CHAPTER 3 17
LITERATURE SURVEY

16

3.1.

Query Reformulation 16

3.2.

State of the art techniques

16

VI

CHAPTER 4 23
PROPOSED ARCHITECTURE

21

4.1.

Proposed Architecture for the Relevance Reasoning 21

4.2.

Semantic Matching & Source Ranking of RDF Triples

4.3.

Proposed Semantic Matching Methodology 28

4.4.

Explanation of Proposed Methodology using a Case Study 37

25

CHAPTER 5 43
IMPLEMENTATION

43

5.1.

RDF data/ Ontologies in Oracle Database

43

5.2.

Setting up the Stage for Implementation

47

5.3.

Implementation of the Proposed Architecture for Relevance Reasoning

CHAPTER 6 58
RESULTS AND EVALUATION

57

6.1.

System Specification 57

6.2.

Evaluation Criteria

57

6.3.

Data Specification

58

6.4.

Test Queries

6.5.

Experiments for Response Time of Query Execution 59

6.6.

Experiments for System Accuracy

58
62

CHAPTER 7 64
CONCLUSION AND FUTURE DIRECTIONS 64
7.1.

Discussion

7.2.

Main Contribution of the Project

7.3.

Future Direction

REFERENCES

64
65

66

67
VII

51

LIST OF FIGURES
Figure 1: Data Warehousing Architecture for Data Integration.......................................................8
Figure 2 Mediator Wrapper Architecture for data integration.........................................................9
Figure 3 RDF Triple as Directed Graph........................................................................................12
Figure 4: Structure of a bitmap index............................................................................................15
Figure 5 Proposed Architecture for Relevance Reasoning in Data Integration Systems..............22
Figure 6: Sequence Diagram for Ontology Management Workflow.............................................29
Figure 7 Pseudo-code for RDF triple registration of global ontology...........................................30
Figure 8 InverseOf SameAs rule inserted in the rule-base............................................................30
Figure 9 TransitiveOf SameAs rule inserted in the rule-base........................................................31
Figure 10 Pseudo-code for RDF triple creation of local ontology................................................32
Figure 11 Pseudo-Code for Bitmap Segment Creation..................................................................32
Figure 12 Pseudo-Code for Bitmap Synchronization....................................................................33
Figure 13: Sequence Diagram for Source Registration Workflow................................................34
Figure 14: Sequence Diagram for Relevance Reasoning Workflow.............................................35
Figure 15 Pseudo-Code for Query Expansion in Relevance Reasoning Workflow......................36
Figure 16 Pseudo-Code for Source Selection in Relevance Reasoning Workflow.......................37
Figure 17 Snap shot of the Global Ontology.................................................................................37
Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global
Ontology........................................................................................................................................38
Figure 19 Database Schema to store ontology in Oracle NDM....................................................44
Figure 20 Package Diagram of the Proposed Architecture for Relevance Reasoning..................51
Figure 21 Time Complexity of System (Query with 3 Triples).....................................................60
Figure 22 Time Complexity of System (Query with 6 Triples).....................................................60
Figure 23 Time Complexity of System (Query with 9 Triples).....................................................61
Figure 24 Performance gain of the system with respect to direct ontology traversal....................61
Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon
algorithm........................................................................................................................................62

VIII

List of Tables
Table 1 Relevance levels and scoring strategy..............................................................................27
Table 2 RDF triples of the Global Ontology..................................................................................38
Table 3 Structure of Bitmap Index.................................................................................................38
Table 4 RDF triples of the data sources.........................................................................................39
Table 5 Structure of Bitmap Index after sources are registered.....................................................39
Table 6 Buckets created for the RDF triples..................................................................................39
Table 7 Inferred RDF triples for a users query triple...................................................................40
Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple....................42
Table 9: Semantic Similarity Calculation of a Data Source for User Query.................................42

IX

LIST OF ABBREVIATIONS
XML
WWW
DAML
OWL
API
DIS
NDM
RDF

Extensible Markup Language


World Wide Web
DARPA Agent Markup Language
Ontology Web Language
Application Programming Interface
Data Integration Systems
Network Data Model
Resource Description Framework

W3C
URL
ICT
AI
UMLS
IM
GUID
LUID
SDS

World Wide Web Consortium


Uniform (Universal) Resource Locator
Information and Communication Technologies
Artificial Intelligence
Unified Medical Language System
Information Manifold
Global Unique Identifier
Local Unique Identifier
Source Description Storage

ABSTRACT
Online data sources are autonomous, heterogeneous and geographically distributed. The
data sources can join and leave a data integration system arbitrarily. Some sources may not
X

contribute significantly to a user query because they are not relevant to it. Executing queries
against all the available data sources consume resources unreasonably and subsequently these
queries become expensive.
Source selection is an approach to resolve the issue. The existing techniques of relevance
reasoning for source selection take significant time in traversing the source descriptions.
Consequently query response time degrades in coping with the growing number of available
sources. Moreover, simple matching process is unable to sort out the fine-grained semantic
heterogeneities of data. Semantic heterogeneity of data sources makes the relevance reasoning
complex. These issues degrade the performance of data integration systems.
In this research, we have proposed an ontology-driven relevance reasoning architecture
that identifies relevant data sources for a user query before its execution. The proposed
methodology aligns source descriptions (i.e. local ontologies) with domain ontology through a
bitmap index. In spite of traversing the local ontologies, the methodology utilizes the bitmap
index to perform relevance reasoning in order to improve query response. Semantic matching has
been employed in relevance reasoning for the provision of semantic interoperability. Semantic
operators, such as, exactMatch, sameAs, equivalentOf, subClassOf, and disjointFrom, have been
introduced to sort out fine-grained semantic heterogeneities among data sources. Quantitative
scores are assigned to the operators. Data sources are ranked based on the similarity score
obtained by them.
A prototype system has been designed and implemented to validate the methodology. The
evaluation criteria used are (a) query response time and (b) accuracy of relevant source selection.
The prototype system has been compared with the existing systems for evaluation. Query

XI

response time and accuracy of source selection, in terms of precision and recall; have been
improved due to the incorporation of a bitmap index and ontology respectively.

XII

CHAPTER 1

INTRODUCTION
This chapter introduces the research work that has been taken in this thesis. It
includes motivation and definition of the problem. Moreover the objectives and goals
have also been discussed.
1.1.

Motivation
The exponential growth in data sources on the Internet is due to advancements in

information and communication technologies (ICT). Some data sources contain


interrelated data that could answer a user query. Retrieving data from these interrelated
data sources is a non trivial task due to their properties i.e. autonomy, heterogeneity and
geographical distribution [1, 8, 11, 23]. The sources can be heterogeneous in terms of
syntax, schema, or semantics. The task of a data integration system is to enable the
interoperation of autonomous and distributed data sources for knowledge discovery
through a centralized access point. It provides a uniform query interface that gives a user
transparent access for querying data sources. However the properties, discussed above,
make integration among the sources a pervasive challenge and a crucial task [1, 8, 23].
A variety of approaches to data integration exists. These approaches can be generally
classified into two major categories: (a) data warehousing and (b) mediation [1, 28]. In
data warehousing, the required data is extracted from the sources and stored in a
centralized repository after integration. While in mediation, data is gathered and
integrated when a user query is submitted. Query execution is efficient and response time
is predictable in warehousing, but result is stale. In contrary, query efficiency is slow in
mediation but result is up to date [1, 21, 28].
1

The growth of online data sources requires a scalable data integration system because
the sources are unpredictable due to their autonomy. In other words, data sources can join
and leave the system arbitrarily. Thus checking the availability of a data source before
executing a query is needed. Moreover all the data sources may not have the required
information. Executing a query on all data sources is an expensive solution due to the fact
that an available source may not contribute any significant information to the user query
result [8, 20, 23]. In order to execute queries efficiently in these systems, we need to
identify relevant and effective data sources that are available at the time of execution.
This research work focuses on relevance reasoning for identifying relevant and effective
data sources in a scalable data integration system.
1.2.

Problem Definition
Identifying relevant sources in a scalable data integration system faces problems due

to semantic heterogeneity and lack of performance. We have highlighted in depth these


problems in the following paragraphs of this section.

Semantic Heterogeneity: Data sources are being developed by independent


organizations so there might be semantic differences between their schemas [20].
In different data sources, a same concept may be represented with different names
such as, instructor, teacher or lecturer. Similarly different concepts in different
data sources may be represented by same name such as bank i.e., a bank can be a

river bank or a financial institution.


Performance in Query Response Time: Some data sources may or may not
contribute significantly to a user query because they are not relevant. Executing a
query on all available data sources, without any estimation about their relevance
2

for a user query, degrades the performance of the query. This leads to
unreasonable wastage of resources of the data integration systems.
1.3.

Objective and Goals of Research


The goal of this research is to provide a mechanism for relevance reasoning in a

scalable data integration system generally. In particular our objective is to work on


relevance reasoning in the following directions.

Provision of Semantic Interoperability in Relevance Reasoning: Ontology,


initially developed by artificial intelligence community for knowledge sharing
and reuse, is a formal, explicit specification of a shared conceptualization [5].
Ontology is largely used for representing domain knowledge and can play a vital
role in reconciling the semantic heterogeneities due to its representational and
expressive capabilities [3, 4]. In this research, we are exploiting the capabilities of
domain ontology for the provision of semantic interoperability to handle the

source heterogeneities during relevance reasoning.


Optimization of Relevance Reasoning Mechanism: Indexing structures are
used in databases to access data efficiently [27, 28]. We have proposed semantic
indexing using bitmap technique to represent the metadata of data sources. A user
query is executed through the bitmap index for identifying relevant data sources.
The index performs relevance reasoning in improved manner thereby enhances
query response time.

1.4.

Outlines of Thesis

The rest of the document is organized as follows: Chapter 2 describes a data integration
system and its various components. RDF is also explained as a language for developing
ontologies, storing source descriptions and semantic mappings. Chapter 3 discusses
various algorithms for relevance reasoning and their critical analysis. Chapter 4
highlights the proposed system architecture, proposed semantic matching process along
with the proposed methodology for relevance reasoning. Chapter 5 gives complete
overview of implementation details. Chapter 6 highlights experimentation and
comparative analysis to validate the proposed architecture. Discussions are also made on
the conducted experiments. Chapter 7 concludes the thesis and defines future research
directions.

CHAPTER 2

BACKGROUND STUDIES
This chapter provides background literature in order to understand the context of this
research. Data integration and semantic heterogeneity is discussed. The details of
ontology and its designing methodology; and indexing have also been included.
2.1.

Data Integration
Data sources on the Internet are growing exponentially in size and number over the

time. These data sources contain information about different topics such as stock market,
product information, real estate, and entertainment. The data from these sources can be
used for answering complex user queries and this might go beyond the traditional
searches. Advancements in information and communication technology has enabled the
users to access a wide array of data sources that are related in some way and to integrate
the result to come up with useful information that might not be stored physically in a
single place [1, 8, 12, 24].
Data integration enables the interoperability of the data sources for knowledge
discovery through a centralized access point, and provides a uniform query interface that
gives user the illusion of querying a homogeneous system [2, 15, 19, 31]. In data
integration a user is provided with a unified interface for posing queries, which is based
on a schema typically referred as the global schema or mediated schema. Based on the
approach used to develop data integration systems, a user is provided with appropriate
result obtained from underlying data sources either from centrally materialized repository
or at real time.

2.2.

Issues in data integration


Data sources in data integration are maintained by different organizations, are located

geographically distributed, and managed autonomously. This scenario creates a variety of


barriers in integrating data from these participating data sources. Most common issues
include (a) autonomy, and (b) semantic heterogeneity. In order to achieve scalable data
integration these issues need to be sorted out.
2.2.1. Autonomy: In data integration, autonomy indicates the ability of data sources to
control their data and processing capabilities. The data sources retain their autonomy
even after becoming a part of data integration [24, 31]. This autonomous scenario arises
the following issues:
- The source data administrators might not be interested in, or may not have the
resources, to help the integrators to understand how their site's schema relates to
-

the schemas of other sites being integrated.


The source data administrators might change their site's schema without
forewarning the integrators and can lead the integration software to make invalid

assumptions about the data source.


The data source administrators might choose a schema that is very difficult to

integrate with the other schemas in the integrated system.


2.2.2. Semantic Heterogeneity: In data integration, heterogeneities come from different
programming and data models as well as from different conceptualization of a real world
object. Among these heterogeneities is the semantic heterogeneity [20]. A variety of
semantic heterogeneities can be found in the different data sources. A few of semantic
heterogeneities are:
2.2.2.1.
Synonym: The same concept may be represented with different names in
different data sources e.g., Course, Subject.

2.2.2.2.

Homonym: The different concepts in different data sources may be

represented by same name e.g., bear can be an animal or a property meaning tolerate.
2.2.2.3.
Degree of likelihood: Two concepts can be relevant to each other on the
basis of degree of likelihood. This does not mean equality of concepts like synonyms,
rather relatedness e.g., <:Teacher :isTeaching :Course> and <:TeachingAssistant
:isAssisting :Course>, here teaching assistant and teacher is not same concepts but are
relevant to each other with certain degree of likelihood.
2.3.

Approaches to Data Integration


A variety of approaches to data integration exists. These approaches can be generally

classified into two major categories: (a) data warehousing and (b) mediation.
2.3.1. Warehouse: In data warehousing, the required data is extracted from the sources
and stored in a centralized repository after integration [19, 24]. Users pose queries against
the data model of the warehouse. This approach is also known as eager approach or
materialized view approach to data integration. Query execution is efficient and response
time is predictable in this approach, but result can be stale mostly [1]. Figure 1 shows
data warehousing architecture [24].

Figure 1: Data Warehousing Architecture for Data Integration

2.3.2. Mediation: In mediation approach, a user is given a unified schema for posing a
query that contains virtual relations. Data is not loaded in a central repository in advance
in this approach rather queries are executed at run time [1, 19, 20, 24, 24]. In order to
answer a user query using the information sources, metadata is needed that describe the
semantic relationship between the elements of mediated schema and schemas of
underlying data sources. This metadata is known as source description. This approach is
also known as lazy approach or virtual view approach to data integration. Query
efficiency is slow in mediation but result is up to date [1, 21, 24]. Figure 2 depicts
mediation based architecture for data integration [24].

Figure 2 Mediator Wrapper Architecture for data integration

2.4.

Query Processing in Data integration


The main objective of data integration is to facilitate access to a set of autonomous,

heterogeneous and distributed data sources. The ability to efficiently and correctly
execute a query over the integrated data lies in the heart of data integration. Main steps in
processing a query in data integration include (1) Query reformulation, (2) Query
planning and execution.
2.4.1. Query Reformulation: Query reformulation is the first step in query processing
where a user query previously written in terms of a mediated schema is reformulated
using information about sources into queries that refer directly to the schemas of
underlying data sources [1, 8, 10, 11, 19, 24]. Query reformulation is further divided into
two steps: (a) source identification (b) query rewriting.
2.4.1.1.
Source identification: Before executing a user query, relevant and
effective sources should be clearly identified to optimize query execution. Relevance
reasoning is the process of identifying relevant sources and pruning irrelevant and
redundant data sources. The main focus of our research is to propose an algorithm that
can speed up the process of relevance reasoning.
2.4.1.2.
Query rewriting: Once relevant sources are being identified then query
rewriting is performed and source specific queries are reformulated only for those sources
that have been found relevant and can contribute some result to the users query.
2.4.2. Query Planning and Execution: Query reformulation provides some
optimizations by pruning irrelevant sources and overlapping sources to avoid redundant
computation. The reformulated queries are evaluated using different strategies producing
9

multiple execution plans during the optimization [11, 12]. The query execution engine
executes these queries using best and cheapest execution plan and deals with limitation
and capabilities of the data sources [28]. During execution, an important issue is to
minimize time to return the first answers to the query rather minimizing the total amount
of work to be done to execute whole query [21, 24].
2.5.

Ontology
Ontology is defined as an explicit and formal specification of a shared

conceptualization [3, 4, 15]. In this definition, the term conceptualization refers to an


abstract model of some domain knowledge that identifies relevant concepts of the
domain. The term shared indicates that ontology captures consensual knowledge that is
accepted by a group of people and systems. The term explicit means that concepts and the
constraints on these concepts are explicitly defined. Finally, the term formal means that
the ontology should be machine understandable [15]. Ontology was initially developed
by the Artificial Intelligence (AI) community to facilitate knowledge sharing and reuse.
Ontology carries semantics for a particular domain and hence used for representing
domain

knowledge.

Ontology

is

widely

used

in

data

standardization

and

conceptualization. Ontologies have proven to be an essential element in many


applications including agent systems, knowledge management systems, and e-commerce
systems. They can also, generate natural language like queries, integrate intelligent
information, and provide semantic based access to the Internet [36]. Ontology can be a
taxonomy e.g., Yahoo categories or a domain specific standard terminology e.g., UMLS
and Gene Ontology or an online lexicon database e.g., Word Net.

10

Ontology consists of concepts, properties, and individuals. A concept is a thing of


significance in the real world. Concepts may be organized into super-class and subclass
hierarchy which is also known as taxonomy where subclasses specialize their superclasses. Concepts in ontology can be synonyms or disjoint. Properties represent
relationships between two concepts. Properties may have a domain and a specified range.
Properties may be inverse, functional, transitive, or symmetric. Individuals represent
objects in the domain. Ontology needs a reasoner which can check whether or not all of
the statements and definitions in the ontology are mutually consistent and can also
recognize which concepts fit under which definitions. The reasoner can help to maintain
the hierarchy correctly.
2.6.

Ontology Modeling Languages: In order to develop ontology-driven

applications, a language is needed to facilitate the semantic representation of the


information, required by these applications. A number of research groups have already
identified a need for a more powerful ontology modeling language. This need for a
powerful modeling language, leads to joint initiatives of building languages. Therefore, a
number of ontology modeling languages are available and are being used today [36].
Most common ontology modeling languages include XML Schema [35], DAML+OIL
[37], RDF and RDFS [25], and OWL [38]. Among all these ontology languages, we are
most interested in RDF and RDFS for their role in data integration and semantic web [4,
6, 25, 26].
2.6.1.

RDF and RDFS: Resource Description Framework (RDF) is a standard -

developed by World Wide Web Consortium (W3C), for representing information about
resources. RDF provides interoperability across resources due to its simplistic structure.
RDF schema (RDFS) is a language for describing vocabularies of RDF data in terms of
11

primitives such as Class, Property, domain, and range. The machine-understandable


format of RDF facilitates the automated processing of web resources [5, 6, 26]. In RDF, a
pair of resources (nodes) connected by a property (edge) forms a statement: (resource,
property, value), often called an RDF triple. A set of triples is known as model or graph.
The components of a triple include a subject, a predicate or property, and object. Each
triple represents a complete and unique fact for a specific domain. It can be modeled as a
link in a directed graph as shown in Figure 3. The subject is the start node of the link and
the object is the end node of the link. The direction of the link always points towards the
object. A detailed description of RDF language can be found in [25].

Subj
Subj
ect
ect

Obj
Obj
ect
ect

Figure 3 RDF Triple as Directed Graph

Some of the important concepts of RDF are discussed below:


- A URI is a more generic form of Uniform Resource Locator (URL). It allows us to
locate

web

resource

without

specific

network

address

(http://www.niit.edu.pk/delsa#Instructor).
- A blank node is used when either the subject or object of a triple are unknown or
relationship between the subject and object is n-ary.
- A literal is a string which is used to represent names, dates, and numbers.
- A

typed

literal

is

string

combined

with

its

data

type

(e.g.Smith^^http://www.w3.org-/2001/XMLSchema#string).
- A container is a resource that is used to describe a group of things. Participants of a
container are members of the group. Blank nodes are usually used to represent
containers.
12

- Reification allows triples to be attached to other triples as properties. One of the major
issues is its representational complexity. Therefore it is sometimes termed as The Big
Ugly.
A variety of RDF storage systems and browsers are available such as Jena [33],
Kowari [34], Sesame [35], Longwell [36], and Oracle RDF Data Model [37, 40]. We
have used Oracle RDF Data Model for managing global ontology and source descriptions
because it is efficient in terms of storage and is not mitigated by slow performance times.
It provides a basic infrastructure for effectively managing RDF data in databases. At the
same time RDF data can be readily integrated, managed and analyzed with other
enterprise data. A comparative analysis of RDF [26] was conducted and shown that oracle
RDF data model outperforms other existing RDF storage systems.
2.7.

Indexing
Databases spend a lot of their time in finding things. So the finding needs to be

performed as fast as possible to speed up the searching mechanism. Indexes provide the
basis for both rapid random lookups and efficient ordering of access to data. An index is
associated with some search key that is, one or more attributes of a relation for which the
index provides fast access. The disk space required to store an index is typically less than
the storage of the table. Indexes can be primary or secondary indexes. A variety of
indexing techniques are used today in modern DBMSs e.g., hash based indexing, cluster
indexing, tree-structured indexing, and bitmap indexing. The most efficient and compact
indexing techniques, that are dealing with bulks of data [26,28], includes (a) B+tree Index
(b) Bitmap Index. In this thesis we are using bitmap indexes due to their internal compact
representation for bulks of data.
13

2.7.1. Bitmap Index: A bitmap indexing is a specialized technique that is geared


towards easy querying based on multiple search keys. In bitmap index, attributes can be
stratified into relatively a small number of possible values and then queried based on that
stratification. Internally bitmap index entries have bitmap vectors of 0s and 1s. Figure
4 depicts the structure of bitmap index. Bitmap indexing can benefits applications where
ad-hoc queries are being executed on large amounts of data with a low level of concurrent
transactions [26, 28]. The purpose of using bitmap index in our approach is to provide
pointers to RDF triples for efficient searching. Normal indexing can also be used to
achieve this functionality by storing a RDF triple with each index entry but it consumes
more space than the bitmaps. In bitmap index, a single bitmap vector represents the status
of whole source. Each bit in a bitmap vector corresponds to an RDF triple. If the bit is
set, then it means that the source contains the corresponding RDF triple. A mapping
function is used that converts the bit position to an actual RDF triple. So the bitmap index
provides the same functionality as a regular index even though it uses a different
representation internally. Major benefits of bitmap indexing include:
2.7.1.1.
Compact Storage and Reduced Response Time for queries: Fully
indexing an RDF repository with traditional indexes can be prohibitively expensive in
terms of space because an index can be several times larger than actual RDF data. Bitmap
indexes are only a fraction of the size of the data being indexed. This compact and
concise representation helps to save space and reduce computation while searching for a
RDF triple.
2.7.1.2.

Very efficient parallel Data Manipulation and Loading: In our

methodology, sources advertise their capabilities and contents in the form of RDF triples

14

to the global ontology. A single source may contain bulks of RDF triples. Bitmap indexes
are very efficient in bulk processing of data manipulation statements and data loading.

Search Key

Bitmap Vectors

A
A

0
0

0
0

0
0

0
0

0
0

X
X

1
1

1
1

1
1

1
1

1
1

Y
Y

0
0

0
0

0
0

0
0

0
0

G
G

1
1

1
1

1
1

1
1

1
1

T
T

0
0

0
0

0
0

0
0

0
0

1
1
1 of a1
1bitmap
1 index1
1
Figure 4: Structure

1
1

U
U
V
V

1
1

1
1

1
1

1
1

1
1

Z
1
1
1
1
1
In a nutshell, we have
data
integration
approaches
that are widely
Z discussed different
1
1
1
1
1

used now a day. Ontology and its modeling languages have been highlighted because
they can help data integration systems to cope with the semantic heterogeneities that exist
in the domain of discourse. Finally indexing has been discussed in general to speed up the
querying mechanism and in particular bitmap indexing has been explained that can be
used to traverse semantic web metadata efficiently.

15

CHAPTER 3

LITERATURE SURVEY
Relevant data source selection in query reformulation for data integration systems has
attracted significant attention in the literature over the last few decades [5, 6, 7, 8, 11, 12,
19, 20, 21, 24]. This chapter starts with the discussion and evaluation of state of the art
algorithms used in data integration systems for the identification of relevant data sources
during query reformulation.
3.1.

Query Reformulation
In query reformulation, a users query previously written in terms of a mediated

schema, need to be reformulated or rewritten into queries that refer directly to the
schemas of underlying data sources [10, 11, 19, 24]. In literature, query reformulation can
be further sub-divided into two steps: (a) relevant source selection (b) query rewriting.
3.1.1. Relevant source identification: Before executing user queries, relevant and

effective sources should be clearly identified because all the available data sources may
not contribute significantly. Relevance reasoning is the process of identifying relevant
sources and pruning irrelevant and redundant data sources.
3.1.2. Query rewriting: Once relevant sources are being identified then query rewriting
is performed and source specific queries are generated only for those sources that have
been found relevant and can contribute some result to the users query.
3.2.

State of the art techniques


The main focus of this research is to propose an algorithm that can speed up the

process of relevance reasoning. The following section elaborates state of the art
16

algorithms that are used in different data integration systems for the relevant source
selection during query reformulation.
3.2.1. The Bucket Algorithm: This algorithm has been used in the Information

Manifold (IM) [1, 20], a system for browsing and querying of multiple networked
information sources. IM provides a mechanism to describe the contents and the
capabilities of data sources in source descriptions (which in our architecture is called
source models). Bucket algorithm uses source descriptions to create query plans that can
access several information sources to answer a query. This algorithm prunes irrelevant
data sources using source descriptions and reformulate source specific queries for only
relevant data sources. In order to describe and reason about the contents of data sources,
the relational model (augmented with certain object oriented features) is used in IM.
Technically, algorithm constructs a number of buckets and checks a user query with each
bucket for the identification of relevant data sources. Once relevant buckets for the
sources are being identified then source specific conjunctive queries are rewritten for
each source.
3.2.2. The Inverse-Rules Algorithm: InfoMaster is an information integration system 1
[19] that provides an integrated access to multiple, distributed, and heterogeneous
information sources on the Internet. InfoMaster creates a virtual data warehouse. The
algorithm used behind the InfoMaster is Inverse-Rules algorithm. Inverse-Rules
algorithm rewrites the definition of data sources by constructing a set of rules. A set of
rules are reformulated for defining the contents and the capabilities of each data source.
During rules construction heterogeneities among the data sources are dealt. These rules
guide the algorithm that how to compute records from data sources using source
1 http://infomaster.stanford.edu/

17

definitions. The algorithm dynamically determines an efficient way to answer the user's
query using as few sources as necessary. In simple words, they are not reformulating the
query rather they are reformulating the source definitions so that the original query can be
easily answered on the reformulated rules.
3.2.3 The MiniCon Algorithm: MiniCon algorithm [19, 21] improved the Bucket
algorithm. The main focus of developing MiniCon algorithm is to pay attention to
performance aspects of query reformulation algorithms. MiniCon algorithm finds the
maximally contained rewriting of a conjunctive query using a set of conjunctive views.
Bucket algorithm completes in two steps: computing the buckets, and then reformulating
the source-specific queries using the buckets of those data sources which are relevant.
The main complexities involved in the bucket algorithm include: (a) If the number of
sound data sources is small, the Bucket algorithm may generate a large number of
candidate solutions and then reject them. (b) The exponential conjunctive query
containment test that is used to validate each candidate solution. MiniCon algorithm pays
attention to the interaction of the variables in the user query and in the source definitions
to prune the sources that are rejected later in the containment test. This timely detection
of irrelevant data sources improves the performance of MiniCon algorithm due to small
number of combinations to be checked.
3.2.4. The Shared-Variable-Bucket Algorithm: This design goal of this algorithm [38]
is to recover the deficiencies of the Bucket algorithm and develop an efficient algorithm
for query reformulation. The key idea underlying this algorithm is to examine the shared
variables and reduce the bucket contents to reduce view combinations. This reduction
ultimately optimized second phase of the algorithm.
3.2.5. The CoreCover Algorithm: In this algorithm [39], views are materialized from
source relations. The main aim of this algorithm is to find those rewritings which are
18

guaranteed to produce an optimal physical plan. Their divergence is mostly towards the
query optimization therefore different cost models are also considered in this algorithm.
The algorithm is trying to find an equivalent rewriting rather than a contained rewriting.
3.3.

Critical Analysis
The CoreCover algorithm [39] is different from other query reformulation

algorithms in the following perspectives. Firstly, it is trying to find an equivalent


rewriting whereas all the other algorithms are finding a maximally-contained sourcespecific rewriting of the query. Secondly, closed-world assumption is taken to find an
equivalent rewriting whereas all the other algorithms are taking open-world assumption.
Thirdly, reformulation stage of query processing has to guarantee an optimal plan for the
query. Bucket, MiniCon and Shared-Variable-Bucket algorithms are constructing the
buckets, and then taking the cartesian product of the buckets, to produce source-specific
rewritings. In Bucket algorithm, buckets constructed are large which causes a lot of
combinations to be computed and tested for the second phase. MiniCon and SharedVariable-Bucket algorithms prevent this deficiency. The MiniCon algorithm has been
shown to outperform both the Bucket and the Inverse-Rules algorithms [21]. InverseRules algorithm is query independent. The rules are computed once and are applied to all
queries. These rules are easy extendable for functional dependencies [19]. This algorithm
ignores the predicates during rewriting and requires an additional phase to remove the
irrelevant views, added to the algorithm [21]. None of the algorithm pays attention to fast
and efficient traversal of source descriptions. As number of sources grows, there metadata
information also grows. How to reduce the search space of metadata in the process of
relevance reasoning to make this whole process more efficient? This ultimately leads to
19

scalable data integration systems where sources can join and leave the system arbitrarily
and the query execution engine can synchronize itself with any change and submits the
sub-query to the relevant and available data sources. Another deficiency of these
algorithms is that most of them are using relational models for source descriptions
whereas the ontology based models can help us to represent fine-grained distinctions
between the contents and capabilities of the different data sources. This fine-grained
distinctions can help us reason about the data sources in a more precise and efficient
manner
In a nutshell, we have discussed state of the art algorithms, used for query
reformulation in data integration systems. These algorithms are analyzed and compared
with each other. The features and deficiencies of these algorithms are also illustrated.

20

CHAPTER 4

PROPOSED ARCHITECTURE
In order to execute a users query in a scalable data integration system proposed in
[8], the query execution process needs to be optimized. We have proposed an ontologydriven relevance reasoning architecture to improve response time for user query during
relevance reasoning. This chapter is organized into three major sections. In the first
section, components of the proposed relevance reasoning architecture are discussed. The
second section of the chapter explains the semantic matching process and proposed
scoring strategy. Finally the proposed methodology for relevance reasoning is discussed
in details and elaborated through an example.
4.1.

PROPOSED ARCHITECTURE FOR THE RELEVANCE REASONING


This section presents the proposed architecture designed for relevance reasoning for

source selection in a data integration system. The proposed architecture, as shown in Fig.
5, comprises of different components. These are described as follows.
4.1.1. Global Ontology: The global ontology is a knowledge-base in the proposed
architecture. This helps in generating user queries and enabling semantic inference. Major
components of the global ontology are: (1) domain knowledge, represents domain of
discourse in the form of RDF triples. Each RDF triple is uniquely identified by the global
unique identifier (GUID). GUIDs are used in semantic indexing scheme for relevance
reasoning; (2) concepts and relationships hierarchies, represents semantic relationships
among concepts and relationships respectively. These hierarchies help in resolving
semantic heterogeneities that exist in a domain; (3) rule-base, a rule is an object that can
be applied to deduce inference from RDF triples. Every rule is identified by its name and
21

consisted of two parts. (a) An antecedent, which is known as body of the rule and (b) a
consequent which is known as head of the rule. The rule-base is an object that consists of
rules; (4) rules-index, computes and maintains deduced inferences by applying a specific
set of rule-bases in order to optimize reasoning.
4.1.2. Ontology Management Service: Ontology management service facilitates the
creation and maintenance of the global ontology. It provides a set of application program
interfaces (APIs) to perform the following functionalities: (1) publishes the domain
knowledge in the form of RDF triples by assigning GUIDs to the RDF metadata triples
and mapping GUIDs over the bitmap index; (2) defines semantic operators and constructs
concept and relationship hierarchies; (3) provides a mechanism to create and drop a rulebase and modifies the set of rules from a rule-base; (4) enables the creation and
maintenance of the rules-index and synchronizes it after rules are modified into the rulebase.

Figure 5 Proposed Architecture for Relevance Reasoning in Data Integration Systems

22

4.1.3. Source Descriptions Storage (SDS): Source description is the metadata of a data
source. This metadata can be further classified into source metadata and content
metadata. In order to make source description of a data source interoperable in a
heterogeneous environment, they are described in a conceptual model in the form of a
local ontology [8]. The metadata of a data source is expressed as RDF triples in the local
ontology. These RDF triples are assigned local unique identifiers (LUIDs) using a
sequence generating object of each data source. In a nutshell, we can say that source
descriptions storage is a set of local ontologies.
4.1.4. Source Registration Service: Source registration service facilitates the creation
and maintenance of a local ontology for a data source in the source description storage. It
provides a set of application program interfaces (APIs) to perform the following
functionalities: (1) creates a unique sequence number generating object for the incoming
data source, (2) creates a local ontology to hold the RDF triples advertised by a data
source, (3) registers the local ontology into the source description storage, (5) inserts the
RDF triples of the data source into its corresponding local ontology.
4.1.5. Bitmap Index Storage: A bitmap index is a cross-tab structure of bits [26, 28].
We employ bitmap index for efficient traversal during relevance reasoning. A bitmap
index is divided into bitmap-segments. Internally, data in the bitmap segment is
represented in the form of bits. Each data source retains one bitmap segment over the
bitmap index. In the proposed architecture, data sources are represented on vertical side
of the index whereas RDF triples of the global ontology are represented on horizontal
side of the index. A bit state is unset i.e., 0 if a data source does not contain the
corresponding RDF triple and is set i.e., 1 if a data source contains corresponding RDF

23

triple. A sequence number generating object is used to assign a unique identifier to each
bitmap segment.
4.1.6. Index Management Service: Index management service facilitates the creation
and maintenance of a bitmap segment for a data source in the bitmap index storage. It
provides a set of application program interfaces (APIs) to perform following
functionalities: (1) bitmap segment creation creates the bitmap segment for an incoming
data source and initializes all bits of the bitmap segment to 0 (means unset); (2) bitmap
synchronization updates the bitmap segment of a data source consistent against its local
ontology; (3) shuffle bit shuffles the bits of a bitmap segment during synchronization.
4.1.7. Index Lookup Service: Index lookup service facilitates an efficient traversal of
the bitmap index. It provides a set of application program interfaces (APIs) to perform
following functionalities: (1) relevant source identification traverses the bitmap index
against the RDF triple and identifies the bitmap segments where the bit is set; (2)
irrelevant source pruning traverses the bitmap index against the RDF triple and identifies
the irrelevant bitmap segments where the bit is unset.
4.1.8. Ontology Reasoning Service: Ontology Reasoning Service enables the reasoning
and inference capabilities to the proposed architecture. It provides a set of application
program interfaces (APIs) to perform the following functionalities. (1) Semantic
Matching: is the process of finding semantic similarity among the different terms
(concepts and relation-ships) in order to resolve the semantic heterogeneities. (2)
Inference and Reasoning: provides reasoning and inference to the semantic matching
process by incorporating rules, rules-base, and rules-index. (3) Semantic Query
Generation: generates queries against the global ontology using semantic operators
during the semantic matching. Note that these queries are different from the user query so
these should not be inter-mixed or confused.
24

4.1.9. Relevance Reasoning Service: Relevance reasoning service identifies relevant


and effective data sources for a query using index lookup service from bitmap index. It
provides a set of application program interfaces (APIs) to perform following
functionalities. (1) Semantic query expansion expands a user query to semantically
relevant RDF triples. (2) Relevance reasoning identifies relevant and effective data
sources for a given users query. (3) Relevance ranking ranks the data sources for a given
user query based on the semantic similarity score obtained.
4.2.

Semantic Matching & Source Ranking for RDF Triples

4.2.1. Relevance Levels and Proposed Scoring Strategies: During the semantic
matching, the terms of users query triples are matched with the terms of source triples.
As a result one of the five relevance levels can be obtained for each term. These
relevance levels are given numeric scores for the purpose of quantification that will help
us to rank a source for a given query. Following is the definition and explanation of the
relevance levels and operators used in semantic matching process.
4.2.1.1.

Exact Matching ( ): A term is exact match of another term if and only if

both are lexically equal to each other. For example a term nust:Instructor is an exact
match of niit:Instructor. A numeric score of 1 is assigned to any exact matching terms as
soon as they appear in RDF triple.
4.2.1.2.

Synonym Matching ( ): It is unrealistic to assume that same name will

be used for a concept in a domain. An explicit specification of synonyms using some


operator is required. Therefore synonyms are the terms that are different lexically but
have the same meaning. For example a term nust:Instructor is synonym of the another
term niit:Teacher. A numeric score of 0.8 is assigned to any synonym matching terms as
25

soon as they appear in RDF triple. We are using owl:sameAs operator for specifying
mappings in the rule-base of the global ontology.
4.2.1.3.

Subclass Matching ( ): In some scenarios taxonomies might be used for

the purpose of knowledge representation where generic concepts subsume specific


concepts. In order to cope with subsumption relationship, some operators are required for
explicit specification. Therefore a term is a subclass of another term if and only if it is
subsumed by that term. For example nust:Employee might subsumes the niit:Instructor.
A numeric score of 0.6 is assigned to any sub class matching terms as soon as they appear
in RDF triple. We are using rdf:subClassOf operator for specifying

mappings in the

rule-base of global ontology.


4.2.1.4.

Degree of likelihood ( ): In some situations data sources might contain

concepts that are not totally disjoint or different rather they would be related to some
other term with some degree of likelihood. For example the term nust:Instructor might be
relevant to nust:TeacherAssistant with some degree of likelihood. This type of mappings
cannot be specified using previously defined operators. A numeric score of 0.5 is
assigned to any likelihood based similar terms as soon as they appear in RDF triple. We
are using owl:equivalentOf operator for specifying

mappings in the rulebase of global

ontology.
4.2.1.5.

Disjoint ( ): A term is disjoint from another term if and only if they are

different from each other. For example the term nust:Instructor is disjoint from
nust:Student. A numeric score of 0.0 is assigned any disjoint terms as soon as they appear
in any components of RDF triple. These relevance levels and their scoring strategies can
be summarized in Table 1 below:

26

Table 1 Relevance levels and scoring strategy


1
2
3
4
5

exact match
sameAs
subClassOf
equivalentOf
disjointFrom

1.0
0.8
0.6
0.5
0.0

4.2.2. Term Similarity: We use the same semantic matching strategy for both concepts
and relation-ships. We have concept hierarchy and relation-ship hierarchy. Terms include
both concepts and relationship. We extract the relationship between the query and source
terms using their respective hierarchies and then assign standard relevance score as
defined in the Table 1. An RDF triple contains the subject, predicate, and object. Subject
and object are considered as concepts thereby their similarity is computed using concepts
hierarchy whereas to calculate the predicate similarity, the relation-ship hierarchy is used.
4.2.3. RDF Triple Similarity: To calculate the relevance between user query and source
RDF triples, we combine both aspects of term similarity (i.e., concepts and relationships). The overall RDF triple similarity can be calculated as shown in equation 1:

sim (qT , s ) simt (q tj , stij ) (1)


i 0 j 0

Where qT denotes the query triple and S denotes source triples. qt and st are query and
source terms respectively that are to be matched, Sim (qT,s) the overall similarity of a
single query triple for a given source. Here i and j represent ith and jth source RDF triples
and query triple terms respectively.
4.2.4. Source Ranking: A user query and source RDF triples are matched to find the
similarity of each query triple with data source triples. Once RDF triple similarity has

27

been computed, source score of the whole query is being computed using the formula
given in equation 2. Based on the score obtained for a query, data sources are ranked.
n

simsrc sim( qi , s ) ( 2)
i 0

In the above equation, simsrc is the total score of a source (s) for a user query
(obtained by multiplying the similarity score of all query triples). qi denotes the query
triples and n denotes the total number of query triples.
4.3.

Proposed Semantic Matching Methodology


This section discusses our proposed methodology for relevance reasoning to identify

the most relevant and effective data sources using a bitmap index. Our proposed
methodology can be divided into three main workflows. These workflows help to
understand the intricacies of the proposed architecture. Below is the detail discussion of
each workflow.
4.3.1. Ontology Management Workflow: Ontology management workflow manages
the global ontology in the architecture. Ontology management service plays a prominent
part in this workflow. Five major activities carried out by ontology management
workflow include:

Domain knowledge representation


Concept & relationship hierarchy representation
Rules & Rules-base management
Rules-index management
The Figure 6 shows all the activities that are performed during the ontology

management workflow using sequence diagram.


Figure 6: Sequence Diagram for Ontology Management Workflow

28

Domain knowledge representation is the registration of the RDF triples over the
global ontology. These RDF triples are stored in the global ontology and GUIDs are
assigned using a unique sequence number generator object. GUIDs are allocated
positions over the bitmap index. Transactions are permanently recorded to the global
ontology. The snippet in Figure 7 shows pseudo-code for insertion of RDF triple in the
global ontology. In the preceding chapter its implementation issues and details are
discussed.
Pseudo-Code for Domain Knowledge Registration
For each RDF triple of global ontology
Assign GUID to RDF triple
Add RDF triple to the global ontology
Extend bitmap index
Increase the length of bitmap pattern by one
Assign location to the RDF triple reserved over the bitmap index
Perform commit to apply changes persistently to global ontology

29

Figure 7 Pseudo-code for RDF triple registration of global ontology

Concept & relationship hierarchy representation involves the definition of semantic


operators and then using these operators to build their respective hierarchies. These
operators include sameAS, equivalentOf, subclassOf, and disjointFrom, as explained in
the previous section. RDF triples are added to the global ontology to represent the
concept and relationship hierarchy. Bitmap index is not maintained for these RDF triples.
Rules & Rules-base management involves the creation of the rules-base and then
inserting rules into the rules-base. In order to reduce mappings among the hierarchies and
increase the inference capabilities of rule-base, two rules are inserted for each semantic
operator. These rules include InverseOf<operator> and TransitiveOf<operator>.
InverseOf<operator> rule tells the rule-base that if a terms A is related to another term B
with relation R, and then B is related to A using R -1. Fig. 8 shows the N3 representation of
the InverseOf rule for sameAs operator in the semantic web rule language.

: Def-InverseOfSameAs@swrl((?x sameAs ?y) -> (?y sameAs ?x))

TransitiveOf<operator>
tells the
rule-base
that ifinathe
term
A is related to another
Figurerule
8 InverseOf
SameAs
rule inserted
rule-base
term B with some relation R, and the same term B is further related to another term C
using the relation R, it implies that the term A is related to term C using the same relation
R. Fig. 9 shows the N3 representation of the TransitiveOf rule for sameAs operator in the
semantic web rule language.
: Def-TransitiveOfSameAs@swrl
((?x and
sameAs
?y) (?y sameAs
?z)rules-index
-> (?x
Rules-index management
involves the creation
management
of the
sameAs ?z))

Figure 9 TransitiveOf SameAs rule inserted in the rule-base

for a rules-base. Once the rules are inserted into the rules-base, the corresponding rulesindex is refreshed to pre-compute RDF triples.

30

4.3.2. Source Registration Workflow: Source registration workflow registers the data
sources in the data integration system. Three major activities carried out by source
registration workflow include
Local ontology creation
Bitmap segment creation
Bitmap synchronization
Local ontology creation involves the creation of local ontology for incoming data
source, a unique sequence number generator object along with the insertion of RDF
triples over the created ontology. Source registration service plays a prominent part in
local ontology creation. Ontology is created for the incoming data source and is
registered with the source descriptions storage. The RDF triples, advertised by the data
source, are assigned unique identifiers (LUIDs) and are added to the local ontology.
Transactions are permanently recorded to the source descriptions storage. The snippet in
Figure 10 shows pseudo-code for local ontology creation and its RDF triples insertion. In
the preceding chapter its implementation issues and details are discussed.
Pseudo-Code for Local Ontology Creation
Creating ontology for incoming source in Source Descriptions Storage
Creating unique sequence generator for incoming source RDF triples
Assign LUIDs to the RDF triples
Add RDF triple to the local ontology in Source Descriptions Storage
Perform commit to apply changes persistently to global ontology

Figure 10 Pseudo-code for RDF triple creation of local ontology

Bitmap segment creation involves the cloning of bitmap pattern and the creation of
bitmap segment for incoming data sources over the bitmap index. The index management
service plays a prominent role in bitmap segment creation. The bitmap pattern is stored
over the global ontology. It is cloned for the newly created bitmap segment. Initially all
the bits are initialized to unset i.e., 0. A unique identifier is assigned to the bitmap
segment and is added to the bitmap index. The snippet in Figure 11 shows pseudo-code
31

for local ontology creation and its RDF triples insertion. In the preceding chapter its
implementation issues and details are discussed.
Pseudo-Code for Bitmap Segment Creation
Check whether bitmap segment exists for the incoming source
If (no)
Clone bitmap pattern from global ontology RDF triples
Initialize bits to zero (0)
Assign a unique number to the bitmap segment
Add bitmap segment to the bitmap index for incoming source
Perform commit to apply changes persistently in index

Figure 11 Pseudo-Code for Bitmap Segment Creation

Bitmap Synchronization involves plotting the RDF triples of a data source


consistently and correctly by shuffling the bits in its bitmap segment. The index
management service plays a prominent role by spawning a listener process that listens for
any invalidation (those changes in local ontology that are not propagated and plotted over
the bitmap index) in the source descriptions storage. If any invalidation is found, it starts
index synchronization. During synchronization RDF triples of the data source are fetched.
Every RDF triple is decomposed into terms (subject, predicate, and object) and given to
ontology reasoning service. The ontology reasoning service performs reasoning and
inference that helps the index management service to extracts GUIDs for the
corresponding RDF triple. The position of the GUIDs is identified over the bitmap index
and the bits are shuffled accordingly. The snippet in Figure 12 shows pseudo-code for the
bitmap synchronization. In the preceding chapter its implementation issues and details are
discussed.
Pseudo-Code for Bitmap Synchronization
Pseudo-Code for Bitmap Synchronization

For each incoming RDF triple advertised by a data source


For each incoming
RDF
advertised by a data source
Decompose
RDF triple
intotriple
its components
Decompose
RDF
into similarity
its components
Perform
reasoning
fortriple
semantic
Perform
for semantic similarity
Extract
GUIDreasoning
for the corresponding
RDF triple
Extract
GUID for
thethe
corresponding
Identify
its position
over
bitmap indexRDF triple
Identify
its position
over
index
Fetch
the bitmap
segment
for the
the bitmap
data source
Fetch
for32
the data
sourcein the bitmap segment
Shuffle
thethe
bitbitmap
to 1 at segment
the corresponding
position
Shuffle
theto
bitapply
to 1 at
the corresponding
Perform
commit
changes
persistently position
in index in the bitmap segment
Perform commit to apply changes persistently in index

Figure 12 Pseudo-Code for Bitmap Synchronization

The Figure 13 shows all the activities that are performed during the source
registration workflow using sequence diagram.
Figure 13: Sequence Diagram for Source Registration Workflow

33

4.3.3 Relevance Reasoning Workflow: Relevance reasoning workflow includes the

steps that are carried out to identify the relevant and effective data sources for the users
query. Relevance reasoning service plays a prominent part in this workflow. It
incorporates with the index lookup service and ontology reasoning service during
relevance reasoning to perform the following activities.
Semantic Query Expansion
Source Selection
Source Ranking

34

The Figure 14 shows all the activities that are performed during the source
registration workflow using sequence diagram.
Figure 14: Sequence Diagram for Relevance Reasoning Workflow

Semantic query expansion: A user submits the query in RDF which is passed to the
relevance reasoning service. The RDF triples that are entered by the user into a query are
called asserted query triples. A user can submit queries in global ontology terms as well
as local ontology terms of their underlying data sources. Relevance reasoning service
expands the user query to all possible combinations using ontology reasoning service.
Every term of the query triple is expanded using semantic operators for synonyms, lexical
Pseudo-code for Query Expansion in Relevance Reasoning

variants, subsumption,
and degree
of likelihood. This expansion results in the addition of
InferredTriplesList
=
For each RDF triple in AssertedTripleList of users query
Isolate subject, object, and property of current RDF triple
Calculate semantic similarity and add relevant term for the subject of RDF triple
Calculate semantic similarity and add
35relevant term for the property of RDF triple
Calculate semantic similarity and add relevant term for the object of RDF triple
Take Cartesian product of terms
Populate InferredTriplesList of the Cartesian product
Return InferredTriplesList

some extra triples to the user query. These RDF triples are called inferred query triples.
The snippet in Figure 15 shows pseudo-code for the semantic query expansion. In the
preceding chapter its implementation issues and details are discussed.
Pseudo-code for Query Expansion in Relevance Reasoning
InferredTriplesList =
For each RDF triple in AssertedTripleList of users query
Isolate subject, object, and property of current RDF triple
Calculate semantic similarity and add relevant term for the subject of RDF triple
Calculate semantic similarity and add relevant term for the property of RDF triple
Calculate semantic similarity and add relevant term for the object of RDF triple
Take Cartesian product of terms
Populate InferredTriplesList of the Cartesian product
Return InferredTriplesList

Figure 15 Pseudo-Code for Query Expansion in Relevance Reasoning Workflow

Source Selection: Once the query is expanded with semantically relevant RDF triples,
the GUIDs are reconciled from the global ontology. GUIDs help to find out the position
of RDF triples over the bitmap index. These positions are passed to the index lookup
service which traverses the bitmap segments of each source at the corresponding
positions and identifies the data sources for which the bits are set. The snippet in Figure
16 shows pseudo-code for the source selection. In the preceding chapter its
implementation issues and details are discussed.

Pseudo-code for Source Selection in Relevance Reasoning


RelevantSourceList =
For each RDF triple in users query [Asserted + Inferred]
Reconcile GUID for incoming RDF triple from global ontology
Identify Bitmap location of the RDF triple using GUID
Pass bitmap location to Index lookup service
Traverse bitmap segments at corresponding location to identify relevant sources
Add sources to RelevantSourceList
Return RelevantSourceList

Figure 16 Pseudo-Code for Source Selection in Relevance Reasoning Workflow

36

Source Ranking: The identified data sources are ranked according to their relevance
to the user query. Table 1 shows our scoring scheme. Initially term similarity is computed
for component of query RDF triple in a given source. Once term similarity is computed it
is used in the equation 1 to compute RDF triple similarity. Finally source similarity is
computed by equation 2 and sources are ranked according to the score obtained for a
given user query.
4.4.

Explanation of Proposed Methodology using a Case Study


We are using a portion of the famous university ontology as an example. In the

scenario, we have a global ontology with name NUST_DB as shown in Figure 17, and
three data sources named EME_DB, MCS_DB, and NIIT_DB. The RDF triples of the
global ontology are shown in Table 2.
isTeaching
Instructor
worksIn

hasMajor

isAssisting

isRegisteredIn

isAdvisorOf

Department

Course

Student

TeachingAssista
nt

Figure 17 Snap shot of the Global Ontology


Table 2 RDF triples of the Global Ontology

NUST_RDF_DATA
GUID
nust-1000001
nust-1000002
nust-1000003
nust-1000004
nust-1000005
nust-1000006

RDF Triples
< nust:Instructor, nust:isTeaching, nust:Course >
< nust:Instructor, nust:isAdvisorOf, nust:Student >
< nust:Student, nust:isRegisteredIn, nust:Course >
< nust:Student, nust:hasMajor, nust:Department >
< nust:Instructor, nust:worksIn, nust:Department >
< nust:TeacherAssistant, nust:isAssisting, nust:Course >

The RDF triples of the global ontology forms basis for the bitmap indexing in our
proposed architecture. The pattern of the index can be illustrated as shown in Table 3.
Table 3 Structure of Bitmap Index

37

Source-segment

position-1

position-2

nust-

nust-

1000001

1000002

xxxxxxxxxxxxxx

position-3

position-4

position-5

nust-

nust-

1000004

1000005

nust-1000003

position-6
nust-1000006

Bitmap Pattern

In order to manage concepts and relation-ship hierarchies, the semantic matching


operators defined are sameAs, equivalentOf, subclassOf, and disjointFrom. The concepts
like nust:Instructor is mapped with the concept niit:Lecturer using subClassof operator in
order to specify subsumption relationships. The term nust:Course is mapped with the
term nust:Subject

using sameAs operator in order to specify synonyms and lexical

variants. Similarly nust:Instructor is mapped with nust:TeachingAssistant using


equivalentOf operator in order to specify degree of likelihood and so on. Relation-ship
hierarchies are also managed accordingly. These hierarchies can be illustrated as shown
in Figure 18.
ExactMatch

isTeaching

Instructo

r
subClassof

sameAs
subClassof

subClassof
Professo
r
sameAs

Student

Course

sameAs

Teaching

Teaches

sameAs

isAssisting

Subject

Lecturer

subClassof

TeachingAssista
nt

Teache
r

Prof

Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global Ontology

Three local ontologies are being created for the data sources with the naming
convention like <DataSource>_RDF_Data. There are semantic heterogeneities between
the contents of the data sources. Table 4 describes the RDF triples of the sources stored in
their respective ontologies.
Table 4 RDF triples of the data sources
EME_RDF_DATA
Local Link-ID

RDF Triples

38

eme-1011
eme-1012
eme-1013

< eme:Professor , eme:Teaches, eme:Subject >


< eme: Professor, eme:Advises, eme:Student >
< eme:Student, eme:RegisteredIn, eme:Subject >

NIMS_RDF_DATA
Local Link-ID
nims-2011
nims-2012
nims-2013

RDF Triples
< nims:Teacher, nims:isAdvisorOf, nims:Student >
< nims: Teacher, nims:WorksIn, nims:Department >
< nims:Student, nims:hasMajor, nims:Department >

NIIT_RDF_DATA
Local Link-ID
niit-3011
niit-3011

RDF Triples
< niit:Lecturer, niit:isTeaching, niit:Course >
< niit:TeachingAssistant, niit:isAsssting, niit:Course >

The prefixes nust, niit, eme, and nims refer to URLs http://www.nust.edu.pk,
http://www.niit.edu.pk, http://www.nims.edu.pk, and http://www.eme.edu.pk respectively.
Once the local ontologies are being created, the index management service comes into
play and creates the bitmap segments in the bitmap index for the data sources and plots
(synchronizes) the RDF triples of the data sources in their respective bitmap segments.
During synchronization, index management service also resolves the semantic
heterogeneities. The structure of the bitmap index can be illustrated as shown in the Table
5.

Table 5 Structure of Bitmap Index after sources are registered


Source-segment

nust-1000001

nust-1000002

nust-1000003

nust-1000004

nust-1000005

nust-1000006

EME-DB
NIMS-DB
NIIT-DB

1
0
1

1
1
0

1
0
1

0
1
0

0
1
0

0
0
1

Suppose, a user query contains RDF triple i.e., <Instructor isTeaching Course>.
Relevance reasoning service decomposes this triple into its terms and creates three
buckets i.e., one for the subject, one for the property, and one for the object. Each term is
given to ontology reasoning service to calculate its semantic similarity in their respective
hierarchies to find relevant terms. The buckets are populated as shown in the Table 6.
Semantic Operator
Used

Subject Bucket for


Property Bucket for
Table 6 Buckets createdisTeaching
for the RDF triples
Instructor
Terms Deduced
Terms Deduced

Property Bucket
Course
Terms Deduced

exactMatch
sameAs
subClassOf
equivalentOf

Instructor
NULL
Professor, Prof, Lecturer, Teacher
TeachingAssistant

Course
Subject
NULL
NULL

isTeaching
Teaching, Teaches
NULL
isAssisting

39

for

The cartesian product of subject, property and object is taken to construct inferred
triple list. Table 7 shows their cartesian product.
Table 7 Inferred RDF triples for a users query triple
Expansion of RDF triple
Reasoning Service
<Instructor>,
<Instructor>,
<Instructor>,
<Instructor>,
.
<Instructor>,
<Professor>,
<Professor>,
<Professor>,

<Professor>,
<Prof>,
<Prof>,
<Prof>,
<Lecturer>,
<Lecturer>,
<Lecturer>,

<Teacher>,
<Teacher>,

<Teacher>,
<TeachingAssistant>,
<TeachingAssistant>,
<TeachingAssistant>,
<TeachingAssistant>,

<TeachingAssistant>,

using Ontology

<isTeaching>,
<Teaching>,
<Teaches>,
<isAssisting>,

<isAssisting>,
<isTeaching>,
<Teaching>,
<Teaches>,

<isAssisting>,
<isTeaching>,
<Teaching>,
<isAssisting>,
<isTeaching>,
<Teaching>,
<Teaches>,

<Teaching>,
<Teaches>,

<isAssisting>,
<isTeaching>,
<Teaching>,
<Teaches>,
<isAssisting>,

<isAssisting>,

<Course>
<Course>
<Course>
<Course>
.
<Subject>
<Course>
<Course>
<Course>

<Subject>
<Course>
<Course>
<Subject>
<Course>
<Course>
<Course>

<Course>
<Course>

<Subject>
<Course>
<Course>
<Course>
<Course>

<Subject>

In order to execute a query over the bitmap index, GUIDs are needed. The RDF triple
is rejected, if no GUID is available for it in the global ontology. In this example, GUID
nust-1000001 and nust-1000006 are fetched from the global ontology. These GUIDs are
passed to the index lookup service to identify relevant and effective data sources. The
index lookup service traverses the bitmap index for only these GUIDs and returns all
bitmap segments where the bits are set i.e., EME-DB, and NIIT-DB.
In order to sort the data sources based on their relevance to the query triples, semantic
similarity scoring is incorporated as shown in Table 1. First the term similarity is
40

computed for the query triples with data source triples using the concept and relationship
hierarchies.
EME-DB scores 0.6 for matching subject of the query triple Instructor with subject of
the source triples Professor. The concepts hierarchy returns subClassOf relationship
between these terms. Next properties of the query and source triples are matched and
scores 0.8 for matching the respective properties isTeaching and Teaches, because they
are connected by sameAs relationship. Finally object of the query and source triples are
matched which scores 0.8 for matching the respective objects Course and Subject.
NIIT-DB scores 0.6 for matching the subject of the query triple Instructor with the
subject of the source triple Lecturer. The concept hierarchy returns subClassOf
relationship for this match. This data source scores 1 for matching the property
isTeaching with query property isTeaching. Finally scores 1 for matching the respective
objects Course and Course. NIIT-DB also contains a triple that is relevant to the query
triple with some degree of likelihood i.e., nust-1000006.
The relevance of a data source for every query triple is calculated by putting the term
similarity scores into the equation 1 and is shown in Table 8.

Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple
Term Similarity
sim (subject)

sim (property)

sim (object)

Source Similarity
for Query
Triple(qT)

nust-1000001

0.6

0.8

0.8

0.384

nust-1000001
nust-1000006

0.6
0.5

1
0.5

1
1

0.6
0.25

Relevant
Data Source

GUIDs

EME-DB
NIIT-DB

41

Finally, the overall similarity score of a data source for a users query is calculated by
using the equation 2 and is shown in Table 9. These sources are sorted and given to query
rewriting component.
Table 9: Semantic Similarity Calculation of a Data Source for User Query
Relevant
Data Source

GUIDs

Source Similarity for


Query Triple(qT)

EME-DB
nust-1000001:
Total Source Similarity for User Query (simEME)

0.384
(0.384)

NIIT-DB

0.6
0.25
(0.85)

nust-1000001:
nust-1000006:
Total Source Similarity for User Query(simNIIT)

In a nutshell, we have explained our proposed architecture of relevance reasoning for


source selection in data integration. Different workflows are highlighted and semantic
matching methodology has been explained using a case study.

42

CHAPTER 5

IMPLEMENTATION
This chapter discusses our implementation strategy and issues for the proposed
architecture. The first section of this chapter discusses in details the Oracle
implementation of the ontologies and RDF data. The second section discusses the
implementation details of our proposed architecture for the relevance reasoning.
5.1.

RDF data/ Ontologies in Oracle Database


In Oracle Database2 10g Release 2, a new data model has been developed for storing

RDF and OWL data. This functionality builds on the recent Oracle Spatial Network Data
Model (NDM), which is the Oracle solution for managing graphs within the Oracle
Database. The RDF Data Model supports three types of database objects: model or
ontology (RDF graph consisting of a set of triples), rule-base (set of rules), and rule index
(entailed RDF graph).
5.1.1. RDF Data Model or Ontology: There is a single universe for all RDF data stored
in the database. All RDF triples are parsed and stored in the system under the MDSYS
schema as shown in Figure 19. An RDF triple (subject, predicate, and object) is treated as
one database object. A single RDF document that contains multiple triples, therefore,
results in many database objects.
RDF_MODEL$ is a system level table created to store information on all of the RDF
and OWL ontologies in a database. Whenever a new ontology is created, new
MODEL_ID is automatically generated for it. An entry is made into the RDF_MODEL$
table.

2 http://www.oracle.com/index.html

43

The RDF_NODE$ table stores the VALUE_ID for text values that participate in
subjects or objects of statements. The NODE_ID is the same as the VALUE_ID.
NODE_ID values are stored once, regardless of the number of subjects or objects they
participate in. The node table allows RDF data to be exposed to all of the analytical
functions and APIs available in the core NDM.
The LINKS$ table stores the triples for all of the RDF models in the database.
Therefore, the MODEL_ID logically partitions the RDF_LINK$ table. Selecting all of
the links for a specified MODEL_ID returns the RDF network for that particular
ontology.
The RDF_VALUE$ table stores the text values, i.e. the Uniform Resource Identifiers
or literals for each part of the triple. Each text value is stored only once, and a unique
VALUE_ID is generated for the text entry. URIs, blank nodes, plain literals and typed
literals are all possible VALUE_TYPE entries.

Figure 19 Database Schema to store ontology in Oracle NDM

44

Blank nodes are used to represent unknown objects, and when the relationship
between a subject node and an object node is n-ary. New blank nodes are automatically
generated whenever blank nodes are encountered in triples. However, it is possible for
users to re-use blank nodes, for example when inserting data into a containers or
collections. The RDF_BLANK_NODE$ table stores the original names of blank nodes
that are to be reused when encountered in triples.
To represent a reified statement a resource is created using the LINK_ID of the triple.
The resource can then be used as the subject or object of a statement. To process a
reification statement, a triple is first entered with the reified statements resource as
subject, rdf:type as property and rdf:Statement as object. A triple is then entered for each
assertion about the reified statement. However, each reified statement will have only one
rdf:type to rdf:Statement associated with it, despite the number of assertions made using
this resource.
The Oracle RDF Data Model supports containers and collections. A container or
collection will have a rdf:type to rdf:container_name or rdf:collection_name associated
with it, and a LINK_TYPE of RDF_MEMBER.
Two new object types have been defined for RDF-modeled data. SDO_RDF_TRIPLE
serves as the triple representation of RDF data, whilst SDO_RDF_TRIPLE_S is defined
to store persistent data in the database. The GET_RDF_TRIPLE() function can be used to
return an SDO_RDF_TRIPLE type.
5.1.2. Rule-base: Oracle supplies both an RDF rule-base that implements the RDF
entailment rules, and an RDF Schema (RDFS) rule-base that implements the RDFS
entailment rules. Both rule-bases are automatically created when RDF support is added to

45

the database. It is also possible to create a user-defined rule-base for additional


specialized inference capabilities. For each rule-base, a system table is created to hold
rules in the rule-base, along with a system view of the rule-base. The view is used to
insert, delete and modify rules in the rule-base. Information about all rule-bases is
maintained in the rule-base information view.
For example, the rule that the head of department (HoD) is also a faculty member of
the department could be represented as follows:
('HeadofDepartRule', -- rule name
(?p :HoDOf ?d), -- IF side pattern
NULL, -- filter condition
(?p :FacultyMemberOf ?d), -- THEN side pattern
SDO_RDF_Aliases(MDSYS.RDF_Alias('', 'http://www.seecs.edu.pk/univontology/')))
In this case, the rule does not have a filter condition, so that the component of the
representation is NULL. Note that a THEN side pattern with more than one triple can be
used to infer multiple triples for each IF side match.
5.1.3. Rules Index: A rules index is an object containing pre-computed triples that can
be inferred from applying a specified set of rule-bases to a specified set of ontologies. If a
graph query refers to any rule-bases, a rule index must exist for each rule-base and
ontology combination in the query.
When a rule index is created, a view is also created of the RDF triples associated with
the index under the MDSYS schema. This view is visible only to the owner of the rules
index and to users with suitable privileges. Information about all rule indexes is
maintained in the rule index information view. Information about all database objects,
such as ontologies and rule-bases, related to rules indexes is maintained in the Rule Index
Datasets view.
46

5.1.4. Querying RDF Data: The SDO_RDF_MATCH function has been designed to
meet most of the requirements identified by W3C in SPARQL for graph querying. A Java
API is also provided for network representation and network analysis. Analysis
capabilities include the ability to find a path between two resources, or to find a path
between two resources when the links are of a specified type.
Use of the SDO_RDF_MATCH table function allows a graph query to be embedded
in a SQL query. It has the ability to search for an arbitrary pattern against the RDF data,
including inference, based on RDF, RDFS, and user-defined rules. It can automatically
resolve multiple representations of the same point in value space (e.g. 10 ^^xsd:Integer
from 10 ^^xsd:PositiveInteger).
5.2.

Setting up the Stage for Implementation


The implementation of different components of the architecture is discussed in the

following subsections.
5.2.1. Enabling and Disabling the RDF Support in Database: Before using the RDF
support into a Oracle database, we need to enable this feature. A procedure named
CREATE_RDF_NETWORK() of the SDO_RDF package is used to enable RDF support
in the database. This procedure creates system tables and other database objects used for
RDF support. One must connect to the database as a user with DBA privileges in order to
call this procedure, and should call the procedure only once for the database. To remove
RDF support from the database, call the SDO_RDF.DROP_RDF_NETWORK procedure.
The following example enables the RDF support into the database.
Enabling the Semantic Network
BEGIN
SDO_RDF.CREATE_RDF_NETWORK('rdf_tblspace');
END;

47

5.2.2. Creating the Global Ontology: The table used to store the RDF triples of the
global ontology is shown below. The name of the table is GLOBAL_RDF_DATA.
Column Name

Data type

Description

GUID

NUMBER

TRIPLE

SDO_RDF_TRIPLE_S

TRIPLE_TYP

VARCHAR2

BIT_POS

NUMBER

GUID assigned to incoming RDF triple of the global


ontology.
This column stores the subject, predicate, and object of the
RDF triple.
This column distinguishes whether the RDF triple is a
rulebase(R) or metadata (M) triple.
If the RDF triple type is M, then this column stores the
position of the GUID over the bitmap index

A unique sequence generating object is used to assign GUIDs to the incoming RDF
triples. The example below shows the creation of the sequence generator object.
Creating the Sequence Generator for GUIDs
CREATE SEQUENCE s_global_rdf_data_id
START WITH 1000
INCREMENT BY 1
NOCACHE
ORDER;

Once the global ontology table has been created, we then create the global ontology
using the CREATE_RDF_MODEL() procedure of the SDO_RDF package. The example
below creates the global ontology.
Creating the Global Ontology
BEGIN
SDO_RDF.CREATE_RDF_MODEL('global_ontology', 'global_rdf_data', 'triple');
END;

This procedure adds the global ontology to the MDSYS.RDF_MODEL$ table. To


delete ontology, use the SDO_RDF.DROP_RDF_MODEL procedure.
5.2.3. Creating the Bitmap Index: The table used to store the bitmap segment is shown
below. The name of the table is BITMAP_INDX
Column Name

Data type

Description

SEGMENT_ID

NUMBER

SEGMENT_SOURCE
BITMAP_PATTERN

URI
VARCHAR2

A unique identifier assigned to bitmap segment created for an


incoming data source.
This column stores the URI of the data sources.
This column stores the bits to represent RDF triples for a data

48

source.

A unique sequence generating object is created to assign segment identifiers to newly


created bitmap segments. The example below shows the creation of the sequence
generator object.
Creating Sequence Generator for Bitmap Segments
CREATE SEQUENCE s_bitmap_segment_id
START WITH 1000
INCREMENT BY 1
NOCACHE
ORDER;

5.2.4. Defining Semantic Operators and Creating Hierarchies: The semantic


operators like exactMatch, sameAs, equivalentOf, subClassOf have also been defined
over the global ontology. The following example shows the SQL to define sameAs
operator. The same syntax is used to define other operators.
Defining sameAs operator
INSERT INTO global_ontology_rdf_data
VALUES(s_global_rdf_data_id.NEXTVAL,
SDO_RDF_TRIPLE_S(global_ontology,
'http://www.niit.edu.pk/Research/Delsa/sameAs'
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'
'http://www.w3.org/1999/02/22-rdf-syntax-ns#Property'));

Once the semantic operators have been defined, they are used to manage the concepts
and relationship hierarchies. The code in following example links the concept Course
with Subject using sameAs operator to represent synonyms.
Managing Hierarchies
INSERT INTO global_ontology_rdf_data
VALUES(s_global_rdf_data_id.NEXTVAL,
SDO_RDF_TRIPLE_S(global_ontology,
'http://www.niit.edu.pk/Research/Delsa/Course'
'http://www.niit.edu.pk/Research/Delsa/sameAs'
'http://www.niit.edu.pk/Research/Delsa/Subject'));

5.2.5. Creating Rules, Rule-base and Rule Index: In order to create a user defined
rulebase, CREATE_RULEBASE() procedure of the SDO_RDF_INFERENCE package is

49

used. The following example creates a rulebase for the global ontology with name
global_ontology_rb.
Creating Global Ontology Rulebase
BEGIN
SDO_RDF_INFERENCE.CREATE_RULEBASE('global_ontology_rb');
END;

After creating the rule-base, rules can be added to it. To cause the rules in the rulebase to be applied in a query of RDF data, one can specify the rule-base in the call to the
SDO_RDF_MATCH table function. Inverse and transitive rules have been inserted for
each semantic operator. The following example explains the implementation of these
rules for sameAs operator.
Inverse Rule for sameAs Operator
INSERT INTO mdsys.rdfr_global_ontology_rb
VALUES('InverseOfSameAs',
'(?x :sameAs ?y)', NULL,
'(?y :sameAs ?x)',
SDO_RDF_ALIASES(SDO_RDF_ALIAS('','http://www.niit.edu.pk/Research/Delsa/')));

Transitive Rule for sameAs Operator


INSERT INTO mdsys.rdfr_global_ontology_rb
VALUES('TransitiveOfSameAs',
'(?x :sameAs ?y) (?y :sameAs ?z)', NULL,
'(?x : s ?z)',
SDO_RDF_ALIASES(SDO_RDF_ALIAS('','http://www.niit.edu.pk/Research/Delsa/')));

Whenever rules are inserted, updated, or deleted from the rule-base, rules index must
be refreshed. The following example creates the rule index for the global ontology rulebase.
Rules Index Creation
BEGIN
SDO_RDF_INFERENCE.CREATE_RULES_INDEX (
'rdfs_rix_global_ontology',
SDO_RDF_Models('global_ontology'),
sdo_rdf_rulebases('RDFS','global_ontology_rb'));
END;

50

5.3.

Implementation of the Proposed Architecture for Relevance Reasoning


The Figure 20 shows the packaged diagram of the proposed architecture for relevance

reasoning in a scalable data integration system. The remaining section discusses the
functionality provided by each of these packages along with a brief description.

Figure 20 Package Diagram of the Proposed Architecture for Relevance Reasoning

5.3.1. PACKAGE Source_Registratrion_Service: This package manages local


ontologies for the incoming data sources. It provides two procedures for this purpose.
5.3.1.1.
REGISTER_SOURCE(): This procedure accepts the name along with the
contents of a incoming data source and creates the local ontology for it in source
description storage.
Parameter Name

Data type

Description

p_incoming_source

VARCHAR2

p_list_of_triples

TRIPLE_TAB_TYP

Name of the incoming data source. This name must be


unique.
List of triples expressing the contents and capabilities of
the incoming data source.

5.3.1.2.

UNREGISTER_SOURCE(): This procedure accepts the name of the data

source and deletes the local ontology for it from the source description storage.
Parameter Name

Data type

Description

p_deleting_source

VARCHAR2

Name of the data source to be deleted. This name must be unique.

5.3.2. PACKAGE Ontology_Management_Service:

This package manages global

ontology. It provides three main procedures to perform various tasks.


51

5.3.2.1.

REGISTER_GLOBAL_TRIPLE(): This procedure helps in publishing

domain knowledge in terms of the RDF triples. It assigns the GUID to incoming triple
and reserves its position on the bitmap index and adds it to the global ontology.
Parameter Name

Data type

Description

p_incoming_triple
p_incoming_triple_type

SDO_RDF_TRIPLE
VARCHAR2

RDF triples describing the domain knowledge.


Type of the RDF triple

5.3.2.2.

RECONCILE_GUID(): This function returns the GUID for the specified

RDF triple. It interacts with the ontology reasoning service to semantically expand RDF
triple and identify its GUID.
Parameter Name

Data type

Description

p_incoming_triple

SDO_RDF_TRIPLE

RDF triple for which GUID has to be identified.

5.3.2.3.

IDENTIFY_BITMAP_POSITION(): This function accepts the GUID and

returns the bitmap position for the specified RDF triple.


Parameter Name

Data type

Description

p_incoming_triple_GUID

NUMBER

GUID of the RDF triple for which bitmap position has to


be identified.

5.3.3. PACKAGE

Index_Management_Service:

This

package

helps

in

the

management of the bitmap index in the proposed architecture. Following are the main
three procedures of this package.
5.3.3.1.
MANAGE_BITMAP_PATTERN(): This procedure manages the bitmap
pattern for the index whenever domain knowledge is published in terms of the RDF
triples.
Parameter Name

Data type

Description

p_incoming_triple_GUID

NUMBER

GUID of the RDF triple that has to be published in global


ontology.

5.3.3.2.

CONSTRUCT_BITMAP_SEGMENT(): This procedure helps in the

construction of bitmap segment for the incoming data source. It assigns a unique
identifier for each bitmap segment. Initially all bits are initialized to 0 in the bitmap
pattern.
52

Parameter Name

Data type

Description

p_incoming_source

VARCHAR2

URI of the incoming data source for which the bitmap segment
has to be created.

5.3.3.3.

SYNCH_BITMAP_SEGMENT():

This

procedure

helps

in

the

synchronization of the local ontology RDF triples with the bitmap segment for a specified
data source. It shuffles the bits accordingly to the RDF triples of the local ontology.
Parameter Name

Data type

Description

p_source_segment

VARCHAR2

GUID_POS
BIT_STATE

NUMBER
VARCHAR2

Unique identifier assigned to the bitmap segment of the data


source.
Position of the bit on the bitmap segment that need to be shuffled.
SET means 1, and UNSET means 0.

5.3.4. PACKAGE Index_Lookup_Service: This package traverses the bitmap


segments in the index for the specified RDF triple. It contains one function shown below.
5.3.4.1.
TRAVERSE_BITMAP_SEGMENT(): This function accepts the position
and traverses the bitmap index on the specified position to identify those bitmap
segments where the bits are set.
Parameter Name

Data type

Description

GUID_POS

NUMBER

Position of the bit on the bitmap segment that needs to be traversed.

5.3.5. PACKAGE Ontology_Reasoning_Service: This package helps the architecture


to perform ontological inferencing and calculate the semantic similarity among different
terms. It contains the following functions.
5.3.5.1.
GENERATE_SEMANTIC_QUERY(): This function extends the simple
semantic searching behaviour of the proposed architecture and formulates a semantic
query that checks for synonyms, lexical variants, and subclass operators along with the
terms that are relevant with some degree of likelihood.
Parameter Name

Data type

Description

P_incoming_term

VARCHAR2

Term for which simple semantic query has to be generated.

5.3.5.2.

GENERATE_SEMANTIC_QUERY_DOL(): This function extends the

simple semantic searching behavior and to the proposed architecture. It accepts a term
53

(Concepts or Relationship) and formulates a semantic query that checks for synonyms,
lexical variants, and subclass operators in their respective hierarchies over the global
ontology.
Parameter Name

Data type

Description

P_incoming_term

VARCHAR2

Term for which extended semantic query has to be generated.

5.3.5.3.

FETCH_RELEVANT_TERMS(): This function executes the query that is

to be generated using GENERATE_SEMANTIC_QUERY() function and returns a list of


relevant terms for the term being reasoned.
Parameter Name

Data type

Description

P_incoming_term

VARCHAR2

Terms for which semantic similarity has to be computed.

5.3.5.4.

FETCH_RELEVANT_TERMS_DOL():This function executes the query

that is to be generated using GENERATE_SEMANTIC_QUERY_DOL() function and


returns a list of relevant terms for the term being reasoned.
5.3.6. PACKAGE Relevance_Reasoning_Service: This package accepts the RDF
triples of a user query and identifies the most effective and relevant data sources.
5.3.6.1.
IDENTIFY_RELEVANT_SOURCES(): This function interacts with the
ontology reasoning service and draw inference from it to expand the query triples. It also
interacts with the index lookup service to identify the most effective and relevant data
sources for these inferred RDF triples.
Parameter Name

Data type

Description

p_incoming_subject
p_incoming_property
p_incoming_object

VARCHAR2
VARCHAR2
VARCHAR2

Subject of the query RDF triples


Property of the query RDF triples
Object of the query RDF triples

5.3.6.2.

IDENTIFY_RELEVANT_SOURCES_DOL(): This function interacts

with the ontology reasoning service and draw inference based on degree of likelihood
from it to expand the query triples. It also interacts with the index lookup service to
identify the most effective data sources that are also relevant with certain degree of
likelihood.

54

5.3.6.3.

RANK_RELEVANT_SOURCE(): This functions ranks the selected data

sources based on the score being obtained for the users query.
Parameter Name

Data type

p_incoming_source
p_ranking_order

VARCHAR2
VARCHAR2

Description
Relevant data source that are to be ranked
DESC/ASC means descending/ascending

We have highlighted the Oracle implementation of the ontologies and RDF data.
The design and implementation along with their issues have been discussed in detail for
the proposed architecture.

55

CHAPTER 6

RESULTS AND EVALUATION


In this chapter we evaluate the results of the developed prototype system, discussed in
Chapter 5. We identify main evaluation criteria, the details of data set, the query structure,
system specification and results of the experiments carried through the system.
6.1.

System Specification

Pentium-IV
System Processor
RAM
HDD
Operating System
Tool
Language

6.2.

2.4GB
1GB
80GB
Windows 2003 (with service pack 2)
Oracle Spatial 10g Release 2 NDM
PL/SQL

Evaluation Criteria
The main aim of this evaluation is to validate whether the proposed architecture for

the relevance reasoning can scale up to a large number of data sources and complex
queries. In order to quantitatively measure the performance of the relevance reasoning,
different evaluation measures have been used which are discussed in the subsequent
section. The evaluation criteria for evaluating our system are listed below:
6.2.1. Response Time of Query Execution: to ensure that the manipulation of RDF
triples does not mitigate query response time during relevance reasoning as the number of
sources increases for the system.
6.2.2. Accuracy of the Relevant Source Selection: to ensure that provision of
semantics does not affect the accuracy of the proposed methodology and can be checked
by calculating precision and recall of the system for relevance reasoning. Precision can be
defined as the ratio of relevant data sources to the number of retrieved data sources [41]:
56

Whereas Recall can be defined as the proportion of relevant data sources that are
retrieved [41]:

6.3.

Data Specification
The experiment has been carried out with a corpus of manually generated 100 data

sources. Each data source contains 30-50 RDF triples. The famous university ontology
has been used in the experiment as the domain ontology [1, 42].
6.4.

Test Queries
We have executed 35 different queries related to the students, faculty, and research

associates data. We performed accuracy test of the proposed architecture over these test
queries. We comparatively analyzed our system with MiniCon algorithm [1], observing
the precision and recall of both the systems. Among these 35 queries, we selected 3
queries; having 3, 6, and 9 RDF triples to test the system efficiency by checking query
response time. These queries are as below:

Find name of all Instructors who are teaching a course to the same student, whom
they are advisors.
RDF Pattern of Query 1
(?instructor :isTeaching :Course) (?student :isRegisteredIn :Course) (?instructor :isAdvisorOf ?Student).

57

Find instructor-name, instructor-gender, and area of specialization of all instructors


whether they are in staff or students.
RDF Pattern of Query 2
(? instructor :hasName ? name) (? instructor :hasGender ?gender) (? instructor :hasArea ? area)
UNION
(? student :isAssisting :Course) (? student :hasGender ? gender) (? student :hasMajor ? depart).

Find instructor-name, instructor-gender, and area of specialization of all instructors


whether they are in staff or student and student doesnt have major department as
advisor working department.
RDF Pattern of Query 3
((?instructor :hasName ?name)
(?instructor :hasGender ?gender) (?instructor :hasArea ?area)
UNION
(?student :isAssisting ?Course)
(?student :hasGender ?gender)
(?student :hasMajor ?depart))
MINUS
(?instructor :isAdvisorOf ?student) (?student :hasMajor ?depart) (?instructor :hasWorkingDepart ?depart)

6.5.

Experiments for Response Time of Query Execution


In the experiment for evaluating the performance, we evaluated the system for query

response time from three dimensions. Firstly, queries were executed against the local
ontologies of data sources in the source description storage. We assessed the time taken
by the relevance reasoner to traverse local ontologies for relevant source selection.
Second, as our proposed methodology employs bitmap index where source descriptions
are mapped semantically in the bitmap segments as bits, we submitted the queries to
relevance reasoner using bitmap index and assessed the time taken using bitmap index.
Finally, we extended the bitmap index and implemented function-based indexing over it
and then analyzed the performance of the system. Figure 21, 22, and 23 illustrates the
performance of the system with the 3 queries shown in the preceding section.

58

Figure 21 Time Complexity of System (Query with 3 Triples)

Figure 22 Time Complexity of System (Query with 6 Triples)

59

Figure 23 Time Complexity of System (Query with 9 Triples)

The observations showed that there is a comparative performance gain running


queries on source descriptions with bitmap index than running them directly to source
descriptions. While, significant performance gain observed while searching relevant
sources using extended bitmap index from both previously discussed approaches. Figure
24 shows a comparison of performance gain using extended bitmap index than the simple
bitmap index.

Figure 24 Performance gain of the system with respect to direct ontology traversal

6.6.

Experiments for System Accuracy


In the experiment for evaluating the accuracy of the system, we have calculated the

precision and recall of our proposed methodology and made a comparison with the
60

MiniCon algorithm [1]. As MiniCon algorithm directly traverses the source descriptions,
therefore we did not implement it rather we used the same approach to develop the code
to traverse the local ontologies. As our proposed semantic matching process searches for
the synonyms, lexical variants, subclasses and degree of likelihood also therefore the
comparison showed an increase in both precision and recall with respect to MiniCon
Algorithm.

Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon algorithm

We have provided evaluation of the results of the developed prototype system in this
chapter. Different evaluation criteria have been identified for system evaluation. We have
compared the results of the prototype system with the existing systems. The comparison
showed that the system have better query response time and accuracy of source selection
compared to the existing systems.

61

CHAPTER 7

CONCLUSION AND FUTURE DIRECTIONS


In this chapter we conclude the research thesis. It provides an analysis of results and
future directions where the thesis work can be extended. The chapter is of vital
importance because it provides a birds eye-view of the methodology and gives future
directions for new researchers.
7.1.

Discussion
An exponential growth in online data sources due to advancements in information and

communication technologies (ICT) requires semantically-enabled robust and scalable


data integration. Keeping in view the cited objectives we have proposed an ontologydriven relevance reasoning architecture that identifies the most effective and relevant data
sources for users query before executing it. In our proposed methodology, we plotted the
local ontologies of the data sources over the bitmap index. In spite of traversing the local
ontologies in relevance reasoning, we use bitmap index to perform the relevance
reasoning.
The proposed methodology has three workflows; (1) Ontology Management
Workflow, (2) Source Registration Workflow, and (3) Query Execution Workflow. This
division helps to understand the functionality of various components in the methodology
along with their inter-dependence on each other. The ontology management workflow
and the source registration workflow set the stage for relevance reasoning in the proposed
architecture.
The ontology management workflow publishes the domain knowledge in the form of
RDF in global ontology. It creates the concept and relationship hierarchies using the
62

semantic operators. It also creates the rule-base to define rules and manage rules index to
perform inference and reasoning during the semantic matching process. Source
registration workflow manages the local ontologies of data source in the source
description storage. As the new sources enter and leave the system, index management
service synchronizes the bitmap index to reflect the new status of the source description
storage. In order to answer the queries precisely, bitmap index need to be
synchronized/updated with source description storage.
Query execution workflow takes the users query formulated in RDF triples and
identifies the most effective and relevant data sources for the given query. During
relevance reasoning queries are expanded using the inference drawn from the ontology
reasoning service. It calculates the semantic similarity between the query and source RDF
triples and identifies the relevant and effective data sources. Relevant data sources are
ranked based on the similarity score they obtained for the user query. The sorted list of
relevant and effective data sources are returned to the query rewriting component that
reformulates the queries for these relevant data sources.
7.2.

Contributions of the Project


The first contribution of the proposed methodology is that it provides provision for

the semantic interoperability during the process of relevance reasoning. Semantic


operators are being introduced to sort out fine grained heterogeneities among the contents
of different data sources. It checks for exact matches, lexical variants, synonyms,
subclasses, and degree of likelihood during semantic matching. Ontology, rule-bases and
rules-indexes have used for semantic matching and inference during the relevance

63

reasoning. The accuracy tests of the system showed improved precision and recall than
MiniCon algorithm [1].
The second contribution of the proposed methodology is the provision for
optimization during relevance reasoning with the help of a bitmap index. Previously the
community was using the bitmap index for bulks of data management in the warehouses
of the relational models but we used bitmap index to represent the RDF models. The
bitmap index is used during relevance reasoning and improves this whole process by
traversing the plotted RDF data in an improved manner. The time complexity test showed
that bitmap indexing performs the relevance reasoning in a comparatively shorter time.
7.3.

Future Direction
Currently our focus is on centralized bitmap indexing in data integration systems

where a single global ontology is presiding over some node and queries are reformulated
over it. As P2P DBMSs are evolving and data integration is also getting popular in these
domains, therefore in future this methodology can be extended to meet the requirements
of P2P data integration. Index partitions may be residing on each peer and collectively
they all will participate in relevance reasoning during the query processing.

64

REFERENCES
[1]

Alon Halevy, Anand Rajarman, Joann Ordille. Data Integration: The Teenage
years, Proceeding of 32nd international conference on VLDB, Pages 9-16,
September 2006.

[2]

Yaser A. Bishr. Overcoming the semantic and other barriers to GIS


interoperability. International Journal of Geographical Information Science,
12(4):229{314, 1998.

[3]

Thomas R. Gruber and Gregory R. Olsen. An Ontology for Engineering


Mathematics. Proceeding of 4th International Conference on Principles of
Knowledge Representation and Reasoning (KR 1994), pages 258-269, 1994.

[4]

Tom R. Gruber. A Translation Approach to Portable Ontology Specifications.


Knowledge Acquisition, pages 199-220, 1993.

[5]

Natalya F. Noy. Semantic Integration: A survey of Ontology-Based Approaches.


SIGMOND record, Vol. 33, pages 65-70, December 2004.

[6]

Isabel F. Cruz and H. Xiao. The Role of Ontologies in Data Integration. Journal of
Engineering Intelligent Systems: pages 245-252, December, 2005.

[7]

M. Jamadhvaja, Twittie Senivgee. An Integration of Data sources with UML Class


Models Based on Ontological Analysis. Pages 1-8, November 4, 2005, ACM,
Bremen, Germany.

[8]

S. Khan and F. Marvon, Identifying Relevant Sources in Query Reformulation. In


the proceedings of the 8th International Conference on Information Integration
and Web-based Applications & Services (iiWAS2006), Yogyakarta, Indonesia,
December 2006.

[9]

Wache, H., Vogele, T., et al., Ontology-Based Integration of Information A


Survey of Existing Approaches in The Seventeenth International Joint Conference
on Artificial Intelligence, Seattle, Washington, USA, 2001.

[10]

Arens, Y., Hsu, C.N., et al. Query processing in the SIMS information mediator.
In readings in agents, Morgan Kaufmann Publishers Inc., pages 82-90, 1997, San
Francisco USA.

[11]

Mena, E., Illarmendi. OBSERVER: An approach for query processing in Global


Information Systems based on Interoperation across Pre-existing Ontologies.
IEEE, pages 19-21, 1996.
F. Naumann, U.Leser, and J.C. Freytag. Quality-driven integration of
heterogeneous information systems. 25th Proceeding of International Conference
on VLDB, pages 447-458, Scotland, September 1999.

[12]

65

[13]

Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. An Ontology-based Framework


for Semantic Interoperability between XML Sources. In Proceedings of the 8th
International Database Engineering and Applications Symposium (IDEAS), pages
217-226, July, 2004. IEEE Computer Society 2004.

[14]

Nicola Guarino. Formal Ontology and Information Systems. In Proceedings of the


1st International Conference on Formal Ontologies in Information Systems (FOIS
1998), pages 3-15, 1998.

[15]

Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal,
pages 270-294, 2001.

[16]

Alon Y.Halevy, Anand Rajaraman, Joann J.Ordille. Querying heterogeneous


information sources using source descriptions. In the proceeding of the
International conference on Very Large Databases (VLDB) 1996.

[17]

Rachel Pottinger and Alon .HaLevy. MiniCon: A scalable algorithm for


Answering Queries using views. VLDB Journal 2001.

[18]

G. Wiederhold. Mediators in the architectures of future information systems.


IEEE Computer, Pages 38-49, March 1992.

[19]

J. Zhong, H. Zhu, et al. Conceptual graph matching for semantic search. In the
proceedings of the 10th International conference on Conceptual Structures
(ICCS), LNCS 2393, pages 92-106, Bulgaria, July 2002. Springer.

[20]

A.H. Levy: Why Your Data Wont Mix: Semantic Heterogeneity. ACM Queue 3,
pages 50-58, 2005.

[21]

RDF
Primer.
W3C
http://www.w3c.org/RDF/

[22]

Waris Ali, Sharifullah Khan, Global Query Generation over Diverse Data Sources
Using Ontology. In 1st International Conference on Information and
Communication Technologies, 9th June 2007, Bannu, N.W.F.P, Pakistan.

[23]

Nicole Alexander, Siva Ravada. RDF Object Type and Reification in the
Database. In the proceeding of 22nd Int. Conference on Data Engineering
(ICDE06). IEEE Computer Society 2006.

[24]

R. Smith, T. Connolly, Data Integration Service, Book Chapter, Information


management in Large Scale Enterprises. 3rd Edition.
Mediator-Wrapper, http://www.objs.com/survey/wrap.htm

[25]

Recommendation,

66

10th

February

2004,

[26]

S. Khan, F. Movan, Scalable Integration of Biomedical Sources, In the


proceedings of the 8th International Conference on Information Integration and
Web-based Applications & Services (iiWAS2006), Yogyakarta, Indonesia,
December 2006.

[27]

Jacob Kohler, Stephan Philippi, Michael Specht, Alexander Rueggd, Ontology


based text indexing and querying for the semantic web? Knowledge-Based
Systems 19 (2006), pages 744-754.

[28]

X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, G. Iannaccone. "MIND: A


Distributed Multi-Dimensional Indexing System for Network Monitoring". IEEE
Infocom 2006 Barcelona April 06.

[29]

XML Vocabulary Description Language 1.1 XML Schema,


Recommendation May 2001, http://www.w3.org/XML/Schema

W3C

[30]

The DARPA Agent


http://www.daml.org/

2000,

[31]

Web Ontology Language, W3C Recommendation, 06 September 2007.


http://www.w3.org/2004/OWL/

[32]

B-Tree and Bitmap Indexing. Oracle Developer Guide 10g Release 2, Part no:
A969505-01, Oracle Corporation, March 2002.

[33]

Jena A semantic web framework for Java, http://jena.sourceforge.net/

[34]

Kowari meta store for OWL and RDF metadata, http://www.kowari.org/

[35]

Jose Kahan, Marja-Riitta, Eric PrudHommeaux, Ralph R. Swick. Annotate: An


Open RDF Infrastructure for Shared Web Annotations, Proceedings of the WWW
10th Int. Conf., Hong Kong, May 2001.

[36]

A web-based RDF browser, Longwell, http://simile.mit.edu/wiki/Longwell

[37]

Oracle Semantic Technologies Network, Spatial Technology using Network Data


Model, http://www.oracle.com/technology/tech/semantic_technlogies/index.html.

[38]

P. Mitra. Algorithms for Answering Queries Efficiently Using Views. Technical


report, Infolab, Stanford University, September 1999.

[39]

F. N. Afrati, C. Li, and J. D. Ullman. Generating Efficient Plans for Queries Using
Views. In ACM SIGMOD International Conference on Management of Data,
Santa Barbara, CA, May 2001.

Markup

Language

67

Home

Page.

August

[40]

E. I. Chong, S. Das, G. Edon, J. Srinivasan. An Efficient SQL based RDF


Querying Scheme, Proceedings of the 21st VLDB Conference, Trondheim,
Norway, 2005.

[41]

Giannis Varelas, Epimenidis Voutsakis, Paraskevi Raftopoulou, Semantic


Similarity Methods in WordNet and their Application to Information Retrieval on
the Web, 7th ACM international workshop on Web information and data
management November 5, 2005.

68

Anda mungkin juga menyukai