By
XYZ
Supervisor
Dr
A thesis submitted in partial fulfillment of
The requirements for the degree of
Masters in Information Technology
In
(July 2008)
APPROVAL
It is certified that the contents and form of thesis entitled submitted, have been found
satisfactory for the requirement of degree.
Advisor: __________________
II
TO MY PARENTS,
BROTHER AND SISTERS
III
CERTIFICATE OF ORIGINALITY
I hereby declare that this submission is my own work and to the best of my knowledge it
contains no materials previously published or written by another person, nor material which to a
substantial extent has been accepted for the award of any degree or diploma at BZU or at any
other educational institute, except where due acknowledgement has been made in the thesis. Any
contribution made to the research by others, with whom I have worked at BZU or elsewhere, is
explicitly acknowledged in the thesis.
I also declare that the intellectual content of this thesis is the product of my own work, except for
the assistance from others in the projects design and conception or in style, presentation and
linguistics which has been acknowledged.
Author Name:
Signature: ______________
IV
ACKNOWLEDGEMENTS
First of all I am extremely thankful to Almighty Allah for giving me courage and strength to
complete this challenging task and to compete with international research community. I am also
grateful to my family, especially my parents who have supported and encouraged me through
their prayers that have always been with me.
I am highly thankful to for his valuable suggestions and continuous guidance throughout my
research work. His foresightedness and critical analysis of things taught me a lot about valuable
research which will be more helpful to me in my practical life.
I would like to offer my gratitude to all the members of the research group and my close
colleagues who have been encouraging me throughout my research work especially Mr Maruf
Pasha.
TABLE OF CONTENTS
List of Figures
VIII
1.1.
Motivation
1.2.
Problem Definition
1.3.
1.4.
Outlines of Thesis
2
3
CHAPTER 2 5
BACKGROUND STUDIES 5
2.1.
Data Integration
2.2.
2.3.
2.4.
2.5.
Ontology
10
2.6.
Indexing
13
6
7
CHAPTER 3 17
LITERATURE SURVEY
16
3.1.
Query Reformulation 16
3.2.
16
VI
CHAPTER 4 23
PROPOSED ARCHITECTURE
21
4.1.
4.2.
4.3.
4.4.
25
CHAPTER 5 43
IMPLEMENTATION
43
5.1.
43
5.2.
47
5.3.
CHAPTER 6 58
RESULTS AND EVALUATION
57
6.1.
System Specification 57
6.2.
Evaluation Criteria
57
6.3.
Data Specification
58
6.4.
Test Queries
6.5.
6.6.
58
62
CHAPTER 7 64
CONCLUSION AND FUTURE DIRECTIONS 64
7.1.
Discussion
7.2.
7.3.
Future Direction
REFERENCES
64
65
66
67
VII
51
LIST OF FIGURES
Figure 1: Data Warehousing Architecture for Data Integration.......................................................8
Figure 2 Mediator Wrapper Architecture for data integration.........................................................9
Figure 3 RDF Triple as Directed Graph........................................................................................12
Figure 4: Structure of a bitmap index............................................................................................15
Figure 5 Proposed Architecture for Relevance Reasoning in Data Integration Systems..............22
Figure 6: Sequence Diagram for Ontology Management Workflow.............................................29
Figure 7 Pseudo-code for RDF triple registration of global ontology...........................................30
Figure 8 InverseOf SameAs rule inserted in the rule-base............................................................30
Figure 9 TransitiveOf SameAs rule inserted in the rule-base........................................................31
Figure 10 Pseudo-code for RDF triple creation of local ontology................................................32
Figure 11 Pseudo-Code for Bitmap Segment Creation..................................................................32
Figure 12 Pseudo-Code for Bitmap Synchronization....................................................................33
Figure 13: Sequence Diagram for Source Registration Workflow................................................34
Figure 14: Sequence Diagram for Relevance Reasoning Workflow.............................................35
Figure 15 Pseudo-Code for Query Expansion in Relevance Reasoning Workflow......................36
Figure 16 Pseudo-Code for Source Selection in Relevance Reasoning Workflow.......................37
Figure 17 Snap shot of the Global Ontology.................................................................................37
Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global
Ontology........................................................................................................................................38
Figure 19 Database Schema to store ontology in Oracle NDM....................................................44
Figure 20 Package Diagram of the Proposed Architecture for Relevance Reasoning..................51
Figure 21 Time Complexity of System (Query with 3 Triples).....................................................60
Figure 22 Time Complexity of System (Query with 6 Triples).....................................................60
Figure 23 Time Complexity of System (Query with 9 Triples).....................................................61
Figure 24 Performance gain of the system with respect to direct ontology traversal....................61
Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon
algorithm........................................................................................................................................62
VIII
List of Tables
Table 1 Relevance levels and scoring strategy..............................................................................27
Table 2 RDF triples of the Global Ontology..................................................................................38
Table 3 Structure of Bitmap Index.................................................................................................38
Table 4 RDF triples of the data sources.........................................................................................39
Table 5 Structure of Bitmap Index after sources are registered.....................................................39
Table 6 Buckets created for the RDF triples..................................................................................39
Table 7 Inferred RDF triples for a users query triple...................................................................40
Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple....................42
Table 9: Semantic Similarity Calculation of a Data Source for User Query.................................42
IX
LIST OF ABBREVIATIONS
XML
WWW
DAML
OWL
API
DIS
NDM
RDF
W3C
URL
ICT
AI
UMLS
IM
GUID
LUID
SDS
ABSTRACT
Online data sources are autonomous, heterogeneous and geographically distributed. The
data sources can join and leave a data integration system arbitrarily. Some sources may not
X
contribute significantly to a user query because they are not relevant to it. Executing queries
against all the available data sources consume resources unreasonably and subsequently these
queries become expensive.
Source selection is an approach to resolve the issue. The existing techniques of relevance
reasoning for source selection take significant time in traversing the source descriptions.
Consequently query response time degrades in coping with the growing number of available
sources. Moreover, simple matching process is unable to sort out the fine-grained semantic
heterogeneities of data. Semantic heterogeneity of data sources makes the relevance reasoning
complex. These issues degrade the performance of data integration systems.
In this research, we have proposed an ontology-driven relevance reasoning architecture
that identifies relevant data sources for a user query before its execution. The proposed
methodology aligns source descriptions (i.e. local ontologies) with domain ontology through a
bitmap index. In spite of traversing the local ontologies, the methodology utilizes the bitmap
index to perform relevance reasoning in order to improve query response. Semantic matching has
been employed in relevance reasoning for the provision of semantic interoperability. Semantic
operators, such as, exactMatch, sameAs, equivalentOf, subClassOf, and disjointFrom, have been
introduced to sort out fine-grained semantic heterogeneities among data sources. Quantitative
scores are assigned to the operators. Data sources are ranked based on the similarity score
obtained by them.
A prototype system has been designed and implemented to validate the methodology. The
evaluation criteria used are (a) query response time and (b) accuracy of relevant source selection.
The prototype system has been compared with the existing systems for evaluation. Query
XI
response time and accuracy of source selection, in terms of precision and recall; have been
improved due to the incorporation of a bitmap index and ontology respectively.
XII
CHAPTER 1
INTRODUCTION
This chapter introduces the research work that has been taken in this thesis. It
includes motivation and definition of the problem. Moreover the objectives and goals
have also been discussed.
1.1.
Motivation
The exponential growth in data sources on the Internet is due to advancements in
The growth of online data sources requires a scalable data integration system because
the sources are unpredictable due to their autonomy. In other words, data sources can join
and leave the system arbitrarily. Thus checking the availability of a data source before
executing a query is needed. Moreover all the data sources may not have the required
information. Executing a query on all data sources is an expensive solution due to the fact
that an available source may not contribute any significant information to the user query
result [8, 20, 23]. In order to execute queries efficiently in these systems, we need to
identify relevant and effective data sources that are available at the time of execution.
This research work focuses on relevance reasoning for identifying relevant and effective
data sources in a scalable data integration system.
1.2.
Problem Definition
Identifying relevant sources in a scalable data integration system faces problems due
for a user query, degrades the performance of the query. This leads to
unreasonable wastage of resources of the data integration systems.
1.3.
1.4.
Outlines of Thesis
The rest of the document is organized as follows: Chapter 2 describes a data integration
system and its various components. RDF is also explained as a language for developing
ontologies, storing source descriptions and semantic mappings. Chapter 3 discusses
various algorithms for relevance reasoning and their critical analysis. Chapter 4
highlights the proposed system architecture, proposed semantic matching process along
with the proposed methodology for relevance reasoning. Chapter 5 gives complete
overview of implementation details. Chapter 6 highlights experimentation and
comparative analysis to validate the proposed architecture. Discussions are also made on
the conducted experiments. Chapter 7 concludes the thesis and defines future research
directions.
CHAPTER 2
BACKGROUND STUDIES
This chapter provides background literature in order to understand the context of this
research. Data integration and semantic heterogeneity is discussed. The details of
ontology and its designing methodology; and indexing have also been included.
2.1.
Data Integration
Data sources on the Internet are growing exponentially in size and number over the
time. These data sources contain information about different topics such as stock market,
product information, real estate, and entertainment. The data from these sources can be
used for answering complex user queries and this might go beyond the traditional
searches. Advancements in information and communication technology has enabled the
users to access a wide array of data sources that are related in some way and to integrate
the result to come up with useful information that might not be stored physically in a
single place [1, 8, 12, 24].
Data integration enables the interoperability of the data sources for knowledge
discovery through a centralized access point, and provides a uniform query interface that
gives user the illusion of querying a homogeneous system [2, 15, 19, 31]. In data
integration a user is provided with a unified interface for posing queries, which is based
on a schema typically referred as the global schema or mediated schema. Based on the
approach used to develop data integration systems, a user is provided with appropriate
result obtained from underlying data sources either from centrally materialized repository
or at real time.
2.2.
2.2.2.2.
represented by same name e.g., bear can be an animal or a property meaning tolerate.
2.2.2.3.
Degree of likelihood: Two concepts can be relevant to each other on the
basis of degree of likelihood. This does not mean equality of concepts like synonyms,
rather relatedness e.g., <:Teacher :isTeaching :Course> and <:TeachingAssistant
:isAssisting :Course>, here teaching assistant and teacher is not same concepts but are
relevant to each other with certain degree of likelihood.
2.3.
classified into two major categories: (a) data warehousing and (b) mediation.
2.3.1. Warehouse: In data warehousing, the required data is extracted from the sources
and stored in a centralized repository after integration [19, 24]. Users pose queries against
the data model of the warehouse. This approach is also known as eager approach or
materialized view approach to data integration. Query execution is efficient and response
time is predictable in this approach, but result can be stale mostly [1]. Figure 1 shows
data warehousing architecture [24].
2.3.2. Mediation: In mediation approach, a user is given a unified schema for posing a
query that contains virtual relations. Data is not loaded in a central repository in advance
in this approach rather queries are executed at run time [1, 19, 20, 24, 24]. In order to
answer a user query using the information sources, metadata is needed that describe the
semantic relationship between the elements of mediated schema and schemas of
underlying data sources. This metadata is known as source description. This approach is
also known as lazy approach or virtual view approach to data integration. Query
efficiency is slow in mediation but result is up to date [1, 21, 24]. Figure 2 depicts
mediation based architecture for data integration [24].
2.4.
heterogeneous and distributed data sources. The ability to efficiently and correctly
execute a query over the integrated data lies in the heart of data integration. Main steps in
processing a query in data integration include (1) Query reformulation, (2) Query
planning and execution.
2.4.1. Query Reformulation: Query reformulation is the first step in query processing
where a user query previously written in terms of a mediated schema is reformulated
using information about sources into queries that refer directly to the schemas of
underlying data sources [1, 8, 10, 11, 19, 24]. Query reformulation is further divided into
two steps: (a) source identification (b) query rewriting.
2.4.1.1.
Source identification: Before executing a user query, relevant and
effective sources should be clearly identified to optimize query execution. Relevance
reasoning is the process of identifying relevant sources and pruning irrelevant and
redundant data sources. The main focus of our research is to propose an algorithm that
can speed up the process of relevance reasoning.
2.4.1.2.
Query rewriting: Once relevant sources are being identified then query
rewriting is performed and source specific queries are reformulated only for those sources
that have been found relevant and can contribute some result to the users query.
2.4.2. Query Planning and Execution: Query reformulation provides some
optimizations by pruning irrelevant sources and overlapping sources to avoid redundant
computation. The reformulated queries are evaluated using different strategies producing
9
multiple execution plans during the optimization [11, 12]. The query execution engine
executes these queries using best and cheapest execution plan and deals with limitation
and capabilities of the data sources [28]. During execution, an important issue is to
minimize time to return the first answers to the query rather minimizing the total amount
of work to be done to execute whole query [21, 24].
2.5.
Ontology
Ontology is defined as an explicit and formal specification of a shared
knowledge.
Ontology
is
widely
used
in
data
standardization
and
10
developed by World Wide Web Consortium (W3C), for representing information about
resources. RDF provides interoperability across resources due to its simplistic structure.
RDF schema (RDFS) is a language for describing vocabularies of RDF data in terms of
11
Subj
Subj
ect
ect
Obj
Obj
ect
ect
web
resource
without
specific
network
address
(http://www.niit.edu.pk/delsa#Instructor).
- A blank node is used when either the subject or object of a triple are unknown or
relationship between the subject and object is n-ary.
- A literal is a string which is used to represent names, dates, and numbers.
- A
typed
literal
is
string
combined
with
its
data
type
(e.g.Smith^^http://www.w3.org-/2001/XMLSchema#string).
- A container is a resource that is used to describe a group of things. Participants of a
container are members of the group. Blank nodes are usually used to represent
containers.
12
- Reification allows triples to be attached to other triples as properties. One of the major
issues is its representational complexity. Therefore it is sometimes termed as The Big
Ugly.
A variety of RDF storage systems and browsers are available such as Jena [33],
Kowari [34], Sesame [35], Longwell [36], and Oracle RDF Data Model [37, 40]. We
have used Oracle RDF Data Model for managing global ontology and source descriptions
because it is efficient in terms of storage and is not mitigated by slow performance times.
It provides a basic infrastructure for effectively managing RDF data in databases. At the
same time RDF data can be readily integrated, managed and analyzed with other
enterprise data. A comparative analysis of RDF [26] was conducted and shown that oracle
RDF data model outperforms other existing RDF storage systems.
2.7.
Indexing
Databases spend a lot of their time in finding things. So the finding needs to be
performed as fast as possible to speed up the searching mechanism. Indexes provide the
basis for both rapid random lookups and efficient ordering of access to data. An index is
associated with some search key that is, one or more attributes of a relation for which the
index provides fast access. The disk space required to store an index is typically less than
the storage of the table. Indexes can be primary or secondary indexes. A variety of
indexing techniques are used today in modern DBMSs e.g., hash based indexing, cluster
indexing, tree-structured indexing, and bitmap indexing. The most efficient and compact
indexing techniques, that are dealing with bulks of data [26,28], includes (a) B+tree Index
(b) Bitmap Index. In this thesis we are using bitmap indexes due to their internal compact
representation for bulks of data.
13
methodology, sources advertise their capabilities and contents in the form of RDF triples
14
to the global ontology. A single source may contain bulks of RDF triples. Bitmap indexes
are very efficient in bulk processing of data manipulation statements and data loading.
Search Key
Bitmap Vectors
A
A
0
0
0
0
0
0
0
0
0
0
X
X
1
1
1
1
1
1
1
1
1
1
Y
Y
0
0
0
0
0
0
0
0
0
0
G
G
1
1
1
1
1
1
1
1
1
1
T
T
0
0
0
0
0
0
0
0
0
0
1
1
1 of a1
1bitmap
1 index1
1
Figure 4: Structure
1
1
U
U
V
V
1
1
1
1
1
1
1
1
1
1
Z
1
1
1
1
1
In a nutshell, we have
data
integration
approaches
that are widely
Z discussed different
1
1
1
1
1
used now a day. Ontology and its modeling languages have been highlighted because
they can help data integration systems to cope with the semantic heterogeneities that exist
in the domain of discourse. Finally indexing has been discussed in general to speed up the
querying mechanism and in particular bitmap indexing has been explained that can be
used to traverse semantic web metadata efficiently.
15
CHAPTER 3
LITERATURE SURVEY
Relevant data source selection in query reformulation for data integration systems has
attracted significant attention in the literature over the last few decades [5, 6, 7, 8, 11, 12,
19, 20, 21, 24]. This chapter starts with the discussion and evaluation of state of the art
algorithms used in data integration systems for the identification of relevant data sources
during query reformulation.
3.1.
Query Reformulation
In query reformulation, a users query previously written in terms of a mediated
schema, need to be reformulated or rewritten into queries that refer directly to the
schemas of underlying data sources [10, 11, 19, 24]. In literature, query reformulation can
be further sub-divided into two steps: (a) relevant source selection (b) query rewriting.
3.1.1. Relevant source identification: Before executing user queries, relevant and
effective sources should be clearly identified because all the available data sources may
not contribute significantly. Relevance reasoning is the process of identifying relevant
sources and pruning irrelevant and redundant data sources.
3.1.2. Query rewriting: Once relevant sources are being identified then query rewriting
is performed and source specific queries are generated only for those sources that have
been found relevant and can contribute some result to the users query.
3.2.
process of relevance reasoning. The following section elaborates state of the art
16
algorithms that are used in different data integration systems for the relevant source
selection during query reformulation.
3.2.1. The Bucket Algorithm: This algorithm has been used in the Information
Manifold (IM) [1, 20], a system for browsing and querying of multiple networked
information sources. IM provides a mechanism to describe the contents and the
capabilities of data sources in source descriptions (which in our architecture is called
source models). Bucket algorithm uses source descriptions to create query plans that can
access several information sources to answer a query. This algorithm prunes irrelevant
data sources using source descriptions and reformulate source specific queries for only
relevant data sources. In order to describe and reason about the contents of data sources,
the relational model (augmented with certain object oriented features) is used in IM.
Technically, algorithm constructs a number of buckets and checks a user query with each
bucket for the identification of relevant data sources. Once relevant buckets for the
sources are being identified then source specific conjunctive queries are rewritten for
each source.
3.2.2. The Inverse-Rules Algorithm: InfoMaster is an information integration system 1
[19] that provides an integrated access to multiple, distributed, and heterogeneous
information sources on the Internet. InfoMaster creates a virtual data warehouse. The
algorithm used behind the InfoMaster is Inverse-Rules algorithm. Inverse-Rules
algorithm rewrites the definition of data sources by constructing a set of rules. A set of
rules are reformulated for defining the contents and the capabilities of each data source.
During rules construction heterogeneities among the data sources are dealt. These rules
guide the algorithm that how to compute records from data sources using source
1 http://infomaster.stanford.edu/
17
definitions. The algorithm dynamically determines an efficient way to answer the user's
query using as few sources as necessary. In simple words, they are not reformulating the
query rather they are reformulating the source definitions so that the original query can be
easily answered on the reformulated rules.
3.2.3 The MiniCon Algorithm: MiniCon algorithm [19, 21] improved the Bucket
algorithm. The main focus of developing MiniCon algorithm is to pay attention to
performance aspects of query reformulation algorithms. MiniCon algorithm finds the
maximally contained rewriting of a conjunctive query using a set of conjunctive views.
Bucket algorithm completes in two steps: computing the buckets, and then reformulating
the source-specific queries using the buckets of those data sources which are relevant.
The main complexities involved in the bucket algorithm include: (a) If the number of
sound data sources is small, the Bucket algorithm may generate a large number of
candidate solutions and then reject them. (b) The exponential conjunctive query
containment test that is used to validate each candidate solution. MiniCon algorithm pays
attention to the interaction of the variables in the user query and in the source definitions
to prune the sources that are rejected later in the containment test. This timely detection
of irrelevant data sources improves the performance of MiniCon algorithm due to small
number of combinations to be checked.
3.2.4. The Shared-Variable-Bucket Algorithm: This design goal of this algorithm [38]
is to recover the deficiencies of the Bucket algorithm and develop an efficient algorithm
for query reformulation. The key idea underlying this algorithm is to examine the shared
variables and reduce the bucket contents to reduce view combinations. This reduction
ultimately optimized second phase of the algorithm.
3.2.5. The CoreCover Algorithm: In this algorithm [39], views are materialized from
source relations. The main aim of this algorithm is to find those rewritings which are
18
guaranteed to produce an optimal physical plan. Their divergence is mostly towards the
query optimization therefore different cost models are also considered in this algorithm.
The algorithm is trying to find an equivalent rewriting rather than a contained rewriting.
3.3.
Critical Analysis
The CoreCover algorithm [39] is different from other query reformulation
scalable data integration systems where sources can join and leave the system arbitrarily
and the query execution engine can synchronize itself with any change and submits the
sub-query to the relevant and available data sources. Another deficiency of these
algorithms is that most of them are using relational models for source descriptions
whereas the ontology based models can help us to represent fine-grained distinctions
between the contents and capabilities of the different data sources. This fine-grained
distinctions can help us reason about the data sources in a more precise and efficient
manner
In a nutshell, we have discussed state of the art algorithms, used for query
reformulation in data integration systems. These algorithms are analyzed and compared
with each other. The features and deficiencies of these algorithms are also illustrated.
20
CHAPTER 4
PROPOSED ARCHITECTURE
In order to execute a users query in a scalable data integration system proposed in
[8], the query execution process needs to be optimized. We have proposed an ontologydriven relevance reasoning architecture to improve response time for user query during
relevance reasoning. This chapter is organized into three major sections. In the first
section, components of the proposed relevance reasoning architecture are discussed. The
second section of the chapter explains the semantic matching process and proposed
scoring strategy. Finally the proposed methodology for relevance reasoning is discussed
in details and elaborated through an example.
4.1.
source selection in a data integration system. The proposed architecture, as shown in Fig.
5, comprises of different components. These are described as follows.
4.1.1. Global Ontology: The global ontology is a knowledge-base in the proposed
architecture. This helps in generating user queries and enabling semantic inference. Major
components of the global ontology are: (1) domain knowledge, represents domain of
discourse in the form of RDF triples. Each RDF triple is uniquely identified by the global
unique identifier (GUID). GUIDs are used in semantic indexing scheme for relevance
reasoning; (2) concepts and relationships hierarchies, represents semantic relationships
among concepts and relationships respectively. These hierarchies help in resolving
semantic heterogeneities that exist in a domain; (3) rule-base, a rule is an object that can
be applied to deduce inference from RDF triples. Every rule is identified by its name and
21
consisted of two parts. (a) An antecedent, which is known as body of the rule and (b) a
consequent which is known as head of the rule. The rule-base is an object that consists of
rules; (4) rules-index, computes and maintains deduced inferences by applying a specific
set of rule-bases in order to optimize reasoning.
4.1.2. Ontology Management Service: Ontology management service facilitates the
creation and maintenance of the global ontology. It provides a set of application program
interfaces (APIs) to perform the following functionalities: (1) publishes the domain
knowledge in the form of RDF triples by assigning GUIDs to the RDF metadata triples
and mapping GUIDs over the bitmap index; (2) defines semantic operators and constructs
concept and relationship hierarchies; (3) provides a mechanism to create and drop a rulebase and modifies the set of rules from a rule-base; (4) enables the creation and
maintenance of the rules-index and synchronizes it after rules are modified into the rulebase.
22
4.1.3. Source Descriptions Storage (SDS): Source description is the metadata of a data
source. This metadata can be further classified into source metadata and content
metadata. In order to make source description of a data source interoperable in a
heterogeneous environment, they are described in a conceptual model in the form of a
local ontology [8]. The metadata of a data source is expressed as RDF triples in the local
ontology. These RDF triples are assigned local unique identifiers (LUIDs) using a
sequence generating object of each data source. In a nutshell, we can say that source
descriptions storage is a set of local ontologies.
4.1.4. Source Registration Service: Source registration service facilitates the creation
and maintenance of a local ontology for a data source in the source description storage. It
provides a set of application program interfaces (APIs) to perform the following
functionalities: (1) creates a unique sequence number generating object for the incoming
data source, (2) creates a local ontology to hold the RDF triples advertised by a data
source, (3) registers the local ontology into the source description storage, (5) inserts the
RDF triples of the data source into its corresponding local ontology.
4.1.5. Bitmap Index Storage: A bitmap index is a cross-tab structure of bits [26, 28].
We employ bitmap index for efficient traversal during relevance reasoning. A bitmap
index is divided into bitmap-segments. Internally, data in the bitmap segment is
represented in the form of bits. Each data source retains one bitmap segment over the
bitmap index. In the proposed architecture, data sources are represented on vertical side
of the index whereas RDF triples of the global ontology are represented on horizontal
side of the index. A bit state is unset i.e., 0 if a data source does not contain the
corresponding RDF triple and is set i.e., 1 if a data source contains corresponding RDF
23
triple. A sequence number generating object is used to assign a unique identifier to each
bitmap segment.
4.1.6. Index Management Service: Index management service facilitates the creation
and maintenance of a bitmap segment for a data source in the bitmap index storage. It
provides a set of application program interfaces (APIs) to perform following
functionalities: (1) bitmap segment creation creates the bitmap segment for an incoming
data source and initializes all bits of the bitmap segment to 0 (means unset); (2) bitmap
synchronization updates the bitmap segment of a data source consistent against its local
ontology; (3) shuffle bit shuffles the bits of a bitmap segment during synchronization.
4.1.7. Index Lookup Service: Index lookup service facilitates an efficient traversal of
the bitmap index. It provides a set of application program interfaces (APIs) to perform
following functionalities: (1) relevant source identification traverses the bitmap index
against the RDF triple and identifies the bitmap segments where the bit is set; (2)
irrelevant source pruning traverses the bitmap index against the RDF triple and identifies
the irrelevant bitmap segments where the bit is unset.
4.1.8. Ontology Reasoning Service: Ontology Reasoning Service enables the reasoning
and inference capabilities to the proposed architecture. It provides a set of application
program interfaces (APIs) to perform the following functionalities. (1) Semantic
Matching: is the process of finding semantic similarity among the different terms
(concepts and relation-ships) in order to resolve the semantic heterogeneities. (2)
Inference and Reasoning: provides reasoning and inference to the semantic matching
process by incorporating rules, rules-base, and rules-index. (3) Semantic Query
Generation: generates queries against the global ontology using semantic operators
during the semantic matching. Note that these queries are different from the user query so
these should not be inter-mixed or confused.
24
4.2.1. Relevance Levels and Proposed Scoring Strategies: During the semantic
matching, the terms of users query triples are matched with the terms of source triples.
As a result one of the five relevance levels can be obtained for each term. These
relevance levels are given numeric scores for the purpose of quantification that will help
us to rank a source for a given query. Following is the definition and explanation of the
relevance levels and operators used in semantic matching process.
4.2.1.1.
both are lexically equal to each other. For example a term nust:Instructor is an exact
match of niit:Instructor. A numeric score of 1 is assigned to any exact matching terms as
soon as they appear in RDF triple.
4.2.1.2.
soon as they appear in RDF triple. We are using owl:sameAs operator for specifying
mappings in the rule-base of the global ontology.
4.2.1.3.
mappings in the
concepts that are not totally disjoint or different rather they would be related to some
other term with some degree of likelihood. For example the term nust:Instructor might be
relevant to nust:TeacherAssistant with some degree of likelihood. This type of mappings
cannot be specified using previously defined operators. A numeric score of 0.5 is
assigned to any likelihood based similar terms as soon as they appear in RDF triple. We
are using owl:equivalentOf operator for specifying
ontology.
4.2.1.5.
Disjoint ( ): A term is disjoint from another term if and only if they are
different from each other. For example the term nust:Instructor is disjoint from
nust:Student. A numeric score of 0.0 is assigned any disjoint terms as soon as they appear
in any components of RDF triple. These relevance levels and their scoring strategies can
be summarized in Table 1 below:
26
exact match
sameAs
subClassOf
equivalentOf
disjointFrom
1.0
0.8
0.6
0.5
0.0
4.2.2. Term Similarity: We use the same semantic matching strategy for both concepts
and relation-ships. We have concept hierarchy and relation-ship hierarchy. Terms include
both concepts and relationship. We extract the relationship between the query and source
terms using their respective hierarchies and then assign standard relevance score as
defined in the Table 1. An RDF triple contains the subject, predicate, and object. Subject
and object are considered as concepts thereby their similarity is computed using concepts
hierarchy whereas to calculate the predicate similarity, the relation-ship hierarchy is used.
4.2.3. RDF Triple Similarity: To calculate the relevance between user query and source
RDF triples, we combine both aspects of term similarity (i.e., concepts and relationships). The overall RDF triple similarity can be calculated as shown in equation 1:
Where qT denotes the query triple and S denotes source triples. qt and st are query and
source terms respectively that are to be matched, Sim (qT,s) the overall similarity of a
single query triple for a given source. Here i and j represent ith and jth source RDF triples
and query triple terms respectively.
4.2.4. Source Ranking: A user query and source RDF triples are matched to find the
similarity of each query triple with data source triples. Once RDF triple similarity has
27
been computed, source score of the whole query is being computed using the formula
given in equation 2. Based on the score obtained for a query, data sources are ranked.
n
simsrc sim( qi , s ) ( 2)
i 0
In the above equation, simsrc is the total score of a source (s) for a user query
(obtained by multiplying the similarity score of all query triples). qi denotes the query
triples and n denotes the total number of query triples.
4.3.
the most relevant and effective data sources using a bitmap index. Our proposed
methodology can be divided into three main workflows. These workflows help to
understand the intricacies of the proposed architecture. Below is the detail discussion of
each workflow.
4.3.1. Ontology Management Workflow: Ontology management workflow manages
the global ontology in the architecture. Ontology management service plays a prominent
part in this workflow. Five major activities carried out by ontology management
workflow include:
28
Domain knowledge representation is the registration of the RDF triples over the
global ontology. These RDF triples are stored in the global ontology and GUIDs are
assigned using a unique sequence number generator object. GUIDs are allocated
positions over the bitmap index. Transactions are permanently recorded to the global
ontology. The snippet in Figure 7 shows pseudo-code for insertion of RDF triple in the
global ontology. In the preceding chapter its implementation issues and details are
discussed.
Pseudo-Code for Domain Knowledge Registration
For each RDF triple of global ontology
Assign GUID to RDF triple
Add RDF triple to the global ontology
Extend bitmap index
Increase the length of bitmap pattern by one
Assign location to the RDF triple reserved over the bitmap index
Perform commit to apply changes persistently to global ontology
29
TransitiveOf<operator>
tells the
rule-base
that ifinathe
term
A is related to another
Figurerule
8 InverseOf
SameAs
rule inserted
rule-base
term B with some relation R, and the same term B is further related to another term C
using the relation R, it implies that the term A is related to term C using the same relation
R. Fig. 9 shows the N3 representation of the TransitiveOf rule for sameAs operator in the
semantic web rule language.
: Def-TransitiveOfSameAs@swrl
((?x and
sameAs
?y) (?y sameAs
?z)rules-index
-> (?x
Rules-index management
involves the creation
management
of the
sameAs ?z))
for a rules-base. Once the rules are inserted into the rules-base, the corresponding rulesindex is refreshed to pre-compute RDF triples.
30
4.3.2. Source Registration Workflow: Source registration workflow registers the data
sources in the data integration system. Three major activities carried out by source
registration workflow include
Local ontology creation
Bitmap segment creation
Bitmap synchronization
Local ontology creation involves the creation of local ontology for incoming data
source, a unique sequence number generator object along with the insertion of RDF
triples over the created ontology. Source registration service plays a prominent part in
local ontology creation. Ontology is created for the incoming data source and is
registered with the source descriptions storage. The RDF triples, advertised by the data
source, are assigned unique identifiers (LUIDs) and are added to the local ontology.
Transactions are permanently recorded to the source descriptions storage. The snippet in
Figure 10 shows pseudo-code for local ontology creation and its RDF triples insertion. In
the preceding chapter its implementation issues and details are discussed.
Pseudo-Code for Local Ontology Creation
Creating ontology for incoming source in Source Descriptions Storage
Creating unique sequence generator for incoming source RDF triples
Assign LUIDs to the RDF triples
Add RDF triple to the local ontology in Source Descriptions Storage
Perform commit to apply changes persistently to global ontology
Bitmap segment creation involves the cloning of bitmap pattern and the creation of
bitmap segment for incoming data sources over the bitmap index. The index management
service plays a prominent role in bitmap segment creation. The bitmap pattern is stored
over the global ontology. It is cloned for the newly created bitmap segment. Initially all
the bits are initialized to unset i.e., 0. A unique identifier is assigned to the bitmap
segment and is added to the bitmap index. The snippet in Figure 11 shows pseudo-code
31
for local ontology creation and its RDF triples insertion. In the preceding chapter its
implementation issues and details are discussed.
Pseudo-Code for Bitmap Segment Creation
Check whether bitmap segment exists for the incoming source
If (no)
Clone bitmap pattern from global ontology RDF triples
Initialize bits to zero (0)
Assign a unique number to the bitmap segment
Add bitmap segment to the bitmap index for incoming source
Perform commit to apply changes persistently in index
The Figure 13 shows all the activities that are performed during the source
registration workflow using sequence diagram.
Figure 13: Sequence Diagram for Source Registration Workflow
33
steps that are carried out to identify the relevant and effective data sources for the users
query. Relevance reasoning service plays a prominent part in this workflow. It
incorporates with the index lookup service and ontology reasoning service during
relevance reasoning to perform the following activities.
Semantic Query Expansion
Source Selection
Source Ranking
34
The Figure 14 shows all the activities that are performed during the source
registration workflow using sequence diagram.
Figure 14: Sequence Diagram for Relevance Reasoning Workflow
Semantic query expansion: A user submits the query in RDF which is passed to the
relevance reasoning service. The RDF triples that are entered by the user into a query are
called asserted query triples. A user can submit queries in global ontology terms as well
as local ontology terms of their underlying data sources. Relevance reasoning service
expands the user query to all possible combinations using ontology reasoning service.
Every term of the query triple is expanded using semantic operators for synonyms, lexical
Pseudo-code for Query Expansion in Relevance Reasoning
variants, subsumption,
and degree
of likelihood. This expansion results in the addition of
InferredTriplesList
=
For each RDF triple in AssertedTripleList of users query
Isolate subject, object, and property of current RDF triple
Calculate semantic similarity and add relevant term for the subject of RDF triple
Calculate semantic similarity and add
35relevant term for the property of RDF triple
Calculate semantic similarity and add relevant term for the object of RDF triple
Take Cartesian product of terms
Populate InferredTriplesList of the Cartesian product
Return InferredTriplesList
some extra triples to the user query. These RDF triples are called inferred query triples.
The snippet in Figure 15 shows pseudo-code for the semantic query expansion. In the
preceding chapter its implementation issues and details are discussed.
Pseudo-code for Query Expansion in Relevance Reasoning
InferredTriplesList =
For each RDF triple in AssertedTripleList of users query
Isolate subject, object, and property of current RDF triple
Calculate semantic similarity and add relevant term for the subject of RDF triple
Calculate semantic similarity and add relevant term for the property of RDF triple
Calculate semantic similarity and add relevant term for the object of RDF triple
Take Cartesian product of terms
Populate InferredTriplesList of the Cartesian product
Return InferredTriplesList
Source Selection: Once the query is expanded with semantically relevant RDF triples,
the GUIDs are reconciled from the global ontology. GUIDs help to find out the position
of RDF triples over the bitmap index. These positions are passed to the index lookup
service which traverses the bitmap segments of each source at the corresponding
positions and identifies the data sources for which the bits are set. The snippet in Figure
16 shows pseudo-code for the source selection. In the preceding chapter its
implementation issues and details are discussed.
36
Source Ranking: The identified data sources are ranked according to their relevance
to the user query. Table 1 shows our scoring scheme. Initially term similarity is computed
for component of query RDF triple in a given source. Once term similarity is computed it
is used in the equation 1 to compute RDF triple similarity. Finally source similarity is
computed by equation 2 and sources are ranked according to the score obtained for a
given user query.
4.4.
scenario, we have a global ontology with name NUST_DB as shown in Figure 17, and
three data sources named EME_DB, MCS_DB, and NIIT_DB. The RDF triples of the
global ontology are shown in Table 2.
isTeaching
Instructor
worksIn
hasMajor
isAssisting
isRegisteredIn
isAdvisorOf
Department
Course
Student
TeachingAssista
nt
NUST_RDF_DATA
GUID
nust-1000001
nust-1000002
nust-1000003
nust-1000004
nust-1000005
nust-1000006
RDF Triples
< nust:Instructor, nust:isTeaching, nust:Course >
< nust:Instructor, nust:isAdvisorOf, nust:Student >
< nust:Student, nust:isRegisteredIn, nust:Course >
< nust:Student, nust:hasMajor, nust:Department >
< nust:Instructor, nust:worksIn, nust:Department >
< nust:TeacherAssistant, nust:isAssisting, nust:Course >
The RDF triples of the global ontology forms basis for the bitmap indexing in our
proposed architecture. The pattern of the index can be illustrated as shown in Table 3.
Table 3 Structure of Bitmap Index
37
Source-segment
position-1
position-2
nust-
nust-
1000001
1000002
xxxxxxxxxxxxxx
position-3
position-4
position-5
nust-
nust-
1000004
1000005
nust-1000003
position-6
nust-1000006
Bitmap Pattern
isTeaching
Instructo
r
subClassof
sameAs
subClassof
subClassof
Professo
r
sameAs
Student
Course
sameAs
Teaching
Teaches
sameAs
isAssisting
Subject
Lecturer
subClassof
TeachingAssista
nt
Teache
r
Prof
Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global Ontology
Three local ontologies are being created for the data sources with the naming
convention like <DataSource>_RDF_Data. There are semantic heterogeneities between
the contents of the data sources. Table 4 describes the RDF triples of the sources stored in
their respective ontologies.
Table 4 RDF triples of the data sources
EME_RDF_DATA
Local Link-ID
RDF Triples
38
eme-1011
eme-1012
eme-1013
NIMS_RDF_DATA
Local Link-ID
nims-2011
nims-2012
nims-2013
RDF Triples
< nims:Teacher, nims:isAdvisorOf, nims:Student >
< nims: Teacher, nims:WorksIn, nims:Department >
< nims:Student, nims:hasMajor, nims:Department >
NIIT_RDF_DATA
Local Link-ID
niit-3011
niit-3011
RDF Triples
< niit:Lecturer, niit:isTeaching, niit:Course >
< niit:TeachingAssistant, niit:isAsssting, niit:Course >
The prefixes nust, niit, eme, and nims refer to URLs http://www.nust.edu.pk,
http://www.niit.edu.pk, http://www.nims.edu.pk, and http://www.eme.edu.pk respectively.
Once the local ontologies are being created, the index management service comes into
play and creates the bitmap segments in the bitmap index for the data sources and plots
(synchronizes) the RDF triples of the data sources in their respective bitmap segments.
During synchronization, index management service also resolves the semantic
heterogeneities. The structure of the bitmap index can be illustrated as shown in the Table
5.
nust-1000001
nust-1000002
nust-1000003
nust-1000004
nust-1000005
nust-1000006
EME-DB
NIMS-DB
NIIT-DB
1
0
1
1
1
0
1
0
1
0
1
0
0
1
0
0
0
1
Suppose, a user query contains RDF triple i.e., <Instructor isTeaching Course>.
Relevance reasoning service decomposes this triple into its terms and creates three
buckets i.e., one for the subject, one for the property, and one for the object. Each term is
given to ontology reasoning service to calculate its semantic similarity in their respective
hierarchies to find relevant terms. The buckets are populated as shown in the Table 6.
Semantic Operator
Used
Property Bucket
Course
Terms Deduced
exactMatch
sameAs
subClassOf
equivalentOf
Instructor
NULL
Professor, Prof, Lecturer, Teacher
TeachingAssistant
Course
Subject
NULL
NULL
isTeaching
Teaching, Teaches
NULL
isAssisting
39
for
The cartesian product of subject, property and object is taken to construct inferred
triple list. Table 7 shows their cartesian product.
Table 7 Inferred RDF triples for a users query triple
Expansion of RDF triple
Reasoning Service
<Instructor>,
<Instructor>,
<Instructor>,
<Instructor>,
.
<Instructor>,
<Professor>,
<Professor>,
<Professor>,
<Professor>,
<Prof>,
<Prof>,
<Prof>,
<Lecturer>,
<Lecturer>,
<Lecturer>,
<Teacher>,
<Teacher>,
<Teacher>,
<TeachingAssistant>,
<TeachingAssistant>,
<TeachingAssistant>,
<TeachingAssistant>,
<TeachingAssistant>,
using Ontology
<isTeaching>,
<Teaching>,
<Teaches>,
<isAssisting>,
<isAssisting>,
<isTeaching>,
<Teaching>,
<Teaches>,
<isAssisting>,
<isTeaching>,
<Teaching>,
<isAssisting>,
<isTeaching>,
<Teaching>,
<Teaches>,
<Teaching>,
<Teaches>,
<isAssisting>,
<isTeaching>,
<Teaching>,
<Teaches>,
<isAssisting>,
<isAssisting>,
<Course>
<Course>
<Course>
<Course>
.
<Subject>
<Course>
<Course>
<Course>
<Subject>
<Course>
<Course>
<Subject>
<Course>
<Course>
<Course>
<Course>
<Course>
<Subject>
<Course>
<Course>
<Course>
<Course>
<Subject>
In order to execute a query over the bitmap index, GUIDs are needed. The RDF triple
is rejected, if no GUID is available for it in the global ontology. In this example, GUID
nust-1000001 and nust-1000006 are fetched from the global ontology. These GUIDs are
passed to the index lookup service to identify relevant and effective data sources. The
index lookup service traverses the bitmap index for only these GUIDs and returns all
bitmap segments where the bits are set i.e., EME-DB, and NIIT-DB.
In order to sort the data sources based on their relevance to the query triples, semantic
similarity scoring is incorporated as shown in Table 1. First the term similarity is
40
computed for the query triples with data source triples using the concept and relationship
hierarchies.
EME-DB scores 0.6 for matching subject of the query triple Instructor with subject of
the source triples Professor. The concepts hierarchy returns subClassOf relationship
between these terms. Next properties of the query and source triples are matched and
scores 0.8 for matching the respective properties isTeaching and Teaches, because they
are connected by sameAs relationship. Finally object of the query and source triples are
matched which scores 0.8 for matching the respective objects Course and Subject.
NIIT-DB scores 0.6 for matching the subject of the query triple Instructor with the
subject of the source triple Lecturer. The concept hierarchy returns subClassOf
relationship for this match. This data source scores 1 for matching the property
isTeaching with query property isTeaching. Finally scores 1 for matching the respective
objects Course and Course. NIIT-DB also contains a triple that is relevant to the query
triple with some degree of likelihood i.e., nust-1000006.
The relevance of a data source for every query triple is calculated by putting the term
similarity scores into the equation 1 and is shown in Table 8.
Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple
Term Similarity
sim (subject)
sim (property)
sim (object)
Source Similarity
for Query
Triple(qT)
nust-1000001
0.6
0.8
0.8
0.384
nust-1000001
nust-1000006
0.6
0.5
1
0.5
1
1
0.6
0.25
Relevant
Data Source
GUIDs
EME-DB
NIIT-DB
41
Finally, the overall similarity score of a data source for a users query is calculated by
using the equation 2 and is shown in Table 9. These sources are sorted and given to query
rewriting component.
Table 9: Semantic Similarity Calculation of a Data Source for User Query
Relevant
Data Source
GUIDs
EME-DB
nust-1000001:
Total Source Similarity for User Query (simEME)
0.384
(0.384)
NIIT-DB
0.6
0.25
(0.85)
nust-1000001:
nust-1000006:
Total Source Similarity for User Query(simNIIT)
42
CHAPTER 5
IMPLEMENTATION
This chapter discusses our implementation strategy and issues for the proposed
architecture. The first section of this chapter discusses in details the Oracle
implementation of the ontologies and RDF data. The second section discusses the
implementation details of our proposed architecture for the relevance reasoning.
5.1.
RDF and OWL data. This functionality builds on the recent Oracle Spatial Network Data
Model (NDM), which is the Oracle solution for managing graphs within the Oracle
Database. The RDF Data Model supports three types of database objects: model or
ontology (RDF graph consisting of a set of triples), rule-base (set of rules), and rule index
(entailed RDF graph).
5.1.1. RDF Data Model or Ontology: There is a single universe for all RDF data stored
in the database. All RDF triples are parsed and stored in the system under the MDSYS
schema as shown in Figure 19. An RDF triple (subject, predicate, and object) is treated as
one database object. A single RDF document that contains multiple triples, therefore,
results in many database objects.
RDF_MODEL$ is a system level table created to store information on all of the RDF
and OWL ontologies in a database. Whenever a new ontology is created, new
MODEL_ID is automatically generated for it. An entry is made into the RDF_MODEL$
table.
2 http://www.oracle.com/index.html
43
The RDF_NODE$ table stores the VALUE_ID for text values that participate in
subjects or objects of statements. The NODE_ID is the same as the VALUE_ID.
NODE_ID values are stored once, regardless of the number of subjects or objects they
participate in. The node table allows RDF data to be exposed to all of the analytical
functions and APIs available in the core NDM.
The LINKS$ table stores the triples for all of the RDF models in the database.
Therefore, the MODEL_ID logically partitions the RDF_LINK$ table. Selecting all of
the links for a specified MODEL_ID returns the RDF network for that particular
ontology.
The RDF_VALUE$ table stores the text values, i.e. the Uniform Resource Identifiers
or literals for each part of the triple. Each text value is stored only once, and a unique
VALUE_ID is generated for the text entry. URIs, blank nodes, plain literals and typed
literals are all possible VALUE_TYPE entries.
44
Blank nodes are used to represent unknown objects, and when the relationship
between a subject node and an object node is n-ary. New blank nodes are automatically
generated whenever blank nodes are encountered in triples. However, it is possible for
users to re-use blank nodes, for example when inserting data into a containers or
collections. The RDF_BLANK_NODE$ table stores the original names of blank nodes
that are to be reused when encountered in triples.
To represent a reified statement a resource is created using the LINK_ID of the triple.
The resource can then be used as the subject or object of a statement. To process a
reification statement, a triple is first entered with the reified statements resource as
subject, rdf:type as property and rdf:Statement as object. A triple is then entered for each
assertion about the reified statement. However, each reified statement will have only one
rdf:type to rdf:Statement associated with it, despite the number of assertions made using
this resource.
The Oracle RDF Data Model supports containers and collections. A container or
collection will have a rdf:type to rdf:container_name or rdf:collection_name associated
with it, and a LINK_TYPE of RDF_MEMBER.
Two new object types have been defined for RDF-modeled data. SDO_RDF_TRIPLE
serves as the triple representation of RDF data, whilst SDO_RDF_TRIPLE_S is defined
to store persistent data in the database. The GET_RDF_TRIPLE() function can be used to
return an SDO_RDF_TRIPLE type.
5.1.2. Rule-base: Oracle supplies both an RDF rule-base that implements the RDF
entailment rules, and an RDF Schema (RDFS) rule-base that implements the RDFS
entailment rules. Both rule-bases are automatically created when RDF support is added to
45
5.1.4. Querying RDF Data: The SDO_RDF_MATCH function has been designed to
meet most of the requirements identified by W3C in SPARQL for graph querying. A Java
API is also provided for network representation and network analysis. Analysis
capabilities include the ability to find a path between two resources, or to find a path
between two resources when the links are of a specified type.
Use of the SDO_RDF_MATCH table function allows a graph query to be embedded
in a SQL query. It has the ability to search for an arbitrary pattern against the RDF data,
including inference, based on RDF, RDFS, and user-defined rules. It can automatically
resolve multiple representations of the same point in value space (e.g. 10 ^^xsd:Integer
from 10 ^^xsd:PositiveInteger).
5.2.
following subsections.
5.2.1. Enabling and Disabling the RDF Support in Database: Before using the RDF
support into a Oracle database, we need to enable this feature. A procedure named
CREATE_RDF_NETWORK() of the SDO_RDF package is used to enable RDF support
in the database. This procedure creates system tables and other database objects used for
RDF support. One must connect to the database as a user with DBA privileges in order to
call this procedure, and should call the procedure only once for the database. To remove
RDF support from the database, call the SDO_RDF.DROP_RDF_NETWORK procedure.
The following example enables the RDF support into the database.
Enabling the Semantic Network
BEGIN
SDO_RDF.CREATE_RDF_NETWORK('rdf_tblspace');
END;
47
5.2.2. Creating the Global Ontology: The table used to store the RDF triples of the
global ontology is shown below. The name of the table is GLOBAL_RDF_DATA.
Column Name
Data type
Description
GUID
NUMBER
TRIPLE
SDO_RDF_TRIPLE_S
TRIPLE_TYP
VARCHAR2
BIT_POS
NUMBER
A unique sequence generating object is used to assign GUIDs to the incoming RDF
triples. The example below shows the creation of the sequence generator object.
Creating the Sequence Generator for GUIDs
CREATE SEQUENCE s_global_rdf_data_id
START WITH 1000
INCREMENT BY 1
NOCACHE
ORDER;
Once the global ontology table has been created, we then create the global ontology
using the CREATE_RDF_MODEL() procedure of the SDO_RDF package. The example
below creates the global ontology.
Creating the Global Ontology
BEGIN
SDO_RDF.CREATE_RDF_MODEL('global_ontology', 'global_rdf_data', 'triple');
END;
Data type
Description
SEGMENT_ID
NUMBER
SEGMENT_SOURCE
BITMAP_PATTERN
URI
VARCHAR2
48
source.
Once the semantic operators have been defined, they are used to manage the concepts
and relationship hierarchies. The code in following example links the concept Course
with Subject using sameAs operator to represent synonyms.
Managing Hierarchies
INSERT INTO global_ontology_rdf_data
VALUES(s_global_rdf_data_id.NEXTVAL,
SDO_RDF_TRIPLE_S(global_ontology,
'http://www.niit.edu.pk/Research/Delsa/Course'
'http://www.niit.edu.pk/Research/Delsa/sameAs'
'http://www.niit.edu.pk/Research/Delsa/Subject'));
5.2.5. Creating Rules, Rule-base and Rule Index: In order to create a user defined
rulebase, CREATE_RULEBASE() procedure of the SDO_RDF_INFERENCE package is
49
used. The following example creates a rulebase for the global ontology with name
global_ontology_rb.
Creating Global Ontology Rulebase
BEGIN
SDO_RDF_INFERENCE.CREATE_RULEBASE('global_ontology_rb');
END;
After creating the rule-base, rules can be added to it. To cause the rules in the rulebase to be applied in a query of RDF data, one can specify the rule-base in the call to the
SDO_RDF_MATCH table function. Inverse and transitive rules have been inserted for
each semantic operator. The following example explains the implementation of these
rules for sameAs operator.
Inverse Rule for sameAs Operator
INSERT INTO mdsys.rdfr_global_ontology_rb
VALUES('InverseOfSameAs',
'(?x :sameAs ?y)', NULL,
'(?y :sameAs ?x)',
SDO_RDF_ALIASES(SDO_RDF_ALIAS('','http://www.niit.edu.pk/Research/Delsa/')));
Whenever rules are inserted, updated, or deleted from the rule-base, rules index must
be refreshed. The following example creates the rule index for the global ontology rulebase.
Rules Index Creation
BEGIN
SDO_RDF_INFERENCE.CREATE_RULES_INDEX (
'rdfs_rix_global_ontology',
SDO_RDF_Models('global_ontology'),
sdo_rdf_rulebases('RDFS','global_ontology_rb'));
END;
50
5.3.
reasoning in a scalable data integration system. The remaining section discusses the
functionality provided by each of these packages along with a brief description.
Data type
Description
p_incoming_source
VARCHAR2
p_list_of_triples
TRIPLE_TAB_TYP
5.3.1.2.
source and deletes the local ontology for it from the source description storage.
Parameter Name
Data type
Description
p_deleting_source
VARCHAR2
5.3.2.1.
domain knowledge in terms of the RDF triples. It assigns the GUID to incoming triple
and reserves its position on the bitmap index and adds it to the global ontology.
Parameter Name
Data type
Description
p_incoming_triple
p_incoming_triple_type
SDO_RDF_TRIPLE
VARCHAR2
5.3.2.2.
RDF triple. It interacts with the ontology reasoning service to semantically expand RDF
triple and identify its GUID.
Parameter Name
Data type
Description
p_incoming_triple
SDO_RDF_TRIPLE
5.3.2.3.
Data type
Description
p_incoming_triple_GUID
NUMBER
5.3.3. PACKAGE
Index_Management_Service:
This
package
helps
in
the
management of the bitmap index in the proposed architecture. Following are the main
three procedures of this package.
5.3.3.1.
MANAGE_BITMAP_PATTERN(): This procedure manages the bitmap
pattern for the index whenever domain knowledge is published in terms of the RDF
triples.
Parameter Name
Data type
Description
p_incoming_triple_GUID
NUMBER
5.3.3.2.
construction of bitmap segment for the incoming data source. It assigns a unique
identifier for each bitmap segment. Initially all bits are initialized to 0 in the bitmap
pattern.
52
Parameter Name
Data type
Description
p_incoming_source
VARCHAR2
URI of the incoming data source for which the bitmap segment
has to be created.
5.3.3.3.
SYNCH_BITMAP_SEGMENT():
This
procedure
helps
in
the
synchronization of the local ontology RDF triples with the bitmap segment for a specified
data source. It shuffles the bits accordingly to the RDF triples of the local ontology.
Parameter Name
Data type
Description
p_source_segment
VARCHAR2
GUID_POS
BIT_STATE
NUMBER
VARCHAR2
Data type
Description
GUID_POS
NUMBER
Data type
Description
P_incoming_term
VARCHAR2
5.3.5.2.
simple semantic searching behavior and to the proposed architecture. It accepts a term
53
(Concepts or Relationship) and formulates a semantic query that checks for synonyms,
lexical variants, and subclass operators in their respective hierarchies over the global
ontology.
Parameter Name
Data type
Description
P_incoming_term
VARCHAR2
5.3.5.3.
Data type
Description
P_incoming_term
VARCHAR2
5.3.5.4.
Data type
Description
p_incoming_subject
p_incoming_property
p_incoming_object
VARCHAR2
VARCHAR2
VARCHAR2
5.3.6.2.
with the ontology reasoning service and draw inference based on degree of likelihood
from it to expand the query triples. It also interacts with the index lookup service to
identify the most effective data sources that are also relevant with certain degree of
likelihood.
54
5.3.6.3.
sources based on the score being obtained for the users query.
Parameter Name
Data type
p_incoming_source
p_ranking_order
VARCHAR2
VARCHAR2
Description
Relevant data source that are to be ranked
DESC/ASC means descending/ascending
We have highlighted the Oracle implementation of the ontologies and RDF data.
The design and implementation along with their issues have been discussed in detail for
the proposed architecture.
55
CHAPTER 6
System Specification
Pentium-IV
System Processor
RAM
HDD
Operating System
Tool
Language
6.2.
2.4GB
1GB
80GB
Windows 2003 (with service pack 2)
Oracle Spatial 10g Release 2 NDM
PL/SQL
Evaluation Criteria
The main aim of this evaluation is to validate whether the proposed architecture for
the relevance reasoning can scale up to a large number of data sources and complex
queries. In order to quantitatively measure the performance of the relevance reasoning,
different evaluation measures have been used which are discussed in the subsequent
section. The evaluation criteria for evaluating our system are listed below:
6.2.1. Response Time of Query Execution: to ensure that the manipulation of RDF
triples does not mitigate query response time during relevance reasoning as the number of
sources increases for the system.
6.2.2. Accuracy of the Relevant Source Selection: to ensure that provision of
semantics does not affect the accuracy of the proposed methodology and can be checked
by calculating precision and recall of the system for relevance reasoning. Precision can be
defined as the ratio of relevant data sources to the number of retrieved data sources [41]:
56
Whereas Recall can be defined as the proportion of relevant data sources that are
retrieved [41]:
6.3.
Data Specification
The experiment has been carried out with a corpus of manually generated 100 data
sources. Each data source contains 30-50 RDF triples. The famous university ontology
has been used in the experiment as the domain ontology [1, 42].
6.4.
Test Queries
We have executed 35 different queries related to the students, faculty, and research
associates data. We performed accuracy test of the proposed architecture over these test
queries. We comparatively analyzed our system with MiniCon algorithm [1], observing
the precision and recall of both the systems. Among these 35 queries, we selected 3
queries; having 3, 6, and 9 RDF triples to test the system efficiency by checking query
response time. These queries are as below:
Find name of all Instructors who are teaching a course to the same student, whom
they are advisors.
RDF Pattern of Query 1
(?instructor :isTeaching :Course) (?student :isRegisteredIn :Course) (?instructor :isAdvisorOf ?Student).
57
6.5.
response time from three dimensions. Firstly, queries were executed against the local
ontologies of data sources in the source description storage. We assessed the time taken
by the relevance reasoner to traverse local ontologies for relevant source selection.
Second, as our proposed methodology employs bitmap index where source descriptions
are mapped semantically in the bitmap segments as bits, we submitted the queries to
relevance reasoner using bitmap index and assessed the time taken using bitmap index.
Finally, we extended the bitmap index and implemented function-based indexing over it
and then analyzed the performance of the system. Figure 21, 22, and 23 illustrates the
performance of the system with the 3 queries shown in the preceding section.
58
59
Figure 24 Performance gain of the system with respect to direct ontology traversal
6.6.
precision and recall of our proposed methodology and made a comparison with the
60
MiniCon algorithm [1]. As MiniCon algorithm directly traverses the source descriptions,
therefore we did not implement it rather we used the same approach to develop the code
to traverse the local ontologies. As our proposed semantic matching process searches for
the synonyms, lexical variants, subclasses and degree of likelihood also therefore the
comparison showed an increase in both precision and recall with respect to MiniCon
Algorithm.
Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon algorithm
We have provided evaluation of the results of the developed prototype system in this
chapter. Different evaluation criteria have been identified for system evaluation. We have
compared the results of the prototype system with the existing systems. The comparison
showed that the system have better query response time and accuracy of source selection
compared to the existing systems.
61
CHAPTER 7
Discussion
An exponential growth in online data sources due to advancements in information and
semantic operators. It also creates the rule-base to define rules and manage rules index to
perform inference and reasoning during the semantic matching process. Source
registration workflow manages the local ontologies of data source in the source
description storage. As the new sources enter and leave the system, index management
service synchronizes the bitmap index to reflect the new status of the source description
storage. In order to answer the queries precisely, bitmap index need to be
synchronized/updated with source description storage.
Query execution workflow takes the users query formulated in RDF triples and
identifies the most effective and relevant data sources for the given query. During
relevance reasoning queries are expanded using the inference drawn from the ontology
reasoning service. It calculates the semantic similarity between the query and source RDF
triples and identifies the relevant and effective data sources. Relevant data sources are
ranked based on the similarity score they obtained for the user query. The sorted list of
relevant and effective data sources are returned to the query rewriting component that
reformulates the queries for these relevant data sources.
7.2.
63
reasoning. The accuracy tests of the system showed improved precision and recall than
MiniCon algorithm [1].
The second contribution of the proposed methodology is the provision for
optimization during relevance reasoning with the help of a bitmap index. Previously the
community was using the bitmap index for bulks of data management in the warehouses
of the relational models but we used bitmap index to represent the RDF models. The
bitmap index is used during relevance reasoning and improves this whole process by
traversing the plotted RDF data in an improved manner. The time complexity test showed
that bitmap indexing performs the relevance reasoning in a comparatively shorter time.
7.3.
Future Direction
Currently our focus is on centralized bitmap indexing in data integration systems
where a single global ontology is presiding over some node and queries are reformulated
over it. As P2P DBMSs are evolving and data integration is also getting popular in these
domains, therefore in future this methodology can be extended to meet the requirements
of P2P data integration. Index partitions may be residing on each peer and collectively
they all will participate in relevance reasoning during the query processing.
64
REFERENCES
[1]
Alon Halevy, Anand Rajarman, Joann Ordille. Data Integration: The Teenage
years, Proceeding of 32nd international conference on VLDB, Pages 9-16,
September 2006.
[2]
[3]
[4]
[5]
[6]
Isabel F. Cruz and H. Xiao. The Role of Ontologies in Data Integration. Journal of
Engineering Intelligent Systems: pages 245-252, December, 2005.
[7]
[8]
[9]
[10]
Arens, Y., Hsu, C.N., et al. Query processing in the SIMS information mediator.
In readings in agents, Morgan Kaufmann Publishers Inc., pages 82-90, 1997, San
Francisco USA.
[11]
[12]
65
[13]
[14]
[15]
Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal,
pages 270-294, 2001.
[16]
[17]
[18]
[19]
J. Zhong, H. Zhu, et al. Conceptual graph matching for semantic search. In the
proceedings of the 10th International conference on Conceptual Structures
(ICCS), LNCS 2393, pages 92-106, Bulgaria, July 2002. Springer.
[20]
A.H. Levy: Why Your Data Wont Mix: Semantic Heterogeneity. ACM Queue 3,
pages 50-58, 2005.
[21]
RDF
Primer.
W3C
http://www.w3c.org/RDF/
[22]
Waris Ali, Sharifullah Khan, Global Query Generation over Diverse Data Sources
Using Ontology. In 1st International Conference on Information and
Communication Technologies, 9th June 2007, Bannu, N.W.F.P, Pakistan.
[23]
Nicole Alexander, Siva Ravada. RDF Object Type and Reification in the
Database. In the proceeding of 22nd Int. Conference on Data Engineering
(ICDE06). IEEE Computer Society 2006.
[24]
[25]
Recommendation,
66
10th
February
2004,
[26]
[27]
[28]
[29]
W3C
[30]
2000,
[31]
[32]
B-Tree and Bitmap Indexing. Oracle Developer Guide 10g Release 2, Part no:
A969505-01, Oracle Corporation, March 2002.
[33]
[34]
[35]
[36]
[37]
[38]
[39]
F. N. Afrati, C. Li, and J. D. Ullman. Generating Efficient Plans for Queries Using
Views. In ACM SIGMOD International Conference on Management of Data,
Santa Barbara, CA, May 2001.
Markup
Language
67
Home
Page.
August
[40]
[41]
68