Anda di halaman 1dari 9

Storing, Managing and Query Processing in Graph

Databases
Isha Arora and Prudhvini Putta
1. ABSTRACT
As we live in a world where technology is increasing and hence the amount of data related to everything is
also increasing eventually and that too at a very fast pace. Lets take the example of an online shopping
website, where the sellers of a product are grouped and the details of all the sellers of a product who have
the availability of that product are provided in the sites. In such situation the online shopping sites have to
store a huge amount of data for each product and have to link this to another data which has the seller
details. In such situations, if we try to use relational database management system for the storage and
management of the data, we encounter a lot of difficulties related to changing the schema and also retrieval
of the data. There comes the requirement, to find a way to efficiently store and query the data using some
technique different from RDBMS, as managing a huge number of joins and data manipulation becomes
tedious and challenging because of the relationship of one product with so many sellers and similarly for
other products.
One solution for managing the data in which relationships are more important is Graph database
management system. An increasing number of companies have started using graph database for solving
different kinds of problems like fraud detection, access control management, drug preparation etc. Current
research in graph databases mainly involves two kinds of graph databases. First kind of database contains
very large graphs like for example web graph or social networks. In this kind of databases query processing
will involve finding a best path between two nodes and mostly involves processing sub graph queries.
Second type of graph database is the one which has large number of small graphs. This kind of database is
popular in scientific domains. In this paper we are going to explain how to store, manage and query second
kind of graph databases.
2. INTRODUCTION
Graph Database: A graph database is a collection of nodes and edges. Each node represents an entity and
each edge represents the relationship between two nodes. Also, each node is like a smallest storage unit in
a graph database and the edges are the relationships among these small data storage units. Graph databases
are very efficient when the relationship between the data is the first class citizen. Graph databases provide
index free adjacency which means that every node has a direct link to its adjacent element. As the amount
of data being generated by companies like Amazon, Google etc. is beyond the capacity of relational
databases Graph database provides an efficient solution for storing and querying the data. When the data to
be stored is not regular i.e., it keeps on changing from one node to another and can be represented in the
form a graph, we can efficiently use graph databases.
When we talk about graphs, there are two types of two graphs
Static Graphs- In static graphs, the node in a graph do not change with time but the relationships do change
with time. For e.g., In an access management system, each user has an access to a set of things. Here the
user will remain the same over time but the access provided to his profile might change.
Dynamic Graphs- In dynamic graphs, new data is added into the graph over time almost every second. So,
because of this new data coming into picture, the relationships of the previous nodes also increases.

Managing such dynamically changing data becomes challenging as we have to take the input every second
and modify the storage based on that and also have to focus on a way that leads to efficient query processing.
Following are few applications of graph databases:

Social networking
Recommending products based on users purchase history
Inferring relationships in case of social networks
Access control systems in large companies

Though graph databases are efficient in retrieving data from large databases, it is important to consider how
we can manage large scale of graph data that changes dynamically. Coming to the query processing in
graph databases, it is very important to have very efficient query processing and indexing techniques as the
graph databases deal with large amount of data.
2.1 Managing dynamic graph databases
Till date, we have a lot of algorithms for static graphs but they cannot be used for dynamically changing
graphs, as it has one more dimension to take into account i.e., time.
As the nodes and the relationships change in dynamically changing graphs with time so, the data increases
in an uneven fashion and a very fast rate. The ultimate requirement in managing a dynamically changing
graph is to manage the indexing as it is not permanent. One index has to be linked to another in a way that
the two different nodes are always related to each other. Apart from this, the data needs to be stored and
distributed over the different computing machines in a way that reduces the search complexity.
We also need to take into account some means to increment the graph data, in such a way that distributing
it again and again on the machines should become feasible. The distribution should optimize both, the
output and performance of each individual machine.
There are three problems in managing the graph databases, which are as follows;

Distributing the data into the nodes in a cluster such that the computation takes less time and the
every nodes performance should be optimal.
How the efficiently answer the reachability of a node from other nodes.
How can a subgraph inside a huge dynamic graph be accessed to answer the queries in less latent
manner.

2.2 Query Processing in Dynamic Graph Databases


We have several ways in which graph databases can be queried. Since there is a big list of different types
of applications that makes use of different types of graph databases in a number of distinct ways, so there
are multiple ways in which a graph database can be queried. Many query languages have been proposed in
the past 25 years but there is no standard language for this purpose.
G is the first query language found in 1980s. There are several other query languages like UNQL, SoSQL,
etc that were found recently but an efficient way to query the database is very important. All these query
languages take different types of input and generate different graph queries.
Due to high complexity of graph data, cost of query processing in graph databases is still insufficient, to
find out the best achievable cost. It is also expensive because it involves sub graph isomorphism, which is
a NP Complete problem. There are several problems while processing super graph queries, sub graph

queries and similarity queries. Querying a graph database involves two steps, candidate set generation and
verifying candidate sets. To minimize the cost, we can build indexes to reduce the number of sub graph
isomorphism tests. Though the number of false results can be reduced, we still have to verify the candidate
sets generated. At present there are several indexing techniques that can improve the query process, but out
of all indexes, FG-index is one technique which reduces the time and memory consumption of candidate
verification process. But there are few problems with FG indexing which we will try to solve by using FG*index. We are going to present FG*-index, which is designed by some researchers for minimizing the cost
of subgraph query processing.
3. MOTIVATION
A traditional relational data model is not suitable for data where the relationship between the data is very
important. Graph database can be used as an alternative when relationships have more importance. Though
graph database is efficient, storing and querying huge amount of related data is a big task. Efficient storage
management and query processing techniques must be used to deal with huge amounts of graph data. This
paper deals with problems in storage and query processing of large set of graph databases and solutions to
those problems.
4. METHODS TO SOLVE EXISTING PROBLEMS
4.1 Methods to Solve Problems in Managing Dynamic Graph Databases
For a graph database it is required to store and distribute the data over a set of machines to make the
computability easy. For distribution, the things that needs to be taken care are firstly, the cost that emerges
as a result of one node communicating with the other via the connecting links. Secondly, how much a node
is utilized in each and every communication that takes place in form of query processing. Here node
utilization is basically when one node communicates with the other node in the same sub graph and later
when the node is linked and communicates with the other node in some different sub graph or the super
parent graph node.
So for distribution, we will see how some techniques are not useful, also some techniques which are good
to be used with graph databases.The first technology that comes to mind considering the distributed
computing is that of map reduce. It is a parallel computing technique where different processors
communicate to each other and a separate thread runs for each machine. Many different threads run at the
same time to provide the output of the parallel computation. But map reduce is not good in case of graph
databases, as we need to read and write the minute details of the state of a graph using map and further
using reduce techniques to calculate the count or to perform other operations. Using a map in such case
leads to a number of I/Os for every small operation as the nodes are linked to each other in all possible
ways, and also the data keeps on changing every second in dynamic graphs.
Moreover, to keep the node utilization more, it can be done by randomly distributing the nodes in a graph.
But this leads to exchanging a lot of messages between two nodes, be it from the same graph or with the
adjacent ones in other sub or super graphs. To reduce the communication time, we can divide the graph into
highly connected subgraphs so that sending less messages across the nodes will help in providing the result,
but this increases the idle time of the nodes. So, in these two scenarios using the bulk partition system of
map reduce does not seem to be helpful.
The only way out seems to be some distribution method, which will help distributing the data among the
nodes considering the additional dimension i.e., time, incase of dynamic graphs. If we will partition the

graph in a way that the number of exchanged messages will reduce along with increase in every nodes
utilization in a graph then that approach would be one of the best to deal with this type of data.
Partitioning the components in a node of a subgraph, requires performing the repartitioning on the
addition/deletion of the new/old nodes. But it incurs a huge cost so the repartitioning should be done with
utmost care. Once a node reaches the threshold of the communication and utilization level in a subgraph,
only then it should be repartitioned and moved to another sub graph.
One approach used for repartitioning can consider the scenario where we can observe the communication
pattern of two nodes. If they start communicating, then take the copy of that node and add it to the other
subgraph also, to which it is communicating in order to reduce the number of messages that gets exchanged
for every query processing. Here it is the same concept, as that of B-trees which maintain the replica of a
node in other subgraphs if the node is linked to some other subgraph and some changes have happened in
the previous graph of that node. Like we maintain a copy of the deleted element in a B-tree, at all the levels
except the lower child level so that the links can be maintained, similarly this approach can be used in a
graph database to lower the risk of data loss and reduce the mess in the related links of a node.
Reachability query in dynamic graphs
Reachability refers to the way or link following which a node can be reachable from other node in a graph.
In dynamic graphs, the data is stored and analyzed by taking the snapshots of the graph at different time
intervals. Here, the start and end of the time in which the snapshot is taken represents the state of a graph
in that period of time.
Let us take the example of a Graph G where, snapshots G1, G2 Gn are taken at n different time
intervals.
What we need to calculate is a reachability query, the difference between the two snapshots of a graph
which involve the operations and changes on the same node a of graph G containing 6 nodes.
Df: Represents the function calculating the Difference between two snapshots. Consider the node 5 and 6
in the graph then Df(G5, G6) represents the difference function.
So, for the 5th snapshot, check if node a is reachable from node b at 5th snapshot or not. Denote it by Gab
(a, b, 5). And also the reachability of node a from b at 6th snapshot time Gab (a, b, 6).
And Df(G5, G6) represents the difference between these two snapshots. The threshold can be set for this
difference and based on that, if the difference is more than the threshold value, repartitioning can be done
on the node a.
Modelling dynamically changing graph databases.
The way that is used to study minute details of a graph databases is by capturing the state of a graph between
two time intervals. Where each state is defined as a frame in a graph. This frame describes how a node in a
database is linked to other node and how these two interact or relate to each other.
Each node has a sequence of frames that evolves with time one after the other.The frames are marked from
first to last and every time a next pointer points to the other frame in the list. There are some actors in the
form of nodes which are connected to these frames in a way that the actions are performed in the frame
through these actor nodes. And each time interaction between two nodes is denoted by the distance, which
is the relationship between the two nodes in this case.

The attributes of an actor and interaction node are added to each frame, so the moment the snapshot is taken
we have the complete details of the participants of that graph.
Here, the actor nodes also have a sequence same as that of frames which denotes the links between two
actor nodes and so on. Also, the interaction nodes have a sequence same as that of actor and frames. All
this is marked with links to show the sequence of the frame, actors and the interactions between them in a
time interval.
This way a graph is modeled to make all the nodes highly connected to each other and thereby reducing the
time to query and manage the database.
So, once a new node is added in a frame, all the links of the actors and interactions in that time interval are
modified and then analyzed. In this manner, a small frame is always a part of the overall graph database.
Similarly when a node is deleted from the graph, all its related links can be deleted using the frame concept.
This approach reduces the communication time and increases the response time of the queries.

Data model of dynamically changing graph


6

5
4
3
2
1
0
Frame 1

Frame 2

Frame 3

Frame 4

Here, Frame 1, Frame 2, Frame 3, Frame 4 are the different states of the graph captured between different
time intervals.
We can consider Actor 1 denoted by blue graph, Actor 2 denoted by orange color and Actor 3. We can
consider the utilization of the actor nodes as the height of the same bars in different frames.

4.2 Methods to Solve Problems in Querying Graph databases


4.2.1 Definitions and Notations
Subgraph: A set of graphs g are called subgraphs of query graph q if all gs are supergraphs of q.
Frequent Subgraph(FG): Given a database D with a set of graphs, a graph g is called frequent subgraph if
the frequency of supergraphs of g in D is greater than or equal to a frequency threshold value or else it is
called as a Non frequent subgraph(NFG). For a query graph q there will be frequent subgraphs and nonfrequent subgraphs(NFG) for q.
Closed Frequent Subgraph(CFG): A graph g from a set of FG is a CFG such that there exists no other graph
in the set of FGs which will be a supergraph of g and whose frequency is equal to that of g.
Maximal Frequent Subgraph(MFG): A graph g from a set of FG is a MFG such that there exists no other
graph in the set of FGs which will be a supergraph of g.
4.2.2 Idea behind FG*-Index
In FG-indexing technique, we collect the subraphs of query graph q and generate answer sets for FGs and
perform verification processing on candidate sets of NFGs generated. The number of FGs generated
depends on threshold value . The value of will be between 0 and 1. FG-indexing partitions the graph
space into clusters based on and forms a tree structure. Therefore, if is very small then the number of
FGs generated will be very large. We overcome this problem by using FG*-index. In FG*-index technique
we introduce two indexes feature-index and FAQ-index. The idea behind frequent-index is to define a set
of features which will extract the structural information from the subgraphs which can be used to create a
new index called feature index. Since feature index will be able to find a matching subgraph without
involving subgraph isomorphism, the number of candidate verifications required can also be reduced. If the
value is very high the number of NFGs will increase thereby increasing the number of candidate
verifications required. To overcome this problem FG*-index introduces a new index call FAQ-index. The
idea is to model the set of queries in the form of a stream and find FAQs (Frequently Asked non-Frequent
Subgraph Queries) in a dynamically changing sliding window and create an FAQ-index on the set of
FAQs. With the help of FAQ-index the number of candidate sets to be verified. Thus the two drawbacks of
FG-indexing can be solved with the help of FG*-indexing.
4.2.3 FG-indexing
Given a graph q which is a query graph and database D with all sets of graphs, we define q as a FG query
if frequency of q is greater than or equal to the product of (minimum frequency threshold) and number
of graphs in D otherwise we call it a non-FG (non-Frequent subgraph) query. FG-index uses (frequency
tolerance factor) to cluster the frequent subgraphs and forms a structured tree index. FG-index uses two
indexes core index and edge index to process FG (frequent subgraphs) queries and non-FG (non-frequent
subgraphs) queries respectively. In the case of FG queries a core-index is formed and based on and
frequency of each graph in set of FGs. Core-index becomes a tree-structured index built on set of FG
queries. Core index cannot be used for querying non FG queries. Therefore FG-index uses edge index for
querying non FG queries and it is built on all the infrequent distinct edges.
Steps involved in FG-indexing:
1. Mine the given database and retrieve all frequent subgraphs based on .
2. Generate tolerant Closed Frequent Subgraphs.
3. Create Core Index.

4. Create Edge Index.


If query graph q is a FG query then FG-indexing creates a core index on all FGs and copies the IDs of all
FGs which have the same edges as the query graph q. Core indexing uses inverted graph index on the root
node. In case of FG-index the root node will be stored in the main memory and all the remaining nodes are
stored on the disks thereby involving I/O cost. If query graph q is not an FG query then FG-indexing creates
a set of all distinct frequent edges and maintains a list of IDs of NFGs having the same edges. Every NFG
will be assigned a value depending on the number of frequent edges found. The NFGs with high frequency
match will be compared with query graph to generate candidate sets. Though FG-index is one of the best
indexing technique available, there are few problems with FG-index. Response time when an FG-index is
used is as follows:
Response time = Mining time + Number of candidate answer sets * I/O access time + Number of candidate
answer sets * verification time.
The major limitation of FG-indexing is that the queries must be frequent subgraphs in order to be queried
efficiently.
4.2.4 Implementing FG*-index
FG*-index is an improvement over FG-index. The cost of index probing and the cost of candidate
verification are the limitations of FG-index which will be resolved by FG*-index. It uses feature index to
reduce the cost of index probing by reducing the number of subgraph isomorphism tests required and FAQindex to reduce the number of candidate set verifications by creating index on non FG queries that were
asked frequently.
Feature index: A set of features will be used in creating feature index and the selection of set of features
depends on domain knowledge. Features must be selected in such a way that the number of features should
not be large and the features should provide structural information of super graphs of the query graph.
Feature index maintains FHI (Feature Hash Index) and a set of IFI (inverted feature indexes). FHI uses a
hash table and maps every feature based on the name of feature and stores it in a hash table. For a given set
of features and set of graphs IFI maintains a graph array and feature array and each feature will maintain
an array of graph ids which contains that feature. For all the subgraphs a value will be calculated depending
on the number of features they have in common with the query graph and they will be indexed accordingly.
FAQ-index: In case of non FG queries there should be a way to reduce the number of candidate set
verifications required. Therefore FG*-index uses FAQ-index for processing non exact matching queries.
As FAQ-index cannot consider whole database for generating non frequent sub queries, the non-exact
matching queries are considered a stream of queries. FAQ-index uses sliding window model to control
index size. FAQ-index consists of FHI (FAQ Hash Index) and QHI (Inverted FAQ index). FAQ-index will
be updated frequently as the sliding window changes. Depending on a parameter called Maximum Average
Frequency the resulting answer set will be filtered.
FG*-index combines FG-index, feature index and FAQ-index. The response time using FG*-index is as
follows:
If there is at least one common feature q then
Response time = Number of graphs with feature q * Time taken for I/O operations.
If there is a feature q that is not in common then

Response time = Time taken to search the graphs + Number of graphs with feature q* Time taken for I/O
operations.
5. EXPERIMENTS AND RESULTS
This section explains the results of experiments conducted on real data sets. The database is taken in the
form of chunks of 100K-1000K set of graphs and queries are processed on the graph databases. The
experiments are conducted using different indexes like FG*-index, FG-index. The experiments are
conducted first by taking and frequent index and without using FAQ-index to test the impact of frequent
index on FG*-index. Four metrics are used for evaluating performance of FG*-index. They are: number of
subgraph isomorphism tests required in processing one query in probing phase, average response time, the
amount of memory consumed and the size of candidate set. When a small number of features are used for
finding frequent subgraphs, the number of non FG queries will be increased thereby increases the response
time. In case if many features are used for finding frequent subgraphs, most of the time will be wasted in
finding the features itself. It is observed that FG*-index when moderate and optimum number of features
are used. But without FAQ-index the memory and time consumptions of FG*-index are same as those of
FG-index.
Few more experiments were conducted by taking FAQ-index. Since a window of items is used in
calculating FAQ-index, the memory and time consumption of FAQ-index depends on the amount of
memory allocated for sliding window and number of items in sliding window. After a certain memory limit
FAQ-index performance will not be improved even if the memory size is increased further. FAQ-index
reduced the number of candidate sets generated in case of non FG queries thereby reducing the response
time. After including FAQ-index, it is observed that FG*-index is five times faster than FG-index and two
times faster than FG*-index without FAQ index.
6. CONCLUSION
In this paper, we have discussed about the problems in managing large databases. Even though there have
been few developments and researches in this field to solve the problems to some extent. Still, there is a
need to have some better indexing and partitioning techniques as the complexity and size of the data is
increasing in multiple magnitudes.
We have also presented FG*-index an indexing technique proposed by few researchers. In graph databases
we have the concept of subgraph isomorphism which is an NP complete problem. Therefore while
processing subgraph queries we use FG-index to reduce number of subgraph isomorphism tests required as
well as reduces the cost of candidate verification. FG*-index is a modified FG-indexing technique which
will further improve the query processing. FG*-index makes use of frequent-index and FAQ-index in order
to improve the existing FG-index. FG*-index is developed based on streaming algorithm for frequent itemset generation from a sample window which will help FG*-index to reduce the response time. From the
experiments conducted we observed that FG*-indexing is many times faster compared to other existing
indexing techniques. FG* index is more scalable compared to other indexing techniques.
7. FUTURE RESEARCH
7.1 Future research in managing dynamic subgraphs
We have discussed about the partitioning of data for the ease of computation and less latency of the queries.
There is a need to enhance this research considering the points in mind that putting lots of burden on each
node adds to the risk that if one node leaves the graph then all other data related to it will be disturbed
highly and there should be some approach to fix it immediately. If that approach will not be adopted then

this can lead to halting all the other nodes in the graph and slowing down the system. For the efficient
search techniques and storing/managing large amount of data in the form of vertices and edges requires
high level of connectivity between the nodes along with less overhead on each node to maintain the identity
of each node as an individual and also as a part of the main graph.
7.2 Future research in processing subgraph queries
We discussed about FG*-index, an indexing technique for querying graph database. FG*-index is an
improvement over FG-index. But FG*-index can be improved further by improving feature index and FAQindexing algorithm. When the database is updated very frequently then FG*index takes more time because
it needs to create both feature and FAQ indexes again which will increase the response time of FG*-index.
Instead of using two indexes FG*-index can be modified to use only one indexing technique which takes
care of both features and candidate sets thereby reducing the memory consumption and response time.

8. REFERENCES
[1] CHENG J, YIPING K and WILFRED NG. 2008. Efficient query processing on graph databases. ACM
Transactions on Database Systems, Vol. V, No. N.
[2] CHENG J, YIPING K, LU A and WILFRED NG. 2007. FG-index: towards verification free query
processing on graph databases. In SIGMOD conference. 857-872.
[3] Graph Databases by OReilly Media.
[4] Querying Graph Databases by Pablo Barcel, University of Chile.
[5] Query Languages by Peter T Wood, University of London.
[6] ARASH FARD,AMIR, LAKSHMISH AND JOHN A MILLER. University of Georgia, Towards
efficient query processing on massive time evolving graphs.
[7] ARTICLE on Nosql-standouts--new-databases-for-new-applications by Peter Wayner
[8] Github- Representing time dependent graphs in Neo4j.

Anda mungkin juga menyukai