Anda di halaman 1dari 24

ARMY INSTITUTE OF TECHNOLOGY

2018-2019

SEMINAR PRESENTATION
ON
OPTIMIZING PHYLOGENETIC QUERIES FOR
PERFORMANCE

Presented By: Srishti Sachan


(T150224305)
Guided By: Prof. Praveen Hore
ABSTRACT

• Visual query language, called PhyQL


• Developed a range of Pruning aids
• Hybrid Optimization
• “Fail Soonest” strategy
Introduction to Query Optimization
• Query: A query is a request for information from a database.
• Query Plans: A query plan (or query execution plan) is an ordered set of steps used to
access data in a SQL relational database management system.
• Query Optimization: A single query can be executed through different algorithms or
re-written in different forms and structures. Hence, the question of query
optimization comes into the picture – Which of these forms or pathways is the most
optimal? The query optimizer attempts to determine the most efficient way to
execute a given query by considering the possible query plans.
There are broadly two ways a query can be optimized:
1. Analyze and transform equivalent relational
expressions: Try to minimize the tuple and column counts
of the intermediate and final query processes.

1. Using different algorithms for each operation: These


underlying algorithms determine how tuples are accessed
from the data structures they are stored in, indexing,
hashing, data retrieval and hence influence the number of
disk and block accesses.
INTRODUCTION TO PHYLOGENETIC
QUERIES
• Phylogenetic queries take as input a phylogenetic tree and
attempt to locate trees in the database that match it in some
specified way.
• A phylogenetic tree is a diagram that represents evolutionary
relationships among organisms. Phylogenetic trees are
hypotheses, not definitive facts.
• PhyQL is one of a handful of languages that allow
declarative querying (other than PQL and Crimson), and the
only language that allows composable structure and pattern
queries visually due to its declarative foundation.
• A set of homogeneous phylogenies (trees) is
called collections (or forests), and a
PhyloBase database essentially is a set of
such collections.
EXAMPLE OF PHYLOGENETIC TREES
A MODEL FOR REPRESENTING AND QUERYING
PHYLOGENIES

1. PhyloBase Data Model


PhyloBase is capable of modeling phylogenies as sets of multi-
modal trees with hybridization or horizontal transfers across nodes
within a phylogeny.
L (PhyQL) => (I,V,L,C,λ)

2. Persistent Storage Model


One simple model is to store each edge of a tree as a binary pair,
possibly indexed with a tree identifier to recognize its membership
in a tree.

3. PhyQL Visual Query Language


Once stored, phylogenies in PhyloBase can be retrieved and
manipulated using an user interface that allows writing queries
using visual icons, and supports powerful operations in ways
similar to SQL in relational databases, and XQuery in XML
databases.
4. PhyQL Syntax and Semantics
The syntax of PhyQL supports three icons for three types of
nodes: Root (a white square), (internal) Node (a gray circle),
and Leaf (a green leaf); three wildcard icons to support query
flexibility: Any (a starred pink circle), LCA (a question mark
inside a mustard circle) and Subtree (a pink and blue tree); and
two edge icons to capture node relationships: Edge (blue edge)
and Hedge (red edge). These icons have predefined meanings
and are implemented as first-order predicates.
• PhyloBase user query interface for PhyQL
REQUIREMENTS

1. Java 1.6.
2. stable eXist-DB 2.2 for its superior performance and
suitability in our current setting
3. a virtual computer equipped with a four-threaded Intel
Xeon 3.00 GHz CPU and 16 GB RAM, running on
Windows Server 2008 (64-bits).
4. a stable version of SWI-Prolog 7.2.3 for Windows.
5. meta-data
Pruning Aids

 Used to avoid hopeless processing.


 A query hub must be a substructure of the database
hub and should match all the label constraints to be
considered a match.
 Assignment of unique ID’s to all the nodes in
Phylobase helps to compute the query much faster.
Hub representation
Performance Improvment
LCA COMPUTATION ALTERNATIVES

LCA queries can be computed in many different ways, and more efficient
procedural approaches probably exist, a rule based deductive evaluation is
probably most intuitive and computationally simple.

If a node k is a common ancestor of nodes i and j, and so is another node l, and


k also happens to be an ancestor of l, then k cannot be the least common
ancestor of i and j. The predicate nlca(i,j,k) stipulates that there exists two
common ancestors of i and j, namely k and l, and also that k is an ancestor of l
at the same time. Thus, in the rule lca(i,j,k) for LCA, we establish that for k to
be a least common ancestor of both i and j, nlca(i,j,k) simultaneously does not
hold while ca(i,j,k) holds, i.e., meaning there is no intervening l for which
ca(i,j,l) also holds.
ancs(i,j) :- edge(i,j).
ancs(i,j) :- edge(i,k),ancs(k,j).

ca(i,i,i).
ca(i,j,k) :- ancs(k,i), ancs(k,j).
nlca(i,j,k) :- ca(i,j,k), ca(i,j,l),
ancs(k,l).
lca(i,j,k) :- ca(i,j,k), ¬ nlca(i,j,k).

(The ancs axiom represents the ancestor and the axiom lca returns the LCA X
of a set of nodes in a phylogeny.)

Unfortunately, there are several problems with these rules that lead to
unusual computational overheads.
We can use the rules below to compute the ancestor list for each node where edge(x,y)
means y is parent of x, and root(r) represents the root node of the tree T.

lca(X,Y,H) :- root(R),
alist(X,R,[X],P1), alist(Y,R,[Y],P2), intersect(P1,P2,[H|T]).

alist(Node,Node,_,[Node]).
alist(Start,End,Visited,[Start|Path]) :
edge(Start,X), ¬member(X,Visited),
alist(X,End,[X|Visited],Path).

intersect(_,[],[]).
intersect([],_,[]).
intersect([H1|T1],L2,[H1|L]):
member(H1,L2), intersect(T1,L2,L),!.
intersect([_|T1],L2,L):- intersect(T1,L2,L).
• The LCA rule uses the alist and intersect rules to return thehead of the
intersection list as the LCA. To make this rule work for a set of n nodes, we
need to invoke intersection rule n−1 times, and the ancestor list rule n times.
While it is possible to design smarter rules to compute intersection and avoid
membership tests once the first one failed, we still need to compute the
ancestor lists for all, which alone is as expensive, while the cost of computing
intersection is additional. Although we believe that the analytical discussion
presented above is reason enough in favor of our choice.
Reachability Index for Candidate Pruning

• What is less explored is how advances in reachability research can be leveraged


to expedite tree query processing in phylogenetics.

• For the tree in figure below, if we already knew that node a, or any other
member in the LCA subquery list, is not reachable from r, we could fail the
subquery involving LCA without computing it, and thus need not compute the
entire query.
• For the same reason we could fail any subquery involving the operator any.

• But for all other operators, we could leverage the idea of k-hop reachability to
see if two nodes are connected via exactly k nodes.

• More research, however, is required to study an effective way forward.


CONCLUSION
We have presented a query processing engine for an intuitive, flexible and
declarative phylogeny query language called PhyQL. Its implementation and
query processing rely on a deductive reasoner and thus allow significant
query optimization opportunities.
Resources
[1] H. M. Jamil, “A visual interface for querying heterogeneous phylogenetic
databases,” IEEE/ACM TCBB, vol. 14, no. 1, pp. 131–144, 2017.

[2] Shaoyi Yin, Abdelkader Hameurlain, and Franck Morvan, “SLA


Definition for Mutli-tenant DBMS and its impact on Query Optimization”

[3] Mukul Joshi and Dr Praveen Ranjan Srivastava, “Query Optimization”

[4] B. Alix, D. A. Boubacar, and M. Vladimir, “T-REX: a web server for


inferring, validating and visualizing phylogenetic trees and networks,”
Nucleic Acids Research, 2012.
QUESTIONS?
THANK YOU