Anda di halaman 1dari 10

Scalability of Databases for Digital Libraries +

John Chmura, Nattakarn Ratprasartporn, Gultekin Ozsoyoglu

Department of Electrical Engineering and Computer Science

Case Western Reserve University, Cleveland, Ohio 44106
(jlc18, nxr27, tekin)

Abstract. Search engines of main-stream literature digital libraries such as

ACM Digital Library, Google Scholar, and PubMed employ file-based systems,
and provide users with a basic boolean keyword search functionalities. As a re-
sult, new and powerful querying capabilities are not easy to implement on top
of such systems, and not provided. In comparison, query languages of database
systems traditionally have high expressive power. This paper evaluates the scal-
ability of the approach of deploying relational databases as backend systems to
digital libraries, and, thus, making use of the query languages and the query
processing capabilities of database query engines for literature digital libraries.
To evaluate our approach, we built a scalable prototype digital library built on
top of a relational database management system, and its advanced query inter-
face which allows users to specify dynamic text and path queries in an intuitive,
hierarchical manner. This paper evaluates the scalability of two search query
processing approaches, namely, ad-hoc queries, pre-compiled queries (stored-
procedures). We demonstrate that, with reasonably priced hardware, we are able
to build an RDBMS-based digital library search engine that can scale to handle
millions of queries per day.
Keywords. Scalability, Database, Metadata, Path Query, Query Interface

1 Introduction
Main-stream literature digital libraries such as ACM Digital Library [12], CiteSeer
[13], and PubMed [14] traditionally employ file-based systems with indexes, and pro-
vide users with a basic Boolean keyword search functionality. As more and more re-
searchers find themselves dependent on these digital libraries, there is a need for more
advanced query capabilities. Consider the query “find papers of authors who pub-
lished in ACM SIGMOD conferences and wrote papers whose titles are similar to
‘data mining’ with a score of above 0.7” or the query “find papers on “web data min-
ing” and on a citation path of distance of length at least 3 starting with paper P”.
Presently, there are no main-stream search systems that allow users to specify such
queries. New and powerful querying capabilities, such as path queries and dynamic
text queries with approximate similarity predicates and text joins, are not easy to im-
plement on top of file-based systems. At the other end, database systems traditionally
provide query languages with high expressive power. In this paper, we evaluate the
hypothesis that relational database query engines have now become efficient and ef-
fective, and that they can scale for use as backends to literature digital libraries.
To evaluate our hypothesis, we have built Case (Anthology) Explorer [1, 10], a
scalable prototype digital library built on top of a relational database management

+ This research is supported by the US National Science Foundation grant ITR-0312200.

system (RDBMS). Case Explorer is populated with metadata from approximately
15,000 papers from the ACM SIGMOD Anthology [2] and 415,000 paper titles from
DBLP bibliography [3]. To specify powerful new queries dynamically, we have de-
signed the advanced query interface (AQI) of Case Explorer, which is browser-based,
and allows users to specify dynamic text and path queries in an intuitive and hierar-
chical manner.
To execute queries generated by AQI, we propose (and evaluate (a) and (b)) five
approaches: (a) ad-hoc queries, (b) pre-compiled queries (stored-procedures) (c) di-
vide-and-conquer approach, (d) user-defined functions, and (e) an adaptive-dynamic
query interface. Each approach has performance or flexibility implications.
Finally, we present experiments that test the scalability of Case Explorer from sev-
eral aspects. By executing queries of increasing complexity and under increasing load
conditions, we are able to test the overall performance of the system in high load
situations. Our experiments also test system performance in the absence of database
indexes and caching. Overall, our experiments demonstrate that, with reasonably
priced hardware, we are able to build an RDBMS-based digital library search engine
that can scale to handle millions of queries per day.
Section 2 presents the design and implementation of the Case Explorer interface. In
section 3, we discuss text-similarity search design considerations. Section 4 lists sev-
eral query execution methods for achieving balanced scalability and flexibility. In
section 5, we evaluate two of the proposed query optimization methods as well as the
overall performance of Case Explorer. Section 6 concludes.

2 Advanced Query Interface (AQI) and Path Queries

Case Explorer database [10] is designed to store metadata extracted from papers and
research articles. The database contains information about papers both in text form,
TF-IDF vectors, and Microsoft Full text Search (MSFT) [7] form. Similarity of two
papers is measured [4] based on different sections of a paper:
RelatedToPapers(A, B) = ∑ w c • Sim(A c ,Bc ) (1)
Where A, B are papers, Comp is the set of paper components {Title, Authors, Ab-
stract, Index Terms, Body, References}, and wc is the component’s weight. Indices are
built on the metadata [10].
Case Explorer provides a basic search screen that allows users to search by key-
word, author name, paper year, and publication venue. The results of a search and
query statistics are displayed on the “results screen” (Fig. 1).

Fig. 1. Search Results Screen

In addition to basic search, the advanced query interface (AQI), whose design is
inspired by the Pathway Explorer [16], allows users to specify and expand a given
query dynamically. Due to the inherent hierarchical relationships, the AQI is able to
provide multiple different hierarchical views to represent nested predicate types. Dif-
ferent types of predicates, as added to the AQI, form a tree structure. Fig. 2 illustrates
one of the possible hierarchical views.

Fig. 2. Example Hierarchy of Anthology Explorer Data

From the initial state (Fig. 3), queries are designed by selecting one of the three
main predicate types (publication venue, author, or paper) and creating its instances.
The user then selects which relations to display in the output (by clicking to any entity
type, and the AQI coloring the entity), and executes the query. In Fig. 4, the Publica-
tion Venue relation has been selected for inclusion into the output.

Fig. 3. AQI Initial State Fig. 4. Designating an Output Relation

Each relation in the AQI has one or more search fields available (Fig. 5). When left
blank, no constraints are placed on the query. The AQI also allows the user to specify
a minimal similarity threshold (with respect to a pre-specified similarity measure, e.g.,
cosine, jaccard, dice etc.). In Fig. 6, the user is searching for papers that are about
“data mining” with a minimal similarity threshold of 0.5.

Fig. 5. Multiple Constraints Fig. 6. Minimal Similarity Threshold in AQI

Fig. 7,8, and 9 illustrate, respectively, examples of (i) the simple query “Get the ti-
tles of papers that were published in year x”, (ii) the intermediate query “Get the pa-
per titles and publication venue names for papers that are about ‘data mining’”, and
(iii) the complex query “List authors who have written papers about ‘caching’ (and
list the titles of those papers), and have also been published in a publication venue
that contained papers about ‘data mining’ in the year 1995 or 2002”.
Path queries of Case Explorer allow users to locate papers using predefined path
queries. This is a highly useful feature that is expensive to offer by conventional re-
search paper search engines, such as Citeseer [13]. Path queries involve citation rela-
tionships between papers, e.g., “Find papers on a citation path of length at least (X)
starting with the paper (Y)”, “Find the authors of papers on a citation path of distance
of length at least (X) ending with the paper (Y)”, etc. Because these queries are recur-
sive by nature, in addition to DBMS, we use file-based indexes for scalability. The
index file is a hash file where the key for each record in a bucket is a paper-id, and
values stored are all papers in the citation paths of length 1, 2, and 3 starting or ending
with each key. We keep only the paths of length up to 3 because longer paths usually
loose context and become less relevant. Fig. 10 illustrates the query “find papers that
are about ‘caching’, and cite a paper in SIGMOD 1997 within length 2”.

Fig. 7. Simple Query (QS) Design in AQI Fig. 8. Intermediate Query (QI) in AQI

Fig. 9. Complex Query (QC) Design in AQI Fig. 10. Example of Path Query in AQI

To answer a path query, the index file is used to retrieve all papers in the citation
path of a specified length starting or ending with a given paper. The results are then
transferred to the DBMS to retrieve paper contents, such as full title, publication, and
year, and details are output.
Path queries are integrated with the Paper Details page and the AQI. From the Pa-
per Details page (of paper Y), users can choose to view citation paths of length 1, 2,
or 3 starting or ending with the paper Y. From the AQI, path queries (citation or cited
by) are added to the tree structure (but not as the root of the tree).

3 Text-Similarity Search Design Considerations

In the experiments of this paper, to perform the text-similarity search in both the
Basic Search and the AQI, we have used the Jaccard similarity measure provided by
the Microsoft Full-Text Search component as opposed to TF-IDF-based similarity
measures implemented as user-defined functions (UDF). The reason is the present
nonscalability of the UDF function, which we refer to as sva_selection [6, 15]. In ear-
lier work, we have advocated [6, 15] the integration of “sva operators” into DBMS
query engines for highly powerful and efficient computations of text-based similarity
measures, and ranking query outputs, which we believe, when implemented, will out-
perform the presently employed “SQL optimization+MSFT processing” approach.
However, in the absence of such an integrated approach, when we implemented the
sva_selection UDF through a series of selections from the database relations as well
as insertions and updates of temporary tables, sva_selection did not scale, and we had
to abandon it.
The Microsoft Full-Text Search (MSFT) [7] component is an external full-text
search service included with SQL Server that is specifically designed to provide text-
search capabilities from within SQL queries. After scanning database relations to in-
dex the content, MSFT stores its vector data in the form of a compressed, inverted in-
dex structure. When performing a text-search, MSFT computes the paper scores using
this vector data and a version of the Jaccard measure [8]. In addition, MSFT caches its
indexes in the main memory providing fast search results. The disadvantage to this
approach is that flexibility is reduced as we can not customize and integrate the index-
ing methods, scores, or query formulas for our implementation.

4 Form Query Optimization

Ad-hoc SQL queries can be easily generated by the AQI. However, they incur addi-
tional overhead as the database engine creates and compiles the query execution plan
each time they are executed.
Pre-compiled queries as stored procedures have several advantages over ad-hoc
queries. First, stored procedures do not incur the query plan preparation and compila-
tion expense with each execution. Second, to execute the ad-hoc query, the entire
SQL Text must be transferred to the DBMS. The stored procedure version however,
only needs to transfer the EXEC command; a much smaller amount of text to transfer.
However, stored procedures cannot easily provide the dynamic query capabilities that
the AQI requires without placing constraints on the interface.
The third option is a divide-and-conquer approach, breaking each entity in the
query into separate stored procedures which will insert their results into a temporary
relation. An ad-hoc query will then be executed to select all the data from this tempo-
rary relation. While the DBMS still must compile a query plan for this ad-hoc query,
it will be simple as it will only contain a few joins. The negatives here are the over-
head associated with the temporary relation. The DBMS can potentially write the con-
tents to the disk. Furthermore, using a global temporary relation across multiple
stored procedures will cause the DMBS to recompile the stored procedure more often
[11]. Therefore, temporary relations are not a suitable solution for a high-load situa-
The fourth approach is to use the User Defined Functions (UDF). Continuing with
the divide-and-conquer approach, the query for each entity in the AQI can be pre-
compiled into a user defined function (UDF) that returns a table value, allowing it to
be used in a query by joining it to other relations. Again, a dynamic query will be
used to join the proper UDFs and project the output. Similar to stored procedures,
UDFs provide the benefits of being compiled. However, in order to return its contents
in the form of a table that can be used in a join operation, UDFs use table variables. A
table variable still can potentially write to disk in a low-memory situation. Moreover,
when performing the natural joins on the table outputs, the DBMS can not take advan-
tage of the indexes already on the original relations. The drawbacks of UDFs there-
fore outweigh the performance gains of being pre-compiled.
Finally, the AQI can be turned into an adaptive-dynamic query interface. When a
query structure is seen for the first time, the interface will execute the query as if it is
an ad-hoc query while it asynchronously creates a stored procedure for the query. In-
formation about the structure, cost estimates, age information, parameters, and the
stored procedure name will then be stored in a central repository. A daemon would be
responsible for monitoring the stored procedures, the statistics, and the age informa-
tion, cleaning up infrequently-used queries or alerting system administrators to overly
expensive ones. Subsequent queries of similar structure will then use the compiled
stored procedures to execute. While in the end this would potentially enumerate every
possible combination of queries, there is an advantage over pre-compiling every com-
bination. First, the system builds the code itself, so generating and managing the
stored-procedures is not necessary. In addition, queries can still be executed in an ad-
hoc fashion. For most queries, it is likely that users will execute a similarly-structured
query more than once, thus benefiting from the dynamically created stored procedure.
This method, while incurring slight overhead, offers the most flexibility with the least
amount of constraints placed on the AQI.
Table 1 provides an overview of the proposed query execution methods.
Table 1. Proposed Query Execution Methods
Method Advantages Disadvantages
Ad-Hoc Queries Flexibility. No constraints need Incurs query compilation overhead.
to be placed on the AQI.
Stored Procedures Compiled queries reduce execu- Requires constraints to be placed
tion overhead. RPC call reduces on AQI so that all combinations of
network bandwidth. queries can be enumerated and
code generated.
Stored Procedures with Tem- Retains AQI flexibility. Breaks Temporary Relations will incur
porary Relations queries into smaller, more man- creation/drop overhead. Most
(not evaluated in this paper) ageable stored procedures. likely to write to disk, drastically
reducing performance.
Ad-Hoc using Table-Valued Precompiled Queries gives ad- Table-Valued UDFs creates table
User-Defined Functions vantages of stored procedures, variables in main memory that can
(not evaluated in this paper) while retaining AQI flexibility. potentially write to disk. Can not
utilize the main relations’ indexes
when performing natural joins.
Adaptive-Dynamic Compiled queries provide bene- First execution of a new query will
(not evaluated in this paper) fits of Stored Procedure. Re- be ad-hoc, incurring extra expense.
tains full AQI flexibility by al- Additional expense to create the
lowing the system to generate stored procedure. Can potentially
code. produce an extremely large number
of stored procedures.

5 Experiments
In all of the experiments, Case Explorer is run on a Dell PowerEdge 2850 server with
two Intel Xeon 3.2GHz processors and 6.0 GB of main memory. The Enterprise Edi-
tion of Windows 2003 Server is used as the operating system. Each experiment is run
using Microsoft Application Test Center (MATC) [9]. The MATC opens multiple
connections to the server and automatically generates HTTP requests, simulating the
behavior of an actual user. The time-to-last-byte is used as a measure for the amount
of delay from a user’s point of view. The bandwidth utilization is measured in kilo-
bytes per second per HTTP response.
To measure the performance of the AQI and path queries, we have created queries
of increasing complexity and ran an increasing number of these queries simultane-
ously to emulate multiple users accessing the system concurrently. A summary of all
queries used in our experiments is given in table 2.
Table 2. List of queries used in the experiments
Query Name Description
Simple QS Selection with one join
Simple query with no relations cached in main memory.
Simple with No Cache Q SNC
Simple query with no indexes
Simple with No Index Q SNI
Intermediate QI Selection with one join, one projection, and full-text search
Complex QC Selection with four joins, two full-text searches, and one projection
Simple Path Query PS Path query that starts with a paper
Complex Path Query PC Path query that starts with an author or a publication venue

Queries that are used in the experiments are as follows.

1. QS: “Get titles of papers that were published in year x”
SELECT paper.title FROM paper
INNER JOIN publication_venue ON paper.pub_id = publication_venue.pub_id
WHERE publication_venue.year = x
2. QI: “Get paper titles and publication venue names for papers that are about ‘data mining’ ”
SELECT paper.title, FROM paper
INNER JOIN publication_venue ON paper.pub_id = publication_venue.pub_id
WHERE CONTAINS(paper.title, ‘data mining’)
3. QC: “List authors with papers about ‘caching’ (and list the titles of those papers), and pub-
lished in a publication venue with papers about ‘data mining’ in years y1 or y2.”
SELECT, paper1.title FROM author
INNER JOIN AuthorToPaper AtP on author.author_id = AtP.author_id
INNER JOIN Paper Paper1 on AtP.paper_id = Paper1.paper_id
INNER JOIN AuthorToPubVenue AtPub on author.author_id=AtPub.author_id
INNER JOIN Paper Paper2 on AtoPub.pub_id = Paper2.pub_id
WHERE CONTAINS(Paper1.title, ‘caching’) AND CONTAINS(Paper2.title, ‘data mining’)
AND ( Paper2.pub_year = year or Paper2.pub_year = year )
4. PS: “List papers in the citation path of length at least 1, 2, and 3 starting with paper x”
5. PC: “List papers that cite a paper written by author x within length 1, 2, and 3”
The paper relation used in experiments currently holds approximately 430,000 tu-
ples. Case Explorer will eventually contain over 7.0 million papers (presently, it is be-
ing populated with freely-available PubMed papers from biomedical sciences). There-
fore, to more accurately simulate the amount of disk IO required to scan the paper
relation, we have padded each tuple with extra 48 bytes. The extra 48 bytes increases
the total amount of space to approximately 26.2 MB; a closer representation of the fi-
nal dataset. Also, the size of the hash file used in the path query experiments is in-
creased to 500 MB (the actual size is 2MB for the current database).
Each test run is defined by the number c of concurrent connections. The experi-
ments begin by running one connection to establish a baseline of a query’s running
time. The no. of concurrent connections is then increased to 5, 10, 25, 50, and 100.
(a) (b) (c)

(d) (e) (f)

(g) (h)
Fig. 11. Experimental Results (ad-hoc approach): (a, c, e, g) are response times and (b, d, f,
h) are bandwidth utilizations for QS, QI and QC, PS, and PC experiments, respectively.
Simple Query Experiments (Fig. 11a, 11b):
These queries involve simple query QS, Q SNI (no indexing), and Q SNC (no caching).
By default, the DBMS caches as much data into main memory as possible. How-
ever, in the future, it may become necessary to run queries against live data that is
very large, and will not fit into the main memory. The first experiment tests how
much load, and at what level of performance, our system can handle without caching
any data in the main memory. To run this experiment, the DBMS is forced to clear all
buffer pages after each query is run. Although overall performance of Q SNC is poor,
the system can fulfill at least 775,000 queries per day with little noticeable delay.
In the Q SNI experiments, the pub_id and paper_id indexes are removed from the
paper relation to force the database engine to scan the entire relation. We use the Q SNI
query to measure performance in a worst case scenario. The amount of time required
to perform a full scan reduces performance beyond that of the Q SNC query. The maxi-
mum practical number of requests handled by this experiment is approximately 5 per
second, or 432,000 per day.
The QS experiment fulfilled 25 requests per second, or 1.9 million per day with a
response time of less than one second. We also run the QS query from an offsite loca-
tion to test the effects of internet latency on the query response times. From our re-
mote location, with a single connection, the remote client received the full response
from the Case Explorer server in 176.04 milliseconds compared to the 84.29 millisec-
ond response time when run directly from the server. We attribute approximately 90
milliseconds of transfer time to internet latency.
All of the above experiments were executed through an ad-hoc approach. With a
simple query experiment, the pre-compiled approach can not be show a significant
gain in performance over the ad-hoc approach (19% gain in requests per second, but
only 1% gain in performance with respect to response time).
Observation: 1) Although the QS experiment demonstrates a 51.4% gain in performance over Q SNC and
74.3% gain over Q SNI , the system can fulfill 775,000 and 432,000 queries per day for Q SNC and Q SNI . 2)
Internet latency adds an average of 90 milliseconds to the server’s response. 3) Overall, the gain in per-
formance from pre-compiled queries over the ad-hoc queries is negligible.

Intermediate Query (QI) Experiments (Fig. 11c, 11d):

The response times of this query were fast. A single connection of this ad-hoc
query provided a 48.7% gain in response time over the simple ad-hoc query. At a
maximum usable rate of 47.4 requests per second, the ad-hoc QI can fulfill 4 million
queries per day. One side effect of this type of query is that the increased number of
connection runs per second greatly increase the bandwidth used. Running at its peak
of 47.4 requests per second, the server was transferring data at a speed of 267 KB/sec,
a significant increase over other experiments.
Similar to simple queries, the gain in performance from the pre-compiled over the
ad-hoc approach is not significant. Below we summarize the results.
Observation: 1) The added predicates in QI reduce the amount of data to be processed, and significantly
increase the response time (a 53% gain in performance over QS). 2) At a maximum usable rate of 47.4 re-
quests per second, the ad-hoc QI can fulfill 4 million queries per day. 3) QI does not contain enough com-
plexity to benefit significantly from a pre-compiled approach.

Complex Query (QC) Experiments (Fig. 11c, 11d):

A single QC took 385 milliseconds to complete, significantly slower than QS and
QI. The system was able to sustain an average rate of 5.45 requests per second before
performance began to degrade, or approximately 470,000 complex queries per day.
The pre-compiled query performed 8% better both in queries per second and re-
sponse time. The average number of usable queries also increased 8% to 5.93 queries
per second, or 512,000 per day. As load became extreme, the pre-compiled produced
a 16% gain in response times. Because the pre-compiled produces better throughput,
the bandwidth utilization was proportionally higher than that of ad-hoc queries.
Observation: 1) The ad-hoc QC is significantly more complex than QS and QI. 2) The request rate of the
ad-hoc QC is 5.45 requests per second, or approximately 470,000 queries per day. 3) The pre-compiled ap-
proach produced the most significant gain in performance (16%) under extreme load conditions.

Path Query Experiments:

In path query experiments, citation paths of length 1, 2, and 3 are tested.
a) Simple Path Query (PS) Experiments (Fig. 11e, 11f):
The PS query can fulfill 15 requests per second for the citation path of length at least
3, or 1.3 million queries per day, and can fulfill 38 requests per second for the path of
length 2, or 3.2 million queries per day. For the path of length 1, the number of re-
quests per second is more than 100, or at least 8.6 million queries per day. For all path
lengths, average bandwidth utilization is about 5,000 KB/sec.
b) Complex Path Query (PC) Experiments (Fig. 11g, 11h):
The PC query can fulfill 6 requests per second for the citation path of length at least 3,
or 259,200 queries per day, and 8 requests per second, or 691,200 queries per day for
the path of length 2. For the path of length 1, the number of requests per second is 20,
or 1.7 million queries per day. Bandwidth used is around 3,000 to 3,500 KB/sec.
Observation: 1) Path query experiments can fulfill from 259,200 to 8.6 millions queries per day depending
on its complexity. 2) The bandwidth utilizations of both experiments (about 4,000 KB/sec) are high when
compared to those of regular query experiments because of a larger amount of text transferred to DBMS.

10 Conclusions
In order to provide the AQI functionality and remain scalable, we have devised sev-
eral strategies for implementing the queries behind the AQI. Ad-hoc queries allow the
greatest flexibility while suffering a performance hit. Pre-compiled queries offer
greater performance, however limit flexibility. We have discussed benefits and disad-
vantages of each of these options in detail. We evaluated the scalability of our system
using both ad-hoc and pre-compiled queries. Through our experiments, we estab-
lished that Case Explorer could sufficiently support user’s queries in a high-load envi-
ronment. In addition, pre-compiled queries perform only slightly better than ad-hoc
queries. Our tests demonstrate that Case Explorer will be able to fulfill from 200 thou-
sand to about four million queries per day based on the complexity of queries.
1. Case Explorer,
2. ACM SIGMOD Anthology,
3. DBLP bibliography,
4. Li, L., “Metadata Extraction: RelatedToPapers and its Use in Web Resource Querying”,
MS Thesis, EECS Dept, CWRU, 2003.
5. Al-Hamdani, A., “Querying web resources with metadata in a database”, PhD Thesis,
EECS Dept., CWRU, 2004.
6. Ozsoyoglu G., Al-Hamdani, A., Altingovde, I. S., Ozel, S. A., Ulusoy, O., and Ozsoyoglu,
Z. M., “Sideway Value Algebra for Object-Relational Databases”. In Proc. of VLDB
2002, Hong Kong, August 2002.
7. Microsoft Full Text Search,
8. Kowalski, G., “Information retrieval systems: theory and implementation”, Kluwer Aca-
demic Publishers, 1997.
9. Microsoft Application Center Test,
10. Chmura, J., “Scalable Web Data Source Search Engine Using an RDBMS”, MS Thesis,
EECS Dept, CWRU, 2005.
12. ACM Digital Library,
13. CiteSeer,
14. Pubmed,
15. Ozsoyoglu, G., Altingovde, I.S., Al-Hamdani, A., Ozel, S.A., Ulusoy, O., Ozsoyoglu,
Z.M., “Querying Web metadata: Native Score Management and Text Support in Data-
bases”, ACM Transactions on Database Systems, Dec. 2004.
16. Newman, S., Ozsoyoglu M., “A Tree-Structured Query Interface for Querying Semi-
Structured Data”. SSDBM 2004.