1 Introduction
Dictionary encoding is a simple compression method used by a wide range of
RDF [17] applications to reduce the memory footprint of the program. A dic-
tionary encoder usually provides two basic operations: one for replacing strings
with short numerical IDs (encoding), and one for translating IDs back to the
original strings (decoding). This technique effectively reduces the memory foot-
print, because numerical values are typically much smaller than string terms. It
also boosts the general performance since comparing or copying numerical values
is more efficient than the corresponding operations on strings.
Dictionary encoding relies on a bi-directional map, which we call a dictio-
nary, to store the associations between the numeric and textual IDs. If the input
contains many unique tokens, then the size of the dictionary can saturate main
memory and start hampering the functioning of the program. Given the increas-
ing size of RDF datasets, this is becoming a frequent scenario, e.g. [19] reports
cases where the size of the dictionary becomes even larger than the resulting
encoded data. This becomes particularly problematic for applications that need
to keep the dictionary in memory while processing the data.
An additional challenge comes from the dynamic nature of the Web, which
demands that the application access and/or updates the dictionary with high
frequency (for example when processing high velocity streams of RDF data).
This requirement precludes the usage of most existing dictionary compression
techniques (e.g. [10, 19]) since these sacrifice update performance in order to
maximize compression, which was the rational trade-off when processing static
RDF data. To the best of our knowledge, there is no method to store dictionaries
of RDF data that is space-efficient and allows frequent updates.
The goal of this paper is to fill this gap by proposing a novel approach, called
RDFVault, to maintain a large RDF dictionary in main memory. RDFVault de-
sign contains two main novelties: First, it exploits the high degree of similarity
between RDF terms [9] and compresses the common prefixes with a novel vari-
ation of a Trie [11]. Tries are often used for this type of problems, but standard
implementations are memory inefficient when loaded with skewed data [13], as
is the case with RDF [16]. To address this last issue, we present a Trie variation
based on a List Trie [7], which addresses the well-known limitations of List Tries
with a number of optimizations that exploit characteristics of RDF data.
Second, inspired by symbol tables in compilers, our approach unifies the two
independent tables that are normally used for encoding and decoding into a
single table. Our unified approach maps the strings not to a counter ID (as
is usually the case), but to a memory address from where the string can be
reconstructed again. The advantage of this design is that it removes the need of
an additional mapping from IDs back to strings.
To support our contribution, we present an empirical analysis of the per-
formance and memory consumption of RDFVault over realistic datasets. Our
experiments show that our technique saves 50-59% memory compared to un-
compressed hash-based dictionary while maintaining competitive encoding speed
and up to 2.5 times slower decoding performance. Given that decoding in a con-
ventional hash-based dictionary is very fast (a single hash table look up), we
believe that the decoding speed of RDFVault is still reasonably good, and that
in many cases this is a fair price to pay for better memory consumption.
The rest of this paper is organized as follows: In Section 2 we report some
initial experiments that illustrate the potential saving that we can obtain with
redundancy-aware techniques. In Section 3 we overview related work on the
structure of existing dictionaries, and briefly discuss some of the existing efforts
to reduce their memory consumption. In this section we also introduce the Trie
data structure which will be the basis of our method. In Section 4 we focus on
a number of existing Trie variants and analyze their strengths and weaknesses
when applied to RDF data. Then, we present our method in Section 5 and an
empirical evaluation of its performance in Section 6. Finally, Section 7 concludes
the paper and discusses possible directions for future work.
#Terms (M) #Unique Terms (M)
Datasets IRI Literal All IRI Literal All Triples (M)
BioPortal [22] 112.17 17.80 130 3.32 4.11 7.44 43.33
Freebase [12] 237.59 60.50 300 19.06 10.48 29.77 100
BTC2014 [14] 228.33 20.79 300 11.47 3.42 17.97 100
DBPedia (EN) [3] 280.07 19.22 300 50.01 1.85 51.87 100
Table 1: Number and type of terms in examined datasets
3 Existing Approaches
Size (M)
All 63.00% 2500
dbpedia (EN) Literal 15.40% 2000
IRI 63.60% 1500
All 48.10% 1000
bioportal Literal 26.10%
500
IRI 85.90%
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 dbpedia(EN) btc2014 freebase bioportal
Common prefixes w.r.t. total text
(a) (b)
Fig. 1: (a) The collective amount of common prefixes of any length categorized by type
of terms. (b) The disk space occupied by HDT dictionary versus unique string terms.
shows that common prefixes do not occur only in IRIs, but also in literals, which
are outside the scope of this optimization.
There are also approaches that offer dictionary compression over static datasets,
e.g. HDT [10] applies PPM [6] to compress its D (dictionary) part (FourSection-
Dictionary) 4 . Our experiments presented in Figure 1(b) compare the disk space
occupied by unique strings in datasets presented in Table 1 versus that of HDT
dictionary. The Figure clearly shows that in almost all cases, the whole HDT
dictionary (strings and IDs) occupies more than 50% less space than uncom-
pressed strings. Another similar approach [19] is a compact dictionary which
partly relies on partitioning terms based on the role they play in the datasets
to achieve a dictionary compression level of 20-64%. Although both approaches
effectively compact the dictionary, they require the whole dataset to be available
at compression time, and they both function under this assumption that the data
rarely changes after the dictionary creation. As a result, they support relatively
efficient decoding (order of micro seconds in our experiments) but support no
new encoding after the dictionary is created. Thus, these techniques are great if
the data is static, but inapplicable for dynamic and streaming data sources used
in many real-time usecases such as stream RDF processing.
[4] proposes an order preserving in-memory dictionary based on a single data
structure that supports dynamic updates, however the approach is vulnerable
to memory wastage for highly skewed data with many duplicates (like RDF
data). [5, 24] propose approaches that address the scalability issue of massive
RDF compression by resorting to distributed approaches.
Trie. If we look at the data structures that are normally used inside the dictio-
nary, then we notice that often B + -trees are chosen if the dictionary is stored on
disk, while arrays, hash tables, or memory mapped files are normally preferred
if the dictionary is supposed to reside in main memory [19]. Regardless of the
data structure, in general existing approaches do not attempt to minimize the
storage of common prefixes, and therefore consume significant space.
4
https://code.google.com/p/hdt-java/
A Trie [11] (also known as radix or prefix tree) is special multi-way tree that
was initially proposed as an alternative to binary trees [15] for indexing strings
of variable length. In a Trie, each node represents the string that is spelled out
by concatenating the edge labels on the path from the root. The string stored in
a Trie is represented by the terminal node, while each internal node represents a
string prefix. The children of a node are identified by the character on their edge
labels; So, the fastest Trie implementation stores an array of || child pointers
in each node, where is the alphabet. For instance, if a Trie should store ASCII
strings, then the arrays would need to have 128 entries.
To better illustrate the functioning of a Trie, we show in Figure 2(a) a small
example of a standard Trie that supports uppercase English alphabet and con-
tains three simple keys (ABCZ, ABCA, and XYZ). The example shows
that no node in the Trie stores the key associated with that node. Instead, it is
the position of the node in the Trie that determine the key associated with that
node. In other words, the indices followed to reach a node, determines the key
associated with that node. In this example we also see that the strings ABCZ,
and ABCA share the part of the Trie that represents the common prefix.
Because of this, Tries have the following desirable properties:
All strings that share the same prefix will be stored using the same nodes.
Therefore common prefixes are stored only once;
Keys can be quickly reconstructed via a bottom up traversal of the Trie;
Time complexity of insertions, and lookups are proportional only to the
length of the key, and not to the number of elements in the Trie.
Our experiments in Section 2 show that a storage strategy that minimizes
string redundancy has the potential of being very effective in terms of resource
consumption. Therefore, Tries can potentially be an ideal data structure for the
compression of RDF terms in memory. Unfortunately, the most serious drawback
of Tries is that if the input is skewed, and the alphabet is large, the Trie nodes
become sparse [13] and cause low memory efficiency. In last years, this limitation
has received considerable attention, and a number of papers have proposed some
interesting solutions. In the next section we discuss the most prominent ones and
analyze how they perform in our specific usecase.
A X A X
B Y BC YZ
PC C Z LE Z A
Z A
Path compression Single descendent nodes that do not lead to leaves are omit-
ted and the skipped characters are either stored in the (multi-descendant)
nodes, or only the numbers of characters is stored in the nodes, and the
entire string is stored in the leaves to ensure correctness.
Burst Trie and HAT Trie A Burst Trie [13] is a hybrid data structure comprised
of a standard Trie called access Trie whose leaves are containers that can be
any data structure (linked lists by default). HAT-Trie [2] improves performance
by using hash tables instead of linked lists. Initially strings are only organized
in containers, but once the algorithm detects a container is inefficient, it bursts
the container into a Trie node, with multiple smaller containers as its leaves.
An advantage of this hybrid design is that it is more resistant to memory
wastage for skewed data. However, this data structure is not attractive for saving
common prefixes because a) it does not minimize the storage of all common
prefixes, but only those that are in the access Trie b) the burst Trie does not
Adaptive Radix Trie. Adaptive Radix Tree (ART) [18] further improves mem-
ory efficiency, not only by applying the lazy expansion and path compression
optimizations, but also by adaptively changing the size of the pointer array in
nodes to minimize the number of unused pointers. To this end, ART uses nodes
with variable length which grow in size where there is not enough space in their
arrays. We measured the effect of this new optimization for RDF data, and re-
port the results in the third column of Table 2. As we can see from the table, the
adaptive node policy significantly boosts the memory efficiency compared to the
Compact Trie. Nevertheless, still more than half of the pointer array entries are
left unused. Therefore, for large Tries with many nodes, the memory efficiency
is still unacceptably high.
List Trie. The last type of Trie that we considered is the List Trie [7]. This
Trie organizes the children pointers of each node in linked lists instead of arrays.
The advantage is that, unlike arrays, linked lists are not vulnerable to sparsity.
However, the price to pay is that linked lists do not support random accesses.
Therefore, in generic cases the performance of a List Trie is significantly lower
than other Trie variants. Our experiments (not shown because of space limita-
tions) show that a Standard Trie is more than two times faster than a List Trie
in storing English words in a dictionary, though a List Trie consumes 6.3 times
less memory than a Standard Trie to do so.
In the previous section we analyzed the existing Trie variants and showed why
none of them is ideal for RDF. In fact, while Tries remove the problem of string
redundancy, array-based tries are still memory inefficient because of the excessive
number of unused pointers (Table 2), and list-based Tries cannot guarantee a
good performance in generic cases.
To address these limitations, we propose a new variant of a List Trie and
use it as an optimized in-memory dictionary named RDFVault for dynamic and
streaming RDF data. There are three important factors that differentiate our
solution from existing methods. First, our Trie variant uses linked lists (despite
their general suboptimality) and improves the performance by introducing a
move-to-front policy. Second, our dictionary encoding approach removes the need
for a dedicated decoding data structure by using as ID the memory location of
the Trie node that represents the string. Finally, it further optimizes memory
usage by using two different types of nodes in the construction of the Trie. The
remainder of this section describe each of these points in more detail.
100%
80%
60% >= 12
>=4 & <= 12
40% <4
20%
0%
bioportal freebase btc2014 dbpedia
C B C \0
(a) (b)
6 Evaluation
formance, we ran each experiment 10 times and report the average value to
minimize the overhead of garbage collection on the comparability of our results.
Data Structure Overhead. Figure 5(a) shows that in all cases the overhead of
RDFVault is less than that in conventional dictionary. It also shows that due
to the string compression in RDFVault, sometimes strings consume much less
space than the data structure (compared to the conventional dictionary). This
means that in some cases the data structure becomes the main source of mem-
ory consumption in RDFVault. This is the observation that motivated our last
optimizations to introduce different data structures to implement the nodes.
Further optimization in this direction might be very effective in reducing the
overall compression.
Note that although most existing dictionaries consist of two tables, it may
also be possible to confine a conventional dictionary into a single hash table.
Nonetheless, because strings are shared between the two tables, omitting one
table at best can only reduce the data structure overhead to half, but the mem-
ory consumption of strings remains intact. Thus, RDFVault still offers better
memory efficiency in all cases (see Figures 5, and 6).
500 STR
2000
0 0
RDFVault
Dictionary
RDFVault
Dictionary
RDFVault
Dictionary
RDFVault
Dictionary
RDFVault
RDFVault
RDFVault
Dictionary
Dictionary
Dictionary
Dictionary
RDFVault
bioportal dbpedia (EN) btc 2014 freebase bioportal dbpedia (EN) btc 2014 freebase
(IRIs) (Literals)
Fig. 6: Memory usage considering both IRIs and Literals (DS represents the space
occupied by the data structure, while STR is the one taken for strings.
outperforms the baseline. On the other hand, our technique is less effective in
compressing the literals, even though also in this case our method always out-
performs the baseline.
We now focus our attention on the impact of our method during encoding and
decoding. To this end, we measure the time necessary to encode and decode all
terms in the datasets in the order they appear in the publicly available seri-
alization of data (hereafter input order). To be more specific, we also run the
same experiment once on IRIs, and once on Literals. Terms were encoded and
decoded one after another, and the average encoding/decoding runtime per term
is reported in Figure 7.
As we can see from the left graph, in the worst case it takes about 650ns to
encode a term, and about 450ns to decode it. In general, the encoding perfor-
mance of our approach is comparable to the one of the Trove hash map, and
in two cases (BioPortal, BTC 2014) the runtimes are even better. The figure
shows that encoding literals is often more expensive both in the conventional
dictionary and RDFVault. This suggests that the lower encoding performance
of literals could be because they are longer than IRIs. Similar reasons can be
given for the slow encoding speed of IRIs in case of DBPedia (EN) dataset,
namely because this dataset uses long IRIs, both the conventional dictionary
and RDFVault present slower encoding performance than average.
The right graph of Figure 7 presents the average decoding runtime of RDF
terms. The results show that a hash map performs up to 2.5 times better de-
coding performance, even though in some cases the margin with RDFVault is
minimal. In theory the time complexity of both approaches is proportional to
the length of string (hash code calculation for conventional dictionary, and string
reconstruction in RDFVault), but in practice RDFVault needs to follow multiple
references for the bottom up traversal and concatenate the substrings to recon-
struct the original one. Therefore, it usually needs to execute more instructions
700 500
600 450
400
500 350
performance (ns)
400 300
300 250
200 IRI
200 150
100 Literal
100 All
50
0 0
Dictionary
Dictionary
RDFVault
RDFVault
RDFVault
RDFVault
Dictionary
Dictionary
Dictionary
Dictionary
Dictionary
RDFVault
RDFVault
RDFVault
RDFVault
Dictionary
bioportal dbpedia (EN) btc 2014 freebase bioportal dbpedia (EN) btc 2014 freebase
(Encode) (Decode)
Fig. 7: Encoding and decoding runtime of our approach against the baseline.
2500
2000
1500
Performance (ns)
1000 IRI
Literal
500
All
0
Disabled
Disabled
Enabled
Disabled
Disabled
Enabled
Enabled
than the conventional dictionary. The positive result is that strings do not have
(on average) an excessive length. Therefore, the price that we pay for compress-
ing our input in terms of decoding speed remains rather limited.
8 Acknowledgment
This project was partially funded by the COMMIT project, and by the NWO
VENI project 639.021.335.
References
1. java.sizeof. http://sizeof.sourceforge.net/.
2. N. Askitis and R. Sinha. Hat-trie: a cache-conscious trie-based data structure
for strings. In Proceedings of the thirtieth Australasian conference on Computer
science-Volume 62, pages 97105. Australian Computer Society, Inc., 2007.
3. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia:
A Nucleus for a Web of Open Data. Springer, 2007.
4. C. Binnig, S. Hildenbrand, and F. Farber. Dictionary-based order-preserving string
compression for main memory column stores. In SIGMOD. ACM, 2009.
5. L. Cheng123, A. Malik, S. Kotoulas, T. E. Ward, and G. Theodoropoulos. Efficient
parallel dictionary encoding for RDF data.
6. J. G. Cleary and I. Witten. Data compression using adaptive coding and partial
string matching. Communications, IEEE Transactions on, 32(4):396402, 1984.
7. R. De La Briandais. File searching using variable length keys. In Papers presented
at the the March 3-5, 1959, western joint computer conference. ACM, 1959.
8. O. Erling and I. Mikhailov. RDF support in the Virtuoso DBMS. In Networked
Knowledge-Networked Media, pages 724. Springer, 2009.
9. J. D. Fernandez, C. Gutierrez, and M. A. Martnez-Prieto. RDF compression: basic
approaches. In WWW, pages 10911092. ACM, 2010.
10. J. D. Fernandez, M. A. Martnez-Prieto, and C. Gutierrez. Compact representation
of large RDF data sets for publishing and exchange. In ISWC. Springer, 2010.
11. E. Fredkin. Trie memory. Communications of the ACM, 3(9):490499, 1960.
12. Google. Freebase data dumps. http://download.freebase.com/datadumps.
13. S. Heinz, J. Zobel, and H. E. Williams. Burst Tries: a fast, efficient data structure
for string keys. ACM TOIS, 20(2):192223, 2002.
14. T. Kafer and A. Harth. Billion Triples Challenge data set. Downloaded from
http://km.aifb.kit.edu/projects/btc-2014/, 2014.
15. D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching.
International Monetary Fund, 1998.
16. S. Kotoulas, E. Oren, and F. Van Harmelen. Mind the data skew: distributed
inferencing by speeddating in elastic regions. In WWW. ACM, 2010.
17. O. Lassila and R. R. Swick. Resource Description Framework (RDF) model and
syntax specification. 1999.
18. V. Leis, A. Kemper, and T. Neumann. The Adaptive Radix Tree: ARTful indexing
for main-memory databases. In ICDE, 2013 IEEE 29th International Conference
on, pages 3849. IEEE, 2013.
19. M. A. Martnez-Prieto, J. D. Fernandez, and R. Canovas. Querying RDF dictio-
naries in compressed space. ACM SIGAPP, 12(2):6477, 2012.
20. D. R. Morrison. PATRICIA-practical algorithm to retrieve information coded in
alphanumeric. JACM, 15(4):514534, 1968.
21. T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. VLDB,
1(1):647659, 2008.
22. N. F. Noy, N. H. Shah, P. L. Whetzel, B. Dai, M. Dorf, N. Griffith, C. Jonquet, D. L.
Rubin, M.-A. Storey, C. G. Chute, et al. Bioportal: ontologies and integrated data
resources at the click of a mouse. Nucleic acids research, 37:W170W173, 2009.
23. E. H. Sussenguth Jr. Use of tree structures for processing files. Communications
of the ACM, 6(5):272279, 1963.
24. J. Urbani, J. Maassen, and H. Bal. Massive semantic web data compression with
MapReduce. In HPDC, pages 795802. ACM, 2010.
25. P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. TripleBit: a fast and
compact system for large scale RDF data. VLDB, 6(7):517528, 2013.