Vladimir Weinstein
vweinste@us.ibm.com
Introduction
A lot of data for each code point Need appropriate data structures Unicode version 3.1 introduced code points into supplementary space addressable range grew to more than a million Repetitive data Sparsely populated range, especially the supplementary space
21st International Unicode Conference Dublin, Ireland, May 2002
Data Structures
Arrays
Advantages: very fast access time, fast write time Disadvantage: Unacceptable memory consumption
Hash tables
Advantages: Easy to use, Reasonably fast, General Disadvantages: High overhead, complicated sequential access, slower than array lookup, data within ranges is not shared
Tries
A trie is a structure with one or more indexes and one data storage. Name comes from Information Retrieval Shares repetitive data Good compaction Not appropriate for frequently changing data
Single-Index Trie
A trie structure with an index array and a data array. Advantages
Excellent size Very good access performance (two array accesses, shift, mask and addition)
Disadvantages
Not appropriate for frequently changing data Index array gets too big when dealing with supplementary code points
21st International Unicode Conference Dublin, Ireland, May 2002
Lower 0
Index
Data Array
Data
Block
Block
Double-Index Trie
Two index arrays and a data block Compared to single-index trie:
1. Provides better compression of the index array 2. Worse performance, but still very fast 3. Feasible for supplementary code points
Block
Folded Trie
Fast access for BMP code points Slower access for supplementary code points, but far less frequent Compacts supplementary index Needs additional build time processing Fast address with UTF-16 code units
no need to construct code point
10
Yes
Data
Index + Data
Final Data
11
12
Range Enumeration
Allows enumerating over a set of contiguous maximal ranges of same data elements Elements can be preprocessed by additional callback Saves time when processing the whole Unicode range by efficiently walking the trie structure
21st International Unicode Conference
start-1 start
Element 2
Element 2 Element 2 limit-1 limit Element 2 Element 3
13
14
15
16
17
Conclusion
UTrie data structure provides good compression with fast access The main constraint for usage is the nature of the data that needs to be stored Designed for repetitive and sparse data
18
Q&A
19
In ICU4J, provide a subclass with a method that detects special lead surrogates
21st International Unicode Conference Dublin, Ireland, May 2002
20
Summary
Introduction: Storing Unicode data Types of data structures Tries Single-index trie Double-index trie Folded trie Usage of folded trie in normalization Usage of folded trie for character properties
Dublin, Ireland, May 2002
21