Folded Trie: Efficient Data Structure For All of Unicode

Folded Trie: Efficient Data Structure for All of Unicode
Vladimir Weinstein
vweinste@us.ibm.com
Globalization Center of Competency, San Jose, CA

21st International Unicode Conference Dublin, Ireland, May 2002
Introduction
A lot of data for each code point Need appropriate data structures Unicode version 3.1 introduced code points into supplementary space addressable range grew to more than a million Repetitive data Sparsely populated range, especially the supplementary space
Data Structures
Arrays
Advantages: very fast access time, fast write time Disadvantage: Unacceptable memory consumption
Hash tables
Advantages: Easy to use, Reasonably fast, General Disadvantages: High overhead, complicated sequential access, slower than array lookup, data within ranges is not shared
21st International Unicode Conference
Dublin, Ireland, May 2002
Data Structures (continued)

Inversion Maps
Advantages: simple, very compact, fast boolean operations Disadvantages: worse access time than arrays and possibly hash tables
For more details see Bits of Unicode at

http://www.macchiato.com/slides/Bits_of_Unicode.ppt
Tries
A trie is a structure with one or more indexes and one data storage. Name comes from Information Retrieval Shares repetitive data Good compaction Not appropriate for frequently changing data
Single-Index Trie
A trie structure with an index array and a data array. Advantages
Excellent size Very good access performance (two array accesses, shift, mask and addition)
Disadvantages
Not appropriate for frequently changing data Index array gets too big when dealing with supplementary code points
Single-Index Trie Diagram

UPPER_WIDTH LOWER_WIDTH LOWER_MASK
BMP code point Upper

15
Lower 0
Index
Data Array
Data
Block
Block
Double-Index Trie
Two index arrays and a data block Compared to single-index trie:
1. Provides better compression of the index array 2. Worse performance, but still very fast 3. Feasible for supplementary code points
Double-Index Trie Diagram

UPPER_WIDTH MIDDLE_WIDTH LOWER_WIDTH MIDDLE_MASK LOWER_MASK Code point Upper 20 Index 1 Middle Lower 0 Index 2 Data
0 Index1 Index2 Data 0
Block
Folded Trie
Fast access for BMP code points Slower access for supplementary code points, but far less frequent Compacts supplementary index Needs additional build time processing Fast address with UTF-16 code units
no need to construct code point
10
Folded Trie Supplementary Access Diagram

1
Lead Surrogate 15 110110.. Folded Trie

2
Has data for surrogate block?
Same for the surrogate block

No
Yes
Data
Lead Surrogate Data

4 5
Trail Surrogate 15 9 110111..

4
Pseudo Code Point
Index + Data
Final Data
BMP code points access same as with single-index

11
ICU Implementation: UTrie

ICU implementation is called UTrie Stores either 16 bit or 32 bit wide data (extensible in the future) Up to 256K different data elements Can be frozen and reused as memory mapped image for fast startup Using UTrie requires custom code
More about ICU at the end of presentation

12
Range Enumeration
Allows enumerating over a set of contiguous maximal ranges of same data elements Elements can be preprocessed by additional callback Saves time when processing the whole Unicode range by efficiently walking the trie structure
start-1 start
Element 1 Element 2 Element 2
Element 2
Element 2 Element 2 limit-1 limit Element 2 Element 3
13
Latin-1 Fast Path

Build time option Allows direct array access for the Latin-1 range (0x00-0xFF) Latin-1 range is not compressed if this option is used Appropriate when access for Latin-1 range is critical
collation
14
Example: Normalization Data

Normalization data is stored using UTries For example, main data has the following format
31 Extra data index Can be either: -index to variable length data - first part of supplementary lookup value -Special handling indicator (Hangul, Jamo) 15 7 6 5 QC_MAYBE 3 QC_NO 0 Combining class BCK FWD Combines back Values for normalization quick check Combines forward
Variable-length data contains composition and decomposition info

15
Example: Character Properties Data

The result of UTrie lookup is an index Double indexing allows for even better compression, since many code points have the same property value UTrie data width is 16 bit (thousands of data entries), while the property data width is 32 bits (few hundred unique data words).
Folded Trie Index Data Property data 32 bits 16 bits
16
International Components for Unicode

International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support Several library services use the common UTrie implementation Wide variety of supported platforms open source (X license non-viral) C/C++ and Java versions http://oss.software.ibm.com/icu/
17
Conclusion
UTrie data structure provides good compression with fast access The main constraint for usage is the nature of the data that needs to be stored Designed for repetitive and sparse data
18
Q&A
19
Folding and Surrogate Access

Folding process compacts the index for supplementaries and moves it right above the BMP index Access in ICU4C:
Define a C callback, invoked when special lead surrogate is detected Manually detect special lead surrogates
In ICU4J, provide a subclass with a method that detects special lead surrogates
20
Summary
Introduction: Storing Unicode data Types of data structures Tries Single-index trie Double-index trie Folded trie Usage of folded trie in normalization Usage of folded trie for character properties
21

Folded Trie: Efficient Data Structure For All of Unicode

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Folded Trie: Efficient Data Structure For All of Unicode

Diunggah oleh

Hak Cipta:

Format Tersedia

Folded Trie: Efficient Data Structure for All of Unicode

Globalization Center of Competency, San Jose, CA

21st International Unicode Conference

Dublin, Ireland, May 2002

Data Structures (continued)

For more details see Bits of Unicode at

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Single-Index Trie Diagram

BMP code point Upper

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Double-Index Trie Diagram

0 Index1 Index2 Data 0

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Folded Trie Supplementary Access Diagram

Lead Surrogate 15 110110.. Folded Trie

Has data for surrogate block?

Same for the surrogate block

Lead Surrogate Data

Trail Surrogate 15 9 110111..

Pseudo Code Point

BMP code points access same as with single-index

ICU Implementation: UTrie

More about ICU at the end of presentation

Element 1 Element 2 Element 2

Dublin, Ireland, May 2002

Latin-1 Fast Path

21st International Unicode Conference

Dublin, Ireland, May 2002

Example: Normalization Data

Variable-length data contains composition and decomposition info

Example: Character Properties Data

International Components for Unicode

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Folding and Surrogate Access

21st International Unicode Conference

Anda mungkin juga menyukai