Anda di halaman 1dari 27

e Multilingual Lion:

T
E
X learns to eak Unicode
Jonathan Kew
SIL International
April 7, 2005
A
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Background
T
E
X: free typeseing system with a 25-year history
stable, reliable, nexible, widely implemented
experienced user community
rich collection of supporting tools
Originally designed for English typeseing
support for accents and other European characers
language support extended via custom fonts, macros, and
preprocessors
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Traditional T
E
X input conventions
Input text is ASCII (or 8-bit codepage)
Source text Typeset output Notes
\'{a} typical accent command
\c{c}
\aa
--- ligature in typical T
E
X fonts
$\alpha$ math mode symbol
{\dnacchaa} " using custom preprocessor
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Multilingual typeseing with T
E
X
Text input
Escape sequences for non-ASCII characers
Multiple 8-bit codepages
Preprocessors for complex scripts
Font support
Fonts limited to 256 glyphs
Custom-encoded fonts with qecinc glyph sets
All tied together via complex T
E
X macros
Dimcult to understand and extend
Dimcult to integrate with other packages
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Towards a cleaner solution
Unicode: all required charaers directly represented
no need for escape sequences to access characers not
included in the current codepage
no need to switch between codepages according to the
language/script being typeset
characers rendered via standard access codes
Charaer/glyph model and modern font rendering
technologies
complex script handling moved out of the domain of the
text data stream
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Typeseing Unicode text with X
E
T
E
X
Accented charaers
\halign{#\hfil\quad&
#\hfil\cr
dan&dan\cr
dubok&dubok\cr
dabe&ak\cr
din&dabe\cr
Din&din\cr
ak&Din\cr
Evropa&Evropa\cr}
dan dan
dubok dubok
dabe ak
din dabe
Din din
ak Din
Evropa Evropa
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Typeseing Unicode text with X
E
T
E
X
CJK ideographs
\font\han="STSong"at16pt
\font\rom="Gentium"at8pt
\def\hc#1#2{\vtop{\hbox{\han#1}
\hbox{\kern10pt\rom#2}}}
\vtop{\hc{}{ka-ku}
\hc{}{motto-mo}
\hc{}{sai-go}
\hc{}{hatara-ku}
\hc{}{umi}}

ka-ku

motto-mo

sai-go

hatara-ku

umi
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Typeseing Unicode text with X
E
T
E
X
Complex scripts
\c1
\s
\p
\v1
.
\v2
.

\v3
..

!" $%& )*+ ./ $245

7*848!: )*+ ;<



.>*? .*@
AB.C 4D$E+ >F GH%& !IC .!JB 4K
!F ./ $E+ !F L!M$@ >B N*? $&
>C QRS ./ )BT8

!? !J@ 4*V
.!J*@ !Y !H5 >& .!JY !H5 A8
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Key changes from T
E
X to X
E
T
E
X
Unicode as the text encoding
directly use Unicode input text, Unicode-encoded fonts
Fonts and rendering technologies
use any fonts available in the host computer
use existing smart-font rendering systems
Additional features for multilingual typeseing
optional font features
line breaking for Asian scripts
Backward compatibility issues
support for legacy T
E
X fonts and documents
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
From 8 to 16 bits
Charaer type in T
E
X code was 8-bit value
one option: process text as UTF-8
Charaer codes used to index a number of tables
characer category, case pairs, etc.
Decision to use 16-bit charaer codes
all 256-element tables enlarged to 65,536 elements to
match the extended characer set
extended T
E
X commands that refer to characer codes
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
From 8 to 16 bits and beyond?
Unicode does not t in 16 bits either!
X
E
T
E
X handles non-BMP charaers as UTF-16
surrogate pairs
properties of individual characers cannot be set
unlikely to mauer for typeseuing usage: all surrogate codes
can be treated as simple printable characers
keeps size of internal tables moderate, without extensive
restructuring
Using UTF-16 happens to match the font rendering
APIs that X
E
T
E
X uses
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Implementing the charaer/glyph model
Required for support of complex scripts in Unicode
Signicant change from traditional T
E
X model
T
E
X regards a qecinc characer code in a qecinc font as
the fundamental unit of text to be typeset
assumes such a characer has known, nxed dimensions
provision for ligatures by characer substitutions
a paragraph consists of sequence of characer nodes, to be
precisely placed, and intervening glue nodes
A Unicode charaer may not map to a single,
known glyph
many scripts require contextual selection of glyphs
must measure characers in context, not in isolation
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Implementing the charaer/glyph model
Initial implementation using ATSUI on Mac OS X
typeseuing process collects runs of characers (words)
calls ATSUI text layout APIs to measure width
a X
E
T
E
X paragraph consists of sequence of word nodes
separated by glue
Typeseing engine positions words, not glyphs
this is the job of the font rendering engine
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Implementing the charaer/glyph model
Nodes in a T
E
X paragraph Corresponding nodes in X
E
T
E
X
!"#$%!&'()!*+,-$
!"#$%!&'()!*+,-$
!"#$%!&'()!*+,-$
-.,(%!/
-.,(%!.
-.,(%!$
-.,(%!0
-.,(%!#
-.,(%!1
-.,(%!-
-.,(%!2
-.,(%!3
-.,(%!'
-.,(%!4
!"#$%!&'()!*+,-$
!"#$%!&'()!*+,-$
!"#$%!&'()!*+,-$
&'()%!.'/
&'()%!0#1-2
&'()%!34$
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Implementing the charaer/glyph model
OpenType Layout support using ICU library
alternative font layout engine
provides support for OpenType features in Latin fonts
supports a number of complex (Indic/Asian) scripts
X
E
T
E
X uses either ATSUI or ICU according to
layout tables found in fonts
overall typeseuing process is independent of font
technology in use
distinction required only at lowest level of measuring a run
of text in a given font
documents may freely mix AATand OT fonts
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Implementing the charaer/glyph model
ATSUI APIs used in typeseing
ATSUCreateStyle, ATSUSetAttributes
ATSUCreateTextLayout, ATSUSetTextPointerLocation,
ATSUSetRunStyle
ATSUGetUnjustifiedBounds, ATSUDrawText
ICU APIs used in typeseing
ubidi_open, ubidi_close, ubidi_setPara,
ubidi_getDirection, ubidi_countRuns,
ubidi_getVisualRun
LayoutEngine::layoutChars, getGlyphs,
getGlyphPositions
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Hyphenation support
Paragraphs formed of lists of word boxes
treated as indivisible units in the token list
allows T
E
X to remain unaware of low-level details
If acceptable line breaks not found, hyphenation
required
extract text characers fromword nodes
nnd hyphen positions using T
E
Xs algorithm
repackage words as word fragments and discretionary
break nodes
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Hyphenation support
Modifying the node list to allow hyphenation
!"# $%&' ()**'+',- *#.'/ $%&'
!"# $%&' ()* *'+ ',- *#.'/ $%&' 0120',3 0120',3
Problem: unused hyphen points break rendering
!"# $%&' ()* *'+
',- *#.'/ $%&'
0
Two differ-
ent foxes
Need to re-merge word nodes aer choosing breaks
!"# $%&' ()**'+,
'-. *#/'0 $%&'
Two dier-
ent foxes
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Advanced font features
OpenType language systems
\font\Doulos="DoulosSIL/ICU"
\font\DoulosViet="DoulosSIL/ICU:language=VIT"
Unicode cung cp
mt con s duy
nht cho mi k t
Unicode cung c'p
mt con s( duy
nh't cho m)i k t
\font\Brioso="BriosoPro"
\font\BriosoTrk="BriosoPro:language=TRK"
gelen rmalar
tarafndan
gelen firmalar
tarafndan
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Advanced font features
Custom AAT features
\font\Doulos="DoulosSIL/AAT"
\font\DoulosAlt="DoulosSIL/AAT:
Alternateforms=Literacyalternates,
Smallv-hookstraightstyle;
UppercaseEngalternates=CapitalNwithtail"
Xsee na Mose o
utitotokeke la anyi,
eye wna wohl u e
trutiwo u bene dla
si atsr' ggbeviwo la
nagaw nuvevi Israel
viwo ya o.
Xsee n( Mose o
)utitotokeke l( (nyi,
eye wn( wohl *u e
*trutiwo u bene dl(
si (tsr' ++beviwo l(
n(+(w nuvevi Isr(el
viwo y( o.
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
East Asian languages
Line breaking without word spaces
T
E
X normally breaks lines at glue arising fromspaces
Chinese, Japanese, 1ai, etc. do not use word spaces
, , 5 , . 5
, F . F , Unicode , ,
, encoding F , F .
Use ICU line-break: \XeTeXlinebreaklocale"th"
, , 5 , . 5
, F .
F, Unicode , , , encoding F ,
F .
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Backward compatibility
Legacy T
E
X fonts, eecially for math mode
supported via T
E
X font metrics and Type 1 font nles
allowmany existing T
E
X documents to work
not Unicode-compliant!

e
x
2
dx

2
=

e
(x
2
+y
2
)
dxdy
=

2
0


0
e
r
2
r dr d
=

2
0

e
r
2
2

r=
r=0

d
= .
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Backward compatibility
Non-Unicode input text
by default, input read as Unicode (UTF-8 or UTF-16)
legacy codepages supported via ICUconverters
set codepage of current input nle:
\XeTeXinputencoding"charset-name"
set initial codepage for newly-opened input nles:
\XeTeXdefaultencoding"charset-name"
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Backward compatibility
Support for legacy keying praices
typical input:
``\TeX''---atypesettingsystem
generates: ``T
E
X''---a typeseuing system
Font mapping for compatibility
;TECkitmappingforTeXinputconventions
U+002DU+002D<>U+2013;--->endash
U+002DU+002DU+002D<>U+2014;---->emdash
U+0027<>U+2019;'->rightsinglequote
U+0027U+0027<>U+201D;''->rightdoublequote
U+0022>U+201D;"->rightdoublequote
generates: T
E
Xa typeseuing system
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
More fun with font mappings
\def\SampleText{Unicode-

,\\
,\\
,\\
.}
\font\gen="Gentium"
\gen\SampleText
\bigskip
\font\gentrans="Gentium:
mapping=cyr-lat-iso9"
\gentrans\SampleText
Unicode -
,
,
,
.
Unicode - to unikal'nyj
kod dl lbogo simvola,
nezavisimo ot platformy,
nezavisimo ot programmy,
nezavisimo ot zyka.
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
X
E
T
E
X and other T
E
X extensions
T
E
X
G
X
a direct ancesor of X
E
T
E
X, but now obsolete
e-T
E
X
basis of current X
E
T
E
X implementation
provides a number of features, eqecially bidi support
Omega, Aleph
ambitious project to extend T
E
X to all scripts
complex connguration, no direct smart-font support
pdfT
E
X
widely-used extension providing rich PDF support
no native Unicode or smart-font support
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
For more information
X
E
T
E
X web site and mailing list
http://scripts.sil.org/xetex
http://tug.org/mailman/listinfo/xetex
Contact information
mailto:jonathan_kew@sil.org
Questionsand answers?
A
? "" Unicode
(/)? to je Unicode? ?
Unicode; ? ? Hva er Unicode?
?
Unicode? Unicode ? ?
27 Internationalization and Unicode Conference Berlin, Germany, April 2005

Anda mungkin juga menyukai