T
E
X learns to eak Unicode
Jonathan Kew
SIL International
April 7, 2005
A
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Background
T
E
X: free typeseing system with a 25-year history
stable, reliable, nexible, widely implemented
experienced user community
rich collection of supporting tools
Originally designed for English typeseing
support for accents and other European characers
language support extended via custom fonts, macros, and
preprocessors
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Traditional T
E
X input conventions
Input text is ASCII (or 8-bit codepage)
Source text Typeset output Notes
\'{a} typical accent command
\c{c}
\aa
--- ligature in typical T
E
X fonts
$\alpha$ math mode symbol
{\dnacchaa} " using custom preprocessor
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Multilingual typeseing with T
E
X
Text input
Escape sequences for non-ASCII characers
Multiple 8-bit codepages
Preprocessors for complex scripts
Font support
Fonts limited to 256 glyphs
Custom-encoded fonts with qecinc glyph sets
All tied together via complex T
E
X macros
Dimcult to understand and extend
Dimcult to integrate with other packages
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Towards a cleaner solution
Unicode: all required charaers directly represented
no need for escape sequences to access characers not
included in the current codepage
no need to switch between codepages according to the
language/script being typeset
characers rendered via standard access codes
Charaer/glyph model and modern font rendering
technologies
complex script handling moved out of the domain of the
text data stream
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Typeseing Unicode text with X
E
T
E
X
Accented charaers
\halign{#\hfil\quad&
#\hfil\cr
dan&dan\cr
dubok&dubok\cr
dabe&ak\cr
din&dabe\cr
Din&din\cr
ak&Din\cr
Evropa&Evropa\cr}
dan dan
dubok dubok
dabe ak
din dabe
Din din
ak Din
Evropa Evropa
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Typeseing Unicode text with X
E
T
E
X
CJK ideographs
\font\han="STSong"at16pt
\font\rom="Gentium"at8pt
\def\hc#1#2{\vtop{\hbox{\han#1}
\hbox{\kern10pt\rom#2}}}
\vtop{\hc{}{ka-ku}
\hc{}{motto-mo}
\hc{}{sai-go}
\hc{}{hatara-ku}
\hc{}{umi}}
ka-ku
motto-mo
sai-go
hatara-ku
umi
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Typeseing Unicode text with X
E
T
E
X
Complex scripts
\c1
\s
\p
\v1
.
\v2
.
\v3
..
!" $%& )*+ ./ $245
e
x
2
dx
2
=
e
(x
2
+y
2
)
dxdy
=
2
0
0
e
r
2
r dr d
=
2
0
e
r
2
2
r=
r=0
d
= .
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Backward compatibility
Non-Unicode input text
by default, input read as Unicode (UTF-8 or UTF-16)
legacy codepages supported via ICUconverters
set codepage of current input nle:
\XeTeXinputencoding"charset-name"
set initial codepage for newly-opened input nles:
\XeTeXdefaultencoding"charset-name"
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
Backward compatibility
Support for legacy keying praices
typical input:
``\TeX''---atypesettingsystem
generates: ``T
E
X''---a typeseuing system
Font mapping for compatibility
;TECkitmappingforTeXinputconventions
U+002DU+002D<>U+2013;--->endash
U+002DU+002DU+002D<>U+2014;---->emdash
U+0027<>U+2019;'->rightsinglequote
U+0027U+0027<>U+201D;''->rightdoublequote
U+0022>U+201D;"->rightdoublequote
generates: T
E
Xa typeseuing system
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
More fun with font mappings
\def\SampleText{Unicode-
,\\
,\\
,\\
.}
\font\gen="Gentium"
\gen\SampleText
\bigskip
\font\gentrans="Gentium:
mapping=cyr-lat-iso9"
\gentrans\SampleText
Unicode -
,
,
,
.
Unicode - to unikal'nyj
kod dl lbogo simvola,
nezavisimo ot platformy,
nezavisimo ot programmy,
nezavisimo ot zyka.
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
X
E
T
E
X and other T
E
X extensions
T
E
X
G
X
a direct ancesor of X
E
T
E
X, but now obsolete
e-T
E
X
basis of current X
E
T
E
X implementation
provides a number of features, eqecially bidi support
Omega, Aleph
ambitious project to extend T
E
X to all scripts
complex connguration, no direct smart-font support
pdfT
E
X
widely-used extension providing rich PDF support
no native Unicode or smart-font support
27 Internationalization and Unicode Conference Berlin, Germany, April 2005
e Multilingual Lion: T
E
X learns to eak Unicode
For more information
X
E
T
E
X web site and mailing list
http://scripts.sil.org/xetex
http://tug.org/mailman/listinfo/xetex
Contact information
mailto:jonathan_kew@sil.org
Questionsand answers?
A
? "" Unicode
(/)? to je Unicode? ?
Unicode; ? ? Hva er Unicode?
?
Unicode? Unicode ? ?
27 Internationalization and Unicode Conference Berlin, Germany, April 2005