Anda di halaman 1dari 34

Tamil All Character

Encoding
Page issues

Tamil All Character Encoding (TACE16)


is a 16-bit Unicode-based character
encoding scheme for Tamil language.[1][2]

Keyboard drivers and fonts


The Keyboard driver for this encoding
scheme are available in Tamil Virtual
University website[3] for free.[4] It uses
Tamil99 and Tamil Typewriter keyboard
layouts, which are approved by Tamil
Nadu Government, and maps the input
keystrokes to its corresponding
characters of TACE16 scheme.[2] To read
the files which are created using TACE16
scheme, the corresponding Unicode
Tamil fonts for this encoding scheme are
also available in the same website.[3][4]
These fonts not only has mapping of
glyphs for characters of TACE16 format,
but also has mapping of glyphs for the
present Unicode encoding for both ASCII
and Tamil characters, so that it can
provide backward compatibility for
reading existing files which are created
using present Unicode encoding scheme
for Tamil language.
Character set
All characters of this encoding scheme
are located in the private use area of the
Basic Multilingual Plane of Unicode's
Universal Character Set.

Tamil All Character Encoding(TACE16) Character Set


Consonants →
Vowels E10 E18 E1A E1F E20 E21 E22 E23 E24

0 ௳ ௦ அைர கா ◌்

1 ௴ ௧ கா அ க ங ச ஞ

2 ௵ ௨ அைர ◌ா ஆ கா ஙா சா ஞா

3 ௶ ௩ கா ◌ி இ க ங ச ஞ

4 ௷ ௪ அைர ச ◌ீ ஈ கீ சீ ஞீ

5 ௸ ௫ ச ◌ு உ

6 ௹ ௬ ச ◌ூ ஊ

7 ௺ ௭ அைரமா ெ◌ எ ெக ெங ெச ெஞ

8 ெபௗ ணமி ௮ ஒ மா ே◌ ஏ ேக ேங ேச ேஞ

9 அமாவாைச ௯ இர மா ை◌ ஐ ைக ைங ைச ைஞ

A கா த ைக ௰ மா ெ◌ா ஒ ெகா ெஙா ெசா ெஞா

B ராஜ ௱ நா மா ே◌ா ஓ ேகா ேஙா ேசா ேஞா

C ௐ ௲ த ரி ெ◌ௗ ஔ ெகௗ ெஙௗ ெசௗ ெஞௗ

D அைர காணி ஃ

E காணி

F காணி
Note:

Newly added. Not present in Unicode_v6.3.

Allocated for researches(NLP)

For future use

Analysis of TACE16 over


present Unicode standard
for Tamil language
Analysis of TACE16 over present Unicode
standard for Tamil language:

Issues with the present


Unicode for Tamil language

The present Unicode standard for Tamil


is considered not adequate for efficient
and effective usage of Tamil in
computers, due to the following
reasons:[1]
1. Unicode code Tamil has code
positions only for 31 out of 247 Tamil
Characters. These 31 characters include
12 vowels, 18 agara-uyirmey and one
aytham. Five Grantha agara-uyirmey are
also provided code space in Unicode
Tamil. The other Tamil Characters have
to be rendered using a separate
software. Only 10% of the Tamil
Characters are provided code space in
the Present Unicode Tamil. 90% of the
Tamil Characters that are used in general
text interchange are not provided code
space.
2. The Uyir-meys that are left out in the
present Unicode Tamil are simple
characters, just like A, B, C, D are
characters to English. Uyir-meys are not
glyphs, nor ligatures, nor conjunct
characters as assumed in Unicode. ka,
kA, ki, kI, etc., are characters to Tamil.
3. In any plain Tamil text, Vowel
Consonants (uyir-meys) form 64 to 70%;
Vowels (uyir) form 5 to 6% and
Consonants (meys) form 25 to 30%.
Breaking high frequency letters like
vowel-consonants into glyphs is highly
inefficient.
4. This type of encoding which requires a
rendering engine to realize a character
while computing is not suitable for
applications like system software
developments in Tamil, searching and
sorting and Natural language
processing(NLP) in Tamil, It consumes
extra time and space, making the
computing process highly inefficient. For
such applications Level-1
implementation where all the characters
of a language have code positions in the
encoding, like English is required.
5. This encoding is based on ISCII (1988)
and therefore, the characters are not in
the natural order of sequence. It requires
a complex collation algorithm for
arranging them in the natural order of
sequence.
6. It uses multiple code points to render
single characters. Multiple code points
lead to security vulnerabilities,
ambiguous combinations and requires
the use of normalization.
7. Simple counting letters, sorting,
searching are inefficient
8. It requires ZWJ/ZWNJ type hidden
chars.
9. It needs exception table to prevent
illegal combinations of code points.
10. Unicode Indic block is built on
enormous, complex, error-prone edifice,
based on an encoding that is NOT built to
last.
11. Very first code point says "Tamil Sign
Anusvara - Not used in Tamil".
12. Assumed collation was same as
Devanagari - incorrectly uses ambiguous
encoding to render same character.
13. It encodes 23 Vowel-Consonants (23
consonants + Ü) and calls them as
consonants, against Tamil grammar.
14. Unnatural for Speech to Text/Text to
Speech.
15. Inefficient to store, transmit and
retrieval(For example, File reading and
writing, Internet, etc.).
16. Complex processing hinders
development.
17. Need normalization for string
comparison.
18. A sequence of characters may
correspond to a single glyph, that is, ச +
ெ◌◌ + ◌ா = ெ◌சா. Characters are not
graphemes. According to Unicode ெ◌சா
is a grapheme; but ச, ெ◌◌, ◌ா are
characters.
19. Requires Dynamic Composition - a
text element encoded as a sequence of a
base character followed by one or more
combining marks.
20. There are two methods of rendering
the Vowel Consonants. This leads to
ambiguity in rendering characters.
21. The present Unicode is not efficient
for parsing. For example, let us count the
letters in the name த வ வ .
Even a Tamil child in a primary school
can say that this name has Seven letters.
According to Unicode this name has
twelve characters: த ◌ி ர ◌ு வ ள ◌் ள
◌ு வ ர ◌
22. To properly count the letters in this
name, an expert developer had to write a
complex program and present it as a
technical paper in a Tamil computing
conference. To compare, counting letters
in an English word is an exercise left to a
beginning programmer. Such problems
are triggered because a simple script
such as Tamil is treated as a complex
script by Unicode. For example in Python
library open-tamil,[5] which uses present
Unicode Standard for Tamil, in order to
count the number of Tamil letters in the
given text, the function
tamil.utf8.get_letters is first used to parse
the text into a List and then returns the
length of the list as the count of the
number of letters.[6] This type of complex
programming logic or extra additional
layer of framework requirement is
needed when a simple script such as
Tamil is treated as a complex script.
23. The Unicode standard policy is to
encode only characters, not glyphs.
However,[7] because Unicode Tamil
standard includes the vowel signs as
combining characters. These signs that
have no meaning to a Tamil reader would
be displayed as is by character shaping
engines that detect a blank space
between them and a base character.
Thus Unicode introduces the dotted
circle as a Tamil character.
24. Unicode Tamil is not fully supported
in many platforms primarily because
Tamil is treated as a complex script that
requires complex processing.
25. Since all the above-mentioned
inefficiencies consumes extra
processing cycles of a processor(which
in turns the consumption of electricity)
for a machine than needed, it will
increase the overall lifetime power
usage(electricity) by a machine which
processes Unicode Tamil and might
reduce the lifetime of that machine. For
example, take a very simple instance of
processing a single Tamil character kI
(கீ), it has to process both consonant
and vowel modifier, which doubles the
consumption of processing cycles of a
processor(which in turns the
consumption of electricity). If we
consider all the machines and servers
across the whole world which processes
the Unicode Tamil characters, the extra
processing power consumption will be
huge.

Analysis of TACE16 over


Unicode Tamil
The following data provides the
comparison of analysis of current
Unicode encoding for Tamil language vs
TACE16 on E-Governance and
Browsing:[1]

1. TACE16 is efficient over Unicode Tamil


by about 5.46 to 11.94 percent in the
case of Data Storage Application.
2. TACE16 is efficient over Unicode Tamil
by about 18.69 to 22.99 percent in the
case of Sorting Index Data.
3. TACE16 is efficient over Unicode Tamil
by about 25.39% when the entire data is
of Tamil. The default collation sequence
followed (Binary) while using the code
space values in the New TACE16 is not
as per Tamil Dictionary order. Some of
the uyir-meys (Agara-uyirmeys) are
taking precedence over vowels and other
Uyirmeys in the New TACE16, the vowels
and agarauyir-meys being in the 0B80 -
0B8F block and the other Uyir-meys
being in the 0800 to 08FF. Because of
this reason, sorting Unicode data looks
better than TACE16 data.
4. TACE16 is faster in sorting over
Unicode Tamil by about 0.31 to 16.96
percent.
5. Index creation on TACE16 data is
faster by 36.7% than Unicode.
6. For Full key Search on Indexed Fields,
TACE16 performed better than Unicode
Tamil by up to 24.07%. In the case of
non-indexed fields also TACE16
performed better than Unicode Tamil by
up to 20.9%.
7. Rendering of static Tamil Data was
fine with TACE16.

Advantages of TACE16 over


Unicode Tamil

TACE16 character encoding scheme not


only overcomes all the issues with the
present Unicode encoding standard for
Tamil language which are mentioned
above, but also provides additional
advantage over major performance
improvements in both processing time
and processing space which are the
major factors in affecting the efficient
and speedy execution of any computer
based program. This system has the
following additional advantages:[1]

1. The encoding is Universal since it


encompasses all characters that are
found in general Tamil text interchange.
2. The Collation is sequential in
accordance with the code value.
3. The encoding is unambiguous.
4. Any given code point always
represents the same character.
5. There is no ambiguity as in the present
Unicode Tamil.
The Unicode Tamil encoding had so
many issues, someone created the
following proposal to reencode Tamil.[8]
This was rejected by Unicode, who said
that the reencoding would be damaging
and there was no convincing evidence
Unicode Tamil encoding is bad.[9]

This system has the following


advantages for computer programming:

The basic software design to


accommodate Tamil characters and
their processing are simplified.
Sorting and searching is very simple.
For a machine, TACE16 takes less
processing cycles of a
processor(which in turn takes less
electricity) than Unicode Tamil.
Basically, TACE16 is greener than
Unicode Tamil.
TACE16 allows to do programming
based on Tamil grammar, which is not
very easy in Unicode Tamil (needs
extra framework development).
The encoding is very efficient to parse.
By simple arithmetic operation the
characters can be parsed. In computer
programming, second method is very
efficient in terms of performance over
large character set. Also, these
methods follows the basic Tamil
grammar that
Consonant+Vowel=Vowel-
Consonant(UyirMei) which is not
followed in Unicode Tamil.

Method 1(By simple


arithmetic operations):
+ இ = க
E210 ( ) + E203 (இ) -
E200(Constant) = E213 (க )
Method 2:
(E210) + இ (E203) = க
(E213)
E210 ( ) | (E203 (இ) &
000F (Constant)) = E213
(க )

It is very efficient to divide a vowel-


consonant (UyirMei) character into its
corresponding vowel and consonant.
This is very efficient in terms of
performance over large data.

/* To get Vowel */
E213 (க◌ி) & 'F20F
(Constant)' = E203 (இ)

/* To get Consonant */
E213 (க◌ி) &
'FFF0(Constant)' = E210
(க◌்)

It is very efficient to find whether a


character is vowel or consonant or
vowel-consonant (UyirMei) or
numbers.
/* | - Bitwise OR
* & - Bitwise AND
* ! - Bitwise NOT
* ^ - Bitwise XOR
* ||- Conditional OR
* &&- Conditional AND
*/
c = the TACE16 encoding
for a Tamil character

/* To check whether a
character is vowel */
/* Method 1 */
((c >= E201) && (c <=
E20C)) == true // =>
Vowel
/* Method 2 - If code
positions E200, E20E,
E20F are not used for any
other purpose*/
(((c & 'E20F
(Constant)')==c) && (c !=
E20D)) == true // =>
Vowel
((!((c & 'E20F
(Constant)')^c)) && (c !=
E20D)) == true // =>
Vowel

/* To check whether a
character is consonant or
Vowel-consonant(UyirMei)
*/
x = (c & '000F
(Constant)') // If c is
Vowel or Vowel-Consonant,
then x = Unique number
for each vowel starting
from 1
(((c >= E210) && (c <=
E38C)) && (x == 0)) ==
true // => Consonant
(((c >= E210) && (c <=
E38C)) && ((x >= 1) && (x
<= 12))) == true // =>
Vowel-Consonant(UyirMei)

/* To check whether a
character is Tamil number
*/
/* Method 1 */
((c >= E180) && (c <=
E18C)) == true // =>
Tamil Number
/* Method 2*/
//If code positions
E18D-E18F are not used
for any other purpose
(c & 'E18F (Constant)')
== c // => Tamil Number
(!((c & 'E18F
(Constant)')^c)) == true
// => Tamil Number
//If code positions
E18D-E18F are used for
any other purpose, then
either Method 1 or below
method can be used*/
((!((c & 'E18F
(Constant)')^c)) && ((c &
'000F (Constant)') <=
12)) == true // => Tamil
Number

It is very easy to convert numbers to


Tamil numbers(new Tamil number
format) and vice versa(same as
Unicode Tamil).

/* To convert a number
to new format of Tamil
number and vice versa,
direct digit to digit
conversion is enough */
/* To convert a number
to new format of Tamil
number */
n = single digit number
(0-9)
/* Method 1 */
(n & 'E18F (Constant)')
// => Tamil Number
/* Method 2 */
(n | 'E180 (Constant)')
// => Tamil Number

/* To convert new
format of Tamil number to
a number */
c = single digit Tamil
number character(௦-௯)
(c & '000F (Constant)')
// => Number

Alternative Claims
Open-Tamil

The open-tamil project[10] provides many


of the common operations, e.g. to extract
letters from Unicode UTF-8 encoded
string, sorting, searching etc. Even
though, the project claims Level-1
compliance of Tamil text processing
without using TACE16, the project is still
written on top of extra programming
logic which is needed for present
Unicode Standard for Tamil.

#!/usr/bin/python2
# -*- coding:UTF-8 -*-
import codecs,os
import tamil.utf8 as
utf8
with
codecs.open('singl','w',enc
oding='utf-8') as ff:
letters =
utf8.get_letters(u" வள
எ ப எ ன சீ ")
for letter in
letters:
ff.write(unicode(letter))
print
unicode(letter)
ff.write(' ')
ff.close()

generates the output, output: வ ள


எ ப எ ன சீ

See also
TSCII (Tamil Script Code for
Information Interchange)

References
1. Report on the final recommendations
of the task force on TACE16
2. Tamil Nadu Government's Tender
Document for development of Tamil fonts
and Tamil keyboard driver for 16-bit
encodings (Unicode and TACE16)
3.
http://www.tamilvu.org/tkbd/index.htm
4. Tamil Nadu Government's Order(G.O.),
Keyboard Drivers and Fonts
5.
https://github.com/arcturusannamalai/op
en-tamil open-tamil
6.
https://ezhillang.wordpress.com/2014/01
/26/open-tamil-text-processing-
%E0%AE%89%E0%AE%B0%E0%AF%88-
%E0%AE%AA%E0%AE%95%E0%AF%81%E
0%AE%AA%E0%AF%8D%E0%AE%AA%E0%
AE%BE%E0%AE%AF%E0%AF%8D%E0%AE
%B5%E0%AF%81/ tamil.utf8.get_letters
7.
https://ezhillang.wordpress.com/2014/01
/26/open-tamil-text-processing-
%E0%AE%89%E0%AE%B0%E0%AF%88-
%E0%AE%AA%E0%AE%95%E0%AF%81%E
0%AE%AA%E0%AF%8D%E0%AE%AA%E0%
AE%BE%E0%AE%AF%E0%AF%8D%E0%AE
%B5%E0%AF%81/
8.
https://www.unicode.org/L2/L2012/1203
3-tamil-presentation.pdf
9.
http://unicode.org/alloc/nonapprovals.ht
ml
10. https://pypi.org/project/Open-Tamil/
open-tamil project

Retrieved from
"https://en.wikipedia.org/w/index.php?
title=Tamil_All_Character_Encoding&oldid=83842
9227"

Last edited 2 months ago by ச.ப …

Content is available under CC BY-SA 3.0 unless


otherwise noted.

Anda mungkin juga menyukai