Unicode and Legacy Representation of Emoji

Unicode and Legacy
Representations of
Emoji
IUC 36
David Yonge-Mallo, i18n Engineer, Google

Oct. 24, 2012
ver. 2012-10-23 14:00

"Bit rot"
09:15-10:00 KEYNOTE PRESENTATION -
"Bit Rot" – A Disaster Waiting to Happen
Presenter:
Dr. Vinton G. Cerf
Vice President and Dr. Cerf will discuss the problem of curating digital
Chief Internet content on the order of centuries. Unicode has a role to
Evangelist, Google play although there are very complex issues relating to
format and structure of digital objects, interpretation of
content, intellectual property management, perhaps
even patents and other legal framework questions. The
problems are both technical and legal.
Outline
● A brief history of emoji
● Encoding: Shift JIS and Unicode
● Mapping and unification
● Emoji in Unicode 6
● Problems:
○ variation selectors
○ regional indicators
○ counting
● Best practices
Emoji down the ages
What if you were tasked with preserving the following texts
to be passed down for posterity?
Emoji down the ages
awesome! :-)
Emoji down the ages
awesome! :-)
yay! ☺
Emoji down the ages
awesome! :-)
yay! ☺
i know how much you hiking

What is an emoji (絵文字)?
絵文字 = picture (絵) + character/letter (文字)
What are they?

● pictures (representational)
● includes facial expressions (smileys)
○ but not restricted to them
● stored and transmitted as encoded characters
○ used in email and SMS
History:
● popularised on Japanese mobile devices
● extension of Japanese character sets
● carrier-specific standards
"Early" history in Japan
Three major cell phone operators supported emoji:
● NTT DoCoMo
● au/EZweb by KDDI
● SoftBank
Problems:
● each operator had its own set of emoji
● they were encoded differently
● no interoperability between them
Examples of emoji
Above: DoCoMo emoji palette
Right: DoCoMo Foma P902i, c. 2005

Examples of emoji
Subset of KDDI emojis:
Subset of SoftBank emojis:

Number of supported emoji
Source: Emoji in Unicode, IUC 33

Outline
● Problems:
○ counting
● Best practices
Encoding - Shift JIS
This is one of the most popular encodings for Japanese.
The "JIS" part refers to Japanese Industrial Standards.

ISO-2022-JP is also known as the "JIS" encoding.
The "shift" part comes from how the double-byte characters

are encoded.
0x00 - 0x7F : matches ASCII (except for 2 characters)

0x81 - 0x9F : first byte of a double-byte character
0xA1 - 0xDF : half-width katakana
0xE0 - 0xEF : first byte of a double-byte character
Encoding - Shift JIS
Source: modified from Wikipedia

Encoding - Unicode PUA
Unicode has a number of private use areas (PUAs).
PUA range in the Basic Multilingual Plane (BMP):

0xE000 - 0xF8FF
Supplementary PUA-A:
0xF0000 - 0xFFFFF
Supplementary PUA-B:
0x100000 - 0x10FFFD
Encoding is carrier-specific
Each carrier used different values to encode emoji. For
example...
NTT DoCoMo:
● Shift JIS: 0xF89F - 0xF9FC
● Unicode: 0xE63E - 0xE757 (BMP PUA)
● JIS points for e-mail
... and similarly for the other two carriers.

Mojibake (文字化け)
Mojibake is what happens when encoded text is displayed
using the wrong encoding.
Mojibake (文字化け)
Mojibake is what happens when encoded text is displayed
using the wrong encoding.
Sent:
Displayed:
Outline
● Problems:
○ counting
● Best practices
Carrier-to-carrier mapping
SoftBank Disney au by KDDI DoCoMo
Source: SoftBank
Emoji support spreads...
Emoji began to be supported in web mail and other
devices:
● Yahoo! Japan Web Mail (2006)
● Gmail (2008)
● iPhone 2.2 (2008)
● Android apps (2009)
Google emoji
Provides a unified representation of the three emoji sets:
● union of all the emoji characters
● cross-mapping
○ combine same character
○ a few dozen: existing Unicode
● about 700 new characters
KDDI
○ using PUA
○ outside BMP (U+FExxx)
SoftBank
Idea:
● support legacy systems by
DoCoMo
converting between other
encodings and Unicode
Google PUA mapping table
Converting at boundaries
Gmail KDDI
(Google PUA)
DoCoMo
SoftBank
Convert to/from Unicode
Emoji in Gmail
Uses mapping table to convert
between PUA and carrier encoding.
Display emoji using images. In some

places, "[?]" is displayed.
Right: mobile Gmail on iPhone
Below: desktop Gmail compose window

Outline
● Problems:
○ counting
● Best practices
Making it official
In 2007, the Unicode Technical Committee agreed to
encode most of the emoji characters, for the purpose of
interoperability between systems.
Unicode proposals (joint effort by Google and Apple) 2009:

● N3582 "Proposal for Encoding Emoji Symbols"
● N3583 "Emoji Symbols Proposed for New Encoding"
Authors:
● Markus Scherer, Mark Davis, Kat Momoi, Darick Tong
(Google)
● Yasuo Kida, Peter Edberg (Apple)
The Proposal
Source: N3583 "Emoji Symbols Proposed for New Encoding"

Emoji in Unicode 6
Goal:
● Encode superset of emoji in Unicode, allowing for
roundtrip and fallback mappings
Restrictions:
● Source separation rule (strict rule)
● Reuse existing Unicode symbols
● Separate generic symbols
● Abstract characters (no specific colours or animation)
● Unify semantically identical symbols, but:
disunify visually similar but semantically different
symbols
● Unify Unicode with least-marked most-common symbol
Source: Unicode Technical Committee Subcommittee on Encoding of Symbols
Proposal accepted
In 2010, the new emoji were accepted into Unicode 6.
These consisted of:

● 625 emoji new 1:1 to Unicode 6
● 103 emoji unified 1:1 with existing characters
● 11 keycaps represented as [0-9#] followed by 'keycap'
● 10 new 'flag' emojis represented as sequences
● 65 emoji logos were not added
In addition, Unicode 6 added many other symbols which

are similar in nature to emoji, such as playing cards, plants,
and transportation symbols.
Unified and new emoji
Unified emoji: New emoji:

Outline
● Problems:
○ counting
● Best practices
New problems introduced
Since Gmail was already using the unified PUA, it looks like
all that needs to be done to bring it up to spec is to replace
the PUA code points with the official ones...
Not so fast -- it's not that simple!
Recall that one of the goals in creating the proposal was:

● Reuse existing Unicode symbols
Also, the new emoji include:

● keycaps and flags represented by sequences of
characters
What could possibly go wrong?

Can you spot the problems?
Variation selectors
Source: Unicode Standardized Variants

Regional Indicator symbols
The combined carrier emoji contained ten national flags.
(PRC, Germany, Spain, France, UK, Italy, Japan, Korea, Russia, USA)
US proposal (Google and Apple):

● encode as "emoji compatibility symbols"
Germany/Ireland counter-proposal:
● encode 256 characters for ISO 3166 country codes
Compromise:
● encode twenty-six "regional indicator symbols" (A-Z)
● spell out the two-letter country codes
Possible ambiguity
We have "regional indicators" to .
But what if the middle of a string looked like this?
... ...
Is this ... ...
or ... ...?
What about CN/NC, KRUS/RUSK, BB...BBFRUSBB...?

Be careful how you count!
Counting the wrong thing is a major source of bugs:
● Java's String.length() lies about Unicode supplementary
code points (UCS-2 vs. UTF-16), use String.
codePointCount() instead
● masking with "[?]" changes the length
● changing encoding changes the length
The above problems existed prior to Unicode 6. But now:

● variation selectors are invisible
● some emoji are represented by sequences (of
supplementary code points)
Outline
● Problems:
○ counting
● Best practices
Best practices
Strive for the following goals:
● use Unicode encoding rather than Shift JIS or other
● use official Unicode code points instead of PUA
● choose wisely whether to use text or image
● convert to/from Unicode at boundaries
● be aware that Unicode has emoji-like symbols beyond
the Japanese carrier sets, and conversion to the carrier
Shift JIS encodings may not be possible for these
● follow Postel's principle
○ "be liberal in what you accept,
but conservative in what you send"
The End
Thank you!
Q&A

Unicode and Legacy Representation of Emoji

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Unicode and Legacy Representation of Emoji

Diunggah oleh

Hak Cipta:

Format Tersedia

Unicode and Legacy

David Yonge-Mallo, i18n Engineer, Google

ver. 2012-10-23 14:00

i know how much you hiking

What are they?

Above: DoCoMo emoji palette

Right: DoCoMo Foma P902i, c. 2005

Subset of SoftBank emojis:

Source: Emoji in Unicode, IUC 33

The "JIS" part refers to Japanese Industrial Standards.

The "shift" part comes from how the double-byte characters

0x00 - 0x7F : matches ASCII (except for 2 characters)

Source: modified from Wikipedia

PUA range in the Basic Multilingual Plane (BMP):

... and similarly for the other two carriers.

Display emoji using images. In some

Right: mobile Gmail on iPhone

Below: desktop Gmail compose window

Unicode proposals (joint effort by Google and Apple) 2009:

Source: N3583 "Emoji Symbols Proposed for New Encoding"

These consisted of:

In addition, Unicode 6 added many other symbols which

Unified emoji: New emoji:

Not so fast -- it's not that simple!

Recall that one of the goals in creating the proposal was:

Also, the new emoji include:

What could possibly go wrong?

Source: Unicode Standardized Variants

US proposal (Google and Apple):

We have "regional indicators" to .

But what if the middle of a string looked like this?

Is this ... ...

What about CN/NC, KRUS/RUSK, BB...BBFRUSBB...?

The above problems existed prior to Unicode 6. But now:

Anda mungkin juga menyukai