Chap10 DMMCharactersAndFonts

Characters & Fonts
Rawesak Tanawongsuwan
ccrtw@mahidol.ac.th
10 314–315
Text
• Dual nature
• Visual representation of language (content)

• Need to relate bit patterns to symbols of
a written language
• Graphic element (appearance)

• Precise shapes of characters, spacing and
layout (typography)
• Each abstract character may have many

different graphic representations
10 315–316
Character Sets
• Abstract characters are grouped into alphabets
• Any set of distinct symbols, usually forming

the basis of some written language
• A character set is a mapping between the

characters of some alphabet (its character
repertoire) and bit patterns
• For each character in its repertoire, a

character set defines a code value,
belonging to its set of code points
10
Character Sets
• Code values can be used in editing, searching, and
comparing of text
• Any mapping of character set to code values is ok,

but the mapping should provide a unique mapping
value
• It is useful to use integers within a small range that

can be efficiently manipulated by a computer
• Consecutive letters should be represented by

consecutive code values to simplify operations on
text, i.e. sorting
10
Standards
• Transferring text between different computers
needs a standard character code
• Standardization is a real-world problem in

many situations
10 317
ASCII
• American Standard Code for Information Interchange
• The dominant character set since 1970s
• 7 bits for each code value, hence 128 code points
• Character repertoire only comprises 95 characters (other 33

values are used for control characters)
• 0 to 31 and 127 are assigned to control characters
• Character repertoire is only really adequate for American English
• ASCII was adopted as an ISO standard in 1972
• ISO 646 is ASCII with national variants (accented letters,

currency symbols)
10 318
8-bit Character Sets

• Easy to double the number of code points by
using the eighth bit of a byte
• Maintain backward compatibility by keeping

lower half (0–127) identical with US-ASCII
• Use code points 128–255 for accented letters,

math symbols, extra punctuation
• 256 code points still insufficient for all

languages, so must still use different variants
('code pages')
10 319
ISO 8859
• Incompatible 8-bit extensions to ASCII were
originally developed by manufacturers
• Standardization required
• ISO 8859 is a multi-part standard (defined

during 1980s) which defines a collection of
character sets, each designed to accommodate
the needs of a group of related languages
• ISO 8859-1, known as ISO-Latin1, covers

most Western European languages
10 320
Multi-byte Character Sets

• 256 code points is not sufficient for
ideographically based alphabets or for using
more than one language at a time
• 16-bit (2-byte) character set with 65,536 code

points can accommodate 256 8-bit character
sets simultaneously, and so on with 24- and
32-bit sets
• ISO 10646 is structured as a hypercube

comprising 256 groups (cubes) each of 256
planes of 256 rows, each with 256 characters
10 320
Structure of ISO 10646

• Each code point can be written as a quadruple
(g, p, r, c) – group, plane, row, character
• Using * to mean all values from 0–255, can

also use quadruples to identify subsets
• (0, 0, 0, *) is the subset with all but lowest-

order byte set to 0
• In ISO 10646, (0, 0, 0, *) is identical with

ISO Latin1
10 320–321
Unicode
• 16-bit character set developed by an industry
consortium
• Unicode uses CJK consolidation to fit all characters

required for Chinese, Korean and Japanese into 16
bits
• Characters that look the same have the same

position, even if they are in fact different
• ISO 10646 Basic Multilingual Plane (0, 0, *, *) is

identical to Unicode (even though ISO 10646 doesn't
really need to use CJK consolidation)
10 322–323
Encodings
• Mapping from code values to a sequence of bytes
• charset specification in a MIME type identifies

encoding and character set
• e.g. text/html; charset = ISO-8859-1
• Obvious encoding of ISO 10646 uses four bytes

for each 32-bit value – UCS 4
• For values on BMP drop zero bytes – UCS 2
• UCS 2 is therefore identical to Unicode
Universal Character Set

10 323
Unicode Transformation Format
UTFs
• UCS Transformation Formats can be applied to
Unicode (UCS 2) values
• UTF-8: ASCII characters encoded as

themselves, values > 127 encoded as a string
of up to six bytes with highest bit set to 1
• UTF-7 further encodes UTF-8 as 7-bit values to

avoid problems with older protocols
• UTF-16 allows pairs of 16-bit values to be

combined into a single 32-bit value, extending
Unicode beyond BMP (additional 15 planes)
10 324–326
Fonts
• Visual representation of a character is called a
glyph A A A A A A
• Must replace characters with glyphs for display
• Glyphs are arranged into collections called
fonts
• Fonts are stored in specified locations on a
computer system, may be embedded in
documents
• If font is not embedded, document may not
display properly on systems where that font
is not installed
10 327–330
Classification of Fonts
Courier Times
• Spacing: monospaced (fixed width)/proportional
• Serifs: serifed/sans serif H H

• Serifs are the small strokes added to the ends
of character shapes in conventional book
fonts
• Shape: upright/italic/slanted H H
• Slant is a vertical shear effect, italic uses
different glyph shapes with a slant
• Weight: bold/normal/light H H
10 331–332
Choice of Fonts
• Text fonts – suitable for continuous text (e.g.
body of a book or article)
• Must be unobtrusive, easy to read
• Display fonts – suitable for isolated pieces of

short text (e.g. headings, signs or slogans)
• Intention is to get a short message across,

so eye-catching design that would be
inappropriate for continuous text is OK
10 332–333
Fonts for Multimedia

• Text fonts may be problematical
• Low resolution of computer displays leads to

loss of details (e.g. fine serifs) and
distortion of letter shapes
• Use larger sizes than in print, prefer sans

serif, use fonts such as Arial and Verdana
designed to be readable at low resolution
• Display fonts work better and may be suitable

for small pieces of continuous text
10 334–335
Font Measurement
• Units
• Points: 1pt = 1/72" = 0.3528mm
• Exact size is not standard; 1/72" is
invariably used by computer systems
• Picas: 1pc = 12pt or 1/6 in or 4.2333 mm
• Font's body size is not necessarily the size of
any particular character
• e.g. 10pt Times Roman
• Smallest height that could accommodate
every symbol in the font
10 337
Font Terminology
• Baseline – the line on which the bases of
characters are arranged
• Leading – the distance between successive

baselines
• x-height – the distance between the baseline

and the top of a lower-case letter x
• Ascenders/descenders – strokes that rise

above the x-height/drop below the baseline
10 336
10 336–337
Relative Units
• Used to express measurements relative to font size
• 1 ex = font's x-height
• 1 em = body size
• Traditionally the width of an upper-case M
• Long dashes — known as em-dashes (1em long)
• 1 en = 0.5em
• Shorter dashes – known as en-dashes (1en long)

10 337–338
Spacing
• Kerning – adjustment of space between
certain pairs of letters (e.g. AV) to make
them look more uniform
• Kerning pairs for a font are defined by

its designer, stored with the font
http://www.webopedia.com/TERM/K/kerning.html
metrics
• Ligatures – single composite characters

used to replace pairs of letters that don't
look right next to each other (e.g. fi)
• Ligatures are stored as extra characters

in the font
http://en.wikipedia.org/wiki/Ligature_(typography)
10 339
Digital Fonts
• Glyphs are just images, so we can have
bitmapped or vector (outline) fonts
• Bitmapped fonts don't scale well or

reproduce at different resolutions
• Outline font formats:
• PostScript Type 1
• TrueType
• OpenType
10
Bitmap fonts
• An assortment of bitmap fonts from the first

version of the Macintosh operating system
10
Bitmap fonts
• For each variant of the font, there is a
complete set of glyph images, with each set
containing an image for each character
• For example, if a font has three sizes, and any

combination of bold and italic, then there must
be 12 complete sets of images.
10
Bitmap fonts
• Advantages of bitmap fonts include:
• Extremely fast and simple to render
• Unscaled bitmap fonts always give exactly
the same output
• Easier to create than other kinds
• The primary disadvantage of bitmap fonts
• The visual quality tends to be poor when
scaled or otherwise transformed, compared
to outline and stroke fonts.
10 340–341
Outline Fonts
• Type 1
• Character shapes are based on Bézier
curves
• Fonts can contain hints used by rendering
programs to improve appearance at low res
• TrueType
• Character shapes are based on quadratic
curves
• Instructions specify how features of a
character are rendered at different
resolutions
10 340
OpenType Fonts
• New cross-platform format that unifies Type 1
and TrueType
• More than 256 characters in each font
• Type 1 and TrueType both limited to 256
• Encoding based on Unicode
• Support for extended range of ligatures, old-

style numerals, swash capitals, fractions

Chap10 DMMCharactersAndFonts

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Chap10 DMMCharactersAndFonts

Diunggah oleh

Hak Cipta:

Format Tersedia

Characters & Fonts

• Visual representation of language (content)

• Graphic element (appearance)

• Each abstract character may have many

• Any set of distinct symbols, usually forming

• A character set is a mapping between the

• For each character in its repertoire, a

• Any mapping of character set to code values is ok,

• It is useful to use integers within a small range that

• Consecutive letters should be represented by

• Standardization is a real-world problem in

• The dominant character set since 1970s

• 7 bits for each code value, hence 128 code points

• Character repertoire only comprises 95 characters (other 33

• 0 to 31 and 127 are assigned to control characters

• Character repertoire is only really adequate for American English

• ASCII was adopted as an ISO standard in 1972

• ISO 646 is ASCII with national variants (accented letters,

8-bit Character Sets

• Maintain backward compatibility by keeping

• Use code points 128–255 for accented letters,

• 256 code points still insufficient for all

• ISO 8859 is a multi-part standard (defined

• ISO 8859-1, known as ISO-Latin1, covers

Multi-byte Character Sets

• 16-bit (2-byte) character set with 65,536 code

• ISO 10646 is structured as a hypercube

Structure of ISO 10646

• Using * to mean all values from 0–255, can

• (0, 0, 0, *) is the subset with all but lowest-

• In ISO 10646, (0, 0, 0, *) is identical with

• Unicode uses CJK consolidation to fit all characters

• Characters that look the same have the same

• ISO 10646 Basic Multilingual Plane (0, 0, *, *) is

• charset specification in a MIME type identifies

• e.g. text/html; charset = ISO-8859-1

• Obvious encoding of ISO 10646 uses four bytes

• For values on BMP drop zero bytes – UCS 2

• UCS 2 is therefore identical to Unicode

Universal Character Set

• UTF-8: ASCII characters encoded as

• UTF-7 further encodes UTF-8 as 7-bit values to

• UTF-16 allows pairs of 16-bit values to be

• Serifs: serifed/sans serif H H

• Must be unobtrusive, easy to read

• Display fonts – suitable for isolated pieces of

• Intention is to get a short message across,

Fonts for Multimedia

• Low resolution of computer displays leads to

• Use larger sizes than in print, prefer sans

• Display fonts work better and may be suitable

• Leading – the distance between successive

• x-height – the distance between the baseline

• Ascenders/descenders – strokes that rise

• Traditionally the width of an upper-case M

• Long dashes — known as em-dashes (1em long)

• Shorter dashes – known as en-dashes (1en long)

• Kerning pairs for a font are defined by

• Ligatures – single composite characters

• Ligatures are stored as extra characters

• Bitmapped fonts don't scale well or

• Outline font formats:

• ISO 10646 Basic Multilingual Plane (0, 0, , ) is