You are on page 1of 26

Part 2 CHARACTER ENCODING: How do computers deal with multiple languages?

by Tze Wei Sim simtzewei@mail.soasalumni.org

Content
Basic Computing Knowledge Binary, Decimal and Hexadecimal Numbers Unicode Character Set Character Encoding Language Input Software Fonts Glyphs

Data Communication

In order for computers to understand each other, they have to speak and understand the same language.

In computing terms, they must have the same encoding (speaking) and decoding (understanding) protocol.

Data Communication

Every time we press a button on a keyboard, it generates a sequence of high and low voltages which resemble binary numbers.

These sequences of data are saved in memory or transmitted to another computer via a network. In order for the recipient to understand (decode) what the sender was speaking (encode), both of them have to have the same understanding (encoding) of what that string of binary numbers mean.

Numeral Systems

Computer data is represented in binary numbers (base-2 numeral system) as opposed to decimal numbers (base-10 numeral system) we use in daily life.
Decimal Numbers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Binary Numbers 0 1 10 11 100 101 110 111 1000 1001 1010 1011 1100 1101 1110 1111 10000 Hexadecimal Numbers 0 1 2 3 4 5 6 7 8 9 A B C D E F 10

Common Character Sets

ASCII (American Standard Code for Information Interchange) - originally based on the English language that encodes 128 characters - numbers 0-9, letters a-z and A-Z, some basic punctuation symbols, some control codes - all stored in 7 binary digits (bits)
Keys A B C ! ? Binary Representation 1000001 1000010 1000011 0100001 0111111 Decimal Number 65 66 67 33 63

$
Backspace Escape Delete

0100100
0001000 0011011 1111111

36
8 27 127

Common Character Sets



Most early computers kept data in an 8-bit byte system. With an 8-bit byte, not only is it possible to store every possible ASCII character, but there is also one whole bit spare.

Byte = the smallest addressable unit of memory in many computer architectures

Because bytes have room for up to eight bits, many people had their own ideas of what should go where in the space from 100000002 (or 12810) to 111111112 (or 25510). For example on some American PCs the character code 100000102 (or 13010) would display as , but on computers in Israel it was the Hebrew letter Gimel (), so when Americans sent their rsums to Israel they arrived as rsums.

Common Character Sets


Unicode A group of ambitious people came up with the idea of creating a single character set that included every reasonable writing system in the world, covering 110,181 characters from the world's alphabets, ideograph sets, and symbol collections. (Amharic), (Tamil) and even old characters which are not commonly used anymore such as (Baybayin), the old Filipino writing system, (Ch nm), the old Vietnamese characters are assigned binary codes (aka code points) to prevent confusion between computers.

Unicode

The code assigned to a specific character in Unicode Standard is called a code point.

A binary number for a character can be very long.


The Chinese character is represented by this string of binary number 100100101101100010 (or 15037010).

Note: To make the code point more concise, it is expressed in this format: U+hexadecimal number. Thus, U+24B62.

UTF-8 Encoding

The string of numbers has to be encoded and segmented into several 8-bit bytes in order to store on computer memory, transmit across communication networks, and be deciphered correctly by other computers. UTF-8 is an encoding method widely used on the internet and increasingly being used as the default character encoding in operating systems, programming languages, and software applications.
No. of Bytes Required 1 2 3 4 5 6

First Code Point U+0000 U+0080 U+0800 U+10000 U+200000 U+4000000

Last Code Point U+007F U+07FF U+FFFF U+1FFFFF U+3FFFFFF

No. of Bit 7 bits 11 bits 16 bits 21 bits 26 bits

1st Byte 0xxxxxxx 110xxxxx 1110xxxx 11110xxx 111110xx 1111110x

2nd Byte

3rd Byte

4th Byte

5th Byte

6th Byte

10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U+7FFFFFFF 31 bits

UTF-8 Encoding
To encode the Chinese character which is represented by this string of binary number 100100101101100010. The following protocol is performed by the encoder: 1. Since it is between U+10000 and U+1FFFFF, this will take 4 bytes to encode.
No. of Bit 21 bits No. of Bytes Required 4 1st Byte 11110xxx 2nd Byte 10xxxxxx 3rd Byte 10xxxxxx 4th Byte 10xxxxxx

2. Three leading zeros are added in front of 100100101101100010 to make it 000100100101101100010 so it can fill up all the variable x. 3. The character is now made up of 4-byte binary numbers (32 bits) ready to be saved and transmitted to another computer: 11110000 10100100 10101101 10100010
Note: This lengthy binary number can be concisely written in hexadecimal number: F0 A4 AD A2

Decoding

When the recipient receives a string of 32 bits data, 11110000 10100100 10101101 10100010, the numbers in purple colour will be removed by the recipients decoder and revert back to the original 21-bit binary number 100100101101100010. It is now ready to be opened with a computer programs and equipped with a font which can render the 21-bit binary number into a picture (better known as a glyph in typography). There are other encoding methods such as UTF-16, UTF-32 etc. to suit different type of computer architectures.

Why do computers need encoding and decoding? - So that the receiver can make sense of the seemingly random signal. It knows that a new character is being received when it detects 11110.

Language Input Software

Typing the English language is relatively straight forward in computing. The keyboard generates a binary number 10000012 (or 6510 or 4116) when A is pressed.

To type non-English languages, the computer needs a language input software that convert 7-bit ASCII binary number to a Unicode binary number.
To type the Arabic Alif as in , we essentially press the h key. The keyboard generates a binary number 10010002 (or 7210 or 4816). The language input software then converts 10010002 (or 7210 or 4816) to 110001001112 (or 157510 or 62716).

Language Input Software

To encode the Arabic Alif as in which is represented by this 11-bit binary number 11000100111. The following protocol is performed by the encoder:

1. Since it is between U+0080 and U+07FF, this will take 2 bytes to encode.
No. of Bit No. of Bytes Required 1st Byte 2nd Byte

11 bits

110xxxxx

10xxxxxx

2. The 11-bit binary number will fill up all the variable x.

3. The character is now made up of 2-byte binary numbers (16 bits) ready to be saved and transmitted to another computer:
11011000 10100111
Note: This lengthy binary number can be concisely written in hexadecimal number: D8 A7

Font

Font is a file that maps strings of binary data with designated pictorial glyphs to be shown on computer screen.

The most common font types are:


1. OpenType Fonts 2. TrueType Fonts 3. PostScript Fonts

Font can be developed with software i.e. Fontlab, Adobe FDK, RoboFont, Glyphs, DTL Font Master

Font

Fonts are files kept in Universal Type Client or your Font Book (Mac) and Font folder in Windows.

Font vs. Glyph


A font file contains a collection of glyphs (pictures) files assigned with numbers.

Font vs. Glyph


A glyph is the design of a character, a symbol or even an object.

Character vs. Glyph


Unicode has the principle of assigning a code point to a character, not a glyph. Both as below are the same character but different glyph (design). Both glyphs share the same code point.

Unicode leaves the design of glyphs to type designers so type designers have the liberty to design the appearance of glyphs.
Things get more complicated when the definition of a character and a glyph are not clear cut, especially in regions where logographic writing systems are used i.e. China, Japan, Korea, and Vietnam (abbr. CJKV).

Some glyphs are considered as the same character and they share the same code point. For example the character in CJKV used to share the code point of U+7A81.

(abrupt )

Its variant form

was later added to Unicode and given the code point U+2592E.

But some glyphs are considered as different characters and they have different code points.

(to listen)

(to hit)

Which Font is Better?

Well-developed fonts usually have many glyphs and therefore are able to support many languages.

Less-developed fonts have lesser glyphs and therefore are less versatile in coping with different languages.

Arial Unicode MS

MHeiHK-S

Data Processing

This is the usual process of encoding a non-ASCII character.

Keyboar d

Language Input Software

UTF-8 Encoder

UTF-8 Decoder

Unicod e Font

Computer A

Computer B

Input Software-induced Disconcordance

Some language input developers prefer to use U+807C (a rare character) over U+807D (the more common one) i.e. Microsoft Pinyin New Experience Input Style.

Keyboar d

Language Input Software

807C1
6

UTF-8 Encoder

UTF-8 Decoder

Computer A
6

or 807D1

Unicod e Font

Computer B

Font-induced Disconcordance

This is the usual computing process to type the Arabic character Alif.
4816
ASCII

Keyboar d

Language Input Software

62716
Unicode

UTF-8 Encoder

D8 A716

D8 A716

UTF-8 Decoder

62716
Unicode

Unicod e Font

Computer A

Computer B

However, some font developers skip the language input software and UTF-8 encoding by creating non-Unicode fonts i.e. Kruti Dev 010, MHeiHK-S.
481
6

Keyboar d

ASCII

NonUnicode Font

Non-Unicode Fonts

Some font developers create fonts that assign characters to code points that have already been taken by other characters.

Non-Unicode Fonts

These fonts are called non-Unicode fonts. Data typed with these fonts are not able to be read by other fonts.