CHARACTER ENCODING: How Do Computers Deal With Multiple Language?

Part 2 CHARACTER ENCODING: How do computers deal with multiple languages?
by Tze Wei Sim simtzewei@mail.soasalumni.org
Content
Basic Computing Knowledge Binary, Decimal and Hexadecimal Numbers Unicode Character Set Character Encoding Language Input Software Fonts Glyphs
Data Communication
In order for computers to understand each other, they have to speak and understand the same language.
In computing terms, they must have the same encoding (speaking) and decoding (understanding) protocol.
Data Communication
Every time we press a button on a keyboard, it generates a sequence of high and low voltages which resemble binary numbers.
These sequences of data are saved in memory or transmitted to another computer via a network. In order for the recipient to understand (decode) what the sender was speaking (encode), both of them have to have the same understanding (encoding) of what that string of binary numbers mean.
Numeral Systems
Computer data is represented in binary numbers (base-2 numeral system) as opposed to decimal numbers (base-10 numeral system) we use in daily life.
Decimal Numbers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Binary Numbers 0 1 10 11 100 101 110 111 1000 1001 1010 1011 1100 1101 1110 1111 10000 Hexadecimal Numbers 0 1 2 3 4 5 6 7 8 9 A B C D E F 10
Common Character Sets
ASCII (American Standard Code for Information Interchange) - originally based on the English language that encodes 128 characters - numbers 0-9, letters a-z and A-Z, some basic punctuation symbols, some control codes - all stored in 7 binary digits (bits)
Keys A B C ! ? Binary Representation 1000001 1000010 1000011 0100001 0111111 Decimal Number 65 66 67 33 63
$
Backspace Escape Delete
0100100
0001000 0011011 1111111
36
8 27 127

Most early computers kept data in an 8-bit byte system. With an 8-bit byte, not only is it possible to store every possible ASCII character, but there is also one whole bit spare.
Byte = the smallest addressable unit of memory in many computer architectures
Because bytes have room for up to eight bits, many people had their own ideas of what should go where in the space from 100000002 (or 12810) to 111111112 (or 25510). For example on some American PCs the character code 100000102 (or 13010) would display as , but on computers in Israel it was the Hebrew letter Gimel (), so when Americans sent their rsums to Israel they arrived as rsums.

Unicode A group of ambitious people came up with the idea of creating a single character set that included every reasonable writing system in the world, covering 110,181 characters from the world's alphabets, ideograph sets, and symbol collections. (Amharic), (Tamil) and even old characters which are not commonly used anymore such as (Baybayin), the old Filipino writing system, (Ch nm), the old Vietnamese characters are assigned binary codes (aka code points) to prevent confusion between computers.
Unicode
The code assigned to a specific character in Unicode Standard is called a code point.
A binary number for a character can be very long.

The Chinese character is represented by this string of binary number 100100101101100010 (or 15037010).
Note: To make the code point more concise, it is expressed in this format: U+hexadecimal number. Thus, U+24B62.
UTF-8 Encoding

The string of numbers has to be encoded and segmented into several 8-bit bytes in order to store on computer memory, transmit across communication networks, and be deciphered correctly by other computers. UTF-8 is an encoding method widely used on the internet and increasingly being used as the default character encoding in operating systems, programming languages, and software applications.
No. of Bytes Required 1 2 3 4 5 6
First Code Point U+0000 U+0080 U+0800 U+10000 U+200000 U+4000000
Last Code Point U+007F U+07FF U+FFFF U+1FFFFF U+3FFFFFF
No. of Bit 7 bits 11 bits 16 bits 21 bits 26 bits
1st Byte 0xxxxxxx 110xxxxx 1110xxxx 11110xxx 111110xx 1111110x
2nd Byte
3rd Byte
4th Byte
5th Byte
6th Byte
10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+7FFFFFFF 31 bits
UTF-8 Encoding
To encode the Chinese character which is represented by this string of binary number 100100101101100010. The following protocol is performed by the encoder: 1. Since it is between U+10000 and U+1FFFFF, this will take 4 bytes to encode.
No. of Bit 21 bits No. of Bytes Required 4 1st Byte 11110xxx 2nd Byte 10xxxxxx 3rd Byte 10xxxxxx 4th Byte 10xxxxxx
2. Three leading zeros are added in front of 100100101101100010 to make it 000100100101101100010 so it can fill up all the variable x. 3. The character is now made up of 4-byte binary numbers (32 bits) ready to be saved and transmitted to another computer: 11110000 10100100 10101101 10100010
Note: This lengthy binary number can be concisely written in hexadecimal number: F0 A4 AD A2
Decoding

When the recipient receives a string of 32 bits data, 11110000 10100100 10101101 10100010, the numbers in purple colour will be removed by the recipients decoder and revert back to the original 21-bit binary number 100100101101100010. It is now ready to be opened with a computer programs and equipped with a font which can render the 21-bit binary number into a picture (better known as a glyph in typography). There are other encoding methods such as UTF-16, UTF-32 etc. to suit different type of computer architectures.
Why do computers need encoding and decoding? - So that the receiver can make sense of the seemingly random signal. It knows that a new character is being received when it detects 11110.
Language Input Software
Typing the English language is relatively straight forward in computing. The keyboard generates a binary number 10000012 (or 6510 or 4116) when A is pressed.
To type non-English languages, the computer needs a language input software that convert 7-bit ASCII binary number to a Unicode binary number.
To type the Arabic Alif as in , we essentially press the h key. The keyboard generates a binary number 10010002 (or 7210 or 4816). The language input software then converts 10010002 (or 7210 or 4816) to 110001001112 (or 157510 or 62716).
To encode the Arabic Alif as in which is represented by this 11-bit binary number 11000100111. The following protocol is performed by the encoder:
1. Since it is between U+0080 and U+07FF, this will take 2 bytes to encode.
No. of Bit No. of Bytes Required 1st Byte 2nd Byte
11 bits
110xxxxx
10xxxxxx
2. The 11-bit binary number will fill up all the variable x.
3. The character is now made up of 2-byte binary numbers (16 bits) ready to be saved and transmitted to another computer:
11011000 10100111
Note: This lengthy binary number can be concisely written in hexadecimal number: D8 A7
Font
Font is a file that maps strings of binary data with designated pictorial glyphs to be shown on computer screen.
The most common font types are:

1. OpenType Fonts 2. TrueType Fonts 3. PostScript Fonts
Font can be developed with software i.e. Fontlab, Adobe FDK, RoboFont, Glyphs, DTL Font Master
Font
Fonts are files kept in Universal Type Client or your Font Book (Mac) and Font folder in Windows.
Font vs. Glyph

A font file contains a collection of glyphs (pictures) files assigned with numbers.
Font vs. Glyph

A glyph is the design of a character, a symbol or even an object.
Character vs. Glyph

Unicode has the principle of assigning a code point to a character, not a glyph. Both as below are the same character but different glyph (design). Both glyphs share the same code point.
Unicode leaves the design of glyphs to type designers so type designers have the liberty to design the appearance of glyphs.
Things get more complicated when the definition of a character and a glyph are not clear cut, especially in regions where logographic writing systems are used i.e. China, Japan, Korea, and Vietnam (abbr. CJKV).
Some glyphs are considered as the same character and they share the same code point. For example the character in CJKV used to share the code point of U+7A81.
(abrupt )
Its variant form
was later added to Unicode and given the code point U+2592E.
But some glyphs are considered as different characters and they have different code points.
(to listen)
(to hit)
Which Font is Better?
Well-developed fonts usually have many glyphs and therefore are able to support many languages.
Less-developed fonts have lesser glyphs and therefore are less versatile in coping with different languages.
Arial Unicode MS
MHeiHK-S
Data Processing
This is the usual process of encoding a non-ASCII character.
Keyboar d
UTF-8 Encoder
UTF-8 Decoder
Unicod e Font
Computer A
Computer B
Input Software-induced Disconcordance
Some language input developers prefer to use U+807C (a rare character) over U+807D (the more common one) i.e. Microsoft Pinyin New Experience Input Style.
Keyboar d
807C1
6
UTF-8 Encoder
UTF-8 Decoder
Computer A
6
or 807D1
Unicod e Font
Computer B
Font-induced Disconcordance
This is the usual computing process to type the Arabic character Alif.
4816
ASCII
Keyboar d
62716
Unicode
UTF-8 Encoder
D8 A716
D8 A716
UTF-8 Decoder
62716
Unicode
Unicod e Font
Computer A
Computer B
However, some font developers skip the language input software and UTF-8 encoding by creating non-Unicode fonts i.e. Kruti Dev 010, MHeiHK-S.
481
6
Keyboar d
ASCII
NonUnicode Font
Non-Unicode Fonts
Some font developers create fonts that assign characters to code points that have already been taken by other characters.
Non-Unicode Fonts

These fonts are called non-Unicode fonts. Data typed with these fonts are not able to be read by other fonts.

CHARACTER ENCODING: How Do Computers Deal With Multiple Language?

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CHARACTER ENCODING: How Do Computers Deal With Multiple Language?

Hochgeladen von

Copyright:

Verfügbare Formate

Part 2 CHARACTER ENCODING: How do computers deal with multiple languages?

by Tze Wei Sim simtzewei@mail.soasalumni.org

Common Character Sets

Common Character Sets

Byte = the smallest addressable unit of memory in many computer architectures

Common Character Sets

A binary number for a character can be very long.

First Code Point U+0000 U+0080 U+0800 U+10000 U+200000 U+4000000

Last Code Point U+007F U+07FF U+FFFF U+1FFFFF U+3FFFFFF

No. of Bit 7 bits 11 bits 16 bits 21 bits 26 bits

1st Byte 0xxxxxxx 110xxxxx 1110xxxx 11110xxx 111110xx 1111110x

Language Input Software

Language Input Software

2. The 11-bit binary number will fill up all the variable x.

The most common font types are:

Font vs. Glyph

Font vs. Glyph

Character vs. Glyph

Its variant form

Which Font is Better?

This is the usual process of encoding a non-ASCII character.

Language Input Software

Input Software-induced Disconcordance

Language Input Software

Language Input Software

Das könnte Ihnen auch gefallen