Beruflich Dokumente
Kultur Dokumente
Content
Basic Computing Knowledge Binary, Decimal and Hexadecimal Numbers Unicode Character Set Character Encoding Language Input Software Fonts Glyphs
Data Communication
In order for computers to understand each other, they have to speak and understand the same language.
In computing terms, they must have the same encoding (speaking) and decoding (understanding) protocol.
Data Communication
Every time we press a button on a keyboard, it generates a sequence of high and low voltages which resemble binary numbers.
These sequences of data are saved in memory or transmitted to another computer via a network. In order for the recipient to understand (decode) what the sender was speaking (encode), both of them have to have the same understanding (encoding) of what that string of binary numbers mean.
Numeral Systems
Computer data is represented in binary numbers (base-2 numeral system) as opposed to decimal numbers (base-10 numeral system) we use in daily life.
Decimal Numbers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Binary Numbers 0 1 10 11 100 101 110 111 1000 1001 1010 1011 1100 1101 1110 1111 10000 Hexadecimal Numbers 0 1 2 3 4 5 6 7 8 9 A B C D E F 10
ASCII (American Standard Code for Information Interchange) - originally based on the English language that encodes 128 characters - numbers 0-9, letters a-z and A-Z, some basic punctuation symbols, some control codes - all stored in 7 binary digits (bits)
Keys A B C ! ? Binary Representation 1000001 1000010 1000011 0100001 0111111 Decimal Number 65 66 67 33 63
$
Backspace Escape Delete
0100100
0001000 0011011 1111111
36
8 27 127
Because bytes have room for up to eight bits, many people had their own ideas of what should go where in the space from 100000002 (or 12810) to 111111112 (or 25510). For example on some American PCs the character code 100000102 (or 13010) would display as , but on computers in Israel it was the Hebrew letter Gimel (), so when Americans sent their rsums to Israel they arrived as rsums.
Unicode
The code assigned to a specific character in Unicode Standard is called a code point.
Note: To make the code point more concise, it is expressed in this format: U+hexadecimal number. Thus, U+24B62.
UTF-8 Encoding
The string of numbers has to be encoded and segmented into several 8-bit bytes in order to store on computer memory, transmit across communication networks, and be deciphered correctly by other computers. UTF-8 is an encoding method widely used on the internet and increasingly being used as the default character encoding in operating systems, programming languages, and software applications.
No. of Bytes Required 1 2 3 4 5 6
2nd Byte
3rd Byte
4th Byte
5th Byte
6th Byte
10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U+7FFFFFFF 31 bits
UTF-8 Encoding
To encode the Chinese character which is represented by this string of binary number 100100101101100010. The following protocol is performed by the encoder: 1. Since it is between U+10000 and U+1FFFFF, this will take 4 bytes to encode.
No. of Bit 21 bits No. of Bytes Required 4 1st Byte 11110xxx 2nd Byte 10xxxxxx 3rd Byte 10xxxxxx 4th Byte 10xxxxxx
2. Three leading zeros are added in front of 100100101101100010 to make it 000100100101101100010 so it can fill up all the variable x. 3. The character is now made up of 4-byte binary numbers (32 bits) ready to be saved and transmitted to another computer: 11110000 10100100 10101101 10100010
Note: This lengthy binary number can be concisely written in hexadecimal number: F0 A4 AD A2
Decoding
When the recipient receives a string of 32 bits data, 11110000 10100100 10101101 10100010, the numbers in purple colour will be removed by the recipients decoder and revert back to the original 21-bit binary number 100100101101100010. It is now ready to be opened with a computer programs and equipped with a font which can render the 21-bit binary number into a picture (better known as a glyph in typography). There are other encoding methods such as UTF-16, UTF-32 etc. to suit different type of computer architectures.
Why do computers need encoding and decoding? - So that the receiver can make sense of the seemingly random signal. It knows that a new character is being received when it detects 11110.
Typing the English language is relatively straight forward in computing. The keyboard generates a binary number 10000012 (or 6510 or 4116) when A is pressed.
To type non-English languages, the computer needs a language input software that convert 7-bit ASCII binary number to a Unicode binary number.
To type the Arabic Alif as in , we essentially press the h key. The keyboard generates a binary number 10010002 (or 7210 or 4816). The language input software then converts 10010002 (or 7210 or 4816) to 110001001112 (or 157510 or 62716).
To encode the Arabic Alif as in which is represented by this 11-bit binary number 11000100111. The following protocol is performed by the encoder:
1. Since it is between U+0080 and U+07FF, this will take 2 bytes to encode.
No. of Bit No. of Bytes Required 1st Byte 2nd Byte
11 bits
110xxxxx
10xxxxxx
3. The character is now made up of 2-byte binary numbers (16 bits) ready to be saved and transmitted to another computer:
11011000 10100111
Note: This lengthy binary number can be concisely written in hexadecimal number: D8 A7
Font
Font is a file that maps strings of binary data with designated pictorial glyphs to be shown on computer screen.
Font can be developed with software i.e. Fontlab, Adobe FDK, RoboFont, Glyphs, DTL Font Master
Font
Fonts are files kept in Universal Type Client or your Font Book (Mac) and Font folder in Windows.
Unicode leaves the design of glyphs to type designers so type designers have the liberty to design the appearance of glyphs.
Things get more complicated when the definition of a character and a glyph are not clear cut, especially in regions where logographic writing systems are used i.e. China, Japan, Korea, and Vietnam (abbr. CJKV).
Some glyphs are considered as the same character and they share the same code point. For example the character in CJKV used to share the code point of U+7A81.
(abrupt )
was later added to Unicode and given the code point U+2592E.
But some glyphs are considered as different characters and they have different code points.
(to listen)
(to hit)
Well-developed fonts usually have many glyphs and therefore are able to support many languages.
Less-developed fonts have lesser glyphs and therefore are less versatile in coping with different languages.
Arial Unicode MS
MHeiHK-S
Data Processing
Keyboar d
UTF-8 Encoder
UTF-8 Decoder
Unicod e Font
Computer A
Computer B
Some language input developers prefer to use U+807C (a rare character) over U+807D (the more common one) i.e. Microsoft Pinyin New Experience Input Style.
Keyboar d
807C1
6
UTF-8 Encoder
UTF-8 Decoder
Computer A
6
or 807D1
Unicod e Font
Computer B
Font-induced Disconcordance
This is the usual computing process to type the Arabic character Alif.
4816
ASCII
Keyboar d
62716
Unicode
UTF-8 Encoder
D8 A716
D8 A716
UTF-8 Decoder
62716
Unicode
Unicod e Font
Computer A
Computer B
However, some font developers skip the language input software and UTF-8 encoding by creating non-Unicode fonts i.e. Kruti Dev 010, MHeiHK-S.
481
6
Keyboar d
ASCII
NonUnicode Font
Non-Unicode Fonts
Some font developers create fonts that assign characters to code points that have already been taken by other characters.
Non-Unicode Fonts
These fonts are called non-Unicode fonts. Data typed with these fonts are not able to be read by other fonts.