Sie sind auf Seite 1von 8

Japanese Character Code Sets & Encodings:

In the English language world, one-byte character code set "ASCII" is used,
which includes the alphanumeric ((0-9, A-Z, a-z), special symbols and control
characters. However, this code set cannot display any Japanese. In order to
display Japanese kanji and kana characters on your computer, you need programs
that can accommodate one of the three basic Japanese encoding methods: JIS,
shift-JIS, or EUC (Extended UNIX Code).

 History:
The first encoding to become widely used was JIS X 0201, which is a
single-byte encoding that only covers standard 7-bit ASCII characters with half-
width katakana extensions. This was widely used in systems that were neither
powerful enough nor had the storage to handle kanji (including DOS and old
embedded equipment such as cash registers).
The development of kanji encodings was the beginning of the split. Shift
JIS was developed to be completely backward compatible with JIS X 0201, and
thus is used in Windows (for backwards compatibility with DOS), and in much
embedded electronic equipment.
Complications arise because the original Internet e-mail standards only
support 7-bit transfer protocols. Thus JIS encoding was developed for sending
and receiving e-mails.
Not all required characters may be included in a character set standard such
as JIS, so gaiji ( 外 字 "external characters") are sometimes used to supplement
the character set. Gaiji may come in the form of external font packs, where normal
characters have been replaced with new characters, or the new characters have
been added to unused character positions. However, gaiji are not practical in
Internet environments since the font set must be transferred with text to use the
gaiji. As a result, such characters are written with similar or simpler characters in
place, or the text may need to be written using a larger character set (such as
Unicode) that supports the required character.
Unicode is supposed to solve all encoding problems in all languages of the
world. For Japanese, the kanji characters have been unified with Chinese, which is
a character considered to be the same in both Japanese and Chinese have been
given one and the same code number in Unicode, even if they look a little different.
This process is called Han unification. Unicode is slowly growing because it is
better supported by US made software, but still most homepages in Japanese use
Shift-JIS.

 Specification of the character code sets and


system of encoding:
The standard methods to encode Japanese characters for use on a
computer, such as JIS, Shift-JIS, EUC, and Unicode contain "ASCII" code, which
enables to display both Japanese and English. Despite efforts, none of the
encoding schemes have become the de facto standard, and multiple encoding
standards are still in use today.
For example, most Japanese e-mails are in JIS encoding and web pages
in Shift-JIS and yet mobile phones in Japan usually use some form of Extended
Unix Code. If a program fails to determine the encoding scheme employed, it can
cause unreadable text on computers.
 [ASCII code]
ASCII stands for American Standard Code for Information Interchange, and
the definition of its specification is described in ISO 646-1991. It is both a symbol
set and a code space (the specific encoding of a symbol). This code is used for
English. This is constructed by one-byte character (7 bit, 128 alignments). 94
characters of them are used for printable characters like the alphabets, numbers
and symbols, and remaining 34 characters are assigned for the space and control
characters. This code set cannot display any Japanese.

 [ ISO-8859 ]

This code uses 8 bits (ASCII plus one bit) and defines the extended ASCII
character set which can display characters with accents which are used in other
European languages, Kyrillian characters, or Arabic characters. This code consists
of 9 units, supporting Latin, Kyrillian or Arabic characters respectively. All European
languages can be displayed by this code. This code set cannot display any
Japanese.

 [ ISO-2022-JP (JIS) encoding]

JIS stands for "Japanese Industrial Standard." JIS is basically a mixture of


the Japan version of ANSI (American National Standards Institute), and UL
(Underwriter's Laboratory). This code is the most popular character code for
Japanese language on the Internet.
This code also is seven-bit system like ASCII code. This is usually used for
e-mail and e-news, since many networks do
not read the eighth bit of 8-bit bytes. This code
uses the same code area ( 20H-80H ) as
ASCII, so the "escape sequences" in the text
identifies a character as Japanese or English.
In order to differentiate ASCII code from
Japanese code, Japanese is put between shift-
in ( ESC $ B ) and shift-out ( ESC (J ). Escape
sequences are special codes that indicate a
switch between character sets. Each escape
sequence begins with the "escape" character
($1b). There are many registered escape
sequences for different character sets and languages; ISO-2022-JP recognizes a
subset of these escape sequences relevant to Japanese.

sequence hex values effect

Esc ( B $1b $28 $42 switch to ASCII


Esc ( J $1b $28 $4a switch to JIS Roman (JIS X 0201-1976)

JIS Roman runs from 0 to $7f and is identical to ASCII except for a few minor
differences (notably, the backslash at 92 is instead a yen symbol, and the tilde at
126 is replaced by an overbar). For most practical purposes, JIS Roman and ASCII
can be considered the same, so both these escape sequences can be treated as a
switch to ASCII.

sequence hex values effect

Esc $ @ $1b $24 $40 switch to JIS C 6226-1978


Esc $ B $1b $24 $42 switch to JIS X 0208-1983

Both JIS C 6226-1978 and JIS X 0208-1983 are earlier versions of JIS X 0208-
1990. For most practical purposes, both these escape sequences can be treated
as a switch to JIS X 0208-1990.
Typically, then, Japanese text appears enclosed by two escape sequences
(a two byte opening designator): either Esc $ @ or Esc $ B at the beginning, and
either Esc (B or Esc (J at the end (another two byte designator). The text itself (that
designates a certain kanji/kana to use) between the escape sequences consists of
pairs of plain 7-bit bytes in the printable range from $21 to $7e, simply formed by
splitting apart the JIS value into two bytes, also known as "raw JIS".
Because the data itself matches the original JIS character numbers, the
ISO-2022-JP encoding method is also known as "JIS encoding" (not to be
confused with the "JIS character set"!). The most popularly-used Japanese
character set is known as JIS X 0208-1990. It includes 6879 characters, among
which are the hiragana and katakana syllabaries, 6355 kanji, the Roman, Greek,
and Cyrillic alphabets, the numerals, and a number of typographic symbols. The
characters are arranged in a 95-by-95 grid, which usually becomes a row number
from 33 to 126 and a column number from 33 to 126. In most common discussion,
"JIS" when not followed by a particular standard number refers to the JIS X 0208-
1990 character set. ISO-2022-JP defined in RFC-1468 is this JIS code.
Japanese email clients on a Japanese operating system will automatically
convert messages into JIS and back. While most modern browsers recognize all
three encoding types ("Auto-Detect"), JIS only will alert the browser to switch to
Japanese. ISO-2022-JP (JIS) encoding defines a standard way to send data in
multiple character sets when the transmission medium supports 7-bit bytes.
Overall it is probably the best for communication purposes being a 7-bit
code using escape sequences and ASCII characters to encode the Japanese
characters. One is strongly encouraged to use JIS if one can.

 [Shift JIS code]

Shift-JIS (also known as SJIS, X-SJIS or MS Kanji) is a Microsoft developed


character encoding for the Japanese language and standardized as JIS X 0208
Appendix 1. It is based on character sets defined within JIS standards JIS X
0201:1997 (for the single-byte characters) and JIS X 0208:1997 (for the double
byte characters).

This is widely used in PC and Mac for internal computer coding. Since this
code doesn't use the same code area as the
ASCII code, the escape sequence is not
necessary. It allows Japanese characters to
be intermixed with ASCII characters. It uses
8-bit bytes, resulting in double-byte
dependencies: a given byte may be a single
byte ASCII character meant to stand alone,
or it may be the second byte of a 2-byte
character, meant to be read together with the
other byte. The first byte of a 2 byte shift JIS
character does not match any 7 bit ASCII
character. For Japanese, the most significant
bit of the first byte is always "1"(one)
e.g.(80H-A0H, E0H-EFH), and this bit can identify a character as Japanese or
English.
Shift JIS requires an 8-bit medium for transmission. It is fully backwards
compatible with the legacy JIS X 0201 single-byte encoding, meaning it supports
half-width katakana and that any valid JIS X 0201 string is also a valid Shift JIS
string. However Shift JIS only guarantees that the first byte will be in the upper
ASCII range; the value of the second byte can be either high or low. This makes
reliable Shift JIS detection difficult.
Many different versions of Shift JIS exist. There are two areas for expansion:
Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS,
therefore there is room for more characters here — these are really extensions to
JIS X 0208 rather than to Shift JIS itself. The most popular extension here is to the
Windows-31J (otherwise known as Code page 932) encoding popularized by
Microsoft and registered separately from Shift JIS (Microsoft calls that variation
"shift_jis"). Secondly, Shift JIS has more encoding space than is needed for JIS X
0201 and JIS X 0208, and this space can and is used for yet more characters. The
space with lead bytes 0xF5 to 0xF9 is used by Japanese mobile phone operators
for pictographs for use in E-mail. IBM CCSID 943 has the same extensions as
Code Page 932.

 [ EUC code (formerly X-EUC-JP) ]

EUC is a subset of a more widely ranged (but underused) method used for
encoding many of the worlds various languages. EUC (Extended Unix Code),
much like the other formats, is a multibyte format that was designed by AT&T. was
supported on System V to represent the Asian character sets. This code is used
for Japanese language system of the UNIX work station. Web pages that reside
on UNIX systems are often encoded in EUC. It is an ISO standard (ISO2022).
EUC is very similar to JIS without the escape sequences, and the 8th bit (of
each byte) turned on in encoded bytes to distinguish Japanese characters from
ASCII. The code system is eight-bit using vacant area (the most significant bit of
the first bit: 1) of ASCII code. It is highly recommended to use EUC-JP together
with PHP and MySQL. XML will only support EUC-JP. It takes advantage of
mediums that support 8-bit bytes.

EUC defines a variable length multibyte encoding intended primarily for


interchange, and a fixed length encoding primarily intended for processing. EUC
does not use opening or closing designators, nor share any codesets with JIS, nor
Unicode. The 8-bit format EUC-JP, does not support single-byte halfwidth
katakana, and allows for a much cleaner and direct conversion to and from
JIS X 0208 codepoints, as all upper-ASCII
bytes are part of a double-byte character and
all lower-ASCII bytes are part of a single-
byte character.

It's a very simple and straightforward


solution: to distinguish Japanese characters
from ASCII, simply add 128 to each JIS value
by setting the highest bit of each byte on. If j
and k are the original JIS values and e and f
are the transmitted EUC bytes, then:
e = j + 128

f = k + 128

This pushes all the EUC codes up into the top half of the 8-bit range. They land
from $a1 to $fe, where they have no chance of getting confused with ASCII codes
from 0 to $7e. Nice and easy.

 [Unicode]

Unicode is an up and coming standard that is not yet widely supported. It


probably is the most well known of character encoding formats. Unicode was
designed by the Unicode Consortium and can be used to represent most of the
world's languages.
It works by providing a 16-bit space in which over 65000 characters can be
contained. Unlike the others, its specification is universal, and includes all written
languages on earth in use today, plus space reserved for future use, user-specified
characters, and a compatibility region.
Though it is powerful, it is not the most popular. A lot of the conflict arises
from the fact that this symbol set and code space aims to unify hanzi (Chinese
characters), kanji (Japanese characters), and hanja (Korean characters) into one
symbol set. The 3 main kanji-based languages Chinese, Japanese, and parts of
Korean are thus merged. However, a Chinese Unicode font cannot be used to print
Japanese and vice versa because of stylistic differences that would be
objectionable to native speakers. This has created much political discourse, and
frustrated input method developers as the arrangement renders many common
search methods useless.

Sources:

a) http://www.asp-dev.com
b) http://cpan.uwinnipeg.ca
c) http://nihongopc.us/en
d) http://everything2.com
e) http://www.threeweb.ad.jp
f) http://www.jref.com
g) http://lfw.org
h) Wikipedia
i) http://www.kiko-net.com
j) http://cns-web.bu.edu
k) University of Virginia Library
l) http://japanese.about.com

Das könnte Ihnen auch gefallen