Beruflich Dokumente
Kultur Dokumente
In the English language world, one-byte character code set "ASCII" is used,
which includes the alphanumeric ((0-9, A-Z, a-z), special symbols and control
characters. However, this code set cannot display any Japanese. In order to
display Japanese kanji and kana characters on your computer, you need programs
that can accommodate one of the three basic Japanese encoding methods: JIS,
shift-JIS, or EUC (Extended UNIX Code).
History:
The first encoding to become widely used was JIS X 0201, which is a
single-byte encoding that only covers standard 7-bit ASCII characters with half-
width katakana extensions. This was widely used in systems that were neither
powerful enough nor had the storage to handle kanji (including DOS and old
embedded equipment such as cash registers).
The development of kanji encodings was the beginning of the split. Shift
JIS was developed to be completely backward compatible with JIS X 0201, and
thus is used in Windows (for backwards compatibility with DOS), and in much
embedded electronic equipment.
Complications arise because the original Internet e-mail standards only
support 7-bit transfer protocols. Thus JIS encoding was developed for sending
and receiving e-mails.
Not all required characters may be included in a character set standard such
as JIS, so gaiji ( 外 字 "external characters") are sometimes used to supplement
the character set. Gaiji may come in the form of external font packs, where normal
characters have been replaced with new characters, or the new characters have
been added to unused character positions. However, gaiji are not practical in
Internet environments since the font set must be transferred with text to use the
gaiji. As a result, such characters are written with similar or simpler characters in
place, or the text may need to be written using a larger character set (such as
Unicode) that supports the required character.
Unicode is supposed to solve all encoding problems in all languages of the
world. For Japanese, the kanji characters have been unified with Chinese, which is
a character considered to be the same in both Japanese and Chinese have been
given one and the same code number in Unicode, even if they look a little different.
This process is called Han unification. Unicode is slowly growing because it is
better supported by US made software, but still most homepages in Japanese use
Shift-JIS.
[ ISO-8859 ]
This code uses 8 bits (ASCII plus one bit) and defines the extended ASCII
character set which can display characters with accents which are used in other
European languages, Kyrillian characters, or Arabic characters. This code consists
of 9 units, supporting Latin, Kyrillian or Arabic characters respectively. All European
languages can be displayed by this code. This code set cannot display any
Japanese.
JIS Roman runs from 0 to $7f and is identical to ASCII except for a few minor
differences (notably, the backslash at 92 is instead a yen symbol, and the tilde at
126 is replaced by an overbar). For most practical purposes, JIS Roman and ASCII
can be considered the same, so both these escape sequences can be treated as a
switch to ASCII.
Both JIS C 6226-1978 and JIS X 0208-1983 are earlier versions of JIS X 0208-
1990. For most practical purposes, both these escape sequences can be treated
as a switch to JIS X 0208-1990.
Typically, then, Japanese text appears enclosed by two escape sequences
(a two byte opening designator): either Esc $ @ or Esc $ B at the beginning, and
either Esc (B or Esc (J at the end (another two byte designator). The text itself (that
designates a certain kanji/kana to use) between the escape sequences consists of
pairs of plain 7-bit bytes in the printable range from $21 to $7e, simply formed by
splitting apart the JIS value into two bytes, also known as "raw JIS".
Because the data itself matches the original JIS character numbers, the
ISO-2022-JP encoding method is also known as "JIS encoding" (not to be
confused with the "JIS character set"!). The most popularly-used Japanese
character set is known as JIS X 0208-1990. It includes 6879 characters, among
which are the hiragana and katakana syllabaries, 6355 kanji, the Roman, Greek,
and Cyrillic alphabets, the numerals, and a number of typographic symbols. The
characters are arranged in a 95-by-95 grid, which usually becomes a row number
from 33 to 126 and a column number from 33 to 126. In most common discussion,
"JIS" when not followed by a particular standard number refers to the JIS X 0208-
1990 character set. ISO-2022-JP defined in RFC-1468 is this JIS code.
Japanese email clients on a Japanese operating system will automatically
convert messages into JIS and back. While most modern browsers recognize all
three encoding types ("Auto-Detect"), JIS only will alert the browser to switch to
Japanese. ISO-2022-JP (JIS) encoding defines a standard way to send data in
multiple character sets when the transmission medium supports 7-bit bytes.
Overall it is probably the best for communication purposes being a 7-bit
code using escape sequences and ASCII characters to encode the Japanese
characters. One is strongly encouraged to use JIS if one can.
This is widely used in PC and Mac for internal computer coding. Since this
code doesn't use the same code area as the
ASCII code, the escape sequence is not
necessary. It allows Japanese characters to
be intermixed with ASCII characters. It uses
8-bit bytes, resulting in double-byte
dependencies: a given byte may be a single
byte ASCII character meant to stand alone,
or it may be the second byte of a 2-byte
character, meant to be read together with the
other byte. The first byte of a 2 byte shift JIS
character does not match any 7 bit ASCII
character. For Japanese, the most significant
bit of the first byte is always "1"(one)
e.g.(80H-A0H, E0H-EFH), and this bit can identify a character as Japanese or
English.
Shift JIS requires an 8-bit medium for transmission. It is fully backwards
compatible with the legacy JIS X 0201 single-byte encoding, meaning it supports
half-width katakana and that any valid JIS X 0201 string is also a valid Shift JIS
string. However Shift JIS only guarantees that the first byte will be in the upper
ASCII range; the value of the second byte can be either high or low. This makes
reliable Shift JIS detection difficult.
Many different versions of Shift JIS exist. There are two areas for expansion:
Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS,
therefore there is room for more characters here — these are really extensions to
JIS X 0208 rather than to Shift JIS itself. The most popular extension here is to the
Windows-31J (otherwise known as Code page 932) encoding popularized by
Microsoft and registered separately from Shift JIS (Microsoft calls that variation
"shift_jis"). Secondly, Shift JIS has more encoding space than is needed for JIS X
0201 and JIS X 0208, and this space can and is used for yet more characters. The
space with lead bytes 0xF5 to 0xF9 is used by Japanese mobile phone operators
for pictographs for use in E-mail. IBM CCSID 943 has the same extensions as
Code Page 932.
EUC is a subset of a more widely ranged (but underused) method used for
encoding many of the worlds various languages. EUC (Extended Unix Code),
much like the other formats, is a multibyte format that was designed by AT&T. was
supported on System V to represent the Asian character sets. This code is used
for Japanese language system of the UNIX work station. Web pages that reside
on UNIX systems are often encoded in EUC. It is an ISO standard (ISO2022).
EUC is very similar to JIS without the escape sequences, and the 8th bit (of
each byte) turned on in encoded bytes to distinguish Japanese characters from
ASCII. The code system is eight-bit using vacant area (the most significant bit of
the first bit: 1) of ASCII code. It is highly recommended to use EUC-JP together
with PHP and MySQL. XML will only support EUC-JP. It takes advantage of
mediums that support 8-bit bytes.
f = k + 128
This pushes all the EUC codes up into the top half of the 8-bit range. They land
from $a1 to $fe, where they have no chance of getting confused with ASCII codes
from 0 to $7e. Nice and easy.
[Unicode]
Sources:
a) http://www.asp-dev.com
b) http://cpan.uwinnipeg.ca
c) http://nihongopc.us/en
d) http://everything2.com
e) http://www.threeweb.ad.jp
f) http://www.jref.com
g) http://lfw.org
h) Wikipedia
i) http://www.kiko-net.com
j) http://cns-web.bu.edu
k) University of Virginia Library
l) http://japanese.about.com