Sie sind auf Seite 1von 25

Web Internationalization – Standards and Practice

Web Internationalization
Objectives
Web Internationalization • Describe the standards that define the
Standards and Practice architecture & principles for I18N on the
web
• Scope limited to markup languages
29th Internationalization and Unicode Conference • Provide practical advice for working with
international data on the web, including the
design and implementation of multilingual
Tex Texin (XenCraft/Yahoo) web sites and localization considerations
Yves Savourel (ENLASO Corporation)
Copyright © 2002-2006 Tex Texin and Yves Savourel.
• Be introductory level
Web Internationalization – Standards and Practice Slide 2

Legend For This Presentation Web Internationalization Agenda


Icons used to indicate current product support:
• Part 1 – Character Processing
Internet Explorer 6 Firefox 1 Opera 8.5 • Coffee Break
• Part 2
Supported: – Layout and Typography
Partially supported:
– Designing International Web Sites
Not supported:

Caution
Highlights a note for users or developers to be careful.

Web Internationalization – Standards and Practice Slide 3 Web Internationalization – Standards and Practice Slide 4

Web I18n Part 1- Character Processing A Simple HTML Example Page


Character Encodings
Character Encoding Negotiation
Reference Processing Model
Character Escaping
Unicode in Markup
Normalization
Identifiers

Web Internationalization – Standards and Practice Slide 5 Web Internationalization – Standards and Practice Slide 6

29th Internationalization and Unicode Conference 1 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

A Simple HTML Example Page A Simple HTML Example Page


Here is how the same HTML looks in Japan Here is how the same HTML looks in Japan
The browser has no
information about the
encoding of the web page.
It uses a default value
which in this case, is very
wrong and even confuses
the markup (see beauté).

Web Internationalization – Standards and Practice Slide 7 Web Internationalization – Standards and Practice Slide 8

A Simple HTML Example Page Character Encodings


Encoding disagreement is one problem for text.
Here is how the same HTML looks in Japan
We also consider the following problems and solutions.
Some of the problems may (See: Character Model for the World Wide Web
not be obvious to the reader. http://www.w3.org/TR/charmod/ )
Changing the euro symbol Problem Solution
Encoding disagreement Encoding negotiation
to a bullet, might cause a Encoding diversity Reference Processing Model
significant financial error. Encoding limitations Character escaping
Unicode vs. Markup Markup preferred on the web
String Matching Early Uniform Normalization
String Indexing Character counting guidelines

Web Internationalization – Standards and Practice Slide 9 Web Internationalization – Standards and Practice Slide 10

Character Encodings Character Encodings


First though: What are character encodings? ACR = Abstract Character Repertoire
ACR ≠ A+˚ ACR ≠ A+˚
The set of characters you want to be able to
CCS U+233B4 U+2260 U+0041 U+030A represent (aka Character Set).
• Characters can be alphabetic, punctuation,
CEF D84C DFB4 2260 0041 030A arithmetic, or specific to a discipline or vertical
market (e.g. proofreading symbols § ¶,
business symbols © ®)
CES D8 4C DF B4 22 60 00 41 03 0A • Characters may be composable. E.g. Å = A + ˚
The Character Encoding Model – Unicode Tech. Report 17
Web Internationalization – Standards and Practice Slide 11 Web Internationalization – Standards and Practice Slide 12

29th Internationalization and Unicode Conference 2 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Character Encodings Character Encodings


CCS = Coded Character Set CEF = Character Encoding Form

ACR ≠ A+˚ ACR ≠ A+˚

CCS U+233B4 U+2260 U+0041 U+030A


CCS U+233B4 U+2260 U+0041 U+030A
Maps each character to a non-negative unique number. CEF D84C DFB4 2260 0041 030A
– Note this example uses hexadecimal numbers. 16-bit

– The “U+” indicates use of Unicode’s numbering.


– The grapheme Å consists of two characters A + ˚ Note the relationship between the CEFs is not so simple.
– Unicode calls these “Unicode Scalar Values” Map CCS to fixed width units (e.g. 32, 16, or 8 –bit)

Web Internationalization – Standards and Practice Slide 13 Web Internationalization – Standards and Practice Slide 14

Character Encodings Character Encodings


CEF = Character Encoding Form CES = Character Encoding Scheme
ACR ≠ A+˚ ACR ≠ A+˚

CCS U+233B4 U+2260 U+0041 U+030A


CCS U+233B4 U+2260 U+0041 U+030A

CEF D84C DFB4 2260 0041 030A


16-bit CEF
16-bit D84C DFB4 2260 0041 030A
CEF F0 A3 8E B4 E2 89 A0 41 CC 8A
8-bit CES D8 4C DF B4 22 60 00 41 03 0A
UTF-16BE
Map CCS to fixed width units (e.g. 32, 16, or 8 –bit)
Note the relationship between the CEFs is not so simple. CES: Mapping the CEF(s) to serialization of bytes
Web Internationalization – Standards and Practice Slide 15 Web Internationalization – Standards and Practice Slide 16

Character Encodings Encoding Identification


• Many character sets exist and in popular use • Given just bytes, encoding is
• Many encoding schemes, even for 1 character indeterminate.
set • How can an encoding be identified?
ISO 8859-1 ≈ IBM 850
ISO-2022-JP, Shift_JIS, EUC-JP (JIS X-0208-1997)
UTF-8 = UTF-16 = UTF-32
• There are 2 requirements:
• Given just bytes, the character set and the – Agreement on names for encodings
encoding scheme can be indeterminate.
– Mechanisms for labeling text with encoding

How can a browser know how to decode a web


page?
Web Internationalization – Standards and Practice Slide 17 Web Internationalization – Standards and Practice Slide 18

29th Internationalization and Unicode Conference 3 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Character Encoding Names Unregistered Encoding Names


IANA (Internet Assigned Numbers Authority) • Conventions for Unregistered Character
– Maintains registry of official names for character Encoding Names
sets (actually encodings) used on the internet and – Name begins with “x-”
in MIME (mail) – Example: “x-Tex-Yves-encoding”
• Registry Names – Useful for private encodings or very new
– ASCII, printable characters encodings
– Case-insensitive
– Maximum length 40 characters – Not useful on the web, except for private
– Aliases (alternative names) are also registered exchange
– The preferred name is indicated
www.iana.org/assignments/character-sets
Web Internationalization – Standards and Practice Slide 19 Web Internationalization – Standards and Practice Slide 20

Character Encoding Names Markup and Encoding Names


• IANA Name and Alias Examples • HTTP
– ISO_8859-1:1987 (ISO_8859-1, ISO-8859-1, latin1, • HTML
L1, IBM819, CP819, csISOLatin1)
• XML
– Windows-1252, GB2312, BIG5, BIG5-HKSCS
• CSS
– SHIFT_JIS, HP-Legal
– Extended_UNIX_Code_Packed_Format_for_Japanese • Links
– Adobe-standard-encoding – HTML <LINK>
– UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 – HTML <… HREF>
• Note- Registry contains many useless names – XML <… HREF>
• Note- Preferred names indicated. Use them.
Web Internationalization – Standards and Practice Slide 21 Web Internationalization – Standards and Practice Slide 22

HTTP and Encoding Names HTML, XML & Encoding Names


Mechanism for labeling HTTP with encoding HTML
<META HTTP-EQUIV="Content-type"
CONTENT="text/html; charset=UTF-8">
HTTP Response
– HTML does not specify a default.

200 OK HTTP/1.1 XML


Content-Type: text/html; charset=UTF-8
<?xml version="1.0" encoding="UTF-8" ?>
--- Blank Line
document – Alternative declaration: Begin with Byte Order
... Mark (U+FEFF), for UTF-16 or UTF-8
– Note UTF-16 MUST begin with a BOM
– The default encoding is UTF-8.
Web Internationalization – Standards and Practice Slide 23 Web Internationalization – Standards and Practice Slide 24

29th Internationalization and Unicode Conference 4 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

CSS2 and Encoding Name LINKs and Encoding Name


CSS2 Declaring the charset of a LINKed document
– Only used in the first line of external style • HTML
sheets
<LINK title="Arabic text"
@charset "UTF-8"; type=“text/html” charset=“ISO-8859-6”
rel=“alternate” href="arabic.html">

• New! <A href=“http://www.unicode.org" charset=“UTF-8”>


– CSS 2.1 added Unicode Byte Order Mark Unicode</A>
(BOM, U+FEFF) as an encoding indicator. • XML
– Encoding is unspecified if BOM and @charset
<?xml – stylesheet href=“…” type=“…”
conflict. charset="UTF-16“?>
Web Internationalization – Standards and Practice Slide 25 Web Internationalization – Standards and Practice Slide 26

Notes: Declaring Encoding Names HTML Encoding Priorities


• Charset on links can be incorrect if the Prioritization is used to resolve conflicts.
document’s encoding on the server changes • From high to low priority, HTML uses the encoding
of:
• The encoding for the <META… charset=…>
is unknown until the statement is processed. 1. HTTP “Content-Type” charset
– So ASCII is recommended for this statement 2. <META http-equiv “Content-Type” charset>
– Place it is as early as possible in the document. 3. LINK or other syntax for external documents
– Else, prior statements may be decoded incorrectly. 4. Charset-detecting heuristics

• Note:
• Many user agents (browsers) support a user override
– Transcoders do not generally correct charset ID
for charset (highest priority)

Web Internationalization – Standards and Practice Slide 27 Web Internationalization – Standards and Practice Slide 28

CSS2 Encoding Priorities XML Encoding Priorities


Prioritization is used to resolve conflicts. • Encoding name processing is more carefully
specified for XML.
• From high to low priority, CSS 2.1 external
style sheets use the encoding of: • As with HTML, protocol or external information
can supercede declaration, BOM or default of
1. HTTP “Content-Type” charset UTF-8.
2. BOM/@charset rule in the style sheet • XML Appendix E (non-normative): Prioritization
3. LINK or other syntax in referencing should be specified by protocols.
– Recommends use of BOM or encoding declaration
document for files (rather than an external source).
4. Charset of the referencing document – Refers to RFC 3023
5. Assume UTF-8 • RFC 3023 specifies several encoding scenarios based on
MIME media type: text/xml, application/xml, etc.

Web Internationalization – Standards and Practice Slide 29 Web Internationalization – Standards and Practice Slide 30

29th Internationalization and Unicode Conference 5 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Web I18n Part 1- Character Processing Character Encoding Negotiation


Character Encodings Unix user Windows user
Character Encoding Negotiation
Reference Processing Model
Character Escaping
GB2312 1252
Unicode in Markup
Normalization html html

Identifiers

Web Internationalization – Standards and Practice Slide 31 Web Internationalization – Standards and Practice Slide 32

Typical Browser-Server HTTP Sequence Character Encoding Negotiation


Accept Charset Get URL HTML
1. Browser issues GET URL x,y,z
Browser Server
Pages
2. Server sends RESPONSE
3. Browser displays document in RESPONSE GET / HTTP/1.1
Accept-Language: en-us,en,hr;q=0.5
4. Browser POSTs Form with user data (text) Accept-Charset: iso-8859-1,utf-8;q=0.75,*;q=0.5

5. Web Server receives data, database


The browser’s HTTP GET request can list the languages and
application stores text. the encodings it can make use of, to guide the server.
•“q” is a relative measure of the usefulness (quality) of an entry.

Which encoding is sent by the server? The above example indicates:


•US English preferred, other English, Croatian are also ok.
Which encoding is returned by the browser? •ISO 8859-1 preferred, then UTF-8, then anything else.
Web Internationalization – Standards and Practice Slide 33 Web Internationalization – Standards and Practice Slide 34

Character Encoding Negotiation Character Encoding Negotiation


Accept Charset Get URL HTML
• Most browsers let you set your language x,y,z
Browser Server
Pages
preferences and priorities Response
Browser Server CHARSET=x
• Encoding capabilities are not settable (since
they are software dependent). 200 OK HTTP/1.1
Content-Type: text/html; charset=iso-8859-1
– Microsoft IE doesn’t send ACCEPT-CHARSET. --- Blank Line
– (U.S.) NS 7: ISO-8859-1, UTF-8;q=0.66, *;q=0.66 HTML document
...

– Opera 6.0 sends: The server returns a document.


Windows-1252;q=1.0, UTF-8;q=1.0, UTF-16; q=1.0, The encoding is declared in the RESPONSE header.
(Web administrators or content authors need to
iso-8859-1;q=0.6, *;q=0.1
inform the server about document encodings.)
Web Internationalization – Standards and Practice Slide 35 Web Internationalization – Standards and Practice Slide 36

29th Internationalization and Unicode Conference 6 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Character Encoding Negotiation Character Encoding Negotiation


Accept Charset Get URL HTML Accept Charset Get URL HTML
Browser Server Browser Server
x,y,z Pages x,y,z Pages
Response
Browser Server CHARSET=x
Response
Browser Server CHARSET=x
Form Data Set
Browser Server
O/S Charset =z
The browser adapts the document for operating
System display. The browser also accepts user data in HTML <FORM>
and can send it to the server as a Form Data Set.

A Form Data Set is a series of control name/current


value pairs, for “successful” controls.

There are 3 ways browsers submit form data sets.


Web Internationalization – Standards and Practice Slide 37 Web Internationalization – Standards and Practice Slide 38

Form Data Set Form Data Set Submission


<form name="input” method=“GET"
action="http://www.xencraft.com/cgitest" 3 Submission Methods
enctype="application/x-www-form-urlencoded"> • GET + HTTP URI
Name: <input type="text" name="Name" size=“10” /> Form Data Set appended to URI +”?” encoded as
<input type="radio" name="sex" value="m"> Male – “application/x-www-form-urlencoded“
<input type="radio" name="sex" value="f"> Female
<input type="submit" value="Send">
</form> • POST + HTTP URI
Form Data Set sent in body, encoded as either
Form Data Set = Control Name/Current Value Pairs 1) “application/x-www-form-urlencoded“ or
Name/Tex 2) “multipart/form-data” (MIME, RFC 2045)
sex/m
Web Internationalization – Standards and Practice Slide 39 Web Internationalization – Standards and Practice Slide 40

Form Data Set- GET Method Submission Form Data Set Encoding
<form name="input” method=“GET"
action="http://www.xencraft.com/cgitest" Application/x-www-form-urlencoded
enctype="application/x-www-form-urlencoded"> Name=Value&Name2=Value2&Name3=Value3
– Control names/current values listed in the order they appear in
Name: <input type="text" name="Name" size=“10” /> the document.
<input type="radio" name="sex" value="m"> Male – Names separated from values by =
<input type="radio" name="sex" value="f"> Female – Name/value pairs separated by &
<input type="submit" value="Send"> – Spaces replaced by +
</form> – Line breaks represented as CR LF: %0D%0A
– Non-alphanumeric and non-ASCII characters and ‘+’, ‘&’, ‘=’, are
replaced by %HH
– Browsers map current encoding byte values to %HH

This simple form will submit a an HTTP GET with: – If the server doesn’t know browser’s character
http://www.xencraft.com/cgitest?Name=Tex&sex=m encoding, it may decode form data incorrectly.
Web Internationalization – Standards and Practice Slide 41 Web Internationalization – Standards and Practice Slide 42

29th Internationalization and Unicode Conference 7 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Form Data Set Encoding Character Encoding Negotiation


Accept Charset Get URL
Application/x-www-form-urlencoded x,y,z
Browser Server
HTML
Pages
Response
Browser Server CHARSET=x

Submit Form
Example comparing two character encodings: (GET or POST)
Browser Server
O/S Charset =z CHARSET=x
Charset=ISO-8859-1
encoding=x-www-form-urlencoded
Name=Fran%E7ois+Ren%E9+Strau%DF

Charset=UTF-8 Modern browsers send x-www-form-urlencoded data to


Name=Fran%C3%A7ois+Ren%C3%A9+Strau%C3%9F the server in the CHARSET that was determined to be
that of the *form*, however that determination was
made (HTTP, <meta>, default, user override).

Web Internationalization – Standards and Practice Slide 43 Web Internationalization – Standards and Practice Slide 44

Character Encoding Negotiation Character Encoding Negotiation


Accept Charset Get URL
Returning data in the encoding received x,y,z
Browser Server
HTML
Pages
• Generally works in principle Response
Browser Server CHARSET=x
• Document ‘charset” must be correctly
Submit Form
identified (and has often been wrong) (POST)
• Fails with multiple encodings handled by a Browser Server
O/S Charset =z CHARSET=x
single CGI multipart/form-data (MIME)

• Fails with transcoding proxies (not allowed to


Each control name/current value pair is a separate
change URIs). part. Each part can be a different charset or
content-type encoding.
• Recommend using UTF-8 in both directions Supports file uploading (RFC1867).

Web Internationalization – Standards and Practice Slide 45 Web Internationalization – Standards and Practice Slide 46

Character Encoding Negotiation Character Encoding Negotiation


Multipart/form-data Other solutions to identifying encodings:
• More efficient than x-www-form-urlencode for • XFORMS fixes the failure cases:
non-ASCII data, binary data, and files http://www.w3.org/MarkUp/Forms/
• Does not have the length limit that browsers http://www.w3.org/TR/xforms/ (Rec. Oct. 2003)
impose on URLs (can be as low as 250 for – Not generally supported
some devices)
Used with older browsers:
• Is now well supported
• Hidden fields containing encoding name or
• Recommended for POST of all form data carefully chosen text (tracks transcodings).
CGI script performs analysis.
– e.g. Microsoft’s _CHARSET_
Web Internationalization – Standards and Practice Slide 47 Web Internationalization – Standards and Practice Slide 48

29th Internationalization and Unicode Conference 8 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Web I18n Part 1- Character Processing Reference Processing Model


Character Encodings • Different encoding schemes require different
Character Encoding Negotiation decoding/parsing/processing methods
Reference Processing Model – Single, and Multi-byte character sets (e.g. EUC)
– Character encoding-switching schemes (ISO 2022)
Character Escaping
– Forward combining (accent-base letter)
Unicode in Markup – Backward combining (base letter-accent)
Normalization – Logical ordering/Visual ordering
Identifiers • Variety bothers implementers and spec writers
• Adopting a single universal encoding obsoletes
most of the existing data
• Instead, use a character abstraction
Web Internationalization – Standards and Practice Slide 49 Web Internationalization – Standards and Practice Slide 50

Reference Processing Model Reference Processing Model


• Logically, characters are Unicode characters
– Specifications are in terms of Unicode characters
– Implementations do NOT have to use Unicode, Any encoding
In/out
only behave as if they did on the wire HTML
• Benefits
Abstraction
– Removes ambiguity, simplifies specifications C
Layer using internal S
– Allows flexibility for common local encodings S
Unicode
– Backward compatible for older HTML browsers
– Supports internationalization (large character set) Any encoding XML
for internal
– Removes dependencies/orientation on byte values implementation
Web Internationalization – Standards and Practice Slide 51 Web Internationalization – Standards and Practice Slide 52

Reference Processing Model Web I18n Part 1- Character Processing


• Examples using Reference Processing Model Character Encodings
– HTML 4.0 declares Unicode as its SGML Document Character Encoding Negotiation
Character Set
– CSS “sequence of characters from UCS”
Reference Processing Model
– XML “A character is an atomic unit of text as specified by Character Escaping
ISO/IEC 10646”
Unicode in Markup
• Any encoding can be used internally, but Unicode
often makes the most sense. Normalization
• XML requires parsers to accept UTF-8 and UTF-16, Identifiers
making Unicode best internal choice
• Some Recommendations require Unicode
– e.g. DOM requires UTF-16
Web Internationalization – Standards and Practice Slide 53 Web Internationalization – Standards and Practice Slide 54

29th Internationalization and Unicode Conference 9 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Character Escaping Character Escaping


Mechanisms to represent characters • Useful for:
• Numeric Character References (NCRs) – syntax-significant characters
– HTML and XML – e.g. &lt; (<), &gt; (>), &amp; (&), &quot; (")
Hexadecimal: &#xhhhhhh; – characters outside current encoding
Decimal &#dddd; – eliminating visual or other ambiguity
– CSS2 &#x00AD; (soft-hyphen),
“\hh ” (note terminating space), \hhhhhh
&#x002D; (hyphen-minus)
• Character Entity References (HTML only)
&aring; &Aring; (note case-sensitivity) &#x0020; (space)
&#x00A0; (no-break space)
Web Internationalization – Standards and Practice Slide 55 Web Internationalization – Standards and Practice Slide 56

Character Escaping Character Escaping


• Relies on Reference Processing Model • Don’t use Windows 1252 code points instead
– Always references Unicode scalar value of Unicode, for values 128-160.
• Same value regardless of encoding • e.g. Euro is &#8364; or &euro; not &#128;
• Same value for UTF-8, UTF-16, UTF-32 • www.i18nguy.com/markup/ncrs.html
• One value for supplementary characters, not two • Don’t simulate characters with special fonts
E.g. &#x12345; not &#xD808;&#xDF45; (e.g. Symbol), or you can get erroneous:
– Simplifies transcoding (no parsing or conversion) • Display, depending on font availability
– Allows any Unicode character in any document (if • Font fallbacks
it is legal in the language of the document) • Searches by Search engines
• Behavior from Style sheets
• Database contents
Web Internationalization – Standards and Practice Slide 57 Web Internationalization – Standards and Practice Slide 58

Selecting A Character Encoding Web I18n Part 1- Character Processing


• Choose an encoding that minimizes the need Character Encodings
to escape characters. Character Encoding Negotiation
– Unicode is always a candidate. Reference Processing Model
– Unicode is supported by all but the oldest browsers. Character Escaping
– Is the largest character set, and can be expanded. Unicode in Markup
– Therefore it is often the best choice both for
Normalization
minimizing escapes and anticipating future
character requirements. Identifiers
• e.g. New currency symbols

Web Internationalization – Standards and Practice Slide 59 Web Internationalization – Standards and Practice Slide 60

29th Internationalization and Unicode Conference 10 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Unicode Vs. Markup Unicode Vs. Markup


• 96,000+ characters as of Unicode 4.0 Potential problem areas
– Should we use them all? • Redundancies impact searching
– Are there any we shouldn’t use? – “Å” A-ring “A+˚ ” A+ring “Å” Angstrom
– Does Unicode’s capabilities, needed for plain text, • Formatting characters vs. Markup
interfere with markup?
– E.g. Bidi controls, interlinear annotation characters
• Markup can do some things better than
• Characters with style vs. Markup
character codes. Not all Unicode characters are
– E.g. Superscript, subscript
needed.
• Object Replacement Character vs. Markup
– Better to use markup to include an image

Web Internationalization – Standards and Practice Slide 61 Web Internationalization – Standards and Practice Slide 62

Unicode Vs. Markup Web I18n Part 1- Character Processing


Solution types Character Encodings
• Restrict characters so they cannot be used Character Encoding Negotiation
• Replace redundancies (normalization) Reference Processing Model
• Replace with Markup Character Escaping
– Extensible Unicode in Markup
– presentation can be separate from content
Normalization
Joint W3C and Unicode recommendations in: Identifiers
“Unicode in XML and other Markup Languages”
http://www.w3.org/TR/unicode-xml/
http://www.unicode.org/unicode/reports/tr20/

Web Internationalization – Standards and Practice Slide 63 Web Internationalization – Standards and Practice Slide 64

String Indexing and Normalization String Indexing


• Representing data in • Which units should be used for counting?
more than 1 way leads
Graphemes
to errors ≠ A+˚
3
• E.g. The Mars Climate
Orbiter mission was Characters U+0041 U+030A
U+233B4 U+2260
disastrous. Information 4
expected to be metric,
was sent in English units Code units D84C DFB4 2260 0041 030A
• Solution- Adopt a 5
standard representation-
Bytes D8 4C DF B4 22 60 00 41 03 0A
Normalize
10
Web Internationalization – Standards and Practice Slide 65 Web Internationalization – Standards and Practice Slide 66

29th Internationalization and Unicode Conference 11 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

String Indexing Early Uniform Normalization


Character Model recommendations Unicode characters can have more than 1
• Character counting is recommended for most representation
• Canonical equivalence
programming interfaces (e.g. XML Path)
– Indistinguishable, fundamental equivalence
• Code unit counting may be used for internal – E.g. combining sequences, singletons
efficiency (e.g. DOM) – “Å” U+00C5 (A-ring pre-composed)
• Graphemes may be useful for user interaction, – “A+˚ ” U+0041 + U+030A (A + combining ring above)
– “Å” U+212B (Angstrom)
once a suitable definition exists
• Compatibility equivalence
• Avoid creating API with single unit arguments – E.g. Formatting differences, ligatures
e.g. “SS” = Uppercase(“ß”) – “カ” U+FF76 “カ” U+30AB (KA half and full width)
– “fi” U+FB01 (ligature fi)
Web Internationalization – Standards and Practice Slide 67 Web Internationalization – Standards and Practice Slide 68

Early Uniform Normalization Early Uniform Normalization


• Unicode Consortium has defined canonical and When to normalize?
compatibility decomposition formats and 4 • Late Normalization burdens receivers to have
different sets of rules for normalization:
“smart” compare functions
“ Unicode Normalization Forms”
• Early normalization burdens producers to
http://www.unicode.org/unicode/reports/tr15/
create normalized text
• The W3C Character Model recommends
Normalization Form C (NFC)
– Brings canonical equivalences to composed form
– Leaves compatibility forms as distinct
– Most legacy text is composed, and is unchanged

Web Internationalization – Standards and Practice Slide 69 Web Internationalization – Standards and Practice Slide 70

Early Uniform Normalization Early Uniform Normalization


Basic principles Basic principles (continued)
• Without agreement on text representation, • Existing receivers do not check normalization
binary matching and string indexing fail • Most existing text is composed
• Consequences are significant • There are many more receivers than producers
– E.g. Comparison of contracts, security • Encrypted strings require normalization first
• Corrected implementations are complex • Often producers have information about the
strings they create, simplifying normalization.
• Sufficient resources may not be available on
Conclusion: Early Normalization is lowest
very small web components
total cost

Web Internationalization – Standards and Practice Slide 71 Web Internationalization – Standards and Practice Slide 72

29th Internationalization and Unicode Conference 12 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Early Uniform Normalization Early Uniform Normalization


Text on the web SHOULD be Fully Normalized. • Examples of Fully Normalized Text
This is text that is either: “suçon”, “su&#xE7;on”,
1. Unicode text in Normalization Form NFC, and “sub¸on”, “sub&#x0327;on”
2. Does not contain character escapes or includes Note- Unicode does not have a composed b-cedilla.
that upon expansion would undo point 1, and • Examples that are not Fully Normalized
3. Does not begin with a composing character. “suc¸on”, “suc&#x0327;on”
or: Reason: should use composed character “ç”
1. Legacy encoded text, which transcoded to “¸on”, “&#x0327;on”
Unicode satisfies the above. Reason: should not begin with combining character

Web Internationalization – Standards and Practice Slide 73 Web Internationalization – Standards and Practice Slide 74

Web I18n Part 1- Character Processing Identifiers in Markup Languages


Character Encodings Identifiers are element and attribute names, CSS
Character Encoding Negotiation selectors, and properties, etc.
Reference Processing Model • HTML- restricted to a subset of ASCII
(-_:.A-Za-z0-9), case-insensitive
Character Escaping
• CSS2- few restrictions
Unicode in Markup
(-A-Za-z0-9) + all chars >U+00A0, case insensitive
Normalization • XML 1.0- subset of Unicode 2.0, case-sensitive
Identifiers • XML 1.1 Unicode 4.0, case-sensitive
• XHTML- same set as HTML
lowercase, case-sensitive, (-_:.A-Za-z0-9)
Web Internationalization – Standards and Practice Slide 75 Web Internationalization – Standards and Practice Slide 76

Identifiers in Markup Languages Identifiers in Markup Languages


XML Naming conventions suggested in XML Naming conventions suggested in
Appendix I in XML 1.1, www.w3.org/TR/xml11/ Appendix I in XML 1.1, www.w3.org/TR/xml11/
– Exclude all control characters, enclosing Don’t use:
nonspacing marks, non-decimal numbers, private- – Ideographs with canonical decompositions
use characters, punctuation characters (with some – Characters with compatibility decompositions
exceptions), symbol characters, unassigned
codepoints, and white space characters. – Combining characters meant for use with symbols
only
– 1st character: Unicode General Category of Ll,
Lu, Lo, Lm, Lt, or Nl, or else be '_' U+005F – Interlinear annotation characters (U+FFF9-FFFB)
– Remainder: Unicode General Category of Ll, Lu, – Variation selector characters
Lo, Lm, Lt, Mc, Mn, Nl, Nd, Pc, Cf, or be one of: – Names which are nonsensical, unpronounceable,
'-' '. ' ':' or '·' (U+00B7 middle dot). hard to read, or easily confusable with other names

Web Internationalization – Standards and Practice Slide 77 Web Internationalization – Standards and Practice Slide 78

29th Internationalization and Unicode Conference 13 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Resource Identifiers: URI, IRI, IDNA Domain Names: IDNA Architecture


http://日本語.jp
URIs encode bytes, not characters
– Most ASCII bytes expressed as ASCII Convert to Unicode
– Non-ASCII are %HH, which is ambiguous
Nameprep
• Character encoding not taken into consideration
Case fold, Mapping, NFKC, Removal
• IRI-Internationalized Resource Identifiers
– Transcode to UTF-8, then encode as URI

DNS Servers
ACE (Punycode, profile of Bootstring)
• Adopters: XLink, XPointer, URN, XML, XML Schema
Convert to ASCII, Prepend “xn- -”
• IE, Firefox, Opera, Safari, and others
– http://www.w3.org/International/O-URL-and-ident.html Application
Resolver http://xn--wgv71a119e.jp
Servers

RFC: IDNA 3454, 3490-3492, URI 3986, IRI 3987


Web Internationalization – Standards and Practice Slide 79 Web Internationalization – Standards and Practice Slide 80

Questions Coffee Break

Web Internationalization – Standards and Practice Slide 81 Web Internationalization – Standards and Practice Slide 82

Web Internationalization Agenda Language Identification


• Part 1 – Character Processing Same mechanism for all:
• Coffee Break • Identifiers defined by RFC 3066.
• Part 2 Based on 2-letter and 3-letter language codes
– Layout and Typography (ISO-639) with an optional 2-letter country
– Designing International Web Sites codes (ISO-3166) separated by a character ‘-’
(not ‘_’).
• Value is not case sensitive (even in XML).
• In mark-up: the language attribute is inherited
by the children of the element where the
attribute is defined.
Web Internationalization – Standards and Practice Slide 83 Web Internationalization – Standards and Practice Slide 84

29th Internationalization and Unicode Conference 14 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Language Identification Language Identification


RFC 3066 Rules: Some Problems:
• RFC 3066 does not cover all needs.
• 3-letter codes should be used only for the – e.g. Latin-Amer. Spanish, Script distinctions, etc.
languages that have no 2-letter code. – Now being addressed through registrations
• Always use the Terminological form of the 3- • There is no clear distinction of the identifiers of
letter codes, not the Bibliographical form. a “language” and a “locale”.
– (See past IUC locales talks for more information.)
• In addition, as much as possible, avoid user- • Work in progress to address those issues: IETF,
defined codes (x-myCode) ISO TC37, SIL, W3C, etc.
– RFC 3066bis
Web Internationalization – Standards and Practice Slide 85 Web Internationalization – Standards and Practice Slide 86

Language Identification Language Identification


• RFC 3066bis • HTTP: Content-Language header
– replaces RFC 3066. New number to be assigned • HTML: LANG attribute (e.g. in <html>)
soon
• XML: xml:lang attribute
– language-country becomes language-script-country
• XHTML 1.0: Both lang and xml:lang
– Registry expanded to include all valid entries
– New matching rules proposed in a separate RFC <p xml:lang="la" lang="la">Verba.</p>
– See Addison Phillip’s talk for more info

• XHTML 1.1: xml:lang attribute

Web Internationalization – Standards and Practice Slide 87 Web Internationalization – Standards and Practice Slide 88

Language Identification Language Identification – Input file


<?xml version="1.0" encoding="iso-8859-1" ?>
The lang() function in XPath: <?xml-stylesheet type="text/xsl" href="Languages.xsl"?>

• True if the selected node has xml:lang set to <MyData>


<Msg id="100">
the given language code. <Text xml:lang="en">Message 100 in English.</Text></Msg>
<Msg id="200">
• Match is done as a sub-string from the start of <Text xml:lang="en-us">Message 200 <span xml:lang="fr">
[insertion in French]</span> in American
the value: English.</Text>
'en' matches 'en', and 'en-us'. <Text xml:lang="fr-CA">Message 200 en Québecquois.</Text>
</Msg>

• Match is case insensitive: <Msg id="300">


<Text xml:lang="fr">Message 300 en français.</Text></Msg>
'en' matches 'EN', 'En-us', etc. <Msg id="400">
<Text xml:lang="EN-GB">Message 400 in British
Î Example: Input, Languages.xsl, Output. English.</Text> </Msg>
</MyData>

Web Internationalization – Standards and Practice Slide 89 Web Internationalization – Standards and Practice Slide 90

29th Internationalization and Unicode Conference 15 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Language Identification – Style-sheet Language Identification – IE Output


<?xml version="1.0" ?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform“
version="1.0"> Message 100 in English. (en)
Message 200 [insertion in French] in American English. (en-us)
<xsl:param name="Language">en</xsl:param>
Message 400 in British English. (EN-GB)
<xsl:template match="text()"/>

<xsl:template match="Text">
<xsl:if test="lang($Language)">
<p><xsl:value-of select="."/>
(<xsl:value-of select="@xml:lang"/>)</p>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

Web Internationalization – Standards and Practice Slide 91 Web Internationalization – Standards and Practice Slide 92

Language Identification – FF Output Language Identification- Opera Output

Message 100 in English. (en)Message 200 [insertion in French] Message 100 in English. Message 200 [insertion in French] in
in American English. (en-us)Message 400 in British English. American English. Message 200 en Québecquois. Message 300
(EN-GB) en français. Message 400 in British English.

Web Internationalization – Standards and Practice Slide 93 Web Internationalization – Standards and Practice Slide 94

Language Identification – CSS Language Identification – CSS


<html lang="en">
There are two methods to refer to the language <head>
attribute in CSS: <style>
*:lang(en-us) { font-weight: bold; }
*[lang|=fr] { font-style: italic; color: red; }
• The lang pseudo-class. </style>
<title>Test Language and CSS</title>
*:lang(zh) { font-family:SimSun } </head>
<body>
<p>Text in English.</p>
• The attribute selector. <p lang="en-us">Text in American English.</p>
<p lang="en">Text in generic English.</p>
*[lang|=fr] { font-weight:bold } <p lang="fr-ca">Texte en québecquois.</p>
<p lang="fr">Texte en français.</p>
<p lang="en-gb">Text in British English.</p>
• Both use the same matching mechanism as the </body>
lang() function in XPath. </html>

Î Example: LanguagesCSS.htm
Web Internationalization – Standards and Practice Slide 95 Web Internationalization – Standards and Practice Slide 96

29th Internationalization and Unicode Conference 16 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Language Identification – FF Output Fonts


• In HTML: Use styles rather than <FONT>
(easier to change for each language, if needed).
Text in English.
CSS Fonts:
Text in American English. • Provide fallback fonts.
Text in generic English. • unicode-range specifies characters used.
unicode-range removed from CSS 2.1
Texte en québecquois.
@font-face {font-family: "Tex";
Texte en français. src:url(ftp://ftp.i18nguy.com/Texfnt.ttf);
unicode-range: U+??, U+900-97f; }
Text in British English. body { font-family: Texfnt, Tahoma, Arial,
sans-serif; }
Web Internationalization – Standards and Practice Slide 97 Web Internationalization – Standards and Practice Slide 98

Quotes – HTML Quotes – Using CSS

• The <q> element for in-line quotations (auto- • CSS allows control of the type of quote to use
quotation marks expected). according to the language.
• The <blockquote> element for paragraph- *[lang|=fr] { quote:'\ab\a0' '\a0\bb' }
type quotations (indented, and no auto- qo:before { content:open-quote }
qo:after { content:close-quote }
quotation marks expected).
• Examples
Î Example: Input, Output: Quotes.htm. Î HTML: Input, CSS, Output: QuotesWithCSS.htm.
Î XML: Input, CSS File, Output: Quotes.xml.
Web Internationalization – Standards and Practice Slide 99 Web Internationalization – Standards and Practice Slide 100

Quotes – HTML Input Some Unicode Characters


...
<body> U+2018 ‘ Left Single Quotation Mark
<p lang="en">English text with <q>English quoted
text</q>.</p>
U+2019 ’ Right Single Quotation Mark
<p lang="fr">Text en Français avec <q>English quoted U+201C “ Left Double Quotation Mark
text</q>.</p>
<p lang="fr">Text en Français avec <q lang="en">English U+201D ” Right Double Quotation Mark
U+201E „ Double Low 9 Quotation Mark
quoted text containing a <q>quote</q> itself</q>.</p>
<p lang="fi"><q>Quotes</q> in Finnish.</p>
<p lang="pl"><q>Quotes</q> in Polish.</p>
<p lang="ja"><q>Quotes</q> in Japanese.</p>
U+201F ‟ Double High Reversed 9 Q. M.
<p lang="de"><q>Quotes</q> in German.</p> U+300C 「 Left Corner Bracket
<p lang="nl"><q>Quotes</q> in Dutch.</p>
<blockquote lang="fr">A paragraph using U+300D 」 Right Corner Bracket
blockquote.</blockquote>
</body> U+00AB « Left Pointing Double Angle Q. M.
</html>
U+00BB » Right Pointing Double Angle Q. M.
U+00A0 No Break Space
Web Internationalization – Standards and Practice Slide 101 Web Internationalization – Standards and Practice Slide 102

29th Internationalization and Unicode Conference 17 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Quotes – CSS Style-sheet Quotes – Opera Output


q:before { content: open-quote; }
q:after { content: close-quote; }
English text with “English quoted text”.
blockquote:before { content: open-quote; } Text en Français avec « English quoted text ».
blockquote:after { content: close-quote; }
Text en Français avec « English quoted text containing a
[lang|='en'] > * { /* English */ quotes: "\201C" "\201D" } ”quote" itself ».
[lang|='fr'] > * { /*guillemets*/ quotes: "\AB\A0" "\A0\BB" } ”Quotes” in Finnish.
„Quotes” in Polish.
[lang|='fi'] > * {/*same direction*/ quotes: "\201D" "\201D" }
「Quotes」 in Japanese.
[lang|='de'] > * { /* German */ quotes: "\201E" "\201C" } „Quotes“ in German.
[lang|='ja'] > * { /* Japanese */ quotes: "\300C" "\300D" } ‘Quotes’ in Dutch.
[lang|='nl'] > * { /* Dutch */ quotes: "\2018" "\2019" }
“A paragraph using blockquote.”
[lang|='pl'] > * { /* Polish */ quotes: "\201E" "\201D" }

Web Internationalization – Standards and Practice Slide 103 Web Internationalization – Standards and Practice Slide 104

Casing Casing TextTransform.htm


• CSS2 provides the property Style:
text-transform with 5 values:
uppercase, lowercase, capitalize, <style>
none, and inherit. .upper { text-transform: uppercase}
.lower { text-transform: lowercase}
• CSS2 allows user agents to ignore it for non .cap { text-transform: capitalize}
Latin-1 characters and for unusual case </style>
conversion (making it useless from an i18n
viewpoint). CSS3 (working draft) forces
Unicode casing conformance. This property is
deprecated in XSL 1.0.
Î Example: Source, Output: TextTransform.htm.
Web Internationalization – Standards and Practice Slide 105 Web Internationalization – Standards and Practice Slide 106

Casing TextTransform.htm Casing – IE and Opera Output


<p>Original = This text should be all uppercased.<br> Original = This text should be all uppercased.
Transformed = <span class="upper">This text should be all Transformed = THIS TEXT SHOULD BE ALL UPPERCASED.
uppercased. </span></p>
<p>Original = THIS TEXT SHOULD BE ALL LOWERCASED.<br>
Original = THIS TEXT SHOULD BE ALL LOWERCASED.
Transformed = <span class="lower">THIS TEXT SHOULD BE ALL
Transformed = this text should be all lowercased.
LOWERCASED. </span></p>
<p>Original 1 = tHIS tEXT sHOULD bE cAPITALIZED.<br>
Transformed = <span class="cap">tHIS tEXT sHOULD bE Original 1 = tHIS tEXT sHOULD bE cAPITALIZED.
cAPITALIZED. </span><br> Transformed = THIS TEXT SHOULD BE CAPITALIZED.
Original 2 = this text should be capitalized.<br> Original 2 = this text should be capitalized.
Transformed = <span class="cap">this text should be Transformed = This Text Should Be Capitalized.
capitalized. </span></p>
<p lang="de">[de] Original = ß (sharp-s), ö (o-diaeresis)<br> [de] Original = ß (sharp-s), ö (o-diaeresis)
Transformed = <span class="upper">ß (sharp-s), ö (o-
Transformed = ß (SHARP-S), Ö (O-DIAERESIS)
diaeresis) </span></p>
<p lang="tr">[tr] Original = i (i-with-dot)<br>
Transformed = <span class="upper">i (i-with-dot)</span></p> [tr] Original = i (i-with-dot)
Transformed = I (I-WITH-DOT)

Web Internationalization – Standards and Practice Slide 107 Web Internationalization – Standards and Practice Slide 108

29th Internationalization and Unicode Conference 18 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Casing – Firefox Output Casing – Clipboard Copy (unchanged)


Original = This text should be all uppercased. Original = This text should be all uppercased.
Transformed = THIS TEXT SHOULD BE ALL UPPERCASED. Transformed = This text should be all uppercased.

Original = THIS TEXT SHOULD BE ALL LOWERCASED. Original = THIS TEXT SHOULD BE ALL LOWERCASED.
Transformed = this text should be all lowercased. Transformed = THIS TEXT SHOULD BE ALL LOWERCASED.

Original 1 = tHIS tEXT sHOULD bE cAPITALIZED. Original 1 = tHIS tEXT sHOULD bE cAPITALIZED.
Transformed = THIS TEXT SHOULD BE CAPITALIZED. Transformed = tHIS tEXT sHOULD bE cAPITALIZED.
Original 2 = this text should be capitalized. Original 2 = this text should be capitalized.
Transformed = This Text Should Be Capitalized. Transformed = this text should be capitalized.

[de] Original = ß (sharp-s), ö (o-diaeresis) [de] Original = ß (sharp-s), ö (o-diaeresis)


Transformed = SS (SHARP-S), Ö (O-DIAERESIS) Transformed = ß (sharp-s), ö (o-diaeresis)

[tr] Original = i (i-with-dot) [tr] Original = i (i-with-dot)


Transformed = I (I-WITH-DOT) Transformed = i (i-with-dot)

Web Internationalization – Standards and Practice Slide 109 Web Internationalization – Standards and Practice Slide 110

Numbered Lists Numbered Lists NumberedLists.htm


With CSS2
• CSS2 offers the list-style-type ...<head>
<style>
property to specify the type of numbers for .list_heb {list-style-type:hebrew}
lists. Supports only a limited set of pre-defined .list_geo {list-style-type:georgian}
styles (e.g. has Armenian but not Thai). .list_arm {list-style-type:armenian}
.list_cjk {list-style-type:cjk-ideographic}
</style>
Î Example NumberedLists.htm </head>
<body>...

Web Internationalization – Standards and Practice Slide 111 Web Internationalization – Standards and Practice Slide 112

Numbered Lists NumberedLists.htm Numbered Lists – Firefox Output


... <body>
<p>List numbered in Hebrew:</p>
<ol class="list_heb"><li>Item 1</li>...<li>Item 6</li>
</ol>
<p>List numbered in Georgian:</p>
<ol class="list_geo"><li>Item 1</li>...<li>Item 6</li>
</ol>
<p>List numbered in Armenian:</p>
<ol class="list_arm"><li>Item 1</li>...<li>Item 6</li>
</ol>
<p>List numbered in Han character
(<code>cjk-ideographic</code>):</p>
<ol class="list_cjk"><li>Item 1</li>...<li>Item 6</li>
</ol>
</body>
</html>

Web Internationalization – Standards and Practice Slide 113 Web Internationalization – Standards and Practice Slide 114

29th Internationalization and Unicode Conference 19 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Numbered Lists Number Formatting


With XSL The function format-number()
• XSL provides more flexibility as the format in XSL allows the formatting of numbers based
and the type of the numbers can be changed on a given pattern.
using <xsl:number/>. • Uses same patterns as Java 1.1
java.text.DecimalFormat patterns.
• Use <xsl:decimal-format/> to
Î Example: Input, ListNumbers.xsl, Output. overwrite the default symbols (i.e. decimal
separator, grouping separator, etc.).

Î Example: Input, XSL File, Output.


Web Internationalization – Standards and Practice Slide 115 Web Internationalization – Standards and Practice Slide 116

Text Flow Text Flow


Bi-directional Text in HTML Bi-directional Text for XML (CSS2)
• The dir attribute: • Use the direction and unicode-bidi
– dir="ltr" (default), dir="rtl" properties. The unicode-bidi property
– Affects the default value of align. specifies the behavior for inline levels elements
– Inherited (use it in <html> to set the base for the (15 maximum levels of embedding).
whole document). • Based on Unicode bidi algorithm (UAX#9)
• The <bdo> element: para.bidi { direction:rtl;
– Overrides implicit directional properties of content. unicode-bidi:embed }
– Requires the dir attribute.
Î Example: BidiText.htm
Web Internationalization – Standards and Practice Slide 117 Web Internationalization – Standards and Practice Slide 118

Text Flow – Bidi Example Source (1/2) Text Flow – Bidi Example Source (2/2)
<p style="direction:rtl; unicode-bidi:embed"> <p dir="rtl">
Using CSS:<br/> ‫ חברת‬Pepper Creek LLC, <span dir="ltr">Using dir="ltr-
‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p> span":</span> <br/>
‫ חברת‬Pepper Creek LLC,
<p dir="rtl"> ‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p>
Using dir="rtl":<br/>
‫ חברת‬Pepper Creek LLC, <p dir="ltr">Using dir="ltr" (wrong):<br/>
‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p> ‫ חברת‬Pepper Creek LLC,
‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p>

Web Internationalization – Standards and Practice Slide 119 Web Internationalization – Standards and Practice Slide 120

29th Internationalization and Unicode Conference 20 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Text Flow – Bidi Output Text Flow


Vertical Text

• Use the writing-mode property (CSS3).


• For example, to display top-to-bottom, and
right-to-left text use:
div.vertical { writing-mode:tb-rl }

Î Example in HTML, and in SVG.

Web Internationalization – Standards and Practice Slide 121 Web Internationalization – Standards and Practice Slide 122

Text Flow – Vertical, HTML Text Flow – Vertical, HTML Output


<p style="writing-mode: rl-tb">
Example of horizontal text (rl-tb).</p>
<p style="writing-mode: tb-rl">
Example of vertical text (tb-rl).</p>
<p style="writing-mode: tb-rl">
Example of vertical text with
<span style="writing-mode: lr-tb">
horizontal</span> insert.</p>

Web Internationalization – Standards and Practice Slide 123 Web Internationalization – Standards and Practice Slide 124

Text Flow – Vertical, SVG Text Flow – Vertical, SVG in HTML


<?xml version="1.0" ?> <html>
<svg width="330" height="330“ <body>
viewbox="0 0 330 330"> <p>
<g style="font-size:24;">
<text x="20" y="26" style="writing-mode: lr;">
<object data="Vertical.svg“
Horizontal Text</text> type="image/svg+xml"
<text x="20" y="56" style="writing-mode: tb;"> width="330" height="330" />
Example of vertical text</text> </p>
</g> </body>
</svg> </html>

Web Internationalization – Standards and Practice Slide 125 Web Internationalization – Standards and Practice Slide 126

29th Internationalization and Unicode Conference 21 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Text Flow – Vertical, SVG Output Ruby Annotation


Annotation in smaller characters
running above or below a base text.

• Used in Japanese for pronunciation of Kanji


characters (Furigana).
• W3C Ruby Module: <ruby> element with
<rb> for the base text, <rt> for the ruby text.
<rbc> and <rtc> for complex annotations.

Î Example: Ruby.htm
Web Internationalization – Standards and Practice Slide 127 Web Internationalization – Standards and Practice Slide 128

Ruby Annotation – HTML (1/2) Ruby Annotation – HTML (2/2)


<p>Simple Ruby test:</p> <p>Ruby complex:</p>
<ruby> <ruby>
<rb>日本語</rb> <rbc>
<rt>にほんご</rt> <rb>10</rb> <rb>31</rb> <rb>2002</rb>
</ruby> </rbc>
<p>Ruby with parenthesis text, used if ruby is not <rtc>
implemented: </p> <rt>Month</rt> <rt>Day</rt> <rt>Year</rt>
<ruby> </rtc>
<rb>日本語</rb> <rtc>
<rp>[[</rp><rt>にほんご</rt><rp>]]</rp> <rt rbspan="3">Expiration Date</rt>
</ruby> </rtc>
</ruby>
Web Internationalization – Standards and Practice Slide 129 Web Internationalization – Standards and Practice Slide 130

Ruby Annotation – IE Output Ruby Annotation – Firefox Output

Web Internationalization – Standards and Practice Slide 131 Web Internationalization – Standards and Practice Slide 132

29th Internationalization and Unicode Conference 22 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Combined Text Sorting


Runs of characters grouped together. XSL offers the <xsl:sort/>
element to collate lists of items.

• Use lang (not xml:lang) to specify the


• Available in CSS3 with text-combine. language to use for the sorting rules.
• Combination in blocks (Kumimoji) • Results depend on the implementation of the
XSL engine.
• Combination in lines (Warichu)
Î Example: Sorting.xml sorted for English
span.kumimoji { text-combine: letters } and Norwegian. (Sorting_EN.xsl and
span.warichu { text-combine: lines }
Sorting_NO.xsl).
Web Internationalization – Standards and Practice Slide 133 Web Internationalization – Standards and Practice Slide 134

Sorting Web Internationalization Agenda


Version 2.0 of XSL has new features for • Part 1 – Character Processing
<xsl:sort> • Coffee Break
http://www.w3.org/TR/xslt20/#dt-collation • Part 2
– Layout and Typography
• case-order attribute specifies whether to sort – Designing International Web Sites
uppercase or lowercase first.
• collation attribute names an implementer-
defined collation to use.
– if given, lang and case-order are ignored.

Web Internationalization – Standards and Practice Slide 135 Web Internationalization – Standards and Practice Slide 136

Requirements Domain Names


• Two sets of requirements, not always Easier if each language has its own domain
compatible: name: www.xyzcorp.fr, www.xyzcorp.de, etc.
– Business requirements. e.g. to have the Web site Æ One domain = One language.
ranked high in search engines; to have a single look
and feel across sites in different languages; etc. Unfortunately:
– Localization requirement. e.g. to avoid changing – Most site have only one address for many
links in localized pages; to have locale-specific languages.
content; etc.
• Solutions depend on technologies used (static – Even ‘country-specific’ sites may have
Web site, client-side scripting, server-side several languages: www.xyzcorp.ca
scripting, databases, multiple addresses, etc). Æ English, French, Inuktitut.

Web Internationalization – Standards and Practice Slide 137 Web Internationalization – Standards and Practice Slide 138

29th Internationalization and Unicode Conference 23 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Directories and Files Directories and Files


\
One possible solution: +----- index.html
+----- index_fr.html
• Home page of the ‘main’ language is the entry |
+----- en
point of the directory structure. | +----- about.html
(e.g. index.html) | +----- products.html
| +----- menubar.png
• Language home pages are also at the root and |
+----- fr
have a language identifier in their name. | +----- about.html
(e.g. index_fr.html) | +----- products.html
| +----- menubar.png
• Other pages have identical names across |
+----- common
languages, but are in different language +----- logo.png
+----- background.jpg
directories.
Web Internationalization – Standards and Practice Slide 139 Web Internationalization – Standards and Practice Slide 140

Directories and Files Directories and Files


• Allow search engines to retrieve meaningful • Use cookies if you want to remember the
information (but emphasis for the main preferred language of the user and redirect
language). him/her to the relevant set of files.
• Maximize the use of relative URLs (no link
change, except to the home page). • Use common directory for shared files.
If scripting is available, you can have the links • Use meaningful directory and file names.
resolved at run-time.
• Avoid translating directory and file names.
• Allows room for locale-specific content if
necessary. • Treat the source language just like another
language as much as possible.
Web Internationalization – Standards and Practice Slide 141 Web Internationalization – Standards and Practice Slide 142

Language Selection Good Practices – IDs


• List box of language names in native language IDs are VERY useful for re-use of translation,
– Make sure characters display correctly (fonts) and for re-use of text across documents.
– Graphics are always displayed correctly.
– in HTML IDs can be set for all elements containing
text, except the <title> element.
– Make sure to provide an ID attribute for the
• Destination Choice translatable elements of your XML vocabularies, so
– The same page in the new it can be utilized for re-use, leveraging, etc.
language.
– The main page in the new language.
(for country-specific sites, etc.)
Web Internationalization – Standards and Practice Slide 143 Web Internationalization – Standards and Practice Slide 145

29th Internationalization and Unicode Conference 24 Tex Texin, Yves Savourel


Web Internationalization – Standards and Practice

Good Practices – Attributes Good Practices – Embedded Data


When creating new XML vocabularies: Avoid Data that are not text content (e.g. scripts, SQL
using attributes for storing translatable text. queries, etc.).

– Impossible to add needed bidi tags in an attribute. – Keep them outside of the document if possible
– Cause segmentation issues in many tools. (e.g. using include mechanisms).
– Much more difficult to have metadata for attributes – At least, make sure elements with such data are
than for elements. identified for the localizer (who might need to
– You cannot set different languages for two apply a process different than for the rest of the
attributes in the same element. document content).
– More tricky to set unique IDs for attributes. – Internationalize your scripts/queries/etc.

Web Internationalization – Standards and Practice Slide 146 Web Internationalization – Standards and Practice Slide 147

Good Practices – Use Style-sheets Good Practices – CDATA Sections


• Separate the function of a term (a title, a link, Avoid CDATA sections if possible.
an important term) from its display (bolded,
underlined, in 12-points Courier, etc.) – Translation tools do not handle CDATA well.
– Type of display for the target language(s) may be – Keeping track on inline CDATA leads to
different than for the source language. meaningless inline codes in segments (and can
affect leveraging).
– Force author/developer to think about the structure
– NCRs are not allowed in CDATA. This may cause
of the document.
problems if the document is converted to an
• Avoid <br/> -like elements when possible: encoding where some characters need to be written
Use styles to format, not tags. as NCRs.
By the way: CDATA does NOT preserve spaces.

Web Internationalization – Standards and Practice Slide 148 Web Internationalization – Standards and Practice Slide 149

Additional Resources Conclusion


• W3C Internationalization Work Group Implementation of standards is still a little
http://www.w3.org/International
behind in practice.
• Unicode in XML and other Markup Languages
http://www.w3.org/TR/unicode-xml
• Character Model for the World Wide Web But today, Web-related technologies are among
http://www.w3.org/TR/charmod the best ways to store, manipulate and
• Richard Ishida’s paper on “Localisation represent data in different languages.
Considerations in DTD Design”
http://www.w3.org/People/Ishida/writing.html#dtd
• XML Internationalization FAQ
http://www.opentag.com/xmli18nfaq.htm Any questions?
Web Internationalization – Standards and Practice Slide 150 Web Internationalization – Standards and Practice Slide 151

29th Internationalization and Unicode Conference 25 Tex Texin, Yves Savourel

Das könnte Ihnen auch gefallen