Webi 18 Ntutorial

Web Internationalization – Standards and Practice
Web Internationalization
Objectives
Web Internationalization • Describe the standards that define the
Standards and Practice architecture & principles for I18N on the
web
• Scope limited to markup languages
29th Internationalization and Unicode Conference • Provide practical advice for working with
international data on the web, including the
design and implementation of multilingual
Tex Texin (XenCraft/Yahoo) web sites and localization considerations
Yves Savourel (ENLASO Corporation)
Copyright © 2002-2006 Tex Texin and Yves Savourel.
• Be introductory level
Web Internationalization – Standards and Practice Slide 2
Legend For This Presentation Web Internationalization Agenda

Icons used to indicate current product support:
• Part 1 – Character Processing
Internet Explorer 6 Firefox 1 Opera 8.5 • Coffee Break
• Part 2
Supported: – Layout and Typography
Partially supported:
– Designing International Web Sites
Not supported:
Caution
Highlights a note for users or developers to be careful.
Web Internationalization – Standards and Practice Slide 3 Web Internationalization – Standards and Practice Slide 4
Web I18n Part 1- Character Processing A Simple HTML Example Page

Character Encodings
Character Encoding Negotiation
Reference Processing Model
Character Escaping
Unicode in Markup
Normalization
Identifiers
29th Internationalization and Unicode Conference 1 Tex Texin, Yves Savourel

A Simple HTML Example Page A Simple HTML Example Page

Here is how the same HTML looks in Japan Here is how the same HTML looks in Japan
The browser has no
information about the
encoding of the web page.
It uses a default value
which in this case, is very
wrong and even confuses
the markup (see beauté).
A Simple HTML Example Page Character Encodings

Encoding disagreement is one problem for text.
Here is how the same HTML looks in Japan
We also consider the following problems and solutions.
Some of the problems may (See: Character Model for the World Wide Web
not be obvious to the reader. http://www.w3.org/TR/charmod/ )
Changing the euro symbol Problem Solution
Encoding disagreement Encoding negotiation
to a bullet, might cause a Encoding diversity Reference Processing Model
significant financial error. Encoding limitations Character escaping
Unicode vs. Markup Markup preferred on the web
String Matching Early Uniform Normalization
String Indexing Character counting guidelines
Character Encodings Character Encodings

First though: What are character encodings? ACR = Abstract Character Repertoire
ACR ≠ A+˚ ACR ≠ A+˚
The set of characters you want to be able to
CCS U+233B4 U+2260 U+0041 U+030A represent (aka Character Set).
• Characters can be alphabetic, punctuation,
CEF D84C DFB4 2260 0041 030A arithmetic, or specific to a discipline or vertical
market (e.g. proofreading symbols § ¶,
business symbols © ®)
CES D8 4C DF B4 22 60 00 41 03 0A • Characters may be composable. E.g. Å = A + ˚
The Character Encoding Model – Unicode Tech. Report 17


CCS = Coded Character Set CEF = Character Encoding Form
CCS U+233B4 U+2260 U+0041 U+030A

CCS U+233B4 U+2260 U+0041 U+030A
Maps each character to a non-negative unique number. CEF D84C DFB4 2260 0041 030A
– Note this example uses hexadecimal numbers. 16-bit
– The “U+” indicates use of Unicode’s numbering.

– The grapheme Å consists of two characters A + ˚ Note the relationship between the CEFs is not so simple.
– Unicode calls these “Unicode Scalar Values” Map CCS to fixed width units (e.g. 32, 16, or 8 –bit)

CEF = Character Encoding Form CES = Character Encoding Scheme
CCS U+233B4 U+2260 U+0041 U+030A

CCS U+233B4 U+2260 U+0041 U+030A
CEF D84C DFB4 2260 0041 030A

16-bit CEF
16-bit D84C DFB4 2260 0041 030A
CEF F0 A3 8E B4 E2 89 A0 41 CC 8A
8-bit CES D8 4C DF B4 22 60 00 41 03 0A
UTF-16BE
Map CCS to fixed width units (e.g. 32, 16, or 8 –bit)
Note the relationship between the CEFs is not so simple. CES: Mapping the CEF(s) to serialization of bytes
Character Encodings Encoding Identification

• Many character sets exist and in popular use • Given just bytes, encoding is
• Many encoding schemes, even for 1 character indeterminate.
set • How can an encoding be identified?
ISO 8859-1 ≈ IBM 850
ISO-2022-JP, Shift_JIS, EUC-JP (JIS X-0208-1997)
UTF-8 = UTF-16 = UTF-32
• There are 2 requirements:
• Given just bytes, the character set and the – Agreement on names for encodings
encoding scheme can be indeterminate.
– Mechanisms for labeling text with encoding
How can a browser know how to decode a web

page?

Character Encoding Names Unregistered Encoding Names

IANA (Internet Assigned Numbers Authority) • Conventions for Unregistered Character
– Maintains registry of official names for character Encoding Names
sets (actually encodings) used on the internet and – Name begins with “x-”
in MIME (mail) – Example: “x-Tex-Yves-encoding”
• Registry Names – Useful for private encodings or very new
– ASCII, printable characters encodings
– Case-insensitive
– Maximum length 40 characters – Not useful on the web, except for private
– Aliases (alternative names) are also registered exchange
– The preferred name is indicated
www.iana.org/assignments/character-sets
Character Encoding Names Markup and Encoding Names

• IANA Name and Alias Examples • HTTP
– ISO_8859-1:1987 (ISO_8859-1, ISO-8859-1, latin1, • HTML
L1, IBM819, CP819, csISOLatin1)
• XML
– Windows-1252, GB2312, BIG5, BIG5-HKSCS
• CSS
– SHIFT_JIS, HP-Legal
– Extended_UNIX_Code_Packed_Format_for_Japanese • Links
– Adobe-standard-encoding – HTML <LINK>
– UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 – HTML <… HREF>
• Note- Registry contains many useless names – XML <… HREF>
• Note- Preferred names indicated. Use them.
HTTP and Encoding Names HTML, XML & Encoding Names

Mechanism for labeling HTTP with encoding HTML
<META HTTP-EQUIV="Content-type"
CONTENT="text/html; charset=UTF-8">
HTTP Response
– HTML does not specify a default.
200 OK HTTP/1.1 XML

Content-Type: text/html; charset=UTF-8
<?xml version="1.0" encoding="UTF-8" ?>
--- Blank Line
document – Alternative declaration: Begin with Byte Order
... Mark (U+FEFF), for UTF-16 or UTF-8
– Note UTF-16 MUST begin with a BOM
– The default encoding is UTF-8.

CSS2 and Encoding Name LINKs and Encoding Name

CSS2 Declaring the charset of a LINKed document
– Only used in the first line of external style • HTML
sheets
<LINK title="Arabic text"
@charset "UTF-8"; type=“text/html” charset=“ISO-8859-6”
rel=“alternate” href="arabic.html">
• New! <A href=“http://www.unicode.org" charset=“UTF-8”>

– CSS 2.1 added Unicode Byte Order Mark Unicode</A>
(BOM, U+FEFF) as an encoding indicator. • XML
– Encoding is unspecified if BOM and @charset
<?xml – stylesheet href=“…” type=“…”
conflict. charset="UTF-16“?>
Notes: Declaring Encoding Names HTML Encoding Priorities

• Charset on links can be incorrect if the Prioritization is used to resolve conflicts.
document’s encoding on the server changes • From high to low priority, HTML uses the encoding
of:
• The encoding for the <META… charset=…>
is unknown until the statement is processed. 1. HTTP “Content-Type” charset
– So ASCII is recommended for this statement 2. <META http-equiv “Content-Type” charset>
– Place it is as early as possible in the document. 3. LINK or other syntax for external documents
– Else, prior statements may be decoded incorrectly. 4. Charset-detecting heuristics
• Note:
• Many user agents (browsers) support a user override
– Transcoders do not generally correct charset ID
for charset (highest priority)
CSS2 Encoding Priorities XML Encoding Priorities

Prioritization is used to resolve conflicts. • Encoding name processing is more carefully
specified for XML.
• From high to low priority, CSS 2.1 external
style sheets use the encoding of: • As with HTML, protocol or external information
can supercede declaration, BOM or default of
1. HTTP “Content-Type” charset UTF-8.
2. BOM/@charset rule in the style sheet • XML Appendix E (non-normative): Prioritization
3. LINK or other syntax in referencing should be specified by protocols.
– Recommends use of BOM or encoding declaration
document for files (rather than an external source).
4. Charset of the referencing document – Refers to RFC 3023
5. Assume UTF-8 • RFC 3023 specifies several encoding scenarios based on
MIME media type: text/xml, application/xml, etc.

Web I18n Part 1- Character Processing Character Encoding Negotiation

Character Encodings Unix user Windows user
Character Encoding Negotiation
Character Escaping
GB2312 1252
Unicode in Markup
Normalization html html
Identifiers
Typical Browser-Server HTTP Sequence Character Encoding Negotiation

Accept Charset Get URL HTML
1. Browser issues GET URL x,y,z
Browser Server
Pages
2. Server sends RESPONSE
3. Browser displays document in RESPONSE GET / HTTP/1.1
Accept-Language: en-us,en,hr;q=0.5
4. Browser POSTs Form with user data (text) Accept-Charset: iso-8859-1,utf-8;q=0.75,*;q=0.5
5. Web Server receives data, database

The browser’s HTTP GET request can list the languages and
application stores text. the encodings it can make use of, to guide the server.
•“q” is a relative measure of the usefulness (quality) of an entry.
Which encoding is sent by the server? The above example indicates:

•US English preferred, other English, Croatian are also ok.
Which encoding is returned by the browser? •ISO 8859-1 preferred, then UTF-8, then anything else.
Character Encoding Negotiation Character Encoding Negotiation

Accept Charset Get URL HTML
• Most browsers let you set your language x,y,z
Browser Server
Pages
preferences and priorities Response
Browser Server CHARSET=x
• Encoding capabilities are not settable (since
they are software dependent). 200 OK HTTP/1.1
Content-Type: text/html; charset=iso-8859-1
– Microsoft IE doesn’t send ACCEPT-CHARSET. --- Blank Line
– (U.S.) NS 7: ISO-8859-1, UTF-8;q=0.66, *;q=0.66 HTML document
...
– Opera 6.0 sends: The server returns a document.

Windows-1252;q=1.0, UTF-8;q=1.0, UTF-16; q=1.0, The encoding is declared in the RESPONSE header.
(Web administrators or content authors need to
iso-8859-1;q=0.6, *;q=0.1
inform the server about document encodings.)


Accept Charset Get URL HTML Accept Charset Get URL HTML
Browser Server Browser Server
x,y,z Pages x,y,z Pages
Response
Response
Form Data Set
Browser Server
O/S Charset =z
The browser adapts the document for operating
System display. The browser also accepts user data in HTML <FORM>
and can send it to the server as a Form Data Set.
A Form Data Set is a series of control name/current

value pairs, for “successful” controls.
There are 3 ways browsers submit form data sets.

Form Data Set Form Data Set Submission

<form name="input” method=“GET"
action="http://www.xencraft.com/cgitest" 3 Submission Methods
enctype="application/x-www-form-urlencoded"> • GET + HTTP URI
Name: <input type="text" name="Name" size=“10” /> Form Data Set appended to URI +”?” encoded as
<input type="radio" name="sex" value="m"> Male – “application/x-www-form-urlencoded“
<input type="radio" name="sex" value="f"> Female
<input type="submit" value="Send">
</form> • POST + HTTP URI
Form Data Set sent in body, encoded as either
Form Data Set = Control Name/Current Value Pairs 1) “application/x-www-form-urlencoded“ or
Name/Tex 2) “multipart/form-data” (MIME, RFC 2045)
sex/m
Form Data Set- GET Method Submission Form Data Set Encoding
<form name="input” method=“GET"
action="http://www.xencraft.com/cgitest" Application/x-www-form-urlencoded
enctype="application/x-www-form-urlencoded"> Name=Value&Name2=Value2&Name3=Value3
– Control names/current values listed in the order they appear in
Name: <input type="text" name="Name" size=“10” /> the document.
<input type="radio" name="sex" value="m"> Male – Names separated from values by =
<input type="radio" name="sex" value="f"> Female – Name/value pairs separated by &
<input type="submit" value="Send"> – Spaces replaced by +
</form> – Line breaks represented as CR LF: %0D%0A
– Non-alphanumeric and non-ASCII characters and ‘+’, ‘&’, ‘=’, are
replaced by %HH
– Browsers map current encoding byte values to %HH
This simple form will submit a an HTTP GET with: – If the server doesn’t know browser’s character
http://www.xencraft.com/cgitest?Name=Tex&sex=m encoding, it may decode form data incorrectly.

Form Data Set Encoding Character Encoding Negotiation

Accept Charset Get URL
Application/x-www-form-urlencoded x,y,z
Browser Server
HTML
Pages
Response
Submit Form
Example comparing two character encodings: (GET or POST)
Browser Server
O/S Charset =z CHARSET=x
Charset=ISO-8859-1
encoding=x-www-form-urlencoded
Name=Fran%E7ois+Ren%E9+Strau%DF
Charset=UTF-8 Modern browsers send x-www-form-urlencoded data to

Name=Fran%C3%A7ois+Ren%C3%A9+Strau%C3%9F the server in the CHARSET that was determined to be
that of the *form*, however that determination was
made (HTTP, <meta>, default, user override).

Accept Charset Get URL
Returning data in the encoding received x,y,z
Browser Server
HTML
Pages
• Generally works in principle Response
• Document ‘charset” must be correctly
Submit Form
identified (and has often been wrong) (POST)
• Fails with multiple encodings handled by a Browser Server
O/S Charset =z CHARSET=x
single CGI multipart/form-data (MIME)
• Fails with transcoding proxies (not allowed to

Each control name/current value pair is a separate
change URIs). part. Each part can be a different charset or
content-type encoding.
• Recommend using UTF-8 in both directions Supports file uploading (RFC1867).

Multipart/form-data Other solutions to identifying encodings:
• More efficient than x-www-form-urlencode for • XFORMS fixes the failure cases:
non-ASCII data, binary data, and files http://www.w3.org/MarkUp/Forms/
• Does not have the length limit that browsers http://www.w3.org/TR/xforms/ (Rec. Oct. 2003)
impose on URLs (can be as low as 250 for – Not generally supported
some devices)
Used with older browsers:
• Is now well supported
• Hidden fields containing encoding name or
• Recommended for POST of all form data carefully chosen text (tracks transcodings).
CGI script performs analysis.
– e.g. Microsoft’s _CHARSET_

Web I18n Part 1- Character Processing Reference Processing Model

Character Encodings • Different encoding schemes require different
Character Encoding Negotiation decoding/parsing/processing methods
Reference Processing Model – Single, and Multi-byte character sets (e.g. EUC)
– Character encoding-switching schemes (ISO 2022)
Character Escaping
– Forward combining (accent-base letter)
Unicode in Markup – Backward combining (base letter-accent)
Normalization – Logical ordering/Visual ordering
Identifiers • Variety bothers implementers and spec writers
• Adopting a single universal encoding obsoletes
most of the existing data
• Instead, use a character abstraction
Reference Processing Model Reference Processing Model

• Logically, characters are Unicode characters
– Specifications are in terms of Unicode characters
– Implementations do NOT have to use Unicode, Any encoding
In/out
only behave as if they did on the wire HTML
• Benefits
Abstraction
– Removes ambiguity, simplifies specifications C
Layer using internal S
– Allows flexibility for common local encodings S
Unicode
– Backward compatible for older HTML browsers
– Supports internationalization (large character set) Any encoding XML
for internal
– Removes dependencies/orientation on byte values implementation
Reference Processing Model Web I18n Part 1- Character Processing

• Examples using Reference Processing Model Character Encodings
– HTML 4.0 declares Unicode as its SGML Document Character Encoding Negotiation
Character Set
– CSS “sequence of characters from UCS”
– XML “A character is an atomic unit of text as specified by Character Escaping
ISO/IEC 10646”
Unicode in Markup
• Any encoding can be used internally, but Unicode
often makes the most sense. Normalization
• XML requires parsers to accept UTF-8 and UTF-16, Identifiers
making Unicode best internal choice
• Some Recommendations require Unicode
– e.g. DOM requires UTF-16

Character Escaping Character Escaping

Mechanisms to represent characters • Useful for:
• Numeric Character References (NCRs) – syntax-significant characters
– HTML and XML – e.g. < (<), > (>), & (&), " (")
Hexadecimal: &#xhhhhhh; – characters outside current encoding
Decimal &#dddd; – eliminating visual or other ambiguity
– CSS2 (soft-hyphen),
“\hh ” (note terminating space), \hhhhhh
- (hyphen-minus)
• Character Entity References (HTML only)
å Å (note case-sensitivity) (space)
(no-break space)
Character Escaping Character Escaping

• Relies on Reference Processing Model • Don’t use Windows 1252 code points instead
– Always references Unicode scalar value of Unicode, for values 128-160.
• Same value regardless of encoding • e.g. Euro is € or € not
• Same value for UTF-8, UTF-16, UTF-32 • www.i18nguy.com/markup/ncrs.html
• One value for supplementary characters, not two • Don’t simulate characters with special fonts
E.g. 𒍅 not &#xD808;&#xDF45; (e.g. Symbol), or you can get erroneous:
– Simplifies transcoding (no parsing or conversion) • Display, depending on font availability
– Allows any Unicode character in any document (if • Font fallbacks
it is legal in the language of the document) • Searches by Search engines
• Behavior from Style sheets
• Database contents
Selecting A Character Encoding Web I18n Part 1- Character Processing

• Choose an encoding that minimizes the need Character Encodings
to escape characters. Character Encoding Negotiation
– Unicode is always a candidate. Reference Processing Model
– Unicode is supported by all but the oldest browsers. Character Escaping
– Is the largest character set, and can be expanded. Unicode in Markup
– Therefore it is often the best choice both for
Normalization
minimizing escapes and anticipating future
character requirements. Identifiers
• e.g. New currency symbols

Unicode Vs. Markup Unicode Vs. Markup

• 96,000+ characters as of Unicode 4.0 Potential problem areas
– Should we use them all? • Redundancies impact searching
– Are there any we shouldn’t use? – “Å” A-ring “A+˚ ” A+ring “Å” Angstrom
– Does Unicode’s capabilities, needed for plain text, • Formatting characters vs. Markup
interfere with markup?
– E.g. Bidi controls, interlinear annotation characters
• Markup can do some things better than
• Characters with style vs. Markup
character codes. Not all Unicode characters are
– E.g. Superscript, subscript
needed.
• Object Replacement Character vs. Markup
– Better to use markup to include an image
Unicode Vs. Markup Web I18n Part 1- Character Processing

Solution types Character Encodings
• Restrict characters so they cannot be used Character Encoding Negotiation
• Replace redundancies (normalization) Reference Processing Model
• Replace with Markup Character Escaping
– Extensible Unicode in Markup
– presentation can be separate from content
Normalization
Joint W3C and Unicode recommendations in: Identifiers
“Unicode in XML and other Markup Languages”
http://www.w3.org/TR/unicode-xml/
http://www.unicode.org/unicode/reports/tr20/
String Indexing and Normalization String Indexing

• Representing data in • Which units should be used for counting?
more than 1 way leads
Graphemes
to errors ≠ A+˚
3
• E.g. The Mars Climate
Orbiter mission was Characters U+0041 U+030A
U+233B4 U+2260
disastrous. Information 4
expected to be metric,
was sent in English units Code units D84C DFB4 2260 0041 030A
• Solution- Adopt a 5
standard representation-
Bytes D8 4C DF B4 22 60 00 41 03 0A
Normalize
10

String Indexing Early Uniform Normalization

Character Model recommendations Unicode characters can have more than 1
• Character counting is recommended for most representation
• Canonical equivalence
programming interfaces (e.g. XML Path)
– Indistinguishable, fundamental equivalence
• Code unit counting may be used for internal – E.g. combining sequences, singletons
efficiency (e.g. DOM) – “Å” U+00C5 (A-ring pre-composed)
• Graphemes may be useful for user interaction, – “A+˚ ” U+0041 + U+030A (A + combining ring above)
– “Å” U+212B (Angstrom)
once a suitable definition exists
• Compatibility equivalence
• Avoid creating API with single unit arguments – E.g. Formatting differences, ligatures
e.g. “SS” = Uppercase(“ß”) – “ｶ” U+FF76 “カ” U+30AB (KA half and full width)
– “fi” U+FB01 (ligature fi)
Early Uniform Normalization Early Uniform Normalization

• Unicode Consortium has defined canonical and When to normalize?
compatibility decomposition formats and 4 • Late Normalization burdens receivers to have
different sets of rules for normalization:
“smart” compare functions
“ Unicode Normalization Forms”
• Early normalization burdens producers to
http://www.unicode.org/unicode/reports/tr15/
create normalized text
• The W3C Character Model recommends
Normalization Form C (NFC)
– Brings canonical equivalences to composed form
– Leaves compatibility forms as distinct
– Most legacy text is composed, and is unchanged

Basic principles Basic principles (continued)
• Without agreement on text representation, • Existing receivers do not check normalization
binary matching and string indexing fail • Most existing text is composed
• Consequences are significant • There are many more receivers than producers
– E.g. Comparison of contracts, security • Encrypted strings require normalization first
• Corrected implementations are complex • Often producers have information about the
strings they create, simplifying normalization.
• Sufficient resources may not be available on
Conclusion: Early Normalization is lowest
very small web components
total cost


Text on the web SHOULD be Fully Normalized. • Examples of Fully Normalized Text
This is text that is either: “suçon”, “suçon”,
1. Unicode text in Normalization Form NFC, and “sub¸on”, “sub̧on”
2. Does not contain character escapes or includes Note- Unicode does not have a composed b-cedilla.
that upon expansion would undo point 1, and • Examples that are not Fully Normalized
3. Does not begin with a composing character. “suc¸on”, “suçon”
or: Reason: should use composed character “ç”
1. Legacy encoded text, which transcoded to “¸on”, “̧on”
Unicode satisfies the above. Reason: should not begin with combining character
Web I18n Part 1- Character Processing Identifiers in Markup Languages

Character Encodings Identifiers are element and attribute names, CSS
Character Encoding Negotiation selectors, and properties, etc.
Reference Processing Model • HTML- restricted to a subset of ASCII
(-_:.A-Za-z0-9), case-insensitive
Character Escaping
• CSS2- few restrictions
Unicode in Markup
(-A-Za-z0-9) + all chars >U+00A0, case insensitive
Normalization • XML 1.0- subset of Unicode 2.0, case-sensitive
Identifiers • XML 1.1 Unicode 4.0, case-sensitive
• XHTML- same set as HTML
lowercase, case-sensitive, (-_:.A-Za-z0-9)
Identifiers in Markup Languages Identifiers in Markup Languages

XML Naming conventions suggested in XML Naming conventions suggested in
Appendix I in XML 1.1, www.w3.org/TR/xml11/ Appendix I in XML 1.1, www.w3.org/TR/xml11/
– Exclude all control characters, enclosing Don’t use:
nonspacing marks, non-decimal numbers, private- – Ideographs with canonical decompositions
use characters, punctuation characters (with some – Characters with compatibility decompositions
exceptions), symbol characters, unassigned
codepoints, and white space characters. – Combining characters meant for use with symbols
only
– 1st character: Unicode General Category of Ll,
Lu, Lo, Lm, Lt, or Nl, or else be '_' U+005F – Interlinear annotation characters (U+FFF9-FFFB)
– Remainder: Unicode General Category of Ll, Lu, – Variation selector characters
Lo, Lm, Lt, Mc, Mn, Nl, Nd, Pc, Cf, or be one of: – Names which are nonsensical, unpronounceable,
'-' '. ' ':' or '·' (U+00B7 middle dot). hard to read, or easily confusable with other names

Resource Identifiers: URI, IRI, IDNA Domain Names: IDNA Architecture

http://日本語.jp
URIs encode bytes, not characters
– Most ASCII bytes expressed as ASCII Convert to Unicode
– Non-ASCII are %HH, which is ambiguous
Nameprep
• Character encoding not taken into consideration
Case fold, Mapping, NFKC, Removal
• IRI-Internationalized Resource Identifiers
– Transcode to UTF-8, then encode as URI
DNS Servers
ACE (Punycode, profile of Bootstring)
• Adopters: XLink, XPointer, URN, XML, XML Schema
Convert to ASCII, Prepend “xn- -”
• IE, Firefox, Opera, Safari, and others
– http://www.w3.org/International/O-URL-and-ident.html Application
Resolver http://xn--wgv71a119e.jp
Servers
RFC: IDNA 3454, 3490-3492, URI 3986, IRI 3987

Questions Coffee Break
Web Internationalization Agenda Language Identification

• Part 1 – Character Processing Same mechanism for all:
• Coffee Break • Identifiers defined by RFC 3066.
• Part 2 Based on 2-letter and 3-letter language codes
– Layout and Typography (ISO-639) with an optional 2-letter country
– Designing International Web Sites codes (ISO-3166) separated by a character ‘-’
(not ‘_’).
• Value is not case sensitive (even in XML).
• In mark-up: the language attribute is inherited
by the children of the element where the
attribute is defined.

Language Identification Language Identification

RFC 3066 Rules: Some Problems:
• RFC 3066 does not cover all needs.
• 3-letter codes should be used only for the – e.g. Latin-Amer. Spanish, Script distinctions, etc.
languages that have no 2-letter code. – Now being addressed through registrations
• Always use the Terminological form of the 3- • There is no clear distinction of the identifiers of
letter codes, not the Bibliographical form. a “language” and a “locale”.
– (See past IUC locales talks for more information.)
• In addition, as much as possible, avoid user- • Work in progress to address those issues: IETF,
defined codes (x-myCode) ISO TC37, SIL, W3C, etc.
– RFC 3066bis
Language Identification Language Identification

• RFC 3066bis • HTTP: Content-Language header
– replaces RFC 3066. New number to be assigned • HTML: LANG attribute (e.g. in <html>)
soon
• XML: xml:lang attribute
– language-country becomes language-script-country
• XHTML 1.0: Both lang and xml:lang
– Registry expanded to include all valid entries
– New matching rules proposed in a separate RFC <p xml:lang="la" lang="la">Verba.</p>
– See Addison Phillip’s talk for more info
• XHTML 1.1: xml:lang attribute
Language Identification Language Identification – Input file

<?xml version="1.0" encoding="iso-8859-1" ?>
The lang() function in XPath: <?xml-stylesheet type="text/xsl" href="Languages.xsl"?>
• True if the selected node has xml:lang set to <MyData>

<Msg id="100">
the given language code. <Text xml:lang="en">Message 100 in English.</Text></Msg>
<Msg id="200">
• Match is done as a sub-string from the start of <Text xml:lang="en-us">Message 200 <span xml:lang="fr">
[insertion in French]</span> in American
the value: English.</Text>
'en' matches 'en', and 'en-us'. <Text xml:lang="fr-CA">Message 200 en Québecquois.</Text>
</Msg>
• Match is case insensitive: <Msg id="300">

<Text xml:lang="fr">Message 300 en français.</Text></Msg>
'en' matches 'EN', 'En-us', etc. <Msg id="400">
<Text xml:lang="EN-GB">Message 400 in British
Î Example: Input, Languages.xsl, Output. English.</Text> </Msg>
</MyData>

Language Identification – Style-sheet Language Identification – IE Output

<?xml version="1.0" ?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform“
version="1.0"> Message 100 in English. (en)
Message 200 [insertion in French] in American English. (en-us)
<xsl:param name="Language">en</xsl:param>
Message 400 in British English. (EN-GB)
<xsl:template match="text()"/>
<xsl:template match="Text">
<xsl:if test="lang($Language)">
<p><xsl:value-of select="."/>
(<xsl:value-of select="@xml:lang"/>)</p>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Language Identification – FF Output Language Identification- Opera Output
Message 100 in English. (en)Message 200 [insertion in French] Message 100 in English. Message 200 [insertion in French] in
in American English. (en-us)Message 400 in British English. American English. Message 200 en Québecquois. Message 300
(EN-GB) en français. Message 400 in British English.
Language Identification – CSS Language Identification – CSS

<html lang="en">
There are two methods to refer to the language <head>
attribute in CSS: <style>
*:lang(en-us) { font-weight: bold; }
*[lang|=fr] { font-style: italic; color: red; }
• The lang pseudo-class. </style>
<title>Test Language and CSS</title>
*:lang(zh) { font-family:SimSun } </head>
<body>
<p>Text in English.</p>
• The attribute selector. <p lang="en-us">Text in American English.</p>
<p lang="en">Text in generic English.</p>
*[lang|=fr] { font-weight:bold } <p lang="fr-ca">Texte en québecquois.</p>
<p lang="fr">Texte en français.</p>
<p lang="en-gb">Text in British English.</p>
• Both use the same matching mechanism as the </body>
lang() function in XPath. </html>
Î Example: LanguagesCSS.htm

Language Identification – FF Output Fonts

• In HTML: Use styles rather than <FONT>
(easier to change for each language, if needed).
Text in English.
CSS Fonts:
Text in American English. • Provide fallback fonts.
Text in generic English. • unicode-range specifies characters used.
unicode-range removed from CSS 2.1
Texte en québecquois.
@font-face {font-family: "Tex";
Texte en français. src:url(ftp://ftp.i18nguy.com/Texfnt.ttf);
unicode-range: U+??, U+900-97f; }
Text in British English. body { font-family: Texfnt, Tahoma, Arial,
sans-serif; }
Quotes – HTML Quotes – Using CSS
• The <q> element for in-line quotations (auto- • CSS allows control of the type of quote to use
quotation marks expected). according to the language.
• The <blockquote> element for paragraph- *[lang|=fr] { quote:'\ab\a0' '\a0\bb' }
type quotations (indented, and no auto- qo:before { content:open-quote }
qo:after { content:close-quote }
quotation marks expected).
• Examples
Î Example: Input, Output: Quotes.htm. Î HTML: Input, CSS, Output: QuotesWithCSS.htm.
Î XML: Input, CSS File, Output: Quotes.xml.
Quotes – HTML Input Some Unicode Characters

...
<body> U+2018 ‘ Left Single Quotation Mark
<p lang="en">English text with <q>English quoted
text</q>.</p>
U+2019 ’ Right Single Quotation Mark
<p lang="fr">Text en Français avec <q>English quoted U+201C “ Left Double Quotation Mark
text</q>.</p>
<p lang="fr">Text en Français avec <q lang="en">English U+201D ” Right Double Quotation Mark
U+201E „ Double Low 9 Quotation Mark
quoted text containing a <q>quote</q> itself</q>.</p>
<p lang="fi"><q>Quotes</q> in Finnish.</p>
<p lang="pl"><q>Quotes</q> in Polish.</p>
<p lang="ja"><q>Quotes</q> in Japanese.</p>
U+201F ‟ Double High Reversed 9 Q. M.
<p lang="de"><q>Quotes</q> in German.</p> U+300C 「 Left Corner Bracket
<p lang="nl"><q>Quotes</q> in Dutch.</p>
<blockquote lang="fr">A paragraph using U+300D 」 Right Corner Bracket
blockquote.</blockquote>
</body> U+00AB « Left Pointing Double Angle Q. M.
</html>
U+00BB » Right Pointing Double Angle Q. M.
U+00A0 No Break Space

Quotes – CSS Style-sheet Quotes – Opera Output

q:before { content: open-quote; }
q:after { content: close-quote; }
English text with “English quoted text”.
blockquote:before { content: open-quote; } Text en Français avec « English quoted text ».
blockquote:after { content: close-quote; }
Text en Français avec « English quoted text containing a
[lang|='en'] > * { /* English */ quotes: "\201C" "\201D" } ”quote" itself ».
[lang|='fr'] > * { /*guillemets*/ quotes: "\AB\A0" "\A0\BB" } ”Quotes” in Finnish.
„Quotes” in Polish.
[lang|='fi'] > * {/*same direction*/ quotes: "\201D" "\201D" }
「Quotes」 in Japanese.
[lang|='de'] > * { /* German */ quotes: "\201E" "\201C" } „Quotes“ in German.
[lang|='ja'] > * { /* Japanese */ quotes: "\300C" "\300D" } ‘Quotes’ in Dutch.
[lang|='nl'] > * { /* Dutch */ quotes: "\2018" "\2019" }
“A paragraph using blockquote.”
[lang|='pl'] > * { /* Polish */ quotes: "\201E" "\201D" }
Casing Casing TextTransform.htm

• CSS2 provides the property Style:
text-transform with 5 values:
uppercase, lowercase, capitalize, <style>
none, and inherit. .upper { text-transform: uppercase}
.lower { text-transform: lowercase}
• CSS2 allows user agents to ignore it for non .cap { text-transform: capitalize}
Latin-1 characters and for unusual case </style>
conversion (making it useless from an i18n
viewpoint). CSS3 (working draft) forces
Unicode casing conformance. This property is
deprecated in XSL 1.0.
Î Example: Source, Output: TextTransform.htm.
Casing TextTransform.htm Casing – IE and Opera Output

<p>Original = This text should be all uppercased.<br> Original = This text should be all uppercased.
Transformed = <span class="upper">This text should be all Transformed = THIS TEXT SHOULD BE ALL UPPERCASED.
uppercased. </span></p>
<p>Original = THIS TEXT SHOULD BE ALL LOWERCASED.<br>
Original = THIS TEXT SHOULD BE ALL LOWERCASED.
Transformed = <span class="lower">THIS TEXT SHOULD BE ALL
Transformed = this text should be all lowercased.
LOWERCASED. </span></p>
<p>Original 1 = tHIS tEXT sHOULD bE cAPITALIZED.<br>
Transformed = <span class="cap">tHIS tEXT sHOULD bE Original 1 = tHIS tEXT sHOULD bE cAPITALIZED.
cAPITALIZED. </span><br> Transformed = THIS TEXT SHOULD BE CAPITALIZED.
Original 2 = this text should be capitalized.<br> Original 2 = this text should be capitalized.
Transformed = <span class="cap">this text should be Transformed = This Text Should Be Capitalized.
capitalized. </span></p>
<p lang="de">[de] Original = ß (sharp-s), ö (o-diaeresis)<br> [de] Original = ß (sharp-s), ö (o-diaeresis)
Transformed = <span class="upper">ß (sharp-s), ö (o-
Transformed = ß (SHARP-S), Ö (O-DIAERESIS)
diaeresis) </span></p>
<p lang="tr">[tr] Original = i (i-with-dot)<br>
Transformed = <span class="upper">i (i-with-dot)</span></p> [tr] Original = i (i-with-dot)
Transformed = I (I-WITH-DOT)

Casing – Firefox Output Casing – Clipboard Copy (unchanged)

Original = This text should be all uppercased. Original = This text should be all uppercased.
Transformed = THIS TEXT SHOULD BE ALL UPPERCASED. Transformed = This text should be all uppercased.
Original = THIS TEXT SHOULD BE ALL LOWERCASED. Original = THIS TEXT SHOULD BE ALL LOWERCASED.
Transformed = this text should be all lowercased. Transformed = THIS TEXT SHOULD BE ALL LOWERCASED.
Original 1 = tHIS tEXT sHOULD bE cAPITALIZED. Original 1 = tHIS tEXT sHOULD bE cAPITALIZED.
Transformed = THIS TEXT SHOULD BE CAPITALIZED. Transformed = tHIS tEXT sHOULD bE cAPITALIZED.
Original 2 = this text should be capitalized. Original 2 = this text should be capitalized.
Transformed = This Text Should Be Capitalized. Transformed = this text should be capitalized.
[de] Original = ß (sharp-s), ö (o-diaeresis) [de] Original = ß (sharp-s), ö (o-diaeresis)

Transformed = SS (SHARP-S), Ö (O-DIAERESIS) Transformed = ß (sharp-s), ö (o-diaeresis)
[tr] Original = i (i-with-dot) [tr] Original = i (i-with-dot)

Transformed = I (I-WITH-DOT) Transformed = i (i-with-dot)
Numbered Lists Numbered Lists NumberedLists.htm

With CSS2
• CSS2 offers the list-style-type ...<head>
<style>
property to specify the type of numbers for .list_heb {list-style-type:hebrew}
lists. Supports only a limited set of pre-defined .list_geo {list-style-type:georgian}
styles (e.g. has Armenian but not Thai). .list_arm {list-style-type:armenian}
.list_cjk {list-style-type:cjk-ideographic}
</style>
Î Example NumberedLists.htm </head>
<body>...
Numbered Lists NumberedLists.htm Numbered Lists – Firefox Output

... <body>
<p>List numbered in Hebrew:</p>
<ol class="list_heb"><li>Item 1</li>...<li>Item 6</li>
</ol>
<p>List numbered in Georgian:</p>
<ol class="list_geo"><li>Item 1</li>...<li>Item 6</li>
</ol>
<p>List numbered in Armenian:</p>
<ol class="list_arm"><li>Item 1</li>...<li>Item 6</li>
</ol>
<p>List numbered in Han character
(<code>cjk-ideographic</code>):</p>
<ol class="list_cjk"><li>Item 1</li>...<li>Item 6</li>
</ol>
</body>
</html>

Numbered Lists Number Formatting

With XSL The function format-number()
• XSL provides more flexibility as the format in XSL allows the formatting of numbers based
and the type of the numbers can be changed on a given pattern.
using <xsl:number/>. • Uses same patterns as Java 1.1
java.text.DecimalFormat patterns.
• Use <xsl:decimal-format/> to
Î Example: Input, ListNumbers.xsl, Output. overwrite the default symbols (i.e. decimal
separator, grouping separator, etc.).
Î Example: Input, XSL File, Output.

Text Flow Text Flow

Bi-directional Text in HTML Bi-directional Text for XML (CSS2)
• The dir attribute: • Use the direction and unicode-bidi
– dir="ltr" (default), dir="rtl" properties. The unicode-bidi property
– Affects the default value of align. specifies the behavior for inline levels elements
– Inherited (use it in <html> to set the base for the (15 maximum levels of embedding).
whole document). • Based on Unicode bidi algorithm (UAX#9)
• The <bdo> element: para.bidi { direction:rtl;
– Overrides implicit directional properties of content. unicode-bidi:embed }
– Requires the dir attribute.
Î Example: BidiText.htm
Text Flow – Bidi Example Source (1/2) Text Flow – Bidi Example Source (2/2)
<p style="direction:rtl; unicode-bidi:embed"> <p dir="rtl">
Using CSS:<br/> ‫ חברת‬Pepper Creek LLC, <span dir="ltr">Using dir="ltr-
‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p> span":</span> <br/>
‫ חברת‬Pepper Creek LLC,
<p dir="rtl"> ‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p>
Using dir="rtl":<br/>
‫ חברת‬Pepper Creek LLC, <p dir="ltr">Using dir="ltr" (wrong):<br/>
‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p> ‫ חברת‬Pepper Creek LLC,
‫ עובדים‬550-‫ מונה יותר מ‬,‫עתה‬-‫שנוסדה זה‬.</p>

Text Flow – Bidi Output Text Flow

Vertical Text
• Use the writing-mode property (CSS3).

• For example, to display top-to-bottom, and
right-to-left text use:
div.vertical { writing-mode:tb-rl }
Î Example in HTML, and in SVG.
Text Flow – Vertical, HTML Text Flow – Vertical, HTML Output

<p style="writing-mode: rl-tb">
Example of horizontal text (rl-tb).</p>
<p style="writing-mode: tb-rl">
Example of vertical text (tb-rl).</p>
<p style="writing-mode: tb-rl">
Example of vertical text with
<span style="writing-mode: lr-tb">
horizontal</span> insert.</p>
Text Flow – Vertical, SVG Text Flow – Vertical, SVG in HTML

<?xml version="1.0" ?> <html>
<svg width="330" height="330“ <body>
viewbox="0 0 330 330"> <p>
<g style="font-size:24;">
<text x="20" y="26" style="writing-mode: lr;">
<object data="Vertical.svg“
Horizontal Text</text> type="image/svg+xml"
<text x="20" y="56" style="writing-mode: tb;"> width="330" height="330" />
Example of vertical text</text> </p>
</g> </body>
</svg> </html>

Text Flow – Vertical, SVG Output Ruby Annotation

Annotation in smaller characters
running above or below a base text.
• Used in Japanese for pronunciation of Kanji

characters (Furigana).
• W3C Ruby Module: <ruby> element with
<rb> for the base text, <rt> for the ruby text.
<rbc> and <rtc> for complex annotations.
Î Example: Ruby.htm
Ruby Annotation – HTML (1/2) Ruby Annotation – HTML (2/2)

<p>Simple Ruby test:</p> <p>Ruby complex:</p>
<ruby> <ruby>
<rb>日本語</rb> <rbc>
<rt>にほんご</rt> <rb>10</rb> <rb>31</rb> <rb>2002</rb>
</ruby> </rbc>
<p>Ruby with parenthesis text, used if ruby is not <rtc>
implemented: </p> <rt>Month</rt> <rt>Day</rt> <rt>Year</rt>
<ruby> </rtc>
<rb>日本語</rb> <rtc>
<rp>[[</rp><rt>にほんご</rt><rp>]]</rp> <rt rbspan="3">Expiration Date</rt>
</ruby> </rtc>
</ruby>
Ruby Annotation – IE Output Ruby Annotation – Firefox Output

Combined Text Sorting

Runs of characters grouped together. XSL offers the <xsl:sort/>
element to collate lists of items.
• Use lang (not xml:lang) to specify the

• Available in CSS3 with text-combine. language to use for the sorting rules.
• Combination in blocks (Kumimoji) • Results depend on the implementation of the
XSL engine.
• Combination in lines (Warichu)
Î Example: Sorting.xml sorted for English
span.kumimoji { text-combine: letters } and Norwegian. (Sorting_EN.xsl and
span.warichu { text-combine: lines }
Sorting_NO.xsl).
Sorting Web Internationalization Agenda

Version 2.0 of XSL has new features for • Part 1 – Character Processing
<xsl:sort> • Coffee Break
http://www.w3.org/TR/xslt20/#dt-collation • Part 2
– Layout and Typography
• case-order attribute specifies whether to sort – Designing International Web Sites
uppercase or lowercase first.
• collation attribute names an implementer-
defined collation to use.
– if given, lang and case-order are ignored.
Requirements Domain Names

• Two sets of requirements, not always Easier if each language has its own domain
compatible: name: www.xyzcorp.fr, www.xyzcorp.de, etc.
– Business requirements. e.g. to have the Web site Æ One domain = One language.
ranked high in search engines; to have a single look
and feel across sites in different languages; etc. Unfortunately:
– Localization requirement. e.g. to avoid changing – Most site have only one address for many
links in localized pages; to have locale-specific languages.
content; etc.
• Solutions depend on technologies used (static – Even ‘country-specific’ sites may have
Web site, client-side scripting, server-side several languages: www.xyzcorp.ca
scripting, databases, multiple addresses, etc). Æ English, French, Inuktitut.

Directories and Files Directories and Files

\
One possible solution: +----- index.html
+----- index_fr.html
• Home page of the ‘main’ language is the entry |
+----- en
point of the directory structure. | +----- about.html
(e.g. index.html) | +----- products.html
| +----- menubar.png
• Language home pages are also at the root and |
+----- fr
have a language identifier in their name. | +----- about.html
(e.g. index_fr.html) | +----- products.html
| +----- menubar.png
• Other pages have identical names across |
+----- common
languages, but are in different language +----- logo.png
+----- background.jpg
directories.
Directories and Files Directories and Files

• Allow search engines to retrieve meaningful • Use cookies if you want to remember the
information (but emphasis for the main preferred language of the user and redirect
language). him/her to the relevant set of files.
• Maximize the use of relative URLs (no link
change, except to the home page). • Use common directory for shared files.
If scripting is available, you can have the links • Use meaningful directory and file names.
resolved at run-time.
• Avoid translating directory and file names.
• Allows room for locale-specific content if
necessary. • Treat the source language just like another
language as much as possible.
Language Selection Good Practices – IDs

• List box of language names in native language IDs are VERY useful for re-use of translation,
– Make sure characters display correctly (fonts) and for re-use of text across documents.
– Graphics are always displayed correctly.
– in HTML IDs can be set for all elements containing
text, except the <title> element.
– Make sure to provide an ID attribute for the
• Destination Choice translatable elements of your XML vocabularies, so
– The same page in the new it can be utilized for re-use, leveraging, etc.
language.
– The main page in the new language.
(for country-specific sites, etc.)

Good Practices – Attributes Good Practices – Embedded Data

When creating new XML vocabularies: Avoid Data that are not text content (e.g. scripts, SQL
using attributes for storing translatable text. queries, etc.).
– Impossible to add needed bidi tags in an attribute. – Keep them outside of the document if possible
– Cause segmentation issues in many tools. (e.g. using include mechanisms).
– Much more difficult to have metadata for attributes – At least, make sure elements with such data are
than for elements. identified for the localizer (who might need to
– You cannot set different languages for two apply a process different than for the rest of the
attributes in the same element. document content).
– More tricky to set unique IDs for attributes. – Internationalize your scripts/queries/etc.
Good Practices – Use Style-sheets Good Practices – CDATA Sections

• Separate the function of a term (a title, a link, Avoid CDATA sections if possible.
an important term) from its display (bolded,
underlined, in 12-points Courier, etc.) – Translation tools do not handle CDATA well.
– Type of display for the target language(s) may be – Keeping track on inline CDATA leads to
different than for the source language. meaningless inline codes in segments (and can
affect leveraging).
– Force author/developer to think about the structure
– NCRs are not allowed in CDATA. This may cause
of the document.
problems if the document is converted to an
• Avoid <br/> -like elements when possible: encoding where some characters need to be written
Use styles to format, not tags. as NCRs.
By the way: CDATA does NOT preserve spaces.
Additional Resources Conclusion

• W3C Internationalization Work Group Implementation of standards is still a little
http://www.w3.org/International
behind in practice.
• Unicode in XML and other Markup Languages
http://www.w3.org/TR/unicode-xml
• Character Model for the World Wide Web But today, Web-related technologies are among
http://www.w3.org/TR/charmod the best ways to store, manipulate and
• Richard Ishida’s paper on “Localisation represent data in different languages.
Considerations in DTD Design”
http://www.w3.org/People/Ishida/writing.html#dtd
• XML Internationalization FAQ
http://www.opentag.com/xmli18nfaq.htm Any questions?

Webi 18 Ntutorial

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Webi 18 Ntutorial

Hochgeladen von

Copyright:

Verfügbare Formate

Web Internationalization – Standards and Practice

Legend For This Presentation Web Internationalization Agenda

Web I18n Part 1- Character Processing A Simple HTML Example Page

29th Internationalization and Unicode Conference 1 Tex Texin, Yves Savourel

A Simple HTML Example Page A Simple HTML Example Page

A Simple HTML Example Page Character Encodings

Character Encodings Character Encodings

29th Internationalization and Unicode Conference 2 Tex Texin, Yves Savourel

Character Encodings Character Encodings

ACR ≠ A+˚ ACR ≠ A+˚

CCS U+233B4 U+2260 U+0041 U+030A

– The “U+” indicates use of Unicode’s numbering.

Character Encodings Character Encodings

CCS U+233B4 U+2260 U+0041 U+030A

CEF D84C DFB4 2260 0041 030A

Character Encodings Encoding Identification

How can a browser know how to decode a web

29th Internationalization and Unicode Conference 3 Tex Texin, Yves Savourel

Character Encoding Names Unregistered Encoding Names

Character Encoding Names Markup and Encoding Names

HTTP and Encoding Names HTML, XML & Encoding Names

200 OK HTTP/1.1 XML

29th Internationalization and Unicode Conference 4 Tex Texin, Yves Savourel

CSS2 and Encoding Name LINKs and Encoding Name

• New! <A href=“http://www.unicode.org" charset=“UTF-8”>

Notes: Declaring Encoding Names HTML Encoding Priorities

CSS2 Encoding Priorities XML Encoding Priorities

29th Internationalization and Unicode Conference 5 Tex Texin, Yves Savourel

Web I18n Part 1- Character Processing Character Encoding Negotiation

Typical Browser-Server HTTP Sequence Character Encoding Negotiation

5. Web Server receives data, database

Which encoding is sent by the server? The above example indicates:

Character Encoding Negotiation Character Encoding Negotiation

– Opera 6.0 sends: The server returns a document.

29th Internationalization and Unicode Conference 6 Tex Texin, Yves Savourel

Character Encoding Negotiation Character Encoding Negotiation

A Form Data Set is a series of control name/current

There are 3 ways browsers submit form data sets.

Form Data Set Form Data Set Submission

29th Internationalization and Unicode Conference 7 Tex Texin, Yves Savourel

Form Data Set Encoding Character Encoding Negotiation

Charset=UTF-8 Modern browsers send x-www-form-urlencoded data to

Character Encoding Negotiation Character Encoding Negotiation

• Fails with transcoding proxies (not allowed to

Character Encoding Negotiation Character Encoding Negotiation

29th Internationalization and Unicode Conference 8 Tex Texin, Yves Savourel

Web I18n Part 1- Character Processing Reference Processing Model

Reference Processing Model Reference Processing Model

Reference Processing Model Web I18n Part 1- Character Processing

29th Internationalization and Unicode Conference 9 Tex Texin, Yves Savourel

Character Escaping Character Escaping

Character Escaping Character Escaping

Selecting A Character Encoding Web I18n Part 1- Character Processing

29th Internationalization and Unicode Conference 10 Tex Texin, Yves Savourel

Unicode Vs. Markup Unicode Vs. Markup

Unicode Vs. Markup Web I18n Part 1- Character Processing

String Indexing and Normalization String Indexing

29th Internationalization and Unicode Conference 11 Tex Texin, Yves Savourel

String Indexing Early Uniform Normalization

Early Uniform Normalization Early Uniform Normalization

Early Uniform Normalization Early Uniform Normalization

29th Internationalization and Unicode Conference 12 Tex Texin, Yves Savourel