Whitespace Character: Definition and Ambiguity

Whitespace character
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a
whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol
U+0020 SPACE (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts.
Contents
Overview
Definition and ambiguity
Unicode
Substitutes
Whitespace and digital typography

On-screen display
Variable-width general-purpose space
Hair spaces around dashes
Formatting values of quantities
Computing applications
Programming languages
Command line user interfaces
Markup languages
File names
See also
References
External links
Overview
With many keyboard layouts, a horizontal whitespace character may be entered through the use of a spacebar .
Horizontal whitespace may also be entered on many keyboards through the use of the Tab ↹ key, although the length
of the space may vary. Vertical whitespace is a bit more varied as to how it is encoded, but the most obvious in typing
is the ↵ Enter result which creates a 'newline' code sequence in applications programs. Older keyboards might instead
say Return , abbreviating the typewriter keyboard meaning 'Carriage-Return' which generated an electromechanical
return to the left stop (CR code in ASCII-hex &0D;) and a line feed or move to the next line (LF code in ASCII-hex
&0A;); in some applications these were independently used to draw text cell based displays on monitors or for printing
on tractor-guided printers—which might also contain reverse motions/positioning code sequences allowing text-based
output devices to achieve more sophisticated output. Many early computer games used such codes to draw a screen
(e.g. Kingdom of Kroz), and word processing software would use this to produce printed effects such as bold,
underline, and strikeout.
Relative widths of various
spaces in Unicode
The term "whitespace" is based on the resulting appearance on ordinary paper. However they are coded inside an
application, whitespace can be processed the same as any other character code and programs can do the proper action
as defined for the context in which they occur.
Definition and ambiguity

The most common whitespace characters may be typed via the space bar or the tab key. Depending on context, a line-break generated by the return or enter
key may be considered whitespace as well.
Unicode
The table below lists the twenty-five characters defined as whitespace ("WSpace=Y", "WS") characters in the Unicode Character Database.[1] Seventeen use
a definition of whitespace consistent with the algorithm for bidirectional writing ("Bidirectional Character Type=WS") and are known as "Bidi-WS"
characters. The remaining characters may also be used, but are not of this "Bidi" type.
Note: Depending on the browser and fonts used to view the following table, not all spaces may be displayed properly.
Unicode characters with White_Space property[a][b]
Width May In General
Name Code point Script Block Notes
box break? IDN? category
HT, Horizontal Tab.
CHARACTER Other,
U+0009 9 Yes No Common Basic Latin HTML/XML named entity:
TABULATION control
&Tab;, LaTeX: '\tab'
Other, LF, Line feed. HTML/XML

LINE FEED U+000A 10 Is a line-break Common Basic Latin
control named entity: &NewLine;
LINE Other,
U+000B 11 Is a line-break Common Basic Latin VT, Vertical Tab
TABULATION control
Other,
FORM FEED U+000C 12 Is a line-break Common Basic Latin FF, Form feed
control
CARRIAGE Other,
U+000D 13 Is a line-break Common Basic Latin CR, Carriage return
RETURN control
Separator, Most common (normal
SPACE U+0020 32 Yes No Common Basic Latin
space ASCII space)
Latin-1 Other,
NEXT LINE U+0085 133 Is a line-break Common NEL, Next line
Supplement control
Non-breaking space:
identical to U+0020, but
NO-BREAK Latin-1 Separator, not a point at which a line
U+00A0 160 No No Common
SPACE Supplement space may be broken.
HTML/XML named entity:
, LaTeX: '\ '
Used for interword

separation in Ogham text.
Normally a vertical line in
OGHAM SPACE Separator, vertical text or a horizontal
U+1680 5760   Yes No Ogham Ogham
MARK space line in horizontal text, but
may also be a blank
space in "stemless" fonts.
Requires an Ogham font.
Width of one en. U+2002
General Separator, is canonically equivalent
EN QUAD U+2000 8192 Yes No Common
Punctuation space to this character; U+2002
is preferred.
Also known as "mutton
quad". Width of one em.
General Separator, U+2003 is canonically
EM QUAD U+2001 8193 Yes No Common
Punctuation space equivalent to this
character; U+2003 is
preferred.
Also known as "nut".
Width of one en. U+2000
En Quad is canonically
General Separator, equivalent to this
EN SPACE U+2002 8194 Yes No Common
Punctuation space character; U+2002 is
preferred. HTML/XML
named entity: &ensp;,
LaTeX: '\enspace'
Also known as "mutton".
Width of one em. U+2001
Em Quad is canonically
General Separator, equivalent to this
EM SPACE U+2003 8195 Yes No Common
Punctuation space character; U+2003 is
preferred. HTML/XML
named entity: &emsp;,
LaTeX: '\quad'
Also known as "thick
THREE-PER- General Separator, space". One third of an
U+2004 8196 Yes No Common
EM SPACE Punctuation space em wide. HTML/XML
named entity: &emsp13;
Also known as "mid
FOUR-PER-EM General Separator, space". One fourth of an
SPACE Punctuation space em wide. HTML/XML
named entity: &emsp14;
SIX-PER-EM U+2006 8198 Yes No Common General Separator, One sixth of an em wide.
SPACE Punctuation space In computer typography,
sometimes equated to
U+2009.
Figure space. In fonts with
monospaced digits, equal
General Separator,
FIGURE SPACE U+2007 8199 No No Common to the width of one digit.
Punctuation space
&numsp;
As wide as the narrow

punctuation in a font, i.e.
PUNCTUATION General Separator, the advance width of the
SPACE Punctuation space period or comma.[2]
&puncsp;
One-fifth (sometimes one-

sixth) of an em wide.
Recommended for use as
a thousands separator for
General Separator, measures made with SI
THIN SPACE U+2009 8201 Yes No Common
Punctuation space units. Unlike U+2002 to
U+2008, its width may get
adjusted in typesetting.[3]
; LaTeX: '\,'
Thinner than a thin space.

General Separator, HTML/XML named entity:
HAIR SPACE U+200A 8202 Yes No Common
Punctuation space &hairsp; (does not work in
all browsers)
LINE General Separator,
U+2028 8232 Is a line-break Common
SEPARATOR Punctuation line
PARAGRAPH General Separator,
U+2029 8233 Is a line-break Common
SEPARATOR Punctuation paragraph
Narrow no-break space.
Similar in function to
U+00A0 No-Break Space.
When used with
Mongolian, its width is
NARROW NO- General Separator,
U+202F 8239 No No Common usually one third of the
BREAK SPACE Punctuation space
normal space; in other
context, its width
sometimes resembles that
of the Thin Space
(U+2009).
MMSP. Used in
mathematical formulae.
Four-eighteenths of an
em.[4] In mathematical
typography, the widths of
spaces are usually given
in integral multiples of an
MEDIUM
General Separator, eighteenth of an em, and
MATHEMATICAL U+205F 8287 Yes No Common
Punctuation space 4/18 em may be used in
SPACE
several situations, for
example between the a
and the + and between
the + and the b in the
expression a + b.[5]
 
CJK As wide as a CJK

IDEOGRAPHIC Symbols Separator, character cell (fullwidth).
SPACE and space Used, for example, in tai
Punctuation tou.
Related Unicode characters without White_Space property
Width May In General
Name Code point Script Block Notes
box break? IDN? category
MVS. A narrow space
character, used in
Mongolian to cause
the final two
characters of a word
to take on different
MONGOLIAN
VOWEL U+180E 6158 Yes No Mongolian Mongolian
Other, shapes.[6] It is no
Format longer classified as
SEPARATOR
space character (i.e.
in Zs category) in
Unicode 6.3.0, even
though it was in
previous versions of
the standard.
ZWSP, zero-width
space. Used to
indicate word
boundaries to text
processing systems
when using scripts
that do not use
explicit spacing. It is
similar to the soft
ZERO
General Other, hyphen, with the
WIDTH U+200B 8203 Yes No ?
Punctuation Format difference that the
SPACE
latter is used to
indicate syllable
boundaries, and
should display a
visible hyphen when
the line breaks at it.
HTML/XML named
entity:
&NegativeMediumSpace;
ZWNJ, zero-width
non-joiner. When
placed between two
characters that would
ZERO otherwise be
WIDTH Context- General Other, connected, a ZWNJ
U+200C 8204 Yes ?
NON- dependent[7] Punctuation Format causes them to be
JOINER printed in their final
and initial forms,
respectively.
HTML/XML named
entity: &zwnj;
ZWJ, zero-width
joiner. When placed
between two
characters that would
ZERO
Context- otherwise not be
General Other,
WIDTH U+200D 8205 Yes ? connected, a ZWJ
dependent[8] Punctuation Format
JOINER causes them to be
printed in their
connected forms.
HTML/XML named
entity: &zwj;
WJ, word joiner.
Similar to U+200B,
but not a point at
WORD General Other,
U+2060 8288 No No ? which a line may be
JOINER Punctuation Format
broken. HTML/XML
named entity:
&NoBreak;
ZERO U+FEFF 65279 No No ? Arabic Other, Zero-width non-

WIDTH Presentation Format breaking space. Used
NON- Forms-B primarily as a Byte
BREAKING Order Mark. Use as
SPACE an indication of non-
breaking is
deprecated as of
Unicode 3.2; see
U+2060 instead.
1. White_Space is a binary Unicode property.[9]

2. "Unicode 12.0 UCD: PropList.txt" (https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt). 2019-01-22. Retrieved 2019-03-05.
Substitutes
Unicode also provides some visible characters that can be used to represent whitespace:
Unicode space-illustrating characters (visible)

Code Decimal Name Block Display Description
Interpunct
U+00B7 183 Middle dot Latin-1 Supplement · Named entity: ·
U+237D 9085 Shouldered open box Miscellaneous Technical ⍽ Used to indicate a NBSP
U+2420 9248 Symbol for space Control Pictures ␠

aka "substitute blank",[10] used
in BCDIC,[10] EBCDIC,[10]
U+2422 9250 Blank symbol Control Pictures ␢ ASCII-1963[10][11] etc. as word
separator
Used in block letter handwriting
at least since the 1980s when it
is necessary to explicitly
indicate the number of space
characters (e.g. when
programming with pen and
paper). Used in a textbook
(published 1982, 1984, 1985,
U+2423 9251 Open box Control Pictures ␣ 1988 by Springer-Verlag) on
Modula-2,[12] a programming
language where space codes
require explicit indication. Also
used in the keypad[n 1] of the
Texas Instruments' TI-8x series
of graphing calculators.
Named entity: &blank;
1. Above the zero "0" or negative "(‒)" key.
Non-space blanks
The Braille Patterns Unicode block contains U+2800 ⠀ BRAILLE PATTERN BLANK (HTML ⠀), a Braille pattern with no dots raised.
Some fonts display the character as a fixed-width blank, however the Unicode standard explicitly states that it does not act as a space.
Exact space
The Cambridge Z88 provided a special "exact space" (code point 160 aka 0xA0) (invokable by key shortcut ⌑ + SPACE ,[13]) displayed as
"…" by the operating system's display driver.[14][15] It was therefore also known as "dot space" in conjunction with BBC BASIC.[14][15]
Under code point 224 (0xE0) the computer also provided a special three-character-cells-wide SPACE symbol "SPC" (analogous to
Unicode's single-cell-wide U+2420).[14][15]
Whitespace and digital typography
On-screen display
Text editors, word processors, and desktop publishing software differ in how they represent whitespace on the screen, and how they represent spaces at the
ends of lines longer than the screen or column width. In some cases, spaces are shown simply as blank space; in other cases they may be represented by an
interpunct or other symbols. Many different characters (described below) could be used to produce spaces, and non-character functions (such as margins and
tab settings) can also affect whitespace.
Variable-width general-purpose space

In computer character encodings, there is a normal general-purpose space (Unicode character U+0020) whose width will vary according to the design of the
typeface. Typical values range from 1/5 em to 1/3 em (in digital typography an em is equal to the nominal size of the font, so for a 10-point font the space will
probably be between 2 and 3.3 points). Sophisticated fonts may have differently sized spaces for bold, italic, and small-caps faces, and often compositors will
manually adjust the width of the space depending on the size and prominence of the text.
In addition to this general-purpose space, it is possible to encode a space of a specific width. See the table below for a complete list.
Hair spaces around dashes

Em dashes used as parenthetical dividers, and en dashes when used as word joiners, are usually set continuous with the text.[16] However, such a dash can
optionally be surrounded with a hair space, U+200A, or thin space, U+2009. The hair space can be written in HTML by using the numeric character
references or , or the named entity &hairsp;, but is not universally supported in browsers yet, as of 2016. The thin space is named
entity and numeric references or . These spaces are much thinner than a normal space (except in a monospaced (non-
proportional) font), with the hair space being the thinner of the two.
Normal space versus hair and thin spaces

(as rendered by your browser)
Normal space left right
Normal space with em dash left — right
Thin space with em dash left — right
Hair space with em dash left — right
No space with em dash left—right
Formatting values of quantities

The International System of Units (SI) prescribes inserting a space between a number and a unit of measurement and between units in compound units. A thin
space should be used as thousands separator. See unit symbols and numbers.
Computing applications
Programming languages
In programming language syntax, spaces are frequently used to explicitly separate tokens. In most languages multiple whitespace characters are treated the
same as a single whitespace character (outside of quoted strings); such languages are called free-form. In a few languages, including Haskell, occam, ABC,
and Python, whitespace and indentation are used for syntactical purposes. In the satirical language called Whitespace, whitespace characters are the only valid
characters for programming, while any other characters are ignored.
Excessive use of whitespace, especially trailing whitespace at the end of lines, is considered a nuisance. However correct use of whitespace can make the
code easier to read and help group related logic.
Most languages only recognize ASCII characters as whitespace, or in some cases Unicode newlines as well, but not most of the characters listed above. The C
language defines whitespace characters to be "space, horizontal tab, new-line, vertical tab, and form-feed".[17] The HTTP network protocol requires different
types of whitespace to be used in different parts of the protocol, such as: only the space character in the status line, CRLF at the end of a line, and "linear
whitespace" in header values.[18]
Command line user interfaces

In commands processed by command processors, e.g., in scripts and typed in, the space character can cause problems as it has two possible functions: as part
of a command or parameter, or as a parameter or name separator. Ambiguity can be prevented either by prohibiting embedded spaces, or by enclosing a name
with embedded spaces between quote characters.
Markup languages
Some markup languages, such as SGML, preserve whitespace as written.
Web markup languages such as XML and HTML treat whitespace characters specially, including space characters, for programmers' convenience. One or
more space characters read by conforming display-time processors of those markup languages are collapsed to 0 or 1 space, depending on their semantic
context. For example, double (or more) spaces within text are collapsed to a single space, and spaces which appear on either side of the "=" that separates an
attribute name from its value have no effect on the interpretation of the document. Element end tags can contain trailing spaces, and empty-element tags in
XML can contain spaces before the "/>". In these languages, unnecessary whitespace increases the file size, and so may slow network transfers. On the other
hand, unnecessary whitespace can also inconspicuously mark code, similar to, but less obvious than comments in code. This can be desirable to prove an
infringement of license or copyright that was committed by copying and pasting.
In XML attribute values, sequences of whitespace characters are treated as a single space when the document is read by a parser.[19] Whitespace in XML
element content is not changed in this way by the parser, but an application receiving information from the parser may choose to apply similar rules to
element content. An XML document author can use the xml:space="preserve" attribute on an element to instruct the parser to discourage the
downstream application from altering whitespace in that element's content.
In most HTML elements, a sequence of whitespace characters is treated as a single inter-word separator, which may manifest as a single space character
when rendering text in a language that normally inserts such space between words.[20] Conforming HTML renderers are required to apply a more literal
treatment of whitespace within a few prescribed elements, such as the pre tag and any element for which CSS has been used to apply pre-like whitespace
processing. In such elements, space characters will not be "collapsed" into inter-word separators.
In both XML and HTML, the non-breaking space character, along with other non-"standard" spaces, is not treated as collapsible "whitespace", so it is not
subject to the rules above.
File names
Such usage is similar to multiword file names written for operating systems and applications that are confused by embedded space codes—such file names
instead use an underscore (_) as a word separator, as_in_this_phrase.
Another such symbol was U+2422 ␢ BLANK SYMBOL. This was used in the early years of computer programming when writing on coding forms. Keypunch
operators immediately recognized the symbol as an "explicit space".[10] It was used in BCDIC,[10] EBCDIC,[10] and ASCII-1963.[10]
See also
Carriage return
Form feed
Indent style
Line feed
Newline
Programming style
Prosigns for Morse code
Regular expression#Character classes for the white-space character class.
Space bar
Space (punctuation)
Tab key
Trimming (computer programming)
Whitespace (programming language)
Zero-width space
References
1. "The Unicode Standard" (http://unicode.org/versions/latest/). Unicode Consortium.
2. "Character design standards – space characters" (https://web.archive.org/web/20100314135826/https://www.microsoft.com/typography/de
velopers/fdsspec/spaces.htm). Character design standards. Microsoft. 1998–1999. Archived from the original (http://www.microsoft.com/typ
ography/developers/fdsspec/spaces.htm) on August 23, 2000. Retrieved 2009-05-18.
3. The Unicode Standard 5.0, printed edition, p.205
4. "General Punctuation" (https://www.unicode.org/charts/PDF/U2000.pdf) (PDF). The Unicode Standard 5.1. Unicode Inc. 1991–2008.
Retrieved 2009-05-13.
5. Sargent, Murray III (2006-08-29). "Unicode Nearly Plain Text Encoding of Mathematics (Version 2)" (https://www.unicode.org/notes/tn28/tn
28-2.html). Unicode Technical Note #28. Unicode Inc. pp. 19–20. Retrieved 2009-05-19.
6. Gillam, Richard (2002). Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard. Addison-Wesley. ISBN 0-201-
70052-2.
7. Faltstrom, P., ed. (August 2010). "Zero Width Non-Joiner" (https://tools.ietf.org/html/rfc5892#appendix-A.1). The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA) (https://tools.ietf.org/html/rfc5892). IETF. sec. A.1. doi:10.17487/RFC5892 (https://
doi.org/10.17487%2FRFC5892). RFC 5892. Retrieved September 4, 2019.
8. Faltstrom, P., ed. (August 2010). "Zero Width Joiner" (https://tools.ietf.org/html/rfc5892#appendix-A.2). The Unicode Code Points and
Internationalized Domain Names for Applications (IDNA) (https://tools.ietf.org/html/rfc5892). IETF. sec. A.2. doi:10.17487/RFC5892 (https://
doi.org/10.17487%2FRFC5892). RFC 5892. Retrieved September 4, 2019.
9. "Unicode Standard Annex #44, Unicode Character Database" (http://www.unicode.org/reports/tr44/#White_Space).
10. Mackenzie, Charles E. (1980). Coded Character Sets, History and Development (https://books.google.com/books?id=6-tQAAAAMAAJ).
The Systems Programming Series (1 ed.). Addison-Wesley Publishing Company, Inc. pp. 41, 47, 52, 102–103, 117, 119, 130, 132, 141,
148, 150–151, 212, 424. ISBN 978-0-201-14460-4. LCCN 77-90165 (https://lccn.loc.gov/77-90165). Retrieved 2016-05-22. [1] (https://web.
archive.org/web/20160526172151/https://textfiles.meulie.net/bitsaved/Books/Mackenzie_CodedCharSets.pdf)
11. "American Standard Code for Information Interchange, ASA X3.4-1963" (http://worldpowersystems.com/archives/codes/X3.4-1963/index.ht
ml). American Standards Association (ASA). 1963-06-17. Archived (https://web.archive.org/web/20160526195837/http://worldpowersystem
s.com/archives/codes/X3.4-1963/index.html) from the original on 2016-05-26. Retrieved 2014-05-23.
12. Niklaus Wirth, Programming in Modula-2 (https://link.springer.com/content/pdf/bfm%3A978-3-642-83565-0%2F1.pdf)
13. "Cambridge Z88 User Guide" (https://cambridgez88.jira.com/wiki/display/UG/The+keyboard). 4.7 (4th ed.). Cambridge Computer Limited.
2016 [1987]. Basic concepts - The keyboard. Archived (https://web.archive.org/web/20161212173159/https://cambridgez88.jira.com/wiki/di
splay/UG/The+keyboard) from the original on 2016-12-12. Retrieved 2016-12-12.
14. "Cambridge Z88 User Guide" (https://cambridgez88.jira.com/wiki/display/UG40/Appendix+D+-+Character+set). 4.0 (4th ed.). Cambridge
Computer Limited. 1987. Appendix D. Archived (https://web.archive.org/web/20161212173345/https://cambridgez88.jira.com/wiki/display/U
G40/Appendix+D+-+Character+set) from the original on 2016-12-12. Retrieved 2016-12-12.
15. "Cambridge Z88 User Guide" (https://cambridgez88.jira.com/wiki/display/UG/Appendix+D+-+Character+set). 4.7 (4th ed.). Cambridge
Computer Limited. 2015 [1987]. Appendix D. Archived (https://web.archive.org/web/20161212173256/https://cambridgez88.jira.com/wiki/di
splay/UG/Appendix+D+-+Character+set) from the original on 2016-12-12. Retrieved 2016-12-12.
16. Usage of the different dash types is illustrated, e.g., in The Chicago Manual of Style, §§ 6.80, 6.83–6.86
17. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf Section 6.4, paragraph 3
18. Fielding, R.; et al., "2.2 Basic Rules", Hypertext Transfer Protocol—HTTP/1.1, RFC 2616 (https://tools.ietf.org/html/rfc2616)
19. "3.3.3 Attribute-Value Normalization" (http://www.w3.org/TR/REC-xml/#AVNormalize). Extensible Markup Language (XML) 1.0 (Fifth
Edition). World Wide Web Consortium.
20. "9.1 Whitespace" (http://www.w3.org/TR/html4/struct/text.html#h-9.1). W3CHTML 4.01 Specification. World Wide Web Consortium.
External links
Property List of Unicode Character Database (http://unicode.org/Public/UNIDATA/PropList.txt)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Whitespace_character&oldid=924029247"
This page was last edited on 1 November 2019, at 10:22 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the
Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

Whitespace Character: Definition and Ambiguity

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Whitespace Character: Definition and Ambiguity

Hochgeladen von

Copyright:

Verfügbare Formate

Whitespace character

Whitespace and digital typography

Definition and ambiguity

Other, LF, Line feed. HTML/XML

Used for interword

As wide as the narrow

One-fifth (sometimes one-

Thinner than a thin space.

CJK As wide as a CJK

ZERO U+FEFF 65279 No No ? Arabic Other, Zero-width non-

1. White_Space is a binary Unicode property.[9]

Unicode space-illustrating characters (visible)

U+2420 9248 Symbol for space Control Pictures ␠

1. Above the zero "0" or negative "(‒)" key.

Whitespace and digital typography

Variable-width general-purpose space

Hair spaces around dashes

Normal space versus hair and thin spaces

Formatting values of quantities

Command line user interfaces

Retrieved from "https://en.wikipedia.org/w/index.php?title=Whitespace_character&oldid=924029247"

This page was last edited on 1 November 2019, at 10:22 (UTC).

Das könnte Ihnen auch gefallen