loading...

JavaScript – Character Set

JavaScript programs are written using the Unicode character set.
Unicode is a superset of ASCII and Latin-1 and supports virtually
every written language currently used on the planet. ECMAScript 3
requires JavaScript implementations to support Unicode version 2.1 or
later, and ECMAScript 5 requires implementations to support Unicode 3 or later. See the sidebar in
Text for more about Unicode and
JavaScript.

Case Sensitivity

JavaScript is a case-sensitive language. This means that
language keywords, variables, function names, and other
identifiers must always be typed with a
consistent capitalization of letters. The while keyword, for example, must be typed
“while,” not “While” or “WHILE.” Similarly, online, Online, OnLine, and ONLINE are four distinct variable
names.

Note, however, that HTML is not case-sensitive (although XHTML
is). Because of its close association with client-side JavaScript,
this difference can be confusing. Many client-side JavaScript
objects and properties have the same names as the HTML tags and
attributes they represent. While these tags and attribute names can
be typed in any case in HTML, in JavaScript they typically must be
all lowercase. For example, the HTML onclick event handler attribute is
sometimes specified as onClick in
HTML, but it must be specified as onclick in JavaScript code (or in XHTML
documents).

Whitespace, Line Breaks, and Format Control
Characters

JavaScript ignores spaces that appear between tokens in
programs. For the most part, JavaScript also ignores line breaks
(but see Optional Semicolons for an exception).
Because you can use spaces and newlines freely in your programs, you
can format and indent your programs in a neat and consistent way
that makes the code easy to read and understand.

In addition to the regular space character ( \u0020), JavaScript also recognizes the
following characters as whitespace: tab ( \u0009), vertical tab ( \u000B), form feed ( \u000C), nonbreaking space ( \u00A0), byte order mark ( \uFEFF), and any character in Unicode
category Zs. JavaScript recognizes the following characters as line
terminators: line feed ( \u000A),
carriage return ( \u000D), line
separator ( \u2028), and paragraph
separator ( \u2029). A carriage
return, line feed sequence is treated as a single line terminator.

Unicode format control characters (category Cf), such as
RIGHT-TO-LEFT MARK ( \u200F) and
LEFT-TO-RIGHT MARK ( \u200E),
control the visual presentation of the text they occur in. They are
important for the proper display of some non-English languages and
are allowed in JavaScript comments, string literals, and regular
expression literals, but not in the identifiers (e.g., variable
names) of a JavaScript program. As a special case, ZERO WIDTH JOINER
( \u200D) and ZERO WIDTH
NON-JOINER ( \u200C) are allowed
in identifiers, but not as the first character. As noted above, the
byte order mark format control character ( \uFEFF) is treated as a space
character.

Unicode Escape Sequences

Some computer hardware and software can not display or input
the full set of Unicode characters. To support programmers using
this older technology, JavaScript defines special sequences of six
ASCII characters to represent any 16-bit Unicode codepoint. These
Unicode escapes begin with the characters \u and are followed by exactly four
hexadecimal digits (using uppercase or lowercase letters A–F).
Unicode escapes may appear in JavaScript string literals, regular
expression literals, and in identifiers (but not in language
keywords). The Unicode escape for the character é, for example, is
\u00E9, and the following two
JavaScript strings are identical:

"café" === "caf\u00e9"   // => true

Unicode escapes may also appear in comments, but since
comments are ignored, they are treated as ASCII characters in that
context and not interpreted as Unicode.

Normalization

Unicode allows more than one way of encoding the same
character. The string “é”, for example, can be encoded as the single
Unicode character \u00E9 or as a
regular ASCII e followed by the acute accent combining mark \u0301. These two encodings may look
exactly the same when displayed by a text editor, but they have
different binary encodings and are considered different by the
computer. The Unicode standard defines the preferred encoding for
all characters and specifies a normalization procedure to convert
text to a canonical form suitable for comparisons. JavaScript
assumes that the source code it is interpreting has already been
normalized and makes no attempt to normalize identifiers, strings,
or regular expressions itself.

Comments are closed.

loading...