JavaScript – Text

A string is an immutable ordered sequence
of 16-bit values, each of which typically represents a Unicode
character—strings are JavaScript’s type for representing text. The
length of a string is the number of 16-bit values
it contains. JavaScript’s strings (and its arrays) use zero-based
indexing: the first 16-bit value is at position 0, the second at
position 1 and so on. The empty string is the
string of length 0. JavaScript does not have a special type that
represents a single element of a string. To represent a single 16-bit
value, simply use a string that has a length of 1.

Characters, Codepoints, and JavaScript Strings

JavaScript uses the UTF-16 encoding of the Unicode character
set, and JavaScript strings are sequences of unsigned 16-bit values.
The most commonly used Unicode characters (those from the “basic
multilingual plane”) have codepoints that fit in 16 bits and can be represented by a
single element of a string. Unicode characters whose codepoints do
not fit in 16 bits are encoded following the rules of UTF-16 as a
sequence (known as a “surrogate pair”) of two 16-bit values. This
means that a JavaScript string of length 2 (two 16-bit values) might
represent only a single Unicode character:

var p = "π"; // π is 1 character with 16-bit codepoint 0x03c0
var e = "e"; // e is 1 character with 17-bit codepoint 0x1d452
p.length     // => 1: p consists of 1 16-bit element
e.length     // => 2: UTF-16 encoding of e is 2 16-bit values: "\ud835\udc52"

The various string-manipulation methods defined by JavaScript
operate on 16-bit values, not on characters. They do not treat
surrogate pairs specially, perform no normalization of the string,
and do not even ensure that a string is well-formed UTF-16.

String Literals

To include a string literally in a JavaScript program, simply
enclose the characters of the string within a matched pair of single
or double quotes ( ' or "). Double-quote characters may be
contained within strings delimited by single-quote characters, and
single-quote characters may be contained within strings delimited by
double quotes. Here are examples of string literals:

""  // The empty string: it has zero characters
'testing'
"3.14"
'name="myform"'
"Wouldn't you prefer O'Reilly's book?"
"This string\nhas two lines"
"π is the ratio of a circle's circumference to its diameter"

In ECMAScript 3, string literals must be written on a single
line. In ECMAScript 5, however, you can break a string literal
across multiple lines by ending each line but the last with a
backslash ( \). Neither the
backslash nor the line terminator that follow it are part of the
string literal. If you need to include a newline character in a
string literal, use the character sequence \n (documented below):

"two\nlines"   // A string representing 2 lines written on one line
"one\          // A one-line string written on 3 lines. ECMAScript 5 only.
 long\
 line"

Note that when you use single quotes to delimit your strings,
you must be careful with English contractions and possessives, such
as can’t and O’Reilly’s.
Since the apostrophe is the same as the single-quote character, you
must use the backslash character ( \) to “escape” any apostrophes that appear
in single-quoted strings (escapes are explained in the next
section).

In client-side JavaScript programming, JavaScript code may
contain strings of HTML code, and HTML code may contain strings of
JavaScript code. Like JavaScript, HTML uses either single or double
quotes to delimit its strings. Thus, when combining JavaScript and
HTML, it is a good idea to use one style of quotes for JavaScript
and the other style for HTML. In the following example, the string
“Thank you” is single-quoted within a JavaScript expression, which
is then double-quoted within an HTML
event-handler attribute:

<button onclick="alert('Thank you')">Click Me</button>

Escape Sequences in String Literals

The backslash character ( \)
has a special purpose in JavaScript strings. Combined with the
character that follows it, it represents a character that is not
otherwise representable within the string. For example, \n is an escape
sequence
that represents a newline character.

Another example, mentioned above, is the \' escape, which represents the single
quote (or apostrophe) character. This escape sequence is useful when
you need to include an apostrophe in a string literal that is
contained within single quotes. You can see why these are called
escape sequences: the backslash allows you to escape from the usual
interpretation of the single-quote character. Instead of using it to
mark the end of the string, you use it as an apostrophe:

'You\'re right, it can\'t be a quote'

Table 3-1 lists the JavaScript escape
sequences and the characters they represent. Two escape sequences
are generic and can be used to represent any character by specifying
its Latin-1 or Unicode character code as a hexadecimal number. For
example, the sequence \xA9
represents the copyright symbol, which has the Latin-1 encoding
given by the hexadecimal number A9. Similarly, the \u escape represents an arbitrary Unicode
character specified by four hexadecimal digits; \u03c0 represents the character π, for example.

Table 3-1. JavaScript escape sequences

Sequence

Character
represented

\0

The NUL character ( \u0000)

\b

Backspace ( \u0008)

\t

Horizontal tab ( \u0009)

\n

Newline ( \u000A)

\v

Vertical tab ( \u000B)

\f

Form feed ( \u000C)

\r

Carriage return ( \u000D)

\"

Double quote ( \u0022)

\'

Apostrophe or single quote
( \u0027)

\\

Backslash ( \u005C)

\x XX

The Latin-1 character specified
by the two hexadecimal digits
XX

\u XXXX

The Unicode character specified
by the four hexadecimal digits
XXXX

If the \ character precedes
any character other than those shown in Table 3-1, the backslash is simply ignored
(although future versions of the language may, of course, define new
escape sequences). For example, \# is the same as #. Finally, as noted above, ECMAScript 5
allows a backslash before a line break to break a string literal
across multiple lines.

Working with Strings

One of the built-in features of JavaScript is the ability to
concatenate strings. If you use the + operator with numbers, it adds them. But
if you use this operator on strings, it joins them by appending the
second to the first. For example:

msg = "Hello, " + "world";   // Produces the string "Hello, world"
greeting = "Welcome to my blog," + " " + name;

To determine the length of a string—the number of 16-bit
values it contains—use the length
property of the string. Determine the length of a string s like this:

s.length

In addition to this length
property, there are a number of methods you can invoke on strings
(as always, see the reference section for complete details):

var s = "hello, world"        // Start with some text.
s.charAt(0)                   // => "h": the first character.
s.charAt(s.length-1)          // => "d": the last character.
s.substring(1,4)              // => "ell": the 2nd, 3rd and 4th characters.
s.slice(1,4)                  // => "ell": same thing
s.slice(-3)                   // => "rld": last 3 characters
s.indexOf("l")                // => 2: position of first letter l.
s.lastIndexOf("l")            // => 10: position of last letter l.
s.indexOf("l", 3)             // => 3: position of first "l" at or after 3
s.split(", ")                 // => ["hello", "world"] split into substrings
s.replace("h", "H")           // => "Hello, world": replaces all instances
s.toUpperCase()               // => "HELLO, WORLD"

Remember that strings are immutable in JavaScript. Methods
like replace() and toUpper Case() return new strings: they do not
modify the string on which they are invoked.

In ECMAScript 5, strings can be treated like read-only arrays,
and you can access individual characters (16-bit values) from a
string using square brackets instead of the charAt() method:

s = "hello, world";
s[0]                  // => "h"
s[s.length-1]         // => "d"

Mozilla-based web browsers such as Firefox have allowed
strings to be indexed in this way for a long time. Most modern
browsers (with the notable exception of IE) followed Mozilla’s lead
even before this feature was standardized in ECMAScript 5.

Pattern Matching

JavaScript defines a RegExp() constructor for creating objects
that represent textual patterns. These patterns are described with
regular expressions, and JavaScript adopts
Perl’s syntax for regular expressions. Both strings and RegExp
objects have methods for performing pattern matching and
search-and-replace operations using regular expressions.

RegExps are not one of the fundamental types of JavaScript.
Like Dates, they are simply a specialized kind of object, with a
useful API. The regular expression grammar is complex and the API is
nontrivial. They are documented in detail in Chapter 10. Because RegExps are powerful and commonly used
for text processing, however, this section provides a brief
overview.

Although RegExps are not one of the fundamental data types in
the language, they do have a literal syntax and can be encoded
directly into JavaScript programs. Text between a pair of slashes
constitutes a regular expression literal. The second slash in the
pair can also be followed by one or more letters, which modify the
meaning of the pattern. For example:

/^HTML/              // Match the letters H T M L at the start of a string
/[1-9][0-9]*/        // Match a non-zero digit, followed by any # of digits
/\bjavascript\b/i    // Match "javascript" as a word, case-insensitive

RegExp objects define a number of useful methods, and strings
also have methods that accept RegExp arguments. For example:

var text = "testing: 1, 2, 3";   // Sample text
var pattern = /\d+/g             // Matches all instances of one or more digits
pattern.test(text)               // => true: a match exists
text.search(pattern)             // => 9: position of first match
text.match(pattern)              // => ["1", "2", "3"]: array of all matches
text.replace(pattern, "#");      // => "testing: #, #, #"
text.split(/\D+/);               // => ["","1","2","3"]: split on non-digits

Comments are closed.