loading...

JavaScript – Defining Regular Expressions

In JavaScript, regular expressions are represented by RegExp
objects. RegExp objects may be created with the RegExp() constructor, of course, but they
are more often created using a special literal syntax. Just as string
literals are specified as characters within quotation marks, regular
expression literals are specified as characters within a pair of slash
( /) characters. Thus, your
JavaScript code may contain lines like this:

var pattern = /s$/;

This line creates a new RegExp object and assigns it to the
variable pattern. This particular
RegExp object matches any string that ends with the letter “s.” This
regular expression could have equivalently been defined with the
RegExp() constructor like
this:

var pattern = new RegExp("s$");

RegExp Literals and Object Creation

Literals of primitive type, like strings and numbers, evaluate
(obviously) to the same value each time they are encountered in a
program. Object literals (or initializers) such as {} and [] create a new object each time they are
encountered. If you write var a =
[]
in the body of a loop, for example, each iteration of
the loop will create a new empty array.

Regular expression literals are a special case. The ECMAScript
3 specification says that a RegExp literal is converted to a RegExp
object when the code is parsed, and each evaluation of the code
returns the same object. The ECMAScript 5 specification reverses
this and requires that each evaluation of a RegExp return a new
object. IE has always implemented the ECMAScript 5 behavior and most
current browsers have now switched to it, even before they fully
implement the standard.

Regular-expression pattern specifications consist of a series of
characters. Most characters, including all alphanumeric characters,
simply describe characters to be matched literally. Thus, the regular
expression /java/ matches any
string that contains the substring “java”. Other characters in regular
expressions are not matched literally but have special significance.
For example, the regular expression /s$/ contains two characters. The first,
“s”, matches itself literally. The second, “$”, is a special
metacharacter that matches the end of a string. Thus, this regular
expression matches any string that contains the letter “s” as its last
character.

The following sections describe the various characters and
metacharacters used in JavaScript regular expressions.

Literal Characters

As noted earlier, all alphabetic characters and digits match
themselves literally in regular expressions. JavaScript
regular-expression syntax also supports certain nonalphabetic
characters through escape sequences that begin with a backslash
( \). For example, the sequence
\n matches a literal newline
character in a string. Table 10-1 lists
these characters.

Table 10-1. Regular-expression literal characters

Character Matches
Alphanumeric character Itself
\0 The NUL character (\u0000)
\t Tab (\u0009)
\n Newline ( \u000A)
\v Vertical tab ( \u000B)
\f Form feed ( \u000C)
\r Carriage return ( \u000D)
\x
nn

The Latin character specified by the
hexadecimal number nn; for
example, \x0A is the same
as \n

\u
xxxx

The Unicode character specified by the
hexadecimal number xxxx; for
example, \u0009 is the
same as \t

\c
X

The control character ^ X;
for example, \cJ is
equivalent to the newline character \n

A number of punctuation characters have special meanings in
regular expressions. They are:

^ $ . * + ? = ! : | \ / ( ) [ ] { }

The meanings of these characters are discussed in the sections
that follow. Some of these characters have special meaning only
within certain contexts of a regular expression and are treated
literally in other contexts. As a general rule, however, if you want
to include any of these punctuation characters literally in a
regular expression, you must precede them with a \. Other punctuation characters, such as
quotation marks and @, do not
have special meaning and simply match themselves literally in a
regular expression.

If you can’t remember exactly which punctuation characters
need to be escaped with a backslash, you may safely place a
backslash before any punctuation character. On the other hand, note
that many letters and numbers have special meaning when preceded by
a backslash, so any letters or numbers that you want to match
literally should not be escaped with a backslash. To include a
backslash character literally in a regular expression, you must
escape it with a backslash, of course. For example, the following
regular expression matches any string that includes a backslash:
/\\/.

Character Classes

Individual literal characters can be combined into
character classes by placing them within square
brackets. A character class matches any one character that is
contained within it. Thus, the regular expression /[abc]/ matches any one of the letters a,
b, or c. Negated character classes can also be defined; these match
any character except those contained within the brackets. A negated
character class is specified by placing a caret ( ^) as the first character inside the left
bracket. The regexp /[^abc]/
matches any one character other than a, b, or c. Character classes
can use a hyphen to indicate a range of characters. To match any one
lowercase character from the Latin alphabet, use /[a-z]/ and to match
any letter or digit from the Latin alphabet, use /[a-zA-Z0-9]/.

Because certain character classes are commonly used, the
JavaScript regular-expression syntax includes special characters and
escape sequences to represent these common classes. For example,
\s matches the space character,
the tab character, and any other Unicode whitespace character;
\S matches any character that is
not Unicode whitespace. Table 10-2 lists these characters and summarizes
character-class syntax. (Note that several of these character-class
escape sequences match only ASCII characters and have not been
extended to work with Unicode characters. You can, however,
explicitly define your own Unicode character classes; for example,
/[\u0400-\u04FF]/ matches any one
Cyrillic character.)

Table 10-2. Regular expression character classes

Character Matches
[...] Any one character between the brackets.
[^...] Any one character not between the brackets.
.

Any character except newline or another
Unicode line terminator.

\w

Any ASCII word character. Equivalent to
[a-zA-Z0-9_].

\W

Any character that is not an ASCII word
character. Equivalent to [^a-zA-Z0-9_].

\s Any Unicode whitespace character.
\S

Any character that is not Unicode whitespace.
Note that \w and \S are not the same thing.

\d Any ASCII digit. Equivalent to [0-9].
\D

Any character other than an ASCII digit.
Equivalent to [^0-9].

[\b] A literal backspace (special case).

Note that the special character-class escapes can be used
within square brackets. \s
matches any whitespace character, and \d matches any digit, so /[\s\d]/ matches any one whitespace
character or digit. Note that there is one special case. As you’ll
see later, the \b escape has a
special meaning. When used within a character class, however, it
represents the backspace character. Thus, to represent a backspace
character literally in a regular expression, use the character class
with one element: /[\b]/.

Repetition

With the regular expression syntax you’ve learned so far, you
can describe a two-digit number as /\d\d/ and a four-digit number as /\d\d\d\d/. But you don’t have any way to
describe, for example, a number that can have any number of digits
or a string of three letters followed by an optional digit. These
more complex patterns use regular-expression syntax that specifies
how many times an element of a regular expression may be
repeated.

The characters that specify repetition always follow the
pattern to which they are being applied. Because certain types of
repetition are quite commonly used, there are special characters to
represent these cases. For example, + matches one or more occurrences of the
previous pattern. Table 10-3 summarizes the
repetition syntax.

Table 10-3. Regular expression repetition characters

Character Meaning
{
n , m
}

Match the previous item at least
n times but no more than
m times.

{
n ,}

Match the previous item
n or more times.

{
n }

Match exactly n
occurrences of the previous item.

?

Match zero or one occurrences of the previous
item. That is, the previous item is optional. Equivalent to
{0,1}.

+

Match one or more occurrences of the previous
item. Equivalent to {1,}.

*

Match zero or more occurrences of the previous
item. Equivalent to {0,}.

The following lines show some examples:

/\d{2,4}/    // Match between two and four digits
/\w{3}\d?/   // Match exactly three word characters and an optional digit
/\s+java\s+/ // Match "java" with one or more spaces before and after
/[^(]*/      // Match zero or more characters that are not open parenthesis

Be careful when using the *
and ? repetition characters.
Since these characters may match zero instances of whatever precedes
them, they are allowed to match nothing. For example, the regular
expression /a*/ actually matches
the string “bbbb” because the string contains zero occurrences of
the letter a!

Nongreedy repetition

The repetition characters listed in Table 10-3 match as many times as possible while
still allowing any following parts of the regular expression to
match. We say that this repetition is “greedy.” It is also
possible to specify that repetition should be done in a nongreedy
way. Simply follow the repetition character or characters with a
question mark: ??, +?, *?, or even {1,5}?. For example, the regular
expression /a+/ matches one or
more occurrences of the letter a. When applied to the string
“aaa”, it matches all three letters. But /a+?/ matches one or more occurrences of
the letter a, matching as few characters as necessary. When
applied to the same string, this pattern matches only the first
letter a.

Using nongreedy repetition may not always produce the
results you expect. Consider the pattern /a+b/, which matches one or more a’s,
followed by the letter b. When applied to the string “aaab”, it
matches the entire string. Now let’s use the nongreedy version:
/a+?b/. This should match the
letter b preceded by the fewest number of a’s possible. When
applied to the same string “aaab”, you might expect it to match
only one a and the last letter b. In fact, however, this pattern
matches the entire string, just like the greedy version of the
pattern. This is because regular-expression pattern matching is
done by finding the first position in the string at which a match
is possible. Since a match is possible starting at the first
character of the string, shorter matches starting at subsequent
characters are never even considered.

Alternation, Grouping, and References

The regular-expression grammar includes special characters for
specifying alternatives, grouping subexpressions, and referring to
previous subexpressions. The |
character separates alternatives. For example, /ab|cd|ef/ matches the string “ab” or the
string “cd” or the string “ef”. And /\d{3}|[a-z]{4}/ matches either three
digits or four lowercase
letters.

Note that alternatives are considered left to right until a
match is found. If the left alternative matches, the right
alternative is ignored, even if it would have produced a “better”
match. Thus, when the pattern /a|ab/ is applied to the string “ab”, it
matches only the first letter.

Parentheses have several purposes in regular expressions. One
purpose is to group separate items into a single subexpression so
that the items can be treated as a single unit by |, *,
+, ?, and so on. For example, /java(script)?/ matches “java” followed by
the optional “script”. And /(ab|cd)+|ef/ matches either the string
“ef” or one or more repetitions of either of the strings “ab” or
“cd”.

Another purpose of parentheses in regular expressions is to
define subpatterns within the complete pattern. When a regular
expression is successfully matched against a target string, it is
possible to extract the portions of the target string that matched
any particular parenthesized subpattern. (You’ll see how these
matching substrings are obtained later in the chapter.) For example,
suppose you are looking for one or more lowercase letters followed
by one or more digits. You might use the pattern /[a-z]+\d+/. But suppose you only really
care about the digits at the end of each match. If you put that part
of the pattern in parentheses (/[a-z]+(\d+)/), you can extract the
digits from any matches you find, as explained later.

A related use of parenthesized subexpressions is to allow you
to refer back to a subexpression later in the same regular
expression. This is done by following a \ character by a digit or digits. The
digits refer to the position of the parenthesized subexpression
within the regular expression. For example, \1 refers back to the first subexpression,
and \3 refers to the third. Note
that, because subexpressions can be nested within others, it is the
position of the left parenthesis that is counted. In the following
regular expression, for example, the nested subexpression ([Ss]cript) is referred to as \2:

/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/

A reference to a previous subexpression of a regular
expression does not refer to the pattern for
that subexpression but rather to the text that matched the pattern.
Thus, references can be used to enforce a constraint that separate
portions of a string contain exactly the same characters. For
example, the following regular expression matches zero or more
characters within single or double quotes. However, it does not
require the opening and closing quotes to match (i.e., both single
quotes or both double quotes):

/['"][^'"]*['"]/

To require the quotes to match, use a reference:

/(['"])[^'"]*\1/

The \1 matches whatever the
first parenthesized subexpression matched. In this example, it
enforces the constraint that the closing quote match the opening
quote. This regular expression does not allow single quotes within
double-quoted strings or vice versa. It is not legal to use a
reference within a character class, so you cannot write:

/(['"])[^\1]*\1/

Later in this chapter, you’ll see that this kind of reference
to a parenthesized subexpression is a powerful feature of
regular-expression search-and-replace operations.

It is also possible to group items in a regular expression
without creating a numbered reference to those items. Instead of
simply grouping the items within ( and ), begin the group with (?: and end it with ). Consider the following pattern, for
example:

/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/

Here, the subexpression (?:[Ss]cript) is used simply for grouping,
so the ? repetition character can
be applied to the group. These modified parentheses do not produce a
reference, so in this regular expression, \2 refers to the text matched by (fun\w*).

Table 10-4 summarizes the
regular-expression alternation, grouping, and referencing
operators.

Table 10-4. Regular expression alternation, grouping, and reference
characters

Character Meaning
|

Alternation. Match either the subexpression to
the left or the subexpression to the right.

(...)

Grouping. Group items into a single unit that
can be used with *,
+, ?, |, and so on. Also remember the
characters that match this group for use with later
references.

(?:...)

Grouping only. Group items into a single unit,
but do not remember the characters that match this group.

\
n

Match the same characters that were matched
when group number n was first
matched. Groups are subexpressions within (possibly nested)
parentheses. Group numbers are assigned by counting left
parentheses from left to right. Groups formed with (?: are not numbered.

Specifying Match Position

As described earlier, many elements of a regular expression
match a single character in a string. For example, \s matches a single character of
whitespace. Other regular expression elements match the positions
between characters, instead of actual characters. \b, for example, matches a word
boundary—the boundary between a \w (ASCII word character) and a \W (nonword character), or the boundary
between an ASCII word character and the beginning or end of a
string.[18] Elements such as \b
do not specify any characters to be used in a matched string; what
they do specify, however, are legal positions at which a match can occur.
Sometimes these elements are called regular-expression anchors because
they anchor the pattern to a specific position in the search string.
The most commonly used anchor elements are ^, which ties the pattern to the beginning
of the string, and $, which
anchors the pattern to the end of the string.

For example, to match the word “JavaScript” on a line by
itself, you can use the regular expression /^JavaScript$/. If you want to search for
“Java” as a word by itself (not as a prefix, as it is in
“JavaScript”), you can try the pattern /\sJava\s/, which requires a space before
and after the word. But there are two problems with this solution.
First, it does not match “Java” at the beginning or the end of a
string, but only if it appears with space on either side. Second,
when this pattern does find a match, the matched string it returns
has leading and trailing spaces, which is not quite what’s needed.
So instead of matching actual space characters with \s, match (or anchor to) word boundaries
with \b. The resulting expression
is /\bJava\b/. The element
\B anchors the match to a
location that is not a word boundary. Thus, the pattern /\B[Ss]cript/ matches “JavaScript” and
“postscript”, but not “script” or “Scripting”.

You can also use arbitrary regular expressions as anchor
conditions. If you include an expression within (?= and ) characters, it is a lookahead assertion,
and it specifies that the enclosed characters must match, without
actually matching them. For example, to match the name of a common
programming language, but only if it is followed by a colon, you
could use /[Jj]ava([Ss]cript)?(?=\:)/. This pattern
matches the word “JavaScript” in “JavaScript: The Definitive Guide”,
but it does not match “Java” in “Java in a Nutshell”, because it is
not followed by a colon.

If you instead introduce an assertion with (?!, it is a negative lookahead assertion,
which specifies that the following characters must not match. For
example, /Java(?!Script)([A-Z]\w*)/ matches “Java”
followed by a capital letter and any number of additional ASCII word
characters, as long as “Java” is not followed by “Script”. It
matches “JavaBeans” but not “Javanese”, and it matches “JavaScrip”
but not “JavaScript” or “JavaScripter”.

Table 10-5 summarizes
regular-expression anchors.

Table 10-5. Regular-expression anchor characters

Character Meaning
^

Match the beginning of the string and, in
multiline searches, the beginning of a line.

$

Match the end of the string and, in multiline
searches, the end of a line.

\b

Match a word boundary. That is, match the
position between a \w
character and a \W
character or between a \w
character and the beginning or end of a string. (Note,
however, that [\b]
matches backspace.)

\B

Match a position that is not a word boundary.

(?=
p )

A positive lookahead assertion. Require that
the following characters match the pattern
p, but do not include those
characters in the match.

(?!
p )

A negative lookahead assertion. Require that
the following characters do not match the pattern
p.

Flags

There is one final element of regular-expression grammar.
Regular-expression flags specify high-level pattern-matching rules.
Unlike the rest of regular-expression syntax, flags are specified
outside the / characters; instead
of appearing within the slashes, they appear following the second
slash. JavaScript supports three flags. The i flag specifies that pattern matching
should be case-insensitive. The g
flag specifies that pattern matching should be global—that is, all
matches within the searched string should be found. The m flag performs pattern matching in
multiline mode. In this mode, if the string to be searched contains
newlines, the ^ and $ anchors match the beginning and end of a
line in addition to matching the beginning and end of a string. For
example, the pattern /java$/im
matches “java” as well as “Java\nis fun”.

These flags may be specified in any combination. For example,
to do a case-insensitive search for the first occurrence of the word
“java” (or “Java”, “JAVA”, etc.), you can use the case-insensitive
regular expression /\bjava\b/i.
And to find all occurrences of the word in a string, you can add the
g flag: /\bjava\b/gi.

Table 10-6 summarizes these
regular-expression flags. Note that you’ll see more about the
g flag later in this chapter,
when the String and RegExp methods are used to actually perform
matches.

Table 10-6. Regular-expression flags

Character Meaning
i Perform case-insensitive matching.
g

Perform a global match—that is, find all
matches rather than stopping after the first match.

m

Multiline mode. ^ matches beginning of line or
beginning of string, and $ matches end of line or end of
string.


[18] Except within a character class (square brackets), where
\b matches the backspace
character.

Comments are closed.

loading...