131
Chapter 9. Functions and Operators
Table 9-15. Regular Expression Constraints
Constraint
Description
^
matches at the beginning of the string
$
matches at the end of the string
(?=re)
positive lookahead matches at any point where
asubstring matching
re
begins (AREs only)
(?!re)
negative lookahead matches at anypoint where
no substring matching
re
begins (AREs only)
Lookaheadconstraints cannotcontainbackreferences(seeSection9.7.3.3), and all parentheses within
them are considered non-capturing.
9.7.3.2. Bracket Expressions
Abracket expression is a list of characters enclosed in
[]
.It normally matches any single character
from the list (but see below). If the list begins with
^
,it matches any single character not from the
rest of the list. If two characters in the list are separated by
-
,this is shorthand for the full range
of characters between those two (inclusive) in the collating sequence, e.g.,
[0-9]
in ASCII matches
any decimal digit. It is illegal for two ranges to share an endpoint, e.g.,
a-c-e
.Ranges are very
collating-sequence-dependent, so portable programs should avoid relying on them.
To include a literal
]
in the list, make it the first character (after
^
, if that is used). To include a
literal
-
,make it the first or last character, or the second endpoint of a range. To use a literal
-
as
the first endpoint of a range, enclose it in
[.
and
.]
to make it a collating element (see below).
Withthe exception of these characters, some combinations using
[
(see nextparagraphs), and escapes
(AREs only), all other special characters lose their special significance within a bracket expression.
In particular,
\
is not special when following ERE or BRE rules, though it is special (as introducing
an escape) in AREs.
Within a bracket expression, a collating element (a character, a multiple-character sequence that col-
lates as if it were a single character, or a collating-sequence name for either) enclosed in
[.
and
.]
stands for the sequence of characters of that collating element. The sequence is treatedas asingle ele-
ment of the bracketexpression’s list. This allows a bracket expression containing a multiple-character
collating element to match more than one character, e.g., if the collating sequence includes a
ch
collating element, then the RE
[[.ch.]]
*
c
matches the first five characters of
chchcc
.
Note:PostgreSQL currently doesnot support multi-character collating elements. This information
describes possible future behavior.
Within a bracket expression, a collating element enclosed in
[=
and
=]
is anequivalence class, stand-
ingfor the sequences of characters of all collatingelements equivalent to that one, including itself. (If
there are noother equivalent collatingelements, the treatment is as if theenclosingdelimiters were
[.
and
.]
.) For example, if
o
and
^
are the members of an equivalence class, then
[[=o=]]
,
[[=^=]]
,
and
[o^]
are all synonymous. An equivalence class cannot be an endpoint of a range.
Within a bracket expression, the name of a character class enclosed in
[:
and
:]
stands for the list
of all characters belonging to that class. Standard character class names are:
alnum
,
alpha
,
blank
,
cntrl
,
digit
,
graph
,
lower
,
print
,
punct
,
space
,
upper
,
xdigit
.Thesestandfor thecharacter
202
80
Chapter 9. Functions and Operators
classes defined inctype. A locale can provide others. A character class cannot be used as an endpoint
of a range.
There are two special cases of bracket expressions: the bracket expressions
[[:<:]]
and
[[:>:]]
are constraints, matching empty strings at the beginning and end of a word respectively. A word is
defined as a sequence of word characters that is neither preceded nor followed by word characters.
Aword character is an
alnum
character (as defined by ctype) or an underscore. This is an extension,
compatible with but not specified by POSIX 1003.2, and should be used with caution in software in-
tended tobe portable to other systems. The constraintescapes described below are usually preferable;
they are no more standard, but are easier to type.
9.7.3.3. Regular Expression Escapes
Escapes are special sequences beginning with
\
followed by an alphanumeric character. Escapes
come in several varieties: character entry, class shorthands, constraint escapes, and back references.
A
\
followed by an alphanumeric character but not constituting a valid escape is illegal in AREs. In
EREs, there are no escapes: outside a bracket expression, a
\
followed by an alphanumeric character
merely stands for that character as an ordinary character, and inside a bracket expression,
\
is an
ordinary character. (The latter is the one actual incompatibility between EREs and AREs.)
Character-entry escapes exist to make it easier to specify non-printing and other inconvenient char-
acters in REs. They are shown in Table 9-16.
Class-shorthand escapes provide shorthands for certain commonly-used character classes. They are
shown in Table 9-17.
Aconstraint escape is a constraint, matching the empty string if specific conditions are met, written
as an escape. They are shown in Table 9-18.
Aback reference (
\n
)matches the same string matched by the previous parenthesized subexpression
specified by the number
n
(see Table 9-19). For example,
([bc])\1
matches
bb
or
cc
but not
bc
or
cb
.The subexpression must entirely precede the back reference in the RE. Subexpressions are
numbered in the order of their leading parentheses. Non-capturing parentheses do not define subex-
pressions.
Note: Keep in mind that an escape’s leading
\
will need to bedoubled when entering the pattern
as an SQL string constant. For example:
’123’ ~ E’^\\d{3}’ true
Table 9-16. Regular Expression Character-entry Escapes
Escape
Description
\a
alert (bell) character, as in C
\b
backspace, as in C
\B
synonym for backslash (
\
)to help reduce the
need for backslashdoubling
\cX
(where
X
is any character) the character whose
low-order 5 bits are the same as those of
X
,and
whose other bits are all zero
203
110
Chapter 9. Functions and Operators
Escape
Description
\e
the character whose collating-sequence name is
ESC
,or failing that, the character with octal
value
033
\f
form feed, as in C
\n
newline, as in C
\r
carriage return, as in C
\t
horizontal tab, as in C
\uwxyz
(where
wxyz
is exactlyfour hexadecimal digits)
the character whose hexadecimal value is
0xwxyz
\Ustuvwxyz
(where
stuvwxyz
is exactly eight hexadecimal
digits) the character whose hexadecimal value is
0xstuvwxyz
\v
vertical tab, as inC
\xhhh
(where
hhh
is any sequence of hexadecimal
digits) the character whose hexadecimal value is
0xhhh
(a single character no matter how many
hexadecimal digits are used)
\0
the character whose value is
0
(the null byte)
\xy
(where
xy
is exactly two octal digits, and is not
aback reference) the character whose octal value
is
0xy
\xyz
(where
xyz
is exactly three octal digits, and is
not a back reference) the character whose octal
value is
0xyz
Hexadecimal digits are
0
-
9
,
a
-
f
,and
A
-
F
.Octal digits are
0
-
7
.
Numeric character-entry escapes specifying values outside the ASCII range (0-127) have meanings
dependent on the database encoding. When the encoding is UTF-8, escape values are equivalent to
Unicode code points, for example
\u1234
means the character
U+1234
.For other multibyte encod-
ings,character-entryescapesusuallyjustspecifytheconcatenation of thebyte values for the character.
If the escape value does not correspond to any legal character in the database encoding, no error will
be raised, but it will never match any data.
The character-entry escapes are always takenas ordinary characters. For example,
\135
is
]
inASCII,
but
\135
does not terminate a bracket expression.
Table 9-17. Regular Expression Class-shorthand Escapes
Escape
Description
\d
[[:digit:]]
\s
[[:space:]]
\w
[[:alnum:]_]
(note underscore is included)
\D
[^[:digit:]]
\S
[^[:space:]]
\W
[^[:alnum:]_]
(note underscore is included)
204
98
Chapter 9. Functions and Operators
Within bracket expressions,
\d
,
\s
,and
\w
lose their outer brackets, and
\D
,
\S
,and
\W
are illegal.
(So, for example,
[a-c\d]
is equivalent to
[a-c[:digit:]]
.Also,
[a-c\D]
,which is equivalent
to
[a-c^[:digit:]]
,is illegal.)
Table 9-18. Regular Expression Constraint Escapes
Escape
Description
\A
matches only at the beginning of the string (see
Section9.7.3.5 for how this differs from
^
)
\m
matches only at the beginning of a word
\M
matches only at the end of a word
\y
matches only at the beginning or end of a word
\Y
matches only at a point that is not the beginning
or end of a word
\Z
matches only at the end of the string (see
Section9.7.3.5 for how this differs from
$
)
Aword is defined as in the specification of
[[:<:]]
and
[[:>:]]
above. Constraint escapes are
illegal within bracket expressions.
Table 9-19. Regular Expression Back References
Escape
Description
\m
(where
m
is a nonzero digit) a back reference to
the
m
’th subexpression
\mnn
(where
m
is a nonzero digit, and
nn
is some
more digits, and the decimal value
mnn
is not
greater than the number of closing capturing
parentheses seen so far) a back reference to the
mnn
’th subexpression
Note:Thereisaninherent ambiguity betweenoctalcharacter-entry escapes andback references,
which is resolved by the following heuristics, as hinted at above. A leading zero always indicates
an octal escape. A single non-zero digit, not followed by another digit, is always taken as a back
reference. A multi-digit sequence not starting with a zero is taken as a back reference if it comes
after a suitable subexpression (i.e., the number is in the legal range for a back reference), and
otherwise is taken as octal.
9.7.3.4. Regular Expression Metasyntax
In addition to the main syntax described above, there are some special forms and miscellaneous syn-
tactic facilities available.
An RE can begin with one of two special director prefixes. If an RE begins with
***
:
,the rest of
the RE is taken as an ARE. (This normally has no effectin PostgreSQL, since REs are assumed to be
AREs; but it does have an effect if ERE or BRE mode had been specified by the
flags
parameter to
aregex function.) If an RE begins with
***
=
,the rest of the RE is taken to be a literal string, with all
characters considered ordinary characters.
205
97
Chapter 9. Functions and Operators
An ARE can begin with embeddedoptions: a sequence
(?xyz)
(where
xyz
is one or morealphabetic
characters) specifies options affecting the rest of the RE. These options override any previously de-
termined options — in particular, they can override the case-sensitivity behavior implied by a regex
operator, or the
flags
parameter to a regex function. The available option letters are shown in Table
9-20. Note that these same option letters are used in the
flags
parameters of regex functions.
Table 9-20. ARE Embedded-option Letters
Option
Description
b
rest of RE is a BRE
c
case-sensitive matching (overrides operator
type)
e
rest of RE is an ERE
i
case-insensitive matching (see Section 9.7.3.5)
(overrides operator type)
m
historical synonym for
n
n
newline-sensitive matching (see Section 9.7.3.5)
p
partial newline-sensitive matching (see Section
9.7.3.5)
q
rest of RE is a literal (“quoted”) string, all
ordinary characters
s
non-newline-sensitive matching (default)
t
tight syntax (default; see below)
w
inverse partial newline-sensitive (“weird”)
matching (see Section 9.7.3.5)
x
expanded syntax (see below)
Embedded options take effect at the
)
terminating the sequence. They can appear only at the start of
an ARE (after the
***
:
director if any).
In additionto the usual (tight) RE syntax, in which allcharacters are significant, there is anexpanded
syntax, available byspecifying theembedded
x
option. In theexpandedsyntax, white-spacecharacters
in the RE are ignored, as are all characters between a
#
and the following newline (or the end of the
RE). This permits paragraphing and commenting a complex RE. There are three exceptions to that
basic rule:
•
awhite-space character or
#
preceded by
\
is retained
•
white space or
#
withina bracket expression is retained
•
white space and comments cannot appear within multi-character symbols, such as
(?:
For this purpose, white-space characters are blank, tab, newline, and any character that belongs to the
space
character class.
Finally, in an ARE, outside bracket expressions, the sequence
(?#ttt)
(where
ttt
is any text not
containinga
)
)is acomment, completely ignored. Again, this is notallowedbetweenthecharacters of
multi-character symbols, like
(?:
.Such comments are more a historical artifact than a useful facility,
and their use is deprecated; use the expanded syntax instead.
Noneof these metasyntaxextensions is availableif aninitial
***
=
director has specified thattheuser’s
input be treated as a literal string rather than as an RE.
206
102
Chapter 9. Functions and Operators
9.7.3.5. Regular Expression Matching Rules
In the event that an RE could match more than one substring of a given string, the RE matches the
one starting earliest in the string. If the RE could match more than one substring starting atthatpoint,
either the longest possible match or the shortest possible match will be taken, depending on whether
the RE is greedy or non-greedy.
Whether an RE is greedy or not is determined by the following rules:
•
Most atoms, and all constraints, have no greediness attribute (because they cannot match variable
amounts of text anyway).
•
Adding parentheses around an RE does not change its greediness.
•
Aquantified atom with a fixed-repetition quantifier (
{m}
or
{m}?
)has the same greediness (possi-
bly none) as the atom itself.
•
Aquantified atom with other normal quantifiers (including
{m,n}
with
m
equal to
n
)is greedy
(prefers longest match).
•
Aquantifiedatom witha non-greedyquantifier (including
{m,n}?
with
m
equalto
n
)is non-greedy
(prefers shortest match).
•
Abranch — that is, an RE that has no top-level
|
operator — has the same greediness as the first
quantified atom in it that has a greediness attribute.
•
An RE consisting of two or more branches connected by the
|
operator is always greedy.
The above rules associate greediness attributes not only with individual quantified atoms, but with
branches and entire REs that contain quantified atoms. What that means is that the matching is done
in such a way that the branch, or whole RE, matches the longest or shortest possible substring as
awhole. Once the length of the entire match is determined, the part of it that matches any particu-
lar subexpression is determined on the basis of the greediness attribute of that subexpression, with
subexpressions starting earlier in the RE taking priority over ones starting later.
An example of what this means:
SELECT SUBSTRING(’XY1234Z’, ’Y
*
([0-9]{1,3})’);
Result:
123
SELECT SUBSTRING(’XY1234Z’, ’Y
*
?([0-9]{1,3})’);
Result:
1
In the first case, the RE as a whole is greedy because
Y
*
is greedy. It can match beginning at the
Y
,
and it matches the longest possible string starting there, i.e.,
Y123
.The output is the parenthesized
part of that, or
123
.In the second case, the RE as a whole is non-greedy because
Y
*
?
is non-greedy.
It canmatch beginning at the
Y
,and it matches the shortest possible string starting there, i.e.,
Y1
.The
subexpression
[0-9]{1,3}
is greedy butitcannot change the decisionas tothe overall matchlength;
so it is forced to match just
1
.
In short, when an RE contains both greedy and non-greedy subexpressions, the total match length is
either as long as possible or as short as possible, according to the attribute assigned to the whole RE.
The attributes assigned to the subexpressions only affect how much of that match they are allowed to
“eat” relative to each other.
The quantifiers
{1,1}
and
{1,1}?
can be used to force greediness or non-greediness, respectively,
on a subexpression or a whole RE. This is useful when you need the whole RE to have a greediness
attribute different from what’s deduced from its elements. As an example, suppose that we are trying
207
Documents you may be interested
Documents you may be interested