60
13.1 Introduction
Regular expressions are a powerful tool for matching and manipulating text. While not as fast
as plain-vanilla string matching, regular expressions are extremely flexible; they allow you to
construct patterns to match almost any conceivable combination of characters with a simple,
albeit terse and somewhat opaque syntax.
In PHP, you can use regular expression functions to find text that matches certain criteria.
Once located, you can choose to modify or replace all or part of the matching substrings. For
example, this regular expression turns text email addresses into
mailto
: hyperlinks:
$html = preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/i',
'<a href="mailto:$0">$0</a>', $text);
As you can see, regular expressions are handy when transforming plain text into HTML and
vice versa. Luckily, since these are such popular subjects, PHP has many built-in functions to
handle these tasks. Recipe 9.9
tells how to escape HTML entities, Recipe 11.12
covers
stripping HTML tags, and Recipe 11.10
and Recipe 11.11
show how to convert ASCII to HTML
and HTML to ASCII, respectively. For more on matching and validating email addresses, see
Recipe 13.7
.
Over the years, the functionality of regular expressions has grown from its basic roots to
incorporate increasingly useful features. As a result, PHP offers two different sets of regular-
expression functions. The first set includes the traditional (or POSIX) functions, all beginning
with
ereg
(for extended regular expressions; the
ereg
functions themselves are already an
extension of the original feature set). The other set includes the Perl family of functions,
prefaced with
preg
(for Perl-compatible regular expressions).
The
preg
functions use a library that mimics the regular expression functionality of the Perl
programming language. This is a good thing because Perl allows you to do a variety of handy
things with regular expressions, including nongreedy matching, forward and backward
assertions, and even recursive patterns.
In general, there's no longer any reason to use the
ereg
functions. They offer fewer features,
and they're slower than
preg
functions. However, the
ereg
functions existed in PHP for many
years prior to the introduction of the
preg
functions, so many programmers still use them
because of legacy code or out of habit. Thankfully, the prototypes for the two sets of functions
are identical, so it's easy to switch back and forth from one to another in your mind without
too much confusion. (We list how to do this while avoiding the major gotchas in Recipe 13.2
.)
The basics of regular expressions are simple to understand. You combine a sequence of
characters to form a pattern. You then compare strings of text to this pattern and look for
matches. In the pattern, most characters represent themselves. So, to find if a string of HTML
contains an image tag, do this:
99
if (preg_match('/<img /', $html)) {
// found an opening image tag
}
The
preg_match( )
function compares the pattern of
"<img "
against the contents of
$html
. If it finds a match, it returns 1; if it doesn't, it returns 0. The
/
characters are called
pattern delimiters ; they set off the start and end of the pattern.
A few characters, however, are special. The special nature of these characters are what
transforms regular expressions beyond the feature set of
strstr( )
and
strpos( )
. These
characters are called metacharacters. The most frequently used metacharacters include the
period (.), asterisk (
*
), plus (
+
), and question mark (
?
). To match an actual metacharacter,
precede the character with a backslash(\).
·
The period matches any character, so the pattern
/.at/
matches
bat
,
cat
, and even
rat
.
·
The asterisk means match 0 or more of the preceding object. (Right now, the only
objects we know about are characters.)
·
The plus is similar to asterisk, but it matches 1 or more instead of or more. So,
/.+at/
matches
brat
,
sprat
, and even
catastrophe
, but not
at
. To match
at
,
replace the
+
with a
*
.
·
The question mark matches 0 or 1 objects.
To apply
*
and
+
to objects greater than one character, place the sequence of characters
inside parentheses. Parentheses allow you to group characters for more complicated matching
and also capture the part of the pattern that falls inside them. A captured sequence can be
referenced in
preg_replace( )
to alter a string, and all captured matches can be stored in
an array that's passed as a third parameter to
preg_match( )
and
preg_match_all( )
.
The
preg_match_all( )
function is similar to
preg_match( )
, but it finds all possible
matches inside a string, instead of stopping at the first match. Here are some examples:
if (preg_match('/<title>.+<\/title>/', $html)) {
// page has a title
}
if (preg_match_all('/<li>/', $html, $matches)) {
print 'Page has ' . count($matches[0]) . " list items\n";
}
// turn bold into italic
$italics = preg_replace('/(<\/?)b(>)/', '$1i$2', $bold);
If you want to match strings with a specific set of letters, create a character class with the
letters you want. A character class is a sequence of characters placed inside square brackets.
The caret (
^
) and the dollar sign (
$
) anchor the pattern at the beginning and the end of the
string, respectively. Without them, a match can occur anywhere in the string. So, to match
86
only vowels, make a character class containing
a
,
e
,
i
,
o
, and
u
; start your pattern with
^;
and end it with
$
:
preg_match('/^[aeiou]+$/', $string); // only vowels
If it's easier to define what you're looking for by its complement, use that. To make a
character class match the complement of what's inside it, begin the class with a caret. A caret
outside a character class anchors a pattern at the beginning of a string; a caret inside a
character class means "match everything except what's listed in the square brackets":
preg_match('/^[^aeiou]+$/', $string) // only non-vowels
Note that the opposite of
[aeiou]
isn't
[bcdfghjklmnpqrstvwxyz]
. The character class
[^aeiou]
also matches uppercase vowels such as
AEIOU
, numbers such as
123
, URLs such
as
http://www.cnpq.br/
, and even emoticons such as
:)
.
The vertical bar (
|
), also known as the pipe, specifies alternatives. For example:
// find a gif or a jpeg
preg_match('/(gif|jpeg)/', $images);
Beside metacharacters, there are also metasymbols. Metasymbols are like metacharacters,
but are longer than one character in length. Some useful metasymbols are
\w
(match any
word character,
[a-zA-Z0-9_]
);
\d
(match any digit,
[0-9]
);
\s
(match any whitespace
character), and
\b
(match a word boundary). Here's how to find all numbers that aren't part
of another word:
// find digits not touching other words
preg_match_all('/\b\d+\b/', $html, $matches);
This matches
123
,
76!
, and
38-years-old
, but not
2nd
.
Here's a pattern that is the regular expression equivalent of
trim( )
:
// delete leading whitespace or trailing whitespace
$trimmed = preg_replace('/(^\s+)|(\s+$)/', '', $string);
Finally, there are pattern modifiers. Modifiers effect the entire pattern, not just a character or
group of characters. Pattern modifiers are placed after the trailing pattern delimiter. For
example, the letter
i
makes a regular expression pattern case-insensitive:
// strict match lower-case image tags only (XHTML compliant)
if (preg_match('/<img[^>]+>/', $html)) {
...
}
// match both upper and lower-case image tags
if (preg_match('/<img[^>]+>/i', $html)) {
60
...
}
We've covered just a small subset of the world of regular expressions. We provide some
additional details in later recipes, but the PHP web site also has some very useful information
on POSIX regular expressions at http://www.php.net/regex
and on Perl-compatible regular
expressions at http://www.php.net/pcre
. The links from this last page to "Pattern Modifiers"
and "Pattern Syntax" are especially detailed and informative.
The best books on this topic are Mastering Regular Expressions by Jeffrey Friedl, and
Programming Perl by Larry Wall, Tom Christiansen, and Jon Orwant, both published by
O'Reilly. (Since the Perl-compatible regular expressions are based on Perl's regular
expressions, we don't feel too bad suggesting a book on Perl.)
Recipe 13.2 Switching From ereg to preg
13.2.1 Problem
You want to convert from using
ereg
functions to
preg
functions.
13.2.2 Solution
First, you have to add delimiters to your patterns:
preg_match('/pattern/', 'string')
For
eregi( )
case-insensitive matching, use the
/i
modifier instead:
preg_match('/pattern/i', 'string');
When using integers instead of strings as patterns or replacement values, convert the number
to hexadecimal and specify it using an escape sequence:
$hex = dechex($number);
preg_match("/\x$hex/", 'string');
13.2.3 Discussion
There are a few major differences between
ereg
and
preg
. First, when you use
preg
functions, the pattern isn't just the string
pattern
; it also needs delimiters, as in Perl, so it's
/pattern/
instead.
[1]
So:
[1]
Or
{}
,
<>
,
||
,
##
, or whatever your favorite delimiters are. PHP
supports them all.
ereg('pattern', 'string');
becomes:
69
preg_match('/pattern/', 'string');
When choosing your pattern delimiters, don't put your delimiter character inside the regular-
expression pattern, or you'll close the pattern early. If you can't find a way to avoid this
problem, you need to escape any instances of your delimiters using the backslash. Instead of
doing this by hand, call
addcslashes( )
.
For example, if you use
/
as your delimiter:
$ereg_pattern = '<b>.+</b>';
$preg_pattern = addcslashes($ereg_pattern, '/');
The value of
$preg_pattern
is now
<b>.+<\/b>
.
The
preg
functions don't have a parallel series of case-insensitive functions. They have a
case-insensitive modifier instead. To convert, change:
eregi('pattern', 'string');
to:
preg_match('/pattern/i', 'string');
Adding the
i
after the closing delimiter makes the change.
Finally, there is one last obscure difference. If you use a number (not a string) as a pattern or
replacement value in
ereg_replace( )
, it's assumed you are referring to the ASCII value
of a character. Therefore, since 9 is the ASCII representation of tab (i.e.,
\t
), this code inserts
tabs at the beginning of each line:
$tab = 9;
$replaced = ereg_replace('^', $tab, $string);
Here's how to convert linefeed endings:
$converted = ereg_replace(10, 12, $text);
To avoid this feature in
ereg
functions, use this instead:
$tab = '9';
On the other hand,
preg_replace( )
treats the number 9 as the number 9, not as a tab
substitute. To convert these character codes for use in
preg_replace( )
, convert them to
hexadecimal and prefix them with
\x
. For example,
9
becomes
\x9
or
\x09,
and
12
becomes
\x0c
. Alternatively, you can use
\t
,
\r
, and
\n
for tabs, carriage returns, and
linefeeds, respectively.
51
13.2.4 See Also
Documentation on
ereg( )
at http://www.php.net/ereg
,
preg_match( )
at
http://www.php.net/preg-match
, and
addcslashes( )
at http://www.php.net/addcslashes
.
Recipe 13.3 Matching Words
13.3.1 Problem
You want to pull out all words from a string.
13.3.2 Solution
The key to this is carefully defining what you mean by a word. Once you've created your
definition, use the special character types to create your regular expression:
/\S+/ // everything that isn't whitespace
/[A-Z'-]+/i // all upper and lowercase letters, apostrophes, and hyphens
13.3.3 Discussion
The simple question "what is a word?" is surprisingly complicated. While the Perl compatible
regular expressions have a built-in word character type, specified by
\w
, it's important to
understand exactly how PHP defines a word. Otherwise, your results may not be what you
expect.
Normally, because it comes directly from Perl's definition of a word,
\w
encompasses all
letters, digits, and underscores; this means
a_z
is a word, but the email address
php@example.com
is not.
In this recipe, we only consider English words, but other languages use different alphabets.
Because Perl-compatible regular expressions use the current locale to define its settings,
altering the locale can switch the definition of a letter, which then redefines the meaning of a
word.
To combat this, you may want to explicitly enumerate the characters belonging to your words
inside a character class. To add a nonstandard character, use
\ddd
, where
ddd
is a
character's octal code.
13.3.4 See Also
Recipe 16.3
for information about setting locales.
Recipe 13.4 Finding the nth Occurrence of a Match
13.4.1 Problem
61
You want to find the nth word match instead of the first one.
13.4.2 Solution
Use
preg_match_all( )
to pull all the matches into an array; then pick out the specific
matches you're interested in:
preg_match_all ("/$pattern/$modifiers", $string, $matches)
foreach($matches[1] as $match) {
print "$match\n";
}
13.4.3 Discussion
Unlike in Perl, PHP's Perl-compatible regular expressions don't support the
/g
modifier that
allows you to loop through the string one match at a time. You need to use
preg_match_all( )
instead of
preg_match( )
.
The
preg_match_all( )
function returns a two-dimensional array. The first element holds
an array of matches of the complete pattern. The second element also holds an array of
matches, but of the parenthesized submatches within each complete match. So, to get the
third
potato
, you access the third element of the second element of the
$matches
array:
$potatoes = 'one potato two potato three potato four';
preg_match_all("/(\w+)\s+potato\b/", $potatoes, $matches);
print $matches[1][2];
three
Instead of returning an array divided into full matches and then submatches,
preg_match_all( )
returns an array divided by matches, with each submatch inside. To
trigger this, pass
PREG_SET_ORDER
in as the fourth argument. Now,
three
isn't in
$matches[1][2]
, as previously, but in
$matches[2][1]
.
Check the return value of
preg_match_all( )
to find the number of matches:
print preg_match_all("/(\w+)\s+potato\b/", $potatoes, $matches);
3
Note that there are only three matches, not four, because there's no trailing
potato
after the
word
four
in the string.
13.4.4 See Also
Documentation on
preg_match_all( )
at http://www.php.net/preg-match-all
.
56
Recipe 13.5 Choosing Greedy or Nongreedy Matches
13.5.1 Problem
You want your pattern to match the smallest possible string instead of the largest.
13.5.2 Solution
Place a
?
after a quantifier to alter that portion of the pattern:
// find all bolded sections
preg_match_all('#<b>.+?</b>#', $html, $matches);
Or, use the
U
pattern modifier ending to invert all quantifiers from greedy to nongreedy:
// find all bolded sections
preg_match_all('#<b>.+</b>#U', $html, $matches);
13.5.3 Discussion
By default, all regular expressions in PHP are what's known as greedy. This means a quantifier
always tries to match as many characters as possible.
For example, take the pattern
p.*
, which matches a
p
and then 0 or more characters, and
match it against the string
php
. A greedy regular expression finds one match, because after it
grabs the opening
p
, it continues on and also matches the
hp
. A nongreedy regular
expression, on the other hand, finds a pair of matches. As before, it matches the
p
and also
the
h
, but then instead of continuing on, it backs off and leaves the final
p
uncaptured. A
second match then goes ahead and takes the closing letter.
The following code shows that the greedy match finds only one hit; the nongreedy ones find
two:
print preg_match_all('/p.*/', "php"); // greedy
print preg_match_all('/p.*?/', "php"); // nongreedy
print preg_match_all('/p.*/U', "php"); // nongreedy
1
2
2
Greedy matching is also known as maximal matching and nongreedy matching can be called
minimal matching, because these options match either the maximum or minimum number of
characters possible.
Initially, all regular expressions were strictly greedy. Therefore, you can't use this syntax with
ereg( )
or
ereg_replace( )
. Greedy matching isn't supported by the older engine that
powers these functions; instead, you must use Perl-compatible functions.
Documents you may be interested
Documents you may be interested