116
282
Using Backreferences in The Regular Expression
Backreferences can not only be used after a match has been found, but also during the match. Suppose you
want to match a pair of opening and closing HTML tags, and the text in between. By putting the opening tag
into a backreference, we can reuse the name of the tag for the closing tag. Here’s how: «
<([A-Z][A-Z0-
9]*)\b[^>]*>.*?</\1>
» . This regex contains only one pair of parentheses, which capture the string
matched by «
[A-Z][A-Z0-9]*
» into the first backreference. This backreference is reused with «
\1
»
(backslash one). The «
/
» before it is simply the forward slash in the closing HTML tag that we are trying to
match.
To figure out the number of a particular backreference, scan the regular expression from left to right and
count the opening round brackets. The first bracket starts backreference number one, the second number
two, etc. Non-capturing parentheses are not counted. This fact means that non-capturing parentheses have
another benefit: you can insert them into a regular expression without changing the numbers assigned to the
backreferences. This can be very useful when modifying a complex regular expression.
You can reuse the same backreference more than once. «
([a-c])x\1x\1
» will match „
axaxa
µ, „
bxbxb
µ
and „
cxcxc
µ.
Looking Inside The Regex Engine
Let’s see how the regex engine applies the above regex to the string ´
Testing <B><I>bold
italic</I></B> text
µ. The first token in the regex is the literal «
<
». The regex engine will traverse the
string until it can match at the first „
<
µ in the string. The next token is «
[A-Z]
». The regex engine also takes
note that it is now inside the first pair of capturing parentheses. «
[A-Z]
» matches „
B
µ. The engine advances
to «
[A-Z0-9]
» and ´
>
µ. This match fails. However, because of the star, that’s perfectly fine. The position in
the string remains at ´
>
µ. The position in the regex is advanced to «
[^>]
».
This step crosses the closing bracket of the first pair of capturing parentheses. This prompts the regex engine
to store what was matched inside them into the first backreference. In this case, „
B
µ is stored.
After storing the backreference, the engine proceeds with the match attempt. «
[^>]
» does not match „
>
µ.
Again, because of another star, this is not a problem. The position in the string remains at ´
>
µ, and position
in the regex is advanced to «
>
». These obviously match. The next token is a dot, repeated by a lazy star.
Because of the laziness, the regex engine will initially skip this token, taking note that it should backtrack in
case the remainder of the regex fails.
The engine has now arrived at the second «
<
» in the regex, and the second ´
<
µ in the string. These match.
The next token is «
/
». This does not match ´
I
µ, and the engine is forced to backtrack to the dot. The dot
matches the second „
<
µ in the string. The star is still lazy, so the engine again takes note of the available
backtracking position and advances to «
<
» and ´
I
µ. These do not match, so the engine again backtracks.
The backtracking continues until the dot has consumed „
<I>bold italic
µ. At this point, «
<
» matches the
third „
<
µ in the string, and the next token is «
/
» which matches ´
/
µ. The next token is «
\1
». Note that the
token is the backreference, and not «
B
». The engine does not substitute the backreference in the regular
expression. Every time the engine arrives at the backreference, it will read the value that was stored. This
means that if the engine had backtracked beyond the first pair of capturing parentheses before arriving the
second time at «
\1
», the new value stored in the first backreference would be used. But this did not happen
156
283
here, so „
B
µ it is. This fails to match at ´
I
µ, so the engine backtracks again, and the dot consumes the third
´
<
µ in the string.
Backtracking continues again until the dot has consumed „
<I>bold italic</I>
µ. At this point, «
<
»
matches „
<
µ and «
/
» matches „
/
µ. The engine arrives again at «
\1
». The backreference still holds „
B
µ. «
B
»
matches „
B
µ. The last token in the regex, «
>
» matches „
>
µ. A complete match has been found: „
<B><I>bold
italic</I></B>
µ.
Backtracking Into Capturing Groups
You may have wondered about the word boundary «
\b
» in the «
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
»
mentioned above. This is to make sure the regex won’t match incorrectly paired tags such as
´
<boo>bold</b>
µ. You may think that cannot happen because the capturing group matches „
boo
µ which
causes «
\1
» to try to match the same, and fail. That is indeed what happens. But then the regex engine
backtracks.
Let’s take the regex «
<([A-Z][A-Z0-9]*)[^>]*>.*?</\1>
» without the word boundary and look inside
the regex engine at the point where «
\1
» fails the first time. First, «
.*?
» continues to expand until it has
reached the end of the string, and «
</\1>
» has failed to match each time «
.*?
» matched one more character.
Then the regex engine backtracks into the capturing group. «
[A-Z0-9]*
» has matched „
oo
µ, but would just
as happily match „
o
µ or nothing at all. When backtracking, «
[A-Z0-9]*
» is forced to give up one character.
The regex engine continues, exiting the capturing group a second time. Since [A-Z][A-Z0-9]* has now
matched „
bo
µ, that is what is stored into the capturing group, overwriting „
boo
µ that was stored before.
«
[^>]*
» matches the second „
o
µ in the opening tag. «
>.*?</
» matches „
>bold<
µ. «
\1
» fails again.
The regex engine does all the same backtracking once more, until «
[A-Z0-9]*
» is forced to give up another
character, causing it to match nothing, which the star allows. The capturing group now stores just „
b
µ.
«
[^>]*
» now matches „
oo
µ. «
>.*?</
» once again matches „
>bold<
µ. «
\1
» now succeeds, as does «
>
» and
an overall match is found. But not the one we wanted.
There are several solutions to this. One is to use the word boundary. When «
[A-Z0-9]*
» backtracks the first
time, reducing the capturing group to „
bo
µ, «
\b
» fails to match between ´
o
µ and ´
o
µ. This forces «
[A-Z0-
9]*
» to backtrack again immediately. The capturing group is reduced to „
b
µ and the word boundary fails
between ´
b
µ and ´
o
µ. There are no further backtracking positions, so the whole match attempt fails.
The reason we need the word boundary is that we’re using «
[^>]*
» to skip over any attributes in the tag. If
your paired tags never have any attributes, you can leave that out, and use «
<([A-Z][A-Z0-
9]*)>.*?</\1>
». Each time «
[A-Z0-9]*
» backtracks, the «
>
» that follows it will fail to match, quickly
ending the match attempt.
If you didn’t expect the regex engine to backtrack into capturing groups, you can use an atomic group. The
regex engine always backtracks into capturing groups, and never captures atomic groups. You can put the
capturing group inside an atomic group to get an atomic capturing group: «
(?>(atomic capture))
». In
this case, we can put the whole opening tag into the atomic group: «
(?><([A-Z][A-Z0-
9]*)[^>]*>).*?</\1>
». The tutorial section on atomic grouping has all the details.
101
284
Backreferences to Failed Groups
The previous section applies to all regex flavors, except those few that don’t support capturing groups at all.
Flavors behave differently when you start doing things that don’t fit the ´match the text matched by a
previous capturing groupµ job description.
There is a difference between a backreference to a capturing group that matched nothing, and one to a
capturing group that did not participate in the match at all. The regex «
(q?)b\1
» will match „
b
µ. «
q?
» is
optional and matches nothing, causing «
(q?)
» to successfully match and capture nothing. «
b
» matches „
b
µ
and «
\1
» successfully matches the nothing captured by the group.
The regex «
(q)?b\1
» however will fail to match ´
b
µ. «
(q)
» fails to match at all, so the group never gets to
capture anything at all. Because the whole group is optional, the engine does proceed to match «
b
». However,
the engine now arrives at «
\1
» which references a group that did not participate in the match attempt at all.
This causes the backreference to fail to match at all, mimicking the result of the group. Since there’s no «
?
»
making «
\1
» optional, the overall match attempt fails.
The only exception is JavaScript. According to the official ECMA standard, a backreference to a non-
participating capturing group must successfully match nothing just like a backreference to a participating
group that captured nothing does. In other words, in JavaScript, «
(q?)b\1
» and «
(q)?b\1
» both match „
b
µ.
Forward References and Invalid References
Modern flavors, notably JGsoft, .NET, Java, Perl, PCRE and Ruby allow forward references. That is: you can
use a backreference to a group that appears later in the regex. Forward references are obviously only useful if
they’re inside a repeated group. Then there can be situations in which the regex engine evaluates the
backreference after the group has already matched. Before the group is attempted, the backreference will fail
like a backreference to a failed group does.
If forward references are supported, the regex «
(\2two|(one))+
» will match „
oneonetwo
µ. At the start of
the string, «
\2
» fails. Trying the other alternative, „
one
µ is matched by the second capturing group, and
subsequently by the first group. The first group is then repeated. This time, «
\2
» matches „
one
µ as captured
by the second group. «
two
» then matches „
two
µ. With two repetitions of the first group, the regex has
matched the whole subject string.
A nested reference is a backreference inside the capturing group that it references, e.g. «
(\1two|(one))+
».
This regex will give exactly the same behavior with flavors that support forward references. Some flavors that
don’t support forward references do support nested references. This includes JavaScript.
With all other flavors, using a backreference before its group in the regular expression is the same as using a
backreference to a group that doesn’t exist at all. All flavors discussed in this tutorial, except JavaScript and
Ruby, treat backreferences to undefined groups as an error. In JavaScript and Ruby, they always result in a
zero-width match. For Ruby this is a potential pitfall. In Ruby, «
(a)(b)?\2
» will fail to match ´
a
µ, because
«
\2
» references a non-participating group. But «
(a)(b)?\7
» will match „
a
µ. For JavaScript this is logical, as
backreferences to non-participating groups do the same. Both regexes will match „
a
µ.
76
285
Repetition and Backreferences
As I mentioned in the above inside look, the regex engine does not permanently substitute backreferences in
the regular expression. It will use the last match saved into the backreference each time it needs to be used. If
a new match is found by capturing parentheses, the previously saved match is overwritten. There is a clear
difference between «
([abc]+)
» and «
([abc])+
». Though both successfully match „
cab
µ, the first regex will
put „
cab
µ into the first backreference, while the second regex will only store „
b
µ. That is because in the
second regex, the plus caused the pair of parentheses to repeat three times. The first time, „
c
µ was stored.
The second time „
a
µ and the third time „
b
µ. Each time, the previous value was overwritten, so „
b
µ remains.
This also means that «
([abc]+)=\1
» will match „
cab=cab
µ, and that «
([abc])+=\1
» will not. The reason
is that when the engine arrives at «
\1
», it holds «
b
» which fails to match ´
c
µ. Obvious when you look at a
simple example like this one, but a common cause of difficulty with regular expressions nonetheless. When
using backreferences, always double check that you are really capturing what you want.
Useful Example: Checking for Doubled Words
When editing text, doubled words such as ´the theµ easily creep in. Using the regex «
\b(\w+)\s+\1\b
» in
your text editor, you can easily find them. To delete the second word, simply type in ´
\1
µ as the replacement
text and click the Replace button.
Parentheses and Backreferences Cannot Be Used Inside Character Classes
Round brackets cannot be used inside character classes, at least not as metacharacters. When you put a round
bracket in a character class, it is treated as a literal character. So the regex «
[(a)b]
» matches „
a
µ, „
b
µ, „
(
µ
and „
)
µ.
Backreferences also cannot be used inside a character class. The \1 in regex like «
(a)[\1b]
» will be
interpreted as an octal escape in most regex flavors. So this regex will match an „
a
µ followed by either «
\x01
»
or a «
b
».
Documents you may be interested
Documents you may be interested