71
1. What I really wanted was to match
'ROAD'
when it was at theend of the string and it was its own word
(and not a part of some larger word). To express this in a regular expression, you use
\b
,which means “a
word boundary must occur right here.” In Python, this is complicated by thefact that the
'\'
character in a
string must itself be escaped. This is sometimes referred to as the backslash plague, and it is onereason why
regular expressions are easier in Perl than in Python. On thedown side, Perl mixes regular expressions with
other syntax, so if you have a bug, it may be hard to tell whether it’s a bug in syntax or a bugin your
regular expression.
2. To work around the backslash plague, you can usewhat is called a raw string, by prefixing the string with the
letter
r
.This tells Python that nothing in this string should be escaped;
'\t'
is a tab character, but
r'\t'
is
really the backslash character
\
followed by the letter
t
.I recommend always using raw strings when dealing
with regular expressions; otherwise, things get tooconfusing too quickly (and regular expressions are
confusing enough already).
3. *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address
contained theword
'ROAD'
as a whole word by itself, but it wasn’t at theend, because the address had an
apartment number after thestreet designation. Because
'ROAD'
isn’t at thevery end of thestring, it doesn’t
match, so the entire call to
re.sub()
ends up replacing nothing at all, and you get the original string back,
which is not what you want.
4. To solve this problem, I removed the
$
character and added another
\b
.Now the regular expression reads
“match
'ROAD'
when it’s a whole word by itself anywhere in thestring,” whether at the end, the beginning,
or somewherein the middle.
⁂
5.3. C
ASE
S
TUDY
:R
OMAN
N
UMERALS
You’ve most likely seen Roman numerals, even if you didn’t recognize them. You may have seen them in
copyrights of old movies and television shows (“Copyright
MCMXLVI
”instead of “Copyright
1946
”), or on the
dedication walls of libraries or universities (“established
MDCCCLXXXVIII
”instead of “established
1888
”). You
may also have seen them in outlines and bibliographical references. It’s a system of representing numbers
that really does dateback to the ancient Roman empire (hencethename).
130