61
329
3. How to Find or Validate an Email Address
The regular expression I receive the most feedback, not to mention ´bugµ reports on, is the one you’ll find
right in the tutorial’s introduction: «
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
» . This regular
expression, I claim, matches any email address. Most of the feedback I get refutes that claim by showing one
email address that this regex doesn’t match. Usually, the ´bugµ report also includes a suggestion to make the
regex ´perfectµ.
As I explain below, my claim only holds true when one accepts my definition of what a valid email address
really is, and what it’s not. If you want to use a different definition, you’ll have to adapt the regex. Matching a
valid email address is a perfect example showing that (1) before writing a regex, you have to know exactly
what you’re trying to match, and what not; and (2) there’s often a trade-off between what’s exact, and what’s
practical.
The virtue of my regular expression above is that it matches 99% of the email addresses in use today. All the
email address it matches can be handled by 99% of all email software out there. If you’re looking for a quick
solution, you only need to read the next paragraph. If you want to know all the trade-offs and get plenty of
alternatives to choose from, read on.
If you want to use the regular expression above, there’s two things you need to understand. First, long
regexes make it difficult to nicely format paragraphs. So I didn’t include «
a-z
» in any of the three character
classes. This regex is intended to be used with your regex engine’s ´case insensitiveµ option turned on. (You’d
be surprised how many ´bugµ reports I get about that.) Second, the above regex is delimited with word
boundaries, which makes it suitable for extracting email addresses from files or larger blocks of text. If you
want to check whether the user typed in a valid email address, replace the word boundaries with start-of-
string and end-of-string anchors, like this: «
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$
» .
The previous paragraph also applies to all following examples. You may need to change word boundaries into
start/end-of-string anchors, or vice versa. And you will need to turn on the case insensitive matching option.
Trade-Offs in Validating Email Addresses
Yes, there are a whole bunch of email addresses that my pet regex doesn’t match. The most frequently quoted
example are addresses on the
.museum
top level domain, which is longer than the 4 letters my regex allows
for the top level domain. I accept this trade-off because the number of people using
.museum
email
addresses is extremely low. I’ve never had a complaint that the order forms or newsletter subscription forms
on the JGsoft websites refused a
.museum
address (which they would, since they use the above regex to
validate the email address).
To include
.museum
, you could use «
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$
». However, then
there’s another trade-off. This regex will match „
john@mail.office
µ. It’s far more likely that John forgot
to type in the
.com
top level domain rather than having just created a new
.office
top level domain
without ICANN’s permission.
This shows another trade-off: do you want the regex to check if the top level domain exists? My regex
doesn’t. Any combination of two to four letters will do, which covers all existing and planned top level
domains except .museum. But it will match addresses with invalid top-level domains like
How to C#: Set Image Thumbnail in C#.NET VB.NET How-to, VB.NET PDF, VB.NET Word following steps below, you can create an image viewer WinForm Open or create a new WinForms application, add necessary dll
add multiple jpg to pdf; how to add image to pdf acrobat
61
330
„
asdf@asdf.asdf
µ. By not being overly strict about the top-level domain, I don’t have to update the regex
each time a new top-level domain is created, whether it’s a country code or generic domain.
«
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.(?:[A-
Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$
» could be
used to allow any two-letter country code top level domain, and only specific generic top level domains. By
the time you read this, the list might already be out of date. If you use this regular expression, I recommend
you store it in a global constant in your application, so you only have to update it in one place. You could list
all country codes in the same manner, even though there are almost 200 of them.
Email addresses can be on servers on a subdomain, e.g. „
john@server.department.company.com
µ. All
of the above regexes will match this email address, because I included a dot in the character class after the @
symbol. However, the above regexes will also match „
john@aol...com
µ which is not valid due to the
consecutive dots. You can exclude such matches by replacing «
[A-Z0-9.-]+\.
» with «
(?:[A-Z0-9-
]+\.)+
» in any of the above regexes. I removed the dot from the character class and instead repeated the
character class and the following literal dot. E.g. «
\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-
Z]{2,4}\b
» will match „
john@server.department.company.com
µ but not ´
john@aol...com
µ.
Another trade-off is that my regex only allows English letters, digits and a few special symbols. The main
reason is that I don’t trust all my email software to be able to handle much else. Even though
John.O'Hara@theoharas.com
is a syntactically valid email address, there’s a risk that some software will
misinterpret the apostrophe as a delimiting quote. E.g. blindly inserting this email address into a SQL will
cause it to fail if strings are delimited with single quotes. And of course, it’s been many years already that
domain names can include non-English characters. Most software and even domain name registrars, however,
still stick to the 37 characters they’re used to.
The conclusion is that to decide which regular expression to use, whether you’re trying to match an email
address or something else that’s vaguely defined, you need to start with considering all the trade-offs. How
bad is it to match something that’s not valid? How bad is it not to match something that is valid? How
complex can your regular expression be? How expensive would it be if you had to change the regular
expression later? Different answers to these questions will require a different regular expression as the
solution. My email regex does what I want, but it may not do what you want.
Regexes Don’t Send Email
Don’t go overboard in trying to eliminate invalid email addresses with your regular expression. If you have to
accept
.museum
domains, allowing any 6-letter top level domain is often better than spelling out a list of all
current domains. The reason is that you don’t really know whether an address is valid until you try to send an
email to it. And even that might not be enough. Even if the email arrives in a mailbox, that doesn’t mean
somebody still reads that mailbox.
The same principle applies in many situations. When trying to match a valid date, it’s often easier to use a bit
of arithmetic to check for leap years, rather than trying to do it in a regex. Use a regular expression to find
potential matches or check if the input uses the proper syntax, and do the actual validation on the potential
matches returned by the regular expression. Regular expressions are a powerful tool, but they’re far from a
panacea.
46
331
The Official Standard: RFC 2822
Maybe you’re wondering why there’s no ´officialµ fool-proof regex to match email addresses. Well, there is
an official definition, but it’s hardly fool-proof.
The official standard is known as RFC 2822. It describes the syntax that valid email addresses must adhere to.
You can (but you shouldn’t--read on) implement it with this regular expression:
«
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-
\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-
9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-
9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-
\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
»
This regex has two parts: the part before the @, and the part after the @. There are two alternatives for the
part before the @: it can either consist of a series of letters, digits and certain symbols, including one or more
dots. However, dots may not appear consecutively or at the start or end of the email address. The other
alternative requires the part before the @ to be enclosed in double quotes, allowing any string of ASCII
characters between the quotes. Whitespace characters, double quotes and backslashes must be escaped with
backslashes.
The part after the @ also has two alternatives. It can either be a fully qualified domain name (e.g. regular-
expressions.info), or it can be a literal Internet address between square brackets. The literal Internet address
can either be an IP address, or a domain-specific routing address.
The reason you shouldn’t use this regex is that it only checks the basic syntax of email addresses.
john@aol.com.nospam
would be considered a valid email address according to RFC 2822. Obviously, this
email address won’t work, since there’s no ´nospamµ top-level domain. It also doesn’t guarantee your email
software will be able to handle it. Not all applications support the syntax using double quotes or square
brackets. In fact, RFC 2822 itself marks the notation using square brackets as obsolete.
We get a more practical implementation of RFC 2822 if we omit the syntax using double quotes and square
brackets. It will still match 99.99% of all email addresses in actual use today.
«
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-
z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
»
A further change you could make is to allow any two-letter country code top level domain, and only specific
generic top level domains. This regex filters dummy email addresses like asdf@adsf.adsf. You will need to
update it as new top-level domains are added.
«
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-
z0-9-]*[a-z0-9])?\.)+(?:[A-
Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b
»
So even when following official standards, there are still trade-offs to be made. Don’t blindly copy regular
expressions from online libraries or discussion forums. Always test them on your own data and with your
own applications.
63
332
4. Matching a Valid Date
«
^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$
» matches a date in
yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators. The anchors
make sure the entire variable is a date, and not a piece of text containing a date. The year is matched by
«
(19|20)\d\d
». I used alternation to allow the first two digits to be 19 or 20. The round brackets are
mandatory. Had I omitted them, the regex engine would go looking for 19 or the remainder of the regular
expression, which matches a date between 2000-01-01 and 2099-12-31. Round brackets are the only way to
stop the vertical bar from splitting up the entire regular expression into two options.
The month is matched by «
0[1-9]|1[012]
», again enclosed by round brackets to keep the two options
together. By using character classes, the first option matches a number between 01 and 09, and the second
matches 10, 11 or 12.
The last part of the regex consists of three options. The first matches the numbers 01 through 09, the second
10 through 29, and the third matches 30 or 31.
Smart use of alternation allows us to exclude invalid dates such as 2000-00-00 that could not have been
excluded without using alternation. To be really perfectionist, you would have to split up the month into
various options to take into account the length of the month. The above regex still matches 2003-02-31,
which is not a valid date. Making leading zeros optional could be another enhancement.
If you want to require the delimiters to be consistent, you could use a backreference. «
^(19|20)\d\d([-
/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])$
» will match „
1999-01-01
µ but not
´
1999/01-01
µ.
Again, how complex you want to make your regular expression depends on the data you are using it on, and
how big a problem it is if an unwanted match slips through. If you are validating the user’s input of a date in a
script, it is probably easier to do certain checks outside of the regex. For example, excluding February 29th
when the year is not a leap year is far easier to do in a scripting language. It is far easier to check if a year is
divisible by 4 (and not divisible by 100 unless divisible by 400) using simple arithmetic than using regular
expressions.
Here is how you could check a valid date in Perl. I also added round brackets to capture the year into a
backreference.
sub isvaliddate {
my $input = shift;
if ($input =~ m!^((?:19|20)\d\d)[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$!) {
# At this point, $1 holds the year, $2 the month and $3 the day of the date entered
if ($3 == 31 and ($2 == 4 or $2 == 6 or $2 == 9 or $2 == 11)) {
return 0;
# 31st of a month with 30 days
} elsif ($3 >= 30 and $2 == 2) {
return 0;
# February 30th or 31st
} elsif ($2 == 2 and $3 == 29 and not ($1 % 4 == 0 and ($1 % 100 != 0 or $1 % 400 == 0))) {
return 0;
# February 29th outside a leap year
} else {
return 1;
# Valid date
}
} else {
return 0;
# Not a date
}
}
9
333
To match a date in mm/dd/yyyy format, rearrange the regular expression to «
^(0[1-9]|1[012])[-
/.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$
» . For dd-mm-yyyy format, use «
^(0[1-
9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d$
» . You can find additional
variations of these regexes in RegexBuddy’s library.
55
334
5. Finding or Verifying Credit Card Numbers
With a few simple regular expressions, you can easily verify whether your customer entered a valid credit card
number on your order form. You can even determine the type of credit card being used. Each card issuer has
its own range of card numbers, identified by the first 4 digits.
You can use a slightly different regular expression to find credit card numbers, or number sequences that
might be credit card numbers, within larger documents. This can be very useful to prove in a security audit
that you’re not improperly exposing your clients’ financial details.
We’ll start with the order form.
Stripping Spaces and Dashes
The first step is to remove all non-digits from the card number entered by the customer. Physical credit cards
have spaces within the card number to group the digits, making it easier for humans to read or type in. So
your order form should accept card numbers with spaces or dashes in them.
To remove all non-digits from the card number, simply use the ´replace allµ function in your scripting
language to search for the regex «
[^0-9]+
» and replace it with nothing. If you only want to replace spaces
and dashes, you could use «
[ -]+
». If this regex looks odd, remember that in a character class, the hyphen is
a literal when it occurs right before the closing bracket (or right after the opening bracket or negating caret).
If you’re wondering what the plus is for: that’s for performance. If the input has consecutive non-digits, e.g.
´
1===2
µ, then the regex will match the three equals signs at once, and delete them in one replacement.
Without the plus, three replacements would be required. In this case, the savings are only a few
microseconds. But it’s a good habit to keep regex efficiency in the back of your mind. Though the savings are
minimal here, so is the effort of typing the extra plus.
Validating Credit Card Numbers on Your Order Form
Validating credit card numbers is the ideal job for regular expressions. They’re just a sequence of 13 to 16
digits, with a few specific digits at the start that identify the card issuer. You can use the specific regular
expressions below to alert customers when they try to use a kind of card you don’t accept, or to route orders
using different cards to different processors. All these regexes were taken from RegexBuddy’s library.
Visa: «
^4[0-9]{12}(?:[0-9]{3})?$
» All Visa card numbers start with a 4. New cards have 16
digits. Old cards have 13.
MasterCard: «
^5[1-5][0-9]{14}$
» All MasterCard numbers start with the numbers 51 through 55.
All have 16 digits.
American Express: «
^3[47][0-9]{13}$
» American Express card numbers start with 34 or 37 and
have 15 digits.
Diners Club: «
^3(?:0[0-5]|[68][0-9])[0-9]{11}$
» Diners Club card numbers begin with 300
through 305, 36 or 38. All have 14 digits. There are Diners Club cards that begin with 5 and have 16
digits. These are a joint venture between Diners Club and MasterCard, and should be processed like a
MasterCard.
51
335
Discover: «
^6(?:011|5[0-9]{2})[0-9]{12}$
» Discover card numbers begin with 6011 or 65.
All have 16 digits.
JCB: «
^(?:2131|1800|35\d{3})\d{11}$
» JCB cards beginning with 2131 or 1800 have 15 digits.
JCB cards beginning with 35 have 16 digits.
If you just want to check whether the card number looks valid, without determining the brand, you can
combine
the
above
six
regexes
into
«
^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-
9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-
9]{11}|(?:2131|1800|35\d{3})\d{11})$
». You’ll see I’ve simply alternated all the regexes, and used a
non-capturing group to put the anchors outside the alternation. You can easily delete the card types you don’t
accept from the list.
These regular expressions will easily catch numbers that are invalid because the customer entered too many or
too few digits. They won’t catch numbers with incorrect digits. For that, you need to follow the Luhn
algorithm, which cannot be done with a regex. And of course, even if the number is mathematically valid, that
doesn’t mean a card with this number was issued or if there’s money in the account. The benefit or the
regular expression is that you can put it in a bit of JavaScript to instantly check for obvious errors, instead of
making the customer wait 30 seconds for your credit card processor to fail the order. And if your card
processor charges for failed transactions, you’ll really want to implement both the regex and the Luhn
validation.
Finding Credit Card Numbers in Documents
With two simple modifications, you could use any of the above regexes to find card numbers in larger
documents. Simply replace the caret and dollar with a word boundary, e.g.: «
\b4[0-9]{12}(?:[0-
9]{3})?\b
».
If you’re planning to search a large document server, a simpler regular expression will speed up the search.
Unless your company uses 16-digit numbers for other purposes, you’ll have few false positives. The regex
«
\b\d{13,16}\b
» will find any sequence of 13 to 16 digits.
When searching a hard disk full of files, you can’t strip out spaces and dashes first like you can when
validating a single card number. To find card numbers with spaces or dashes in them, use «
\b(?:\d[ -
]*?){13,16}\b
». This regex allows any amount of spaces and dashes anywhere in the number. This is really
the only way. Visa and MasterCard put digits in sets of 4, while Amex and Discover use groups of 4, 5 and 6
digits. People typing in the numbers may have different ideas yet.
Documents you may be interested
Documents you may be interested