44
228
REGULAR EXPRESSIONS
3.2.6 Problem: Identifying URLs and email addresses in texts
Aneat feature of many Internet and news clients is their automatic identification of
resources that the applications can act upon. For URL resources, this usually means
making the links “clickable”; for an email address it usually means launching a new let-
ter to the person at the address. Depending on the nature of an application, you could
perform other sorts of actions for each identified resource. For a text processing appli-
cation, the use of a resource is likely to be something more batch-oriented: extraction,
transformation, indexing, or the like.
Fully and precisely implementing RFC1822 (for email addresses) or RFC1738 (for
URLs) is possible within regular expressions. But doing so is probably even more work
than is really needed to identify 99% of resources. Moreover, a significant number of
resources in the “real world” are not strictly compliant with the relevant RFCs—most
applications give a certain leeway to “almost correct” resource identifiers. The utility
below tries to strike approximately the same balance of other well-implemented and
practical applications: get almost everything that was intended to look like a resource,
and almost nothing that was intended not to:
find
urls.py
# Functions to identify and extract URLs and email addresses
import re, fileinput
pat_url = re.compile( r’’’
(?x)( # verbose identify URLs within text
(http|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
(\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r"]+ # or stuff then space, newline, tab, quote
[\w/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)’"\]]) # assert: followed by white or clause ending
) # end of match group
’’’)
pat_email = re.compile(r’’’
(?xm) # verbose identify URLs in text (and multiline)
(?=^.{11} # Mail header matcher
(?<!Message-ID:| # rule out Message-ID’s as best possible
In-Reply-To)) # ...and also In-Reply-To
(.*?)( # must grab to email to allow prior lookbehind
([A-Za-z0-9-]+\.)? # maybe an initial part: DAVID.mertz@gnosis.cx
[A-Za-z0-9-]+ # definitely some local user: MERTZ@gnosis.cx
@ # ...needs an at sign in the middle
(\w+\.?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
(?=[\s\.,>)’"\]]) # assert: followed by white or clause ending