WEB ACCESSING AND DATA EXTRACTION TOOLS
To access a webpage, we usually use a web browser, such as Internet Explorer, Firefox, Google Chrome, etc.
Web browsers have much more functionalities, but they are for human interaction, such as browsing, and clicking
alink to next page, etc., they are not for automatic data extraction. On the other hand, if you are a programmer,
you might know the tools such as the cURL command line program, the LWP (library for WWW in Perl) package.
Unlike web browsers, these programs are for automatic internet accessing. Although they don’t have the same
they can mimac most web browser functions. While cURL is mainly for web accessing, with the powerful Perl
Regular Expression functions for information extraction, the LWP package combines both web accessing and
data extraction functions together, and thus can be an ideal tool for these purposes. Both cURL and LWP are
freely available. cURL can be downloaded from
,and for LWP, you can download Perl
program language from
,and LWP is a default package included in Perl.
Within SAS, SAS provides serveral Internet accessing tools, mainly with the FILENAME statement. The two
main statements are FILENAME URL and FILENAME SOCKET statements. FILENAME URL is for basic web
accessing with webpages using the GET request method. It can handle both HTTP and HTTPS, and also with
webpages that need authentication (such as username/password). But FILENAME URL can not handle the
POST request method. FILENAME SOCKET is for more complicate webpages, and can handle both GET and
POST methods. To use FILENAME SOCKET statement, you need to understand the HTTP header. Helf (2005)
provides some detailed explanation for the usage of this statement. Other FILENAME statements are FILENAME
FTP and FILENAME EMAIL statements. As their names suggest, these two statements are for the FTP protocol
and EMAIL accessing only.
Many web accessing can be handled by these SAS FILENAME statements. But in some situations, espe-
cially with webpages using the POST request method, it is difﬁcult or too complicated to use these statements.
We would like to use the cURL program or the LWP package. Fortunately, SAS provides two mechanisms to
integrate external programs easily: the X statement and the FILENAME PIPE statement. The X statement lets
you run a command line program (such as cURL) and then return back to SAS after the command line program
ﬁnishes execution. You can use the NOXWAIT system option to get back to SAS automatically. The FILENAME
PIPE statement lets you run the external program, feed the results from that program directly into SAS as an
unamed pipe. You can think of this unamed pipe as reading from a local text ﬁle, except the content of this ﬁle
is dynamically generated by the external program. With the X statement, if there is any return results that need
to be processed by SAS, you need to save them in a local ﬁle, then use FILENAME statement to read this ﬁle.
But for the FILENAME PIPE statement, there is no need to save the results in an intermediate ﬁle. The following
examples illustrated how to use these two statement:
ﬁle called savedﬁle.txt:
For data extraction, SAS has many string functions and call rountines. We can use functions such as FIND(),
INDEX(), INDEXC() to locate the position of the target information, and then use functions such as SCAN(),
SUBSTR() to extract that information into a data set. Starting from Version 9, SAS provides a group of functions
and call-rountines called Perl Regular Expression, all begin with the preﬁx PRX: PRXCHANGE() for replace-
ment, PRXMATCH() for searching, CALL PRXNEXT for repeated extraction, etc. With the versatile Perl Regular
Expression, these functions provide much ﬂexible and powerful data extraction capabilities for SAS. Interested
readers should check out their documentation in SAS.
EXAMPLE 1: DOWNLOAD .CSV FILE
Suppose you won’t to download Moody’s seasoned AAA coporate bond yield from the FRED (Federal Re-
served Economical Data) website. The URL for this time series is
SAS Global Forum 2012