76
11.1 Introduction
Most of the time, PHP is part of a web server, sending content to browsers. Even when you
run it from the command line, it usually performs a task and then prints some output. PHP can
also be useful, however, playing the role of a web browser — retrieving URLs and then
operating on the content. Most recipes in this chapter cover retrieving URLs and processing
the results, although there are a few other tasks in here as well, such as using templates and
processing server logs.
There are four ways to retrieve a remote URL in PHP. Choosing one method over another
depends on your needs for simplicity, control, and portability. The four methods are to use
fopen( )
,
fsockopen( )
, the cURL extension, or the
HTTP_Request
class from PEAR.
Using
fopen( )
is simple and convenient. We discuss it in Recipe 11.2
. The
fopen( )
function automatically follows redirects, so if you use this function to retrieve the directory
http://www.example.com/people
and the server redirects you to
http://www.example.com/people/
, you'll get the contents of the directory index page, not a
message telling you that the URL has moved. The
fopen( )
function also works with both
HTTP and FTP. The downsides to
fopen( )
include: it can handle only HTTP GET requests
(not HEAD or POST), you can't send additional headers or any cookies with the request, and
you can retrieve only the response body with it, not response headers.
Using
fsockopen( )
requires more work but gives you more flexibility. We use
fsockopen(
)
in Recipe 11.3
. After opening a socket with
fsockopen( )
, you need to print the
appropriate HTTP request to that socket and then read and parse the response. This lets you
add headers to the request and gives you access to all the response headers. However, you
need to have additional code to properly parse the response and take any appropriate action,
such as following a redirect.
If you have access to the cURL extension or PEAR's
HTTP_Request
class, you should use
those rather than
fsockopen( )
. cURL supports a number of different protocols (including
HTTPS, discussed in Recipe 11.6
) and gives you access to response headers. We use cURL in
most of the recipes in this chapter. To use cURL, you must have the cURL library installed,
available at http://curl.haxx.se
. Also, PHP must be built with the
--with-curl
configuration
option.
PEAR's
HTTP_Request
class, which we use in Recipe 11.3
, Recipe 11.4
, and Recipe 11.5
,
doesn't support HTTPS, but does give you access to headers and can use any HTTP method. If
this PEAR module isn't installed on your system, you can download it from
http://pear.php.net/get/HTTP_Request
. As long as the module's files are in your
include_path
, you can use it, making it a very portable solution.
Recipe 11.7
helps you go behind the scenes of an HTTP request to examine the headers in a
request and response. If a request you're making from a program isn't giving you the results
you're looking for, examining the headers often provides clues as to what's wrong.
52
Once you've retrieved the contents of a web page into a program, use Recipe 11.8
through
Recipe 11.12
to help you manipulate those page contents. Recipe 11.8
demonstrates how to
mark up certain words in a page with blocks of color. This technique is useful for highlighting
search terms, for example. Recipe 11.9
provides a function to find all the links in a page. This
is an essential building block for a web spider or a link checker. Converting between plain
ASCII and HTML is covered in Recipe 11.10
and Recipe 11.11
. Recipe 11.12
shows how to
remove all HTML and PHP tags from a web page.
Another kind of page manipulation is using a templating system. Discussed in Recipe 11.13
,
templates give you freedom to change the look and feel of your web pages without changing
the PHP plumbing that populates the pages with dynamic data. Similarly, you can make
changes to the code that drives the pages without affecting the look and feel. Recipe 11.14
discusses a common server administration task — parsing your web server's access log files.
Two sample programs use the link extractor from Recipe 11.9
. The program in Recipe 11.15
scans the links in a page and reports which are still valid, which have been moved, and which
no longer work. The program in Recipe 11.16
reports on the freshness of links. It tells you
when a linked-to page was last modified and if it's been moved.
Recipe 11.2 Fetching a URL with the GET Method
11.2.1 Problem
You want to retrieve the contents of a URL. For example, you want to include part of one web
page in another page's content.
11.2.2 Solution
Pass the URL to
fopen( )
and get the contents of the page with
fread( )
:
$page = '';
$fh = fopen('http://www.example.com/robots.txt','r') or die($php_errormsg);
while (! feof($fh)) {
$page .= fread($fh,1048576);
}
fclose($fh);
You can use the cURL extension:
$c = curl_init('http://www.example.com/robots.txt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);
You can also use the
HTTP_Request
class from PEAR:
require 'HTTP/Request.php';
$r = new HTTP_Request('http://www.example.com/robots.txt');
60
$r->sendRequest();
$page = $r->getResponseBody();
11.2.3 Discussion
You can put a username and password in the URL if you need to retrieve a protected page. In
this example, the username is
david
, and the password is
hax0r
. Here's how to do it with
fopen( )
:
$fh = fopen('http://david:hax0r@www.example.com/secrets.html','r')
or die($php_errormsg);
while (! feof($fh)) {
$page .= fread($fh,1048576);
}
fclose($fh);
Here's how to do it with cURL:
$c = curl_init('http://www.example.com/secrets.html');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_USERPWD, 'david:hax0r');
$page = curl_exec($c);
curl_close($c);
Here's how to do it with
HTTP_Request
:
$r = new HTTP_Request('http://www.example.com/secrets.html');
$r->setBasicAuth('david','hax0r');
$r->sendRequest();
$page = $r->getResponseBody();
While
fopen( )
follows redirects in
Location
response headers,
HTTP_Request
does not.
cURL follows them only when the
CURLOPT_FOLLOWLOCATION
option is set:
$c = curl_init('http://www.example.com/directory');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);
$page = curl_exec($c);
curl_close($c);
cURL can do a few different things with the page it retrieves. If the
CURLOPT_RETURNTRANSFER
option is set,
curl_exec( )
returns a string containing the
page:
$c = curl_init('http://www.example.com/files.html');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);
To write the retrieved page to a file, open a file handle for writing with
fopen( )
and set the
CURLOPT_FILE
option to the file handle:
85
$fh = fopen('local-copy-of-files.html','w') or die($php_errormsg);
$c = curl_init('http://www.example.com/files.html');
curl_setopt($c, CURLOPT_FILE, $fh);
curl_exec($c);
curl_close($c);
To pass the cURL resource and the contents of the retrieved page to a function, set the
CURLOPT_WRITEFUNCTION
option to the name of the function:
// save the URL and the page contents in a database
function save_page($c,$page) {
$info = curl_getinfo($c);
mysql_query("INSERT INTO pages (url,page) VALUES ('" .
mysql_escape_string($info['url']) . "', '" .
mysql_escape_string($page) . "')");
}
$c = curl_init('http://www.example.com/files.html');
curl_setopt($c, CURLOPT_WRITEFUNCTION, 'save_page');
curl_exec($c);
curl_close($c);
If none of
CURLOPT_RETURNTRANSFER
,
CURLOPT_FILE
, or
CURLOPT_WRITEFUNCTION
is
set, cURL prints out the contents of the returned page.
The
fopen
() function and the
include
and
require
directives can retrieve remote files only
if URL fopen wrappers are enabled. URL fopen wrappers are enabled by default and are
controlled by the
allow_url_fopen
configuration directive. On Windows, however,
include
and
require
can't retrieve remote files in versions of PHP earlier than 4.3, even if
allow_url_fopen
is
on
.
11.2.4 See Also
Recipe 11.3
for fetching a URL with the POST method; Recipe 8.13
discusses opening remote
files with
fopen()
; documentation on
fopen( )
at http://www.php.net/fopen
,
include
at
http://www.php.net/include
,
curl_init( )
at http://www.php.net/curl-init
,
curl_setopt(
)
at http://www.php.net/curl-setopt
,
curl_exec( )
at http://www.php.net/curl-exec
, and
curl_close( )
at http://www.php.net/curl-close
; the PEAR
HTTP_Request
class at
http://pear.php.net/package-info.php?package=HTTP_Request
.
Recipe 11.3 Fetching a URL with the POST Method
11.3.1 Problem
You want to retrieve a URL with the POST method, not the default GET method. For example,
you want to submit an HTML form.
11.3.2 Solution
Use the cURL extension with the
CURLOPT_POST
option set:
56
$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'monkey=uncle&rhino=aunt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);
If the cURL extension isn't available, use the PEAR
HTTP_Request
class:
require 'HTTP/Request.php';
$r = new HTTP_Request('http://www.example.com/submit.php');
$r->setMethod(HTTP_REQUEST_METHOD_POST);
$r->addPostData('monkey','uncle');
$r->addPostData('rhino','aunt');
$r->sendRequest();
$page = $r->getResponseBody();
11.3.3 Discussion
Sending a POST method request requires special handling of any arguments. In a GET
request, these arguments are in the query string, but in a POST request, they go in the
request body. Additionally, the request needs a
Content-Length
header that tells the server
the size of the content to expect in the request body.
Because of the argument handling and additional headers, you can't use
fopen( )
to make a
POST request. If neither cURL nor
HTTP_Request
are available, use the
pc_post_request(
)
function, shown in Example 11-1
, which makes the connection to the remote web server
with
fsockopen( )
.
Example 11-1. pc_post_request( )
function pc_post_request($host,$url,$content='') {
$timeout = 2;
$a = array();
if (is_array($content)) {
foreach ($content as $k => $v) {
array_push($a,urlencode($k).'='.urlencode($v));
}
}
$content_string = join('&',$a);
$content_length = strlen($content_string);
$request_body = "POST $url HTTP/1.0
Host: $host
Content-type: application/x-www-form-urlencoded
Content-length: $content_length
$content_string";
$sh = fsockopen($host,80,&$errno,&$errstr,$timeout)
or die("can't open socket to $host: $errno $errstr");
fputs($sh,$request_body);
$response = '';
53
while (! feof($sh)) {
$response .= fread($sh,16384);
}
fclose($sh) or die("Can't close socket handle: $php_errormsg");
list($response_headers,$response_body) =
explode("\r\n\r\n",$response,2);
$response_header_lines = explode("\r\n",$response_headers);
// first line of headers is the HTTP response code
$http_response_line = array_shift($response_header_lines);
if (preg_match('@^HTTP/[0-9]\.[0-9] ([0-9]{3})@',$http_response_line,
$matches)) {
$response_code = $matches[1];
}
// put the rest of the headers in an array
$response_header_array = array();
foreach ($response_header_lines as $header_line) {
list($header,$value) = explode(': ',$header_line,2);
$response_header_array[$header] = $value;
}
return array($response_code,$response_header_array,$response_body);
}
Call
pc_post_request( )
like this:
list($code,$headers,$body) =
pc_post_request('www.example.com','/submit.php',
array('monkey' => 'uncle',
'rhino' => 'aunt'));
Retrieving a URL with POST instead of GET is especially useful if the URL is very long, more
than 200 characters or so. The HTTP 1.1 specification in RFC 2616 doesn't place a maximum
length on URLs, so behavior varies among different web and proxy servers. If you retrieve
URLs with GET and receive unexpected results or results with status code 414 ("Request-URI
Too Long"), convert the request to a POST request.
11.3.4 See Also
Recipe 11.2
for fetching a URL with the GET method; documentation on
curl_setopt( )
at
http://www.php.net/curl-setopt
and
fsockopen( )
at http://www.php.net/fsockopen
; the
PEAR
HTTP_Request
class at http://pear.php.net/package-info.php?package=HTTP_Request
;
RFC 2616 is available at http://www.faqs.org/rfcs/rfc2616.html
.
Recipe 11.4 Fetching a URL with Cookies
11.4.1 Problem
You want to retrieve a page that requires a cookie to be sent with the request for the page.
11.4.2 Solution
64
Use the cURL extension and the
CURLOPT_COOKIE
option:
$c = curl_init('http://www.example.com/needs-cookies.php');
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_COOKIE, 'user=ellen; activity=swimming');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);
If cURL isn't available, use the
addHeader( )
method in the PEAR
HTTP_Request
class:
require 'HTTP/Request.php';
$r = new HTTP_Request('http://www.example.com/needs-cookies.php');
$r->addHeader('Cookie','user=ellen; activity=swimming');
$r->sendRequest();
$page = $r->getResponseBody();
11.4.3 Discussion
Cookies are sent to the server in the
Cookie
request header. The cURL extension has a
cookie-specific option, but with
HTTP_Request
, you have to add the
Cookie
header just as
with other request headers. Multiple cookie values are sent in a semicolon-delimited list. The
examples in the Solution send two cookies: one named
user
with value
ellen
and one
named
activity
with value
swimming
.
To request a page that sets cookies and then make subsequent requests that include those
newly set cookies, use cURL's "cookie jar" feature. On the first request, set
CURLOPT_COOKIEJAR
to the name of a file to store the cookies in. On subsequent requests,
set
CURLOPT_COOKIEFILE
to the same filename, and cURL reads the cookies from the file
and sends them along with the request. This is especially useful for a sequence of requests in
which the first request logs into a site that sets session or authentication cookies, and then the
rest of the requests need to include those cookies to be valid:
$cookie_jar = tempnam('/tmp','cookie');
// log in
$c =
curl_init('https://bank.example.com/login.php?user=donald&password=b1gmoney
$');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_COOKIEJAR, $cookie_jar);
$page = curl_exec($c);
curl_close($c);
// retrieve account balance
$c = curl_init('http://bank.example.com/balance.php?account=checking');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_COOKIEFILE, $cookie_jar);
$page = curl_exec($c);
curl_close($c);
46
// make a deposit
$c = curl_init('http://bank.example.com/deposit.php');
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'account=checking&amount=122.44');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_COOKIEFILE, $cookie_jar);
$page = curl_exec($c);
curl_close($c);
// remove the cookie jar
unlink($cookie_jar) or die("Can't unlink $cookie_jar");
Be careful where you store the cookie jar. It needs to be in a place your web server has write
access to, but if other users can read the file, they may be able to poach the authentication
credentials stored in the cookies.
11.4.4 See Also
Documentation on
curl_setopt( )
at http://www.php.net/curl-setopt
; the PEAR
HTTP_Request
class at http://pear.php.net/package-info.php?package=HTTP_Request
Recipe 11.5 Fetching a URL with Headers
11.5.1 Problem
You want to retrieve a URL that requires specific headers to be sent with the request for the
page.
11.5.2 Solution
Use the cURL extension and the
CURLOPT_HTTPHEADER
option:
$c = curl_init('http://www.example.com/special-header.php');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_HTTPHEADER, array('X-Factor: 12', 'My-Header:
Bob'));
$page = curl_exec($c);
curl_close($c);
If cURL isn't available, use the
addHeader( )
method in
HTTP_Request
:
require 'HTTP/Request.php';
$r = new HTTP_Request('http://www.example.com/special-header.php');
$r->addHeader('X-Factor',12);
$r->addHeader('My-Header','Bob');
$r->sendRequest();
$page = $r->getResponseBody();
11.5.3 Discussion
46
cURL has special options for setting the
Referer
and
User-Agent
request headers —
CURLOPT_REFERER
and
CURLOPT_USERAGENT
:
$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_REFERER, 'http://www.example.com/form.php');
curl_setopt($c, CURLOPT_USERAGENT, 'CURL via PHP');
$page = curl_exec($c);
curl_close($c);
11.5.4 See Also
Recipe 11.14
explains why "referrer" is often misspelled "referer" in web programming
contexts; documentation on
curl_setopt( )
at http://www.php.net/curl-setopt
; the PEAR
HTTP_Request
class at http://pear.php.net/package-info.php?package=HTTP_Request
.
Recipe 11.6 Fetching an HTTPS URL
11.6.1 Problem
You want to retrieve a secure URL.
11.6.2 Solution
Use the cURL extension with an HTTPS URL:
$c = curl_init('https://secure.example.com/accountbalance.php');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);
11.6.3 Discussion
To retrieve secure URLs, the cURL extension needs access to an SSL library, such as OpenSSL.
This library must be available when PHP and the cURL extension are built. Aside from this
additional library requirement, cURL treats secure URLs just like regular ones. You can provide
the same cURL options to secure requests, such as changing the request method or adding
POST data.
11.6.4 See Also
The OpenSSL Project at http://www.openssl.org/
.
Recipe 11.7 Debugging the Raw HTTP Exchange
11.7.1 Problem
52
You want to analyze the HTTP request a browser makes to your server and the corresponding
HTTP response. For example, your server doesn't supply the expected response to a particular
request so you want to see exactly what the components of the request are.
11.7.2 Solution
For simple requests, connect to the web server with telnet and type in the request headers:
% telnet www.example.com 80
Trying 10.1.1.1...
Connected to www.example.com.
Escape character is '^]'.
GET / HTTP/1.0
Host: www.example.com
HTTP/1.1 200 OK
Date: Sat, 17 Aug 2002 06:10:19 GMT
Server: Apache/1.3.26 (Unix) PHP/4.2.2 mod_ssl/2.8.9 OpenSSL/0.9.6d
X-Powered-By: PHP/4.2.2
Connection: close
Content-Type: text/html
// ... the page body ...
11.7.3 Discussion
When you type in request headers, the web server doesn't know that it's just you typing and
not a web browser submitting a request. However, some web servers have timeouts on how
long they'll wait for a request, so it can be useful to pretype the request and then just paste it
into telnet. The first line of the request contains the request method (
GET
), a space and the
path of the file you want (
/
), and then a space and the protocol you're using (
HTTP/1.0
). The
next line, the
Host
header, tells the server which virtual host to use if many are sharing the
same IP address. A blank line tells the server that the request is over; it then spits back its
response: first headers, then a blank line, and then the body of the response.
Pasting text into telnet can get tedious, and it's even harder to make requests with the POST
method that way. If you make a request with
HTTP_Request
, you can retrieve the response
headers and the response body with the
getResponseHeader( )
and
getResponseBody(
)
methods:
require 'HTTP/Request.php';
$r = new HTTP_Request('http://www.example.com/submit.php');
$r->setMethod(HTTP_REQUEST_METHOD_POST);
$r->addPostData('monkey','uncle');
$r->sendRequest();
$response_headers = $r->getResponseHeader();
$response_body = $r->getResponseBody();
Documents you may be interested
Documents you may be interested