57
Web Scraping
249
Step 1: Get the Command Line Arguments and Request the Search Page
Before coding anything, you first need to know the URL of the search result
page. By looking at the browser’s address bar after doing a Google search,
you can see that the result page has a URL like https://www.google.com/
search?q=SEARCH_TERM_HERE. The
requests
module can download this
page and then you can use Beautiful Soup to find the search result links in
the HTML. Finally, you’ll use the
webbrowser
module to open those links in
browser tabs.
Make your code look like the following:
#! python3
# lucky.py - Opens several Google search results.
import requests, sys, webbrowser, bs4
print('Googling...') # display text while downloading the Google page
res = requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
# TODO: Retrieve top search result links.
# TODO: Open a browser tab for each result.
The user will specify the search terms using command line arguments
when they launch the program. These arguments will be stored as strings in
a list in
sys.argv
.
Step 2: Find All the Results
Now you need to use Beautiful Soup to extract the top search result links
from your downloaded HTML. But how do you figure out the right selec-
tor for the job? For example, you can’t just search for all
<a>
tags, because
there are lots of links you don’t care about in the HTML. Instead, you must
inspect the search result page with the browser’s developer tools to try to
find a selector that will pick out only the links you want.
After doing a Google search for Beautiful Soup, you can open the
browser’s developer tools and inspect some of the link elements on the
page. They look incredibly complicated, something like this:
<a href="/url?sa
=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&
amp;ved=0CCgQFjAA&url=http%3A%2F%2Fwww.crummy.com%2Fsoftware%2FBeautifulSoup
%2F&ei=LHBVU_XDD9KVyAShmYDwCw&usg=AFQjCNHAxwplurFOBqg5cehWQEVKi-TuLQ&a
mp;sig2=sdZu6WVlBlVSDrwhtworMA" onmousedown="return rwt(this,'','','','1','AFQ
jCNHAxwplurFOBqg5cehWQEVKi-TuLQ','sdZu6WVlBlVSDrwhtworMA','0CCgQFjAA','','',ev
ent)" data-href="http://www.crummy.com/software/BeautifulSoup/"><em>Beautiful
Soup</em>: We called him Tortoise because he taught us.</a>
.
It doesn’t matter that the element looks incredibly complicated. You just
need to find the pattern that all the search result links have. But this
<a>
ele-
ment doesn’t have anything that easily distinguishes it from the nonsearch
result
<a>
elements on the page.
67
250
Chapter 11
Make your code look like the following:
#! python3
# lucky.py - Opens several google search results.
import requests, sys, webbrowser, bs4
--snip--
# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text)
# Open a browser tab for each result.
linkElems = soup.select('.r a')
If you look up a little from the
<a>
element, though, there is an element
like this:
<h3 class="r">
. Looking through the rest of the HTML source,
it looks like the
r
class is used only for search result links. You don’t have
to know what the CSS class
r
is or what it does. You’re just going to use
it as a marker for the
<a>
element you are looking for. You can create a
BeautifulSoup
object from the downloaded page’s HTML text and then use
the selector
'.r a'
to find all
<a>
elements that are within an element that
has the
r
CSS class.
Step 3: Open Web Browsers for Each Result
Finally, we’ll tell the program to open web browser tabs for our results. Add
the following to the end of your program:
#! python3
# lucky.py - Opens several google search results.
import requests, sys, webbrowser, bs4
--snip--
# Open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))
By default, you open the first five search results in new tabs using the
webbrowser
module. However, the user may have searched for something that
turned up fewer than five results. The
soup.select()
call returns a list of all
the elements that matched your
'.r
a'
selector, so the number of tabs you
want to open is either
5
or the length of this list (whichever is smaller).
The built-in Python function
min()
returns the smallest of the integer
or float arguments it is passed. (There is also a built-in
max()
function that
59
Web Scraping
251
returns the largest argument it is passed.) You can use
min()
to find out
whether there are fewer than five links in the list and store the number of
links to open in a variable named
numOpen
. Then you can run through a
for
loop by calling
range(numOpen)
.
On each iteration of the loop, you use
webbrowser.open()
to open a new
tab in the web browser. Note that the
href
attribute’s value in the returned
<a>
elements do not have the initial
http://google.com
part, so you have to
concatenate that to the
href
attribute’s string value.
Now you can instantly open the first five Google results for, say, Python
programming tutorials by running
lucky python programming tutorials
on the
command line! (See Appendix B for how to easily run programs on your
operating system.)
Ideas for Similar Programs
The benefit of tabbed browsing is that you can easily open links in new tabs
to peruse later. A program that automatically opens several links at once
can be a nice shortcut to do the following:
• Open all the product pages after searching a shopping site such as
Amazon
• Open all the links to reviews for a single product
• Open the result links to photos after performing a search on a photo
site such as Flickr or Imgur
Project: downloading All xkcd comics
Blogs and other regularly updating websites usually have a front page with
the most recent post as well as a Previous button on the page that takes you
to the previous post. Then that post will also have a Previous button, and so
on, creating a trail from the most recent page to the first post on the site.
If you wanted a copy of the site’s content to read when you’re not online,
you could manually navigate over every page and save each one. But this is
pretty boring work, so let’s write a program to do it instead.
XKCD is a popular geek webcomic with a website that fits this structure
(see Figure 11-6). The front page at http://xkcd.com/ has a Prev button that
guides the user back through prior comics. Downloading each comic by
hand would take forever, but you can write a script to do this in a couple of
minutes.
Here’s what your program does:
• Loads the XKCD home page.
• Saves the comic image on that page.
• Follows the Previous Comic link.
• Repeats until it reaches the first comic.
39
252
Chapter 11
Figure 11-6: XKCD, “a webcomic of romance, sarcasm, math, and language”
This means your code will need to do the following:
• Download pages with the
requests
module.
• Find the URL of the comic image for a page using Beautiful Soup.
• Download and save the comic image to the hard drive with
iter_content()
.
• Find the URL of the Previous Comic link, and repeat.
Open a new file editor window and save it as downloadXkcd.py.
Step 1: Design the Program
If you open the browser’s developer tools and inspect the elements on the
page, you’ll find the following:
• The URL of the comic’s image file is given by the
href
attribute of an
<img>
element.
• The
<img>
element is inside a
<div id="comic">
element.
• The Prev button has a
rel
HTML attribute with the value
prev
.
• The first comic’s Prev button links to the http://xkcd.com/# URL, indicat-
ing that there are no more previous pages.
Make your code look like the following:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests, os, bs4
url = 'http://xkcd.com' # starting url
os.makedirs('xkcd', exist_ok=True) # store comics in ./xkcd
52
Web Scraping
253
while not url.endswith('#'):
# TODO: Download the page.
# TODO: Find the URL of the comic image.
# TODO: Download the image.
# TODO: Save the image to ./xkcd.
# TODO: Get the Prev button's url.
print('Done.')
You’ll have a
url
variable that starts with the value
'http://xkcd.com'
and repeatedly update it (in a
for
loop) with the URL of the current page’s
Prev link. At every step in the loop, you’ll download the comic at
url
. You’ll
know to end the loop when
url
ends with
'#'
.
You will download the image files to a folder in the current working
directory named xkcd. The call
os.makedirs()
ensures that this folder exists,
and the
exist_ok=True
keyword argument prevents the function from throw-
ing an exception if this folder already exists. The rest of the code is just
comments that outline the rest of your program.
Step 2: Download the Web Page
Let’s implement the code for downloading the page. Make your code look
like the following:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests, os, bs4
url = 'http://xkcd.com' # starting url
os.makedirs('xkcd', exist_ok=True) # store comics in ./xkcd
while not url.endswith('#'):
# Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
# TODO: Find the URL of the comic image.
# TODO: Download the image.
# TODO: Save the image to ./xkcd.
# TODO: Get the Prev button's url.
print('Done.')
73
254
Chapter 11
First, print
url
so that the user knows which URL the program is about to
download; then use the
requests
module’s
request.get()
function to download
it. As always, you immediately call the
Response
object’s
raise_for_ status()
method to throw an exception and end the program if something went
wrong with the download. Otherwise, you create a
BeautifulSoup
object from
the text of the downloaded page.
Step 3: Find and Download the Comic Image
Make your code look like the following:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests, os, bs4
--snip--
# Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
print('Could not find comic image.')
else:
comicUrl = comicElem[0].get('src')
# Download the image.
print('Downloading image %s...' % (comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
# TODO: Save the image to ./xkcd.
# TODO: Get the Prev button's url.
print('Done.')
From inspecting the XKCD home page with your developer tools, you
know that the
<img>
element for the comic image is inside a
<div>
element
with the
id
attribute set to
comic
, so the selector
'#comic img'
will get you the
correct
<img>
element from the
BeautifulSoup
object.
A few XKCD pages have special content that isn’t a simple image file.
That’s fine; you’ll just skip those. If your selector doesn’t find any elements,
then
soup.select('#comic img')
will return a blank list. When that happens,
the program can just print an error message and move on without down-
loading the image.
Otherwise, the selector will return a list containing one
<img>
element. You
can get the
src
attribute from this
<img>
element and pass it to
requests.get()
to download the comic’s image file.
86
Web Scraping
255
Step 4: Save the Image and Find the Previous Comic
Make your code look like the following:
#! python3
# downloadXkcd.py - Downloads every single XKCD comic.
import requests, os, bs4
--snip--
# Save the image to ./xkcd.
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
# Get the Prev button's url.
prevLink = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prevLink.get('href')
print('Done.')
At this point, the image file of the comic is stored in the
res
variable.
You need to write this image data to a file on the hard drive.
You’ll need a filename for the local image file to pass to
open()
.
The
comicUrl
will have a value like
'http://imgs.xkcd.com/comics/heartbleed
_explanation.png'
—which you might have noticed looks a lot like a file path.
And in fact, you can call
os.path.basename()
with
comicUrl
, and it will return
just the last part of the URL,
'heartbleed_explanation.png'
. You can use this
as the filename when saving the image to your hard drive. You join this
name with the name of your
xkcd
folder using
os.path.join()
so that your
program uses backslashes (
\
) on Windows and forward slashes (
/
) on OS X
and Linux. Now that you finally have the filename, you can call
open()
to
open a new file in
'wb'
“write binary” mode.
Remember from earlier in this chapter that to save files you’ve downloaded
using Requests, you need to loop over the return value of the
iter_content()
method. The code in the
for
loop writes out chunks of the image data (at
most 100,000 bytes each) to the file and then you close the file. The image
is now saved to your hard drive.
Afterward, the selector
'a[rel="prev"]'
identifies the
<a>
element with
the
rel
attribute set to
prev
, and you can use this
<a>
element’s
href
attribute
to get the previous comic’s URL, which gets stored in
url
. Then the
while
loop begins the entire download process again for this comic.
The output of this program will look like this:
Downloading page http://xkcd.com...
Downloading image http://imgs.xkcd.com/comics/phone_alarm.png...
Downloading page http://xkcd.com/1358/...
50
256
Chapter 11
Downloading image http://imgs.xkcd.com/comics/nro.png...
Downloading page http://xkcd.com/1357/...
Downloading image http://imgs.xkcd.com/comics/free_speech.png...
Downloading page http://xkcd.com/1356/...
Downloading image http://imgs.xkcd.com/comics/orbital_mechanics.png...
Downloading page http://xkcd.com/1355/...
Downloading image http://imgs.xkcd.com/comics/airplane_message.png...
Downloading page http://xkcd.com/1354/...
Downloading image http://imgs.xkcd.com/comics/heartbleed_explanation.png...
--snip--
This project is a good example of a program that can automatically
follow links in order to scrape large amounts of data from the Web. You
can learn about Beautiful Soup’s other features from its documentation
at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Ideas for Similar Programs
Downloading pages and following links are the basis of many web crawling
programs. Similar programs could also do the following:
• Back up an entire site by following all of its links.
• Copy all the messages off a web forum.
• Duplicate the catalog of items for sale on an online store.
The
requests
and
BeautifulSoup
modules are great as long as you can
figure out the URL you need to pass to
requests.get()
. However, sometimes
this isn’t so easy to find. Or perhaps the website you want your program to
navigate requires you to log in first. The
selenium
module will give your pro-
grams the power to perform such sophisticated tasks.
controlling the Browser with the selenium module
The
selenium
module lets Python directly control the browser by program-
matically clicking links and filling in login information, almost as though
there is a human user interacting with the page. Selenium allows you to
interact with web pages in a much more advanced way than Requests and
Beautiful Soup; but because it launches a web browser, it is a bit slower and
hard to run in the background if, say, you just need to download some files
from the Web.
Appendix A has more detailed steps on installing third-party modules.
Starting a Selenium-Controlled Browser
For these examples, you’ll need the Firefox web browser. This will be the
browser that you control. If you don’t already have Firefox, you can down-
load it for free from http://getfirefox.com/.
Documents you may be interested
Documents you may be interested