76
246
Chapter 11
You can also load an HTML file from your hard drive by passing a
File
object to
bs4.BeautifulSoup()
. Enter the following into the interactive shell
(make sure the example.html file is in the working directory):
>>> exampleFile = open('example.html')
>>> exampleSoup = bs4.BeautifulSoup(exampleFile)
>>> type(exampleSoup)
<class 'bs4.BeautifulSoup'>
Once you have a
BeautifulSoup
object, you can use its methods to locate
specific parts of an HTML document.
Finding an Element with the select() Method
You can retrieve a web page element from a
BeautifulSoup
object by calling
the
select()
method and passing a string of a CSS selector for the element you
are looking for. Selectors are like regular expressions: They specify a pattern
to look for, in this case, in HTML pages instead of general text strings.
A full discussion of CSS selector syntax is beyond the scope of this
book (there’s a good selector tutorial in the resources at http://nostarch.com/
automatestuff/), but here’s a short introduction to selectors. Table 11-2 shows
examples of the most common CSS selector patterns.
table 11-2: Examples of CSS Selectors
Selector passed to the select() method
Will match
soup.select('div')
All elements named <div>
soup.select('#author')
The element with an id attribute of author
soup.select('.notice')
All elements that use a CSS class attri-
bute named notice
soup.select('div span')
All elements named <span> that are within
an element named <div>
soup.select('div > span')
All elements named <span> that are
directly within an element named <div>,
with no other element in between
soup.select('input[name]')
All elements named <input> that have a
name attribute with any value
soup.select('input[type="button"]')
All elements named <input> that have an
attribute named type with value button
The various selector patterns can be combined to make sophisticated
matches. For example,
soup.select('p #author')
will match any element that
has an
id
attribute of
author
, as long as it is also inside a
<p>
element.
The
select()
method will return a list of
Tag
objects, which is how
Beautiful Soup represents an HTML element. The list will contain one
Tag
object for every match in the
BeautifulSoup
object’s HTML. Tag values
can be passed to the
str()
function to show the HTML tags they represent.