Use an HTML parser to scrape websites

Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that is designed specifically for piecing apart HTML pages. There are a number of Python tools written for this purpose, but the most popular (and easiest to learn) is named Beautiful Soup.

To set up Beautiful Soup for Windows or OS X, if you have easy_install set up then you can type the following command into your command prompt or Terminal:easy_install beautifulsoup4 Otherwise, download the compressed .tar.gz file, unzip it, then install Beautiful Soup using the setup.py script from the command line or Terminal as described in the chapter on installing packages.

For Debian/Linux, just type: sudo apt-get install python-bs4

Once you have Beautiful Soup installed, you can now import the bs4 module and pass a string of HTML to BeautifulSoup to begin parsing:

1 from bs4 import BeautifulSoup

3 from urllib2 import urlopen # Py 3: use urllib.request

5 myAddress = "http://RealPython.com/practice/dionysus.html"

7 htmlPage = urlopen(myAddress)

9 htmlText = htmlPage.read() # Py 3: decode

11 mySoup = BeautifulSoup(htmlText)

From here, we can parse data out ofmySoup in various useful ways depending on what information we want. For instance, BeautifulSoup includes aget_text()method for extracting just the text from a document, removing any HTML tags automatically:

1 >>> print mySoup.get_text()

There are a lot of extra blank lines left, but these can always be taken out using the stringreplace() method. If we only want to get specific text from an HTML document, using BeautifulSoup to extract the text first and then usingfind() is sometimes easier than working with regular expressions.

However, sometimes the HTML tags are actually the elements that point out the data we want to retrieve. For instance, perhaps we want to retrieve links for all the images on the page, which will appear in <img> HTML tags. In this case, we can use the find_all() method to return a list of all instances of that particular tag:

This wasn’t exactly what we expected to see, but it happens quite often in the real world; the first element of the list,<img src="dionysus.jpg"/>, is a “self-closing” HTML image tag that doesn’t require a closing</img> tag. Unfortunately, whoever wrote the sloppy HTML for this page never added a closing forward slash to the second HTML image tag,<img src="grapes.png">, and didn’t include a</img> tag either. So BeautifulSoup ended up grabbing a fair amount of HTML after the image tag as well before inserting a</img> on its own to correct the HTML.

Fortunately, this still doesn’t have much bearing on how we can parse information out of the image tags with Beautiful Soup. This is because these HTML tags are stored as Tag objects, and we can easily extract certain information out of each Tag. In our example, assume for simplicity that we know to expect two images in our list so that we can pull two Tag objects out of the list:

1 >>> image1, image2 = mySoup.find_all("img")

3 >>>

We now have two Tag objects, image1 and image2. These Tag objects each have a name, which is just the type of HTML tag that to which they correspond:

1 >>> print image1.name

3 img

5 >>>

These Tag objects also have various attributes, which can be accessed in the same way as a dictio-nary. The HTML tag<img src="dionysus.jpg"/> has a single attribute “src” that takes on the value “dionysus.jpg” (much like a key: value pair in a dictionary). Likewise, an HTML tag such as<a href="http://RealPython.com" target="_blank"> would have two attributes, a “href” at-tribute that is assigned the value “http://RealPython.com” and a “target” atat-tribute that has the value

“_blank”.

We can therefore pull the image source (the link that we wanted to parse) out of each image tag using standard dictionary notation to get the value that the “src” attribute of the image has been assigned:

1 >>> print image1["src"]

Even though the second image tag had a lot of extra HTML code associated with it, we could still pull out the value of the image “src” without any trouble because of the way Beautiful Soup organizes HTML tags into Tag objects.

In fact, if we only want to grab a particular tag, we can identify it by the corresponding name of the Tag object in our soup:

Notice how the HTML<title> tags have automatically been cleaned up by Beautiful Soup. Further-more, if we want to extract only the string of text out of the<title> tags (without including the tags themselves), we can use the string attribute stored by the title:

1 >>> print mySoup.title.string

3 Profile: Dionysus

4 5 >>>

We can even search for specific kinds of tags whose attributes match certain values. For instance, if we wanted to find all of the<img> tags that had a src attribute equal to the value “dionysus.jpg”, we could provide the following additional argument to the find_all() method:

1 >>> mySoup.find_all("img", src="dionysus.jpg")

3 [<img src="dionysus.jpg"/>]

5 >>>

In this case, the example is somewhat arbitrary since we only returned a list that contained a single image tag, but we will use this technique in a later section in order to help us find a specific HTML tag buried in a vast sea of other HTML tags.

Although Beautiful Soup is still used frequently today, the code is no longer being maintained and updated by its creator. A similar toolkit,lxml, is somewhat trickier to get started using, but offers all of the same functionality as Beautiful Soup and more. Once you are comfortable with the basics of Beautiful Soup, you should move on to learning how to use lxml for more complicated HTML parsing tasks.

NOTE: HTML parsers like Beautiful Soup can (and often do) save a lot of time and effort when it comes to locating specific data in webpages. However, sometimes HTML is so poorly written and disorganized that even a sophisticated parser like Beautiful Soup doesn’t really know how to interpret the HTML tags properly. In this case, you’re often left to your own devices (namely, find() and regex) to try to piece out the information you need.

Review exercises:

1. Write a script that grabs the full HTML from the pageprofiles.html

2. Parse out a list of all the links on the page using Beautiful Soup by looking for HTML tags with the name “a” and retrieving the value taken on by the “href” attribute of each tag

3. Get the HTML from each of the pages in the list by adding the full path to the file name, and display the text (without HTML tags) on each page using Beautiful Soup’s get_text() method

In document Real Python Part 1 (Page 143-147)