Processing HTML Files - (the Pragmatic Programmers) Dmitry Zinoviev-Data Science Essentials in

The first type of structured text document you’ll look at is HTML—a markup language commonly used on the web for human-readable representation of information. An HTML document consists of text and predefined tags (enclosed in angle brackets ^<>) that control the presentation and interpretation of the text. The tags may have attributes. The following table shows some HTML tags and their attributes.

Purpose

Table 3—Some Frequently Used HTML Tags and Attributes

HTML is a precursor to XML, which is not a language but rather a family of markup languages having similar structure and intended in the first place for machine-readable documents. Users like us define XML tags and their attributes as needed.

XML ≠ HTML

Though XML and HTML look similar, a typical HTML document is in general not a valid XML document, and an XML document is not an HTML document.

XML tags are application-specific. Any alphanumeric string can be a tag, as long as it follows some simple rules (enclosed in angle brackets and so on). XML tags don’t control the presentation of the text—only its interpretation. XML is frequently used in docu-ments not intended directly for human eyes. Another language, eXtensible Stylesheet Language Transformation (XSLT), transforms XML to HTML, and yet another language, Cascading Style Sheets (CSS), adds style to resulting HTML documents.

The module BeautifulSoup is used for parsing, accessing, and modifying HTML and XML documents. You can construct a BeautifulSoup object from a markup string, a markup file, or a URL of a markup document on the web:

from bs4 import BeautifulSoup from urllib.request import urlopen

# Construct soup from a string

soup1 = BeautifulSoup("<HTML><HEAD>

«

headers

»

</HEAD>

«

body

»

</HTML>")

# Construct soup from a local file soup2 = BeautifulSoup(open("myDoc.html"))

# Construct soup from a web document

# Remember that urlopen() does not add "http://"!

soup3 = BeautifulSoup(urlopen("http://www.networksciencelab.com/"))

The second optional argument to the object constructor is the markup parser

—a Python component that is in charge of extracting HTML tags and entities.

BeautifulSoup comes with four preinstalled parsers:

• "html.parser" (default, very fast, not very lenient; used for “simple” HTML documents)

• "lxml" (very fast, lenient)

• "xml" (for XML files only)

Processing HTML Files

•

³¹

When the ^soup is ready, you can pretty print the original markup document with the function soup.prettify().

The function soup.get_text() returns the text part of the markup document with all tags removed. Use this function to convert markup to plain text when it’s the plain text you’re interested in.

htmlString = '''

<HTML>

<HEAD><TITLE>My document</TITLE></HEAD>

<BODY>Main text.</BODY></HTML>

'''

soup = BeautifulSoup(htmlString) soup.get_text()

'\nMy document\nMain text.\n'

➾

Often markup tags are used to locate certain file fragments. For example, you might be interested in the first row of the first table. Plain text alone is not helpful in getting there, but tags are, especially if they have class or id attributes.

BeautifulSoup uses a consistent approach to all vertical and horizontal relations between tags. The relations are expressed as attributes of the tag objects and resemble a file system hierarchy. The soup title, ^soup.title, is the ^soup object attribute. The value of the name object of the title’s parent element is soup.title.parent.name.string, and the first cell in the first row of the first table is probably soup.body.table.tr.td.

Any tag ^t has a name ^t.name, a string value (^t.string with the original content and a list of t.stripped_strings with removed whitespaces), the parent t.parent, the next ^t.next and the previous ^t.prev tags, and zero or more children ^t.children (tags within tags).

BeautifulSoup provides access to HTML tag attributes through a Python dictionary interface. If the object ^t represents a hyperlink (such as <a href="foobar.html">, then the string value of the destination of the hyperlink is t["href"].string. Note that HTML tags are case-insensitive.

Perhaps the most useful soup functions are soup.find() and soup.find_all(), which find the first instance or all instances of a certain tag. Here’s how to find things:

• All instances of the tag <H2>:

level2headers = soup.find_all("H2")

• All bold or italic formats:

formats = soup.find_all(["i", "b", "em", "strong"])

• All tags that have a certain attribute (for example, id="link3"):

soup.find(id="link3")

• All hyperlinks and also the destination URL of the first link, using either the dictionary notation or the ^tag.get() function:

links = soup.find_all("a") firstLink = links[0]["href"]

# Or:

firstLink = links[0].get("href")

By the way, both expressions in the last example fail if the attribute is not present. You must use the tag.has_attr() function to check the presence of an attribute before you extract it. The following expression combines BeautifulSoup and list comprehension to extract all links and their respective URLs and labels (useful for recursive web crawling):

with urlopen("http://www.networksciencelab.com/") as doc:

soup = BeautifulSoup(doc)

links = [(link.string, link["href"]) for link in soup.find_all("a") if link.has_attr("href")]

The value of ^links is a list of tuples:

[('Network Science Workshop',

➾

'http://www.slideshare.net/DmitryZinoviev/workshop-20212296'),

➾

«

...

»

,('Academia.edu',

➾

'https://suffolk.academia.edu/DmitryZinoviev'), ('ResearchGate',

➾

'https://www.researchgate.net/profile/Dmitry_Zinoviev')]

➾

The versatility of HTML/XML is its strength, but this versatility is also its curse, especially when it comes to tabular data. Fortunately, you can store tabular data in rigid but easy-to-process CSV files, which you’ll look at in the next unit.

Processing HTML Files

•

³³

In document (the Pragmatic Programmers) Dmitry Zinoviev-Data Science Essentials in Python_ Collect - Organize - Explore - Predict - Value-Pragmatic Bookshelf (2016) (Page 46-50)