REMEMBER Most function/method calls have self or a module name in front of them, followed by a dot. When you're reading a program, if you see a bare function or method (or class)—one without self or a module name—it usually means one of these:
• It was defined in the module (like get_page() and find_links()
are in this program).
• It was imported by using the from… import syntax (like
StringIO).
• It is a built-in function (like sorted(), down at the bottom).
REMEMBER Functions and methods are essentially the same thing—a code block that
performs an action and returns a result. Here's the difference:
• A function is by itself in a module.
• A method is part of a class.
Chapters 3, 11, and 13 describe how to use functions, methods, and classes.
Looping around
The spider.py script contains three loops:
• Two for loops
REMEMBER Python programmers generally favor for loops, because for loops can both assign values and provide one element at a time.
• One while loop
Tip A while loop is often better when you are both adding and deleting elements, so we used a while loop in the section where the run() method deletes elements from the self._links_to_process list (via the list method pop()) and also adds elements to the list (via process_page()).
Chapter 10 shows you more ways to use loops.
Collections of data
Sets, lists, and dicts are Python's data types for dealing with collections of data, especially if the data change while the program is running.
Lists
Lists (described more fully in Chapter 8) are most efficient, which is one reason we used a list (self._links_to_process) to keep track of all the links we're processing. Lists are also good when you need to
• Maintain elements in a particular order.
• Allow duplicate elements.
Sets
Sets are a good way of handling data when you want to ignore or avoid duplicates.
Tip We used a set for the primary URL data in the Spider class (self.URLs) because we wanted only one copy of each URL. Chapter 9 shows more ways to use sets.
Dicts
Dicts are good for data that will be stored and accessed by keys rather than ordered alphabetically or numerically.
Tip If we had wanted to associate some data with each URL in the Web page, we would have used a dict. Chapter 9 shows you how.
Naming names
If a good name is one that helps you understand what the named object is doing, then there are some good names and some not-so-good names in spider.py. An early draft of our program had functions named get_links() and find_links(). Those names don't really make clear the differences between the two functions, so we renamed
get_links() to get_page().
Tip Programmers sometimes choose terse and not very explanatory names on purpose to indicate that you shouldn't pay much attention to the name because it's just a
temporary name used to convey information (for example, it's used as an argument to a function or method). Sometimes a temporary name makes a few lines of code easier to read. Take, for example, the lines of code that use the name f in the
find_links() function:
f = formatter.AbstractFormatter(writer) parser = htmllib.HTMLParser(f)
We could have gotten rid of f by writing the code this way instead:
parser = htmllib.HTMLParser(formatter.AbstractFormatter(writer))
But that's kind of long and hard to read, so we decided it was better to split the lines and use a temporary name.
REMEMBER Give users of your modules information about which attributes, functions, classes, and methods they should avoid accessing directly, passing to other functions, subclassing, or rewriting. (Or, more colloquially, "Das ist nicht für gefingerpoken!") Sometimes this information is conveyed by using a single underscore character as the first character in a name, which means the object is private. (See Chapter 13 for more about private attributes.) For example, we chose to make self._links_to_process a private name because it's valid only inside the Spider class. We could have made
url_in_site() a private name for the same reason, but we didn't in order to send the message that it's suitable for overriding in a subclass.
Managing strings
There are a lot of ways you can work with strings in Python:
• Python strings have many methods built in.
For example, we use the startswith() string method in the url_in_site()
function. (Unsurprisingly, startswith() checks whether a string starts with a particular substring.)
• Many Python modules include additional string-handling functions.
We use one of these in the process_page() method: urlparse. urljoin(url, link). The urljoin() function (as the name suggests—see the benefits of naming things well?) sticks together two parts of a URL, which we pass in as two strings, url and link.
REMEMBER Before you start writing some special functionality to process strings, check whether someone else has already done the work for you. The Cheat Sheet lists the most commonly used string methods.
Handling errors
One error-handling tool in the spider.py program is this block of code in the
get_page() function:
try:
page = urllib2.urlopen(url) except urllib2.URLError:
log("Error retrieving: " + url) return ''
This is called a try / except block (see Chapter 10). Its purpose is to catch errors from the urllib2.urlopen() function, which tries to open a remote URL.
What spider.py doesn't have
The spider.py program works as-is, but it's missing a few elements that are necessary to make it a fully functional Python program that follows the conventions of good
programming:
• Docstrings for each function and class method.
• Error checking. For example, if you try to run the program from the command line without specifying a URL, the program fails messily. Here's what happens:
• % python spider.py
• Traceback (most recent call
• last):
• File "spider.py", line 91,
• in <module>
• startURL = sys.argv[1]
• IndexError: list index out of
• range
Why are we showing a program without these features when we reiterate ad nauseum in this book that you should include these features in your program? Because lots of programmers talk about the benefits of comments, documentation, and error-checking, but lots of programs (some written by those same programmers) don't do as good of a job on those things as they should. This is a "Do as we say, not as we do" situation.
A complete version of the program with documentation and error checking is on our Web site:
http://www.pythonfood.com/
REMEMBER It's good practice to use try with I/O functions and user input or other external input. For example, we could improve our program by moving the
body = page.read() line into the try block—because even if you can open a Web page, you might still have trouble reading it. Similarly,
garbage in the HTML can cause the htmllib module's parser to choke, so the line parser.feed(html) should be in a try block as well, like this:
try: parser.feed(html)
except htmllib.HTMLParseError:
log("Error finding links: " + url) return []
finally:
parser.close() return parser.anchorlist
The sidebar, "What spider.py doesn't have," describes another spot where error-checking would be useful.
Chapter 5: Working Like a Programmer
Overview
Professional programmers spend as little as 10 percent of their working time writing code. This chapter focuses on what they do the rest of the time. These practices generally consume about 60 percent of a programmer's time on a project:
• Analyzing problems
• Designing solutions and documenting decisions
• Debugging
• Maintaining and improving code
Warning The final 30 percent of a programmer's time is taken by meetings and wasted time. (Sometimes, there's no difference between meetings and wasted time.)