We usually retrieve information from the Internet by sending requests for webpages. A module like urllib2 serves us well for this purpose, since it offers a very simple way of returning the HTML from individual webpages. Sometimes, however, we need to send information back to a page - for instance, submitting our information on a login form. For this, we need an actual browser. There are a number of web browsers built for Python, and one of the most popular and easiest to use is in a module called mechanize. (Unfortunately, mechanize is not supported in Python 3; an alternative is the newerlxml, although it doesn’t have all the same functionality and it’s more difficult to get started using.) Essentially, mechanize is an alternative to urllib2 that can do all of the same things but has much more added functionality that will allow us to talk back to webpages.
Windows: If you have easy_install set up, you can type"easy_install mechanize"into your com-mand prompt. Otherwise, download the.zipfile, decompress it, and install the package by running the setup.py script from the command line as described in the section on installing packages.
OS X: If you have easy_install set up, you can type “sudo easy_install mechanize” into your Terminal.
Otherwise, download the.tar.gzfile, decompress it, and install the package by running the setup.py script from your Terminal as described in the section on installing packages.
Debian/Linux: As usual:sudo apt-get install python-mechanize
You may need to close and restart your IDLE session for mechanize to load and be recognized after it’s been installed.
Getting mechanize to create a new Browser object and use it to open a webpage is as easy as saying:
1 import mechanize
2
3 myBrowser = mechanize.Browser()
4
5 myBrowser.open("http://RealPython.com/practice/aphrodite.html")
We now have various information that the website returned to us stored in our mechanize browser as a “response” which we can return by calling the response() method. This response also has various methods that help us piece out information returned from the website:
1 >>> print myBrowser.response().geturl()
12
Here we used the geturl() method to return the URL (i.e., the full address) of the webpage and theget_data() method to return the actual HTML code in the same way that we used read() with urllib2. We could then use Beautiful Soup or regular expressions to parse out the information we want from the string returned by theget_data() method.
But what if we have to submit information to the website? For instance, what if the information we want is behind a login page such as [login.php (http://www.realpython.com/practice/login.php)? If we are trying to do things automatically, then we will need a way to automate the login process as well.
First, let’s take a look at the HTML response provided by login.php:
1 import mechanize
2
3 myBrowser = mechanize.Browser()
4
This returns the following form (which you should take a look at in a regular browser as well to see how it appears):
17 <h2>Please log in to access Mount Olympus:</h2>
18
The code we see is HTML, but the page itself is written in another language called PHP. In this case, the PHP is creating the HTML that we see based on the information we provide. For instance, try logging into the page with an incorrect username and password, and you will see that the same page now includes a line of text to let you know: “Wrong username or password!” However, if you provide the correct login information (username of “zeus” and password of “ThunderDude”), you will be redirected to theprofiles.htmlpage.
For our purposes, the important section of HTML code is the login form, i.e., everything inside the
<form> tags. We can see that there is a submission <form> named “login” that includes two <input>
tags, one named “user” and the other named “pwd”. The third<input> is the actual “Submit” button.
Now that we know the underlying structure of the form, we can return to mechanize to automate the login process.
7 # select the form and fill in its fields
8
15 myResponse = myBrowser.submit() # submit form
16
17 print myResponse.geturl() # make sure we were redirected
We used the browser’sselect_form() method to select the form, passing in the name “login” that we discovered by reading the HTML of the page. Once we had that form selected in the browser, we could then assign values to the different fields the same way we would assign dictionary values; we passed the value “zeus” to the field named “user” and the value “ThunderDude” to the field named
“pwd”; once again, we knew what to call these fields by reading the “name” attribute assigned within each<input> HTML tag.
Once we filled in the form, we could submit this information to the webpage by using the browser’s submit() method, which returns a response in the same way that calling open() on a webpage would do. We displayed the URL of this response to make sure that our login submission worked; if we had provided an incorrect username or password then we would have been sent back tologin.php, but we see that we were successfully redirected toprofiles.htmlas planned.
NOTE: We are always being encouraged to use long passwords with many different types of characters in them, and now you know the main reason: automated scripts like the one we just designed can be used by hackers to “brute force” logins by rapidly
trying to log in with many different usernames and passwords until they find a working combination. Besides this being highly illegal, almost all websites these days (including my practice form) will lock you out and report your IP address if they see you making too many failed requests, so don’t try it!
We were able to retrieve the webpage form by name because mechanize includes its own HTML parser. We can use this parser through various browser methods as well to easily obtain other types of HTML elements. The links() method will return all the links appearing on the browser’s current page as Link objects, which we can then loop over to obtain their addresses. For instance, if our browser is still on theprofiles.htmlpage, we could say:
1 >>>
Each Link object has a number of attributes, including an absolute_url attribute that represents the address of the webpage (i.e., the “href” value) and a text attribute that represents the actual text that appears as a link on the webpage.
The mechanize browser provides many other methods to offer us the full functionality of a standard web browser. For instance, the browser’s back() method simply takes us back one page, similar to hitting the “Back” button in an ordinary browser. We can also “click” on links in a page to follow them using the browser’s follow_link() method.
Let’s follow each of the links on theprofiles.htmlpage using the browser’s follow_link() and back() methods, displaying the title of each webpage we visit by using the browser’s title() method:
1 import mechanize
2
3 myBrowser = mechanize.Browser()
4
5 myBrowser.open("http://RealPython.com/practice/profiles.html")
6
7 for nextLink in myBrowser.links(): # follow each link on profiles.html
8
9 myBrowser.follow_link(nextLink)
10
11 print"Page title:", myBrowser.title() # display current page title
12
13 myBrowser.back() # return to profiles.html
14
15 print "Page title:", myBrowser.title() # back to profiles.html
By using thefollow_link() method to open links, we avoid the trouble of having to add the rest of the URL path to each link. For instance, the first link on the page simply links to a page named
“aphrodite.html” without including the rest of the URL; this is a relative URL, and mechanize wouldn’t know how to load this webpage if we passed it to the open() method. To open() this address, we would have to create an absolute URL by adding the rest of the web address, i.e.,
“http://RealPython.com/practice/aphrodite.html”.
Thefollow_link() method takes care of this problem for us since the Link objects all store absolute URL paths. Of course, in this case it still may have been more sensible to open() each of these links directly since this wouldn’t require loading as many pages. However, sometimes we need the ability to move back and forth between pages.
But how did our code do? The output appears as follows:
1 >>>
7 Page title: Profile: Dionysus</title /> </head> <body bgcolor="yellow">
8 <center> <br><br> <img src="dionysus.jpg"> <h2>Name: Dionysus</h2> <img
9 src="grapes.png"><br><br> Hometown: Mount Olympus <br><br> Favorite animal:
10 Leopard <br> <br> Favorite Color: Wine </center> </body> </html>
11
12 Page title: All Profiles
13
14 >>>
Unfortunately, mechanize’s HTML parser couldn’t find the closing title tag for Dionysus because of the errant forward slash at the end, and we ended up gathering the rest of the page. When one parsing method fails us (in this case, the mechanize browser’s title() method) because of poorly writ-ten HTML, the best course of action is to turn to another parsing method - in this case, creating a BeautifulSoup object and extracting the webpage’s title from it. For instance:
1 import mechanize
2
3 from bs4 import BeautifulSoup
4
5 myBrowser = mechanize.Browser()
6
7 htmlPage = myBrowser.open("http://www.RealPython.com/practice/dionysus.html")
8
9 htmlText = htmlPage.get_data()
10
11 mySoup = BeautifulSoup(htmlText)
12
13 print mySoup.title.string
Notice that we didn’t have to revisit using urllib2 since mechanize already provides the functionality of opening webpages and retrieving their HTML for us.
Review exercises:
1. Use mechanize to provide the correct username “zeus” and password “ThunderDude” to the login page submission form located at:http://RealPython.com/practice/login.php
2. Using Beautiful Soup, display the title of the current page to determine that you have been redirected toprofiles.html
3. Use mechanize to return tologin.phpby going “back” to the previous page
4. Provide an incorrect username and password to the login form, then search the HTML of the returned webpage for the text “Wrong username or password!” to determine that the login process failed