Sometimes we want to be able to fetch real-time data from a website that offers continually updated information. In the dark days before you learned Python programming, you would have been forced to sit in front of a browser, clicking the “Refresh” button to reload the page each time you want to check if updated content is available. Instead, we can easily automate this process using the reload() method of the mechanize browser.
As an example, let’s create a script that periodically checks Yahoo! Finance for a current stock quote for the “YHOO” symbol. The first step with any web scraping task is to figure out exactly what infor-mation we’re seeking. In this case, the webpage URL ishttp://finance.yahoo.com/q?s=yhoo. Cur-rently, the stock price (as I see it) is 40.01, and so we view the page’s HTML source code and search for this number. Fortunately, it only occurs once in the code:
1 <span class="time_rtq_ticker"><span id="yfs_l84_yhoo">15.72</span></span>
Next, we check that the tag<span id="yfs_l84_yhoo"> also only occurs once in the webpage, since we will need a way to uniquely identify the location of the current price. If this is the only<span> tag with an id attribute equal to “yfs_l84_yhoo”, then we know we’ll be able to find it on the webpage later and extract the information we need from this particular pair of<span> tags.
We’re in luck (this is the only<span> tag with this id), so we can use mechanize and Beautiful Soup to find and display the current price, like so:
1 import mechanize
13 # return a list of all the <span> tags where the id is 'yfs_184_yhoo'
14
15 myTags = mySoup.find_all("span", id="yfs_l84_yhoo")
16
17 # take the BeautifulSoup string out of the first (and only) <span> tag
18
19 myPrice = myTags[0].string
20
21 print "The current price of YHOO is: {}".format(myPrice)
We needed to useformat() to print the price because myPrice is not just an ordinary string; it’s
actu-ally a more complex BeautifulSoup string that has other information associated with it, and Beautiful Soup would try to display this other information as well if we had only said print myPrice.
Now, in order to repeatedly get the newest stock quote available, we’ll need to create a loop that either uses the reload() method or loads the page in the browser each time. But first, we should check the Yahoo! Financeterms of useto make sure that this isn’t in violation of their acceptable use policy.
The terms state that we should not “use the Yahoo! Finance Modules in a manner that exceeds rea-sonable request volume [or] constitutes excessive or abusive usage,” which seems rearea-sonable enough.
Of course, reasonable and excessive are entirely subjective terms, but the general rules of Internet etiquette suggest that you don’t ask for more data than you need. Sometimes, the amount of data you “need” for a particular use might still be considered excessive, but following this rule is a good place to start.
In our case, an infinite loop that grabs stock quotes as quickly as possible is definitely more than we need, especially since it appears that Yahoo! only updates its stock quotes once per minute. Since we’ll only be using this script to make a few webpage requests as a test, let’s wait one minute in between each request. We can pause the functioning of a script by passing a number of seconds to the sleep() method of Python’s time module, like so:
1 from time import sleep
2
3 print "I'm about to wait for five seconds..."
4
5 sleep(5)
6
7 print "Done waiting!"
Although we won’t explore them here, Python’s timemodule also includes various ways to get the current time in case we wished to add a “time stamp” to each price.
Using thesleep() method, we can now repeatedly obtain real-time stock quotes:
1 from time import sleep
7 # create a Browser object
8
9 myBrowser = mechanize.Browser()
10
11 # obtain 1 stock quote per minute for the next 3 minutes
12
18
19 mySoup = BeautifulSoup(htmlText)
20
21 myTags = mySoup.find_all("span", id="yfs_l84_yhoo")
22
23 myPrice = myTags[0].string
24
25 print"The current price of YHOO is: {}".format(myPrice)
26
27 if i<2: # wait a minute if this isn't the last request
28
29 sleep(60)
If we had originally loaded the page once outside of the loop so that myBrowser already pointed to that URL, we also could have them called myBrowser.reload() instead of re-loading the page each time in order to refresh the page. This can mainly be useful when you want a piece of code to refresh a current page without having to worry about what the original URL of the page was.
Review exercises:
1. Repeat the example in this section to scrape YHOO stock quotes, but additionally include the current time of the quote as obtained from the Yahoo! Finance webpage; this time can be taken from part of a string inside another span tag that appears shortly after the actual stock price in the webpage’s HTML