Now we can build a scraper that fetches all of the headlines from Google News. We will do this by extracting all of the <a></a> tags in Google News’s HTML. As we saw in our HTML example, each <a></a> tag has a variable in it called href , e.g., <a
href=”theselftaughtprogrammer.io”></a> . We are going to extract all of the href variables from all of the <a></a> tags on Google News’s website. In other words, we are going to collect all of the URLs Google News is linking to at the time we run our program. We will use the BeautifulSoup library for parsing our HTML (converting HTML to Python objects), so first install it with:
pip install beautifulsoup4==4.4.1
Once BeautifulSoup is installed, we can get Google News’s HTML using Python’s built-in urllib2 library for working with URLs. Start by importing urllib2 and BeautifulSoup:
“““https://github.com/calthoff/tstp/blob/master/part_III/lets_read_some_code/lets_read_some_code.py”””
import urllib2
from bs4 import BeautifulSoup Next we create a scraper class
class Scraper :
def __init__ ( self , site):
self .site = site def scrape ( self ):
pass
Our method takes a website to scrape from, and has a method called scrape which we are going to call whenever we want to scrape data from the website we passed in.
Now we can start defining our scrape method.
def scrape ( self ):
response = urllib2.urlopen( self .site) html = response.read()
The urlopen() function makes a request to Google News and returns a response object,which includes Google News’s HTML in it as a variable. We save the response in our response variable and assign the variable html to response.read() which returns the HTML from Google News. All of the HTML from Google News is now saved in the variable html . This is all we need in order to extract data from Google News. However, we still need to parse the HTML. Parsing HTML means reading it into our program and giving it structure with our code, such as turning each HTML tag into a Python object, which we can do using the
Beautiful Soup library. First, we create a BeautifulSoup object and pass in our html variable and the string ‘html.parser ’ as a parameter to let Beautiful Soup know we are parsing HTML:
def scrape ( self ):
response = urllib2.urlopen( self .site) html = response.read()
soup = BeautifulSoup(html , 'html.parser' )
Moving forward, we can now print out the links from Google News with:
def scrape ( self ):
response = urllib2.urlopen( self .site) html = response.read()
soup = BeautifulSoup(html , 'html.parser' ) for tag in soup.find_all( 'a' ):
url = tag.get( 'href' ) if url and 'html' in url:
print ( " \n " + url)
find_all() is a method we can call on BeautifulSoup objects. It takes a string representing an HTML tag as a parameter ( ‘a’ representing <a> </a> ), and returns a ResultSet object
containing all the Tag objects found by find_all() . The ResultSet is similar to a list— you can iterate through it (we never save ResultSet in a variable, it is simply the value returned by soup.find_all('a') ), and each time through the loop there is a new variable: tag, representing a tag object. We call the method get() on the tag object, passing in the string ‘href’ as a
parameter (href is the part of an HTML <a href=“url”></a> tag which holds the URL), and it returns a string URL which we store in the variable url .
The last thing we do is to check to make sure URL is not None with if url , because we don’t want to print the url if it is empty. We also make sure ‘html’ is in the url , because we don’t want to print Google’s internal links. If the url passes both of these tests, we use ‘\n’ to print a newline and then print the url. Here is our full program:
import urllib2
from bs4 import BeautifulSoup
class Scraper :
def __init__ ( self , site):
self .site = site def scrape ( self ):
response = urllib2.urlopen( self .site) html = response.read()
soup = BeautifulSoup(html , 'html.parser' ) for tag in soup.find_all( 'a' ):
url = tag.get( 'href' ) if url and 'html' in url:
print ( " \n " + url)
Scraper().scrape( 'https://news.google.com/' )
Run the scraper, and you should see a result similar to this:
https://www.washingtonpost.com/world/national-security/in-foreign-bribery-cases-leniency- offered-to-companies-that-turn-over-employees/2016/04/05/d7a24d94-fb43-11e5-9140-e61d062438bb_story.html
http://www.appeal-democrat.com/news/unit-apartment-complex-proposed-in-marysville/article_bd6ea9f2-fac3-11e5-bfaf-4fbe11089e5a.html
http://www.appeal-democrat.com/news/injuries-from-yuba-city-bar-violence-hospitalize-groom-to-be/article_03e46648-f54b-11e5-96b3-5bf32bfbf2b5.html
...
Now that you have all of Google News’s headlines available in your program, the possibilities are limitless. You could write a program to analyze the most used words in the headlines, and build a word cloud to visualize it. You could build a program to analyze the sentiment of the headlines, and see if it has any correlation with the stock market. As you get better at web scraping, all of the information in the world will be open to you, and I hope that excites you as much as it excites me.
Challenge
Modify the Google Scraper to save the headlines in a file.
Chapter 23. Practice
Exercises
0. Build a scraper for another website.
0. Write a program and then revert to an earlier version of it using PyCharm.
0. Download pylint using pip, read the documentation for it and try it out.
Read
0. http://www.tutorialspoint.com/python/python_reg_expressions.htm