Running the Script - Scraping Google Results

Hack 44 Scraping Google Results

44.2 Running the Script

Run the script from the command line, specifying the result's HTML filename and name of the CSV file you wish to create or to which you wish to append additional results. For example, using results.html as input and results.csv as output:

$ perl google2csv.pl < results.html > results.csv Leaving off the > and CSV filename sends the results to the screen for your perusal:

$ perl google2csv.pl < results.html

44.3 The Results

Here's a sample run on the results of a search for Mac OS X: $ perl google2csv.pl < results.html "title","url","size","domain suffix"

"Apple - Mac OS X","http://www.apple.com/macosx/","","com" "Apple - Software - Mac OS X

Server","http://www.apple.com/server/", "29k","com"

"Mac OS X

Development","http://developer.apple.com/macosx/","28k","com "

"Mac OS X Hints - Get the most from X!","http://www.macosxhints.com/", "","com"

"Mac OS X Apps - The Source For Mac OS X Software", "http://www.macosxapps.com/","39k","com"

"VersionTracker.com - free Macintosh software downloads for Mac

OS ...

","http://www.versiontracker.com/macosx/","101k","com" "O'Reilly Mac OS X Conference",

"http://conferences.oreillynet.com/macosx2002/","25k","com" "MacNN | OS X","http://osx.macnn.com/","94k","com"

"???? - Mac OS

X","http://www.apple.co.jp/macosx/","43k","jp" "Apple - Support - Mac OS X",

"http://www.info.apple.com/usen/macosx/","36k","com" You'll see that the program records four attributes to the CSV file: title, URL, size (when

available), and top-level domain. The "snippet" of web page usually included with a Google result was omitted, because it's difficult to read in a spreadsheet format.

So why include the page size and domain? Research. If you're generating a set of results to be referred to later, it's handy to be able to sort them by suffix. "edu" results tend to be different from "org" results, whic h tend to be different from "com" results, and so on. Not to mention differing result sets by country, .uk versus .jp, for instance. And if you're generating links to contact later (to ask for a reciprocal link, for example), it's handy to be able to set apart the less-commercial

Hack 45 Scraping Google AdWords

Scrape the AdWords from a saved Google results page into a form suitable for importing into a spreadsheet or database.

Google's AdWords™—the text ads that appear to the right of the regular search results—are delivered on a cost-per-click basis, and purchasers of the AdWords are allowed to set a ceiling on the amount of money they spend on their ad. This means if even if you run a search for the same query word multiple times, you won't necessarily get the same set of ads each time.

If you're considering using Google AdWords to run ads, you might want to gather up and save the ads that are running for the query words you're interested in. Google AdWords are not provided by the Google API; of course you can't automatically scrape Google's results outside the Google API, because it's against Google's Terms of Service.

This hack will let you scrape the AdWords from a saved Google results page and export them to a CSV (comma-separated value) file, which you can then import into Excel or your favorite

spreadsheet program.

This hack requires an additional Perl module, HTML::TokeParser

(http://search.cpan.org/search?query=htmL%3A%3Atokeparser&mode=all). You'll need to install it before the hack will run.

45.1 The Code

#!/usr/bin/perl

# usage: perl adwords.pl results.html use strict;

use HTML::TokeParser;

die "I need at least one file: $!\n" unless @ARGV;

my @Ads;

for my $file (@ARGV){

# skip if the file doesn't exist

# you could add more fil e testing here. # errors go to STDERR so they won't # pollute our csv file

unless (-e $file) {

warn "What??: $file -- $! \n-- skipping --\n"; next;

}

# now parse the file

my $p = HTML::TokeParser->new($file); # $p is a kind of iterator and everything

# in the given file is a token. We are going to

# iterate through them all but we might throw them away # if they aren't what we are looking for.

while(my $token = $p->get_token) {

# look for a start token whose name is 'td' # and has an attribute named 'id' and that # attribute's value is 'taw' followed by one # or more digits.

next unless $token ->[0] eq 'S' and $token->[1] eq 'td'

and $token->[2]{id} =~ /taw\d+/; # $ad is a hash ref that will hold our # data for this ad.

my $ad;

# if we are here we found the td tag. It also has # the url we want

# we strip off the 'go to' stuff

($ad->{url}) = $token->[2]{onmouseover} =~ /go to ([^']+)'/;

# now go directly to the next anchor tag my $link = $p ->get_tag('a');

# grab the href attribute and clean it up $ad->{href} = $link->[1]{href};

$ad->{href} =~ s|/url\?q=||;

# the adwords are the text upto the closing </a> tag $ad->{adwords} = $p->get_trimmed_text('/a');

# Now look at every token looking for text. # Unless the text matches 'Interest:' it is

# description text so we add it to the description. # If it is the 'Interest:' token then

# we want to move to the next img token # and grab the 'width' attribute's value while( my $token = $p->get_token ) {

# this will skip all the <br> and <font> tags next unless $token->[0] eq 'T';

unless($token->[1] =~ /Interest:/) { $ad->{desc} .= ' ' . $token->[1]; next; } my $img = $p->get_tag('img'); $ad->{interest} = $img->[1]{width}; last; # we are done

}

# the url is also in this description but # we don't need it. We already found it. $ad->{desc} =~ s/$ad->{url}.*//;

# change two or more whitespace characters into one space.

$ad->{desc} =~ s/\s{2,}/ /g;

# there is nothing else to look for so # we add this ad to our list of ads. push(@Ads,$ad);

} }

print quoted( qw( AdWords HREF Description URL Interest ) ); for my $ad (@Ads) {

print quoted( @$ad{qw( adwords href desc url interest )} );

}

# we want a csv (comma separated values) # so excel will open it without asking

# any questions. So we have to print quote marks sub quoted {

return join( ",", map { "'$_'" } @_ )."\n"; }

45.2 How It Works

Call this script on the command line, providing the name of the saved Google results page and a file in which to put the CSV results:

% perl adword input.html > output.csv

input.html is the name of the Google results page you've saved. output.csv is the name of the comma-delimited file to which you want to save your results. You can also provide multiple input files on the command line if you'd like:

% perl adword input.html input2.html > output.csv

45.3 The Results

The results will appear in a comma-delimited format that looks like this: "AdWords","HREF","Description","URL","Interest" "Free Blogging Site","http://www.1sound.com/ix", " The ultimate blog spot Start your journal now ","www.1sound.com/ix","40"

"New Webaga Blog","http://www.webaga.com/blog.php", " Fully customizable. Fairly inexpensive.

","www.webaga.com","24"

"Blog this","http://edebates.e-thepeople.org/a- national/article/10245/view&",

" Will online diarists rule the Net strewn with failed dotcoms? ",

"e-thePeople.org","26" "Ford - Ford

Cars","http://quickquote.forddirect.com/FordDirect.jsp", " Build a Ford online here and get a price quote from your local dealer! ",

"www.forddirect.com","40" "See Ford Dealer's

Invoice","http://buyingadvice.com/search/",

" Save $1,400 in hidden dealership profits on your next new car. ",

"buyingadvice.com","28"

"New Ford Dealer Prices","http://www.pricequotes.com/", " Compare Low Price Quotes on a New Ford from Local Dealers and Save! ",

(Each line was prematurely broken for the purposes of publication.)

You'll see that the hack returns the AdWords headline, the link URL, the description in the ad, the URL on the ad (this is the URL that appears in the ad text, while the HREF is what the URL links to), and the Interest, which is the size of the Interest bar on the text ad. The Interest bar gives an idea of how many click-throughs an ad has had, showing how popular it is.

45.4 Hacking the Hack

You might find that the hack as it's written provides too much information. Instead of the information above, you might want a little less information, or you might want it in a different order.

The code you'll need to change is in one section.

my @headers = qw( AdWords HREF Description URL Interest ); print '"',join('","',@headers),'"',"\n";

for my $ad (@Ads) {

print '"', join('","', $ad->{adwords}, $ad->{href}, $ad->{desc}, $ad->{url}, $ad->{interest}),'"',"\n";

The first part you'll need to change is the lower part, beginning with print "", join. As you see, each line corresponds to part of the data written to the comma-delimited file. Simply rearrange the lines in the order you want them, omitting the information you don't want.

For example, you might want the Adwords title, the URL, and the description, in that order. Your code would look like this:

print '"',join('","',@headers),'"',"\n"; for my $ad (@Ads) {

print '"', join('","',

$ad->{adwords}, $ad->{url},

$ad->{desc}),'"',"\n";

Once you've made the changes to that you'll have to change the "header line" that tells Excel what each field is. That's at the top of the code snippet:

my @headers = qw( AdWords HREF Description URL Interest); You'll need to rearrange the words in parentheses to match the information that you're outputting to the CSV file. In the case above, where I'm saving just the AdWords title, URL, and description, the line would look like this:

my @headers = qw( AdWords URL Description);

45.5 See Also

In document oreilly google hacks pdf (Page 145-152)