61 Extracting URLs from a Web Page - Textbook Wicked Cool Shell Scripts pdf

A straightforward shell script application of lynx is to extract a list of URLs on a given web page, which can be quite helpful in a variety of situations.

The Code

#!/bin/sh

# getlinks - Given a URL, returns all of its internal and # external links.

if [ $# -eq 0 ] ; then

echo "Usage: $0 [-d|-i|-x] url" >&2

echo "-d=domains only, -i=internal refs only, -x=external only" >&2 exit 1

if [ $# -gt 1 ] ; then case "$1" in

-d) lastcmd="cut -d/ -f3 | sort | uniq" shift

;;

-i) basedomain="http://$(echo $2 | cut -d/ -f3)/"

lastcmd="grep \"^$basedomain\" | sed \"s|$basedomain||g\" | sort | uniq" shift

;;

-x) basedomain="http://$(echo $2 | cut -d/ -f3)/" lastcmd="grep -v \"^$basedomain\" | sort | uniq" shift

;;

*) echo "$0: unknown option specified: $1" >&2; exit 1 esac else lastcmd="sort | uniq" fi lynx -dump "$1" | \ sed -n '/^References$/,$p' | \ grep -E '[[:digit:]]+\.' | \ awk '{print $2}' | \ cut -d\? -f1 | \ eval $lastcmd exit 0

How It Works

When displaying a page, lynx shows the text of the page, formatted as best it can, followed by a list of all hypertext references, or links, found on that page. This script simply extracts just the links by using a sed invocation to print everything after the "References" string in the web page text, and then processes the list of links as needed based on the user-specified flags.

The one interesting technique demonstrated by this script is the way the variable lastcmd is set to filter the list of links that it extracts according to the flags specified by the user. Once lastcmd is set, the amazingly handy eval command is used to force the shell to interpret the content of the variable as if it were a command, not a variable.

Running the Script

By default, this script outputs a list of all links found on the specified web page, not just those that are prefaced with http:. There are three optional command flags that can be specified to change the results, however: -d produces just the domain names of all matching URLs, -i produces a list of just the internal references (that is, those references that are found on the same server as the current page), and -x produces just the external references, those URLs that point to a different server.

The Results

A simple request is a list of all links on a specified website home page: $ getlinks http://www.trivial.net/

http://www.intuitive.com/

http://www.trivial.net/kudos/index.html http://www.trivial.net/trivial.cgi mailto:[email protected]

Another possibility is to request a list of all domain names referenced at a specific site. This time let's first use the standard Unix tool wc to check how many links are found overall:

$ getlinks http://www.amazon.com/ | wc -l 136

Amazon has 136 links on its home page. Impressive! Now, how many different domains does that represent? Let's generate a full list with the -d flag:

$ getlinks -d http://www.amazon.com/ s1.amazon.com

www.amazon.com

As you can see, Amazon doesn't tend to point anywhere else. Other sites are different, of course. As an example, here's a list of all external links in my weblog:

$ getlinks -x http://www.intuitive.com/blog/ LYNXIMGMAP:http://www.intuitive.com/blog/#headermap http://blogarama.com/in.php http://blogdex.media.mit.edu/ http://booktalk.intuitive.com/ http://chris.pirillo.com/ http://cortana.typepad.com/rta/ http://dylan.tweney.com/ http://fx.crewtags.com/blog/ http://geourl.org/near/ http://hosting.verio.com/index.php/vps.html http://imajes.info/ http://jake.iowageek.com/ http://myst-technology.com/mysmartchannels/public/blog/214/ http://smattering.org/dryheat/ http://www.101publicrelations.com/blog/ http://www.APparenting.com/ http://www.backupbrain.com/ http://www.bloghop.com/ http://www.bloghop.com/ratemyblog.htm http://www.blogphiles.com/webring.shtml http://www.blogshares.com/blogs.php http://www.blogstreet.com/blogsqlbin/home.cgi http://www.blogwise.com/ http://www.gnome-girl.com/ http://www.google.com/search/ http://www.icq.com/ http://www.infoworld.com/ http://www.mail2web.com/

http://www.movabletype.org/ http://www.nikonusa.com/usa_product/product.jsp http://www.onlinetonight.net/ethos/ http://www.procmail.org/ http://www.ringsurf.com/netring/ http://www.spamassassin.org/ http://www.tryingreallyhard.com/ http://www.yahoo.com/r/p2

Hacking the Script

You can see where getlinks could be quite useful as a site analysis tool. Stay tuned: Script #77, checklinks, is a logical follow-on to this script, allowing a quick link check to ensure that all hypertext references on a site are valid.

#62 Defining Words Online

In addition to grabbing information off web pages, a shell script can also feed certain information to a website and scrape the data that the web page spits back. An excellent example of this technique is to implement a command that looks up the specified word in an online dictionary and returns its definition. There are a number of dictionaries online, but we'll use the WordNet lexical database that's made available through the Cognitive Science Department of Princeton University.

Learn more You can read up on the WordNet project — it's quite interesting — by visiting its website directly at http://www.cogsci.princeton.edu/~wn/

The Code

#!/bin/sh

# define - Given a word, returns its definition.

url="http://www.cogsci.princeton.edu/cgi-bin/webwn1.7.1?stage=1&word="

if [ $# -ne 1 ] ; then

echo "Usage: $0 word" >&2 exit 1

lynx -source "$url$1" | \

grep -E '(^[[:digit:]]+\.| has [[:digit:]]+$)' | \ sed 's/<[^>]*>//g' |

( while read line do

if [ "${line:0:3}" = "The" ] ; then part="$(echo $line | awk '{print $2}')" echo ""

echo "The $part $1:" else

echo "$line" | fmt | sed 's/^/ /g' fi

done ) exit 0

How It Works

Because you can't simply pass fmt an input stream as structurally complex as a word definition without completely ruining the structure of the definition, the while loop attempts to make the output as attractive and readable as possible. Another solution would be a version of fmt that wraps long lines but never merges lines, treating each line of input distinctly, as shown in script #33, toolong.

Worthy of note is the sed command that strips out all the HTML tags from the web page source code: sed 's/<[^>]*>//g'

This command removes all patterns that consist of an open angle bracket (<) followed by any combination of characters other than a close angle bracket (>), finally followed by the close angle bracket. It's an example of an instance in which learning more about regular expressions can pay off handsomely when working with shell scripts.

Running the Script

This script takes one and only one argument: a word to be defined.

The Results

$ define limn The verb limn:

1. delineate, limn, outline -- (trace the shape of)

2. portray, depict, limn -- (make a portrait of; "Goya wanted to portray his mistress, the Duchess of Alba")

$ define visionary The noun visionary:

1. visionary, illusionist, seer -- (a person with unusual powers of foresight)

The adjective visionary:

1. airy, impractical, visionary -- (not practical or realizable; speculative; "airy theories about socioeconomic improvement"; "visionary schemes for getting rich")

Hacking the Script

WordNet is just one of the many places online where you can look up words in an automated fashion. If you're more of a logophile, you might appreciate tweaking this script to work with the online Oxford English Dictionary, or even the venerable Webster's. A good starting point for learning about online dictionaries (and encyclopedias, for that matter) is the wonderful Open Directory Project. Try http://dmoz.org/Reference/Dictionaries/ to get started.

In document Textbook Wicked Cool Shell Scripts pdf (Page 167-172)