lot of nodes that we don’t want to parse”

us to straight away ignore most of Wikipedia’s GUI, but there are still lots of elements that we don’t want to parse. We remedy this by iterating through the list ‘undesirables’ that we created

01

Install Beautiful Soup & HTML5Lib

Before we can start writing code, we need to install the libraries we’ll be using for the program (Beautiful Soup, HTML5Lib, Six). The installation process is fairly standard: grab the libraries from their respective links, then unzip them. In the terminal, enter the unzipped directory and run python setup.py install for each library. They will now be ready for use.

02

Creating some useful variables

These variables will keep track of the links we’ve accessed while the script has been running: addresses is a list containing every link we’ve accessed; deepestAddresses are the links of the pages that were the furthest down the link tree from our starting point; storeFolder is where we will save the HTML ﬁ les we create and maxLevel is the maximum depth that we can follow the links to from our starting page.

03

Handling the user’s input

In the fi rst few lines of this function, we’re just creating a helper statement. Afterwards, we’re parsing any arguments passed into the program on its execution and looking for a -URL fl ag and a -levels fl ag. The -levels fl ag is optional as we already have a preset depth that we’ll follow the links to, but we need a link to start from so if the -URL fl ag is missing, we’ll prompt the user and exit. If we have a link, then we quickly check whether or not we have a directory to store fi les in – which we’ll create if we don’t – and then we’ll fi re off the function to get that page. Finally, we register a handler for when the script tries to exit. We’ll get to that bit later.

02

03 WIKI-EVERYTHING

Wikipedia has so many different services that interlink with each other; however, we don’t want to grab those pages, so we’ve got quite a lengthy conditional statement to stop that. It’s pretty good at making sure we only get links from Wikipedia.

INFINITE LINKS

Wikipedia has a lot of links and when you start following links to links to links, the number of pages you have to parse can grow exponentially, depending on the subject matter. By passing through the levels value, we put a cap on the amount of pages we can grab–- although the number of ﬁ les stored can still vary greatly. Use it wisely.

“Wikipedia has a

lot of nodes that we

don’t want to parse”

once we’ve received that page, we’re going to pass the content through to Beautiful Soup with the soup variable. This gives us access to the methods we’re going to call as we parse the document.

06

Grabbing the links

By calling content.ﬁ nd_all(‘a’) we get a list of every <a> in the document. We can iterate through this and check whether or not there is a valid Wikipedia link in the <a>’s href. If the link is a valid link, we quickly check how far down the link tree we are from the original page. If we’ve reached the maximum depth we can go, we’ll store this page and call it quits, otherwise we’ll start looking for links that we can grab within it. For every page we request, we append its URL

07

Writing to ﬁ le

Now we create a fi le to store the newly parsed document in for later reading. We change any ‘/’ in the fi lename to ‘_’ so the script doesn’t try and write to a random folder. We also do a quick check to see how many links we’ve followed since the fi rst page. If it’s the max level, we’ll add it to the deepestAddresses list. We’ll use this a little bit later.

08

Tying up loose ends

After our script has iterated through every link on every page to the maximum level of depth that it can, it will try to exit. On line 34 of the code (on the disc and online) in the init function, we registered the function cleanUp to execute on the program trying to exit; cleanUp’s job is to go through the documents that we’ve downloaded and check that every link we’ve left in the pages does in fact link to a ﬁ le that we have available. If it can’t match the link in the href to a ﬁ le in the addresses list, it will remove it. Once we’re done, we will have a fully portable chunk of Wikipedia we can take with us.

STYLING

Currently, the HTML page will use the built-in browser styles when rendering the page. If you like, you can include the style sheet included in the tutorial resources to make it look a little nicer. To use it, you can minify the script and include it inside a <style> tag in the head string on line 102, or you can rewrite the head string to something like: head = “<head><meta charset=\”UTF-8\” /><title>” + ﬁ leName + “</title><style>” + str(open(“/PATH/TO/ STYLES”, ‘r’).read()) + “</ style></head>”

“Beautiful Soup

is a fast, elegant

framework that

works with a

number of Python

HTML parsers”

earlier on in the document. For each different div/section/node that we don’t want, we call Beautiful Soup’s ﬁ nd_all() method and use the extract() method to remove that node from the document. At the end of the undesirables loop, most of the content we don’t want any more will be gone. We also look for the ‘also’ element in the Wiki page. Generally, everything after this div is of no use to us. By calling the ﬁ nd_all_ next() method on the also node, we can get a list of every other element we can remove from that point on.

to the addresses list; to make sure we don’t call the same page twice for each link we ﬁ nd, we check if we’ve already stored it. If we have, then we’ll skip over the rest of the loop, but if we’ve not then we’ll add it to the list of URLs that we’ve requested and ﬁ re off a request. Once that check is done, We then do a quick string replace on that link so that it points to the local directory, not to the subfolder /wiki/ that it’s looking for.

Surf the internet privately with Onion Pi and

In document Linux Tips Tricks Apps & Hacks Vol 2 - 2014 (Page 160-162)