Web Crawling Tools - 1 Hacking Exposed 3 pdf

Here are our favorite tools to help automate the grunt work of the application survey. They are basically spiders that, once you point to an URL, you can sit back and watch

them create a mirror of the site on your system. Remember, this will not be a functional replica of the target site with ASP source code and database calls; it is simply a complete collection of every available link within the application. These tools perform most of the grunt work of collecting files.

We’ll discuss holistic web application assessment tools, which include crawling functionality, in Chapter 10.

Lynx Lynx is a text-based web browser found on many UNIX systems. It provides a quick way to navigate a site, although extensive JavaScript will inhibit it. We find that one of its best uses is for downloading specific pages.

The–dump option is useful for its “References” section. Basically, this option instructs lynx to simply dump the web page’s output to the screen and exit. You can redirect the output to a file. This might not seem useful at first, but lynx includes a list of all links embedded in the page’s HTML source. This is helpful for enumerating links and finding URLs with long argument strings.

[root@meddle]# lynx –dump https://www.victim.com > homepage [root@meddle]# cat homepage

...text removed for brevity... References 1. http://www.victim.com/signup?lang=en 2. http://www.victim.com/help?lang=en 3. http://www.victim.com/faq?lang=en 4. http://www.victim.com/menu/ 5. http://www.victim.com/preferences?anon 6. http://www.victim.com/languages 7. http://www.victim.com/images/

If you want to see the HTML source instead of the formatted page, then use the

–source option. Two other options, –crawl and –traversal, will gather the formatted HTML and save it to files. However, this is not a good method for creating a mirror of the site because the saved files do not contain the HTML source code.

Lynx is still an excellent tool for capturing single URLs. Its major advantage over the getit scripts is the ability to perform HTTP basic authentication using the –auth

option:

[root@meddle]# lynx -source https://www.victim.com/private/index.html Looking up www.victim.com

Making HTTPS connection to 192.168.201.2

Secure 168-bit TLSv1/SSLv3 (EDH-RSA-DES-CBC3-SHA) HTTP connection Sending HTTP request.

HTTP request sent; waiting for response.

Alert!: Can't retry with authorization! Contact the server's WebMaster. Can't Access `https://192.168.201.2/private/index.html'

Alert!: Unable to access document. lynx: Can't access startfile

[root@meddle]# lynx -source -auth=user:pass \ > https://63.142.201.2/private/index.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 FINAL//EN"> <HTML>

<HEAD>

<TITLE>Private Intranet</TITLE>

<FRAMESET BORDER=0 FRAMESPACING=0 FRAMEBORDER=0 ROWS="129,*"> <FRAME NAME="header" SRC="./header_home.html" SCROLLING=NO

MARGINWIDTH="2" MARGINHEIGHT="1" FRAMEBORDER=NO BORDER="0" NORESIZE> <FRAME NAME="body" SRC="./body_home.html" SCROLLING=AUTO

MARGINWIDTH=2 MARGINHEIGHT=2> </FRAMESET>

</HEAD> </HTML>

Wget Wget (www.gnu.org/software/wget/wget.html) is a command-line tool for Windows and UNIX that will download the contents of a web site. Its usage is simple:

[root@meddle]# wget -r www.victim.com --18:17:30-- http://www.victim.com/ => `www.victim.com/index.html'

Connecting to www.victim.com:80... connected! HTTP request sent, awaiting response... 200 OK Length: 21,924 [text/html]

0K ... ... . 100% @ 88.84 KB/s

18:17:31 (79.00 KB/s) - `www.victim.com/index.html' saved [21924/21924] Loading robots.txt; please ignore errors.

--18:17:31-- http://www.victim.com/robots.txt => `www.victim.com/robots.txt'

Connecting to www.victim.com:80... connected! HTTP request sent, awaiting response... 200 OK Length: 458 [text/html]

0K 100% @ 22.36 KB/s

...(continues for entire site)...

The -r or --recursive option instructs wget to follow every link on the home page. This will create a www.victim.com directory and populate that directory with every HTML file and directory wget finds for the site. A major advantage of wget is that it follows every link possible. Thus, it will download the output for every argument that the application passes to a page. For example, the viewer.asp file for a site might be downloaded four times:

• viewer.asp@ID=7 • viewer.asp@ID=42 • viewer.asp@ID=23

The @ symbol represents the ? delimiter in the original URL. The ID is the first argument (parameter) passed to the viewer.asp file. Some sites may require more advanced options such as support for proxies and HTTP basic authentication. Sites protected by basic authentication can be spidered by:

[root@meddle]# wget -r --http-user:dwayne --http-pass:woodelf \> https://www.victim.com/secure/

--20:19:11-- https://www.victim.com/secure/ => `www.victim.com/secure/index.html'

Connecting to www.victim.com:443... connected! HTTP request sent, awaiting response... 200 OK Length: 251 [text/html]

0K 100% @ 21.19 KB/s

...continues for entire site...

Wget has a single purpose: to retrieve files from a web site. Sifting through the results requires some other simple command-line tools available on any Unix system or Windows Cygwin.

Burp Suite Spider Burp Suite is a set of attack tools that includes a utility for mapping applications. Rather than having to follow links manually, submitting forms, and parsing the responses, the Burp Spider will automatically gather this information to help identify potentially vulnerable functionality in the web application. Add the site to be crawled to the current target scope and then simply browse the application using the Burp proxy after enabling the Spider feature. Further options can be configured via the Options tab.

Teleport Pro Of course, for Windows users there is always something GUI. Teleport Pro (www.tenmax.com/teleport/pro/home.htm) brings a graphical interface to the function of wget and adds sifting tools for gathering information.

With Teleport Pro, you can specify any part of a URL to start spidering, control the depth and types of files it indexes, and save copies locally. The major drawback of this tool is that it saves the mirrored site in a Teleport Pro Project file. This TPP file cannot be searched with tools such as grep. Teleport Pro is shown in Figure 2-8.

Black Widow Black Widow extends the capability of Teleport Pro by providing an interface for searching and collecting specific information. The other benefit of Black Widow is that you can download the files to a directory on your hard drive. This directory is more user-friendly to tools like grep and findstr. Black Widow is shown in Figure 2-9.

Offline Explorer Pro Offline Explorer Pro is a commercial Win32 application that allows an attacker to download an unlimited number of her favorite web and FTP sites for later

offline viewing, editing, and browsing. It also supports HTTPS and multiple authentication protocols, including NTLM (simply use the domain\username syntax in the authentication configuration page under File | Properties | Advanced | Passwords for a given Project). We discuss Offline Explorer Pro throughout this book, since it’s one of our favorite automated crawling tools.

In document 1 Hacking Exposed 3 pdf (Page 101-105)