• No results found

2.2 Sources of Information on the Web

2.2.3 Areas on the Web

The Web can be divided into at least three major areas: the Visible Web, the Deep Web, and the Semantic Web. One resource on the Web can belong to multiple areas. In the literature, there is much confusion about these areas and how to describe them properly. The following paragraphs explain each of the three Web areas in more detail.

Visible Web

All resources that can be accessed using the link structure of the Web are considered to be on the Visible Web. Sometimes the Visible Web is also referred to as Indexable Web or Surface

Web. The Visible Web contains at least eight billion Web pages3 at the time of this writing.

Assuming that the average Web page is about 25 kilobytes (King, 2009), the size of the Visible Web is larger than 465 terabytes.

Deep Web

The Deep Web contains all resources on the Web that cannot be reached following the link structure of the Web. More than 50 % of these resources are behind forms that provide access to topic specific databases. After querying the databases, Web pages are dynamically created dependent on the given query (Bergman, 2001). The content might also be secured and only accessible to registered users of a website, but 95 % are publicly accessible. Automatically exploring and retrieving information from the Deep Web is much more difficult than from

3

Document Extension Number of Documents HTML, XHTML, HTM, XHTM 34,537,940,000 PDF 1,840,000,000 XML 1,080,000,000 DOC, DOCX 882,210,000 TXT, TEXT 591,120,000 XLS, XLSX 183,130,000 RDF (+RSS) 102,300,000 RTF 56,700,000 PS 38,800,000 CSV 26,900,000 PPT, PPTX 26,690,000 TEX 10,800,000 Atom 5,490,000 ODT 2,830,000 MW 1,160,000 OWL 1,050,000

Table 2.3: Estimated Distribution of Document File Formats on the Web in Absolute Numbers as of April 2010

the Visible Web. The “entry points”, usually the Web forms, are sparsely distributed over the Web (Barbosa and Freire, 2007) and need to be discovered by specialized crawlers. After a set of forms has been found, it is a challenging task to submit the form with meaningful queries in order to retrieve the hidden documents. Ntoulas et al. (2005) and Madhavan et al. (2008) study the problem of automatically generating queries for these forms. Another way to access the Deep Web is through APIs. These interfaces are provided by the content owner and can be queried by a program to return data from the underlying database. The Deep Web is also referred to as Deepnet, Invisible Web, Dark Web, or Hidden Web (Olston and Najork, 2010). In 2001, it was estimated that the Deep Web is 400 to 550 times bigger than the Visible Web (Bergman, 2001), which would amount to over 93,000 terabytes of textual data.

Sources of Information on the Web 21

Semantic Web

The Semantic Web, also called the Web of Data or Linked Data4, is an initiative of the World

Wide Web Consortium (Berners-Lee et al., 2001) that contains all resources on the Web that can be parsed and “understood” by machines. The Semantic Web aims to allow a new set of applications that compute and reason on the semantic information. The Semantic Web

contains at least 13 billion triples5 at the time of this writing. A triple consists of subject,

predicate, and object, and can be represented in many different ways, such as RDF/XML, N3, or Turtle. Let us assume that one triple consists of 200 characters (200 bytes in ASCII). With these estimated values, we can calculate that the size of the Semantic Web is bigger than 2.3 terabytes.

To explain the differences among the three areas, we consider five dimensions shown in Fig- ure 2.3. On all axes of the graphic, values range from one to three, except the size axis, which is measured in terabytes and is logarithmic. The five dimensions are:

1. The size in terabytes of the sum of all documents.

2. The degree of human accessibility. About 95 % of the Deep Web data is freely accessible (Bergman, 2001), but it is hidden behind forms and not accessible through multi-purpose search engines such as Google. Humans can, however, easily access the data behind the forms. The Semantic Web is primarily made for machines and is therefore not as easy to access as the Visible Web, which is primarily made to be consumed by humans. 3. The degree of human understandability. The Deep Web and the Visible Web are often

encoded similarly and aim to be consumed by humans. Semantic Web documents are not encoded in a way that they could be easily understood by humans.

4. The degree of machine accessibility. The Deep Web is hidden behind forms and therefore is not easy to access for machines. The Semantic Web is primarily made for machines and is therefore easier to access than the Deep Web. The Visible Web can be almost as easily accessed by machines as by humans.

5. The degree of machine understandability. The Deep Web and the Visible Web are often encoded similarly and aim to be consumed by humans, not machines. Semantic Web documents are meant to be parsed and “understood” by machines

We have seen that the Web can be broadly divided into three major areas: the Visible Web, the Deep Web, and the Semantic Web. In this thesis, we will focus on extracting information from the Visible Web and the Semantic Web. We choose not to investigate the retrieval of documents from the Deep Web since this part of the Web is much more difficult to access. Ntoulas et al. (2005) and Madhavan et al. (2008) studied how to retrieve Web pages from the Deep Web, but we focus on extraction rather than retrieval.

4We treat these terms as synonyms since there are no tangible differences, Tim Berners-Lee himself said several times “Linked Data is the Semantic Web done right” (see LDOW workshop 2008 http://blog.dbtune. org/post/2008/05/12/LDOW-and-WWW-2008, last accessed on 25th of March 2012).

5

http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/DataSets/Statistics, last ac- cessed on 18th of June 2012

Figure 2.3: Comparison of Web Areas