• No results found

4.4 Applying the Knowledge Generator

4.4.1 Method for coverage estimation

The goal of the coverage estimation method is assigning the MBBOX to a Web page. First, the heuristics used are described, and then, some details are presented. Finally, some disadvantages of the method developed are discussed.

The coverage estimation method consists in two heuristics: a content–based heuristic (H3) and a

host–based heuristic (Hhip). The heuristic named H3estimates the coverage by analysing geographic

information found within different elements of Web pages (mainly the geocoded place names). The heuristic named Hhip is used when H3 has not been successful. Hhip infers a country code (ISO

3166–1 alpha–2 codes (ISO, 2007a)) from the host (i.e. host name or IP), and then the code is geocoded to the MBBOX. Apart from MBBOX, the final coverage estimation method (H3+ Hhip)

returns a textual representation of the geographic scope, a code and some provenance information (see Section 4.4.2).

Figure 4.3 shows an overview of the coverage estimation method giving emphasis to the code attribute. It can be observed that the final value of the code might be POINT or ESTIMATED. First, the content–based heuristic tries to identify the gMeta within header metadata that provide latitude and longitude. However, this information is hardly provided (see Section 4.4.3). Therefore, the content–based heuristic focuses on the toponyms found within the Web page. The task for the coverage estimation from a text is comprised of three general steps:

1. Toponym recognition. This step produces a candidate place names list (Lner).

2. Toponym resolution. This step identifies the geographic entity (entityg, an element of a simple

territorial ontology) to which refer each place name in the Lner, and it produces a set of

geographic entities (Lge).

3. Geo–scope estimation. This step tries to estimate the MBBOX that best represents the ex- tracted set of geographic entities.

Here, the task of the estimation of the representative geographic entity from a set of toponyms found in a Web page is called EntitygEstimation. Two external tools are used in this task, a NER

tool and a geocoder tool. The first one is used to create a Lner list. The developed heuristic treats

separately place names recognised in different elements of the Web page separately. According to the processed element, the following Lner lists can be created:

Figure 4.3: Overview of the coverage estimation method.

2. P nmeta, that is a Lnerextracted from the header element (other than gMeta) and title element

of Web page,

3. P nbody, that is a Lner of the Web page body, i.e. the visible text (including links) and the

invisible tags of images.

The geocoder module is used to create the Lgefrom a Lner. The geocoder produces a ranked list

of geographic entity proposals for each item of the Lner. The geographic entity returned is encoded

in an XML and the data model used allows the identification of the related concept from a territorial ontology. A simple territorial ontology has been used in this work, which is result of the analysis of three existing standard models: the FIPS 10–4 standard for countries, dependencies, areas of

4.4. APPLYING THE KNOWLEDGE GENERATOR 87 special sovereignty and their principal administrative divisions developed by the United States Fed- eral Government (National Institute of Standards and Technology, 1995); the ISO 3166 Codes for representation of names of countries and their subdivisions (ISO, 2007b); and the Nomenclature of Territorial Units for Statistics (NUTS) developed by the EU (EC, 2003). In this simple ontology, ge- ographic entities are the concepts, and the only relationships of interest are the spatial aggregations, i.e., has–part or part–of. It is a modification of the Administrative Unit domain ontology proposed in López-Pellicer et al. (2008). Additionally, natural phenomena and towns have been considered as well. The resultant ontology gathers geographic entities of the following types:

1. Feature (FT) that represents a natural phenomenon, for example “Danube” (river) and “Alps” (mountains range),

2. Earth region (ERT) that defines international organisations, for example “European Union” and “United Nations”,

3. Country (CT) that represents countries in the world,

4. Region (RT) that represents the top level administrative divisions of a country,

5. Sub–region (SRT) that represents the administrative divisions of a country lower than the top ones,

6. Town (TT) that refers to cities.

For example, in case of “Barcelona” toponym, the expected entityg is the “Barcelona” (TT) in

“province of Barcelona” (SRT) in “Catalonia” (RT) of “Spain” (CT). The ERT entities are related to countries they gather (has–part), and FT entities are related to countries they belong to (part–of ). The Lge is created by assigning to each item in the Lner the first entityg from the ranked list.

The geo–scope estimation procedure uses a Lgeto calculate frequencies of the geographic entities for

different levels of accuracy in the following order: TT, SRT, RT, CT, FT, ERT and EARTH. Each Lgeitem

is represented via the entityg to which it is related at the accuracy level that is being calculated

(e.g. “Barcelona” (TT) will be represented by “Catalonia” at RT level of accuracy). The method returns an entityg of maximum frequency and the ESTIMATED code. If the method could not have

estimated the coverage (e.g. it fails if the Lgeis empty), the “Global” entitygand the ASSIGNED code

are returned.

The final heuristic (H3+ Hhip) is performed as follows. First, the content–based heuristic is run

(see Figure 4.4). The gMeta are checked and if the gMeta provide a point, it is used to create the MBBOX and the result code has POINT value. If no spatial object has been distinguished, the text values are analysed to create P ngM eta and then the corresponding Lge. If the geo–scope estimation

procedure fails (code has ASSIGNED value), a weighted list is created by joining the P ngM eta and

Figure 4.4: Overview of the content–based heuristic.

again, the P nbody is added, new weights are assigned (w3(P ngM eta), w2(P nmeta), w1(P nbody), where

wi= i) and the geo–scope estimation procedure is run again. The host–based heuristic is used only

when the heuristic H3 fails to estimate the coverage (i.e. it returns ASSIGNED code), which happens

usually due to the lack of metadata and poor NER results. The heuristic Hhip tries to extract the

ISO country code from host name of the analysed Web page (HN ame) and if it is not successful, its

IP is georeferenced to an ISO country code (HIP). Then, the ISO code is geocoded to a MBBOX

and ESTIMATED code are returned. Table 4.2 shows some examples of Web pages whose coverage has been estimated by the host–based heuristic.

The developed content–based heuristic is simple and has several problems. First, the candidate place names are trimmed from their context when using the geocoder. For example, it does not consider other place names from the same Lner, which have been identified near the searched place

4.4. APPLYING THE KNOWLEDGE GENERATOR 89

URL Manual H3 HN ame HIP Hhip H3+ Hhip

estimation Code (Hhip) (Hhip) Code

bnhelp.cz CZ ASSIGNED CZ – CZ ESTIMATED

b5m.gipuzkoa Gipuzkoa, Basque ASSIGNED – ES ES ESTIMATED

.net Country, ES

Table 4.2: Example of the Hhip heuristic results.

name within the text. The procedure for the creation of Lge delegates the ranking to the geocoder

as well. The algorithm that creates Lge could consider, for example, the re–ranking of geocoding

list according to other items within the Lge. The results of the experiments performed shows that

this straightforward approach can be satisfactory in the context of this work.