4.4 Conclusions
5.1.2 Resources for Scope-Based Experiments
This subsection describes how we obtained the resources needed to carry out experiments in the Spanish Geography domain using Spanish language. These resources were: the question corpus (validation and test), the document collection required by the Knowledge-Based off- line ODQA Passage Retrieval, and the geographical scope-based resources. Finally, the experiments performed are described.
5.1.2.1 Language and Scope Based Geographical Question Corpus
A corpus of Geographical questions was obtained from Albayzin, a speech corpus (Diaz et al., 1998) that contains a geographical subcorpus with utterances of questions about the geography of Spain in Spanish. A set of 6,887 question patterns were obtained from Albayzin. This corpus were analyzed and the following type of questions were extracted: Partial Direct, Partial Indirect, and Imperative Interrogative factoid questions with a sim- ple level of difficulty (e.g. questions without nested questions). A set of 2,287 question patterns was selected. To create the question corpus a random process selected a set of 177 question patterns from the previous selection (see Table 5.4). These patterns have been randomly instantiated with Geographical NEs of the Albayzin corpus. Then, the answers were searched in the Web and the Spanish Wikipedia (SW). The results of this process were: 123 questions with answer in the SW and the Web, 33 questions without answer in the SW but with answer using the Web, and finally, 21 questions without answer (due to the fact that some questions when instantiated cannot be answered (e.g. which sea bathes the coast of Madrid?)). The 123 questions with answer in the SW were divided in two sets: 61 questions for development (setting thresholds and other parameters) and 62 for test (see this questions in Table 5.5.
124 Chapter 5. Geographical Question Answering Approaches
¿A qué comunidad autónoma pertenece el <PICO>?
At which state pertains <PEAK>?
¿Cuál es el capital de <COMUNIDAD>?
Which is the capital of <STATE>?
¿Cuál es la comunidad en la que desemboca el <RíO>?
What is the state in which <RIVER> flows into?
¿Cuál es la extensión de <COMUNIDAD>?
Which is the extension of <STATE>?
Longitud del río <RíO>.
Length of river <RIVER>.
¿Cuántos habitantes tiene la <COMUNIDAD>?
How many people does <STATE> has?
Table 5.4: Some question patterns from Albayzin.
1 ¿A qué comunidad autónoma pertenece el Puigcampana?
To what autonomous community does the Puigcampana belongs?
2 ¿A qué comunidad pertenece El Ferrol?
To what autonomous community does El Ferrol belongs?
3 ¿A qué comunidad pertenece la isla La Gomera?
To what community belongs the island La Gomera?
4 ¿A qué mar desemboca la ría de Betanzos?
To what sea leads the ria of Betanzos?
5 ¿Cuál es el sistema de la comunidad autónoma Canaria?
What is the mountain range of tha Canary autonomous community?
6 ¿Cuál es el capital de Andalucía?
What is the capital of Andalusia?
7 ¿Cuál es el nombre de la comunidad autónoma en la que se encuentra Cullera?
What is the name of the autonomous community in which Cullera is located?
8 ¿Cuál es la capital Navarra?
What is the capital of Navarre?
9 ¿Cuál es la capital de las islas Las Canarias?
What is the capital of the Canary Islands?
10 ¿Cuál es la comunidad en la que desemboca el Guadalentín?
What is the community in which Guadalentín ends?
11 ¿Cuál es la extensión de la comunidad Madrileña?
What is the extension of the Madrid community?
12 ¿Cuál es la extensión de la comunidad autónoma donde está el golfo de Vizcaya?
What is the extension of the autonomous community in which the Bay of Biscay is located?
13 ¿Cuál es la extensión de la comunidad de Castilla y León?
What is the extension of the community of Castilla y León?
14 ¿Cuántos habitantes tiene la comunidad autónoma de Castilla?
How many inhabitants has the autonomous community of Castile?
15 ¿Cómo se llama la capital de la comunidad autónoma de La Rioja?
What is the capital of the autonomous community of La Rioja called?
16 ¿Cómo se nombra el río que pasa por Granada?
How is the river that passes through Granada named?
17 Dime a qué sistema pertenece el pico Teide?
Tell me which system belongs the peak Teide?
18 Dime a qué comunidad pertenece el cabo de La Nao?
Tell me which community is the Cape of La Nao?
19 Dime a qué comunidad pertenece la ría de Vigo?
Tell me which community is the Vigo estuary?
20 Dime el mar en que desemboca el Llobregat?
Tell me the sea where the Llobregat flows?
21 Dime el mar que baña las islas Canarias?
Tell me the sea that bathes the Canary Islands?
5.1. GeoTALP-QA Geographical Question Answering Approach 125
Tell me in what system is the river Aragón born?
23 Dime en qué comunidad autónoma se encuentra Manacor?
Tell me in what autonomous community is Manacor?
24 Dime en qué comunidad autónoma se encuentra la ciudad de Barbastro?
Tell me in what autonomous community is the city of Barbastro?
25 Dime en qué comunidad desemboca el Llobregat?
Tell me in what community does the Llobregat ends?
26 Dime en qué mar está la isla de Conejera?
Tell me in what sea is the island of Conejera?
27 Dime la población de la comunidad autónoma de Murcia?
Tell me the population of the autonomous community of Murcia?
28 Dime qué extensión tiene la isla de Hierro?
Tell me the extent of the island of Hierro?
29 ¿Dónde está la isla de Gran Canaria?
Where is the island of Gran Canaria?
30 ¿Dónde está la ría Ribadeo?
Where is the Ribadeo estuary?
31 ¿En qué archipiélago se encuentra Mallorca?
In what archipelago is Mallorca located?
32 ¿En qué ciudad desemboca el río Segura?
In what city does the Segura River ends?
33 ¿En qué comunidad autónoma está el Cantábrico?
In what autonomous community is the Cantabrian Sea located?
34 ¿En qué comunidad autónoma está el Mulhacén?
In what autonomous community is the Mulhacen located?
35 ¿En qué comunidad autónoma está el cabo Tarifa?
In what autonomous community is the Tarifa Cape located?
36 ¿En qué comunidad autónoma está situada la Sierra de Gũdar?
In what autonomous community is the Sierra of Gúdar located?
37 ¿En qué comunidad autónoma están los Picos de Europa?
In what autonomous community are the Picos the Europa located?
38 ¿En qué comunidad autónoma se encuentra la isla de La Gomera?
In what autonomous community is the island of La Gomera located?
39 ¿En qué comunidad autónoma se encuentra la Sierra del Maestrazgo?
In what autonomous community is the Sierra of Maestrazgo located?
40 ¿En qué comunidad está la sierra de Somosierra?
In what autonomous community is the sierra of Somosierra located?
41 ¿En qué comunidad nace el río Guadarrama?
In what autonomous community is the Guadarrama river located?
42 ¿En qué comunidad se encuentra el cabo San Adrián?
In what autonomous community is the San Adrián Cape located?
43 ¿En qué comunidad se encuentran los Pirineos?
In what autonomous community are the Pyrenees located?
44 ¿En qué mar está situado el golfo de Cádiz?
In what sea is the Gulf of Cádiz the located?
45 ¿En qué mar se encuentra la ría de Camariñas?
In what sea is the Camariñas estuary?
46 La comunidad en la que nace el río Guadalbullón?
The community in which the river Guadalbullón is born?
47 Me gustaría saber la extensión de la comunidad Vasca?
I would like to know the extension of the Basque community?
48 Nombre de la capital de Andalucía?
Name of the capital of Andalusia?
49 Nombre de la capital de la comunidad autónoma de Andalucía?
Name of the capital of the Autonomous Community of Andalusia?
50 Nombre de la comunidad donde nace el río Eresma?
Name of the community where the river Eresma is born?
51 Podría decirme el nũmero de habitantes de Figueras?
Can you tell me the number of inhabitants of Figueras?
52 Quiero que me digas la capital de la comunidad autónoma de Canarias?
I want you to tell me the capital of the autonomous community of the Canary Islands?
53 Quisiera saber el mar en donde está situada La Gomera?
I would like to know the sea where La Gomera is located?
54 ¿Qué capital tiene Castilla?
What capital does Castilla have?
55 ¿Qué extensión tiene La Gomera?
126 Chapter 5. Geographical Question Answering Approaches
56 ¿Qué extensión tiene la comunidad autónoma Asturiana?
What is the extension of the Asturian Autonomous Community?
57 ¿Qué mar baña el golfo de Onteniente?
What sea bathes the Gulf of Onteniente?
58 ¿Qué mar baña la comunidad autónoma Murciana?
What sea bathes the Murcian autonomous community?
59 ¿Qué mar es el que baña a la comunidad de Murcia?
What sea bathes the Murcian community?
60 ¿Qué nũmero de habitantes tiene Castilla la Mancha? What is the number of inhabitants of Castilla la Mancha? 61 ¿Qué nũmero de habitantes tiene Astorga?
What is the number of inhabitants of Astorga?
62 ¿Qué río pasa por Salamanca?
Which river passes through Salamanca?
Table 5.5: Test set of 62 instantiated questions patterns from Albayzin (in Spanish).
5.1.2.2 Document Collection for the Knowledge-Based ODQA Passage Re- trieval
In order to test our ODQA Passage Retrieval system we need a document collection with enough geographical information to solve the questions of Albayzin corpus. We used the filtered Spanish Wikipedia1. First, we obtained the original set of documents (26,235 files). Then, we selected two sets of 120 documents about the Spanish geography domain and the non-Spanish geography domain. Using these sets we obtained a set of Topic Signatures (TS) (C.-Y. Lin and E. Hovy, 2000) for the Spanish geography domain and another set of TS for the non-Spanish geography domain. Then, we used these TS to filter the documents from Wikipedia, and we obtained a set of 8,851 documents belonging to the Spanish geography domain. These documents were pre-processed and indexed.
5.1.2.3 Geographical Scope-Based Resources
A Knowledge Base (KB) of Spanish Geography has been built using four resources: • GNS: A set of 32,222 non-ambiguous place names of Spain.
• Albayzin Gazetteer: a set of 758 places.
• A Grammar for creating NE aliases. We created patterns for the summit and state classes (the ones with more variety of forms), and we expanded this patterns using the entries of Albayzin.
• A lexicon of 462 trigger words.
A set of 7,632 groups of place names were obtained using the grouping process over GNS. These groups contain a total of 17,617 place names, with an average of 2.51 place names per group. See in Figure 5.3 an example of a group where the canonical term appears underlined.
5.1. GeoTALP-QA Geographical Question Answering Approach 127
{Cordillera Pirenaica, Pireneus, Pirineos, Pyrenaei Montes, Pyrénées, Pyrene, Pyre- nees}
Figure 5.3: Example of a group obtained from GNS.
In addition, a set of the most common trigger phrases in the domain has been obtained from the GNS gazetteer (see Table 5.6).
Geographical Scope
Spain UK
TRIGGER de NE NE TRIGGER
Top-ranked TRIGGER NE TRIGGER NE
Trigger TRIGGER del NE TRIGGER of NE
Phrases TRIGGER de la NE TRIGGER a’ NE
TRIGGER de las NE TRIGGER na NE
Table 5.6: Sample of the top-ranked trigger phrases automatically obtained from GNS gazetteer for the geography of Spain and UK.