Finding Structured Data from Unstructured Data for Question Answering

(1)

978-1-4799-6858-9/14/$31.00 ©2014 IEEE ICTS 2014, Surabaya, Indonesia

Finding Structured Data from Unstructured Data for Question Answering

1

Dewi W. Wardani,

²

Titik Musyarofah

12

Informatics Department

12

Sebelas Maret University Surakarta, Central Java, Indonesia

1

[email protected],

²

[email protected]

Abstract— a mainly research on Automatic Question Answering System commonly uses unstructured data as the data source. The using of structured data for Question Answering has been forgotten by Question Answering system. While the using of unstructured data provides an answer in the form of snippets of sentences or a list of snippets, structured data can produce a precise and concise answer. Structured data has a good quality and non trivial information. Hence, structured data will be very useful for obtaining the precise answer. In our opinion, actually we can find and using structured data from unstructured data.

Using it for Question Answering is quite novel idea. Therefore, we propose a new idea for finding structured data from unstructured data for Indonesian-language Question answering for obtaining a precise and concise answer. Our approach achieved accuracy 85.37%.

Keywords— question answering, structured data, unstructured data.

I. I NTRODUCTION

Mainly researches on Automatic Question Answering uses unstructured data such as webpage as data source. It returns the answers in the form of snippet of the sentence or list of snippets. Bag-of-words retrieval is popular among automatic Question Answering system developers [1]. According to [2]

Question Answering system produces the final answer to the user automatically. Question Answering aims at retrieving precise information from a large collection of documents [3].

Whereas, processing of unstructured data is recognized as one of the problems in information technology [4]. As an example in Figure 1, the result of question in Wikipedia (Indonesian language), is a bag of word, not precise answer. We have to read more to reach the precise answer by ourselves. The extension processing of unstructured data is needed, to make it more useful, moreover about 80% to 85% of the data stored in unstructured format [5].

The precise and concise answers are needed to give satisfaction to the user and this answer probably only can be given by structured data. This is because usually structured data contains high quality and non trivial information, whereas unstructured data returns snippets or the bag of words and users must read those snippets to seek the precise answer. Our previous research [6] demonstrated that using the structured data in information retrieval returns more relevant results, more highly ranked compared with bag-of-words on a sentence retrieval task. One of disadvantage of structured data

is retrieving information on it is not easy and inflexible. Only non native user who knows the schema and knows how to write formal query language can search of structured data.

Fig 1. Result of question on unstructured data

In our opinion, actually within unstructured information itself we can obtain structured data. Since much more information and data is unstructured data, this idea will be quite interesting. We do not need to obtain a lot instant structured data such as table or database, or create knowledge based which take a lot time. In this work, we propose a new idea to find structured data from unstructured data (document).

In this case we use document in Indonesian language as well for Question Answering. We limit the input question is a question that refers to the date, in formal and informal form.

The approach uses pronouns attribute and similarity word approach to calculate the similarity between the question and snippets. Structured approach is also used by finding the structured data from unstructured data and then this structured information is used as a data source.

II. R ELATED W ORKS

Our previous research [6] has been integrating structured data and unstructured data. Structured data that has been forgotten by many Question Answering systems or search engines are very useful. This structured data is used to provide precise and concise answers. Users will not care about from which kind of the resource of the relevant information can be 2014 International Conference on Information, Communication Technology and System

978-1-4799-6858-9/14/$31.00 © 2014 IEEE ICTS 2014, Surabaya, Indonesia 19

(2)

found, they just want to get the better answers of their question [6]

The other previous research in Question Answering for the Indonesian language has been widely applied [7, 2, 8, 9]. [10]

Proposed a pattern based approach to Indonesian Question Answering system. Pattern based approach is a form of rule- based approach to categorization questions. [7] Proposed a machine learning approach for Indonesian Question Answering system. These systems apply the SVM as a machine learning algorithm. Question with “kapan” (when) as the interrogative word is always a date question [7]. [8]

Developed a Question Answering using rule-based method on the text of the Quran (the bible for the Moslem) in Indonesian language. Question Answering in structured data has also been done by [11]. Using the ontology and textual entailment, this Question Answering can be used for one language and across languages. Another research, [1] presented an approach to retrieve for Question Answering, applying structured retrieval techniques to the types of text annotations that Question Answering systems use. [6] Developed a new idea to improve the accuracy of complex questions by integrating structured data in the form of simple relational databases and unstructured data in the form of web pages. In general, Question Answering system architecture is composed of six phases namely Question Analysis, Document Collection Preprocessing, Candidate Document Analysis, Answer Extraction and the last Response Generation [12].

III. C ONSIDERED P ROBLEM AND I DEA

Automatic Question Answering System uses unstructured data, hence the answer is returned in list of snippets. Only structured data can provide precise answer, yet structured data does not widely used in question answering. Our main idea is that actually we can find structured data from unstructured data (documents). We want to provide both structured and unstructured data from the same resource. For unstructured data form simply is a document and the structured data can be obtained from our proposed idea.

Fig 2. Annotation structured data from unstructured data

Figure 2 is the example of obtaining structured data from unstructured data. Annotation date “31 Desember 1799”, explain the information around the date, “VOC resmi dibubarkan pemerintah Belanda…”. In the beginning work, we only consider to the structured information which refers to the date information as well for the question. We expand not only question word “kapan” but also question word which often be used in informal form such as “kapankah”, “tanggal berapa”, or question word that begins with a preposition like

“pada tanggal berapa” etc. In the common previous research

just using prefix question word. In this research we do not only using prefix question word but also allows suffix and confix question words. During the experiment related to date we used document from history domain because it contains a lot of date.

For example:

a. Kapankah Indonesia merdeka? (Prefix question)

b. Indonesia merdeka pada tanggal berapa? (Suffix question) c. Sejak kapan Indonesia merdeka? (Confix question)

IV. M ETHODOLOGY

The proposed idea contains of three main steps. Providing the resource, Question analysis and Finding Answer (candidate selection and matching snippets row).

A. Providing The Resource

Main task of this part is obtaining structured data from unstructured data. We waived tokenizing for the abbreviation for the name, title name, address, position, or rank, followed by a dot (.) as in Table 1

T

ABLE

1. I

NDONESIAN LANGUAGE ABBREVIATION

Abbreviation Example

Name M., Moh., Muh., etc Title Name Dr., Drs., Ir., Prof., etc Address Mr., Mrs., Ms., etc Position or Rank KH., R., etc

In this initial work, we only consider to obtain structured data refers to the date information. Since the domain of data is history, date is one of most important information in history question answering. This structured data is not like knowledge based, because actually it is obtained by the data itself, different with knowledge base which obtain the knowledge from the expert and need much time to create it. Figure 3, describes an example idea to obtain structured data that refers to date from unstructured data.

To generate structured information that refers to “when”

question, date formats such as in Table 2 is needed to be obtained. We consider non-pronouns and pronouns approach.

Pronouns attributes are required in the discovery of structured information. If there are sentences containing pronouns such as in Table 3 then it is considered that these sentence is related to the previous sentence, hence the previous sentence is also taken as a snippet. For example in Figure 3, in detail, we can explain, firstly we need to recognize the date format in unstructured data, and then get the related snippets from the sentences that contains the date and around the date by recognizing the pronouns approach. Firstly, recognize the date

“29 Oktober 1945”, and then we get also the snippet which contains the date.

We will obtain structured data as follow:

Date : “29 Oktober 1945”

Snippet : “Mereka juga minta TKR mengosongkan kota

Bandung bagian utara, paling lambat tanggal 29 Oktober

1945”

(3)

In the next paragraph, we will obtain the structured data:

Date : 2 Mei 1889

Snippet : RM Suwardi Suryaningrat lahir pada tanggal 2 Mei 1889 di Yogyakarta

Fig 3. Example finding structured data unstructured data using non- pronouns (above) and using pronouns approach (below)

The other example in the sentences, recognized date “3 Juli

1922”, in the snippet which contains the date there is pronouns “Beliau”. ”Beliau” in Indonesian language is pronouns for him or her. “Beliau mendirikan Taman Siswa tanggal 3 Juli 1922 dengan tujuan memajukan pendidikan bangsa Indonesia”. Beliau refers to Ki Hajar Dewantoro which mentioned in the previous sentence. It means that we must also consider the previous sentence, “Namanya lebih dikenal dengan Ki Hajar Dewantoro”., in the end we will obtain the structured data:

Date : 3 Juli 1922

Snippet : Namanya lebih dikenal dengan Ki Hajar Dewantoro. Beliau mendirikan Taman Siswa tanggal 3 Juli 1922 dengan tujuan memajukan pendidikan bangsa Indonesia.

As follow is pseudo code of this approach:

Input : Sentence[i]

Output : Snippet

Step :

Begin

if there is “persona pronouns” in Sentence[i] then if there is “pointer pronouns” in Sentence[i+1] then Snippet = Sentence[i-1] + Sentence[i] + Sentence[i+1] ;

else

Snippet = Sentence[i-1] + Sentence[i];

end if else

if terdapat “pointer pronouns” pada Sentence[i+1]

then

Snippet = Sentence[i] + Sentence[i+1] ; else

Snippet = Sentence[i];

end if end if end

To generate structured information that refers to date, need to obtain date formats case in Indonesian language, such as in Table 2.

T

ABLE

2.

DATE FORMATS

Format Example

[year] 1512

[month] [year] Maret 1830

[day] [month] 17 Agustus

[day] [month] [year] 1 Januari 1800

[year]-[ year] 1825-1830

[year] sampai [year] 1247 sampai 1248 [year] sampai dengan [year] 1518 sampai dengan 1521 [day]-[ day] [month] [year] 27-28 Oktober 1928 [day] [month] sampai [day] [month]

[year]

23 Agustus sampai 2 November 1949

21

(4)

Here are the pronouns in Indonesian language T

ABLE

3. P

RONOUNS IN

I

NDONESIAN LANGUAGE

[13]

Persona Meaning Single Plural First Saya, aku, daku, ku-,-ku Kami, kita

Second Engkau, kamu, anda, dikau, kau-,-mu

Kalian, kamu (sekalian), Anda (sekalian) Third Ia, dia, beliau, -nya Mereka, -nya

B. Question Analysis

Several form of the type of questions that refer to time, as in table 4, and some examples question that refers to “when”

in Indonesian language.

T

ABLE

4. Q

UESTION TYPE REFERS TO

“

WHEN

”

IN

I

NDONESIAN LANGUAGE

Prefix Kapan (kah) […]

Tanggal berapa (kah) […]

Tahun berapa (kah) […]

Pada tanggal berapa (kah) […]

Pada tahun berapa (kah) […]

Setiap tanggal berapa (kah) […]

Suffix and Confix

[…] kapan / […] kapan […]

[…] tanggal/ […] tanggal […]

[…] tahun/ […] tahun […]

[…] tanggal berapa/ […] tanggal berapa […]

[…] tahun berapa/ […] tahun berapa […]

[…] pada tanggal/ […] pada tanggal […]

[…] pada tanggal berapa/ […] pada tanggal berapa […]

[…] pada tahun/ […] pada tahun […]

[…] pada tahun berapa/ […] pada tahun berapa […]

[…] pada bulan/ […] pada bulan […]

[…] pada/ […] pada […]

[…] setiap tanggal/ […] setiap tanggal […]

[…] tahun berapakah/ […] tahun berapakah […]

T

ABLE

5. E

XAMPLE OF

Q

UESTION TYPE REFERS TO

“

WHEN

”

IN

I

NDONESIAN LANGUAGE

No Question

1 Kapan Tugu Proklamasi didirikan?

2 Kapankah VOC mulai bangkrut?

3 Tanggal berapakah BPUPKI melakukan sidang kedua?

4 Pada tahun berapakah Bangsa Portugis datang ke Indonesia?

5 Pertempuran Medan Area terjadi pada tanggal?

6 Serangan Umum Satu Maret terjadi pada?

7 Taman Siswa didirikan kapan?

8 Setiap tanggal berapakah diperingati hari lahir Pancasila?

General forms for those questions are [Qt] + …. + ?

…... + [qt] +?

…... + [qt] + ….. + ?

Where qt is a “when” question word C. Finding Answer

There are two main steps on finding answer. First, is simple matching token question with a list of index words to find candidates of snippets? Each snippet containing at least one or more tokens of question will become candidates.

Similarity measurement is done by determining the similarity between the candidate snippets of the question by using the simple approach. Formula 1 is derived from cosine similarity, this method is well-known method and often used to calculate the similarity of documents [14].

, ^∑

∑ ∑ (1)

Where, q is question,s

i

is snippet- i, w

q,j is

weight of term j on query and w

i,j

is weight of snippet i to term j

Second step is a process of sorting snippet candidates is done where snippet that has similarity highest values compared to most other snippets are answers of questions. The highest value of a snippet will always be changed according to a given question. It’s a simple general syntax of SQL to display the candidate of snippets that have been sequenced and to display the exact answer (date) from the snippet that have the highest similarity value.

V. E ^XPERIMENT

As dataset we use electronic history book from [15]. From these eBooks we get more than 200 snippets. We asked the human expert to create factoid questions at once the answer of questions based on the eBook, it is about 123 questions with all variation of types of questions as we described in the previous section.

In the experiment, we use correctness metric to measure accuracy of the answer by comparing the system answer with the human expert answer, as described in Formula 2.

∑

∑ (2)

Where, accuracy in percentage, s

r

is number of the relevant answer and s number of all questions

Over all of the testing, we get 105 relevant answers and 18

not relevant answer, hence we got accuracy about 85.37 %. It

is promising number for accuracy and our proposed idea, that

the discovery of structured data from unstructured data can be

pretty useful in question answering. In table 6 are the example

result of experimental, the true answer and the false one.

(5)

Fig 4. Framework of our approach

T

ABLE

6. T

HE EXAMPLE OF EXPERIMENTAL RESULT

Question Expert Proposed

Approach

T/F Kapan dilakukan

penyerangan Kedua untuk mengusir Portugis di Kerajaan Demak?

1513 1629 F

Kapan terjadi perlawanan kaum Padri yang dipimpin oleh Tuanku

Imam Bonjol?

1821 1821 T

Tanggal berapa Sekutu membebaskan tentaranya yang ditawan di kamp-kamp Belanda?

10 Oktober

1945

10 Oktober 1945

T

Tahun berapa Belanda melakukan penarikan iuran kepada masyarakat Indonesia?

1913 1913 T

Pada tanggal berapa TKR mengepung Ambarawa?

12 Desember

1945

12 Desember 1945

T

Pada tahun berapakah Tuanku Imam Bonjol Wafat?

1864 1864 T

Sekutu mendarat di Belawan, Medan tanggal?

9 Oktober 1946

T

Serikat Islam melakukan berbagai gerakan pemogokan tahun?

1923 1923 T

Pusat Tenaga Rakyat (Putera) didirikan tanggal berapa?

1 Maret 1943

T

Perang Dunia II terjadi pada tahun?

1942 1653 F

VI. C ONCLUSION AND F UTURE W ORKS

We have proposed how to obtain structure data from unstructured data and use both data in Indonesian language’s QA in history domain. The accuracy result approach 85.37 %, it’s promising result. This research is still has limitation in the style of the questions. As the beginning research, we still put the limitation for question that refers to date. For next work, we expect the question will be expanded not only questions that refer to date, “when” question, but also all kinds of question such as what, “where”, “who”, “why” and “how”.

A CKNOWLEDGMENT

We would like to thanks to Ministry Education of Indonesia that provide [15] and let the author to use their data (electronic book) and [16] that let the author to use Xpdf open source. We also thank to Puji Darwati from History Department, our historian in providing question in Indonesian history domain.

R ^EFERENCES

[1] M. W. Bilotti et al, "Structured Retrieval for Question Answering", SIGIR'07 Proceding, Pages 351-358, Amsterdam, 2007.

[2] H. Toba. Analisis Semantik dengan Representasi “First Order Logic”

dalam Sistem Tanya Jawab. University of Indonesia, Bandung. 2010.

[3] M. R. Kangavari, S. Ghandchi, and M. Golpour, "Information Retrieval : Improving Question Answering Systems by Query Reformulation and Answer Validation", World Academy of Science, engineering and Technologi, pages 303-310, Iran, 2008.

[4] R. Blumberg and S. Atre, “The Problem with Unstructured Data.

Information Management Magazine”. 2003. http://www.information- management.com/issues/20030201/6287-1.html had been accessed at 30 April 2011.

[5] A. Harbison and P. Ryan. “The Problem of Analysing Unstructured Data”. Grant Thornton International Ltd, Ireland, 2009.

[6] D. W. Wardani, “Finding structured and unstructured features to improve the search result of complex question”, National Cheng-Kung University, 2009.

[7] A. Purwarianti, M. Tsuchiya and S. Nakagawa, “A Machine Learning Approach for Indonesian Question Answering System”, University of Technologi, Japan.

[8] M. D. Anggaeny, “Implementasi Question Answering System dengan Metode Rule Based pada Terjemahan Al Qur’an Surat Al Baqarah”.

Institut Pertanian Bogor, 2007.

[9] R. Mahendra, S. D. Larasati, and R. Manurung. “Extending an Indonesian Semantic Analysis-based Question Answering System with Linguistic and World Knowledge Axioms”, 2008.

[10] H. Toba and M. Adriani, “Pattern Based Approach in Indonesian Question Answering System”, University of Indonesia, Bandung.

[11] B. Sacaleanu et al. “Entailment-based Question Answering for Structured Data”, Companion volume – Posters and Demonstrations, pages 173–176, Manchester, 2008.

[12] D. Grossman, “Information Retrieval”, 2007.

http://ir.iit.edu/~dagr/cs529/files/ir_book/ had been accessed at 26 April 2011

[13] A. Moeliono, “Tata Bahasa Baku Bahasa Indonesia”. Jakarta : Balai Pustaka, 1988

[14] D. L. Lee, H. Chuang and K. Seamons. “Document Ranking and the Vector-Space Model”, IEEE Software, 1997.

[15] http://www.bse.kemdiknas.go.id [16] http://ww.foolabs.com

23

(6)

This page is left blank on purpose