Construction of a Prototype - An integrated architecture for semantic search

In order to build a prototype for the proposed framework, the author has chosen some ex- isting tools (wherever they are available) to fit in the process. The prototype construction

56_{http://www.bl.uk - British Library.}

57_{http://www.bl.uk/bibliographic/datafree.html - British Library Datasets.}

58_{http://creativecommons.org/publicdomain/zero/1.0/- Creative Common Universal Public Domain} License.

59_{http://wiki.dbpedia.org/about/about-dbpedia/facts-figures - Facts and Figures for DBPedia dataset} [accessed January 2016].

10. PROTOTYPE AND EVALUATION

required the development of six major modules of SIRF: (i) Data Parser, (ii) Ontology Processor, (iii) Query Optimiser, (iv) Query Formatter, (v) Result Optimiser and (vi) User Interface.

10.3.1 Implementation of the Data Parser

The Data Parser currently deals with RDF data only (as described in Chapter 6 Section

6.3). This module has been written using Java language. The author has chosen Java to build the framework’s Knowledge Base for two reasons: (i) Java is capable of reading large files (e.g. files greater than 2MB which is a normal lower limit of RDF dataset files) and (ii) there are many related Java APIs available e.g. RDF API, OWL API and WordNET etc. Also many leading NLP components are already implemented in Java [Finlayson,

2014].

The RDF Parser (in the Data Parser) has been written using Apache Jena RDF API60_.

[Carroll et al., 2004] introduced Jena as an integrated solution for the implementation of the Semantic Web recommendations i.e. RDF and OWL. The RDF Parser parses RDF files and saves instances and Named Entities in the ‘Bag of Keywords’ (as defined in Chapter 6 Section6.3). The author has used a MySQL61 database as a cache repository.

MySQL is an extremely well supported open source relational database which can be readily interfaced with languages such as Java. The complete implementation code for the Data Parser can be found in Appendix I (Listing 10). The Data Parser utilises external libraries to map terms to their plural form and synonyms. To map nouns to their plural forms, the author has used Java Inflector library62 _{from JBoss DNA}63_{. Likewise, for}

mapping synonyms, the author has utilised Java API for WordNet Searching (JAWS)64_.

10.3.2 Implementation of the Ontology Processor

The Ontology Processor utilises Apache Jena Ontology API65to parse ontologies for differ-

ent domains. This module has also been written in Java. The code for the implementation of this module can be found in Appendix I (Listing 8). The parsed ontology concepts

60_{http://jena.apache.org/documentation/rdf/index.html - A free and open source Java framework for} building Semantic Web and Linked Data applications.

61_{https://www.mysql.com/ - MySQL is an open-source relational database management system.} 62_{http://modeshape.jboss.org/downloads/downloads-jboss-dna - JBOSS DNA.}

63_{JBoss DNA is available under GNU Lesser General Public License (details can be found at} http://www.gnu.org/licenses/old-licenses/lgpl-2.1.html)

64_{http://lyle.smu.edu/ tspell/jaws/index.html - Java API for WordNet Searching.}

65_{http://jena.apache.org/documentation/ontology/- Apache Jena Ontology API is a tool that parses} OWL ontologies.

10. PROTOTYPE AND EVALUATION

(i.e. classes and properties) are saved in the Ontology Cache (this has been implemented using a MySQL database).

10.3.3 Implementation of the Query Optimiser

The Query Optimiser module works at runtime to process a user query and so has been written using a server side scripting language. The author has chosen PHP language for this purpose since server side scripting using PHP syntax is efficient in response to requests received by a server [Proctor et al., 2014]. The Syntax Analyser (sub module of the Query Optimiser) uses the PHP library for Stanford Parser (PHP-Stanford-NLP66₎

to process natural language queries. The parsed natural language tokens are processed to generate multiple linked pairs (using Algorithm 3 (Syntactic Processing) in Chapter 7). The implementation of Algorithm 3 has been provided in Appendix II (Listings 16 and

25).

10.3.4 Implementation of the Query Formatter and API

The Query Formatter has also been developed using PHP. This module receives tagged syntactic pairs from the Query Optimiser, finds CPI relations, and generates a matching SPARQL query. The implementation of the Query Formatter algorithm can be found in Appendix II (Listing 14, 19 and 20). The Application Program Interface (API) in SIRF uses an open source SPARQL library67 _{by Christopher Gutteridge. The API is}

responsible for connecting to a SPARQL endpoint, running queries and retrieving results (if available).

10.3.5 Implementation of the Result Optimiser

The Result Optimiser module also works at the server level, and so it has been implemented using PHP. This module makes use of result schemas. SIRF saves schemas based on the most used attributes for a class e.g. schema for a class Person will be its most used attributes (at least one) e.g. name, family name or given name. Currently, the schema attributes have been generated manually.

66_{https://github.com/agentile/PHP-Stanford-NLP - PHP library to utilise Stanford-NLP parser.} 67_{http://graphite.ecs.soton.ac.uk/sparqllib/ - SPARQL library.}

10. PROTOTYPE AND EVALUATION

10.3.6 Implementation of the User Interface

As described in Chapter 5, the User Interface of SIRF acts in two ways: (i) as an entry point for direct user queries or (ii) as an API to receive user queries from other search systems. The functionality of the user interface as an API could be tested by using any HTTP client. The author has chosen the REST68 _{client since it provides a common}

structure that makes the code reusable [Upadhyaya, 2014]. The testing of the REST client with the proposed system will be explained in Section 10.8.

In document An integrated architecture for semantic search (Page 138-141)