Gauri Rao et al./(IJCSE) International Journal on Computer Science andEngineering Vol. 02,No. 02,2010,219-223
...
-NATURAL LANGUAGE QUERY
PROCESSING USING SEMANTIC
GRAMMAR
'Gauri
Rao,
2Chanchal
Agarwal,
3Snehal Chaudhry,
4Nikita Kulkarni, ,
5Dr. S.H. Patil
ILecturer department 0f Computer Engineering BVUCOE, Pune 2Lecturer department 0 fComputer Engineering BVUCOE, Pune 3Lecturer department 0 fInformation Technology BVUCOE Pune
4Lecturer department 0 fComputer Engineering BVUCOE Pune
5Professor and Head department 0 fComputer Engineering Bharati Vidyapeeth University College of Engineering, Pune, India
AbstrKt: TIle fidel of •••• raI ~ prottUillf:
(NU) lias .at a draaatic sIIift _ bodt I"a'Cafd
dirutioII a.t ~ • tile past KYeraiyears. III
tile past, __ ~ _,COla•••••••••• ~ tmdaI
to pIU'dyS)WboIic ..edtods.. Ikcntly. _ore
a.t..ore on is ~ toward Itybrid _dIIods tItat
m.biIIe _ e.piriaI eorw-... ~
~ tile me of probabilistic a.t ••~ ••••
tIIeordic tuIuIiqIIes. wid! ~ S)WboIic ..edtods..
The main purpose of atural Language Query
Processing isfor an English sentence to be interpreted
by the computer and appropriate action taken. Asking
questions to databases in natural language is a very
convenient and easy method of data access, especially
for casual users who do not understand complicated database query languages such as SQL.
This paper proposes the architecture for translating English Query into SQLusing Semantic Grammar
1.0 INTROD
cno
aturaI language processing is becoming one of the most active areas in HIIIII3D-COIIIpU Jnteractioo.. It is a branch of AI which includeslnfunnatioo Retrieval, Machine Translation and Language Analysis. The goal of NLP is to enable communication between people and computers without resorting to JDeIIIOrization of complex commands and proa:dures. In other words. NLP is a technique which can make the computer undeI'sIand the languages naturally used by humans. While natura1 language may be the easiest symbol system for people to learn and use, it has proved to be the banIest for a computer to master.
Despite the cbaIleoges, natural language processing is widely reganled as a promising and critically important endeavor in the field of computer research.
The general goal for most computational linguists is to instiU the computer with the abili1;yto undersaand and generate natura1 language so that eveotually people can address their computers througb text as though they were addressing another person. The applications that wiD be possIDIe when NLP capabilities are fully realized are impressive computers would be able to process natural language, translating languages accurately and in real time, or extracting and summarizing information from a variety of data sources, depending on the users' requests.
2.0 RELATED WORK
The very first attempts at NLP database interfaces are just as old as any other NLP research. In fact database NLP may be one of the most important successes inNLP since it began. Asking questions to databases in natural language is a very convenient and easy method of data access. especially fur casuaJ users who do not undeI'SIand complicated database qlJel}' languages such as SQL. The success in this area is partly because of the real-world benefits that can come fiom database P systems, and partly because NLP works very well in a singIe-datahase domain. Databases usually provide smaU enough domains that ambiguity problems in natura1 language can be resolved successfully. Here are some examples of database NLP systems:
LUNAR (Woods. 1973) involved a system that answered questions about rock samples brought back from the DlOOIL Two databases were used, the chemical analyses and the literature references. The
Gauri Rao et al./(UCSE)International JournalonComputerScienceandEngineering Vol.02,No.02,2010,219-223 program used an Augmented Transition Network
(ATN)parser and Woods'Procedural Semantics.The system was informally demonstrated at the Second
AnnualLunar Science Conference in 1971. [1]
LIFERILADDER wasone ofthe first good database
NLP systems. It was designed asa natural language interfacetoa database of information aboutUS Navy ships. This system, as described in a paper by Hendrix (1978), used a semantic grammar to parse questions and query a distributed database. The LIFERILADDER system could only support simple one-table queries or multiple table queries with easy join conditions. [4]
3.0SYSTEM DESCRIPTIO
A brief description of the system is as follows: Suppose we consider"a database say ORACLE
-Within this oracle database 1 have placed certain
tables, which are properly normalized. 0 if the
userwishes toaccess the data from the table be/sbe
hasto be technically proficient inthe SQL language
to make a query for the ORACLE database. Our
system eliminates this partand enables the end user to accessthe tables in his/her language.
Letustake anexarnple..
Suppose if we want to view information of a
particular employee from EMP table then we are
supposedtousethe foUowingquety:
SELECf
*
FRO EMP WHEREe_name=
'
ABC';But a person, who doesn't IrnowORACLE, wiD not
be able to access the database unless belsbe Irnows the syntax and semantics of firing a query to the
database. But using P,this task:of accessing the
database will be much simpler. So the abo equery willberewritten usingNLP as:
Give the infunnation of employee whose name is
ABC.
80th the SQL statement andNLPstatement toaccess
the EMP table would result in the same output the
only diffi:reoce being, a normal person who doesn't
Irnow anything about SQL can easily access the
ORACLE database.
4.0 SCOPE OF THE SYSTEM
The scopeoftheproposed system isas follows:
»
To work with anyRDBIMS one sbouId Irnowthe syntax of the commands of that particular
database software (Microsoft SQL. Oracle,
etc.),
»
Here the aturaJlanguage processing is doneonEnglish i.e, the input statements have to be
inEoglish.
Input from the user is taken in the form of
questions (wb- form like what, who, where,
etc).
ISSN : 0975-3397
>
A limited Data Dictionary is used where all possible words related to a particular systemwill be included. The Data Dictionary of the
system must be regularly updated with words
that arespecifictotheparticularsystem
>
Ambiguity amongthe words will be takencare of whileprocessingthenatural language.>
All the names in the input natural languagestatement have tobein double quotes.
>
Data Dictionary used will be: -EMP,DEPTand PROJECT
Thetablesusedareshownbelow>
Employee departmel project
id id id name name name salary location empid edptid
cap
a
c
i
t
y
Fig ITables used in the System 4.1 System Requirements 4.1.1 Hardware Requirements PI (Pentium 1)01"bigherprocessor RAM- minimum 64MB RAM HDD - minimum 500MB of free H3nJ..disk spaceVGADisplay Adapter- withbigherresolution
4.1.2 Software Reqllireme.ts
Operating System- WlNDOWS 9S01"abo e Datahase-SQL SERVER2000orACCESS Java,MS isualC++compiler5.016.0
5.0
SYSTEM ARCBJTECnJRE Generally P has following steps:->
Morphological Analysis: Individual words areanalyzed into their components and non word
tokens such aspunctuation are separated from
thewords..
~ Syntactic Analysis: Linear sequences of words
are transformed into structures that sbow bow
the wordsrelatetoeach other.
~ Semantic Analysis: The structures created by
the syntactic analyzer-areassigned meanings.
>
Discourse integration: The meaning of an individual sentence may depend on the seoteoces that precede it andmayinfluence the meanings ofthesentences that fullow it..Pragmatic Analysis:The structure representing
what was said is reinterpreted to determine what was acIualIymeant
Gauri Rao et al./(IJCSE) International Journal onComputer Science and Engineering
Vol. 02,No. 02,2010,219-223
Fig.2 Stages inNatural Language lnterpretation Process
Thesystem includes the following modules:
G : Designing the front end 01" the user
interf3ce where the user .J)enter the query in atura.ILanguage.
Parsing: Derives the Semantics of the atura.I Query given by the user-and parses it in its k:dmical form,
QueryGeneration: After the successfuJ parsing
of the statement given by the user, the system
generates a query against the user statement in
SQL and further gives it to the back.end
database.
Data Collection: This module collects the
output of the SQL statement and places it in the set"Interface Screen as a result form,
6.0 SE
tc
GRAMMARThereare two main partsof a semantic grammar. The
first is alexicon that stores all the possible words that
the grammar is aware of. A simple entry in the
lexicon might look likethis:
-(customer ->customer patron member) (customers ->customers patrons members)
On the left-hand side of the '->' the word 'customer'
defines a symbol that can be used in the grammar.
When 'customer' isused in the grammar, it refers to
the English words on the right-hand side of the '->', that is 'customer', 'patron' or 'member'. Similarly,
the plurals of these words are shown in the next line.
OnlysingleEnglish words.or terminal symbols,can appear in the lexicon.
The other part of the semantic grammar in 01 es
I"IIIes to combine the tenninaJ symbols in the lexicon to form phrases01"sentences in a specific way.
FOI"example,the rule
(AIT_TAPE_CUSTOMER -> RENTED by
customer IT_ER)
Demonstrates bo the TAPE. STOMER attribute
can be referred to in English.On theleft-hand side is the non-terminal representation of the phrase.
This can be used in otber rules to refer to this phrase.
On the right hand side is a combination of non
-tenninal symbols&om other rules (in PPER CASE)
and from the lexicon (in lower case). RENfED represents another rule foe synonyms of the verb
•rented' ,such as taken out,bonowed etc. ote that TED is not in the lexicon because taken out' is a two-word phrase,and so must be defined as a rule. The words by and customer appear in thelexicon,
and the AIT_ ER symbol refers to a number
attnDute. 0 the abo e rulecan be used to parse the
phrase"borrowed by member number 12",01""taken
out by customer 14"etc.
7.00VERVIE OF PROCESS
Firstly, the English input (in the fcrm of a list) is
parsed by the semantic grammar. then a post
-processor matches table and attribute names and joins up tables if the query in 01 es more than one table.
After that the post-processor can construct the
resulting SQL queryand output it.
english query
semantic
representation
SQLqueoy
Fig.3Database Semantics
The question "What is employee name with salary greater than 2000" is processed by the system as given below.
The English text is converted into a list, so that the
system can process it.The sentence becomes:
(WHAT IS EMPLOYEE AME WIlli SALARY
GREATER mAN 2000)
Now this list isparsed bythe semantic grammar. The
Gauri Rao et al. /(IJCSE) International Journal onComputer Science and Engineering
Vol. 02,No. 02,2010,219-223
sentence, using the semantic grammar. This sentence
only has one possible parse, so only one semantic
representation isreturned:
«SELECT NAME! FROM (EMP (WHERE
(SALARY >= 2000»»)
The semantics already look similar to an SQL query.
The brackets in the semantics indicate how the
phrases from the sentence relate to each other. For
ex~ple, the (EMP (WHERE (SALARY>= 2000»)
section shows that the 'born in or after 1920'
condition relates to the director. The 'NAME!'
symbol, after SELECT, isa special representation for
both the first and lastnames.
This semantic representation of the sentence is now
taken by a post-processor to transform it into an SQL
query. The'NAME!'symbol is expanded to FNAME
and L AMEand these are identified as attributes in
the EMP table.The DEPT attribute isalso matched to
theEMP table because the WHERE clause directly
refers to the EMP table. After- processing the
semantics, there is only one table needed for this
query, so only the EMP table is included in the
FROM clause. and there is no need to add join
conditions..
The aim of a natural language interfuce is to f3cilitate
the user to computer in a natural way. For this
purpose. we have designed a domain specific
dictionaryto keep the sytlOII}'Imof the columns and
tables ames. The addition of synonyms makes it
possible fur the user to write a sentence in different
natural ways. We call this dictionary as semantic
dictionary, because there is nosyntactic information
fortokeos..
To build an appropriate interface for inputting
database semantics information we need to develop a
framework for this information.This will allow us to
identifYthe different areas needed and easily obIain
the infonnation from the 1JSeI'The first thing we need.
to know about the database iswhat the tables and
attributes represent in the real world The second
thing is ho they relate to each other. Finally we
need to know the relationships from attrIbutes to the
tablesinwhich theyare contained Attnbutes that are
foreign keys should not be considered in this phase,
because they link tables to other tables, and the
appropriate phrases should have been covered by the
table-table relationships.
Translating from one language to another-always
requires some kind of paring. Parsing is the process
of identifying sIructwe in data..It determines whether
the input data bassome pre-detennined sIructwe and
respond accordingly.
Parsing requires a set of grammars to be defined
These grammars are set of rules, which define how
the language is sIructwed Rules are specific pattern
ISSN :0975-3397
of data, which appears in the input. These rules can
further reference to other rules, which are called sub
rules. Rules can also be recursive if they somehow
refer back to themselves. The rules consist of
products, which are the different ways in which the
rule can be satisfied. The actual data is referred to as
terminals. Parser may also contain actions. To
translate natural language queries, system needs to
have a parser, which contains all these grammar
rules. The parser also has to know the table names
column names, any relationships between tables and
related words for the table and column names. Only
the most common grarnmar isincluded in the parser.
8.0 ALGORITHM
»
Tokeoization(scanning)• Split the Queryin tokens
• Give order-number to each tokenidentified
Split Queryand extractpatterns
• Look for sentence connectorsIcriteria
words
• Break Query on the basis of
connectorIaiteria tokens.
• Use criteria tokens to speciJY
condition in query.
• Find attributes and values after criteria
token
ap value for identified aItn"bute and
corresponding table
Replace synonyms with proper attribute
names
»
Get intermediate funn ofQueryTransfunn itinto SQL
9.0 RESULTS
The graphical User Interlace isasshown below. The
IJSeI' bas to first login and then connect to the
database A database setting is required to access the
database after getting the information about host
name,database name,IJSeI'name and the password
Once the database is setup, English Query can
translate very complex English queries to SQL with
the capability of searching multiple tables and
multiple fidds.
Sfqls fullowed to get the res
uIt-»
Type the natural language queryin thedialogue box given.
»
Click on"Proceed" button.~ The systemwill ask the IJSeIfor the'
expected meaning..In caseof ambiguities
the IJSeI'bas to selectthedesired query.
Click on"Generate SQL"button.
~ The systemwill generateSQLquery.
Click on"Run Query".
Gauri Rao et al./(UCSE) International Journal onComputer Science andEngineering
Vol. 02,No. 02,2010,219-223
}>- The result of thequay will bedisplayed in
the Output box.
Fig.4.Login screen
Following example demonstrate SQL statement
translatedby EnglishQuer
y:-Output :- 12000
lo.OF1JTUllE ENHANCEMENT
ore new grammar can be added to the parser-to
increaseeffeeti eDeSS.
Adding a tbesaurus is another suggestion, which
could belp automating therelatedwords fortableand column names.
With the belp of a tbesaurus, the user- input can be
pre-processed tosubsIitute related words with table
01"column namesand alsoremove unwantedwords.
So fur. this system considers selection and a few simpleaggregations. Thenext step of theresearch is, to accommodatemore complexqueries.
11.0CO CL SIO
aturaI Language Processing can bring powerful eobancements to virtualJy any computer"program interface.. This system is cum:otIy capable of
handling simple queries with standard join
conditions.BecausenotallformsofSQL queriesare supported, further-development would be required before the system can be used within NIDBl
AJternatives fOl" integrating a database P
component into the OBI were considered and
assessed.
12.0 References
Huangj.Guiaog Zangi, I'billip Coy Shcu ~A NatmaI
Language daIabase lntaface based onpubabilislic context freegt3IIIID3I". IEEE lnIanaIionaI ~ on SeIrIamic Compding andsystans 2008
Akama, S.(Ed) Logic.1onguage and compukIlion. Kulm:w
-Academic:publisbcrs. pp.7-11.1997.
ELF Software CO. DlIIroI-Longuage DoIDbase inll!ifaca
froM ELF SofIlrIlR Co.cik:d oycmber 1999. available
Iiom Inkmc:t:
Hendrix. G.G, Saa:nIoti. E.D.•~wic:z. D.•Slocum. J.
"'Developing anabnI Janguag,c ioIJ:rlaa: to complex data=.
inAeM Transoctiotu on dalDbase sy.tII!Im. 3(2). pp. 105
-147.1978.
[5) .Jo5qIh. S.W, AIeIiwas,. R.-A knowkdge-bascd subsystJ:m
focaoatwaI ~ inferface toadatabase that predicts andexplains qUCl)' faiJIRS". in IEEE Ctt; pp.80-87. 1991.
[6] Mitrovic; A.AIarow~ Il!Ot:IriRg $)lSlIeJIf forSQL.
Univemty of CanteJbuJy. 1993..Moore. J.D. ~
~ forim1r~ appI~ maine ~
tuIOrSmore like humans"'. inp,oceedJirgs AJ-ED. pp.36-42, 1995.
[7] Sub"K..S,Ptrtins,. w.e,> '"The effi:cts of a system echo in a n:stricH:d oaIurlII language database inlerl3ce fiI novice users'".in IEEE
S
.
,-
sc:i#!nt%s, 4.pp.594-599> 1994.[8] Wbt:obua, W, Dilts, D.N..'"lnIegTaIing diverse CIM daIa bases: tbe role of nabmII language ink:rlitce~. in IEEE
TnmsacIions 011 .1)1SII!JIU,IIItDf. and q-bemetia, 22(6). pp.
1331-1347.1992.
[9] Dan Klein,Christopher D.Manning: Corpus-Based Induction
of Syntactic Structure: Models of Dependency and Constituency. ACL 2004: 478-485. ____ w ~._ [I]
hltll
I
L
na
J
I
;
a
~
J
~
On
e
1
~~
t
l
:
!f
.
l
t
o(
[2]r
~
[3]-
[4J .,.'
''-.
.
•
.
~....":.. ~~~:"!:.~:••...-Fig.5 Snapshot ofquery: What isthe salary of "Gauri"
English Query: -What isthe salary of"GauriRao"
Meaning of Query:- salary of employee with name gauri rao
SQL Query:- Select salary From emp where empnarne ="gaurirao"