Backing off, NLP and Indexing Methods for String Matching in Query Execution

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 1, January 2013)

305

Backing off, NLP and Indexing Methods for String Matching in

Query Execution

Vandana Jindal

1

, Anil Kumar Verma

2

, Seema Bawa

3 1,2,3

Thapar Univ., Patiala

Abstract- Steps involved in converting a query into its equivalent internal form include generation of logically equivalent expressions, using equivalence rules, secondly providing enough information for extracting alternate solutions and finally optimizing the query. Queries written in high level languages are transformed into lower level forms by making use of relational algebra.

Keywords - Query Processing, Query optimization, Indexing methods, NLP technique, backing off technique.

I. INTRODUCTION

Query is an ad-hoc reporting tool that allows you to retrieve data that is stored in the application. It gives users the ability to produce a report of information from a database. The word „Query‟ may be defined as retrieval of meaningful information. The language used to provide various manipulations like retrieval of information that is already available within the database, insertion of new information into the database, deletion of information from the database and modification of data stored in the database is referred to as „DML’ (Data Manipulation Language). A subset of the DML that is used for writing a query is called a „Query Language‟. „Query Processing‟ is the method by which we can obtain the best plan, used in implementing the database request. In search, true success comes from understanding what the user is asking from their query.

II. QUERY TYPES

i. Select queries: These are used to retrieve data from one or more tables and display the result.

ii. Parameter queries: For creating on-the-fly queries which prompt the user for criteria at the time the query is run.

iii. Crosstab queries: These are used to summarize data from one field and group it in tabular form according to two criteria.

iv. Action queries: Queries that make changes to the records in a table. There are four types of action queries:

Delete queries – remove records from the tables.

Update queries – make global changes to a group of records in a table.

Append queries – add records from one or more tables to the end of one or more tables.

Make table queries – create a new table from all or part of

the data in an existing table.

v. SQL queries: A query created using SQL, which is highly advanced query language.

vi. Contextual queries:

Search context – Contextual query language is a formal language used for representing queries to information retrieval systems. E.g. web indexes, bibliographic catalogs etc. The contextual query language has a single or multiple search clause(s) joined by Boolean operators. They are also associated with keywords, which may either be prefixed or may follow the clause.

Taxonomy – A taxonomy γ is a tree of categories where each node represents a predefined category. Each category is defined by the labels along the path from the root to the corresponding node.

Level-N category, ancestor category and sibling category:

for a category c in taxonomy γ, c is called a level-n category if the node at c is located at nth level of γ.

III. STRATEGY FOR QUERY PROCESSING

Query submitted by the user is not in a standard form that the system may understand. It has to be converted into computer understandable form. The query processor [1] transforms the query into a standard internal form like relational calculus, relational algebra, object graph, operator graph or tableau.

(2)

International Journal of Emerging Technology and Advanced Engineering

306

Fig.1 Query processing [1]

When a query is translated simply into its equivalent relational algebraic expression, it is form of non-procedural query language, but when the same is represented with a sequence of operations, it is said to be in a procedural form. This relational algebraic expression is represented as a Query Treeor a Query Graph.

The query which is to be processed is passed through three stages: Parsing and translation, Optimization and finally Evaluation.

a)Parsing and translating a query

The first step is to convert the query submitted into a form understood by the query processing engine. The basic job of the parser is to extract the tokens (e.g. keywords, operators, operands, literal strings etc.) into their corresponding internal data elements (i.e., relational algebra operations and operands) and structures (i.e. query tree, query graph). Parser also verifies the validity and syntax of the query string.

b)Optimizing the query

Query processor applies rules to the internal data structures of the query to transform these structures into their equivalent but more efficient representations. Rules may be based on various mathematical models and heuristics. Selection of application of rules – WHEN and HOW? is the basic function of the query optimization engine.

Fig.2 Optimization Sequence Process [11]

c)Evaluating the Query

The last step in query processing is the evaluation phase. The best evaluation plan that a user generates by optimization engine is selected and then executed (There may exist various methods for executing the same query).

Fig.3 Internals of the Query processing “black-box” [12]

IV. METHODS FOR QUERY PROCESSING

(3)

International Journal of Emerging Technology and Advanced Engineering

307

A. Indexing Methods

Indexing is a technique for improving the database performance. The significant property used is - that they eliminate the need to examine every entry while running a query. In vast databases this leads to reduction in time/cost. Indexes affect the performance and not the results. Given a particular query, the DBMS’s query optimizer devises the most efficient strategy for finding matching data. The optimizer decides which index/indices to use.

Limitations:

 Speeds up the data access, but uses a lot of space in the database.

 Must be updated each time the data is altered. Various indexing methods proposed are as follows:

1. DataGuide [2]: It is summary based and total summary index. It means that it is possible to answer queries directly by the index without consulting the original XML data.

2.Forward and backward index [3]: It is a summary indexing method and covers like the DataGuide.

Limitation:

 It was proposed as a memory index but its size was too large.

 Query processing not fully optimized.

3. APEX [4]: It is a summary based index having 2 structures namely a graph and a hash table. Graph structure is a structural summary of XML data while the hash table keeps the information for frequently used path and the nodes that are in that paths.

4. BUS [5] Bottom Up Scheme Index is an indexing method that includes ranking. All nodes that match a query will be ordered with the nodes at top where possible query keywords occur most.

5. Ctree [6]: provides an indexing structure based on 2 levels i.e., path summary and detailed-element level relationships.

Path summary is a tree that is distracted from the original data. In this new tree all equivalent paths are taken together like the DataGuide method does.

Limitation:

 Works well for a single-path queries, unable to answer branching queries (as it does not contain hierarchical relationships between individual elements).

To overcome this, Ctree also contains the detailed parent-child links. Because of the 2 level index structure the Ctree is a summary index as well as a structural join index.

6. XISS [7]: It is a structural join indexing method. It stands for XML indexing and storage system. It is based on numbering scheme for elements. Because of this scheme, it is able to determine fast ancestor-descendant relationships.

7. ViST [8] Virtual Suffix Tree is a dynamic indexing method, which is classified as a sequence based indexing technique. To answer queries ViST first converts XML data as well as query to structure-encoded sequences. Then sequence matching algorithms search for XML documents containing the query. The index structure is based on B+-trees.

8. Prufer Sequences [9]: It is sequence based and uses a tree encoding method different from ViST. The encoding is based on deleting one element at a time.

B. Natural Language Processing (NLP) Technique

It is a field of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages. It facilitates access to a database by facilitating language translation with a user friendly interface. The criteria followed are exact string matches for the query in a document find to be under discriminatory in many cases. Using NLP to increase precision of results by utilizing observations about discourse structure, noun phrases and proximity of matches of query terms. NLP steps performed were executed within EAGLE framework system [10]. The words were disambiguated based on their part of speech (noun/ verb). Semantically, irrelevant words were eliminated that added clutter (prepositions/ conjunctions/pronouns). Though the above mentioned words are vital for proper analysis of complete sentence, these can be ignored for models using single-word or noun phrase size tokens.

(4)

International Journal of Emerging Technology and Advanced Engineering

308

Various models used are:

1. Vector Space model: A mathematical model that finds out how similar retrieved documents are to the user’s query by constructing on N-dimensional token space(N- no. of tokens in the query).

2. Proximity model: an alternative ranking system to VS model, based on the assumptions:

 Given any 2 words from the query, the closer they are to the document, the more relevant the document is.

 More frequently the pair appears in the document, the more relevant the document.

 Word pairs that appear in the same document in the same order as in the query (forward pairs), should rank higher than the ones that appear in the reversed order (backward pairs).

Various applications of NLP are: language to language translation, converting English like commands into machine understandable form (i.e. compliers and interpreters) and code to code converters are employed in word processing software. The challenges faced are: inability to understand the intent behind the query, not understanding the emotive dimension, no linguistic communication and unaware of communicating partners’ knowledge and experience.

C. Backing off Technique

Based on the assumption that NLP algorithms are imperfect and some information is not important for determining relevance of a document. It is a 5 stage process:

1.Match query terms exactly.

2.Use only root and part of speech information.

3.Account for NLP mistakes in part of speech assignments (software confuses adjective with proper nouns).

4.Match word roots only, ignoring morphology and POS Tags.

5.Match to a synonym or hyponym. V. QUERY MEASUREMENT

Query processing is measured in terms of QPS (query per second) and by the perceived relevancy of the results.

These measurements are affected by: Hardware used in the search matrix, limits imposed by licensing (e.g. how many search needs), the linguistic features available in the index, the query’s complexity and the query-specific features invoked. The challenges faced during information retrieval are: Drafting better questions and giving “easy-to-judge” results so that results are minimized (what the user has to read through).

VI. APPROACHES FOR ENHANCING THE SEARCH

Take a phased approach. This will help in identifying what works well and isolate the areas for improvement. It is advisable to work maximum in the case of the search application. Increasing the number of search rows enhances the speed of processing and boosts scalability. You should understand the result volume. This is a trade off between speed and satisfying the need for all the information. The more data returned the more time it takes to stream it back to the search client. Result relevancy is very important factor. Users usually need only a small subset of the entire result set. They should understand the impact and cost of features that operate on an extended result set – features such as result clustering. Content indexing results in enhanced performance. Users should take into consideration the impact of blended searches such as mixed relevance or throughput.

VII. CONCLUSION

(5)

International Journal of Emerging Technology and Advanced Engineering

309

REFERENCES

[1 ] http://sce.uhcl.edu/yang/teaching/csci5333fall04/csci5333ch18.htm [2 ] F.Weigel,” Content aware DataGuides for indexing semi-structured

data”, University of Munich, Germany 2003.

[3 ] W. Wary et al.,” Efficient processing of XML Path queries using the Disk-based F&B index”, Proc. Int’l Conf. Very large Database (VLDB), Morgan Kaufmann 2005.

[4 ] C.W.Chung et al.,”APEX: An Adaptive path index for XML data”, Proc. Int’l Conf. Management od data (ACM Sigmod), ACM Press, 2002, pp. 121-132.

[5 ] D.W. Shin, H.C. Jang and H.L. Jin, “BUS: An Effective Indexing and Retrieval Scheme in”, In Proc. of digital libraries, pp 235-243, Pittsburgh, 1998.

[6 ] Q. Zou, S. Liu and W. Chu, “Ctree: A Compact Tree for Indexing XML Data”, In UCLA Computer Science.

[7 ] Quanzhong Li and Bongki Moon, “Indexing and Querying XML Data for Regular Path Expressions”, In Proc. of the 27th Int'l Conference on Very Large DataBases (VLDB), pp 361-370, Rome, Italy, September 2001.

[8 ] H. Wang et al., “ViST: A Dynamic Index Method for Quering XML Data by Tree Structures”, Proc. Int’l Conf. Management of Data (ACM Sigmod), ACM Press, 2003, pp. 110–121.

[9 ] P.R. Raw and B. Moon, “PRIX: Indexing and Querying XML Using Prüfer Sequences”, Proc. Int’l Conf. Data Eng. (ICDE), IEEE CS Press, 2004, pp. 288–300.

[10 ]Baldwin, Breck, Christine Doran, Jeffrey C. Reynar, Michael Niv, B. Srinivas, and Mark Wasson. 1997. EAGLE: An extensible architecture for general linguistic engineering.

[11 ]docs.oracle.com/cd/E28271_01/doc.1111/e10792/c04_settings.htm [12 ]jeff.familyyu.net/2012/10/19/back-to-basics-dbms-1/

[13 ]hit.ac.il/staff/leonidm/information-systems/ch68.html

Vandana Jindal, B.Tech, MCA, M.Phil. is an Assistant Professor and Head of Department of Computer Science at D.A.V College, Bathinda and a Doctoral student in the Department of Computer Science, Thapar University, Patiala. Since January 2009, she has been associated with Thapar University and has been working on Wireless Sensor Networks. Apart from WSNs, her research interests include Database Management Systems. She is a member of IEEE and IEI.

A K Verma is currently working as Assistant Professor in the department of Computer Science and Engineering at Thapar University, Patiala in Punjab (INDIA). He received his B.S. and M.S. in 1991 and 2001 respectively, majoring in Computer Science and Engineering. He has worked as Lecturer at M.M.M. Engg. College, Gorakhpur from 1991 to 1996. From 1996 he is associated with the Thapar University. He has been a visiting faculty to many institutions. He has published over 100 papers in referred journals and conferences (India and Abroad). He is

member of various program committees for different

International/National Conferences and is on the review board of various journals. He is a senior member (ACM), MIEEE, LMCSI (Mumbai), GMAIMA (New Delhi). He is a certified software quality auditor by MoCIT, Govt. of India. His main areas of interests are: Bioinformatics, database management systems and Computer Networks. His research interests include wireless networks, routing algorithms, securing ad hoc networks and

design/develop applications for other fields based on IT.

Seema Bawa holds M.Tech (Computer Science) degree from IIT Kharagpur and Ph.D. from Thapar Institute of Engineering & Technology, Patiala. She is currently Professor and Dean of Students' affairs (DOSA). Her areas of interest include Parallel and distributed computing, Grid

computing, VLSI Testing and network