Distributed Graph Query Management using Semantic Cluster based Graph Pattern Views

(1)

8

Distributed Graph Query Management using Semantic Cluster based

Graph Pattern Views

Ms. P. Revathi,

MPhil - Research Scholar

Ms. C. Thangamani,

Associate Professor, Department of Computer Science

P.K.R. Arts College for Women, Gobichettipalayam, Tamilnadu, India

Abstract

Graph Pattern Views are used to process the graph query values. Pattern containments are employed to support the graph pattern matching process. The minimal minimum pattern containments are used to prepare the views for the query process. Statistical weight based term clustering methods are applied for the partitioning process. The MatchJoin, minimal, minimum and containment algorithms are adapted for the graph pattern query process. The maximally contain rewriting mechanism is applied to update query pattern views. The graph pattern matching using views is still in its infancy. One issue is to decide what views to cache such that a set of frequently used pattern queries can be answered by using the queries. To find a practical method to query big social data requires combining techniques such as view based and distributed. Hence a distributed semantic cluster based GPV model is proposed.

The distributed graph query management system is developed as two applications. They are XML Server and client applications. The XML server manages the XML documents. All the query submission and data search activities are initiated using the client application. Ontology is a repository deployed to analyze the concept and term relationships. The XML documents are analyzed with semantic relationships using the Ontology. The clustering semantic tasks are performed using the K-Means clustering algorithm. The Semantic Cluster based Graph Pattern Views (SCGPV) model is built using the semantic weight values. The graph comparison operations are carried out using the refined pattern containments. All the query values are updated into the log files. The log files are analyzed to update cache contents. Response delay, query relevance rate and accuracy level parameters are analyzed to evaluate the performance levels of graph retrieval process.

Index Terms: Graph Pattern Matching, Semantic Analysis, Pattern Query and Cluster models

1. Introduction

The semi structured documents prompted in recent times is rapidly increasing in areas such as Social Networking and E-Commerce. It is essential for discovering new knowledge from them. Semi structured data arises when the source or the environment does not provide a proper structure on the data and when the data is combined from several heterogeneous data. Semi structured databases are constructed to manage tree or hierarchical data elements. XML documents are one of the major resources of semi structured database. It is modeled by rooted graphs, in which vertices of the graph represent the objects and edges represents the relationship among objects. Graph pattern queries are adapted to retrieve data from graph databases. The graph retrieval process is constructed to perform graph based query process. The views are incorporated to maintain the query results.

Graph pattern matching is a routine process in a variety of applications, e.g., computer vision, knowledge discovery, biology, cheminformatics, dynamic network traffic, intelligence analysis and

social networks. It is often defined in terms of subgraph isomorphism, graph simulation or bounded simulation. Given a pattern graph GP and a data graph G, graph pattern matching is to find the set M(GP, G) of matches in G for GP . For subgraph isomorphism, M(GP,G) is the set of all subgraphs of G that are isomorphic to the pattern GP . For simulation, M(GP, G) consists of a unique maximum match, a relation defining edge-to-edge mappings.

(2)

5 new query without finding the rewritings. Furthermore, all previous works require us to materialize all answers to the view and evaluate the MCR over all such answers. Since an answer to a tree pattern is a subtree of the original data tree, and some answers may be subtrees of other answers, it is likely that we do not need to evaluate the MCR over all of the view answers. In other words some view answers may be redundant in that any answers that can be found by evaluating the MCR over them can also be found by evaluating the MCR over other view answers. In our experiments we found that on average 36.91% of view answers for the dataset XMark are redundant, 73.67% percent of view answers for the BIOML dataset are redundant, and up to 69.48% of view answers can be redundant for the dataset Treebank. Identifying view answers which do not contribute to the answering of new queries will help us minimize view maintenance and speed-up query evaluation.

2. Related Work

In this section we compare our work with previous ones that are the mostly closely related to ours. Apparently is the first paper on MCR of tree pattern queries using views, and it proposed the technique of useful embeddings for queries and views in P{/,//,[]}. The paper also proposed an algorithm to find the MCR under a recursive and non-disjunctive DTD which can be represented as an acyclic graph. The basic idea is to reduce the original problem to one without DTD by chasing the tree patterns repeatedly using constraints that can be derived from the DTD. Recently, [1] proposed a method for finding the MCR for queries that have no *-nodes connected to a //-edge and no leaf node u such that u and the parent of u are both labeled *, based on the concepts of trap embedding and trap relay, where a trap embedding is a mapping from a tree pattern to a tree, and a trap relay is a mapping from a pattern to another pattern.

It can be shown that the induced pattern of a trap embedding from Q to a pattern Vi in modL* Q+1(V

) is the same as a CAT if the attach point Vi is sn(Vi).

More recently, [7] studied the evaluation of MCRs, and [2] gave an algorithm for identifying redundant contained rewritings, which is orthogonal of this work. It should be noted that the redundant view answers discussed in this paper are with respect to the union of all CATs, and they may not be redundant with respect

to equivalent rewritings or with respect to a subset of CATs.

There have been works on equivalent rewritings (ERs), where an equivalent rewriting is a special contained rewriting Q′ which satisfies the condition Q′ ◦ V = Q. Among them showed that if V and Q are in P{/,//,[]}, P{/,[ ],*} or they are normalized linear patterns, there is an equivalent rewriting of Q using V iff Qk is an equivalent

rewriting, where k is the position of sn(V) on the selection path of V , and Qk is the subtree of Q rooted

at the k-th node on the selection path of Q. [4] extended the above result to Q, V £ P{/,//,*,[]} and showed that for many common special cases, there is an equivalent rewriting of Q using V iff either Qk or Qk// is an equivalent rewriting, where Qk// is the pattern obtained from Qk by changing all edges connected to the root to //-edges. Thus in those special cases, to find an equivalent rewriting we only need to test whether Qk or Qk// is an equivalent rewriting assumed both Q and V are minimized, and reduced tree pattern matching to string matching, allowing a more efficient algorithm for finding equivalent rewritings. They further proposed a way to organize the materialized views in Cache to enable efficient view selection and cache look-up. [6] investigated equivalent rewritings using a single view or multiple views for queries and views in P{/,//,[]}, where the set of views is represented by grammar like rules called query set specifications. The work can be seen as an extension to an earlier work [5] which investigated the same problem for explicit views.

None of the previous works dealt with the redundant view answers problem or the approach of finding answers to a new TP using annotated view answers. Finally, this work borrows techniques from some of our results on query containment are extensions to the results.

3. Answering Pattern Query Processing Views

(3)

6 matching using views provides an effective method to query such big data.

Graph pattern matching is conducted by capitalizing on available views. Answering queries using views has been extensively studied for relational data, XML and semi structured data. Given a query Q and a set V = {V1, . . . , Vn} of views, the idea is to find another query A such that A is equivalent to Q and A only refers to views in V. If such a query A exists, then given a database D, one can compute the answer Q(D) to Q in D by using A, which uses only the data in the materialized views Vi(D), without accessing D. This is particularly effective when D is “big” and/or distributed. Indeed, views have been advocated for scale independence, to query big data “independent of” the size of the underlying data. They are also useful in data integratio, data warehousing, semantic caching and access control.

The graph pattern query system is focused on graph pattern matching defined in terms of graph simulation, since it is commonly used in social community detection, biological analysis and mobile

network analyses. Conventional subgraph

isomorphism often fails to capture meaningful matches. Graph simulation fits into emerging applications with its “many-to-many” matching semantics. It is more challenging since graph simulation is “recursively defined” and has poor data locality.

Graph pattern queries can be answered using views based on graph simulation with pattern containment. It extends the traditional query containment deal with a set of views. Pattern query Qs and a set V of view definitions are utilized for the query process. Qs Can be answered using V if and only if Qs is contained in V.

Efficient algorithms are adopted for checking (minimal, minimum) pattern containment. Cubic-time algorithms in the sizes of query Qs and view definitions V are provided for containment and minimal containment. They are much smaller than graph G in practice. Approximation algorithm is provided for minimum containment with performance guarantees. When exact answers of a query Qs cannot be computed using views V, i.e., when Qs is not contained in V, one wants to find the maximal part of Qs that can be answered using V. maximally contained rewriting is initiated to update the containment with recent query information that are not included in the views.

The maximally contained rewriting Qs’ of Qs w.r.t. V can be found in cubic-time. The query-driven approximation scheme is provided by treating Qs’(G) as approximate query answers to Qs in a big graph G. Alternatively, one can compute exact answers Qs(G) by using Qs’(G) and additionally, accessing a small fraction of G, along the same lines as the scale independence approach. The effectiveness, efficiency and accuracy of the view-based matching method are verified with different data models. The matching algorithm scales well with data size and pattern size. The algorithm can compute maximally contained rewriting Qs’ efficiently.

4. Problem Statement

The graphs are constructed to manage the semi structured data values with nodes and edges. Data retrieval in graphs is carried out with the graph queries. The graph relationships are analyzed with graph pattern queries. The graph pattern queries are applied using the graph simulations. The pattern containments are build with matched views. The minimal and minimum containment of pattern queries are determined and updated. The MatchJoin algorithm is used match the graph values. The minimal algorithm and minimum algorithm are adapted to fetch the containment of the pattern queries. The contain algorithm is applied to build the pattern containments with views. The maximally contain rewriting mechanism is applied to fetch approximate results from the graph and view information. The node labeled graphs and edge labeled graphs can be used in the pattern query processing. The graph pattern matching operations are carried out on the journal details build using the XML database model. The following problems are identified from the current pattern query models. Semantic query models are not supported. Cache management schemes are not integrated with the pattern views in query process. Pattern refinement and weight analysis tasks are not supported and Distributed data analysis operations are not supported.

(4)

7 to improve the query response speed with minimum computational complexity. The system is designed to manage XML document search operations on document server. Semantic analysis models are used to improve the document indexing and ranking process. Query submission process is supported with semantic and prefix methods. The system is divided into six modules. They are XML server, weight assignment, pattern containment, query assistant, query log management and document retrieval.

The XML server maintains the XML documents and Ontology. The weight assignment module is designed to assign keyword weights. Documents are clustered under pattern containment module. The query assistant module is designed to assist the user for query submission. Query log management module is designed to improve the user queries with semantic and history analysis. The document retrieval module fetches the relevant documents from the document server.

5.1. XML Document Server

The XML server provides the XML documents to the clients. The XML documents are maintained in separate folder the document server machine. The client queries are managed by the XML server. The XML document list form shows the list of XML documents and its details maintained under the document server. Document name and document size details are provided in the XML document list form. The XML document view form is designed to display the content of the XML documents. The user can view any XML document by selecting the document name from the XML document list. The XML document is displayed with tag and data details. The content details form is used show the parsed contents of XML documents. The parsing process separates the tag and data values. The data values are assigned in the proper field to provide readability for the user. The XML documents are composed in tree structured manner. The XML documents are referred as semi structured database.

It provides the field name in tags and data values are enclosed within the tag values. The tag values are referred as paths for the document tree. The tree levels are analyzed in the path analysis process. The document contents are referenced as path tree elements. Path availability is represented in the table as numeric data value. Path information is extracted from document analysis process. Path alignment is initiated to remove infrequent paths. Infrequent tags

are removed from the list. Authors, tables and figures details are infrequently appeared in the documents. 5.2. Weight Assignment

The weight assignment process assigns the keyword weights for XML documents. Statistical and semantic analysis schemes are used for the weight assignment process. The term weights are estimated using statistical analysis mechanism. Document preprocess is carried out under the weight estimation process. The stopword elimination and stemming process tasks are performed under the document preprocess. Stop word elimination process is performed to remove common words that are used in all documents. The system uses a separate stop word data collection for the stop word elimination process. The suffix analysis is performed in the stemming process. Porter stemming algorithm is used in the stemming process.

Term weights are estimated using Term Frequency (TF) and Inverse Document Frequency (IDF) values. The term frequency is calculated using the term count and total number of terms in the document. Term frequency is a probability value that represents the importance of the term within the document. The document count is used in the inverse document frequency value. The inverse document frequency is used to represent the importance of the term across the documents. The term weight is calculated by multiply the term frequency with the inverse document frequency value. The term weight is calculated for all the terms in the XML documents.

(5)

8 has the weight value 1. The synonym is assigned with the weight value of 0.6 and the meronym is assigned with the weight value of 0.4. Finally the hypernym is assigned with the weight value of 0.2. The weight values and the concept category frequency are used in the semantic weight estimation process. Concept weight is calculated using the summation of all semantic weight values. The semantic weight values are updated with term and concept relationship details in the database. Logical part and type of relationships are analyzed in the semantic analysis.

5.3. Pattern Containment Process

The graph pattern queries are executed under the graph simulations. Graph Pattern Views (GPV) are used to retrieve the graph data values. Materialized views are prepared for the query values. The query responses are fetched from the materialized views. The pattern containment is build with the selected views. Minimal and minimum containment information are identified from the graphs. The pattern containment is build on the minimal and minimum patterns. The pattern containment is build with the clustering results. Indexing process is designed to arrange XML documents based on the weight values. Term and semantic weight based cluster models are used in the system. The K-means clustering algorithm is applied for the clustering process. Similarity analysis is used for the clustering process.

The pattern clustering process is divided into two types term pattern clustering and semantic pattern cluster. Term cluster is estimated using the term weight values for the XML documents. The semantic pattern cluster is calculated using the concept weight values. The semantic pattern clustering is also prepared by the K-means clustering algorithm. The clustering process form shows the list of cluster and number of documents associated with the cluster. The cluster details form shows the documents that are assigned with the cluster values. The refined pattern containment preparation is carried out with the semantic weight values and term information. Semantic cluster results are used for the refined pattern containment process.

5.4. Query Assistant

The client application is designed to search XML documents using keyword query values under the distributed environment. The query keyword is collected from the client and passed to the XML document server. The XML document server sends response to the client. The query assistant helps the

user to type the correct query value without errors. The query assistant provides a list of suggestions for the user query value. The query assistant is integrated with query submission user interface. The typographical errors are automatically corrected by the query assistant. Prefix and phrase based query suggestions are provided by the query assistant. Keyword weights are used to produce query suggestion in an ordered way. The prefix based Graph Pattern Views (GPV) query assistance and concept and Semantic Cluster based Graph Pattern Views (SCGPV) model based query assistance mechanism are supported by the system.

5.5. Query Log Management

The query log management process is carried out under the document server environment. The query values are collected from the clients and updated into the query logs under the server. Query log details form shows the list of query values that are submitted by the clients. Query term, host name, IP address and submitted time details are listed in the query log list form. The query summary form shows the list of query terms and their frequency values. The query optimizer improves the user queries with semantic and query logs. The user can select the required query assistance mechanism. The query optimization is carried out with reference to the user choice. Concept relationship based query suggestions are produced in semantic analysis model. User queries and their hit rate are used in query log based suggestion model. The user can update the query values with suggestion details.

The query summary is used to update cache contents The query values that have high frequency levels are moved to the cache with their query response details. The subsequent query results are fetched from the cache. The maximally query rewriting mechanism is used in the pattern containment and view update process. The refined pattern containment is used in the response preparation process. The graph retrieval is handled in distributed manner.

5. 6. Document Retrieval

(6)

9 User query is submitted to the server and the server produces the response of document list. Query results are listed in the query submission form. The query submission process is assisted with query suggestions. The user can view any XML document by selecting the document and view button. The content of the XML document is displayed in separate form.

6. Conclusion and Future Work

The distributed graph query management scheme is build to perform the graph data retrieval on the distributed environment. The Graph Pattern Views (GPV) is used to execute the graph pattern query values. The pattern containment is utilized in the graph matching process. Minimal minimum pattern containments are prepared for the pattern matching process. The Semantic Cluster based Graph Pattern Views (SCGPV) scheme is developed to support the distributed query process with semantic relationship and cache management models. The refined pattern containments are prepared to control the computational complexity levels. The graph patterns are clustered with the K-means clustering algorithm. The XML server manages the XML documents and query process operations. The client submits the query values to the XML server. The query responses are ranked with reference to the weight values. The system can be enhanced with the following features. They are

graph pattern query model is enhanced with privacy preserved data retrieval process and the graph pattern query model is improved to support graph retrieval on data streams.

REFERENCES

[1] J. Tang and A. W.-C. Fu. Query rewritings using views for XPath queries, framework, and methodologies. Inf. Syst., 35(3):315–334, 2010. [2] J. Wang, K. Wang, and J. Li. Finding irredundant contained rewritings of tree pattern queries using views. In APWeb/WAIM, pages 113–125, 2009. [3] J. Wang and J. X. Yu. XPath rewriting using multiple views. In DEXA, pages 493–507, 2008. [4] F. N. Afrati, R. Chirkova, M. Gergatsoulis, B. Kimelfeld, V. Pavlaki, and Y. Sagiv. On rewriting XPath queries using views. In EDBT, 2009.

[5] B. Cautis, A. Deutsch, and N. Onose. XPath Rewriting using multiple views: Achieving completeness and efficiency. In WebDB, 2008.

[6] B. Cautis, A. Deutsch, N. Onose, and V. Vassalos. Efficient rewriting of XPath queries using query set specifications. PVLDB, 2(1):301–312, 2009.