• No results found

A Flexible Retrieval Mechanism for Structural Data Using Multiple Vector Spaces

N/A
N/A
Protected

Academic year: 2021

Share "A Flexible Retrieval Mechanism for Structural Data Using Multiple Vector Spaces"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

Relevant Topic list:

Special Purpose Database −− Multidimensional databases

Access Methods

Advanced Search, Query and Approximation

A Flexible Retrieval Mechanism for Structural Data Using Multiple

Vector Spaces

Srinath Srinivasa, Sumit Acharya, Rajat Khare, Himanshu Agrawal

Indian Institute of Information Technology,

International Technology Park,

Whitefield Road,

Bangalore 560066

India

Primary contact: Srinath Srinivasa

Email: [email protected]

Tel: +91−80−8410627

Fax: +91−80−8410636

Paper−id: 316

Topic area: Core database technology

Category: Research

(2)

Vetor Spaes

SrinathSrinivasa, Sumit Aharya, RajatKhare, Himanshu Agrawal

Indian Instituteof Information Tehnology

International Tehnology Park

WhiteeldRoad

Bangalore560066, India

sriiiitb.a.in

May 7,2002

Abstrat

ThispaperpresentsGRACE{agraphdatabase

sys-tembased ontheoneptof agraphspae. Graph

databases store a set of member graphs whih are

retrieved based on an input query graph. The

re-trievalmaybebasedonaset of(non-strutural)

at-tributes,oronthegraphstruture. GRACEprovides

retrieval mehanisms based on both attributes and

struture. This isperformedby vetorizing member

graphsonto pointsin hyperspae. Thequery graph

ismappedontoaregion;andthequeryresultswould bepointslyingin thatregion. Forvetorizing graph

strutures,GRACEinorporatesaoneptof

multi-plevetorspaes. Thesedepitthesamegraphfrom

dierentviewsand/orhierarhiallevels. A member

graphmay be mapped onto unique points in

dier-ent vetor spaes. The vetorspaes are generated

automatiallyduringtheinitializationphase. Atany

point in time, more vetor spaes may be added to

the system, or existing ones deleted. During a

re-trievalproess, vetorspaesanalsobeenabled or

disabledtorespetivelypreferaurayoverspeedof

retrieval,orvieversa.

Keywords: Graph database, Subgraph

isomor-phism,Vetorspaes,Graphvetorization,

Informa-tionretrieval.

1 Introdution

A number of databaseappliations require to store

and retrieve strutural information, usually in the

formofundiretedgraphs. Aolletionofsuh

stru-turaldataisalledagraphdatabase.

Graphdatabases areused in dierent appliation

domains. Examples inlude databases of the

fol-lowing: protein struturesin biologialappliations,

eletrialiruitsinCADappliations,imagesin

im-age retrieval appliations, and itations or

hyper-link strutures in digital library appliations. For

the most part, eah of the above and related

om-munitieshaveindependentlypursuedthedesignand

maintenaneofstruturaldata. Someexamplegraph

databasesinlude: ASES by Shashaet al., [19℄ and

SUBDUEbyCooketal.,[5℄formoleularmodeling;

MessmerandBunke'sdatabase[17℄forCADand

im-age retrievalappliations; and Petrakis and

Falout-sos's[18℄approahformedialimagingdatabases. In

thiswork,ourendeavoristoaddressgraphdatabases asaoneptinitselfwithgeneriretrievalalgorithms

thatan beustomizedto anyoftheatual

applia-tiondomains.

Themain hallengeingraphdatabasesisretrieval

of graphs based on struture. Mathing strutures

isaninstaneofthesubgraphisomorphismproblem

(3)

math-andidate member graph to determine whether the

query struture ours in the member. In graph

databases, this problem is ompounded by the fat

that adatabase ontainsalarge number of member

graphs againstwhom struture mathing should be

performed.

Hene,issuesingraphdatabasesaretwofold: (a).

retrieving a set of andidate graphs from the set of

all member graphs, and (b). mathing the query

graphagainsttheandidategraphs. Thispaper

on-entrateson therst step. Forretrieving andidate

graphs,urrentlytherearetwomainapproahes: (a).

index based approahes, and (b). vetor based

ap-proahes. Indexbasedapproahesmaintaina

hierar-hialindexofmembergraphs,whihistraversedin

responsetoaquery. Duringthetraversal,adistane

metribetweenthe querygraphand theurrent

in-dexelementisalulated. Whenthedistane metri

fallsbelowasetthreshold,theurrentindexelement isusedtoretrieveandidategraphs. Examplesof

in-dexbasedapproahesinlude[1,17℄and[19℄. Vetor

basedapproahesonsider membergraphsasa

ve-toroffeatures,andtransformeahgraphontoa

fea-turespae. An example of avetorbased approah

is by Petrakis and Faloutsos [18℄ for medial

imag-ingdatabases. Usually,vetorizationisperformedon

attributes of the graph. In this proess, the

stru-turalpropertiesthat showhowgraphattributes are

interlinked,getnegleted(.f.[1℄).

Vetorizing struture is more ompliated than

vetorizing aset of features. Struture has at least

twopropertieswhihareimportantforvetorization:

views and abstration levels. Struture may be

de-sribed by dierent views, eah of whih are vastly

dierentfromoneanothereventhoughtheydesribe

the same entity. Seondly, struture may be

de-sribed from dierent abstration levels with lower

levelspossibly(but notneessarily)beingontained

within higherlevels. This makesit diÆult to

od-ifystruturalpropertieswithinasinglesetof dimen-sions.

Thispaperproposesaretrievalmehanismthat is

basedonvetorizationof struture. Starting from a

fewsamplegraphs,asetofstrutural propertiesare

extrated from dierent views and abstration

lev-spaesarereatedinwhihgraphsappearasvetors.

The approah presented here is being designed

un-dera graphdatabaseprojetalled GRACE, whih

supportsstorageandretrievalofundiretedgraphs.

GRACE stands for graphspae, whih is an

ab-stratspaethat desribesthe graphdatabase. The

graphspaeontainsoneormorevetorspaes

1 .

At-tributesandstruturalpropertiesofmembergraphs

appear asunique vetors in these vetorspaes. A

queryismapped onto regionsin oneormorevetor

spaes; and query results are all the points in the

region.

Retrievalbyvetorizationisfast,butinexat. The

auray of the retrieval proess is limited by the

numberofvetorspaesand theirdimensions.

Usu-ally, moreaurayrequiresaddition ofmorevetor

spaesand/ordimensions. InGRACE,vetorspaes

anbe inrementallyadded (orenabled) toinrease

auray, or deleted (or disabled) to inrease speed

ofretrieval.

2 Literature Survey

Struturemathing of graphsis aproblem that has

beenaddressedatleastsinethe1970s. Some

exam-plesareCorneilandGottlieb[6℄in1970,Berztiss[2℄ in1973,andUllmannin 1976[20℄.

A graph database involves more than struture

mathing. There are a number of member graphs

whih have to be ompared against an input query

graph. The omplexity here is two fold: searhing

forandidate membergraphsto math; and

math-ing thequery againsteah of theandidate graphs.

The SUBDUEgraph database[5, 13℄ usesa

hierar-hial index of substrutures to identify andidate

graphs. Retrieval of andidate graphs is based on

aoneptof\bestompression,"thatidentiesaset

ofsubstrutureswhihompressthequerygraphin

thebestpossibleway. MessmerandBunke[17℄usea 1

Aswouldbeapparentinlatersetions,a \vetor spae" ofstruture isnot avetor spaeinthe mathematialsense oftheterm. It issimplyaolletionoforderedtupleswhih haraterize dierentstruturalproperties. Thevetorspae does not support vetor arithmeti operations like distane alulationandadditionofvetors.

(4)

deisiontreebasedindexingofmembergraphsfor a

graphdatabase. Aquerygraphtraversesthroughthe

deisiontreetill itndsamembergraphwhih

sat-isesthequery. Whilethetreesizeisinvariantwith

respet to the number of member graphs, it grows

exponentiallywiththenumberofvertiesinmember

graphs.

Shasha et al.[19℄ addressinexator approximate

searhes on graphs. They present a system alled

Approximate-Tree-by-Example(ATBE)whihallows

inexat mathing of tree graphs. The underlying

oneptisbasedonasimpliation ofedit-distane.

Edit-distane between two graphs is the minimum

numberoflabelor nodetransformationsrequiredto

onvert one graphto another. Sine alulation of

edit-distane is NP-hard, a simpliation alled

2-degree distane is introdued whih onsiders edits

onlyuntil adistaneof 2from thenode under

om-parison. Luo and Hanok [15℄ propose strutural

graph mathing using an expetation maximization

(EM) algorithm. Their method is struture based

andrelies onlyontheedgeoronnetivitystruture

of the graph. Berretti et al. [1℄ propose a method

to index graph models for ontent-based retrieval.

The index mehanism is based on mutual distanes

between member graphs. The Available Chemials

Database (ACD) of Daylight CIS In., [7℄ stores a

database of hemial ompound strutures. This is

donebylabelingeahnode inahemialompound

with a set of spatial oordinates and then indexing

themaordingto theoordinates.

Graphs and struture mathing have been

ad-dressedin other ontextsas well. TheG

+

language

proposed byMendelzon and Wood [16℄ searhesfor

pathsinarelationaldatabasebasedonaninput

reg-ular expression. XPath from W3C [4℄ proposes a

language omprising expressions that desribe tree

strutures.

Various representation mehanisms exist for

stor-ing a database of graphs. One of the widely used

representation for storing image data is Attributed

Relational Graphs (ARGs). Other models inlude

GRAS(GraphOrientedDatabaseSystems)[14℄and

Graphviz [10℄. Deo[9℄ illustrates manymore

appli-ationareasandstoragetehniquesforgraphdata.

InGRACE,thedesignriteriaisorientedtowards

Graph Databases

Attribute Dominance

Structure Dominance

Mostly uniform structures

Many node attributes

Vastly diverse structures

Few node attributes

Example: Database of chemical

compounds

Example: MRI scan database

Comparable diversity

in structure and

attributes

Example:Database of

Electrical Circuits

Figure1: TheDominaneContinuum

storageandretrievalofundiretedgraphs. Only

log-ial strutures of graphs are stored. The physial

renderingofmembergraphswould dependonwhih

visualization tehnique is used. The vetorization

proess in GRACE is similar to feature based

ve-torization (for example, as proposed by [18℄). But

GRACE also vetorizes graphstrutures from

vari-ousviewsandabstrationlevels. Forobtaining

dier-entabstrationlevels,aompressionbasedtehnique

similar to SUBDUE[13℄ is used. HoweverGRACE

doesnothaveaoneptofbestompression;andall

ompressedgraphsmaybeutilizedforvetorization.

3 Dominane of Struture and

Attributes

If agraph databasehasto be designedin ageneri

fashion,weneed toidentifypeuliaritiesofitsmany

appliation domains. The peuliarities are

summa-rizedbywhatmaybealledadominane ontinuum

asshowninFigure1.

Attribute Dominane: Some appliation areas

require graph databases with attribute dominane.

Thismeansthatthemembergraphsaremostly

sim-ilarwith respet to struture, with slightvariations

if any. But the number of possible node and edge

attributes for eah graph would be large. The

re-trievalofsuhgraphswouldinvolvemathing

dier-ent ombinations of node and edge attributes over

substrutures. For example, a medial database of

MRI sans of thehumanfae would ontain graphs

whih are more or less similar. Eah graph would

ontain many \labeled" or \expeted" objets like \eyes",\nose"et. Somegraphsmayontainslight

(5)

Strutural Dominane: On the other end of

theontinuumareappliationswhosedatabaseshave

strutural dominane. Inthese databases,variations

in struture is more than variations in attributes.

There are likely to be a large number of dierent

graphstrutures, with relativelyfewattributes. An

exampleisadatabaseoforganihemialompounds

used in hemial industries. Suh adatabasewould

ontain a large number of ompounds eah having

vastly dierentstruturalproperties. However,they

are madeup of relatively few types of elements like

arbon,hydrogen,oxygen,et.

Theremaybedatabasesthat fallsomewhere

mid-way in the ontinuum

2

. An example is a database

ofeletrial iruits. Ciruitsarestored inthe form

ofgraphswherenodesareeletrialomponentsand

edges are onnetions between omponents. Here,

thedatabasewouldonsistofalargenumberofgraph

strutures as well as a substantial number of node

attributes. Bothattributesandstrutureareequally importantintheretrievalproess.

Thepositionofagraphdatabaseinthedominane

ontinuum would determinethe overall strategy for

indexing and retrieval. In existing literature,

at-tributedominated graphsareaddressedbyPetrakis

and Faloutsos [18℄. Theyonsider a medial image

graphto onsistofaxednumberof\expeted"

at-tributes and vetorize the graph based onthese

at-tributes. Here, graph strutures are onsidered to

be largely similar with oasional slight variations.

Struturedominationisaddressedin[20,19,17,6,5℄.

Mostoftheseapproahesarebasedeither on

traver-salalgorithmsor ononeptsofeditdistanes.

In GRACE, the design riteria is to address the

whole ontinuum under asingle umbrella. In order

to support this, thedatabase systemhas twokinds

ofspaes:

Struture spae is a set of one or more

ve-tor spaes that desribe strutures of member

2

It isouronjeture thatdominane isa ontinuum {in that thedegree ofdominane matters forthe overall design. It ould wellbe the ase that dominane maybedesribed byjust three ases: attribute overstruture, struture over attributeandnodominane.

els.

Attribute spae is asingleset of dimensions that

vetorizes a nite set of attributes assoiated

withgraphnodesandedges.

In attribute dominated appliations, some

ap-proahes also apply struture mathing after

at-tributebasedretrievalinorder toinreaseauray.

Examples are [12, 1℄. Similarly, some approahes

in moleular databases (whih are struture

domi-nated databases) augment struture mathing with

attributemathing[5,19℄.

Based on the same priniples, GRACE addresses

dominaneasfollows:

For attribute dominated databases, rstsearh

intheattributespae. Thenusingresultsofthis

searh,performasearhinthestruturespae.

For databases with no dominane (where

at-tribute and struture are equally important),

rst perform a searh in the struture spae.

Basedonthesearhresults,searhtheattribute spae.

For struture dominated databases, perform a

searhinthestruturespae. Dependingonthe

rihness of the attributes in the query, use the

searhresultstosearhin theattributespae.

Determiningthesearhstrategyisaone-time

pro-ess. This is enoded during installation, based on

whether theappliationdomain hasattribute

domi-nationorstruturedomination.

4 Graph Vetorization

4.1 The GRACE model

Let G be a graph database onsisting of a set of

undiretedgraphs. Thegraph spae ofGdenoted

by GS(G) is the range onto whih everyG

i

2 G is

mapped. The graph spae GS(G) is desribed by

GS(G) =hAS(G);SS(G)i, where AS(G) is the

(6)

00

00

00

11

11

11

00

00

00

11

11

11

0

0

1

1

0

0

1

1

0

0

1

1

00

00

00

11

11

11

00

00

00

11

11

11

0

0

1

1

0

0

1

1

0

0

1

1

00

00

00

11

11

11

00

00

00

11

11

11

0

0

1

1

0

0

1

1

0

0

1

1

Member graph

Vector space

Query

graph

Structure space

Figure2: TheGRACEModel

AS(G) isaset of attributelasses ordimensions.

Eah attributelass A

i

2AS(G) isdened by aset

of possibleinstanes. The dimensionalityof AS(G)

isthenumberofattributelassesitontains.

SS(G) is a poset of n vetor spaes VS

0 to VS

n 1

where n is at least 1; and apartial order <

among the elements of SS(G). Eah vetor spae

VS i

2SS(G)isdened byaset ofdimensions. The

partial order < on the vetor spaes denes a

lat-tie. For any two vetor spaes VS

i and VS j , if VS i <VS j ,then VS j

issaidto beatahigher level

of abstrationthanVS i indesribingGS(G). If nei-therVS i <VS j norVS j <VS i ,thenbothVS i and VS j

aresaidtobeatthesamelevelofabstrationin

desribingGS(G).

Figure2shematiallydepitstheGRACEmodel.

It showsastruture spaeonsisting of various

ve-torspaesandpartialorderamongthem. Amember

graphismappedontouniquepointsin multiple

ve-torspaes. Aquerygraphismappedontoregions.

For desribing member graphs, GRACE uses a

modied form of the graphviz [10℄ language. In

graphviz, node attributes are delared in the form:

n0 [name=value℄;where n0is the name ofa node

and [name=value℄ is the attribute assoiated with

the node. Paths in the graph are desribed in the

form: n0--n1--n2 [name=value℄;wheren0,n1and

n2 are nodes of the graph. The [name=value℄

de-sribesedge attributesthat areappliableto allthe

graph benzene { h [name=Hydrogen℄; [name=Carbon℄; h_0--_0; _0--_1[type=2℄; h_1--_1--_2; h_2--_2; _2--_3[type=2℄; h_3--_3--_4; h_4--_4; _4--_5[type=2℄; h_5--_5--_0; }

Figure3: Desriptionof abenzenemoleule

edgesinthepath.

GRACE introdues two extensions to the above

desriptions. A node is desribed in the form

typeinstane. This indiates that the node is

of type type and the instane name of the node

is instane. Similarly, a link is desribed by two

kinds of attributes: a qualitative type attribute,

and a quantitative weight attribute. These are

de-lared as type=value and num=value diretives in

the edge attribute area. For example, C 0--C1

[type=2℄; indiates a double bond between two

arbon atoms. An edge diretive of the form

authorSumit--journalJACM [num=3℄;indiatesa

itation graphwhere an author named Sumit has3

publiations in ajournal named JACM.The

dier-enebetweentypeandnumattributeslieintheway

they are proessed during vetorization. Figure 3

showsanexampledesriptionofabenzenemoleule.

4.2 Vetorizing member graphs

In order to reate a new graph database, GRACE

adoptsatwo-phaseproess. Theyare:

Initializationphase: In the initialization phase,

GRACE reates AS(G) and SS(G). This

in-volves determining the dimensions that make

up eah of the onstituent vetor spaes. The

databaseadministrator presentsasetof sample graphs,whiharearepresentativesampleofthe

dierentkindsof membergraphs. Vetorspae

(7)

sam-plegraphs.

Operational phase: In the operational phase,

member graphs may be added to (or deleted

from) the database. When a new member

graphistobeadded,GRACEparsesthe

mem-bergraphusing thesameparsingtehniques as

in the initialization phase. However, one the

graph is parsed, any newfeature that is found

whihisnotpresentinanyofthevetorspaesis silentlyignored. Hene,therepresentative sam-ple hosen in the initialization phase is

impor-tant. Ithastoontainatleastoneourreneof

allimportantfeaturesthat havetobeindexed.

Inthe remainderof thissetion, the initialization andoperationalphasesareexplainedindetail.

How-ever, the explanations onentrate only on the

for-mation of SS(G). The formation of AS(G) is not

addressedin thispaper.

Initializationphase: Theinitializationphase

in-volves parsing the sample graphs provided by the

database administrator. Based on these graphs,

GRACE determines a set of dimensions that make

upthedierentvetorspaes in SS(G). Toexplain

theformationofSS(G), wetakearunningexample

ofastruturedominatedappliation{thatofstoring

struturesoforganihemials.

The vetorization proess in the initialization

phaseisoftwokinds: (a). Vetorizing therst

sam-plegraph;and(b). Vetorizing thesubsequent

sam-plegraphs.

Vetorizationoftherstsamplegraphisdesribed

bythefollowingsteps:

Atanyabstrationlevell,dothefollowing:

1. ReadinputsamplegraphG

in

2. Create a vetorspae VS

i

by identifying a set

of dimensions by pairsof node types and edge

types.

3. CreateavetorthatdesribesG

in

basedonthe

identieddimensions.

4. Orderthedimensionsindereasingorderoftheir

omponentsin G in

C:H 6

C:C1 4

C:C2 3

C:O2 1

(b)

H C

C

H

C

C

H

O

C

H

C

H

C

H

(a)

Figure 4: VetorizationExample

5. CompressG

in

by these dierent dimensions to

reate a set of m output graphsG

1 out to G m out .

Disardallnodesintheompressedgraphsthat

belong tolevell.

6. Repeat from step 1 for level l+1, using eah

oftheompressedgraphsG

1 out toG m out asinput graphG in

Steps 2 to 5 are desribed in more detail below.

Consider a sample input graph G

in

of a hemial

ompound asshown in Figure 4(a). The

vetoriza-tion proess rst derives a set of dimensions from

G in

. This is doneby ombining thenode types and

edgetypesdesribedinG

in

. Forexample,aC=C

dou-ble bond would bedesribed asC0--C1[type=2℄;

in graphviz. Using this, a dimension alled C:C2 is

derived. If a C-H bond is desribed by C0--H0;

thenadimensionalledC:Hisderived. Oneasetof

dimensionsarederived,G

in

isvetorizedalongthese

dimensionsbyprojetingthenumberofdierentedge

ourrenesonto thedimensions. Figure 4(b) shows

thegraphvetoralongthefouridentieddimensions.

Ifan edgehas anumattribute, itis used asaount

of the number of ourrenes of the edge. For

ex-ample,adelarationofC 2--H [num=3℄;desribes3

ourrenesofaC-Hbond.

One avetoris formed,G

in

is ompressedbased

on the identied dimensions. Compression is done

asfollows: First,dimensionsareordered in

dereas-ing order of their ourrene in G

in

(8)

start-CC1

CC1

CC1

CO2

C:C1 4

C:O2 1

C:C2 3

C:H 6

CH

CH

CH

CH

CH

CH

CO2

CC2

CC2

CC2

(a)

(b)

(c)

Figure5: VetorizationExample

ing from the highest dimension, G

in

is ompressed

byreplaingeahmathingedgewithanodehaving

thedimensionname. If thereplaededgeonneted

nodes v

i

and v

j

, then all other edgesinidenton v

i or v

j

are now onneted to the new node. One it

is nolonger possibleto ompress thegraphfurther,

the top most dimension is disarded and the

pro-ess is started again. From eah of the ompressed

graphs thus reated, all nodes at lower abstration

levelsarealsodisarded. Noweahnodeinthe

om-pressedgraphwouldbeatthesame(higher)

abstra-tionlevel.

Figure5showsthreeompressedgraphsG

1 out ,G 2 out andG 3 out

,thatarereatedaftervetorizationofG in

.

Thevetorizationreated4dimensions: C:H, C:C1,

C:C2andC:O2;andG

in

hadrespetively6,4,3and 1pointsinthefourdimensions. Forompression,the

dimensionsare plaedin dereasing order and

om-pressions are progressively applied from top. First

G in

is ompressed using C:H. One this is done, it

isnolongerpossibletoapply anyotherompression

fatorandtheresultinggraphG

1 out

isshownin

Fig-ure5(a). C:Histhendisardedandtheompression

proess is repeated to obtain the next ompressed

graph.

Compressionreatesgraphsatahigherabstration

level l+1. The dierent graphs that are obtained

depitdierentstruturalviewsoftheoriginalgraph.

Eah ompressed graph is then passed again

through the vetorization proess to reate

dimen-sionsatlevell+1. These dimensionsarereatedin

dierentvetorspaesatthislevel. Thus, ifG

in was

parsedat level0and reatedvetor spaeVS

0 , the

three ompressed graphsreates vetorspaes VS

1 , VS 2 and VS 3

at level 1. The database

administra-tor may set a limit k, on the maximum number of

vetorspaesatanyabstrationlevel. Ifthenumber

of ompressed graphs is morethan k, then some of

the\uninteresting"graphsaredisarded. The

unin-terestingnessmaybedenedbasedonpropertieslike

lakofonnetedness,toolittlenumberofnodes,et.

In the earlier example, ompressing G

in

by the

di-mension C:O2resultsin agraphwithasingle node.

Suhagraphisanexampleofanuninterestinggraph. Forvetorizingsubsequentsamplegraphs,the pro-edureisslightlydierent. Itisdesribedbythe fol-lowingsteps:

Foranyabstrationlevell,dothefollowing:

1. ReadinputsamplegraphG

in

2. Identify a set of dimensions by pairs of node

typesandedge types

3. CreateavetorthatdesribesG

in

basedonthe

identieddimensions

4. AddthesetofdimensionsandG

in

toany

exist-ingvetorspaeVS

j

atlevell,aordingtothe followingriteria:

(a) G

in

isnotalreadyavailable inVS j

,and

(b) Theoverlapbetweenthe dimensions

iden-tiedinG

in

andin VS j

ismaximal

5. Orderthedimensionsindereasingorderoftheir

omponentsinG

in

6. CompressG

in

by these dierent dimensions to

reate a set of m output graphs G

1 out to G m out .

Disardallnodesintheompressedgraphsthat

belong tolevell.

7. Repeat from step 1 for level l+1, using eah

oftheompressedgraphsG

1 out toG m out asinput graphG in

Subsequent samplegraphs are ttedinto existing

(9)

H

H

H

H

H

H

H

(a)

C

C

O

C

C

C

H

C:H 8

C:C1 2

C:O1 2

C:C3 1

(b)

Figure6: VetorizationExample

CH

CH

CO1

CH

CC1

CO1

CC1

(a)

(b)

Figure7: VetorizationExample

are reated only if the sample graph generates all

newdimensionsnotenounteredthusfar. Henethe

order in whih sample graphs are presented is also

important. Sample graphs should not only be

rep-resentativeof thedierent kindsof member graphs,

but theyalso haveto be presentedin anorder suh

thathangesinstrutureourgradually. Thiswould

ensureabesttofdimensionsintovetorspaes.

Forexample,onsiderthatthegraphinFigure6(a) istheseondsamplegraphatlevel0. Intherstpass,

the following dimensions are identied: C:H, C:C1,

C:O1andC:C3. Atlevel0,presentlythereisonlyone

vetorspaeVS

0 . VS

0

mathestwodimensionsC:H

andC:C1withthat ofthenewgraph. Thisgraphis

thenplaedintoVS

0

andthedimensionalityofVS 0

isinreasedtoinludethenewdimensionsdisovered.

When the graph is ompressed, two interesting

graphsresult. This is shown in Figure 7. The rst

ompressed graph yields a dimension CH:CH whih

is in ommon with a dimension in VS

1

at level 1.

Hene this graph is plaed in VS

1

. The seond

ompressed graph yields dimensions CC1:CO1 and

CO1:CC13. These donot math with dimensions of

any existing vetorspaes at level 1. Hene a new

vetorspaeVS

4

isreated.

Twoaspetsoftheaboveproedureneedmore

mo-tivation:

Creation of multiplevetorspaes: The reason

for reating multiple vetor spaes is to make

eah vetorspaeasdenseaspossible. This

as-pet is revisited in moredetailfurther downin this setion. If theset of all dimensions

identi-ed from dierentviews at anabstration level

wereombinedintoasinglevetorspae,the fol-lowingissuesresult:

1. The tuples that desribe member graphs

wouldlikelyhavealot ofzeroes,espeially

if the views dier substantially from eah

other

2. Sinemultipleviewsarestoredinthesame

spae,amembergraphmayourin

multi-plepointsinthesamespae. Thisprevents

thegraphfrombeinguniquelyidentiedby

itstuple

3. Multiple vetor spaes in an abstration

levelanbesearhedparallely,whihisnot

possibleifvetorspaesareombined.

Compressionof graphs: In the present

ompres-sionsheme, ompressionfators areapplied in

desending order. One a ompressed graph is

reated,thetopmostompressionfatoris

dis-arded. However,itisevidentthatdierent

per-mutations of the ompression fators might

re-sult in agreater number of ompressedgraphs.

Wedonotadoptsuhanapproahsimplyto

re-duethenumber ofompression passes. If

per-mutationsofompressionfatorsareapplied,the

number of passes inrease fatorially with the

numberof ompression fators. If it is

guaran-teedthatthenumberofompressionfatorswill

besmall(typiallylessthan5),thenthe

(10)

volvesinsertionanddeletionofgraphsinadditionto query. Deletionisstraightforward;howeverinsertion

is moreinvolved. The proess for insertinga graph

H in

duringoperationalphase,isdesribedbelow.

Foranyabstrationlevelldothefollowing:

1. ReadinputgraphH

in

2. IdentifydimensionsofH in

usingtheproedures

desribedearlier

3. IdentifythevetorspaeVS

i

atlevellthathas

amaximalmathwiththeidentieddimensions

4. Generate a vetor for H

in

based on all the di-mensionsofVS

i

,ignoringrestofthedimensions inH

in

5. Compress H

in

using the proedures desribed

above

6. Repeattheproessforlevell+1foreahofthe

ompressedgraph.

4.3 Density of vetor spaes

The rationale for generating multiple vetor spaes

lies in the requirement for dense vetor spaes. A

vetorspae is said to be dense if all points in the

vetorspae areequallyprobable. Ifsomepointsin

thevetorspaeannotexistbeauseofdesignaws

orduetoappliationdomainpeuliarities,thevetor

spae would not be dense. Vetor spae density

isaddressedinGRACEbasedonthefollowingaxiom:

Lemma: A vetor spae formed by pairwise

om-binations of nodes is dense, if its dimensions are not orrelated.

If there is a orrelation between any two

dimen-sions, the spae would be skewed. For example in

hemial ompounds, dimensions C:H and H:C are

orrelated beause the graph is undireted. A

ve-torspaeformed byC:HandH:Cwillontainpoints

only on a singleline in the spae. All other points

in thespaewill neverbelled. Hene,orrelations

betweendimensionsshould beavoided.

dimensionsour{diretand indiret. Theyare

ex-plainedbelow:

Diret: Dimensionsinavetorspaerepresent

sub-strutures. If foranytwodimensionsv

i

andv

j ,

there exists a homomorphism from v

i to v j or fromv j tov i

,thedimensionsaresaidtobe or-related.

Indiret: There may be indiret orrelations

be-tweendimensions based on the appliation

do-main. Forexample,ifitis knownthata

hemi-alompoundhasat leastoneC-Cbond,itan

be inferred that it hasno morethan three C-H

bonds. ThusifC:HandC:C1weredimensionsof

thevetorspae, thepointh4;1i annotexist.

InGRACE,onlydiretorrelationsareaddressed.

Indiret orrelations are spei to nuanes of the

appliation domain, and so are not addressed. The

following lemmas illustrate how diret orrelations

areavoidedin GRACE.

Lemma: ForanyvetorspaeinGRACE,notwo

di-mensionsareisomorphitoeahother.

Proof: Dimensions are reated by node pairs whih

have the following properties: (a). node pairs are un-ordered,so thatC:H andH:Care mappedonto thesame dimension; and (b). node pairs of dierent dimensions aredistint.

Lemma:ForeveryvetorspaeinGRACE,no

dimen-sioniswhollyontainedwithinanotherdimension.

Proof: This anbe provedby indution. In level 0,

dimensions are formed by distint pairs of node types. Henenodimensionis whollyontainedwithin another. Supposethat thelemmaholdsatleveli. Atabstration leveli+1,nodesofleveliaredisarded. Nodesoflevel i+1 ontain exatly two level i nodes and are formed outof distint nodepairs. Henenodimensionat level i+1isontainedwholly withinanother.

(11)

5 Query Resolution and Re-trieval

Query resolution in GRACE is performed by

map-ping a query onto regions in one or more vetor

spaes. Query results are omputed by a ranked

unionofallthepointslyinginthequeryregioninall

vetorspaes. Query resolution and retrievalhene

involves two issues: mapping a query onto regions,

andrankingqueryresults.

Thequery itselfis in the form ofagraph. Query

resultsshouldbethesetofallmembergraphswhih

ontain the query graph as a substruture. While

GRACE addressesretrievalbasedonbothstruture

and attributes, thissetion addressesonly struture

basedretrieval.

5.1 Mapping queries onto regions

Theproessofmappingaqueryontoregionsisvery

similar to theoperationalphaseproess of mapping

membergraphsontopointsin vetorspaes. Thisis

explainedinmoredetailbelow.

Foreveryabstration levelldothefollowing:

1. ReadtheinputquerygraphQ

in

2. Identify a set of dimensions by pairs of node

typesandedgetypes

3. Identify enabled vetor spaes VS

i ;VS

j ;::: at levellwhihhaveatleastonedimensionin

om-monwiththat ofQ

in 4. CreateregionsinVS i ;VS j ;:::thatdesribeQ in

5. OrderthedimensionsofQ

in

indereasingorder

oftheirourrene

6. Compress Q

in

by these dierent dimensions to

reate a set of m output graphs Q

1 out to Q m out .

Disardallnodesintheompressedgraphsthat

belongtolevell.

7. Repeat from step 1 for level l+1, using eah

oftheompressedgraphsQ

1 out toQ m out asinput graphQ in

C

H

C

H

(a)

C:H [2,*]

C:C2 [1,*]

(b)

Figure8: QueryExample

Step 4 in the aboveproess needs more

explana-tion. Duringquerytime,thevetorizationproessis

usedtogeneratearegion ratherthanapoint.

Creatingregionsisbasedonaoneptalled

win-dowsizew. Windowsizeisaparameterthat denes

thewidthofaregionalonganydimension. If

vetor-izing a query graph resultsin point in dimension

C, then the query is dened by the losed interval

[;+w℄ along dimension C. The query region is

theintersetionofallitsintervaldenitionsalongall dimensions.

LetVS

i

beoneof theidentied vetorspaes for

query graph Q

in

. For VS

i

to be hosen, it has to

satisfythefollowingriteria: VS i

should beinstate enabled andithasto haveatleast onedimensionin

ommonwiththeidentieddimensionsofQ

in

. Eah

vetorspae anbe in one oftwostates: enabled or

disabled. Queries an be mapped to vetor spaes

onlyiftheyareenabled.

ThesetofallommondimensionsbetweenQ

in and VS i isgivenbyQ in \VS i

. Thishastobeofnon-zero

ardinality. The set of all dimensions in the query

but not in the vetorspae, isgivenby Q

in VS

i . Thissetissilentlyignored. Thesetofalldimensions

present in the vetor spae but not in the query is

givenbyVS i

Q in

. Forallelementsinthisset,Q in isassignedalosed interval[0;w℄.

Example: Consider a graph spae onsisting of

twolevelsandatotalof fourvetorspaesas

gener-ated inFigures 4,5,6and7. Nowonsideraquery

graphasshownin Figure 8(a). Let thewindowsize

(12)

C:C2. Atlevel1,VS 0

anbeidentiedforthequery,

sine VS

0

has the following dimensions: C:H,C:C1,

C:C2,C:C3,C:O1and C:O2. The querytuplewould

thusbeh[2;℄;[0;℄;[1;℄;[0;℄;[0;℄;[0;℄i.

When the queryis ompressed, we get one

inter-esting graph CH--CH [type=2℄. Parsing this graph

gives a dimension CH:CH1 whih ours in vetor

spae VS

1

at level 1. VS

1

has two dimensions:

CH:CH1 and CH:CH2. The query ould thus be

mappedonto theregionh[1;℄;[0;℄iin VS 1

.

5.2 Ranking query results

Queryresultsare omputed from aranked unionof

allandidategraphswhihhavemathedinallvetor

spaes. Rankingisbasedonthefollowingriteria:

A andidate graph mathing at a higher level

of abstration gets a higher ranking than one

mathing at only lower levels of abstration.

(Note that if a graph maps to a vetor spae

at level l, it has to map to at least one vetor

spae at all levels below l. Hene, this

rite-rion an also be stated as: A andidate graph

that mathesat morelevelsgetsahigher

rank-ingthanonethatmathesat fewerlevels.)

A andidate graph mathing in many vetor

spaesat thesamelevelis givenahigher

rank-ingthanonewhihmathesinonlyafewvetor

spaes.

Basedontheaboveriteria,therankofaandidate

graph C is omputed as: rank(C) =

P l i=1 iw i , where,listhenumberofabstrationlevels,andw

i is

thenumberofdierentvetorspaesatleveliwhere

theandidategraphlieswithinthequeryregion.

5.3 Spae and time omplexities

Number of abstration levels: Vetor spaes at

any abstration level are reated by pairwise

om-bination of edges from the lower level. The vetor

spaesinthelowestabstrationlevelaremadeofthe

edgesofmembergraphs.

from all the sample graphs, and the total

num-ber of distint edge types. The number of

possi-ble dimensions is O(n

2

). If is far lesser than

and unrelatedto n, the spae omplexity of VS

0 is O(n

2

). Sine abstration levels areformed by

om-biningpairsofdimensions,thenumberofabstration levelsisO(dlog(n

2

)e)=O(dlog(n)e).

Number of vetor spaes at an abstration

level: Let n be the number of distint node types

fromallthesamplegraphsand thenumberof

dis-tintedgetypes. Intheworstase,thesamplegraphs

arepresentedin awaysuhthat everysamplegraph

getsplaedintoadierentvetorspaeandeah

ve-torspaehasjustonedimension. Thiswaytherean

ben

2

=O(n

2

)vetorspaes. Butforourpurposes,

weshallbeusingk asthemaximumnumberof

ve-torspaesatalevel. k isthelimitonthenumberof

vetorspaessetbythedatabaseadministrator.

Complexityofthemappingproess:Mapping

membergraphsontopointsorquerygraphsonto

re-gions,involvealgorithmsofthesameomplexity.

Let the number of nodes in the input member

or query graph be v. The graph would ontain

e = O(v

2

) edges. Vetorizing the input graph

re-quiresasan of alltheedges. This isof omplexity

O(e). Compressionoftheinputgraphinvolves

om-bining edges (dimensions) pairwise. Thenumber of

timesompressionanbedoneisO(dlog(e)e). After

eahompression,theremaybeamaximumofk

om-pressedgraphs. Hene,theomplexityofmappingis

O(ekdlog(e)e).

Theomplexityofretrievalisthesumofthe

om-plexities of mapping the query graph onto regions,

andofomputingthesetofallpointsthatlieinthese regions. Complexityof thelatteris basedon the

in-dexingmehanismusedbythedatabasetoindex

re-gionsin vetorspaes.

6 Prototype Implementation

AproofofoneptprototypeforGRACEisurrently

under development. Ideally, implementation would

be based on amulti-dimensional databaseusing

(13)

d1 d2 d3

f

Client

GRACE

Server

MySQL database

VS0

VS1

VS2

d1 d2 d3

d4

f

d1 d2

f

AS

a1 a2 a3

f

VSn

Figure9: ImplementationArhiteture

However, for the prototype, we have used a

lay-eredarhitetureusingaMySQLdatabasewith

My-ISAM indexing. Figure 9 depits the layered

ar-hiteture. The user interats with GRACE using

a lient, whih in turn interats with the GRACE

server. The GRACE server mediates between a

MySQL DBMS storing a graph database, and the

lient. Agraphvizdesriptionsentbyalientis

on-vertedbytheGRACEserverintoasetoftuples. One ofthese tuplesgo intotheattribute spae. Therest

gointodierentvetorspaesinthestruturespae.

Eah vetor spae is modeled as a table in the

MySQL database. The olumn name of the tables

aredimensions(d1;d2::::dn)ofthevetorspae. The

last olumn (f) of eah table is the le name that

maintainsthegraphstrutureanddesription.

As mentioned in Setion 4, GRACE has two

phases of ativity: initialization phase and

opera-tionalphase.

Intheinitializationphase,theGRACEserver

re-atesanewdatabaseonMySQL.Italsoreatesaset

oftablesorrespondingtoanattributespaeandthe

set of all vetor spaes for struture. One the

di-mensionsofeahofthesevetorspaesisknown,the

orrespondingolumnsarereatedinthesetablesby

issuingSQLCREATEommands.

Operationalphase involvesonverting alient

re-LECTommands. Insertionanddeletionarestraight

forward. For queries,SELECT statementsare

on-struted based on the window size parameter. A

query of the form h[1;11℄;[0;10℄;[3;13℄ion a vetor

spaeVS

0

havingdimensionsA, BandCisonverted

tothefollowingSQLstatement:

SELECT f FROM VS0 WHERE A BETWEEN 1 AND 11

AND B BETWEEN 0 AND 10

AND C BETWEEN 3 AND 13;

Basedonthequeryresults,theGRACEserver

ap-plies therankingalgorithm and returnsappropriate

resultsto thelient.

7 Conlusions

Querying strutural data is gaininginreased

appli-ationsindierentelds. Inmanyappliationareas,

therequirementisforfastretrievalofstruturaldata,

ratherthanintriate struturemathing. GRACEis

designedtoaddresssuhneeds. Thisontrastisalong

similarlinesastheontrastbetweendatabasesearh

andinformationretrieval.

Itisrequiredtoaugmentstruturalretrievalswith

alibration models whih ompute values of

prei-sionandreall. Thisenablesomparisonofdierent

struturalretrievalmethods.

A number of unaddressed issues in GRACE are

planned to be taken up in the foreseeable future.

Someoftheseinludethefollowing: amehanismfor

ausertomanuallyreatevetorspaesbyspeifying

theirdimensions, amehanismfor heking

orrela-tions among dimensions in suh vetor spaes, and

advanedquerymehanismsthatansupport

expres-sionsoverstruture.

Referenes

[1℄ S. Berretti, A. Del Bimbo, E. Viario.

EÆ-ient Mathing and Indexing of Graph Models

inContent-BasedRetrieval.IEEETransations

on Pattern Analysis and Mahine Intelligene,

(14)

morphism Detetion of Direted Graphs. Jour-naloftheACM,Vol.20,No.3,July1973,pages 365-377.

[3℄ Chad Carson, Megan Thomas, Serge Belongie,

Joseph M. Hellerstein, Jitendra Malik.

Blob-world: ASystemforRegion-basedImage

Index-ing andRetrieval.Pro. of the Third Int.Conf.

on Visual InformationSystems,June1999.

[4℄ J. Clark. XML Path Language (xpath), 1999.

http://www.w3.org/TR/xpath

[5℄ Diane J. Cook, L. B. Holder, S. Su, R.

Ma-glothin,I.Jonyer.StruturalMiningof

Moleu-larBiologyData.IEEEEngineeringinMediine

and Biology, Speial issue on Advanes in

Ge-nomis,Vol.20,No.4,pages67-74,2001.

[6℄ D. G. Corneil, C. C. Gotlieb. An EÆient

Al-gorithmforGraphIsomorphism.Journalofthe

ACM(JACM),Vol.17,No.1,pages51-64,Jan.

1970.

[7℄ Daylight Chemial Information

Sys-tems. Available Chemials Diretory

(ACD). http://www.daylight.om/prod uts/

databases/ACD.html

[8℄ Y. Deng, B.S. Manjunath. An EÆient

Low-dimensionalColorIndexingShemefor

Region-basedImageRetrieval.Pro.ofIEEEIntl.Conf.

on Aoustis, Speeh, and Signal Proessing

(ICASSP),pages3017-20,1999.

[9℄ Narsingh Deo. Graph Theory with Appliation

toEngineeringandComputerSiene,

Prentie-Hall,EnglewoodClis,N.J.,1974.

[10℄ E. R. Gansner, S. C. North. An Open

Graph Visualization System and its

Appliations to Software Engineering.

http://www.researh.att.om/sw/tools/ graphviz/GN99.pdf

[11℄ A.Guttman.R-Trees: ADynami Index

Stru-turefor Spatial Searhing. Pro. of ACM

SIG-MODConferene,Boston,1984,pages47-57.

Content-basedImageRetrieval.IASTED

Inter-national Conferene on Signal and Image

Pro-essing(SIP 99), Ot18-21, 1999, Nassau,

Ba-hamas,pages129-133.

[13℄ I. Jonyer, D. J. Cook, L. B. Holder.

Disov-eryandEvaluationofGraph-BasedHierarhial

ConeptualClusters.JournalofMahine

Learn-ingResearh,Vol.2,pages19-43,2001.

[14℄ Norbert Kiesel, Andy Shurr, Bernhard

West-fehtel. Graph Oriented (Software engineering)

DatabaseSystem.InformationSystemsVol.20,

No.1,Feb.1995.

[15℄ B.Luo,E.R.Hanok.StruturalGraph

Math-ingUsingtheEMAlgorithmandSingularValue

Deomposition.IEEE Transations on Pattern

MathingandMahineIntelligene,Vol.23,No.

10,Ot. 2001.

[16℄ A.O.Mendelzon and P.T. Wood. Finding

Reg-ular Simple Paths in Graph Databases. SIAM

JournalofComputing,Vol.24,No.6,De.1995.

[17℄ Bruno T. Messmer, Horst Bunke. Subgraph

Isomorphism Detetion in Polynominal Time

onPreproessed Model Graphs. Proeedings of

ACCV 1995, Singapore.

[18℄ Euripides G.M. Petrakis, Christos

Falout-sos. Similarity Searhing in Medial Image

Databases. IEEE Transations on Knowledge

andDataEngineering,Vol.9,No.3,1997.

[19℄ DennisShasha,JasonTsong-LiWang,Kaizhong

Zhang, Karpjoo Jeong. A System for

Approx-imate Tree Mathing. IEEE Transations on

Knowledge and Data Engineering, Vol. 6, No.

4,pages559-571,August1994.

[20℄ J.R.Ullmann, An Algorithm forSubgraph

Iso-morphism,Journalof the ACM,Vol.23,No.1,

Figure

Figure 1: The Dominan
e Continuum
Figure 2: The GRACE Model
Figure 4: Ve
torization Example
Figure 5: Ve
torization Example
+4

References

Related documents

The key difference between P 6⊂ P -uniform SIZE ( O ( n )) and the (non-relativizing) lower bound of Theorem 2 (general circuit composition is not in LOGSPACE-uniform n 1+ o (1)

Create Greater Value in IT by Driving Business Innovation High Performance Data Analytics (Big Data) and Data Center Predictive/Prescriptive Analysis Allow IT to be Proactive

The EH is a private network dedicated to the data exchange among TSOs, that operates under the responsibility of the member TSOs and the network management by the two UCTE

 It is important to mention that Raspberry Pi is a small independent computer that runs on the Linux operating system and can be programmed as needed..  It has a very large

The ritual dynamics of separation, transition and integration allow us to further scrutinise post-mortem relationships and, as I will argue, not simply to point to breaking

A recent Danish study of over 50,000 procedures recorded in the Induced Abortion Registry reported a complication rate of 34 in 1,000 within two weeks of the procedure for