Relevant Topic list:
Special Purpose Database −− Multidimensional databases
Access Methods
Advanced Search, Query and Approximation
A Flexible Retrieval Mechanism for Structural Data Using Multiple
Vector Spaces
Srinath Srinivasa, Sumit Acharya, Rajat Khare, Himanshu Agrawal
Indian Institute of Information Technology,
International Technology Park,
Whitefield Road,
Bangalore 560066
India
Primary contact: Srinath Srinivasa
Email: [email protected]
Tel: +91−80−8410627
Fax: +91−80−8410636
Paper−id: 316
Topic area: Core database technology
Category: Research
Vetor Spaes
SrinathSrinivasa, Sumit Aharya, RajatKhare, Himanshu Agrawal
Indian Instituteof Information Tehnology
International Tehnology Park
WhiteeldRoad
Bangalore560066, India
sriiiitb.a.in
May 7,2002
Abstrat
ThispaperpresentsGRACE{agraphdatabase
sys-tembased ontheoneptof agraphspae. Graph
databases store a set of member graphs whih are
retrieved based on an input query graph. The
re-trievalmaybebasedonaset of(non-strutural)
at-tributes,oronthegraphstruture. GRACEprovides
retrieval mehanisms based on both attributes and
struture. This isperformedby vetorizing member
graphsonto pointsin hyperspae. Thequery graph
ismappedontoaregion;andthequeryresultswould bepointslyingin thatregion. Forvetorizing graph
strutures,GRACEinorporatesaoneptof
multi-plevetorspaes. Thesedepitthesamegraphfrom
dierentviewsand/orhierarhiallevels. A member
graphmay be mapped onto unique points in
dier-ent vetor spaes. The vetorspaes are generated
automatiallyduringtheinitializationphase. Atany
point in time, more vetor spaes may be added to
the system, or existing ones deleted. During a
re-trievalproess, vetorspaesanalsobeenabled or
disabledtorespetivelypreferaurayoverspeedof
retrieval,orvieversa.
Keywords: Graph database, Subgraph
isomor-phism,Vetorspaes,Graphvetorization,
Informa-tionretrieval.
1 Introdution
A number of databaseappliations require to store
and retrieve strutural information, usually in the
formofundiretedgraphs. Aolletionofsuh
stru-turaldataisalledagraphdatabase.
Graphdatabases areused in dierent appliation
domains. Examples inlude databases of the
fol-lowing: protein struturesin biologialappliations,
eletrialiruitsinCADappliations,imagesin
im-age retrieval appliations, and itations or
hyper-link strutures in digital library appliations. For
the most part, eah of the above and related
om-munitieshaveindependentlypursuedthedesignand
maintenaneofstruturaldata. Someexamplegraph
databasesinlude: ASES by Shashaet al., [19℄ and
SUBDUEbyCooketal.,[5℄formoleularmodeling;
MessmerandBunke'sdatabase[17℄forCADand
im-age retrievalappliations; and Petrakis and
Falout-sos's[18℄approahformedialimagingdatabases. In
thiswork,ourendeavoristoaddressgraphdatabases asaoneptinitselfwithgeneriretrievalalgorithms
thatan beustomizedto anyoftheatual
applia-tiondomains.
Themain hallengeingraphdatabasesisretrieval
of graphs based on struture. Mathing strutures
isaninstaneofthesubgraphisomorphismproblem
math-andidate member graph to determine whether the
query struture ours in the member. In graph
databases, this problem is ompounded by the fat
that adatabase ontainsalarge number of member
graphs againstwhom struture mathing should be
performed.
Hene,issuesingraphdatabasesaretwofold: (a).
retrieving a set of andidate graphs from the set of
all member graphs, and (b). mathing the query
graphagainsttheandidategraphs. Thispaper
on-entrateson therst step. Forretrieving andidate
graphs,urrentlytherearetwomainapproahes: (a).
index based approahes, and (b). vetor based
ap-proahes. Indexbasedapproahesmaintaina
hierar-hialindexofmembergraphs,whihistraversedin
responsetoaquery. Duringthetraversal,adistane
metribetweenthe querygraphand theurrent
in-dexelementisalulated. Whenthedistane metri
fallsbelowasetthreshold,theurrentindexelement isusedtoretrieveandidategraphs. Examplesof
in-dexbasedapproahesinlude[1,17℄and[19℄. Vetor
basedapproahesonsider membergraphsasa
ve-toroffeatures,andtransformeahgraphontoa
fea-turespae. An example of avetorbased approah
is by Petrakis and Faloutsos [18℄ for medial
imag-ingdatabases. Usually,vetorizationisperformedon
attributes of the graph. In this proess, the
stru-turalpropertiesthat showhowgraphattributes are
interlinked,getnegleted(.f.[1℄).
Vetorizing struture is more ompliated than
vetorizing aset of features. Struture has at least
twopropertieswhihareimportantforvetorization:
views and abstration levels. Struture may be
de-sribed by dierent views, eah of whih are vastly
dierentfromoneanothereventhoughtheydesribe
the same entity. Seondly, struture may be
de-sribed from dierent abstration levels with lower
levelspossibly(but notneessarily)beingontained
within higherlevels. This makesit diÆult to
od-ifystruturalpropertieswithinasinglesetof dimen-sions.
Thispaperproposesaretrievalmehanismthat is
basedonvetorizationof struture. Starting from a
fewsamplegraphs,asetofstrutural propertiesare
extrated from dierent views and abstration
lev-spaesarereatedinwhihgraphsappearasvetors.
The approah presented here is being designed
un-dera graphdatabaseprojetalled GRACE, whih
supportsstorageandretrievalofundiretedgraphs.
GRACE stands for graphspae, whih is an
ab-stratspaethat desribesthe graphdatabase. The
graphspaeontainsoneormorevetorspaes
1 .
At-tributesandstruturalpropertiesofmembergraphs
appear asunique vetors in these vetorspaes. A
queryismapped onto regionsin oneormorevetor
spaes; and query results are all the points in the
region.
Retrievalbyvetorizationisfast,butinexat. The
auray of the retrieval proess is limited by the
numberofvetorspaesand theirdimensions.
Usu-ally, moreaurayrequiresaddition ofmorevetor
spaesand/ordimensions. InGRACE,vetorspaes
anbe inrementallyadded (orenabled) toinrease
auray, or deleted (or disabled) to inrease speed
ofretrieval.
2 Literature Survey
Struturemathing of graphsis aproblem that has
beenaddressedatleastsinethe1970s. Some
exam-plesareCorneilandGottlieb[6℄in1970,Berztiss[2℄ in1973,andUllmannin 1976[20℄.
A graph database involves more than struture
mathing. There are a number of member graphs
whih have to be ompared against an input query
graph. The omplexity here is two fold: searhing
forandidate membergraphsto math; and
math-ing thequery againsteah of theandidate graphs.
The SUBDUEgraph database[5, 13℄ usesa
hierar-hial index of substrutures to identify andidate
graphs. Retrieval of andidate graphs is based on
aoneptof\bestompression,"thatidentiesaset
ofsubstrutureswhihompressthequerygraphin
thebestpossibleway. MessmerandBunke[17℄usea 1
Aswouldbeapparentinlatersetions,a \vetor spae" ofstruture isnot avetor spaeinthe mathematialsense oftheterm. It issimplyaolletionoforderedtupleswhih haraterize dierentstruturalproperties. Thevetorspae does not support vetor arithmeti operations like distane alulationandadditionofvetors.
deisiontreebasedindexingofmembergraphsfor a
graphdatabase. Aquerygraphtraversesthroughthe
deisiontreetill itndsamembergraphwhih
sat-isesthequery. Whilethetreesizeisinvariantwith
respet to the number of member graphs, it grows
exponentiallywiththenumberofvertiesinmember
graphs.
Shasha et al.[19℄ addressinexator approximate
searhes on graphs. They present a system alled
Approximate-Tree-by-Example(ATBE)whihallows
inexat mathing of tree graphs. The underlying
oneptisbasedonasimpliation ofedit-distane.
Edit-distane between two graphs is the minimum
numberoflabelor nodetransformationsrequiredto
onvert one graphto another. Sine alulation of
edit-distane is NP-hard, a simpliation alled
2-degree distane is introdued whih onsiders edits
onlyuntil adistaneof 2from thenode under
om-parison. Luo and Hanok [15℄ propose strutural
graph mathing using an expetation maximization
(EM) algorithm. Their method is struture based
andrelies onlyontheedgeoronnetivitystruture
of the graph. Berretti et al. [1℄ propose a method
to index graph models for ontent-based retrieval.
The index mehanism is based on mutual distanes
between member graphs. The Available Chemials
Database (ACD) of Daylight CIS In., [7℄ stores a
database of hemial ompound strutures. This is
donebylabelingeahnode inahemialompound
with a set of spatial oordinates and then indexing
themaordingto theoordinates.
Graphs and struture mathing have been
ad-dressedin other ontextsas well. TheG
+
language
proposed byMendelzon and Wood [16℄ searhesfor
pathsinarelationaldatabasebasedonaninput
reg-ular expression. XPath from W3C [4℄ proposes a
language omprising expressions that desribe tree
strutures.
Various representation mehanisms exist for
stor-ing a database of graphs. One of the widely used
representation for storing image data is Attributed
Relational Graphs (ARGs). Other models inlude
GRAS(GraphOrientedDatabaseSystems)[14℄and
Graphviz [10℄. Deo[9℄ illustrates manymore
appli-ationareasandstoragetehniquesforgraphdata.
InGRACE,thedesignriteriaisorientedtowards
Graph Databases
Attribute Dominance
Structure Dominance
Mostly uniform structures
Many node attributes
Vastly diverse structures
Few node attributes
Example: Database of chemical
compounds
Example: MRI scan database
Comparable diversity
in structure and
attributes
Example:Database of
Electrical Circuits
Figure1: TheDominaneContinuum
storageandretrievalofundiretedgraphs. Only
log-ial strutures of graphs are stored. The physial
renderingofmembergraphswould dependonwhih
visualization tehnique is used. The vetorization
proess in GRACE is similar to feature based
ve-torization (for example, as proposed by [18℄). But
GRACE also vetorizes graphstrutures from
vari-ousviewsandabstrationlevels. Forobtaining
dier-entabstrationlevels,aompressionbasedtehnique
similar to SUBDUE[13℄ is used. HoweverGRACE
doesnothaveaoneptofbestompression;andall
ompressedgraphsmaybeutilizedforvetorization.
3 Dominane of Struture and
Attributes
If agraph databasehasto be designedin ageneri
fashion,weneed toidentifypeuliaritiesofitsmany
appliation domains. The peuliarities are
summa-rizedbywhatmaybealledadominane ontinuum
asshowninFigure1.
Attribute Dominane: Some appliation areas
require graph databases with attribute dominane.
Thismeansthatthemembergraphsaremostly
sim-ilarwith respet to struture, with slightvariations
if any. But the number of possible node and edge
attributes for eah graph would be large. The
re-trievalofsuhgraphswouldinvolvemathing
dier-ent ombinations of node and edge attributes over
substrutures. For example, a medial database of
MRI sans of thehumanfae would ontain graphs
whih are more or less similar. Eah graph would
ontain many \labeled" or \expeted" objets like \eyes",\nose"et. Somegraphsmayontainslight
Strutural Dominane: On the other end of
theontinuumareappliationswhosedatabaseshave
strutural dominane. Inthese databases,variations
in struture is more than variations in attributes.
There are likely to be a large number of dierent
graphstrutures, with relativelyfewattributes. An
exampleisadatabaseoforganihemialompounds
used in hemial industries. Suh adatabasewould
ontain a large number of ompounds eah having
vastly dierentstruturalproperties. However,they
are madeup of relatively few types of elements like
arbon,hydrogen,oxygen,et.
Theremaybedatabasesthat fallsomewhere
mid-way in the ontinuum
2
. An example is a database
ofeletrial iruits. Ciruitsarestored inthe form
ofgraphswherenodesareeletrialomponentsand
edges are onnetions between omponents. Here,
thedatabasewouldonsistofalargenumberofgraph
strutures as well as a substantial number of node
attributes. Bothattributesandstrutureareequally importantintheretrievalproess.
Thepositionofagraphdatabaseinthedominane
ontinuum would determinethe overall strategy for
indexing and retrieval. In existing literature,
at-tributedominated graphsareaddressedbyPetrakis
and Faloutsos [18℄. Theyonsider a medial image
graphto onsistofaxednumberof\expeted"
at-tributes and vetorize the graph based onthese
at-tributes. Here, graph strutures are onsidered to
be largely similar with oasional slight variations.
Struturedominationisaddressedin[20,19,17,6,5℄.
Mostoftheseapproahesarebasedeither on
traver-salalgorithmsor ononeptsofeditdistanes.
In GRACE, the design riteria is to address the
whole ontinuum under asingle umbrella. In order
to support this, thedatabase systemhas twokinds
ofspaes:
Struture spae is a set of one or more
ve-tor spaes that desribe strutures of member
2
It isouronjeture thatdominane isa ontinuum {in that thedegree ofdominane matters forthe overall design. It ould wellbe the ase that dominane maybedesribed byjust three ases: attribute overstruture, struture over attributeandnodominane.
els.
Attribute spae is asingleset of dimensions that
vetorizes a nite set of attributes assoiated
withgraphnodesandedges.
In attribute dominated appliations, some
ap-proahes also apply struture mathing after
at-tributebasedretrievalinorder toinreaseauray.
Examples are [12, 1℄. Similarly, some approahes
in moleular databases (whih are struture
domi-nated databases) augment struture mathing with
attributemathing[5,19℄.
Based on the same priniples, GRACE addresses
dominaneasfollows:
For attribute dominated databases, rstsearh
intheattributespae. Thenusingresultsofthis
searh,performasearhinthestruturespae.
For databases with no dominane (where
at-tribute and struture are equally important),
rst perform a searh in the struture spae.
Basedonthesearhresults,searhtheattribute spae.
For struture dominated databases, perform a
searhinthestruturespae. Dependingonthe
rihness of the attributes in the query, use the
searhresultstosearhin theattributespae.
Determiningthesearhstrategyisaone-time
pro-ess. This is enoded during installation, based on
whether theappliationdomain hasattribute
domi-nationorstruturedomination.
4 Graph Vetorization
4.1 The GRACE model
Let G be a graph database onsisting of a set of
undiretedgraphs. Thegraph spae ofGdenoted
by GS(G) is the range onto whih everyG
i
2 G is
mapped. The graph spae GS(G) is desribed by
GS(G) =hAS(G);SS(G)i, where AS(G) is the
00
00
00
11
11
11
00
00
00
11
11
11
0
0
1
1
0
0
1
1
0
0
1
1
00
00
00
11
11
11
00
00
00
11
11
11
0
0
1
1
0
0
1
1
0
0
1
1
00
00
00
11
11
11
00
00
00
11
11
11
0
0
1
1
0
0
1
1
0
0
1
1
Member graph
Vector space
Query
graph
Structure space
Figure2: TheGRACEModel
AS(G) isaset of attributelasses ordimensions.
Eah attributelass A
i
2AS(G) isdened by aset
of possibleinstanes. The dimensionalityof AS(G)
isthenumberofattributelassesitontains.
SS(G) is a poset of n vetor spaes VS
0 to VS
n 1
where n is at least 1; and apartial order <
among the elements of SS(G). Eah vetor spae
VS i
2SS(G)isdened byaset ofdimensions. The
partial order < on the vetor spaes denes a
lat-tie. For any two vetor spaes VS
i and VS j , if VS i <VS j ,then VS j
issaidto beatahigher level
of abstrationthanVS i indesribingGS(G). If nei-therVS i <VS j norVS j <VS i ,thenbothVS i and VS j
aresaidtobeatthesamelevelofabstrationin
desribingGS(G).
Figure2shematiallydepitstheGRACEmodel.
It showsastruture spaeonsisting of various
ve-torspaesandpartialorderamongthem. Amember
graphismappedontouniquepointsin multiple
ve-torspaes. Aquerygraphismappedontoregions.
For desribing member graphs, GRACE uses a
modied form of the graphviz [10℄ language. In
graphviz, node attributes are delared in the form:
n0 [name=value℄;where n0is the name ofa node
and [name=value℄ is the attribute assoiated with
the node. Paths in the graph are desribed in the
form: n0--n1--n2 [name=value℄;wheren0,n1and
n2 are nodes of the graph. The [name=value℄
de-sribesedge attributesthat areappliableto allthe
graph benzene { h [name=Hydrogen℄; [name=Carbon℄; h_0--_0; _0--_1[type=2℄; h_1--_1--_2; h_2--_2; _2--_3[type=2℄; h_3--_3--_4; h_4--_4; _4--_5[type=2℄; h_5--_5--_0; }
Figure3: Desriptionof abenzenemoleule
edgesinthepath.
GRACE introdues two extensions to the above
desriptions. A node is desribed in the form
typeinstane. This indiates that the node is
of type type and the instane name of the node
is instane. Similarly, a link is desribed by two
kinds of attributes: a qualitative type attribute,
and a quantitative weight attribute. These are
de-lared as type=value and num=value diretives in
the edge attribute area. For example, C 0--C1
[type=2℄; indiates a double bond between two
arbon atoms. An edge diretive of the form
authorSumit--journalJACM [num=3℄;indiatesa
itation graphwhere an author named Sumit has3
publiations in ajournal named JACM.The
dier-enebetweentypeandnumattributeslieintheway
they are proessed during vetorization. Figure 3
showsanexampledesriptionofabenzenemoleule.
4.2 Vetorizing member graphs
In order to reate a new graph database, GRACE
adoptsatwo-phaseproess. Theyare:
Initializationphase: In the initialization phase,
GRACE reates AS(G) and SS(G). This
in-volves determining the dimensions that make
up eah of the onstituent vetor spaes. The
databaseadministrator presentsasetof sample graphs,whiharearepresentativesampleofthe
dierentkindsof membergraphs. Vetorspae
sam-plegraphs.
Operational phase: In the operational phase,
member graphs may be added to (or deleted
from) the database. When a new member
graphistobeadded,GRACEparsesthe
mem-bergraphusing thesameparsingtehniques as
in the initialization phase. However, one the
graph is parsed, any newfeature that is found
whihisnotpresentinanyofthevetorspaesis silentlyignored. Hene,therepresentative sam-ple hosen in the initialization phase is
impor-tant. Ithastoontainatleastoneourreneof
allimportantfeaturesthat havetobeindexed.
Inthe remainderof thissetion, the initialization andoperationalphasesareexplainedindetail.
How-ever, the explanations onentrate only on the
for-mation of SS(G). The formation of AS(G) is not
addressedin thispaper.
Initializationphase: Theinitializationphase
in-volves parsing the sample graphs provided by the
database administrator. Based on these graphs,
GRACE determines a set of dimensions that make
upthedierentvetorspaes in SS(G). Toexplain
theformationofSS(G), wetakearunningexample
ofastruturedominatedappliation{thatofstoring
struturesoforganihemials.
The vetorization proess in the initialization
phaseisoftwokinds: (a). Vetorizing therst
sam-plegraph;and(b). Vetorizing thesubsequent
sam-plegraphs.
Vetorizationoftherstsamplegraphisdesribed
bythefollowingsteps:
Atanyabstrationlevell,dothefollowing:
1. ReadinputsamplegraphG
in
2. Create a vetorspae VS
i
by identifying a set
of dimensions by pairsof node types and edge
types.
3. CreateavetorthatdesribesG
in
basedonthe
identieddimensions.
4. Orderthedimensionsindereasingorderoftheir
omponentsin G in
C:H 6
C:C1 4
C:C2 3
C:O2 1
(b)
H C
C
H
C
C
H
O
C
H
C
H
C
H
(a)
Figure 4: VetorizationExample
5. CompressG
in
by these dierent dimensions to
reate a set of m output graphsG
1 out to G m out .
Disardallnodesintheompressedgraphsthat
belong tolevell.
6. Repeat from step 1 for level l+1, using eah
oftheompressedgraphsG
1 out toG m out asinput graphG in
Steps 2 to 5 are desribed in more detail below.
Consider a sample input graph G
in
of a hemial
ompound asshown in Figure 4(a). The
vetoriza-tion proess rst derives a set of dimensions from
G in
. This is doneby ombining thenode types and
edgetypesdesribedinG
in
. Forexample,aC=C
dou-ble bond would bedesribed asC0--C1[type=2℄;
in graphviz. Using this, a dimension alled C:C2 is
derived. If a C-H bond is desribed by C0--H0;
thenadimensionalledC:Hisderived. Oneasetof
dimensionsarederived,G
in
isvetorizedalongthese
dimensionsbyprojetingthenumberofdierentedge
ourrenesonto thedimensions. Figure 4(b) shows
thegraphvetoralongthefouridentieddimensions.
Ifan edgehas anumattribute, itis used asaount
of the number of ourrenes of the edge. For
ex-ample,adelarationofC 2--H [num=3℄;desribes3
ourrenesofaC-Hbond.
One avetoris formed,G
in
is ompressedbased
on the identied dimensions. Compression is done
asfollows: First,dimensionsareordered in
dereas-ing order of their ourrene in G
in
start-CC1
CC1
CC1
CO2
C:C1 4
C:O2 1
C:C2 3
C:H 6
CH
CH
CH
CH
CH
CH
CO2
CC2
CC2
CC2
(a)
(b)
(c)
Figure5: VetorizationExample
ing from the highest dimension, G
in
is ompressed
byreplaingeahmathingedgewithanodehaving
thedimensionname. If thereplaededgeonneted
nodes v
i
and v
j
, then all other edgesinidenton v
i or v
j
are now onneted to the new node. One it
is nolonger possibleto ompress thegraphfurther,
the top most dimension is disarded and the
pro-ess is started again. From eah of the ompressed
graphs thus reated, all nodes at lower abstration
levelsarealsodisarded. Noweahnodeinthe
om-pressedgraphwouldbeatthesame(higher)
abstra-tionlevel.
Figure5showsthreeompressedgraphsG
1 out ,G 2 out andG 3 out
,thatarereatedaftervetorizationofG in
.
Thevetorizationreated4dimensions: C:H, C:C1,
C:C2andC:O2;andG
in
hadrespetively6,4,3and 1pointsinthefourdimensions. Forompression,the
dimensionsare plaedin dereasing order and
om-pressions are progressively applied from top. First
G in
is ompressed using C:H. One this is done, it
isnolongerpossibletoapply anyotherompression
fatorandtheresultinggraphG
1 out
isshownin
Fig-ure5(a). C:Histhendisardedandtheompression
proess is repeated to obtain the next ompressed
graph.
Compressionreatesgraphsatahigherabstration
level l+1. The dierent graphs that are obtained
depitdierentstruturalviewsoftheoriginalgraph.
Eah ompressed graph is then passed again
through the vetorization proess to reate
dimen-sionsatlevell+1. These dimensionsarereatedin
dierentvetorspaesatthislevel. Thus, ifG
in was
parsedat level0and reatedvetor spaeVS
0 , the
three ompressed graphsreates vetorspaes VS
1 , VS 2 and VS 3
at level 1. The database
administra-tor may set a limit k, on the maximum number of
vetorspaesatanyabstrationlevel. Ifthenumber
of ompressed graphs is morethan k, then some of
the\uninteresting"graphsaredisarded. The
unin-terestingnessmaybedenedbasedonpropertieslike
lakofonnetedness,toolittlenumberofnodes,et.
In the earlier example, ompressing G
in
by the
di-mension C:O2resultsin agraphwithasingle node.
Suhagraphisanexampleofanuninterestinggraph. Forvetorizingsubsequentsamplegraphs,the pro-edureisslightlydierent. Itisdesribedbythe fol-lowingsteps:
Foranyabstrationlevell,dothefollowing:
1. ReadinputsamplegraphG
in
2. Identify a set of dimensions by pairs of node
typesandedge types
3. CreateavetorthatdesribesG
in
basedonthe
identieddimensions
4. AddthesetofdimensionsandG
in
toany
exist-ingvetorspaeVS
j
atlevell,aordingtothe followingriteria:
(a) G
in
isnotalreadyavailable inVS j
,and
(b) Theoverlapbetweenthe dimensions
iden-tiedinG
in
andin VS j
ismaximal
5. Orderthedimensionsindereasingorderoftheir
omponentsinG
in
6. CompressG
in
by these dierent dimensions to
reate a set of m output graphs G
1 out to G m out .
Disardallnodesintheompressedgraphsthat
belong tolevell.
7. Repeat from step 1 for level l+1, using eah
oftheompressedgraphsG
1 out toG m out asinput graphG in
Subsequent samplegraphs are ttedinto existing
H
H
H
H
H
H
H
(a)
C
C
O
C
C
C
H
C:H 8
C:C1 2
C:O1 2
C:C3 1
(b)
Figure6: VetorizationExample
CH
CH
CO1
CH
CC1
CO1
CC1
(a)
(b)
Figure7: VetorizationExample
are reated only if the sample graph generates all
newdimensionsnotenounteredthusfar. Henethe
order in whih sample graphs are presented is also
important. Sample graphs should not only be
rep-resentativeof thedierent kindsof member graphs,
but theyalso haveto be presentedin anorder suh
thathangesinstrutureourgradually. Thiswould
ensureabesttofdimensionsintovetorspaes.
Forexample,onsiderthatthegraphinFigure6(a) istheseondsamplegraphatlevel0. Intherstpass,
the following dimensions are identied: C:H, C:C1,
C:O1andC:C3. Atlevel0,presentlythereisonlyone
vetorspaeVS
0 . VS
0
mathestwodimensionsC:H
andC:C1withthat ofthenewgraph. Thisgraphis
thenplaedintoVS
0
andthedimensionalityofVS 0
isinreasedtoinludethenewdimensionsdisovered.
When the graph is ompressed, two interesting
graphsresult. This is shown in Figure 7. The rst
ompressed graph yields a dimension CH:CH whih
is in ommon with a dimension in VS
1
at level 1.
Hene this graph is plaed in VS
1
. The seond
ompressed graph yields dimensions CC1:CO1 and
CO1:CC13. These donot math with dimensions of
any existing vetorspaes at level 1. Hene a new
vetorspaeVS
4
isreated.
Twoaspetsoftheaboveproedureneedmore
mo-tivation:
Creation of multiplevetorspaes: The reason
for reating multiple vetor spaes is to make
eah vetorspaeasdenseaspossible. This
as-pet is revisited in moredetailfurther downin this setion. If theset of all dimensions
identi-ed from dierentviews at anabstration level
wereombinedintoasinglevetorspae,the fol-lowingissuesresult:
1. The tuples that desribe member graphs
wouldlikelyhavealot ofzeroes,espeially
if the views dier substantially from eah
other
2. Sinemultipleviewsarestoredinthesame
spae,amembergraphmayourin
multi-plepointsinthesamespae. Thisprevents
thegraphfrombeinguniquelyidentiedby
itstuple
3. Multiple vetor spaes in an abstration
levelanbesearhedparallely,whihisnot
possibleifvetorspaesareombined.
Compressionof graphs: In the present
ompres-sionsheme, ompressionfators areapplied in
desending order. One a ompressed graph is
reated,thetopmostompressionfatoris
dis-arded. However,itisevidentthatdierent
per-mutations of the ompression fators might
re-sult in agreater number of ompressedgraphs.
Wedonotadoptsuhanapproahsimplyto
re-duethenumber ofompression passes. If
per-mutationsofompressionfatorsareapplied,the
number of passes inrease fatorially with the
numberof ompression fators. If it is
guaran-teedthatthenumberofompressionfatorswill
besmall(typiallylessthan5),thenthe
volvesinsertionanddeletionofgraphsinadditionto query. Deletionisstraightforward;howeverinsertion
is moreinvolved. The proess for insertinga graph
H in
duringoperationalphase,isdesribedbelow.
Foranyabstrationlevelldothefollowing:
1. ReadinputgraphH
in
2. IdentifydimensionsofH in
usingtheproedures
desribedearlier
3. IdentifythevetorspaeVS
i
atlevellthathas
amaximalmathwiththeidentieddimensions
4. Generate a vetor for H
in
based on all the di-mensionsofVS
i
,ignoringrestofthedimensions inH
in
5. Compress H
in
using the proedures desribed
above
6. Repeattheproessforlevell+1foreahofthe
ompressedgraph.
4.3 Density of vetor spaes
The rationale for generating multiple vetor spaes
lies in the requirement for dense vetor spaes. A
vetorspae is said to be dense if all points in the
vetorspae areequallyprobable. Ifsomepointsin
thevetorspaeannotexistbeauseofdesignaws
orduetoappliationdomainpeuliarities,thevetor
spae would not be dense. Vetor spae density
isaddressedinGRACEbasedonthefollowingaxiom:
Lemma: A vetor spae formed by pairwise
om-binations of nodes is dense, if its dimensions are not orrelated.
If there is a orrelation between any two
dimen-sions, the spae would be skewed. For example in
hemial ompounds, dimensions C:H and H:C are
orrelated beause the graph is undireted. A
ve-torspaeformed byC:HandH:Cwillontainpoints
only on a singleline in the spae. All other points
in thespaewill neverbelled. Hene,orrelations
betweendimensionsshould beavoided.
dimensionsour{diretand indiret. Theyare
ex-plainedbelow:
Diret: Dimensionsinavetorspaerepresent
sub-strutures. If foranytwodimensionsv
i
andv
j ,
there exists a homomorphism from v
i to v j or fromv j tov i
,thedimensionsaresaidtobe or-related.
Indiret: There may be indiret orrelations
be-tweendimensions based on the appliation
do-main. Forexample,ifitis knownthata
hemi-alompoundhasat leastoneC-Cbond,itan
be inferred that it hasno morethan three C-H
bonds. ThusifC:HandC:C1weredimensionsof
thevetorspae, thepointh4;1i annotexist.
InGRACE,onlydiretorrelationsareaddressed.
Indiret orrelations are spei to nuanes of the
appliation domain, and so are not addressed. The
following lemmas illustrate how diret orrelations
areavoidedin GRACE.
Lemma: ForanyvetorspaeinGRACE,notwo
di-mensionsareisomorphitoeahother.
Proof: Dimensions are reated by node pairs whih
have the following properties: (a). node pairs are un-ordered,so thatC:H andH:Care mappedonto thesame dimension; and (b). node pairs of dierent dimensions aredistint.
Lemma:ForeveryvetorspaeinGRACE,no
dimen-sioniswhollyontainedwithinanotherdimension.
Proof: This anbe provedby indution. In level 0,
dimensions are formed by distint pairs of node types. Henenodimensionis whollyontainedwithin another. Supposethat thelemmaholdsatleveli. Atabstration leveli+1,nodesofleveliaredisarded. Nodesoflevel i+1 ontain exatly two level i nodes and are formed outof distint nodepairs. Henenodimensionat level i+1isontainedwholly withinanother.
5 Query Resolution and Re-trieval
Query resolution in GRACE is performed by
map-ping a query onto regions in one or more vetor
spaes. Query results are omputed by a ranked
unionofallthepointslyinginthequeryregioninall
vetorspaes. Query resolution and retrievalhene
involves two issues: mapping a query onto regions,
andrankingqueryresults.
Thequery itselfis in the form ofagraph. Query
resultsshouldbethesetofallmembergraphswhih
ontain the query graph as a substruture. While
GRACE addressesretrievalbasedonbothstruture
and attributes, thissetion addressesonly struture
basedretrieval.
5.1 Mapping queries onto regions
Theproessofmappingaqueryontoregionsisvery
similar to theoperationalphaseproess of mapping
membergraphsontopointsin vetorspaes. Thisis
explainedinmoredetailbelow.
Foreveryabstration levelldothefollowing:
1. ReadtheinputquerygraphQ
in
2. Identify a set of dimensions by pairs of node
typesandedgetypes
3. Identify enabled vetor spaes VS
i ;VS
j ;::: at levellwhihhaveatleastonedimensionin
om-monwiththat ofQ
in 4. CreateregionsinVS i ;VS j ;:::thatdesribeQ in
5. OrderthedimensionsofQ
in
indereasingorder
oftheirourrene
6. Compress Q
in
by these dierent dimensions to
reate a set of m output graphs Q
1 out to Q m out .
Disardallnodesintheompressedgraphsthat
belongtolevell.
7. Repeat from step 1 for level l+1, using eah
oftheompressedgraphsQ
1 out toQ m out asinput graphQ in
C
H
C
H
(a)
C:H [2,*]
C:C2 [1,*]
(b)
Figure8: QueryExample
Step 4 in the aboveproess needs more
explana-tion. Duringquerytime,thevetorizationproessis
usedtogeneratearegion ratherthanapoint.
Creatingregionsisbasedonaoneptalled
win-dowsizew. Windowsizeisaparameterthat denes
thewidthofaregionalonganydimension. If
vetor-izing a query graph resultsin point in dimension
C, then the query is dened by the losed interval
[;+w℄ along dimension C. The query region is
theintersetionofallitsintervaldenitionsalongall dimensions.
LetVS
i
beoneof theidentied vetorspaes for
query graph Q
in
. For VS
i
to be hosen, it has to
satisfythefollowingriteria: VS i
should beinstate enabled andithasto haveatleast onedimensionin
ommonwiththeidentieddimensionsofQ
in
. Eah
vetorspae anbe in one oftwostates: enabled or
disabled. Queries an be mapped to vetor spaes
onlyiftheyareenabled.
ThesetofallommondimensionsbetweenQ
in and VS i isgivenbyQ in \VS i
. Thishastobeofnon-zero
ardinality. The set of all dimensions in the query
but not in the vetorspae, isgivenby Q
in VS
i . Thissetissilentlyignored. Thesetofalldimensions
present in the vetor spae but not in the query is
givenbyVS i
Q in
. Forallelementsinthisset,Q in isassignedalosed interval[0;w℄.
Example: Consider a graph spae onsisting of
twolevelsandatotalof fourvetorspaesas
gener-ated inFigures 4,5,6and7. Nowonsideraquery
graphasshownin Figure 8(a). Let thewindowsize
C:C2. Atlevel1,VS 0
anbeidentiedforthequery,
sine VS
0
has the following dimensions: C:H,C:C1,
C:C2,C:C3,C:O1and C:O2. The querytuplewould
thusbeh[2;℄;[0;℄;[1;℄;[0;℄;[0;℄;[0;℄i.
When the queryis ompressed, we get one
inter-esting graph CH--CH [type=2℄. Parsing this graph
gives a dimension CH:CH1 whih ours in vetor
spae VS
1
at level 1. VS
1
has two dimensions:
CH:CH1 and CH:CH2. The query ould thus be
mappedonto theregionh[1;℄;[0;℄iin VS 1
.
5.2 Ranking query results
Queryresultsare omputed from aranked unionof
allandidategraphswhihhavemathedinallvetor
spaes. Rankingisbasedonthefollowingriteria:
A andidate graph mathing at a higher level
of abstration gets a higher ranking than one
mathing at only lower levels of abstration.
(Note that if a graph maps to a vetor spae
at level l, it has to map to at least one vetor
spae at all levels below l. Hene, this
rite-rion an also be stated as: A andidate graph
that mathesat morelevelsgetsahigher
rank-ingthanonethatmathesat fewerlevels.)
A andidate graph mathing in many vetor
spaesat thesamelevelis givenahigher
rank-ingthanonewhihmathesinonlyafewvetor
spaes.
Basedontheaboveriteria,therankofaandidate
graph C is omputed as: rank(C) =
P l i=1 iw i , where,listhenumberofabstrationlevels,andw
i is
thenumberofdierentvetorspaesatleveliwhere
theandidategraphlieswithinthequeryregion.
5.3 Spae and time omplexities
Number of abstration levels: Vetor spaes at
any abstration level are reated by pairwise
om-bination of edges from the lower level. The vetor
spaesinthelowestabstrationlevelaremadeofthe
edgesofmembergraphs.
from all the sample graphs, and the total
num-ber of distint edge types. The number of
possi-ble dimensions is O(n
2
). If is far lesser than
and unrelatedto n, the spae omplexity of VS
0 is O(n
2
). Sine abstration levels areformed by
om-biningpairsofdimensions,thenumberofabstration levelsisO(dlog(n
2
)e)=O(dlog(n)e).
Number of vetor spaes at an abstration
level: Let n be the number of distint node types
fromallthesamplegraphsand thenumberof
dis-tintedgetypes. Intheworstase,thesamplegraphs
arepresentedin awaysuhthat everysamplegraph
getsplaedintoadierentvetorspaeandeah
ve-torspaehasjustonedimension. Thiswaytherean
ben
2
=O(n
2
)vetorspaes. Butforourpurposes,
weshallbeusingk asthemaximumnumberof
ve-torspaesatalevel. k isthelimitonthenumberof
vetorspaessetbythedatabaseadministrator.
Complexityofthemappingproess:Mapping
membergraphsontopointsorquerygraphsonto
re-gions,involvealgorithmsofthesameomplexity.
Let the number of nodes in the input member
or query graph be v. The graph would ontain
e = O(v
2
) edges. Vetorizing the input graph
re-quiresasan of alltheedges. This isof omplexity
O(e). Compressionoftheinputgraphinvolves
om-bining edges (dimensions) pairwise. Thenumber of
timesompressionanbedoneisO(dlog(e)e). After
eahompression,theremaybeamaximumofk
om-pressedgraphs. Hene,theomplexityofmappingis
O(ekdlog(e)e).
Theomplexityofretrievalisthesumofthe
om-plexities of mapping the query graph onto regions,
andofomputingthesetofallpointsthatlieinthese regions. Complexityof thelatteris basedon the
in-dexingmehanismusedbythedatabasetoindex
re-gionsin vetorspaes.
6 Prototype Implementation
AproofofoneptprototypeforGRACEisurrently
under development. Ideally, implementation would
be based on amulti-dimensional databaseusing
d1 d2 d3
f
Client
GRACE
Server
MySQL database
VS0
VS1
VS2
d1 d2 d3
d4
f
d1 d2
f
AS
a1 a2 a3
f
VSn
Figure9: ImplementationArhiteture
However, for the prototype, we have used a
lay-eredarhitetureusingaMySQLdatabasewith
My-ISAM indexing. Figure 9 depits the layered
ar-hiteture. The user interats with GRACE using
a lient, whih in turn interats with the GRACE
server. The GRACE server mediates between a
MySQL DBMS storing a graph database, and the
lient. Agraphvizdesriptionsentbyalientis
on-vertedbytheGRACEserverintoasetoftuples. One ofthese tuplesgo intotheattribute spae. Therest
gointodierentvetorspaesinthestruturespae.
Eah vetor spae is modeled as a table in the
MySQL database. The olumn name of the tables
aredimensions(d1;d2::::dn)ofthevetorspae. The
last olumn (f) of eah table is the le name that
maintainsthegraphstrutureanddesription.
As mentioned in Setion 4, GRACE has two
phases of ativity: initialization phase and
opera-tionalphase.
Intheinitializationphase,theGRACEserver
re-atesanewdatabaseonMySQL.Italsoreatesaset
oftablesorrespondingtoanattributespaeandthe
set of all vetor spaes for struture. One the
di-mensionsofeahofthesevetorspaesisknown,the
orrespondingolumnsarereatedinthesetablesby
issuingSQLCREATEommands.
Operationalphase involvesonverting alient
re-LECTommands. Insertionanddeletionarestraight
forward. For queries,SELECT statementsare
on-struted based on the window size parameter. A
query of the form h[1;11℄;[0;10℄;[3;13℄ion a vetor
spaeVS
0
havingdimensionsA, BandCisonverted
tothefollowingSQLstatement:
SELECT f FROM VS0 WHERE A BETWEEN 1 AND 11
AND B BETWEEN 0 AND 10
AND C BETWEEN 3 AND 13;
Basedonthequeryresults,theGRACEserver
ap-plies therankingalgorithm and returnsappropriate
resultsto thelient.
7 Conlusions
Querying strutural data is gaininginreased
appli-ationsindierentelds. Inmanyappliationareas,
therequirementisforfastretrievalofstruturaldata,
ratherthanintriate struturemathing. GRACEis
designedtoaddresssuhneeds. Thisontrastisalong
similarlinesastheontrastbetweendatabasesearh
andinformationretrieval.
Itisrequiredtoaugmentstruturalretrievalswith
alibration models whih ompute values of
prei-sionandreall. Thisenablesomparisonofdierent
struturalretrievalmethods.
A number of unaddressed issues in GRACE are
planned to be taken up in the foreseeable future.
Someoftheseinludethefollowing: amehanismfor
ausertomanuallyreatevetorspaesbyspeifying
theirdimensions, amehanismfor heking
orrela-tions among dimensions in suh vetor spaes, and
advanedquerymehanismsthatansupport
expres-sionsoverstruture.
Referenes
[1℄ S. Berretti, A. Del Bimbo, E. Viario.
EÆ-ient Mathing and Indexing of Graph Models
inContent-BasedRetrieval.IEEETransations
on Pattern Analysis and Mahine Intelligene,
morphism Detetion of Direted Graphs. Jour-naloftheACM,Vol.20,No.3,July1973,pages 365-377.
[3℄ Chad Carson, Megan Thomas, Serge Belongie,
Joseph M. Hellerstein, Jitendra Malik.
Blob-world: ASystemforRegion-basedImage
Index-ing andRetrieval.Pro. of the Third Int.Conf.
on Visual InformationSystems,June1999.
[4℄ J. Clark. XML Path Language (xpath), 1999.
http://www.w3.org/TR/xpath
[5℄ Diane J. Cook, L. B. Holder, S. Su, R.
Ma-glothin,I.Jonyer.StruturalMiningof
Moleu-larBiologyData.IEEEEngineeringinMediine
and Biology, Speial issue on Advanes in
Ge-nomis,Vol.20,No.4,pages67-74,2001.
[6℄ D. G. Corneil, C. C. Gotlieb. An EÆient
Al-gorithmforGraphIsomorphism.Journalofthe
ACM(JACM),Vol.17,No.1,pages51-64,Jan.
1970.
[7℄ Daylight Chemial Information
Sys-tems. Available Chemials Diretory
(ACD). http://www.daylight.om/prod uts/
databases/ACD.html
[8℄ Y. Deng, B.S. Manjunath. An EÆient
Low-dimensionalColorIndexingShemefor
Region-basedImageRetrieval.Pro.ofIEEEIntl.Conf.
on Aoustis, Speeh, and Signal Proessing
(ICASSP),pages3017-20,1999.
[9℄ Narsingh Deo. Graph Theory with Appliation
toEngineeringandComputerSiene,
Prentie-Hall,EnglewoodClis,N.J.,1974.
[10℄ E. R. Gansner, S. C. North. An Open
Graph Visualization System and its
Appliations to Software Engineering.
http://www.researh.att.om/sw/tools/ graphviz/GN99.pdf
[11℄ A.Guttman.R-Trees: ADynami Index
Stru-turefor Spatial Searhing. Pro. of ACM
SIG-MODConferene,Boston,1984,pages47-57.
Content-basedImageRetrieval.IASTED
Inter-national Conferene on Signal and Image
Pro-essing(SIP 99), Ot18-21, 1999, Nassau,
Ba-hamas,pages129-133.
[13℄ I. Jonyer, D. J. Cook, L. B. Holder.
Disov-eryandEvaluationofGraph-BasedHierarhial
ConeptualClusters.JournalofMahine
Learn-ingResearh,Vol.2,pages19-43,2001.
[14℄ Norbert Kiesel, Andy Shurr, Bernhard
West-fehtel. Graph Oriented (Software engineering)
DatabaseSystem.InformationSystemsVol.20,
No.1,Feb.1995.
[15℄ B.Luo,E.R.Hanok.StruturalGraph
Math-ingUsingtheEMAlgorithmandSingularValue
Deomposition.IEEE Transations on Pattern
MathingandMahineIntelligene,Vol.23,No.
10,Ot. 2001.
[16℄ A.O.Mendelzon and P.T. Wood. Finding
Reg-ular Simple Paths in Graph Databases. SIAM
JournalofComputing,Vol.24,No.6,De.1995.
[17℄ Bruno T. Messmer, Horst Bunke. Subgraph
Isomorphism Detetion in Polynominal Time
onPreproessed Model Graphs. Proeedings of
ACCV 1995, Singapore.
[18℄ Euripides G.M. Petrakis, Christos
Falout-sos. Similarity Searhing in Medial Image
Databases. IEEE Transations on Knowledge
andDataEngineering,Vol.9,No.3,1997.
[19℄ DennisShasha,JasonTsong-LiWang,Kaizhong
Zhang, Karpjoo Jeong. A System for
Approx-imate Tree Mathing. IEEE Transations on
Knowledge and Data Engineering, Vol. 6, No.
4,pages559-571,August1994.
[20℄ J.R.Ullmann, An Algorithm forSubgraph
Iso-morphism,Journalof the ACM,Vol.23,No.1,