3.7 Subgraph Isomorphism Search
4.2.2 Exchanging Data Vertices
The distributed subgraph enumeration solutions that choose to exchange data vertices instead of intermediate results include [1][13]. By saying exchanging data vertices, we mean exchanging the adjacency-list of those vertices. Both [1] and [13] follow a single- round style where, for any specific machine, it has to fetch a large set of data vertices from other machines. A fatal issue of these solutions is that the graph data held within each machine easily drains the memory of a single machine.
Similar to [1][13], PADS chooses to exchange data vertices. Differently, PADS uses a multi-round approach so that the data vertices exchanged are divided into multiple- rounds. Instead of coping data vertices blindly, we also introduce maximum-caching strategy which releases previously fetched/used data vertices when it nearly drains the memory or has reached a given threshold. Those strategies grant us the ability to control the peak number of data vertices held in memory during the process.
Moreover, for non-spanning tree query edges, to determine whether some particular data edges (which cannot be verified locally) can be matched to them, we do not fetch related data vertices to the local machine. Instead, we send those edges to be verified to other machines and get the verification results from them. Sending particular edges and verification results is more lightweight than sending data vertices. This technique significantly reduces the workload that needs network communication.
4.3 PADS 77
4.2.3 Compression
Aiming to solve the output-crisis of the subgraph enumeration, [40] introduced a com- pression strategy to compress the output embeddings. By pre-indexing all the cliques of the data graph, [40] also proposed a distributed enumeration solution. Although [40] achieved some success in speeding up the subgraph enumeration process, it has to admit that its index is quite large (as shown in Table 4.1.), which not only takes a large storage space but also increases the overhead of maintaining, especially when the data graph is frequently updated. This issue dramatically drops the practicality of [40]. In contrast, our approach does not need any auxiliary index except for the original data graph.
Table 4.1 Illustration of the Index Size
Dataset(G) Original File Size Index File Size
DBLP 13M 210M
RoadNet 87M 569M
LiveJournal 501M 6.5G
4.3 PADS
4.3.1 Architecture Overview
Given an unlabelled data graph G and m machines {M1, . . . , Mm} in a distributed environ-
ment, a partition of G is denoted as {G1, . . . ,Gm} where EG= {EG1∪ EG2· · · ∪ EGm} and Gt
is the graph partition located in the tt h machine Mt. PADS finds all subgraphs of G that
are isomorphic to P from all the partitions. In this chapter, we assume each partition is stored as a adjacency-list. For any data vertex v, we assume its adjacency-list is stored in a single machine Mt and we say v is owned by Mt (or resides in Mt) (denoted as v ∈ VGt).
We say a data edge e is owned by (or resides in) Mt (denoted as e ∈ EGt) if either end
vertex of e resides in Gt. Note that an edge can reside in two machines.
The architecture of PADS is given in Figure 4.1.
Given an unlabelled query pattern P , PADS processes subgraph enumeration by exe- cuting the following two threads within each machine:
• Daemon Thread listens to requests from other machines and supports two function- alities: verifyE is to return the edge verification results for a given request consisting of vertex pairs. For example, given a request {(v0, v1), (v2, v3)} posted to M1, M1will
Fig. 4.1 PADS Architecture
return {t r ue, f al se} if (v0, v1) is an edge in G1while (v2, v3) is not. fetchV is to return
the adjacency lists of requested vertices of the data graph.
• SubEnum Thread is the core subgraph enumeration thread which applies a new ap- proach R-Meef (region-grouped multi-round expand-verify-filter). When necessary, SubEnum thread sends undetermined edges and remote vertex requests to the Dae- mon threads located in other machines.
We use PADS when referring to our distributed subgraph enumeration system as a whole and use SubEnum to refer to the core subgraph enumeration thread.
Utilizing the above framework, PADS elegantly solves the aforementioned issues that are preventing existing solutions from being practically applied. To be specific, PADS achieves
(1) Asynchronism The core subgraph enum thread of PADS runs separately and asyn-
chronously where there is no waiting latency among them. This leads the overall perfor- mance of the process to be dramatically improved.
(2) Index Free Except for the adjacency-list of the data graph, PADS does not need any
auxiliary index. This characteristic significantly improved the practicality of PADS espe- cially when the data graphs needs frequent update.
(3) Lightweight Communication Unlike [54][31][32], PADS does not send any intermedi-
ate embeddings through the network. The requests and responses for both verifyE and fetchV are much more lightweight compared to previous approaches.
4.3 PADS 79
(4) Relieved Intermediate Result Storage Latency Instead of join-oriented method, the core
subgraph enumeration algorithm follows a novel multi-round verify-expand-filter idea which is similar to the incremental verification strategy of backtracking algorithms. It fil- ters out unpromising embedding candidates by edge verification as early as possible. In this way, it effectively reduces the number of intermediate results that are needed to be stored in memory. Therefore the intermediate result storage latency is relieved in PADS.
(5) Memory Control The memory usage of PADS is controlled by a set of new techniques:
a compression strategy to compress intermediate results, a grouping strategy to split the workload and a maximum-data caching mechanism to reduce communication.
4.3.2 R-Meef
The architecture of our distributed subgraph enumeration system PADS has already been presented in the above subsection. The functions within the daemon thread are very intuitive to understand, therefore we omit any further details about them. In this section, we focus on the explanation of the SubEnum thread, which is the core component of PADS.
Before we present the algorithmic details of SubEnum, let us first explain its underly- ing principles which can be abstracted as a region-grouped multi-round expand-verify-
filter(R-Meef) approach where the following definitions are important.
Definition 9. Given a graph partition Gt of data graph G located in machine Mt and a query pattern P , the injective function fGt: VP → VGis an embedding candidate of P in G
w.r.t Gt such that for any edge (ui, uj) ∈ EP, if either fGt(ui) ∈ VGt or fGt(uj) ∈ VGt, there
exists an edge ( fGt(ui), fGt(uj)) ∈ EGt.
The definition of embedding candidate is similar to that of partial embedding while the embedding candidate is restrictedly defined within a given graph partition Gt. For
any query edge, it only requires that the corresponding mapped data edge e must exist if either vertex of e is owned by Gt. We useRGt(P ) to denote a set of embedding candidates
of P in G w.r.t Gt. fG−1
t (v) is used to denote the query vertex mapped to data vertex v in fGt.
Definition 10. Given an embedding candidate fGt of query pattern P , for any e = (v, v′)
where ( fG−1
t (v), f
−1 Gt (v
′)) ∈ E
P, e is called an undetermined edge of fGt if neither v nor v′
Obviously if we want to determine whether fGt is actually an embedding of the query
pattern, we have to verify its undetermined edges in other machines. For any undeter- mined edge e, if its two vertices reside in two different machines, we can use any of them to verify whether e ∈ EGor not.
Example 17. Consider a random graph partition Gt of a data graph G and a triangle query pattern P where VP = {u0, u1, u2}, fGt={(u0, v0), (u0, v1), (u0, v2)} is an embedding
candidate of P in G w.r.t Gt if v0∈ VGt, v1∈ Ad j (v0) and v2∈ Ad j (v0) while neither v1
nor v2resides in Gt. (v1, v2) is an undetermined edge of fGt.
Next let us assume that we have an identification strategy such that each fGt can be
looked up by a unique ID. (We will discuss our identification strategy in Section 4.5.) Definition 11. Given an embedding candidate set ReGt(P ), the edge verification index (EVI) is a key-value map where for any tuple {e, I D s},
• the key e is a vertex pair (v, v′).
• the value I D s represent the set of IDs of the embedding candidates ofReGt(P ).
and (1) if the I D of any fGt ∈ eRGt(P ) is in I D s, e is an undetermined edge of fGt. (2) for any
undetermined edge e of fGt ∈ eRGt(P ), there exists a tuple inEVI with e as the key and the
ID of fGt as the value.
Intuitively, the edge verification index is a structure to organize the undetermined edges of a set of embedding candidates. It is straightforward to see:
Lemma 2. Given a data graph G, query pattern P and an edge verification indexEVI, for any {e, I D s} ∈EVI, if e ∉ EG, then none of the embedding candidates corresponding to I D s can be an embedding of P in G.
We have explained fGt,RGt(P ) andEVI, which are key structures of the data flowing
in the workflow of R-Meef. Now we are ready to define the execution plan which guides the workflow of R-Meef. Before that, we present the definition of our query decomposi- tion.
Definition 12. A decomposition of query pattern P = {VP, EP} is a set of decomposition unitsDE= {d p0, . . . , d pl} where for every d pi∈DE, d pi= {upi vot, Vl ea f} and
4.3 PADS 81 (2) Vd pi = {d pi.upi vot}S Vl ea f, and Ed pi= E st ar d pi S E si b d piwhere E st ar d pi = S u′∈Vl ea f {(d pi.upi vot, u′)}, and Ed psi b i = S u,u′∈d pi.Vl ea f {(u, u′) ∈ EP}. (3) S
d pi∈DE(Vd pi) = VP, and for any two d pi, d pj, Ed piT Ed pj = ;.
Similar to the star decomposition within the left-deep join of [31], our decomposition units cover all the vertices of P and share no common edges with each other. Each of the decomposition units can be treated as a subgraph of P . While differently, our unit is not restricted to star since it allows edges among leaves. We use Ed pst ar
i represent the edges
from pivot to leaves and Ed psi b
i represents the edges among leaves. Also it is worth noting
thatS
d pi∈DE(Ed pi) is only a subset of the EP.
Based on the above decomposition, we define the execution plan as following: Definition 13. Given a query pattern P , the execution plan is an ordered list of decompo-
sition units, denoted asPL= {d p0, . . . , d pl} where for any d pi∈PL, (1) Pi= {VPi, EPi} where VPi= S pj ≤i∈PL(Vpj) and EPi = S pj ≤i∈PL(Epj). (2) d pi.upi vot ∈ VP(i −1)(i > 0) and d pi.Vl ea f∩ VP(i −1)= ;.
It is easy to see that Pi is the union of the first i units. As per the order of execution
plan, it requires that the pivot-vertex of d pi must appear in some units before d pi and
any of the leaf-vertex of unit d pishould never appear in any unit before d pi. Considering
the condition (3) of Definition 12, it is obvious that VP= VPl.
For any d pi of a decomposition planPL, let us define: Ed panc i = [ u∈d pi.Vl ea f,u′∈VPi−1 (u, u′) ∈ EP We call Ed pst ar
i expansion query edges and call the union of E
si b d pi and E
anc
d pi verification
query edges. The Ed panc
0 is an empty set. We have the following lemma:
Lemma 3. Given an execution planPL= {d p0, . . . , d pn} of query pattern P , it holds that
|S
d pi∈PL(E
st ar
d pi )| = (|VP| − 1).
Intuitively, the union of the expansion edges of all the units forms a spanning tree of pattern P . The verification query edges are non-spanning tree edges of the pattern P . Example 18. Consider the query pattern in Figure 4.2(a), we have an execution planPL= {d p0, d p1, d p2, d p3} where d p0.upi vot = u0, d p0.Vl ea f = {u1, u2, u7}, d p1.upi vot = u1, d p1.Vl ea f = {u3, u4}, d p2.upi vot= u0, d p2.Vl ea f = {u8, u9} and d p3.upi vot= u2, d p3.Vl ea f =
Fig. 4.2 Running Example
Definition 14. Given an execution planPLand a list of candidate vertices C (d p0.upi vot) in Mt that can be mapped to d p0.upi vot, we divide C (d p0.upi vot) into region groups, de- noted asRG= {r g0, . . . , r gh} where each region group is a set of data vertices and
(1) for any two r gi, r gj∈RG, r gi∩ r gj= ;. (2) C (d p0.upi vot) =Sr gi∈RGr gi.
With the above concepts, we are ready to present the principles of our R-Meef ap- proach.
Given query pattern P , data graph G and its partition Gtwithin a machine Mt, R-Meef
finds a set of embeddings of P in Gt based on the following:
(1) Region-Grouped From the vertices residing in Mt, R-Meef divides those which can
be mapped to the first query vertex of the execution plan into different region groups. Then it process each group sequentially and separately. The idea is to divide the work- load into multiple independent processes so that the maximum intermediate results cached in the memory can be significantly decreased, therefore memory crash can be effectively avoided.
(2 Multi-round For each region group, R-Meef processes one unit at a round based on the execution planPL. In it hround, the workflow can be illustrated in Figure 4.3.
Fig. 4.3 R-Meef workflow
In Figure 4.3, theRGt(Pi −1) represents the embedding generated and cached from last
4.3 PADS 83
v is the candidate vertex that can be mapped to d p0.upi vot. By expandingRGt(Pi −1),
we get all the embedding candidates ReGt(Pi) of Pi w.r.t Mt. After verification and filtering, we get all the embedding of Pi within Mt, which are partial embeddings of P
in G.
In each round, the expand and verify & filter work as follows:
• Expand According to our execution plan, given an embedding f from last round,
d pi.upi vot has already been matched to a data vertex v by f . By searching the
neighbourhood of v, we expand f and find all the embedding candidates of d pi
containing (d pi.upi vot, v) w.r.t Mt. It is worth noting that if v does not reside
in Mt, we have to fetch its adjacency-list from other machines. Different embed-
dings have different vertices to fetch in order to expand. For all the embeddings from last round, we gather all the vertices that need to be fetched and then fetch their adjacency-lists together.
• Verify & Filter Upon having a set of embedding candidatesReGt(Pi), we can easily build anEVI from them. Then we send the keys ofEVI to other machines to verify their existence. One important assumption here is that each machine has a record of the ownership information (say which machine a particular v resides at) of all the vertices. This record can be constructed offline as a map whose size is |V |. When constructing the record, each machine broadcasts the IDs of their local vertices to other machines, whose time cost is very small. After we get the verification results, we filter out the failed embedding candidates fromReGt(P(i )). And the remaining ones areRGt(Pi). The output of the final round is the set of
embeddings of query pattern P in G within Mt.
4.3.3 Algorithm of SubEnum
Next we look at the implementation of R-Meef as shown in Algorithm 11. Algorithm 11 is executed in each machine Mt simultaneously and asynchronously.
(1) Data Structure
• Within each machine, we first compute an execution planPLbased on the given query pattern P (Line 2). ThePLs computed in every machine are the same. • We group the candidate data vertices of d p0.upi votwithin Mtinto into region groups
(Line 3). For each region group r g , a multi-round mapping process is conducted (Line 4 to 21).
Algorithm 11: SUBENUMFRAMEWORK
Input: Query pattern P and Data graph G partitioned into {G1,G2, . . . ,Gm} located
at m machines from M1to Mm
Output: All embeddings of Q in G
1 for each machine Mt(1 ≤ t ≤ m) simultaneously do
2 PL← execut i onPl an(P )
3 RG={r g0. . . r gk} ← r eg i onGr oups¡C (d p0.upi vot, Mt)¢ 4 for each region group r g ∈RGdo
5 init inverted trieIT with size |VP|
6 init edge verification indexEVI
7 for each data v ∈ r g do 8 f ← (d p0.upi vot, v)
9 exp and I nver t ed Tr i e( f , Mt, d p0,IT,EVI) 10 R← ver i f yFor ei g nE(EVI)
11 f i l t er F ai l ed Embed (R,IT) 12 for Round i = 1 to |PL| do
13 clearEVI
14 f et chF or ei g nV (i )
15 for each f ∈IT do
16 exp and I nver t ed Tr i e( f , Mt, d pi,IT,EVI)
17 R← ver i f yFor ei g nE(EVI)
18 f i l t er F ai l ed Embed (R,IT) 19 return the embeddings withinIT
• In Line 5, We use an inverted trieIT to save the generated intermediate results. We use intermediate results for short in the rest of the paper . We will give more details