3.7 Subgraph Isomorphism Search
3.8.5 Query Processing Time
In this subsection, we present the comparisons of the performances ofMQOand sequen- tial query processing (SQO). We report the results from three perspectives. (1) The scal- ability of the query processing. For each query set, we generated 10 query sets, with set size ranging from 100 to 1000. The family size was set to 10. The grouping factor thresh- old was set to 0.5 for Yeast, 0.6 for Human and 0.9 for Wordnet. The partition width is set to 3 for all data sets. The results are given in Figure 3.11. (2) The effects of query similarity over the query processing time. We used different family sizes (from 1 to 10) when gen- erating the queries. The query set size is 1000. (3) The effects of different grouping factor thresholds over the query processing time. We used query sets with 1000 queries and the family size was set to 10. Due to space limit, we only present the results for Human for experiments (2) and (3) in Figure 3.12.
As shown in Figure 3.11, our MQOapproach achieved significant improvement over
SQO for all three datasets. Compared with Yeast, both Human and Wordnet achieved larger improvement. There are two possible reasons for this:(1) Yeast is a small graph where the average query processing time is short, hence the space for time savings is not as big as for the other data sets. (2) Yeast contains many labels which makes the queries harder to share common subgraphs. The improvement ofMQOoverSQOfor TurboIso is larger than that for TurboIsoBoosted for all datasets in terms of absolute time saved (note the different time units in the figures).
As shown in Figure 3.12(a), with the increment of the family size, the performance ofMQOshow larger improvement overSQO. We used two horizontal lines to represent the averageSQOprocessing time over all queries for TurboIso and TurboBoosted respectively
3.8 Experiments 71
(a) TurboIso over Yeast (b) TurboBoosted over Yeast
(c) TurboIso over Human (d) TurboBoosted over Human
(e) TurboIso over Wordnet (f ) TurboBoosted over Wordnet Fig. 3.11 Performance Comparison and Scalability Test
(a) Effects of query similarity (b) Effects of threshold Fig. 3.12 Effects of query similarity and GF threshold
(note that the family size does not affect the average time cost ofSQO). As can be seen, when the family size is 1 (which means the queries have little overlap), the performance of ourMQOis only slightly worse than that ofSQO.
As shown in Figure 3.12(b), neither a very large nor a very small grouping factor thresh- old can achieve the best performance. The former leads to many useful MCSs not being detected, and the latter leads to relatively longPCMbuilding time with many small com- mon subgraphs being detected. ThePCMbuilding time cannot be easily paid back in such cases.
3.9 Conclusion
We presented our solution of MQO for subgraph isomorphism search in this chapter. Our experiments show that, using our techniques, the overall query processing time when multiple queries are processed together can be significantly shorter than if the queries are processed separately when the queries have many overlaps. Furthermore, the larger the data set or the more time it takes for an average query, the more savings we can achieve. When the queries have no or little overlap, our filtering technique can detect it quickly, resulting in only a slight overhead compared withSQO. To the best of our knowl- edge, our work is the first on MQO for general subgraph isomorphism search. It demon- strates that MQO for subgraph isomorphism search is not only feasible, but can also be highly effective.
Chapter 4
Distributed Subgraph Enumeration: A
Practical Asynchronous System
In this chapter, we present a Practical Asynchronous Distributed Subgraph enumeration system (PADS). Section 4.1 presents some practical motivations of distributed subgraph enumeration and discusses some problems existing in previous research. We further dis- cuss the techniques proposed by previous research in Section 4.2. In Section 4.3, we give the architecture of PADS and present the principles of R-Meef based on which we de- signed SubEnum, the core of PADS. The algorithm of building explanation plan is given in Section 4.4. In Section 4.5, we present an inverted trie data structure to compress in- termediate results. After that, two memory control strategies are given in Section 4.6. Then we present our experimental study in Section 4.7. At last we conclude this chapter in Section 5.1.
4.1 Motivations and Problems
In the real world, the data graphs are often fragmented and distributed across different sites. For example, a social network graph maybe distributed across different servers and data centers for performance and security considerations [22][38]. This phenomenon highlights the importance of distributed systems for subgraph enumeration. Also, the increasing size of modern graph makes it hard to load the whole graph into memory, which further strengthens the requirement for distributed systems that could share the workload among the computing resources of a cluster.
Assuming the data graph is partitioned over multiple machines, many distributed processing approaches of subgraph enumeration have been proposed [1][13][54][31][32][13].
However they are facing the following problems:
• Memory Crisis A fatal issue of the single-round join-oritented solutions [1][13] is that the graph partition held within each machine easily outnumbers the available memory of a single machine. The reason is that, for any particular machine, those solutions have to copy a large part of data graphs from other machines to the memory of this machine. With the ability of avoiding maintaining a large part of the data graph in memory, the algorithms [31][32][54] have achieved some success by following a multi- round style. However, current multi-round algorithms suffer from the fact that a huge number of intermediate results are generated in each round, causing severe burden not only on memory but also on network communication.
• Heavy Index Aiming to solve the output crisis of finally found embeddings, [40] pro- posed a compression strategy based on which it also provided a distributed subgraph enumeration framework. However, [40] needs to pre-compute and index all the cliques of the data graph. Although this solution achieved some speed-up, the size of the in- dexed cliques can be much larger than the graph itself (see Section 4.7) in our exper- iments. For instance, the original data graph of liveJournal takes 1G space, while the index of [40] takes more than 15G space. Maintaining such a heavy index generates huge overhead especially when the graph needs to be updated frequently.
• Intermediate Result Storage Latency Most of the existing distributed approaches are based on the join-oriented method where the embeddings of two small subgraphs are first loaded in memory and then joined together to get the embeddings of the union of these two subgraphs [43][31][32]. It is known that CPU is faster than RAM access, therefore, besides the memory crisis aforementioned, the latency generated by saving a large size of intermediate results in memory can also be a serious problem for these join-oriented approaches. To our best knowledge, this issue has never been noted in early research.
• Synchronization Delay Most existing multi-round distributed approaches [54][31][32] are following a synchronous model where machines are synchronized within each round. To be specific, before any machine proceeds to the next round, the embeddings from every machine of the current round have to be shuffled across the network. There- fore, those machines that have completed the current round have to wait for those that have not. The time cost of each round will highly depend on the worst machine. This
4.2 Related Work 75
synchronization delay can be a big problem when the datasets are large and when the computing power of a cluster is not evenly distributed.
4.2 Related Work
In this section, we extend the discussion of existing distributed subgraph enumeration solutions in Section 4.1 and highlight the differences between them and our work. We classify them into three categories by their main techniques: Multi-round Join-oriented,
Exchanging Data Vertices, and Compression.