• No results found

CHAPTER 3. ELASTIC PUBLIC-SUBSCRIBE IN CLOUD USING DATA

3.6 Related Work

3.6.1 Publish-subscribe in cloud

Ye and Kim [78] proposed BlueDove which is a scalable and elastic publish subscribe service in cloud. Our study differs from [78] in two ways: (1) we consider a semi-structured data form messages and queries, whereas Ye and Kim consider structured data as messages. (2) We consider structure-based queries whereas they focus on attribute-based queries. and (3) most significant difference being that we work to design an elastic application, while they focus on using elasticity offered in cloud.

Barazzutti et al. [17] propose E-STREAMHUB, an elastic publish subscribe system. Baraz- zutti et al. [17] study dynamic scaling of state-ful and state-less publish subscribe operators. They also proposed elasticity policies, both local and global, in order to ensure high system utilization and stable operation latencies.

Belyaev and Ray [19] study on dissemination and filtering of XML data streams, and propose a subscriber-centric XML filtering approach for replication and distribution of XML streams. Their approach allows for selective filtering on more efficient nodes.

Tran et. al propose EQS, an elastic and scalable message queue for Cloud [137]. They also propose a message queue-based adaptable scaling algorithm.

Ma et. al propose SREM, a scalable and reliable matching service for content-based pub- lish/subscribe systems in cloud settings [79]. A distributed overlay is proposed to achieve low routing latency and reliability. Authors also propose a hybrid space partitioning technique HPartition, which maps large-scale skewed subscriptions into multiple subspaces which results in high matching throughput.

3.6.2 Tree encoding and data filtering techniques

Several transformation techniques have been proposed in literature. A Node Encoded Tree Sequence (NETS) is proposed in [124]. NETS models an XML document as a rooted ordered labeled tree such that each node corresponds to an element tag, attribute or value. The structural relationship between these nodes is represented as edges. PRIX [119] and FiST [69] use Prufer sequence encoding for filtering twig. FiST first transforms XML document and queries into their Prufer sequence, and then carries out subsequence matching to find matches. FiST supports ordered query matching.

ViST [146] transforms XML tree pattern into structure encoded sequence and uses rootpath encoding where each node is expressed as a sequence of intermediate nodes from root to itself. The worst-case storage requirement can be higher than linear in the total number of elements of the XML documents. A key difference between our tree encoding approach and [146] is that we encode selective nodes and storage space requirement in our case is O(n).

Dewey ID labeling to represent XML order in relational data model is proposed in [135]. In this work, authors also apply Dewey ID labeling scheme to preserve document order during XML query processing. ORDPATH is a variation of prefix labeling is proposed in [105]. ORD- PATH deals with insertion of XML nodes in the database. The region encoding is introduced in [26], where focus is on searching and indexing the text database.

A mixed hardware/software approach for filtering simple XPath queries is proposed in [144]. Queries having only parent-child axis are handled by this approach. Authors [90] proposed a pure hardware approach for XML filtering, which cannot handle recursion in XML documents. Work in [96] uses FPGA for holistic twig filtering. This FPGA based approach provide a high throughput averaging more than 200 MB/s, but has a severe limitation in terms of the number of queries being filtered.

YFilter [35] is a FSM-based approach which uses Non-Deterministic Finite Automata (NFA) to represent the user profiles. YFilter operates by breaking twig into root-to-leaf path(s), and builds a unified NFA over set of paths. A lazy Deterministic Finite Automation(DFA) based filtering approach also has been proposed in [45]. A key difference between our data shaping solution and state-of-the-art dynamic programming based approach is as follows: We carry out selective processing of the nodes instead of comparing each and every node in the tree against all the nodes in the query set. By being selective we reduce the computational complexity incurred in dynamic programming based approach, which is O(m × n), to O(m0× n) where m0 can be a fraction of m and it can as small as zero.

Another proposal [95] uses GPU for filtering the XML data streams. This proposal only filters path queries and does not address the twig filtering problem. The proposal uses dynamic programming approach in order to filter the path queries. The study in [95] records 1 MBps throughout for filtering 8K path queries over 50 MB DBLP document using 240 streaming processors (SPs). Research in [128], leverages GPU to accelerate the processing of a single query processing in XML databases.

Roy, Teubner, and Alonso [122] used hybrid systems for processing data sets encoded in XML. Our work differs from the works of [122] in several ways: (1) we do not make any assumption regarding the distribution of item sets in the input data. (2) Our focus is on processing larger documents in the range of hundreds of megabytes as opposed to filtering out the frequent items and reducing the workload of data mining algorithm. (3) We adopt a platform-centric approach and design a general filtering system which can be used to filter data streams whereas [122] seek to accelerate data mining problems.

3.7 Conclusion

Trends in Internet of Things (IoT) and Big Data require effective publish-subscribe systems to facilitate exchange of information between interested entities. A data filtering algorithm is at the core of a publish-subscribe system which determines the matching of a document (message) with queries (profile). Cloud computing offers cost effective and easy access to vast computing resources which are oftentimes comprise heterogeneous computing platforms. Existing publish- subscribe solutions do not leverage heterogeneity prevalent in Cloud effectively, thus limiting their operational effectiveness and are not elastic under heterogeneous Cloud environment.

In this chapter, we proposed a elastic publish subscribe system. Specifically, we first devel- oped a data shaping technique to reorganize semi-structured data in a manner amenable for processing on parallel architecture machines and applied it to a publish-subscribe system. Then, we leveraged data shaping-based publish subscribe system to facilitate elastic and portable com- puting in heterogeneous Cloud environment. Our data shaping technique enables large number of work (kernels) resulting in efficient utilization of compute resources of a highly parallel ar- chitecture machines. Experiments using real datasets on multicore processors and GPGPU demonstrate that our data shaping-based approach delivers a scalable and high throughput publish-subscribe system.

Table 3.4: Kernel Generation using Data Shaping

Query set size (Q) B K(= dN/Be) Total Kernels (=Q*B*K)

4 32 2650 339200 4 64 1325 339200 4 128 663 339200 4 256 332 339200 4 512 166 339200 4 1024 83 339200 8 32 2650 678400 8 64 1325 678400 8 128 663 678400 8 256 332 678400 8 512 166 678400 8 1024 83 678400 16 32 2650 1356800 16 64 1325 1356800 16 128 663 1356800 16 256 332 1356800 16 512 166 1356800 16 1024 83 1356800 32 32 2650 2713600 32 64 1325 2713600 32 128 663 2713600 32 256 332 2713600 32 512 166 2713600 32 1024 83 2713600

CHAPTER 4. DUPLICATE DETECTION IN SEMI-STRUCTURED