This thesis presents template B+ trees for distributed append-only stores. Template B+ trees support high throughput tuple insertion and efficient key-based retrieval. Using tem- plate B+ trees enables higher utilization of CPU resources while performing concurrent tuple insertions. Our experimental results show that compared to traditional B+ trees, template B+ trees improve insertion throughput by 40% to 57%, compared to traditional B+ tree in- dexes. Template B+ trees also beat the state-of-the-art system HBase, with higher insertion throughput and lower query processing latency. These initial results are very promising, and suggest a number of questions for future work to explore.
1. Domain Partitioning: The architecture proposed in this thesis does not employ any form of domain partitioning between data servers. A domain partitioning scheme divides the domain of key values into non-overlapping sub-domains such that the total load is evenly balanced between data servers. It also helps in query processing since, with domain partitioning, queries can be selectively forwarded to only those data servers which are handling the appropriate sub-domains. Our architecture achieves load balancing between data servers by forwarding tuples to data servers in a round- robin fashion. But this scheme does not help during query processing as queries have to be forwarded to all data servers. It may be possible to design an adaptive domain partitioning scheme which keeps the input load between data servers balanced at all times. The major challenge in creating such a scheme is that the template used by a template B+ tree in a data server is closely tied to the sub-domain that the data server is handling. If the domain partitioning protocol changes the sub-domain assigned to
that data server, the template has to change with it. This requires some degree of synchronisation between dispatcher servers and data servers, which is hard to achieve. 2. Quality of template B+ trees: In this thesis, we have shown that template B+ trees perform well when the overall distribution of tuples does not drift much with time. It will be valuable to fully characterize the quality of template B+ trees when subjected to a sudden change or a gradual drift in overall distribution of tuples. In this thesis, we used the percentage of highly stretched leaf nodes as a proxy for the quality of a template B+ tree. This helps us to judge the quality of template B+ trees only in a relative sense. We think that it should be possible to derive a quality metric which compares the template B+ tree with the theoretically best possible B+ tree that can be grown using the same batch of tuples. The absolute value of such a metric will have a firmer conceptual grounding and can be used to determine whether the template of a template B+ tree needs to be updated.
3. Updating the template of template B+ trees: In this thesis, we do not propose any scheme to update the template of a template B+ tree. As the overall distribution of tuples gradually drifts over time, the template needs to be updated along with it. In our present proposal, the only way to update a template is to create a new B+ tree from scratch and then flush its leaf nodes, which is not efficient. It should be possible to efficiently generate the new template directly from the previous template, using some statistical information from the most recent buffer of tuples to generate a new template.
4. Fault tolerance: We do not explore the fault tolerance mechanisms in our architecture and rely on the fault tolerance of Storm in our implementation. Without the fault- tolerance of Storm, our architecture would not be tolerant against data server failures which lead to potential data loss. In case of a data server failure, all the tuples that have been stored in the local buffer so far (and their corresponding index), but have not
been flushed to HDFS, would be lost forever. An interesting open question is the design of techniques that will make our architecture robust against failures of data servers as well as dispatchers. Replication and creating disk-based (or Zookeeper-based) logs are potential candidate schemes for this purpose.
5. Performance comparison with LSM trees and other alternatives: With promising initial performance results for template B+ trees in hand, an in-depth comparison with other indexing alternatives is a natural next step. In particular, it will be useful to compare template B+ trees to alternatives such as LSM trees in a neutral environment. Our ex- periments used LSM trees in their native HBase implementation, which includes many other architectural choices that affect performance and are different from Storm’s.
References
[1] Apache HBase, https://hbase.apache.org/. [2] Apache Storm, https://storm.apache.org/.
[3] Spark Streaming, http://spark.apache.org/streaming/. [4] Yourkit Java Profiler, https://www.yourkit.com/docs/.
[5] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. BigTable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.
[6] S. Chen, B. C. Ooi, and Z. Zhang. An adaptive updating protocol for reducing moving object database workload. Proceedings of the VLDB Endowment, 3(1-2):735–746, 2010. [7] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, pages 143–154, New York, NY, USA, 2010. ACM.
[8] G. Graefe. Write-Optimized B-Trees. In VLDB, pages 672–683, 2004.
[9] G. Graefe. B-tree indexes for high update rates. SIGMOD Record, 35(1):39–44, 2006. [10] A. Lakshman and P. Malik. Cassandra: A Decentralized Structured Storage System.
SIGOPS Oper. Syst. Rev., 44(2):35–40, Apr. 2010.
[11] V. C. Liang, R. T. B. Ma, W. S. Ng, L. Wang, M. Winslett, H. Wu, S. Ying, and Z. Zhang. Mercury: Metro density prediction with recurrent neural network on stream- ing CDR data. In ICDE, 2016.
[12] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The log-structured merge-tree (LSM- tree). Acta Inf., 33(4):351–385, June 1996.
[13] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society.
[14] W. Tan, S. Tata, Y. Tang, and L. L. Fong. Diff-index: Differentiated index in distributed log-structured data stores. In EDBT, pages 700–711, 2014.
[15] S. Wu, G. Chen, X. Zhou, Z. Zhang, A. K. H. Tung, and M. Winslett. PABIRS: A data access middleware for distributed file systems. In ICDE, pages 113–124, 2015.
[16] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915–926, 2010.
[17] Z. Zhang, H. Shu, Z. Chong, H. Lu, and Y. Yang. C-cube: Elastic continuous clustering in the cloud. In ICDE, pages 577–588, 2013.