Chapter 4 PERFORMANCE OF DEADLINE-DRIVEN MAPRE-
4.5 Conclusions and Future Work
This chapter presented our initial study toward the design and implementa- tion of a tool that can provide latency-guarantee for MapReduce jobs in the presence of failures. The tool is a combination of two strategies which aim at masking the latency effect of failures at different levels. First, the tool attempts to mask the latency-effect in each single parallel job that suffers the failure. When the first strategy fails, the second strategy is applied to mask the latency by increasing resources for other parallel jobs in the next phases of the DAG jobs.
For each strategy we discussed a completion time estimation model. In future work, we use this model to build an inverse problem, with which we can reactively obtain estimates of the resource allocation for MapReduce jobs given their deadlines, after failure occurrence.
We believe that our solutions will result in a better overall resource uti- lization for MapReduce jobs frameworks.
BIBLIOGRAPHY
[1] J. Weinman, “Cloudonomics: A rigorous approach to cloud benefit quantification,” The Journal of Software Technology, vol. 14, pp. 10– 18, October 2011.
[2] “Failure rates in Google data centers,” http://www. datacenterknowledge.com/archives/2008/05/30/failure-rates-in-google- data-centers.
[3] F. Dinu and T. E. Ng, “Understanding the effects and implications of compute node related failures in Hadoop,” in Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing, ser. HPDC ’12. ACM, 2012, pp. 187–198.
[4] “The hadoop project website,” http://wiki.apache.org/hadoop/ PoweredBy.
[5] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[6] “ChaosMonkey,” http://techblog.netflix.com/2010/12/5-lessons-weve- learned-using-aws.html.
[7] H. S. Gunawi, T. Do, J. M. Hellerstein, I. Stoica, D. Borthakur, and J. Robbins, “Failure as a Service (FaaS): A cloud service for large-scale, online failure drills,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2011-87, Jul 2011. [Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS- 2011-87.html
[8] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic con- cepts and taxonomy of dependable and secure computing,” IEEE Trans. Dependable Secur. Comput., vol. 1, no. 1, pp. 11–33, Jan. 2004.
[9] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg, “Quincy: fair scheduling for distributed computing clus- ters,” in Proceedings of the ACM SIGOPS 22nd symposium on Operating
systems principles, ser. SOSP ’09. New York, NY, USA: ACM, 2009, pp. 261–276.
[10] H. Jin, K. Qiao, X.-H. Sun, and Y. Li, “Performance under failures of MapReduce applications,” in Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, ser. CCGRID ’11, 2011, pp. 608–609.
[11] P. Joshi, H. S. Gunawi, and K. Sen, “PREFAIL: a programmable tool for multiple-failure injection,” in Proceedings of the 2011 ACM inter- national conference on Object oriented programming systems languages and applications, ser. OOPSLA ’11. New York, NY, USA: ACM, 2011, pp. 171–188.
[12] H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci- Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur, “FATE and DESTINI: a framework for cloud recovery testing,” in Proceedings of the 8th USENIX conference on Networked systems design and implementa- tion, ser. NSDI’11, 2011, pp. 18–18.
[13] “AnarchyApe,” https://github.com/yahoo/anarchyape.
[14] W. Torell and V. Avelar, “Performing effective MTBF comparisons for data center infrastructure,” http://www.apcmedia.com/salestools/ ASTE-5ZYQF2 R1 EN.pdf.
[15] J. Dean, “Software engineering advice from building large-scale distributed systems,” http://research.google.com/people/jeff/stanford- 295-talk.pdf.
[16] “Hortonworks: Data integrity and availability in Apache Hadoop HDFS,” http://hortonworks.com/blog/data-integrity-and-availability- in-apache-hadoop-hdfs/.
[17] Y. Geng, S. Chen, Y. Wu, R. Wu, G. Yang, and W. Zheng, “Location- aware mapreduce in virtual cloud,” in Proceedings of the 2011 Inter- national Conference on Parallel Processing, ser. ICPP ’11, 2011, pp. 275–284.
[18] “Amazon Web Services,” http://aws.amazon.com/.
[19] “Amazon CloudWatch,” http://aws.amazon.com/cloudwatch/.
[20] X. Zhang, S. Dwarkadas, G. Folkmanis, and K. Shen, “Processor hard- ware counter statistics as a first-class system resource,” in Proceedings of the 11th USENIX workshop on Hot topics in operating systems, ser. HOTOS’07, 2007, pp. 14:1–14:6.
[21] “The new york times topics: High-frequency trading,” 2012, http://topics.nytimes.com/topics/reference/timestopics/subjects/ h/high frequency algorithmic trading/index.html.
[22] A. G. Haldane, “Patience and finance,” in Remarks at the Oxford China Business Forum, Beijing, available at http://www. bankofengland. co. uk/publications/speeches/2010/speech445. pdf, 2010.
[23] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: dis- tributed data-parallel programs from sequential building blocks,” ACM SIGOPS Operating Systems Review, vol. 41, no. 3, pp. 59–72, 2007. [24] D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegel-
berg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer, “Apache hadoop goes realtime at facebook,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, 2011, pp. 1071–1080.
[25] F. Faghri, S. Bazarbayev, M. Overholt, R. Farivar, R. H. Campbell, and W. H. Sanders, “Failure scenario as a service (fsaas) for hadoop clusters,” in Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management. ACM, 2012, p. 5. [26] P. Gill, N. Jain, and N. Nagappan, “Understanding network failures in data centers: measurement, analysis, and implications,” in ACM SIG- COMM Computer Communication Review, vol. 41, no. 4. ACM, 2011, pp. 350–361.
[27] B. Schroeder and G. A. Gibson, “A large-scale study of failures in high- performance computing systems,” in Dependable Systems and Networks, 2006. DSN 2006. International Conference on. IEEE, 2006, pp. 249– 258.
[28] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan, “Availability in globally distributed storage systems,” in Proceedings of the 9th USENIX conference on Operating systems design and implementation, 2010, pp. 1–7.
[29] A. Verma, L. Cherkasova, and R. H. Campbell, “Aria: automatic re- source inference and allocation for mapreduce environments,” in Pro- ceedings of the 8th ACM international conference on Autonomic com- puting. ACM, 2011, pp. 235–244.
[30] Z. Zhang, L. Cherkasova, A. Verma, and B. T. Loo, “Automated profil- ing and resource management of pig programs for meeting service level objectives,” in Proceedings of the 9th international conference on Auto- nomic computing. ACM, 2012, pp. 53–62.
[31] A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca, “Jockey: guaranteed job latency in data parallel clusters,” in Proceed- ings of the 7th ACM european conference on Computer Systems, ser. EuroSys ’12, 2012, pp. 99–112.
[32] S. M. Johnson, “Optimal two-and three-stage production schedules with setup times included,” Naval research logistics quarterly, vol. 1, no. 1, pp. 61–68, 1954.