Future Work Directions - Conclusion and Future Work

Chapter 9: Conclusion and Future Work

9.3. Future Work Directions

• To conduct ongoing experimental research on the tool for further improvements.

• To pursue more advanced NLP and analyze user’s answers incorporating different languages.

Chapter9: Conclusion and Future Work

126 | P a g e

• To develop the tool as a plug-in before starting the data collection and analytics. To provide the relevant data, recommended tool, and data source for timely decision making strategy and process improvement.

127 | P a g e

REFERENCES

[1] S. B. Siewert (Feb. 2013), "Social media analytics: Making customer insights

actionable." IBM Corporation.

[2] P. Gundecha and H. Liu (2014), "Mining Social Media: A Brief Introduction". Cambridge University Press, UK.

[3] A. Nelson (2013), "How to Use Social Media Data for Customer Insight". [Online]. Available: http://www.exacttarget.com/blog/social-media-data/. [Accessed 12 February 2014].

[4] (n.d) (2015), "Wholesale Fraud Management". Enghouse Network, Ontario, Kanada.

[5] A. Semenov (May 31, 2013), “Principles of Social Media Monitoring and Analysis

Software”. Jyväskylä, Finland: University of Jyväskylä.

[6] G. Lotan, E. Graeff, M. Ananny, D. Gaffney, I. Pearce and d. boyd (2011), "The

Revolutions Were Tweeted: Information Flows during the 2011 Tunisian and Egyptian Revolutions". International Journal of Communications 5, Pages 42-60.

[7] META Group (2011), "Big Data, What it is and why it matters". SAS, [Online]. Available: http://www.sas.com/en_us/insights/big-data/what-is-big-data.html. [Accessed 4 March 2014].

[8] P. Breuer, L. Forina and J. Moulton (2013), "Beyond the hype: Capturing value from

big data and advanced analytics". McKinsey & Company.

[9] B. Thalheim, and Y. Kiyoki (2012), “Analysis-Driven Data Collection, Integration

and Preparation for Visualization.” Frontiers in Artificial Intelligence and

Applications EJC 2012. Pages 142-160.

[10] D. Zeng, H. Chen, . L. R. and S.-H. Li (Nov.-Dec. 2010), "Social Media Analytics

References

128 | P a g e [11] I. Kompatsiaris, D. Gatica-Perez , X. Xie and J. Luo (2013), "Special Section on

Social Media as Sensors". Multimedia, IEEE Transactions, vol. 15, no. 6, pp. 1229 -

1230.

[12] T. Chardonnens (June 2013), “Big Data analytics on high velocity streams”. Switzerland: University of Fribourg.

[13] C. Regina, M. Beyer, M. Adrian, T. Friedman and D. Logan (2013), "Top 10

Technology Trends Impacting Information Infrastructure 2013". Gartner.

[14] S. Singh and N. Singh (2012), ""Big Data analytics". in Communication, Information & Computing Technology (ICCICT), International Conference, Pages:12-18.

[15] S. Madden (2012), "From Databases to Big Data". IEEE Internet Computing, p. 16(3):4–6.

[16] S. Kaisler, F. Armour, J. A. Espinosa and a. W. Money (2013), "Big Data: Issues

and Challenges Moving Forward" In Proceedings of the 46th Hawaii International

Conference on System Sciences, HICSS ’13, p. 995–1004.

[17] T. Kraska (2013), "Finding the Needle in the Big Data Systems Haystack". IEEE Internet Computing, p. 17(1):84–86.

[18] M. Markus (Oct. 2013), “Towards a Big Data Reference Architecture”. Eindhoven, Netherlands: Department of Mathematics and Computer Science, Eindhoven University of Technology.

[19] IBM Corp. (2013), "The Big Data and Analytic Hub" IBM, [Online]. Available: http://www.ibmbigdatahub.com/infographic/four-vs-big-data. [Accessed 14 Feb. 2014].

[20] D. Mauro, Andrea, M. Greco and M. Grimaldi (September 2014), "What is big data?

A consensual definition and a review of key research topics". American Institute of

Physics and related Sciences AIP Conference Proceedings, pp. 97-104, 5–8.

[21] D. M. Boyd and N. B. Ellison (2007), "Social Network Sites: Definition, History,

129 | P a g e 210–230.

[22] C. C. Aggarwal (2011), “Social Network Data Analytics”, New York: Springer. [23] H. Ting (2008), "Web Mining Techniques for On-line Social Networks Analysis". in

Service Systems and Service Management, 2008 International Conference, p. 212 – 224.

[24] D. M. Boyd and N. B. Ellison (2007), "Social Network Sites: Definition, History,

and Scholarship". University of California-Berkeley, Michigan State University,

USA.

[25] N. Sharma (2011), "Sphere of Influence, The Importance of Social Network

Analysis". Solutions for Enabling Lifetime Customer Relationships, Pitney Bows

Software.

[26] A. Katal, M. Wazid and R. H. Goudar (8-10 Aug. 2013), "Big data: Issues,

challenges, tools and Good practices". in Contemporary Computing (IC3) Sixth

International Conference, Noida, p.98-116.

[27] G. & M. Guest, N. M. Kathleen and E. E. (2012), "Applied thematic Analysis". Sage Publications, Thousand Oaks, Calif.

[28] I. Summerville (2010), “Software Engineering”, 9th Edition, Addison Wesley. [29] D. Damian, J. Chisan, L. Vaidyanathasamy and Y. Pal (2005), "Requirements

Engineering and Downstream Software Development: Findings from a Case Study".

Empirical Software Engineering, vol. 10, no. 3, pp. 255-283.

[30] S. W. Hermansen (2012), "Reducing Big Data to Manageable Portions". in Southeastern SAS User's Group (SESUG) Conference, USA, p.12-25.

[31] Z. Guo and J. Wang (2011), "Information retrieval from large data sets via multiple-

winners-take-all".International Symposium on Circuits and Systems (ISCAS)

Conference, Rio De Janeiro, pp. 2669-2672.

[32] M. Saunders, P. Lewis and A. Thornhill (2012), “Research Methods for Business

References

130 | P a g e [33] A. Dahanayake, and B. Thalheim (2014), “W*H: The conceptual Model for

Services”. ESF 2014 workshop on "Correct software for web application", Sringer-

Verlage.

[34] C. Otero and A. Peter (2015), "Research Directions for Engineering Big Data

Analytics Software". Intelligent Systems, IEEE, vol. 30, no. 1, pp. 13-19.

[35] D. Mysore, S. Khupat, and S. Jain (2013),”Big Data architecture and patterns,

Part1: Introduction to Big Data classification and architecture”. IBM Corp.

[36] C. Spencer “Big Data scenarios and case studies” IBM Corp.

[37] M. Alswilmi, N. Alnajran, and A. Dahanayake, (2014), “Conceptual Framework for

Big Data Analytics Solutions” Proceedings of 24th International Conference on

Information Modelling and Knowledge Bases (EJC 2014), p.111-123.

[38] T. Morzy, M. Wojciechowski and M. Zakrzewicz (2002), "Efficient Constraint-

Based Sequential Pattern Mining Using Dataset Filtering Techniques". in Poznan

University of Technology, Institute of Computing Science, Poland.

[39] F. Neck and G. A. David (2006), "Challenges and Opportunities in Internet Data

Mining". in Carnegie Mellon University, Pittsburgh, PA 15213-3890.

[40] Hyun-Ho Lee and W.-S. Lee (2010), "Consistent collective evaluation of multiple

continuous queries for filtering heterogeneous data streams". Knowledge and

Information Systems, vol. 22, no. 2, pp. 185-210.

[41] A. Claire and B. Brisset (2013), "Managing Semantic Big Data for Intelligence" Central EuropeSTIDS - CEUR Workshop Proceedings, vol. 1097, pp. 41-47.

[42] F. Daniel (2012), “Extract, Transform, and Load Big Data with Apache Hadoop” White paper. Intel.

[43] Y. W. Zhao, W.-J. van den Heuvel and X. Ye (2013), "Exploring big data in small

forms: A multi-layered knowledge extraction of social networks". Big Data, 2013

IEEE International Conference, pp. 60-67.

131 | P a g e

System,", Social Intelligence and Technology (SOCIETY), 2013 International

Conference, pp. 64-71.

[45] M. Nguyen, T. Ho and Phuc Do (2013), "Social Networks Analysis Based on topic

modeling". Computing and Communication Technologies, Research, Innovation,

and Vision for the Future (RIVF), 2013 IEEE RIVF International Conference . pp. 119-122.

[46] M. Thelwall, D. Wilkinson and S. Uppal (2010), "Data Mining Emotion in Social

Network Communication: Gender differences in MySpace". Journal of the American

Society for Information Science and Technology, vol. 61, no. 1, p. 190–199.

[47] D. Hansen, D. Rotman, E. Bonsignore, N. Milic-Frayling, E. Rodrigues, M. Smith and B. Shneiderman (2012), "Do You Know the Way to SNA?: A Process Model for

Analyzing and Visualizing Social Media Network Data". in Social Informatics

(SocialInformatics), 2012 International Conference. pp.14-16.

[48] R. Colbaugh and K. Glass (2013), "Analyzing Social Media Content for Security

Informatics" in Intelligence and Security Informatics Conference (EISIC), 2013

European, p.10-24.

[49] R. T. Khasawneh, H. A. Wahsheh, M. N. Al-Kabi and I. M. Alsmadi (2013), "Sentiment analysis of arabic social media content: a comparative study". in Internet Technology and Secured Transactions (ICITST), 2013 8th International Conference. [50] D. Galin (2004), “Software Quality Assurance: From Theory to Practice”. England:

Pearson.

[51] S. Pfleeger and J. Atlee (2010), “Software Engineering: Theory and Practice”, 4th edition, Pearson Education.

[52] W. Westfall, (2005), “Software Requirements Engineering: What, Why, Who, When,

and How”. Software Quality Professional, Vol.7, No.4, pages 17-26.

[53] N. Alnajran (2015), "Big Data Analytics and Scenario-based Big Data Collection". Master Thesis, Prince Sultan University, Riyadh, Saudi Arabia.

References

132 | P a g e [55] J. Abuin, J. Pichel, T. Pena, P. Gamallo and M. García (2014), "Perldoop: Efficient

execution of Perl scripts on Hadoop clusters" Big Data (Big Data), 2014 IEEE

International Conference, pp. 766 - 771.

[56] T. White (2012) "Hadoop: The Definitive Guide”. UK: O'Reilly.

[57] Apache Hadoop. “The Apache Software Foundation”. Retrieved from

http://hadoop.apache.org/. [Accessed 26 April 2014].

[58] J. Dean, and S. Ghemawat (2008), “MapReduce: Simplified Data Processing on

Large Clusters” Communications of the ACM, Vol. 51, No. 1, Pages 107-113.

[59] S. Kurazumi, T. Tsumura, S. Saito, and H. Matsuo (2012), "Dynamic Processing

Slots Scheduling for I/O Intensive Jobs of Hadoop MapReduce" Networking and

Computing (ICNC), Third International Conference, Pages 288,292, 5-7.

[60] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur (2000), “PVFS: A parallel

file system for Linux clusters” in Proceedings of 4th Annual Linux Showcase and

Conference, Pages 317–327.

[61] M. K. McKusick, and S. Quinlan (2009), “GFS: Evolution on Fast-forward” ACM Queue, Vol. 7, No. 7, Page 10.

[62] K. Shvachko, H. Kuang, S. Radia, and R. Chansler (2010), "The Hadoop Distributed

File System" Mass Storage Systems and Technologies (MSST), IEEE 26th

Symposium, Pages 1,10, 3-7.

[63] X. Qin, H. Wang, F. Li, B. Zhou, Y. Cao, C. Li, H. Chen, X. Zhou, X. Du, and S. Wang (2012), "Beyond Simple Integration of RDBMS and MapReduce -- Paving the

Way toward a Unified System for Big Data Analytics: Vision and Progress" Cloud

and Green Computing (CGC), Second International Conference , Pages 716,725, 1- 3.

[64] C. D. Manning, Surdeanu, Miha, Bauer, John, Finkel, B. Jenny, Bethard, Steven J. and ,. D. McClosky (2014), "The Stanford CoreNLP Natural Language Processing

Toolkit." of 52nd Annual Meeting of the Association for Computational Linguistics:

133 | P a g e [65] M. Leiter (2014), "How to Choose Social Media Platforms," Melissa Leiter, San

Francisco.

[66] P. Meier (2013), "Classification of Social Media Platforms" Delvalle Institute Knowledge Base.

[67] A. Mayfield (2008), “What Is Social Media”. US: iCrossing.

[68] A. DuVander (2013), "How Top Social APIs Use Social Media". Programmable web.

[69] E. Ravinscraft (2013), "Which Social Network Should I Use?". LifeHacker. [Online]. Available: http://lifehacker.com/which-social-network-should-i-use- 894808717.

[70] A. Dean (2009), "WhatIs". TechTarget. [Online]. Available: http://whatis.techtarget.com/definition/Facebook. [Accessed 23 5 2014].

[71] A. Bozzuto (2012), "The Difference Between Facebook, Tiwitter, Linkedin,

Google+, Youtube, Pinterest". IMPACT Branding and Design.

[72] J. Taylor (2014), "Choosing a social media monitoring tool" Our Social Time. [Online]. Available: http://oursocialtimes.com/choosing-a-social-media-monitoring- tool/.

[73] M. E. Mármol (2013), "How to choose a Social Media Monitoring tool". eDigital - Digital Marketing Consultants and Trainers in Sydney, Sydney.

[74] J. Bear (2012), "Clearing Clouds of Confusion – the 5 Categories of Social Media

Software, Convince and Convert - Digital Marketing Advisors". [Online]. Available:

http://www.convinceandconvert.com/social-media-tools/clearing-clouds-of- confusion-the-5-categories-of-social-media-software/.

[75] M. M. Berg (2014), “Modelling of Natural Dialogues in the Context of Speech-

based Information and Control Systems”. Kiel: Christian-Albrechts University.

[76] C. D. Ennis (1986), "Conceptual Frameworks as a Foundation for the Study of

References

134 | P a g e 25-39.

[77] J. A. Michel (2012), “Qualitative Research Design: An Interactive Approach”. 3rd Edition , Sagepub.

[78] D. Mills, (n.d.) “Problem Domain”. Cunningham & Cunningham, Inc.," [Online]. Available: http://c2.com/cgi/wiki?ProblemDomain. [Accessed 10 April 2014].

[79] A. Smeda (2010), "A formal definition of software architecture behavioral

concepts". Research Challenges in Information Science (RCIS), 2010 Fourth

International Conference, pp. 247 - 256, 19-21. Nice, France.

[80] P. Kruchten (1995), "Architectural Blueprints—The “4+1” View Model of Software

Architecture". IEEE Software, pp. 42-50.

[81] Len Bass, P. Clements and R. Kazman (2003), “Software Architecture in Practice”. Second Edition: Addison Wesley.

[82] N. Rozanski and E. Woods (2011), "Applying Viewpoint and Views to Software

Architecture". Addison-Wesley Professional.

[83] M. J. Bates and M. N. Maack (2010), “Encyclopedia of Library and Information

Science”. 3rd Edition :Marcel Decker, Inc.

[84] F. Hogenboom, F. Frasincar and U. Kaymak (2010), "An Overview of Approaches to

Extract Information from Natural Language Corpora". 10th Dutch-Belgian

Information Retrieval Workshop, p.112-126.

[85] F. Ricci, L. Rokach and B. Shapira (2011), “Recommender Systems Handbook” Newyork: Springer.

[86] K. T. Alex and C. D. Manning (2000), " Enriching the Knowledge Sources Used in a

Maximum Entropy Part-of-Speech Tagger". Joint SIGDAT Conference on Empirical

Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC- 2000), pp. 63-70..

[87] K. Toutanova, D. Klein, C. Manning and Y. Singer (2003), "Feature-Rich Part-of-

135 | P a g e 259.

[88] J. R. Finkel, T. Grenager and C. Manning (2005), "Incorporating Non-local

Information into Information Extraction Systems by Gibbs Sampling." 43nd Annual

Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. [89] D. Chen, and C. D. Manning (2014), “A Fast and Accurate Dependency Parser

using Neural Networks.” Proceedings of EMNLP 2014.

[90] M. Recasens, M. D. Marneffe, and C. Potts (2013), “The Life and Death of

Discourse Entities: Identifying Singleton Mentions.” In Proceedings of NAACL

2013.

[91] H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu and D. Jurafsky (2013), “Deterministic coreference resolution based on entity-centric, precision-ranked

rules.” Computational Linguistics 39(4).

[92] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky (2011), “Stanford's Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011

Shared Task.“ In Proceedings of the CoNLL-2011 Shared Task.

[93] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng and C. Potts (2013), “Recursive Deep Models for Semantic Compositionality Over a Sentiment

Treebank.” Conference on Empirical Methods in Natural Language Processing.

[94] S. Gupta and C. D. Manning (2014), “Improved Pattern Learning for Bootstrapped

Entity Extraction.” In Proceedings of the Eighteenth Conference on Computational

Natural Language Learning (CoNLL)

[95] The Stanford Natural Language Processing Group (2011), "Stanford Log-linear

Part-Of-Speech Tagger," Stanford University. [Online]. Available:

http://nlp.stanford.edu/software/tagger.shtml. [Accessed 4 January 2015].

[96] Natural Language Processing Musings (2011), "Part of Speech Tags," Stanford University. [Online]. Available: http://www.monlp.com/2011/11/08/part-of-speech- tags/. [Accessed 4 January 2015].

References

136 | P a g e

Recognizer (NER)". Stanford University. [Online]. Available:

http://nlp.stanford.edu/software/CRF-NER.shtml. [Accessed 3 March 2015].

[98] (2011), "A Saving Lincoln Case Study," Hootsuite Inc.. [Online]. Available: https://hootsuite.com/resources/case-study/storytelling-with-saving-lincoln.

[99] Centers for Disease Control and Prevention (2014), "MERS-CoV" Ministry of Health. [Online]. Available: http://www.cdc.gov/coronavirus/index.html. [Accessed Feb 2015].

[100] Command and Control Center (2014), "Ministry of Health Institutes New Standards

for Reporting of MERS-CoV". Ministry of Health. [Online]. Available:

http://www.moh.gov.sa/en/CCC/News/Pages/News-2014-06-03-001.aspx. [Accessed 28 Feb 2015].

[101] D. I. Khan (2014), "Pakistan Taliban splinter group vows allegiance to Islamic

State". Reuters.

[102] Shafaqna (2015), "Mufti of Saudi Arabia:Daash is a terrorist not related to Islam". Shafaqna Online News.

[103] Casey (2014), "Daash plans to enter Saudi Arabia during the pilgrimage season". worldanalysis.

[104] J. P. Cavano and M. A. James (1978), "A framework for the measurement of

software quality". ACM SIGMETRICS Performance, vol. 7, pp. 3-4.

[105] W. S. Humphrey (2000), "The Personal Software Process". Software Engineering Ins., Pittsburgh.

[106] T. &. L. P. Hill (2007), "Designing an Experiment, Power Analysis" in STATISTICS: Methods and Applications,Tulsa, Oklahoma, StatSoft, Inc.

[107] J.M. (2011), "Percentage Change | Increase and Decrease," SkillsYouNeed. [Online]. Available: http://www.skillsyouneed.com/about.html. [Accessed 12 March 2015].

137 | P a g e Statistics, Vol.32, No. 7, P. 685-694.

[109] J.C.F. Winter (2013), “Using the Student’s t-test with extremely small sample sizes”, Practical Assessment, Research & Evaluation Journal, Vol.18, No. 10.

[110] M. Bamberger, J. Rugh and L. Mabry (2012), "Chapter 3: Not Enough Money:

Addressing Budget Constraints in RealWorld evaluation: working under budget, time, data and political constraints." in In RealWorld Evaluation Working Under

Appendix A

138 | P a g e

APPENDICES

APPENDIX A. GLOSSARY

Architecture The structure of a software-containing system, including the software and hardware components that make up the system, the interfaces and relationships between those components.

Business Requirements

A high-level business objective of the organization that builds a product or a customer who procures it.

Constraint A restriction that is imposed on the choices available to the user and/or developer for the use/design and construction of a product.

Functional Requirement

A statement of a piece of required functionality or a behaviour that a system will exhibit under specific conditions.

IBM The International Business Machines Corporation (IBM) is an American multinational technology and consulting corporation, with headquarters in Armonk, New York. IBM manufactures and markets computer hardware and software, and offers infrastructure, hosting and consulting services in areas ranging from mainframe computers to nanotechnology.

IEEE The Institute of Electrical and Electronics Engineers. A professional society that maintains a set of standards for managing and executing software and system engineering projects.

Gartner Gartner, Inc. is an American information technology research and advisory firm providing technology related insight headquartered in Stamford, Connecticut, United States.

Non-functional Requirement

A description of property or characteristic that a software must exhibit or a constraint that it must respect, other than an observable system behaviour.

Paper Prototype A non-executable mock-up of a software system’s user interface using inexpensive. low-tech screen sketches.

139 | P a g e

Prototype A partial, preliminary, or possible implementation of a program. Used to explore and validate requirements and design approaches.

Quality Attribute A kind of non-functional requirement that describes a quality or property of a system. Examples include usability, portability etc. It describe the extent to which a software product demonstrates desired characteristics, not what the product does.

Requirements A statement of a customer need or objective, or of a condition or capability that a product must possess to satisfy such a need or objective.

Requirement Attribute

Descriptive information about a requirement that enriches its definition beyond the statement of intended functionality.

Requirement Allocation

The process of apportioning system requirements among various architectural subsystem and components.

Requirement Elicitation

The process of identifying software or system requirements from various sources through interviews, workshops, workflow, and task analysis, document analysis and other mechanisms.

Software Development Lifecycle

A sequence of activities by which a software product is design, defined, built, and verified.

SAS SAS Institute is an American developer of analytics software based in Cary, North Carolina. SAS develops and markets a suite of analytics software (also called SAS), which helps manage, access, analyse and report on data to aid in decision-making.

Validation The process of evaluating a work product to determine whether it satisfies customer requirements.

Verification The process of evaluating a work product to determine whether it satisfies the specifications and conditions imposed on it at the beginning of the development phase during which it was created.

Appendix B

140 | P a g e

APPENDIX B. HADOOP COMPONENTS

Hadoop Distributed File System (HDFS)

HDFS is the file system component of Hadoop designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware [56]. HDFS stores file systems metadata and application data separately. As in other distributed file systems, such as, PVFS [60], Lustre and GFS [61], HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP- based protocols [62].

YARN (MapReduce 2.0)

MapReduce was created by Google mainly to process enormous volumes of unstructured data. MapReduce is a general execution engine that is ignorant of storage layouts and data schemas. The runtime system automatically parallelizes computations across a large cluster of machines, handles failures and manages disk and network efficiency. The user only needs to provide a map function and a reduce function. The map function is applied to all input rows of the dataset and produces an intermediate output that is aggregated by the reduce function later to produce the final result [63].

In 2010, a group at Yahoo! began to design the next generation of MapReduce. The result was YARN shortened for Yet Another Resource Negotiator. YARN meets the scalability shortcomings of “classic” MapReduce”. YARN is more general than MapReduce, and in fact MapReduce is just one type of YARN application. The beauty of YARN’s design is that different YARN applications can co-exist on the same cluster, so a MapReduce application can run at the same time as an MPI (Message Passing Interface) application. It performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming

141 | P a g e models. On the whole it offers greater benefits for manageability and cluster utilization [56].

Other Hadoop Components.

Type of Service

Component Description

Core HDFS Provides scalable and reliable data storage of massive amounts of data (data blocks are distributed among clusters) for further processing. It is suitable for applications with large and multi-structured data sets (e.g., web and social data, human generated log, and biometrics data) to provide for performing predictive analysis and pattern recognition. HDFS is possible to interact with batch data processing as well as the data in real time events (sensors or fraud) even before it lands on HDFS.

MapReduce Framework for writing applications that process large amounts of structured and unstructured data in parallel by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel on commodity hardware reliably, and in a fault-tolerant manner.

YARN Framework for Hadoop data processing supports MapReduce and other programming models. It handles the resource management, security, etc.. and to allow for multiple ways to interact with the data in HDFS (batch with MapReduce, streaming with Apache Storm, interactive SQL with Apache Hive and Apache Tez).

Tez Generalizes MapReduce to support near real-time processing. It can scale up request and meet demands for fast response times providing the suitable framework to execute near real-time processing systems.

Data Pig Platform paired with MapReduce and HDFS for processing large Big Data. It performs all of the data processing by compiling its Latin scripts to produce sequences of MapReduce programs.

Hive Data Warehouse that enables easy data summarization and ad-hoc queries. It also allows a mechanism for structuring the semi-structured (customer logs) and unstructured data (machine generated and transaction data) and perform queries using SQL-like language called HiveQL. Hive rresides on top of MapReduce and next to Pig.

Appendix B

142 | P a g e

HBase A distributed, scalable, Big Data store with random, real time read/write access. For storing huge amounts of unstructured data, RDBMS will not be adequate as the data sets will grow and accordingly will rise issues with scaling up request since these relational databases were not designed to be distributed. Hbase (column-based), a Not Only SQL (NoSQL) database that allows for low-latency, quick lookups in Hadoop is needed to maintain a class of a non-relational data storage systems that supports data consistency, scalability and excellent performance.

HCatalog Provides centralized way for data processing systems to understand the

In document A Requirements Acquisition Tool Architecture for the Decision Back Approach to Social Media Big Data Capturing (Page 145-170)