A Requirements Specification Framework for Big Data Collection and Capture

(1)

A Requirements Specification Framework

for Big Data Collection and Capture

A Thesis

Submitted in Partial Fulfillment of the

Requirements for the Degree

Of

Masters of Science in Software Engineering

At the College of Computer and Information Sciences

At Prince Sultan University

By Nouf A. Al-Najran February 2015

(2)

(3)

Acknowledgements

All praises and thanks be to Allah for helping me in the completion of

this thesis.

My profound gratitude is expressed to my esteemed and kind supervisor,

Prof. Dr. Ajantha Dahanayake. She has given liberally of her time and

freely of her expert knowledge, to the great benefit of this research. Both

her patience and critical interest at every stage have been the source of great

encouragement.

Particular thanks are appropriate to Dr. Areej Al-Wabil and Dr. Musaad

bin Muqbil for their

support, encouragements, and insightful suggestions.

Also, I would like to thank my competent and skillful teachers who have

brought me to the point I am at. Truly, I have been blessed to be taught by

the very best.

My greatest thanks and appreciation go to my family who taught me the

value of hard work and dedication. Their belief in me never wavers, and

they always lift my spirit with their constant love and support.

I should not neglect to thank my dear husband for his understanding and

unceasing support he provided throughout my study.

(4)

III This thesis introduces ‘big data scenarios’ to the domain of data collection, because the ad hoc processes of data gathering used currently by most organizations are proving to be inadequate in a digital world that is expanding with infinite information. As a consequence, users are often unable to obtain specific relevant information from large-scale data collections. The today’s practice tends to collect bulks of data that most often: (1) containing large portions of useless data; (2) leading to longer analysis time frames and thus, longer time to insights. As this has implications to real-time decision support, it is vital for businesses, organizations, and associations to implement better approaches for developing scenario-relevant data collection. The premise of this thesis is; that big data analytics can only be successful when they are able to digest captured data and deliver valuable information. Therefore, this thesis develops a conceptual model for a well-defined scenario-based big data collection processes through a Requirements Specification Framework. This framework has been validated for effectiveness in improving the process of data collection through performing an experiment. The experiment provides quantitative measures on the relevancy of collected feeds. In the conducted experiment, the ad hoc process of data collection generates 8.5% relevant feeds and 91.5% irrelevant feeds, whereas the scenario-based data collection generates 92.5% relevant feeds and 7.5% irrelevant feeds. Hence, in a time of mass content creation, the Requirements Specification Framework contributes to: (1) the Requirements Engineering domain based on scenario-based big data collection; (2) collecting data according to scenarios of interest for analysis of (real-time) decision support; (3) the reduction of unnecessary or garbage data collection, which is a huge problem for big data in terms of storage, transportation and analytic time for (real-time) decision support.Therefore, this research mainly contributes to a paradigm shift of big data collection.

Key words: Big Data Scenarios, Information Filtering, Big Data Collection, Big Data Analytics, Scenario-based Data Collection, Software Requirements Engineering.

(5)

IV

ثحبلا صخلم

ت مدق ذه ه لا ةلاسر " ةمخضلا تانايبلا تاهويرانيس " ىلإ لاجم تانايبلا عمج ، نلأ كلذو عمجل ةيديلقتلا ةيلمعلا تانايبلا ─ تامظنملا لبق نم ةمدختسملاو ─ رارمتساب عستيو فعاضتي يمقر ملاع فورظ يف ةيفاك ريغ اهنأ تتبثأ .ةيتامولعملل دودحملالا ديازتلا عم كلذل ةجيتنو نوكي ، نيمدختسملا نايحلأا نم ريثك يف رداق ريغ ني ع ىل لوصحلا تانايب ىلع ةددحم ةلص تاذ نمض نم عساو قاطن تاعومجم لا تانايب . ،كلذ نم لادب هنإف م نوليمي ىلإ عمج ك ةلت نم يتلا تانايبلا ابلاغ : ( 1 ىلع يوتحت ) ءازجأ تانايبلا نم ةريبك ةيدجم ريغ ، ( 2 ) ىلإ يدؤي امم لا ليلحت تانايب تاراطإ للاخ لوطأ ةينمز يلاتلابو و اتق لوطأ .ةوجرملا تايئرملا ىلع لوصحلل راهظإ هنأش نم كلذ نإف يئرم تا وأ ةلص تاذ ريغ تامولعم يلاتلاب و ،ةبوغرم ريغ ىلع ابلس ريثأتلا بسانملا تقولا يف تارارقلا ذاختا ةيلمع . اذل ظنملاو ةيراجتلا لامعلأا يوذ ىلع يرورضلا نم لل لضفأ بيلاسأ ليعفت تاكرشلاو تام اذ تانايبلا ىلإ لصو ت .ددحملا ويرانيسلاب ةلصلا ت دنتس ةلاسرلا هذه ىلع ةيضرف حاجن نأ تاينقت ليلحت تانايبلا ةمخضلا ي ىلع دمتع ىلع اهتردق باعيتسا اهطاقتلا مت يتلا تانايبلا ميدقتو تامولعم ةميق . هنإف ،يلاتلابو ا ت يف مهاس نب ءا جذومن يميهافم ةيلمعل تانايبلا عمج ةمخضلا ىلع ةمئاقلا ويرانيس ملاعملا حضاو راطإ للاخ نم تافصاوملا . دقو قحتلا مت نم ق ةحص راطلإا اذه لل ةيلاعف يف ةيلمع نيسحت عمج نم تانايبلا للاخ ءارجإ ةبرجت . يمك سيياقم ةبرجتلا هذه مدقت ه .اهعمج مت يللا تانايبلا طابترا ةبسن ىلع لا يف ةبرجت تيرجأ يتلا نإف ، لا ةيلمع اهل ططخم ريغلا عمجل تانايبلا ت دلو 5.8 ٪ تانايب و ةلص تاذ 51.8 ٪ تانايب ةلص تاذ ريغ نأ نيح يف ، تانايبلا عمج ىلع مئاقلا ويرانيسلا لوي د 52.8 ٪ تانايب ةلص تاذ و 5.8 ٪ تانايب ةلص تاذ ريغ . يلاتلابو يف نوضغ ءاشنإ لا نم ةلتك ىوتحم لا لماش ، ف ( :يف مهاسي يميهافملا راطلإا اذه نإ 1 ) لاجم تابلطتملا ةسدنه ساسأ ىلع تانايبلا عمج ةريبكلا ىلع ةمئاقلا ويرانيسلا ( ، 2 ) عمج اقفو تانايبلا ل ل تاهويرانيس ةبولطملا لل ليلحت ( تقولا بسانملا ) تارارقلا ذاختا معدل ، ( 3 ) نم دحلا تانايبلا عمج ةيرورض ريغ وأ و ،ةمامقلا يتلا ةلكشم يه ةمخض تانايبلل ةريبكلا ثيح نم يزختلا لقنلاو ن تقولاو يليلحتلا بسانملا تارارقلا ذاختا معدل . ،كلذلو مهاسي ثحبلا اذه اساسأ ةيعون ةلقن ىلإ يبلا عمجل تانا ةريبكلا . :ثحبلا حيتافم يبلا ليلحت ،ةمخضلا تانايبلا عمج ،تامولعملا ةيفصت ،ةمخضلا تانايبلا تاهويرانيس ا ،ةمخضلا تان مج .تايجمربلا تابلطتم ةسدنه ،ويرانيسلا ىلع مئاقلا تانايبلا ع

(6)

L

IST OF

T

ABLES

TABLE2.1: SOFTWARE ENGINEERING APPLICATIONS AND BIG DATA COLLECTION ... 14 TABLE2.2: STUDIES AROUND BIG DATA SCENARIOS AND BIG DATA COLLECTION ... 20 TABLE4.1:BUSINESS DOMAIN AND CORRESPONDING SCENARIOS ... 32 TABLE4.2: CATEGORIES OF BIG DATA ANALYZING TECHNIQUES AND APPLICABILITY . 37 TABLE6.1: FRAMEWORK APPLICATION ON US PRESIDENTIAL ELECTIONS SCENARIO ... 51 TABLE6.2: FRAMEWORK APPLICATION ON THE EBOLA SCARE IN SASCENARIO ... 54 TABLE6.3: FRAMEWORK APPLICATION ON AUTO TRAFFIC MANAGEMENT SCENARIO .... 58 TABLE 7.1: KEYWORDS AND THEIR OCCURRENCES IN THE AD HOC DATA COLLECTION . 63 TABLE 7.2: KEYWORDS AND THEIR OCCURRENCES IN THE SCENARIO-BASED DATA COLLECTION ... 67 TABLE 8.1: THE CONCEPTUAL MODEL FOR SCENARIO-BASED DATA COLLECTION ... 75

(11)

VI

FIGURE 1.1: THESIS STRUCTURE ... 12

FIGURE 2.1: HORTONWORKS DATA PLATFORM ... 16

FIGURE 3.1: RESEARCH METHOD ... 28

FIGURE 4.1: OVERLAPPING SCENARIOS IN A DOMAIN ... 30

FIGURE 4.2: POSSIBLE SCENARIOS IN THE BUSINESS AND MEDICAL DOMAINS ... 31

FIGURE 4.3: REQUIREMENTS ENGINEERING PHASE IN ABIG DATA SOFTWARE LIFE CYCLE . 33 FIGURE 4.4: DETERMINING BIG DATA CAPTURING TECHNIQUES ... 36

FIGURE 5.1: THE REQUIREMENTS SPECIFICATION FRAMEWORK FOR SCENARIO-BASED BIG DATA COLLECTION ... 46

FIGURE 6.1: DATA FLOWING THROUGH FLUME CHANNEL ... 56

FIGURE 6.2: DATA FLOWING THROUGH SQOOP AND FLUME INTO HDFS ... 59

FIGURE 7.1: AD HOC DATA COLLECTION RESULT IN BAR CHART GRAPH ... 64

FIGURE 7.2: AD HOC DATA COLLECTION RESULT IN PIE CHART ... 65

FIGURE 7.3: A SNAPSHOT OF THE AD HOC DATA COLLECTION FEEDS (I) ... 65

FIGURE 7.4: A SNAPSHOT OF THE AD HOC DATA COLLECTION FEEDS (II) ... 66

FIGURE 7.5:SCENARIO-BASED DATA COLLECTION RESULT IN BAR CHART GRAPH ... 67

FIGURE 7.6:SCENARIO-BASED DATA COLLECTION RESULT IN PIE CHART ... 68

FIGURE 7.7:A SNAPSHOT OF THE SCENARIO-BASED DATA COLLECTION FEEDS (I) ... 69

FIGURE 7.8:A SNAPSHOT OF THE SCENARIO-BASED DATA COLLECTION FEEDS (II) ... 69

FIGURE 8.1:COMPARISON OF THE EXPERIMENTATION RESULTS ... 77

FIGURE 9.1:AD HOC PROCESS OF DATA COLLECTION (I) ... 80

(12)

VII

TABLE1B: HADOOP ECOSYSTEM COMPONENTS ... 89 TABLE1C:MAPPING W*HMODEL QS TO SCENARIO-BASED DATA COLLECTION QS ... 96

(13)

VIII

FIGURE 1C: THE W∗HINQUIRY BASED CONCEPTUAL MODEL FOR SERVICES... 94

FIGURE 2C: THE REQUIREMENTS SPECIFICATION FRAMEWORK FOR SCENARIO-BASED BIG DATA COLLECTION ... 97

FIGURE 1D: DISCOVERTEXT DASHBOARD ... 98

FIGURE 2D: START A NEW PROJECT ... 98

FIGURE 3D: NAME YOUR PROJECT ... 99

FIGURE 4D: IMPORT DATA ... 99

FIGURE 5D: DATA SOURCES ... 100

FIGURE 6D: TWITTER FEED (I)... 100

FIGURE 7D: TWITTER FEED (II) ... 101

FIGURE 8D: TWITTER FEED (III) ... 101

FIGURE 9D: ARCHIVE MANAGEMENT ... 102

FIGURE 10D: LIST OF TWEET FETCHES ... 102

(14)

IX

L

IST OF

A

BBREVIATIONS

SDLC Software Development Life Cycle

RE Requirements Engineering

NIC National Information Center HDFS Hadoop Distributed File System HDP Hortonworks Data Platform YARN Yet Another Resource Negotiator

NoSQL Not Only SQL

ETL Extract, Transform, and Load CF Collaborative Filtering NLP Natural Language Processing DOM Document Object Module

MIDIS Multi-Intelligence Data Integration Services DTO Data Transformation Operations

BGP Basic Graph Pattern BFS Breadth First Search

SNAP Stanford Network Analysis Platform FIFO First In First Out

MVC Model View Controller

ICT Information and Communication Technology CDR Call Detail Records

POS Part Of Speech

EVD Ebola Virus Disease MOH Ministry of Health

RFID Radio Frequency Identification ITS Intelligent Transportation MOI Ministry Of Interior

(15)

C

HAPTER

1:

I

NTRODUCTION

“We have more data than we have skills to turn it into useful knowledge”

(16)

2 | P a g e

1.1 I

NTRODUCTION

Today, with the explosion of digital data growth in social media, marketing, healthcare, national security, and weather forecasting, etc., most enterprises, organizations, and governments are unable to effectively filter and analyze those massive collections of data to be used for timely and informed decision makings [2]. This is because separating the relevant and meaningful information from the available universe of data which can reveal hidden patterns is a non-trivial task [3]. Therefore, these public as well as the private sectors may not be able to cope with the velocity of data collection and fail to make use of those data for instances such as real-time decision support [2]. Hence, it is important to move away from the ad hoc process of data collection, and develop a better strategy for capturing the useful data that can leverage valuable information and insights in a timely manner.

As these data volumes require different capturing techniques, they also need special analyzing techniques to analyze that data and make it meaningful in a way that helps reduce data noise and store only what is needed to answer the useful questions [4]. A structured and effective way to bring the right data to the right analytical technique, is to derive a framework that is capable of collecting data according to requirements of the analyzing needs based on a model that is useful for users to ask the right set of questions that define the required outcome of data analytics prior attempting to capture the data [5].

Therefore, this thesis provides a requirements specification framework leading to a set of questions that is organized as a requirements engineering model for data collection in order to identify the properties of the data analytics’ environment to yield a meaningful data collection process.

1.2 A

REA OF

R

ESEARCH

Due to the diversity and heterogeneity of data structures and formats found in various data sources, such as healthcare, national security, weather forecasting systems, etc., there is a strong need for a sound data collection approach in which people are aided in capturing and extracting useful knowledge from different data sources. Hence, there is a need for a user friendly framework that provides the right questions for capturing useful information from all the available data. The framework introduced in this thesis is inspired by the work of [6], which follows the Zachman framework

(17)

3 | P a g e

as an inquiry system for information systems engineering, and Hermagoras of Temnos frameworks used in legal inquiry [7]. Hermagoras of Temnos is a Greek rector, who established a classical rhetorical heuristic for identifying the crucial issue in a given case, which is based on a sequence of 6Ws + 1H (who, what, when, where,

why, how, and by whatmeans) [6].

Therefore, this thesis looks into several research fields from the point of view of software requirements engineering for big data capture and collection such as Big Data Analytical Techniques, Big Data Capturing Techniques, Search Patterns, and Information Retrieval.

1.3 M

OTIVATION

In the age of information overload, this thesis is motivated by the vision of ensuring access to the most valuable sources with the least resources. Recently, in addition to other data sources, much care has been given to the production of user-generated content from what is known by Web 2.0 [8]. Decision makers often need to utilize the relevant structured, semi-structured and unstructured data to drive their strategy [9]. Hence, two facts derived from studies have been a source of encouragement and motivation to conduct this research [10]:

 Studies have shown that more data does not necessitate more knowledge. In fact, a lot of data can be overwhelming and sometimes leads to the wrong decision.

 You could never work with all the available data. Running after every piece of data will consume a great deal of your resources and will make you paralyzed. So focus on the data that will best serve the involved scenario. Therefore, reversing the process of data collection through analyzing the required

output (business scenario) in order to determine the relevant input, will look into the data capturing from a different angle and contribute to science. It will add a new dimension of knowledge and technology to deal with big data capture. Thus, this study emphasizes the demand for a well-defined mechanism that aims to develop effective processes in order to take the maximum value from the available data that brings decision makers close to extracting value out of big data. The need for a development of a value-added data collection framework to assist users in understanding what they require to know before attempting to collect the data is the main motivation of this research.

(18)

4 | P a g e

1.4 W

HAT IS

B

IG

D

ATA

?

In the digital world, everyone is dealing with data in one way or another. People communicate through social networks and generate content like blog posts, photos and videos. Wireless sensors and RFID readers create signals and servers continuously log messages about what they’re doing. Scientists make scientific experiments and create detailed measurements and marketers record information about sales, suppliers, operations, customers, and etc.. This rapid growth of data is the reason behind the evolution of big data [11].

According to the leading IT industry research group Gartner [12], big data is defined as: “Big Data are high-Volume, high-Velocity, and/or high-Variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization”. Yet there are two more equally important characteristics to consider, which are Veracity and Value [13].

1.4.1 CHARACTERISTICS OF BIG DATA

Big data is characterized by the following five elements:

 Volume: How big the data is growing. Many factors contribute to the increase in data volume. Historical, transaction-based data stored through the years, multi-structured data streaming of social media and mobile devices is exploding. In addition to the increasing amounts of sensor and machine-to-machine data being collected with new sources of data that are emerging every year. This rapid growth of data causes the digital universe to double its size every two years [14].

 Velocity: How fast the data is being generated. Data is streaming in at unimaginable speed. For example, according to eMorpis Technologies [15], every minute we upload 100 hours of video on Youtube. In addition, every minute over 200 million emails are sent, around 20 million photos are viewed and 30.000 uploaded on Flickr, almost 300.000 tweets are sent and almost 2,5 million queries on Google are performed.

 Variety: Data are not only structured or can be represented in rows and columns. Variation of data types today includes source, format, and structure [16]. Data today comes from different sources in all types of

(19)

5 | P a g e

formats. Structured, unstructured text documents, email, video, audio, stock ticker data and financial transactions are some examples.

 Veracity: Trustworthiness, validity and quality of the data. According to IBM [17], veracity refers to how much of data can be trusted when key decisions need to be made on such large volumes collected at high rates of velocity and variety. Paul Miller [18] reported that “a good process will, typically, make bad decisions if based upon bad data”.

 Value: The success of big data drives businesses in terms of better and faster management decisions and financial performance [19].

1.4.2 GENERAL PROBLEMS OF BIG DATA

Big data can provide big success opportunities [11]. However, as with most emerging technologies, several characteristics are associated with big data problems that make them technically challenging. These general problems or challenges of big data can be grouped in three categories: data, process, and

management [16].

Data Challenges  Volume

 The problem is how to deal with the sheer volumes of big data in terms of processing and storage?

 Velocity

 The problem is how to respond to the flood of information in a real-time manner, or at least in the real-time required by the application.

 Variety

 The problem is how to deal with the multiplicity of data sources, formats, and structures.

 Veracity

As this is a critical challenge, there are several problems associated with it [16]:

 How can you cope with the invalidity, untruths, missing values or uncertainty of the data being analyzed?

 How broad is the coverage of the data available for analysis?  How timely is the readings of the values?

(20)

6 | P a g e  How can you discover high-quality data from all the high volumes of

data that are available out there?

Process Challenges

According to Laura Haas (IBM Research), process challenges include [18]:  Collecting the data.

This challenge is illustrated in more details later in this chapter, as this research is centered on addressing and enhancing the process of big data capture

 Integrating the data from multiple resources

 Transforming the data into a format that is feasible for analysis

 Modeling the data

 Visualizing the results of data analysis and sharing the output Management Challenges

“Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits” — Michael Blaha [20]. The main management challenges include [18]:

 Data privacy

 Security

 Governance

 Ethical

The problems associated are:

How to track and ensure that the data is used correctly? How are the data being used, transformed, and derived? And managing its lifecycle.

These were the general problems of big data categorized in three dimensions: Data, process, and management problems.

1.5 D

ATA

C

OLLECTION

According to [21], the process of data collection is defined as: “The process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes”.

(21)

7 | P a g e

This research utilizes the generic definition of data collection to the collection of ‘big’ data. However, as big data requires different analytical tools and techniques, it as well requires non-ad hoc capturing approaches and technologies [4]. Clearly, those huge volumes of continuously generated data are more than what conventional technologies can sustain. Hence, the lack of effective processes for information collection and management in organizations adapting to big data solutions, can result in a negative impact in the financial as well as reputation wise [4].

1.5.1PROBLEMS OF BIG DATA COLLECTION

Acquiring the data that holds useful information from tremendous amounts of available data with the rapid increase of online information is a non-trivial task [3]. Collecting scenario-based relevant data from all the available information sources, poses several challenges [16]:

 Integrating multi-disciplinary methods aiming to locate useful data in the large volume, messy, often schema-less, and complex world of big data.

 Understanding big data analyzing techniques as well as big data capturing techniques to be able to select the right one for the scenario being processed. And consolidating the possible factors that can have a control over reducing the unwanted data.

 The ability to develop a simple yet comprehensible and powerful approach to guide and streamline the data collection process, based on the properties of the given scenario.

 Collecting big data requires experts with technical knowledge who can map the right data to the right analytical technique, and execute complex data queries.

1.6 U

SEFULNESS OF

B

IG

D

ATA TO

O

RGANIZATIONS

Today, with the expansion of the adoption of big data, organizations are using big data analytics to benefit their businesses [9]. They are taking advantage of the vast amounts of available information to enhance their process of decision making and performance. Following is an illustration of how some companies are taking advantage of big data [22]:

(22)

8 | P a g e

 Amazon uses big data to build and power their recommender system that suggests products to people through their purchase history and clickstream data.

 Samsung Inc. uses big data on its new smart TVs to enhance their content recommendation engine, and thus, provide the customer with more accurate and user specific recommendations.

 Progressive Insurance Inc. relies on big data to decide on competitive pricing and capture customer driving behavior.

 LexisNexis Risk Solutions Inc. uses big data to help financial organizations and other clients detect and reduce fraud through identifying individuals, including family relationships.

These are some examples of how organizations using big data to leverage their performance.

1.7 R

EQUIREMENTS

E

NGINEERING FOR

B

IG

D

ATA

C

APTURE AND

C

OLLECTION

The volume, velocity, variety and veracity of big data has grown tremendously in the past years due to the vast spread of software systems as well as the social behaviors [23]. Social behaviors refer to the communication of people through social media applications such as Facebook, Myspace, Twitter, Digg, YouTube, and Flickr to express their thoughts, voice, opinions, and connect to each other anytime and anywhere [24]. This exponential growth and activities around big data in social applications and IT software systems have intensified the need for a well-structured Software Engineering approach to identify the requirements of the big data collection process [25]. One important aspect of Software Engineering when it comes to big data, is related to the capture and collection of relevant data for software systems. Seeking to collect as much data as possible creates a significant software processing challenge for software data analysts [9].

Therefore, we need to invest in a Requirement Engineering approach that specifies the requirements and structure for gathering and collecting only the needful data according to scenarios, and discarding irrelevant and useless data. The research into a Specification Framework for Big Data Collection and capturing is therefore the Requirement Engineering phase for the big data collection process of a big data

(23)

9 | P a g e

software solution [26]. The guiding questions in the framework is a structured process for system analysts to elicit the big data collection requirements in a more effective and user-friendly manner. This approach to Requirements Engineering is one of the main principles of the Software Development Life Cycle (SDLC) [27].

1.8 P

ROBLEM

S

TATEMENT

Today, organizations and individuals use computers to solve complex problems. For business and many other purposes, contributes to generating volumes of digital data. However, these huge volumes of information are evolving in a great pace, making the process of retrieving relevant and valuable information to produce decisions very difficult [2].

In the past, excessive data volume was a storage issue, but with decreasing storage costs, organizations tend to acquire and store all the available data through data streaming, whether it matches their organizational needs or not [28]. This leads to creating other issue which includes the size of datasets getting so huge that efficiency becomes a big challenge for current data analytical technologies [14]. This is unfortunate because analyzers will consume a lot of time trying to figure out matching patterns in the data and may not be able to answer important questions in a timely manner. Organizations will be stuck with an ever-growing volume of data, and may miss out opportunities to take actions on critical business decisions [2]. Technology allows you to fetch every bit and byte, but not all of the data out there is relevant or useful. Organizations need to separate the meaningful information from the chatter and focus on what counts. Thus, the real issue behind big data value does not only include the acquisition and storage of the massive volumes of data; rather it lies in the process of acquiring only what is suspected of being relevant for further analysis [16]. When the amount of data to be analyzed is reduced, the managing of their storage, merging, analyzing, and governing different varieties of the data is expected to be simpler and more controllable [29].

The size of the web exceeded 800 million pages in 1999 to 11.5 billion in 2005, and probably more than 30 billion nowadays [30]. The amount of information is continuing to increase at an enormous rate. Therefore, it is imperative that businesses, organizations, and associations find better approaches for information filtering which would effectively decrease the information overload and improve the precision of results [29].

(24)

10 | P a g e

Big Data analytics can only be effective when the underlying data collection processes are able to leverage the relevant information to a particular scenario [31]. And thus, improving the usefulness of the analysis results. Therefore, a more powerful mechanism of data capture guidance is needed to avoid the waste of time and resources analyzing irrelevant data.

1.9 R

ESEARCH

Q

UESTIONS AND

O

BJECTIVES

The study will examine the structure of the current process of data collection and its inadequacy for the huge world of digital data. It will raise significant points that question:

“How can we improve the ad hoc process of data collection that hinders the efficiency of extracting value from large datasets in a timely manner?”

This thesis endeavors to answer this research question through the introduction of a requirements specification framework which can play a significant and potentially profitable role for big data collection processes.

The research objectives are:

a) To provide a problem-centric and user-centric approach that improves the data collection in the big data domain, than that of the ad hoc data collection process, which collects huge volumes of data most of which is irrelevant to the particular business or organizational scenarios, and is inefficient for creating value out of big data analytics. Therefore, it is necessary to have an approach that manage to collect only the data that is relevant to the scenario being under investigation [32].

b) To examine how “Scenario-based Data Collection” can leverage the usefulness for businesses and organizations to make better real-time decisions.

c) To define an approach for analysis-driven data collection based on business scenarios, through determining what output you need in order to determine the relevant input (Backward Analysis).

d) To develop a framework that provides a well-structured processes to locate the appropriate data and increase the precision of the results.

Therefore, the goal of this thesis is to define coherent processes to acquiring only the data relevant to the business question from all the available data. Thus, data analytics can be done in smaller time frames, allowing decisions to be made faster and with

(25)

11 | P a g e

higher precision. Improving the current data capturing process from where you can

draw accurate and useful conclusions, will contribute to changing the way people are

collecting data and therefore, transforming decision making in a way that gives business the required advantage.

1.10 S

COPE OF THE

T

HESIS

As with most technologies, extracting value from the available universe of information has a core body of processing stages. In terms of big data, these stages are: data collection – processing – storage – and performing analytics [11]. Logically, data input into later stages of processing will be affected by the amount of data acquired and how relevant it is to the scenario being investigated. Therefore, the focus of this research is directed to the primary phase in a big data solution. This phase is inspired by and is similar to the primary phase of a Software Development Life Cycle (SDLC), which is Requirements Engineering (RE). In an SDLC process, RE is used to collect the requirements of a software from the stakeholders [27]. This research follows RE in collecting the requirements of a data collection process and provide a requirements specification framework that improves the current ad hoc process of data collection.

Digging into big data storage capacities, processing facilities and different big data mining and analytical techniques is beyond the scope of this research.

1.11 O

UTLINE OF THE

T

HESIS

Apart from this introduction, the rest of the thesis is structured in six chapters as outlined in Figure 1.1: Literature Review, Methodology, Scenario-based Data Collection, The Data Collection Requirements Modelling, Case Study, Experiment and Validation, Discussion, and Conclusion,

Chapter 2 consists the literature review. It discusses the related works including some available data reduction approaches, highlighting the innovativeness of this research. Additionally, an overview of the work that has been a source of main inspiration is presented.

Chapter 3 consists the research methodology. It provides a demonstration of the adapted methodology to conduct this research.

Chapter 4 introduces important research concepts and provides a mechanism for planning the data collection process. The framework developed and proposed as the

(26)

12 | P a g e

core of this research is presented in chapter 5 along with supporting materials. Chapter 6 provides an application of the framework on three case studies covering the three big data formats. The framework has been validated through an experiment to prove its effectiveness in chapter 7. Afterwards, the research analysis, a discussion on the framework, its validation, and the conceptual model is provided in chapter 8. Finally, chapter 9 contains the conclusion, limitations of this research and further research directions.

FIGURE 1.1THESIS STRUCTURE

This chapter presented an introduction and overview of this thesis. It mainly provided a brief glimpse of the research areas and motivation behind this study, it introduced the phenomena of big data and the general associated problems. It identified the process of data collection in relation to big data, and the challenges of big data collection. Moreover, a view on uses of big data in some organizations is presented, the big data capture and collection from the angle of Software Engineering has been discussed, and the problem statement was illustrated along with the research question, objectives and the scope.

Thesis Introduction Literature Review Research Methods Scenario-based Data Collection Requirements Modelling Case Study Experiment & Validation Discussion Conclusion

(27)

C

HAPTER

2:

L

ITERATURE

R

EVIEW

“There is no data on the future”

(28)

14 | P a g e

2.1 I

NTRODUCTION

Software Engineering and its applications through information technology is a subject of intense discussion around the globe, and a large number of scientific researches has been published on this discipline over the Web [34]. Nothing seems to stand still in this area because as soon as one work is developed, another comes out to supplant the previous one. In terms of ‘big data collection’, much research is conducted in this field but there is no clear and sufficient information on how to determine relevancy within structured, semi structured and unstructured data in all the available universe of information [11].

In this chapter, an overview of the related work is presented, highlighting the value and worthiness behind this research and how it differs from other contributions. It states the research innovation and main source of inspiration in conducting the core research of this thesis.

2.2 R

ELATED

W

ORKS

2.2.1RELATION OF SOFTWARE ENGINEERING APPLICATIONS TO BIG DATA

COLLECTION

Application Description Relation to Big Data

Collection

Software Requirements Engineering

Organizations will not meet the software they need if the software requirements were not right from the very beginning [35]. “The hardest part of building a software system is

deciding precisely what to build.” This illustrates why Requirements Engineering are so important [35].

The proposed Requirements Specification Framework for scenario-based big data collection contributes to

Requirements Engineering as it provides a structured set of questions to assist users in identifying the requirements of the data collection process (in the Big Data domain) based on the scenario of interest, and therefore collecting the right data.

Reverse Engineering

Reverse Engineering can be applied to re-specify a system for

re-implementation [37]. The system’s specifications may be reverse engineered and provided as an input to

The proposed Requirements Specification Framework for scenario-based big data collection contributes to Reverse Engineering through Backward Analysis in

(29)

15 | P a g e

the requirements

specification process for system replacement. In re-engineering, the system may be restructured and re-documented without changing its functionality, in order to support

frequent maintenance [37].

data collection. It provides means for analyzing the properties of the scenario (business problem) of interest and determining the relevant elements which, when collected, will probably reveal hidden patterns, prior to the actual data collection process. Software

Process Improvement

The software in its development process, requires continuous improvements in order to ensure quality products [38]. In a competitive industry, companies tend to hire professionals with multiple skills, implement new technologies and adapt new methods, standards and techniques to improve their processes.

In big data applications, domain experts make use of all the available data to make informed decisions and leverage their business strategy [38]. The proposed Requirements Specification Framework for scenario-based big data collection contributes to Process Improvement as it improves the process of decision making. Applying the proposed framework generates datasets that are relevant to the scenario of interest, which requires less processing and analysis time, and therefore less time to insights (real-time decision support).

TABLE 2.1.SOFTWARE ENGINEERING APPLICATIONS AND BIG DATA COLLECTION

2.2.2HADOOP –THE BIGDATAMANAGEMENT FRAMEWORK

This section provides an overview of Apache Hadoop as a ‘big data processing framework’ which some of its components will be revisited in advanced chapters (see Appendix B for more information on Hadoop’s core components). The aspects are explained here in a highly simplified manner. A detailed description of them can be found in [39-50].

APACHE HADOOP

Hadoop is the name that creator Doug Cutting’s son gave to his stuffed toy elephants. He was looking for something that was easy to say and stands for nothing in particular [39].

(30)

16 | P a g e

Hadoop provides a distributed file system (HDFS) and a framework for the capturing, processing and transformation of very large data sets using the MapReduce [42] paradigm. The important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. Hadoop is an Apache project; all components are available via the Apache open source license. Yahoo! has developed and contributed to 80% of the core of Hadoop [44].

Although Hadoop is best known for MapReduce and HDFS, the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing [39]. A brief explanation of the core components for Hadoop ecosystem: HDFS (storage), and MapReduce 2.0 or YARN (resource managing and data processing) will be provided. The other components will be summarized in a table at the end of this section. The use of components will be depending on Hortonworks Data Platform (HDP) [45] as an open source distribution powered by Apache Hadoop. HDP provides actual Apache-released versions of the components with all necessary bug fixes to make all the interoperable needs in the production environment (see Figure 2.1).

FIGURE 2.1.HORTONWORKS DATA PLATFORM [45]

2.2.2.1Hadoop Distributed File System (HDFS)

HDFS is the file system component of Hadoop designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware [39]. HDFS stores file systems metadata and application data separately. As in other distributed file systems, such as,

(31)

17 | P a g e

PVFS [46], Lustre [47] and GFS [48], HDFS stores metadata on a dedicated server, called the NameNode. Application data are stored on other servers called DataNodes. All servers are fully connected and communicate with each other using TCP-based protocols [49].

2.2.2.2YARN (MapReduce 2.0)

MapReduce was created by Google mainly to process big volume of unstructured data. MapReduce is a general execution engine that is ignorant of storage layouts and data schemas. The runtime system automatically parallelizes computations across a large cluster of machines, handles failures and manages disk and network efficiency. The user only needs to provide a map function and a reduce function. The map function is applied to all input rows of the dataset and produces an intermediate output that is aggregated by the reduce function later to produce the final result [50].

In 2010, a group at Yahoo! began to design the next generation of MapReduce. The result was YARN, which meets the scalability shortcomings of “classic” MapReduce”. YARN is more general than MapReduce, and in fact MapReduce is just one type of YARN application. The beauty of YARN’s design is that different YARN applications can co-exist on the same cluster, so a MapReduce application can run at the same time as an MPI (Message Passing Interface) application [39]. It performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models [45]. Which brings great benefits for manageability and cluster utilization [39].

2.2.3INNOVATIVE BIG DATA COLLECTION APPROACHES

In this section, several related works and innovative approaches for data collection and integration are visited. It aims to provide a knowledge in big data challenges, current data collection state of art, and how this research contributes to a shift in the domain of big data collection and analytics. In [51], the authors have emphasized that data analysis based on spatial and temporal relationships yields new knowledge discovery in multi-database environments. They have developed a novel approach to data analysis by

(32)

18 | P a g e

turning topsy-turvy the analysis task. This approach provides that the analysis task drives the features of data collectors. These collectors are small databases which collect the data of interest. To illustrate their idea, they have surveyed the processes and tools used to analyze traffic behavior of passengers in the Tokyo metropolitan railway environment. Moreover, they presented the data integration method for heterogeneous legacy databases by combining equality, similarity, topological relationships, directional relationships and distance relationships for spatial and temporal data.

C. Anne and B. Boury in [52], proposed a framework facilitating the integration of heterogeneous unstructured and structured data, enabling Hard/Soft fusion and preparing for various analytics exploitation. It as well provides timely and relevant information to the analyst through intuitive search and discovery mechanisms. The authors described the design and implementation of a prototype for scalable MIDIS, based on a flexible data integration approach, making use of Semantic Web and Big Data technologies.

In [53], the white paper published by Intel walked through the challenge of extracting big data from multiple sources. It has explained how Hadoop infrastructure can contribute to the process of big data ETL. It illustrates the process of loading different data formats from multiple data sources into Hadoop’s warehouse in a technical point of view. However, they did not touch the idea of reducing useless data capture nor producing real-time management decisions.

IBM in [54] provides a means of classifying big data business problems according to a specified criteria. They have provided a pattern-based approach to facilitate the task of defining an overall big data architecture.

Their idea of classifying data in order to map each problem with its suitable solution pattern provides an understanding of how a structured classification approach can lead to an analysis of the need and a clear vision of what needs to be captured.

Moreover, IBM has presented several real-life samples of big data case studies in [55]. From the two previous contributions of IBM in the field of big data, the idea of scenario analysis for a structured approach to big data collection has emerged.

(33)

19 | P a g e

The authors in [56], have studied different big data types and problems. They developed a conceptual framework that classifies big data problems according to the format of the data that must be processed. It maps the big data types with the appropriate combinations of data processing components. These components are the processing and analytic tools in order to generate useful patterns from this type of data.

In [32], Nakanishi emphasizes that most current data analytics and data mining methods are insufficient for the big data environment. Therefore, they have designed and proposed a model thatcreates axes for correlation measurement on big data analytics. This model maps the Bayesian network to measure correlation mutually in the coordination axes. It contributes to a shift in the domain of big data analytics.

2.2.4 DATA REDUCTION APPROACHES

There are several approaches and technologies discussed that may possibly lead to have a control on limiting or reducing unwanted data. Some of which are:

 Visualization and manual Data Collection [57]. However, several challenges emerged as a result of this process. These include the possibility for correct misses/false alarms and errors in categorizing the data and can be very time consuming.

 Machine Learning and Data Mining techniques [58]. However, data mining can only be applied to structured data that can be stored in a relational database.

 Collaborative Filtering (CF) is a common web technique for providing personalized recommendations, such as the ones generated by Amazon (based on the user profile and transaction history). In spite of the technique’s effectiveness, it rises privacy issues as some customers don’t prefer to have their preferences or habits widely known, along with other associated challenges such as data sparsity, scalability, and synonymy [59].

 Contextual Approach uses semantic technologies such as an NLP, annotation, and classification to handle information integration (depending on the context of the web page at that moment in time) and

(34)

20 | P a g e

querying of distributed data. For query representation, SPARQL language is specifically designed for the semantic technology and enables constructing sophisticated queries to search for different types of data [60]. This approach is efficient in terms of its high precision in controlling unwanted data, as it takes into account the important factors such as keywords, synonyms and antonyms. However, it requires a different infrastructure and highly skilled experts to deal with the complicated technology.

More studies on big data scenarios and big data collection approaches are described in Table 2.2.

TABLE 2.2.STUDIES AROUND BIG DATA SCENARIOS AND BIG DATA COLLECTION

Authors Study Description Tools, Languages,

Approach Findings W. C. Wesley, B. J. David and K. Hooshang [61]

This research describes the ineffectiveness of general queries in addressing scenario-specific information gathering. It calls for a scenario-based approach for information retrieval.

scenario-based proxies, context-sensitive navigation and matching, content correlation of documents and user models

Propose a medical digital library that supports scenario-specific and user-tailored information retrieval. V. Sitalakshmi and K. Sadhana [62]

Addresses the challenge of retrieving text, barcodes and images (unstructured data) that is relevant, pertinent and novel.  Intelligent Image Retrieval components.  Intelligent Information Retrieval components.  Recommender component The development of a recommender system framework that

combines data relevance from multiple sources. The framework has been evaluated and proved high effectiveness.

R. Sanjay [63]

An approach aims to acquire and store

unstructured data through utilizing Hadoop

components

Hadoop components

The development of a big data management system that includes data acquiring, organization and analysis. Z. Z. A. Siti, M.D. Noorazida and H. H. Azizul [64]

This research contributes to the approach of classifying and capturing unstructured web data and the efficiency of

Document Object Module (DOM) tree for classification process, XML for data transmission from web

The development of an interface that allows people to extract meaningful multimedia data. The tool will extract useful

(35)

21 | P a g e

multimedia database in storing this sort of data.

into multimedia database

information from the specified URL.

A. K. Craig and S. Pedro [65]

Described an approach to building and executing integration and

restructuring plans to support analysis and visualization tools on very large and diverse datasets Built a comprehensive set of Data Transformation Operations (DTO) including structured information and semi-structured data

The proposed approach will enable developers to rapidly and correctly prepare data for analysis and visualization tools and link the output of one tool to the input of the next, all within the big data environment H. Olaf, B.

Christian and C. F. Johann [66]

Introduced an approach to discover data that might be relevant for answering a query during the query execution itself

SPARQL query language in the

context of Basic Graph Pattern (BGP)

matching over a fixed set of RDF graphs

The more links exist, the more complete results can be expected because more relevant data might be discovered S. A. Catanese, P. De Meo, E. Ferrara, G. Fiumara and A. Provetti [67]

Developed a set of tools to analyze specific properties of social-network graphs, i.e., among others, degree distribution, centrality measures, scaling laws and distribution of friendship

Breadth-first-search (BFS) using FIFO queue

The analysis of collected datasets has been

conducted exploiting the functionalities of the Stanford Network Analysis Platform library (SNAP), which provides general purpose network analysis functions

T. Cao and Q. Nguyen [68]

The authors proposed a semantic approach for searching tourist information and generating travel itinerary.  An ontological model for representation of tourist resources as well as traveler’s profile  An algorithm for generating travel itinerary that will combine semantic matching with ant colony

optimization technique

An experiment was conducted to show that the proposed algorithm generates travel itinerary relevant to both criteria of itinerary length and user interest B. Thalheim And Y. Kiyoki [51] Developed a novel approach to realize dynamic data integration and analysis among heterogeneous databases

model-view-controller (MVC)

Data analysis based on spatial and temporal relationships leads to new knowledge discovery in multi-database environments

(36)

22 | P a g e

As one of the most important elements for successful data collection is being able to deliver relevant information and services in real-time, a simple and comprehensible approach that relies on the properties of the need is essential to control the data collection process.

2.3 R

ESEARCH

I

NNOVATION

Indeed, big data scenarios play a vital role in the process of collecting the relevant data. Much research was conducted around big data scenarios and around data collection [55], [52]. However, there was no clear and sufficient information that links the two fields together. Therefore, the innovativeness of this research lies in the development of a scenario-based big data collection framework that performs as the Requirements Engineering phase in a software life cycle. The framework links the two or more aspects together to provide a well-defined approach for identifying the properties of the scenario context in which the data collection process will take place. This research studies the requirements specification of the big data collection process and makes it more tailored to the business needs, in order to decreases the analysis time and increases the value of the results by making faster management decisions.

2.4 R

ESEARCH

I

NSPIRATION

The main inspiration of this thesis comes from the W*H Conceptual Model for Services [6]. The authors in their research have studied the concept of ‘services’ as a design artifact. They have aimed to merge the gap between main service design initiatives and their abstraction level interpretations. In order to address their research goal, the authors have developed an

inquiry- Anne-Claire and Boury-Brisset [52]

This research makes use of big data technologies, ontological models and semantic-based analysis to address the challenge of transforming the over-whelming amounts of sensed datainto useful, actionable intelligence in a timely manner.

R&D intelligence data integration platform MIDIS

Presented the ongoing work that they are conducting for the development of a scalable and flexible platform through experimenting with recent big data technologies.

(37)

23 | P a g e

based conceptual model for service systems designing. This model formulates the right questions that specify service systems innovation, design and development.

This chapter provided an overview of some Software Engineering applications as well as the big data framework, Hadoop. It visited multiple related works in the arena of ‘big data collection’. In this matter, it highlights the uniqueness of this thesis contribution in providing an approach that rests big data collection on the analysis of the big data scenario being addressed. In this chapter, the main inspiration in the development of the core part of this thesis has been as well presented.

(38)

C

HAPTER

3:

R

ESEARCH

M

ETHODS

“Data is the new oil. We need to find it, extract it, refine it, distribute it and monetize it”

(39)

25 | P a g e

3.1 I

NTRODUCTION

In this chapter, the research method is outlined. The research design, philosophy, and strategy is presented. The techniques and data analysis methods to be used in analyzing the data and providing results is illustrated as well. Moreover, the instruments and procedures used to conduct the experimental work is also discussed.

3.2 R

ESEARCH

M

ETHODS 3.2.1RESEARCH DESIGN

This research is designed to develop a Requirements Specification Framework to scenario-based big data collection based on a conceptual modeling of the principles of design science [70]. Findings of this research will be a requirements specification framework, an outcome of comprehensive analysis of subjective information and categorizing data. It will rely as well on quantitative analysis in order to validate the proposed framework. Thus, a mixed research design of qualitative and quantitative method of investigation is followed in this research.

3.2.2RESEARCH PHILOSOPHY

This research is associated with an interpretive philosophy [71]. This is because it needs to make sense of the subjective and socially constructed meanings expressed about the concepts under study [71]. It commences with an inductive approach, where the data gathered is analyzed and used to develop a richer theoretical perspective than what already exists in the literature.

3.2.3RESEARCH STRATEGY

The research is exploratory in nature. It explores the subject and allows for the development of knowledge.

 It is of cross-sectional in nature.

 It supports gathering more in-depth contextual understanding of the proposed framework in order to address the research question and meet the objectives.

 Real-life situation case studies are carried out to evaluate and examine the proposed framework and its applicability in fulfilling its purpose.

(40)

26 | P a g e

 An experiment is conducted in order to evaluate and validate the effectiveness of the framework in providing relevant data according to the given scenario, compared to an ad hoc process of data collection. The statistical results of the experiment shall provide the validity of the research in its answer to the research question.

3.2.4RESEARCH TECHNIQUES AND DATA ANALYSIS

This is a mixed methods research. It uses a variety of data collection techniques and analytical procedures to develop and validate the framework. In order to maximize the validity and trust-worthiness of the findings, the research intends to use a hybrid access type to gather a richer set of data of the related works. Hybrid access data collection method refers to collecting the data and materials through difference access types, such as traditional access and internet access [71].

 The primary source of data collection is through literature exploration and the use of in-depth internet access and going through various relevant publications and white papers.

 Supporting data is collected through traditional access and conversations with interested participants in local as well as international conferences. In addition, observations of several companies and meeting experts such as Mr. Joseph Kambourakis1 _{through unstructured verbal interviews have}

took place in March 2014 (See Appendix A for brief description on the unstructured interview). The choice of companies has been determined by the ease of accessibility, reputation, and level of involvement in this field.

3.3 R

ESEARCH

I

NSTRUMENTS AND

P

ROCEDURES

This research attempts to provide a requirements specification framework for scenario-based data collection. In order for this framework to be validated for effectiveness, some tools need to be available to aid in the framework evaluation process.

1_{Mr. Joseph Kambourakis, EMC data scientist ( ‘E’ ‘M’ ‘C’ are the initials of the corporation founders,}

(41)

27 | P a g e

3.3.1EXPERIMENT TOOLS

DiscoverText

DiscoverText is a powerful and reliable we-based software application launched by Texifter. It enables collecting text from social media and a variety of other sources. In addition to data collection, the software is designed to improve standard research, government and business processes. It provides collaborative text analytics solutions tailored to the user’s specific needs [72].

With DiscoverText it is possible to ingest hundreds of thousands of items from social media, email and electronic document repositories. This advanced social search leveraging metadata, networks, credentials and filters will change the way users interact with text data over time [72]. It helps organizations to aggregate customer feedback from many public and private sources, and generate key insights for better business process. DiscoverText has many other text analysis and storage features to sift and sort textual data. However, the experiment in this research is interested in the data collection feature of this tool and will not go through other text mining capabilities.

Reasons behind the selection of DiscoverText for performing the validation:

 It generates a reliable and accurate results in terms of data analysis, matching patterns and consistency [72].

 DiscoverText, among other information retrieval and text analytics tools, provides simple and user friendly interface that does not require intensive training or technical expertise.

 Re-inventing the wheel and implementing a module to import and aggregate data will require vast time and effort. Effort lies in [73]:

1. Understanding and applying specialized programming languages, such as Python, JSON, R and etc..

2. The integration with different social media infrastructures such as Twitter, Facebook, Path, and security issues such as acquiring authentication tokens to fetch feeds.

A Requirements Specification Framework for Big Data Collection and Capture