• No results found

7. Evaluation

7.1 Validation against the Requirements

The developed artefact is evaluated by analysing the defined requirements, how the developed artefact corresponds to the set requirements. The set requirements after implementation are marked with a status, depicting whether it has been completed, has been partially completed or has not been completed. Each requirement is thoroughly examined and analysed for providing the status and what needs to be done for improvement.

Table 4. Evaluation against requirements

Requirements ID = Status (Completed (C)/ Partially completed (PC)/ Incomplete (I)

R-1 = C R-2 = C R-3 = PC R-4 = C R-5 = C

R-6 = C R-7 = C R-8 = C R-9 = C R-10 = PC

R-1: The included data sources should be available for crawling i.e. open data. Status: Completed

Evaluation: The identified data sources are open data, which means that they can be re-

used for any purpose. The data from these sources is provided either through APIs or web pages. The access points for data sources are also public enabling user to download it any time. The data from these sources can be scraped for further usage, however, it requires extra effort for developing custom scripts.

R-2: The artefact to be implemented for this study should allow the user to collect data

from data providers into the platform, irrespective of their format.

Evaluation: The developed platform is capable of collecting data from a heterogeneous

source with diverse representation formats. The current version of the platform allows the user to develop custom scripts to extract data programmatically. Moreover, the data sources with APIs can be configured with a push-based mechanism to collect data as soon as it’s updated.

R-3: Based on the different frequencies of data sources in this study and occurring

network exceptions, the chosen data collection technologies should be fault-tolerant. Moreover, the platform should be available all the time from data collection.

Status: Partially completed

Evaluation: The chosen technological solution – Apache Flume to accumulate data from

a heterogeneous source with different acquisition frequencies is reliable and fault- tolerant. In case of failures, the lost data can be recovered from a created checkpoint by Flume agent. The custom written scripts to extract data from data provider is configured with cron scheduler to store files in spool directory, from where Flume agents fetch the data and write it HDFS. However, the accuracy of data still cannot be guaranteed e.g. if data provider changes the structure of intended data, advanced techniques and solutions are required to address this challenge. The current approach for data ingestion can be improved programmatically in custom scripts, that if the structure of data is changed, it should notify the user. Moreover, adding Apache Kafka can also add value to current implementation through its schema validation functionalities.

R-4: The platform should allow the user to store data into a platform in raw format

through data ingestions tools and manually.

Status: Completed

Evaluation: The current version of the platform allows the user to ingest data into HDFS

using Apache Flume. Moreover, the platform also provides support to integrate other data ingestion tools like Apache Kafka and Apache Nifi. To accommodate the requirement of inserting data manually, Hue was configured. It provides support to add files regardless of their format to HDFS through its user-friendly interface. Both approaches were tested, and current data in the platform is ingested using these approaches.

R-5: As the data in this study will grow over time, so the developed platform should be

scalable.

Status: Completed

Evaluation: The current storage capacity of the platform is sufficient to accommodate

the data for this study. However, as the data will grow with time and in future, if more data is to be integrated, the platform can be scaled in terms of storage, computational capacity, and others.

R-6: The platform should allow pre-processing the stored data in the platform including

cleaning, transformation and others.

Status: Completed

Evaluation: The added data processing and analysis solution – Apache Spark allows

performing various kinds of pre-processing such as feature extraction, data cleaning, data transformation, data normalization, and others. To verify the integrated solution, data included in this study was processed using its user-friendly python-based libraries, which fulfilled the requirements. Moreover, the platform is capable of integrating other processing solutions.

R-7: The platform should be developed considering the need for both batch and real-time

processing and should enable the use of different standard libraries for performing the analysis.

Status: Completed

Evaluation: The integrated solution of data analysis – Apache Spark allows to perform

both batch and real-time data analysis. Its ML libraries allow the user to analyse various set of problems, such as classification, pattern recognition, clustering, and regression. The predictive analysis in this study has been performed using its machine learning pipelines, which allows to process and analyse data in one go.

R-8: The platform should provide support for data visualization. Status: Completed

Evaluation: The support for data visualization in this study is provided through Hue and

Matplotlib. Both tools experimented, and they fulfil the requirements.

R-9: The platform should provide support for developing custom applications. Status: Completed

Evaluation: The platform supports the development of custom applications. Through the

use of REST APIs, the data and results can be delivered to third-party users depending upon their needs.

R-10: The developed platform should be secure and capable of handling any intrusions. Status: Partially completed

Evaluation: To ensure the security of the platform, only users with whitelisted IPs can

access the platform. This still requires further research to integrate more advanced techniques for securing the platform.

Related documents