4.2 Software Engineering KA
4.2.4 Software Testing
According to SWEBOK, software testing consists of the dynamic verification that a pro- gram provides expected behaviors on a finite set of test cases, suitably selected from the usually infinite execution domain [17].
Ten research studies were categorized under the Software Testing KA. The micro cat- egories identified under this KA were:
1. Test Techniques
2. Test Related Measures
3. Test Process
4. Software Testing Tools
Table 4.6: Software Testing Micro Categories
Software Testing Micro Categories Papers Count Test Techniques [42], [91], [95], [98], [139], [146],
[147], [169]
8 Test Related Measures [147] 1
Test Process –
Software Testing Tools [147] 1
4.2.4.1 Analysis
Eight research studies dealt with test techniques for big data and one each with test related measures and software testing tools. B. Li et al., proposed a novel approach to protecting databases used in big data applications by minimizing and sanitizing the database for parent organizations to send their big data to outsourcing vendors for testing [91]. The argument made is the task of testing data intensive software systems is outsourced to test centers in order to keep costs low and quality high. Since data sets contain sensitive information and variations in privacy laws governing different locations where vendors may be based, sharing of raw data is risky. To avoid legal and ethical issues, information can be anonymized. In cases of big data systems, minimizing data sets by removing data to make anonymization easier would not be an option due to the existence of useful patterns in large data sets. The authors proposed an approach called Protecting and mInimizing databases for Software TestIng taSks (PISTIS) that sanitizes and minimizes data using a weight based data clustering algorithm that partitions data.
N. Li et al. focused on a method for generating small and representative datasets from very large sets of data in order to save on the costs of processing large amounts of data which restricts continuous integration and delivery in agile environments [95]. They intro-
duced a novel scalable big data test framework using various Amazon services like Amazon Web Services (AWS), Amazon Elastic MapReduce (EMR), and Amazon Simple Storage Service (S3) and Redshift to test ETL applications that use big data techniques. This is achieved by using characteristics of domain-specific constraints, business constraints, referential constraints, statistical distribution and other constraints.
Ding et al., developed a test framework using an iterative metamorphic testing tech- nique for testing scientific software and for validating machine learning algorithms [42] and Sneed et al., proposed an automatic testing process to test big data because of the sheer size of the data sets making it difficult for developers [139].
The only research study that discussed both micro categories of Test Related Measures and Software Testing Tools discussed the requirements challenges of testing big data sys- tems and proposed a factory model for big data systems with testing strategies, testing tools, principles and matrices of testing [147].
4.2.4.2 Open Research Challenges
A major issue with testing big data systems is the infeasibility of replicating the exact production big data environment onto an test environment [99]. Big data systems are huge and complicated; hence, replicating them involves a lot of resources in the form of technical skills and cost for the hardware. Another factor is the dynamic nature of the data involved, replicating the source data to behave exactly as it would in production could turn out to be very complex. The most obvious approach available is to scale down the resources - storage and computational - needed by the production system to fit the needs of testing it. But there has been no strategy developed to decide how much scaling down would be
Another factor that requires scaling is the huge amounts of data in the big data systems. An approach proposed by N.Li et al., involved reducing large data sets into a smaller representational form [95] but it was specifically for ETL applications. The real challenge is how the same techniques of creating small representative data sets can be created for other types of big data systems, especially ones that deal with different multimedia data like text, audio, speech, video, etc.
The only micro category in the Software Testing KA for which no research studies were found was Test Process.
1. Test Process is the collection of testing concepts, techniques and measures for test- ing a software system [17]. A lot of factors contribute to the formulation of a test process like the attitudes of the programmers/developers, test documentation and the cost and effort budget allocated to testing the system. Due to the variety of tech- nically skilled personnel required to work on big data system development, starting from statisticians to front end UI developers and business analysts, the attitudes of everyone involved may not be flexible or trusting enough to work with non-established and customized test processes or agree on the standards and terminology used in test documentation.
Open research challenges for the software testing KA include:
• Traceability: Tracing the functions and behaviour of the different system com- ponents of a big data system is vital to ensure that the component functions as desired or expected. How to devise unit test cases for complex and intricate big data technology components? How to keep track of the dynamic nature of the software components of a big data system while testing it?
• Test Environment: Having a dedicated test environment that is a replica of the production environment is the best option to test and understand the functioning of a software system in order to fix any actual and potential errors. In big data systems, implementing this may not be economically viable. How to scale a production big data system in order to replicate the same in a dedicated test environment? How to prioritize which system components may need exact replication and which compo- nents can be scaled to a minimum to keep hardware costs low? How to duplicate the resource intensive tasks and workflow of a production big data system in order to test it? How to capture the production workloads and replay them in case of testing big data systems?
• Test Cases: Devising test cases in order to perform testing on a big data system code base would require the input and processing logic be similar to the production environment. How to replicate the behaviour of the data sources of a big data system in order to keep the test cases are similar as possible to the actual use cases of the big data system? How to model behaviour of system components which cannot be replicated? How to create representational data sets of input data that can be used for testing a big data system? How to define test cases that are representative to big data systems by using a subset of the resources used by the system?