BIG DATA TESTING APPROACH

Validate data quality by employing a structured testing technique

As we are dealing with huge data and executing on multiple nodes there are high chances of having bad data and data quality issues at each stage of the process. Data functional testing is performed to identify these data issues because of coding errors or node configuration errors.

Testing should be performed at each of the three phases of Big data processing to ensure that data is getting processed without any errors. Functional Testing includes (i) validation of pre-Hadoop processing; (ii), validation of Hadoop Map Reduce process data output; and (iii) validation of data extract, and load into EDW. Apart from these functional validations non-functional testing including performance testing and failover testing needs to be performed.

Figure 2 shows a typical Big data architecture diagram and highlights the areas where testing should be focused.

Validation of Pre-Hadoop Processing

Data from various sources like weblogs, social network sites, call logs, transactional data etc., is extracted based on the requirements and loaded into HDFS before processing it further.

Issues: Some of the issues which we face during this phase of the data moving from source Figure 1: Big Data Testing Focus Areas Source: Infosys Research

1

Loading Source

2 3

data files into HDFS Perform Map Reduce operations

Extract the output results from HDFS

systems to Hadoop are incorrect data captured from source systems, incorrect storage of data, incomplete or incorrect replication.

Validations: Some high level scenarios that need to be validated during this phase include:

1. Comparing input data file against source systems data to ensure the data is extracted correctly

2. Validating the data requirements and ensuring the right data is extracted,

3. Validating that the files are loaded into HDFS correctly, and

4. Validating the input files are split, moved and replicated in different data nodes.

Validation of Hadoop Map Reduce Process

Once the data is loaded into HDFC Hadoop map-reduce process is run to process the data coming from different sources.

Issues: Some issues that we face during this phase of the data processing are coding issues in map-reduce jobs, jobs working correctly when run in standalone node, but working incorrectly when run on multiple nodes, incorrect aggregations, node configurations, and incorrect output format.

Validations: Some high level scenarios that need to be validated during this phase include:

1. Validating that data processing is completed and output file is generated

Figure 2: Big Data architecture Source: Infosys Research Enterprise Data

Warehouse ReportsTesting

Reporting using BI Tools 25% 25%

25% 25% ¹ ² ³ ⁴ ⁵

Big Data Testing Focus Areas

Bar graph

Big Data Analytics

Pig HIVE

HBase (NoSQL DB) Map Reduce

(Job Execution) HDFS (Hadoop Distributed File System)

Transactional Data (RDBMS)

Non-FunctionalTesting (Performance, Fail over testing)

4 Map-Reduce

process validation 2

ETL Process validation 3

Pre-Hadoop process validation 1

Web Logs Streaming

Data Social Data

Processed Data

Data Load using Sqoop

hadoop

ETL Process

2. Validating the business logic on standalone node and then validating after running against multiple nodes

3. Validating the map reduce process to verify that key value pairs are generated correctly

4. Validating the aggregation and consolidation of data after reduce process

5. Validating the output data against the source files and ensuring the data processing is completed correctly

6. Validating the output data file format and ensuring that the format is per the requirement.

Validation of Data Extract, and Load into EDW Once map-reduce process is completed and data output files are generated, this processed data is moved to enterprise data warehouse or any other transactional systems depending on the requirement.

Issues: Some issues that we face during this phase include incorrectly applied transformation rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS.

Validations: Some high level scenarios that need to be validated during this phase include:

1. Validating that transformation rules are applied correctly

2. Validating that there is no data corruption by comparing target table data against HDFS files data

3. Validating the data load in target system 4. Validating the aggregation of data

5. Validating the data integrity in the target system.

Validation of Reports

Analytical reports are generated using reporting tools by fetching the data from EDW or running queries on Hive.

Issues: Some of the issues faced while generating reports are report definition not set as per the requirement, report data issues, layout and format issues.

Validations: Some high level validations performed during this phase include:

Reports Validation: Reports are tested after ETL/transformation workflows are executed for all the sources systems and the data is loaded into the DW tables. The metadata layer of the reporting tool provides an intuitive business view of data available for report authoring.

Checks are performed by writing queries to verify whether the views are getting the exact data needed for the generation of the reports.

Cube Testing: Cubes are testing to verify that dimension hierarchies with pre-aggregated values are calculated correctly and displayed in the report.

Dashboard Testing: Dashboard testing consists of testing of individual web parts and reports placed in a dashboard. Testing would involve ensuring all objects are rendered properly and the resources on the webpage are current and latest. The data fetched from various web parts is validated against the databases.

In document Infosys Labs Briefings (Page 68-71)