Validate data quality by employing a structured testing technique
BIG DATA TESTING APPROACH
As we are dealing with huge data and executing on multiple nodes there are high chances of having bad data and data quality issues at each stage of the process. Data functional testing is performed to identify these data issues because of coding errors or node configuration errors.
Testing should be performed at each of the three phases of Big data processing to ensure that data is getting processed without any errors. Functional Testing includes (i) validation of pre-Hadoop processing; (ii), validation of Hadoop Map Reduce process data output; and (iii) validation of data extract, and load into EDW. Apart from these functional validations non-functional testing including performance testing and failover testing needs to be performed.
Figure 2 shows a typical Big data architecture diagram and highlights the areas where testing should be focused.
Validation of Pre-Hadoop Processing
Data from various sources like weblogs, social network sites, call logs, transactional data etc., is extracted based on the requirements and loaded into HDFS before processing it further.
Issues: Some of the issues which we face during this phase of the data moving from source Figure 1: Big Data Testing Focus Areas Source: Infosys Research
1
Loading Source2 3
data files into HDFS Perform Map Reduce operations
Extract the output results from HDFS
systems to Hadoop are incorrect data captured from source systems, incorrect storage of data, incomplete or incorrect replication.
Validations: Some high level scenarios that need to be validated during this phase include:
1. Comparing input data file against source systems data to ensure the data is extracted correctly
2. Validating the data requirements and ensuring the right data is extracted,
3. Validating that the files are loaded into HDFS correctly, and
4. Validating the input files are split, moved and replicated in different data nodes.
Validation of Hadoop Map Reduce Process
Once the data is loaded into HDFC Hadoop map-reduce process is run to process the data coming from different sources.
Issues: Some issues that we face during this phase of the data processing are coding issues in map-reduce jobs, jobs working correctly when run in standalone node, but working incorrectly when run on multiple nodes, incorrect aggregations, node configurations, and incorrect output format.
Validations: Some high level scenarios that need to be validated during this phase include:
1. Validating that data processing is completed and output file is generated
Figure 2: Big Data architecture Source: Infosys Research Enterprise Data
Warehouse ReportsTesting
Reporting using BI Tools 25% 25%
25% 25% 1 2 3 4 5
Big Data Testing Focus Areas
Bar graph
Big Data Analytics
Pig HIVE
HBase (NoSQL DB) Map Reduce
(Job Execution) HDFS (Hadoop Distributed File System)
Transactional Data (RDBMS)
Non-FunctionalTesting (Performance, Fail over testing)
4
4 Map-Reduce
process validation 2
ETL Process validation 3
Pre-Hadoop process validation 1
Web Logs Streaming
Data Social Data
Processed Data
Data Load using Sqoop
hadoop
ETL Process
2. Validating the business logic on standalone node and then validating after running against multiple nodes
3. Validating the map reduce process to verify that key value pairs are generated correctly
4. Validating the aggregation and consolidation of data after reduce process
5. Validating the output data against the source files and ensuring the data processing is completed correctly
6. Validating the output data file format and ensuring that the format is per the requirement.
Validation of Data Extract, and Load into EDW Once map-reduce process is completed and data output files are generated, this processed data is moved to enterprise data warehouse or any other transactional systems depending on the requirement.
Issues: Some issues that we face during this phase include incorrectly applied transformation rules, incorrect load of HDFS files into EDW and incomplete data extract from Hadoop HDFS.
Validations: Some high level scenarios that need to be validated during this phase include:
1. Validating that transformation rules are applied correctly
2. Validating that there is no data corruption by comparing target table data against HDFS files data
3. Validating the data load in target system 4. Validating the aggregation of data
5. Validating the data integrity in the target system.
Validation of Reports
Analytical reports are generated using reporting tools by fetching the data from EDW or running queries on Hive.
Issues: Some of the issues faced while generating reports are report definition not set as per the requirement, report data issues, layout and format issues.
Validations: Some high level validations performed during this phase include:
Reports Validation: Reports are tested after ETL/transformation workflows are executed for all the sources systems and the data is loaded into the DW tables. The metadata layer of the reporting tool provides an intuitive business view of data available for report authoring.
Checks are performed by writing queries to verify whether the views are getting the exact data needed for the generation of the reports.
Cube Testing: Cubes are testing to verify that dimension hierarchies with pre-aggregated values are calculated correctly and displayed in the report.
Dashboard Testing: Dashboard testing consists of testing of individual web parts and reports placed in a dashboard. Testing would involve ensuring all objects are rendered properly and the resources on the webpage are current and latest. The data fetched from various web parts is validated against the databases.