• No results found

Keywords: Data Warehouse, Data Warehouse testing, Lifecycle based testing, performance testing.

N/A
N/A
Protected

Academic year: 2021

Share "Keywords: Data Warehouse, Data Warehouse testing, Lifecycle based testing, performance testing."

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

DOI 10.4010/2016.493 ISSN2321 3361 © 2015 IJESC

Performance Testing Of Data Warehouse Lifecycle

Surekha.M1, Dr. Sanjay Srivastava2, Dr. Vineeta Khemchandani3

IV Sem, MTech1, Professor & HOD2, Associate professor3 Department of CSE1, 2

MGM College of Engineering, Noida1, 2

JSSATE, Noida Uttar Pradesh Technical University, India3

[email protected], [email protected], [email protected]

Abstract:

Testing is an essential part of design lifecycle of any software product. Testing of Data warehouse is very important in projects because users need to trust the quality of information they access. The reasons for this are increase in Enterprise Mergers & Acquisitions, Compliance Regulations, Increased focus on data by Management and data driven decision makings. Data warehouse is a collection of large amount of data which is used by management for making strategic decisions. In this paper, we introduce a performance testing of Data warehouse Lifecycle and implement the same for a dataset using Naïve Bayes and K-Nearest Neighbour algorithm and made comparison between these two discuss how it can help solve some of the challenges in data warehouse testing. This paper also proposes implementation details and future roadmap for model based data warehouse testing.

Keywords: Data Warehouse, Data Warehouse testing, Lifecycle based testing, performance testing.

I. Introduction to Data Warehouse

In Computing Data warehouse is a system used for Reporting and data analysis. DWs are central repositories of integrated data from one or more desperate sources. They store historical and current data are used for creating analytical reports for knowledge workers throughout the enterprise. It is also called enterprise data Ware house (EDW).

A Data warehouse is a Subject-oriented, time variant, integrated and nonvolatile collection of data in support of management’s decision making process. A data warehouse is a copy of transaction data specifically structured for analysis and query. Data warehouse testing is carried out to check the quality of data stored. Testing should focus on Transformation of data, quality and performance of data and completeness of data. This paper is organized as follows section 2 briefly explains the survey of different DWT techniques. Section 3 explains proposed work. Section 4 explains advantages of lifecycle. Section 5 represents results and performance evaluation. Section 6 contains conclusion

Data Warehouse Testing

DWT itself has multiple phases and is staggered throughout the lifecycle of DW implementation. DWT is focuses on quality, consistency, completeness and validation of data. Testing implementation undergoes the cycle of unit testing Integration testing, System testing and usability testing and also focus on test cases generation requirement gathering and for the system.

Data Warehouse Testing VS Software Testing

 DWT has a broader scope than software testing as it focus upon the usefulness and correctness of data.  DWT always deals with huge volume of Data.

 DWT is post placement activity where as software testing is prior placement activity of software  DWT test cases are unlimited. Software testing test

cases are limited.

 DWT is System triggered whereas software testing is user triggered.

 DWT focus on validating the data whereas software testing focuses on the code.

Challenges in Data Warehouse Testing

As it is accepted that DW is different from other System, Hence there are some challenges in DW testing.

 Different types of formatted data.  Data of Heterogeneous sources.  Huge volume of Data.

 Lack of testing tools knowledge  Lack of requirements clarity

 Missing values in large volume of data

 One of the core challenges of testing DWS or providing technique for testing DW is its flexible architecture

 DW system could have different architectures.

II. RELATED WORK

In [1] Naveen ElGamal proposed to show the workflow of the framework when it is put in to operation. The key player of the DW testing process is the Test manager, who feeds the system with the DW architecture under test and current state of DW.DW architecture Analyzer component then studies the received data and compare it with the dependency graph with the assistance of test dependency manager component and passes the data to test recommender to generate an

(2)

abstract test plan, then executed validation and verification phase. Validation manager involves the business experts and system experts where as verification manager involves system tester and DB administrator for assistance in process of Test Case preparation.

In[2] Kuldeep Deshpande proposed model based testing is a technique for automatic generation and execution of test cases based on formal models of system under test(SUT).Model based testing can be applied at various levels of testing i.e. unit testing, integration test and system testing.

In [3] Syntel white paper proposed object oriented framework for DW conceptual design. Framework is divided in to two levels namely-Requirement level and Design level. At the requirement level, the requirements are gathered from different users and a thorough analysis is made.

In design level UML designs helps in Extracting major objects and classes from data gathered and construction of UML class diagrams.

In [4] Rajini Jindal and Shweta Taneja proved testing technique in large DW projects. Paper suggests that Industry best practices in DWH testing are data completeness and quality check, BI report data tasting, Performance validation of ETL and reports. Critical success factors for testing are Referential integrity of facts and Dimensions, Risk based Testing, data abfuscation, effective defect management and Focus on automation.

In [5] Naveen ElGamal, Ali Ei Bastawissy and Galal Edeen proposed DW matrices which is categorized as ‘what’, ‘where’ and ‘when’ testing. These categories will result in a three dimensional matrix. As shown in table1, the rows represent the ‘where’ dimension, the columns represent the ‘what’ dimension, and later on the ‘when’ dimension is represented in color. Each cell of this table will consist of the group of test routines that addresses combination of dimension members.This matrix is used to compare the existing DW testing approaches.

Table 1: DW Testing Matrices

In [6] Manoj Philip Mathen proposed two focus points for DW testing. At a high level, Any strategy should focus on the two main aspects mentioned as underlying data and DW components. Underlying data includes Data coverage and data complying with the Transformation logic in accordance with the Business Rules.

Whereas in DW Components he mentioned: performance, scalability, component orchestration tests and regression testing.

In [7] Execution-MiH explained categories of DW testing which includes different stages of the process. The testing is done on individual and end to end basis. Testing will include extraction testing transformation testing loading testing, end user testing end user browsing and OLAP testing, Adhoc Query testing, Stress and volume testing and parallel testing.

In [8] Golfarelli M. and Rizzi S introduced data warehouse testing activities framed within a DW development methodology. They stated that, The components needs to be tested: conceptual schema, Logical Schema, ETL procedures, Database and frontend.

Table2: DW components vs testing types[8]

III.PROPOSED WORK

The testing activities in data warehousing projects begin with the requirement gathering phase and carried out in an iterative manner. In data warehousing testing, every component of the project needs to be tested. Figure 1 shows the proposed lifecycle for data warehouse testing containing following phases.

(3)

We are briefly described the above phases as follows: 1. DATA GATHERING PHASE :

All the data is collected from different sources like SQL server, access sheets. In this paper we used Data from Banking Domain as shown in figure1 Reviewed data. 2. REVIEW

The data collected is reviewed for any correction present in the data. All data which is to be tested should be included in the data gathering.

Figure 1: Reviewed Data 3. PLANNING

In this phase, test data is prepared for testing. In this phase how testing is to be carried out is planned. Decision for selecting of which algorithm is made here. In this paper selected testing algorithms are Naive bays and K-nearest neighbor.

4. REQUIREMENT TESTING

In this emphasis is given defining business rules and requirement stated should be complete, clear, consistent and understandable. Interface review must be done to check the usability of the system. Tools must be decided in this phase on which testing must be performed.

5. TEST CASE GENERATION

Different test cases are generated by using different combinations and according tools are set to operate. 6. TESTING

Different testing techniques are mentioned:

 Testing phase involves selection of algorithm to be implemented to test DW performance.

Selected algorithm in our case are Naive Bayes and

KNN (K-Nearest Neighbor) Naïve Bayes:

Briefly explaining Naïve Bayes classifier, In simple terms, a Naive Bayes classifier assumes that the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. For example, In banking

data if parameters are age, sex, income etc in Naive Bayes classifier considers each of these features contribute independently to the probability of data being in an existing data, regardless of the presence or absence of the other features.

K-Nearest Neighbor:

K nearest neighbors is a simple algorithm that stores all available cases and Classifies new cases based on a similarity measure (e.g., distance functions).

Algorithm

A case is classified by a majority vote of its neighbors, with the case being assigned to the class most common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply assigned to the class of its nearest neighbor.

It should also be noted that all three distance measures are only valid for Continuous variables. In the instance of Categorical variables the Hamming Distance must be used.

It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and Categorical variables in the dataset.

Unit Testing: In this white box testing is performed. The developer loads the data from data source, data modules is tested individually.

Integration testing: the process of combining and testing the components together one by one to check their integrity issues and to make sure they perform well working together.

System Testing: this testing executes at developer site to make sure the system performs well. All the components run well together.

Usability testing: this is a black box testing. This checks for fulfillment of the requirements and ensure the validation of data.

7. PERFORMANCE TEST

The proposed lifecycle executed well in implementation as we are getting validated data as a result of querying the data through .NET interface. In WEKA data is validated. Then unit testing is done on single units. Then proceed to integration testing by integrating WEKA, SQL SERVER and UI together to check whether following components worked well. Data is tested against the accuracy of the data sets and performance is judged on the basis of different data mining attributes like precision, accuracy and time taken.

(4)

IV. FEATURES OF PROPOSED TESTING LIFECYCLE FOR DATA WAREHOUSE

1. Test planning is done prior to the test cases development.

2. Stress on Requirement testing is given as it is important for the need of development of the system and tools is studied for testing the data. 3. Strategy is prepared as all the test data should be

covered in test cases so that defects can be recognized at the early stages.

4. Testing the data on the basis of accuracy is the major part as in data mining there are many algorithms to choose, the best algorithm according to the need is chosen.

5. Testing is made iterative to make it easy for testers to change the system according to the requirements or add on any strategy without any hindrance.

V. IMPLEMENTATION RESULTS

 Testing of data is done using the proposed lifecycle. Firstly, all data is collected and extracted by using ETL (Extracted Tested Loaded) process. All the data is transformed according to the business rules, which is then loaded into the WEKA tool and algorithm is loaded and results are captured. Data is validated using this process. Cost/benefit and threshold curve algorithm used for data analysis shows good result of graph.

 Secondly, GUI (Graphical user Interface) is made in which data sets is loaded and queried and result is displayed so as to ensure whether the validated data is producing correct results.

Thirdly Comparison is made using proposed algorithms

Figure 2: threshold graph for naive bayes

Figure 3: cost Benefit graph for Naïve Bayes

Figure 4: Threshold curve for KNN

Figure 5: Threshold curve for KNN

Figure 6: Threshold curve for KNN

VI. CONCLUSION

From our work, we have concluded that. Any bug found in later stages can affect the analysis of data. Hence it become very important to validate before presenting it to GUI, It can help the data analyst to do analysis at accurate rate and in simplified manner. This paper presents the iterative DWT life cycle. The paper shows 94.2% accuracy achieved with KNN algorithm in comparison with Naïve Bayes, which shows 89.5% accuracy which concludes that KNN gives better performance than Naïve Bayes.

VII. FUTURE WORK ROADMAP

In future, we plan to conduct the studies on various data gathering techniques and would work on data gathering and will focus more on building a data warehouse with more improved testing strategies.

ACKNOWLEGMENT

I thanks to all the experts who have contributed towards this work. I would also like to dedicate my acknowledgment of gratitude towards the following significant advisors and contributors:

(5)

I would like to thank Mr. Sanjay Srivastava for reading my research paper and providing valuable advices and for reproofing the paper.

I sincerely thank to my parents, family, and friends, who provide the advice and financial support. The product of this research paper would not be possible without all of them.

REFERENCES

[1].Naveen ElGamal,” Data Warehouse Testing”, EDBT/ICDT, March 2013

[2].Kuldeep Deshpande,” Model Based testing of Data warehouse”, International Journal of Computer Science Issues (IJCSI), vol 10, Issue 2, March 2013

[3].Syntel , “ Proven Testing Techniques In Large Data Warehousing Projects : A white paper “ , Syntel 2012 [4].Rajini Jindal and Shweta Taneja,”Comparitive Study Of Data Warehouse Design Approaches: A survey”, International Jornal of Database Management System (IJDMS), vol February 2012

[5].Naveen ElGamal, Ali Ei Bastawissy and Galal Edeen, “ Towards A Data warehouse esting Framework”, Ninth International Conference On ICT ad Knowledge Engineering, IEEE 2011

[6].Manoj Philip Mathen, “ Data Warehouse Testing : a white paper” , Infosys, March 2010

[7].Execution-MiH,”Data Warehouse Testing is different”. [8].Golfarelli M. and Rizzi S., 2009, “A Comprehensive Approach to Data Warehouse Testing”, in ACM 12th international workshop on Data Warehousing and OLAP (DOLAP’09), Hong Kong, China.

[9] Muhammad Shahan Ali Khan and Ahmad ElMadi, Data Warehouse Testing an Exploratory Study, MS Thesis, School of Computing, Blekinge Institute of Technology, Karlskrona, Sweden, 2011.

[10] Muhammad Shafique and Yvan Labiche, A Systematic Review of Model Based Testing Tool Support, Carleton University, Technical Report, 2010.

[11] Tanuška, P., Moravčík, O., Važan, P. and Miksa, F.The Proposal of the Essential Strategies of Data Warehouse Testing. in 19th Central European Conference on Information and Intelligent Systems (CECIIS), (2008), 63-67.

[12].Pooniah, P., 2001, “Data Warehousing Fundamentals – A Comprehensive Guide for IT Professionals”, John Wiley & Sons, Inc

[13].Mookerjea A. and Malisetty P., 2008, “Data Warehouse/ETL Testing: Best Practices”, www.pureconferences.com.

[14].Weka tutorials: http://sentimentmining.net/weka/ [15].www.wikipedia.com

[16].Sneed M. Harry, 2006, “Testing a Data Warehouse – an Industrial Challenge”, in proceedings of the Testing: Academic & Industrial Conference on Practice and Research Techniques, IEEE Computer, p. 203-210

Author Details

Mrs. Surekha.M. MTech, Dept of CSE, MGM College of Engineering, Noida. Uttar Pradesh Technical University.

Dr. Sanjay Srivastava Professor & HOD, Dept of CSE., MGM College of Engineering, Noida. Uttar Pradesh Technical University.

Dr.Vineeta Khemchandani Assossiate professor, JSSATE,Noida Uttar Pradesh Technical University.

References

Related documents

Among the further tasks to be done we find: improving the efficiency of the PCG by adaptive selection of φ, the number of terms in the preconditioner, using the estimation of ρ by

Importantly, this welfare improvement is driven by a selection effect that arises when firms are heterogenous in their productivity levels: The subsidy allows the host country

1 Legal and Tax Advice rendered by associated partner

The implementation of public key infrastructure (PKI) and secure cryptographic hash functions to sign the contractual documents and the messages communicated across the Internet

In these cases, directors and teachers are able to build partnerships with parents to meet the needs of young children with challenging behavior, with support from the mental

Table 1: Mean ultrasonographic measurements of cross sectional area at various levels of the median nerve in both carpal tunnel syndrome patients and controls... Neurology Asia

Feminist and lesbian feminist theorists have offered an ideological critique of hetero- sexuality in relation to the gendered and sexual oppression of women; in this