International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
34
Comparative Analysis of Bitmap Indexing Techniques in Data
Warehouse
Firdous Kausar
1, Shoroq Odah Al Beladi
2, Kholoud AL Shammari
3Department of Computer Science, College of Computer and Information Sciences, Imam University Riyadh, Saudi Arabia
Abstract— One of the most important elements of “Business Intelligence (BI)” is the “Data Warehouse (DW)” that used to improve the making of strategic decision. Also, the massive database, historical, nonvolatile, and subject-oriented can represent the DW that forms the treatment of analytical queries, which produces in times with high response. Many popular techniques are used to enhance the DW treatment queries performance such as indices, views of materialized, and fragmentation of data. Furthermore, the emerging technology of NoSQL is used to enhance the storage of data and the treatment of query through it is major characteristics, also it can be used as alternative to the databases of relational. This research offers analysis and comparison of some of the related facts, which have been drawn from past resources that concern on bitmap indexing for data warehouses. Those resources are reviewed in this research one by one.
Keywords— NoSQL, Indices of bitmap join data warehouse, Query treatment, and Databases of relational.
I. INTRODUCTION
The concept of ―Data Warehouse (DW)‖ can be described by the extraction of data from the systems of operational to provide them to the reporting of scheduled and the queries of ad-hoc as snapshots of historical. The data of DW are distinguish from others that available at the environment of operational through many characteristics such as organizing data of warehouse in method that the data of relevant are grouped with each other to achieve easy access, keeping many data copies from different time points with each other, and there is no updating for the inserted data into DW. Rather, there is a periodically refresh for historical snapshots that stored in DW with operational databases data [1].
Many advantages are provided to the end-users by DW, which are, [2]:
Enhancing the access of end-user to large range of
data.
Increasing consistency of data.
Adding more data documentation.
Reducing the costs of computing and increased the
productivity
Providing appropriate place to collect the related data
from different sources.
Creating a computing infrastructure that can support
the business structures and the systems of computer changes.
Empowering the end-users to achieve any ad-hoc
queries level or to report them without influence on operational systems performance
There is a small time for the end-users to generate reports and ad-hoc queries, so this represents the key addressed problem by the DW and this back to many factors such as, [3]:
Large amount of data are saved in ADABAS and this
make the access of end-user difficult.
The stored data that designed for the processing of
transaction are not reporting of ad-hoc.
Getting a report or data usually needs waiting time
for the programmer to give a dedicated download program or to develop the report.
The total data may not be consistent at the same
time.
The copies of stored data that used for the reporting
of historical in ―Operating Systems (OS)‖ may not be sufficient.
The end-users have not any knowledge about the
available contents of data stores.
The initial used and implemented basic index of bitmap is the ―Model 204 Database Management System
(DBMS)‖. This model included С which is the number of
distinct values for the attribute of indexed, vectors of bitmap, where each vector is generated to form each value
of distinct of the attributes of indexed, the
i
bit in the vectorof bitmap and the representing value
v
, which is changed to1 if the registered
i
in the table of indexed that containsν [4].
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
35 For the query of equality, the vector of bitmap that determined for the certain value in the original condition is read in memory. However, for the queries of membership with this form" Y in {ν1, ν2,…, νn}", where n forms the read vectors of bitmap, the ―Boolean Operation OR‖ is achieved n-1 times on those vectors. Also the problem of sparsely happens if the basic index of bitmap built on high origin attribute, which needed additional space and processing time of query to construct and to answer the query [4].
RID Attribute
Y I
O IN ... IE ID IC IB IA
1 H 0 0 ... 0 0 0 0 0
2 D 0 0 ... 0 0 0 0 0
3 L 0 0 ... 0 0 0 0 0
4 C 0 0 ... 0 0 0 0 0
5 D 0 0 ... 0 0 0 0 0
6 B 0 0 ... 0 0 0 0 0
7 N 0 0 ... 0 0 0 0 0
8 A 0 0 ... 0 0 0 0 0
9 G 0 0 ... 0 0 0 0 0
.
. . . . . . . . . . . . . . . . . . .
[image:2.612.338.563.144.304.2]100 F 0 0 ... 1 0 0 0 0
Figure I Example Of Simple Index Of Bitmap On Y Attribute (C=15) [4].
The scatter index of bitmap on the attribute Y included 2 C vectors of bitmap. Also each vector of bitmap forms collection values of distinct attribute for indexed attribute (i.e., two vectors of bitmap are used to form each value of attribute). The attribute of indexed is divided into groups of
Z- and- L. A bit that located at
i
position in the vector ofbitmap is assigns to 1 if the register at
i
location in the tableof indexed considers as a member of that values group. In
addition, the grouping plan of scatter index of bitmap on Y
attribute with
m
=5 is shown in (Figure 2), while an [image:2.612.62.274.254.434.2]example of this index is shown and explained in (Figure 3).
Figure 1 Grouping scheme of Scatter Index of Bitmap on Y attribute (C=15) with m=5 [5].
RID bute Attri
Y L
3 L2 L1 Z4 Z3 Z2 Z1 Z0
1 H 1 0 0 0 0 1 0 0
2 D 1 0 0 0 0 0 1 0
3 L 1 0 0 0 1 0 0 0
4 C 0 1 0 0 0 0 1 0
5 D 1 0 0 0 0 0 1 0
6 B 0 0 1 0 0 0 1 0
7 N 0 0 1 0 0 0 0 0
8 A 0 0 0 0 0 0 1 1
9 G 0 1 0 0 0 1 0 0
.
. . . . . . . . . . . . . . . . . . .
100 F 0 0 1 0 0 1 0 0
Figure 2 Example of Scatter Index of Bitmap on Y attribute (C=15) [5].
The index included 8 vectors of bitmap which are:{Z0,
Z1, Z2 ,Z3 ,Z4 ,L1 ,L2 ,L3 }. So for those registers with Y
='A', the identical sites in all vectors of bitmap in Z1 and Z0
are assign to 1. Also for those registered with Y ='B' , the
identical sites in all vectors of bitmap in L1 and Z1 are
assign to 1, and this is the same for the others. For query of equality, the needed two vectors of bitmap that form each attribute value are determined, after that ―AND‖ Boolean operation is achieved on them [5].
For example, for answering query with Y ='D', the
vectors of bitmap Z1 and L3 are read. The answer is the
records that have the result of Z3 Λ L3 equals to 1 (i.e.,
records 2 and 5). For answering the queries of membership with this form "Y in {v1 , v2 ,... ,vn }", each value v is
solved using the equality algorithm above. Then, the ―OR‖ Boolean operation are achieved n-1 times on those values to produce the results. In the worst case, 2n bitmap vectors are read and n Boolean operation ―AND‖, and n-1 Boolean operation ―OR‖ are performed. For example, to answer the query through the chosen condition Y in (F ,H ,D ,L ,N)
the retrieval function is ( Z1Λ L3) ∨ (Z2Λ L1 ) ∨ ( Z2Λ L3
) ∨ ( Z3Λ L3 ) ∨ ( Z4Λ L1 ) Results are records; 1, 2, 3, 5,
7 and 100. Therefore, to reduce the bitmap vectors number that required to be entered through membership query answering, a known scheme of grouping represents an important key for scatter index of bitmap. Thus, the data clustering technique can be used to find the known scheme of grouping, where this leads to enhance the processing time of queries [5].
Clustering of Data is a technique of data mining and it is used in different fields such as the analysis of image, bioinformatics, and recognition of pattern.
(a) Table T
(a) Table T
(b) Simple Bitmap Index
[image:2.612.54.284.585.691.2]International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
36 Moreover, this technique splits the dataset into many clusters through the known measure of distance that limits the proximity or the similarity for two elements of data. Currently, many algorithms of data clustering have been suggested such as clustering of grid-based, clustering of density-based, clustering of hierarchical, and clustering of partitioning [6].
This research focuses on previous works concerning the idea of bit map indexing for data warehouses. The following sections of this research are section 2 (technique description), which represents some of the carried out techniques by different researchers and section 3, which offers a comparison among those techniques.
II. TECHNIQUES DESCRIPTION
A.Determining Join Index of Bitmap through the Techniques of Data Mining
An approach to determine the join index of bitmap through the techniques of data mining is submitted in [7]. So, the idea of this approach aims to choose the join index of bitmap to reduce the treatment of query cost and to meet the constraints of storage. In this approach, firstly the attributes of candidate for the join index of bitmap is created by the tree of ―Extended Frequent Pattern (EFP)‖. Then, the join index of bitmap is selected and the ultimate index is constructed. On the other hand, if the external attribute of key for fact table or the attribute of key for dimension tables are included in the closed item-sets of frequent, then they should be removed. However, if the attributes of non-key and the attributes of key are included in the closed item-sets of frequent, then the potential candidate attributes of non-key will be constructed through removing the potential attributes of key. Also the methods that used to create the item-sets of frequent can be discovered by using the tree of EFP that depending on the algorithm of FP-growth and by removing the item-sets that related to the attributes of key.
B.FPmax based approach
The algorithm of FP-max which represents the extended version for the join indexes of bitmap was proposed in [8].
The main used for the determining of join indexes is to pre-compute fact table and dimension table(s) joints for relational modeled of DW through the schema of star. Moreover, the join indexes of bitmap are very complex. Also the suggested approach consists of three steps which are illustrated in (Figure 4). In the first step, the extraction of context is constructed in the available workload to be utilized through the algorithm of FP-max.
The possible candidate attributes are built inside the file
text, in this file the query (Qi)is represented by the line and
the attribute (Aj) that invoked through the identical query is
represented by the column. At the second step, the algorithm of FP-max is applied to produce the elect indexes. Also at the third step, the produced indexes are constructed to enhance workload response time.
FIGURE 3 THE FP-MAX ALGORITHM [8].
C.Page Scatter Bitmap Index Optimization through Data Clustering
The ―Scatter Bitmap Index Optimization using Data Clustering (SBIOC)‖ was presented in [9], where this SBIOC included three phases which are: bitmap index of optimizing scatter, clustering of attribute, and preparation of data. The following figure illustrates those phases.
DW
Workload
Syntax Analysis
Extraction context
FPMAX Algorithm
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
37
FIGURE4 THESTRUCTUREOFSBIOC[9].
In scope of efficiency enhancement for space-time, the SBIOC and the scatter index of bitmap are clearly outperformed from the other techniques. In addition, when the scatter index of bitmap is used, then the scheme of grouping should be will-defined. Also the clustering of K-mode was applied by the researchers to group the attribute values of indexed and this helped to minimize the accessed bitmap vectors number.
D.Enhanced Encoded Bitmap Index
The indexing method that called by the ―Enhanced Encoded Bitmap Index (E-EBI)‖was proposed in [10], to improve the equality query performance of traditional Encoded Bitmap Index. An example is shown in the figure below.
RID ... X ... E3 E2 E1 E0
1 ... 5 ... 0 1 0 1
2 ... 2 ... 0 0 1 0
3 ... 3 ... 0 0 1 1
4 ... 10 ... 1 0 1 0
5 ... 1 ... 0 0 0 1
6 ... 0 ... 0 0 0 0
7 ... 3 ... 0 0 1 1
8 ... 15 ... 1 1 1 1
9 ... 13 ... 1 1 0 1
10 ... 1 ... 0 0 0 1
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
1000 ... 14 ... 1 1 1 0
(a) Table T (b) E-EBI
E3 E2 E1 E0 ANS_VECTOR
1 0 0 1 0
1 1 1 0 0
1 1 1 1 1
0 1 1 0 0
1 1 0 1 0
1 1 0 0 0
1 1 1 1 1
0 0 1 1 0
0 0 0 1 0
1 1 0 1 0
. . .
. . .
. . .
. . .
. . .
0 0 1 0 0
( c ) The result of query "X=3".
FIGURE 4 EXAMPLE OF E-EBI AND ITS QUERY RESULT [10].
The E-EBI algorithm for answering an equality query consists of eight steps. In the first step, all bits of the result bitmap vector (i.e., ANS_ VECTOR), are set to '1'. In the second step, the encoded bits' representation of the value specified in the query condition (i.e., h PATTERN) is calculated from the binary encoding.
Workload
Extract attribute and build attribute value table
Calculate m =
c
+1+1=
Build weighted bitMatrix table
Build chisters
Optimizing Scatter Bitmap Index Phase 1:
Data Preparation
Phase 2:
attribute Chistering
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
38
In
i
step, eachi
bitmap vector is performed usingBoolean 3Td operation AND with ANS VECTOR if Ei is
equal to '1'. Otherwise, each bit in the bitmap vector is inverted before performing Boolean operation AND with ANS_VECTOR. In the last step, ANS_VECTOR is output as the final result. For example, if the selected condition is "X=3", from the binary encoding, '3' is represented as
'0011'. Therefore, E3 and E2 need to be inverted before
performing Boolean operation AND with ANS_VECTOR,
while E1 and E0 can be performed using Boolean operation
AND with ANS_VECTOR directly. The result of the above equality query is the 3 Td and 7th tuples. E-EBI is a new method to improve equality query performance for Encoded Bitmap Index. It uses the least number of bitmap vectors and query time by using the low-cost Boolean operations to answer equality query [10].
E.Horizontal Format Data Mining
The algorithm of ―Horizontal Format Data Mining with Extended Bitmaps‖ was proposed in [11]. The aim of this algorithm is to mining the dataset in horizontal shape through transforming this dataset into vertical shape among the indices of bitmaps to save the data of item-set. There were certain steps of an algorithm which are listed below:
Parse the dataset to create first item set and bitmaps.
Remove redundant associated items.
Prune the item-sets according to prior property.
F.CasAB Encoding Scheme
The encoding scheme of CasAB, which is a novel encoding scheme of bitmap, was proposed in [12]. CasAB introduced to obtain the positive-free false AB in complete way and to perform the compression of bitmaps in effective way with maintaining the direct access ability. Moreover, the idea of this scheme comes from filters of Bloomier, which are used to support any value with the inserted elements. So this aim can be performed by using sequential filters of Bloom, but false positives could be produced by those filters. The similarities between Bloomier filters and CasAB can be limited by using the sequential filters of Bloom in both them. On the other hand false positive can be removed by the proposed scheme because it is able to check false positive through checking query time original dataset-ahead and this can be done with small time overhead and space. Moreover, the results of experiments prove that this scheme is able to answer the quires in accurate way. Also the execution time and the space for this scheme are larger than AB scheme and this proving the theoretical analysis that proposed here. The figure below illustrates the CasAB state diagram.
FIGURE 5 THE CASAB STATE DIAGRAM [12].
G.Evaluating the Iceberg Query through the Compressed Index of Bitmap
The compressed index of bitmap to evaluate the iceberg query was used in [13]. In addition, this research aims to develop a new pruning algorithm of dynamic bitmap depending on the strategy of deferred. Also to achieve the efficiency of iceberg query computation with more enhanced time of evaluation. Moreover, a strategy of bit pruning was developed through delaying the operations of XOR to prune vectors of bitmap before applying the operations of bitwise-XOR, this done to increase the speed of evaluation time for the query.
III. COMPARATIVE ANALYSIS OF TECHNIQUES
A.Description of Analyzed Techniques
The technology of determining the join index of bitmap through the techniques of data mining uses the algorithm of EFP-Tree depending on FP-Growth that produced the item-sets of frequent. This technique passes through three steps to be able to work properly. Moreover, this technique helps to minimize the candidate attributes number for the join index of bitmap and helps to produce join indices of bitmap with low areas of storage and low cost. However, this technique is unable to handle large volume of data and cannot apply the join index of bitmap dynamically through the evolutions of DW conditions.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
39 The technique of scatter bitmap index optimization used k-mode clustering to gather the attribute values of indexed and this helps to minimize accessed bitmap vectors number. Scatter index of bitmap is very appropriate to process the complicated quires of DW. This approach enhances the membership quires processing time and efficiency but the data inequality and inconsistency.
The method of ―Enhanced Encoded Bitmap Index (E-EBI)‖ is used to enhance the equality performance of query for the encoded index of bitmap. The encoded index of bitmap and the low-cost operations of Boolean are used through this method. Moreover, it helps to improve performance to traditional Encoded Bitmap Index, but the data inequality and inconsistency.
Moreover, the technique of horizontal format data mining used the extended bitmapping. The algorithm of this technique is implemented through Java programming language; however the performance of this technique may be better by using C language or other languages of lower-level. In addition, the utilization of memory through this technique is more effective, but this algorithm may be could not still withstand when large sets of data that have several items for each transaction are used.
Also for the CasAB Encoding Scheme, the sequential filters of Bloom are used through this scheme. The results of experiments prove that this scheme is able to answer the quires in accurate way, also it helps to remove false positive. However, by comparing this scheme with the AB scheme the results show that the execution time and the space for this scheme are larger than AB scheme.
By the method of Evaluating the Iceberg Query through the Compressed Index of Bitmap, the compressed index of bitmap is used to evaluate the iceberg query.
In addition, a strategy of bit pruning was developed through delaying the operations of XOR to prune vectors of bitmap before applying the operations of bitwise-XOR.
This method increases the speed of evaluation time for the query, but it is unable to remove unexpected bitwise-AND operations.
B.Techniques Comparison
Each algorithm is based on or an extended version of some other technique or method and the entire algorithm goes through some certain steps or phases. They have some associated strength and weakness. Similarity/difference, strength and weakness of each technique are presented in table 1.
IV. CONCLUSION
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
[image:7.612.52.562.152.672.2]40
Table 2
a comparison between all techniques
Technique Similarity/Difference Strength Weakness
Determining the Join Index of Bitmap through Techniques of Data Mining
- Uses ETree and FP-Growth
- Process goes through only three steps
- Minimize the candidate attributes number for join index of bitmap
- Unable to handle large volume of data
FPmax based approach - Uses FPmax algorithm
- Process consist of three steps
- Helps to improve the candidate indexes quality
- the generated indexes enhance the workload
performance
- Difficult and hard to implement
- Unable to choose the final configuration of index.
Scatter Bitmap Index Optimization through Data Clustering
- Uses K-mode clustering - Process consist of three steps
- Enhance the membership quires processing time and efficiency.
- Data inequality and inconsistency
Enhanced Encoded Bitmap Index - Uses encoded bitmap
indexing and low-cost operations of Boolean.
- Process consist of eight steps
- Can improve performance to traditional Encoded Bitmap Index
- Data inequality and inconsistency
Horizontal Format Data Mining - Uses extended bitmapping
- Process consist of three steps only
- Better memory utilization
- This algorithm may be could not still withstand when large sets of data that have several items for each transaction are used.
CasAB Encoding Scheme - Uses the sequential filters of
Bloom
- Process consist of three steps only
- Can remove false positive
-Execution time and the space for this scheme are larger than AB scheme.
Evaluating the Iceberg Query through the Compressed Index of Bitmap
- Uses the compressed index of bitmap
-Process consist of three steps only
-Increase the speed of evaluation time for the query.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014)
41 REFERENCES
[1] Stabno, M and Wrembel, R, ―RLH: Bitmap Compression Technique Based on Run-Length and Huffman Encoding‖, Information Systems, 34, pp. 400-414, 2009.
[2] Wu, K, Otoo, J, and Shoshani, A, ―Optimizing bitmap indices with efficient compression, ACM Transaction on Database Systems (TODS)‖, 31, pp. 1-38, 2006.
[3] Wattanakitrungroj, N and Vanichayobon, S, ―Dual Bitmap Index: Space-Time Efficient Bitmap Index for Equality and Membership Queries‖, Proceeding of International Symposium on Communications and Information Technologies (ISCIT'06), pp. 568-573, 2006.
[4] Chan, C and Loannidis, Y, ―An Efficient Bitmap Encoding Scheme for Selection Queries, Proceeding of the 1999 ACM SIGMOD‖, International Conference on Management of data, pp. 215-226, 1999.
[5] Vanichayobon, S, Manfuekphan, J and Gruenwald, L, ―Scatter Bitmap: Space-time Efficient Bitmap Indexing for Equality and Membership Queries‖, Proceeding of IEEE International Conference on cybernetics and Intelligent Systems, pp. 1-6, 2006.
[6] Rapševičius, V, Juozapavičius, A and Brazauskas, A, ―Clustering descriptive-textual data on Silurian rocks of Lithuania‖, 55. pp. 48– 57, 2006.
[7] Geun, H and Jin, J, ―A Study on the Selection of Bitmap Joins Index Using Data Mining Techniques‖. Strategic Technology (IFOST), pp. 1-5, 2012.
[8] Ziani, B and Ouinten, Y, ―Mining Maximal Frequent Item-Sets: a Java Implementation of FPMAX Algorithm‖, Proceeding IIT'09 Proceedings of the 6th international conference on Innovations in information technology, pp. 11-15, 2011.
[9] Weahama, W, Vanichayobon, S and Manfuekphan, J, ―Using Data Clustering to Optimize Scatter Bitmap Index for Membership Queries‖, International Conference on Computer and Automation Engineering, 2009.
[10] Keawibal, A, Wattanakitrungroj, N and Vanichayobon, S, ―Enhanced Encoded Bitmap Index for equality query‖, Deepdyve journal Institute of Electrical and Electronics Engineers. 2012. [11] Alwis, D, Malinga, S, Pradeeban, K, Weerasiri, D and Perera, S,
―Horizontal Format Data Mining with Extended Bitmaps‖, The International Journal of Computer Information Systems and Industrial Management Applications, 2010.
[12] Wang, Z, ―CasAB: Building Precise Bitmap Indices via Cascaded Bloom filters‖, Fourth International Conference on Internet Computing for Science and Engineering, pp. 85-92, 2009.