Vol 1, No 1 (2013)

(1)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 1

An Efficient Rough Set Approach in

Querying Covering Based Relational

Databases

P. Prabhavathy

School of Information Technology and Engineering Vellore Institute of Technology, Vellore, India

Dr. B. K. Tripathy

School of Computing Science and Engineering Vellore Institute of Technology, Vellore, India

ABSTRACT

Handling uncertainty and incompleteness of knowledge becomes a challenging task in Information Systems. Rough-set theory enhances databases by allowing it for the management of uncertainty. Roughsets, due to its versatality can be integrated into an underlying database model like relational or object oriented which can also be used in the design and querying of databases.Beaubouef and Petry extended relational databases to introduce rough relational databases. Rough Relational Databases (RRDB) are those databases that can have multivalued attributes. Querying data from these databases becomes quite difficult because these multivalued attributes are indiscernible. In the past, the concept of rough sets has been used to query data from RRDB. In this paper, we introduce second type covering-based rough sets to query data by involving a cover set instead of the conventional equivalence class. This increases the number of possible data retrievals. Also, we encode multivalued attributes into a simplified binary code. This makes data querying more efficient. Subsequently, a comparative study between the classical rough sets and second type covering-based rough sets to query data has been drawn.

Keywords

Rough Sets, Relational Databases, Query, Covering.

1. INTRODUCTION

(2)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 2 has proved itself as an effective tool used to handle uncertainty in information systems. Rough set theory was incorporated into the classical relational databases and is termed as”Rough relational database model” in uncertain information systems. There are several features common in rough relational database (RRDB) and relational database (RDB), and the primary difference between them is that in relational Database model (RDM) the attribute values are atomic and singleton whereas in rough relational database model (RRDM) [2] the attribute values are multivalued. Previous work on RRDB includes its architecture, rough information entropy, rough relational operations, rough functional dependency, rough normalization and the theory of rough data querying [4], [5], [6]. Rough set theory [7] is a mechanism which can be used for rough data management as well as query handling. Data querying to fetch attribute values is divided into two types: certain data querying and possible data querying. RRDBs involve rough sets that are used to query indiscernible data through the use of equivalence classes to determine lower and upper approximation regions.

In the following sections, we introduce second type covering-based rough sets to increase the number of possible data retrievals. This is done by involving a cover set instead of the conventional equivalence class. Also, we use the encoding technique for querying data [9] to maximize the efficiency. This is achieved by means of a function mapping to establish a relationship between the attribute values and the elements of the covering set.

Section 2 describes related work. Section 3 reviews some basic concepts about rough set theory, covering based rough set and rough relational database.Section 4 and 5 discusses the encoding and querying data using rough sets.Section 6 discusses encoding algorithm using second type covering based rough sets .Section 7 gives the comparitve study Finally, we conclude our work in Section 8.

2. RELATED WORKS

(3)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 3 RRDB is decomposed into standard relational table according to the semantics of the query data, and then use SQL and rough relational operators to get the results. This approach needs the decomposition of RRDB and query according to the semantics of the querying data, which wastes time and storage space. A Rough relational database transform [12] approach is based on decomposing the data of rough relational database and transformed into relational database according to the characteristic of rough relational database and relational database and in virtue of multiplication principle and Descartes of basic operation of relational algebra. Then deleted redundant data and optimized RDB. Finally, it is designed translation arithmetic and applied in soil example, then proved the arithmetic validity. But the data value items of the transformed and optimized RDB are repeated.

The rough data querying is discussed based on granular computing [10]. It calculated the lower approximation and upper approximation of every atomic value in attribute‟s domain, and got the final results by rough set operation principles. But calculating the lower approximation and upper approximation of an atomic value needs scanning all tuples of a table, and calculating all the atomic values will take a very long time. And it also needs processing the semantics of the query data. Covering based rough sets [11], [13] provide generality as well as better modelling power to basic rough sets. Also, this new model unifies many other extensions of the basic rough set model. The SQL Languages used in rough data querying, where they extended the SQL [4] and got the results based on comparison between equivalence classes rather than values. But comparison between equivalence classes is processed by comparing the set of an equivalence class, and the efficiency is a big problem.

3. BASIC CONCEPTS

3.1 Rough set theory

Definition 1

(4)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 4

3.2 Covering based rough sets

Let (U, C) is a covering approximation space

Definition 2 (Covering)

Let U be a universe, C a family of subsets of U. C is called a cover of U if no subset in C is empty and ∪C = U. The order pair (U, C) is called a covering approximation space if C is cover of U.

It is clear that a partition of U is certainly a covering of U, so the concept of a covering is an extension of a partition. In the following discussion, unless stated to the contrary, the coverings are considered to be finite, that is, coverings consist of a finite number of sets in them. First, we list some definitions about coverings to be used in this paper.

3.2.1 Second type of covering based rough sets

For the second type of covering-based rough set model, the lower approximation is the same as in the first type of covering based rough set model.

Definition 3 (SL)

By the second type of lower approximation of a set X ⊆Uin the space < U, C > we mean the set:

∀X⊆U, SL(X) = ∪ {K|K ∈C, K ⊆ X}

We define second type of covering based upper approximation operation.

Definition 4 (SH)

Let C be a covering of U. The second type of covering upper approximation operation is defined as follows:

The concept and properties of second type of covering based rough sets are enough discussed.We are going to discuss how these concepts are incorporated in rough relational databases.

3.3 Rough relational database model (RRDM)

(5)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 5 containing tuples.The integration of rough sets with traditional database model is defined as rough relational database model (Beaubouef, Petry, & Buckles, 1995).Its a logical database model where domains are partitioned into equivalence classes. The domains of the attributes are partitioned into some equivalence relation designated by the database user. Within each domain, a group of values that is considered indiscernible forms an equivalence class.

These relations are considered as set where the tuples in the relation are considered as elements, and like the elements of sets in general, are unordered and non-duplicated. In the ordinary relational database, tuple ti takes the form (di1, di2, …, dim), where dij is adomain set Djand dij∈Dj.

But in rough relational databasedij ⊆ Dj, and although it is notrequired that dij be a singleton, dij ≠ ∅ (since it includes non-first normal form relations) Let P (Di) denote the powerset (Di) - ∅.

3.4 Rough relational database model definition (RRDB)

A rough relational database is defined as follows:

S = (U, A, D, R). U is the set of all the tuples, A is the attribute set, D is the domains of attribute sets, and R is the equivalence classes on the D.

In RRDB, an attribute Ai∈A, DAi is the domain of Ai, RAi is the equivalence class of attribute Ai, a tuple r∈U, r(Ai) is the tuple r‟s value on attribute Ai, and r(Ai)⊆DAi.

In fact, RRDB is a special kind of Multi-valued information system according to the definition of information system.

Definition 5: A rough relation is a subset of the set cross product P(D1)× P(D2) ×... × P(Dm).

Definition 6: An interpretation α = (a1, a2, …, am)of a rough tuple ti = (di1, di2, …, dim) is any value assignment such that aj∈dij for all 1 ≤ j ≤ m, aj is called a sub-interpretation of dij.

(6)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 6 Table 1: TV Shows data

ROW ID

Person Age Groups TV Shows

ROW1 {Toddler} {Baby Shows}

ROW2 {Kid ,Pre-teen, Teenage, Toddlers} {Cartoons, Educational}

ROW3 {Pre-teen, teenage} {Teenage comedy shows}

ROW4 {Pre-teen, Teenage, Young Adult} {Rom-com, Sitcom}

ROW5 {Young Adult, Adult} {Sitcom, Serials, Music

videos} ROW6 {Pre-teen, Teenage, Young Adult, Adult, Young

Middle Aged, Middle Aged, V. Young SC}

{Movies, Documentaries}

ROW7 {Pre-teen, Teenage, Young Adult, Adult, Young Middle Aged}

{Reality Shows}

ROW8 {Adult, Young Middle Aged, Middle Aged, V. Young SC, Young SC ,SC, Old SC}

{News}

ROW9 {Kid, Pre-teen, Teenage ,Young Adult, Young Middle Aged, Middle Aged, V. Young SC ,Young SC}

{Sports Shows, Infotainment}

ROW10 {V. Young SC, Young SC, SC, Old SC} {Old Classics, Serials}

ROW11 {SC, Old SC} {Religious Shows, Serials}

In Table 1, we have shown a relationship between the TV Shows and its viewers who are of different age groups.

The different age groups are categorized as follows:

Toddler: 3-5 years; Kid: 6-9 years; Pre-teen: 10-12 years; Teenage: 13-17 years Young Adult: 18-20 years; Adult: 21-39 years; Young Middle Aged: 40-49 years; Middle Aged: 50-54 years; Very Young Senior Citizen: 55-64 years; Young Senior Citizen: 65-74 years; Senior Citizen: 75-84 years; Old Senior Citizen: 85+ years.

4. ENCODING USING ROUGH SETS

(7)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 7 Yi of RPAG, =0; otherwise

Consider the multivalued attribute “Person Age Groups” in the rough relational database depicted in the Table 1.

DPAG = {Toddler, Kid, Pre-teen, Teenage, Young Adult, Adult, Young Middle Aged, Middle Aged, Very Young Senior Citizen, Young Senior Citizen, Senior Citizen, Old Senior Citizen}

RPAG= {Y1, Y2, Y3, Y4, Y5} = {[Toddler, Kid] [Pre-teen, Teenage] [Young Adult, Adult] [Young Middle Aged, Middle Aged] [Very Young Senior Citizen, Young Senior Citizen, Senior Citizen, Old Senior Citizen]}

Now, for instance, to encode the arbitrary value K= {Kid, Pre-teen, Teenage, Toddlers} of the multi-valued attribute “Person Age Groups”in ROW2, it is compared with each equivalence class of RPAG. K exists in the first two equivalence classes due to which the first two bits are 1 each, the remaining bits are 0.Therefore, the PAG_ code for ROW2 is 11000 as shown in Table 2.

Similarly, consider the multi valued attribute “TV Shows” in the table depicted in the Table1, DTV_Shows = {Baby Shows, Cartoons, Educational, Teenage

Comedy Shows, Rom Com, Sitcom, Serials, Music Videos, Movies, Documentaries, RealityShows, News, SportsShows, Infotainment, OldClassics,

ReligiousShows}

RTV_Shows= {[Baby Shows, Cartoons, Educational] [Teenage Comedy Shows, Rom coms, Sitcoms] [Serials, Reality Shows] [Movies, Old Classics, Music Videos] [Documentaries, Infotainment, News] [Sports, News] [Religious Shows]}.

(8)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 8 Table 2. TV Shows Data after Encoding

ROW ID Person Age Groups PAG_ CODE (rough set-based) PAG_ CODE (coveri ng set-based)

TV Shows TVS_

Code (roughse t-based) TVS_ Code (Coveri ng set-based)

ROW1 {Toddler} 10000 10000 {Baby Shows} 1000000 100000

ROW2 {Kid, Pre-teen,

Teenage, Toddlers}

11000 11000 {Cartoons,

Educational}

1000000 100000

ROW3 {Pre-teen, teenage} 01000 01000 {Teenage

comedy shows}

0100000 110000

ROW4 {Pre-teen, Teenage,

Young Adult}

01100 01100 {Rom-com,

Sitcom}

0100000 110000

ROW5 {Young Adult, Adult} 00100 01100 {Sitcom, Serials,

Music videos}

0111000 011100

Young Adult, Adult, Young Middle Aged, Middle Aged, V. Young SC}

01111 01110 {Movies,

Documentaries}

0001100 001110

Young Adult, Adult, Young Middle Aged}

01110 01110 {Reality Shows} 0010000 011000

ROW8 {Adult, Young Middle

Aged, Middle Aged, V. Young SC, Young SC, SC, Old SC}

00111 00111 {News} 0000100 000110

ROW9 {Kid, Pre-teen,

Teenage, Young Adult, Young Middle Aged, Middle Aged, V. Young SC, Young SC}

11111 11110 {Sports Shows,

Infotainment}

0000010 000111

ROW10 {V. Young SC, Young

SC,SC, Old SC}

00001 00011 {Old classics, Serials}

0011000 011100

ROW11 {SC, Old SC} 00001 00001 {Religious

Shows, Serials}

0010001 011001

(9)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 9

5. ENCODING USING COVERING-BASED SET

Let domain and covering set of the multivalued attribute “a” be represented by Dac and Rac respectively. Let Kc be an arbitrary value of this multivalued attribute for which the encoding is to done.

Let the encoded data be represented by Ec = e1c e2c ….enc, where n= number of covering sets eic (i=1, 2.. n) = 1, if Kc partly or wholly belongs to each covering set Zi of Ra,=0; otherwise consider the multivalued attributes “ Person Age group” in the relational database depicted in Table1. DPAGc ={Toddler, Kid, Pre-teen, Teenager, Young Adult, Adult, Young Middle Aged, Middle Aged, Very Young Senior Citizen, Young Senior Citizen, Senior Citizen, Old Senior Citizen}

RPAGc = {Z1c Z2c Z3c Z4c Z5c}={[Toddler, Kid][Kid, Pre-teen, Teenage, Young Adult] [Young Adult, Adult, Young Middle Aged] [Young Middle Aged, Middle Aged, Very Young Senior Citizen, Young Senior Citizen, Senior Citizen, Old Senior Citizen] [Senior Citizen, Old Senior Citizen]}

To encode the arbitrary value Kc = {Kid, Pre-teen, Teenage, Toddlers} of the multi-valued attribute “Person Age Groups” in ROW2, it is compared with each covering set of RPAG c.Kc exists in the first two covering sets due to which the first two bits are 1 each,the remaining bits are 0. The PAG_code is 11000 as shown in Table 2.

Similarly, DTV_Shows = {Baby Shows, Cartoons, Educational, Teenage Comedy Shows, Rom Com, Sitcom, Serials, Music Videos, Movies, Documentaries, Reality Shows, News, Sports Shows, Infotainment, Old Classics, Religious Shows}.

RTVSc = {[Baby Shows, Cartoons, Educational, Teenage Comedy Shows, Rom com, Sitcom] [Teenage Comedy Shows, Rom com, Sitcom, Serials, Reality Shows] [Serials, Reality Shows, Movies, OldClassics, MusicVideos] [Movies, OldClassics, MusicVideos, Documentaries, Infotainment, News] [Documentaries, Infotainment, News, SportShows] [SportShows, ReligiousShows]}

(10)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 10

6. QUERYING DATA USING THE ENCODED VALUES

Let us consider the target data A = {Young Adult, Adult, Middle Aged}, A DPAG. We determine all those TV Shows that are applicable to one or more in the target data, A.

Similarly, taking the target data B = {Teenage Comedy Shows, Serials}, B DTVS, we find the corresponding Person Age Groups. For target sets similar to A and B, we calculate retrievals using both Rough and second type Covering Encodings separately.

6.1 Algorithm for Rough Set

1. Establish multivalued attributes a and b.

2. Establish a target set X and set increment to 1.

3. For each x X do

4. While i is less than or equal to the number of EQUIVALENCE

CLASSES

4.1. Establish Y_i R_a.

4.2. If (x Y_i) is not equal to

4.2.1. Add Yi to .

5. Establish a retrieved data set Result.

6. Set increment k to 1and Set increment j to 1.

7. While k is less than or equal to the number of tuples

7.1. Establish Ek a_code.

7.2. While j is less than or equal to the number of EQUIVALENCE CLASSES

7.2.1. Establish e_j E_k.

7.2.2. Set increment l to 1.

7.2.3. While l is less or equal to the number of terms in

7.2.3.1. Establish Y_i .

7.2.3.2. For each Y_i do

7.2.3.2.1. If (j equals i) and (ei equals 1)

7.2.3.2.1.1. Add b_k to Result.

(11)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 11

6.2 Algorithm for 2nd Type Covering-Based Rough Set

1. Establish multivalued attributes a and b.

2. Establish a target set X.

3. Set increment to 1.

4. For each x X do

5. While i is less than or equal to the number of COVER_SET

5.1. Establish Yi Ra .

5.2. If (x Yi) is not equal to

5.2.1. Add Y_i to .

6. Establish a retrieved data set Result.

7. Set increment k to 1.

8. Set increment j to 1.

9. While k is less than or equal to the number of tuples

9.1. Establish Ek a_code.

9.2. While j is less than or equal to the number of COVER_SET

9.2.1. Establish ej Ek.

9.2.2. Set increment l to 1.

9.2.3. While l is less or equal to the number of terms in

9.2.3.1. Establish Y_i .

9.2.3.2. For each Y_i do

9.2.3.2.1. If (j equals i) and (e_i equals 1)

9.2.3.2.1.1. Add bk to Result.

9.2.3.2.1.2. Goto 8.

Lines 1 to 5 in the algorithms above gives upper approximation for the target set X. Lines 6 to 9 is used to retrieve the possible data.

Consider A

1. The rough encoding upper approximation can be given by: = {Y3, Y4}

(12)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 12 2. The second type covering-based upper approximation can be given by

= {Z2, Z3, Z4}

Now, putting 1s at the 2nd and/or 3rd and/or 4th positions of the covering encoding data {ec1, ec2, ec3, ec4, ec5}, we get {01000}, {00100}, {00010}, {01100}, {00110}, {01010} and {01110}. All the tuples having these as covering encoding values are retrieved.For example, from Table1, ROW2, ROW3, ROW4, ROW5, ROW6, ROW7, ROW8, ROW9 and ROW10 are fetched. Here, the total number of tuples is 9.

Consider B

1. The rough encoding upper approximation can be given by: = {Y2, Y3}

Now, putting 1s at the 2ndand/or 3rdpositions of the rough encoding data {e1, e2, e3, e4, e5, e6, e7}, we get {0100000}, {0010000} and {0110000}. All the tuples having these as rough encoding values are retrieved. For example, from Table1, ROW3, ROW4, ROW5, ROW7, ROW10 and ROW11 are fetched. Here, the total number of tuples is 6.

2. The second type covering-based upper approximation can be given by: = {Z1, Z2, Z3}

Now, putting 1s at the 1st and/or 2nd and/or 3rd positions of the covering encoding data {ec1, ec2, ec3, ec4, ec5, ec6}, we get {100000}, {010000}, {001000}, {110000}, {011000}, {101000}, {111000}. All the tuples having these as covering encoding values are retrieved. For example, from Table1, ROW1, ROW2, ROW3, ROW4, ROW5, ROW6, ROW7, ROW10 and ROW11 are fetched. Here, the total number of tuples is 9.

7. PERFORMANCE ANALYSIS

(13)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 13 In order to fetch maximum data from the table for a target X, we calculate upper approximations for both Rough and Second type Covering and compare the number of retrievals.We take 10 such target sets and plot it against the number of retrievals , as shown in Figure 1 and Figure 2.

The sample target data set taken for Figure 1:

{YoungAdult, Adult, MiddleAged}, {Pre-teen, Adult} , {V.YoungSC, OldSC}, {YoungMiddleAged, OldSC}, {SeniorSC, OldSC}, {Toddler, Kid, YoungAdult},{Kid, YoungMiddleAged},{Kid, Pre-teen}, {YoungMiddleAged, MiddleAged}, {Toddler, OldSC}.

The sample target data set taken for Figure 2:

{TeenageComedyShows, Serials}, {RealityShows, OldClassics}, {BabyShows, Cartoons}, {ReligiousShows}, {RomCom, Movies}, {MusicVideos, News, SportShows}, {ReligiousShows, Serials}, {OldClassics, Movies, Documentaries}, {SportShows, News}, {Educational, Infotainment}.

(14)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 14 Figure 2: Using TV Shows to retrieve Person Age Groups

From Figure 1 and Figure 2, we see that querying data using Second type covering increases the number of possilibilities, irrespective of the covering set chosen.

8. CONCLUSIONS

(15)

ISSN: 1694-2108 | Vol. 1, No. 1. MAY 2013 15

REFERENCES

[1] Beaubouef, T, 1994, „Uncertainty processing in a relational database model via a rough set representation‟, University Microfilms International, A Bell & Howell Information Company, PhD, dissertation, pp. 67-76.

[2] Beaubouef, T, Petry, F and Buckles, 1995, „B.: Extension of the relational database and its algebra with rough set techniques‟, Computational Intelligence, vol.11, no.2, pp.233-245.

[3] Beaubouef, T., Petry, F. and Aroar, 1998, „G.: Information theoretic measures of uncertainty for rough sets and rough relational databases‟, Information Science, vol.109, pp.185-195.

[4] Cao, F., Liang, J, 2004, „The Rough Data Query Based on SQL Language‟, Computer Science, vol.31, no.2.

[5] Hu, Xing lei, Hong, Xiaoguang and Yuan, Yu, 2007, „A high efficiency approach to querying rough data‟, Fourth International Conference on Fuzzy Systems and Knowledge Discovery.

[6] Nakata, M., Murai, T, 2001, „Data Dependencies over Rough Relational Expressions‟, In: IEEE Intl. Fuzzy Systems Conf, pp. 1543-1546.

[7] Pawlak, Z, 1982, „Rough Sets‟, International Journal of Computer and Information science, vol.11, no.5, pp.341-356.

[8] Pawlak, Z, 1991, „Rough sets - Theoretical aspects of reasoning about data‟, Dordrecht: Kluwer Academic Publishers, pp. 68-162.

[9] Qiusheng, A, Wang, G., Shen, J. and Xu, J, 2003, „Querying Data from RRDB Based on Rough Sets Theory‟, Springer-Verlag, pp. 342-345.

[10] Qiusheng, A, Yusheng, Z. and Wenxiu Z, 2005, „The study of rough relational database based on granular computing, Granular Computing‟, IEEE International Conference on Granular Computing, vol.1, pp.108-111.

[11] Tripathy, B.K. and Patro, V.M, 2009, „Covering Based Rough set approach to uncertainty management in databases‟, IIM Ahmadabad.

[12] Wei, Ling-ling, Zhang, Z, 2009, „A method for rough relational database transformed into relational database‟, IITA International Conference on Services Science, Management and Engineering.