Rough Text Model for SMS Spam Detection

(1)

International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.

1

Rough Text Model for SMS Spam Detection

Richa Arora

M.Tech(IS) student, COE department, NSIT, Delhi University [email protected]

Abstract — Spamming has become one of the most difficult problems to tackle with on the web. It refers to an unsolicited or unwanted message or text or pop up being displayed. Email spam is the term which everyone is commonly aware. This spam has been spread on web also in the form of links in the search results. Spam has also been introduced in the mobile networks on a large scale now days. Many techniques have been implemented to detect SMS Spam but it is very hard to confirm whether an incoming message is spam or not. In this paper, we have implemented Rough Set Theory for SMS Spam detection in a new introduced Decision System. Since it is also one of the tasks that implement Text Mining, and the application of Rough Sets is being done on it, hence the name Rough Text Model for SMS Spam Detection.

Keywords

—

Rough sets, spam detection, Rough Text, SMS spamming, spam, sms spam.

I. I

NTRODUCTION

Spam refers to an unsolicited and unwanted message sent to a large number of recipients making the message irrelevant to the recipient. Although the recipient does not have permission to send such message, he sends for his own benefits. These messages are entirely out of user’s interest. Hence an approach is required so as to cluster these kind of messages together by some decision rules that decide whether the message is relevant or not i.e. ham or spam respectively.

Clustering is the process of partitioning a data set of n points to k clusters in an m-D space such that the elements of a cluster are undiscerning and the elements of different clusters are discerning i.e. categorizing similar elements into a single set, which is called cluster, is the task performed by the process of clustering. Generally, data clustering is done using a centroid method or Euclidean distance method. This can be used when the information is exact and complete. Exact in the sense, with sharp boundaries of the attribute values set and complete in the sense, without missing values for any of the attributes for any object. Today, the knowledge base has become so agile that it consists of the obscure facts, missing values and erroneous data. To deal with such kind of data, came, the Rough Set Theory (RST), introduced by Z. Pawlak in [4].

“A vague fact may be more perfidious than erroneous reasoning” – by Paul Valery[2] i.e. bugs in reasoning can be addressed and rectified but a vague fact cannot be rectified since it is not known what is missing from it or what can be drawn and if we coalesce it with certain facts, everything will

be put in doubt. RST is basically a mathematical approach, used for dealing with the obscure facts which combine with correct ones to create the imperfect knowledge. RST is applied to such facts which can’t be ignored for there lays the exactness amid their obscurity. Not only the vague facts belong to the imperfect data, but also the data with missing values has also to be dealt with. Clustering can be done using RST by discerning the objects and Categorizing them based upon several approximations. These categories are the several clusters which are named and based upon similarity with these clusters, the unlabelled objects can also be categorized and labelled which can be computed using RST.

Section II discusses the rudiments of RST. Section III explains the Rough Text Model as given in [7] and the proposed Augmented Rough Text Model. Section IV gives the Experimental Results as applied on the SMS Spam Collection dataset taken from the UCI repository.

II. RST B

ASICS

An Information System is represented as

where, U is the Universal set of objects and C is a set of condition attributes. Here, we deal with a Decision System, which is represented as where d is a decision attribute. An Indiscernibility Relation is defined on a

subset of

as , where

is the value of object for attribute . The set is partitioned into different sets based on the decision classes of a decision attribute and the equivalence classes are obtained based on B. Let there be k decision classes, d

1

, d

2

, ……., d

k

. The equivalence classes based on the decision attribute, d, are represented as

d

. Clearly,

d

is a subset of . Let

d

be denoted as i.e. . Let the equivalence classes obtained from the Indiscernibility relation be denoted by

B

.

A. Approximations and Regions [20]

Lower Approximation of w.r.t. is defined as the set of all the elements which surely belong to . Mathematically,

B

, where, is the lower

approximation of w.r.t. and .

Upper Approximation of w.r.t. is defined as the set of all the elements which possibly belong to . Mathematically,

B

, where is the upper

approximation of X w.r.t. and .

Accuracy measure of w.r.t is defined as the quotient of

lower approximation and upper approximation.

(2)

International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.

2 Positive Region is defined as the union of all the lower approximations.

Boundary Region is defined as the set of elements which can not be classified as belonging to . Mathematically, it is the set difference of the lower and upper approximations,

B

where,

B

is the

boundary region w.r.t. and \ is the set difference operator.

B. Rough Set Definability

TABLE I

ROUGH SET DEFINABILITY

Type

NO NO Roughly B-definable

NO YES Externally B-Undefinable

YES NO Internally B-Undefinable

YES YES Totally B-Undefinable

is Roughly B-definable states that it is decidable that which elements belong to and which to .

is Internally B- Undefinable states that which elements belong to but undecidable that which elements belong to

.

is Externally B-Undefinable states that which elements belong to but undecidable about which elements belong to

.

is Totally B- Undefinable states that it is undecidable whether elements of belong to .

C. Rough Set Membership

For classical sets, the membership of the elements of a set is black and white i.e. either an element belongs to a set or doesn’t. Or it can be said that the membership is binary, where 1 represents that the element belongs to the set and 0 represents that the element doesn’t belong to the set, in case of classical sets i.e. the membership, here, can be defined by the characteristic function. But, the view is different in case of RST. The Rough membership function elucidates the stint of concur between the set and its equivalence class w.r.t. set of conditions

B

. The Rough Membership function is given by:

XB

B B

………..(1)

The lower and upper approximations can be computed by taking an arbitrary rough membership value (say, ∏ є (1/2,1]) as a threshold value, as follows:

LB

_∏

X=(x| µ

XB

(x)> ∏)………...(2) UB

_∏

X=(x| µ

XB

(x)> 1-∏)………(3) The Rough Membership formulae aid to generalize the approximations and can be utilized to find the similarity measure which will be discussed later in the paper.

III. T

EXT

M

INING

E

MPLOYING

R

OUGH

S

ETS

Text mining is the process of extracting knowledge from amorphous text. It is the burgeoning field of Data mining and

a challenging task to distil knowledge from large chunks of unstructured and fuzzy text. RST is a methodology that deals with the sets with fuzzy boundaries and can, thus, be applied in Data Mining for the Text Mining tasks. The application of RST in Text Mining is referred to as Rough Text in [7]. The intent of this paper is to extend the approach of RST used in [7] from a complete Information System to an incomplete Information System of classified documents, to classify and label the new documents, and categorizing them into the given clusters. If any of the documents don’t relate to the given clusters then new clusters will be formed correspondingly.

ROUGH TEXT Model in [7]

The Rough Text model as given in [7] for document clustering is a Decision System with m documents as the objects

1 2 m

, n terms as the set of attributes,

1 2

be the decision attribute where . The attribute values are the term frequencies such that

i j

is the term frequency of i

th

term in document jth document. Term frequency is the no. of times the term appears in the document divided by the maximum number of time a term appears in the document.

Mathematically, it is given by:

i j

Where,

i j

number of times ith term occurs in jth document.

Here, each document is categorized into specific clusters from the decision attribute cluster. Each cluster is such that:

k

= kth cluster to which a document is categorized, C

k

є cluster

TABLE II

ROUGH TEXT MODEL FROM [19]

O T1 T2 . . Tn Cluster

D1 tf(T1,D1) . . . tf(Tn,D1) C1

D2 tf(T1,D2) . . . . C2

. . . .

Dm tf(T1,Dm) . . . tf(Tn,Dm) Ck

For our consideration, the Rough Text model of [7] is augmented to a Decision System with the attribute values in the form of ranges of the term frequencies.

A. Augmented ROUGH TEXT Model

We have proposed a model which not only contains the

terms but also some additional attributes in the sms that

determine whether it is spam or not. To make it simpler, we

have just taken the presence and absence of any attribute as its

value. The values are thus kept binary, 0 or 1. 1 designates

that an attribute is present like a term is present and yes for

some attributes like for sender_known, it shows that the

(3)

International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.

3 sender of the sms is known, for web_msg, it shows that the sms is a web message and so on. Thus it not only depends upon the text of the sms but many other important attributes too.

In this augmented model, the Decision System is

. Here, each attribute’s domain is binary i.e. either 0 or 1. The decision attribute, d is Spam. Its value shows whether a message is spam or not. Value 1 for Spam shows that the message is spam and 0 shows that the message is ham.

TABLE III Proposed Decision system

S

sender_k

nown . . T1 T2 . . T

n Spam

S1 1 . . 1 . . . . 0

S₂ 1 . . 0 . . . . 1

. 0 . . 0 . . . . 1

. . . .

S_m . . . .

IV. E

XPERIMENTAL

R

ESULTS

The SMS Spam Collection dataset is used from UCI Machine Learning Repository. This dataset is converted into .isf format with the decision system form as shown in section III ROSE2 software on it for RST implementation.

The dataset consists of 800 sms messages. The attributes of decision system are as shown in table IV below.

TABLE IV ATTRIBUTES

The decision system is built with these attributes in the form of .isf format file. This was then input to the ROSE2 software. The results are as delineated as below:

TABLE V ROSE2 APPROXIMATIONS

Class Number of objects

Lower approximation

Upper

approximation Accuracy

0 674 664 736 0.9445

1 126 97 136 0.7132

S.No .

Attribu te Name

Explaination

Attribute Type (Conditio n/Decisio

n)

Domai n

1 sender_

known

Whether sender is known

or not. Condition Boolea n {0,1}

2 web_m

sg

Whether web message(or

link) is present or not. Condition Boolea n {0,1}

3 long_str ing

Whether a long string containing alphanumeric data or symbols is present

or not

Condition Boolea n {0,1}

4 thanks Whether term “thanks” is

present or not. Condition Boolea n {0,1}

5 congrat ulations

Whether term

“congratulations” or related verb forms are

present or not.

Condition Boolea n {0,1}

6 win Whether term “win” is

7 free Whether term “free” is

8 sorry Whether term “sorry” is

9 urgent Whether term “urgent” is

10 private Whether term “private” is

present or not Condition Boolea n {0,1}

11 please Whether term “please” is

12 finally Whether term “finally” is

13 service Whether term “service” is

14 offer Whether term “offer” is

15 great Whether term “great” is

16 oops Whether term “oops” is

17 reminde r

Whether term “reminder”

is present or not. Condition Boolea n {0,1}

18 call Whether term “call” is

19 spam Decision whether spam or

ham. Decision Boolea

n {0,1}

(4)

4

TABLE VI ROSE2 CORE VIEWER

TABLE VII REDUCTS

RULES

# ModLEM with Entropy

# C:\Program Files\ROSE2\examples\smsspam800.isf

# objects = 800

# attributes = 19

# decision = spam

# classes = {0, 1}

# Thu May 08 19:42:51 2014

# 0 s

rule 1. (sender_known = 1) & (sorry = 0) => (spam = 0); [656, 656, 97.33%, 100.00%][656, 0]

[{1, 2, 4, 5, 7, 8, 14, 15, 17, 18, 19, 21, 22, 23, 24, 26, 30, 35, 37, 38, 40, 41, 44, 45, 46, 51, 53, 54, 56, 57, 58, 59, 60, 62, 63, 64, 65, 67, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 95, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 116, 117, 119, 120, 123, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 137, 138, 139, 141, 142, 143, 144, 145, 146, 147, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 162, 163, 164, 167, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 190, 191, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 225, 227, 229, 230, 231, 232, 233, 235, 237, 238, 239, 240, 242, 243, 244, 245, 246, 247, 248, 249, 250, 252, 253, 254, 255, 256, 257, 258, 259, 261, 262, 263, 264, 266, 267, 268, 270, 272, 273, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291,

292, 293, 294, 295, 296, 298, 299, 300, 301, 302, 303, 304, 305, 307, 308, 309, 311, 312, 314, 315, 316, 317, 318, 319, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 335, 337, 338, 339, 341, 342, 343, 344, 345, 346, 347, 348, 349, 351, 352, 353, 355, 356, 357, 360, 361, 362, 363, 364, 365, 366, 367, 370, 371, 372, 373, 374, 375, 377, 378, 379, 380, 381, 382, 383, 384, 385, 387, 388, 389, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 417, 418, 420, 422, 424, 426, 427, 428, 429, 430, 431, 432, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 457, 458, 459, 460, 461, 462, 463, 466, 467, 468, 469, 470, 471, 473, 474, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 489, 490, 491, 495, 496, 497, 499, 500, 501, 502, 503, 504, 505, 507, 508, 509, 510, 511, 512, 513, 514, 515, 517, 520, 521, 522, 523, 524, 525, 527, 529, 531, 533, 534, 535, 536, 537, 538, 539, 540, 541, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 578, 579, 580, 582, 583, 584, 585, 586, 587, 591, 592, 593, 594, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 610, 611, 612, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 627, 628, 629, 630, 633, 634, 635, 636, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 686, 687, 688, 689, 690, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 712, 713, 715, 716, 717, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 733, 734, 735, 736, 737, 738, 740, 741, 742, 743, 744, 746, 747, 748, 750, 751, 754, 755, 756, 757, 758, 759, 760, 761, 763, 765, 766, 768, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 786, 787, 788, 791, 792, 793, 794, 795, 796, 797, 799, 800}, {}]

rule 2. (great = 1) => (spam = 0); [12, 12, 1.78%, 100.00%][12, 0]

[{1, 38, 43, 294, 325, 342, 351, 406, 441, 463, 467, 593}, {}]

rule 3. (sender_known = 1) & (offer = 1) => (spam = 0); [3, 3, 0.45%, 100.00%][3, 0]

[{27, 182, 400}, {}]

rule 4. (sender_known = 1) & (call = 1) => (spam = 0); [38, 38, 5.64%, 100.00%][38, 0]

[{75, 76, 81, 82, 86, 130, 133, 138, 149, 173, 177, 206, 227, 248, 289, 290, 314, 340, 341, 388, 398, 422, 444, 460, 465, 494, 496, 521, 541, 567, 575, 587, 681, 689, 703, 727, 765, 769}, {}]

rule 5. (sender_known = 0) & (free = 1) => (spam = 1); [29, 29, 23.02%, 100.00%][0, 29]

[{}, {3, 6, 10, 13, 39, 96, 140, 148, 189, 228, 269, 271, 297, 358, 368, 386, 402, 419, 456, 464, 488, 493, 581, 595, 609, 631, 785, 790, 798}]

rule 6. (sender_known = 0) & (offer = 1) => (spam = 1); [8, 8, 6.35%, 100.00%][0, 8]

[{}, {260, 297, 368, 464, 528, 581, 637, 798}]

rule 7. (sender_known = 0) & (urgent = 1) => (spam = 1); [10, 10, 7.94%, 100.00%][0, 10]

[{}, {13, 32, 36, 68, 122, 168, 425, 526, 718, 764}]

rule 8. (sender_known = 0) & (thanks = 1) => (spam = 1); [2, 2, 1.59%, 100.00%][0, 2]

[{}, {25, 376}]

rule 9. (sender_known = 0) & (sorry = 1) => (spam = 1); [4, 4, 3.17%, 100.00%][0, 4]

[{}, {50, 251, 691, 711}]

rule 10. (long_string = 1) => (spam = 1); [4, 4, 3.17%, 100.00%][0, 4]

[{}, {16, 20, 48, 711}]

rule 11. (service = 1) => (spam = 1); [18, 18, 14.29%, 100.00%][0, 18]

[{}, {33, 39, 61, 94, 140, 160, 166, 189, 269, 369, 376, 416, 423, 595, 661, 739, 749, 753}]

rule 12. (sender_known = 0) & (win = 1) => (spam = 1); [18, 18, 14.29%, 100.00%][0, 18]

[{}, {12, 94, 115, 135, 168, 189, 274, 313, 320, 336, 358, 390, 506, 565, 577, 588, 718, 767}]

QUALITY OF CLASSIFICATION For all condition attributes 0.9513 For condition attributes in core 0.9500

ATTRIBUTES IN CORE Core Sender_known Core long_string Core win Core free Core sorry Core urgent Core service Core offer Core great Core call

S.no. Reduct Length

1 sender_known, web_msg, long_string, win, free,

sorry, urgent, please, service, offer, great, call 12

2 sender_known, web_msg, long_string, thanks, win, free, sorry, urgent, service, offer, great, call 12

(5)

5

rule 13. (sender_known = 0) & (call = 1) => (spam = 1); [58, 58, 46.03%, 100.00%][0, 58]

[{}, {9, 10, 13, 28, 29, 31, 32, 33, 34, 36, 39, 52, 55, 61, 66, 94, 115, 118, 121, 122, 124, 160, 168, 189, 241, 260, 297, 320, 334, 386, 390, 423, 425, 433, 456, 464, 472, 493, 506, 526, 528, 532, 565, 581, 589, 590, 595, 632, 650, 661, 691, 711, 718, 732, 762, 764, 785, 798}]

# Approximate rules

rule 14. (sender_known = 0) & (long_string = 0) & (thanks = 0) &

(win = 0) & (free = 0) & (sorry = 0) & (urgent = 0) & (service = 0) &

(offer = 0) & (call = 0) => (spam = 0) OR (spam = 1); [30, 30, 76.92%, 100.00%][2, 28]

[{11, 49}, {42, 69, 136, 161, 165, 192, 226, 236, 265, 306, 310, 350, 359, 421, 475, 516, 518, 519, 530, 542, 613, 673, 674, 685, 710, 714, 752, 789}]

rule 15. (sender_known = 1) & (sorry = 1) & (offer = 0) & (great = 0)

& (call = 0) => (spam = 0) OR (spam = 1); [9, 9, 23.08%, 100.00%][8, 1]

[{47, 193, 224, 234, 354, 492, 498, 745}, {626}]

**END

V. C

ONCLUSION

Spam SMS detection has been implemented using other approaches also which give association rules like Naïve Baye’s algorithm. RST has proved to be a better approach than the previously implemented ones since it gives the decision rules. These rules decide whether a given SMS is spam or not. Thus, here we directly get the decision as to which category an SMS can belong. This paper is a review of how SMS Spam detection can be implemented using the new methodology Rough Set Theory.

VI. F

UTURE

W

ORK

The implementation is done on COREi3 processor. It can be enhanced to implement upon a mobile device to predict whether an SMS is spam or not. Further this work can be enhanced to build an application on android or windows to detect Spam SMS.

A

CKNOWLEDGEMENT

I would like to acknowledge Dr. Shampa Chakraverty, Prof.

and HOD, COE department, NSIT, Delhi University to grant me knowledge of Rough Set Theory. The detailed study of this subject has helped me to work on its application on SMS Spam data.

R

EFERENCES

[1] Zdzisław Pawlak, “Rough Sets”, International Journal of Computer and Information Sciences, 11, 341-356, 1982.

[2] R.Slowinski, “Obituary/Fuzzy Sets and Systems”, in ScienceDirect, 157, 2419-2422, 2006.

[3] Z.Pawlak and R. Slowinski, “Rough Set approach to multi-attribute decision analysis”, European J. Oper. Res. 72, 443-459, 1994.

[4] Slowinski, R., Vanderpooten and D., “A Generalized Definition of Rough Approximations based on Similarity”, IEEE Transaction on Knowledge and Data Engineering, 12(2), 331-336, 2000.

[5] Zdzisław Pawlak and Andrzej Skowron, “Rudiments of Rough Sets”, Information Sciences, 177, 3-27, 2007

[6] R. Raghavan and B. K. Tripathy, “On Some Topological Properties of Rough Sets”, Pelagia Research Library, Advances in Applied Science Research, 2(3), 2011.

[7] Leticia Arco, Rafael Bello, Yaile Caballero, and Rafael Falcon,

“Rough Text Assisting Text Mining: Focus on Document Clustering Validity”, Springer, 224, 229-248.

[8] Jan Komorowski, Lech Polkowski and Anderzej Skowron, “Rough Sets: A Tutorial”.

[9] Zbigniew Suraj, “An Introduction to Rough Set Theory and its Applications: A Tutorial”, ICENCO’2004, December 27-30, 2004.

[10] Fuyuan Cao and Jiye Liang, “A Data Labelling Method for Clustering Categorical Data”, in ELSEVIER, ScienceDirect, Expert Systems with Applications 38, 2381-2385, 2011.

[11] Y. Y. Yao, Information Sciences, “A Comparative Study of Fuzzy Sets and Rough Sets”, Vol. 109, No. 104, 227-242, 1998 .

[12] Noemí Pérez-Díaz, David Ruano-Ordás, José R. Méndez, Juan F.

Gálvez, Florentino Fdez-Riverola, “Rough Sets for Spam Filtering:

Selecting appropriate decision rules for boundary email classification”, Applied Soft Computing, Vol. 12, 3671-3682, 2012.

[13] Sarah Jane Delany, Mark Buckley, Derek Greene, Expert Systems with applications, “SMS Spam Filtering: Methods and Data”, Vol. 39, 9899- 9908, 2012.