International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.
1
Rough Text Model for SMS Spam Detection
Richa Arora
M.Tech(IS) student, COE department, NSIT, Delhi University [email protected]
Abstract — Spamming has become one of the most difficult problems to tackle with on the web. It refers to an unsolicited or unwanted message or text or pop up being displayed. Email spam is the term which everyone is commonly aware. This spam has been spread on web also in the form of links in the search results. Spam has also been introduced in the mobile networks on a large scale now days. Many techniques have been implemented to detect SMS Spam but it is very hard to confirm whether an incoming message is spam or not. In this paper, we have implemented Rough Set Theory for SMS Spam detection in a new introduced Decision System. Since it is also one of the tasks that implement Text Mining, and the application of Rough Sets is being done on it, hence the name Rough Text Model for SMS Spam Detection.
Keywords
—
Rough sets, spam detection, Rough Text, SMS spamming, spam, sms spam.I. I
NTRODUCTIONSpam refers to an unsolicited and unwanted message sent to a large number of recipients making the message irrelevant to the recipient. Although the recipient does not have permission to send such message, he sends for his own benefits. These messages are entirely out of user’s interest. Hence an approach is required so as to cluster these kind of messages together by some decision rules that decide whether the message is relevant or not i.e. ham or spam respectively.
Clustering is the process of partitioning a data set of n points to k clusters in an m-D space such that the elements of a cluster are undiscerning and the elements of different clusters are discerning i.e. categorizing similar elements into a single set, which is called cluster, is the task performed by the process of clustering. Generally, data clustering is done using a centroid method or Euclidean distance method. This can be used when the information is exact and complete. Exact in the sense, with sharp boundaries of the attribute values set and complete in the sense, without missing values for any of the attributes for any object. Today, the knowledge base has become so agile that it consists of the obscure facts, missing values and erroneous data. To deal with such kind of data, came, the Rough Set Theory (RST), introduced by Z. Pawlak in [4].
“A vague fact may be more perfidious than erroneous reasoning” – by Paul Valery[2] i.e. bugs in reasoning can be addressed and rectified but a vague fact cannot be rectified since it is not known what is missing from it or what can be drawn and if we coalesce it with certain facts, everything will
be put in doubt. RST is basically a mathematical approach, used for dealing with the obscure facts which combine with correct ones to create the imperfect knowledge. RST is applied to such facts which can’t be ignored for there lays the exactness amid their obscurity. Not only the vague facts belong to the imperfect data, but also the data with missing values has also to be dealt with. Clustering can be done using RST by discerning the objects and Categorizing them based upon several approximations. These categories are the several clusters which are named and based upon similarity with these clusters, the unlabelled objects can also be categorized and labelled which can be computed using RST.
Section II discusses the rudiments of RST. Section III explains the Rough Text Model as given in [7] and the proposed Augmented Rough Text Model. Section IV gives the Experimental Results as applied on the SMS Spam Collection dataset taken from the UCI repository.
II. RST B
ASICSAn Information System is represented as
where, U is the Universal set of objects and C is a set of condition attributes. Here, we deal with a Decision System, which is represented as where d is a decision attribute. An Indiscernibility Relation is defined on a
subset of
as , where
is the value of object for attribute . The set is partitioned into different sets based on the decision classes of a decision attribute and the equivalence classes are obtained based on B. Let there be k decision classes, d
1, d
2, ……., d
k. The equivalence classes based on the decision attribute, d, are represented as
d. Clearly,
dis a subset of . Let
dbe denoted as i.e. . Let the equivalence classes obtained from the Indiscernibility relation be denoted by
B.
A. Approximations and Regions [20]
Lower Approximation of w.r.t. is defined as the set of all the elements which surely belong to . Mathematically,
B
, where, is the lower
approximation of w.r.t. and .
Upper Approximation of w.r.t. is defined as the set of all the elements which possibly belong to . Mathematically,
B
, where is the upper
approximation of X w.r.t. and .
Accuracy measure of w.r.t is defined as the quotient of
lower approximation and upper approximation.
International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.
2
Positive Region is defined as the union of all the lower approximations.
Boundary Region is defined as the set of elements which can not be classified as belonging to . Mathematically, it is the set difference of the lower and upper approximations,
B
where,
Bis the
boundary region w.r.t. and \ is the set difference operator.
B. Rough Set Definability
TABLE I
ROUGH SET DEFINABILITY
Type
NO NO Roughly B-definable
NO YES Externally B-Undefinable
YES NO Internally B-Undefinable
YES YES Totally B-Undefinable
is Roughly B-definable states that it is decidable that which elements belong to and which to .
is Internally B- Undefinable states that which elements belong to but undecidable that which elements belong to
.
is Externally B-Undefinable states that which elements belong to but undecidable about which elements belong to
.
is Totally B- Undefinable states that it is undecidable whether elements of belong to .
C. Rough Set Membership
For classical sets, the membership of the elements of a set is black and white i.e. either an element belongs to a set or doesn’t. Or it can be said that the membership is binary, where 1 represents that the element belongs to the set and 0 represents that the element doesn’t belong to the set, in case of classical sets i.e. the membership, here, can be defined by the characteristic function. But, the view is different in case of RST. The Rough membership function elucidates the stint of concur between the set and its equivalence class w.r.t. set of conditions
B. The Rough Membership function is given by:
XB
B B
………..(1)
The lower and upper approximations can be computed by taking an arbitrary rough membership value (say, ∏ є (1/2,1]) as a threshold value, as follows:
LB
∏X=(x| µ
XB(x)> ∏)………...(2) UB
∏X=(x| µ
XB(x)> 1-∏)………(3) The Rough Membership formulae aid to generalize the approximations and can be utilized to find the similarity measure which will be discussed later in the paper.
III. T
EXTM
ININGE
MPLOYINGR
OUGHS
ETSText mining is the process of extracting knowledge from amorphous text. It is the burgeoning field of Data mining and
a challenging task to distil knowledge from large chunks of unstructured and fuzzy text. RST is a methodology that deals with the sets with fuzzy boundaries and can, thus, be applied in Data Mining for the Text Mining tasks. The application of RST in Text Mining is referred to as Rough Text in [7]. The intent of this paper is to extend the approach of RST used in [7] from a complete Information System to an incomplete Information System of classified documents, to classify and label the new documents, and categorizing them into the given clusters. If any of the documents don’t relate to the given clusters then new clusters will be formed correspondingly.
ROUGH TEXT Model in [7]
The Rough Text model as given in [7] for document clustering is a Decision System with m documents as the objects
1 2 m, n terms as the set of attributes,
1 2
be the decision attribute where . The attribute values are the term frequencies such that
i jis the term frequency of i
thterm in document jth document. Term frequency is the no. of times the term appears in the document divided by the maximum number of time a term appears in the document.
Mathematically, it is given by:
i j
Where,
i j
number of times ith term occurs in jth document.
Here, each document is categorized into specific clusters from the decision attribute cluster. Each cluster is such that:
k
= kth cluster to which a document is categorized, C
kє cluster
TABLE II
ROUGH TEXT MODEL FROM [19]
O T1 T2 . . Tn Cluster
D1 tf(T1,D1) . . . tf(Tn,D1) C1
D2 tf(T1,D2) . . . . C2
. . . .
. . . .
. . . .
Dm tf(T1,Dm) . . . tf(Tn,Dm) Ck
For our consideration, the Rough Text model of [7] is augmented to a Decision System with the attribute values in the form of ranges of the term frequencies.
A. Augmented ROUGH TEXT Model
We have proposed a model which not only contains the
terms but also some additional attributes in the sms that
determine whether it is spam or not. To make it simpler, we
have just taken the presence and absence of any attribute as its
value. The values are thus kept binary, 0 or 1. 1 designates
that an attribute is present like a term is present and yes for
some attributes like for sender_known, it shows that the
International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.
3
sender of the sms is known, for web_msg, it shows that the sms is a web message and so on. Thus it not only depends upon the text of the sms but many other important attributes too.
In this augmented model, the Decision System is
. Here, each attribute’s domain is binary i.e. either 0 or 1. The decision attribute, d is Spam. Its value shows whether a message is spam or not. Value 1 for Spam shows that the message is spam and 0 shows that the message is ham.TABLE III Proposed Decision system
S
sender_k
nown . . T1 T2 . . T
n Spam
S1 1 . . 1 . . . . 0
S2 1 . . 0 . . . . 1
. 0 . . 0 . . . . 1
. . . .
. . . .
Sm . . . .
IV. E
XPERIMENTALR
ESULTSThe SMS Spam Collection dataset is used from UCI Machine Learning Repository. This dataset is converted into .isf format with the decision system form as shown in section III ROSE2 software on it for RST implementation.
The dataset consists of 800 sms messages. The attributes of decision system are as shown in table IV below.
TABLE IV ATTRIBUTES
The decision system is built with these attributes in the form of .isf format file. This was then input to the ROSE2 software. The results are as delineated as below:
TABLE V ROSE2 APPROXIMATIONS
Class Number of objects
Lower approximation
Upper
approximation Accuracy
0 674 664 736 0.9445
1 126 97 136 0.7132
S.No .
Attribu te Name
Explaination
Attribute Type (Conditio n/Decisio
n)
Domai n
1 sender_
known
Whether sender is known
or not. Condition Boolea n {0,1}
2 web_m
sg
Whether web message(or
link) is present or not. Condition Boolea n {0,1}
3 long_str ing
Whether a long string containing alphanumeric data or symbols is present
or not
Condition Boolea n {0,1}
4 thanks Whether term “thanks” is
present or not. Condition Boolea n {0,1}
5 congrat ulations
Whether term
“congratulations” or related verb forms are
present or not.
Condition Boolea n {0,1}
6 win Whether term “win” is
present or not. Condition Boolea n {0,1}
7 free Whether term “free” is
present or not. Condition Boolea n {0,1}
8 sorry Whether term “sorry” is
present or not. Condition Boolea n {0,1}
9 urgent Whether term “urgent” is
present or not. Condition Boolea n {0,1}
10 private Whether term “private” is
present or not Condition Boolea n {0,1}
11 please Whether term “please” is
present or not. Condition Boolea n {0,1}
12 finally Whether term “finally” is
present or not. Condition Boolea n {0,1}
13 service Whether term “service” is
present or not. Condition Boolea n {0,1}
14 offer Whether term “offer” is
present or not. Condition Boolea n {0,1}
15 great Whether term “great” is
present or not. Condition Boolea n {0,1}
16 oops Whether term “oops” is
present or not. Condition Boolea n {0,1}
17 reminde r
Whether term “reminder”
is present or not. Condition Boolea n {0,1}
18 call Whether term “call” is
present or not. Condition Boolea n {0,1}
19 spam Decision whether spam or
ham. Decision Boolea
n {0,1}
International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.
4
TABLE VI ROSE2 CORE VIEWER
TABLE VII REDUCTS
RULES
# ModLEM with Entropy
# C:\Program Files\ROSE2\examples\smsspam800.isf
# objects = 800
# attributes = 19
# decision = spam
# classes = {0, 1}
# Thu May 08 19:42:51 2014
# 0 s
rule 1. (sender_known = 1) & (sorry = 0) => (spam = 0); [656, 656, 97.33%, 100.00%][656, 0]
[{1, 2, 4, 5, 7, 8, 14, 15, 17, 18, 19, 21, 22, 23, 24, 26, 30, 35, 37, 38, 40, 41, 44, 45, 46, 51, 53, 54, 56, 57, 58, 59, 60, 62, 63, 64, 65, 67, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 95, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 116, 117, 119, 120, 123, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 137, 138, 139, 141, 142, 143, 144, 145, 146, 147, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 162, 163, 164, 167, 169, 170, 171, 172, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 190, 191, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 225, 227, 229, 230, 231, 232, 233, 235, 237, 238, 239, 240, 242, 243, 244, 245, 246, 247, 248, 249, 250, 252, 253, 254, 255, 256, 257, 258, 259, 261, 262, 263, 264, 266, 267, 268, 270, 272, 273, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291,
292, 293, 294, 295, 296, 298, 299, 300, 301, 302, 303, 304, 305, 307, 308, 309, 311, 312, 314, 315, 316, 317, 318, 319, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 335, 337, 338, 339, 341, 342, 343, 344, 345, 346, 347, 348, 349, 351, 352, 353, 355, 356, 357, 360, 361, 362, 363, 364, 365, 366, 367, 370, 371, 372, 373, 374, 375, 377, 378, 379, 380, 381, 382, 383, 384, 385, 387, 388, 389, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 417, 418, 420, 422, 424, 426, 427, 428, 429, 430, 431, 432, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 457, 458, 459, 460, 461, 462, 463, 466, 467, 468, 469, 470, 471, 473, 474, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 489, 490, 491, 495, 496, 497, 499, 500, 501, 502, 503, 504, 505, 507, 508, 509, 510, 511, 512, 513, 514, 515, 517, 520, 521, 522, 523, 524, 525, 527, 529, 531, 533, 534, 535, 536, 537, 538, 539, 540, 541, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 578, 579, 580, 582, 583, 584, 585, 586, 587, 591, 592, 593, 594, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 610, 611, 612, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 627, 628, 629, 630, 633, 634, 635, 636, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 686, 687, 688, 689, 690, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 712, 713, 715, 716, 717, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 733, 734, 735, 736, 737, 738, 740, 741, 742, 743, 744, 746, 747, 748, 750, 751, 754, 755, 756, 757, 758, 759, 760, 761, 763, 765, 766, 768, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 786, 787, 788, 791, 792, 793, 794, 795, 796, 797, 799, 800}, {}]
rule 2. (great = 1) => (spam = 0); [12, 12, 1.78%, 100.00%][12, 0]
[{1, 38, 43, 294, 325, 342, 351, 406, 441, 463, 467, 593}, {}]
rule 3. (sender_known = 1) & (offer = 1) => (spam = 0); [3, 3, 0.45%, 100.00%][3, 0]
[{27, 182, 400}, {}]
rule 4. (sender_known = 1) & (call = 1) => (spam = 0); [38, 38, 5.64%, 100.00%][38, 0]
[{75, 76, 81, 82, 86, 130, 133, 138, 149, 173, 177, 206, 227, 248, 289, 290, 314, 340, 341, 388, 398, 422, 444, 460, 465, 494, 496, 521, 541, 567, 575, 587, 681, 689, 703, 727, 765, 769}, {}]
rule 5. (sender_known = 0) & (free = 1) => (spam = 1); [29, 29, 23.02%, 100.00%][0, 29]
[{}, {3, 6, 10, 13, 39, 96, 140, 148, 189, 228, 269, 271, 297, 358, 368, 386, 402, 419, 456, 464, 488, 493, 581, 595, 609, 631, 785, 790, 798}]
rule 6. (sender_known = 0) & (offer = 1) => (spam = 1); [8, 8, 6.35%, 100.00%][0, 8]
[{}, {260, 297, 368, 464, 528, 581, 637, 798}]
rule 7. (sender_known = 0) & (urgent = 1) => (spam = 1); [10, 10, 7.94%, 100.00%][0, 10]
[{}, {13, 32, 36, 68, 122, 168, 425, 526, 718, 764}]
rule 8. (sender_known = 0) & (thanks = 1) => (spam = 1); [2, 2, 1.59%, 100.00%][0, 2]
[{}, {25, 376}]
rule 9. (sender_known = 0) & (sorry = 1) => (spam = 1); [4, 4, 3.17%, 100.00%][0, 4]
[{}, {50, 251, 691, 711}]
rule 10. (long_string = 1) => (spam = 1); [4, 4, 3.17%, 100.00%][0, 4]
[{}, {16, 20, 48, 711}]
rule 11. (service = 1) => (spam = 1); [18, 18, 14.29%, 100.00%][0, 18]
[{}, {33, 39, 61, 94, 140, 160, 166, 189, 269, 369, 376, 416, 423, 595, 661, 739, 749, 753}]
rule 12. (sender_known = 0) & (win = 1) => (spam = 1); [18, 18, 14.29%, 100.00%][0, 18]
[{}, {12, 94, 115, 135, 168, 189, 274, 313, 320, 336, 358, 390, 506, 565, 577, 588, 718, 767}]
QUALITY OF CLASSIFICATION For all condition attributes 0.9513 For condition attributes in core 0.9500
ATTRIBUTES IN CORE Core Sender_known Core long_string Core win Core free Core sorry Core urgent Core service Core offer Core great Core call
S.no. Reduct Length
1 sender_known, web_msg, long_string, win, free,
sorry, urgent, please, service, offer, great, call 12
2 sender_known, web_msg, long_string, thanks, win, free, sorry, urgent, service, offer, great, call 12
International Journal of Advanced Engineering Science and Technological Research (IJAESTR) ISSN: 2321-1202, www.aestjournal.org @2015 All rights reserved.
5
rule 13. (sender_known = 0) & (call = 1) => (spam = 1); [58, 58, 46.03%, 100.00%][0, 58]
[{}, {9, 10, 13, 28, 29, 31, 32, 33, 34, 36, 39, 52, 55, 61, 66, 94, 115, 118, 121, 122, 124, 160, 168, 189, 241, 260, 297, 320, 334, 386, 390, 423, 425, 433, 456, 464, 472, 493, 506, 526, 528, 532, 565, 581, 589, 590, 595, 632, 650, 661, 691, 711, 718, 732, 762, 764, 785, 798}]
# Approximate rules
rule 14. (sender_known = 0) & (long_string = 0) & (thanks = 0) &
(win = 0) & (free = 0) & (sorry = 0) & (urgent = 0) & (service = 0) &
(offer = 0) & (call = 0) => (spam = 0) OR (spam = 1); [30, 30, 76.92%, 100.00%][2, 28]
[{11, 49}, {42, 69, 136, 161, 165, 192, 226, 236, 265, 306, 310, 350, 359, 421, 475, 516, 518, 519, 530, 542, 613, 673, 674, 685, 710, 714, 752, 789}]
rule 15. (sender_known = 1) & (sorry = 1) & (offer = 0) & (great = 0)
& (call = 0) => (spam = 0) OR (spam = 1); [9, 9, 23.08%, 100.00%][8, 1]
[{47, 193, 224, 234, 354, 492, 498, 745}, {626}]
**END
V. C
ONCLUSIONSpam SMS detection has been implemented using other approaches also which give association rules like Naïve Baye’s algorithm. RST has proved to be a better approach than the previously implemented ones since it gives the decision rules. These rules decide whether a given SMS is spam or not. Thus, here we directly get the decision as to which category an SMS can belong. This paper is a review of how SMS Spam detection can be implemented using the new methodology Rough Set Theory.
VI. F
UTUREW
ORKThe implementation is done on COREi3 processor. It can be enhanced to implement upon a mobile device to predict whether an SMS is spam or not. Further this work can be enhanced to build an application on android or windows to detect Spam SMS.
A
CKNOWLEDGEMENTI would like to acknowledge Dr. Shampa Chakraverty, Prof.
and HOD, COE department, NSIT, Delhi University to grant me knowledge of Rough Set Theory. The detailed study of this subject has helped me to work on its application on SMS Spam data.
R
EFERENCES[1] Zdzisław Pawlak, “Rough Sets”, International Journal of Computer and Information Sciences, 11, 341-356, 1982.
[2] R.Slowinski, “Obituary/Fuzzy Sets and Systems”, in ScienceDirect, 157, 2419-2422, 2006.
[3] Z.Pawlak and R. Slowinski, “Rough Set approach to multi-attribute decision analysis”, European J. Oper. Res. 72, 443-459, 1994.
[4] Slowinski, R., Vanderpooten and D., “A Generalized Definition of Rough Approximations based on Similarity”, IEEE Transaction on Knowledge and Data Engineering, 12(2), 331-336, 2000.
[5] Zdzisław Pawlak and Andrzej Skowron, “Rudiments of Rough Sets”, Information Sciences, 177, 3-27, 2007
[6] R. Raghavan and B. K. Tripathy, “On Some Topological Properties of Rough Sets”, Pelagia Research Library, Advances in Applied Science Research, 2(3), 2011.
[7] Leticia Arco, Rafael Bello, Yaile Caballero, and Rafael Falcon,
“Rough Text Assisting Text Mining: Focus on Document Clustering Validity”, Springer, 224, 229-248.
[8] Jan Komorowski, Lech Polkowski and Anderzej Skowron, “Rough Sets: A Tutorial”.
[9] Zbigniew Suraj, “An Introduction to Rough Set Theory and its Applications: A Tutorial”, ICENCO’2004, December 27-30, 2004.
[10] Fuyuan Cao and Jiye Liang, “A Data Labelling Method for Clustering Categorical Data”, in ELSEVIER, ScienceDirect, Expert Systems with Applications 38, 2381-2385, 2011.
[11] Y. Y. Yao, Information Sciences, “A Comparative Study of Fuzzy Sets and Rough Sets”, Vol. 109, No. 104, 227-242, 1998 .
[12] Noemí Pérez-Díaz, David Ruano-Ordás, José R. Méndez, Juan F.
Gálvez, Florentino Fdez-Riverola, “Rough Sets for Spam Filtering:
Selecting appropriate decision rules for boundary email classification”, Applied Soft Computing, Vol. 12, 3671-3682, 2012.
[13] Sarah Jane Delany, Mark Buckley, Derek Greene, Expert Systems with applications, “SMS Spam Filtering: Methods and Data”, Vol. 39, 9899- 9908, 2012.