?rnal*
?dzo
,ffi,
rc,*Q
International
Review on Computers and Soliware (I.RE.CO.S.), Vol. 9,l,l.
7ISSN
1828-6003
Julv 2014A Framework
for
SMS Spam and Phishing Detection
in Malay
Language:
a Case
Study
Cik
Feresa
Mohd
Foozy,
RabiahAhmad, Faizal M.
A.
Abstract
-
Short Message Service (SMS) spamand
SMS phishing has been ina'eese nov,adavsespecially in
Malal'
language whichis
the Jirst langtage Jbr Malaysia country. Currently,nnnl'
SMS spamin
others language has heen proposed, however nolyetJbr
Maluy- language and we are the Jirst to propose these.In addition,
this paper also analysl on several Jramev'orksof
SMS spamJiltering.[or onr
SMS spanrand
phishing detection.framevork. From
theana{uis,
the chosenframev,ork
has been enhanced.fbr Malav- SMS spamand
phishing. The enhancement has been doneon
classiJication phase *'here our.frameu'ork proposeddual
clussification. The t'lassificationI
w,itl
classit'ythe
SMSinto
ham anclscam
SMS.For
classification2,
thescam
SMSwill
bec'lassified again
into
SMS spamand
SMS phishing.Aller
dual classifications phase completed, theMalay
SMS Aas been examinedasing iVaii'e
Ba3,esand
J48
unsupervised Machine Learningtec'hniques. The result shows high aecurdcy
in
detectingMalay
SMS ham, spamand
phishing. Copyright @ 2014 Praise Worthy Prize S,r.I. -AII
rights reserved.K eyw o rds
:
D e t e c t i o n, F i I teri n g, P h i s h i n g, .Securi4t. SMS, SpazrI.
Introdnction
SMS is one
of
the alternative communication services nowadays.Since
the SMS
chargesare
in
reasonably priced per SMSin
many countries. these gain interest to many users thatdon't
have mobile data or Intemet to usethis
service
as
communication
tool
and
sending information such as advertising marketing, spread news and etc. There aretools
and softwarefor
saleto filter
SMS. However,the SMS
spam and phishing attack isstill
increased. Spam and SMS phishing istwo
different typesof
attack. Spam messagewill
contain advertising[l]
and marketing. However,
for
phishing attack
the messagewill
trick
usersby
announce user i5 a winner or get freegift,
These phishing tactics is for user to responsethe
message.According
to
[2]
the
SMS
phishing nessagescan
chargedfees
to
mobile
device
ou'ner silentlyif
the user response to the SMS. Moreover, fromthe replied SMS phishing,
the
phishescan
get
moreinfbrmation about
the mobile
device such as
mobile device version. contact number and etc.Boodae
[3]
identify,
mobile
device usersare
three times morelikely to
enter a'*'eb-based phishing attack than desli:top users and a security reportby
Lookout [4]found three
out
oi
ten Lookout's
users interested toclicking on an
unsafelink
per
year
by
using
mobiledevice. Recently, there are variety of mobile applications have been developed and have been dorn'nload on mobile device
by
many user. Thisis
oneof
the reason phishes revolutionize their attack strategy by embedding malwareURL
links into
the SMSfbr
userto
clicks or reply
the SMS as accepting the rules and regulations to download or install the mobile applications.Moreover,
if
user clickedor
replied the trigger SMS, the malwarewill
connectwith
the mobile device and themobile
device owner has potentialto
losing money andinfbrmation privacy.
Joeet
al. [5]
found
the unu'antedincoming
SMS
makes
the
respondentfeel the
SMS violate their personal privacy and Sfv{iShing usually u"ill influence the SMS recipientto
entertheir
fraud contest, asking money transactionsinto their
bank account" paythcir bills,
traud
advertisingand
etc.
Since
there aremany disadvantages of SMiShing attack. few numbers
of
studies on SMS spam filtering has been proposed.Traditionally. phishes
will
send SMiShing attackin
ahigh
volume. Howevcr, nowadays phishes interested to send SMiShing attackin
a small quantity to observe the security levelof
the mobile device, thusit
is possible to detect the SMS phishing based on the SMS quantity and the distribution of sending message pattern.In
addition. SMS languageor
abbreviation words ate frequently used when sending the SMS instead of proper language becausethe limited text
length
fo
vrite
the message-This
SMS compose l'eatures on mobile phonewill
make
user
compress
the
messageby
using abbreviationwords as
long
asthe
recipients read the message. However, too much abbreviation words in SMSwill
be detected as spamby
someof
the SMSfiltering
tools.In
this
paper
we
proposeda
I'rameworkto
detectMalay SMS
spam
and
phishing
using
classificationtechniques. Malay SMS corpus, have been collected
tiom
the
contributors suchas
web
site,
fi'iends,family
and unknou'n respondents.The
text
collection methodis
by
using n'anscription and our online form.Cop-vright @ 2014 Praise Worrhy Prize S.r.l. - All rights reserved
This
paperis
organizedas follows.
In
Section
2, Literature ret'iew on SMS phishing detection.In
Section 3,will
explained the pre-processing SMSdatasets,
features selection
development
and
the experiments toidentiff
the accuracy offeatures selection using data mining tools namedWEKA.
Section 4, shows the result and findings based on the experiment that has been done andfinally
is conclusion and fufurewort
were given.II.
Related
Work
Mobile device phishing adack has been categorized
by
Dunhamet
al.[6]
into
four
types such
as
Bluetooth phishing,Short
Message Service(SMS) phishing
andVoice
over
IP
Phishing(vishing). Example
Bluetoothphishing attack has been discuss
bV
[6].
the
Bluetooth phishing attack workswhen
user connectto
the Wi-Fi
hotspot and the attacker can steal the data when the user connects to theWi-Fi.
For vishing, this attackwill
attackcustomers'
service
organization
to
get the
private information of customers-Mobile web
application
phishing
is
similar
with
desktop web phishing attack. However, the solution
for
this
attack mostly
basedon
client-server method anddepends on the mobile architecture itself because mobile
device
has
few
limitation such as CPU
processing, batterylife
and memory size" thus the existing solution isby
usingantivirus
or
anti-phishingwhich is
purposely developfor
lightweight mobile device. Thus, this paper, weu.ill
focus on SMS spam and phishing attack solution that has been increased recentlv.il.1.
SMS Corpus Ctillection Method Several methods can be used to collect text corpus. According to Tablel,
there is varietyof
text size that has been collected.Most SMS
languages that has been collected are in English SMS and text[7]
contributor are trom unknown, known contributor and participant.For unknown SMS contributor are usually SMS taken lrom soltware and websites.
For
known arefrom family
and friends. Few studies not mention the process of SMS collection clearly.However, Hard
af
Segerstad[8],
Ogle[9],
Fairon andPaumier
[0],
Choudhury
[ ll,
Herring
and Zelenkauskaite[l2],
Bach and Gunnaruson [13] has state the method that has be used to collect SMS.SMiShing attack has been discussed
by
[6]
and [39]-According toK.
Dunham[6].
SMiShingis
a new tacticto
spread malwarcby
adding theURL link in
SMS and inJluence the recipient toclick
on the URL.In
addition
O.
Salem
et al.
[39],
ideutity
theSMiShing
atnck
is on the rise atrack cun'ently on mobile device andXu et al. [40]
agreedSMS
spammingis
aserious attack
for
SMS nowadays. J. W.Yoon
et al. [41] andH.
Peizhou et al.[42]
studies on SMS content-basedfiltering
andQ.Xu
et al. [40] proposed SMSfiltering
on non content-based.Copl'7if6t 'g 2014 Praise ltortlry Prize S.r.l. - .4ll rights resen'ed
Cik
Fercsa MohdFoon.
Rabiah Ahmad. Faizal M. A.For
this
paper,the Malay SMS
phishing and spamwill
be classified based on the generic features such asTotal words t431, t441,
[45]
and
[46],
Number of
Character bi-grams[43]
antl [45],
Numberof
Character n'i-grams[43] and[45],
Average numberof
length [44] and [a6] and Average number of word [aa] and [46].We also add additional f'eatules fbr this study such as
advertisement, contest, ut3enr SMS, ask money, asked to response SMS, telephone,
URL
and arurounce user getfree
gift
or win.
Thesecriteria
is
basedoo
spam and phishing attack.Total
f'eatures appliedin
this
study are14 fearures.
t1.2.
ilfala1t l-6s1g11age Detet'tion Framev:orkMalay
language has been applied as case shrdyfor
several research areas.:However, there
is
still
lack
lbr
information security detection study especiallyfor
SMS spam and phishing study which is not yet been examined based on the literature review in SMS spam and phishingstudy. The
exampleof
Malay
detection research area such as speech derection[47], language detection[48],[7] and email spam detection [a9].Moreover,
speech
detection
by
Lim
et
al.[47]purposed
to
producehigh quality
synthetic speechin
Malay
language.In
addition,for
language detection by Tsai et al. [48].The,researchers aimed
to
reducethe
English, Malayand
Chinese
languages documents redundancy. The framework has threedifferent
preprocessingwhich
suit'
for
threedifferent
languages andfinally
will
applyin
adetection
process
such
as
Sentence
Segmentation.Sentence Level and Novel Rate computation.
A.
Kwee
et
al.
[7].
proposedEnglish
and
Malaylanguage detection and the processes
for
Malay language detection consistof
Language Translation. Stop Word Removal, Word Stemming and Detection.In
spam detection
research
area
by
T. Subramaniamet
al.
[49],
theMalay
language has beenapplied as case study using Bayesian
filtering
techniques. The results showthis
technique successfully classity spam and non-spamof
Malay language emailwith
96%accuracy.
The
frameworks consist
of
Collection of
emails, Tokenization,
Features
reduction,
Featuresselection, Training and Testing^
Thus,
in
this
paper,
Malay
language
has
beenexamined
as
casestudy
fbr
SMS
spam and phishing detection. The basic detecfion process has been appliedin
this
framework
and
the
enhancementof
dual classification.*ith two
additionalnew
features such asadvertisement and contest feafures has been proposed. The framework and result
will
discussed further in the next section.ILs.
SMS Fihering and Detection Framework Generally,filtering
method consistof
tokenization, lemmatization and stop word removal, representation andclassifierby
[50].httenatktnal Reviev: on Comuulers and ktftu'are, rbL 9, N. 7
Cik Fercsa Mohd Foozy, Rabiah Ahmad, Faizal fuL A.
TABLE I
LJTERATURE ON TEXT MESSAGING CoRPUs
References Tevt Size
TertLensuaec
TcxtContributor Text Collcction l\{ethodPietrini
ll4l
Schlobinski et al. [151
Shortis l16l Doring [r7]
Hard ef Setgetstad
ll$l
Kasesnicmi and Rautiainen
flel
Grinter and Eldridge[201
Thurlorv end Brown
[2ll
Ogl"Yljue How and Ken l22l Feiron end Paumler
ll0l
Choudhury [lll
Rich Ling [231
Rcttie (2007)
Zic Fuchs and Tudman
Vukorid l24l
Gibbon and Kul
l2sl
Deunert and Oscar
Masinyana [261
Hutchby erd Tannr l27l
Welkowske [281
Herring end
Zelenkauckrite
ll2l
Teggl2el
Ehis J30l
Barasa [3ll
Bech end Gunnarsson [31
Bodomo l32l
Liu and Weng
133l
Sotillo I34l
Ditrscheid and Stark [351
Lexender [361
Elizondo f3?l
Chcn end Ken l38l
Italian Germany English German Swedish Finnish English English Englisir Englislr French English Norwegian English Croation Folish
English, isiXhosa
English
Polish lralian
English English. French, etc. ,English. Kiswahili. etc. Swedish, English. etc.'
English. Chinese
Chinese
Elglish
Germany, French. etc French English
English and
Mandarin Chinese. 292 312 500 1500 207 100{) | 152 7800 4'7'l 5M 97
l0l l7
30000
lo00
867 278 6000
| 250 I700 1452
l062ri 600
2730
3l 52
853
85870
6629 2398?
l5-35 Years Old
StDdents
I Male Student liicnds and Family
200 partrcipans
I I 2 trom an anonymous webpage, 252
nrcssagcs foru'ardcd from voluntccrs
and ?88 fiorn family and friends
Teenagers ( l3-18 Years Old)
l0 Teenagers i15-16 Years Old)
135 Freshmen Nightclubs
?003 Respondents
166 Univexity Students
3.200 Contributos
Rendomly
3l contributors
University students,
family and iiiends
Uniuersity studenrs and
friends
22 young adulls
-30 young pmfessionals
(?0-35 years old)
2(X) contriburors
Audiences of an iTV
progrum
16 tbmily and friends
72 universitv students
and lecturers
84 univercity studcnts
and 37 young prottssionals
I I conlributors
87 youngsrcrs
Real volunteers
-59 participanrs
2.6f7 vtrluntcer:
I 5 young people
12 volunteers
University Students
Unknoun
Unknown Transcript Unknou'n
From webpage, rrolunteers,
family and liicnds
Transcript
Tmnscripr Trarrscript
Subscribe SMS Promotion
Of Nightclub
Transcript For*,ard
Search The SMS
From The Website
Transcript Unknown Unknown
Unknown
Trdnscript. Forward
Transcript
Fonpard, Softu'arc
Online SMS archives
Transcript Forward Fonvard Soliware Transcript Unknown Software Fors'nrd Transcript Transcript Unknown 496 357 71000
However, there are
multiple
frameworks
for
spam filtering and detection such as below:t
ToolsCormack et al.
[45]
on SMStiltering
using 5 fypesof
spamfilter
tools
tofilter
the SMS. Before applied tools on SMS, there are pre-processing data that has been doneto
come outwith
four (4)
mainfeatures.
Moreover, to detectSMS protocol
in
real time
Rafique
et al.
[51]applied Hidden Markov
Model
(MIm)
which
thearchitecture
consists
of
sniffer.
feafi.rre
extraction, classifier andmles
decision. Que and Farooq[52]
also applyMHH
on byte level distributionof
SMS that haveCopyright 'O 201 4 Praise Worthj, Prize S.r.l. - .4ll rights resen ed
four
processes such as identify suitable representationof
an
SMS,
build
spam models, classihcation
and determined the spam.ln
addition, SMS-Watchdog SMS detection schemeby Yan
etal.
[53] has tlu'ee processesof
monitoring, anomaly detection
and alefi
handling using SMS services..
Contenl-Based Filte,'ingJ.
W.
Yoon, et
al. [41]
proposedhybrid
framewor* that implement content-based techniquewith
challenge-response scheme.The SMS
classifiedinto
ham, spam and uncertain, then the challenge responsewill
classifu uncertaininto
ham and
spamby
matchingthe
senderl25t)
Cik Feresa Mohd Foozy, Rabiah Ahmad, Faizal M. A.
response.
Gdmez
Hidalgo
et al.
[54] have
proposed content-basedSMS filtering
for
English
and
SpanishSMS
spam
using
Bayesian
filtering that
consist
of
preprocessing, feature selection
and
learning.t
Mac:hine LearningXiang et
al. [55]
proposed SupportVector
Machine technique tofilter
the mobile spam.Moreover,
Cai et al.[56]
improved the spamtilter
using traditional balancedWinnow
algorithmwhich
applied pre-processor, feature selection,texts
representationand winnow
algorithmmodule.
An
independentmobile
device
filtering
by
Taufiq Numzzaman et
al. [44]
applied several processesin
their SMS independent sparnfiltering
such as data setand running
environment, feature extraction,
vectorcreation and
filtering
processof
Naive
Bayesor
SVM and updatefiltering
system. Yadav[46], [57]
had three processin
the their SMS filtering
such
as
Bayesianfiltering
algorithm,
mobile
application
and synchronization service on server.t
Vatching PatternWu et al. [58] has proposed SMS
filtering flow
such as SMS screening. bayesian learning. keyword SMS and Pinyin Fuzzed keyword matching.Moreover,
the
ChineseSMS
filtering by
Jie
et
al, [59] has pre-processing, lbatures selection. modeling, and classifier.In
addition. Najadatet al. [60].
frameworksinvolve
of
three
proccssesof
data collection.
pre-processing,
text mining,
testing, evaluation metrics and implementation.t
Artift'ial
I ntmttneSlsternT.
M.
Mahmoud andA.
M.
Mahfouz
[61]
appliedarrifrcial
immune system methodfilter
SMS
spam thatcontain
analysis engine, tokenize
word, stop
word,dataset.
training
andAIS
engine. Chamindaet
al.
[62] proposeda
hybrid solution
ot
neural network
and Bayesianfiltering
where theSMS
filtering
process are sender identification module, spam folder, SMS contentextractor, tokenizer. Bayesian filter..
categorization,training and inbox.
.
Ctyptograph)'In
Cryptography area, Saxena[63] proposeda
secure SMS protocolfor
SMS tr.an$mission and a cryptographic algorithmin
theSIM
card. The processosnf
frameworkare request to send SMS and authenticate sending
SMS-In
addition, Pereira
et
al.[64]
also
proposed
alightweight cryptography algorithm to mitigate the SMS
security
issues, protocols,
pror,iding
encryption, authentication and signature services.In
addition. Choit65l
applied Common
Public
Kuy
Cryptography techniquefor
SMS
communication efl'eciency whichcontaint
of
initialization
for
aulhenticate,encrypt.
or decrypt and communication phuseslbr
sending SMS.As
summary,*re
basic architecmre frameworks aredata
collection,
pre-processing, f'eatures
selection, training and testing.Cop)'right
O
2014 Praise Worthl,Prize $.r.!. - 4!! rights resseryedWhich is mostly has been applied in SMS shrdtes. These explained the SMS spam
filtering
study already applied var-iousfiltering
and detecrion techniqueswith
different SMS language except for Malay SMS language.
Additional,
oneof the
difl'erent befween frameu'orks is the technique applied but yet the result isstill
good.Varieties
of
fiamework
presentedfor
spamfiltering
and
detection.However,
for
SMS
spam and phishing attacknot yet
available. Thus, the proposed SMS spam and phishing frameworkwill
do
some enhancement on the generic lramework of SMS spamtiltering
by [50].The
enhancement
framework
will
have
dual classificationtbr
SMS Malay language.The reason to have dual classification is to identify the SMS
collection
has been classifiedconectly. After
thefirst
classification done, the second classification proeesswill
classity the scam SMS into spam and phishing. The frameworkwill
discuss further in next section.UI.
Methodology
This
section,explain about
Malay
SMS
spam andphishing
detectionframework
development.The
SMSspam and detection
will
focus
on
features based on previous studies. As mention before, this study is thelirst
to collect Malay SMS for detecting spam and phishing. Thus" there are
no
SMS spam and phishing datasetsavailable
in
Malay
Language. SMS spam and phishing datasets nced to be prepared for this srudy.For
datasets preparation, a collectionof
Malay
SMShas
been
done from
website, unknown
respondents. friends and fomily. The proposed framework is based on Guzella and Caminhas[50]
which
havefour (4)
main stepsin
filtering
spam nressage suchas
tokenization, lemmatization, representation and classifier.For this framework, four main steps
will
be applied. However. additionalclassifier
will
be
addedin
thisflamework which
calleddual classifier
in
Malay
SMS spam and phishing detection tlamework. .The reason we need dual classifier compare to a single classification process because
we
collect SMS ham and scamSMS
from
website.friend,
family
and unknownrespondents.
The
respondents
usually
have
basicknowledge about
SMS
spamand phishing and
some doesn't know anything about these attacks.After
Malay SMS collection have done SMS harnmd
SMS
scamwill
be
tokenizing, lemmatizing and
stop word removal, representation and classification I-Atter
get the result
liom
classitication
1,
second classification processwill
be proceed to classified againSMS
scaminto
SMS
spam and phishing.The
similar method has been appliedby
J.
W.
Yoon,
etal.
[41]
to classiiy uncertain SMS into spam and ham class. FigureI
and the process below listed the process of Malay SMS spam and phishing datasets and detection development:i.
Collect SMS ham and
scam
SMS from
rvebsite,friend,
family
and
respondents
and
do
first
classiltcation.ii.
Tokenization.Inkrnational Re'-ie*- on Computers and Sttflu'are, Vol- 9, N- 7
Cik Fet'esa Mohd Foo4,, Robiah Ahmad, Faizal M. A.
lnuoming
Ham and
Scam SMS
si*;
sEiir!sMs
xl
Tokenization Lemmatizationand stop u'ordremoval
I
Clessilkr I Representation
(Features Selection)
Scam
sMs
[Gr*"ur
l2
-_Ir*
______u__.
S.mm or Phiql
Fig. l. SMS Spam and Phishing Detection in Malay Language Framework
il|.1.
Matav
SMS Corpus Collection MethodAs preliminary study in collecting Malay SMS corpus, Malay SMS has been collected using methods in Table I.
The
SMS collection
methods
are from
website,personal
SMS tbrwarding,
transcriptions
and
onlinefbrm.
The
SMS
contributors
are from
respondent,website,
family
and {'riends.After
the SMS
collections are done,all
SMS
are transcriptinto
Microsofl Office
Excel 2007
for
tokenization; lemmatization
and representation process.III.2.
TokenizationTokenization
is
a processto
divide
the sentence intoword,
The
purpose tokenizationshave
beendone for
calculating
the word
for
f'eatures
selection
andclassificarion process.
Fig. 2
is
an exampleof
the SMS tokenization. There are 179 SMS has besn tokenize and total word after-tokenization are 21694.iii.
Lemmatization as remove redundancy and noise.iv.
Representation Srrings into nominal datasets.v-
FeaturesSelection-vi.
Second classificationto
ScamSMS
into
spam and phishing.vii.Examine the result
Malay
SMS datasets using Naive Bayes and J48 Technique.However,
SMS
usually
will
contain
manyabbreviation
words.
It
is
difficult
to
group
similar meaningfor
variety
of
words
suchas
in
Malay
SMS abbreviation,&e
word
ThankYou
can be typed as TQ,thank
Q,
thanks
or
tengkiu. Thus,
for
this
study,
all words in ttris SMSwill
be calculated the occurrences andwill
be identified as different words.The calculation
of
SMS word occurrences process aredone
by
using
JAVA
programmingto
identitied
the unique wordsin
theseMalay
SMS collection. There are80? words in SMS after lemmatization.
lII.4.
Fedtures Selec'tionSMS
representationin
this
paper
is
applying
the features based on the previous studies. The features arc Total words, Numberof
Character bi-grams, Numberof
Charactertri-grams. Average number
of
length,
and Average number of word.For
this
study, additional features based on the spamand
phishing
characteristicsalso
included such
asAdvertisement
or
announcement,Contest,
Malicious URL, Telephone Number, Winning or Freegift.
SMS askhelp'to
get money and SMS ask to respondor
subscribe sen'ices.il1.5.
Class(icatiotrfitere
aretwo
classil'rcations processes proposed inthis
framework.
The raw
data collection has
beenclassified
into
SMS ham
and
SMS
scam.
After
classification process
l,
the
classificationresult
shou'shigh
accuracy.
Afier
that,
the
second
classification process,classify
the
SMS
scaminto SMS
spam andphishing.
The
reason
dual
classifications
are
donebecause
this
is thefirst
study to proposed framervork in detecting SMS spam and phishing. Thus, to ensure goodresult
in
classification accuracy.this
dual classification has been proposed and the resultsrvill
be discussed intle
next section.IV.
Analysis
and
Findings
An
experiment has been doneto
examined 179of
SMS ham, spam and phishing class usingWEKA
a data mining tools to test the classified accuracy, truc positive,true
false
on Malay
spamand phishing
corpus usingNaive
Bayes and J48.The
reason these technique hasbeen
applied
to
testedthe
classifrcarion accuracy rate becausethese
techniquesis
one
of
the
well
known supervised method in machine leaming techniques.Table
II
is a classification result between4l
SMS ham and82 SMS
scam. Thetesult
shows Nai've Bayes andJ48
is
100%. TableIII
is a
classificationis
result for
Scam SMS that has been classified into4l
SMS phishing and4l
SMS
spamMalay
SMS. The result also showsNarve
Bayes and J48get
100%.
The
final
result
lbr
ternary classificationof
41 SMSham,41
SMS phishing and4l
SMS spam show 1007o accuracy.Fig. 2. Atter SMS Tokcnization Process
IIL3.
LemmatizationLemmatization
is a
process
to
group
the
same meaning wortls.Copyright Q 2014 Praise Worthy Prize S.r,t. - .4ll rights resen;ed
r252
[image:5.593.47.526.75.428.2]TAELE IV
TEnwent ClessrFtcATlo^- RESr"rlr Fon N'lrlav SMS Heu, Slera
ANDPr{rsHrNc
Peramcter l\eive Baves
Harn Phishinq Spam Ham Phishing Spam
True Positive
False Posirive
Correctly Classified
Iocon'ectiy Classified
0%
The higher
percentage(%)
of
co.rectly
classifiedparameter
is
better.Meanwhile the Tnre Positive
is
I showing that the datasets is classifiedcorectly
and false positive is 0 means that none of SMS are in wrong class.:
V.
Conclusion
SMS phishing
is still
arising. However, only.SMS
spam corpus available
on the
Internet., Basedon
the studieson
spamand phishing
in
inlbrmation
securityuea.
spam and phishing attack have different definition and understanding.Thus,
we
initiate
to
collect
SMSphishing
in
Malay languageas
altemative mitigation
to
overcome SMS spam and phishing in Malay language.Based
on
the
result,
\tr'epmve
thal this
corpusi successfully been classifiedinto
three classes such as SMS ham, spam and phishing usingdual
classification techniques.As
conclusion,by
applying classification techniquesusing
machinelearning
techniquealso can
detect andfilter
scam SMS.There are
many
techniquescan
be
appliedfor
this researchirea and
this technique is oneofthe
establishedtechnique
for
SMS
hltering
and
detection
sinceprocessing times
by
using Naive
Bayesand J48
only takes only 0 seconds to classity and get 100-o/o accuracy.Acknowledg€ments
The authors
would
like to
thankUniversili
Teknikal Malaysia Melaka(UTeM), Universiti Tun
Hussein OnnMalaysiaflJTHM)
and Minisrry
of
Higher
Educarion Malaysia for supporting this research.Copyright,g 201,1 Praise Wortlry Prize S.r.l. - .4ll rights resened
Cik Feresa Mohd
Foon'.
Rabiuh Ahmad. Faizal M. A.This
researchis
funded
bv
MOHE
Research Grant
100 ?;
0 9'6
TABLE II
B]NARY CLASSITICATION I RTSUTT FoR MALAY SMS HAM AND SCAM
Parameter Neil'c
Beyes
J48Ham Scam Ham
ScamTrue
Positive
I
IFalse
Positive
0
0Corectly
Classified
100 7olnconectlv
Classified t
Y"TABUENI
BiNARY CL{ss$IcATIoN 2 TIISULT FoR MALA,Y SMS SPAM
AND PHTSHING
Parrmcter
Phishint Spam
NaiveBaves
J48Phishinq Spam
Truc Positivc
False Positive
Correctly Classified
Inconectlv Classified
LRGS/20 I I iFTMK/TKO I / IROOOO2
References
S. S. Chandtrrn and S. Murugappan, "Spam detecdon and
eliminarion of messager; fiom twiner;" Intetnational Review an
Comlnters und Safnturz, vo]. I, pp. 2438-2443, 2013.
F-Sccure, "N{obilc Threat Repon Q3 20 | 2," F-Secure Labs20 | 2.
M. Boodac, "l\{obilc Uscrs Thrcc Timcs Morc Vulncrablc to
Plr.ishing Acacks," ln h'wleer vol. 1012, cd. !01 t.
I. Lookout, "Lookour Mobile Tlreat Repon August 201 I," 201 l.
l.
Joc and H. Shim, "An S1\lS Spam Filtcring Systcrn UsingSupport Vector Machine."
in
Future Generation hrlormationTet:hnoLtg:. vol, 6485. T.-h. Kim, er o/.. Eds., ed: Springel Berlin
Heidelberg. 201 0. pp. 577-584.
K. Dunbarn, "Chapter 6 - Phishing. SMishing. and Vishing," in
Mobile hfalnutr ,4tta<:ks cnd Dclinse. D. Ken. Ed., cd Boston:
Syngress. 2009. pp. 125-l 96.
A- Kwee, t"t a,/., "sentence-Level Novelty Detection ia Fnslish
and Mafay."
in
.irlvaaersin
Ktrotrlcdge l)isror,ery and Data,Vnrirg. vol. 5476. T. Theeramunkong. et
al.
Eds., cd: Springcr'Berlin Heidelberg. 2009, pp. 40-5 l.
Y. Hird af Segerstad. t/sc und Adupltttion
(t
Writltn Lunguuge t{ttht'
Conditionsof
Computer-ldediated Com*runication'.Univesity of Cothenburg, 200?.
tgl
T. Ogle, "Crealive Uses of lnlbrmation Extracted from SMSMessages." Undergraduate, Computer Science, The Universiry of
Sheft-ield.2005.
[|0]
.Cddnck Fainrn and S. Paumier. "A Translated Corpus of 30,00t)French SMS,'
ln
ln
Prorcetlings of Language Resourtes and Evaluatktn.,20Q6.il 11 M. Choudhury. et al. "lnvestigation and modeling of the stnrcturc
of terting l4nguage." Inr. J. Dot. .4nal. Re'tognit.. vol. 10, pp.
r 57-l ?4. 200?.
[2]
S. C. Herring and A. Zelenkauskaite, "Symbolic capitalin
avirtual heterosexual market abbreviation and insertion in Italian
iTV Slt'ls.' ,yrittett Communicarion. vol. 26, pp. 5-3 l, 2009.
[I3]
C. Bach and J. (iunnarsson. "Exlraction of trends is SMS tcxt,"l{ashtrs thtsis, Lund Uni}ersr,1. 2010.
I
l4]
D. Pietrini, "X'6:-(?": The sms and the triumph of infbrmality andludic w{ting," Italknisch, vol.46. pp. 92-101,2001.
Il5l
P. Schlobinski, rr a/., "Sintsen. Eine Pilotstudie zu sprirchhchenund kommunikativen Aspekten
in
der SMS-Komnrunikadon," i\it'trttttx 22. Online-Publikatiottttr zum Themu Sprache und Kommunikation im Internet, 2001.[16]
T.
Shortis, "'New Literacies' and Emerging Foms: Textlr'lessaging on Mobile Phones," presented at the International
Literacy and Research Network Cont'erence on Leaming., 2001.
[7] N.
Doring."l
bread. sausage,5
bagsof
apples LL.Y"-comrnunicative functions of text mcssages (SMS\," ZeitschiJtJiir
!{ edie np sy c h obgi e 3. 2(X)2..
!81
Y, Hard af Sergerstad, Ux: und .lda1ttuttun ol fii'itren l-anguageto the
Conelitionsal'
ContPuter'Mediated Communication:University of Gothenburg. 2002.
[l9]
E.-L. Kasesniemi and P. Rautiainen. "Mobile culrure of chilfuen and temgers in Finland"" ia Petpetml contact, ed'. CambridgeUniversity Prcss, 2002, pp. 170-192.
[20]
R.
Grintcranrl
M.
Eldridgc, "Wan2dk?: cvlYyday tcxt messaging," presentedat
the Proceedingsof
thc
SIGCHIConference
on
Human Factorsin
Conputing Systenls, Ft.Lauderdale, Florida. USA, 2003.
[21] C.
a..{.
B. Thurlow, "Gcncration Txfl Thc sociolinguistics ofyoung people's rext messagiug," Discoutse Analysis Online
lll),
Jr.. 2003_
[22] Yijue Horv and M.-Y. Kan, "Opdmizing Predictive Text Entry lbr
Short lt{cssagc Sc'nicc on Mobilc Phoncs," prosr-ntcd at thc In
Procecdings of HCII, 2005.
[23] Rich Ling md N. S^ Baron, "Text Messaging and lM: Linguistic
Comparison of .{merican Collegs Data," 2007.
under
LongScheme
ll
00
100%
0'
tll
t2l
t3l
t41
t5l
i6l
t?l llll
0000
100 o;ir
0%
I
l
I
I
I
I
t8l000000
100
%
1000.;0?o
r25l
[2a]
M.
Zic
Fuchs andN.
Tudman Vukovii, "Communicationtechnologies and theil influence on l:rngurge: Reshuffling tenses
in Croatirur SMS text messaging,r' Jezikoslovlje, pp. 109-122,
2008.
[25] D. Gibbon and
l!{.
Kul, "Economy Strategiesin
ResrictedCommunicadon Channcls.
A
studyof
Polish shon toxt messages," f,008.[26] A- Deumert and S. Oscar Masinyana. "Mobile language choices
The use
of
English and isiXhosain
text messages (SMS)Eviclcncc tiont a bilingual South Afiican sanrplc," English
World-Wide, vol. 29, pp. I 1?-147, 2008.
[27] I. Hutcl$y and V. Tanna, "Aspecrs of sequential organizadon in
text message cxchange." Diseourse & Cotnmunication, vol. 2, pp.
r43-r64,2008.
[28] J. Walkowska, "Gathering and Analysis of a Corpus of Polish
SMS Dialogucs," Challenging Pnthlems oJ Science. Computer
Science- Rtcent Advances in Intdligent l4formation.ilsreor.s. pp.
r45-r57.:009.
[29] C. Tagg, "A Corpus Linguistics Srudy of SMS Text Messaging."
Doctor of Philosophy" Department of English. The University of
Birmingham. Birmingham.
2009-[30] F. W. Elvis. "The sociolinguistics of] motrile phone sms usage in
cameroon and nigeria," nrc btternalional Journal of Languuge
Society and Culture. vol. 28. pp. 25-40,2009.
[31] S. N. Barasa, langroge, mobile phones antl internet: a studr
rt
Sl/S lertrirg, entail,
bi
ancl SNS chats in campuler netlialedcomnunication (Cl\{C) in Kenya, 2410.
[32]
A. B.
Bodomo. "Thc Gmmmarof
Mobile Phone Written[.anguage," Chaprer, vol. 7. pp. I l0-198,2010.
[3]l
W. Liu and T. Wang. "lndex-based online text classification lbrsnrs spam filtering." Journul rdCompurers, vol. 5. pp.844-851.
2010.
[34]
S.
Sotillo.'SMS
Texting Practicesand
CommunicaliveIntcntion," t)hapter, vol- I 6. pp. 252-265, 2010.
[35] C. Dirscheid and E. Srark. "SMS4science:
An
internstionalcorpus-based texting pmject and the specilic challenges for
multilingual Switzerland." Digitul Dis<'ourse: Languagr in the
Nc*- .Vcdia: Languuge in the Neu' i\'ledia. p. 299. 201l.
[36] K. V" Lerander. "Names U ma puce: multilingual texting in
Senegal," Working paper20l l.
[3?] J. Elizondo, "Not 2 Cryptic 2 DCode: Paralinguistic Restitution.
Delction. and Nonstandard Orthography in Text Messages,"
Ph-D. thqsis. Swanhmore College.20l
l
[3E] T. Chen and M.-Y. Kan, "Creating a live, public short mcssage
service corpus: the NUS SMS corpus." Lunguagc Rcxturc:<'s and
Eva luation. vol. 47. pp. 299-335. 20 I il06/0 I 20 | 3.
[39] O. Salem" er
al,
"Awareness Program andAI
based Tool toReduce Risk of Phishing Attacks," in Computer und Infbrmarion
TechnologS. (CID. 2010 IEEE IOth International Conference on,
2010. pp. l4l8-14?3.
[a0l Q. Xu, e, a/., "SMS Spam Detection using ContentJess Features."
hxelligent System-s,
/fEf.
vol. PP, pp. l-1,_2012.t4ll J- W.
Yoon.cl
a/..
'Hybrid spam filteringftrr
mobileeommunication." (lompaters &: Securit.r'. tol. ?9. pp.
446-459, t0lo.
[42] H. Peizhou. a a/., nA Novel Method for Filtering Group Sending
Short Message Spam," in Convergence und H1-hrid Inl-ormation
Tcc h no logr, 2 {n8. I C HIT'08. I n ttr n at i on a l Co n ft t"t' nca' on, 2008. pp. 60-65.
[43] G. V. Cormack, er a/., "Content bascd SMS spam filtering,"
presented at the Proceedings of the 2006 ACM symposium on
Document engineering, Amsterdam, The Netherlands, 2006.
[4a] Ir,f. Taufiq Nrnuzzaman, er a/., "Simplc SMS spam filtcring on
independent mobile phone," Spcuri{'
and
CommunicatiortNeru.nrrLr, vol. 5. pp. ll09-l:20,2012.
[45] G. V. Cormack, el
al.
"Fearure engineering tbr mobile (SMS)spanr filtoing," p'escntcd at thc Procccdiltgs of the 30th annual
intemationai
ACM
SIGIR conferenceon
Research anddevelopment
in
informarion retrierai, Anrsterdarn, Tlre Nerherlands,200?-[46J K. Yadav. e/
,/.,
"SN,lsAssassin: crou.Gourcilg drivenmobilc-based system l.or SMS spam tiltering," prcserrted
at
theProceedings
of
the
l.lth
Workshop on Mobile ComputingSystems and Applications, Phoenix, Arizona. 20 I I .
Copyright,g 2014 Praise Worthy Prize S.r.l. - .4ll rights resemed
Cik Felesa Mohd Foo4', RabiahAhmad. Faizal M. A.
[4?] Y. C. Llm, ct al., "Application of Genetic Algorithm in unit
selection for Malal' speech synthesis system," Expet ,S)rst€ms
with Applications, vol. 39, pp. 53?6--5383,2012.
[48] F. S. Tsai, er rrl., "Multilingual novelty detection." Eryert S]srerfls
with Applications vol. 38, pp. 652-658. 201
l-[49] T. Subramaniam, et a/., "Naivc Baycsian ,{nti-spam Filtcring
Technique for Malay Language."
[50] T, S. Guzella and W.
M.
Caminhas,"A
revierv of rnachineleaming approaches ro spam fihering," Lrpert Systcnts v:ilh
,4pplicatiorts. vol- 36, pp. I 0206- l 0222, 2009.
t51] M. Z. Rafique, el a/.,'Applicarion of evoludonary algorithrns rn
detecting SMS spam
at
access layer," presentedat
theProceedings
of
rhe l3th annual confbrence on Genetic andcvolutionary conrputalion, Dublin, Irclancl 201 l.
l52l M. Z. R. que and M. Farooq, "SMS Spam Detection By Opemting
On B1'te-Level Distributions Using Hidden Markov Models
{HMMS)." prcsented at the Virus Bulletin Contbrence September
20r0.10r0.
[53] C. Yan, er a/-, "SMS-Watchdog: Profiling Social Bchaviors of
SMS Users for Anomaly Detection
Recent Advaace-s in Intrusion Detection." vol- .57-58. E. Kirdu et al.,
Eds.. ed: Springer Berlin Heidelbery.2009. pp. 202-223.
[54] J. M. G. Hidalgo, et sl.. "Content based SMS spam filtering,"
presented at the Ptoceedings of the 2006 ACM symposium on
Dorcument engineering. Amsterdam. The Netherlands, 2006.
[55]
Y.
Xiang. c't aL, "Filiering nrobile spam by suppo{ \'ectotmachirie
"
presented at thc Conferencc on (lomputer Sciences,Softwarc Engineeiing lnformation Technology. E-Business md
Applications (3rd: 20M : Cairo, Eg1'pt). Cairo. Egypt. 2004.
{561 C- Jie er c/., "Spam Filter fbr Short Me*sages Using Wirtnow," in
Atlvaneetl Language Processirg
and
Web InJbrmationTechnolog;, 2008- ALPTT '08. Into nalional Cotlfbr<nce on, 2008,
pp.454-459.
:[57] K. Yadav. et a/.. "Take Control of Your SMSes: Designing an
Usable Spam
SMS
Filtering System,"in
lt{obile Dantlfianagement (MD!L{), 2012 IEEE ISth lnternatiilnal Conference
on,2012,pp- l5:-355.
[58] W. Ningning. ar a/., "Real-time monitoring and tiltering systenr
for mobile SMS," in la<Justial Electronics and appliutions,
20A8. rc1E.4 300tt. -lrd IEEE Conleren<r' on. 2008, pp.
l3l9-1 324.
[59] J. Huang, et
d.,
"A Bayesian Approach for Text Filter on 3GNetwork." in ll'ireless Communicatiotts Nr:nrurl'rng and Mobile Conl>uting {WiL'.Olrl),
}0lA
6& Internationill Conference on, 1010, pp, l -.5.[60] H. Najadat. er al, "Mobile SMS Spam Filtering based on l\{ixing
Classitiers."
[{rl]
T. M. Mahmoud andA.
M. Mahfouz, "SMS Spam FilteringTechnique Based
ol
Anificial
Immune Systern." IJCI9|Inlernational .Iour,tdl oJ Camputcr Scr'arca' /-rsl<'s, vctl. 9, 20 I 2.
[62] T. Charninda, et al.. "Clontent based hybrid srns spam filtcring
system." 20l4.
[6,1] N. Saxena and N. S. Chaudhari, "SecureSMS: A secure SMS
protocof for VAS and other applications," Journal ofSysleils and
Soliv.are. vol. 90. pp. 138-150.2014.
[64]
G. C. C. F.
Pereira,er
a/.. "SMSCT]?Io:A
lightrveightcryptographic tiarnework for secure SMS transmission," ,Journol
ry' S.1sr,nr.s ond Sr/irrzn'. vol. 86. pp. 698-706. 20 I 3.
[65] J. Choi and H. Kim,
"A
Novel Approach for SMS sccurity."Intenntional Journd oJ Security & Ils Applicalions, I'ol. 6,2012.
Authors'
information
Cik Fcrcse Mohd Foozl is cunendy working
with Universiti Tun Hussein Onn Malal'sia
(UTHIVI), Malal'sia. Feresa holds a l\{a$er's
degree
in
Computer SciencefinformadonSccuriry) fronr Universiri Tel:nologi Malaysia
Malaysia and a Bachelor's degree in Intbrmation
Tectmology and l\'lultimedia tlom Universiti
Tun Hussein Onn Malaysia
(UTln[,
Malaysia.She is crurently pwsuing her PhD at the Universiri Teknikal Malaysia
Melaka. Malaysia.
r254