• No results found

A framework for SMS spam and phishing detection in Malay Language: a case study

N/A
N/A
Protected

Academic year: 2020

Share "A framework for SMS spam and phishing detection in Malay Language: a case study"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

?rnal*

?dzo

,ffi,

rc,*Q

International

Review on Computers and Soliware (I.RE.CO.S.), Vol. 9,

l,l.

7

ISSN

1828-6003

Julv 2014

A Framework

for

SMS Spam and Phishing Detection

in Malay

Language:

a Case

Study

Cik

Feresa

Mohd

Foozy,

RabiahAhmad, Faizal M.

A.

Abstract

-

Short Message Service (SMS) spam

and

SMS phishing has been ina'eese nov,adavs

especially in

Malal'

language which

is

the Jirst langtage Jbr Malaysia country. Currently,

nnnl'

SMS spam

in

others language has heen proposed, however nol

yetJbr

Maluy- language and we are the Jirst to propose these.

In addition,

this paper also analysl on several Jramev'orks

of

SMS spam

Jiltering.[or onr

SMS spanr

and

phishing detection

.framevork. From

the

ana{uis,

the chosen

framev,ork

has been enhanced.fbr Malav- SMS spam

and

phishing. The enhancement has been done

on

classiJication phase *'here our.frameu'ork proposed

dual

clussification. The t'lassification

I

w,itl

classit'y

the

SMS

into

ham ancl

scam

SMS.

For

classification

2,

the

scam

SMS

will

be

c'lassified again

into

SMS spam

and

SMS phishing.

Aller

dual classifications phase completed, the

Malay

SMS Aas been examined

asing iVaii'e

Ba3,es

and

J48

unsupervised Machine Learning

tec'hniques. The result shows high aecurdcy

in

detecting

Malay

SMS ham, spam

and

phishing. Copyright @ 2014 Praise Worthy Prize S,r.I. -

AII

rights reserved.

K eyw o rds

:

D e t e c t i o n, F i I teri n g, P h i s h i n g, .Securi4t. SMS, Spazr

I.

Introdnction

SMS is one

of

the alternative communication services nowadays.

Since

the SMS

charges

are

in

reasonably priced per SMS

in

many countries. these gain interest to many users that

don't

have mobile data or Intemet to use

this

service

as

communication

tool

and

sending information such as advertising marketing, spread news and etc. There are

tools

and software

for

sale

to filter

SMS. However,

the SMS

spam and phishing attack is

still

increased. Spam and SMS phishing is

two

different types

of

attack. Spam message

will

contain advertising

[l]

and marketing. However,

for

phishing attack

the message

will

trick

users

by

announce user i5 a winner or get free

gift,

These phishing tactics is for user to response

the

message.

According

to

[2]

the

SMS

phishing nessages

can

charged

fees

to

mobile

device

ou'ner silently

if

the user response to the SMS. Moreover, from

the replied SMS phishing,

the

phishes

can

get

more

infbrmation about

the mobile

device such as

mobile device version. contact number and etc.

Boodae

[3]

identify,

mobile

device users

are

three times more

likely to

enter a'*'eb-based phishing attack than desli:top users and a security report

by

Lookout [4]

found three

out

oi

ten Lookout's

users interested to

clicking on an

unsafe

link

per

year

by

using

mobile

device. Recently, there are variety of mobile applications have been developed and have been dorn'nload on mobile device

by

many user. This

is

one

of

the reason phishes revolutionize their attack strategy by embedding malware

URL

links into

the SMS

fbr

user

to

clicks or reply

the SMS as accepting the rules and regulations to download or install the mobile applications.

Moreover,

if

user clicked

or

replied the trigger SMS, the malware

will

connect

with

the mobile device and the

mobile

device owner has potential

to

losing money and

infbrmation privacy.

Joe

et

al. [5]

found

the unu'anted

incoming

SMS

makes

the

respondent

feel the

SMS violate their personal privacy and Sfv{iShing usually u"ill influence the SMS recipient

to

enter

their

fraud contest, asking money transactions

into their

bank account" pay

thcir bills,

traud

advertising

and

etc.

Since

there are

many disadvantages of SMiShing attack. few numbers

of

studies on SMS spam filtering has been proposed.

Traditionally. phishes

will

send SMiShing attack

in

a

high

volume. Howevcr, nowadays phishes interested to send SMiShing attack

in

a small quantity to observe the security level

of

the mobile device, thus

it

is possible to detect the SMS phishing based on the SMS quantity and the distribution of sending message pattern.

In

addition. SMS language

or

abbreviation words ate frequently used when sending the SMS instead of proper language because

the limited text

length

fo

vrite

the message-

This

SMS compose l'eatures on mobile phone

will

make

user

compress

the

message

by

using abbreviation

words as

long

as

the

recipients read the message. However, too much abbreviation words in SMS

will

be detected as spam

by

some

of

the SMS

filtering

tools.

In

this

paper

we

proposed

a

I'ramework

to

detect

Malay SMS

spam

and

phishing

using

classification

techniques. Malay SMS corpus, have been collected

tiom

the

contributors such

as

web

site,

fi'iends,

family

and unknou'n respondents.

The

text

collection method

is

by

using n'anscription and our online form.

Cop-vright @ 2014 Praise Worrhy Prize S.r.l. - All rights reserved

(2)

This

paper

is

organized

as follows.

In

Section

2, Literature ret'iew on SMS phishing detection.

In

Section 3,

will

explained the pre-processing SMS

datasets,

features selection

development

and

the experiments to

identiff

the accuracy offeatures selection using data mining tools named

WEKA.

Section 4, shows the result and findings based on the experiment that has been done and

finally

is conclusion and fufure

wort

were given.

II.

Related

Work

Mobile device phishing adack has been categorized

by

Dunham

et

al.[6]

into

four

types such

as

Bluetooth phishing,

Short

Message Service

(SMS) phishing

and

Voice

over

IP

Phishing(vishing). Example

Bluetooth

phishing attack has been discuss

bV

[6].

the

Bluetooth phishing attack works

when

user connect

to

the Wi-Fi

hotspot and the attacker can steal the data when the user connects to the

Wi-Fi.

For vishing, this attack

will

attack

customers'

service

organization

to

get the

private information of customers

-Mobile web

application

phishing

is

similar

with

desktop web phishing attack. However, the solution

for

this

attack mostly

based

on

client-server method and

depends on the mobile architecture itself because mobile

device

has

few

limitation such as CPU

processing, battery

life

and memory size" thus the existing solution is

by

using

antivirus

or

anti-phishing

which is

purposely develop

for

lightweight mobile device. Thus, this paper, we

u.ill

focus on SMS spam and phishing attack solution that has been increased recentlv.

il.1.

SMS Corpus Ctillection Method Several methods can be used to collect text corpus. According to Table

l,

there is variety

of

text size that has been collected.

Most SMS

languages that has been collected are in English SMS and text

[7]

contributor are trom unknown, known contributor and participant.

For unknown SMS contributor are usually SMS taken lrom soltware and websites.

For

known are

from family

and friends. Few studies not mention the process of SMS collection clearly.

However, Hard

af

Segerstad

[8],

Ogle

[9],

Fairon and

Paumier

[0],

Choudhury

[ ll,

Herring

and Zelenkauskaite

[l2],

Bach and Gunnaruson [13] has state the method that has be used to collect SMS.

SMiShing attack has been discussed

by

[6]

and [39]-According to

K.

Dunham

[6].

SMiShing

is

a new tactic

to

spread malwarc

by

adding the

URL link in

SMS and inJluence the recipient to

click

on the URL.

In

addition

O.

Salem

et al.

[39],

ideutity

the

SMiShing

atnck

is on the rise atrack cun'ently on mobile device and

Xu et al. [40]

agreed

SMS

spamming

is

a

serious attack

for

SMS nowadays. J. W.

Yoon

et al. [41] and

H.

Peizhou et al.

[42]

studies on SMS content-based

filtering

and

Q.Xu

et al. [40] proposed SMS

filtering

on non content-based.

Copl'7if6t 'g 2014 Praise ltortlry Prize S.r.l. - .4ll rights resen'ed

Cik

Fercsa Mohd

Foon.

Rabiah Ahmad. Faizal M. A.

For

this

paper,

the Malay SMS

phishing and spam

will

be classified based on the generic features such as

Total words t431, t441,

[45]

and

[46],

Number of

Character bi-grams[43]

antl [45],

Number

of

Character n'i-grams[43] and

[45],

Average number

of

length [44] and [a6] and Average number of word [aa] and [46].

We also add additional f'eatules fbr this study such as

advertisement, contest, ut3enr SMS, ask money, asked to response SMS, telephone,

URL

and arurounce user get

free

gift

or win.

These

criteria

is

based

oo

spam and phishing attack.

Total

f'eatures applied

in

this

study are

14 fearures.

t1.2.

ilfala1t l-6s1g11age Detet'tion Framev:ork

Malay

language has been applied as case shrdy

for

several research areas.:However, there

is

still

lack

lbr

information security detection study especially

for

SMS spam and phishing study which is not yet been examined based on the literature review in SMS spam and phishing

study. The

example

of

Malay

detection research area such as speech derection[47], language detection[48],[7] and email spam detection [a9].

Moreover,

speech

detection

by

Lim

et

al.[47]

purposed

to

produce

high quality

synthetic speech

in

Malay

language.

In

addition,

for

language detection by Tsai et al. [48].

The,researchers aimed

to

reduce

the

English, Malay

and

Chinese

languages documents redundancy. The framework has three

different

preprocessing

which

suit

'

for

three

different

languages and

finally

will

apply

in

a

detection

process

such

as

Sentence

Segmentation.

Sentence Level and Novel Rate computation.

A.

Kwee

et

al.

[7].

proposed

English

and

Malay

language detection and the processes

for

Malay language detection consist

of

Language Translation. Stop Word Removal, Word Stemming and Detection.

In

email

spam detection

research

area

by

T. Subramaniam

et

al.

[49],

the

Malay

language has been

applied as case study using Bayesian

filtering

techniques. The results show

this

technique successfully classity spam and non-spam

of

Malay language email

with

96%

accuracy.

The

frameworks consist

of

Collection of

emails, Tokenization,

Features

reduction,

Features

selection, Training and Testing^

Thus,

in

this

paper,

Malay

language

has

been

examined

as

case

study

fbr

SMS

spam and phishing detection. The basic detecfion process has been applied

in

this

framework

and

the

enhancement

of

dual classification

.*ith two

additional

new

features such as

advertisement and contest feafures has been proposed. The framework and result

will

discussed further in the next section.

ILs.

SMS Fihering and Detection Framework Generally,

filtering

method consist

of

tokenization, lemmatization and stop word removal, representation and

classifierby

[50].

httenatktnal Reviev: on Comuulers and ktftu'are, rbL 9, N. 7

(3)

Cik Fercsa Mohd Foozy, Rabiah Ahmad, Faizal fuL A.

TABLE I

LJTERATURE ON TEXT MESSAGING CoRPUs

References Tevt Size

TertLensuaec

TcxtContributor Text Collcction l\{ethod

Pietrini

ll4l

Schlobinski et al. [151

Shortis l16l Doring [r7]

Hard ef Setgetstad

ll$l

Kasesnicmi and Rautiainen

flel

Grinter and Eldridge[201

Thurlorv end Brown

[2ll

Ogl"

Yljue How and Ken l22l Feiron end Paumler

ll0l

Choudhury [l

ll

Rich Ling [231

Rcttie (2007)

Zic Fuchs and Tudman

Vukorid l24l

Gibbon and Kul

l2sl

Deunert and Oscar

Masinyana [261

Hutchby erd Tannr l27l

Welkowske [281

Herring end

Zelenkauckrite

ll2l

Teggl2el

Ehis J30l

Barasa [3ll

Bech end Gunnarsson [31

Bodomo l32l

Liu and Weng

133l

Sotillo I34l

Ditrscheid and Stark [351

Lexender [361

Elizondo f3?l

Chcn end Ken l38l

Italian Germany English German Swedish Finnish English English Englisir Englislr French English Norwegian English Croation Folish

English, isiXhosa

English

Polish lralian

English English. French, etc. ,English. Kiswahili. etc. Swedish, English. etc.'

English. Chinese

Chinese

Elglish

Germany, French. etc French English

English and

Mandarin Chinese. 292 312 500 1500 207 100{) | 152 7800 4'7'l 5M 97

l0l l7

30000

lo00

867 278 6000

| 250 I700 1452

l062ri 600

2730

3l 52

853

85870

6629 2398?

l5-35 Years Old

StDdents

I Male Student liicnds and Family

200 partrcipans

I I 2 trom an anonymous webpage, 252

nrcssagcs foru'ardcd from voluntccrs

and ?88 fiorn family and friends

Teenagers ( l3-18 Years Old)

l0 Teenagers i15-16 Years Old)

135 Freshmen Nightclubs

?003 Respondents

166 Univexity Students

3.200 Contributos

Rendomly

3l contributors

University students,

family and iiiends

Uniuersity studenrs and

friends

22 young adulls

-30 young pmfessionals

(?0-35 years old)

2(X) contriburors

Audiences of an iTV

progrum

16 tbmily and friends

72 universitv students

and lecturers

84 univercity studcnts

and 37 young prottssionals

I I conlributors

87 youngsrcrs

Real volunteers

-59 participanrs

2.6f7 vtrluntcer:

I 5 young people

12 volunteers

University Students

Unknoun

Unknown Transcript Unknou'n

From webpage, rrolunteers,

family and liicnds

Transcript

Tmnscripr Trarrscript

Subscribe SMS Promotion

Of Nightclub

Transcript For*,ard

Search The SMS

From The Website

Transcript Unknown Unknown

Unknown

Trdnscript. Forward

Transcript

Fonpard, Softu'arc

Online SMS archives

Transcript Forward Fonvard Soliware Transcript Unknown Software Fors'nrd Transcript Transcript Unknown 496 357 71000

However, there are

multiple

frameworks

for

spam filtering and detection such as below:

t

Tools

Cormack et al.

[45]

on SMS

tiltering

using 5 fypes

of

spam

filter

tools

to

filter

the SMS. Before applied tools on SMS, there are pre-processing data that has been done

to

come out

with

four (4)

main

features.

Moreover, to detect

SMS protocol

in

real time

Rafique

et al.

[51]

applied Hidden Markov

Model

(MIm)

which

the

architecture

consists

of

sniffer.

feafi.rre

extraction, classifier and

mles

decision. Que and Farooq

[52]

also apply

MHH

on byte level distribution

of

SMS that have

Copyright 'O 201 4 Praise Worthj, Prize S.r.l. - .4ll rights resen ed

four

processes such as identify suitable representation

of

an

SMS,

build

spam models, classihcation

and determined the spam.

ln

addition, SMS-Watchdog SMS detection scheme

by Yan

et

al.

[53] has tlu'ee processes

of

monitoring, anomaly detection

and alefi

handling using SMS services.

.

Contenl-Based Filte,'ing

J.

W.

Yoon, et

al. [41]

proposed

hybrid

framewor* that implement content-based technique

with

challenge-response scheme.

The SMS

classified

into

ham, spam and uncertain, then the challenge response

will

classifu uncertain

into

ham and

spam

by

matching

the

sender

l25t)

(4)

Cik Feresa Mohd Foozy, Rabiah Ahmad, Faizal M. A.

response.

Gdmez

Hidalgo

et al.

[54] have

proposed content-based

SMS filtering

for

English

and

Spanish

SMS

spam

using

Bayesian

filtering that

consist

of

preprocessing, feature selection

and

learning.

t

Mac:hine Learning

Xiang et

al. [55]

proposed Support

Vector

Machine technique to

filter

the mobile spam.

Moreover,

Cai et al.

[56]

improved the spam

tilter

using traditional balanced

Winnow

algorithm

which

applied pre-processor, feature selection,

texts

representation

and winnow

algorithm

module.

An

independent

mobile

device

filtering

by

Taufiq Numzzaman et

al. [44]

applied several processes

in

their SMS independent sparn

filtering

such as data set

and running

environment, feature extraction,

vector

creation and

filtering

process

of

Naive

Bayes

or

SVM and update

filtering

system. Yadav

[46], [57]

had three process

in

the their SMS filtering

such

as

Bayesian

filtering

algorithm,

mobile

application

and synchronization service on server.

t

Vatching Pattern

Wu et al. [58] has proposed SMS

filtering flow

such as SMS screening. bayesian learning. keyword SMS and Pinyin Fuzzed keyword matching.

Moreover,

the

Chinese

SMS

filtering by

Jie

et

al, [59] has pre-processing, lbatures selection. modeling, and classifier.

In

addition. Najadat

et al. [60].

frameworks

involve

of

three

proccsses

of

data collection.

pre-processing,

text mining,

testing, evaluation metrics and implementation.

t

Artift'ial

I ntmttneSlstern

T.

M.

Mahmoud and

A.

M.

Mahfouz

[61]

applied

arrifrcial

immune system method

filter

SMS

spam that

contain

analysis engine, tokenize

word, stop

word,

dataset.

training

and

AIS

engine. Chaminda

et

al.

[62] proposed

a

hybrid solution

ot

neural network

and Bayesian

filtering

where the

SMS

filtering

process are sender identification module, spam folder, SMS content

extractor, tokenizer. Bayesian filter..

categorization,

training and inbox.

.

Ctyptograph)'

In

Cryptography area, Saxena[63] proposed

a

secure SMS protocol

for

SMS tr.an$mission and a cryptographic algorithm

in

the

SIM

card. The processos

nf

framework

are request to send SMS and authenticate sending

SMS-In

addition, Pereira

et

al.[64]

also

proposed

a

lightweight cryptography algorithm to mitigate the SMS

security

issues, protocols,

pror,iding

encryption, authentication and signature services.

In

addition. Choi

t65l

applied Common

Public

Kuy

Cryptography technique

for

SMS

communication efl'eciency which

containt

of

initialization

for

aulhenticate,

encrypt.

or decrypt and communication phuses

lbr

sending SMS.

As

summary,

*re

basic architecmre frameworks are

data

collection,

pre-processing, f'eatures

selection, training and testing.

Cop)'right

O

2014 Praise Worthl,Prize $.r.!. - 4!! rights resseryed

Which is mostly has been applied in SMS shrdtes. These explained the SMS spam

filtering

study already applied var-ious

filtering

and detecrion techniques

with

different SMS language except for Malay SMS language.

Additional,

one

of the

difl'erent befween frameu'orks is the technique applied but yet the result is

still

good.

Varieties

of

fiamework

presented

for

spam

filtering

and

detection.

However,

for

SMS

spam and phishing attack

not yet

available. Thus, the proposed SMS spam and phishing framework

will

do

some enhancement on the generic lramework of SMS spam

tiltering

by [50].

The

enhancement

framework

will

have

dual classification

tbr

SMS Malay language.

The reason to have dual classification is to identify the SMS

collection

has been classified

conectly. After

the

first

classification done, the second classification proeess

will

classity the scam SMS into spam and phishing. The framework

will

discuss further in next section.

UI.

Methodology

This

section,

explain about

Malay

SMS

spam and

phishing

detection

framework

development.

The

SMS

spam and detection

will

focus

on

features based on previous studies. As mention before, this study is the

lirst

to collect Malay SMS for detecting spam and phishing. Thus" there are

no

SMS spam and phishing datasets

available

in

Malay

Language. SMS spam and phishing datasets nced to be prepared for this srudy.

For

datasets preparation, a collection

of

Malay

SMS

has

been

done from

website, unknown

respondents. friends and fomily. The proposed framework is based on Guzella and Caminhas

[50]

which

have

four (4)

main steps

in

filtering

spam nressage such

as

tokenization, lemmatization, representation and classifier.

For this framework, four main steps

will

be applied. However. additional

classifier

will

be

added

in

this

flamework which

called

dual classifier

in

Malay

SMS spam and phishing detection tlamework. .

The reason we need dual classifier compare to a single classification process because

we

collect SMS ham and scam

SMS

from

website.

friend,

family

and unknown

respondents.

The

respondents

usually

have

basic

knowledge about

SMS

spam

and phishing and

some doesn't know anything about these attacks.

After

Malay SMS collection have done SMS harn

md

SMS

scam

will

be

tokenizing, lemmatizing and

stop word removal, representation and classification I

-Atter

get the result

liom

classitication

1,

second classification process

will

be proceed to classified again

SMS

scam

into

SMS

spam and phishing.

The

similar method has been applied

by

J.

W.

Yoon,

et

al.

[41]

to classiiy uncertain SMS into spam and ham class. Figure

I

and the process below listed the process of Malay SMS spam and phishing datasets and detection development:

i.

Collect SMS ham and

scam

SMS from

rvebsite,

friend,

family

and

respondents

and

do

first

classiltcation.

ii.

Tokenization.

Inkrnational Re'-ie*- on Computers and Sttflu'are, Vol- 9, N- 7

(5)

Cik Fet'esa Mohd Foo4,, Robiah Ahmad, Faizal M. A.

lnuoming

Ham and

Scam SMS

si*;

sEiir!

sMs

xl

Tokenization Lemmatizationand stop u'ord

removal

I

Clessilkr I Representation

(Features Selection)

Scam

sMs

[Gr*"ur

l2

-_Ir*

______u__.

S.mm or Phiql

Fig. l. SMS Spam and Phishing Detection in Malay Language Framework

il|.1.

Matav

SMS Corpus Collection Method

As preliminary study in collecting Malay SMS corpus, Malay SMS has been collected using methods in Table I.

The

SMS collection

methods

are from

website,

personal

SMS tbrwarding,

transcriptions

and

online

fbrm.

The

SMS

contributors

are from

respondent,

website,

family

and {'riends.

After

the SMS

collections are done,

all

SMS

are transcript

into

Microsofl Office

Excel 2007

for

tokenization; lemmatization

and representation process.

III.2.

Tokenization

Tokenization

is

a process

to

divide

the sentence into

word,

The

purpose tokenizations

have

been

done for

calculating

the word

for

f'eatures

selection

and

classificarion process.

Fig. 2

is

an example

of

the SMS tokenization. There are 179 SMS has besn tokenize and total word after-tokenization are 21694.

iii.

Lemmatization as remove redundancy and noise.

iv.

Representation Srrings into nominal datasets.

v-

Features

Selection-vi.

Second classification

to

Scam

SMS

into

spam and phishing.

vii.Examine the result

Malay

SMS datasets using Naive Bayes and J48 Technique.

However,

SMS

usually

will

contain

many

abbreviation

words.

It

is

difficult

to

group

similar meaning

for

variety

of

words

such

as

in

Malay

SMS abbreviation,

&e

word

Thank

You

can be typed as TQ,

thank

Q,

thanks

or

tengkiu. Thus,

for

this

study,

all words in ttris SMS

will

be calculated the occurrences and

will

be identified as different words.

The calculation

of

SMS word occurrences process are

done

by

using

JAVA

programming

to

identitied

the unique words

in

these

Malay

SMS collection. There are

80? words in SMS after lemmatization.

lII.4.

Fedtures Selec'tion

SMS

representation

in

this

paper

is

applying

the features based on the previous studies. The features arc Total words, Number

of

Character bi-grams, Number

of

Character

tri-grams. Average number

of

length,

and Average number of word.

For

this

study, additional features based on the spam

and

phishing

characteristics

also

included such

as

Advertisement

or

announcement,

Contest,

Malicious URL, Telephone Number, Winning or Free

gift.

SMS ask

help'to

get money and SMS ask to respond

or

subscribe sen'ices.

il1.5.

Class(icatiotr

fitere

are

two

classil'rcations processes proposed in

this

framework.

The raw

data collection has

been

classified

into

SMS ham

and

SMS

scam.

After

classification process

l,

the

classification

result

shou's

high

accuracy.

Afier

that,

the

second

classification process,

classify

the

SMS

scam

into SMS

spam and

phishing.

The

reason

dual

classifications

are

done

because

this

is the

first

study to proposed framervork in detecting SMS spam and phishing. Thus, to ensure good

result

in

classification accuracy.

this

dual classification has been proposed and the results

rvill

be discussed in

tle

next section.

IV.

Analysis

and

Findings

An

experiment has been done

to

examined 179

of

SMS ham, spam and phishing class using

WEKA

a data mining tools to test the classified accuracy, truc positive,

true

false

on Malay

spam

and phishing

corpus using

Naive

Bayes and J48.

The

reason these technique has

been

applied

to

tested

the

classifrcarion accuracy rate because

these

techniques

is

one

of

the

well

known supervised method in machine leaming techniques.

Table

II

is a classification result between

4l

SMS ham and

82 SMS

scam. The

tesult

shows Nai've Bayes and

J48

is

100%. Table

III

is a

classification

is

result for

Scam SMS that has been classified into

4l

SMS phishing and

4l

SMS

spam

Malay

SMS. The result also shows

Narve

Bayes and J48

get

100

%.

The

final

result

lbr

ternary classification

of

41 SMS

ham,41

SMS phishing and

4l

SMS spam show 1007o accuracy.

Fig. 2. Atter SMS Tokcnization Process

IIL3.

Lemmatization

Lemmatization

is a

process

to

group

the

same meaning wortls.

Copyright Q 2014 Praise Worthy Prize S.r,t. - .4ll rights resen;ed

r252

[image:5.593.47.526.75.428.2]
(6)

TAELE IV

TEnwent ClessrFtcATlo^- RESr"rlr Fon N'lrlav SMS Heu, Slera

ANDPr{rsHrNc

Peramcter l\eive Baves

Harn Phishinq Spam Ham Phishing Spam

True Positive

False Posirive

Correctly Classified

Iocon'ectiy Classified

0%

The higher

percentage

(%)

of

co.rectly

classified

parameter

is

better.

Meanwhile the Tnre Positive

is

I showing that the datasets is classified

corectly

and false positive is 0 means that none of SMS are in wrong class.

:

V.

Conclusion

SMS phishing

is still

arising. However, only.SMS

spam corpus available

on the

Internet., Based

on

the studies

on

spam

and phishing

in

inlbrmation

security

uea.

spam and phishing attack have different definition and understanding.

Thus,

we

initiate

to

collect

SMS

phishing

in

Malay language

as

altemative mitigation

to

overcome SMS spam and phishing in Malay language.

Based

on

the

result,

\tr'e

pmve

thal this

corpusi successfully been classified

into

three classes such as SMS ham, spam and phishing using

dual

classification techniques.

As

conclusion,

by

applying classification techniques

using

machine

learning

technique

also can

detect and

filter

scam SMS.

There are

many

techniques

can

be

applied

for

this research

irea and

this technique is one

ofthe

established

technique

for

SMS

hltering

and

detection

since

processing times

by

using Naive

Bayes

and J48

only takes only 0 seconds to classity and get 100-o/o accuracy.

Acknowledg€ments

The authors

would

like to

thank

Universili

Teknikal Malaysia Melaka

(UTeM), Universiti Tun

Hussein Onn

MalaysiaflJTHM)

and Minisrry

of

Higher

Educarion Malaysia for supporting this research.

Copyright,g 201,1 Praise Wortlry Prize S.r.l. - .4ll rights resened

Cik Feresa Mohd

Foon'.

Rabiuh Ahmad. Faizal M. A.

This

research

is

funded

bv

MOHE

Research Grant

100 ?;

0 9'6

TABLE II

B]NARY CLASSITICATION I RTSUTT FoR MALAY SMS HAM AND SCAM

Parameter Neil'c

Beyes

J48

Ham Scam Ham

Scam

True

Positive

I

I

False

Positive

0

0

Corectly

Classified

100 7o

lnconectlv

Classified t

Y"

TABUENI

BiNARY CL{ss$IcATIoN 2 TIISULT FoR MALA,Y SMS SPAM

AND PHTSHING

Parrmcter

Phishint Spam

Naive

Baves

J48

Phishinq Spam

Truc Positivc

False Positive

Correctly Classified

Inconectlv Classified

LRGS/20 I I iFTMK/TKO I / IROOOO2

References

S. S. Chandtrrn and S. Murugappan, "Spam detecdon and

eliminarion of messager; fiom twiner;" Intetnational Review an

Comlnters und Safnturz, vo]. I, pp. 2438-2443, 2013.

F-Sccure, "N{obilc Threat Repon Q3 20 | 2," F-Secure Labs20 | 2.

M. Boodac, "l\{obilc Uscrs Thrcc Timcs Morc Vulncrablc to

Plr.ishing Acacks," ln h'wleer vol. 1012, cd. !01 t.

I. Lookout, "Lookour Mobile Tlreat Repon August 201 I," 201 l.

l.

Joc and H. Shim, "An S1\lS Spam Filtcring Systcrn Using

Support Vector Machine."

in

Future Generation hrlormation

Tet:hnoLtg:. vol, 6485. T.-h. Kim, er o/.. Eds., ed: Springel Berlin

Heidelberg. 201 0. pp. 577-584.

K. Dunbarn, "Chapter 6 - Phishing. SMishing. and Vishing," in

Mobile hfalnutr ,4tta<:ks cnd Dclinse. D. Ken. Ed., cd Boston:

Syngress. 2009. pp. 125-l 96.

A- Kwee, t"t a,/., "sentence-Level Novelty Detection ia Fnslish

and Mafay."

in

.irlvaaers

in

Ktrotrlcdge l)isror,ery and Data

,Vnrirg. vol. 5476. T. Theeramunkong. et

al.

Eds., cd: Springcr'

Berlin Heidelberg. 2009, pp. 40-5 l.

Y. Hird af Segerstad. t/sc und Adupltttion

(t

Writltn Lunguuge t{t

tht'

Conditions

of

Computer-ldediated Com*runication'.

Univesity of Cothenburg, 200?.

tgl

T. Ogle, "Crealive Uses of lnlbrmation Extracted from SMS

Messages." Undergraduate, Computer Science, The Universiry of

Sheft-ield.2005.

[|0]

.Cddnck Fainrn and S. Paumier. "A Translated Corpus of 30,00t)

French SMS,'

ln

ln

Prorcetlings of Language Resourtes and Evaluatktn.,20Q6.

il 11 M. Choudhury. et al. "lnvestigation and modeling of the stnrcturc

of terting l4nguage." Inr. J. Dot. .4nal. Re'tognit.. vol. 10, pp.

r 57-l ?4. 200?.

[2]

S. C. Herring and A. Zelenkauskaite, "Symbolic capital

in

a

virtual heterosexual market abbreviation and insertion in Italian

iTV Slt'ls.' ,yrittett Communicarion. vol. 26, pp. 5-3 l, 2009.

[I3]

C. Bach and J. (iunnarsson. "Exlraction of trends is SMS tcxt,"

l{ashtrs thtsis, Lund Uni}ersr,1. 2010.

I

l4]

D. Pietrini, "X'6:-(?": The sms and the triumph of infbrmality and

ludic w{ting," Italknisch, vol.46. pp. 92-101,2001.

Il5l

P. Schlobinski, rr a/., "Sintsen. Eine Pilotstudie zu sprirchhchen

und kommunikativen Aspekten

in

der SMS-Komnrunikadon," i\it'trttttx 22. Online-Publikatiottttr zum Themu Sprache und Kommunikation im Internet, 2001.

[16]

T.

Shortis, "'New Literacies' and Emerging Foms: Text

lr'lessaging on Mobile Phones," presented at the International

Literacy and Research Network Cont'erence on Leaming., 2001.

[7] N.

Doring.

"l

bread. sausage,

5

bags

of

apples LL.Y"

-comrnunicative functions of text mcssages (SMS\," ZeitschiJtJiir

!{ edie np sy c h obgi e 3. 2(X)2..

!81

Y, Hard af Sergerstad, Ux: und .lda1ttuttun ol fii'itren l-anguage

to the

Conelitions

al'

ContPuter'Mediated Communication:

University of Gothenburg. 2002.

[l9]

E.-L. Kasesniemi and P. Rautiainen. "Mobile culrure of chilfuen and temgers in Finland"" ia Petpetml contact, ed'. Cambridge

University Prcss, 2002, pp. 170-192.

[20]

R.

Grintcr

anrl

M.

Eldridgc, "Wan2dk?: cvlYyday tcxt messaging," presented

at

the Proceedings

of

thc

SIGCHI

Conference

on

Human Factors

in

Conputing Systenls, Ft.

Lauderdale, Florida. USA, 2003.

[21] C.

a..{.

B. Thurlow, "Gcncration Txfl Thc sociolinguistics of

young people's rext messagiug," Discoutse Analysis Online

lll),

Jr.. 2003_

[22] Yijue Horv and M.-Y. Kan, "Opdmizing Predictive Text Entry lbr

Short lt{cssagc Sc'nicc on Mobilc Phoncs," prosr-ntcd at thc In

Procecdings of HCII, 2005.

[23] Rich Ling md N. S^ Baron, "Text Messaging and lM: Linguistic

Comparison of .{merican Collegs Data," 2007.

under

Long

Scheme

ll

00

100%

0'

tll

t2l

t3l

t41

t5l

i6l

t?l llll

0000

100 o;ir

0%

I

l

I

I

I

I

t8l

000000

100

%

1000.;

0?o

r25l

(7)

[2a]

M.

Zic

Fuchs and

N.

Tudman Vukovii, "Communication

technologies and theil influence on l:rngurge: Reshuffling tenses

in Croatirur SMS text messaging,r' Jezikoslovlje, pp. 109-122,

2008.

[25] D. Gibbon and

l!{.

Kul, "Economy Strategies

in

Resricted

Communicadon Channcls.

A

study

of

Polish shon toxt messages," f,008.

[26] A- Deumert and S. Oscar Masinyana. "Mobile language choices

The use

of

English and isiXhosa

in

text messages (SMS)

Eviclcncc tiont a bilingual South Afiican sanrplc," English

World-Wide, vol. 29, pp. I 1?-147, 2008.

[27] I. Hutcl$y and V. Tanna, "Aspecrs of sequential organizadon in

text message cxchange." Diseourse & Cotnmunication, vol. 2, pp.

r43-r64,2008.

[28] J. Walkowska, "Gathering and Analysis of a Corpus of Polish

SMS Dialogucs," Challenging Pnthlems oJ Science. Computer

Science- Rtcent Advances in Intdligent l4formation.ilsreor.s. pp.

r45-r57.:009.

[29] C. Tagg, "A Corpus Linguistics Srudy of SMS Text Messaging."

Doctor of Philosophy" Department of English. The University of

Birmingham. Birmingham.

2009-[30] F. W. Elvis. "The sociolinguistics of] motrile phone sms usage in

cameroon and nigeria," nrc btternalional Journal of Languuge

Society and Culture. vol. 28. pp. 25-40,2009.

[31] S. N. Barasa, langroge, mobile phones antl internet: a studr

rt

Sl/S lertrirg, entail,

bi

ancl SNS chats in campuler netlialed

comnunication (Cl\{C) in Kenya, 2410.

[32]

A. B.

Bodomo. "Thc Gmmmar

of

Mobile Phone Written

[.anguage," Chaprer, vol. 7. pp. I l0-198,2010.

[3]l

W. Liu and T. Wang. "lndex-based online text classification lbr

snrs spam filtering." Journul rdCompurers, vol. 5. pp.844-851.

2010.

[34]

S.

Sotillo.

'SMS

Texting Practices

and

Communicalive

Intcntion," t)hapter, vol- I 6. pp. 252-265, 2010.

[35] C. Dirscheid and E. Srark. "SMS4science:

An

internstional

corpus-based texting pmject and the specilic challenges for

multilingual Switzerland." Digitul Dis<'ourse: Languagr in the

Nc*- .Vcdia: Languuge in the Neu' i\'ledia. p. 299. 201l.

[36] K. V" Lerander. "Names U ma puce: multilingual texting in

Senegal," Working paper20l l.

[3?] J. Elizondo, "Not 2 Cryptic 2 DCode: Paralinguistic Restitution.

Delction. and Nonstandard Orthography in Text Messages,"

Ph-D. thqsis. Swanhmore College.20l

l

[3E] T. Chen and M.-Y. Kan, "Creating a live, public short mcssage

service corpus: the NUS SMS corpus." Lunguagc Rcxturc:<'s and

Eva luation. vol. 47. pp. 299-335. 20 I il06/0 I 20 | 3.

[39] O. Salem" er

al,

"Awareness Program and

AI

based Tool to

Reduce Risk of Phishing Attacks," in Computer und Infbrmarion

TechnologS. (CID. 2010 IEEE IOth International Conference on,

2010. pp. l4l8-14?3.

[a0l Q. Xu, e, a/., "SMS Spam Detection using ContentJess Features."

hxelligent System-s,

/fEf.

vol. PP, pp. l-1,_2012.

t4ll J- W.

Yoon.

cl

a/..

'Hybrid spam filtering

ftrr

mobile

eommunication." (lompaters &amp: Securit.r'. tol. ?9. pp.

446-459, t0lo.

[42] H. Peizhou. a a/., nA Novel Method for Filtering Group Sending

Short Message Spam," in Convergence und H1-hrid Inl-ormation

Tcc h no logr, 2 {n8. I C HIT'08. I n ttr n at i on a l Co n ft t"t' nca' on, 2008. pp. 60-65.

[43] G. V. Cormack, er a/., "Content bascd SMS spam filtering,"

presented at the Proceedings of the 2006 ACM symposium on

Document engineering, Amsterdam, The Netherlands, 2006.

[4a] Ir,f. Taufiq Nrnuzzaman, er a/., "Simplc SMS spam filtcring on

independent mobile phone," Spcuri{'

and

Communicatiort

Neru.nrrLr, vol. 5. pp. ll09-l:20,2012.

[45] G. V. Cormack, el

al.

"Fearure engineering tbr mobile (SMS)

spanr filtoing," p'escntcd at thc Procccdiltgs of the 30th annual

intemationai

ACM

SIGIR conference

on

Research and

development

in

informarion retrierai, Anrsterdarn, Tlre Nerherlands,

200?-[46J K. Yadav. e/

,/.,

"SN,lsAssassin: crou.Gourcilg driven

mobilc-based system l.or SMS spam tiltering," prcserrted

at

the

Proceedings

of

the

l.lth

Workshop on Mobile Computing

Systems and Applications, Phoenix, Arizona. 20 I I .

Copyright,g 2014 Praise Worthy Prize S.r.l. - .4ll rights resemed

Cik Felesa Mohd Foo4', RabiahAhmad. Faizal M. A.

[4?] Y. C. Llm, ct al., "Application of Genetic Algorithm in unit

selection for Malal' speech synthesis system," Expet ,S)rst€ms

with Applications, vol. 39, pp. 53?6--5383,2012.

[48] F. S. Tsai, er rrl., "Multilingual novelty detection." Eryert S]srerfls

with Applications vol. 38, pp. 652-658. 201

l-[49] T. Subramaniam, et a/., "Naivc Baycsian ,{nti-spam Filtcring

Technique for Malay Language."

[50] T, S. Guzella and W.

M.

Caminhas,

"A

revierv of rnachine

leaming approaches ro spam fihering," Lrpert Systcnts v:ilh

,4pplicatiorts. vol- 36, pp. I 0206- l 0222, 2009.

t51] M. Z. Rafique, el a/.,'Applicarion of evoludonary algorithrns rn

detecting SMS spam

at

access layer," presented

at

the

Proceedings

of

rhe l3th annual confbrence on Genetic and

cvolutionary conrputalion, Dublin, Irclancl 201 l.

l52l M. Z. R. que and M. Farooq, "SMS Spam Detection By Opemting

On B1'te-Level Distributions Using Hidden Markov Models

{HMMS)." prcsented at the Virus Bulletin Contbrence September

20r0.10r0.

[53] C. Yan, er a/-, "SMS-Watchdog: Profiling Social Bchaviors of

SMS Users for Anomaly Detection

Recent Advaace-s in Intrusion Detection." vol- .57-58. E. Kirdu et al.,

Eds.. ed: Springer Berlin Heidelbery.2009. pp. 202-223.

[54] J. M. G. Hidalgo, et sl.. "Content based SMS spam filtering,"

presented at the Ptoceedings of the 2006 ACM symposium on

Dorcument engineering. Amsterdam. The Netherlands, 2006.

[55]

Y.

Xiang. c't aL, "Filiering nrobile spam by suppo{ \'ectot

machirie

"

presented at thc Conferencc on (lomputer Sciences,

Softwarc Engineeiing lnformation Technology. E-Business md

Applications (3rd: 20M : Cairo, Eg1'pt). Cairo. Egypt. 2004.

{561 C- Jie er c/., "Spam Filter fbr Short Me*sages Using Wirtnow," in

Atlvaneetl Language Processirg

and

Web InJbrmation

Technolog;, 2008- ALPTT '08. Into nalional Cotlfbr<nce on, 2008,

pp.454-459.

:

[57] K. Yadav. et a/.. "Take Control of Your SMSes: Designing an

Usable Spam

SMS

Filtering System,"

in

lt{obile Dant

lfianagement (MD!L{), 2012 IEEE ISth lnternatiilnal Conference

on,2012,pp- l5:-355.

[58] W. Ningning. ar a/., "Real-time monitoring and tiltering systenr

for mobile SMS," in la<Justial Electronics and appliutions,

20A8. rc1E.4 300tt. -lrd IEEE Conleren<r' on. 2008, pp.

l3l9-1 324.

[59] J. Huang, et

d.,

"A Bayesian Approach for Text Filter on 3G

Network." in ll'ireless Communicatiotts Nr:nrurl'rng and Mobile Conl>uting {WiL'.Olrl),

}0lA

6& Internationill Conference on, 1010, pp, l -.5.

[60] H. Najadat. er al, "Mobile SMS Spam Filtering based on l\{ixing

Classitiers."

[{rl]

T. M. Mahmoud and

A.

M. Mahfouz, "SMS Spam Filtering

Technique Based

ol

Anificial

Immune Systern." IJCI9|

Inlernational .Iour,tdl oJ Camputcr Scr'arca' /-rsl<'s, vctl. 9, 20 I 2.

[62] T. Charninda, et al.. "Clontent based hybrid srns spam filtcring

system." 20l4.

[6,1] N. Saxena and N. S. Chaudhari, "SecureSMS: A secure SMS

protocof for VAS and other applications," Journal ofSysleils and

Soliv.are. vol. 90. pp. 138-150.2014.

[64]

G. C. C. F.

Pereira,

er

a/.. "SMSCT]?Io:

A

lightrveight

cryptographic tiarnework for secure SMS transmission," ,Journol

ry' S.1sr,nr.s ond Sr/irrzn'. vol. 86. pp. 698-706. 20 I 3.

[65] J. Choi and H. Kim,

"A

Novel Approach for SMS sccurity."

Intenntional Journd oJ Security & Ils Applicalions, I'ol. 6,2012.

Authors'

information

Cik Fcrcse Mohd Foozl is cunendy working

with Universiti Tun Hussein Onn Malal'sia

(UTHIVI), Malal'sia. Feresa holds a l\{a$er's

degree

in

Computer Sciencefinformadon

Sccuriry) fronr Universiri Tel:nologi Malaysia

Malaysia and a Bachelor's degree in Intbrmation

Tectmology and l\'lultimedia tlom Universiti

Tun Hussein Onn Malaysia

(UTln[,

Malaysia.

She is crurently pwsuing her PhD at the Universiri Teknikal Malaysia

Melaka. Malaysia.

r254

Figure

Fig. l. SMS Spam and Phishing Detection in Malay LanguageFramework

References

Related documents

From analog communication of 1 st generation today we talk about digital communication in 4 th generation, from the generation of single transmitting antenna and single

Using a WebCT classroom shell for all online courses will give each course a consistent look and feel, which will make the DVC distance learning program look more unified

Jakob Roland Munch and Jan Rose Skaksen, 2002, Product Market Integration and Wages in Unionized Countries, Scandinavian Journal of Economics, 104, 289-2991. Mette Yde Skaksen and

Additional device such as NT-1 if using Quad BRI, CSU if using PRI, or encryption equipment if using V.35/RS-449/RS-530 Network interface module External power supply (If

As the coming chapters will reveal, every sensory encounter provides us a different way to engage with this artistry: Our interaction with music, whether listening or singing

A summary of the adverse events reported in ≥1% of NEBILET patients or placebo patients in therapeutic dose (5 mg) hypertension clinical trials, are provided in Table 1... Table 1:

Social psychologists have developed a theoretical framework for concealable stigmatized identities (CSIs) that can help characterize the experiences of those with stigmatized

More specifically, it focuses on the creation of the European University Institute (EUI) and studies the extent to which European institutions succeeded in orientating