TEXT DOCUMENT CLUSTERING USING ARTIFICIAL BEE COLONY WITH BISECTING K MEANS ALGORITHM

(1)

D

dealin and t solut propo the b gives Keyw D the c of in proc grou in on work expe Opti stoch solve comp beca the optim repla algor the p the r Artif searc This revie of th secti resea I basic algor DOI: http://dx.

TEXT D

Dept. of C

ract: Recently, ng with a huge the documents ion. The main a oses a new hyb benchmark data s a better perfor

words: Documen

Document clu collection of d nformation re ess. It is the p up, which is c

ne group have k, the most p erimentation. mization alg hastic in esse e optimizat putational re ause the proble

inspiration mization alg acements to rithms are us problems. In rush of feath ficial Bee C ching conduct paper is org ew of literatur his research ion IV and se arch work.

I

n [4] the auth c Artificial

rithms and Im

.doi.org/10.264

Intern

CS All Rights Re

DOCUME

Mrs. R Ph.D Rese Computer Scie Bharathiar Coimba document clus e volume of doc with dissimila aim of this rese brid algorithm c aset in contrast rmance compar

nt Clustering, A

I.INTRO

ustering is mo documents wit etrieval to en procedure of alled clusters e the related opular cluster

orithms may entiality. For tion issues esolves, whic

em size will g for retainin gorithms as

deterministic sed to develop

this, the swar hered creature

olony (ABC) t of honey bee ganized as fol re and section

work. Experi ection V desc

II.LITERATU

hors were anal Bee Colony mproved Bees 483/ijarcs.v9i1

Vol

national Jo

A

eserved

ENT CLU

BISEC

R.Janani arch Scholar ence and Engi r University atore,India

stering with op cuments. The m ar contents in a earch work is to called Artificia

to the widely ed to the standa

ABC Algorithm

ODUCTION

ost probably u th similar con nrich the doc dividing docu , in order that contents [1]. ring algorithm

be both de mer techniqu requires ch generally gradually incr ng bio insp computatio c method [3]

p the global m rm algorithm es and school ) calculation es.

llows, section III presents th imental resul cribes the con

UREREVIE

lyzed the perf y, Harmony

algorithm. Th

1.5359

lume 9, No

ournal of A

RE

Available O

STERING

CTING K

ineering ptimization tech main goal of do another group. o cluster the do al Bee Colony w

used document ard ABC cluste

m, K-means, Bis

used to retrie ntents in the ar cument retriev

uments into o t the documen

In this resear ms are taken f

eterministic a ues are used

the massi tend to fl rease [2]. This ired stochas onally efficie

]. Bio inspir models to sol ms are imagini l of fishes. T imagining t

n II explains t he methodolo ts are given nclusion of th

W

formances of t Search, Be hese algorithm

o. 1, Januar

Advanced R

ESEARCH P

Online at w

G USING

K-MEANS

hniques has gai ocument cluster Document clus cuments based with Bisecting t clustering alg ring algorithm, secting K-mean eve rea val one nts rch for and to ive lop s is stic ent red lve ing The the the ogy in his the ees ms w m al al en In an fo ha en th of da av th al ca no is fo ab m ar ne th pe ne B ar

ry-Februar

Research in

PAPER

www.ijarcs.

G ARTIFI

S ALGOR

Dept

ined the attenti ring is to place stering with op

on their conten K-Means (ABC gorithms. Exper

K-means and t

ns, ABC-BK

were compared multimodal ben lgorithm is s lgorithm is eff ngineering pro n [5] they have nd improving ocuses of cu aphazardly c nhanced bisec he consequentl f the group fo atabase demon void the comm he exactness o

In [6] the lgorithm, calle apacity of k-onlinear partit the stage of ound the bette bout demonst method has g rrangements eighborhood.

The main o he documents erform this cl ew hybrid al isecting K-M rchitecture of

ry 2018

n Compute

.info

ICIAL BE

RITHM

Dr. S Assist . of Computer Bharath Coim

on of many res the documents ptimization algo nt. In order to p C-BK). The pro rimental results the Bisecting K

d and given nchmark prob similar to th ffectively work

oblems with h e presented th g the accurac ustomary bise hosen. In th cting K-mean

ly conclusive ocus. The exa nstrated that t motion focuse f clustering co ey have disp

ed artificial be -means for fi tioned cluster k-means and er group part trating that th greater capac

and grea

III.MET

objective of th which have t lustering task, lgorithm calle Means (ABC this research w

er Science

ISSN

EE COLO

S.Vijayarani

tant Professor r Science and hiar University mbatore,India

searchers, espe s with similar c orithm achieve perform this tas oposed algorith s show that the K-means algorith

the solutions blems. The per he stated alg

king to find t higher dimensi he clustering c cy. In this th ecting K-mea heir work, t ns algorithm w

K esteem and amination com

the calculatio es and anoma omes about. played a nov

ee colony (AB inding global

issues. The p ABC algorith titions. The r he mix of A city to scan ater capacity

THODOLOG

his research w the higher sim

, this research ed Artificial C-BK). Figur

work.

619 N No. 0976‐5

ONY WIT

Engineering y

ecially those wh content in one g es the global op k, this research hm was verified e proposed algo

hm.

on unimoda rformance of gorithm and the solution fo

ionality. center optimiz he initialclust ans algorithm they propose which depend d the improve mes about on on can success alies, and enha

vel transform BC), to enhanc l ideal cluste proposed techn

hms, called kA reenactment c ABC and k-m

for global y for pa

GY

work is to ret milarity. In ord h work propo Bee Colony re 1 shows

(2)

R.Jananiet al, International Journal of Advanced Research in Computer Science, 9 (1), Jan-Feb 2018,619-623

A. Document Corpus

[image:2.595.41.276.213.426.2]

The collection of text documents is denoted as a document corpus. In this research work, the existing and proposed algorithms are verified with the following data set. They are Reuter’s dataset, 20 newsgroup dataset and BBC dataset. These document datasets have been mostly used for evaluating the performance of document clustering and classification. These datasets are collected from different sources which contains newspaper articles, newsgroup posts and the remaining being academic news from the BBC news channel. The comprehensive summary of the data set used for experimentation is given in Table I.

Figure 1: Methodology

Table I: Summary of Dataset

Dataset Source Number of Documents

Number of classes

Number of Words

re0 Reuters 1504 13 11465

20news 20news group

18828 20 28553

BBC BBC

News

2225 5 39558

B. Preprocessing

Document preprocessing is an important step in the process of document classification, clustering, topic identification, etc., Ininformationretrievalthepreprocessing techniques are applied to the precise dataset to extract the significant information from unstructured documents and to moderate the size of the dataset [7]. This process will increase the proficiency of the document clustering system. In this research work, stemming, stop word removal, numbers and punctuation removal techniques are used to extract the significant knowledge.

C. Clustering Algorithms

Clustering algorithms are commonly used in the field of text mining and machine learning. In text mining, the clustering algorithms are used to group the documents based on their content similarity of documents [1]. In this research work, the most popular clustering algorithms are taken for

experimentation. They are k-means, BisectingK-means and ABC clustering algorithms.

a. K-Means

The k-means clustering algorithm is well known and competent partition algorithm which is used for large document collections. The main objective of this algorithm is to categorize the documents in a predefined number of clusters based on the document attributed or its features. This algorithm has been improved with other optimization techniques recently to improve the clustering accuracy [2]. Assume a fixed number of clusters; k-means are used to discover the partition of the documents established in a set of frequent features specified by the distance metrics. These metrics are used to define how the clusters are determined.

Algorithm: K-means

Input: A documents of n elements D = {d1, d2, …,dn} and k

the number of predefined clusters, Vi the model vector

Output: k clusters of document Repeat

Step 1: Assign each dnto the nearest model vector Vi, cluster

k contains documents dnwhich are closest to Vi.

Step 2: Amend the new centroids rendering to, Vi=_{| |}∑

Unless convergence is achieved.

b. Bisecting K-Means

The bisecting k-means algorithm is an enhanced form of the k-means clustering algorithm. The essential thought of bisecting k-means is to achieve the quantity of C clusters, partitioned the arrangement of all focuses into two groups c1, c2 and choose one of these clusters (c1or c2) to part and so on, preceding C groups has been made [8].It is the powerful method to diminish the dimensionality of featured vector for classification and clustering. To choose which cluster to split, there are various ways are available [9]. In this research work, the criterion established on both size and the lowest sum of squared error method used.

Algorithm: Bisecting K-means

Input: C - Number of Clusters, D – Number of Documents. Output: C Clusters of documents

Step 1: Initialize the list of clusters (Ls(c)) to contain the

cluster consisting of all points and n=1 the number of iterations

Step 2: Repeat

Step 3: Choose the appropriate cluster from the Ls(c)

Step 4: For n= 1 to Cdo

Step 5: Bisect the number of clusters using basic k-means Bisect(c)

Step 6: End for

Step 7: Select the two clusters from Ls(c) with the lowest

total sum of the squared error (SSE) Step 8: add Bisect(c) ->Ls(c)

(3)

R.Jananiet al, International Journal of Advanced Research in Computer Science, 9 (1), Jan-Feb 2018,619-623

c. Artificial Bee Colony (ABC)

The swarm intelligence algorithm used to be proposed by Karaboga among the year 2005. This algorithm is stimulated by means of scavenging behavior over the bee colonies [10]. To consign the solution regarding optimization problems, at that place is quite a few strategies are integrated along the algorithm itself. The solution about the fitness function is signified namely the fluid quantity regarding a food sources [11]. In accordance with this ABC algorithm, it consists of at that place are three kinds of bees such as, employed bees, onlooker bees and scout bees. Employed bees realize the unique food sources those have travelled earlier than and provide the virtue information as regards the food sources to the onlooker bees. Onlooker bees are adoption the information about the food sources and according to decide a food source after exploit based over the facts given with the aid of the employed bees. Scouts bees are ancient after search randomly between the environment in rule in accordance with find out a new food source. In that algorithm, half of the colony encompasses engaged bees and other half includes the onlooker bees [12,13].

Algorithm: Artificial Bee Colony

Input: , is the lower and upper bounds of nth parameter

K is the randomly selected food sources is a random number within the range [-1,1] N is the new food source on dimension d Step 1: Initialize the food sources

, = 0,1

Step 2: Evaluate the fitness of food sources

, , , ,

// Employed Bees

Step 3: Each bee in this phase yields the new food sources. Step 4: Calculate the fitness function using Step 2

Step 5: For selection process, greedy search method is applied

Step 6: Calculate the probability of the food sources.

_∑

// Onlooker Bees

Step 7: Choose the food sources based on the probability value based on employed bees

Step 8: Produce new food sources

Step 9: Calculate the fitness function using Step 2

Step 10: For selection process, greedy search method is applied

// Scout Bees

Step 11: Swap the new food sources.

Step 12: New source will be produced and the counter will be 0

Step 13: Select the best food source.

Step 14: These three bees will be repeated until the best source will be found.

d. ABC-BK

As stated in the above sections, the existing clustering algorithms can generate the local optimal solution [14]. Those algorithms are extremely depending on the initialization of centroids and it converges after the number of iterations. Hence, in this research work we propose a

combinational algorithm (ABC-BK) which uses the facts of bisecting algorithm and ABC algorithm for giving the solution of clustering problems [15]. This proposed algorithm is an independent of cluster centroids and also it leads the global optimal solution. In this proposed algorithm, the food particles in the pursuit foundation implies the group centroids or it signifies the solution for document clustering. The position Pi of the food source are

assembledas

, , … … . , (1)

where k is the total number of clusters and is the yth cluster centroid of xth food source. The proposed algorithm can briefly explain as follows,

 Initialize the food source position randomly and use thebisecting k-means algorithm to find out the solution for document clusteringfor all the position of food sources.

 To calculate the fitness value of each group usingthe fitness formula

 Examinenew food sources and appraise the position of the food sources by using the employed bees phase.

 Apply the bisecting k-means algorithm and a greedy search to estimatenew fitness value. These values are to be compared with original fitness values. Best food sources will be circulated to onlooker bee’s phase.

 Determine the likelihoodestimationsof food sources and amend their position. Again, the bisecting k-means algorithm and a greedy search algorithm can be smeared to perform the document clustering.

 Then figure it the new fitness values and also update it.

 Review the counter of food sources and creates a new food source in the particular search space.

Algorithm: ABC-BK

Step 1: Initialize the food sources; send employed bees to current food sources

Counter = 0;

Step 2: Calculate the fitness function // Employed Bees Phase

Step 3: Apply the bisecting k-means and greedy search algorithm.

Step 4: Compute the new fitness function.

Step 5: Estimate the probability value of food sources // Onlooker Bee’s Phase

Step 6: Select the food sources based on the probability value

Step 7: Again, apply the bisecting k-means and greedy search algorithm.

Step 8: Compute the new fitness function. // Scout Bee’s Phase

Step 9: Check the limit of the parameter. Step 10: Produce the new food sources Step 11: Counter = counter +1

(4)

•

Whe docu

•

Whe docu

•

T algor give algor

Dat

re 20n BB

T algor give algor

n order to pe ormance meas accuracy, prec erimental resu

s the improv rithms.

Accuracy: H better cluster Precision: It properly situa

ere R1 is re

uments.

Recall: It is were

ere R1 is re

uments.

F-measure: I recal F

Table II sho rithms. From s the better a rithms.

taset K-mea

0 30.09

news 24.14

BC 27.36

Table III sho rithms. From s the better a rithms.

R.Jananiet a

CS All Rights Re .RESULTS AN

erform this cl sures are used cision, recall ult establishes ved clusterin

Higher value o predictive dis t is consider ate in the same

Precision =

elevant docum

computed as e recognized.

Recall =

elevant docum

It is the harm ll. It calculate F-measure =

ows the reca this, we infe accuracy when

Table II: Reca

ans Bisecti means

9 34.12

4 24.57

6 29.78

Figu

ws the preci this, we conc accuracy when

al, International

eserved

D DISCUSSION

lustering task d in this resea and f-measur that the prop ng results tha

of cluster acc scrimination.

ed as the fr e cluster.

(2)

ments and R

the part of a

ments and R

monic mean o d as following

all values fo erred the prop n compared t

all for Three Data

ing k- ABC

2 31.9

7 25.1 29.1

ure 2: Recall

sion values f luded the prop n compared t

Journal of Adva

N

k, there are fo arch work. Th re [16, 17]. T posed algorith an the existi

curacy indicat

raction of pa

R2 is retriev

actual pairs th

(3)

R2 is retriev

of precision a g

(4)

or all the fo posed algorith o other existi

asets

ABC-BK

91 36.18 4 26.07 2 34.15

for all the fo posed algorith o other existi

anced Research

our hey The hm ing

tes

airs

ved

hat

ved

and

our hm ing

K

our hm ing

Da

r 20 B

al gi al

Da

re 20 B

al gi al

Da

r 20 B

in Computer Sc

ataset

K-e0 47

0news 60

BC 49

Table IV s lgorithms. Fro ives the bette lgorithms.

ataset

K-e0 36

0news 34

BC 35

Table V sh lgorithms. Fro ives the bette lgorithms.

ataset

K-e0 69

0news 67

BC 78

cience, 9 (1), Jan

Table III: Pr

-means Bis mea

7.15 49.

0.25 59.

9.28 49.

Fi

shows the f-m om this, we co

r accuracy w

Table IV: F-M

-means Bis mea

6.73 40.

4.46 34.

5.18 37.

Fig

hows the clust om this, we i

r accuracy w

Table V: Acc

-means Bis mea

9.21 7.14 8.36

n-Feb 2018,619-6

recision for Three

ecting k-ans

AB

.78 4

.47 6

.89 5

gure 3: Precision

measure value oncluded the p when compared

Measure for Thre

AB

.48 3

.77 3

.29 3

gure 4: F-Measure

tering accurac inferred the p when compared

curacy for Thr

A

70.18 7

70.98 7

74.56 8

623

622 e Datasets

BC

ABC-9.10 54.1 1.27 62.0 0.17 50.94

n

es for all the proposed algo d to other exi

ee Datasets

BC

ABC-8.68 4

5.65 3

8.85 4

e

cy for all the proposed algo d to other exi

ree Datasets

ABC

ABC-1.25 80.9 6.68 89.4 0.72 88.34

2 -BK

8 0 4

e four rithm isting

-BK

43.38 36.70 40.88

e four rithm isting

-BK

(5)

© F

D unsu ident has clust We mean perfo bette impl ABC alter effec volu accu

[1]

[2]Jia

[3]Ji,

Document clus upervised docu tification and developed a tering accurac explored the ns, ABC an ormance facto er optimal lemented and C-BK algorith red with respe ctive algorithm ume of text do uracy.

Nicholas O. developments published by ci a, R., & Song, determination m Mcroelectronic , X., Han, Z.,

improved K m the division o University (En

R.Jananiet a

CS All Rights Re cy

V.CONC

stering plays a ument organiz information a hybrid alg

cy for handli e performance nd ABC-BK ors, the propo solutions. T verified on s hm takes the ect to the par ms to be de cumentsand t

VI.REF

Andrews and in document iteseer, pp. 1-25

J. (2016). K-m method based o cs and Compute & Li, K., et means clustering of distribution gineering Scien

al, International

eserved CLUSION

an importan zation, topic e retrieval. Thi gorithm to a ing unstructu es of k-mean

algorithms. osed algorithm The algorith everal real da less control p rticular bees.

veloped to h o obtain the p

FERENCES

d Edward A clustering,” 5, Oct. 2007 means optimal c

on clustering ce er, 5.

al. (2016). A g algorithm ba network. Jour nce Edition), 4,

Journal of Adva

nt role extraction, top s research wo attain the hi ured documen ns, bisecting Based on t m converge t hm has be atasets. Also t parameter to In future, mo handle the hu proper clusteri

A. Fox, “Rec Technical rep

clustering numb enter optimizati

Application of sed on density rnal of Shando

41-46.

anced Research

in pic ork igh nts. k-the the een the be ost uge ing

ent port

ber on.

the y in ong

[4

[5

[6

[7

[8

[9

[1

in Computer Sc

4]D. Karaboga. optimizatio Engineerin 2005. 5] B. Basturk a

algorithm f Intelligenc May 2006. 6]D. Karaboga a for numer (abc) algor 471, 2007. 7]Berry, M. (

Classificat 8] Liu, G., Huan

means clu Software, 2 9]D. Karaboga

bee colon 8(1):687–6 0] K. R. Zali Pattern Re 2008. 1] C. Zhang, D

approach f vol. 37, pp 2]W. Zou, Y. Z

using coop Dynamics 2010. 3]Pham, D.T.

application Engineers, Science (2 4] Nihal M. A

Document Conference (March 20 5] Goldberg D

and Mach Company I 6]RekhaBaghe Document Computer 2010 7] A. Huang, “

In Proc. Research S

cience, 9 (1), Jan

An idea based on. Technical ng Faculty, Co

and D. Karabo for numeric fun ce Symposium

.

and B. Basturk. ical function o rithm. Journal o

(Ed.). "Survey ion, and Retriev ng, T., & Chen ustering algor 2.

and B. Basturk ny (abc) algo 697, 2008. ik, “An efficie ecognition Lett

D. Ouyang, and for clustering,” p. 4761–4767, Ju

Zhu, H. Chen, perative artifici in Nature and

and Afify, A.A ns in engineeri

Part C: Jou 006)

AbdelHamid, M clustering usi e of Informati 13)

D.E.: Genetic A hine Learning Inc., London (1 el and Dr. Renu Clustering Al Applications, v

“Similarity mea of the Sixth Student Confere

n-Feb 2018,619-6

on honey bee Report TR06, omputer Engin

oga. An artifici nction optimiza

2006, Indianap

. A powerful an optimization: A of Global Optim

y of Text M val". Springer, n, H. (2015). Im rithm.Computer

k. On the perfo rithm. Applied

ent k-means cl ters, vol. 29, p

d J. Ning, “An Expert System uly 2010. and X. Sui, “A ial bee colony Society, vol. 2

A.: Clustering ing. The Institu urnal of Mech

M.B. AbdelHalim ing Bees Algo ion Technolog

Algorithms-in S g. Addison- 1989)

uDhir, “A Frequ lgorithm,” Inte vol. 4, No.5, p

asures for text d New Zealand ence NZCSRSC

623

623 swarm for num Erciyes Unive neering Depart

ial bee colony ation. In IEEE S polis, Indiana,

nd efficient algo Artificial bee c mization, 39(3)

Mining: Clust New York (200 mproved bisecti

r Applications

formance of art d Soft Comp

lustering algori pp. 1385–1391

artificial bee c ms with Applica

A clustering app algorithm,” Di 010, pp. 16, Oc

techniques and ution of Mech hanical Engine

m, and M.W. F orithm. Interna gy, IEEE, Indo

Search, Optimi Wesley Publ

uent Concepts B ernational Journ

p. 0975 – 8887

document cluste Computer Sc C, pp. 49—56, 2

3 merical

ersity, tment,

(abc) Swarm USA,

orithm colony ):459–

tering, 03) ing K-s and

tificial puting,

ithm,” , July

colony ations,

proach iscrete ctober

d their hanical eering

Fakhr: ational onesia

zation ishing

Based nal of 7, Jul.