D
© Abstr
dealin and t solut propo the b gives Keyw D the c of in proc grou in on work expe Opti stoch solve comp beca the optim repla algor the p the r Artif searc This revie of th secti resea I basic algor DOI: http://dx.
© 2015-19, IJARC
TEXT D
Dept. of C
ract: Recently, ng with a huge the documents ion. The main a oses a new hyb benchmark data s a better perfor
words: Documen
Document clu collection of d nformation re ess. It is the p up, which is c
ne group have k, the most p erimentation. mization alg hastic in esse e optimizat putational re ause the proble
inspiration mization alg acements to rithms are us problems. In rush of feath ficial Bee C ching conduct paper is org ew of literatur his research ion IV and se arch work.
I
n [4] the auth c Artificial
rithms and Im
.doi.org/10.264
Intern
CS All Rights Re
DOCUME
Mrs. R Ph.D Rese Computer Scie Bharathiar Coimba document clus e volume of doc with dissimila aim of this rese brid algorithm c aset in contrast rmance comparnt Clustering, A
I.INTRO
ustering is mo documents wit etrieval to en procedure of alled clusters e the related opular cluster
orithms may entiality. For tion issues esolves, whic
em size will g for retainin gorithms as
deterministic sed to develop
this, the swar hered creature
olony (ABC) t of honey bee ganized as fol re and section
work. Experi ection V desc
II.LITERATU
hors were anal Bee Colony mproved Bees 483/ijarcs.v9i1
Vol
national Jo
A
eservedENT CLU
BISEC
R.Janani arch Scholar ence and Engi r University atore,Indiastering with op cuments. The m ar contents in a earch work is to called Artificia
to the widely ed to the standa
ABC Algorithm
ODUCTION
ost probably u th similar con nrich the doc dividing docu , in order that contents [1]. ring algorithm
be both de mer techniqu requires ch generally gradually incr ng bio insp computatio c method [3]
p the global m rm algorithm es and school ) calculation es.
llows, section III presents th imental resul cribes the con
UREREVIE
lyzed the perf y, Harmony
algorithm. Th
1.5359
lume 9, No
ournal of A
RE
Available O
STERING
CTING K
ineering ptimization tech main goal of do another group. o cluster the do al Bee Colony wused document ard ABC cluste
m, K-means, Bis
used to retrie ntents in the ar cument retriev
uments into o t the documen
In this resear ms are taken f
eterministic a ues are used
the massi tend to fl rease [2]. This ired stochas onally efficie
]. Bio inspir models to sol ms are imagini l of fishes. T imagining t
n II explains t he methodolo ts are given nclusion of th
W
formances of t Search, Be hese algorithm
o. 1, Januar
Advanced R
ESEARCH P
Online at w
G USING
K-MEANS
hniques has gai ocument cluster Document clus cuments based with Bisecting t clustering alg ring algorithm, secting K-mean eve rea val one nts rch for and to ive lop s is stic ent red lve ing The the the ogy in his the ees ms w m al al en In an fo ha en th of da av th al ca no is fo ab m ar ne th pe ne B ar
ry-Februar
Research in
PAPER
www.ijarcs.
G ARTIFI
S ALGOR
Deptined the attenti ring is to place stering with op
on their conten K-Means (ABC gorithms. Exper
K-means and t
ns, ABC-BK
were compared multimodal ben lgorithm is s lgorithm is eff ngineering pro n [5] they have nd improving ocuses of cu aphazardly c nhanced bisec he consequentl f the group fo atabase demon void the comm he exactness o
In [6] the lgorithm, calle apacity of k-onlinear partit the stage of ound the bette bout demonst method has g rrangements eighborhood.
The main o he documents erform this cl ew hybrid al isecting K-M rchitecture of
ry 2018
n Compute
.info
ICIAL BE
RITHM
Dr. S Assist . of Computer Bharath Coimon of many res the documents ptimization algo nt. In order to p C-BK). The pro rimental results the Bisecting K
d and given nchmark prob similar to th ffectively work
oblems with h e presented th g the accurac ustomary bise hosen. In th cting K-mean
ly conclusive ocus. The exa nstrated that t motion focuse f clustering co ey have disp
ed artificial be -means for fi tioned cluster k-means and er group part trating that th greater capac
and grea
III.MET
objective of th which have t lustering task, lgorithm calle Means (ABC this research w
er Science
ISSN
EE COLO
S.Vijayarani
tant Professor r Science and hiar University mbatore,India
searchers, espe s with similar c orithm achieve perform this tas oposed algorith s show that the K-means algorith
the solutions blems. The per he stated alg
king to find t higher dimensi he clustering c cy. In this th ecting K-mea heir work, t ns algorithm w
K esteem and amination com
the calculatio es and anoma omes about. played a nov
ee colony (AB inding global
issues. The p ABC algorith titions. The r he mix of A city to scan ater capacity
THODOLOG
his research w the higher sim
, this research ed Artificial C-BK). Figur
work.
619 N No. 0976‐5
ONY WIT
Engineering y
ecially those wh content in one g es the global op k, this research hm was verified e proposed algo
hm.
on unimoda rformance of gorithm and the solution fo
ionality. center optimiz he initialclust ans algorithm they propose which depend d the improve mes about on on can success alies, and enha
vel transform BC), to enhanc l ideal cluste proposed techn
hms, called kA reenactment c ABC and k-m
for global y for pa
GY
work is to ret milarity. In ord h work propo Bee Colony re 1 shows
R.Jananiet al, International Journal of Advanced Research in Computer Science, 9 (1), Jan-Feb 2018,619-623
© 2015-19, IJARCS All Rights Reserved 620
A. Document Corpus
[image:2.595.41.276.213.426.2]The collection of text documents is denoted as a document corpus. In this research work, the existing and proposed algorithms are verified with the following data set. They are Reuter’s dataset, 20 newsgroup dataset and BBC dataset. These document datasets have been mostly used for evaluating the performance of document clustering and classification. These datasets are collected from different sources which contains newspaper articles, newsgroup posts and the remaining being academic news from the BBC news channel. The comprehensive summary of the data set used for experimentation is given in Table I.
Figure 1: Methodology
Table I: Summary of Dataset
Dataset Source Number of Documents
Number of classes
Number of Words
re0 Reuters 1504 13 11465
20news 20news group
18828 20 28553
BBC BBC
News
2225 5 39558
B. Preprocessing
Document preprocessing is an important step in the process of document classification, clustering, topic identification, etc., Ininformationretrievalthepreprocessing techniques are applied to the precise dataset to extract the significant information from unstructured documents and to moderate the size of the dataset [7]. This process will increase the proficiency of the document clustering system. In this research work, stemming, stop word removal, numbers and punctuation removal techniques are used to extract the significant knowledge.
C. Clustering Algorithms
Clustering algorithms are commonly used in the field of text mining and machine learning. In text mining, the clustering algorithms are used to group the documents based on their content similarity of documents [1]. In this research work, the most popular clustering algorithms are taken for
experimentation. They are k-means, BisectingK-means and ABC clustering algorithms.
a. K-Means
The k-means clustering algorithm is well known and competent partition algorithm which is used for large document collections. The main objective of this algorithm is to categorize the documents in a predefined number of clusters based on the document attributed or its features. This algorithm has been improved with other optimization techniques recently to improve the clustering accuracy [2]. Assume a fixed number of clusters; k-means are used to discover the partition of the documents established in a set of frequent features specified by the distance metrics. These metrics are used to define how the clusters are determined.
Algorithm: K-means
Input: A documents of n elements D = {d1, d2, …,dn} and k
the number of predefined clusters, Vi the model vector
Output: k clusters of document Repeat
Step 1: Assign each dnto the nearest model vector Vi, cluster
k contains documents dnwhich are closest to Vi.
Step 2: Amend the new centroids rendering to, Vi=| |∑
Unless convergence is achieved.
b. Bisecting K-Means
The bisecting k-means algorithm is an enhanced form of the k-means clustering algorithm. The essential thought of bisecting k-means is to achieve the quantity of C clusters, partitioned the arrangement of all focuses into two groups c1, c2 and choose one of these clusters (c1or c2) to part and so on, preceding C groups has been made [8].It is the powerful method to diminish the dimensionality of featured vector for classification and clustering. To choose which cluster to split, there are various ways are available [9]. In this research work, the criterion established on both size and the lowest sum of squared error method used.
Algorithm: Bisecting K-means
Input: C - Number of Clusters, D – Number of Documents. Output: C Clusters of documents
Step 1: Initialize the list of clusters (Ls(c)) to contain the
cluster consisting of all points and n=1 the number of iterations
Step 2: Repeat
Step 3: Choose the appropriate cluster from the Ls(c)
Step 4: For n= 1 to Cdo
Step 5: Bisect the number of clusters using basic k-means Bisect(c)
Step 6: End for
Step 7: Select the two clusters from Ls(c) with the lowest
total sum of the squared error (SSE) Step 8: add Bisect(c) ->Ls(c)
R.Jananiet al, International Journal of Advanced Research in Computer Science, 9 (1), Jan-Feb 2018,619-623
© 2015-19, IJARCS All Rights Reserved 621
c. Artificial Bee Colony (ABC)
The swarm intelligence algorithm used to be proposed by Karaboga among the year 2005. This algorithm is stimulated by means of scavenging behavior over the bee colonies [10]. To consign the solution regarding optimization problems, at that place is quite a few strategies are integrated along the algorithm itself. The solution about the fitness function is signified namely the fluid quantity regarding a food sources [11]. In accordance with this ABC algorithm, it consists of at that place are three kinds of bees such as, employed bees, onlooker bees and scout bees. Employed bees realize the unique food sources those have travelled earlier than and provide the virtue information as regards the food sources to the onlooker bees. Onlooker bees are adoption the information about the food sources and according to decide a food source after exploit based over the facts given with the aid of the employed bees. Scouts bees are ancient after search randomly between the environment in rule in accordance with find out a new food source. In that algorithm, half of the colony encompasses engaged bees and other half includes the onlooker bees [12,13].
Algorithm: Artificial Bee Colony
Input: , is the lower and upper bounds of nth parameter
K is the randomly selected food sources is a random number within the range [-1,1] N is the new food source on dimension d Step 1: Initialize the food sources
, = 0,1
Step 2: Evaluate the fitness of food sources
, , , ,
// Employed Bees
Step 3: Each bee in this phase yields the new food sources. Step 4: Calculate the fitness function using Step 2
Step 5: For selection process, greedy search method is applied
Step 6: Calculate the probability of the food sources.
∑
// Onlooker Bees
Step 7: Choose the food sources based on the probability value based on employed bees
Step 8: Produce new food sources
Step 9: Calculate the fitness function using Step 2
Step 10: For selection process, greedy search method is applied
// Scout Bees
Step 11: Swap the new food sources.
Step 12: New source will be produced and the counter will be 0
Step 13: Select the best food source.
Step 14: These three bees will be repeated until the best source will be found.
d. ABC-BK
As stated in the above sections, the existing clustering algorithms can generate the local optimal solution [14]. Those algorithms are extremely depending on the initialization of centroids and it converges after the number of iterations. Hence, in this research work we propose a
combinational algorithm (ABC-BK) which uses the facts of bisecting algorithm and ABC algorithm for giving the solution of clustering problems [15]. This proposed algorithm is an independent of cluster centroids and also it leads the global optimal solution. In this proposed algorithm, the food particles in the pursuit foundation implies the group centroids or it signifies the solution for document clustering. The position Pi of the food source are
assembledas
, , … … . , (1)
where k is the total number of clusters and is the yth cluster centroid of xth food source. The proposed algorithm can briefly explain as follows,
Initialize the food source position randomly and use thebisecting k-means algorithm to find out the solution for document clusteringfor all the position of food sources.
To calculate the fitness value of each group usingthe fitness formula
Examinenew food sources and appraise the position of the food sources by using the employed bees phase.
Apply the bisecting k-means algorithm and a greedy search to estimatenew fitness value. These values are to be compared with original fitness values. Best food sources will be circulated to onlooker bee’s phase.
Determine the likelihoodestimationsof food sources and amend their position. Again, the bisecting k-means algorithm and a greedy search algorithm can be smeared to perform the document clustering.
Then figure it the new fitness values and also update it.
Review the counter of food sources and creates a new food source in the particular search space.
Algorithm: ABC-BK
Step 1: Initialize the food sources; send employed bees to current food sources
Counter = 0;
Step 2: Calculate the fitness function // Employed Bees Phase
Step 3: Apply the bisecting k-means and greedy search algorithm.
Step 4: Compute the new fitness function.
Step 5: Estimate the probability value of food sources // Onlooker Bee’s Phase
Step 6: Select the food sources based on the probability value
Step 7: Again, apply the bisecting k-means and greedy search algorithm.
Step 8: Compute the new fitness function. // Scout Bee’s Phase
Step 9: Check the limit of the parameter. Step 10: Produce the new food sources Step 11: Counter = counter +1
© I perfo are a expe give algor
•
•
Whe docu
•
Whe docu
•
T algor give algor
Dat
re 20n BB
T algor give algor
© 2015-19, IJARC IV
n order to pe ormance meas accuracy, prec erimental resu
s the improv rithms.
Accuracy: H better cluster Precision: It properly situa
ere R1 is re
uments.
Recall: It is were
ere R1 is re
uments.
F-measure: I recal F
Table II sho rithms. From s the better a rithms.
taset K-mea
0 30.09
news 24.14
BC 27.36
Table III sho rithms. From s the better a rithms.
R.Jananiet a
CS All Rights Re .RESULTS AN
erform this cl sures are used cision, recall ult establishes ved clusterin
Higher value o predictive dis t is consider ate in the same
Precision =
elevant docum
computed as e recognized.
Recall =
elevant docum
It is the harm ll. It calculate F-measure =
ows the reca this, we infe accuracy when
Table II: Reca
ans Bisecti means
9 34.12
4 24.57
6 29.78
Figu
ws the preci this, we conc accuracy when
al, International
eserved
D DISCUSSION
lustering task d in this resea and f-measur that the prop ng results tha
of cluster acc scrimination.
ed as the fr e cluster.
(2)
ments and R
the part of a
ments and R
monic mean o d as following
all values fo erred the prop n compared t
all for Three Data
ing k- ABC
2 31.9
7 25.1 29.1
ure 2: Recall
sion values f luded the prop n compared t
Journal of Adva
N
k, there are fo arch work. Th re [16, 17]. T posed algorith an the existi
curacy indicat
raction of pa
R2 is retriev
actual pairs th
(3)
R2 is retriev
of precision a g
(4)
or all the fo posed algorith o other existi
asets
ABC-BK
91 36.18 4 26.07 2 34.15
for all the fo posed algorith o other existi
anced Research
our hey The hm ing
tes
airs
ved
hat
ved
and
our hm ing
K
our hm ing
Da
r 20 B
al gi al
Da
re 20 B
al gi al
Da
r 20 B
in Computer Sc
ataset
K-e0 47
0news 60
BC 49
Table IV s lgorithms. Fro ives the bette lgorithms.
ataset
K-e0 36
0news 34
BC 35
Table V sh lgorithms. Fro ives the bette lgorithms.
ataset
K-e0 69
0news 67
BC 78
cience, 9 (1), Jan
Table III: Pr
-means Bis mea
7.15 49.
0.25 59.
9.28 49.
Fi
shows the f-m om this, we co
r accuracy w
Table IV: F-M
-means Bis mea
6.73 40.
4.46 34.
5.18 37.
Fig
hows the clust om this, we i
r accuracy w
Table V: Acc
-means Bis mea
9.21 7.14 8.36
n-Feb 2018,619-6
recision for Three
ecting k-ans
AB
.78 4
.47 6
.89 5
gure 3: Precision
measure value oncluded the p when compared
Measure for Thre
ecting k-ans
AB
.48 3
.77 3
.29 3
gure 4: F-Measure
tering accurac inferred the p when compared
curacy for Thr
ecting k-ans
A
70.18 7
70.98 7
74.56 8
623
622 e Datasets
BC
ABC-9.10 54.1 1.27 62.0 0.17 50.94
n
es for all the proposed algo d to other exi
ee Datasets
BC
ABC-8.68 4
5.65 3
8.85 4
e
cy for all the proposed algo d to other exi
ree Datasets
ABC
ABC-1.25 80.9 6.68 89.4 0.72 88.34
2 -BK
8 0 4
e four rithm isting
-BK
43.38 36.70 40.88
e four rithm isting
-BK
© F
D unsu ident has clust We mean perfo bette impl ABC alter effec volu accu
[1]
[2]Jia
[3]Ji,
© 2015-19, IJARC Figure 5: Accurac
Document clus upervised docu tification and developed a tering accurac explored the ns, ABC an ormance facto er optimal lemented and C-BK algorith red with respe ctive algorithm ume of text do uracy.
Nicholas O. developments published by ci a, R., & Song, determination m Mcroelectronic , X., Han, Z.,
improved K m the division o University (En
R.Jananiet a
CS All Rights Re cy
V.CONC
stering plays a ument organiz information a hybrid alg
cy for handli e performance nd ABC-BK ors, the propo solutions. T verified on s hm takes the ect to the par ms to be de cumentsand t
VI.REF
Andrews and in document iteseer, pp. 1-25
J. (2016). K-m method based o cs and Compute & Li, K., et means clustering of distribution gineering Scien
al, International
eserved CLUSION
an importan zation, topic e retrieval. Thi gorithm to a ing unstructu es of k-mean
algorithms. osed algorithm The algorith everal real da less control p rticular bees.
veloped to h o obtain the p
FERENCES
d Edward A clustering,” 5, Oct. 2007 means optimal c
on clustering ce er, 5.
al. (2016). A g algorithm ba network. Jour nce Edition), 4,
Journal of Adva
nt role extraction, top s research wo attain the hi ured documen ns, bisecting Based on t m converge t hm has be atasets. Also t parameter to In future, mo handle the hu proper clusteri
A. Fox, “Rec Technical rep
clustering numb enter optimizati
Application of sed on density rnal of Shando
41-46.
anced Research
in pic ork igh nts. k-the the een the be ost uge ing
ent port
ber on.
the y in ong
[4
[5
[6
[7
[8
[9
[1
[1
[1
[1
[1
[1
[1
[1
in Computer Sc
4]D. Karaboga. optimizatio Engineerin 2005. 5] B. Basturk a
algorithm f Intelligenc May 2006. 6]D. Karaboga a for numer (abc) algor 471, 2007. 7]Berry, M. (
Classificat 8] Liu, G., Huan
means clu Software, 2 9]D. Karaboga
bee colon 8(1):687–6 0] K. R. Zali Pattern Re 2008. 1] C. Zhang, D
approach f vol. 37, pp 2]W. Zou, Y. Z
using coop Dynamics 2010. 3]Pham, D.T.
application Engineers, Science (2 4] Nihal M. A
Document Conference (March 20 5] Goldberg D
and Mach Company I 6]RekhaBaghe Document Computer 2010 7] A. Huang, “
In Proc. Research S
cience, 9 (1), Jan
An idea based on. Technical ng Faculty, Co
and D. Karabo for numeric fun ce Symposium
.
and B. Basturk. ical function o rithm. Journal o
(Ed.). "Survey ion, and Retriev ng, T., & Chen ustering algor 2.
and B. Basturk ny (abc) algo 697, 2008. ik, “An efficie ecognition Lett
D. Ouyang, and for clustering,” p. 4761–4767, Ju
Zhu, H. Chen, perative artifici in Nature and
and Afify, A.A ns in engineeri
Part C: Jou 006)
AbdelHamid, M clustering usi e of Informati 13)
D.E.: Genetic A hine Learning Inc., London (1 el and Dr. Renu Clustering Al Applications, v
“Similarity mea of the Sixth Student Confere
n-Feb 2018,619-6
on honey bee Report TR06, omputer Engin
oga. An artifici nction optimiza
2006, Indianap
. A powerful an optimization: A of Global Optim
y of Text M val". Springer, n, H. (2015). Im rithm.Computer
k. On the perfo rithm. Applied
ent k-means cl ters, vol. 29, p
d J. Ning, “An Expert System uly 2010. and X. Sui, “A ial bee colony Society, vol. 2
A.: Clustering ing. The Institu urnal of Mech
M.B. AbdelHalim ing Bees Algo ion Technolog
Algorithms-in S g. Addison- 1989)
uDhir, “A Frequ lgorithm,” Inte vol. 4, No.5, p
asures for text d New Zealand ence NZCSRSC
623
623 swarm for num Erciyes Unive neering Depart
ial bee colony ation. In IEEE S polis, Indiana,
nd efficient algo Artificial bee c mization, 39(3)
Mining: Clust New York (200 mproved bisecti
r Applications
formance of art d Soft Comp
lustering algori pp. 1385–1391
artificial bee c ms with Applica
A clustering app algorithm,” Di 010, pp. 16, Oc
techniques and ution of Mech hanical Engine
m, and M.W. F orithm. Interna gy, IEEE, Indo
Search, Optimi Wesley Publ
uent Concepts B ernational Journ
p. 0975 – 8887
document cluste Computer Sc C, pp. 49—56, 2
3 merical
ersity, tment,
(abc) Swarm USA,
orithm colony ):459–
tering, 03) ing K-s and
tificial puting,
ithm,” , July
colony ations,
proach iscrete ctober
d their hanical eering
Fakhr: ational onesia
zation ishing
Based nal of 7, Jul.