• No results found

Instance selection of linear complexity for big data

N/A
N/A
Protected

Academic year: 2021

Share "Instance selection of linear complexity for big data"

Copied!
13
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available at ScienceDirect

Knowle

dge-Base

d

Systems

journal homepage: www.elsevier.com/locate/knosys

Instance

selection

of

linear

complexity

for

big

data

Álvar Arnaiz-González, José-Francisco Díez-Pastor, Juan J. Rodríguez, César García-Osorio

UniversityofBurgos,Spain

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received18December2015 Revised3April2016 Accepted30May2016 Availableonline30May2016 Keywords: Nearestneighbor Datareduction Instanceselection Hashing Bigdata

a

b

s

t

r

a

c

t

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, be- cause machine learning algorithms are not prepared to process such large volumes of information. In- stance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets.

In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitivehashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n2), or log-linear, O(nlog n)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).

© 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/).

1. Introduction

The knearest neighbor classifier (kNN) [11] ,despite itsage,is still widely used inmachine learningproblems [9,17,20] .Its sim-plicity, straightforward implementation and good performance in manydomainsmeans thatitisstill inuse,despiteofsomeofits flaws [37] .ThekNNalgorithmisincludedinthefamilyofinstance based learning, in particular within the lazy learners, as it does not buildaclassificationmodelbutjuststoresall thetrainingset [8] .Its classification ruleissimple: foreach newinstance,assign theclassaccordingtothemajorityvoteofitsknearestneighbors in thetrainingset, ifk=1, the algorithmonly takesthe nearest neighbor into account [45] .This feature means that itrequires a lotofmemoryandprocessingtimeintheclassificationphase [48] . Traditionally, twopaths havebeenfollowed tospeed upthe pro-cess:eitheracceleratethecalculationoftheclosestneighbors [3,4] , ordecreasetrainingsetsizebystrategicallyselectingonlyasmall portionofinstancesorfeatures [38] .

Regarding the acceleration of algorithms, perhaps one of the most representative approachesis to approximate nearest neigh-bors,a broadlyresearched techniquein whichthe nearest neigh-borsearch isdone over asub-sampleofthe wholedata set [56] .

Correspondingauthor.Fax:+34947258910.

E-mail addresses: alvarag@ubu.es(Á. Arnaiz-González), jfdpastor@ubu.es(J.-F. Díez-Pastor),jjrodriguez@ubu.es(J.J.Rodríguez),cgosorio@ubu.es(C.García-Osorio).

Inthisfield,manyalgorithmshavebeenproposedforapproximate nearestneighborproblems [3,4,30,34,39] .

Thefocusofthispaperisonthesecond path,thereductionof datasetsize.Thereasonisthatthisreductionisbeneficialformost methods rather than only those based on nearest neighbors. Al-thoughwe willonlyconsiderthereductionofinstances(instance selection)inthispaper,thereductioncouldalsobeappliedto at-tributes(featureselection),orevenbothatthesametime [51] .The problem is that the fastest conventional instance selection algo-rithmshaveacomputationalcomplexityofatleastO

(

nlogn

)

and othersareofevengreatercomplexity.

Theneedforrapidmethodsforinstanceselectionisevenmore relevantnowadays,giventhegrowingsizesofdatasetsinallfields ofmachinelearningapplications (suchasmedicine,marketing or finance [43] ),andthefactthatthemostcommonlyuseddata min-ingalgorithmsforanydataminingtaskweredevelopedwhenthe commondatabasescontainedatmostafewthousandsofrecords. Currently, millionsofrecords arethe mostcommonscenario. So, mostdataminingalgorithmsfindmanyseriousdifficultiesintheir application.Thus,anewtermhasemerged,“BigData”,inreference tothosedatasetsthat,byvolume,variabilityandspeed,makethe applicationofclassicalalgorithmsdifficult [44] .Withregardto in-stance selection, the solutions that have appeared so far to deal withbig data problems adoptthe ‘divide andconquer’ approach [13,22] .Thealgorithms proposedinthepresentpaperoffera dif-ferentapproach,justasequential butveryquickandsimple pro-cessingofeachinstanceinthedataset.

http://dx.doi.org/10.1016/j.knosys.2016.05.056

(2)

In particular, the major contribution of this paper is the use ofLocality-SensitiveHashing(LSH)todesigntwo newalgorithms, whichofferstwomainadvantages:

• Linearcomplexity:theuseofLSH meansadramaticreduction inthe executiontime oftheinstanceselection process. More-over,thesemethodsareabletodealwithhugedatasetsdueto theirlinearcomplexity.

On-the-flyprocessing:oneofthenewmethodsisabletotackle theinstancesinonestep.Itisnotnecessaryforallinstancesfit inmemory:acharacteristicthatoffersaremarkableadvantage inrelationtobigdata.

The paperis organized asfollows: Section 2 presents the re-ductiontechniquesbackground, withspecialemphasis onthe in-stance selection methods used in the experimental validation; Section 3 introduces the concept of locality-sensitive hashing,the basisoftheproposed methods whicharepresented in Section 4; Section 5 presentsandanalyzestheresultsoftheexperimentsand, finally, Sections 6 and 7 set out the conclusions and future re-search,respectively.

2. Reductiontechniques

Available data sets are progressively becoming larger in size. Asaconsequence,manysystemshavedifficultiesprocessingsuch datasetstoobtainexploitableknowledge [23] .Thehighexecution timesandstoragerequirementsof thecurrentclassification algo-rithmsmake them unusable when dealingwith these huge data sets [28] . These problems can be decisive, if a lazy learning al-gorithm such asthenearest neighbor ruleis used,and caneven preventresultsfrombeingobtained.However,reducingthesizeof thedatasetbyselectingarepresentativesubsethastwomain ad-vantages:itreducesthememoryrequiredtostorethedataandit acceleratestheclassificationalgorithms [19] .

In the scientific literature, the term “reduction techniques” includes [61] : prototype generation [32] ; prototype selection [52] (when the classifier is based on kNN); and (for other classi-fiers) instance selection [8] . While prototype generation replaces theoriginal instances withnew artificialones, instance selection andprototypeselectionattempttofindarepresentativesubset of theinitialtrainingsetthatdoesnotlessenthepredictivepowerof thealgorithmstrainedwithsuch asubset [45] .Inthepaper, pro-totypegenerationisnotaddressed,howeveracompletereviewon itcanbefoundin [57] .

2.1.Instanceselection

The aforementioned term “instance selection” brings together differentprocedures andalgorithmsthat target theselection ofa representativesubsetoftheinitialtrainingset.Therearenumerous instanceselectionmethodsforclassification,acompletereview of whichmaybefoundin [21] . Instanceselectionhasalsobeen ap-pliedtobothregression [2,33] andtimeseriesprediction [26,55] .

According to the order in which instances are processed, in-stanceselectionmethodscanbeclassifiedintofivecategories [21] . Iftheybeginwithanemptysetandtheyaddinstancestothe se-lected subset, by means of analyzing the instances in the train-ingset,theyarecalledincremental.Thedecrementalmethods,on thecontrary,start withtheoriginaltrainingdatasetandthey re-move those instances that are considered superfluous or unnec-essary.Batchmethods arethoseinwhich noinstanceisremoved untilallofthemhavebeenanalyzed,instancesaresimplymarked fromremovalifthealgorithmdeterminesthattheyarenotneeded, andat the end of the process only the unmarked instances are kept. Mixedalgorithms start with a preselected set ofinstances.

Table1

Summaryofstate-of-the-artinstanceselectionmethodsusedintheexperimental setup(taxonomyfrom[21];computationalcomplexityfrom[31]andauthors’ pa-pers).

Strategy Direction Algorithm Complexity Year Reference Condensation Incremental CNN O(n3) 1968 [27]

Incremental PSC O(nlogn) 2010 [46] Decremental RNN O(n3) 1972 [25] Decremental MSS O(n2) 2002 [6] Hybrid Decremental DROP1-5 O(n3) 2000 [60]

Batch ICF O(n2) 2002 [8]

Batch HMN-EI O(n2) 2008 [41]

Batch LSBo O(n2) 2015 [37]

Theprocess thendecides eitherto addortodeletetheinstances. Finally,fixedmethodsareasub-familyofmixedones,inwhichthe numberofadditionsandremovalsarethesame.Thisapproach al-lowsthemtomaintainafixednumberofinstances(morefrequent inprototypegeneration).

Consideringthetypeofselection,threecategoriesmaybe dis-tinguished.Thiscriterionismainlycorrelatedwiththepointsthat they remove: either border points, central points, or otherwise. Condensationtechniquestrytoretainborderpoints.Their underly-ingideaisthatinternalpointsdonotaffectclassification,because theboundariesbetweenclassesarethekeystoneofthe classifica-tion process.Editionmethods maybe considered theopposite of condensationtechniques,astheiraimistoremovethoseinstances thatarenotwell-classifiedbytheirnearestneighbors.Theedition processachievessmootherboundariesaswellasnoiseremoval.In the middleof thoseapproaches are hybridalgorithms, which try tomaintainoreventoincreasetheaccuracycapabilityofthedata set,byremovingboth:internalandborderpoints [21] .

Evolutionary approachesforinstance selectionhaveshown re-markableresultsinboth reductionandaccuracy. Acomplete sur-veyofthemcanbefoundin [16] .However,themainlimitationof thosemethodsistheir computationalcomplexity [36] .This draw-back is the reason why they are not taken into account in this study,becausethemethodsitproposesareorientedtowardslarge datasets.

Intheremaining partofthissection,wegivefurtherdetailsof themostrepresentative methodsused intheexperimental setup. Asummaryofthemethodsconsideredinthestudycanbeseenin Table 1 .

2.1.1. Condensation

ThealgorithmofHart,CondensedNearestNeighbor(CNN) [27] is considered the first formal proposal of instanceselection for the nearest neighbor rule. The concept of training set consistency is importantinthisalgorithmandisdefinedasfollows:givenanon empty setX (X=∅),a subset SofX (SX) isconsistent with re-specttoXif,usingthesubsetSastrainingset,thenearest neigh-borrulecancorrectlyclassifyallinstancesinX.Followingthis def-initionofconsistency,ifweconsiderthesetXasthetrainingset, a condensed subset should have the properties of being consis-tent and, ideally,smaller than X.After CNN appeared,other con-densation methods emerged with the aimof decreasing the size ofthe condensed dataset,e.g.: ReducedNearest Neighbor(RNN) [25] .OneofthelatestisthePrototypeSelectionbyClustering(PSC) [46] ,whichuses clusteringtospeed upthe selection process.So, the use of clusteringgives a highefficiency to PSC, if compared against state-of-the-art methods, and better accuracy than other clustering-basedmethodssuchasCLU [40] .

In [6] , the authors proposed a modification to the definition of a selectivesubset [54] , fora better approximation to decision borders. The selectivesubset can be thought ofassimilar to the idea of the condensed algorithm of Hart, but applying a

(3)

condi-tionstrongerthantheconditionofconsistency.Theaimistofind the selectedinstancesin aneasierway,which islesssensitiveto therandominitializationofSandtheorderofexplorationofXin Harts’algorithm.Thesubset obtainedinthiswayiscalledthe se-lectivesubset(SS).

A subset S ofthe trainingset X is a selective subset (SS), ifit satisfiesthefollowingconditions:

1. Sisconsistent(asinHarts’algorithm).

2. Allinstances in theoriginal trainingset,X, arecloser toa se-lectiveneighbor(amemberofS) ofthesameclassthan toany instanceofadifferentclassinX.

Then,theauthorspresentagreedyalgorithmwhichattemptsto findselectiveinstancesstartingwiththoseinstancesofthetraining setthatareclosetothedecisionboundaryofthenearestneighbor classifier.Thealgorithmpresentedisanefficientalternativetothe Selectivealgorithm [54] anditis usually ableto selectbetter in-stances(theonesclosertotheboundaries).

2.1.2. Hybrid

Oneproblemthatariseswhencondensationmethodsareused istheirnoisesensitivity,whilehybridmethodshavespecific mech-anismstomakethemmorerobusttonoise [37] .

The DROP (Decremental Reduction Optimization Procedure) [60] familyofalgorithms comprisessomeofthebestinstance se-lection methods for classification [8,47,50] . The instance removal criteria is based on two relations: associates and nearest neigh-bors.Therelationofassociateistheinverseofnearestneighbors: thoseinstancespthathaveqasoneoftheirnearestneighborsare calledassociatesofq.Thesetofnearestneighborsofoneinstance iscalledtheneighborhoodoftheinstance.Forall instances,itslist of associates is a listwith all instances that have that particular instanceintheirneighborhood.

Marchiori proposed a new graph-based representation of the data set called Hit Miss Networks (HMN) [41] . The graph has a directed edge from each instance to its nearest neighbor on the differentclasses,withone edge per class.The informationinthe graph was used to define three new hybrid algorithms: HMN-E, HMN-C and HMN-EI. A couple of years later, HMNs were used todefine anewinformation-theoretic instancescoringmethodto define a new instance selection method called Class Conditional NearestNeighbor(CCIS) [42] .Accordingto [21] ,HMN-EIisableto achievemoreaccuratedatasetsthanCCISwhichisthereasonwhy HMN-EIwasusedinexperimentalsetup.

The local-set (LS) concept,proposed forthe veryfirst time in [7] ,isapowerfultool forsomemachine learningtasks, including instance selection. Alocal-set ofan instancex containsall those instanceswhichareclosertoxthanitsnearestneighborof differ-entclass,itsnearestenemy.TheselectionruleoftheIterativeCase Filteringalgorithm(ICF) [7] useslocalsetstobuildtwosets: cover-ageandreachability.Thesetwo conceptsarecloselyrelatedtothe neighborhoodandassociatelistusedinDROPalgorithms.The cov-erageofaninstanceisitsLS,thatcanbeseenasaneighborhoodof theinstancethat,insteadofconsideringafixednumberkof neigh-bor,includesallinstancesclosertotheinstancethanitsclosest en-emy.The reachableset ofan instanceis itsset ofassociates.The coverage setofan instanceisits neighborhood.The deletionrule isasfollows:aninstanceisremovedfromthedatasetifits reach-ableset(itssetofassociates)isbiggerthanitscoverage(its ‘neigh-borhood’).Thisrulemeansthatthealgorithmremovesaninstance if other object exists that generalizesits information.To address the problemof noisy datasets, both ICF and DROP3,begin with anoise-filterstage.Recently,Leyvaetal [37] presentedthreenew instance selection methods based on LSs. Theirhybrid approach, which offers a good balance between reduction and accuracy, is

calledLocalSetBorderSelector(LSBo)anditusesaheuristic crite-rion:theinstancesintheboundariesbetweenclassestendtohave greaterLSs.Asisusualinhybridmethods,LSBostartswithanoise filteringalgorithmwhichwaspresentedinthesamepapercalled LSSm.

2.2.Scalingupinstanceselection

Themaindrawbackofinstanceselectionmethodsistheir com-plexitythatisquadraticO

(

n2

)

,wherenisthenumberofinstances

[22] or,atbest,log-linearO

(

nlogn

)

;thus,themajorityofthemare notapplicableindatasetswithhundredsorevenmanythousands ofinstances [15] . Table 1 summarizesthecomputational complex-ityoftheinstanceselectionmethodsusedintheexperimental sec-tion.

Oneapproach to deal withmassive datasets, isto divide the originalproblemintosmallersubsetsofinstances;knownas strat-ification.Theunderlyingideaofthesemethodsistosplitthe orig-inaldatasetintodisjointedsubsets,thenaninstanceselection al-gorithm is applied to each subset [10,13,22,24] . This approach is

used in [22] where a method was proposed that addressed the

splitting process, using Grand Tour [5] theory, to achieve linear complexity.

The problem known as big data refers to the challenges and difficulties that arise when huge amounts of data are processed. Onewaytoaccelerateinstanceselectionmethodsandtobeableto copewithmassivedatasetsisadaptthemtoparallelenvironments [49] .Todoso,thewaythatalgorithmsworkhastoberedesigned. TheMapReduceparadigmoffersarobustframeworkwithwhichto processhugedatasets overclustersofmachines. Followingup on thisidea,anewproposalwaspresentedrecentlybyTrigueroetal. [58] .

3. Locality-sensitivehashing

The locality-sensitive hashing (LSH) is an efficient method for checking similarity between elements. It makes a particular use ofhashfunctions that,unlike those used inother applications of hashing,1 seekstoallocatesimilaritemstothesamebucketwitha highprobability,andatthesametimetogreatlyreduce the prob-abilityofassigningdissimilaritemstothesamebucket [35] .

LSHuseiscommontoincreasetheefficiencyofnearest neigh-borscalculation [3,21] .AnindirectbenefitofLSHforinstance se-lectionalgorithms isthespeedingup ofnearestneighbor calcula-tion,required inmost ofthese sortsof algorithms. However, the complexity of the algorithms remain unchanged, since the loop nestingandstructuresofthealgorithmsremainthesame.Itisonly thekNNstepthatisimproved.

WhatweproposeinthispaperisanoveluseofLSH,notmerely assupport forthecalculationofnearest neighbors,butasan op-erationthat definesthe natureofthe newinstancesselection al-gorithm. Basically, the idea is to make the instance selection on eachofthebucketsthatwillbeobtainedbyLSHwhenappliedto allinstances.Thisprocess permitstheselectionofinstancesusing a uniqueprocessing loop ofthe dataset, thereby givingit linear complexity. So a reasonablequestion arises; when a classifier is trainedwithaselected subset obtainedby thisapproach,willits predictioncapabilitiesdecrease?Thisarticleoffers an experimen-talresponseto thisquestion2 Butbeforegivingthedetails ofour

1Theaimofconventionalcryptographichashfunctionsistoavoidthecollision ofitemsinthesamebucket.

2Inany case,note thateven acertain degradation ofclassifierperformance wouldbeacceptable,ifthenewalgorithmachievedasubstantialaccelerationor reductioninstorage[9],itisoftenbettertogainaquickapproximationwithina reasonabletimethananoptimalsolutionwhenitistoolatetouseit.

(4)

Fig.1. Expectedbehavioroftheprobabilitiesofa(d1,d2,p1,p2)–sensitivefunction. Thefunctionwillassignthesamevaluetotwoinstanceswithaprobabilitygreater thanp1,iftheirdistanceisshorterthand1.Thefunctionwillassignthesamevalue totwoinstanceswithaprobabilitylowerthanp2,iftheirdistanceisgreaterthan d2.Fordistancesbetweend2andd1,thereisnorestrictionregardingthatthevalues thefunctioncanassigntotheinstances.

proposal,letuslookatabriefintroductiontotheunderlying the-oryofLSH.

3.1.Locality-sensitivefunctions

Inthissectionwefollow [35] ,toformallydefinetheconceptof localsensitivityandthe process ofamplifying a locality-sensitive familyoffunctions.

Given a set of objects S and a distance measure D, a family of hash functions H=

{

h:SU

}

is said to be

(

d1,d2,p1,p2

)

–sensitive, if the following properties hold for all

functionsofhinthefamilyH:

• Forallx,yinS,ifD(x,y)≤d1,thentheprobabilitythath

(

x

)

= h

(

y

)

isatleastp1.

• Forallx,yinS,ifD(x,y)>d2,thentheprobabilitythath

(

x

)

= h

(

y

)

isatmostp2.

Inthisdefinition,nothingreferstowhathappenswhenthe dis-tanceoftheobjects isbetweend1 andd2 (see therepresentation

in Fig. 1 ).However, distancesd1 andd2 canbe ascloseas

possi-ble,butthecost willbe that p1 andp2 are alsocloser.However,

asshownbelow, it ispossible to combinefamiliesof hash func-tionsthat separate theprobabilitiesp1 andp2 withoutmodifying

thedistancesd1 andd2.

Givena

(

d1,d2,p1,p2

)

–sensitivefamilyofhashfunctionsH,it

ispossibletoobtain anewfamilyH usingthefollowing amplifi-cationoperations

AND-construction ThefunctionshinHareobtainedby com-bining a fixed number r of functions

{

h1,h2,...,hr

}

in H.

Now,h

(

x

)

=h

(

y

)

,ifandonlyifhi

(

x

)

=hi

(

y

)

foralli.Ifthe

independenceoffunctionsinHcanbeguaranteed,thenew familyoffunctionsHwillbe

(

d1,d2,

(

p1

)

r,

(

p2

)

r

)

–sensitive.

OR-construction The functions h inH are obtained by com-bining a fixed number b of functions

{

h1,h2,...,hb

}

in H.

Now,h

(

x

)

=h

(

y

)

,ifandonlyifhi

(

x

)

=hi

(

y

)

foranyi.Ifthe independenceoffunctionsinHcanbeguaranteed,thenew familyoffunctionsHwillbe

(

d1,d2,1−

(

1−p1

)

b,1−

(

1− p2

)

b

)

–sensitive.

The AND-construction decreases theprobabilities andthe OR-constructionincreasesthem.However,ifrandbareproperly cho-senandwiththechainingofconstructionstheprobability p1 may

bucket

width

Fig.2. Twopoints(A,B)atdistancedwhaveasmallchanceofbeinghashedto thesamebucket.

bebroughtcloserto1,whiletheprobabilityp2willstayreasonably

closeto0.

Intheexperimentalsetup,thehashfunctionsinthebasefamily wereobtainedusingthefollowingequation [12] .

ha,b

(

x

)

=

a·x+b w

(1) where a is a random vector (Gaussian distribution with mean 0 andstandarddeviation1),bisarandomrealvaluefromthe inter-val[0,w]andwisthewidthofeachbucketinthehashtable.

This equation gives a (w/2, 2w, 1/2, 1/3)-sensitivefamily. The reasonforthesenumbersisasfollows(suppose,forsimplicity,a 2-dimensionalEuclideanspace),ifthedistancedbetweentwopoints isexactlyw/2(halfthewidthofthebuckets)thesmallest probabil-ityforthetwopointsfallinginthesamesegment wouldhappen for

θ

=0, andin this casethe probability would be 0.5, since d is exactlyw/2. Forangles greater than 0,this probability will be even higher;in fact,it will be 1for

θ

=90. And forshorter dis-tancesthanw/2,theprobabilitywillequallyincrease.Sothelower boundaryforthisprobabilityis1/2.Ifthedistancedisexactly2w (twicethe width ofthe bucket), theonly chance for both points to fall in thesame bucket is that their distances, once projected inthesegment,arelowerthanw,whatmeansthatcos

θ

mustbe lowerthan 0.5,sincetheprojecteddistanceisdcos

θ

anddis ex-actly 2w.For

θ

in theinterval 0 to 60,cos

θ

is greater than 0.5, so the only chance of cos

θ

beinglower than 0.5is that

θ

is in theinterval[60,90],andthechance ofthat happeningisatmost 1/3.Fordistancesgreaterthan2w,theprobabilitiesareevenlower. Sotheupperboundaryofthisprobabilityis1/3.Thisreasoningis reflectedin Fig. 2 .

Byusingthe(w/2,2w,1/2,1/3)-sensitivefamilypreviously de-scribed, we have computed the probabilities p1 and p2 for the

AND-OR construction with a number of functions from1 to 10. Fig. 3 shows the probabilities p1 (a) and p2 (b) for the case of

thechainingofanOR-constructionjustafteranAND-construction, and the difference betweenthese two probabilities (c). The row number indicates the number of functions used in the AND-construction, while the column number indicates the number of functionsusedintheOR-construction.

4. Newinstanceselectionalgorithmsbasedonhashing This section presents the algorithms proposed in this work: LSH-IS-SandLSH-IS-F.Thefirstcompletestheselectionprocessin a single pass, analyzing each instance consecutively. It processes

(5)

Fig.3. Probabilitiesp1andp2andthedifferencebetweenthem.Thedarkerthecolorthehigherthevalue.EachcellgivesthevalueforthechainingofanOR-construction (numberofcombinedbasicfunctionsonthex-axis)afteranAND-construction(onthey-axisthenumberofcombinedfunctions).

instances inonepass,so notall instancesneed tofitinmemory. The second performs twopasses: intheinitial one, itcountsthe instances ineachbucket,inthesecond,it completestheinstance selectionwiththisinformation.Thecomplexityofbothalgorithms islinear,O

(

n

)

(notethatthisiseventrueforthesecondalgorithm, i.e.althoughtwopassesareperformed).

Both algorithms can be seen as incremental methods, due to thefactthattheselecteddatasetisformedbysuccessiveadditions totheemptyset.However,thesecondoneconformsmoreclosely tobatchprocessingbecauseitanalyzestheimpactoftheremoval onthewholedataset.

Themainadvantageofthepresentedmethodsisthedrastic re-ductioninexecutiontime.Theexperimentalresultsshowa signif-icant difference when they are compared against state-of-the-art instanceselectionalgorithms.

Algorithm1:LSH-IS-S– Instanceselectionalgorithmby hash-inginonepassprocessing.

Input:AtrainingsetX=

{

(

x1,y1

)

,...,

(

xn,yn

)

}

,setG ofhash

functionfamilies

Output:ThesetofselectedinstancesSX

1S=∅

2 foreachinstancex∈Xdo

3 foreachfunctionfamilygG do 4 u←bucketassignedtoxbyfamilyg

5 ifthereisnootherinstanceofthesameclassofxinu then

6 AddxtoS 7 Addxtou

8returnS

4.1. LSH-IS-S:one-passprocessing

As shown in Algorithm 1 , the inputs of the LSH-IS-Smethod are:asetofinstancestoselectandasetoffamiliesofhash func-tions. Theloopprocesseseach instancexofX,usingthefunction familiestodeterminethe bucketutowhichtheinstancebelongs .3 Ifinthebucketuassignedtotheinstancethereisnoother in-stance of the same class ofx, x is selected and added to S and 3ThebucketidentifierugiventoaninstancebyafamilygGcanbethought ofastheconcatenationofallbucketidentifiersgivenbythehashfunctionsing, sincethefunctionfamiliesinGareobtainedbyusinganAND-constructiononbase functionsobtainedusingEq.1.TheOR-constructionisimplementedintheforeach loopatline3ofAlgorithm1.

tothebucketu.Thealgorithm endswhenallinstances inXhave been processed. Note that each instance is processed only once, whichgrantsan extremelyfastperformanceattheexpenseofnot analyzingtheinstances thatare selectedineach bucketindetail. Instances are analyzed in sequence without needing information on other instances. This process means that the method maybe usedinasingle-passprocess,withoutrequiringthewholedataset tofitinthememory.

4.2.LSH-IS-F:amoreinformedselection

Thealgorithm explainedinthe previous section isremarkably fast and allows instances to be processed as they arrive, in one pass.Ontheother hand,becauseofhowitworks,itisnot using allinformationthatmayberelevanttodecidewhichinstances to choose.Forexample,thealgorithmhasnocontroloverthenumber ofinstancesofeachclassthatgotoeachbucket,becauseoncean instance of a class is selected, it discards other instances of the sameclassthatmaycomelater.

Algorithm2:LSH-IS-F– Instanceselectionalgorithmby hash-ingwithtwopasses.

Input:AtrainingsetX=

{

(

x1,y1

)

,...,

(

xn,yn

)

}

,setGofhash

functionfamilies

Output:ThesetofselectedinstancesSX

1 S=∅

2 foreachinstancex∈X do

3 foreachfunctionfamilygGdo

4 u←bucketassignedtoxbyfamilyg; 5 Addxtou;

6 foreachfunctionfamilygGdo 7 foreachbucketuofgdo

8 foreachclassywithsomeinstanceinudo 9 Iy←allinstancesofclassyinu; 10 if

|

Iy

|

>1then

11 AddtoSonerandominstanceofIy;

12 returnS

LSH-IS-F (see Algorithm 2 ) isan evolution of the LSH-IS-S. In thismethod,one-pass processingisreplacedby amore informed selection. The first loop is similar to LSH-IS-Sbut, insteadof di-rectlyselectinginstances,itfirstrecordsthebuckettowhicheach instancebelongs.When thereis onlyone instanceof aclass, the instance is rejected, otherwise, if two or more instances of the

(6)

(a) Original (b) LSH-IS-S (c) LSH-IS-F

Fig.4. Exampletoillustratethebehaviorofbothalgorithms.(a)Initialinstances, twobucketsareidentifiedbyLSHandthelineshowstheboundary.(b)Instances selectedbyLSH-IS-S.(c)InstancesselectedbyLSH-IS-F.

sameclassarepresent, oneofthemisrandomlychosen.Theidea hereis to give the algorithm some tolerance of the presence of noiseintheinputdataset.

The executiontime ofthismethodis notmuch largerthan in thepreviousmethod,sincethe numberofbuckets ismuchlower thanthenumberofinstances. Although,thedifferencesin execu-tion time increase withthe increase in the number ofOR func-tions.

4.3.Behaviorofproposedmethods

The aimofthissectionistotrytoshedsomelight onthe be-haviorofthenewalgorithmsproposedinthepaper.Aspreviously stated,LSH-IS-Smakes theselection ofinstances inone pass, se-lecting one instance of each class in each bucket. On the other hand,LSH-IS-F triesto avoidretaining noisyinstances. The more informed selection criterion of LSH-IS-F allows it to remove in-stancesthatcanbeharmfulfortheclassification.

In Fig. 4 we showanexamplewithnineinstances:fourofone class (crosses) and five of the other (circles).The LSH algorithm is using two buckets, the line represents the boundary between them.LSH-IS-S selects one instance ofeach class ineach bucket (b),while LSH-IS-Fdoesnotselectthe instanceofclasscross be-causeitisidentifiedasnoise.

Fig. 5 illustratestheeffectofthealgorithmsintheXORdataset [42] formedby 400 instances,200per class.An outlierwas added andhighlighted witha graysquare. As indicated above, LSH-IS-S retainstheinstance,whileLSH-IS-Fremovesit.

With the aim ofillustrating the behavior ofthe proposed in-stanceselectionmethods whenthe numberofhashfunctions in-creases,weusedtheBananadataset.Ithastwonumericfeatures andtwo classes, the size of the data set is 5300 instances (see Table 3 ). Fig. 6 (a)showstheoriginaldataset.Despitethefactthat twoclusterscanbeeasilyidentified,ahighoverlapexistsbetween thetwoclassesinsomeregions [21] . Table 2 summarizesthe num-ber of instances selected by both algorithms when only one OR

Table2

NumberofBananadatasetinstancesthatthe pro-posed algorithmsretain bythe number ofAND functions.

Algorithm NumberofANDfunctions

2 4 6 8 10

LSH-IS-S 25 123 249 466 684 LSH-IS-F 24 110 225 423 627

functionis usedandthenumberofANDfunctionschanges. LSH-IS-Fretainslessinstances,becausethoseinstancesofoneclass iso-latedinonebucketwithinstancesoftheotherclassareidentfied asnoiseandthereforedeleted.

Figs. 6 and 7 show theinstances retainedby both algorithms, LSH-IS-S andLSH-IS-Frespectively, whenonlyone ORfunctionis used anddifferent numbers ofAND functions: 2,4, 6,8 and10. The number of retained instances increases with the number of functions.Thisbehaviorisquiteinteresting,becauseitenablesthe userto choosewhether moreorfewer instances areretained, by varyingthenumberoffunctionsthatareused.

5. Experimentalstudy

This section presents the experimental study performed to evaluate thenew proposed methods.We compared them against sevenwell-knownstate-of-the-artinstanceselectionalgorithmsin a studyperformedin Weka [62] .The instance selection methods included in the experiments were: CNN, ICF, MSS, DROP3, PSC, HMN-EI,LSBoandthetwo approachesbasedonhashing.The pa-rametersselectedforthealgorithms werethoserecommendedby the authors: the number of nearest neighbors used on ICF and DROP3were settok=3,thenumberofclustersforPSCwasset to6r(whereristhenumberofclassesofthedataset). Evolution-aryalgorithms werenotincludedintheexperiments,duetotheir highcomputationalcost.

Fortheexperiments,weused30datasetsfromtheKeel repos-itory [1] thathaveatleast1000instances. Table 3 summarizesthe datasets:name,numberoffeatures,numberofinstancesandthe accuracygivenbytwo classifiers(usingtenfold cross-validation): thenearest neighbor classifierwithk=1andtheJ48,a classifier tree(theWekaimplementationofC4.5 [53] ).Thelastfivedatasets arehuge(belowdashedline),withmorethan299,000instances, the traditional instance selection methods are unable to address them.The onlytransformation carriedout wasthenormalization ofallinputfeatures,tosettheirvaluesatbetween0and1.

Weusedthenearestneighborclassifier(1NN),asmostinstance selectionmethods havebeendesignedforthat classifier(k=1in

(a) Original (b) LSH-IS-S (c) LSH-IS-F

(7)

(a) Banana (Orig.) (b) LSH-IS-S: AND=2 (c) LSH-IS-S: AND=4

(d) LSH-IS-S: AND=6 (e) LSH-IS-S: AND=8 (f) LSH-IS-S: AND=10

Fig.6. ThenumberofinstancesselectedbyLSH-IS-Sasthenumberoffunctionsincreases.

(a) Banana (Orig.) (b) LSH-IS-F: AND=2 (c) LSH-IS-F: AND=4

(d) LSH-IS-F: AND=6 (e) LSH-IS-F: AND=8 (f) LSH-IS-F: AND=10

(8)

Table3

Summaryofdatasets characteristics:name,numberoffeatures,numberofinstancesand accuracy(1NN).Lastfivedatasets,belowdashedline,arehugeproblems.

Datasets #attributes #instances Accuracy

Continuous Nominal 1NN J48 1 German 7 13 1000 72.90 71.80 2 Flare 0 11 1066 73.26 73.55 3 Contraceptive 9 0 1473 42.97 53.22 4 Yeast 8 0 1484 52.22 56.74 5 Wine-quality-red 11 0 1599 64.85 62.04 6 Car 0 6 1728 93.52 92.36 7 Titanic 3 0 2201 79.06 79.06 8 Segment 19 0 2310 97.23 96.62 9 Splice 0 60 3190 74.86 94.17 10 Chess 0 35 3196 72.12 81.85 11 Abalone 7 1 4174 19.84 20.72 12 Spam 0 57 4597 91.04 92.97 13 Wine-quality-white 11 0 4898 65.40 58.23 14 Banana 2 0 5300 87.21 89.04 15 Phoneme 5 0 5404 90.19 86.42 16 Page-blocks 10 0 5472 95.91 97.09 17 Texture 40 0 5500 99.04 93.13 18 Optdigits 63 0 5620 98.61 90.69 19 Mushroom 0 22 5644 100.00 100.00 20 Satimage 37 0 6435 90.18 86.28 21 Marketing 13 0 6876 28.74 31.06 22 Thyroid 21 0 7200 92.35 99.71 23 Ring 20 0 7400 75.11 90.95 24 Twonorm 20 0 7400 94.81 85.12 25 Coil2000 85 0 9822 90.62 93.95 26 Penbased 16 0 10,992 99.39 96.53 27 Nursery 0 8 12,960 98.13 97.13 28 Magic 10 0 19,020 80.95 85.01 29 Letter 16 0 20,000 96.04 87.98 30 KRvs.K 0 6 28,058 73.05 56.58 31 Census 7 30 299,285 92.70 95.42 32 KDDCup99 33 7 494,021 99.95 99.95 33 CovType 54 0 581,012 94.48 94.64 34 KDDCup991M 33 7 1,000,000 99.98 99.98 35 Poker 5 5 1,025,010 50.61 68.25

kNN) [37,51] .Moreover,weusedtheJ48classifiertree,toevaluate theextenttowhichtheinstancesselectedbythealgorithmswere suitablefortrainingotherclassifiers.

As shown in Section 3 , there are ways of combining the hash functions families that appear more promising than others (see Fig. 3 ). However, it is unclear which of them will achieve the best results in combination with the proposed algorithms. Therefore, we conducted a study with 60 combinations: AND-constructions,combiningbetween1to10hashfunctions,and OR-constructions,combining1 to6 functions,obtainedby the previ-ousAND-construction,avoidingconstructionswithtoomany func-tionsand,consequently,reducingthecomputationalcost.

The subsets selected by the algorithms were used to build a classifier (1NN), the average rank [14] of which was performed overtheaccuracyofall 60combinations.Averageranks were cal-culatedasfollows:theresultsoftheexperimentsweresorted,one forthe best method,two forthe second, and so on.In the case ofa tie,values oftheranks were added upanddivided into the numberofmethodsthat tied.When theranking ofeachdata set wascalculated,theaverageforeachmethodwascomputed.Better methods hadrankings closer to one. The results of the rankings are shownin Fig. 8 . Eachcell represents the ranking value fora specificcombinationofAND-OR-constructions, wherethe number offunctions in the OR-constructions is shown by the x-axis and thenumberoffunctionsintheAND-constructionsisshownbythe y-axisnumber. The darkerthe cell the higherits ranking (lower

valuesarebetter).Thebestconfigurationisdifferentforeach algo-rithm:

• LSH-IS-S: the best configuration is one that uses OR-constructions of six functions obtained using an AND-constructionontenfunctionsofthebasefamily(functions ob-tainedusing Eq. 1 ).

• LSH-IS-F:thebestresultswereobtainedusingOR-constructions of fivefunctions obtained by combining by AND-construction tenfunctionsofthebasefamily.

Fig. 9 showshowthetime executionincreases,onaverage,for theproposedalgorithms:LSH-IS-S(gray)andLSH-IS-F(black).The higherthenumberofANDfunctions, thebiggerthe gapbetween LSH-IS-SandLSH-IS-F.ThisbehaviorisexplainedbecauseLSH-IS-F hasone loopmorethanLSH-IS-S(see pseudocodes 1 and 2 ) that isused togothrough allthe bucketscountingthenumberof in-stances ofeach class. The number of buckets searched increases withthenumberofhashfunctions.

Ten fold cross-validationwasappliedto theinstance selection methodsunderstudy.Theperformanceswereasfollows:

• accuracyachievedby 1NNandJ48classifierstrainedwiththe selectedsubset;

• filteringtimebyinstanceselection;

• reduction achievedby instanceselection methods (sizeof the selectedsubset).

(9)

(a) LSH-IS-S (b) LSH-IS-F

Fig.8. AveragerankoveraccuracyoftheproposedmethodsforthedifferentconfigurationsofAND-ORconstructions.Thedarkerthecell,thehighertheranking(loweris better).EachcellrepresentsanAND-OR-constructionwherethecolumnisthenumberoffunctionsintheOR-constructionandtherowisthenumberoffunctionsinthe AND-construction.

Fig.9. AverageexecutiontimeoftheproposedalgorithmsasthenumberofAND-functionsincreases.LSH-IS-SisrepresentedingrayandLSH-IS-Finblackwithdifferent linesandmarksforthedifferentnumbersofOR-functions.

Table4

Average ranksand Hochberg proce-dureoveraccuracy:1NN.

Algorithm Ranking pHoch. HMN-EI 2.92 LSBo 3.85 0.1869 LSH-IS-F 4.45 0.0602 MSS 4.58 0.0553 LSH-IS-S 4.98 0.0139 DROP3 5.03 0.0138 CNN 5.17 0.0088 ICF 5.75 0.0004 PSC 8.27 0.0000

According to the accuracy of 1NNclassifier (see Table 4 ), the bestfouralgorithmswereHMN-EIfollowedbyLSBo,LSH-IS-Fand MSS. According to the Hochberg procedure [29] , differences be-tweenthemwerenotsignificantata0.05significancelevel. How-ever,differencesbetweenLSBoandtheothermethodswere signif-icant. WhenJ48wasused asthe classifier(see Table 5 ), thebest sixalgorithmswereLSH-IS-F followedby LSH-IS-S,HMN-EI,LSBo, MSSandCNN;thedifferencesbetweenthemwere notsignificant at0.05. Furthermore, ascanbe seen, the leastaccurate modelis PSCforbothclassifiers.

Table5

Averageranks and Hochberg proce-dureoveraccuracy:J48.

Algorithm Ranking pHoch. LSH-IS-F 3.63 LSH-IS-S 3.88 0.7237 HMN-EI 4.10 0.7237 LSBo 4.57 0.5606 MSS 5.03 0.1909 CNN 5.28 0.0981 ICF 5.67 0.0242 DROP3 5.90 0.0094 PSC 6.93 0.0000

Table 6 showsthe average ranks over compression. DROP3 is

thebestmethodata0.05significancelevel.Furthermore,the av-eragereductionrateforeachmethodisalsoshown.Theproposed methodsarethemostconservative,although,aspreviouslystated, ahighercompressioncouldhavebeenachievedusingfewer func-tionsintheLSHprocess.

Thethirdrelevantfeature ofinstanceselectionmethods isthe time required by the algorithms to calculate the selectedsubset.

Table 7 shows the average rank over execution time of the

(10)

LSH-Table6

AverageranksandHochbergprocedureoverstorage re-duction,andaveragereductionrate.

Algorithm Ranking pHoch. Reductionrate

DROP3 1.67 0.896 ICF 3.10 0.0427 0.813 LSBo 3.70 0.0081 0.737 PSC 4.70 0.0001 0.762 CNN 5.43 0.0000 0.658 MSS 6.00 0.0000 0.665 HMN-EI 6.10 0.0000 0.577 LSH-IS-F 6.62 0.0000 0.455 LSH-IS-S 7.68 0.0000 0.405 Table7

Averageranks and Hochberg proce-dureoverfilteringtime.

Algorithm Ranking pHoch. LSH-IS-F 1.53 LSH-IS-S 1.57 0.9624 PSC 3.00 0.0761 MSS 4.20 0.0005 ICF 5.33 0.0000 LSBo 6.73 0.0000 HMN-EI 6.70 0.0000 DROP3 7.67 0.0000 CNN 8.27 0.0000

IS-F,LSH-IS-SandPSC,betweenthemdifferenceswerenot signif-icantat a0.05significance level.The differenceswere significant fromMSSandthe followingmethods;the slowestwasCNN.It is worthnotingthatPSCachieved,accordingtothesignificancetests, areallycompetitiveexecutiontime.Neverthelesstheshortcoming ofPSCwasitspooraccuracy,asitobtainedtheworstresultsofall ofthemethodsunderanalysis,asnotedin Tables 4 and 5 .

It might be surprisingthat LSH-IS-F was fasterthan LSH-IS-S, becausethiscontradicts Fig. 9 .Thereasonforthiswasthenumber offunctionsusedbythealgorithms;ascommented onatthe be-ginningofthissection, LSH-IS-Swaslaunched usingsixOR func-tions,whileLSH-IS-Fwaslaunchedwithonlyfive.

Fig. 10 shows the filtering time as a function of the number of instances. The results obtained with the data sets of the ex-periments were used to draw thesefigures. Since there were no results available for all possible values of numbers of instances, theavailable resultswere usedto drawBezier lines andtoshow thegeneraltrendofthealgorithms.Althoughothermethods(CNN, HMN-EI,LSBo, DROP3,MSSandICF)have anexecutiontime that increasesswiftly,ouralgorithmsbasedonhashing areatthe bot-tomofthefigures togetherwithPSC.However,inthelogarithmic scale,thegrowthofPSCisvisiblygreater.

As a summary ofthe experimentationwithmedium size data sets,we can highlightthat the proposed methods achieved com-petitive results in terms of accuracy. Considering the reduction rate,DROP3achievedthemaximumcompression,whileour meth-odswere theworst in terms ofcompression. Finally,considering executiontime, themethodspresentedin thepaperwereableto computetheselectedsubsetmuchfasterthantheotheralgorithms inthestateoftheart.Exceptionally,PSCworkedsurprisinglyfast, although slower than the speed of our proposals, and with the shortcomingofthe pooraccuracywhenits selectedsubsetswere usedfortrainingtheclassifiers.

5.1.Hugeproblems

Duetothefactthatinstanceselectionmethodsarenotableto facehugeproblems,the experimentalstudyperformedover

Cen-Table8

Averageranks over accu-racyforhugeproblems.

Algorithm Ranking LSH-IS-F 2.0 DIS.RNN 2.2 LSH-IS-S 2.6 DIS.DROP3 3.6 DIS.ICF 4.6 Table9

Averageranks over storage reduc-tionforhugeproblems.

Algorithm Ranking DIS.RNN 1.8 DIS.DROP3 2.4 LSH-IS-F 3.2 DIS.ICF 3.4 LSH-IS-S 4.2

sus,CovType,KDDCup99,KDDCup991M4 andPoker (see Table 3 ), the algorithms proposed were tested against the Democratic In-stanceSelection(DIS) [22] .AlthoughPSCshowedacompetitive re-sultsintermsofexecutiontime, itwasnot includedinthestudy ofhugeproblems,becauseofitspooraccuracy.

As in the previous experiments, ten fold cross-validation was performedonLSH-IS-SandLSH-IS-FinWeka.Testingerror,using 1-NNclassifier,andstoragereductionwerereportedandcompared against resultspublished in [22] .Execution timeswere not com-paredbecausedifferentimplementationsandmachineswouldnot haveallowedafaircomparison.

Themainconclusionoftheexperimentswasthatourmethods canfacehugeproblems.Resultsofaverageranksovertheaccuracy areshownin Table 8 .Theaccuracyofthemethodsunderstudyis similartoDIS,thoughthemostaccuratemethodisLSH-IS-F.Asin themediumsizeexperiment,LSH-IS-Fimprovedtheaccuracywith regardtoLSH-IS-S.Ontheotherhand,the Table 9 showsthe aver-ageranksoverstoragereduction.Intermsofcompression,thebest methodwas DIS.RNN,as provedin medium size datasets, while themethodsbasedonhashingweretooconservative.However,the numberofinstancesthattheyretaincanbeadjustedbythe num-beroffunctionsused.Thesuccessoftheproposedmethodsiseven moreremarkablewhencomparedagainstscalableapproaches.The simpleideaofusingLSH overcomesthedemocratizationmethods andopensthewaytotheiruseinhugedatasetsandbigdata. 6. Conclusions

Thepaperhasintroducedanovelapproachtotheuseof fami-liesoflocality sensitivefunctions(LSH) forinstanceselection. Us-ing thisapproach, two newalgorithms of linearcomplexityhave been designed. In one approach, the data are processed in one pass, which allows the algorithm to make the selection without requiringthatthewhole datasettofitinmemory.Theother ap-proachneedstwopasses:oneprocesseseachinstanceofthedata set,andthesecond processes thebucketsof thefamiliesofhash functions. Their speed and low memory consumption mean that theyaresuitableforbigdataprocessing.

The experiments have shown that the strength of our meth-ods is the speed, which is achievedthrough a smalldecrease in accuracy and, more remarkably, the reduction rate. Although the 4We divided the original KDDCup 1999 data set into two sets with differ-entnumberofinstances:KDDCup99has10%oftheoriginalinstancesand KDD-Cup991Mhasamillion.

(11)

(a) Linear scale

(b) Logarithmic scale

Fig.10. Computationalcostoftestedmethods,inlinear(a)andlogarithmicscale(b)ontheyaxis.In(a)theCNNwasnotplottedbecauseitsgrowthwassohighthatit wasnotpossibletoappreciatethedifferencesbetweentheothermethods.Thedotsaretheresultsontheavailabledatasets,lines(denotedby“–b” inthelegend)arethe Bezierlinesbuiltwiththesedotstoshowthegeneraltrendofthealgorithms.

bestmethodsaccordingtoaccuracydifferdependingonthe classi-fierthatisused,theproposedmethodsofferacompetitive perfor-mance.Moreover,thereductionratecanbeadjustedbyincreasing orbydecreasingthenumberofhashfunctionsthatareused.

Furthermore, the proposed methods were evaluated on huge problems and compared against Democratic Instance Selection, a linear complexity method. Experimental results on accuracy showed how our methods outperformed DIS, even though our methodswereconceptuallymuchsimpler.

7. Futurework

Intheir currentversion, the waythe algorithms makethe in-stanceselectionisverysimpleandquite“naive”.Theselected sub-setcould beimprovedusing additionalinformationaboutthe in-stancesassignedtoeachbucket,andnotjustthecountofinstances ofeach class. Afuture research line could be to store additional information of the instances assigned to each bucket: for exam-ple,simplestatisticssuch astheincrementalaverageofinstances

(12)

inthebucket,orthe percentageofinstances ofeach classin the bucket.This information might mean that the instance selection processwouldbe better informed,without excessivelypenalizing run-time.Althoughprototypegenerationhasnotbeenanalyzedin thepaper,thegenerationofanewinstance,orgroupofthem,for each bucket is one of the futurelines of research. This idea can bedevelopedusingLSH-IS-F,seekingeachofthebucketstobuild orto createa newset of instances, by selecting the medoids or centroidsoftheinstancesinthebuckets.

According to [63] , one of the most challenging problems in data mining research is mining data streams in extremely large databases.Accurateandfastprocessesabletoworkonstreamare required,withoutanyassumption that informationcan be stored inlarge databases andrepeatedly accessed.One of the problems thatarisesinthoseenvironmentsiscalledconceptdrift,which ap-pearswhenchangesinthecontexttakeplace.Inthemanagement ofconcept drift,threebasicapproachescanbe distinguished: en-semble learning, instance weighting and instance selection [59] . Acomparisonofthe proposedmethodina streamingbenchmark wouldbemadetotestwhetherLSH-IS-Scanbeatthe state-of-the-artalgorithmsthatareabletodealinstreamingdata [18] .

Many more research approaches can be considered, but the principal one for us is to adapt the newmethods to a big data environment.Weare workingonthe adaptationofthisidea toa MapReduceframework,whichoffersarobust environmenttoface upto the processingof hugedata sets overclusters ofmachines [58] .

Acknowledgements

Supported by the Research Projects TIN 2011-24046 and TIN 2015-67534-PfromtheSpanishMinistryofEconomyand Compet-itiveness.

References

[1] J.Alcalá-Fdez,L.Sánchez,S.Garcia,M.delJesus,S.Ventura,J.Garrell,J.Otero, C.Romero,J.Bacardit,V.M.Rivas,J.Fernández,F.Herrera,Keel:asoftwaretool toassessevolutionaryalgorithmsfordataminingproblems,SoftComput.13 (3)(2009)307–318,doi:10.1007/s00500-008-0323-y.

[2] Álvar Arnaiz-González,J.F. Díez-Pastor, J.J. Rodríguez,C.I. García-Osorio, In-stanceselectionforregressionbydiscretization,ExpertSyst.Appl.54(2016) 340–350,doi:10.1016/j.eswa.2015.12.046.

[3] A.Andoni,P.Indyk,Near-optimalhashingalgorithmsforapproximatenearest neighborinhighdimensions,Commun.ACM51(1)(2008)117–122,doi:10. 1145/1327452.1327494.

[4] S.Arya,D.M.Mount,N.S.Netanyahu,R.Silverman,A.Y.Wu,Anoptimal algo-rithmforapproximatenearestneighborsearchingfixeddimensions,J.ACM45 (6)(1998)891–923,doi:10.1145/293347.293348.

[5] D.Asimov,Thegrandtour:Atoolforviewingmultidimensionaldata,SIAMJ. Sci.Stat.Comput.6(1)(1985)128–143,doi:10.1137/0906011.

[6] R.Barandela,F.J.Ferri,J.S.Sánchez,Decisionboundarypreservingprototype selectionfornearestneighborclassification,IJPRAI19(6)(2005)787–806. [7] H. Brighton, C. Mellish,On the consistency ofinformation filters for lazy

learning algorithms, in:J. ˙Zytkow, J. Rauch(Eds.), Principles ofData Min-ing and Knowledge Discovery, Lecture Notes in Computer Science, vol-ume 1704, Springer Berlin Heidelberg, 1999, pp. 283–288, doi:10.1007/ 978-3-540-48247-5_31.

[8] H. Brighton, C. Mellish, Advances in instance selection for instance-based learning algorithms,DataMin.Know. Discov.6(2)(2002)153–172,doi:10. 1023/A:1014043630878.

[9] J. Calvo-Zaragoza,J.J. Valero-Mas, J.R.Rico-Juan, ImprovingkNN multi-label classificationinprototype selectionscenarios usingclassproposals,Pattern Recognit.48(5)(2015)1608–1622,doi:10.1016/j.patcog.2014.11.015. [10] J.R.Cano,F.Herrera,M.Lozano,Stratificationforscalingupevolutionary

pro-totypeselection,PatternRecognit.Lett.26(7)(2005)953–963,doi:10.1016/j. patrec.2004.09.043.

[11]T.Cover,P.Hart,Nearestneighborpatternclassification,Inf.TheoryIEEETrans. 13(1)(1967)21–27,doi:10.1109/TIT.1967.1053964.

[12] M. Datar, N. Immorlica, P. Indyk, V.S. Mirrokni, Locality-sensitive hashing schemebasedonp-stabledistributions,in:ProceedingsoftheTwentieth An-nualSymposiumonComputationalGeometry,in:SCG’04,ACM,NewYork,NY, USA,2004,pp.253–262,doi:10.1145/997817.997857.

[13]A. de Haro-García, N. García-Pedrajas, A divide-and-conquer recursive ap-proachforscalingupinstanceselectionalgorithms,DataMin.Knowl.Discov. 18(3)(2009)392–418,doi:10.1007/s10618-008-0121-2.

[14]J.Demšar,Statisticalcomparisonsofclassifiersovermultipledatasets,J.Mach. Learn.Res.7(2006)1–30.

[15]J.Derrac,S.García,F.Herrera,Stratifiedprototypeselectionbasedona steady-statememeticalgorithm:astudyofscalability,MemeticComput.2(3)(2010) 183–199,doi:10.1007/s12293-010-0048-1.

[16]J.Derrac,S.García,F.Herrera,Asurveyonevolutionaryinstanceselectionand generation,Int.J.Appl.MetaheuristicComput.1(1)(2010)60–92,doi:10.4018/ jamc.2010102604.

[17]J.Derrac,N.Verbiest,S.García,C.Cornelis,F.Herrera,Ontheuseof evolution-aryfeatureselectionforimprovingfuzzyroughsetbasedprototypeselection, SoftComput.17(2)(2013)223–238,doi:10.1007/s00500-012-0888-3. [18]G.Ditzler,M.Roveri,C.Alippi,R.Polikar,Learninginnonstationary

environ-ments:asurvey,Comput.Intell.Mag.,IEEE10(4)(2015)12–25.

[19]F.Dornaika,I.K.Aldine,Decrementalsparsemodelingrepresentativeselection for prototype selection,PatternRecognit. 48 (11)(2015) 3714–3727,doi:10. 1016/j.patcog.2015.05.018.

[20]R.O.Duda,P.E.Hart,D.G.Stork,Patternclassification,JohnWiley&Sons,New YorkCity,UnitedStates,2012.

[21]S.Garcia,J.Derrac,J.Cano,F.Herrera,Prototypeselectionfornearestneighbor classification:Taxonomyandempiricalstudy,PatternAnal.Mach.Intell.IEEE Trans.34(3)(2012)417–435,doi:10.1109/TPAMI.2011.142.

[22]C. García-Osorio,A.deHaro-García,N. García-Pedrajas,Democraticinstance selection:alinearcomplexityinstanceselectionalgorithmbasedonclassifier ensembleconcepts,Artif.Intell.174(5-6)(2010)410–441,doi:10.1016/j.artint. 2010.01.001.

[23]N.García-Pedrajas,A.deHaro-García,Boostinginstanceselectionalgorithms, Knowl.BasedSyst.67(2014)342–360,doi:10.1016/j.knosys.2014.04.021. [24]N. García-Pedrajas, J.A. Romero del Castillo, D. Ortiz-Boyer, A cooperative

co-evolutionaryalgorithmfor instanceselection forinstance-basedlearning, Mach.Learn.78(3)(2010)381–420,doi:10.1007/s10994-009-5161-3. [25]G.Gates,Thereducednearestneighborrule(corresp.),Inf.Theor.IEEETrans.

18(3)(1972)431–433,doi:10.1109/TIT.1972.1054809.

[26]A.Guillen,L.Herrera,G.Rubio,H.Pomares,A.Lendasse,I.Rojas,Newmethod for instance or prototype selection using mutual informationin time se-riesprediction,Neurocomputing73(10-12)(2010)2030–2038,doi:10.1016/j. neucom.2009.11.031.Subspace Learning / Selected papers from the European SymposiumonTimeSeriesPrediction.

[27]P.Hart,Thecondensednearestneighborrule(corresp.),Inf.Theor.IEEETrans. 14(3)(1968)515–516.

[28]P.Hernandez-Leal,J.Carrasco-Ochoa,J.Martínez-Trinidad,J.Olvera-López, In-stancerankbasedonbordersforinstanceselection,PatternRecognit.46(1) (2013)365–375,doi:10.1016/j.patcog.2012.07.007.

[29]Y.Hochberg,Asharperbonferroniprocedureformultipletestsofsignificance, Biometrika75(4)(1988)800–802,doi:10.1093/biomet/75.4.800.

[30]P.Indyk,R.Motwani,Approximatenearestneighbors:Towardsremovingthe curseofdimensionality,in:ProceedingsoftheThirtiethAnnualACM Sympo-siumonTheoryofComputing,in:STOC’98,ACM,NewYork,NY,USA,1998, pp.604–613,doi:10.1145/276698.276876.

[31]N.Jankowski,M.Grochowski,ComparisonofinstancesseletionalgorithmsI. algorithmssurvey, in:L.Rutkowski, J. Siekmann,R.Tadeusiewicz, L.Zadeh (Eds.),ArtificialIntelligenceandSoftComputing-ICAISC2004,LectureNotes inComputerScience,volume3070,SpringerBerlinHeidelberg,2004,pp.598– 603,doi:10.1007/978-3-540-24844-6_90.

[32]S.-W.Kim,J.B.Oommen,Abrieftaxonomyandrankingofcreativeprototype reduction schemes,Pattern Anal.Appl. 6(3)(2003)232–244, doi:10.1007/ s10044-003-0191-0.

[33]M. Kordos,S. Białka,M. Blachnik, Instanceselection inlogicalrule extrac-tion for regression problems,in:L. Rutkowski, M. Korytkowski,R. Scherer, R.Tadeusiewicz,L.A.Zadeh,J.M.Zurada(Eds.),ArtificialIntelligenceandSoft Computing,LectureNotesinComputerScience,volume7895,SpringerBerlin Heidelberg,2013,pp.167–175,doi:10.1007/978-3-642-38610-7_16. [34]E.Kushilevitz,R.Ostrovsky,Y.Rabani,Efficientsearchforapproximatenearest

neighborinhighdimensionalspaces,in:ProceedingsoftheThirtiethAnnual ACMSymposiumonTheoryofComputing,in:STOC’98,ACM,NewYork,NY, USA,1998,pp.614–623,doi:10.1145/276698.276877.

[35]J.Leskovec,A.Rajaraman,J.D.Ullman,MiningofMassiveDatasets,third, Cam-bridgeUniversityPress,Cambridge,UnitedKingdom,2014.

[36]E.Leyva,A.González,R.Pérez,Knowledge-basedinstanceselection:A compro-misebetweenefficiencyandversatility,Knowl.BasedSyst.47(2013)65–76, doi:10.1016/j.knosys.2013.04.005.

[37]E.Leyva,A.González,R.Pérez,Threenewinstanceselectionmethodsbased onlocalsets:acomparativestudywithseveralapproachesfromabi-objective perspective,PatternRecognit. 48(4)(2015)1523–1537,doi:10.1016/j.patcog. 2014.10.001.

[38]J.Li,Y.Wang,Anewfastreductiontechniquebasedonbinarynearest neigh-bortree,Neurocomputing149,PartC(2015)1647–1657,doi:10.1016/j.neucom. 2014.08.028.

[39]T.Liu,A.W.Moore,K.Yang,A.G.Gray,Aninvestigationofpractical approxi-matenearestneighboralgorithms,in:AdvancesinNeuralInformation Process-ingSystems,MITPress,2004,pp.825–832.

[40]A.Lumini,L.Nanni,Aclusteringmethodforautomaticbiometrictemplate se-lection,PatternRecognit.39(3)(2006)495–497,doi:10.1016/j.patcog.2005.11. 004.

(13)

[41] E.Marchiori, Hit miss networks with applicationsto instanceselection, J. Mach.Learn.Res.9(2008)997–1017.

[42] E.Marchiori,Classconditionalnearestneighborforlargemargininstance se-lection,IEEETrans.PatternAnal.Mach.Intell.32(2)(2010)364–370,doi:10. 1109/TPAMI.2009.164.

[43] E.Merelli,M.Pettini,M.Rasetti,Topologydrivenmodeling:theismetaphor, NaturalComput.14(3)(2014)421–430,doi:10.1007/s11047-014-9436-7. [44] C.Moretti, K. Steinhaeuser, D.Thain, N.V.Chawla, Scaling up classifiersto

cloudcomputers,in:DataMining,2008.ICDM’08.EighthIEEEInternational Conferenceon,2008,pp.472–481,doi:10.1109/ICDM.2008.99.

[45] L.Nanni,A.Lumini,Prototypereductiontechniques:Acomparisonamong dif-ferentapproaches,ExpertSyst.Appl.38(9)(2011)11820–11828,doi:10.1016/j. eswa.2011.03.070.

[46] J.Olvera-López,J.Carrasco-Ochoa,J.Martínez-Trinidad,Anewfastprototype selectionmethodbasedonclustering,PatternAnal.Appl.13(2)(2009)131– 141,doi:10.1007/s10044-008-0142-x.

[47]J.Olvera-López,J.Martínez-Trinidad,J.Carrasco-Ochoa,J.Kittler,Prototype se-lectionbasedonsequentialsearch,Intell.DataAnal.13(4)(2009)599–631. [48] S. Ougiaroglou, G. Evangelidis, D.A. Dervos, FHC: an adaptive fast hybrid

methodfork-NNclassification,LogicJ.IGPL(2015).

[49] D.Peralta,S.delRío,S.Ramírez-Gallego,I.Triguero,J.M.Benitez,F.Herrera, Evolutionary feature selection for big dataclassification: amapreduce ap-proach,Math.Probl.Eng.501(2015)246139.

[50] J.Pérez-Benítez,J.Pérez-Benítez,J.Espina-Hernández,Noveldatacondensing methodusingaprototype’sfrontpropagationalgorithm,Eng.Appl.Artif.Intell. 39(2015)181–197.

[51] J.Pérez-Rodríguez,A.G.Arroyo-Peña,N.García-Pedrajas,Simultaneousinstance andfeatureselectionandweightingusingevolutionarycomputation:Proposal andstudy,Appl.SoftComput.37(2015)416–443,doi:10.1016/j.asoc.2015.07. 046.

[52] E.P˛ekalska,R.P.Duin,P.Paclík,Prototypeselectionfordissimilarity-based clas-sifiers,PatternRecognit.39(2)(2006)189–208,doi:10.1016/j.patcog.2005.06. 012. PartSpecial Issue:Complexity ReductionPartSpecial Issue:Complexity Reduction.

[53]J.R.Quinlan,C4.5:ProgramsforMachineLearning,MorganKaufmann Publish-ersInc.,SanFrancisco,CA,USA,1993.

[54]G.Ritter,H.Woodruff,S.Lowry,T.Isenhour,Analgorithmforaselective near-estneighbordecisionrule,IEEETrans.Inf.Theor.21(6)(1975)665–669. [55]M.B. Stojanovi´c,M.M. Boži´c,M.M. Stankovi´c, Z.P. Staji´c, Amethodology for

trainingsetinstanceselectionusingmutualinformationintimeseries predic-tion,Neurocomputing141(2014)236–245,doi:10.1016/j.neucom.2014.03.006. [56]H.Tran-The,V.N.Van,M.H.Anh,Fastapproximatenearneighboralgorithmby

clusteringinhighdimensions,in:KnowledgeandSystemsEngineering(KSE), 2015SeventhInternationalConferenceon,2015,pp.274–279,doi:10.1109/KSE. 2015.72.

[57]I.Triguero,J.Derrac,S.Garcia,F.Herrera,Ataxonomyandexperimentalstudy onprototypegenerationfornearestneighborclassification,Syst.ManCybern. PartC42(1)(2012)86–100,doi:10.1109/TSMCC.2010.2103939.

[58]I.Triguero,D.Peralta,J.Bacardit,S.García,F.Herrera,MRPR:Amapreduce so-lutionforprototypereductioninbigdataclassification,Neurocomputing150, PartA(2015)331–345,doi:10.1016/j.neucom.2014.04.078.

[59]A.Tsymbal,Theproblemofconceptdrift:definitionsandrelatedwork, Com-put.Sci.Department,TrinityCollegeDublin106(2004)2.

[60]D.R.Wilson,T.R.Martinez,Instancepruningtechniques,in:Machine Learn-ing:ProceedingsoftheFourteenthInternationalConference(ICML’97),Morgan Kaufmann,1997,pp.404–411.

[61]D.R. Wilson, T.R. Martinez, Reduction techniquesfor instance-based learn-ing algorithms, Mach. Learn. 38 (3) (2000) 257–286, doi:10.1023/A: 1007626913721.

[62]I.H.Witten,E.Frank,M.A.Hall,DataMining:PracticalMachineLearningTools and Techniques,third,MorganKaufmann PublishersInc.,SanFrancisco,CA, USA,2011.

[63]Q. Yang, X. Wu, 10 challenging problems in data mining research, Int. J. Inf. Technol. Decis. Mak. 05 (04) (2006) 597–604, doi:10.1142/ S0219622006002258.

83–95 ScienceDirect www.elsevier.com/locate/knosys ( http://creativecommons.org/licenses/by/4.0/ 0323- 10.1016/j.eswa.2015.12.046 1145/1327452.1327494 10.1145/293347.293348 10.1137/0906011 787–806 31 1023/A:1014043630878 10.1016/j.patcog.2014.11.015 10.1016/j. 10.1109/TIT.1967.1053964 10.1145/997817.997857 0121- 7 10.1007/s12293-010-0048-1 10.4018/ 0888- 12–25 1016/j.patcog.2015.05.018 2012 10.1109/TPAMI.2011.142 2010.01.001 10.1016/j.knosys.2014.04.021 07/s10994-0 10.1109/TIT.1972.1054809 2030–2038, 515–516 10.1016/j.patcog.2012.07.007 10.1093/biomet/75.4.800 10.1145/276698.276876 90 044-0 16 10.1145/276698.276877 2014 10.1016/j.knosys.2013.04.005 2014.10.001 2014.08.028 825–832 10.1016/j.patcog.2005.11. 997–1017 1109/TPAMI.2009.164 7 10.1109/ICDM.2008.99 11820–11828, 0142- 599–631 [48] 246139 181–197 046 10.1016/j.patcog.2005.06. 1993 665–669 10.1016/j.neucom.2014.03.006 2015.72 10.1109/TSMCC.2010.2103939 10.1016/j.neucom.2014.04.078 106 404–411 1007626913721 2011 02258

References

Related documents

Using the constant diameter assumption, the numerical modelling is assessed by comparing through instants of the hybrid phase, the instantaneous spatial averaged fuel regression

Figure 1 illustrates the national vehicle fleet composition of each country in terms of type of vehicle (columns) and by type of engine fuel (lines). The percentages of

By virtue of the Ricardian Equivalence Theorem, Government bonds do not represent net wealth therefore household savings will increase to offset the government policy.. The

Finally, analysing the effects of windfall gains along various personal characteristics, we find that: (i) at younger and older ages, the effect of windfall gains on labour supply

Of the four SMS components; Safety Policy, Safety Risk Management, Safety Assurance, and Safety Promotion, “Where does the Aviation Safety Investigation organization fit?” Some may

Treatment of post-dural puncture headache with bilateral greater occipital nerve block. Akin Takmaz S, Unal Kantekin C, Kaymak C,

Consequentially, the effort to change the initial intention from browsing into a behavior in which a MOOC- taker participates in a single learning activity is a smaller step

The retrofit concept is based on energy efficiency measures (reduction of transmission, infiltration and ventilation losses), on a high ratio of renewable energy sources and on