Instance selection of linear complexity for big data

(1)

Contents lists available at ScienceDirect

Knowle

dge-Base

d

Systems

journal homepage: www.elsevier.com/locate/knosys

Instance

selection

of

linear

complexity

for

big

data

Álvar Arnaiz-González, José-Francisco Díez-Pastor, Juan J. Rodríguez, César García-Osorio

∗ UniversityofBurgos,Spain

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received18December2015 Revised3April2016 Accepted30May2016 Availableonline30May2016 Keywords: Nearestneighbor Datareduction Instanceselection Hashing Bigdata

a

b

s

t

r

a

c

t

Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. In- stance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets.

In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitivehashing to ﬁnd similarities between instances. While the complexity of conventional methods (usually quadratic, O(n2₎_,_{or log-linear,}_O₍_n_log_n₎_{) means that they are unable} to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).

1. Introduction

The knearest neighbor classifier (kNN) [11] ,despite itsage,is still widely used inmachine learningproblems [9,17,20] .Its sim-plicity, straightforward implementation and good performance in manydomainsmeans thatitisstill inuse,despiteofsomeofits flaws [37] .ThekNNalgorithmisincludedinthefamilyofinstance based learning, in particular within the lazy learners, as it does not buildaclassificationmodelbutjuststoresall thetrainingset [8] .Its classification ruleissimple: foreach newinstance,assign theclassaccordingtothemajorityvoteofitsknearestneighbors in thetrainingset, ifk=1, the algorithmonly takesthe nearest neighbor into account [45] .This feature means that itrequires a lotofmemoryandprocessingtimeintheclassificationphase [48] . Traditionally, twopaths havebeenfollowed tospeed upthe pro-cess:eitheracceleratethecalculationoftheclosestneighbors [3,4] , ordecreasetrainingsetsizebystrategicallyselectingonlyasmall portionofinstancesorfeatures [38] .

Regarding the acceleration of algorithms, perhaps one of the most representative approachesis to approximate nearest neigh-bors,a broadlyresearched techniquein whichthe nearest neigh-borsearch isdone over asub-sampleofthe wholedata set [56] .

∗ _{Corresponding}_author.Fax:_{+34947258910.}

E-mail addresses: [email protected](Á. Arnaiz-González), [email protected](J.-F. Díez-Pastor),[email protected](J.J.Rodríguez),[email protected](C.García-Osorio).

Inthisﬁeld,manyalgorithmshavebeenproposedforapproximate nearestneighborproblems [3,4,30,34,39] .

Thefocusofthispaperisonthesecond path,thereductionof datasetsize.Thereasonisthatthisreductionisbeneﬁcialformost methods rather than only those based on nearest neighbors. Al-thoughwe willonlyconsiderthereductionofinstances(instance selection)inthispaper,thereductioncouldalsobeappliedto at-tributes(featureselection),orevenbothatthesametime [51] .The problem is that the fastest conventional instance selection algo-rithmshaveacomputationalcomplexityofatleastO

(

nlogn

)

and othersareofevengreatercomplexity.

Theneedforrapidmethodsforinstanceselectionisevenmore relevantnowadays,giventhegrowingsizesofdatasetsinallfields ofmachinelearningapplications (suchasmedicine,marketing or finance [43] ),andthefactthatthemostcommonlyuseddata min-ingalgorithmsforanydataminingtaskweredevelopedwhenthe commondatabasescontainedatmostafewthousandsofrecords. Currently, millionsofrecords arethe mostcommonscenario. So, mostdataminingalgorithmsfindmanyseriousdifficultiesintheir application.Thus,anewtermhasemerged,“BigData”,inreference tothosedatasetsthat,byvolume,variabilityandspeed,makethe applicationofclassicalalgorithmsdifficult [44] .Withregardto in-stance selection, the solutions that have appeared so far to deal withbig data problems adoptthe ‘divide andconquer’ approach [13,22] .Thealgorithms proposedinthepresentpaperoffera dif-ferentapproach,justasequential butveryquickandsimple pro-cessingofeachinstanceinthedataset.

http://dx.doi.org/10.1016/j.knosys.2016.05.056

(2)

In particular, the major contribution of this paper is the use ofLocality-SensitiveHashing(LSH)todesigntwo newalgorithms, whichofferstwomainadvantages:

• Linearcomplexity:theuseofLSH meansadramaticreduction inthe executiontime oftheinstanceselection process. More-over,thesemethodsareabletodealwithhugedatasetsdueto theirlinearcomplexity.

• On-the-ﬂyprocessing:oneofthenewmethodsisabletotackle theinstancesinonestep.Itisnotnecessaryforallinstancesﬁt inmemory:acharacteristicthatoffersaremarkableadvantage inrelationtobigdata.

The paperis organized asfollows: Section 2 presents the re-ductiontechniquesbackground, withspecialemphasis onthe in-stance selection methods used in the experimental validation; Section 3 introduces the concept of locality-sensitive hashing,the basisoftheproposed methods whicharepresented in Section 4; Section 5 presentsandanalyzestheresultsoftheexperimentsand, ﬁnally, Sections 6 and 7 set out the conclusions and future re-search,respectively.

2. Reductiontechniques

Available data sets are progressively becoming larger in size. Asaconsequence,manysystemshavedifficultiesprocessingsuch datasetstoobtainexploitableknowledge [23] .Thehighexecution timesandstoragerequirementsof thecurrentclassification algo-rithmsmake them unusable when dealingwith these huge data sets [28] . These problems can be decisive, if a lazy learning al-gorithm such asthenearest neighbor ruleis used,and caneven preventresultsfrombeingobtained.However,reducingthesizeof thedatasetbyselectingarepresentativesubsethastwomain ad-vantages:itreducesthememoryrequiredtostorethedataandit acceleratestheclassificationalgorithms [19] .

In the scientific literature, the term “reduction techniques” includes [61] : prototype generation [32] ; prototype selection [52] (when the classifier is based on kNN); and (for other classi-fiers) instance selection [8] . While prototype generation replaces theoriginal instances withnew artificialones, instance selection andprototypeselectionattempttofindarepresentativesubset of theinitialtrainingsetthatdoesnotlessenthepredictivepowerof thealgorithmstrainedwithsuch asubset [45] .Inthepaper, pro-totypegenerationisnotaddressed,howeveracompletereviewon itcanbefoundin [57] .

2.1.Instanceselection

The aforementioned term “instance selection” brings together differentprocedures andalgorithmsthat target theselection ofa representativesubsetoftheinitialtrainingset.Therearenumerous instanceselectionmethodsforclassiﬁcation,acompletereview of whichmaybefoundin [21] . Instanceselectionhasalsobeen ap-pliedtobothregression [2,33] andtimeseriesprediction [26,55] .

According to the order in which instances are processed, in-stanceselectionmethodscanbeclassifiedintofivecategories [21] . Iftheybeginwithanemptysetandtheyaddinstancestothe se-lected subset, by means of analyzing the instances in the train-ingset,theyarecalledincremental.Thedecrementalmethods,on thecontrary,start withtheoriginaltrainingdatasetandthey re-move those instances that are considered superfluous or unnec-essary.Batchmethods arethoseinwhich noinstanceisremoved untilallofthemhavebeenanalyzed,instancesaresimplymarked fromremovalifthealgorithmdeterminesthattheyarenotneeded, andat the end of the process only the unmarked instances are kept. Mixedalgorithms start with a preselected set ofinstances.

Table1

Summaryofstate-of-the-artinstanceselectionmethodsusedintheexperimental setup(taxonomyfrom[21];computationalcomplexityfrom[31]andauthors’ pa-pers).

Strategy Direction Algorithm Complexity Year Reference Condensation Incremental CNN O(n3₎ ₁₉₆₈ _[27]

Incremental PSC O(nlogn) 2010 [46] Decremental RNN O(n3₎ ₁₉₇₂ _[25] Decremental MSS O(n2₎ ₂₀₀₂ _[6] Hybrid Decremental DROP1-5 O(n3₎ ₂₀₀₀ _[60]

Batch ICF O(n2₎ ₂₀₀₂ _[8]

Batch HMN-EI O(n2₎ ₂₀₀₈ _[41]

Batch LSBo O(n2₎ ₂₀₁₅ _[37]

Theprocess thendecides eitherto addortodeletetheinstances. Finally,ﬁxedmethodsareasub-familyofmixedones,inwhichthe numberofadditionsandremovalsarethesame.Thisapproach al-lowsthemtomaintainaﬁxednumberofinstances(morefrequent inprototypegeneration).

Consideringthetypeofselection,threecategoriesmaybe dis-tinguished.Thiscriterionismainlycorrelatedwiththepointsthat they remove: either border points, central points, or otherwise. Condensationtechniquestrytoretainborderpoints.Their underly-ingideaisthatinternalpointsdonotaffectclassification,because theboundariesbetweenclassesarethekeystoneofthe classifica-tion process.Editionmethods maybe considered theopposite of condensationtechniques,astheiraimistoremovethoseinstances thatarenotwell-classifiedbytheirnearestneighbors.Theedition processachievessmootherboundariesaswellasnoiseremoval.In the middleof thoseapproaches are hybridalgorithms, which try tomaintainoreventoincreasetheaccuracycapabilityofthedata set,byremovingboth:internalandborderpoints [21] .

Evolutionary approachesforinstance selectionhaveshown re-markableresultsinboth reductionandaccuracy. Acomplete sur-veyofthemcanbefoundin [16] .However,themainlimitationof thosemethodsistheir computationalcomplexity [36] .This draw-back is the reason why they are not taken into account in this study,becausethemethodsitproposesareorientedtowardslarge datasets.

Intheremaining partofthissection,wegivefurtherdetailsof themostrepresentative methodsused intheexperimental setup. Asummaryofthemethodsconsideredinthestudycanbeseenin Table 1 .

2.1.1. Condensation

ThealgorithmofHart,CondensedNearestNeighbor(CNN) [27] is considered the first formal proposal of instanceselection for the nearest neighbor rule. The concept of training set consistency is importantinthisalgorithmandisdefinedasfollows:givenanon empty setX (X=∅),a subset SofX (S⊆X) isconsistent with re-specttoXif,usingthesubsetSastrainingset,thenearest neigh-borrulecancorrectlyclassifyallinstancesinX.Followingthis def-initionofconsistency,ifweconsiderthesetXasthetrainingset, a condensed subset should have the properties of being consis-tent and, ideally,smaller than X.After CNN appeared,other con-densation methods emerged with the aimof decreasing the size ofthe condensed dataset,e.g.: ReducedNearest Neighbor(RNN) [25] .OneofthelatestisthePrototypeSelectionbyClustering(PSC) [46] ,whichuses clusteringtospeed upthe selection process.So, the use of clusteringgives a highefficiency to PSC, if compared against state-of-the-art methods, and better accuracy than other clustering-basedmethodssuchasCLU [40] .

In [6] , the authors proposed a modiﬁcation to the deﬁnition of a selectivesubset [54] , fora better approximation to decision borders. The selectivesubset can be thought ofassimilar to the idea of the condensed algorithm of Hart, but applying a

(3)

condi-tionstrongerthantheconditionofconsistency.Theaimistoﬁnd the selectedinstancesin aneasierway,which islesssensitiveto therandominitializationofSandtheorderofexplorationofXin Harts’algorithm.Thesubset obtainedinthiswayiscalledthe se-lectivesubset(SS).

A subset S ofthe trainingset X is a selective subset (SS), ifit satisﬁesthefollowingconditions:

1. Sisconsistent(asinHarts’algorithm).

2. Allinstances in theoriginal trainingset,X, arecloser toa se-lectiveneighbor(amemberofS) ofthesameclassthan toany instanceofadifferentclassinX.

Then,theauthorspresentagreedyalgorithmwhichattemptsto findselectiveinstancesstartingwiththoseinstancesofthetraining setthatareclosetothedecisionboundaryofthenearestneighbor classifier.Thealgorithmpresentedisanefficientalternativetothe Selectivealgorithm [54] anditis usually ableto selectbetter in-stances(theonesclosertotheboundaries).

2.1.2. Hybrid

Oneproblemthatariseswhencondensationmethodsareused istheirnoisesensitivity,whilehybridmethodshavespeciﬁc mech-anismstomakethemmorerobusttonoise [37] .

The DROP (Decremental Reduction Optimization Procedure) [60] familyofalgorithms comprisessomeofthebestinstance se-lection methods for classiﬁcation [8,47,50] . The instance removal criteria is based on two relations: associates and nearest neigh-bors.Therelationofassociateistheinverseofnearestneighbors: thoseinstancespthathaveqasoneoftheirnearestneighborsare calledassociatesofq.Thesetofnearestneighborsofoneinstance iscalledtheneighborhoodoftheinstance.Forall instances,itslist of associates is a listwith all instances that have that particular instanceintheirneighborhood.

Marchiori proposed a new graph-based representation of the data set called Hit Miss Networks (HMN) [41] . The graph has a directed edge from each instance to its nearest neighbor on the differentclasses,withone edge per class.The informationinthe graph was used to define three new hybrid algorithms: HMN-E, HMN-C and HMN-EI. A couple of years later, HMNs were used todefine anewinformation-theoretic instancescoringmethodto define a new instance selection method called Class Conditional NearestNeighbor(CCIS) [42] .Accordingto [21] ,HMN-EIisableto achievemoreaccuratedatasetsthanCCISwhichisthereasonwhy HMN-EIwasusedinexperimentalsetup.

The local-set (LS) concept,proposed forthe veryfirst time in [7] ,isapowerfultool forsomemachine learningtasks, including instance selection. Alocal-set ofan instancex containsall those instanceswhichareclosertoxthanitsnearestneighborof differ-entclass,itsnearestenemy.TheselectionruleoftheIterativeCase Filteringalgorithm(ICF) [7] useslocalsetstobuildtwosets: cover-ageandreachability.Thesetwo conceptsarecloselyrelatedtothe neighborhoodandassociatelistusedinDROPalgorithms.The cov-erageofaninstanceisitsLS,thatcanbeseenasaneighborhoodof theinstancethat,insteadofconsideringafixednumberkof neigh-bor,includesallinstancesclosertotheinstancethanitsclosest en-emy.The reachableset ofan instanceis itsset ofassociates.The coverage setofan instanceisits neighborhood.The deletionrule isasfollows:aninstanceisremovedfromthedatasetifits reach-ableset(itssetofassociates)isbiggerthanitscoverage(its ‘neigh-borhood’).Thisrulemeansthatthealgorithmremovesaninstance if other object exists that generalizesits information.To address the problemof noisy datasets, both ICF and DROP3,begin with anoise-filterstage.Recently,Leyvaetal [37] presentedthreenew instance selection methods based on LSs. Theirhybrid approach, which offers a good balance between reduction and accuracy, is

calledLocalSetBorderSelector(LSBo)anditusesaheuristic crite-rion:theinstancesintheboundariesbetweenclassestendtohave greaterLSs.Asisusualinhybridmethods,LSBostartswithanoise ﬁlteringalgorithmwhichwaspresentedinthesamepapercalled LSSm.

2.2.Scalingupinstanceselection

Themaindrawbackofinstanceselectionmethodsistheir com-plexitythatisquadraticO

(

n2

)

_,_where_n_is_the_number_of_instances

[22] or,atbest,log-linear_O

(

nlogn

)

;thus,themajorityofthemare notapplicableindatasetswithhundredsorevenmanythousands ofinstances [15] . Table 1 summarizesthecomputational complex-ityoftheinstanceselectionmethodsusedintheexperimental sec-tion.

Oneapproach to deal withmassive datasets, isto divide the originalproblemintosmallersubsetsofinstances;knownas strat-iﬁcation.Theunderlyingideaofthesemethodsistosplitthe orig-inaldatasetintodisjointedsubsets,thenaninstanceselection al-gorithm is applied to each subset [10,13,22,24] . This approach is

used in [22] where a method was proposed that addressed the

splitting process, using Grand Tour [5] theory, to achieve linear complexity.

The problem known as big data refers to the challenges and diﬃculties that arise when huge amounts of data are processed. Onewaytoaccelerateinstanceselectionmethodsandtobeableto copewithmassivedatasetsisadaptthemtoparallelenvironments [49] .Todoso,thewaythatalgorithmsworkhastoberedesigned. TheMapReduceparadigmoffersarobustframeworkwithwhichto processhugedatasets overclustersofmachines. Followingup on thisidea,anewproposalwaspresentedrecentlybyTrigueroetal. [58] .

3. Locality-sensitivehashing

The locality-sensitive hashing (LSH) is an eﬃcient method for checking similarity between elements. It makes a particular use ofhashfunctions that,unlike those used inother applications of hashing,1 seekstoallocatesimilaritemstothesamebucketwitha highprobability,andatthesametimetogreatlyreduce the prob-abilityofassigningdissimilaritemstothesamebucket [35] .

LSHuseiscommontoincreasetheeﬃciencyofnearest neigh-borscalculation [3,21] .AnindirectbeneﬁtofLSHforinstance se-lectionalgorithms isthespeedingup ofnearestneighbor calcula-tion,required inmost ofthese sortsof algorithms. However, the complexity of the algorithms remain unchanged, since the loop nestingandstructuresofthealgorithmsremainthesame.Itisonly thekNNstepthatisimproved.

WhatweproposeinthispaperisanoveluseofLSH,notmerely assupport forthecalculationofnearest neighbors,butasan op-erationthat deﬁnesthe natureofthe newinstancesselection al-gorithm. Basically, the idea is to make the instance selection on eachofthebucketsthatwillbeobtainedbyLSHwhenappliedto allinstances.Thisprocess permitstheselectionofinstancesusing a uniqueprocessing loop ofthe dataset, thereby givingit linear complexity. So a reasonablequestion arises; when a classiﬁer is trainedwithaselected subset obtainedby thisapproach,willits predictioncapabilitiesdecrease?Thisarticleoffers an experimen-talresponseto thisquestion2 Butbeforegivingthedetails ofour

1_The_aim_of_conventional_{cryptographic}_hash_functions_is_to_avoid_the_collision ofitemsinthesamebucket.

2_In_any _case,_note _that_even _a_certain _degradation _of_classiﬁer_performance wouldbeacceptable,ifthenewalgorithmachievedasubstantialaccelerationor reductioninstorage[9],itisoftenbettertogainaquickapproximationwithina reasonabletimethananoptimalsolutionwhenitistoolatetouseit.

(4)

Fig.1. Expectedbehavioroftheprobabilitiesofa(d1,d2,p1,p2)–sensitivefunction. Thefunctionwillassignthesamevaluetotwoinstanceswithaprobabilitygreater thanp1,iftheirdistanceisshorterthand1.Thefunctionwillassignthesamevalue totwoinstanceswithaprobabilitylowerthanp2,iftheirdistanceisgreaterthan d2.Fordistancesbetweend2andd1,thereisnorestrictionregardingthatthevalues thefunctioncanassigntotheinstances.

proposal,letuslookatabriefintroductiontotheunderlying the-oryofLSH.

3.1.Locality-sensitivefunctions

Inthissectionwefollow [35] ,toformallydeﬁnetheconceptof localsensitivityandthe process ofamplifying a locality-sensitive familyoffunctions.

Given a set of objects S and a distance measure D, a family of hash functions H=

{

h:S→U

}

is said to be

(

d1,d2,p1,p2

)

–sensitive, if the following properties hold for all

functionsofhinthefamilyH:

• Forallx,yinS,ifD(x,y)≤d1,thentheprobabilitythath

(

x

)

= h

(

y

)

isatleastp1.

• Forallx,yinS,ifD(x,y)_>d2,thentheprobabilitythath

(

x

)

= h

(

y

)

isatmostp2.

Inthisdeﬁnition,nothingreferstowhathappenswhenthe dis-tanceoftheobjects isbetweend1 andd2 (see therepresentation

in Fig. 1 ).However, distancesd1 andd2 canbe ascloseas

possi-ble,butthecost willbe that p1 andp2 are alsocloser.However,

asshownbelow, it ispossible to combinefamiliesof hash func-tionsthat separate theprobabilitiesp1 andp2 withoutmodifying

thedistancesd1 andd2.

Givena

(

d1,d2,p1,p2

)

–sensitivefamilyofhashfunctionsH,it

ispossibletoobtain anewfamilyH usingthefollowing ampliﬁ-cationoperations

AND-construction ThefunctionshinHareobtainedby com-bining a ﬁxed number r of functions

{

h1,h2,...,hr

}

in H.

Now,h

(

x

)

=h

(

y

)

,ifandonlyifhi

(

x

)

=hi

(

y

)

foralli.Ifthe

independenceoffunctionsinHcanbeguaranteed,thenew familyoffunctions_Hwillbe

(

d1,d2,

(

p1

)

r,

(

p2

)

r

)

–sensitive.

OR-construction The functions h inH are obtained by com-bining a ﬁxed number b of functions

{

h1,h2,...,hb

}

in H.

Now,h

(

x

)

₌h

(

y

)

_,ifandonlyifh_i

(

x

)

₌h_i

(

y

)

foranyi.Ifthe independenceoffunctionsinHcanbeguaranteed,thenew familyoffunctionsHwillbe

(

d1,d2,1−

(

1−p1

)

b,1−

(

1− p2

)

b

)

–sensitive.

The AND-construction decreases theprobabilities andthe OR-constructionincreasesthem.However,ifrandbareproperly cho-senandwiththechainingofconstructionstheprobability p1 may

bucket

width

Fig.2. Twopoints(A,B)atdistancedwhaveasmallchanceofbeinghashedto thesamebucket.

bebroughtcloserto1,whiletheprobabilityp2willstayreasonably

closeto0.

Intheexperimentalsetup,thehashfunctionsinthebasefamily wereobtainedusingthefollowingequation [12] .

ha,b

(

x

)

=

a·x+b w

(1) where a is a random vector (Gaussian distribution with mean 0 andstandarddeviation1),bisarandomrealvaluefromthe inter-val[0,w]andwisthewidthofeachbucketinthehashtable.

This equation gives a (w/2, 2w, 1/2, 1/3)-sensitivefamily. The reasonforthesenumbersisasfollows(suppose,forsimplicity,a 2-dimensionalEuclideanspace),ifthedistancedbetweentwopoints isexactlyw/2(halfthewidthofthebuckets)thesmallest probabil-ityforthetwopointsfallinginthesamesegment wouldhappen for

θ

=0, andin this casethe probability would be 0.5, since d is exactlyw/2. Forangles greater than 0,this probability will be even higher;in fact,it will be 1for

θ

=90. And forshorter dis-tancesthanw/2,theprobabilitywillequallyincrease.Sothelower boundaryforthisprobabilityis1/2.Ifthedistancedisexactly2w (twicethe width ofthe bucket), theonly chance for both points to fall in thesame bucket is that their distances, once projected inthesegment,arelowerthanw,whatmeansthatcos

θ

mustbe lowerthan 0.5,sincetheprojecteddistanceisdcos

θ

anddis ex-actly 2w.For

θ

in theinterval 0 to 60,cos

θ

is greater than 0.5, so the only chance of cos

θ

beinglower than 0.5is that

θ

is in theinterval[60,90],andthechance ofthat happeningisatmost 1/3.Fordistancesgreaterthan2w,theprobabilitiesareevenlower. Sotheupperboundaryofthisprobabilityis1/3.Thisreasoningis reﬂectedin Fig. 2 .

Byusingthe(w/2,2w,1/2,1/3)-sensitivefamilypreviously de-scribed, we have computed the probabilities p1 and p2 for the

AND-OR construction with a number of functions from1 to 10. Fig. 3 shows the probabilities p1 (a) and p2 (b) for the case of

thechainingofanOR-constructionjustafteranAND-construction, and the difference betweenthese two probabilities (c). The row number indicates the number of functions used in the AND-construction, while the column number indicates the number of functionsusedintheOR-construction.

4. Newinstanceselectionalgorithmsbasedonhashing This section presents the algorithms proposed in this work: LSH-IS-SandLSH-IS-F.Theﬁrstcompletestheselectionprocessin a single pass, analyzing each instance consecutively. It processes

(5)

Fig.3. Probabilitiesp1andp2andthedifferencebetweenthem.Thedarkerthecolorthehigherthevalue.EachcellgivesthevalueforthechainingofanOR-construction (numberofcombinedbasicfunctionsonthex-axis)afteranAND-construction(onthey-axisthenumberofcombinedfunctions).

instances inonepass,so notall instancesneed toﬁtinmemory. The second performs twopasses: intheinitial one, itcountsthe instances ineachbucket,inthesecond,it completestheinstance selectionwiththisinformation.Thecomplexityofbothalgorithms islinear,O

(

n

)

(notethatthisiseventrueforthesecondalgorithm, i.e.althoughtwopassesareperformed).

Both algorithms can be seen as incremental methods, due to thefactthattheselecteddatasetisformedbysuccessiveadditions totheemptyset.However,thesecondoneconformsmoreclosely tobatchprocessingbecauseitanalyzestheimpactoftheremoval onthewholedataset.

Themainadvantageofthepresentedmethodsisthedrastic re-ductioninexecutiontime.Theexperimentalresultsshowa signif-icant difference when they are compared against state-of-the-art instanceselectionalgorithms.

Algorithm1:LSH-IS-S– Instanceselectionalgorithmby hash-inginonepassprocessing.

Input:AtrainingsetX=

{

(

x1,y1

)

,...,

(

xn,yn

)

}

,setG ofhash

functionfamilies

Output:ThesetofselectedinstancesS_⊆X

1S=∅

2 foreachinstancex∈Xdo

3 foreachfunctionfamilyg∈G do 4 u←bucketassignedtoxbyfamilyg

5 ifthereisnootherinstanceofthesameclassofxinu then

6 AddxtoS 7 Addxtou

8returnS

4.1. LSH-IS-S:one-passprocessing

As shown in Algorithm 1 , the inputs of the LSH-IS-Smethod are:asetofinstancestoselectandasetoffamiliesofhash func-tions. Theloopprocesseseach instancexofX,usingthefunction familiestodeterminethe bucketutowhichtheinstancebelongs .3 Ifinthebucketuassignedtotheinstancethereisnoother in-stance of the same class ofx, x is selected and added to S and 3_The_bucket_identiﬁer_u_given_to_an_instance_by_a_family_g_∈_G_can_be_thought ofastheconcatenationofallbucketidentiﬁersgivenbythehashfunctionsing, sincethefunctionfamiliesinGareobtainedbyusinganAND-constructiononbase functionsobtainedusingEq.1.TheOR-constructionisimplementedintheforeach loopatline3ofAlgorithm1.

tothebucketu.Thealgorithm endswhenallinstances inXhave been processed. Note that each instance is processed only once, whichgrantsan extremelyfastperformanceattheexpenseofnot analyzingtheinstances thatare selectedineach bucketindetail. Instances are analyzed in sequence without needing information on other instances. This process means that the method maybe usedinasingle-passprocess,withoutrequiringthewholedataset toﬁtinthememory.

4.2.LSH-IS-F:amoreinformedselection

Thealgorithm explainedinthe previous section isremarkably fast and allows instances to be processed as they arrive, in one pass.Ontheother hand,becauseofhowitworks,itisnot using allinformationthatmayberelevanttodecidewhichinstances to choose.Forexample,thealgorithmhasnocontroloverthenumber ofinstancesofeachclassthatgotoeachbucket,becauseoncean instance of a class is selected, it discards other instances of the sameclassthatmaycomelater.

Algorithm2:LSH-IS-F– Instanceselectionalgorithmby hash-ingwithtwopasses.

Input:AtrainingsetX₌

{

(

x1,y1

)

,...,

(

xn,yn

)

}

,setGofhash

functionfamilies

Output:ThesetofselectedinstancesS⊆X

1 S=∅

2 foreachinstancex∈X do

3 foreachfunctionfamilyg∈Gdo

4 u←bucketassignedtoxbyfamilyg; 5 Addxtou;

6 foreachfunctionfamilyg∈Gdo 7 foreachbucketuofgdo

8 foreachclassywithsomeinstanceinudo 9 Iy←allinstancesofclassyinu; 10 if

|

Iy

|

>1then

11 AddtoSonerandominstanceofIy;

12 returnS

LSH-IS-F (see Algorithm 2 ) isan evolution of the LSH-IS-S. In thismethod,one-pass processingisreplacedby amore informed selection. The ﬁrst loop is similar to LSH-IS-Sbut, insteadof di-rectlyselectinginstances,itﬁrstrecordsthebuckettowhicheach instancebelongs.When thereis onlyone instanceof aclass, the instance is rejected, otherwise, if two or more instances of the

(6)

(a) Original (b) LSH-IS-S (c) LSH-IS-F

Fig.4. Exampletoillustratethebehaviorofbothalgorithms.(a)Initialinstances, twobucketsareidentiﬁedbyLSHandthelineshowstheboundary.(b)Instances selectedbyLSH-IS-S.(c)InstancesselectedbyLSH-IS-F.

sameclassarepresent, oneofthemisrandomlychosen.Theidea hereis to give the algorithm some tolerance of the presence of noiseintheinputdataset.

The executiontime ofthismethodis notmuch largerthan in thepreviousmethod,sincethe numberofbuckets ismuchlower thanthenumberofinstances. Although,thedifferencesin execu-tion time increase withthe increase in the number ofOR func-tions.

4.3.Behaviorofproposedmethods

The aimofthissectionistotrytoshedsomelight onthe be-haviorofthenewalgorithmsproposedinthepaper.Aspreviously stated,LSH-IS-Smakes theselection ofinstances inone pass, se-lecting one instance of each class in each bucket. On the other hand,LSH-IS-F triesto avoidretaining noisyinstances. The more informed selection criterion of LSH-IS-F allows it to remove in-stancesthatcanbeharmfulfortheclassiﬁcation.

In Fig. 4 we showanexamplewithnineinstances:fourofone class (crosses) and ﬁve of the other (circles).The LSH algorithm is using two buckets, the line represents the boundary between them.LSH-IS-S selects one instance ofeach class ineach bucket (b),while LSH-IS-Fdoesnotselectthe instanceofclasscross be-causeitisidentiﬁedasnoise.

Fig. 5 illustratestheeffectofthealgorithmsintheXORdataset [42] formedby 400 instances,200per class.An outlierwas added andhighlighted witha graysquare. As indicated above, LSH-IS-S retainstheinstance,whileLSH-IS-Fremovesit.

With the aim ofillustrating the behavior ofthe proposed in-stanceselectionmethods whenthe numberofhashfunctions in-creases,weusedtheBananadataset.Ithastwonumericfeatures andtwo classes, the size of the data set is 5300 instances (see Table 3 ). Fig. 6 (a)showstheoriginaldataset.Despitethefactthat twoclusterscanbeeasilyidentiﬁed,ahighoverlapexistsbetween thetwoclassesinsomeregions [21] . Table 2 summarizesthe num-ber of instances selected by both algorithms when only one OR

Table2

NumberofBananadatasetinstancesthatthe pro-posed algorithmsretain bythe number ofAND functions.

Algorithm NumberofANDfunctions

2 4 6 8 10

LSH-IS-S 25 123 249 466 684 LSH-IS-F 24 110 225 423 627

functionis usedandthenumberofANDfunctionschanges. LSH-IS-Fretainslessinstances,becausethoseinstancesofoneclass iso-latedinonebucketwithinstancesoftheotherclassareidentﬁed asnoiseandthereforedeleted.

Figs. 6 and 7 show theinstances retainedby both algorithms, LSH-IS-S andLSH-IS-Frespectively, whenonlyone ORfunctionis used anddifferent numbers ofAND functions: 2,4, 6,8 and10. The number of retained instances increases with the number of functions.Thisbehaviorisquiteinteresting,becauseitenablesthe userto choosewhether moreorfewer instances areretained, by varyingthenumberoffunctionsthatareused.

5. Experimentalstudy

This section presents the experimental study performed to evaluate thenew proposed methods.We compared them against sevenwell-knownstate-of-the-artinstanceselectionalgorithmsin a studyperformedin Weka [62] .The instance selection methods included in the experiments were: CNN, ICF, MSS, DROP3, PSC, HMN-EI,LSBoandthetwo approachesbasedonhashing.The pa-rametersselectedforthealgorithms werethoserecommendedby the authors: the number of nearest neighbors used on ICF and DROP3were settok=3,thenumberofclustersforPSCwasset to6r(whereristhenumberofclassesofthedataset). Evolution-aryalgorithms werenotincludedintheexperiments,duetotheir highcomputationalcost.

Fortheexperiments,weused30datasetsfromtheKeel repos-itory [1] thathaveatleast1000instances. Table 3 summarizesthe datasets:name,numberoffeatures,numberofinstancesandthe accuracygivenbytwo classifiers(usingtenfold cross-validation): thenearest neighbor classifierwithk₌1andtheJ48,a classifier tree(theWekaimplementationofC4.5 [53] ).Thelastfivedatasets arehuge(belowdashedline),withmorethan299,000instances, the traditional instance selection methods are unable to address them.The onlytransformation carriedout wasthenormalization ofallinputfeatures,tosettheirvaluesatbetween0and1.

Weusedthenearestneighborclassiﬁer(1NN),asmostinstance selectionmethods havebeendesignedforthat classiﬁer(k=1in

(a) Original (b) LSH-IS-S (c) LSH-IS-F

(7)

(a) Banana (Orig.) (b) LSH-IS-S: AND=2 (c) LSH-IS-S: AND=4

(d) LSH-IS-S: AND=6 (e) LSH-IS-S: AND=8 (f) LSH-IS-S: AND=10

Fig.6. ThenumberofinstancesselectedbyLSH-IS-Sasthenumberoffunctionsincreases.

(a) Banana (Orig.) (b) LSH-IS-F: AND=2 (c) LSH-IS-F: AND=4

(d) LSH-IS-F: AND=6 (e) LSH-IS-F: AND=8 (f) LSH-IS-F: AND=10

(8)

Table3

Summaryofdatasets characteristics:name,numberoffeatures,numberofinstancesand accuracy(1NN).Lastﬁvedatasets,belowdashedline,arehugeproblems.

Datasets #attributes #instances Accuracy

Continuous Nominal 1NN J48 1 German 7 13 1000 72.90 71.80 2 Flare 0 11 1066 73.26 73.55 3 Contraceptive 9 0 1473 42.97 53.22 4 Yeast 8 0 1484 52.22 56.74 5 Wine-quality-red 11 0 1599 64.85 62.04 6 Car 0 6 1728 93.52 92.36 7 Titanic 3 0 2201 79.06 79.06 8 Segment 19 0 2310 97.23 96.62 9 Splice 0 60 3190 74.86 94.17 10 Chess 0 35 3196 72.12 81.85 11 Abalone 7 1 4174 19.84 20.72 12 Spam 0 57 4597 91.04 92.97 13 Wine-quality-white 11 0 4898 65.40 58.23 14 Banana 2 0 5300 87.21 89.04 15 Phoneme 5 0 5404 90.19 86.42 16 Page-blocks 10 0 5472 95.91 97.09 17 Texture 40 0 5500 99.04 93.13 18 Optdigits 63 0 5620 98.61 90.69 19 Mushroom 0 22 5644 100.00 100.00 20 Satimage 37 0 6435 90.18 86.28 21 Marketing 13 0 6876 28.74 31.06 22 Thyroid 21 0 7200 92.35 99.71 23 Ring 20 0 7400 75.11 90.95 24 Twonorm 20 0 7400 94.81 85.12 25 Coil2000 85 0 9822 90.62 93.95 26 Penbased 16 0 10,992 99.39 96.53 27 Nursery 0 8 12,960 98.13 97.13 28 Magic 10 0 19,020 80.95 85.01 29 Letter 16 0 20,000 96.04 87.98 30 KRvs.K 0 6 28,058 73.05 56.58 31 Census 7 30 299,285 92.70 95.42 32 KDDCup99 33 7 494,021 99.95 99.95 33 CovType 54 0 581,012 94.48 94.64 34 KDDCup991M 33 7 1,000,000 99.98 99.98 35 Poker 5 5 1,025,010 50.61 68.25

kNN) [37,51] .Moreover,weusedtheJ48classiﬁertree,toevaluate theextenttowhichtheinstancesselectedbythealgorithmswere suitablefortrainingotherclassiﬁers.

As shown in Section 3 , there are ways of combining the hash functions families that appear more promising than others (see Fig. 3 ). However, it is unclear which of them will achieve the best results in combination with the proposed algorithms. Therefore, we conducted a study with 60 combinations: AND-constructions,combiningbetween1to10hashfunctions,and OR-constructions,combining1 to6 functions,obtainedby the previ-ousAND-construction,avoidingconstructionswithtoomany func-tionsand,consequently,reducingthecomputationalcost.

The subsets selected by the algorithms were used to build a classiﬁer (1NN), the average rank [14] of which was performed overtheaccuracyofall 60combinations.Averageranks were cal-culatedasfollows:theresultsoftheexperimentsweresorted,one forthe best method,two forthe second, and so on.In the case ofa tie,values oftheranks were added upanddivided into the numberofmethodsthat tied.When theranking ofeachdata set wascalculated,theaverageforeachmethodwascomputed.Better methods hadrankings closer to one. The results of the rankings are shownin Fig. 8 . Eachcell represents the ranking value fora speciﬁccombinationofAND-OR-constructions, wherethe number offunctions in the OR-constructions is shown by the x-axis and thenumberoffunctionsintheAND-constructionsisshownbythe y-axisnumber. The darkerthe cell the higherits ranking (lower

valuesarebetter).Thebestconﬁgurationisdifferentforeach algo-rithm:

• LSH-IS-S: the best conﬁguration is one that uses OR-constructions of six functions obtained using an AND-constructionontenfunctionsofthebasefamily(functions ob-tainedusing Eq. 1 ).

• LSH-IS-F:thebestresultswereobtainedusingOR-constructions of ﬁvefunctions obtained by combining by AND-construction tenfunctionsofthebasefamily.

Fig. 9 showshowthetime executionincreases,onaverage,for theproposedalgorithms:LSH-IS-S(gray)andLSH-IS-F(black).The higherthenumberofANDfunctions, thebiggerthe gapbetween LSH-IS-SandLSH-IS-F.ThisbehaviorisexplainedbecauseLSH-IS-F hasone loopmorethanLSH-IS-S(see pseudocodes 1 and 2 ) that isused togothrough allthe bucketscountingthenumberof in-stances ofeach class. The number of buckets searched increases withthenumberofhashfunctions.

Ten fold cross-validationwasappliedto theinstance selection methodsunderstudy.Theperformanceswereasfollows:

• accuracyachievedby 1NNandJ48classiﬁerstrainedwiththe selectedsubset;

• ﬁlteringtimebyinstanceselection;

• reduction achievedby instanceselection methods (sizeof the selectedsubset).

(9)

(a) LSH-IS-S (b) LSH-IS-F

Fig.8. AveragerankoveraccuracyoftheproposedmethodsforthedifferentconﬁgurationsofAND-ORconstructions.Thedarkerthecell,thehighertheranking(loweris better).EachcellrepresentsanAND-OR-constructionwherethecolumnisthenumberoffunctionsintheOR-constructionandtherowisthenumberoffunctionsinthe AND-construction.

Fig.9. AverageexecutiontimeoftheproposedalgorithmsasthenumberofAND-functionsincreases.LSH-IS-SisrepresentedingrayandLSH-IS-Finblackwithdifferent linesandmarksforthedifferentnumbersofOR-functions.

Table4

Average ranksand Hochberg proce-dureoveraccuracy:1NN.

Algorithm Ranking pHoch. HMN-EI 2.92 LSBo 3.85 0.1869 LSH-IS-F 4.45 0.0602 MSS 4.58 0.0553 LSH-IS-S 4.98 0.0139 DROP3 5.03 0.0138 CNN 5.17 0.0088 ICF 5.75 0.0004 PSC 8.27 0.0000

According to the accuracy of 1NNclassifier (see Table 4 ), the bestfouralgorithmswereHMN-EIfollowedbyLSBo,LSH-IS-Fand MSS. According to the Hochberg procedure [29] , differences be-tweenthemwerenotsignificantata0.05significancelevel. How-ever,differencesbetweenLSBoandtheothermethodswere signif-icant. WhenJ48wasused asthe classifier(see Table 5 ), thebest sixalgorithmswereLSH-IS-F followedby LSH-IS-S,HMN-EI,LSBo, MSSandCNN;thedifferencesbetweenthemwere notsignificant at0.05. Furthermore, ascanbe seen, the leastaccurate modelis PSCforbothclassifiers.

Table5

Averageranks and Hochberg proce-dureoveraccuracy:J48.

Algorithm Ranking pHoch. LSH-IS-F 3.63 LSH-IS-S 3.88 0.7237 HMN-EI 4.10 0.7237 LSBo 4.57 0.5606 MSS 5.03 0.1909 CNN 5.28 0.0981 ICF 5.67 0.0242 DROP3 5.90 0.0094 PSC 6.93 0.0000

Table 6 showsthe average ranks over compression. DROP3 is

thebestmethodata0.05signiﬁcancelevel.Furthermore,the av-eragereductionrateforeachmethodisalsoshown.Theproposed methodsarethemostconservative,although,aspreviouslystated, ahighercompressioncouldhavebeenachievedusingfewer func-tionsintheLSHprocess.

Thethirdrelevantfeature ofinstanceselectionmethods isthe time required by the algorithms to calculate the selectedsubset.

Table 7 shows the average rank over execution time of the

(10)

LSH-Table6

AverageranksandHochbergprocedureoverstorage re-duction,andaveragereductionrate.

Algorithm Ranking pHoch. Reductionrate

DROP3 1.67 0.896 ICF 3.10 0.0427 0.813 LSBo 3.70 0.0081 0.737 PSC 4.70 0.0001 0.762 CNN 5.43 0.0000 0.658 MSS 6.00 0.0000 0.665 HMN-EI 6.10 0.0000 0.577 LSH-IS-F 6.62 0.0000 0.455 LSH-IS-S 7.68 0.0000 0.405 Table7

Averageranks and Hochberg proce-dureoverﬁlteringtime.

Algorithm Ranking pHoch. LSH-IS-F 1.53 LSH-IS-S 1.57 0.9624 PSC 3.00 0.0761 MSS 4.20 0.0005 ICF 5.33 0.0000 LSBo 6.73 0.0000 HMN-EI 6.70 0.0000 DROP3 7.67 0.0000 CNN 8.27 0.0000

IS-F,LSH-IS-SandPSC,betweenthemdifferenceswerenot signif-icantat a0.05significance level.The differenceswere significant fromMSSandthe followingmethods;the slowestwasCNN.It is worthnotingthatPSCachieved,accordingtothesignificancetests, areallycompetitiveexecutiontime.Neverthelesstheshortcoming ofPSCwasitspooraccuracy,asitobtainedtheworstresultsofall ofthemethodsunderanalysis,asnotedin Tables 4 and 5 .

It might be surprisingthat LSH-IS-F was fasterthan LSH-IS-S, becausethiscontradicts Fig. 9 .Thereasonforthiswasthenumber offunctionsusedbythealgorithms;ascommented onatthe be-ginningofthissection, LSH-IS-Swaslaunched usingsixOR func-tions,whileLSH-IS-Fwaslaunchedwithonlyﬁve.

Fig. 10 shows the filtering time as a function of the number of instances. The results obtained with the data sets of the ex-periments were used to draw thesefigures. Since there were no results available for all possible values of numbers of instances, theavailable resultswere usedto drawBezier lines andtoshow thegeneraltrendofthealgorithms.Althoughothermethods(CNN, HMN-EI,LSBo, DROP3,MSSandICF)have anexecutiontime that increasesswiftly,ouralgorithmsbasedonhashing areatthe bot-tomofthefigures togetherwithPSC.However,inthelogarithmic scale,thegrowthofPSCisvisiblygreater.

As a summary ofthe experimentationwithmedium size data sets,we can highlightthat the proposed methods achieved com-petitive results in terms of accuracy. Considering the reduction rate,DROP3achievedthemaximumcompression,whileour meth-odswere theworst in terms ofcompression. Finally,considering executiontime, themethodspresentedin thepaperwereableto computetheselectedsubsetmuchfasterthantheotheralgorithms inthestateoftheart.Exceptionally,PSCworkedsurprisinglyfast, although slower than the speed of our proposals, and with the shortcomingofthe pooraccuracywhenits selectedsubsetswere usedfortrainingtheclassiﬁers.

5.1.Hugeproblems

Duetothefactthatinstanceselectionmethodsarenotableto facehugeproblems,the experimentalstudyperformedover

Cen-Table8

Averageranks over accu-racyforhugeproblems.

Algorithm Ranking LSH-IS-F 2.0 DIS.RNN 2.2 LSH-IS-S 2.6 DIS.DROP3 3.6 DIS.ICF 4.6 Table9

Averageranks over storage reduc-tionforhugeproblems.

Algorithm Ranking DIS.RNN 1.8 DIS.DROP3 2.4 LSH-IS-F 3.2 DIS.ICF 3.4 LSH-IS-S 4.2

sus,CovType,KDDCup99,KDDCup991M4 andPoker (see Table 3 ), the algorithms proposed were tested against the Democratic In-stanceSelection(DIS) [22] .AlthoughPSCshowedacompetitive re-sultsintermsofexecutiontime, itwasnot includedinthestudy ofhugeproblems,becauseofitspooraccuracy.

As in the previous experiments, ten fold cross-validation was performedonLSH-IS-SandLSH-IS-FinWeka.Testingerror,using 1-NNclassiﬁer,andstoragereductionwerereportedandcompared against resultspublished in [22] .Execution timeswere not com-paredbecausedifferentimplementationsandmachineswouldnot haveallowedafaircomparison.

Themainconclusionoftheexperimentswasthatourmethods canfacehugeproblems.Resultsofaverageranksovertheaccuracy areshownin Table 8 .Theaccuracyofthemethodsunderstudyis similartoDIS,thoughthemostaccuratemethodisLSH-IS-F.Asin themediumsizeexperiment,LSH-IS-Fimprovedtheaccuracywith regardtoLSH-IS-S.Ontheotherhand,the Table 9 showsthe aver-ageranksoverstoragereduction.Intermsofcompression,thebest methodwas DIS.RNN,as provedin medium size datasets, while themethodsbasedonhashingweretooconservative.However,the numberofinstancesthattheyretaincanbeadjustedbythe num-beroffunctionsused.Thesuccessoftheproposedmethodsiseven moreremarkablewhencomparedagainstscalableapproaches.The simpleideaofusingLSH overcomesthedemocratizationmethods andopensthewaytotheiruseinhugedatasetsandbigdata. 6. Conclusions

Thepaperhasintroducedanovelapproachtotheuseof fami-liesoflocality sensitivefunctions(LSH) forinstanceselection. Us-ing thisapproach, two newalgorithms of linearcomplexityhave been designed. In one approach, the data are processed in one pass, which allows the algorithm to make the selection without requiringthatthewhole datasettoﬁtinmemory.Theother ap-proachneedstwopasses:oneprocesseseachinstanceofthedata set,andthesecond processes thebucketsof thefamiliesofhash functions. Their speed and low memory consumption mean that theyaresuitableforbigdataprocessing.

The experiments have shown that the strength of our meth-ods is the speed, which is achievedthrough a smalldecrease in accuracy and, more remarkably, the reduction rate. Although the 4_We _divided _the _original _KDDCup ₁₉₉₉ _data _set _into _two _sets _with differ-entnumberofinstances:KDDCup99has10%oftheoriginalinstancesand KDD-Cup991Mhasamillion.

(11)

(a) Linear scale

(b) Logarithmic scale

Fig.10. Computationalcostoftestedmethods,inlinear(a)andlogarithmicscale(b)ontheyaxis.In(a)theCNNwasnotplottedbecauseitsgrowthwassohighthatit wasnotpossibletoappreciatethedifferencesbetweentheothermethods.Thedotsaretheresultsontheavailabledatasets,lines(denotedby“–b” inthelegend)arethe Bezierlinesbuiltwiththesedotstoshowthegeneraltrendofthealgorithms.

bestmethodsaccordingtoaccuracydifferdependingonthe classi-ﬁerthatisused,theproposedmethodsofferacompetitive perfor-mance.Moreover,thereductionratecanbeadjustedbyincreasing orbydecreasingthenumberofhashfunctionsthatareused.

Furthermore, the proposed methods were evaluated on huge problems and compared against Democratic Instance Selection, a linear complexity method. Experimental results on accuracy showed how our methods outperformed DIS, even though our methodswereconceptuallymuchsimpler.

7. Futurework

Intheir currentversion, the waythe algorithms makethe in-stanceselectionisverysimpleandquite“naive”.Theselected sub-setcould beimprovedusing additionalinformationaboutthe in-stancesassignedtoeachbucket,andnotjustthecountofinstances ofeach class. Afuture research line could be to store additional information of the instances assigned to each bucket: for exam-ple,simplestatisticssuch astheincrementalaverageofinstances

(12)

inthebucket,orthe percentageofinstances ofeach classin the bucket.This information might mean that the instance selection processwouldbe better informed,without excessivelypenalizing run-time.Althoughprototypegenerationhasnotbeenanalyzedin thepaper,thegenerationofanewinstance,orgroupofthem,for each bucket is one of the futurelines of research. This idea can bedevelopedusingLSH-IS-F,seekingeachofthebucketstobuild orto createa newset of instances, by selecting the medoids or centroidsoftheinstancesinthebuckets.

According to [63] , one of the most challenging problems in data mining research is mining data streams in extremely large databases.Accurateandfastprocessesabletoworkonstreamare required,withoutanyassumption that informationcan be stored inlarge databases andrepeatedly accessed.One of the problems thatarisesinthoseenvironmentsiscalledconceptdrift,which ap-pearswhenchangesinthecontexttakeplace.Inthemanagement ofconcept drift,threebasicapproachescanbe distinguished: en-semble learning, instance weighting and instance selection [59] . Acomparisonofthe proposedmethodina streamingbenchmark wouldbemadetotestwhetherLSH-IS-Scanbeatthe state-of-the-artalgorithmsthatareabletodealinstreamingdata [18] .

Many more research approaches can be considered, but the principal one for us is to adapt the newmethods to a big data environment.Weare workingonthe adaptationofthisidea toa MapReduceframework,whichoffersarobust environmenttoface upto the processingof hugedata sets overclusters ofmachines [58] .

Acknowledgements

Supported by the Research Projects TIN 2011-24046 and TIN 2015-67534-PfromtheSpanishMinistryofEconomyand Compet-itiveness.

References

[1] J.Alcalá-Fdez,L.Sánchez,S.Garcia,M.delJesus,S.Ventura,J.Garrell,J.Otero, C.Romero,J.Bacardit,V.M.Rivas,J.Fernández,F.Herrera,Keel:asoftwaretool toassessevolutionaryalgorithmsfordataminingproblems,SoftComput.13 (3)(2009)307–318,doi:10.1007/s00500-008-0323-y.

[2] Álvar Arnaiz-González,J.F. Díez-Pastor, J.J. Rodríguez,C.I. García-Osorio, In-stanceselectionforregressionbydiscretization,ExpertSyst.Appl.54(2016) 340–350,doi:10.1016/j.eswa.2015.12.046.

[3] A.Andoni,P.Indyk,Near-optimalhashingalgorithmsforapproximatenearest neighborinhighdimensions,Commun.ACM51(1)(2008)117–122,doi:10. 1145/1327452.1327494.

[4] S.Arya,D.M.Mount,N.S.Netanyahu,R.Silverman,A.Y.Wu,Anoptimal algo-rithmforapproximatenearestneighborsearchingﬁxeddimensions,J.ACM45 (6)(1998)891–923,doi:10.1145/293347.293348.

[5] D.Asimov,Thegrandtour:Atoolforviewingmultidimensionaldata,SIAMJ. Sci.Stat.Comput.6(1)(1985)128–143,doi:10.1137/0906011.

[6] R.Barandela,F.J.Ferri,J.S.Sánchez,Decisionboundarypreservingprototype selectionfornearestneighborclassiﬁcation,IJPRAI19(6)(2005)787–806. [7] H. Brighton, C. Mellish,On the consistency ofinformation ﬁlters for lazy

learning algorithms, in:J. ˙Zytkow, J. Rauch(Eds.), Principles ofData Min-ing and Knowledge Discovery, Lecture Notes in Computer Science, vol-ume 1704, Springer Berlin Heidelberg, 1999, pp. 283–288, doi:10.1007/ 978-3-540-48247-5_31.

[8] H. Brighton, C. Mellish, Advances in instance selection for instance-based learning algorithms,DataMin.Know. Discov.6(2)(2002)153–172,doi:10. 1023/A:1014043630878.

[9] J. Calvo-Zaragoza,J.J. Valero-Mas, J.R.Rico-Juan, ImprovingkNN multi-label classiﬁcationinprototype selectionscenarios usingclassproposals,Pattern Recognit.48(5)(2015)1608–1622,doi:10.1016/j.patcog.2014.11.015. [10] J.R.Cano,F.Herrera,M.Lozano,Stratiﬁcationforscalingupevolutionary

pro-totypeselection,PatternRecognit.Lett.26(7)(2005)953–963,doi:10.1016/j. patrec.2004.09.043.

[11]T.Cover,P.Hart,Nearestneighborpatternclassiﬁcation,Inf.TheoryIEEETrans. 13(1)(1967)21–27,doi:10.1109/TIT.1967.1053964.

[12] M. Datar, N. Immorlica, P. Indyk, V.S. Mirrokni, Locality-sensitive hashing schemebasedonp-stabledistributions,in:ProceedingsoftheTwentieth An-nualSymposiumonComputationalGeometry,in:SCG’04,ACM,NewYork,NY, USA,2004,pp.253–262,doi:10.1145/997817.997857.

[13]A. de Haro-García, N. García-Pedrajas, A divide-and-conquer recursive ap-proachforscalingupinstanceselectionalgorithms,DataMin.Knowl.Discov. 18(3)(2009)392–418,doi:10.1007/s10618-008-0121-2.

[14]J.Demšar,Statisticalcomparisonsofclassiﬁersovermultipledatasets,J.Mach. Learn.Res.7(2006)1–30.

[15]J.Derrac,S.García,F.Herrera,Stratiﬁedprototypeselectionbasedona steady-statememeticalgorithm:astudyofscalability,MemeticComput.2(3)(2010) 183–199,doi:10.1007/s12293-010-0048-1.

[16]J.Derrac,S.García,F.Herrera,Asurveyonevolutionaryinstanceselectionand generation,Int.J.Appl.MetaheuristicComput.1(1)(2010)60–92,doi:10.4018/ jamc.2010102604.

[17]J.Derrac,N.Verbiest,S.García,C.Cornelis,F.Herrera,Ontheuseof evolution-aryfeatureselectionforimprovingfuzzyroughsetbasedprototypeselection, SoftComput.17(2)(2013)223–238,doi:10.1007/s00500-012-0888-3. [18]G.Ditzler,M.Roveri,C.Alippi,R.Polikar,Learninginnonstationary

environ-ments:asurvey,Comput.Intell.Mag.,IEEE10(4)(2015)12–25.

[19]F.Dornaika,I.K.Aldine,Decrementalsparsemodelingrepresentativeselection for prototype selection,PatternRecognit. 48 (11)(2015) 3714–3727,doi:10. 1016/j.patcog.2015.05.018.

[20]R.O.Duda,P.E.Hart,D.G.Stork,Patternclassiﬁcation,JohnWiley&Sons,New YorkCity,UnitedStates,2012.

[21]S.Garcia,J.Derrac,J.Cano,F.Herrera,Prototypeselectionfornearestneighbor classiﬁcation:Taxonomyandempiricalstudy,PatternAnal.Mach.Intell.IEEE Trans.34(3)(2012)417–435,doi:10.1109/TPAMI.2011.142.

[22]C. García-Osorio,A.deHaro-García,N. García-Pedrajas,Democraticinstance selection:alinearcomplexityinstanceselectionalgorithmbasedonclassiﬁer ensembleconcepts,Artif.Intell.174(5-6)(2010)410–441,doi:10.1016/j.artint. 2010.01.001.

[23]N.García-Pedrajas,A.deHaro-García,Boostinginstanceselectionalgorithms, Knowl.BasedSyst.67(2014)342–360,doi:10.1016/j.knosys.2014.04.021. [24]N. García-Pedrajas, J.A. Romero del Castillo, D. Ortiz-Boyer, A cooperative

co-evolutionaryalgorithmfor instanceselection forinstance-basedlearning, Mach.Learn.78(3)(2010)381–420,doi:10.1007/s10994-009-5161-3. [25]G.Gates,Thereducednearestneighborrule(corresp.),Inf.Theor.IEEETrans.

18(3)(1972)431–433,doi:10.1109/TIT.1972.1054809.

[26]A.Guillen,L.Herrera,G.Rubio,H.Pomares,A.Lendasse,I.Rojas,Newmethod for instance or prototype selection using mutual informationin time se-riesprediction,Neurocomputing73(10-12)(2010)2030–2038,doi:10.1016/j. neucom.2009.11.031.Subspace Learning / Selected papers from the European SymposiumonTimeSeriesPrediction.

[27]P.Hart,Thecondensednearestneighborrule(corresp.),Inf.Theor.IEEETrans. 14(3)(1968)515–516.

[28]P.Hernandez-Leal,J.Carrasco-Ochoa,J.Martínez-Trinidad,J.Olvera-López, In-stancerankbasedonbordersforinstanceselection,PatternRecognit.46(1) (2013)365–375,doi:10.1016/j.patcog.2012.07.007.

[29]Y.Hochberg,Asharperbonferroniprocedureformultipletestsofsigniﬁcance, Biometrika75(4)(1988)800–802,doi:10.1093/biomet/75.4.800.

[30]P.Indyk,R.Motwani,Approximatenearestneighbors:Towardsremovingthe curseofdimensionality,in:ProceedingsoftheThirtiethAnnualACM Sympo-siumonTheoryofComputing,in:STOC’98,ACM,NewYork,NY,USA,1998, pp.604–613,doi:10.1145/276698.276876.

[31]N.Jankowski,M.Grochowski,ComparisonofinstancesseletionalgorithmsI. algorithmssurvey, in:L.Rutkowski, J. Siekmann,R.Tadeusiewicz, L.Zadeh (Eds.),ArtiﬁcialIntelligenceandSoftComputing-ICAISC2004,LectureNotes inComputerScience,volume3070,SpringerBerlinHeidelberg,2004,pp.598– 603,doi:10.1007/978-3-540-24844-6_90.

[32]S.-W.Kim,J.B.Oommen,Abrieftaxonomyandrankingofcreativeprototype reduction schemes,Pattern Anal.Appl. 6(3)(2003)232–244, doi:10.1007/ s10044-003-0191-0.

[33]M. Kordos,S. Białka,M. Blachnik, Instanceselection inlogicalrule extrac-tion for regression problems,in:L. Rutkowski, M. Korytkowski,R. Scherer, R.Tadeusiewicz,L.A.Zadeh,J.M.Zurada(Eds.),ArtiﬁcialIntelligenceandSoft Computing,LectureNotesinComputerScience,volume7895,SpringerBerlin Heidelberg,2013,pp.167–175,doi:10.1007/978-3-642-38610-7_16. [34]E.Kushilevitz,R.Ostrovsky,Y.Rabani,Eﬃcientsearchforapproximatenearest

neighborinhighdimensionalspaces,in:ProceedingsoftheThirtiethAnnual ACMSymposiumonTheoryofComputing,in:STOC’98,ACM,NewYork,NY, USA,1998,pp.614–623,doi:10.1145/276698.276877.

[35]J.Leskovec,A.Rajaraman,J.D.Ullman,MiningofMassiveDatasets,third, Cam-bridgeUniversityPress,Cambridge,UnitedKingdom,2014.

[36]E.Leyva,A.González,R.Pérez,Knowledge-basedinstanceselection:A compro-misebetweeneﬃciencyandversatility,Knowl.BasedSyst.47(2013)65–76, doi:10.1016/j.knosys.2013.04.005.

[37]E.Leyva,A.González,R.Pérez,Threenewinstanceselectionmethodsbased onlocalsets:acomparativestudywithseveralapproachesfromabi-objective perspective,PatternRecognit. 48(4)(2015)1523–1537,doi:10.1016/j.patcog. 2014.10.001.

[38]J.Li,Y.Wang,Anewfastreductiontechniquebasedonbinarynearest neigh-bortree,Neurocomputing149,PartC(2015)1647–1657,doi:10.1016/j.neucom. 2014.08.028.

[39]T.Liu,A.W.Moore,K.Yang,A.G.Gray,Aninvestigationofpractical approxi-matenearestneighboralgorithms,in:AdvancesinNeuralInformation Process-ingSystems,MITPress,2004,pp.825–832.

[40]A.Lumini,L.Nanni,Aclusteringmethodforautomaticbiometrictemplate se-lection,PatternRecognit.39(3)(2006)495–497,doi:10.1016/j.patcog.2005.11. 004.

(13)

[41] E.Marchiori, Hit miss networks with applicationsto instanceselection, J. Mach.Learn.Res.9(2008)997–1017.

[42] E.Marchiori,Classconditionalnearestneighborforlargemargininstance se-lection,IEEETrans.PatternAnal.Mach.Intell.32(2)(2010)364–370,doi:10. 1109/TPAMI.2009.164.

[43] E.Merelli,M.Pettini,M.Rasetti,Topologydrivenmodeling:theismetaphor, NaturalComput.14(3)(2014)421–430,doi:10.1007/s11047-014-9436-7. [44] C.Moretti, K. Steinhaeuser, D.Thain, N.V.Chawla, Scaling up classiﬁersto

cloudcomputers,in:DataMining,2008.ICDM’08.EighthIEEEInternational Conferenceon,2008,pp.472–481,doi:10.1109/ICDM.2008.99.

[45] L.Nanni,A.Lumini,Prototypereductiontechniques:Acomparisonamong dif-ferentapproaches,ExpertSyst.Appl.38(9)(2011)11820–11828,doi:10.1016/j. eswa.2011.03.070.

[46] J.Olvera-López,J.Carrasco-Ochoa,J.Martínez-Trinidad,Anewfastprototype selectionmethodbasedonclustering,PatternAnal.Appl.13(2)(2009)131– 141,doi:10.1007/s10044-008-0142-x.

[47]J.Olvera-López,J.Martínez-Trinidad,J.Carrasco-Ochoa,J.Kittler,Prototype se-lectionbasedonsequentialsearch,Intell.DataAnal.13(4)(2009)599–631. [48] S. Ougiaroglou, G. Evangelidis, D.A. Dervos, FHC: an adaptive fast hybrid

methodfork-NNclassiﬁcation,LogicJ.IGPL(2015).

[49] D.Peralta,S.delRío,S.Ramírez-Gallego,I.Triguero,J.M.Benitez,F.Herrera, Evolutionary feature selection for big dataclassiﬁcation: amapreduce ap-proach,Math.Probl.Eng.501(2015)246139.

[50] J.Pérez-Benítez,J.Pérez-Benítez,J.Espina-Hernández,Noveldatacondensing methodusingaprototype’sfrontpropagationalgorithm,Eng.Appl.Artif.Intell. 39(2015)181–197.

[51] J.Pérez-Rodríguez,A.G.Arroyo-Peña,N.García-Pedrajas,Simultaneousinstance andfeatureselectionandweightingusingevolutionarycomputation:Proposal andstudy,Appl.SoftComput.37(2015)416–443,doi:10.1016/j.asoc.2015.07. 046.

[52] E.P˛ekalska,R.P.Duin,P.Paclík,Prototypeselectionfordissimilarity-based clas-siﬁers,PatternRecognit.39(2)(2006)189–208,doi:10.1016/j.patcog.2005.06. 012. PartSpecial Issue:Complexity ReductionPartSpecial Issue:Complexity Reduction.

[53]J.R.Quinlan,C4.5:ProgramsforMachineLearning,MorganKaufmann Publish-ersInc.,SanFrancisco,CA,USA,1993.

[54]G.Ritter,H.Woodruff,S.Lowry,T.Isenhour,Analgorithmforaselective near-estneighbordecisionrule,IEEETrans.Inf.Theor.21(6)(1975)665–669. [55]M.B. Stojanović,M.M. Božić,M.M. Stanković, Z.P. Stajić, Amethodology for

trainingsetinstanceselectionusingmutualinformationintimeseries predic-tion,Neurocomputing141(2014)236–245,doi:10.1016/j.neucom.2014.03.006. [56]H.Tran-The,V.N.Van,M.H.Anh,Fastapproximatenearneighboralgorithmby

clusteringinhighdimensions,in:KnowledgeandSystemsEngineering(KSE), 2015SeventhInternationalConferenceon,2015,pp.274–279,doi:10.1109/KSE. 2015.72.

[57]I.Triguero,J.Derrac,S.Garcia,F.Herrera,Ataxonomyandexperimentalstudy onprototypegenerationfornearestneighborclassiﬁcation,Syst.ManCybern. PartC42(1)(2012)86–100,doi:10.1109/TSMCC.2010.2103939.

[58]I.Triguero,D.Peralta,J.Bacardit,S.García,F.Herrera,MRPR:Amapreduce so-lutionforprototypereductioninbigdataclassiﬁcation,Neurocomputing150, PartA(2015)331–345,doi:10.1016/j.neucom.2014.04.078.

[59]A.Tsymbal,Theproblemofconceptdrift:deﬁnitionsandrelatedwork, Com-put.Sci.Department,TrinityCollegeDublin106(2004)2.

[60]D.R.Wilson,T.R.Martinez,Instancepruningtechniques,in:Machine Learn-ing:ProceedingsoftheFourteenthInternationalConference(ICML’97),Morgan Kaufmann,1997,pp.404–411.

[61]D.R. Wilson, T.R. Martinez, Reduction techniquesfor instance-based learn-ing algorithms, Mach. Learn. 38 (3) (2000) 257–286, doi:10.1023/A: 1007626913721.

[62]I.H.Witten,E.Frank,M.A.Hall,DataMining:PracticalMachineLearningTools and Techniques,third,MorganKaufmann PublishersInc.,SanFrancisco,CA, USA,2011.

[63]Q. Yang, X. Wu, 10 challenging problems in data mining research, Int. J. Inf. Technol. Decis. Mak. 05 (04) (2006) 597–604, doi:10.1142/ S0219622006002258.

83–95

ScienceDirect

www.elsevier.com/locate/knosys

( http://creativecommons.org/licenses/by/4.0/

0323-

10.1016/j.eswa.2015.12.046

1145/1327452.1327494

10.1145/293347.293348

10.1137/0906011

787–806

31

1023/A:1014043630878

10.1016/j.patcog.2014.11.015

10.1016/j.

10.1109/TIT.1967.1053964

10.1145/997817.997857

0121-

7

10.1007/s12293-010-0048-1

10.4018/

0888-

12–25

1016/j.patcog.2015.05.018

2012

10.1109/TPAMI.2011.142

2010.01.001

10.1016/j.knosys.2014.04.021

07/s10994-0

10.1109/TIT.1972.1054809

2030–2038,

515–516

10.1016/j.patcog.2012.07.007

10.1093/biomet/75.4.800

10.1145/276698.276876

90

044-0

16

10.1145/276698.276877

2014

10.1016/j.knosys.2013.04.005

2014.10.001

2014.08.028

825–832

10.1016/j.patcog.2005.11.

997–1017

1109/TPAMI.2009.164

7

10.1109/ICDM.2008.99

11820–11828,

0142-

599–631

[48]

246139

181–197

046

10.1016/j.patcog.2005.06.

1993

665–669

10.1016/j.neucom.2014.03.006

2015.72

10.1109/TSMCC.2010.2103939

10.1016/j.neucom.2014.04.078

106

404–411

1007626913721

2011

02258