Machine-Learning for Big Data:
Sampling and Distributed On-Line Algorithms
Stéphan Clémençon
LTCI UMR CNRS No. 5141
-Telecom ParisTech
Goals of Statistical Learning Theory
• Statistical issues cast asM-estimation problems:• Classification • Regression
• Density level set estimation • ... and theirvariants
• Minimalassumptions on the distribution • BuildrealisticM-estimators for special criteria • Questions:
• Optimal elements • Consistency
• Non-asymptoticexcess risk bounds • Fast rates of convergence
Main Example: Classification
• (X, Y)random pair with unknown distributionP• X 2X observation vector • Y 2{ 1,+1}binary label/class
• A posteriori probability⇠regression function
8x2X, ⌘(x) =P{Y = 1|X =x} • g:X !{ 1,+1}classifier
• Performance measure =classification error
L(g) =P{g(X)6=Y} ! min g • Solution: Bayes rule
8x2X, g⇤(x) = 2I{⌘(x)>1/2} 1 • Bayes errorL⇤ =L(g⇤)
Empirical Risk Minimization
• Sample(X1, Y1), . . . ,(Xn, Yn)with i.i.d. copies of(X, Y) • ClassGof classifiers• Empirical Risk Minimization principle ˆ gn= arg min g2G Ln(g) := 1 n n X i=1 I{g(Xi)6=Yi} • Best classifier in the class
¯
g= arg min g2G
Empirical Processes in Classification
• Bias-variance decomposition L(ˆgn) L⇤ (L(ˆgn) Ln(ˆgn)) + (Ln(¯g) L(¯g)) + (L(¯g) L⇤) 2 sup g2G| Ln(g) L(g)| ! + ✓ inf g2GL(g) L ⇤◆ • Concentration inequality With probability1 : sup g2G | Ln(g) L(g)|Esup g2G | Ln(g) L(g)|+ r 2 log(1/ ) nClassification Theory - Main Results
1 Bayes riskconsistencyandrate of convergenceComplexity control: Esup g2G | Ln(g) L(g)|C r V n ifGis a VC class with VC dimensionV.
2 Fast ratesof convergence
Under variance control: rate faster thann 1/2 3 Convex risk minimization
Classification Theory - Main Results
1 Bayes riskconsistencyandrate of convergenceComplexity control: Esup g2G | Ln(g) L(g)|C r V n ifGis a VC class with VC dimensionV.
2 Fast ratesof convergence
Under variance control: rate faster thann 1/2 3 Convex risk minimization
Classification Theory - Main Results
1 Bayes riskconsistencyandrate of convergenceComplexity control: Esup g2G | Ln(g) L(g)|C r V n ifGis a VC class with VC dimensionV.
2 Fast ratesof convergence
Under variance control: rate faster thann 1/2 3 Convex risk minimization
Big Data? Big Challenge!
Now, it is much easier• to collect data,massivelyand inreal-time: ubiquity of sensors (cell phones, internet, embedded systems, social networks,. . .) • tostoreandmanageBig (and Complex) Data (distributed file
systems, NoSQL)
• to implementmassively parallelized and distributed
computational algorithms (MapReduce, clouds) The three features of Big Data analysis
• Velocity: process data in quasi-real time (on-line algorithms) • Volume: scalability (parallelized, distributed algorithms) • Variety: complex data (text, signal, image, graph)
How to apply ERM to Big Data?
• Suppose thatnistoo largeto evaluate the empirical riskLn(g) • Common sense: run your preferred learning algorithm using a
subsampleof "reasonable" sizeB << n, e.g. by drawing with replacement in the original training data set...
• ... but of course, statistical performance isdowngraded!
How to apply ERM to Big Data?
• Suppose thatnistoo largeto evaluate the empirical riskLn(g) • Common sense: run your preferred learning algorithm using a
subsampleof "reasonable" sizeB << n, e.g. by drawing with replacement in the original training data set...
• ... but of course, statistical performance isdowngraded!
Survey designs:
a solution to Big Data learning?
• Framework: massive original sample(X1, Y1), . . . , (Xn, Yn)viewed as asuperpopulation
• Survey planRn=probability distribution on the ensemble of all nonempty subsets of{1, . . . , n}
• LetS ⇠RN and set✏i = 1ifi2S,✏i = 0otherwise The vector(✏1, . . . , ✏n)fully describesS
• First and second order inclusion probabilities:
⇡i(RN) =P{i2S}and⇡i,j(RN) =P{(i, j)2S2} • Do not rely on the empirical risk based on the survey sample
{(Xi, Yi) : i2S} 1
#S P
Horvitz -Thompson theory
• Consider the Horvitz-Thompson estimator of the riskLRn n (g) = 1 n n X i=1 ✏i ⇡iI{ g(Xi)6=Yi}
• And the Horvitz Thompson empirical risk minimizer
arg min
g2G LRn
n (g) =gn✏ • It may work ifsupg2G LnRn(g) Ln(g) is small
• In general, due to the dependence structure, not much can be said about the fluctuations of this supremum
The Poisson case:
the
✏
i
’s are independent
• In this case,LRn
n (g)is a simple average of independent r.v.’s )back to empirical process theory
• One recovers the same learning rate as if all data had been used, e.g. VC finite dimension case
E[L(gn✏) L⇤](n p 2 + 4) r Vlog(n+ 1) + log 2 n
wheren=qPni=1(1/⇡i2)(the⇡i’s should not be too small...) • The upper bound isoptimalin theminimaxsense.
The Poisson case:
the
✏
i
’s are independent
• Can be extended to more general sampling plansQnprovided you are able to control
dT V(Rn, Qn) def
= X
S2P(Un)
|Pn(S) Rn(S)|.
• A coupling technique (Hajek, 1964) can be used to show that it works for rejective sampling, Rao-Sampford sampling,
Beyond Empirical Processes
U
-Statistics as Performance Criteria
• In various situations, the performance criterion isnot a basic sample mean statisticany more
• Examples:
• Clustering: within cluster point scatter related to a partitionP 2 n(n 1) X i<j D(Xi, Xj) X C2P I{(Xi, Xj)2C2}
• Graph inference(link prediction) • Ranking
• · · ·
• The empirical criterion is anaverage over all possiblek-tuples U-statistic of degreek 2
Example: Ranking
• Data with ordinal label:(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n
• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.
s(X)andY tend to increase/decrease together with high probability
• Quantitative formulation: maximize the criterion
L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}
• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)
Example: Ranking
• Data with ordinal label:(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n
• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.
s(X)andY tend to increase/decrease together with high probability
• Quantitative formulation: maximize the criterion
L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}
• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)
Example: Ranking
• Data with ordinal label:(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n
• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.
s(X)andY tend to increase/decrease together with high probability
• Quantitative formulation: maximize the criterion
L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}
• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)
Example: Ranking
• Data with ordinal label:(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n
• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.
s(X)andY tend to increase/decrease together with high probability
• Quantitative formulation: maximize the criterion
L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}
• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)
Example: Ranking
• A natural empirical counterpart ofL(s)isb Ln(s) = Pn1 i1=1· · · PnK iK=1I n s(Xi(1)1 )< . . . < s(Xi(KK))o n1⇥· · ·⇥nK ,
• But the number of terms to be summed isprohibitive! n1⇥. . .⇥nK
Example: Ranking
• A natural empirical counterpart ofL(s)isb Ln(s) = Pn1 i1=1· · · PnK iK=1I n s(Xi(1)1 )< . . . < s(Xi(KK))o n1⇥· · ·⇥nK ,
• But the number of terms to be summed isprohibitive! n1⇥. . .⇥nK
Example: Ranking
• A natural empirical counterpart ofL(s)isb Ln(s) = Pn1 i1=1· · · PnK iK=1I n s(Xi(1)1 )< . . . < s(Xi(KK))o n1⇥· · ·⇥nK ,
• But the number of terms to be summed isprohibitive! n1⇥. . .⇥nK
Generalized
U
-statistics
• K 1samples and degrees(d1, . . . , dK)2N⇤K• (X1(k), . . . , Xn(kk)),1kK,Kindependent i.i.d. samples drawn fromFk(dx)onXkrespectively
• KernelH :Xd1 1 ⇥· · ·⇥X dK K !R, square integrable w.r.t. µ=F⌦d1 1 ⌦· · ·⌦FK⌦dK
Generalized
U
-statistics
Definition
TheK-sampleU-statistic of degrees(d1, . . . , dK)with kernelHis Un(H) = P I1. . . P IKH(X (1) I1 ;X (2) I2 ;. . .;X (K) IK ) n1 d1 ⇥· · · nK dK , wherePIk refers to summation over all nk
dk subsets X(k) Ik = (X (k) i1 , . . . , X (k)
idk)related to a setIkofdkindexes
1i1 < . . . < idk nk
It is said symmetric whenHis permutation symmetric in each set of dkargumentsX(Ikk).
Generalized
U
-statistics
• Unbiased estimator of✓(H) =E[H(X1(1), . . . , Xd(1)1 , . . . , X1(K), . . . , Xd(Kk ))] with minimum variance
• Asymptotically Gaussian asnk/n! k>0fork= 1, . . . , K • Its computation requires the summation of
K Y k=1 ✓ nk dk ◆ terms • K-partite ranking: dk= 1for1kK
Incomplete
U
-statistics
• ReplaceUn(H)by anincompleteversion, involving much less terms
• Build a setDBof cardinalityBbuilt bysampling with
replacementin the set⇤of indexes
((i(1)1 , . . . , i(1)d1), . . . , (i(1K), . . . , i(dKK))) with1i(1k)< . . . < i(dk)
k nk, 1kK • Compute theMonte-Carlo versionbased onBterms
e UB(H) = 1 B X (I1, ..., IK)2DB H(XI(1) 1 , . . . , X (K) IK ) • An incompleteU-statistic isNOT aU-statistic
ERM based on incomplete
U
-statistics
• Replace the criterion by a tractable incomplete version based on B =O(n)terms
min
H2HUeB(H)
• This leads to investigate the maximal deviations
sup
H2H e
Main Result
Theorem
LetHbe a VC major class of bounded symmetric kernels of finite VC dimensionV <+1. SetMH= sup(H,x)2H⇥X|H(x)|. Then,
(i) PnsupH2H UeB(H) Un(H) >⌘ o
2(1 + #⇤)V ⇥e B⌘2/M2 H
(ii) for all 2(0,1), with probability at least1 , we have:
1 MHHsup2H e UB(H) E h e UB(H) i 2 r 2Vlog(1 +) + r log(2/ ) + r Vlog(1 + #⇤) + log(4/ ) B , where= min{bn1/d1c, . . . , bnK/dKc}
Consequences
• Empirical risk sampling withB =O(n)yields a rate bound of the orderO(plogn/n)
• One suffersno lossin terms of learning rate, whiledrastically reducing computational cost
Example: Ranking
Empirical ranking performance for SVMrank based on1%,5%,10%,
Sketch of Proof
• Set✏= ((✏k(I))I2⇤)1kB, where✏k(I)is equal to1if the tuple I = (I1, . . . , IK)has been selected at thek-th draw and to0 otherwise
• The✏k’s are i.i.d. random vectors
• For all(k, I)2{1, . . . , B}⇥⇤, the r.v. ✏k(I)has a Bernoulli distribution with parameter1/#⇤
• With these notations, e UB(H) Un(H) = 1 B B X k=1 Zk(H), where Zk(H) =X I2⇤ (✏k(I) 1/#⇤)H(XI) • Freezing theXI’s, by virtue of Sauer’s lemma:
Sketch of Proof (continued)
• Conditioned upon theXI’s,Z1(H), . . . , ZB(H)areindependent
• The first assertion is thus obtained by applying Hoeffding’s inequality combined with the union bound
• Set 1VH ⇣ X1(1), . . . , Xn(1)1 , . . . , X1(K), . . . , Xn(KK)⌘= H⇣X1(1), . . . , Xd(1) 1 , . . . , X (K) 1 , . . . , X (K) dK ⌘ +H⇣Xd(1)1+1, . . . , X2(1)d1, . . . , Xd(K) K+1, . . . , X (K) 2dK ⌘ +. . . +H⇣X(1)d 1 d1+1, . . . , X (K) dK dK+1, . . . , X (K) dK ⌘ ,
Sketch of Proof (continued)
• The proof of the second assertion is based on the Hoeffdingdecomposition Un(H) = 1 n1!· · ·nK! X 12Sn1, ..., K2SnK V ⇣X(1) 1(1), . . . , X (K) K(nK) ⌘ ,
• The concentration result is then obtained in a classical manner • Convexity (Chernoff’s bound)
• Symmetrization • Randomization
Beyond finite VC dimension
• Challenge: develop probabilistic tools and complexityassumptions to investigate the concentration properties of collections of sums of weighted binomials
e UB(H) Un(H) = 1 B B X k=1 Zk(H), with Zk(H) =X I2⇤ (✏k(I) 1/#⇤)H(XI)
Some references
• Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling. S. Clémençon, S. Robbiano and J. Tressou (2013). In the Proceedings of the SIAM International Conference on Data-Mining, Austin (USA).
• Empirical processes in survey sampling. P. Bertail, E. Chautru and S. Clémençon (2013). Submitted.
• A statistical view of clustering performance through the theory of U-processes. S. Clémençon (2014). In Journal of Multivariate Analysis.
• On Survey Sampling and Empirical Risk Minimization. P. Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, Fort Lauderdale (USA).
Introduction
Investigate thebinary classificationproblem in statistical learning context
I Datanotstored in central unit but processed by independent agents (processors)
I Aim : not to find a consensus on a common classifier but find how tocombine efficiently the local ones
I Solution : implement in anon-lineanddistributed manner
Outline
Background
Proposed algorithm Theoretical results
Improvement of agents selection Numerical experiences
Outline
Background
Proposed algorithm Theoretical results
Improvement of agents selection Numerical experiences
Learning problem
sign(H(X))
r.v. observation r.v. binary output
X 2X⇢Rn ! ! Yˆ 2{ 1,+1}
Given training dataset(X,Y) = (Xi,Yi)i=1,...,n in a high dimensionn
and withunknownjoint distribution....
...find the best prediction rule sign(H?)such the classifier functionH(x):
H?= min
H Pe(H) where Pe(H) =P[ YH(X)>0] =E ⇥1
{ YH(X)>0}⇤
minimizesthe probability of errorPe
Bbut1(x)isnota di↵erentiable function !
Learning problem
MajorizeE⇥1{ YH(X)>0}⇤ by a convex function: Convex Surrogate E⇥1{ YH(X)>0}⇤E[j( YH(X))]
How ? Use acostfunction with appropiate properties
Example: use the quadratic functionj(u) =(u+21)2 :R![0,+•)
Learning problem
sign(H(X))
r.v. observation r.v. binary output
X 2X⇢Rn ! ! Yˆ 2{ 1,+1}
Given training dataset(X,Y) = (Xi,Yi)i=1,...,n in a high dimensionn
and withunknownjoint distribution....
...find the best prediction rule sign(H?)such the classifier functionH(x): H?= min
H Rj(H) where Rj(H) =E[j( YH(X))]
minimizesthe risk functionRj(H)
4whenj(u) =(u+21)2 !H?coincides with the naive Bayes classifier !
Aggregation of local classifiers
Consider a classification device composed by a setV ofN
connected agents
Each agentv2V:
I disposes of{(X1,v,Y1,v), . . . ,(Xnv,v,Ynv,v)}!nv independent copies of
(X,Y)
I selects alocal softclassifier function from a parametric class{hv(·,qv)}
Setqv= (av,bv), theglobal softclassifier is: H(x,✓) =Âv2Vhv(x,qv)
where : hv(x,qv) =avhv(x,bv) and ✓= 0 B @ q1 .. . qN 1 C A 5/21
Problem statement
The problem can be summarized as follows:
I given an observed data X
I obtain the best estimated label Yˆ as sign(H(X,✓))
I where✓ is computed from the optimization problem using the
training data(X,Y) = (Xi,Yi)i=1,...,n as:
min
✓2⇥ Rj( YH(X,✓))
Problem statement
Approaches
1. Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’] : consensus approach
I find an average consensus solution : ✓= (q, . . . ,q) I each agent use the global classifier H(X,✓)
2. Mixture of experts : cooperative approach
I find the best aggregation solution : ✓= (q1, . . . ,qN)
I each agent use its local classifier hv(x,qv)
Problem statement
Approaches1.Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’]:
consensus approach
2. Mixture of experts : cooperative approach
I find the best aggregation solution : ✓= (q1, . . . ,qN)
I each agent use its local classifier hv(x,qv)
4Example:
set bv = 0,av 0andhv:X!{ 1,+1}: the weakclassifier
hv(x,qv) =avhv(x)
Outline
Background
Proposed algorithm Theoretical results
Improvement of agents selection Numerical experiences
High rate distributed learning
Solve the minimization problem of the parametric risk function: min
✓2⇥ Rj(H(X,✓))
High rate distributed learning
An standarddistributed gradient descentiterative approach : I generates a vector sequence of the estimated parameter
(✓t)t 1= (qt,1,···,qt,N)t 1
I at each agent v the update step writes:
qt+1,v=qt,v+gt E⇥Y—vhv(X,qt,v)j0( YH(X,✓t))⇤
| {z }
Bthe joint distribution is unknown
High rate distributed learning
An standarddistributed andon-linegradient descentiterative approach is:
I generate a vector sequence of the estimated parameter
(✓t)t 1= (qt,1,···,qt,N)t 1
I each agent v observes a pair (Xt+1,v,Yt+1,v) I at each agent v the update step writes:
qt+1,v =qt,v+gtYt+1,v—vhv(Xt+1,v,qvt,v)j0( Yt+1,vH(Xt+1,v,✓t))
| {z }
replace by the empirical version
BevaluateH(Xt+1,v,✓(t))is required at eacht andv !
High rate distributed learning
Example
At iterationt, each agentv2Vhas(Xv,t,qv,t)...
1 h1(X1,t,q1,t) 2 (X2,t,q2,t) 3 (X3,t,q3,t) 4 (X4,t,q4,t)
...and evaluates itslocalhv(Xt,v,qt,v)
High rate distributed learning
ExampleEach nodev sends its observationXt,v toallthe other nodes...
1 (Xt,1,qt,1) 2 Xt,1 3 Xt,1 4 Xt,1 9/21
High rate distributed learning
ExampleEach nodev obtains the evaluation ofhw(Xt,v,qt,w)fromallthe other nodes...
1 h1(X1,t,q1,t) 2 h2(X1,t,q2,t) 3 h3(X1,t,q3,t) 4 h4(X1,t,q4,t) 9/21
High rate distributed learning
Example
Each nodev obtains the evaluation ofhw(Xt,v,qt,w)fromallthe other nodes...
1
h1(X1,t,q1,t) {h2(X1,t,q2,t),h3(X1,t,q3,t),h4(X1,t,q4,t)}
2 3
4
...and computes theglobal: H(Xt,1,✓t) =Âw4=1hw(Xt,1,qt,w)
BN(N 1)communications per iteration N=4!12!
Proposed distributed learning : OLGA algorithm
4Replace the globalH(Xt+1,v,✓(t)) by a local estimateYˆt(V),v at
eachv2V such :
E[Yˆt(+V1),v|Xt+1,v,✓t] =H(Xt+1,v,✓t)
How ? sparsecommunications with ratio sparsity p... On-lineLearningGossipAlgorithm (OLGA)
...for eachv2Vat timet, the localgradient descentupdate writes :
qt+1,v =qt,v+gtYt+1,v—vhv(Xt+1,v,qt,v)j0( Yt+1,vYˆt(+V1),v)
Proposed distributed learning : OLGA algorithm
ExampleAt iterationt, each agentv2Vhas(Xt,v,qt,v)...
1 h1(X1,t,q1,t) 2 (X2,t,q2,t) 3 (X3,t,q3,t) 4 (X4,t,q4,t)
...and evaluates itslocalhv(Xt,v,qt,v)
Proposed distributed learning : OLGA algorithm
Example
Each nodev sends its observationXt,v torandomly selectednodes with
probabilityp=13... 1 (Xt,1,qt,1) 2 3 4 Xt,1 10/21
Proposed distributed learning : OLGA algorithm
ExampleEach nodev obtains the evaluation ofhw(Xt,v,qt,w)from therandomly selectednodes... 1 h1(X1,t,q1,t) 2 3 4 h4(X1,t,q4,t) 10/21
Proposed distributed learning : OLGA algorithm
ExampleEach nodev obtains the evaluation ofhw(Xt,v,qt,w)from choosen nodes...
1
h1(X1,t,q1,t) {h4(X1,t,q4,t)}
2 3
4
...and computes itslocal estimated: Yˆt(,V1)=h1(Xt,1,qt,1) +1ph4(Xt,1,qt,4)
BpN(N 1)communications per iteration N=4,p= 1 3
!4(reductionof67%)!
Performance analysis
BWhat is the e↵ect of sparsification ?...
...study the behaviour of the vector sequence✓t as t!• I theconsistencyof the final solution given by the algorithm I qualify theerror varianceexcess due to the sparsity
Outline
Background
Proposed algorithm
Theoretical results
Improvement of agents selection Numerical experiences
Asymptotic behaviour of OLGA
Under suitable assumptions, we prove the following results : 1. Consistency:
(✓t)t 1 a.s.! q? 2L={—Rj(✓) =0} 2. CLT: conditioned to the event{limt!•✓t=✓?}, then
pgt(✓
t ✓?) L! N(0,S(G?))
where : G?=
estimation error in a centralized case
z }| { E[(H(X,✓?) Y)2—vhv(X,qv?)—Tvhv(X,qv?)] + +1 p p w
Â
6 =v E[hw(X,qw?)2—vhv(X,qv?)—Tvhv(X,qv?)] | {z }additional noise term induced by the distributed setting
Outline
Background
Proposed algorithm Theoretical results
Improvement of agents selection Numerical experiences
A best agents selection approach
When...B the number of agentsN " !difficult to implement B redudancy agents!avoid similar outputs
... include distributed agent selection !
How ? add a`1-penalization term with tunning parameterl
min
✓2⇥ Rj(H(X,✓)) +l
Â
v |av|where:
I the weight av= 0 for anidleagent andav>0 when it isactive
Including best agents selection to OLGA algorithm
Introduce an update step at each timet of OLGA to seek : thetime varying set of active nodesSt ⇢V
Including best agents selection to OLGA algorithm
The extended algorithm is summarized as follows, at timet:1. obtainactive nodesSt from the sequence of updated weights
(at,1, . . . ,at,N)
2. apply OLGA to the set of active agentsv2St as:
i) estimate localYˆ(St)
t+1,v from a random selection among the
currentactivenodes
ii) update local gradient descent
qt+1,v=qt,v+gtYt+1,v—vhv(Xt+1,v,qt,v)j0( Yt+1,vYˆt(S+1t),v)
Outline
Background
Proposed algorithm Theoretical results
Improvement of agents selection Numerical experiences
Example with simulated data
Binary classification of (+) and (o) data samples withN = 60agents using weak lineal classifiers (-). When using distributed selection, it reduces to25active classifiers.
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 (a)OLGA −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
(b)OLGA with distributed selection
Comparison with real data
Binary classification of the available benchmark datasetbananausing weak lineal classifiers when increasingN.
5 10 15 20 25 30 35 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55
Number of weak−learners
Error rate
OLGA (p=0.6) GentleBoost
Figure: Comparison between a centralized and sequential approach (GentleBoost) and our distributed and on-line algorithm (OLGA).
Conclusions
I A fully distributed and on-linealgorithm is proposed for binary classification of big datasets solved byN processors
4the algorithm is then adapted to select useful classifiers!N#
I We obtain theoretical results from the asymptotic analysis of the sequence estimated by OLGA
I Numerical results are illustrated showing a comparable behaviour to a centralized, batch and sequential approach (GentleBoost)