• No results found

Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon

N/A
N/A
Protected

Academic year: 2021

Share "Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

Machine-Learning for Big Data:

Sampling and Distributed On-Line Algorithms

Stéphan Clémençon

LTCI UMR CNRS No. 5141

-Telecom ParisTech

(2)

Goals of Statistical Learning Theory

• Statistical issues cast asM-estimation problems:

• Classification • Regression

• Density level set estimation • ... and theirvariants

• Minimalassumptions on the distribution • BuildrealisticM-estimators for special criteria • Questions:

• Optimal elements • Consistency

• Non-asymptoticexcess risk bounds • Fast rates of convergence

(3)

Main Example: Classification

• (X, Y)random pair with unknown distributionP

• X 2X observation vector • Y 2{ 1,+1}binary label/class

• A posteriori probability⇠regression function

8x2X, ⌘(x) =P{Y = 1|X =x} • g:X !{ 1,+1}classifier

• Performance measure =classification error

L(g) =P{g(X)6=Y} ! min g • Solution: Bayes rule

8x2X, g⇤(x) = 2I{⌘(x)>1/2} 1 • Bayes errorL⇤ =L(g⇤)

(4)

Empirical Risk Minimization

• Sample(X1, Y1), . . . ,(Xn, Yn)with i.i.d. copies of(X, Y) • ClassGof classifiers

• Empirical Risk Minimization principle ˆ gn= arg min g2G Ln(g) := 1 n n X i=1 I{g(Xi)6=Yi} • Best classifier in the class

¯

g= arg min g2G

(5)

Empirical Processes in Classification

• Bias-variance decomposition L(ˆgn) L⇤ (L(ˆgn) Ln(ˆgn)) + (Ln(¯g) L(¯g)) + (L(¯g) L⇤) 2 sup g2G| Ln(g) L(g)| ! + ✓ inf g2GL(g) L ⇤◆ • Concentration inequality With probability1 : sup g2G | Ln(g) L(g)|Esup g2G | Ln(g) L(g)|+ r 2 log(1/ ) n
(6)

Classification Theory - Main Results

1 Bayes riskconsistencyandrate of convergence

Complexity control: Esup g2G | Ln(g) L(g)|C r V n ifGis a VC class with VC dimensionV.

2 Fast ratesof convergence

Under variance control: rate faster thann 1/2 3 Convex risk minimization

(7)

Classification Theory - Main Results

1 Bayes riskconsistencyandrate of convergence

Complexity control: Esup g2G | Ln(g) L(g)|C r V n ifGis a VC class with VC dimensionV.

2 Fast ratesof convergence

Under variance control: rate faster thann 1/2 3 Convex risk minimization

(8)

Classification Theory - Main Results

1 Bayes riskconsistencyandrate of convergence

Complexity control: Esup g2G | Ln(g) L(g)|C r V n ifGis a VC class with VC dimensionV.

2 Fast ratesof convergence

Under variance control: rate faster thann 1/2 3 Convex risk minimization

(9)

Big Data? Big Challenge!

Now, it is much easier

• to collect data,massivelyand inreal-time: ubiquity of sensors (cell phones, internet, embedded systems, social networks,. . .) • tostoreandmanageBig (and Complex) Data (distributed file

systems, NoSQL)

• to implementmassively parallelized and distributed

computational algorithms (MapReduce, clouds) The three features of Big Data analysis

• Velocity: process data in quasi-real time (on-line algorithms) • Volume: scalability (parallelized, distributed algorithms) • Variety: complex data (text, signal, image, graph)

(10)

How to apply ERM to Big Data?

• Suppose thatnistoo largeto evaluate the empirical riskLn(g) • Common sense: run your preferred learning algorithm using a

subsampleof "reasonable" sizeB << n, e.g. by drawing with replacement in the original training data set...

• ... but of course, statistical performance isdowngraded!

(11)

How to apply ERM to Big Data?

• Suppose thatnistoo largeto evaluate the empirical riskLn(g) • Common sense: run your preferred learning algorithm using a

subsampleof "reasonable" sizeB << n, e.g. by drawing with replacement in the original training data set...

• ... but of course, statistical performance isdowngraded!

(12)

Survey designs:

a solution to Big Data learning?

• Framework: massive original sample(X1, Y1), . . . , (Xn, Yn)

viewed as asuperpopulation

• Survey planRn=probability distribution on the ensemble of all nonempty subsets of{1, . . . , n}

• LetS RN and set✏i = 1ifi2S,✏i = 0otherwise The vector(✏1, . . . , ✏n)fully describesS

• First and second order inclusion probabilities:

⇡i(RN) =P{i2S}and⇡i,j(RN) =P{(i, j)2S2} • Do not rely on the empirical risk based on the survey sample

{(Xi, Yi) : i2S} 1

#S P

(13)

Horvitz -Thompson theory

• Consider the Horvitz-Thompson estimator of the risk

LRn n (g) = 1 n n X i=1 ✏i ⇡iI{ g(Xi)6=Yi}

• And the Horvitz Thompson empirical risk minimizer

arg min

g2G LRn

n (g) =gn✏ • It may work ifsupg2G LnRn(g) Ln(g) is small

• In general, due to the dependence structure, not much can be said about the fluctuations of this supremum

(14)

The Poisson case:

the

i

’s are independent

• In this case,LRn

n (g)is a simple average of independent r.v.’s )back to empirical process theory

• One recovers the same learning rate as if all data had been used, e.g. VC finite dimension case

E[L(gn✏) L⇤](n p 2 + 4) r Vlog(n+ 1) + log 2 n

wheren=qPni=1(1/⇡i2)(the⇡i’s should not be too small...) • The upper bound isoptimalin theminimaxsense.

(15)

The Poisson case:

the

i

’s are independent

• Can be extended to more general sampling plansQnprovided you are able to control

dT V(Rn, Qn) def

= X

S2P(Un)

|Pn(S) Rn(S)|.

• A coupling technique (Hajek, 1964) can be used to show that it works for rejective sampling, Rao-Sampford sampling,

(16)

Beyond Empirical Processes

U

-Statistics as Performance Criteria

• In various situations, the performance criterion isnot a basic sample mean statisticany more

• Examples:

• Clustering: within cluster point scatter related to a partitionP 2 n(n 1) X i<j D(Xi, Xj) X C2P I{(Xi, Xj)2C2}

• Graph inference(link prediction) • Ranking

· · ·

• The empirical criterion is anaverage over all possiblek-tuples U-statistic of degreek 2

(17)

Example: Ranking

• Data with ordinal label:

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.

s(X)andY tend to increase/decrease together with high probability

• Quantitative formulation: maximize the criterion

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)

(18)

Example: Ranking

• Data with ordinal label:

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.

s(X)andY tend to increase/decrease together with high probability

• Quantitative formulation: maximize the criterion

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)

(19)

Example: Ranking

• Data with ordinal label:

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.

s(X)andY tend to increase/decrease together with high probability

• Quantitative formulation: maximize the criterion

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)

(20)

Example: Ranking

• Data with ordinal label:

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.

s(X)andY tend to increase/decrease together with high probability

• Quantitative formulation: maximize the criterion

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

• Observations: nki.i.d. copies ofXgivenY =k, X1(k), . . . , Xn(kk)

(21)

Example: Ranking

• A natural empirical counterpart ofL(s)is

b Ln(s) = Pn1 i1=1· · · PnK iK=1I n s(Xi(1)1 )< . . . < s(Xi(KK))o n1⇥· · ·⇥nK ,

• But the number of terms to be summed isprohibitive! n1⇥. . .⇥nK

(22)

Example: Ranking

• A natural empirical counterpart ofL(s)is

b Ln(s) = Pn1 i1=1· · · PnK iK=1I n s(Xi(1)1 )< . . . < s(Xi(KK))o n1⇥· · ·⇥nK ,

• But the number of terms to be summed isprohibitive! n1⇥. . .⇥nK

(23)

Example: Ranking

• A natural empirical counterpart ofL(s)is

b Ln(s) = Pn1 i1=1· · · PnK iK=1I n s(Xi(1)1 )< . . . < s(Xi(KK))o n1⇥· · ·⇥nK ,

• But the number of terms to be summed isprohibitive! n1⇥. . .⇥nK

(24)

Generalized

U

-statistics

• K 1samples and degrees(d1, . . . , dK)2N⇤K

• (X1(k), . . . , Xn(kk)),1kK,Kindependent i.i.d. samples drawn fromFk(dx)onXkrespectively

• KernelH :Xd1 1 ⇥· · ·⇥X dK K !R, square integrable w.r.t. µ=F⌦d1 1 ⌦· · ·⌦FK⌦dK

(25)

Generalized

U

-statistics

Definition

TheK-sampleU-statistic of degrees(d1, . . . , dK)with kernelHis Un(H) = P I1. . . P IKH(X (1) I1 ;X (2) I2 ;. . .;X (K) IK ) n1 d1 ⇥· · · nK dK , wherePIk refers to summation over all nk

dk subsets X(k) Ik = (X (k) i1 , . . . , X (k)

idk)related to a setIkofdkindexes

1i1 < . . . < idk nk

It is said symmetric whenHis permutation symmetric in each set of dkargumentsX(Ikk).

(26)

Generalized

U

-statistics

• Unbiased estimator of

✓(H) =E[H(X1(1), . . . , Xd(1)1 , . . . , X1(K), . . . , Xd(Kk ))] with minimum variance

• Asymptotically Gaussian asnk/n! k>0fork= 1, . . . , K • Its computation requires the summation of

K Y k=1 ✓ nk dk ◆ terms • K-partite ranking: dk= 1for1kK

(27)

Incomplete

U

-statistics

• ReplaceUn(H)by anincompleteversion, involving much less terms

• Build a setDBof cardinalityBbuilt bysampling with

replacementin the set⇤of indexes

((i(1)1 , . . . , i(1)d1), . . . , (i(1K), . . . , i(dKK))) with1i(1k)< . . . < i(dk)

k nk, 1kK • Compute theMonte-Carlo versionbased onBterms

e UB(H) = 1 B X (I1, ..., IK)2DB H(XI(1) 1 , . . . , X (K) IK ) • An incompleteU-statistic isNOT aU-statistic

(28)

ERM based on incomplete

U

-statistics

• Replace the criterion by a tractable incomplete version based on B =O(n)terms

min

H2HUeB(H)

• This leads to investigate the maximal deviations

sup

H2H e

(29)

Main Result

Theorem

LetHbe a VC major class of bounded symmetric kernels of finite VC dimensionV <+1. SetMH= sup(H,x)2H⇥X|H(x)|. Then,

(i) PnsupH2H UeB(H) Un(H) >⌘ o

2(1 + #⇤)V e B⌘2/M2 H

(ii) for all 2(0,1), with probability at least1 , we have:

1 MHHsup2H e UB(H) E h e UB(H) i 2 r 2Vlog(1 +)  + r log(2/ )  + r Vlog(1 + #⇤) + log(4/ ) B , where= min{bn1/d1c, . . . , bnK/dKc}

(30)

Consequences

• Empirical risk sampling withB =O(n)yields a rate bound of the orderO(plogn/n)

• One suffersno lossin terms of learning rate, whiledrastically reducing computational cost

(31)

Example: Ranking

Empirical ranking performance for SVMrank based on1%,5%,10%,

(32)

Sketch of Proof

• Set✏= ((✏k(I))I2⇤)1kB, where✏k(I)is equal to1if the tuple I = (I1, . . . , IK)has been selected at thek-th draw and to0 otherwise

• The✏k’s are i.i.d. random vectors

• For all(k, I)2{1, . . . , B}⇤, the r.v. ✏k(I)has a Bernoulli distribution with parameter1/#⇤

• With these notations, e UB(H) Un(H) = 1 B B X k=1 Zk(H), where Zk(H) =X I2⇤ (✏k(I) 1/#⇤)H(XI) • Freezing theXI’s, by virtue of Sauer’s lemma:

(33)

Sketch of Proof (continued)

• Conditioned upon theXI’s,Z1(H), . . . , ZB(H)are

independent

• The first assertion is thus obtained by applying Hoeffding’s inequality combined with the union bound

• Set  1VH ⇣ X1(1), . . . , Xn(1)1 , . . . , X1(K), . . . , Xn(KK)⌘= H⇣X1(1), . . . , Xd(1) 1 , . . . , X (K) 1 , . . . , X (K) dK ⌘ +H⇣Xd(1)1+1, . . . , X2(1)d1, . . . , Xd(K) K+1, . . . , X (K) 2dK ⌘ +. . . +H⇣X(1)d 1 d1+1, . . . , X (K) dK dK+1, . . . , X (K) dK ⌘ ,

(34)

Sketch of Proof (continued)

• The proof of the second assertion is based on the Hoeffding

decomposition Un(H) = 1 n1!· · ·nK! X 12Sn1, ..., K2SnK V ⇣X(1) 1(1), . . . , X (K) K(nK) ⌘ ,

• The concentration result is then obtained in a classical manner • Convexity (Chernoff’s bound)

• Symmetrization • Randomization

(35)

Beyond finite VC dimension

• Challenge: develop probabilistic tools and complexity

assumptions to investigate the concentration properties of collections of sums of weighted binomials

e UB(H) Un(H) = 1 B B X k=1 Zk(H), with Zk(H) =X I2⇤ (✏k(I) 1/#⇤)H(XI)

(36)

Some references

• Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling. S. Clémençon, S. Robbiano and J. Tressou (2013). In the Proceedings of the SIAM International Conference on Data-Mining, Austin (USA).

• Empirical processes in survey sampling. P. Bertail, E. Chautru and S. Clémençon (2013). Submitted.

• A statistical view of clustering performance through the theory of U-processes. S. Clémençon (2014). In Journal of Multivariate Analysis.

• On Survey Sampling and Empirical Risk Minimization. P. Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, Fort Lauderdale (USA).

(37)

Introduction

Investigate thebinary classificationproblem in statistical learning context

I Datanotstored in central unit but processed by independent agents (processors)

I Aim : not to find a consensus on a common classifier but find how tocombine efficiently the local ones

I Solution : implement in anon-lineanddistributed manner

(38)

Outline

Background

Proposed algorithm Theoretical results

Improvement of agents selection Numerical experiences

(39)

Outline

Background

Proposed algorithm Theoretical results

Improvement of agents selection Numerical experiences

(40)

Learning problem

sign(H(X))

r.v. observation r.v. binary output

X 2XRn ! ! Yˆ 2{ 1,+1}

Given training dataset(X,Y) = (Xi,Yi)i=1,...,n in a high dimensionn

and withunknownjoint distribution....

...find the best prediction rule sign(H?)such the classifier functionH(x):

H?= min

H Pe(H) where Pe(H) =P[ YH(X)>0] =E ⇥1

{ YH(X)>0}⇤

minimizesthe probability of errorPe

Bbut1(x)isnota dierentiable function !

(41)

Learning problem

MajorizeE⇥1{ YH(X)>0}by a convex function: Convex Surrogate E⇥1{ YH(X)>0}E[j( YH(X))]

How ? Use acostfunction with appropiate properties

Example: use the quadratic functionj(u) =(u+21)2 :R![0,+•)

(42)

Learning problem

sign(H(X))

r.v. observation r.v. binary output

X 2X⇢Rn ! ! Yˆ 2{ 1,+1}

Given training dataset(X,Y) = (Xi,Yi)i=1,...,n in a high dimensionn

and withunknownjoint distribution....

...find the best prediction rule sign(H?)such the classifier functionH(x): H?= min

H Rj(H) where Rj(H) =E[j( YH(X))]

minimizesthe risk functionRj(H)

4whenj(u) =(u+21)2 !H?coincides with the naive Bayes classifier !

(43)

Aggregation of local classifiers

Consider a classification device composed by a setV ofN

connected agents

Each agentv2V:

I disposes of{(X1,v,Y1,v), . . . ,(Xnv,v,Ynv,v)}!nv independent copies of

(X,Y)

I selects alocal softclassifier function from a parametric class{hv(·,qv)}

Setqv= (av,bv), theglobal softclassifier is: H(x,✓) =Âv2Vhv(x,qv)

where : hv(x,qv) =avhv(x,bv) and ✓= 0 B @ q1 .. . qN 1 C A 5/21

(44)

Problem statement

The problem can be summarized as follows:

I given an observed data X

I obtain the best estimated label Yˆ as sign(H(X,✓))

I where✓ is computed from the optimization problem using the

training data(X,Y) = (Xi,Yi)i=1,...,n as:

min

✓2⇥ Rj( YH(X,✓))

(45)

Problem statement

Approaches

1. Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’] : consensus approach

I find an average consensus solution : ✓= (q, . . . ,q) I each agent use the global classifier H(X,✓)

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓= (q1, . . . ,qN)

I each agent use its local classifier hv(x,qv)

(46)

Problem statement

Approaches

1.Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’]:

consensus approach

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓= (q1, . . . ,qN)

I each agent use its local classifier hv(x,qv)

4Example:

set bv = 0,av 0andhv:X!{ 1,+1}: the weakclassifier

hv(x,qv) =avhv(x)

(47)

Outline

Background

Proposed algorithm Theoretical results

Improvement of agents selection Numerical experiences

(48)

High rate distributed learning

Solve the minimization problem of the parametric risk function: min

✓2⇥ Rj(H(X,✓))

(49)

High rate distributed learning

An standarddistributed gradient descentiterative approach : I generates a vector sequence of the estimated parameter

(✓t)t 1= (qt,1,···,qt,N)t 1

I at each agent v the update step writes:

qt+1,v=qt,v+gt E⇥Y—vhv(X,qt,v)j0( YH(X,✓t))⇤

| {z }

Bthe joint distribution is unknown

(50)

High rate distributed learning

An standarddistributed andon-linegradient descentiterative approach is:

I generate a vector sequence of the estimated parameter

(✓t)t 1= (qt,1,···,qt,N)t 1

I each agent v observes a pair (Xt+1,v,Yt+1,v) I at each agent v the update step writes:

qt+1,v =qt,v+gtYt+1,v—vhv(Xt+1,v,qvt,v)j0( Yt+1,vH(Xt+1,v,✓t))

| {z }

replace by the empirical version

BevaluateH(Xt+1,v,✓(t))is required at eacht andv !

(51)

High rate distributed learning

Example

At iterationt, each agentv2Vhas(Xv,t,qv,t)...

1 h1(X1,t,q1,t) 2 (X2,t,q2,t) 3 (X3,t,q3,t) 4 (X4,t,q4,t)

...and evaluates itslocalhv(Xt,v,qt,v)

(52)

High rate distributed learning

Example

Each nodev sends its observationXt,v toallthe other nodes...

1 (Xt,1,qt,1) 2 Xt,1 3 Xt,1 4 Xt,1 9/21

(53)

High rate distributed learning

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)fromallthe other nodes...

1 h1(X1,t,q1,t) 2 h2(X1,t,q2,t) 3 h3(X1,t,q3,t) 4 h4(X1,t,q4,t) 9/21

(54)

High rate distributed learning

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)fromallthe other nodes...

1

h1(X1,t,q1,t) {h2(X1,t,q2,t),h3(X1,t,q3,t),h4(X1,t,q4,t)}

2 3

4

...and computes theglobal: H(Xt,1,✓t) =Âw4=1hw(Xt,1,qt,w)

BN(N 1)communications per iteration N=4!12!

(55)

Proposed distributed learning : OLGA algorithm

4Replace the globalH(Xt+1,v,✓(t)) by a local estimateYˆt(V),v at

eachv2V such :

E[Yˆt(+V1),v|Xt+1,v,✓t] =H(Xt+1,v,✓t)

How ? sparsecommunications with ratio sparsity p... On-lineLearningGossipAlgorithm (OLGA)

...for eachv2Vat timet, the localgradient descentupdate writes :

qt+1,v =qt,v+gtYt+1,v—vhv(Xt+1,v,qt,v)j0( Yt+1,vYˆt(+V1),v)

(56)

Proposed distributed learning : OLGA algorithm

Example

At iterationt, each agentv2Vhas(Xt,v,qt,v)...

1 h1(X1,t,q1,t) 2 (X2,t,q2,t) 3 (X3,t,q3,t) 4 (X4,t,q4,t)

...and evaluates itslocalhv(Xt,v,qt,v)

(57)

Proposed distributed learning : OLGA algorithm

Example

Each nodev sends its observationXt,v torandomly selectednodes with

probabilityp=13... 1 (Xt,1,qt,1) 2 3 4 Xt,1 10/21

(58)

Proposed distributed learning : OLGA algorithm

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)from therandomly selectednodes... 1 h1(X1,t,q1,t) 2 3 4 h4(X1,t,q4,t) 10/21

(59)

Proposed distributed learning : OLGA algorithm

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)from choosen nodes...

1

h1(X1,t,q1,t) {h4(X1,t,q4,t)}

2 3

4

...and computes itslocal estimated: Yˆt(,V1)=h1(Xt,1,qt,1) +1ph4(Xt,1,qt,4)

BpN(N 1)communications per iteration N=4,p= 1 3

!4(reductionof67%)!

(60)

Performance analysis

BWhat is the e↵ect of sparsification ?...

...study the behaviour of the vector sequence✓t as t!• I theconsistencyof the final solution given by the algorithm I qualify theerror varianceexcess due to the sparsity

(61)

Outline

Background

Proposed algorithm

Theoretical results

Improvement of agents selection Numerical experiences

(62)

Asymptotic behaviour of OLGA

Under suitable assumptions, we prove the following results : 1. Consistency:

(✓t)t 1 a.s.! q? 2L={—Rj(✓) =0} 2. CLT: conditioned to the event{limt!•✓t=✓?}, then

pgt(

t ✓?) L! N(0,S(G?))

where : G?=

estimation error in a centralized case

z }| { E[(H(X,✓?) Y)2—vhv(X,qv?)—Tvhv(X,qv?)] + +1 p p w

Â

6 =v E[hw(X,qw?)2—vhv(X,qv?)—Tvhv(X,qv?)] | {z }

additional noise term induced by the distributed setting

(63)

Outline

Background

Proposed algorithm Theoretical results

Improvement of agents selection Numerical experiences

(64)

A best agents selection approach

When...

B the number of agentsN " !difficult to implement B redudancy agents!avoid similar outputs

... include distributed agent selection !

How ? add a`1-penalization term with tunning parameterl

min

✓2⇥ Rj(H(X,✓)) +l

Â

v |av|

where:

I the weight av= 0 for anidleagent andav>0 when it isactive

(65)

Including best agents selection to OLGA algorithm

Introduce an update step at each timet of OLGA to seek : thetime varying set of active nodesSt V

(66)

Including best agents selection to OLGA algorithm

The extended algorithm is summarized as follows, at timet:

1. obtainactive nodesSt from the sequence of updated weights

(at,1, . . . ,at,N)

2. apply OLGA to the set of active agentsv2St as:

i) estimate localYˆ(St)

t+1,v from a random selection among the

currentactivenodes

ii) update local gradient descent

qt+1,v=qt,v+gtYt+1,v—vhv(Xt+1,v,qt,v)j0( Yt+1,vYˆt(S+1t),v)

(67)

Outline

Background

Proposed algorithm Theoretical results

Improvement of agents selection Numerical experiences

(68)

Example with simulated data

Binary classification of (+) and (o) data samples withN = 60agents using weak lineal classifiers (-). When using distributed selection, it reduces to25active classifiers.

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 (a)OLGA −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

(b)OLGA with distributed selection

(69)

Comparison with real data

Binary classification of the available benchmark datasetbananausing weak lineal classifiers when increasingN.

5 10 15 20 25 30 35 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

Number of weak−learners

Error rate

OLGA (p=0.6) GentleBoost

Figure: Comparison between a centralized and sequential approach (GentleBoost) and our distributed and on-line algorithm (OLGA).

(70)

Conclusions

I A fully distributed and on-linealgorithm is proposed for binary classification of big datasets solved byN processors

4the algorithm is then adapted to select useful classifiers!N#

I We obtain theoretical results from the asymptotic analysis of the sequence estimated by OLGA

I Numerical results are illustrated showing a comparable behaviour to a centralized, batch and sequential approach (GentleBoost)

References

Related documents

(1A) A matter may not be referred again for assessment by a party to the medical dispute on the grounds of deterioration of the injury or additional relevant information about

It will ensure record of all rented machineries and equipment as per project with help of proper daily based data entry.. Resources includes mainly three

Practical information for teens, including free pamphlets on various mental health topics and referrals to mental health centers, hotlines and treatment facilities throughout

As part of our Free the Children (FTC) UK Schools programme, we offer 30 students the opportunity to compete for a place on our annual scholarship trip to one of our communities

Currently it is recommended to use AccessData Network License Service (NLS) to license systems running as virtual machines.. CLICK HERE for

Alexander Klein; CIAM; collective dwelling; domestic space; housing crisis; Karel Teige; minimum dwelling; rationalization; Taylorism; universal

By comparing different groups of retirees, the paper shows that understanding the nature of retirement, and in particular the extent to which retirement is likely to be