Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon

(1)

Machine-Learning for Big Data:

Sampling and Distributed On-Line Algorithms

Stéphan Clémençon

LTCI UMR CNRS No. 5141

-Telecom ParisTech

(2)

Goals of Statistical Learning Theory

• Statistical issues cast asM-estimation problems:

• Classification • Regression

• Density level set estimation • ... and theirvariants

• Minimalassumptions on the distribution • BuildrealisticM-estimators for special criteria • Questions:

• Optimal elements • Consistency

• Non-asymptoticexcess risk bounds • Fast rates of convergence

(3)

Main Example: Classification

• (X, Y)random pair with unknown distributionP

• X 2X observation vector • Y 2{ 1,+1}binary label/class

• A posteriori probability⇠regression function

8x2X, ⌘(x) =P{Y = 1|X =x} • g:X !{ 1,+1}classifier

• Performance measure =classification error

L(g) =P{g(X)6=Y} ! min g • Solution: Bayes rule

8x2X, g⇤(x) = 2I{_⌘(_x₎_>₁_/₂_} 1 • Bayes errorL⇤ =L(g⇤)

(4)

Empirical Risk Minimization

• Sample(X1, Y1), . . . ,(Xn, Yn)with i.i.d. copies of(X, Y) • ClassGof classifiers

• Empirical Risk Minimization principle ˆ gn= arg min g2G Ln(g) := 1 n n X i=1 I{g(Xi)6=Yi} • Best classifier in the class

¯

g= arg min g2G

(5)

Empirical Processes in Classification

(6)

Classification Theory - Main Results

1 Bayes riskconsistencyandrate of convergence

Complexity control: Esup g2G | Ln(g) L(g)|C r V n ifGis a VC class with VC dimensionV.

2 Fast ratesof convergence

Under variance control: rate faster thann 1/2 3 Convex risk minimization

(7)

Classification Theory - Main Results

(8)

Classification Theory - Main Results

(9)

Big Data? Big Challenge!

Now, it is much easier

• to collect data,massivelyand inreal-time: ubiquity of sensors (cell phones, internet, embedded systems, social networks,. . .) • tostoreandmanageBig (and Complex) Data (distributed file

systems, NoSQL)

• to implementmassively parallelized and distributed

computational algorithms (MapReduce, clouds) The three features of Big Data analysis

• Velocity: process data in quasi-real time (on-line algorithms) • Volume: scalability (parallelized, distributed algorithms) • Variety: complex data (text, signal, image, graph)

(10)

How to apply ERM to Big Data?

• Suppose thatnistoo largeto evaluate the empirical riskLn(g) • Common sense: run your preferred learning algorithm using a

subsampleof "reasonable" sizeB << n, e.g. by drawing with replacement in the original training data set...

• ... but of course, statistical performance isdowngraded!

(11)

How to apply ERM to Big Data?

• Suppose thatnistoo largeto evaluate the empirical riskLn(g) • Common sense: run your preferred learning algorithm using a

subsampleof "reasonable" sizeB << n, e.g. by drawing with replacement in the original training data set...

• ... but of course, statistical performance isdowngraded!

(12)

Survey designs:

a solution to Big Data learning?

• Framework: massive original sample(X1, Y1), . . . , (Xn, Yn)

viewed as asuperpopulation

• Survey planRn=probability distribution on the ensemble of all nonempty subsets of{1, . . . , n_}

• LetS _⇠RN and set✏i = 1ifi2S,✏i = 0otherwise The vector(✏1, . . . , ✏n)fully describesS

• First and second order inclusion probabilities:

⇡i(RN) =P{i2S}and⇡i,j(RN) =P{(i, j)2S2} • Do not rely on the empirical risk based on the survey sample

{(Xi, Yi) : i2S} 1

#S P

(13)

Horvitz -Thompson theory

• Consider the Horvitz-Thompson estimator of the risk

LRn n (g) = 1 n n X i=1 ✏i ⇡iI{ g(Xi)6=Yi}

• And the Horvitz Thompson empirical risk minimizer

arg min

g2G LRn

n (g) =gn✏ • It may work ifsupg2G LnRn(g) Ln(g) is small

• In general, due to the dependence structure, not much can be said about the fluctuations of this supremum

(14)

The Poisson case:

the

✏

i

’s are independent

• In this case,LRn

n (g)is a simple average of independent r.v.’s )back to empirical process theory

• One recovers the same learning rate as if all data had been used, e.g. VC finite dimension case

E[L(g_n✏) L⇤]_(n p 2 + 4) r Vlog(n+ 1) + log 2 n

wheren=qPni=1(1/⇡i2)(the⇡i’s should not be too small...) • The upper bound isoptimalin theminimaxsense.

(15)

The Poisson case:

the

✏

i

’s are independent

• Can be extended to more general sampling plansQnprovided you are able to control

dT V(Rn, Qn) def

= X

S2P(Un)

|Pn(S) Rn(S)|.

• A coupling technique (Hajek, 1964) can be used to show that it works for rejective sampling, Rao-Sampford sampling,

(16)

Beyond Empirical Processes

U

-Statistics as Performance Criteria

• In various situations, the performance criterion isnot a basic sample mean statisticany more

• Examples:

• Clustering: within cluster point scatter related to a partitionP 2 n(n 1) X i<j D(Xi, Xj) X C2P I{(Xi, Xj)2C2}

• Graph inference(link prediction) • Ranking

• _{· · ·}

• The empirical criterion is anaverage over all possiblek-tuples U-statistic of degreek 2

(17)

Example: Ranking

• Data with ordinal label:

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

• Goal: rankX1, . . . , Xnthrough a scoring functions:X !R s.t.

s(X)andY tend to increase/decrease together with high probability

• Quantitative formulation: maximize the criterion

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

• Observations: nki.i.d. copies ofXgivenY =k, X₁(k), . . . , Xn(kk)

(18)

Example: Ranking

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

(19)

Example: Ranking

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

(20)

Example: Ranking

(X1, Y1), . . . ,(Xn, Yn)2 X ⇥{1, . . . , K} ⌦n

L(s) =P{s(X(1))< . . . < s(X(k))|Y(1) = 1, . . . , Y(K) =K}

(21)

Example: Ranking

• A natural empirical counterpart ofL(s)is

b Ln(s) = Pn1 i1=1· · · PnK iK=1I n s(X_i(1)₁ )< . . . < s(X_i(_KK))o n1⇥· · ·⇥nK ,

• But the number of terms to be summed isprohibitive! n1⇥. . .⇥nK

(22)

Example: Ranking

(23)

Example: Ranking

(24)

Generalized

U

-statistics

• K 1samples and degrees(d1, . . . , dK)2N⇤K

• (X₁(k), . . . , Xn(kk)),1kK,Kindependent i.i.d. samples drawn fromFk(dx)onXkrespectively

• KernelH :Xd1 1 ⇥· · ·⇥X dK K !R, square integrable w.r.t. µ=F⌦d1 1 ⌦· · ·⌦FK⌦dK

(25)

Generalized

U

-statistics

Definition

TheK-sampleU-statistic of degrees(d1, . . . , dK)with kernelHis Un(H) = P I1. . . P IKH(X (1) I1 ;X (2) I2 ;. . .;X (K) IK ) n1 d1 ⇥· · · nK dK , whereP_I_k refers to summation over all nk

dk subsets X(k) Ik = (X (k) i1 , . . . , X (k)

i_dk)related to a setIkofdkindexes

1i1 < . . . < idk nk

It is said symmetric whenHis permutation symmetric in each set of dkargumentsX(_Ik_k).

(26)

Generalized

U

-statistics

• Unbiased estimator of

✓(H) =E[H(X₁(1), . . . , X_d(1)₁ , . . . , X₁(K), . . . , X_d(K_k ))] with minimum variance

• Asymptotically Gaussian asnk/n! k>0fork= 1, . . . , K • Its computation requires the summation of

K Y k=1 ✓ nk dk ◆ terms • K-partite ranking: dk= 1for1kK

(27)

Incomplete

U

-statistics

• ReplaceUn(H)by anincompleteversion, involving much less terms

• Build a setDBof cardinalityBbuilt bysampling with

replacementin the set⇤of indexes

((i(1)₁ , . . . , i(1)_d₁), . . . , (i(₁K), . . . , i(_dK_K))) with1i(₁k)< . . . < i(_dk)

k nk, 1kK • Compute theMonte-Carlo versionbased onBterms

e UB(H) = 1 B X (I1, ..., IK)2DB H(X_I(1) 1 , . . . , X (K) IK ) • An incompleteU-statistic isNOT aU-statistic

(28)

ERM based on incomplete

U

-statistics

• Replace the criterion by a tractable incomplete version based on B =O(n)terms

min

H2HUeB(H)

• This leads to investigate the maximal deviations

sup

H2H e

(29)

Main Result

Theorem

LetHbe a VC major class of bounded symmetric kernels of finite VC dimensionV <+1. SetMH= sup(H,x)2H⇥X|H(x)|. Then,

(i) PnsupH2H UeB(H) Un(H) >⌘ o



2(1 + #⇤)V _⇥_e B⌘2_/M2 H

(ii) for all 2(0,1), with probability at least1 , we have:

1 MHHsup2H e UB(H) E h e UB(H) i 2 r 2Vlog(1 +)  + r log(2/ )  + r Vlog(1 + #⇤) + log(4/ ) B , where= min_{_bn1/d1c, . . . , bnK/dKc}

(30)

Consequences

• Empirical risk sampling withB =O(n)yields a rate bound of the orderO(plogn/n)

• One suffersno lossin terms of learning rate, whiledrastically reducing computational cost

(31)

Example: Ranking

Empirical ranking performance for SVMrank based on1%,5%,10%,

(32)

Sketch of Proof

• Set✏= ((✏k(I))I2⇤)1kB, where✏k(I)is equal to1if the tuple I = (I1, . . . , IK)has been selected at thek-th draw and to0 otherwise

• The✏k’s are i.i.d. random vectors

• For all(k, I)₂_{1, . . . , B_}_⇥⇤, the r.v. ✏k(I)has a Bernoulli distribution with parameter1/#⇤

• With these notations, e UB(H) Un(H) = 1 B B X k=1 Zk(H), where Zk(H) =X I2⇤ (✏k(I) 1/#⇤)H(XI) • Freezing theXI’s, by virtue of Sauer’s lemma:

(33)

Sketch of Proof (continued)

• Conditioned upon theXI’s,Z1(H), . . . , ZB(H)are

independent

• The first assertion is thus obtained by applying Hoeffding’s inequality combined with the union bound

• Set  1VH ⇣ X₁(1), . . . , X_n(1)₁ , . . . , X₁(K), . . . , X_n(K_K)⌘= H⇣X₁(1), . . . , X_d(1) 1 , . . . , X (K) 1 , . . . , X (K) dK ⌘ +H⇣X_d(1)₁₊₁, . . . , X₂(1)_d₁, . . . , X_d(K) K+1, . . . , X (K) 2dK ⌘ +. . . +H⇣X_(1)_d 1 d1+1, . . . , X (K) dK dK+1, . . . , X (K) dK ⌘ ,

(34)

Sketch of Proof (continued)

• The proof of the second assertion is based on the Hoeffding

decomposition Un(H) = 1 n1!· · ·nK! X 12Sn1, ..., K2SnK V ⇣X(1) 1(1), . . . , X (K) K(nK) ⌘ ,

• The concentration result is then obtained in a classical manner • Convexity (Chernoff’s bound)

• Symmetrization • Randomization

(35)

Beyond finite VC dimension

• Challenge: develop probabilistic tools and complexity

assumptions to investigate the concentration properties of collections of sums of weighted binomials

e UB(H) Un(H) = 1 B B X k=1 Zk(H), with Zk(H) =X I2⇤ (✏k(I) 1/#⇤)H(XI)

(36)

Some references

• Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling. S. Clémençon, S. Robbiano and J. Tressou (2013). In the Proceedings of the SIAM International Conference on Data-Mining, Austin (USA).

• Empirical processes in survey sampling. P. Bertail, E. Chautru and S. Clémençon (2013). Submitted.

• A statistical view of clustering performance through the theory of U-processes. S. Clémençon (2014). In Journal of Multivariate Analysis.

• On Survey Sampling and Empirical Risk Minimization. P. Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, Fort Lauderdale (USA).

(37)

Introduction

Investigate thebinary classificationproblem in statistical learning context

I Datanotstored in central unit but processed by independent agents (processors)

I Aim : not to find a consensus on a common classifier but find how tocombine efficiently the local ones

I Solution : implement in anon-lineanddistributed manner

(38)

Outline

Background

Proposed algorithm Theoretical results

Improvement of agents selection Numerical experiences

(39)

Outline

Background

(40)

Learning problem

sign(H(X))

r.v. observation r.v. binary output

X ₂X_⇢_Rn _! _! _Yˆ ₂_{ ₁_,₊₁_}

Given training dataset(X,Y) = (Xi,Yi)i=1,...,n in a high dimensionn

and withunknownjoint distribution....

...find the best prediction rule sign(H?)such the classifier functionH(x):

H?= min

H Pe(H) where Pe(H) =P[ YH(X)>0] =E ⇥₁

{ YH(X)>0}⇤

minimizesthe probability of errorPe

Bbut1_(x₎_is_not_{a di}_↵_{erentiable function !}

(41)

Learning problem

Majorize_E⇥1_{ _YH(X)_>₀_}⇤ _{by a convex function:} _{Convex Surrogate} E⇥1_{ _YH_(X)_>₀_}⇤__E_[j( _YH_(X_))]

How ? Use acostfunction with appropiate properties

Example: use the quadratic functionj(u) =(u+₂1)2 :_R![0,+•)

(42)

Learning problem

sign(H(X))

r.v. observation r.v. binary output

X 2X⇢Rn _! _! _Yˆ ₂_{ ₁_,₊₁_}

Given training dataset(X,Y) = (Xi,Yi)i=1,...,n in a high dimensionn

and withunknownjoint distribution....

...find the best prediction rule sign(H?₎_{such the classifier function}_H₍_x₎_: H?= min

H Rj(H) where Rj(H) =E[j( YH(X))]

minimizesthe risk functionRj(H)

4whenj(u) =(u+₂1)2 _!H?coincides with the naive Bayes classifier !

(43)

Aggregation of local classifiers

Consider a classification device composed by a setV ofN

connected agents

Each agentv₂V:

I disposes of{(X1,v,Y1,v), . . . ,(Xnv,v,Ynv,v)}!nv independent copies of

(X,Y)

I selects alocal softclassifier function from a parametric class{hv(·,qv)}

Setqv= (av,bv), theglobal softclassifier is: H(x,✓) =Âv2Vhv(x,qv)

where : hv(x,qv) =avhv(x,bv) and ✓= 0 B @ q1 .. . qN 1 C A 5/21

(44)

Problem statement

The problem can be summarized as follows:

I given an observed data X

I obtain the best estimated label Yˆ as sign(H(X,✓))

I where✓ is computed from the optimization problem using the

training data(X,Y) = (Xi,Yi)i=1,...,n as:

min

✓2⇥ Rj( YH(X,✓))

(45)

Problem statement

Approaches

1. Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’] : consensus approach

I find an average consensus solution : ✓= (q, . . . ,q) I each agent use the global classifier H(X,✓)

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓= (q1, . . . ,qN)

I each agent use its local classifier hv(x,qv)

(46)

Problem statement

Approaches

1.Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’]:

consensus approach

2. Mixture of experts : cooperative approach

I find the best aggregation solution : ✓= (q1, . . . ,qN)

I each agent use its local classifier hv(x,qv)

4Example:

set bv = 0,av 0andhv:X!{ 1,+1}: the weakclassifier

hv(x,qv) =avhv(x)

(47)

Outline

Background

(48)

High rate distributed learning

Solve the minimization problem of the parametric risk function: min

✓2⇥ Rj(H(X,✓))

(49)

High rate distributed learning

An standarddistributed gradient descentiterative approach : I generates a vector sequence of the estimated parameter

(✓t)t 1= (qt,1,···,qt,N)t 1

I at each agent v the update step writes:

qt+1,v=qt,v+gt E⇥Y—vhv(X,qt,v)j0( YH(X,✓t))⇤

| {z }

Bthe joint distribution is unknown

(50)

High rate distributed learning

An standarddistributed andon-linegradient descentiterative approach is:

I generate a vector sequence of the estimated parameter

(✓t)t 1= (qt,1,···,qt,N)t 1

I each agent v observes a pair (Xt+1,v,Yt+1,v) I at each agent v the update step writes:

qt+1,v =qt,v+gtYt+1,v—vhv(Xt+1,v,qvt,v)j0( Yt+1,vH(Xt+1,v,✓t))

| {z }

replace by the empirical version

BevaluateH(Xt+1,v,✓(t))is required at eacht andv !

(51)

Example

At iterationt, each agentv2Vhas(Xv,t,qv,t)...

1 h1(X1,t,q1,t) 2 (X2,t,q2,t) 3 (X3,t,q3,t) 4 (X4,t,q4,t)

...and evaluates itslocalhv(Xt,v,qt,v)

(52)

High rate distributed learning

Example

Each nodev sends its observationXt,v toallthe other nodes...

1 (Xt,1,qt,1) 2 Xt,1 3 Xt,1 4 Xt,1 9/21

(53)

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)fromallthe other nodes...

1 h1(X1,t,q1,t) 2 h2(X1,t,q2,t) 3 h3(X1,t,q3,t) 4 h4(X1,t,q4,t) 9/21

(54)

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)fromallthe other nodes...

1

h1(X1,t,q1,t) {h2(X1,t,q2,t),h3(X1,t,q3,t),h4(X1,t,q4,t)}

2 3

4

...and computes theglobal: H(Xt,1,✓t) =Âw4=1hw(Xt,1,qt,w)

BN(N 1)communications per iteration N=4!12!

(55)

Proposed distributed learning : OLGA algorithm

4Replace the globalH(Xt+1,v,✓(t)) by a local estimateYˆt(V),v at

eachv2V such :

E[Yˆ_t(₊V₁)_,_v|Xt+1,v,✓t] =H(Xt+1,v,✓t)

How ? sparsecommunications with ratio sparsity p... On-lineLearningGossipAlgorithm (OLGA)

...for eachv₂Vat timet, the localgradient descentupdate writes :

qt+1,v =qt,v+gtYt+1,v—vhv(Xt+1,v,qt,v)j0( Yt+1,vYˆt(+V1),v)

(56)

Proposed distributed learning : OLGA algorithm

Example

At iterationt, each agentv2Vhas(Xt,v,qt,v)...

1 h1(X1,t,q1,t) 2 (X2,t,q2,t) 3 (X3,t,q3,t) 4 (X4,t,q4,t)

...and evaluates itslocalhv(Xt,v,qt,v)

(57)

Proposed distributed learning : OLGA algorithm

Example

Each nodev sends its observationXt,v torandomly selectednodes with

probabilityp=1₃... 1 (Xt,1,qt,1) 2 3 4 Xt,1 10/21

(58)

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)from therandomly selectednodes... 1 h1(X1,t,q1,t) 2 3 4 h4(X1,t,q4,t) 10/21

(59)

Example

Each nodev obtains the evaluation ofhw(Xt,v,qt,w)from choosen nodes...

1

h1(X1,t,q1,t) {h4(X1,t,q4,t)}

2 3

4

...and computes itslocal estimated: Yˆ_t(_,V₁)=h1(Xt,1,qt,1) +1_ph4(Xt,1,qt,4)

BpN(N 1)communications per iteration N=4,p= 1 3

!4(reductionof67%)!

(60)

Performance analysis

BWhat is the e↵ect of sparsification ?...

...study the behaviour of the vector sequence✓t as t!• I theconsistencyof the final solution given by the algorithm I qualify theerror varianceexcess due to the sparsity

(61)

Outline

Background

Proposed algorithm

Theoretical results

(62)

Asymptotic behaviour of OLGA

Under suitable assumptions, we prove the following results : 1. Consistency:

(✓t)t 1 a.s.! q? 2L={—Rj(✓) =0} 2. CLT: conditioned to the event{limt!•✓t=✓?}, then

p_gt₍_✓

t ✓?) L! N(0,S(G?))

where : G?₌

estimation error in a centralized case

z }| { E[(H(X,✓?) Y)2—vhv(X,qv?)—Tvhv(X,qv?)] + +1 p p _w

Â

6 =v E[hw(X,qw?)2—vhv(X,qv?)—Tvhv(X,qv?)] | {z }

additional noise term induced by the distributed setting

(63)

Outline

Background

(64)

A best agents selection approach

When...

B the number of agentsN _{" !}difficult to implement B redudancy agents!avoid similar outputs

... include distributed agent selection !

How ? add a`1_{-penalization} _{term with tunning parameter}_l

min

✓2⇥ Rj(H(X,✓)) +l

Â

_v |av|

where:

I the weight av= 0 for anidleagent andav>0 when it isactive

(65)

Including best agents selection to OLGA algorithm

Introduce an update step at each timet of OLGA to seek : thetime varying set of active nodesS_t _⇢V

(66)

Including best agents selection to OLGA algorithm

The extended algorithm is summarized as follows, at timet:

1. obtainactive nodesSt from the sequence of updated weights

(at,1, . . . ,at,N)

2. apply OLGA to the set of active agentsv₂St as:

i) estimate localYˆ(St)

t+1,v from a random selection among the

currentactivenodes

ii) update local gradient descent

qt+1,v=qt,v+gtYt+1,v—vhv(Xt+1,v,qt,v)j0( Yt+1,vYˆt(S+1t),v)

(67)

Outline

Background

(68)

Example with simulated data

Binary classification of (+) and (o) data samples withN = 60agents using weak lineal classifiers (-). When using distributed selection, it reduces to25active classifiers.

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 (a)OLGA −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

(b)OLGA with distributed selection

(69)

Comparison with real data

Binary classification of the available benchmark datasetbananausing weak lineal classifiers when increasingN.

5 10 15 20 25 30 35 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

Number of weak−learners

Error rate

OLGA (p=0.6) GentleBoost

Figure: Comparison between a centralized and sequential approach (GentleBoost) and our distributed and on-line algorithm (OLGA).

(70)

Conclusions

I A fully distributed and on-linealgorithm is proposed for binary classification of big datasets solved byN processors

4the algorithm is then adapted to select useful classifiers!N#

I We obtain theoretical results from the asymptotic analysis of the sequence estimated by OLGA

I Numerical results are illustrated showing a comparable behaviour to a centralized, batch and sequential approach (GentleBoost)