A Randomized Gossip-based Algorithm for Classification on Peer-to-Peer Networks

(1)

A Randomized Gossip-based Algorithm for

Classification on Peer-to-Peer Networks

Haimonti Dutta

The Center for Computational Learning Systems (CCLS) Columbia University, New York, NY 10115.

Abstract

Peer-to-Peer (P2P) networks are distributed systems in which nodes of equal roles and capabilities exchange information and services directly with one another. In recent years, they have become a popular way to share large amounts of data. Such architectures, however, complicate the process of knowledge discovery and data mining since algorithms must deal with distributed (and often) dynamic sources of data and computing. In this paper, we present a distributed algorithm for learn-ing linear classifiers in P2P networks. The problem is posed as a linear program such that each peer has its own constraints, but needs to solve a global objective function. A randomized-gossip based approximate algorithm is presented which reduces communication cost in the network significantly while ensuring conver-gence of the algorithm.

1 Introduction

Peer-to-Peer (P2P) systems such as Gnutella [13] and Kazaa [14] are likely to play a very important role in the next generation of data driven collaborative systems [12]. Their popularity stems from the fact that they can function, scale and self-organize in the presence of highly transient population of nodes without an overhead of a central server for co-ordination. In addition, they are collectively capable of storing a huge amount of data of different modalities. Extracting patterns from this data is a challenging task. Most of the data is inherently distributed and merging of remote data at a central site to perform data mining will result in unnecessary communication overhead. An alternate approach is to mine data locally on individual peers and combine models or results which can provide a more cost-effective solution.

In this paper, we present a decentralized, approximate algorithm for supervised learning on a P2P network. Classification is posed as a linear program – each peer has local constraints and needs to solve a global objective function. The goal is to solve the optimization without collecting the constraints from all the neighbors at a central site. This work builds on prior research ([15], [16], [17], [18]) where it has been shown that distributed linear programming can be used effectively for resource management in Peer-to-Peer networks.

2 Classifier Design by Linear Programming

LetDbe a data set withP instances. Each instance (denoted byx) contains a vector of N features and a categorical target variable (denoted byy). It is assumed that there exists an underlying function fsuch thaty =f(x)for each instance (x,y) in the training set. The goal of a supervised learning algorithm is to find an approximationHoffthat can be used to predict values of unseen instances in the test set. In order to find an approximation, one possibility is to obtain a weightingof the feature vector which is positive for patterns in the positive class and negative for patterns in the

(2)

other class (assuming a binary classification problem) [2]. Thus, if thekth_{instance is represented}

byxk = [x1k, x2k,· · ·, xN k], k= 1,2· · ·P, andW is the weight vector of N weights represented

byW = [w1, w2,· · ·wN]T, then we are interested in aWˆ such that

xkWˆ ≥0 (1)

for all instances in the positive category and negative for instances in the other class. Letekrepresent

the error associated with instancexk,πkrepresent the weighting coefficient. Then total error over P

instances can be expressed as:

ˆ

e= ΣP_k₌₁πkek (2)

and error function for a given instance is obtained by estimatingek =−(xkW −d),whenxkW <

d;and0,whenxkW ≥d.where the scalardrepresents a scale factor and is taken as unity

through-out the formulation in this paper. In order to find the desiredWˆ, the error function ˆeshould be minimized. Smith [2] shows that the above can be formulated as a linear programming problem, with the constraint matrix as follows:

χW +E=δ+S (3)

whereχ= [x1, x2,· · · , xP]T,E = (e1, e2,· · ·eP)T is the error vector,δ= (d, d,· · · , d)T where

T represents the transpose, andS = (s1, s2,· · ·sP)is the vector of slack variables which allow the

inequalities to be made equalities. The objective function of the linear program can be written as:

ˆ

e= ΠTE (4)

whereΠ = (π1, π2,· · ·, πP). To minimize ˆe, it is convenient to transform the minimization to

a equivalent maximization problem: z = ΠT_δ₋_ΠT_E _{which can be further simplified as}_z ₌

ΠTχW−ΠTS. The above linear program can be solved by using the Simplex Algorithm1[3]. Note that in this paper we are primarily concerned with problems that are linearly separable. Extension of the distributed algorithm for quadratic programming problems is left for future research. If this linear program needs to solved in a distributed peer-to-peer environment, it is not worthwhile to transfer all the constraints to a central site and run the optimization since the communication cost incurred may be substantially high. In the following section, we present a distributed simplex algorithm that can be run on each node and the results are updated based on information from neighbors.

3 The Randomized Gossip-based Linear Programming Algorithm

In a Peer-to-Peer network, each node has its dataset from which local constraint matrices are con-structed as described in Section 2. However, solving for the global objective function based on local constraints does not ensure that constraints at other nodes are satisfied. Thus nodes need to com-municate with one another to ensure that a global optimum is reached. This is achieved by the use of a randomized gossip based protocol. The distributed linear classification algorithm has two main steps: (1) A pre-processing step for obtaining the canonical representation of a linear system and (2) Obtaining the solution to the objective function.

3.1 Distributed Canonical Representation of the Linear System

An important pre-processing step for the distributed linear programming algorithm involves devel-oping the canonical representation. Each node needs to know how many basic variables2[3] should be added. A converge cast based approach is developed to solve this problem – Letsbe an initiator node which builds a minimum spanning tree on all nodes in the network. A message is sent bys to all its neighbors asking how many local constraints each node has. A neighbor on receiving this message, either forwards it to its neighbors (if there are any) or sends back a reply. At the end of this procedure, Nodeshas the correct value of the total number of constraints in the system,Tc.

Next, Nodessets an internal variablecount−Constraintto the number of its local constraints. It traverses the minimum spanning tree and informs each node visited of the number of constraints

1_{We refrain from giving an extensive description of the simplex algorithm due to space restrictions.} 2

(3)

seen so far. LetT represent the value ofcount−Constraintat nodei. Then nodeimust add Tcbasic variables to each of its constraints. At the end of this procedure, all nodes have added the

relevant basic variables. Note that this procedure creates exactly the same canonical form as would have been obtained if all the constraints were centralized on a single machine. It must be noted that the Distributed Canonical Representation algorithm needs to be run only once at the time of initialization. Thereafter, each node just updates its tableau [3] depending on the pivots chosen at the corresponding iteration.

3.2 Notation and Preliminaries

LetP1, P2,· · ·Pηbe a set of nodes connected to one another via an underlying communication tree

such that each nodePiknows its neighborsNi. Let|V| =ηrepresent the total number of nodes,

|E| = mthe total number of edges,di the degree of nodePi,(1 ≤ i ≤ η),dmaxthe maximum

degree of all nodes in the network. Assume thatT = [tij]represents the transition probability

matrix, where0 ≤ tij ≤ 1is the probability of moving from nodePi to nodePj. If there is no

direct link between nodes Pi andPj,tij = 0and the sum of probabilities for a node is one i.e.

Pη

j=1tij = 1for a nodePi. Note that the each nodePican be thought of as representing thestate

at discrete timet,Xtin a finite space. A random walk on the network induces a finite stochastic

process which can be modeled by a Markov Chain as follows:

pij =P r(Xt+1=j|X0=i0, X1=i1,· · · , Xt−1=it−1, Xt=i) =P r(Xt+1=j|Xt=i) (5)

This implies that in each state, the probability of moving from state i to state j only depends on node iand is independent oft. This is well-known as thememoryless Markovproperty.

Each nodePihas its own local constraints which may change from time to time depending on the

resources available at that node. The constraints at nodei have the formAiXi = bi whereAi represents anm×nmatrix,Xi_{is a}_n_×₁_{vector and}_bi_{is a}_m_×₁_{vector. Thus at each node, we are}

interested in solving the following linear programming problem: FindXi≥0and Minzisatisfying c1x1+c2x2+· · ·cnxn =zisubject to the constraintsAiXi =bi. The global linear program (if

all the constraint matrices could be centralized) can be written as follows: FindX ≥0and Minz satisfyingc1x1+c2x2+· · ·cnxn = zsubject to constraintsAX =BwhereA = S

η i=1Aiand B=Sη i=1b i_. 3.3 The Algorithm

At the beginning of iterationl, a nodePi has its own constraint matrix and the objective function.

The column pivot, henceforth referred to ascol−pivoti_{, is that column of the tableau corresponding}

to the most negative indicator3 of c1, c2,· · · , cn. Following this, each node forms the row ratios

(ri_j,1≤j≤m) for each row i.e. it dividesbi_j,1≤j≤mby the corresponding number in the pivot column of that row. Let minimum ofri_j’s be presented asrow−pivoti. This is stored in the history table of node Pi corresponding to iterationl. Now the node must participate in the randomized

algorithm for determination of the approximate minimum row ratio i.e.

Minimum(row−pivoti), i∈Ni. (6)

Random Walk and Markov Chains: In order to obtain the approximate minimum row ratio in the network for a particular iteration of the simplex algorithm, a random walk has to be initiated on the Peer-to-Peer network. Ifuniformdiffusion in the network is assumed, then the probability of moving from nodePito nodePj is given bytij = _d1_i. This approach has its origin in distributed

load balancing literature [1, 4, 5] and has also been studied further in [10]. However, research [6, 11] has shown show that since degree of nodes vary widely in a P2P network, the sampling distribution obtained by this approach is non-uniform. To solve this problem, and to achieve an adaptation of uniform sampling in the P2P network the classical Metropolis- Hastings algorithm [7, 8, 9] was proposed. This algorithm was used to develop the Random Weight Distribution Algorithm [11] which is a distributed algorithm that enables efficient uniform sampling in large unstructured non-uniform networks. We briefly describe this algorithm here, since we adapt this algorithm for obtaining the transition matrices at each node.

3

(4)

In the initialization state, each node sets the local transition probabilities as follows: pij =

1

ρ,if,i6=jandj∈Ni, where ρ≥dmax

= 1−di

ρ,if,i=j

= 0, otherwise.

(7)

After initialization, each node tries to re-distribute its local self-transition probability randomly and symmetrically among its neighbors which is tracked by means of a neighbor listNi. Nodeiselects a

neighborjwith equal probability and sends an INCREASE message to Nodej. Nodejon receiving this message, checks whether its self-transition probability is greater thanδ whereδ is a global parameter set at the beginning of the algorithm. If it is, Node iwill decrease its self-transition probability byδand increase the transition probability on the link(j, i)byδ. Nodeiis notified of the success by an ACK message. When Nodeireceives this message, it lowers the self-transition probability and increases it on the(i, j)link. If instead the value at Nodeiis greater thanδ, it sends a NACK message to Nodei.

When a nodejaccepts a message from nodei, the self-transition probability matrix is modified. At timet, the nodeiselects that nodej which has the maximum transition probability and this will be the state to which nodeiwill now transition. In order to do so, it participates in the minimum selection protocol called Push-Min (Algorithm 1).

Algorithm 1Protocol Push-Min 1. Let{mˆi

t}be the current minimum value seen by nodei.

2. Let{mˆp_t}be the current minimum value seen by nodep, where Nodeprepresents the node with highest transition probability among neighborsNiof i.

3. Letmt,i=min({mˆit},{mˆ p t})

4.mt,iis the estimate of the minimum at node i and p in step t.

Once the Push-Min protocol converges, the node containing the minimum will send the row cor-responding to this minima in the simplex tableau to the neighbor. Next, Nodeiupdates its local tableau with respect to the extra row it received. Finally, the algorithm Constraint Sharing Protocol described in Table 2 is performed. Completion of one round of the CS-Protocol, ensures that one iteration of the distributed simplex algorithm is over.

Algorithm 2Constraint Sharing Protocol (CS-Protocol) 1. NodePiperforms protocol Push-Min until termination.

2. On convergence to the approximate minimum,row−pivoti_{is known to Node}_P i.

3. All the nodes use the row obtained in Step 2 to perform Gauss Jordan elimination on the local tableau.

4. At the end of Step 3, each node locally has the updated tableau and completes the current iteration of the simplex algorithm.

4 Conclusion and Future Work

Peer-to-Peer (P2P) networks (such as Gnutella, Napster and Kazaa) are playing a very important role in the next generation of distributed systems. They are capable of storing large amounts of data of different modalities. Resource management and extraction of patterns from this data presents formidable challenges. In this paper, we present a novel approximate distributed algorithm for linear programming based on the simplex algorithm. Nodes in the network have access to local constraints but must solve a global objective function. A randomized gossip-based technique enables sharing of constraints and significantly reduces communication cost. Future work will involve demonstrating the empirical performance of the algorithm on large real-life peer-to-peer networks and extend the framework to other types of machine learning problems such as boosting, ranking and regression.

(5)

References

[1] Y. Rabani, and A. Sinclair, and R. Wanka, “Local Divergence of Markov Chains and the Anal-ysis of Iterative Load-Balancing Schemes”,IEEE Proc. of the 39th Symp. on Foundations of Computer Science (FOCS), 1998, pgs 694 – 703.

[2] F. W. Smith, “Pattern classier design by linear programming.”IEEE Transactions on Computers, C-17(4):367372, April 1968.

[3] G.B.Dantzig,“Linear Programming and Extensions.”,Princeton University Press, NJ, 1963. [4] J. E. Boillat, “Load balancing and Poisson equation in a graph”,Concurrency: Practice and

Experience, Vol 2(4), pgs 289 - 313, Dec 1990.

[5] G. Cybenko, “Dynamic Load Balancing for Distributed Memory Multiprocessors”,Journal of Parallel and Distributed Computing, Vol 7, pgs 279 - 301, 1989.

[6] S. Datta and H. Kargupta, “Uniform Data Sampling from a Peer-to-Peer Network”,Proceedings of the 27th International Conference on Distributed Computing Systems, Washington D.C., pgs 50–, 2007.

[7] W. K. Hastings, “Monte Carlo Sampling Methods Using Markov Chains and Their Applica-tions”,Biometrika, Vol 57, No 1, pp. 97 - 109, 1970.

[8] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller, “Equation of State Calculations by Fast Computing Machines”,Journal of Chemical Physics, Vol 21, No 6, June 1953.

[9] S. Chib and E. Greenberg, “Understanding the Metropolis-Hastings Algorithm”,The American Statistician, Vol 49, No 4, Pg 327 – 335, 1995.

[10] D. Kempe and F. McSherry, “A decentralized algorithm for spectral analysis”,Journal of Com-puter and System Sciences, Vol 74, Issue 1, Feb 2008, Pg 70 - 83.

[11] A. Awan, R. A. Ferreira, S. Jagannathan, A. Grama, “Distributed Uniform Sampling in Un-structured Peer-to-Peer Networks”, Proceedings of the 39th Hawaii International Conference on System Sciences, Kauai, Hawaii, 2006.

[12] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta, “Distributed Data Mining in Peer-to-Peer Networks.”In IEEE Internet Computing, Special Issue On Distributed Data Mining,volume 10, pages1826, 2006.

[13] http://www.gnutella.com

[14] http://www.kazaa.com/us/index.htm

[15] H. Dutta. “Empowering Scientific Discovery by Distributed Data Mining on the Grid Infras-tructure.”,PhD. Thesis, University of Maryland, Baltimore County, 2007.

[16] H. Dutta and H. Kargupta. “Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments.”,In ICDM Workshops, pages 543552, Pisa, Italy, 2008.

[17] H. Dutta and A. Matthur. “Distributed Optimization Strategies for Mining on Peer-to-Peer networks.” In Proc. ofSeventh International Conference on Machine Learning and Applications (ICMLA08), pages 350355, San Diego, CA, Dec. 2008.

[18] H. Dutta, X. Zhu, T. Mahule, H. Kargupta, K. Borne, C. Lauth, F. Holz, and G. Heyer “TagLearner: A P2P Classifier Learning System from Collaboratively Tagged Text Documents”, In Proceedings of theInternational Conference on Data Mining (ICDM), Workshop on Mining Multiple Information Sources (MMIS-09), Miami, FL, USA.