Distributed Structured Prediction for Big Data
A. G. Schwing ETH Zurich
[email protected]
T. Hazan TTI Chicago
M. Pollefeys ETH Zurich
R. Urtasun TTI Chicago
Abstract
The biggest limitations of learning structured predictors from big data are the com- putation time and the memory demands. In this paper, we propose to handle those big data problems efficiently by distributing and parallelizing the resource require- ments. We present a distributed structured prediction learning algorithm for large scale models that cannot be effectively handled by a single cluster node. Impor- tantly, convergence and optimality guarantees of recently developed algorithms are preserved while keeping between node communication low.
1 Introduction
In the past few years, structured models have become an important tool in domains such as nat- ural language processing, computer vision and computational biology. The growing variability within data sets, requires an increasing expressiveness that is achieved by modeling the influence of more and more variables. Hence memory and computational limits of desktop computers are reached quickly. In computer vision, for example, uncompressed full HD video streams produce 150 megabytes of data per second.
Several structured prediction frameworks have been developed in the past. Notable examples are Conditional Random Fields (CRFs) [2], structured support vector machines (SSVMs) [7, 8] and their generalizations [1]. All three frameworks aim at minimizing a regularized surrogate loss.
While CRFs and SSVMs are the method of choice for tree-structured or sub-modular models, ap- proximations, e.g., [1] are in general required. Note that all three approaches are inherently parallel in the training data.
But none of the aforementioned frameworks address the underlying memory limitations of large scale models arising from real-world problems. This is important since nowadays big data tasks of increasing volume, variety and velocity call for large models. Hence we are interested in making structured prediction algorithms practical for large scale scenarios. We present an algorithm which distributes and parallelizes the computation and memory requirements while reducing communica- tion between cluster nodes and conserving convergence and optimality guarantees. Our approach is based on the principle of dual decomposition, i.e., computation is done in parallel by partitioning the model and imposing agreement on independent variables that are required to be consistent. Thus, we split the graph-based optimization program into several local optimization problems solved in parallel, and cluster nodes exchange information occasionally to enforce consistency.
2 A Review on Structured Prediction
Let us first consider a setting where X denotes the input space (e.g., a video or a document) and S is a structured label space (e.g., a video segmentation or a set of parse trees). Further, let φ : X × S → R F denote a mapping from the input and label space to an F -dimensional feature space. When using structured prediction approaches, we are commonly interested in finding the parameters w ∈ R F of a log-linear model p w (s | x) ∝ exp w > φ(x, s)/ with covariance , which best describes the possible labeling s ∈ S of x ∈ X .
For training, we are given a data set D = {(x i , s i ) N i=1 } containing N pairs, each composed by an
input space object x ∈ X and a label space object s ∈ S. In order to find the model parameters w
that best describe the annotations, we are often able to construct a task loss ` (x,s) (ˆ s) which measures the fitness of any labeling ˆ s ∈ S. The vector v = P
(x,s)∈D φ(x, s) denotes the empirical mean and we commonly assume independent and identically distributed data in addition to a prior p(w) ∝ exp(−kwk p p ). During learning we minimize the negative loss-augmented data-log-posterior, i.e.,
min w
X
(x,s)∈D
ln X
ˆ s∈S
exp ` (x,s) (ˆ s) + w > φ(x, ˆ s)
!
− v > w + C
p kwk p p . (1) Note that the covariance = 1 recovers the CRF objective [2] while → 0 smoothly approximates the max-function, hence recovering the SSVM formulation [7, 8].
Due to the sum over all label space configurations ˆ s ∈ S being generally exponential in size, the unconstrained minimization problem given in Eq. (1) is NP-hard in general. Elements φ r
of the feature vector φ often describe interactions between subsets of random variables, i.e., φ r (x, s) = P
i∈V r,x φ r,i (x, s i ) + P
α∈E r,x φ r,α (x, s α ). Note that a labeling s = (s i ) i∈V ∈ S is a tuple subsuming |V | variables, each having S i discrete states. The sparse interactions induced by the feature functions φ r (x, s) are visually depicted by a factor graph G r,x with the individual variables i ∈ V r,x of sample (x, s) being vertices that are connected to factors α ∈ E r,x iff vertex i is a neighbor of factor α ∈ E r,x . The union graph G x = S
r G r,x describes the relationship over all features r and we say that vertex i ∈ V x = S
r V r,x is a neighbor to factor α ∈ E x = S
r E r,x
if variable s i is part of the variable set s α in any of the features of sample (x, s), i.e., i ∈ N (α).
Conversely, all factors that variable i participates in are referred to by α ∈ N (i).
Approximations [1] are one way to deal with the previously outlined intractability. The dual to the program given in Eq. (1) is described by means of joint distributions ranging, for each data sam- ple (x, s), over the label space S. We describe this probability by its variable and factor marginals b (x,s),i (s i ) and b (x,s),α (s α ) and approximate the entropies of those joint distributions by its marginal entropies H(b (x,s),i ) and H(b (x,s),α ) using chosen counting numbers c i and c α for better approxi- mation accuracy. To ensure consistency, we require the beliefs to fulfill marginalization constraints corresponding to the structure of the graph G x while maximizing the approximated dual cost func- tion
X
(x,s)
X
i
c i H(b (x,s),i )+ X
α
c α H(b (x,s),α )+ X
i,ˆ s i
b (x,s),i (ˆ s i )` (x,s),i (ˆ s i )+ X
α,ˆ s α
b (x,s),α (ˆ s α )` (x,s),α (ˆ s α )
−
− C 1−q q
X
r
X
(x,s),i∈V r,x ,ˆ s i
b (x,s),i (ˆ s i )φ r,i (x, ˆ s i ) + X
(x,s),α∈E r,x ,ˆ s α
b (x,s),α (ˆ s α )φ r,α (x, ˆ s α ) − v r
q
, (2)
with 1/p + 1/q = 1. The sum ranging over the training samples being the first term in both the original primal (Eq. (1)) and the approximated dual (Eq. (2)) suggests that computation of the gradient is inherently parallel in the data set elements. With real-world models G x often being too large for the resources provided by a single cluster node we next discuss a possibility to partition the optimization task while preserving the original convergence properties.
3 Distributed Structured Prediction
To cope with current model size needs we are interested in an algorithm to maximize Eq. (2) while leveraging the sparsity given by the graph structure G x . In addition, we partition the vertices of the model such that each of the distributed cluster nodes solves an independent program defined on a subgraph induced by the variables of each partition (Fig. 1(a)). To ensure consistency for the global model, the distributed solutions are combined by exchanging information between connected sub- graphs. The distributed structured prediction algorithm extends existing frameworks by introducing a high-level factor graph (Fig. 1(b)) describing the cluster node interactions. Occasional exchange of information corresponds to messages being sent on this factor graph. It is important to note that we do not require an exchange of information at every iteration.
More concretely, let P x be a partition of all the vertices i ∈ V x for sample (x, s) into disjunct sub-
sets n x ∈ P x each containing the variables i ∈ n x that are assigned to the cluster node n x . The
vertices assigned to node n x ∈ P x induce a subgraph G x,n x . As before, this subgraph describes the
(x 1 , s 1 )
(x 2 , s 2 )
(a) (b)
200 400 600 800 1000 4.815
4.82 4.825 4.83 4.835
4.84x 106
Iterations
Dual Energy
1 5 10 20 50 100
(c)
102 4.815
4.82 4.825 4.83 4.835
4.84x 106
Time [s]
Dual Energy
1 5 10 20 50 100