Distributed Structured Prediction for Big Data

(1)

Distributed Structured Prediction for Big Data

A. G. Schwing ETH Zurich

[email protected]

T. Hazan TTI Chicago

M. Pollefeys ETH Zurich

R. Urtasun TTI Chicago

Abstract

The biggest limitations of learning structured predictors from big data are the com- putation time and the memory demands. In this paper, we propose to handle those big data problems efficiently by distributing and parallelizing the resource require- ments. We present a distributed structured prediction learning algorithm for large scale models that cannot be effectively handled by a single cluster node. Impor- tantly, convergence and optimality guarantees of recently developed algorithms are preserved while keeping between node communication low.

1 Introduction

In the past few years, structured models have become an important tool in domains such as nat- ural language processing, computer vision and computational biology. The growing variability within data sets, requires an increasing expressiveness that is achieved by modeling the influence of more and more variables. Hence memory and computational limits of desktop computers are reached quickly. In computer vision, for example, uncompressed full HD video streams produce 150 megabytes of data per second.

Several structured prediction frameworks have been developed in the past. Notable examples are Conditional Random Fields (CRFs) [2], structured support vector machines (SSVMs) [7, 8] and their generalizations [1]. All three frameworks aim at minimizing a regularized surrogate loss.

While CRFs and SSVMs are the method of choice for tree-structured or sub-modular models, ap- proximations, e.g., [1] are in general required. Note that all three approaches are inherently parallel in the training data.

But none of the aforementioned frameworks address the underlying memory limitations of large scale models arising from real-world problems. This is important since nowadays big data tasks of increasing volume, variety and velocity call for large models. Hence we are interested in making structured prediction algorithms practical for large scale scenarios. We present an algorithm which distributes and parallelizes the computation and memory requirements while reducing communica- tion between cluster nodes and conserving convergence and optimality guarantees. Our approach is based on the principle of dual decomposition, i.e., computation is done in parallel by partitioning the model and imposing agreement on independent variables that are required to be consistent. Thus, we split the graph-based optimization program into several local optimization problems solved in parallel, and cluster nodes exchange information occasionally to enforce consistency.

2 A Review on Structured Prediction

Let us first consider a setting where X denotes the input space (e.g., a video or a document) and S is a structured label space (e.g., a video segmentation or a set of parse trees). Further, let φ : X × S → R ^F denote a mapping from the input and label space to an F -dimensional feature space. When using structured prediction approaches, we are commonly interested in finding the parameters w ∈ R ^F of a log-linear model p w (s | x) ∝ exp w ^> φ(x, s)/ with covariance , which best describes the possible labeling s ∈ S of x ∈ X .

For training, we are given a data set D = {(x i , s i ) ^N _i=1 } containing N pairs, each composed by an

input space object x ∈ X and a label space object s ∈ S. In order to find the model parameters w

(2)

that best describe the annotations, we are often able to construct a task loss ` (x,s) (ˆ s) which measures the fitness of any labeling ˆ s ∈ S. The vector v = P

(x,s)∈D φ(x, s) denotes the empirical mean and we commonly assume independent and identically distributed data in addition to a prior p(w) ∝ exp(−kwk ^p _p ). During learning we minimize the negative loss-augmented data-log-posterior, i.e.,

min w

X

(x,s)∈D

ln X

ˆ s∈S

exp ` _(x,s) (ˆ s) + w ^> φ(x, ˆ s)

!

− v ^> w + C

p kwk ^p _p . (1) Note that the covariance = 1 recovers the CRF objective [2] while → 0 smoothly approximates the max-function, hence recovering the SSVM formulation [7, 8].

Due to the sum over all label space configurations ˆ s ∈ S being generally exponential in size, the unconstrained minimization problem given in Eq. (1) is NP-hard in general. Elements φ r

of the feature vector φ often describe interactions between subsets of random variables, i.e., φ _r (x, s) = P

i∈V r,x φ _r,i (x, s _i ) + P

α∈E r,x φ _r,α (x, s _α ). Note that a labeling s = (s i ) _i∈V ∈ S is a tuple subsuming |V | variables, each having S i discrete states. The sparse interactions induced by the feature functions φ r (x, s) are visually depicted by a factor graph G r,x with the individual variables i ∈ V _r,x of sample (x, s) being vertices that are connected to factors α ∈ E _r,x iff vertex i is a neighbor of factor α ∈ E r,x . The union graph G x = S

r G r,x describes the relationship over all features r and we say that vertex i ∈ V x = S

r V r,x is a neighbor to factor α ∈ E x = S

r E r,x

if variable s i is part of the variable set s α in any of the features of sample (x, s), i.e., i ∈ N (α).

Conversely, all factors that variable i participates in are referred to by α ∈ N (i).

Approximations [1] are one way to deal with the previously outlined intractability. The dual to the program given in Eq. (1) is described by means of joint distributions ranging, for each data sam- ple (x, s), over the label space S. We describe this probability by its variable and factor marginals b _(x,s),i (s i ) and b (x,s),α (s α ) and approximate the entropies of those joint distributions by its marginal entropies H(b _(x,s),i ) and H(b (x,s),α ) using chosen counting numbers c i and c α for better approxi- mation accuracy. To ensure consistency, we require the beliefs to fulfill marginalization constraints corresponding to the structure of the graph G x while maximizing the approximated dual cost func- tion

X

(x,s)



 X

i

c i H(b _(x,s),i )+ X

α

c α H(b _(x,s),α )+ X

i,ˆ s i

b _(x,s),i (ˆ s i )` _(x,s),i (ˆ s i )+ X

α,ˆ s α

b _(x,s),α (ˆ s α )` _(x,s),α (ˆ s α )



 −

− C ^1−q q

X

r

X

(x,s),i∈V _r,x ,ˆ s _i

b _(x,s),i (ˆ s _i )φ _r,i (x, ˆ s _i ) + X

(x,s),α∈E _r,x ,ˆ s _α

b _(x,s),α (ˆ s _α )φ _r,α (x, ˆ s _α ) − v _r

q

, (2)

with 1/p + 1/q = 1. The sum ranging over the training samples being the first term in both the original primal (Eq. (1)) and the approximated dual (Eq. (2)) suggests that computation of the gradient is inherently parallel in the data set elements. With real-world models G x often being too large for the resources provided by a single cluster node we next discuss a possibility to partition the optimization task while preserving the original convergence properties.

3 Distributed Structured Prediction

To cope with current model size needs we are interested in an algorithm to maximize Eq. (2) while leveraging the sparsity given by the graph structure G x . In addition, we partition the vertices of the model such that each of the distributed cluster nodes solves an independent program defined on a subgraph induced by the variables of each partition (Fig. 1(a)). To ensure consistency for the global model, the distributed solutions are combined by exchanging information between connected sub- graphs. The distributed structured prediction algorithm extends existing frameworks by introducing a high-level factor graph (Fig. 1(b)) describing the cluster node interactions. Occasional exchange of information corresponds to messages being sent on this factor graph. It is important to note that we do not require an exchange of information at every iteration.

More concretely, let P x be a partition of all the vertices i ∈ V x for sample (x, s) into disjunct sub-

sets n x ∈ P x each containing the variables i ∈ n x that are assigned to the cluster node n x . The

vertices assigned to node n x ∈ P x induce a subgraph G x,n x . As before, this subgraph describes the

(3)

(x ₁ , s ₁ )

(x 2 , s 2 )

(a) (b)

200 400 600 800 1000 4.815

4.82 4.825 4.83 4.835

4.84x 10⁶

Iterations

Dual Energy

1 5 10 20 50 100

(c)

10² 4.815

4.82 4.825 4.83 4.835

4.84x 10⁶

Time [s]

Dual Energy

1 5 10 20 50 100

(d)

Figure 1: (a): 2 samples each distributed on 2 cluster nodes (color). (b): The cluster node factor graph for consistency messages. (c),(d): Convergence of the inference task w.r.t. iterations and time.

marginalization constraints required to be enforced on cluster node n x for its assigned variable be- liefs b ⁿ _(x,s),i ^x (ˆ s _i ) ∀(x, s), i ∈ n _x , ˆ s _i and the factor beliefs b ⁿ _(x,s),α ^x (ˆ s _α ) ∀(x, s), i ∈ n _x , α ∈ N (i), ˆ s _α , i.e., P

ˆ

s _α \ˆ s _i b ⁿ _(x,s),α ^x (ˆ s α ) = b ⁿ _(x,s),i ^x (ˆ s i ).

A factor α that is assigned to multiple subgraphs G x,n _x , corresponds to a set of beliefs b ⁿ _(x,s),α ^x each of them optimized independently on the cluster nodes n x ∈ N _P _x (α). Since these distributed beliefs originate from a single b _(x,s),α in Eq. (2) we are required to ensure consistency. Formally, we construct a factor graph G _P _x with cluster nodes n _x being the vertices that are connected to shared factors α iff n x ∈ N P _x (α). Conversely, we denote by N P _x (n x ) all factors α that are shared between multiple nodes, one of them being n x . To keep the shared beliefs consistent, we add the constraints b ⁿ _(x,s),α ^x (ˆ s _α ) = b _(x,s),α (ˆ s _α ) ∀(x, s), α, n _x ∈ N P x (α), ˆ s _α . To ensure optimization of the cost function given in Eq. (2), we further need to balance the entropy H(b _(x,s),α ), the loss ` (x,s),α

and the features φ r,α for those factors α, distributed onto different cluster nodes. To this end, we let ˆ

c _α = c _α /|N _P _x (α)|, ˆ ` _(x,s),α = ` _(x,s),α /|N _P _x (α)| and ˆ φ _(x,s),α = φ _(x,s),α /|N _P _x (α)| for all shared factors. For the remaining factors the variables augmented by the hat symbol ‘ˆ·’ correspond to the original variables. Consequently, we obtain the following maximization, equivalent to Eq. (2):

X

(x,s),n x ∈P x



 X

i∈G _x,nx

c i H(b ⁿ _(x,s),i ^x ) + X

α∈G _x,nx

ˆ c α H(b ⁿ _(x,s),α ^x ) + X

i∈G _x,nx ,ˆ s _i

b ⁿ _(x,s),i ^x (ˆ s i )` _(x,s),i (ˆ s i )+

X

α∈G _x,nx ,ˆ s α

b ⁿ _(x,s),α ^x (ˆ s α )ˆ ` _(x,s),α (ˆ s α )



 − C ^1−q

q kz − vk ^q _q , (3)

with marginalization constraints P

ˆ

s α \ˆ s i b ⁿ _(x,s),α ^x (ˆ s α ) = b ⁿ _(x,s),i ^x (ˆ s i ) ∀(x, s), n x , i, ˆ s i , α ∈ N (i), consistency constraints b ⁿ _(x,s),α ^x (ˆ s _α ) = b _(x,s),α (ˆ s _α ) ∀(x, s), n _x , α ∈ N _P _(x,s) _(n _x ₎ , ˆ s _α and variable z r = P

(x,s),n _x ,i,ˆ s _i b ⁿ _(x,s),i ^x (ˆ s i )φ r,i (x, ˆ s i ) + P

(x,s),s,α,ˆ s _α b ⁿ _(x,s),α ^x (ˆ s α ) ˆ φ r,α (x, ˆ s α ) ∀r = {1, . . . , F }.

We would like to utilize the structure of the graph to obtain memory efficient and fast algorithms.

Since the structure is employed to express the marginalization constraints, the dual program of Eq. (3), with its Lagrange multipliers λ _(x,s),i→α (ˆ s i ) corresponding to the marginalization con- straints and ν _(x,s),n _x _→α (ˆ s α ) originating from the consistency constraints between different cluster nodes is our preferred task. The dual program to Eq. (3) is given by the following claim.

Claim 1. Set ν _(x,s),n _x _→α = 0 for every α 6∈ G _P _x and enforce P

n x ∈N _P(x,s) (α) ν _(x,s),n _x _→α (ˆ s α ) = 0

∀(x, s), α, ˆ s α . With ˆ φ (x,s),i (ˆ s i ) = ` (x,s),i (ˆ s i ) + P

r:i∈V _r,x,nx w r φ r,i (x, ˆ s i ) and ˆ φ (x,s),α (ˆ s α ) =

` ˆ _(x,s),α (ˆ s α ) + P

r:α∈E _r,x,nx w r φ ˆ r,α (x, ˆ s α ) the dual program of the approximated structured predic- tion dual in Eq. (3) reads as

g = X

(x,s),n x ,i∈G _x,nx

c _i ln X

ˆ s _i

exp

φ ˆ _(x,s),i (ˆ s _i ) − P

α∈N (i) λ _(x,s),i→α (ˆ s _i )

c i

!

− v ^> w + C

p kwk ^p _p +

X

(x,s),n _x ,α∈G _x,nx

ˆ c α ln X

ˆ s α

exp

φ ˆ (x,s),α (ˆ s α ) + P

i∈N (α)∩s λ (x,s),i→α (ˆ s i ) + ν (x,s),n _x →α (ˆ s α )

ˆ c α

! . (4)

Proof: Follows [1, 5].

Looking at the distributed approximated primal given in Eq. (4) more closely, we note that both

terms involving the two types of Lagrange multipliers are now preceded by sums ranging over the

samples as well as the compute nodes n x .

(4)

To derive an efficient algorithm we perform block-coordinate descent on this approximated primal.

Fixing the consistency messages ν _(x,s),n _x _→α (ˆ s α ), the optimal λ (x,s),i→α (ˆ s i ) is computed ∀i ∈ G _x,n _x without considering current information from other cluster nodes. A status update in form of consistency messages ν (x,s),n _x →α (ˆ s α ) is analytically computed by synchronizing messages between the different machines. The Armijo-Iterations performed to optimize w r require computation of the beliefs as well as the primal cost function value, which is done on the distributed nodes before another synchronization. The resulting block-coordinate descent and gradient steps are given by the following claim.

Claim 2. With µ _(x,s),α→i (ˆ s _i ) = ˆ c _α ln P

ˆ

s α \ˆ s i exp(( ˆ φ _(x,s),α (ˆ s _α ) + P

j∈N (α)∩s\i λ _(x,s),j→α (ˆ s _j ) + ν _(x,s),n _x _→α (ˆ s α ))/(ˆ c α )) the gradient steps in λ, ν and the gradient in w r are:

λ _(x,s),i→α (ˆ s _i ) ∝ ˆ c _α c i + P

α∈N (i) ˆ c α



 φ ˆ _(x,s),i (ˆ s _i ) + X

β∈N (i)

µ _(x,s),β→i (ˆ s _i )



 − µ (x,s),α→i (ˆ s _i ),

ν _(x,s),n _x _→α (ˆ s _α ) ∝ 1

|N P _(x,s) (α)|

X

i∈N (α)

λ _(x,s),i→α (ˆ s _i ) − X

i∈N (α)∩s

λ _(x,s),i→α (ˆ s _i ),

∂g

∂w r

= X

(x,s),n _x ,i,ˆ s _i

b ⁿ _(x,s),i ^x (ˆ s _i )φ _r,i (x, ˆ s _i ) + X

(x,s),n _x ,α,ˆ s _α

b ⁿ _(x,s),α ^x φ ˆ _r,α (x, ˆ s _α ) − v _r + C|w _r | ^p−1 sgn(w _r ).

Proof: Follows [1, 5].

Since the order of the block-coordinate descent steps does not impact convergence guarantees, we iteratively update the λ messages within a cluster node and the model parameters w r , before ex- changing information between machines in form of consistency messages. Note, that updating model parameters requires cluster nodes to only exchange numbers, while the size of the consis- tency messages depends on the size of the shared factors being commonly larger than a single real value.

4 Related Work and Discussion

Data parallel frameworks, like MapReduce, simplify implementation of large-scale data processing but do not naturally support development of efficient learning algorithms. One of the most notable publicly available engines working towards efficient distributed algorithms is GraphLAB which, originally supporting only shared-memory environments [3], was recently extended to distributed environments [4]. However, minimization of communication overhead between cluster nodes is not considered, which potentially reduces computational performance.

Our recent work on a parallel inference tasks that explicitly minimizes the communication overhead was presented in [5]. Fig. 1(c) and Fig. 1(d) from [5] show the convergence of an inference task w.r.t.

iterations and time when communicating between machines every 1,5,10,...,100 iterations. Although convergence in terms of iterations is best when transmitting information frequently, communication overhead reduces wall-clock performance when exchanging variables often. The drop in perfor- mance depends on the graph connectivity and clique size (e.g., a common pairwise 4-connected grid in our case) and the cluster infrastructure (LAN or InfiniBand connection). Since learning involves inference a similar time dependence is expected.

5 Conclusion

We have presented a distributed structured prediction algorithm that is able to process models that exceed the resource restrictions of a single cluster node. Our approach divides computation and memory requirements onto multiple machines while convergence and optimality guarantees are pre- served by introducing a new type of consistency message. Our algorithm benefits particularly from the availability of multiple cluster nodes but it is also useful on a single machine since we derive explicit rules for swapping parts of the model between memory and hard disk.

Extensions towards latent variable models [6] and towards automatically finding an effective parti-

tioning of graphical models are subject to future research.

(5)

References

[1] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction. In Proc. NIPS, 2010.

[2] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML, 2001.

[3] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Parallel Framework for Machine Learning. In Proc. UAI, 2010.

[4] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning in the Cloud. In Proc. Very Large Data Bases, 2012.

[5] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message-Passing for Large-Scale Graphical Models. In Proc. CVPR, 2011.

[6] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Efficient Structured Prediction with Latent Vari- ables for General Graphical Models. In Proc. ICML, 2012.

[7] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In Proc. NIPS, 2003.

[8] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and

Structured Output Spaces. In Proc. ICML, 2004.

Distributed Structured Prediction for Big Data

Distributed Structured Prediction for Big Data

A. G. Schwing ETH Zurich

[email protected]

T. Hazan TTI Chicago

M. Pollefeys ETH Zurich

R. Urtasun TTI Chicago

Abstract

1 Introduction

Several structured prediction frameworks have been developed in the past. Notable examples are Conditional Random Fields (CRFs) [2], structured support vector machines (SSVMs) [7, 8] and their generalizations [1]. All three frameworks aim at minimizing a regularized surrogate loss.

While CRFs and SSVMs are the method of choice for tree-structured or sub-modular models, ap- proximations, e.g., [1] are in general required. Note that all three approaches are inherently parallel in the training data.

2 A Review on Structured Prediction

For training, we are given a data set D = {(x i , s i ) N i=1 } containing N pairs, each composed by an

input space object x ∈ X and a label space object s ∈ S. In order to find the model parameters w

that best describe the annotations, we are often able to construct a task loss ` (x,s) (ˆ s) which measures the fitness of any labeling ˆ s ∈ S. The vector v = P

(x,s)∈D φ(x, s) denotes the empirical mean and we commonly assume independent and identically distributed data in addition to a prior p(w) ∝ exp(−kwk p p ). During learning we minimize the negative loss-augmented data-log-posterior, i.e.,

min w

X

(x,s)∈D

 ln X

ˆ s∈S

exp ` (x,s) (ˆ s) + w > φ(x, ˆ s)



!

− v > w + C

p kwk p p . (1) Note that the covariance  = 1 recovers the CRF objective [2] while  → 0 smoothly approximates the max-function, hence recovering the SSVM formulation [7, 8].

Due to the sum over all label space configurations ˆ s ∈ S being generally exponential in size, the unconstrained minimization problem given in Eq. (1) is NP-hard in general. Elements φ r

of the feature vector φ often describe interactions between subsets of random variables, i.e., φ r (x, s) = P

i∈V r,x φ r,i (x, s i ) + P

r G r,x describes the relationship over all features r and we say that vertex i ∈ V x = S

r V r,x is a neighbor to factor α ∈ E x = S

r E r,x

if variable s i is part of the variable set s α in any of the features of sample (x, s), i.e., i ∈ N (α).

Conversely, all factors that variable i participates in are referred to by α ∈ N (i).

X

(x,s)



 X

i

c i H(b (x,s),i )+ X

α

c α H(b (x,s),α )+ X

i,ˆ s i

b (x,s),i (ˆ s i )` (x,s),i (ˆ s i )+ X

α,ˆ s α

b (x,s),α (ˆ s α )` (x,s),α (ˆ s α )



 −

− C 1−q q

X

r

X

(x,s),i∈V r,x ,ˆ s i

b (x,s),i (ˆ s i )φ r,i (x, ˆ s i ) + X

(x,s),α∈E r,x ,ˆ s α

b (x,s),α (ˆ s α )φ r,α (x, ˆ s α ) − v r

q

, (2)

3 Distributed Structured Prediction

More concretely, let P x be a partition of all the vertices i ∈ V x for sample (x, s) into disjunct sub-

sets n x ∈ P x each containing the variables i ∈ n x that are assigned to the cluster node n x . The

vertices assigned to node n x ∈ P x induce a subgraph G x,n x . As before, this subgraph describes the

(x 1 , s 1 )

(x 2 , s 2 )

(a) (b)

(c)

(d)

Figure 1: (a): 2 samples each distributed on 2 cluster nodes (color). (b): The cluster node factor graph for consistency messages. (c),(d): Convergence of the inference task w.r.t. iterations and time.

marginalization constraints required to be enforced on cluster node n x for its assigned variable be- liefs b n (x,s),i x (ˆ s i ) ∀(x, s), i ∈ n x , ˆ s i and the factor beliefs b n (x,s),α x (ˆ s α ) ∀(x, s), i ∈ n x , α ∈ N (i), ˆ s α , i.e., P

ˆ

s α \ˆ s i b n (x,s),α x (ˆ s α ) = b n (x,s),i x (ˆ s i ).

and the features φ r,α for those factors α, distributed onto different cluster nodes. To this end, we let ˆ

X

(x,s),n x ∈P x



 X

i∈G x,nx

c i H(b n (x,s),i x ) + X

α∈G x,nx

ˆ c α H(b n (x,s),α x ) + X

For training, we are given a data set D = {(x i , s i ) ^N _i=1 } containing N pairs, each composed by an

(x,s)∈D φ(x, s) denotes the empirical mean and we commonly assume independent and identically distributed data in addition to a prior p(w) ∝ exp(−kwk ^p _p ). During learning we minimize the negative loss-augmented data-log-posterior, i.e.,

ln X

exp ` _(x,s) (ˆ s) + w ^> φ(x, ˆ s)

− v ^> w + C

p kwk ^p _p . (1) Note that the covariance = 1 recovers the CRF objective [2] while → 0 smoothly approximates the max-function, hence recovering the SSVM formulation [7, 8].

of the feature vector φ often describe interactions between subsets of random variables, i.e., φ _r (x, s) = P

i∈V r,x φ _r,i (x, s _i ) + P

c i H(b _(x,s),i )+ X

c α H(b _(x,s),α )+ X

b _(x,s),i (ˆ s i )` _(x,s),i (ˆ s i )+ X

b _(x,s),α (ˆ s α )` _(x,s),α (ˆ s α )

− C ^1−q q

(x,s),i∈V _r,x ,ˆ s _i

b _(x,s),i (ˆ s _i )φ _r,i (x, ˆ s _i ) + X

(x,s),α∈E _r,x ,ˆ s _α

b _(x,s),α (ˆ s _α )φ _r,α (x, ˆ s _α ) − v _r

(x ₁ , s ₁ )

marginalization constraints required to be enforced on cluster node n x for its assigned variable be- liefs b ⁿ _(x,s),i ^x (ˆ s _i ) ∀(x, s), i ∈ n _x , ˆ s _i and the factor beliefs b ⁿ _(x,s),α ^x (ˆ s _α ) ∀(x, s), i ∈ n _x , α ∈ N (i), ˆ s _α , i.e., P

s _α \ˆ s _i b ⁿ _(x,s),α ^x (ˆ s α ) = b ⁿ _(x,s),i ^x (ˆ s i ).

i∈G _x,nx

c i H(b ⁿ _(x,s),i ^x ) + X

α∈G _x,nx

ˆ c α H(b ⁿ _(x,s),α ^x ) + X

i∈G _x,nx ,ˆ s _i

b ⁿ _(x,s),i ^x (ˆ s i )` _(x,s),i (ˆ s i )+

α∈G _x,nx ,ˆ s α

b ⁿ _(x,s),α ^x (ˆ s α )ˆ ` _(x,s),α (ˆ s α )

 − C ^1−q

q kz − vk ^q _q , (3)

s α \ˆ s i b ⁿ _(x,s),α ^x (ˆ s α ) = b ⁿ _(x,s),i ^x (ˆ s i ) ∀(x, s), n x , i, ˆ s i , α ∈ N (i), consistency constraints b ⁿ _(x,s),α ^x (ˆ s _α ) = b _(x,s),α (ˆ s _α ) ∀(x, s), n _x , α ∈ N _P _(x,s) _(n _x ₎ , ˆ s _α and variable z r = P

(x,s),n _x ,i,ˆ s _i b ⁿ _(x,s),i ^x (ˆ s i )φ r,i (x, ˆ s i ) + P

(x,s),s,α,ˆ s _α b ⁿ _(x,s),α ^x (ˆ s α ) ˆ φ r,α (x, ˆ s α ) ∀r = {1, . . . , F }.

Claim 1. Set ν _(x,s),n _x _→α = 0 for every α 6∈ G _P _x and enforce P

n x ∈N _P(x,s) (α) ν _(x,s),n _x _→α (ˆ s α ) = 0

r:i∈V _r,x,nx w r φ r,i (x, ˆ s i ) and ˆ φ (x,s),α (ˆ s α ) =

` ˆ _(x,s),α (ˆ s α ) + P

r:α∈E _r,x,nx w r φ ˆ r,α (x, ˆ s α ) the dual program of the approximated structured predic- tion dual in Eq. (3) reads as

(x,s),n x ,i∈G _x,nx

c _i ln X

ˆ s _i

φ ˆ _(x,s),i (ˆ s _i ) − P

α∈N (i) λ _(x,s),i→α (ˆ s _i )

c i

− v ^> w + C

p kwk ^p _p +

(x,s),n _x ,α∈G _x,nx

ˆ c α ln X

i∈N (α)∩s λ (x,s),i→α (ˆ s i ) + ν (x,s),n _x →α (ˆ s α )

ˆ c α

Claim 2. With µ _(x,s),α→i (ˆ s _i ) = ˆ c _α ln P

s α \ˆ s i exp(( ˆ φ _(x,s),α (ˆ s _α ) + P

j∈N (α)∩s\i λ _(x,s),j→α (ˆ s _j ) + ν _(x,s),n _x _→α (ˆ s α ))/(ˆ c α )) the gradient steps in λ, ν and the gradient in w r are:

λ _(x,s),i→α (ˆ s _i ) ∝ ˆ c _α c i + P

 φ ˆ _(x,s),i (ˆ s _i ) + X

µ _(x,s),β→i (ˆ s _i )

 − µ (x,s),α→i (ˆ s _i ),