Distributed Discrete-time Optimization with Coupling Constraints Based on Dual Proximal Gradient Method in Multi-agent Networks

(1)

Distributed Discrete-time Optimization with Coupling Constraints Based on Dual Proximal

Gradient Method in Multi-agent Networks

Jianzheng Wang, Guoqiang Hu, Senior Member, IEEE

Abstract—In this paper, we aim to solve a distributed optimization problem with coupling constraints based on proximal gradient method in a multi-agent network, where the cost function of the agents is composed of smooth and possibly non-smooth parts.

To solve this problem, we resort to the dual problem by deriving the Fenchel conjugate, resulting in a consensus based constrained optimization problem. Then, we propose a fully distributed dual proximal gradient algorithm, where the agents make decisions only with local parameters and the information of immediate neighbours. Moreover, provided that the non-smooth parts in the primal cost functions are with some simple structures, we only need to update dual variables by some simple operations and the overall computational complexity can be reduced. Analytical convergence rate of the proposed algorithm is derived and the efficacy is numerically verified by a social welfare optimization problem in the electricity market.

Index Terms—Multi-agent network; proximal gradient method; distributed optimization; Fenchel conjugate; dual problem.

I. INTRODUCTION

A. Background and Motivation

Decentralized optimization has become an active topic in recent years for solving various engineering problems, such as detection and localization in sensor networks [1], machine learning and regression problems [2], and economic dispatch in power systems [3], etc. As a typical optimization architec- ture, each agent maintains an individual cost function and the global optimal solution can be attained with multiple rounds of communications and decision-makings. In this paper, we focus on a class of composite optimization problems, where the cost functions are composed of smooth (differentiable) and possibly non-smooth (non-differentiable) parts, which are often discussed in various fields, such as resource allocation problems [4], Lasso regressions [5], and support vector machines [6], etc. To solve these problems, widely discussed techniques include alternating direction method of multipliers [7], primal- dual subgradient methods [8], and proximal gradient methods [9], etc .

Most existing works on decentralized optimizations assume that the agents are fully connected to ensure the correctness of the optimization results, which limits their usage in large- scale distributed networks [10, 11]. To overcome this issue, a valid alternative is applying graph theory in modelling

Jianzheng Wang and Guoqiang Hu are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, 639798 e-mail: ([email protected], [email protected]).

the communication links, leading to the distributed setup where the agents only communicate with their immediate neighbours [12]. However, with the increasing demand on the computational efficiency in various fields, more explorations on the algorithm development for distributed optimization problems (DOPs) are needed [13]. Noticing that proximal gradient methods are usually numerically more stable than the subgradient based counterparts in composite optimizations [14], in this work, we aim to develop an efficient distributed optimization algorithm based on proximal gradient method.

B. Literature Review

Fruitful distributed algorithms for solving DOPs can be found in the existing works. To adapt to large-scale distributed networks, consensus based DOPs without coupling constraint were studied in [15–19], where the agents make decisions with local variables and certain agreement on the optimal solution is achieved only through local communications.

Alternatively, we focus on optimizing a class of composite DOPs subject to coupling affine constraints via dual proximal gradient method. In this work, dual proximal gradient method corresponds to the proximal gradient method applied to the dual as also discussed in [4, 20, 21], where, however, no coupling constraint is considered. To the best knowledge of the authors, this work incorporates dual proximal gradient method in distributed setups with general coupling affine constraints for the first time, which enriches the existing algorithms for constrained DOPs.

To develop a fully distributed algorithm for the problem of interest, we propose a distributed dual proximal gradient (DDPG) algorithm. To highlight the new features and advan- tages of this work, the comparisons with some state-of-the-art works with similar problem setups are listed as follows.

• One distinct feature of the proposed DDPG algorithm is that, by resorting to the dual problem, we can only update the dual variables by some simple operations, e.g., basic proximal mappings and simple iterations, provided that the proximal mapping of the non-smooth parts in the primal cost functions can be explicitly derived, which is more efficient than the existing distributed algorithms with possibly costly computations of the primal variables

arXiv:2108.10652v1 [math.OC] 24 Aug 2021

(2)

or other auxiliary variables, e.g., provided in [22–31].¹

• In [22, 24–31], some common fixed or varying global step-sizes are required. By contrast, the proposed fully distributed DDPG algorithm allows for heterogenous step-sizes determined by local information, e.g., private objective functions and local parameters in the global constraints, which provides more flexibilities for the initialization process and is more adaptive to large-scale distributed networks.

• The consensus based distributed optimization algorithms studied in [23, 26, 27, 31] require the updating of some weighted running averages of variables or gradients, which increase the computational complexity and require more memory capacity for the auxiliary variables.

• An explicit convergence rate is derived for the proposed DDPG algorithm, which is not provided in [24, 26–

28, 30]. In addition, the algorithms in [22, 24–28, 30, 31]

assume some compact local constraints to ensure the convergence of the algorithms. By contrast, this work focuses on dual sequences without boundedness requirement on the primal variables.

The contributions of this work are summarized as follows.

• We consider a class of composite DOPs with local convex and coupling affine constraints. A DDPG algorithm is proposed by deriving the dual problem based on Fenchel conjugate, where the optimal solution can be attained when the agents execute updates only with the dual information of immediate neighbours and locally determined step-sizes, leading to a fully distributed computation environment.

• Different from the existing research works with similar problem setups, the proposed DDPG algorithm only requires the update of dual variables by some simple operations if the non-smooth parts of the objective functions are simple-structured, which can reduce the overall computational complexity. In addition, the proposed DDPG algorithm requires some widely used assumptions on the primal problems and explicit convergence rate is provided.

C. Paper Structure and Notations

The remainder of this paper is organized as follows. Section II provides some fundamental definitions and mathematical properties employed by this work. Section III formulates the optimization problem of interest and introduces the assumptions. In Section IV, the dual problem is formulated and the DDPG is proposed therein. The convergence analysis is conducted in Section V. The efficacy of the proposed DDPG algorithm is demonstrated in Section VI with a social welfare optimization problem in the electricity market. Section VII concludes this paper.

1For the DOPs with smooth cost functions, some existing works on dual algorithms, e.g., [32], can avoid the update of primal variables. However, directly extending their results to non-smooth cases can be costly in the sense that the computation of the gradient of the formulated dual function requires an additional nontrivial optimization process. Therefore, the contribution to computational efficiency of this work is established for possibly non-smooth cost functions.

N and N+ denote the non-negative and positive integer spaces, respectively. Let notation | A | be the size of set A.

Rⁿ+denotes the n-dimensional Euclidian space only with non- negative real elements. Operator (·)^> represents the transpose of a matrix. A1× A2denotes the Cartesian product of sets A1

and A2. relintA represents the relative interior of set A. k · k and k · k1 refer to the l2- and l1-norms, respectively. Define kuk²_X = u^>Xu with X a square matrix. ⊗ is Kronecker product. In is an n-dimensional identity matrix and On×m

is an (n × m)-dimensional zero matrix. 1n and 0n denote the n-dimensional column vectors with all elements 1 and 0, respectively. Define

D^m_A[un] =







u1Im O

. ..

O u_|A|I_m





∈ R^m|A|×m|A|. (1)

II. PRELIMINARIES

Some frequently used definitions and relevant properties of graph theory, proximal mapping, and Fenchel conjugate are provided in this section.

A. Graph Theory

Define an undirected graph G = {V, E} for a multi-agent network, where V = {1, 2, ..., N } is the set of vertices and E ⊆ {(i, j)|i, j ∈ V and i 6= j} is the set of edges with (i, j) ∈ E unordered. G is connected if any two distinct vertices are linked by at least one path. V_i= {j|(i, j) ∈ E} is the neighbour set of agent i. Let L ∈R^{N ×N} be the Laplacian matrix of G. Then, the (i, j)th element of L, defined by dij, follows dij = −1 if (i, j) ∈ E, dij = 0 if (i, j) /∈ E & i 6= j, and dii=| Vi| [33].

B. Proximal Mapping

A proximal mapping of a proper, convex, and closed function ψ : Rⁿ → (−∞, +∞] is defined by prox^α_ψ[v] = arg min_u(ψ(u)+_2α¹ ku−vk²), α > 0, v ∈ Rⁿ.²A generalized version of proximal mapping can be defined as

prox^X_ψ[v] = arg min

u (ψ(u) +1

2ku − vk²_X−1), (2) with X ∈R^n×n a positive definite matrix [20].

C. Fenchel Conjugate

ψ : Rⁿ → (−∞, +∞] is a proper function. Then, the Fenchel conjugate of ψ is defined by ψ(v) = sup_u{v^>u − ψ(u)}, which is convex [34, Sec. 3.3].

Lemma 1. (Extended Moreau Decomposition [35, Thm. 6.45]) ψ : Rⁿ → (−∞, +∞] is a proper, convex, and closed function.ψ is its Fenchel conjugate. Then,

v = αprox_ψ¹^α[v

α] + prox^α_ψ[v], (3) v ∈ Rⁿ,α > 0,

2The proximal mapping can be equivalently written as prox_αψas in some other works.

(3)

Lemma 2. [20, Lemma V.7] Let ψ : Rⁿ → (−∞, +∞] be a proper, closed, σ-strongly convex function and ψ be its Fenchel conjugate,σ > 0. Then,

arg max

u (v^>u − ψ(u)) = ∇vψ(v) (4) and ∇vψ(v) is _σ¹-Lipschitz continuous.

III. PROBLEMFORMULATION

The problem formulation and relevant assumptions are provided as follows.

Let F (x) = P

i∈V F_i(x_i) be the global cost function of a multi-agent network G = {V, E}, x_i ∈ R^M, x = [x^>₁, ..., x^>_N]^> ∈ R^{N M}. Agent i maintains the private cost function Fi(xi) = fi(xi) + gi(xi). Let Xi ⊆ R^M be the feasible region of xi. Then, the feasible region of x can be defined by X = X1× X2× ... × XN ⊆ R^{N M}. Then, an affine-constrained optimization problem of G can be given by

(P1) min

x∈X

X

i∈V

Fi(xi) subject to Ax = b, A ∈ R^{B×N M}, b ∈R^B, which is equivalent to

(P2) min

x

X

i∈V

(Fi(xi) + I^Xi(xi)) subject to Ax = b,

withI^Xi(xi) =

0, if xi∈ Xi, +∞, otherwise [36].

Remark 1. Note that for an inequality constraint Ax b, one can formulate an equality constraintAx+y = b with y ∈ R^B+ being a slack variable. Then, the inequality-constrained problem can be equivalently written as

(P1+) min

x∈X,y∈R^B₊

X

i∈V

Fi(xi) subject to Ax + y = b.

To realize decentralized computations, y can be decomposed and assigned to the agents. Hence, the structure of Problem (P1+) complies with that of Problem (P1).

Assumption 1. G is connected and undirected.

Assumption 2. fi : R^M → (−∞, +∞] and gi : R^M → (−∞, +∞] are both proper, convex, and closed extended real-valued functions. In addition, fi is differentiable and σi- strongly convex, σi> 0, i ∈ V .

Similar assumptions in Assumption 2 can be referred to in [4, 20, 37–41].

Assumption 3. Xi is non-empty, convex and closed, i ∈ V ; there exists an ˘x ∈ relintX such that A˘x = b.

Remark 2. By Assumption 3,I^Xiis proper, convex, and closed [36], which complies with the assumption ongi. Therefore, it is also feasible to omit the discussion onI^Xi in Problem (P2) as in [4, 20, 41]. In this work, we highlight the existence of X_i for more detailed discussions.

IV. DISTRIBUTEDDUALPROXIMALGRADIENT

ALGORITHMDEVELOPMENT

In this section, we propose a fully distributed DDPG algorithm for solving the problem of interest.

A. Dual Problem

The dual problem of Problem (P2) is formulated in this subsection. By decoupling the objective function of Problem (P2), we have

(P3) min

x,z

X

i∈V

(fi(xi) + (gi+ I^Xi)(zi)) subject to Ax = b, xi= zi, ∀i ∈ V,

where z = [z^>₁, ..., z^>_N]^> ∈ R^{N M} with zi ∈ R^M a slack vector. The Lagrangian function of Problem (P3) can be given by

L(x, z, θ, µ) =X

i∈V

(fi(xi) + (gi+ I^Xi)(zi) + µ^>_i (xi− zi)) + θ^>(Ax − b)

=X

i∈V

(f_i(x_i) + x^>_i (A^>_i θ + µ_i)

+ (g_i+ IXi)(z_i) − z^>_i µ_i) − b^>θ, (5) where µi ∈ R^M and θ ∈R^B are the Lagrangian multiplier vectors associated with constraints xi = z_i and Ax = b, respectively. µ = [µ^>₁, ..., µ^>_N]^>∈ R^{N M}. Ai∈ R^B×M is the ith column sub-block of A with A = [A1, ..., A_i, ..., A_N].

Then, the dual function can be obtained by minimizing L(x, z, θ, µ) with (x, z), which is

D(θ, µ) = min

x,z

X

i∈V

(fi(xi) + x^>_i (A^>_i θ + µi) + (gi+ I^Xi)(zi) − z^>_i µi) − b^>θ

= min

x,z

X

i∈V

(fi(xi) − x^>_i Hiλi

+ (gi+ I^Xi)(zi) − z^>_i Fλi− κiEλi)

=X

i∈V

(−f_i(H_iλ_i) − κ_iEλ_i

− (gi+ IXi)(Fλ_i)), (6) where Hi = [−A^>_i , −I_M] ∈ R^{M ×(M +B)}, λ_i = [θ^>, µ^>_i ]^> ∈ R^{M +B}, F = [O_{M ×B}, I_M] ∈ R^{M ×(M +B)}, E = [b^>, 0^>_M] ∈ R^{1×(M +B)}, P

i∈V κ_i = 1. The forth equality in (6) employs the definition of Fenchel conjugate and (g_i+ IXi) denotes the Fenchel conjugate of gi+ I^Xi. Hence, the dual problem of Problem (P3) can be formulated as

(P4) min

λ P (λ) + R(λ),

where λ = [λ^>₁, ..., λ^>_N]^> ∈ R^{N B+N M}, P (λ) = P

i∈V(f_i(Hiλi) + κiEλi) and R(λ) = P

i∈V(gi + IXi)(Fλ_i).

(4)

B. Distributed Dual Proximal Gradient Algorithm

In this subsection, we aim to solve Problem (P4) in a distributed manner based on proximal gradient method.

In Problem (P4), the variables of f_i(H_iλ_i) are coupled in terms of the common component θ in λi, but those of (gi+ I^Xi)(Fλi) are decoupled since Fλi= µi. In the following, with a slight abuse of notation, we redefine λi= [θ^>_i , µ^>_i ]^>∈ R^{M +B}, where θiis the local estimate of the common θ. Then, Problem (P4) can be equivalently rewritten as

(P5) min

λ P (λ) + R(λ)

subject to Kλj= Kλl, ∀(j, l) ∈ E, (7) where K = [IB, OB×M]. Constraint (7) ensures the partial consistency among λi in terms of component θi, i.e., θi = Kλi. Let λ^∗ = [(λ^∗₁)^>, ..., (λ^∗_N)^>]^> be the optimal solution to Problem (P5) with λ^∗_i = [(θ_i^∗)^>, (µ^∗_i)^>]^>.

In the following, we assume that the range of θ^∗_i is es- timable, i.e., θ_i^∗∈ Si with Si ⊂ R^B being the estimated non- empty, convex and compact zone. For the convenience of the following discussion, we define Γ = maxi∈V sup_θ_i_∈S_ikθik.

Note that considering constraint θi ∈ Si is equivalent to accommodating an indicator function I^Si(θi) into the non- smooth part [36]. Then, Problem (P5) can be modified into

(P6) min

λ Φ(λ)

subject to Kλ_j= Kλ_l, ∀(j, l) ∈ E, (8) where Φ(λ) = P (λ) + Q(λ), P (λ) =P

i∈V pi(λi), Q(λ) = P

i∈V qi(λi), pi(λi) = f_i(Hiλi) + κiEλi, qi(λi) = (gi+ I^Xi)(µi) + I^Si(θi) = (gi+ I^Xi)(Fλi) + I^Si(Kλi). Note that (8) can be represented by a compact equation Mλ = 0, where M = L⊗K ∈RN B×(N B+N M ). It can be checked that Mλ = L ˆθ, where L = L ⊗ IB ∈ R^{N B×N B} is an augmented Laplacian matrix of G and ˆθ = [θ^>₁, ..., θ_N^>]^>∈ R^{N B}.

Then, the Lagrangian function of Problem (P6) can be given by

L(λ, ξ) = P (λ) + Q(λ) + ξ^>Mλ, (9) where ξ = [ξ^>₁, ..., ξ^>_N]^> ∈ R^{N B} is the collection of La- grangian multipliers. Let C be the set of the saddle points of L(λ, ξ). Then, any saddle point (λ^∗, ξ^∗) ∈ C satisfies [42]

L(λ, ξ^∗) ≥ L(λ^∗, ξ^∗) ≥ L(λ^∗, ξ), (10)

∀λ ∈ R^{N B+N M}, ∀ξ ∈ R^{N B}. We aim to seek a saddle point of L(λ, ξ), which can be characterized by Karush-Kuhn-Tucker (KKT) conditions [43]

0 ∈ ∇_λP (λ^∗) + ∂_λQ(λ^∗) + M^>ξ^∗, (11)

Mλ^∗ = 0. (12)

Based on the previous discussion, the DDPG algorithm for solving Problem (P6) is designed as

λ(t + 1) =prox^D

M +B V [c_l]

Q λ(t) − D^{M +B}_V [cl]

· (∇λP (λ(t)) + M^>ξ(t)), (13) ξ(t + 1) =ξ(t) + D^B_V[γl]Mλ(t + 1), (14)

which means

λi(t + 1) = prox^c_qⁱ_iλi(t) − ci(∇λ_ipi(λi(t)) + M^>_i ξ(t))

= prox^c_qⁱ_iλi(t) − ci(∇λ_ipi(λi(t))

+X

j∈V_i

K^>(ξi(t) − ξj(t))), (15) ξi(t + 1) = ξi(t) + γiM^]_iλ(t + 1)

= ξ_i(t) + γ_iX

j∈V_i

K(λ_i(t + 1)

− λj(t + 1)), (16)

due to the separability of P and Q, ∀i ∈ V , t ∈ N.

M_i ∈ R^{N B×(B+M )} and M^]_i ∈ RB×(N B+N M ) are the ith column and ith row sub-blocks of M, respectively, i.e., M = [M1, ..., Mi, ..., MN] = [(M^]₁)^>, ..., (M^]_i)^>, ..., (M^]_N)^>]^>. ci, γi> 0 are step-sizes.

Remark 3. The estimated Si enables the range of θi to be bounded, which, as we will see later, facilitates the convergence analysis of DDPG algorithm. Similar settlement can be referred to in [24]. In practice, the estimation of S_i relies on the experience in the specific problems. For example, in some social welfare optimization problems in the electricity market, the optimal dual variables can be the settled energy prices [44], whose range can be easy to estimate with historical prices.

Remark 4. From (15) and (16), it can be seen that each agent only needs the information of its neighbours and updates with locally determined step-size (K contains the dimension information of primal and dual variables without other shared global information), which results in a fully distributed computation fashion of the DDPG algorithm.

The detailed computation procedure of DDPG algorithm is stated in Algorithm 1.

Algorithm 1 Distributed Dual Proximal Gradient Algorithm

1: Initialize λ(0), ξ(0). Determine step-sizes ci, γi> 0, ∀i ∈ V .

2: for t = 0, 1, 2, ... do

3: for i = 1, 2, ..., N do (in parallel)

4: Update λi by (15).

5: Update ξi by (16).

6: end for

7: Obtain an output (λ^out, ξ^out) under certain convergence criterion.

8: end for

C. Computational Complexity of DDPG Algorithm

To apply (15), one needs to compute (i) ∇pi and (ii) the proximal mapping of qi, i ∈ V . For (i), ∇pi can be efficiently obtained given that fi is simple-structured and, consequently,

∇f_ican be analytically derived, e.g., fiis a quadratic function [36, Sec. 3.3.1]. For (ii), some feasible methods for different cases are introduced as follows.

(5)

1) Case 1: If the proximal mapping of gi + IXi can be easily obtained³, by q_i(λ_i) = (g_i+ IXi)(µ_i) + ISi(θ_i), we have prox^c_qⁱ

i = prox^c_Iⁱ

Si× prox^c_(gⁱ

i+I_Xi) [35, Thm. 6.6], where prox^c_Iⁱ

Si is an Euclidean projection onto Si [9, Sec. 1.2] and prox^c_(gⁱ

i+I_Xi) can be obtained by calculating prox^1/c_g ⁱ

i+IXi with Lemma 1. Then, (15) can be modified into

%_i(t) = θ_i(t) − c_i(∇_θ_ip_i(λ_i(t)) +X

j∈V_i

(ξ_i(t) − ξ_j(t))), (17) ρi(t) = µi(t) − ci∇µ_ipi(λi(t)), (18) θ_i(t + 1) = prox^c_Iⁱ

Si%i(t) = ΠSi[%_i(t)], (19) µ_i(t + 1) = prox^c_qⁱ

1,iρi(t) = ρi(t) − c_iprox

1 ci

q_1,i[ρ_i(t) ci

], (20) where q1,i = (gi+ I^Xi), q_1,i = (gi+ I^Xi) = gi+ I^Xi

due to the convexity and lower semi-continuity of gi+ I^Xi, and (gi + I^Xi) is the biconjugate of gi + I^Xi [36, Sec.

3.3.2], ΠSi[·] is an Euclidean projection onto Si.⁴ Essentially, (17)-(20) are obtained by decomposing λi(t + 1) and using the above mentioned properties. With the above arrangement, the calculation of the proximal mapping of (gi+ IXi) can be avoided as shown in (20), leading to the reduction of the computational complexity if the proximal mapping of g_i+ IXi

is easier to obtain. For instance, in a Lasso problem with penalty gi(xi) = kxik1and Xi= R^M, the proximal mapping of l1-norm is a soft thresholding operator with analytical solution [35, Sec. 6.3].

2) Case 2: Take the advantage of the structure of gi in some specific problems. For example, consider a regularization problem, where the penalty is an Euclidean e-norm: g_i(x_i) = kxike, Xi= R^M. Then, we can have

qi(λi) = g_i(µi) + I^Si(θi)

= IWi(µi) + I^Si(θi)

=

0, if µi∈ Wi & θi∈ Si , +∞, otherwise,

=

0, if λi∈ Yi , +∞, otherwise,

= IYi(λ_i), (21)

where Wi = {v ∈ R^M|kvke^∗ ≤ 1} (convex zone) with k · ke^∗ the dual norm of k · ke, Yi = Si× Wi (convex zone).

The second equality holds by computing the conjugate of a norm [36, Sec. 3.3.1]. Then, the proximal mapping of qiis an Euclidean projection onto Yi [9, Sec. 1.2].

3) Case 3: If qi is with certain complicated structure, as a general method, we can construct a strongly convex non- smooth gi (e.g., shift a strongly convex component of the

3This assumption is based on that gi+ IX_iis with certain simple structure, which is often the basic assumption in the works on proximal gradient method.

See some frequently used formulas in [35, Sec. 6.3] and applications in [9, Sec. 7].

4%i and ρi are intermediate variables. (17) and (18) can be included in (19) and (20), respectively, to generate a one-step formula for θiand µi.

smooth part to gi). Then, rewrite (15) by the definition of proximal mapping, which gives

λi(t + 1) = arg min

λ_i (qi(λi) + 1 2ci

kλi− λi(t) + c_i(∇_λ_ip_i(λ_i(t))

+X

j∈Vi

K^>(ξi(t) − ξj(t)))k²). (22) To solve (22), one can utilize a subgradient descent method by computing

∂λ_iqi(λi) =∇λ_i(gi+ I^Xi)(Fλi) + ∂λ_iI^Si(Kλi)

=F^>∇Fλi(g_i+ IXi)(Fλ_i)

+ K^>∂Kλ_iI^Si(Kλi), (23) where ∇_Fλ_i(g_i+ IXi)(Fλ_i) = arg max_u((Fλ_i)^>u − (g_i+ I^Xi)(u)) by Lemma 2.

In Cases 1 and 2, each agent only needs to update λiand ξi

with basic proximal mappings and simple iterations without any other costly computation on primal variables or other auxiliary variables as discussed in [22–31], which reduces the computational complexity. In Case 3, the updating of λi

requires an inner-loop optimization process to compute the subgradient of qi, which can be completed only with local information.

Remark 5. (Extension of assumption on f_i) In the case that the structure offi is complicated (can be non-smooth but still strongly convex), (15) can also be implemented by computing

∇λip_i(λ_i) =∇_λ_if_i(H_iλ_i) + κ_iE^>

=H^>_i ∇H_iλ_if_i(Hiλi) + κiE^>, (24) where ∇H_iλ_if_i(Hiλi) = arg maxu((Hiλi)^>u − fi(u)) by Lemma 2. However, similar to Case 3, (24) requires a higher computational complexity since an inner-loop optimization process for computing∇f_i is involved.

V. CONVERGENCEANALYSIS

The convergence analysis of the proposed DDPG algorithm is conducted in this section.

Lemma 3. With Assumption 2, the Lipschitz constant of

∇λP (λ) is given by h =q P

i∈V h²_i, wherehi= ^kH_σⁱ^k²

i . See the proof in Appendix A.

Lemma 4. Suppose that Assumptions 1-3 hold. Based on Algorithm 1, for any(λ^∗, ξ^∗) ∈ C and t ∈ N, we have

Φ(λ(t + 1)) − Φ(λ^∗) + (ξ^∗)^>Mλ(t + 1) ∈ [0, Ψ_t], where

Ψt=kλ^∗− λ(t)k²_DM +B V [ ¹

2cl]

− kλ^∗− λ(t + 1)k²_DM +B V [ ¹

2cl]

+ kξ^∗− ξ(t)k²_DB

V[_2γl¹ ]− kξ^∗− ξ(t + 1)k²_DB V[_2γl¹ ]

− kλ(t) − λ(t + 1)k²

D^{M +B}_V [_2cl¹ −^hl₂]

− kξ(t) − ξ(t + 1)k²_DB V[ ¹

2γl]

(6)

+ kMλ(t + 1)k²_DB

V[γ_l]. (25)

See the proof in Appendix B.

Theorem 1. Suppose that Assumptions 1-3 hold. Let c_i≤ _h¹

i,

∀i ∈ V . By Algorithm 1, for any (λ^∗, ξ^∗) ∈ C, we have

| Φ(¯λ(T + 1)) − Φ(λ^∗) |

≤ Θ(c1, ..., cN, γ1, ..., γN)

T + 1 + O(γmax), (26) kξ^∗kkM¯λ(T + 1)k

≤ Θ(c1, ..., cN, γ1, ..., γN)

T + 1 + O(γmax), (27) where Θ(c₁, ..., c_N, γ₁, ..., γ_N) = kλ^∗− λ(0)k²

D^{M +B}_V [ ¹

2cl]+ kξ^∗k²_DB

V[_γl⁴]+ kξ(0)k²_DB

V[_γl¹],γmax= maxl∈V γl, ¯λ(T + 1) =

1 T +1

PT

t=0λ(t + 1), O(γ_max) = γ_maxN kLk²Γ²,T ∈ N+. See the proof in Appendix C.

Remark 6. O(γmax) characterizes the upper bound of the stationary error of the algorithm. A larger estimated zone of Simay lead to a largerΓ, leading to a larger stationary error bound. Meanwhile, a smaller γmax can reduce the stationary error bound but may sacrifice the convergence speed.

VI. NUMERICALRESULT

In this section, we demonstrate of the performance of the DDPG algorithm by solving a social welfare optimization problem in an electricity market.

A. Simulation Setup

Define VUC and Vuser as the sets of utility com- panies (UCs) and users, respectively. Define x = [x^UC₁ , ..., x^UC_|V

UC|, x^user₁ , ..., x^user_|V

user|]^>, where xÛC_i is the energy generation quantity of UC i and xûser_j is the demand of user j. φi(xÛC_i ) and ωj(xûser_j ) are the cost function of UC i and utility function of user j, respectively, i ∈ VUC, j ∈ Vuser. Then, the social welfare optimization problem of the market can be formulated as

(P7) min

x

X

i∈V_UC

φ_i(x^UC_i ) − X

j∈V_user

ω_j(x^user_j ) subject to X

i∈VUC

x^UC_i = X

j∈Vuser

xûser_j , (28) xÛC_i ∈ X_iÛC, ∀i ∈ VUC, (29) xûser_j ∈ X_jûser, ∀j ∈ Vuser, (30) where

φi(xÛC_i ) = δi(xÛC_i )²+ ϑixÛC_i + βi, (31) ωj(xûser_j ) =

( τ_jxûser_j − π_j(x_jûser)², xûser_j ≤_2π^τ^j

j,

τ_j²

4π_j, x^user_j >_2π^τ^j

j, (32) with δi, ϑi, βi, τj, πjbeing parameters, whose values are set in Table I [45], ∀i ∈ VUC, ∀j ∈ Vuser. (28) is the supply-demand balance constraint. X_iÛC= [0, xÛC_i,max] and X_jûser = [0, xûser_j,max] with xÛC_i,max, xûser_j,max> 0. Define A = [1^>_|V_UC_|, −1^>_|V

user|]. Then, (28) is equivalent to Ax = 0.

UC 1

UC 2 user 1

user 3 user 2

Fig. 1. Communication typology of the market.

TABLE I

PARAMETERS OFUCS AND ENERGY USERS

UCs Users

i/j δi ϑi βi x^UC_i,max τj πj x^user_j,max 1 0.0031 8.71 0 150 17.17 0.0935 91.79 2 0.0074 3.53 0 150 12.28 0.0417 147.29

3 - - - - 18.42 0.1007 91.41

Similar to the derivation procedure of (5), the Lagrangian function of Problem (P7) can be obtained as

L(x, z, θ, µ) = X

i∈VUC

(φi(xÛC_i ) + IX_iÛC(zÛC_i ))

+ X

j∈Vuser

(−ωj(xûser_j ) + I^X_jûser(z_jûser)) + θAx + X

i∈V_UC

µÛC_i (xÛC_i − z_iÛC)

+ X

j∈V_user

µûser_j (xûser_j − zûser_j ), (33)

where z = [z₁^UC, ..., z^UC_|V

UC|, z^user₁ , ..., z_|V^user

user|]^> is a slack vector, θ and µ = [µ^UC₁ , ..., µ^UC_|V

UC|, µ^user₁ , ..., µ^user_|V

user|]^> are dual vectors. Define ˆθ = [θ₁^UC, ..., θ_|V^UC

UC|, θ₁^user, ..., θ^user_|V

user|]^>, which contains the local estimates of θ. Let ξ = [ξ^UC₁ , ..., ξ^UC_|V

UC|, ξ₁^user, ..., ξ_|V^user

user|]^> be the Lagrangian multiplier vector associated with the constraint on ˆθ as indicated in (9). With some direct calculations, the optimal solution to Problem (P7) is x^∗= [0, 150, 48.5, 50.2, 51.3]^>.

B. Simulation Result

To demonstrate the performance of Algorithm 1, we consider the communication typology shown in Fig. 1. The simulation result is shown in Figs. 2 to 4. Fig. 2 shows the dynamics of dual variables ˆθ and µ. It can be seen that all the elements in ˆθ converge to θ^∗ = −8.1 while µ converges to µ^∗ = [−0.61, 2.34, 0, 0, 0]^>. One can check that the optimal solution at the saddle point of L is x^∗ = arg minxL(x, z, θ^∗, µ^∗) = [0, 150, 48.5, 50.2, 51.3]^>5, which means the lower bound and upper bound of x^UC₁ and x^UC₂ are activated, respectively, while other variables reach interior optimal solutions. Fig. 3 depicts the dynamics of ξ. Fig. 4

5The minimization with x is independent of z since x and z are decoupled in (33).

(7)

0 5000 10000 15000 Interations

-25 -20 -15 -10 -5 0 5

Value

UC 1

UC 1 UC 2

UC 2 user 1

user 1 user 2 user 2 user 3 user 3

Fig. 2. Dynamics of ˆθ and µ.

0 5000 10000 15000

Interations -150

-100 -50 0 50 100 150

Value

UC 1 UC 2 user 1 user 2 user 3

Fig. 3. Dynamics of ξ.

0 5000 10000 15000

Interations 0

2000 4000 6000 8000 10000 12000

Value

Fig. 4. Dynamics of Φ(λ).

shows that the value of dual function Φ(λ) (as defined in Problem (P6)) is decreased to around 756.53.

VII. CONCLUSION

In this work, we considered solving a composite DOP with both local convex and coupling affine constraints. A fully distributed DDPG algorithm was proposed for solving the this problem by resorting to the dual problem. As a distinct feature compared with the existing research works with similar problem setups, we showed that if the non-smooth parts of the objective functions are with some simple structures, one only needs to update dual variables by some simple operations, leading to the reduction of overall computational complexity.

APPENDIX

A. Proof of Lemma 3 By Lemma 2, ∇f_iis _σ¹

i-Lipschitz continuous, which means k∇vf_i(Hiv) − ∇uf_i(Hiu)k

=kH^>_i ∇_H_i_vf_i(H_iv) − H^>_i ∇_H_i_uf_i(H_iu)k

≤kH^>_i kk∇H_ivf_i(Hiv) − ∇H_iuf_i(Hiu)k

≤kH^>_i k σi

kH_iv − H_iuk

≤kHik² σi

kv − uk = hikv − uk, (34)

∀v, u ∈ R^{M +B}, which means ∇λif_i(H_iλ_i) is hi-Lipschitz continuous and, therefore, ∇_λ_ip_i(λ_i) = ∇_λ_if_i(H_iλ_i) + κiE^> is also hi-Lipschitz continuous.

On the other hand, due to the separability of P (λ), ∇_λP (λ) can be decoupled with respect to each λi, i.e.,

∇λP (λ) =







∇λ₁p1(λ1) ...

∇λNp_N(λ_N)





. (35)

By using the Euclidean l2-norm, the Lipschitz constant of

∇λP (λ) can be obtained as h.

B. Proof of Lemma 4

By the first-order optimality condition of (13) in terms of (2), we have

0 ∈∂_λQ(λ(t + 1)) + D^{M +B}_V [1 cl

](λ(t + 1)

− λ(t)) + ∇λP (λ(t)) + M^>ξ(t)

=∂_λQ(λ(t + 1)) − D^{M +B}_V [1 cl

](λ(t) − λ(t + 1)) + ∇λP (λ(t)) + M^>ξ(t + 1)

− M^>D^B_V[γ_l]Mλ(t + 1), (36) where D^{M +B}_V [_c¹

l] = (D^{M +B}_V [cl])⁻¹. From the convexity of Q(λ), we have

Q(λ) − Q(λ(t + 1))

≥(λ − λ(t + 1))^>D^{M +B}_V [1

c_l](λ(t) − λ(t + 1))

(8)

− (λ − λ(t + 1))^>∇_λP (λ(t))

− (λ − λ(t + 1))^>M^>ξ(t + 1)

+ (λ − λ(t + 1))^>M^>D^B_V[γl]Mλ(t + 1). (37) From the convexity and hi-Lipschitz continuous differentia- bility of pi, we have

(λ − λ(t + 1))^>∇λP (λ(t))

=X

i∈V

(λi− λi(t))^>∇λipi(λi(t))

+X

i∈V

(λi(t) − λi(t + 1))^>∇λ_ipi(λi(t))

≤X

i∈V

(p_i(λ_i) − p_i(λ_i(t))) +X

i∈V

(p_i(λ_i(t))

− pi(λi(t + 1))) + kλ(t) − λ(t + 1)k²

D^{M +B}_V [^hl₂]

=P (λ) − P (λ(t + 1)) + kλ(t) − λ(t + 1)k²

D^{M +B}_V [^hl₂]. (38)

By (14), we have 0 = D^B_V[1

γ_l](ξ(t) − ξ(t + 1)) + Mλ(t + 1), (39) where D^B_V[_γ¹

l] = (D^B_V[γ_l])⁻¹. Therefore, by multiplying the both sides of (39) by (ξ − ξ(t + 1))^>, we have

(ξ − ξ(t + 1))^>D^B_V[1 γl

](ξ(t) − ξ(t + 1))

+ (ξ − ξ(t + 1))^>Mλ(t + 1) = 0. (40) By adding (37) and (38) together from the both sides, we have

Φ(λ(t + 1)) − Φ(λ)

≤ − (λ − λ(t + 1))^>D^{M +B}_V [1 cl

](λ(t) − λ(t + 1)) + (λ − λ(t + 1))^>M^>ξ(t + 1)

+ kλ(t) − λ(t + 1)k²

D^{M +B}_V [^hl₂]

− (λ − λ(t + 1))^>M^>D^B_V[γ_l]Mλ(t + 1)

= − (λ − λ(t + 1))^>D^{M +B}_V [1

c_l](λ(t) − λ(t + 1))

− (ξ − ξ(t + 1))^>D^B_V[1 γl

](ξ(t) − ξ(t + 1))

− (ξ − ξ(t + 1))^>Mλ(t + 1) + (ξ(t + 1))^>Mλ

− (ξ(t + 1))^>Mλ(t + 1) + kλ(t) − λ(t + 1)k²

D^{M +B}_V [^hl₂]

− (λ − λ(t + 1))^>M^>D^B_V[γl]Mλ(t + 1)

=kλ − λ(t)k²_DM +B V [ ¹

2cl]− kλ − λ(t + 1)k²_DM +B V [ ¹

2cl]

− kλ(t) − λ(t + 1)k²_DM +B V [_2cl¹ ]

+ kξ − ξ(t)k²_DB V[ ¹

2γl]− kξ − ξ(t + 1)k²_DB V[ ¹

2γl]

− kξ(t) − ξ(t + 1)k²_DB

V[_2γl¹ ]+ (ξ(t + 1))^>Mλ

− ξ^>Mλ(t + 1) + kλ(t) − λ(t + 1)k²

D^{M +B}_V [^hl₂]

− (λ − λ(t + 1))^>M^>D^B_V[γ_l]Mλ(t + 1)

=kλ − λ(t)k²

D^{M +B}_V [_2cl¹ ]− kλ − λ(t + 1)k²

D^{M +B}_V [_2cl¹ ]

− kλ(t) − λ(t + 1)k²

D^{M +B}_V [_2cl¹ −^hl₂]

+ kξ − ξ(t)k²_DB V[ ¹

2γl]− kξ − ξ(t + 1)k²_DB V[ ¹

2γl]

− kξ(t) − ξ(t + 1)k²_DB V[ ¹

2γl]

+ (ξ(t + 1))^>Mλ − ξ^>Mλ(t + 1) + kMλ(t + 1)k²_DB

V[γ_l]− (Mλ)^>D^B_V[γ_l]Mλ(t + 1), (41) where we use (40) in the first equality and the second equality holds with v^>u = ¹₂(kvk²+ kuk²− kv − uk²), ∀v, u ∈ R^{M N +BN}.

Let ξ = ξ^∗ and λ = λ^∗ and rearrange (41), then we have Φ(λ(t + 1)) − Φ(λ^∗) + (ξ^∗)^>Mλ(t + 1)

≤kλ^∗− λ(t)k²_DM +B V [ ¹

2cl]− kλ^∗− λ(t + 1)k²_DM +B V [ ¹

2cl]

+ kξ^∗− ξ(t)k²_DB

V[_2γl¹ ]− kξ^∗− ξ(t + 1)k²_DB V[_2γl¹ ]

− kλ(t) − λ(t + 1)k²

D^{M +B}_V [_2cl¹ −^hl₂]

− kξ(t) − ξ(t + 1)k²_DB V[ ¹

2γl]+ kMλ(t + 1)k²_DB

V[γ_l], (42) where KKT condition (12) is used. By combining (9), (10) and (12), ∀λ ∈R^{N M +N B}, we have

Φ(λ) − Φ(λ^∗) + (ξ^∗)^>Mλ ≥ 0. (43) Based on (42) and (43), the proof is completed.

C. Proof of Theorem 1

Note that (41) holds for all λ ∈R^{N B+N M} and ξ ∈R^{N B}. The proof is conducted by discussing the following two scenarios.

1) Scenario 1:If M¯λ(T + 1) 6= 0, by letting λ = λ^∗ and ξ = 2kξ^∗k_{kM ¯}^{M ¯}^{λ(T +1)}_{λ(T +1)k} in (41), we have

Φ(λ(t + 1)) − Φ(λ^∗) + 2kξ^∗k(M¯λ(T + 1))^>

kM¯λ(T + 1)kMλ(t + 1)

≤kλ^∗− λ(t)k²_DM +B V [ ¹

2cl]− kλ^∗− λ(t + 1)k²_DM +B V [ ¹

2cl]

+ k2kξ^∗k M¯λ(T + 1)

kM¯λ(T + 1)k− ξ(t)k²_DB V[_2γl¹ ]

− k2kξ^∗k M¯λ(T + 1)

kM¯λ(T + 1)k− ξ(t + 1)k²_DB V[ ¹

2γl]

+ kMλ(t + 1)k²_DB

V[γ_l], (44)

where ci ≤ _h¹

i is considered. Summing up (44) over t = 0, 1, ..., T gives

(T + 1)(Φ(¯λ(T + 1)) − Φ(λ^∗) + 2kξ^∗kkM¯λ(T + 1)k)

≤

T

X

t=0

(Φ(λ(t + 1)) − Φ(λ^∗) + 2kξ^∗kkM¯λ(T + 1)k)

≤k2kξ^∗k M¯λ(T + 1)

kM¯λ(T + 1)k− ξ(0)k²_DB V[ ¹

2γl]

+ kλ^∗− λ(0)k²_DM +B V [_2cl¹ ]+

T

X

t=0

kMλ(t + 1)k²_DB V[γl]