Distributed Fixed Point Method for Solving Systems of Linear Algebraic Equations

(1)

Distributed Fixed Point Method for Solving Systems of Linear Algebraic Equations

Duˇsan Jakovetić^∗, Nataˇsa Krejić^∗, Nataˇsa Krklec Jerinkić^∗, Greta Malaspina^∗, Alessandra Micheletti^†

Abstract

We present a class of iterative fully distributed fixed point methods to solve a system of linear equations, such that each agent in the network holds oneor severalof the equations of the system. Under a generic directed, strongly connected network, we prove a convergence result analogous to the one for fixed point methods in the classical, centralized, framework: the proposed method converges to the solution of the system of linear equations at a linear rate. We further explicitly quantify the rate in terms of the linear system and network parameters. Next, we show that the algorithm provably works under time-varying directed networks provided that the underlying graph is connected over bounded iteration intervals, and we establish a linear convergence rate for this setting as well. A set of numerical results is presented, demonstrating practical benefits of the method over exist- ing alternatives.

Key words: distributed optimization; systems of linear equations; fixed point methods; consensus; kriging.

1 Introduction

{derivation}

The problem we consider is

Ay = b (1) {eq:ls}{eq:ls}

∗Department of Mathematics and Informatics, Faculty of Sciences, Uni- versity of Novi Sad, Trg Dositeja Obradovi´ca 4, 21000 Novi Sad, Serbia, Email: [email protected], [email protected], [email protected], [email protected].

†Department of Environmental Science and Policy - ESP, Universit`a degli Studi di Milano, Via Saldini 50, 20133 Milano, Italy, Email: [email protected]

(2)

where A = [aij] ∈ R^n×n and b = [bi] ∈ Rⁿare given, and y ∈ Rⁿis the vector of the unknowns. The matrix A is assumed to be nonsingular, so that the problem has a unique solution. We also assume that the problem needs to be solved in a distributed computational framework determined by a set of connected computational nodes which can communicate through a generic sequence of graphs. Let A_i∈ R^1×nand b_i∈ R be the i-th row of A and the i- th component of b respectively. It is assumed that each computational node i knows the corresponding A_i and b_i and that each node needs to obtain the solution y^∗ through an iterative, distributed algorithm. This assumption is later relaxed in the sense that each node can hold several rows of the matrix A and the number of nodes is N ≤ n, see Subsection 2.1, but for the sake of simplicity let us continue within the framework of N = n and single equation per node.

The considered problem is important as linear systems appear natu- rally in a number of applications. To start with, the estimation problems on graphs are described in [1] and such problems can be stated as linear systems of equations with symmetric positive definite matrix. This class includes localization problems, time synchronisation and motion consensus.

Parameter identification in wireless sensor networks [4] also fits into this framework. A particularly important example of application is Ordinary Kriging (OK) [5, 6, 13, 17], an optimal linear prediction technique of the expected value of a spatial random field Z(s), s ∈ Rⁿ. Ordinary Kriging can be applied when the random field under study is isotropic and intrinsically stationary, that is, the expected value E(Z(s)) = m is constant and the variance V ar(Z(s) − Z(s + h)) = 2γ(h) depends only on h. In this case the model parameter estimation relies on the solution of a linear system like (1) (see equations (3.2.13)-(3.2.15) in [6] and the example in Section 5). When the semivariogram γ(h) of the random field has a sill, it can be assumed that there is a range ¯h over which the covariance Cov(Z(s), Z(s + h)) = 0, when |h| > ¯h. In this case the matrix A of the Ordinary Kriging linear system becomes sparse, since its elements are the estimates of γ(h) and each sampled node of the random field Z needs to memorize only the information brought by its neighbours at a distance lower than ¯h to estimate the model parameters. When the mean m of the random field is known the problem simplifies into what is called Simple Kriging. In Section 5 we will use Simple Kriging as an example of application of our method.

There is a vast literature devoted to solving systems of linear equations in the conventional centralized computational environment [11,20], as well as a number of results that cover parallelization of classical iterative methods which are applicable to the case of fully connected distributed computational

(3)

environment, [9]. Our interest in this paper is the class of fixed point methods [11, 20] and their extensions to the distributed framework, as described above. In other words, we develop a class of novel, fully distributed, iterative fixed point methods to solve (1), wherein each node can exchange messages only with the ones in its neighborhood in the communication graph, and each node obtains the estimate of the solution y^∗ of problem (1). It is well known that (1) can be transformed into an equivalent fixed point problem

y = M y + d, (2) {fp}{fp}

and one can apply the Banach contraction principle and define the fixed point iterative method of the form y^k+1 = M y^k+ d, for suitable choices of M ∈ R^n×n and d ∈ Rⁿ (see Section 2 for the details).

The sufficient and necessary condition for the convergence of such iterative sequence is ρ(M ) < 1, where ρ(M ) is the spectral radius of M. Furthermore, a sufficient condition for the convergence of {y^k} is given by kM k < 1 for an arbitrary matrix norm k·k. Clearly, there is a number of suitable ways to define the iterative matrix M in such way that either ρ(M ) < 1 or kM k < 1 for many matrix classes, like symmetric positive definite matrices, M-matrices, H-matrices, etc [2]. Typical examples of this type of methods are the Ja- cobi and Gauss - Seidel method as well as their modifications like Jacobi Overrelaxation (JOR), Successive Overelaxation (SOR), Symmetric Succes- sive Overelaxation (SSOR) method and so on [11, 20]. The convergence of fixed point methods is linear and the convergence factor is determined by the spectral radius or the norm of M. The main idea of relaxation methods is to introduce a parameter that reduces the norm (or the spectral radius) of the corresponding iterative matrix and ensures faster convergence.

There is a rich literature on parallelization of fixed point iterative methods, where the computational nodes communicate in an all-to-all fashion [9], [10], [8], [3]. In the case of very large dimensions one needs to split the computational effort between different computational nodes to speed up the algorithm. In this type of computational environment, the total cost of solving the problem of interest is mainly dictated by the corresponding computational cost and the communication cost of exchanging messages between the parallelized nodes (processes) along iterations. Usually, major bottlenecks include waiting for the slowest node to complete an iteration, or latency incurred by the time to communicate a message. For this reason asynchronous methods, which allow for latency in communication and non- uniform distribution of computational work, are also considered, [10]. The methods of this type are convergent under different communication latency conditions [10].

(4)

The framework we consider in this paper for solving systems of linear equations, in more detail, assumes a network of computational nodes which communicate through a generic directed graph, which can depend on time.

Thus the results in [3, 8–10] are not applicable. The same framework is also considered in [15, 16, 18, 25, 26], and a survey of the methods is presented in [24]. The focus of these methods is to ensure convergence of the local approximations to the global solution, in the presence of time-varying communication graphs. In the context of these algorithms, convergence is defined in two possible ways. In [18, 26] each node holds a local approxima- tion of a subset of the variables and convergence of these local variables to the corresponding part of the solution is required. In [15, 16, 25] every node contains a vector of the same size as the unknown vector of the linear system, and the convergence of each local vector to the full solution in ensured.

We are interested in the second scenario. The method presented in [9] is applicable to a general problem of the type (1) with loose restrictions on the matrix A and can be used to solve the linear least squares problem as well.

In this paper, we propose a novel distributed method to solve (1), which we refer to as DFIX (Distributed Fixed Point). DFIX assumes the same computational framework as [15, 16, 25] but differs significantly from the above mentioned methods. Starting from a centralized fixed point method for solving linear systems we decentralize the method directly and extend the convergence theory of centralized fixed point methods to the distributed case in the sense of sufficient conditions. That is, we demonstrate that the sufficient condition kM k∞ < 1 continues to work in the distributed environment. The main convergence result is completely analogous to the centralized case - given an iterative matrix with the infinity norm smaller than 1, the iterative sequence is convergent for an arbitrary starting point.

The theory presented here thus covers a large class of linear systems. We prove linear convergence of DFIX under directed strongly connected networks and explicitly quantify the corresponding convergence factor in terms of network and linear system parameters. As detailed below, numerical simulations demonstrate advantages of DFIX over some state of the art methods.

With respect to the underlying graph, representing the connection among the computational agents, we consider both the case when the graph is fixed (i.e., the connectivity among the nodes is the same at any time during the execution of the algorithm) and the case when the network changes at every iteration. In the fixed graph case we prove that convergence holds if the network is strongly connected, while in the time-varying graph case we give suitable assumptions over the sequence of networks. We prove that the

(5)

time-independent case is a particular case of the time-varying case but for the sake of clarity we first present and analyse the algorithm assuming the network is fixed, and then we generalize the analysis to the time-varying case.

Any system of linear equation (1) with symmetric matrix A can be considered as the first order optimality condition of an unconstrained optimization problem with cost function ¹₂x^TAx − b^Tx. It is therefore of interest to compare the approach of solving (1) applying some distributed optimization method [14, 19, 21, 22] to the minimization of the quadratic function

1

2x^tAx − b^tx with DFIX. We thus compare computational and communication costs of DFIX with the state-of-the-art optimization method from these papers and show that the computational costs with DFIX are significantly lower, while the communication costs are comparable or go in favor of DFIX, depending on the connectivity of the underlying graph. Thus the numerical efficiency of DFIX is also shown. A comparison with the method from [15]

is also presented in Section 5, demonstrating the clear advantage of DFIX.

The contributions of this paper are the following: we define a novel fixed point iterative method for solving linear systems in distributed computational environment and present convergence analysis that is analogous to the classical centralized case showing that the sufficient convergence condition is the same - the norm of iterative matrix smaller than one; the convergence factor depends on the norm of iterative matrix, the diameter of underlying communication graph and the weight matrix. The results are then extended to the case of directed communication graph and the time-varying graph under reasonable connectivity conditions. Extensive numerical tests are performed and the results confirm theoretical findings.

This paper is organized as follows. Section 2 contains the description of the computational framework together with a brief overview of fixed point iterative methods that will be used further on. The method DFIX is defined and analysed in Section 3 for the fixed graph case. In Section 4 we present the time-varying case. Numerical results that illustrate theoretical analysis as well as an application of DFIX to a Kriging problem are presented in Section 5. Some conclusions are drawn in Section 6.

(6)

2 Preliminaries

Let us first briefly recall the theory of fixed point iterative methods for systems of linear equations. Given a generic¹ method of type (2)

x^k+1= M x^k+ d, (3) {fpiter}{fpiter}

we know that the method is convergent if ρ(M ) < 1, where we recall that ρ(M ) is the spectral radius of M , i.e., the largest eigenvalue of M in modulus.

This condition is both necessary and sufficient for convergence. Given any matrix norm k · k one can also state the sufficient convergence condition as kM k < 1. There are many ways of transforming (1) to the fixed point form (2), depending on the properties of A, with Jacobi and Gauss - Seidel methods, as well as their relaxation versions being the most studied methods.

To fix the idea before defining the distributed method we recall here the Jacobi and Jacobi Overrelaxation, JOR, method, keeping in mind that we will consider a generic M in the next section.

Assume that A is a nonsigular matrix with nonzero diagonal entries.

Using the splitting A = D − P, with D being the diagonal matrix, D = diag(a₁₁, . . . , a_nn), the Jacobi iterative method is defined by (3) with

M = D⁻¹P := M_J.

In other words, given d = D⁻¹b and denoting by x^k = (x^k₁, . . . , x^k_n) the estimate of solution to (1) at iteration k, the new iteration is defined by

x^k+1_i = − 1 aii

n

X

j=1,j6=i

a_ijx^k_j + d_i, i = 1, . . . , n.

The method is linearly convergent for many classes of matrices, for example strictly diagonally dominant matrices, symmetric positive definite matrices etc [11, 20], and the rate of convergence is determined by ρ(MJ). To speed up convergence and extend the class of matrices for which the method is convergent, one can introduce the relaxation parameter ω ∈ R and define

M = ωD⁻¹P + (1 − ω)I.

In other words, the JOR iteration is given by x^k+1_i = (1 − ω)x^k_i − w

a_ii(

n

X

j=1,j6=i

a_ijx^k_j + b_i), i = 1, . . . , n. (4) {JOR}{JOR}

1the relation between the method in (3) and (1) is described further ahead

(7)

If A is a symmetric positive definite matrix, the JOR method converges for ω ∈ (0, 2

ρ(MJ)), see [11, 20].

Assuming that each node can communicate directly with every other node, the method can be applied in parallel and asynchronous manner and the convergence follows from the results of [3, 10].

Let us now define precisely the computational environment we consider.

Assume that the network of nodes is a directed network G = (V, E ), where V is the set of nodes and E is the set of all edges, i.e., all pairs (i, j) of nodes where node i can send information to node j through a communication link.

Definition 1. The graph G = (V, E ) is strongly connected if for every couple of nodes i, j there exists an oriented path from i to j in G. That is, if there exist s1, . . . , sl such that (i, s1), (s1, s2), . . . , (sl, j) ∈ E .

Assumption A1. The network G = (V, E ) is directed, strongly connected, with self-loops at every node.

Remark 2.1. The case of undirected network G can be seen as the particular case of directed graph where G is symmetric. That is, (i, j) ∈ E if and only if (j, i) ∈ E . In this case, the hypothesis that G is strongly connected is equivalent to G connected.

Let us denote by O_i the in-neighborhood of node i, that is, the set of nodes that can send information to node i directly. Since the graph has self loops at each node, then i ∈ Oi for every i. We associate with G an n × n matrix W , such that the elements of W are all nonnegative and each row sums up to one. More precisely, we assume the following.

Assumption A2. The matrix W ∈ R^n×n is row stochastic with elements w_ij such that

wij > 0 if j ∈ Oi, wij = 0 if j /∈ O_i

Let us denote by w_min a constant such that all nonzero elements of W satisfy wij ≥ w_min > 0. Under the previously stated assumptions we know that such constant exists. Moreover, we have wmin ∈ (0, 1). Therefore, for all elements of W we have

w_ij 6= 0 ⇒ w_ij ≥ w_min. (5) {wmin}{wmin}

The diameter of a network is defined as the largest distance between two nodes in the graph. Let us denote with δ the diameter of G.

(8)

3 DFIX method

We consider now a generic fixed-point method for solving (1) by the fixed point iterative method (3), with M = [mij] ∈ R^n×n, d = [di] ∈ Rⁿ defined in such a way that node i contains the i-th row Mi ∈ R^1×n and di ∈ R.

Moreover, we assume that the fixed point y^∗ of (2) is a solution of (1). The algorithm is designed in such way that each node has its own estimate of the solution y^∗. Thus at the iteration k each node i has its own estimate x^k_i ∈ Rⁿwith components x^k_ij, j = 1, . . . , n. The DFIX method is presented in the algorithm below.

Algorithm DFIX

Step 0 Initialization: Set k = 0. Each node chooses x⁰_i ∈ Rⁿ. Step 1 Each node i computes

ˆ x^k+1_ii =

n

X

j=1

mijx^k_ij+ di, ˆ

x^k+1_ij = ˆx^k_ij, i 6= j.

(6) {eq:step1}{eq:step1}

Step 2 Each node i updates its solution estimate x^k+1_i =

n

X

j=1

wijxˆ^k+1_j (7) {eq:step2}{eq:step2}

and sets k = k + 1.

Notice that at Step 1 each node i updates only the i-th component of its solution estimate and leaves all other components unchanged, while in Step 2 all nodes perfom a consensus step [7, 12, 23] using the set of vector estimates ˆx^k+1_j . Defining the global variable at iteration k as

X^k =

x^k₁; . . . ; x^k_n

∈ Rⁿ²,

Algorithm DFIX can be stated in the condensed form using X^k and the following notation

Mc_i =





 1

. ..

m_i1 . . . m_ii . . . m_in . ..

1







∈ R^n×n, db_i =





 0

... d_i

... 0







∈ Rⁿ.

(9)

More precisely, matrix cMi has the i-th row equal to M , the rest of diagonal elements are equal to 1 and the remaining elements are equal to 0. Vector db_i has only one nonzero element in the i-th row which is equal to d_i. Now, Step 1 can be rewritten as

ˆ

x^k+1_i = cMix^k_i + ˆdi, and we can rewrite the Steps 1-2 in matrix form as

X^k+1 = (W ⊗ I)(MX^k+ ˆd) (8) {eq:globalit}{eq:globalit}

where M = diag

Mc₁, . . . , cM_n

∈ Rⁿ²^×n², bd = ˆd₁; . . . ; ˆd_n

∈ Rⁿ² and ⊗ denotes the Kronecker product of matrices. We remark here that equation (8) is only theoretical, in the sense that since each agent has access only to partial information, the global vector X^k, the matrix M and the vector bd are not computed at any node. We derived equation (8) to get a compact representation of Algorithm 1 and to use it in the convergence analysis.

The following theorem shows that for every i ∈ {1, . . . , n} the local sequence {x^k_i} converges to the fixed point y^∗ of (2). Denote

X^∗ = (y^∗; . . . ; y^∗) ∈ Rⁿ².

{Th1}

Theorem 1. Let Assumptions A1 and A2 hold, kM k∞ = µ < 1 and let {X^k} be a sequence generated by (8). Then the global error E^k= X^k− X^∗ satisfies

kE^k+1k_∞≤ τ kE^k−δ+1k_∞, (9) where τ = 1 − w^δ_min(1 − µ) and δ denotes the diameter of the underlying computational graph G.

Proof. Since W is assumed to be row stochastic there holds (W ⊗I)X^∗ = X^∗. Moreover, using the fact that ˆd = (I ⊗ I − M)X^∗, we obtain the following recursion

E^k+1 = (W ⊗ I)ME^k. (10) {novo1}{novo1}

Notice that k(W ⊗ I)Mk∞≤ 1, so we have

kE^k+1k_∞≤ kE^kk_∞. (11) {eq:novo2}{eq:novo2}

Now, denoting by e^k_i the i-th block of E^k (the local error corresponding to node i) and by e^k_ij its j-th component, from (10) we obtain the following

e^k+1_ij = wijMje^k_j +X

s6=j

wise^k_sj. (12) {eq:errorcomp}{eq:errorcomp}

(10)

We prove the thesis by proving that if the distance between j and i in the graph is equal to l, then for every k

|e^k+1_ij | ≤ τ kE^k−l+1k_∞, for a constant τ = (1 − w^l_min(1 − µ)). (13) {eq:distl}{eq:distl}

We proceed by induction over the distance l. If l = 1, that is, if there is an edge from j to i, then wij ≥ w_min> 0. By (23) we get

|e^k+1_ij | ≤ w_ij|M_je^k_j| +X

s6=j

w_is|e^k_sj| ≤ w_ijµkE^kk_∞+ kE^kk_∞X

s6=j

w_is≤

≤ 1 − w_ij(1 − µ)kE^kk_∞≤ 1 − w_min(1 − µ)kE^kk_∞, and defining τ⁰ = 1 − wmin(1 − µ) < 1, we get

|e^k+1_ij | ≤ τ⁰kE^kk_∞. (14) {eq:tau1}{eq:tau1}

Assume now that (13) holds for distance equal to l −1, and let us prove it for l. Let (j, s_l−1, s_l−2, . . . , s₁, i) be a path of length l from j to i. In particular we have that wis1 > 0 and thus

|e^k+1_ij | ≤ w_is₁|e^k_s

1j| + X

s6=s1

w_is|e^k_sj|. (15) {eq:s1}{eq:s1}

For each of the terms |e^k_sj| in the sum, by (11), we have

|e^k_sj| ≤ kE^kk_∞≤ kE^k−l+1k_∞. (16) {eq:1}{eq:1}

Let us now consider the term |e^k_s

1j|. Since (j, s_l−1, s_l−2, . . . , s₁, i) is a path of length l from j to i and the distance between j and i is equal to l, we have that the distance between j and s1 is equal to l − 1 and therefore, by inductive hypothesis

|e^k_s

1j| ≤ τ⁰kE^k−(l−1)k_∞= τ⁰kE^k−l+1k_∞, for τ⁰ = (1 − w_min^l−1(1 − µ)). (17) {eq:2}{eq:2}

Replacing (16) and (17) in (15), we get

|e^k+1_ij | ≤ w_is₁τ⁰kE^k−l+1k_∞+ X

s6=s1

w_iskE^k−l+1k_∞=

= 1 − w_s₁_j(1 − τ⁰) kE^k−l+1k_∞≤

≤ 1 − w_min(1 − τ⁰) kE^k−l+1k_∞

(18)

and defining τ := (1 − w_min(1 − τ⁰)) we get (13). Now the thesis follows directly from the fact that the distance between any two nodes is smaller or equal than the diameter δ of the graph.

(11)

The previous analysis shows that the global error in nonexpanding and that we have a decrease after δ iterations at most where δ is the diameter of the underlying graph. Next we quantify the R-linear convergence factor which is derived directly from the previous result.

Corollary 1. Suppose that the assumptions of Theorem 1 are satisfied.

Then each node’s solution estimate x^k_i converges to the solution y^∗ of the problem (2) R-linearly with the factor γ = τ^1/δ, i.e., for each i ∈ {1, 2, ..., N } there holds

kx^k_i − y^∗k∞= O(γ^k), where γ =

1 − w_min^δ (1 − µ)1/δ

.

Proof. Denote ξ_k:= kX^k− X^∗k_∞= kE^kk_∞. Notice that (11) implies that ξk+1 ≤ ξ_k for every k. Moreover, every iteration k can be represented as k = sδ + c, where s, c ∈ N0 and c < δ. Then,

ξk≤ ξ_k−c ≤ τ^sξ0 = τ^(k−c)/δξ0≤ τ^k/δτ⁻¹ξ0:= γ^kC, where

γ = τ^1/δ =

1 − w^δ_min(1 − µ)1/δ

and C = ξ₀

τ = kX⁰− X^∗k_∞ 1 − w^δ_min(1 − µ). Clearly, having in mind the definition of the global variables X^kand X^∗, for every i ∈ {1, 2, ..., N } we have kx^k_i − y^∗k∞≤ ξ_k, and the result follows.

3.1 DFIX Method - multirow case

The previous result can be generalized in such a way that each node holds more than one row as follows. We consider now a generic fixed-point method for solving (1) by the fixed point iterative method (3), with N nodes and M = [mij] ∈ R^n×n, d = [di] ∈ Rⁿ defined in such a way that each node i ∈ {1, 2, ..., N } contains several rows of M and d. Let us denote the rows assigned to node i by R_i ⊂ {1, 2, ..., n}. We assume R_iT R_j = ∅ for i 6= j and SN

i=1R_i = {1, 2, ..., n}. More precisely, each node i holds M_j ∈ R^1×n and di ∈ R, for all j ∈ Ri. Assume that the fixed point y^∗ of (2) is a solution of (1). The algorithm is again designed in such way that each node has its own estimate of the solution y^∗. Thus at the iteration k each node i has its own estimate x^k_i ∈ Rⁿ with components x^k_ij, j = 1, . . . , n.

Algorithm DFIXM

(12)

Step 0 Initialization: Set k = 0. Each node chooses x⁰_i ∈ Rⁿ. Step 1 Each node i computes

ˆ x^k+1_ij =

n

X

l=1

m_jlx^k_il+ d_j, j ∈ R_i, ˆ

x^k+1_ij = ˆx^k_ij, j /∈ R_i.

(19) {eq:step1}{eq:step1}

Step 2 Each node i exchanges ˆx^k+1_i with the direct neighboors and sets

x^k+1_i =

N

X

j=1

w_ijxˆ^k+1_j (20) {eq:step2}{eq:step2}

and sets k = k + 1.

Notice that at Step 1 each node i updates only the components j ∈ Ri

of its solution estimate and leaves all other components unchanged, while in Step 2 all nodes perform a consensus step [7, 12, 23] using the set of vector estimates ˆx^k+1_j obtained from the immediate neighboors.

Defining the global variable at iteration k as X^k=

x^k₁; . . . ; x^k_n

∈ R^{N n}.

Notice that this variable is only needed for the analysis and it is not actually constructed within the DFIXM algorithm at any point. Using this notation, Algorithm DFIXM can be stated in the condensed form using X^k and the following notation

Mc_i=





 1

. ..

mj,1 . . . mj,j . . . mj,n

m_j+1,1 . . . m_j+1,j+1 . . . m_j+1,n ...

m_j+q_i_,1 . . . m_j+q_i_,j+q_i . . . m_j+q_i_,n . ..

1







∈ R^n×n, bd_i =





 0

... dj

d_j+1 ... d_j+q_i

... 0







∈ Rⁿ,

where j, ..., j + q_i∈ R_i. More precisely, matrix cM_i has the j-th row equal to the j-th row of M for all j ∈ R_i, the rest of diagonal elements are equal to 1

(13)

and the remaining elements are equal to 0. Vector bdi has zero elements on rows j /∈ R_i while the elements on the remaining rows j ∈ R_i coincide with the relevant rows of vector d. Now, Step 1 can be rewritten as

ˆ

x^k+1_i = cM_ix^k_i + ˆd_i, and we can rewrite the Steps 1-2 in matrix form as

X^k+1 = (W ⊗ I)(MX^k+ ˆd) (21) {eq:globalit1}{eq:globalit1}

where M = diag

Mc1, . . . , cMn

∈ R^{N n×N n}, bd = ˆd1; . . . ; ˆdn

∈ R^{N n} and ⊗ denotes the Kronecker product of matrices. We remark here that equation (8) is only theoretical, in the sense that since each agent has access only to partial information, the global vector X^k, the matrix M and the vector bd are not computed at any node. We derived equation (8) to get a compact representation of Algorithm 1 and to use it in the convergence analysis.

The following theorem shows that for every i ∈ {1, . . . , N } the local sequence {x^k_i} converges to the fixed point y^∗ of (2). Denote

X^∗ = (y^∗; . . . ; y^∗) ∈ Rⁿ².

{Tnew}

Theorem 2. Let Assumptions A1 and A2 hold, kM k∞ = µ < 1 and let {X^k} be a sequence generated by (21). Then, for every k, the global error E^k= X^k− X^∗ satisfies

kE^k+1k_∞≤ τ kE^k−δ+1k_∞, (22) where δ denotes the diameter of the underlying computational graph G and

τ = 1 − w_min^δ (1 − µ) ∈ (0, 1).

Proof. The proof is essentially the same as the proof of Theorem 1 with some technical changes. Following the same reasoning we get the error expression

e^k+1_ij = wijMhe^k_j +X

s6=j

wise^k_sj, (23) {eq:errorcomp}{eq:errorcomp}

where h depends on i and j. As in the previous case, we prove the thesis by proving that if the distance between j and i in the graph is equal to l, then for every k

|e^k+1_ij | ≤ τ kE^k−l+1k∞, for a constant τ = 1 − w_min^l (1 − µ) ∈ (0, 1). (24) {eq:distl1}{eq:distl1}

(14)

We proceed by induction over the distance l. If l = 1, that is, if there is an edge from j to i, then w_ij ≥ w_min> 0. By (23) we get

|e^k+1_ij | ≤ w_ij|M_he^k_j| +X

s6=j

wis|e^k_sj| ≤ w_ijkE^kk_∞

n

X

l=1

|m_hl| + kE^kk_∞X

s6=j

wis

≤ w_ijµkE^kk_∞+ kE^kk_∞(1 − wij)

≤ 1 − w_ij(1 − µ)kE^kk_∞≤ 1 − w_min(1 − µ)kE^kk_∞, and defining τ⁰ = 1 − w_min(1 − µ) < 1, we get

|e^k+1_ij | ≤ τ⁰kE^kk_∞. (25) {eq:tau1}{eq:tau1}

The rest of the proof is completely analogous to the proof of Theorem 1 and omitted here.

Analogously, we can quantify the convergence factor in the same way and the corollary below holds.

{cor2}

Corollary 2. Suppose that the assumptions of Theorem 2 are satisfied.

Then each node’s solution estimate x^k_i converges to the solution y^∗ of the problem (2) R-linearly with the factor γ = τ^1/δ, i.e., for each i ∈ {1, 2, ..., N } there holds

kx^k_i − y^∗k_∞= O(γ^k), where γ =

1 − w_min^δ (1 − µ)1/δ

.

DFIXM is a generalization of DFIX that might be of practical impor- tance as it allows us to solve an n dimensional linear system with an arbitrary number of nodes N ≤ n which might be the case in many applications.

However we will continue with DFIX method for time-varying networks in the next Section to avoid notation cluttering and to facilitate reading. The changes in the proofs are minor and of the same type as above.

4 Time-varying Network

The method discussed in the previous sections is valid only if the graph representing the communication among the agents is the same at each iteration. If some failure of the communication link between two agents occurs during the execution of the algorithm, the underlying network changes, and Theorem 1 does not apply anymore. To deal with these possible changes we consider the case where the network is given, possibly different, at each

(15)

iteration. We extend DFIX to this framework and we give assumptions on the sequence of graphs that yield a convergence result analogous to Theo- rem 1. In particular we show that, in order to achieve convergence, strong connectivity is not necessary at any time.

Assume that a sequence of directed graphs {G_k}_k is given, such that at iteration k, G_k represents the network of nodes. That is, at iteration k, each node can communicate with its neighbours in Gk. The DFIX algorithm described by equations (19) and (20) can be applied in this case if we replace (20) with

x^k+1_i =

n

X

j=1

w_ij^kxˆ^k+1_j (26) {eq:step2_time}{eq:step2_time}

where W^kis the consensus matrix associated with the graph G_k, that is, W^k satisfies Assumption A2 with G = Gk. With this modification, the equation describing the global iteration becomes

X^k+1= (W^k⊗ I)(MX^k+ ˆd). (27) {eq:globalit_time}{eq:globalit_time}

We will prove a convergence result for a class of sequences of graphs. We first present and analyze the assumptions on such sequence.

Definition 2. Given G₁, G₂ graphs with G_i = (V, E_i), the composition of G₁ and G2 is defined as G2◦ G₁= (V, E ) where

E := {(j, i) ∈ V² | ∃ s ∈ V such that (j, s) ∈ E₁, (s, i) ∈ E2}. (28) That is, there is an edge from j to i in G₂◦ G₁ if we can find a path from j to i such that the first edge of the path is in G1 and the second edge is in G2. This definition can be extended to finite sequences of graphs of arbitrary length.

{edgescomposition}

Remark 4.1. Let us consider a generic set of graphs G₁, . . . , G_m. It is easy to see that if for every index j the graph G_j has self-loops at every node then the set of edges of the composition G1◦ · · · ◦ G_m contains the set of edges of G_j for every j. In particular, if there exists an index ˆ ∈ {1, . . . , m} such that G_ˆ_ is fully connected, then G₁◦ · · · ◦ G_m is also fully connected.

Definition 3. Given an infinite sequence of networks {G_k}_k and a positive integer ¯m, we say that the sequence is jointly fully (respectively, strongly) connected for sequences of length ¯m if for every index k, the composition G_k◦ G_k+1◦ · · · ◦ G_{k+ ¯}_m−1 is fully (respectively, strongly) connected.

(16)

Definition 4. Given an infinite sequence of networks {Gk}_kand two integers τ₀, l, we say that the sequence is repeteadly jointly strongly connected with constants τ₀, l, if for every index k, the composition G_τ₀_+kl◦ G_τ₀_+kl+1◦ · · · ◦ G_τ₀_+(k+1)l is strongly connected.

Definition 5. Given two vertices i, j we say that there is a joint path of length l from i to j in G_k, . . . , G_{k+ ¯}_m−1 if there exist s1, . . . , s_l−1 such that (i, s₁) ∈ E_{k+ ¯}_m−1, (s₁, s₂) ∈ E_{k+ ¯}_m−2, . . . , (s_l−1, j) ∈ E_{k+ ¯}_m−l, and we say that i, j have joint distance l in G_k, . . . , G_{k+ ¯}_m−1 if the shortest joint path from i to j is of length l.

Our analysis is based on the following assumption.

Assumption A3. {G_k} is a sequence of directed graphs, with self-loops at every node, jointly fully connected for sequences of length ¯m, for some positive integer ¯m.

The algorithm presented in [16] works for time-varying network in a similar framework. Formally, the hypothesis on {G_k} in [16] is the following.

Assumption A3’. {G_k} is a sequence of directed graphs, with self-loops at every node, jointly strongly connected for sequences of length ¯p, for some positive integer ¯p.

We show now that Assumptions A3 and A3’ are equivalent, in the sense specified by Proposition 1. In the following, given an integer m, we denote with G^m the composition of m copies of G.

Lemma 1. If G is a directed strongly connected graph with self-loops at every node and diameter δ, then G^δ is fully connected.

Proof. By definition of composition we have that (i, j) is an edge in G^δ if and only if

∃s₁, . . . , s_δ−1 ∈ V such that (i, s₁), (s₁, s₂), . . . , (s_δ−1, j) ∈ G. (29) {compositioncondition}{compositioncondition}

We want to prove that for every i, j ∈ V a sequence of nodes s_h as in (29) exists.

Since G is fully connected with diameter δ, there exists a path in G from i to j of length l ≤ δ. That is, there exist a set of nodes v₁, . . . , v_l−1such that

(17)

(i, v1), (v1, v2), . . . , (vl−1, j) are edges in G and therefore a sequence satisfying (29) is given by

s_h=

(v_h h = 1 : l − 1 j h = l : δ.

Proposition 1. Let {G_k} be a sequence of graphs where, for each k, G_k = (V, E_k) is a directed graph with self-loops at every node. The following are equivalent:

(1) there exist τ₀, l ∈ N such that {Gk} is repeatedly jointly strongly connected with constants τ0, l

(2) there exists ¯p ∈ N such that {Gk} is strongly connected for sequences of length ¯p

(3) there exists ¯m ∈ N such that {Gk} is fully connected for sequences of length ¯m

Proof. It is easy to see that (2) ⇒ (1) with τ₀ = 0 and l = ¯p and since full connectivity clearly implies strong connectivity, we have that (3) ⇒ (2) with

¯ p = ¯m.

We now prove that (1) ⇒ (2) with ¯p = 2l. That is, we prove that if (1) holds, then for every index s the composition Gs◦ · · · ◦ G_s+2l−1 is strongly connected. Given an index s, we denote with ¯r the remainder of the division of (s − τ₀) by l, we define ¯h := l⁻¹(s − τ₀+ l − ¯r). By definition of ¯r and ¯h and applying (1) with k = ¯h we have that the graph

H : = G_s+l−¯_r◦ · · · ◦ G_s+2l−¯_r−1 =

= G_τ₀_+¯_hl◦ · · · ◦ G_τ

0+(¯h+1)l−1

is strongly connected and thus

G_s◦ · · · ◦ G_s+2l−2 = G_s◦ · · · ◦ G_s+l−¯_r−1◦ H ◦ G_s+2l−¯_r◦ · · · ◦ G_s+2l−1 is strongly connected. Since 2l − ¯r ∈ l + 1, . . . , 2l we have the thesis.

Finally, we prove that (2) ⇒ (3). Since the size of V is finite, there exists a finite number of graphs with vertices V. In particular, there exists a finite integer L equal to the number of strongly connected graphs with vertices V. We denote with H₁, . . . H_L such graphs, with δ_j the diameter of H^j and

(18)

with ¯δ := max δj. Given any index k, we consider (¯δ − 1)L + 1 sequences of length ¯p as follows:

S1 = Gk◦ G_k+1· · · ◦ G_{k+ ¯}_p−1 S₂ = G_{k+ ¯}_p◦ G_{k+ ¯}_p+1· · · ◦ G_{k+2 ¯}_p−1 ...

S_(¯_δ−1)L+1= G_k+(¯_{δ−1)L ¯}_p◦ G_k+(¯_{δ−1)L ¯}_p+1· · · ◦ G_k+(¯_{δ−1)L ¯}_{p+ ¯}_p−1.

Statement (2) implies that, for every j ∈ {1, . . . , (¯δ − 1)L + 1}, Sj ∈ {H₁, . . . H_L} and thus there exists an index ˆı ∈ {1, . . . , L} such that at least ¯δ elements of {S₁, . . . , S_(¯_δ−1)L+1} are equal to H_ˆ_ı. Using the fact that, by Lemma 1, H_ˆ_ı^δ^ˆ^ı is fully connected and Remark 4.1, we have

G_k◦ G_k+1◦ · · · ◦ G_k+(¯_{δ−1)L ¯}_{p+ ¯}_p−1= S1◦ · · · ◦ S_(¯_δ−1)L+1 fully connected, and thus (3) holds with ¯m = (¯δ − 1)L¯p + ¯p.

To conclude the considerations on the sequence of networks we remark that, since we are assuming that the linear system (1) has unique solution and that each node contains exactly one row of the coefficient matrix, the D-connectivity hypothesis introduced in [15] is equivalent to Assumption A3’ and thus, by Proposition 1, to Assumption A3.

Theorem 3. Assume that a sequence of networks {Gk}_k is given, satisfying Assumption A3, and that for every index k the corresponding consensus matrix W^k satisfies Assumption A2. Let {X^k} be a sequence generated by (27) with kM k∞= µ < 1. There exists a constant σ < 1 such that for every k ∈ N the global error E^k= X^k− X^∗ satisfies

kE^k+1k_∞≤ σkE^{k− ¯}^m+1k_∞, (30) where ¯m is the constant given by Assumption A3.

Proof. We follow the proof of Theorem 1. For every index k, the matrix W^k is row stochastic and k(W^k⊗ I)Mk_∞≤ 1, so we get

E^k+1= (W^k⊗ I)ME^k. (31) {novo1_time}{novo1_time}

and

kE^k+1k_∞≤ kE^kk_∞. (32) {novo2_time}{novo2_time}

(19)

For every node i, j and for every iteration index k, we have e^k+1_ij = w_ij^kM_je^k_j +X

s6=j

w^k_ise^k_sj. (33)

We now prove that if the joint distance between j and i in G_{k− ¯}_m+1, G_{k− ¯}_m+2, . . . , Gk is equal to l, then for every k

|e^k+1_ij | ≤ σ⁰kE^k−l+1k∞, for σ⁰< 1. (34) {eq:distltime}{eq:distltime}

We proceed by induction over the joint distance l. If l = 1, that is, if w_ij^k > 0, proceeding as in the derivation of (3.1) we get

|e^k+1_ij | ≤ 1 − w^k_ij(1 − µ)kE^kk_∞≤ 1 − w_min(1 − µ)kE^kk_∞=: σkE^kk_∞. We assume now that (34) holds for distance equal to l − 1 and we prove it for l. Let (j, s_l−1, s_l−2, . . . , s1, i) be a joint path of length l from j to i in G_{k− ¯}_m+1, G_{k− ¯}_m+2, . . . , G_k In particular we have that w^k_is

1 > 0 and thus

|e^k+1_ij | ≤ w_is₁|e^k_s

1j| + X

s6=s1

wis|e^k_sj|. (35) {eq:s1time}{eq:s1time}

Using the fact that (j, s_l−1, s_l−2, . . . , s1) is a joint path of length l − 1 from j to s₁ in G_{k− ¯}_m+1, G_{k− ¯}_m+2, . . . , G_k−1, applying the inductive hypothesis and proceeding as in the proof of the previous theorem, we get

|e^k+1_ij | ≤ 1 − w_min(1 − σ⁰) kE^k−l+1k_∞ (36) with σ⁰given by (34) for distance l−1, and defining σ := (1 − w_min(1 − σ⁰)) <

1 we get (34) for distance equal to l.

Since the sequence {Gk} is fully connected for sequences of length ¯m we have that for every couple of nodes i, j the joint distance between j and i in G_{k− ¯}_m+1, G_{k− ¯}_m+2, . . . , G_k is smaller or equal than ¯m and we get the thesis.

Lemma 1 shows that if we consider the time-independent case as the particular instance of the time-varying case where each of the graphs Gk is equal to G with diameter δ, then Assumption 3 holds with ¯m = δ and the two theorems give the same inequality for the error vectors.

(20)

5 Numerical results

In this section we present testing results for the DFIX method. The DFIX is compared with the state-of-the-art distributed optimization algorithms from [14, 19, 21, 22] referred to here as DIGing, EXTRA and SVL respectively, and the method for solving systems of linear equations presented in [15], abbreviated here as Projection. The test set consists of two types of problems: Simple Kriging problems and linear systems with strictly diagonally dominant coefficient matrix. In Subsection 5.1 we study how the computational and communication cost of DFIX is influenced by the connectivity of the underlying network and we compare DFIX with the mentioned methods on a simpleKriging problem. In Subsection 5.2 we repeat the comparison considering a randomly generated linear system. We also consider the case where each node can hold more than one equation and we analyse what happens, for a fixed linear system, when the number of nodes in the network increases. In subsections 5.3 and 5.4 we consider the case of directed and time-varying networks, respectively.

The results demonstrate that DFIX, analogously to the classical results, outperforms the corresponding optimization method for solving the unconstrained quadratic problem both in terms of computational and communication costs. With respect to the method from [15] the comparison is again favorable for DFIX, in the case of the iterative matrix with suitable properties. Clearly, the method from [15] is designed for a wider class of problems, but its efficiency is significantly lower than DFIX efficiency in the case of unique solution and a suitable iterative matrix.

In the following, the DFIX method we consider is defined using Jacobi Overrelaxation, as specified in Section 2, as underlying fixed point method.

The iteration k of the resulting method at each node is given by

ˆ

x^k+1_ii = (1 − α)x^k_ii− α aii



 X

j6=i

aijx^k_ij− b_i



, ˆx^k+1_ij = x^k_ij for j 6= i, (37) {eq:JORstep1}{eq:JORstep1}

and

x^k+1_i =

n

X

j=1

w_ijxˆ^k+1_j . (38) {eq:JORstep2}{eq:JORstep2}

In the rest of the section we refer to the method defined by equations (37), (38) as DFIX - JOR, and we choose the relaxation parameter α in (37) as

(21)

2

kD⁻¹Ak∞ where D = diag(a11, . . . , ann). The methods for distributed optimization DIGing, EXTRA and SVL are applied to solve the unconstrained problem with quadratic objective function given by ¹₂x^TAx − b^Tx, which is equivalent to finding a solution of (1). The step-size parameter for DIGing and EXTRA is chosen as η = _3L¹ where L = maxi=1:n2kAik²₂, while the parameters for SVL method are computed through the procedure described in [22]. We remark that while these choices of the relaxation parameter for DFIX and the step-size η for DIGing and EXTRA can be easily computed in the distributed framework, the computation of the optimal parameters for SVL requires nowledge of the extremal eigenvalues of the matrix A and the spectral gap of the consensus matrix W. Finally, Projection method deals with the linear system (1) directly and it does not require the computation of any additional parameter, but it requires for each agent i to define the local initial vector x⁰_i as a solution of the equations that the node holds.

5.1 Simple Kriging problem

The first problem we consider is Simple Kriging [6]. Let us consider a phys- ical process modeled as a spatial random field and assume that a network of sensors is given in the region of interest, taking measurements of the field.

The goal is to estimate the field in any given point of the region. Assuming that the field is Gaussian and stationary, and that the expected value and covariance function are known at any point, this kind of problem can be solved by Simple Kriging method.

Denote with Z(s) the value of the random field at the point s, and with µ(s) its expected value, which is assumed to be known. Moreover, by the stationarity assumption, we have that the covariance between the value of Z at two points is given by

Cov(Z(s₁), Z(s₂)) = K(ks₁− s₂k₂)

for some nonnegative function K. Given {s₁, . . . , s_n} ⊂ R² the positions in space of the n sensors of the network, let {Z(s₁), . . . , Z(s_n)} be the sampled values at those points and define the covariance matrix A = [aij] ∈ R^n×n as

a_ij = K(ks_i− s_jk₂).

Now, given a point ¯s where we want to estimate the field, we define the vector b ∈ Rⁿ as

bi= K(ksi− ¯sk2).

(22)

The predicted value of Z(¯s) is then given by ˆ

p(s) := µ(¯s) +

n

X

i=1

x_i(Z(s_i) − µ(s_i))

where (x₁, . . . , x_n) is the approximate solution of the linear system

Ax = b. (39) {eq:System}{eq:System}

Clearly, the matrix W plays an important role in the DFIX - JOR method. So let us first illustrate the influence of connectivity within the network in terms of communication traffic and computational cost for the above described kriging problem, with covariance function given by

K(t) := exp(−5t²). (40) {eq:covariance}{eq:covariance}

We assume a set {s1, . . . , s100} ⊂ [−30, 30]² of agents is given and for any m ∈ {2, 4, . . . , 48, 50} we take the m-regular graph with vertices {s1, . . . , s100}.

That is, given the value of m, we define the network so that each node has degree m. The matrix W is defined using the Metropolis weights [27] which in the m-regular case are given by

wij =

((m + 1)⁻¹ if j = i or j ∈ O_i

0 otherwise (41) {eq:metropolisregular}{eq:metropolisregular}

For every value of the degree m we apply DFIX-JOR method to solve Ax = b.

At Figure 1 and 2 we plot the number of iterations performed by the method and the total communication cost, respectively, until the stopping criterion

i=1,...,nmax kAx^k_i − bk ≤ 10⁻⁴ (42) {termcond}{termcond}

is satisfied, for graphs of increasing degree. In other words we are asking that each node solves the system with the residual tolerance of 10⁻⁴. The communication cost is computed as follows. At each iteration, Step 1 does not require any communication between the agents, while in Step 2 node i shares x^k_i with all the agents in its neighbourhood. The per-iteration traffic is thus given by n²m = 2|E |n, where E is the set of edges of the underlying network and m is the degree. Note that here we implicitly assume that there is a dedicated communication link between any pair of agents. This reflects practical scenarios where dedicated peer-to-peer channels are ensured, e.g.,

(23)

through frequency division multiple access or similar schemes. In the other tests that we present we will also consider the broadcasting scenario: in that case, the per-iteration communication cost is independent to the number of edges of the network and it is given by the number of nodes times the size of the shared vectors, thus it is proportional to the number of iterations performed.

From Figures 1 and 2 we can see that, as the degree of the network increases, the number of iterations required to satisfy (42) decreases, while the total communication traffic first decreases then increases again. This behaviour can be explained as follows. As the connectivity of the graph improves, the local information is distributed through the network more efficiently, and a smaller number of iterations is necessary. On the other hand, if the degree is larger, the consensus step (20) of the algorithm requires each node to share its local vector with a larger number of neighbours, yielding a higher communication traffic at each iteration. The fact that the overall communication traffic (Figure 2) is nonmonotone suggests that for large values of the degree, the decrease in the number of iterations in not enough to balance the higher per-iteration traffic.

0 10 20 30 40 50

degree 0

0.5 1 1.5 2 2.5

iterations

10⁴

DJOR

Figure 1: Number of iterations Figure 2: Communication cost

Let us now compare the DFIX - JOR withDIGing [14,19], EXTRA [21], SVL [22] and Projection method [15]. We consider a 10 × 10 grid of nodes located at {s1, . . . , s100} ⊂ [−3, 3]²and, given a communication radius R > 0 we define the network so that nodes i and j are neighbours if and only if their distance is smaller than R. The linear system that we consider is derived by the kriging problem described at the beginning of this section. That is, we consider again Ax = b with

aij = K(ksi− s_jk₂), bi = K(ksi− ¯sk2) (43) {kriging2}{kriging2}