Preprint
Article

Distributed Jacobi-Proximal ADMM for Consensus Convex Optimization

Altmetrics

Downloads

135

Views

28

Comments

0

This version is not peer-reviewed

Submitted:

30 January 2024

Posted:

31 January 2024

You are already at the latest version

Alerts
Abstract
In this paper, a distributed algorithm is proposed to solve a consensus convex optimization problem. It is a Jacobi-proximal alternating direction method of multipliers with a damping parameter $\gamma$ in the iteration of multiplier. Compared with existing algorithms, it has the following nice properties: (1) The restriction on proximal matrix is relaxed substantively, thus alleviating the weight of the proximal term. Therefore, the algorithm has a faster convergence speed. (2) The convergence analysis of the algorithm is established for any damping parameter $\gamma\in(0,2]$, which is larger ones in the literature. In addition, some numerical experiments and an application to a logistic regression problem are provided to validate the effectiveness and the characteristics of the proposed algorithm.
Keywords: 
Subject: Computer Science and Mathematics  -   Computational Mathematics

1. Introduction

Consider the following consensus convex optimization problem:
min y i = 1 n f i ( y )
where y R m is the global optimization variable, n is the number of agents in the multi-agent system and f i ( i = 1 , , n ) : R m R are convex functions. Each f i is known only by agent i and the agents cooperatively solve the consensus optimization problem. Many problems encountered in machine learning [1] and power network[2] can be posed in the model (1.1).
There are two types distributed algorithms to solve problem (1.1): continuous-time algorithms [3,4,5,6] and discrete-time algorithms, among which, discrete-time algorithms can be divided into primal algorithms and dual algorithms. In primal algorithms, each agent takes a (sub)gradient-related step and averages its local solution with those of neighbors [7,8,9]. One great advantage of these methods is their low computation burden. But slow convergence and low accuracy are two strikes against it. The typical dual algorithms include augmented Lagrangian method [10] and alternating direction method of multipliers (ADMM) [11,12,13,14,15,16], in which each agent needs to solve a subproblem at each iteration, which is responsible for high computation burden. However, the characteristic that they can quickly converge to exact optimal solutions can make up for it.
The ADMM algorithm has attracted significant research interests in recent years. With regard to distributed ADMM algorithms, almost all developments begin with transforming problem (1.1) into a equivalent form by introducing local copy x i for each agent i = 1 , 2 , . . . , n , and enforcing consensus x 1 = x 2 = . . . = x n . For start networks, the reformulation of problem (1.1) can be shown as follows:
min x f ( x ) : = i = 1 n f i ( x i ) subject to x i = x ¯ , i ,
where x = [ x 1 T , . . . , x n T ] T and x ¯ is so-called consensus variable. Considerable attentions have been paid to such formulation, which can be referred to [11,12] for details..
A central agent is required in the start network, and thus algorithms in [11,12] have high communication burden and low fault tolerance. This leads to growing research interests in general connected networks. For general connected networks, the consensus optimization problem (1.1) can be rewritten in the following compact form:
min x f ( x ) : = i = 1 n f i ( x i ) subject to A x = 0 o r A x + B z = 0 ,
where x = [ x 1 T , . . . , x n T ] T , A, B are matrices related to network structure and z is the slack variable. For this kind of problems, Wei and Ozdaglar [13] proposed a distributed Gauss-Seidel ADMM algorithm and proved that its convergence rate was O ( 1 / k ) , where the objective function f i ( 1 = 1 , , n ) are convex. Based on this algorithm, agents can only update in order. To save the waiting time of agents in [13], Yan[14] raised a parallel ADMM algorithm, which adopts Jacobi iterate method. Besides, some distributed ADMM algorithms for nonconvex but differentiable probelms are also established in[15,16].
In addition to the algorithms in [11,12,13,14,15,16], several ADMM algorithms can also solve problem (1). These algorithms were originally designed to solve multi-block separable problems, which can be can be cast as
min x i = 1 n f i ( x i ) subject to A 1 x 1 + . . . + A n x n = c .
where x = [ x 1 T , . . . , x n T ] T . A wide variety of the proximal ADMM algorithms were proposed for this kind of formulation. The researches on these algorithms mainly focus on proximal matrix P i and damping parameter γ . Deng et al.[17] presented a parallel ADMM algorithm and the proximal matrix P i is required to satisfy P i ( n 2 γ 1 ) A i T A i , where 0 < γ < 2 . There are two specific choices for the proximal matrix P i in[18]: (1) Standard proximal matrix P i = τ i I ; (2) Linearized proximal matrix P i = τ i I A i T A i . Therefore, the condition in [17] can be reduced to
P i = τ i I , τ i > ( n 2 γ 1 ) A i 2 , τ i I A i T A i , τ i > n 2 γ A i 2 .
Afterwards, Sun and Sun[19] came up with an improved proximal ADMM algorithm with partially parallel splitting, where P i = τ i I A i T A i and a lower bound of the proximal parameter is given by τ i > 4 + max { 1 γ , γ 2 γ } 5 ( n 1 ) A i 2 , where 0 < γ < 1 + 5 2 .
Inspired by the works in [13,14,17,19], this paper puts forward a distributed Jacobi-proximal ADMM algorithm to solve the consensus convex optimization problem (1.1) over a general connected network. Compared with the state-of-art ones, the proposed algorithm has the following outstanding features.
(1) Compared with the algorithm in [13], the optimization variables of all agents can be updated simultaneously. Hence, the waiting time is saved.
(2) Compared with [14], only half of dual variables are occupied in the proposed algorithm. Therefore, the communication burden among agents and storage cost for each agent are reduced.
(3) The proximal matrix P i of the presented algorithm is smaller than those in [17,19]. Thus, the distributed Jacobi-proximal ADMM algorithm is favorable based on the general principle given by Fazel et. al [20], that the proximal matrix P i should be as small as possible. Besides, the value range of damping parameter γ in the proposed algorithm is larger than that of [19].
The rest of this paper is organized as follows. In Section 2, the equivalent form of the consensus convex optimization problem (1.1) is introduced. In addition, based on this equivalent form, a distributed Jacobi-proximal ADMM algorithm is proposed. Section 3 supplies the convergence analysis of the algorithm. In Section 4, extensive numerical experiments are provided to verify the effectiveness of the proposed algorithm. Moreover, the impacts of the penalty parameter, damping parameter and connectivity ratio on the algorithm are investigated. In Section 5, the proposed algorithm is applied to a logistic regression problem and its numerical results are compared with those in [17]. Finally, the conclusions of this paper are presented in Section 6.

2. Problem Formulation and Distributed Jacobi-Proximal ADMM Algorithm

In this section, some notations related to the network are introduced, and the consensus convex optimization problem (1.1) is represented so that it can be solved by ADMM.
The network topology of the multi-agent system is assumed to be a general undirected connected graph, which is described as G = { V , E } , where V denotes the set of agents, E denotes the set of the edges and | V | = n , | E | = l . These agents are arranged from 1 to n. The edge between agents i and j with i < j is represented by ( i , j ) or e i j and ( i , j ) E means that agents i and j can exchange data with each other. The neighbors of agent i are denoted by N ( i ) : = { j V | ( i , j ) E o r ( j , i ) E } and d i = | N ( i ) | .
The edge-node incidence matrix of the network G is denoted by A ˜ R l × n . The row in A ˜ that corresponds to the edge e i j is denote by [ A ˜ ] e i j , which is defined by
[ A ˜ ] k e i j = 1 , if k = i , 1 , if k = j , 0 , otherwise .
Here, the edges of the network are sorted by the order of their corresponding agents. For instance, the edge-node incidence matrix of the network G in Fig. 1 is given by
A ˜ = 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 .
Figure 1. An example of the network G.
Figure 1. An example of the network G.
Preprints 97720 g001
According to the edge-node incidence matrix, the extended edge-node incidence matrix A of the network G is given by
A : = A ˜ I m = a ˜ 11 I m a ˜ 1 n I m a ˜ l 1 I m a ˜ l n I m R m l × m n ,
where ⊗ denotes the K r o n e c k e r p r o d u c t . Obviously, A is a block matrix with l * n blocks of m × m matrix.
By introducing separating decision variable x i for each agent i = 1 , 2 , . . . , n , the consensus convex optimization problem (1.1) has the following form:
min x f ( x ) : = i = 1 n f i ( x i ) subject to x i = x j , ( i , j ) E ,
where x = [ x 1 T , x 2 T , . . . , x n T ] T R n m × 1 . Clearly, the problem (2.1) is equivalent to problem (1.1) if G is connected.
With the help of the extended edge-node incidence matrix A, the problem (2.1) can be rewritten in the following compact form:
min x f ( x ) subject to A x = 0 .
Dividing the neighbors N ( i ) of the agent i into two sets: predecessors P ( i ) : = { j V | ( j , i ) E } and successors S ( i ) : = { j V | ( i , j ) E } . The distributed Jacobi-proximal ADMM (DJP-ADMM) algorithm is described as Algorithm 1.
Algorithm 1: Distributed Jacobi-proximal ADMM Algorithm (DJP-ADMM)
Preprints 97720 i001
Remark 1.
The parallel ADMM algorithm presented in [14] is shown as follows:
x i k + 1 : = arg min x i f i ( x i ) + ρ 2 j N ( i ) x j k x i 1 ρ λ e j i k 2 + ρ 2 j N ( i ) x i x j k 1 ρ λ e i j k 2 , λ e j i k + 1 : = λ e j i k ρ ( x j k x i k + 1 ) , j N ( i ) .
It is clear that the number of dual variables in (2.3) is twice that in DJP-ADMM. Thus, the communication burden among agents and the storage cost for each agent in Algorithm 1 are smaller than ones in [14].

3. Convergence Analysis

In this section, some important notations and technical lemmas are given. Then, the convergence analysis of Algorithm 1 is investigated.
Let
L ˜ = A ˜ T A ˜ R n × n .
Remark 2.
Hong et al. [16] have pointed out that L ˜ is the sign Laplace matrix of the graph G.
The extended degree matrix and extended sign Laplace matrix of the network G are denoted by
D : = D ˜ I m R m n × m n ,
L : = L ˜ I m R m n × m n ,
where D ˜ is the degree matrix of the graph G.
To simplify the notation, let
H = 1 2 ( P 1 + P 1 T ) 1 2 ( P n + P n T ) R m n × m n ,
and
Q = H + ρ A ¯ I m ,
where A ¯ is the adjacency matrix of the graph G. To ensure the convergence of Algorithm 1, it is necessary to make an assumption about the matrix Q, which is shown below.
Assumption 1.
The matrix Q is a positive definite matrix.
Remark 3.
If proximal matrices P i ( i = 1 , , n ) are symmetric, then Assumption 1 can be reduced to P + ρ A ¯ I m is a positive definite matrix. Therefore, P = ρ D = ρ D ˜ I m is a feasible choice, where D ˜ is the degree matrix of the graph G. In this case, P i = ρ d i I m .
Remark 4.
By the definition of Q, the matrix Q is symmetric positive definite under Assumption 1, and thus, there exists a matrix M such that
Q = M T M .
According to the convexity of the objective function, we have following result.
Lemma 1.
Assume that { ( x k , λ k ) } is the sequence produced by Algorithm 1 for the problem (2.2), where x k = [ ( x 1 k ) T , ( x 2 k ) T , . . . , ( x n k ) T ] T and λ k = [ λ e i j k ] , e i j E . Then one has
f ( x ) f ( x k + 1 ) ( x x k + 1 ) T A T λ k + 1 + ( x x k + 1 ) T Q ( x k + 1 x k ) ρ ( γ 1 ) ( x x k + 1 ) T L x k + 1 0 , x R m n .
Proof. 
Define g i ( i = 1 , , n ) : R m R by
g i k ( x i ) : = ρ 2 j P ( i ) x j k x i 1 ρ λ e j i k 2 + ρ 2 j S ( i ) x i x j k 1 ρ λ e i j k 2 + 1 2 x i x i k P i 2 .
Using the iteration of x in Algorithm 1, one can conclude that x i k + 1 is the optimizer of f i + g i k , i.e.,
x i k + 1 : = arg min x i f i ( x i ) + g i k ( x i ) .
Therefore, there exists a subgradient h ( x i k + 1 ) f i ( x i k + 1 ) such that h ( x i k + 1 ) + g i k ( x i k + 1 ) = 0 . Then
( x i x i k + 1 ) T h ( x i k + 1 ) + g i k ( x i k + 1 ) = 0 , x i R m .
Due to the convexity of f i , we have
f i ( x i ) f i ( x i k + 1 ) + ( x i x i k + 1 ) T h ( x i k + 1 ) .
This together with (3.8) implies that
f i ( x i ) f i ( x i k + 1 ) + ( x i x i k + 1 ) T g i k ( x i k + 1 ) 0 .
Substituting the gradient g i k of the function g i k into the above inequality, we have
f i ( x i ) f i ( x i k + 1 ) + ( x i x i k + 1 ) T ρ j P ( i ) ( x j k x i k + 1 1 ρ λ e j i k ) + ρ j S ( i ) ( x i k + 1 x j k 1 ρ λ e i j k ) + 1 2 ( P i + P i T ) ( x i k + 1 x i k ) 0 .
From the iteration of the multipliers, one can obtain that
ρ j P ( i ) ( x j k x i k + 1 1 ρ λ e j i k ) = j P ( i ) λ e j i k ρ ( x j k x i k + 1 ) = j P ( i ) λ e j i k γ ρ ( x j k + 1 x i k + 1 ) + γ ρ ( x j k + 1 x i k + 1 ) ρ ( x j k x i k + 1 ) = j P ( i ) λ e j i k + 1 + γ ρ ( x j k + 1 x i k + 1 ) ρ ( x j k + 1 x i k + 1 ) + ρ ( x j k + 1 x i k + 1 ) ρ ( x j k x i k + 1 ) = j P ( i ) λ e j i k + 1 + ρ ( γ 1 ) ( x j k + 1 x i k + 1 ) + ρ ( x j k + 1 x j k ) .
Similarly,
ρ j S ( i ) ( x i k + 1 x j k 1 ρ λ e i j k ) = j S ( i ) λ e i j k + 1 + ρ ( γ 1 ) ( x j k + 1 x i k + 1 ) + ρ ( x j k + 1 x j k ) .
Hence,
f i ( x i ) f i ( x i k + 1 ) + ( x i x i k + 1 ) T ( j P ( i ) λ e j i k + 1 j S ( i ) λ e i j k + 1 ) + ( x i x i k + 1 ) T ρ ( γ 1 ) j N ( i ) ( x j k + 1 x i k + 1 ) + ρ j N ( i ) ( x j k + 1 x j k ) + 1 2 ( P i + P i T ) ( x i k + 1 x i k ) 0 ,
By the definition of the matrix A, we simplify the above inequality as follows:
f i ( x i ) f i ( x i k + 1 ) ( x i x i k + 1 ) T [ A ] i T λ k + 1 + ( x i x i k + 1 ) T ρ ( γ 1 ) j N ( i ) ( x j k + 1 x i k + 1 ) + ρ j N ( i ) ( x j k + 1 x j k ) + 1 2 ( P i + P i T ) ( x i k + 1 x i k ) 0 .
And then,
i = 1 n f i ( x i ) i = 1 n f i ( x i k + 1 ) i = 1 n ( x i x i k + 1 ) T [ A ] i T λ k + 1 + i = 1 n ( x i x i k + 1 ) T ρ ( γ 1 ) j N ( i ) ( x j k + 1 x i k + 1 ) + ρ j N ( i ) ( x j k + 1 x j k ) + 1 2 ( P i + P i T ) ( x i k + 1 x i k ) 0 .
By the definition of matrices A and D, we have
i = 1 n ( x i x i k + 1 ) T [ A ] i T λ k + 1 = ( x x k + 1 ) T A T λ k + 1 ,
d i i = 1 n ( x i x i k + 1 ) T x i k + 1 = ( x x k + 1 ) T D x k + 1 .
In addition,
i = 1 n ( x i x i k + 1 ) T j N ( i ) x j k + 1 = [ ( x 1 x 1 k + 1 ) T , . . . , ( x n x n k + 1 ) T ] [ j N ( 1 ) ( x j k + 1 ) T , . . . , j N ( n ) ( x j k + 1 ) T ] T = [ ( x 1 x 1 k + 1 ) T , . . . , ( x n x n k + 1 ) T ] ( A ¯ I m ) x k + 1 = ( x x k + 1 ) T ( A ¯ I m ) x k + 1 ,
where A ¯ is the adjacency matrix of the graph G. The above two relations indicate that
i = 1 n ( x i x i k + 1 ) T ( j N ( i ) x j k + 1 d i x i k + 1 ) = ( x x k + 1 ) T ( D A ¯ I m ) x k + 1
Therefore, by the definition of the extended sign Laplace matrix L , one can conclude that
i = 1 n ( ( x i x i k + 1 ) T j N ( i ) ( x j k + 1 x i k + 1 ) ) = i = 1 n ( x i x i k + 1 ) T ( j N ( i ) x j k + 1 d i x i k + 1 ) = ( x x k + 1 ) T L x k + 1 .
Analogously,
i = 1 n ( x i x i k + 1 ) T j N ( i ) ( x j k + 1 x j k ) = ( x x k + 1 ) T ( A ¯ I m ) ( x k + 1 x k ) .
Besides, by the definition of matrix Q, we have
i = 1 n ( x i x i k + 1 ) T 1 2 ( P i + P i T ) ( x i k + 1 x i k ) + ρ j N ( i ) ( x j k + 1 x j k ) = ( x x k + 1 ) T Q ( x k + 1 x k ) .
Thus, recalling (3.9)-(3.12), inequality (3.7) holds. □
The non-negative property of the norm is very important in the subsequent analysis of convergence. To this end, certain items in Lemma 1 will be concerted into norm form. To simplify some expressions in the proof of the following lemmas, V k is denoted by
V k = 1 2 ρ γ λ k 2 + 1 2 M ( x k x * ) 2 ,
where M is defined in (3.6).
Under Assumption 1, we can get the following lemma.
Lemma 2.
Assume that { ( x k , λ k ) } is the sequence produced by Algorithm 1 for the problem (2.2), where x k = [ ( x 1 k ) T , ( x 2 k ) T , . . . , ( x n k ) T ] T and λ k = [ λ e i j k ] , e i j E . Then under Assumption 1, one has the following equality
( x k + 1 ) T A T λ k + 1 + ( x * x k + 1 ) T Q ( x k + 1 x k ) = V k V k + 1 ρ γ 2 A x k + 1 2 1 2 M ( x k + 1 x k ) 2 ,
where Q = M T M .
Proof. 
To prove (3.14), we firstly claim that
( x k + 1 ) T A T λ k + 1 = 1 2 ρ γ ( λ k 2 λ k + 1 2 ) ρ γ 2 A x k + 1 2 ,
( x * x k + 1 ) T Q ( x k + 1 x k ) = 1 2 ( M ( x k x * ) 2 M ( x k + 1 x * ) 2 ) 1 2 M ( x k + 1 x k ) 2 .
Indeed, by the iteration of the multiplier: λ k + 1 = λ k γ ρ A x k + 1 , we know
( x k + 1 ) T A T λ k + 1 = ( x k + 1 ) T A T λ k ρ γ A x k + 1 2 ,
and
1 2 ρ γ ( λ k 2 λ k + 1 2 ) = ( x k + 1 ) T A T λ k ρ γ 2 A x k + 1 2 .
Therefore, equality (3.17) and (3.18) indicate that equality (3.15) is valid. In addition, by distorting some of the terms, we obtain
M ( x k x * ) 2 M ( x k + 1 x * ) 2 = M x k 2 M x k + 1 2 + 2 ( M x * ) T M ( x k + 1 x k ) ,
and
2 ( x * x k + 1 ) T ( M T M ) ( x k + 1 x k ) = 2 ( M x * ) T M ( x k + 1 x k ) 2 M x k + 1 2 + 2 ( M x k ) T M x k + 1 .
Combining the above two equalities, we yield
2 ( x * x k + 1 ) T ( M T M ) ( x k + 1 x k ) = M ( x k x * ) 2 M ( x k + 1 x * ) 2 + 2 ( M x k ) T M x k + 1 ( M x k 2 + M x k + 1 2 ) = M ( x k x * ) 2 M ( x k + 1 x * ) 2 M ( x k + 1 x k ) 2 .
Taking into account Q = M T M , we can get the equality (3.16). Consequently, by (3.15) and (3.16), the equality (3.14) holds. □
With the help of the proceeding two lemmas, the convergence result of Algorithm 1 can be established.
Theorem 1.
Assume that { ( x s , λ s ) } is the sequence produced by Algorithm 1, where x s = [ ( x 1 s ) T , ( x 2 s ) T , . . . , ( x n s ) T ] T and λ s = [ λ e i j s ] , e i j E . Let y k = 1 k s = 0 k 1 x s + 1 be the ergodic average of x s from step 1 to k. x * is the optimal solution of the problem (2.2). Then under Assumption 1, the following relation holds for any k 1 and for 0 < γ 2
0 f ( y k ) f ( x * ) V 0 k .
where V 0 given in (3.13) is a non-negative term. Furthermore,
lim k + f ( y k ) f ( x * ) = 0 ,
with the rate of O ( 1 / k ) .
Proof. 
It follows from the optimality of x * that the first inequality in (3.19) is clearly true. Let x = x * in inequality (3.7), then one has
f ( x * ) f ( x s + 1 ) ( x * x s + 1 ) T A T λ s + 1 + ( x * x s + 1 ) T Q ( x s + 1 x s ) ρ ( γ 1 ) ( x * x s + 1 ) T L x s + 1 0 .
Take into consideration that L = A T A and A x * = 0 , the above inequality can be rewritten as:
f ( x * ) f ( x s + 1 ) + ( x s + 1 ) T A T λ s + 1 + ( x * x s + 1 ) T Q ( x s + 1 x s ) ρ ( 1 γ ) A x s + 1 2 0 .
By Lemma 4.2, one has
f ( x * ) f ( x s + 1 ) + V s V s + 1 + ρ γ 2 A x s + 1 2 + ρ ( 1 γ ) A x s + 1 2 + 1 2 M ( x s + 1 x s ) 2 ,
and then
k f ( x * ) s = 0 k 1 f ( x s + 1 ) + V 0 V k + 1 2 s = 0 k 1 M ( x s + 1 x s ) 2 + ρ 2 ( 2 γ ) s = 0 k 1 A x s + 1 2 .
Due to V k 0 for any k, the following inequality holds for 0 < γ 2
k f ( x * ) s = 0 k 1 f ( x s + 1 ) + V 0 0 .
Since the function f is convex, s = 0 k 1 f ( x s + 1 ) k f ( 1 k s = 0 k 1 x s + 1 ) , and then using y k = 1 k s = 0 k 1 x s + 1 , we have
k f ( x * ) k f ( y k ) + V 0 0 ,
i.e.,
f ( y k ) f ( x * ) V 0 k .
Therefore, inequality (3.19) stands. Furthermore, inequality (3.22) implies that
lim k + f ( y k ) f ( x * ) 0 .
On the other hand, from the optimality of x * , we have
lim k + f ( y k ) f ( x * ) 0 .
As a result, lim k + f ( y k ) f ( x * ) = 0 and the proof is completed. □
Remark 5.
Theorem 1 gives the theoretical upper bound for f ( y k ) f * , which provides the error estimates for the optimal value f * at each iteration k. The upper bound is consist of two additive items. Both of them approach to zero at the rate O ( 1 / k ) . In addition, Theorem 1 implies that f ( x k ) converges to the optimal value f * asymptotically. Furthermore, if at least one function f i is strongly convex, then the optimal solution x * is unique, and thus x k asymptotically approaches to x * .
Remark 6.
When solving the consensus optimization problem (1.1), the convergence condition of Algorithm 1 has less conservative than that in [17], wherein, the convergence of Algorithm 1 can be guaranteed if P i is symmetric and P i ρ d i I m according to Remark 4.2, while algorithm in [17] requires that P i is a symmetric positive semi-definite matrix and P i ( n 2 γ 1 ) ρ A i T A i = ( n 2 γ 1 ) ρ d i I m ( 0 < γ < 2 ) .

4. Numerical Experiments

In this section, some numerical experiments are provided to show the validity of Algorithm 1. First, the convergence property of Algorithm 1 is verified. Then the impacts of penalty parameter ρ , damping parameter γ and connectivity ratio d on Algorithm 1 are investigated.
In this section, each edge of the connected network G is generated randomly. The connectivity ratio of the network G is denoted by d = 2 l n ( n 1 ) . Consider the following consensus optimization problem given in [21]:
min y 1 2 i = 1 n ( y θ i ) 2 ,
where y R . Apparently, the optimal solution of this problem is y * = θ ¯ = 1 n i = 1 n θ i . The consensus optimization problem (4.1) can be reformulated into a distributed version:
min x f ( x ) = 1 2 i = 1 n ( x i θ i ) 2 , subject to x i = x j , ( i , j ) E ,
where x = [ x 1 , x 2 , . . . , x n ] T R n and f is convex. Therefore, Algorithm 1 can be used to solve the consensus optimization problem. For the consensus optimization problem (4.2), each θ i is randomly generated by a normal distribution N ( 0 , 1 ) .
The proximal matrix of Algorithm 1 is set by P i = ρ d i I . In this case, the iteration of x has a closed-form solution, which is shown as follows:
x i k + 1 = ρ d i x i k + ρ j N ( i ) x j k + j S ( i ) λ i j k j P ( i ) λ j i k + θ i 1 + 2 ρ d i ,
where d i is the number of neighbors of the agent i.
A. Convergence Property
To illustrate the convergence property of Algorithm 1 for the consensus optimization problem (4.2), ten networks are generated. Each network has n = 50 agents and the connectivity ratio of these networks are set as d = 0.1 , 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0, respectively. The algorithm parameters are set as ρ = 1 and γ = 1 . The algorithm will be stopped once x k x * reaches to 10 16 or the number of iterations k reaches to 3500, where x * is the optimal solution of problem (4.1).
Fig. 2 and Fig. 3 respectively depict how the relative error x k x * x * and constraint violation A x k vary with iteration k. One can find that Algorithm 1 has high accuracy since the relative error can achieve 10 13 and the constraint violation can achieve 10 16 .
Figure 2. Relative error versus iteration.
Figure 2. Relative error versus iteration.
Preprints 97720 g002
Figure 3. Constraint violation versus iteration.
Figure 3. Constraint violation versus iteration.
Preprints 97720 g003
B. Algorithm Parameters ρ and γ
In this part, the impacts of algorithm parameters ρ and γ on the convergence speed of Algorithm 1 are discussed. The networks are generated in the same way as Part A. In order to explore the influences of parameters ρ and γ on Algorithm 1, the convergence speed is denoted by ξ ε 0 = 1 / k 0 , where ε 0 > 0 and k 0 is the number of iterations required to achieve x k 0 x * ε 0 . Here, the accuracy is set as ε 0 = 10 6 .
Choosing damping parameter γ = 1 and selecting different penalty parameters to solve the problem (4.2), one can get the relationship between the convergence speed ξ ε 0 and parameter ρ , which is displayed in Fig. 4. Obviously, if the penalty parameter ρ is too large or too small, the convergence speed of the algorithm is slow. The penalty parameter ρ can be selected from ( 0.01 , 2 ) . In general, a smaller connectivity ratio leads to larger actual optimal parameter ρ * . As a consequence, when the network is sparse, it is better to select a larger penalty parameter and when it is dense, a smaller penalty parameter will be a nice choice.
Figure 4. Convergence speed versus ρ .
Figure 4. Convergence speed versus ρ .
Preprints 97720 g004
In order to explore the influence of parameter γ on Algorithm 1, the penalty parameter is set as ρ = 1 and the damping parameter is set to 60 different values. The numerical results are shown in Fig. 5. Obviously, the convergence speed of Algorithm 1 increases with the damping parameter, and then remains constant. Therefore, γ = 2 is a great choice.
Figure 5. Convergence speed versus γ .
Figure 5. Convergence speed versus γ .
Preprints 97720 g005
C. Connectivity Ratio
In this part, the effect of connectivity ratio d on the convergence speed of Algorithm 1 is explored. From Fig. 4, one can find that when penalty parameter ρ takes different values, the impact of connectivity ratio on convergence speed is different. Therefore, the penalty parameter is set to six different values ρ = 0.005 , 0.01 , 0.05 , 0.1 , 1 and 2, respectively.
We generate 30 networks with n = 50 agents, whose connectivity ratio are set to 30 different values : 1 30 , 2 30 , ..., 1. From Fig. 6, one can find that when the penalty parameter takes a smaller value, such as ρ = 0.005 , 0.01 or 0.05, the convergence speed of Algorithm 1 generally slows down with the increase of connectivity ratio, and the opposite is true when the penalty argument takes a bigger value, such as ρ = 0.1 , 1 or 2 from Fig. 7. It is worth noting that when the network is very sparse, for example d = 0.05 , no matter what the penalty parameter value is, the convergence speed is slow. Therefore, on the premise of ensuring network connectivity, few edges can be added to increase information exchange between agents.
Figure 6. Convergence speed versus d.
Figure 6. Convergence speed versus d.
Preprints 97720 g006
Figure 7. Convergence speed versus d.
Figure 7. Convergence speed versus d.
Preprints 97720 g007

5. Application to A Logistic Regression Problem

In this section, the proposed distributed Jacobi-proximal ADMM algorithm is applied to a logistic regression problem, which is a widely used machine learning model[22,23].
The network G = { V , E } is generated with n = 50 agents. The connectivity ratio is set as d = 0.3 and the edges are generated randomly. The network generated is given in Fig. 8. Each agent has n i training samples, which denoted by { w i j , y i j } j = 1 n i , where w i j R p and y i j { 1 , 1 } .
Figure 8. The network of problem (5.1).
Figure 8. The network of problem (5.1).
Preprints 97720 g008
The distributed logistic regression problem is described as follows:
min x f ( x ) = 1 2 x 2 + 1 N i = 1 n j = 1 n i l o g ( 1 + e y i j w i j T x ) ,
where N = i = 1 n n i is the total number of samples. The dimension of feature is set as p = 3 , the number of samples n i is generated by a uniform distribution U ( 1 , 20 ) , and the parameter w i j is generated by a normal distribution N ( 0 , 1 ) . The generation rule of the label y i j is shown as follows:
y i j = 1 , if u ij 0.5 , 1 , if u ij < 0.5 ,
where u i j is generated by a uniform distribution U ( 0 , 1 ) .
The distributed logistic regression problem (5.1) can be formulated as
min x f ( x ) = i = 1 n f i ( x i ) , subject to x i = x j , ( i , j ) E ,
where x = [ x 1 T , x 2 T , . . . , x n T ] T and f i ( x i ) = 1 2 n x i 2 + 1 N j = 1 n i l o g ( 1 + e y i j w i j T x i ) . Obviously, problem (5.2) can be solved by Algorithm 1.
The convergence path of Algorithm 1 is compared with the Jocobi-Proximal ADMM (JP-ADMM) algorithm in [17]. To investigate the performances of the two algorithms, the penalty parameter is set to ρ = 0.01, 0.1 and 1, respectively. In addition, the damping parameter is set to two different values γ = 1 and 3 2 . The proximal matrix of Algorithm 1 and algorithm in [17] are set as P i = ρ d i I and P i = [ ( n 2 γ 1 ) ρ d i + 1 ] I , respectively. Every algorithm is stopped once x k x k 1 reaches to 10 5 or the number of iterations k reaches to 1000. One can find that the convergence speed of Algorithm 1 is significantly faster than that in [17] from Fig. 9 and Fig. 10.
Figure 9. Objective value f k ( γ = 1 2 ).
Figure 9. Objective value f k ( γ = 1 2 ).
Preprints 97720 g009
Figure 10. Objective value f k ( γ = 3 2 ).
Figure 10. Objective value f k ( γ = 3 2 ).
Preprints 97720 g010

6. Conclusions

In this paper, a distributed ADMM algorithm is put forward to solve a consensus convex optimization problem over a connected network. The proposed algorithm is a Jacobi-proximal ADMM algorithm and the proximal matrix is smaller than existing algorithms. The convergence of the algorithm is proved and its convergence rate is O ( 1 / k ) . Extensive numerical experiments are provided to verify the convergence of the algorithm. Besides, the impacts of penalty parameter, damping parameter and connectivity ratio on the proposed algorithm are investigated. Finally, an application of the proposed algorithm to a logistic regression problem is implemented and its performance is compared with that of another ADMM algorithm in [17].

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Grant number: 11801051) and the Natural Science Foundation of Chongqing (Grant number: cstc2019jcyj-msxmX0075).

References

  1. Y.L. Pan, Distributed optimization and statistical learning for large-scale penalized expectile regression, J. Korean Stat. Soc. 50 (2021) 290-314. [CrossRef]
  2. G. Chen, J.Y. Li, A fully distributed ADMM-based dispatch approach for virtual power plant problems, Appl. Math. Model. 58 (2018) 300-312. [CrossRef]
  3. G. Droge, H. Kawashima, M.B. Egerstedt, Continuous-time proportional-integral distributed optimisation for networked systems, J. Control Decis. 1 (2014) 191-213. [CrossRef]
  4. B. Gharesifard, J. Cortés, Continuous-time distributed convex optimization on weight-balanced digraphs, IEEE Trans. Autom. Control 59 (2014) 781-786.
  5. Y.N. Zhu, W.W. Yu, G.H. Wen, G.R. Chen, W. Ren, Continuous-time distributed subgradient algorithm for convex optimization with general constraints, IEEE Trans. Autom. Control 64 (2019) 1694-1701. [CrossRef]
  6. W. Zhu, H.B. Tian, Distributed convex optimization via proportional-integral-differential algorithm, Meas. Control 55 (2021) 13-20. [CrossRef]
  7. A. Nedic, A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Trans. Autom. Control 54 (2009) 48-61. [CrossRef]
  8. C. Xi, U.A. Khan, Distributed subgradient projection algorithm over directed graphs, IEEE Trans. Autom. Control 62 (2017) 3986-3992. [CrossRef]
  9. S. Liu, Z.R. Zhang, L.H. Xie, Convergence rate analysis of distributed optimization with projected subgradient algorithm, Automatic 83 (2017) 162-169. [CrossRef]
  10. D. Jakovetić, J. Xavier, J.M.F. Moura, Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication, IEEE Trans. Signal Process. 59 (2011) 3889-3902.
  11. S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn. 3 (2010) 1-122.
  12. R. Zhang, J.T. Kwok, Asynchronous distributed ADMM for consensus optimization, in: Proceedings of the 31st International Conference on Machine Learning, 2014, pp. 3689-3697.
  13. E. Wei, A. Ozdaglar, Distributed alternating direction method of multipliers, in: Proceedings of the IEEE Conference on Decision and Control, 2012, pp. 5445-5450.
  14. J.Q. Yan, F.H. Guo, C.Y. Wen, G.Q. Li, Parallel alternating direction method of multipliers, Inf. Sci. 507 (2020) 185-196. [CrossRef]
  15. W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization, IEEE Trans. Signal Process. 62 (2014) 1750-1761. [CrossRef]
  16. M. Hong, H. Davood, M. Zhao, Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks, in: Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 2402-2433.
  17. W. Deng, M.J. Lai, Z.M. Peng, W.T. Yin, Parallel multi-block ADMM with o(1/k) convergence, J. Sci. Comput. 71 (2017) 712-736.
  18. W. Deng, W.T. Yin, On the global and linear convergence of the generalized alternating direction method of multipliers, J. Sci. Comput. 66 (2016) 889-916. [CrossRef]
  19. M. Sun, H.C. Sun, Improved proximal ADMM with partially parallel splitting for multi-block separable convex programming, Appl. Math. Comput. 58 (2018) 151-181. [CrossRef]
  20. M. Fazel, T.K. Pong, D.F. Sun, P. Tseng, Hankel matrix rank minimization with applications to system identification and realization, SIAM J. Matrix Anal. Appl. 34 (2013) 946-977. [CrossRef]
  21. M. Rabbat, R. Nowak, Distributed optimization in sensor networks, in: Proceedings of the third International Symposium on Information Processing in Sensor Networks, 2004, pp. 20-27.
  22. L.J. Wang, M. Guo, K. Sawada, J. Lin, J.C. Zhang, A comparative study of landslide susceptibility maps using logistic regression, frequency ratio, decision tree, weights of evidence and artificial neural network, Geosci. J. 20 (2016) 117-236. [CrossRef]
  23. B.Y. Kim, S.J. Shin, Principal weighted logistic regression for sufficient dimension reduction in binary classification, J. Korean Stat. Soc. 48 (2019) 194-206. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated