Preprint
Article

Convergence Rate Analysis of Non-I.I.D. SplitFed Learning with Partial Worker Participation and Auxiliary Networks

Altmetrics

Downloads

90

Views

52

Comments

0

This version is not peer-reviewed

Submitted:

02 September 2024

Posted:

04 September 2024

You are already at the latest version

Alerts
Abstract
In conventional Federated Learning (FL), clients work together to train a model managed by a central server, intending to speed up the learning process. However, this approach imposes significant computational and communication burdens on clients, particularly with complex models. Additionally, while FL strives to protect client privacy, the server's access to local and global models raises security concerns. To address these challenges, Split Learning (SL) separates the model into parts handled by the client and the server, though it suffers from inefficiencies due to sequential client participation. To overcome these issues, SplitFed Learning (SFL) was proposed, which combines the parallelism of FL with the model-splitting strategy of SL, enabling simultaneous training by multiple clients. Our main contribution is the theoretical analysis of SFL, which, for the first time, includes non-i.i.d. datasets, non-convex loss functions, and both full and partial client participation. We provide convergence proofs for a state-of-the-art SFL algorithm based on conventional convergence analysis assumptions for FL. Our results prove that we can recover the linear convergence rate of conventional FL for the SFL algorithm with the distinction that increasing the number of local steps or clients may not speed up the convergence in SFL.
Keywords: 
Subject: Computer Science and Mathematics  -   Computer Science

1. Introduction

In the conventional Federated Learning (FL), several clients in parallel, train a model jointly particularly leading to speed-up in the learning process under the supervision of a server [1]. Hence, given a central server and N clients as participants in the training, an optimization problem of the below form is solved by FL:
min x ˜ R d ˜ f ( x ˜ ) = Δ min x ˜ R d ˜ 1 m i = 0 m 1 F i ( x ˜ )
In which, F i ( x ˜ ) = Δ E ξ D i [ F i ( x ˜ , ξ ) ] can be a non-convex loss function and ξ corresponds to a random sample of local dataset of the client i, D i . In the FL training, there are m clients training on their local datasets. However, FL encounters the challenge that clients must train the entire model, placing a considerable computational burden, particularly with complex and large-scale models. Additionally, gathering all client data and broadcasting the aggregated model at each round can result in substantial communication overhead. While one of the principal aims of FL is to safeguard clients’ privacy, the server retains access to both the client’s local and global models, prompting security concerns [2]. To address the computational limitations and further safeguard the privacy of the client-side model, [3] pioneered SL, dividing the ML model into two parts. The client trains one portion of the model, while the server trains the remaining portion. However, according to [2], this method incurs notable training time overhead, as only one client can engage in split learning (SL) at any given time, leaving others idle. To address this issue, they proposed SplitFed Learning (SFL), which integrates both the parallel computational capabilities of clients from FL and the benefits of split models from SL. In particular, the convergence theory of the SFL framework has not been thoroughly explored in the existing literature. Our primary contribution lies in establishing the theoretical underpinnings for SFL, incorporating the general assumptions of traditional FL, non-convex loss functions, non-identically independently distributed (non-i.i.d.) datasets, and addressing both full and partial client participation in the SFL training process. Our proof is for the state-of-the-art algorithm for SFL developed by [4] based on conventional assumptions in FL settings. We demonstrate that the SFL can still recover the linear convergence rate of conventional FL. However, changes in the number of clients and local steps cannot speed up the convergence.

2. Related work

  • SL and FL
    The reference [5] introduces a personalized SL framework to address issues like data leakage and non-iid datasets in decentralized learning. It proposes an optimal cut layer selection method using multiplayer bargaining and the Kalai-Smorodinsky bargaining solution (KSBS). This approach efficiently balances the time of training, usage of energy, and privacy of data. Each device tailors its model for non-i.i.d. datasets while they have a common server-side model which ensures robustness by generalization. Simulation results validate the framework’s effectiveness in achieving optimal utility and addressing decentralized learning challenges. However, they do not address the communication overhead caused by transmitting the forward-propagation results at each local step. The reference [6] provides convergence analysis for Sequential Split Learning (SSL), a variant of SL in which the model training process is conducted sequentially, with each client trained one after the other, on heterogeneous data. It compares SSL with Federated Averaging (FedAvg) showing SSL’s superiority on extremely heterogeneous data. However, in practice, if the heterogeneity of data is mild, FedAvg outperforms SSL. Also, SSL still suffers from large communication overheads between the server and clients.
  • SplitFed learning
    The reference [7] presents AdaSFL, a method designed to optimize model training efficiency by controlling local update frequency and batch size. The theoretical analysis demonstrates convergence rates, which facilitate the creation of an adaptive algorithm for adjusting update frequency and batch sizes tailored to heterogeneous workers. However, clients must obtain back-propagation results from the server at each local update. Meanwhile, [8] recommends updating client and server-side models concurrently, utilizing local-loss-based training and auxiliary networks designed specifically for split learning. This parallel training approach effectively reduces latency and eliminates the need for server-to-client communication. The paper includes latency analysis for optimal model partitioning and offers guidelines for model splitting. Specifically, [4] developed a communication and storage-efficient SFL approach. In this method, each client trains a portion of the model and calculates its local loss function using an auxiliary network, leading to reduced communication overhead. Furthermore, the server model is trained based on the sequence of forward propagation results from the clients, ensuring that only one copy of the server model is maintained at any given time. Additionally, [8] suggested a similar framework, albeit with a key difference that each client possesses its separate server model, and these models are aggregated to construct the global server model.
  • Auxiliary networks
    Neural network training with back-propagation is hindered by inefficiencies arising from the update locking issue, where layers must await the complete propagation of signals through the network before updating [9]. To address this, [9] proposed Decoupled Greedy Learning (DGL), a more straightforward training approach that relaxes the joint training objective greedily, showing significant effectiveness for CNNs in large-scale image classification. This method optimizes the training objective using auxiliary modules or replay buffers to reduce communication delays caused by waiting for backward propagation. [10] addressed the backward update lock constraint by introducing a model that decouples modules through predictions of future computations within the network graph. These models use local information to predict the outcomes of subgraphs, particularly focusing on error gradients. By using synthetic gradients instead of true backpropagated gradients, subgraphs can update independently and asynchronously, realizing decoupled neural interfaces. A similar approach has been adopted for training in SFL by [4,8]. Indeed, they use an auxiliary model to replace the server model. The mentioned research demonstrates that an auxiliary model with a relatively smaller dimension compared to the server model performs sufficiently well in serving as a replacement.

3. SplitFed Learning Scenario

In this section, we introduce the SFL framework, encompassing both client and server-side models. Additionally, we present the CSE-SFL algorithm designed by [4] to mitigate communication overhead. Accordingly, we split the model as x ˜ ( x C , x S ) where x C denotes the client-side model, and x S indicates the server-side model. We introduce x ( x C , x A ) as a client-side model including the auxiliary network where x A indicates the model for the auxiliary network.
The client-side non-convex loss function in the SFL setting is given by:
F i c ( x ) = Δ E ξ D i F i c ( x ; ξ )
Also, the non-convex loss function in the SFL setting is defined by:
F s ( x S ; z f , y ) = Δ 1 m i = 0 m 1 F i s ( x S ; z f , i , y i )
We denote z f , i ( x C ; ξ ) as the output of the forward propagation of the client i’s model, x C , i , on its local random data sample, ξ D i , which is intended to be transmitted to the server at specific intervals including the true labels y i corresponding to the local random data sample. Note that the sampled data at the client is not shared with the server but the true labels. Similarly, z b , i ( x S ; z f , i , y i ) indicates the backward propagation model of the server for client i. Accordingly, z ^ b , i ( x A ; z f , i , y i ) corresponds to the backward propagation results obtained by the auxiliary network. In more detail, the client i performs forward propagation up to the splitting layer and transmits the output of this layer, along with the true labels, to the server. The server then continues forward propagation through to the final layer and computes the loss function. Subsequently, the server performs backward propagation of the error and sends the gradients of its first layer back to the client. We consider x ¯ C t as the aggregated model at each global round t [ T ] where [ T ] = { 0 , . . . , T 1 } and x ¯ C t = 1 m i x C t . Throughout this paper, [ S ] = { 0 , . . . , m 1 } identifies the clients’ set which is indexed by i. We employ two strategies for client participation. The first strategy entails all clients participating in the learning process. The second strategy involves the server randomly sampling a subset of size n of clients with replacement, [ S t ] , following a uniform distribution. We assume that D i s are non-i.i.d. The derivative of local loss function of client i in SFL setting with respect to x C and x A are indicated by F i c ( x C ) and F i c ( x A ) respectively. As for the server-side model, the derivative of the loss function is F s ( x S ) which is with respect to x S . The stochastic gradients of each of the aforementioned gradients will be distinguished by a ˜ sign, e.g., ˜ F i c ( x C ) = F i c ( x C ; ξ ) where ξ D i is a random sample from client i dataset. Note that μ L , and μ are the learning rates of client-side and server-side models respectively. Client i trains x C , i on its local dataset and renders the forward propagation results, z f , i , to the auxiliary network at each local step k and it receives the z ^ b , i in response. Note that k [ K ] indexes the local steps. Additionally, the client sends the z f , i to the server at each global round t such that t 0 mod l where l is a parameter determining the frequency of this process. We have one server performing the model aggregation at each global round, completing the forward propagation of clients, and updating the server model at specific global rounds. Algorithm 1 illustrates the proposed procedure by [4] in detail.
Algorithm 1 CSE-SFL [4]
1:
At Server 
2:
    Initialize x C 0 , x A 0 and x S 0  
3:
    for  t = 0 , 1 , . . . , T 1  do 
4:
        Sample a subset S t of n clients out of m clients 
5:
        Receive x C , i t , x A , i t i [ S t ]  
6:
        Let x ¯ C t 1 m i [ S t ] x C , i t and x ¯ A t 1 m i [ S t ] x A , i t  
7:
        Broadcast x ¯ C t and x ¯ A t to clients 
8:
        if  t 0 mod l , and t 0  then 
9:
           for each client i [ S t ] in sequence do 
10:
                z f , i , y i Client( i , z f , y
11:
               Complete forward propagation with z f , i , and x S 0  
12:
               Compute y ^ i , the prediction of y i  
13:
               Compute loss function F i s ( x S 0 ; z f , i , y i )  
14:
               Complete backward-propagation 
15:
               Send z b , i to the client 
16:
               Update server model: x S 0 x S 0 μ m F i s ( x S 0 ; z f , i , y i )  
17:
           end for 
18:
        end if 
19:
    end for 
20:
    Concatenate x C and x S  
21:
At Clients : 
22:
    for all clients i [ S t ] in parallel at round t do 
23:
         x C , i 0 , Server ( x ¯ C t )  
24:
        if  t 0 mod l , and t 0  then 
25:
            z f , i ForwardPass ( x C , i 0 ; ξ )  
26:
           Send z f , i and y i to the server  
27:
            z b , i t , Server ( z b t )  
28:
           Complete backward-propagation with z b , i t  
29:
           Client update: x C , i 1 x C , i 0 μ L F i c ( x C , i 0 )  
30:
           Auxiliary update: x A , i 1 x A , i 0  
31:
           for local step k = 1 , . . , K 1  do 
32:
               Compute forward propagation with x C , i k and x A t  
33:
               Compute local loss F i c ( x i k ; ξ k )  
34:
               Client update: x C , i k + 1 x C , i k μ L F i c ( x C , i k )  
35:
               Auxiliary update: x A , i k + 1 x A , i k μ L F i c ( x A , i k )  
36:
           end for 
37:
        else 
38:
           for local step k = 0 , . . , K 1  do 
39:
               Compute forward propagation with x C , i k and x A t  
40:
               Compute local loss F i c ( x i k ; ξ k )  
41:
               Client update: x C , i k + 1 x C , i k μ L F i c ( x C , i k )  
42:
               Auxiliary update: x A , i k + 1 x A , i k μ L F i c ( x A , i k )  
43:
           end for 
44:
        end if 
45:
        Return x C , i K to the server  
46:
    end for 

4. Convergence rate analysis

The following assumptions for the convergence rate evaluation have been made:
Assumption 1.(L-Lipschitz continuous gradient) Both client and server-side models are L smooth non-convex functions, i.e., there is a constant L > 0 such that x C , y C R d c , and x S , y S R d s :
F c ( x C ) F c ( y C ) L x C y C and F s ( x S ) F s ( y S ) L x S y S
Assumption 2.(Unbiased local gradient estimator) We assume that i [ S ] ,
E ξ D i F i c ( x C ; ξ ) = F i c ( x C )
that is the local gradient estimator of the client-side model is unbiased. The expectation is over all the local datasets of the client. Note that we have a similar assumption for the server-side model as follows i [ S ] :
E ξ D i F i s ( x S ; z f , i ( x C ; ξ ) ) = F i s ( x S )
Assumption 3.(Bounded local and global variance) We have bounded variance of the stochastic gradients locally and globally for both server-side and client-side models, i.e., there exist positive constants σ L and σ G such that
E F i c ( x C ; ξ ) F i c ( x C ) 2 σ L 2 and E F i c ( x C ) F c ( x C ) 2 σ G 2 E F i s ( x S ; ξ ) F i s ( x S ) 2 σ L 2 and E F i s ( x S ) F s ( x S ) 2 σ G 2
Assumptions 1, 2, and 3 are natural assumptions applied in non-convex optimization and FL, e.g., see [7,11,12,13,14,15]. Figure 1 gives an overview of the communication and storage of efficient federated split learning (CSE-FSL) algorithm in an illustrative way.

4.1. Client-Side Model Convergence

We examine the convergence rate when t 0 mod l because it is during these rounds that the server-side model is also updated. This will let us study the impact of l on the convergence rate and communication overhead.
Theorem 1. 
Under Assumptions 1, 2, 3, and full participation of clients, if μ L 1 l L K 2 1.15 l + 1.85 , and t 0 mod l , in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:
min t [ T ] E F c ( x ¯ C t ) 2 2 F c ( x ¯ C 0 ) F c ( x ¯ C * ) ( 1 Γ λ 2 ) μ L K T + Φ 1 + Γ λ 1 1 Γ λ 2
Where , Φ 1 = 5 K μ L 2 L 2 ( σ L 2 + 6 K σ G 2 ) ) 1 Γ λ 2 + 4 L μ L + 4 μ L 2 K L 2 ( l 1 ) 1 Γ λ 2 K l + 5 L 2 K 2 l μ L 2 σ L 2 + K l + 30 L 2 K 3 l μ L 2 σ G 2 λ 1 = B j = 0 l 1 A j 1 A 1 , λ 2 = A l 1 A 1 , B = 8 L 2 μ L 2 K 2 + 5 L 2 K 3 μ L 2 σ L 2 + K 2 + 30 L 2 K 4 μ L 2 σ G 2 , A = 8 L 2 μ L 2 K 2 + 30 L 2 K 3 μ L 2 + 2 , Γ = 4 L μ L + μ L 2 K L 2 ( l 1 ) K + 30 L 2 K 2 μ L 2 + 30 K 2 μ L 2 L 2 ) , and x ¯ C * = argmin x ¯ C t , t [ T ] E F c ( x ¯ C t ) 2
Corollary 1. 
Let μ L 1 l L K 2 1.15 l + 1.85 T . Then, the convergence rate of the client-side model in Algorithm 1 is
min t [ T ] E f c ( x ¯ C t ) 2 O l T + 1 T T .
Theorem 2. 
Under Assumptions 1, 2, 3, and partial participation of clients due to strategy one, if μ L 1 l L K 2 1.15 l + 1.85 , and t 0 mod l , in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:
min t [ T ] E F c ( x ¯ C t ) 2 2 F c ( x ¯ C 0 ) F c ( x ¯ C * ) μ L K T + 5 K μ L 2 L 2 + 4 μ L 2 L 2 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 + L μ L 1 n + 15 K 2 L 2 μ L 2 σ L 2 + 30 K 2 μ L 2 L 2 + L μ L 90 K 3 L 2 μ L 2 + 3 K + 4 μ L 2 L 2 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + Γ λ 1 1 Γ λ 2
Where
Γ = 4 μ L 2 L 2 K 2 l + 30 L 2 K 3 l μ L 2 + L μ L l 90 l K 3 L 2 μ L 2 + 3 K + 30 K 2 μ L 2 L 2 and , x ¯ C * = argmin x ¯ C t , t [ T ] E F c ( x ¯ C t ) 2
Corollary 2. 
Let μ L 1 l L K 2 1.15 l + 1.85 T . Then, the convergence rate of the client-side model in Algorithm 1 is
min t [ T ] E f c ( x ¯ C t ) 2 O l T + 1 T T .

4.2. Server-Side Model Convergence

Theorem 3. 
Under Assumptions 1, 2, 3, and full participation of clients, if μ 1 2 L , and t 0 mod l , the convergence rate of the server model of Algorithm 1 satisfies:
min t [ T ] E F s ( x S t ) 2 2 l F s ( x S 0 ) F s ( x S * ) μ ( 2 m 3 ) T + L μ m 2 2 m 3 9.2 σ L 2 + 13.2 σ G 2
Where x S * = argmin x S t , t [ T ] E F s ( x S t , z f t ) 2 .
Corollary 3. 
Let μ 1 2 L T , then the convergence rate of the server-side model is:
min t [ T ] E F s ( x S t ) 2 O l T
Theorem 4. 
Under Assumptions 1, 2, 3, and partial participation of clients due to strategy one, if μ 1 8 L 2 m 2 , and t 0 mod l , the convergence rate of the server model of Algorithm 1 satisfies:
min t [ T ] E F s ( x S t ) 2 l F s ( x S 0 ) F s ( x S * ) μ ( m 2 ) T + L μ m 2 m 2 7 σ L 2 + 7 σ G 2
Where x S * = argmin x S t E F s ( x S t ) 2 .
Corollary 4. 
Let μ 1 L 2 m 2 T , then the convergence rate of the server-side model is:
min t [ T ] E F s ( x S t ) 2 O l T

5. Discussion and Conclusions

In this paper, we proposed theoretical convergence proofs for the state-of-the-art SplitFed Learning algorithm, CSE-FSL, which is designed to improve the convergence rates of both client-side and server-side models leveraging parallelism power of Federated Learning (FL) and reduce the storage at the server by keeping one copy at a time policy. Our approach leverages several key assumptions that are conventional in FL to underpin the theoretical foundations for CSE-FSL convergence. We prove the convergence for the cases where we have non-i.i.d. datasets, and non-convex loss functions given full and partial client participation scenarios.

5.1. Summary of Contributions

  • Convergence Analysis: We clearly formulated the CSE-FSL algorithm developed by [4]. We conducted a comprehensive convergence rate analysis under both full and partial client participation scenarios given the non-i.i.d. dataset and non-convex loss function. The convergence guarantees are derived under several assumptions, including L-smoothness of the objective functions, unbiased gradient estimators, and bounded gradient variances which are natural in conventional FL convergence analysis.
  • Key Results:
    Client-Side Model: We demonstrated that, under full client participation, the client-side model converges with a rate of O l T + 1 T T . This result highlights the effectiveness of the algorithm in achieving linear convergence rates while accommodating the federated setting’s constraints and sequential update of the server model. An increase in l, causes a longer convergence time which is obvious as it means the server model will be updated after more global rounds.
    Server-Side Model: For the server-side model, we established convergence rates of O l T under both full and partial client participation scenarios. This result underscores the robustness of the algorithm in ensuring effective learning even when clients participate partially. This also demonstrates that the number of clients and their local steps are not effective in speeding up the convergence in contrast to FL settings.

5.2. Implications

Our findings underscore the importance of efficient communication and gradient estimation (auxiliary networks) techniques in SplitFed Learning (SFL). The derived convergence rates demonstrate that the CSE-FSL algorithm achieves a balance between computational efficiency and convergence performance, making it a viable solution for practical federated learning applications where the privacy of clients is of high importance.
The theoretical guarantees provided by our convergence analysis offer valuable insights into how the algorithm performs under various conditions, thus guiding practitioners in optimizing federated learning systems. Future work could extend these results to explore more complex scenarios and refine the algorithm further for enhanced performance in real-world applications. For example, considering stragglers, elimination of label sharing by clients, and determining the optimal cut layer seem to be promising avenues for further research.
In summary, the CSE-FSL algorithm represents a significant advancement in FL, providing a robust framework for effective model training leveraging the parallelism power of FL, auxiliary networks, and sequential updates of the server-side model which helps reduce storage on the server side. It recovers the linear convergence speed of FL while providing more privacy by only forward-propagation and label transition between clients and servers instead of trained parameters.

Appendix A. Proofs

Appendix A.1. Client-Side Model Convergence

We examine the convergence rate when t 0 mod l because it is during these rounds that the server-side model is also updated. This analysis allows us to investigate the influence of l on both the convergence rate and communication overhead.
Theorem A1. 
Under Assumptions 1, 2, 3, and full participation of clients, if μ L 1 l L K 2 1.15 l + 1.85 , and t 0 mod l , in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:
min t [ T ] E F c ( x ¯ C t ) 2 2 F c ( x ¯ C 0 ) F c ( x ¯ C * ) ( 1 Γ λ 2 ) μ L K T + Φ 1 + Γ λ 1 1 Γ λ 2
Where , Φ 1 = 5 K μ L 2 L 2 ( σ L 2 + 6 K σ G 2 ) ) 1 Γ λ 2 + 4 L μ L + 4 μ L 2 K L 2 ( l 1 ) 1 Γ λ 2 K l + 5 L 2 K 2 l μ L 2 σ L 2 + K l + 30 L 2 K 3 l μ L 2 σ G 2 λ 1 = B j = 0 l 1 A j 1 A 1 , λ 2 = A l 1 A 1 , B = 8 L 2 μ L 2 K 2 + 5 L 2 K 3 μ L 2 σ L 2 + K 2 + 30 L 2 K 4 μ L 2 σ G 2 , A = 8 L 2 μ L 2 K 2 + 30 L 2 K 3 μ L 2 + 2 , Γ = 4 L μ L + μ L 2 K L 2 ( l 1 ) K + 30 L 2 K 2 μ L 2 + 30 K 2 μ L 2 L 2 ) , and x ¯ C * = argmin x ¯ C t , t [ T ] E F c ( x ¯ C t ) 2
Corollary A1. 
Let μ L 1 l L K 2 1.15 l + 1.85 T . Then, the convergence rate of the client-side model in Algorithm 1 is
min t [ T ] E f c ( x ¯ C t ) 2 O l T + 1 T T .
Proof. 
In this proof, all the gradients are w.r.t. x C . Due to Assumption 1, for any x ¯ C t + l and x ¯ C t such that t [ T ] , we can write:
F c ( x ¯ C t + l ) F c ( x ¯ C t ) + F c ( x ¯ C t ) ( x ¯ C t + l x ¯ C t ) + L 2 x ¯ C t + l x ¯ C t 2
Particularly, we consider the case when t 0 mod l from now on.
Also, note the global aggregation and client update rule in the Algorithm 1,
x ¯ C t + l = 1 m i = 0 m 1 x C , i t + l = 1 m i = 0 m 1 x C , i t μ L j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k ) = x ¯ C t μ L m i = 0 m 1 j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k )
Thus,
x ¯ C t + l x ¯ C t = μ L m i = 0 m 1 j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k )
Taking expectation of F c ( x C t + 1 ) with respect to randomness at round t + l 1 , i.e., ξ [ t + l 1 ] = Δ [ ξ i τ ] i [ N ] , τ [ t + l 1 ] , and plugging A3 into A2 note that:
E F c ( x ¯ C t + l ) F c ( x ¯ C t ) + F c ( x ¯ C t ) , E x ¯ C t + l x ¯ C t + L 2 E x ¯ C t + l x ¯ C t 2 E F c ( x ¯ C t + l ) F c ( x ¯ C t ) + μ L F c ( x ¯ C t ) , E 1 m i = 0 m 1 j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) A 1 μ L F c ( x ¯ C t ) , E K m i = 0 m 1 j = 0 l 1 F i c ( x ¯ C t + j ) A 2 + L μ L 2 2 E 1 m i = 0 m 1 j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k ) 2 A 3
We bound the term A 1 as follows:
A 1 = F c ( x ¯ C t ) , E 1 m i = 0 m 1 j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) = F c ( x ¯ C t ) , E 1 m i = 0 m 1 j = 0 l 1 k = 0 K 1 F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) y 1 = ( a 1 ) K 2 F c ( x ¯ C t ) 2 + 1 2 K m 2 E i = 0 m 1 j = 0 l 1 k = 0 K 1 F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) 2 1 2 K m 2 E i = 0 m 1 j = 0 l 1 k = 0 K 1 F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) + K F c ( x ¯ C t ) 2 ( a 2 ) K 2 F c ( x ¯ C t ) 2 + 1 2 K m 2 E i = 0 m 1 j = 0 l 1 k = 0 K 1 F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) 2 ( a 3 ) K 2 F c ( x ¯ C t ) 2 + l L 2 2 m i = 0 m 1 j = 0 l 1 k = 0 K 1 E F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) 2
( a 4 ) K 2 F c ( x ¯ C t ) 2 + l L 2 2 m i = 0 m 1 j = 0 l 1 k = 0 K 1 E x C , i t + j , k x ¯ C t + j 2 y 2
We have a , b = 1 2 a 2 + b 2 a b 2 for any two vectors a and b .
Thus , if we put a = K F c ( x ¯ C t ) , and b = 1 K y 1 , it yields equality ( a 1 ) . Inequality ( a 2 ) follows from eliminating a strictly negative term . Now , due to E i n z i 2 n i E z i 2 for any random variables z i , inequality ( a 3 ) holds .
The inequality a 4 follows from Assumption 1 . There is an upper bound for the term y 2 provided by [ 13 ] . To   preserve   the   integrity   of   the   work ,   we   include   it   here   as   well . E x C , i t + j , k x ¯ C t + j 2 = E x C , i t + j , k 1 x ¯ C t + j μ L ˜ F i c ( x C , i t + j , k 1 ) 2 = E x C , i t + j , k 1 x ¯ C t + j μ L ( ˜ F i c ( x C , i t + j , k 1 ) F i c ( x C , i t + j , k 1 ) + F i c ( x C , i t + j , k 1 ) F i c ( x ¯ C t + j ) + F i c ( x ¯ C t + j ) F c ( x ¯ C t + j ) + F c ( x ¯ C t + j ) ) 2 ( a 5 ) ( 1 + 1 2 K 1 ) E x C , i t + j , k 1 x ¯ C t + j 2 + E μ L ˜ F i c ( x C , i t + j , k 1 ) F i c ( x C , i t + j , k 1 ) 2 + 6 K E μ L F i c ( x C , i t + j , k 1 ) F i c ( x ¯ C t + j ) 2 + 6 K E μ L F i c ( x ¯ C t + j ) F c ( x ¯ C t + j ) 2 + 6 K E μ L F c ( x ¯ C t + j ) 2 ( a 6 ) ( 1 + 1 K 1 ) E x C , i t + j , k 1 x ¯ C t + j 2 + μ L 2 σ L 2 + 6 K μ L 2 σ G 2 + 6 K μ L 2 E F c ( x ¯ C t + j ) 2
The inequality ( a 5 ) follows from the fact that E i z i 2 E i z i 2 holds true for independent random variables z i with zero mean.
 
The term ( a 6 ) is due to Assumption 3. Finally, by unrolling recursion and some simplification, we have:
1 m i = 0 m 1 E x C , i t + j , k x ¯ C t + j 2 5 K μ L 2 σ L 2 + 6 K σ G 2 + 30 K 2 μ L 2 E F c ( x ¯ C t + j ) 2 Now , we continue by substituting A 11 into A 6 ,
K 2 F c ( x ¯ C t ) 2 + 5 K 2 μ L 2 l L 2 2 ( σ L 2 + 6 K σ G 2 ) + 15 K 3 μ L 2 l L 2 j = 0 l 1 E F c ( x ¯ C t + j ) 2
The above inequality, A12, is an upper bound for the term A 1 . We continue with bounding A 2 as follows.
A 2 = μ L F c ( x ¯ C t ) , E K m i = 0 m 1 j = 0 l 1 F i c ( x ¯ C t + j ) = ( a 7 ) μ L K F c ( x ¯ C t ) , E j = 0 l 1 F c ( x ¯ C t + j ) = μ L K F c ( x ¯ C t ) 2 μ L K F c ( x ¯ C t ) , E j = 1 l 1 F c ( x ¯ C t + j ) ( a 8 ) μ L K F c ( x ¯ C t ) 2 μ L K F c ( x ¯ C t ) , E ( l 1 ) F c ( x ¯ C t + j * )
= ( a 9 ) μ L K ( l + 1 ) 2 F c ( x ¯ C t ) 2 μ L K ( l 1 ) 2 E F c ( x ¯ C t + j * ) 2 y 3 + μ L K ( l 1 ) 2 E F c ( x ¯ C t + j * ) F c ( x ¯ C t ) 2 ( a 10 ) μ L K ( l + 1 ) 2 F c ( x ¯ C t ) 2 + μ L K ( l 1 ) 2 E F c ( x ¯ C t + j * ) F c ( x ¯ C t ) 2 ( a 11 ) μ L K ( l + 1 ) 2 F c ( x ¯ C t ) 2 + μ L K L 2 ( l 1 ) 2 E x ¯ C t + j * x ¯ C t 2
( a 12 ) μ L K ( l + 1 ) 2 F c ( x ¯ C t ) 2 + μ L 3 K L 2 ( l 1 ) 2 E 1 m i = 0 m 1 j = 0 j * k = 0 K 1 ˜ F i c ( x C , i t + j , k ) 2 y 4
( a 13 ) μ L K ( l + 1 ) 2 F c ( x ¯ C t ) 2 + μ L 3 K L 2 ( l 1 ) 2 E 1 m i = 0 m 1 j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k ) 2 A 3
The equality ( a 7 ) follows from definition of the global aggregation in Algorithm 1. In inequality ( a 8 ) , we assume there exists a j * such that j * = argmin 1 j l 1 F c ( x ¯ C t ) , E F c ( x ¯ C t + j ) . The equality ( a 9 ) follows from A7, where a = F c ( x ¯ C t ) and b = F c ( x ¯ C t + j * ) . The inequality ( a 10 ) is due to the fact that the term y 3 is negative. Thus, it can be eliminated safely. Due to Assumption 1, we have inequality ( a 11 ) . The inequality ( a 12 ) is due to equation A3. For the term y 4 in A13, there is an upper bound when j * = l 1 due to A15. Hence, with j * = l 1 , inequality ( a 13 ) is achieved. We proceed with bounding A 3 as follows.
E 1 m i = 0 m 1 j = 0 l 1 k = 0 K 1 ˜ F i c ( x C , i t + j , k ) 2 = 1 m 2 E i = 0 m 1 j = 0 l 1 k = 0 K 1 ( ˜ F i c ( x C , i t + j , k ) F i c ( x C , i t + j , k ) + F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) + F i c ( x ¯ C t + j ) F c ( x ¯ C t + j ) + F c ( x ¯ C t + j ) ) 2 ( a 14 ) 4 K l m i = 0 m 1 j = 0 l 1 k = 0 K 1 ( E ˜ F i c ( x C , i t + j , k ) F i c ( x C , i t + j , k ) 2 + E F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) 2 + E F i c ( x ¯ C t + j ) F c ( x ¯ C t + j ) 2 + E F c ( x ¯ C t + j ) 2 ) ( a 15 ) 4 K l m i = 0 m 1 j = 0 l 1 k = 0 K 1 σ L 2 + L 2 E x C , i t + j , k x ¯ C t + j 2 + σ G 2 + E F c ( x ¯ C t + j ) 2 ( a 16 ) 4 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 σ L 2 + 4 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + 4 ( K 2 l + 30 L 2 K 3 l μ L 2 ) j = 0 l 1 E F c ( x ¯ C t + j ) 2
The inequality ( a 14 ) holds due to A8, and inequality ( a 15 ) follows from Assumptions 2 and A3. Due to the bound on client drift, A11, note the inequality ( a 16 ) . Substituting A12, A14, and A15 into A5, observe that:
E F c ( x ¯ C t + l ) F c ( x ¯ C t ) + μ L K 2 F c ( x ¯ C t ) 2 + 5 K 2 μ L 2 l L 2 2 ( σ L 2 + 6 K σ G 2 ) + 15 K 3 μ L 2 l L 2 j = 0 l 1 F c ( x ¯ C t + j ) 2 μ L K ( l + 1 ) 2 F c ( x ¯ C t ) 2 + 2 L μ L 2 + μ L 3 K L 2 ( l 1 ) ( K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 σ L 2 + K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + K 2 l + 30 L 2 K 3 l μ L 2 j = 0 l 1 E F c ( x ¯ C t + j ) 2 ) Rearranging and simplifying the terms , E F c ( x ¯ C t + l ) F c ( x ¯ C t ) μ L K l 2 F c ( x ¯ C t ) 2 + μ L 5 K 2 μ L 2 l L 2 2 ( σ L 2 + 6 K σ G 2 ) + 2 L μ L 2 + μ L 3 K L 2 ( l 1 ) K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 σ L 2 + K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + 2 L μ L 2 + μ L 3 K L 2 ( l 1 ) K 2 l + 30 L 2 K 3 l μ L 2 + 15 K 3 μ L 3 l L 2 j = 0 l 1 E F c ( x ¯ C t + j ) 2 By iterating over t , note that , t [ T ] E F c ( x ¯ C t ) 2 2 μ L K l F c ( x ¯ C 0 ) F c ( x ¯ C * ) + T l 5 K μ L 2 L 2 ( σ L 2 + 6 K σ G 2 ) + 4 T l L μ L + μ L 2 K L 2 ( l 1 ) K l + 5 L 2 K 2 l μ L 2 σ L 2 + K l + 30 L 2 K 3 l μ L 2 σ G 2 +
4 L μ L + μ L 2 K L 2 ( l 1 ) K + 30 L 2 K 2 μ L 2 + 30 K 2 μ L 2 L 2 Γ t j = 0 l 1 E F c ( x ¯ C t + j ) 2 y 5 We bound the term y 5 as follows . We start with bounding E F c ( x ¯ C t + j ) 2 for a particular t and j :
E F c ( x ¯ C t + j ) 2 = E F c ( x ¯ C t + j ) 2 E F c ( x ¯ C t + j 1 ) 2 + E F c ( x ¯ C t + j 1 ) 2 ( a 17 ) 2 E F c ( x ¯ C t + j ) F c ( x ¯ C t + j 1 ) 2 + 2 E F c ( x ¯ C t + j 1 ) 2 ( a 18 ) 2 L 2 E x ¯ C t + j x ¯ C t + j 1 2 + 2 E F c ( x ¯ C t + j 1 ) 2 ( a 19 ) 8 L 2 μ L 2 K 2 + 5 L 2 K 3 μ L 2 σ L 2 + K 2 + 30 L 2 K 4 μ L 2 σ G 2 B + 8 L 2 μ L 2 K 2 + 30 L 2 K 3 μ L 2 + 2 A E F c ( x ¯ C t + j 1 ) 2
 
The inequality ( a 17 ) is written as a consequence of having E a 2 E b 2 2 E a b 2 + E b 2 for any random variables a and b where a = F c ( x ¯ C t + j ) and b = F c ( x ¯ C t + j 1 ) in the inequality. The term ( a 18 ) is written based on Assumption 1. Due to A3, A15, and that l = 1 in this case, inequality ( a 19 ) was yielded. Thus:
E F c ( x ¯ C t + j ) 2 B + A E F c ( x ¯ C t + j 1 ) 2 Unrolling recursion on j , we achieve the following : E F c ( x ¯ C t + j ) 2 B A j 1 A 1 + A j E F c ( x ¯ C t ) 2 Iterating over j , we have : j = 0 l 1 E F c ( x ¯ C t + j ) 2 B j = 0 l 1 A j 1 A 1 λ 1 + A l 1 A 1 λ 2 E F c ( x ¯ C t ) 2
Substituting A 19 into A 16 : t [ T ] E F c ( x ¯ C t ) 2 2 μ L K l F c ( x ¯ C 0 ) F c ( x ¯ C * ) + T l 5 K μ L 2 L 2 ( σ L 2 + 6 K σ G 2 ) + 4 T l L μ L + μ L 2 K L 2 ( l 1 ) K l + 5 L 2 K 2 l μ L 2 σ L 2 + K l + 30 L 2 K 3 l μ L 2 σ G 2 + T l Γ λ 1 + Γ λ 2 t E F c ( x ¯ C t ) 2
Choosing a proper μ L 1 l L K 2 1.15 l + 1.85 , we have Γ λ 2 < 1 . Thus : min t [ T ] E F c ( x ¯ C t ) 2 2 F c ( x ¯ C 0 ) F c ( x ¯ C * ) ( 1 Γ λ 2 ) μ L K T + 5 K μ L 2 L 2 ( σ L 2 + 6 K σ G 2 ) ) 1 Γ λ 2 + 4 L μ L + 4 μ L 2 K L 2 ( l 1 ) 1 Γ λ 2 K l + 5 L 2 K 2 l μ L 2 σ L 2 + K l + 30 L 2 K 3 l μ L 2 σ G 2 + Γ λ 1 1 Γ λ 2
Theorem A2. 
Under Assumptions 1, 2, 3, and partial participation of clients due to strategy one, if μ L 1 l L K 2 1.15 l + 1.85 , and t 0 mod l in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:
min t [ T ] E F c ( x ¯ C t ) 2 2 F c ( x ¯ C 0 ) F c ( x ¯ C * ) μ L K T + 5 K μ L 2 L 2 + 4 μ L 2 L 2 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 + L μ L 1 n + 15 K 2 L 2 μ L 2 σ L 2 + 30 K 2 μ L 2 L 2 + L μ L 90 K 3 L 2 μ L 2 + 3 K + 4 μ L 2 L 2 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + Γ λ 1 1 Γ λ 2
Where
Γ = 4 μ L 2 L 2 K 2 l + 30 L 2 K 3 l μ L 2 + L μ L l 90 l K 3 L 2 μ L 2 + 3 K + 30 K 2 μ L 2 L 2 and , x ¯ C * = argmin x ¯ C t , t [ T ] E F c ( x ¯ C t ) 2
Corollary A2. 
Let μ L 1 l L K 2 1.15 l + 1.85 T . Then, the convergence rate of the client-side model in Algorithm 1 is
min t [ T ] E f c ( x ¯ C t ) 2 O l T + 1 T T .
Proof. 
In this proof, all the gradients are w.r.t. x C . Due to Assumption 1, for any x ¯ C t + l and x ¯ C t such that t [ T ] , we can write:
F c ( x ¯ C t + l ) F c ( x ¯ C t ) + F c ( x ¯ C t ) ( x ¯ C t + l x ¯ C t ) + L 2 x ¯ C t + l x ¯ C t 2
Particularly, we consider the case when t 0 mod l from now on.
Also, note the global aggregation and client update rule in the Algorithm 1 with partial worker participation,
x ¯ C t + l = 1 n i [ S t + l ] x C , i t + l = 1 n i [ S t ] x C , i t μ L n j = 0 l 1 i [ S t + j ] k = 0 K 1 ˜ F i c ( x C , i t + j , k ) ) = x ¯ C t μ L n j = 0 l 1 i [ S t + j ] k = 0 K 1 ˜ F i c ( x C , i t + j , k )
Thus,
x ¯ C t + l x ¯ C t = μ L n j = 0 l 1 i [ S t + j ] k = 0 K 1 ˜ F i c ( x C , i t + j , k )
In the case of partial worker participation, there are two sources of randomness. One stems from the stochastic gradient computation, while the other arises from randomly sampling the clients at round t.
Taking expectation of F c ( x C t + 1 ) w.r.t. both types of randomness at round t + l 1 , and plugging A24 into A23 note that:
E F c ( x ¯ C t + l ) F c ( x ¯ C t ) + F c ( x ¯ C t ) , E x ¯ C t + l x ¯ C t + L 2 E x ¯ C t + l x ¯ C t 2 E F c ( x ¯ C t + l ) F c ( x ¯ C t ) + μ L F c ( x ¯ C t ) , E 1 n j = 0 l 1 i [ S t + j ] k = 0 K 1 ˜ F i c ( x C , i t + j , k ) F i c ( x ¯ C t + j ) A 1 μ L F c ( x ¯ C t ) , E K n j = 0 l 1 i [ S t + j ] F i c ( x ¯ C t + j ) A 2 + L μ L 2 2 E 1 n j = 0 l 1 i [ S t + j ] k = 0 K 1 ˜ F i c ( x C , i t + j , k ) 2 A 3
According to [15, Lemma 1], terms A 1 and A 2 will possess the same bounds as those of A 1 and A 2 . Thus:
A 1 K 2 F c ( x ¯ C t ) 2 + 5 K 2 μ L 2 l L 2 2 ( σ L 2 + 6 K σ G 2 ) + 15 K 3 μ L 2 l L 2 j = 0 l 1 E F c ( x ¯ C t + j ) 2 A 2 μ L K ( l + 1 ) 2 F c ( x ¯ C t ) 2 + μ L 3 K L 2 ( l 1 ) 2 ( 4 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 σ L 2
+ 4 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + 4 K 2 l + 30 L 2 K 3 l μ L 2 j = 0 l 1 E F c ( x ¯ C t + j ) 2 )
Note that [ S t ] = { q 1 t , . . . , q n t } is the index set demonstrating the sampled clients, which might contain duplicate elements, as the sampling is with replacement. We now proceed to bound the term A 3 following [15]:
A 3 = E 1 n j = 0 l 1 i [ S t + j ] k = 0 K 1 ˜ F i c ( x C , i t + j , k ) 2 = 1 n 2 E j = 0 l 1 i = 1 n k = 0 K 1 ˜ F q i t + j c ( x C , q i t + j t + j , k ) 2 = a 1 1 n 2 E j = 0 l 1 i = 1 n k = 0 K 1 ˜ F q i t + j c ( x C , q i t + j t + j , k ) F q i t + j c ( x C , q i t + j t + j , k ) 2 + 1 n 2 E j = 0 l 1 i = 1 n k = 0 K 1 F q i t + j c ( x C , q i t + j t + j , k ) 2 a 2 l K σ L 2 n + 1 n 2 E j = 0 l 1 i = 1 n k = 0 K 1 F q i t + j c ( x C , q i t + j t + j , k ) 2
The equality a 1 follows from the fact that E z 2 = E z E [ z ] 2 + E z 2 and the inequality a 2 is due to assumption 3 and the explanation provided in A10. Now, let’s consider t i j = k = 0 K 1 F i c ( x C , i t + j , k ) , then:
E j = 0 l 1 i = 1 n k = 0 K 1 F q i t + j c ( x C , q i t + j t + j , k ) 2 = E j = 0 l 1 i = 1 n t q i t + j j 2 = E j = 0 l 1 i = 1 n t q i t + j j 2 + j = 0 l 1 i z , q i t + j , q z t + j [ S t + j ] t q i t + j j , t q z t + j j = a 3 E n j = 0 l 1 t q 1 t + j j 2 + n ( n 1 ) j = 0 l 1 t q 1 t + j j , t q 2 t + j j = n m j = 0 l 1 i [ S ] t i j 2 + n ( n 1 ) m 2 j = 0 l 1 i , z [ S ] t i j , t z j = n m j = 0 l 1 i [ S ] t i j 2 + n ( n 1 ) m 2 j = 0 l 1 i [ S ] t i j 2 a 4 n 2 m j = 0 l 1 i [ S ] t i j 2 A 4
Note that the equality a 3 is due to independent sampling with replacement as outlined by [15] and a 4 follows from A8. Now, we bound the term A 4 as follows:
j = 0 l 1 i [ S ] t i j 2 = j = 0 l 1 i [ S ] k = 0 K 1 F i c ( x C , i t + j , k ) 2 = a 5 K j = 0 l 1 i [ S ] k = 0 K 1 F i c ( x C , i t + j , k ) 2 = K j = 0 l 1 i [ S ] k = 0 K 1 F i c ( x C , i t + j , k ) F i c ( x C t + j ) + F i c ( x C t + j ) F c ( x C t + j ) + F c ( x C t + j ) 2 a 6 3 K L 2 j = 0 l 1 i [ S ] k = 0 K 1 x C , i t + j , k x C t + j 2 + 3 m l K 2 σ G 2 + 3 m K 2 j = 0 l 1 F c ( x C t + j ) 2 a 7 15 m l K 3 L 2 μ L 2 σ L 2 + 6 K σ G 2 + 3 m l K 2 σ G 2 + 90 m l K 4 L 2 μ L 2 + 3 m K 2 j = 0 l 1 E F c ( x ¯ C t + j ) 2
The term a 5 follows from the fact A8, the term a 6 stems from the fact A8 and the assumption 1. The term a 7 is due to A11. Now, plugging A31 into A30 and A30 into A29, we have the following bound on A 3 :
E 1 n j = 0 l 1 i [ S t + j ] k = 0 K 1 ˜ F i c ( x C , i t + j , k ) 2 l K n + 15 l K 3 L 2 μ L 2 σ L 2 + 90 l K 4 L 2 μ L 2 + 3 l K 2 σ G 2 + 90 l K 4 L 2 μ L 2 + 3 K 2 j = 0 l 1 E F c ( x ¯ C t + j ) 2
Plugging A27, A28 and A32 into A23, with rearrangement and simplification, observe that:
E F c ( x ¯ C t ) 2 2 μ L K l E F c ( x ¯ C t + l ) + F c ( x ¯ C t ) + 5 K μ L 2 L 2 + 4 μ L 2 L 2 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 + L μ L 1 n + 15 K 2 L 2 μ L 2 σ L 2 + 30 K 2 μ L 2 L 2 + L μ L 90 K 3 L 2 μ L 2 + 3 K + 4 μ L 2 L 2 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + 4 μ L 2 L 2 K 2 l + 30 L 2 K 3 l μ L 2 + L μ L l 90 l K 3 L 2 μ L 2 + 3 K + 30 K 2 μ L 2 L 2 Γ j = 0 l 1 E F c ( x ¯ C t + j ) 2
By iterating over t, note that,
t [ T ] E F c ( x ¯ C t ) 2 2 μ L K l E F c ( x ¯ C * ) + F c ( x ¯ C 0 ) + T l 5 K μ L 2 L 2 + 4 μ L 2 L 2 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 + L μ L 1 n + 15 K 2 L 2 μ L 2 σ L 2 + T l 30 K 2 μ L 2 L 2 + L μ L 90 K 3 L 2 μ L 2 + 3 K + 4 μ L 2 L 2 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + Γ t j = 0 l 1 E F c ( x ¯ C t + j ) 2
Due to A19, observe that:
t [ T ] E F c ( x ¯ C t ) 2 2 μ L K l E F c ( x ¯ C * ) + F c ( x ¯ C 0 ) + T l 5 K μ L 2 L 2 + 4 μ L 2 L 2 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 + L μ L 1 n + 15 K 2 L 2 μ L 2 σ L 2 + T l 30 K 2 μ L 2 L 2 + L μ L 90 K 3 L 2 μ L 2 + 3 K + 4 μ L 2 L 2 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + T l Γ λ 1 + Γ λ 2 t E F c ( x ¯ C t ) 2
We let μ L 1 l L K 2 1.15 l + 1.85 , thus:
min t [ T ] E F c ( x ¯ C t ) 2 2 F c ( x ¯ C 0 ) F c ( x ¯ C * ) μ L K T + 5 K μ L 2 L 2 + 4 μ L 2 L 2 K 2 l 2 + 5 L 2 K 3 l 2 μ L 2 + L μ L 1 n + 15 K 2 L 2 μ L 2 σ L 2 + 30 K 2 μ L 2 L 2 + L μ L 90 K 3 L 2 μ L 2 + 3 K + 4 μ L 2 L 2 K 2 l 2 + 30 L 2 K 4 l 2 μ L 2 σ G 2 + Γ λ 1 1 Γ λ 2
This completes the proof.

Appendix A.2. Server-Side Model Convergence

Theorem A3. 
Under Assumptions 1, 2, 3, and full participation of clients, if μ 1 8 L m 2 , t 0 mod l , the convergence rate of the server model of Algorithm 1 satisfies:
min t E F s ( x S t ) 2 2 l F s ( x S 0 ) F s ( x S * ) μ ( 2 m 3 ) T + L μ m 2 2 m 3 9.2 σ L 2 + 13.2 σ G 2
Corollary A3. 
Let μ 1 L m 2 T , then the convergence rate of the server-side model is:
min t [ T ] E F s ( x S t ) 2 O l T
Proof. 
Due to Assumption 1, for any x S t + l and x S t , it can be written that:
F s ( x S t + l ) F s ( x S t ) + F s ( x S t ) ( x S t + l x S t ) + L 2 x S t + l x S t 2
Also, note the client forward-propagation and server model update rules in the Algorithm 1,
x S , i + 1 t = x S , i t μ ˜ F i s ( x S , i t )
Thus, putting x S t = x S , 0 t and x S t + l = x S , m t , note that:
1 1 x S t + l x S t = μ i = 0 m 1 ˜ F i s ( x S , i t )
Taking expectation with respect to randomness at round t, i.e., ξ [ t ] = Δ [ ξ i τ ] i [ N ] , τ [ t ] , and plugging A38 into A37 note that:
E F s ( x S t + l ) F s ( x S t ) μ F s ( x S t ) , E i = 0 m 1 ˜ F i s ( x S , i t ) + L μ 2 2 E i = 0 m 1 ˜ F i s ( x S , i t ) 2 E F s ( x S t + l ) F s ( x S t ) μ F s ( x S t ) , E i = 0 m 1 ˜ F i s ( x S , i t ) F i s ( x S t ) B 1 μ F s ( x S t ) , E i = 0 m 1 F i s ( x S t ) B 2 + L μ 2 2 E i = 0 m 1 ˜ F i s ( x S , i t ) 2 B 3
The term B 1 will be bounded as follows:
μ F s ( x S t ) , E E i = 0 m 1 ˜ F i s ( x S , i t ) F i s ( x S t ) | ξ = ( b 1 ) μ 2 E F s ( x S t ) 2 + μ 2 E i = 0 m 1 F i s ( x S , i t ) F i s ( x S t ) 2 μ 2 E i = 0 m 1 F i s ( x S , i t ) F i s ( x S t ) + F s ( x S t ) 2 ( b 2 ) μ 2 E F s ( x S t ) 2 + μ 2 E i = 0 m 1 F i s ( x S , i t ) F i s ( x S t ) 2 ( b 3 ) μ 2 E F s ( x S t ) 2 + m μ 2 i = 0 m 1 E F i s ( x S , i t ) F i s ( x S t ) 2 ( b 4 ) μ 2 E F s ( x S t ) 2 + m μ L 2 2 i = 0 m 1 E x S , i t x S t 2
The equality ( b 1 ) is due to a , b = 1 2 a 2 + b 2 a + b 2 for any two vectors a and b. The inequality ( b 2 ) is clear as we dropped a negative term, inequality ( b 3 ) stems from the fact that E i n z i 2 n i E z i 2 holds for any random variable z i , and the inequality ( b 4 ) is due to 1.
The term B 2 will be bounded as follows :
μ F s ( x S t ) , E i = 0 m 1 F i s ( x S t ) = ( b 5 ) m μ E F s ( x S t ) 2
Note that the equality ( b 5 ) holds based on the definition 3 . The term B 3 will be bounded as below :
E i = 0 m 1 ˜ F i s ( x S , i t ) 2 ( b 6 ) m i = 0 m 1 E ˜ F i s ( x S , i t ) 2 = m i = 0 m 1 E ˜ F i s ( x S , i t ) F i s ( x S , i t ) + F i s ( x S , i t ) 2 ( b 7 ) 2 m i = 0 m 1 E ˜ F i s ( x S , i t ) F i s ( x S , i t ) 2 + 2 m i = 0 m 1 E F i s ( x S , i t ) 2 ( b 8 ) 2 m 2 σ L 2 + 2 m i = 0 m 1 E F i s ( x S , i t ) F s ( x S , i t ) + F s ( x S , i t ) F s ( x S t ) + F s ( x S t ) 2 ( b 9 ) 2 m 2 σ L 2 + 6 m i = 0 m 1 E F i s ( x S , i t ) F s ( x S , i t ) 2 + 6 m i = 0 m 1 E F s ( x S , i t ) F s ( x S t ) 2 + 6 m i = 0 m 1 E F s ( x S t ) 2 ( b 10 ) 2 m 2 σ L 2 + 6 m 2 σ G 2 + 6 m L 2 i = 0 m 1 E x S , i t x S t 2 B 4 + 6 m 2 E F s ( x S t ) 2
The inequalities ( b 6 ) , ( b 7 ) , and ( b 9 ) due to the same reason ( b 3 ) holds above. The inequalities ( b 8 ) and ( b 10 ) hold due to Assumptions 3. The term B 4 is bounded similar to [13, Lemma 3]:
E x S , i t x S t 2 ( b 11 ) E x S , i 1 t x S t μ ˜ F i 1 s ( x S , i 1 t ) 2 = E x S , i 1 t x S t 2 + E μ ˜ F i 1 s ( x S , i 1 t ) 2 + 2 x S , i 1 t x S t , μ ˜ F i 1 s ( x S , i 1 t ) ( b 12 ) ( 1 + 1 2 m 1 ) E x S , i 1 t x S t 2 + ( 1 + 2 m ) E μ ˜ F i 1 s ( x S , i 1 t ) 2 = ( 1 + 1 2 m 1 ) E x S , i 1 t x S t 2 + ( 1 + 2 m ) μ 2 E ˜ F i 1 s ( x S , i 1 t ) F i 1 s ( x S , i 1 t ) + F i 1 s ( x S , i 1 t ) F i 1 s ( x S t ) + F i 1 s ( x S t ) F s ( x S t ) + F s ( x S t ) 2 ( b 13 ) ( 1 + 1 2 m 1 ) E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 E ˜ F i 1 s ( x S , i 1 t ) F i 1 s ( x S , i 1 t ) 2 + 4 ( 1 + 2 m ) μ 2 E F i 1 s ( x S , i 1 t ) F i 1 s ( x S t ) 2 + 4 ( 1 + 2 m ) μ 2 E F i 1 s ( x S t ) F s ( x S t ) 2 + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 ( b 14 ) ( 1 + 1 2 m 1 + 4 ( 1 + 2 m ) μ 2 L 2 ) E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2
The A38 implies the inequality ( b 11 ) . The inequality ( b 12 ) holds true based on 2 a , b 1 n 1 a 2 + n b 2 for any two vectors a, b and positive number n. The inequality ( b 13 ) follows from the previously mentioned fact at the inequality ( b 3 ) , and the inequality ( b 14 ) is based on Assumptions 3 and 1. Given μ 1 2 L ( 2 m + 1 ) and by averaging over the clients, observe that:
1 m i = 0 m 1 E x S , i t x S t 2 ( 1 + 4 m 4 m 2 1 ) 1 m i = 1 m 1 E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 ( 1 + 1 m 1 ) 1 m i = 1 m 1 E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2
Unrolling the recursion, following [13, Lemma 3], it is inferred that:
1 m i = 0 m 1 E x S , i t x S t 2 j = 0 m 1 ( 1 + 1 m 1 ) j 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 ( m 1 ) × 1 + 1 m 1 m 1 × 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 16 ( m + 2 m 2 ) μ 2 ( σ L 2 + σ G 2 ) + 16 ( m + 2 m 2 ) μ 2 E F s ( x S t ) 2
Note that in the above inequality, 1 + 1 m 1 m 1 4 for m > 1 .
Plugging A 41 , A 42 , A 43 and A 44 into A 40 , observe that : E F s ( x S t + l ) F s ( x S t ) + 1 2 16 L 2 μ 3 m 3 ( 1 + 2 m ) + 2 L μ 2 m 2 + 96 L 3 μ 4 m 3 ( 1 + 2 m ) σ L 2 + 1 2 16 L 2 μ 3 m 3 ( 1 + 2 m ) + 6 L μ 2 m 2 + 96 L 3 μ 4 m 3 ( 1 + 2 m ) σ G 2 + 1 2 2 μ m + 6 L m 2 μ 2 + μ + 96 L 3 μ 4 m 3 ( 1 + 2 m ) + 16 L 2 μ 3 m 3 ( 1 + 2 m ) E F s ( x S t ) 2 ( b 15 ) F s ( x S t ) + μ ( 3 2 m ) 2 E F s ( x S t ) 2 + L μ 2 m 2 2 9.2 σ L 2 + 13.2 σ G 2 Rearranging the terms , and summing over t , observe that : t E F s ( x S t ) 2 2 F s ( x S 0 ) F s ( x S * ) μ ( 2 m 3 ) + t L μ m 2 2 m 3 9.2 σ L 2 + 13.2 σ G 2
Assuming there are T global rounds overall , min t E F s ( x S t ) 2 2 l F s ( x S 0 ) F s ( x S * ) μ ( 2 m 3 ) T + L μ m 2 2 m 3 9.2 σ L 2 + 13.2 σ G 2
Theorem A4. 
Under Assumptions 1, 2, 3, and full participation of clients, if μ 1 8 L 2 m 2 , t 0 mod l , the convergence rate of the server model of Algorithm 1 satisfies:
min t E F s ( x S t ) 2 l F s ( x S 0 ) F s ( x S * ) μ ( m 2 ) T + L μ m 2 m 2 7 σ L 2 + 7 σ G 2
Corollary A4. 
Let μ 1 L 2 m 2 T , then the convergence rate of the server-side model is:
min t [ T ] E F s ( x S t ) 2 O l T
Proof. 
Due to Assumption 1, for any x S t + l and x S t , it can be written that:
F s ( x S t + l ) F s ( x S t ) + F s ( x S t ) ( x S t + l x S t ) + L 2 x S t + l x S t 2
Also, note the client forward-propagation and server model update rules in the Algorithm 1,
x S , i + 1 t = x S , i t μ ˜ F i s ( x S , i t )
Thus, putting x S t = x S , 0 t and x S t + l = x S , m t , note that:
1 1 x S t + l x S t = μ i = 0 m 1 ˜ F i s ( x S , i t )
Taking expectation for both types of randomness, i.e., randomness due to stochastic gradients and due to sampling of clients, at round t, and plugging A48 into A47 note that:
E F s ( x S t + l ) F s ( x S t ) μ F s ( x S t ) , E i [ S t ] ˜ F i s ( x S , i t ) + L μ 2 2 E i [ S t ] ˜ F i s ( x S , i t ) 2 E F s ( x S t + l ) F s ( x S t ) μ F s ( x S t ) , E i [ S t ] ˜ F i s ( x S , i t ) F i s ( x S t ) B 1 μ F s ( x S t ) , E i [ S t ] F i s ( x S t ) B 2 + L μ 2 2 E i [ S t ] ˜ F i s ( x S , i t ) 2 B 3
Note that [ S t ] = { q 1 t , . . . , q n t } is the index set demonstrating the sampled clients, which might contain duplicate elements, as the sampling is with replacement. The term B 1 will be bounded as follows:
μ F s ( x S t ) , E E i [ S t ] ˜ F i s ( x S , i t ) F i s ( x S t ) | ξ = ( b 1 ) μ 2 E F s ( x S t ) 2 + μ 2 E i [ S t ] F i s ( x S , i t ) F i s ( x S t ) 2 μ 2 E i [ S t ] F i s ( x S , i t ) F i s ( x S t ) + F s ( x S t ) 2 ( b 2 ) μ 2 E F s ( x S t ) 2 + μ 2 E i [ S t ] F i s ( x S , i t ) F i s ( x S t ) 2 ( b 3 ) μ 2 E F s ( x S t ) 2 + μ 2 E i = 1 n F q i t s ( x S , q i t t ) F q i t s ( x S t ) 2 ( b 4 ) μ 2 E F s ( x S t ) 2 + n μ L 2 2 i = 1 n E x S , q i t t x S t 2
The equality ( b 1 ) is due to a , b = 1 2 a 2 + b 2 a + b 2 for any two vectors a and b. The inequality ( b 2 ) is clear as we dropped a negative term, inequality ( b 4 ) stems from the fact that E i n z i 2 n i E z i 2 holds for any random variable z i , and 1.
The term B 2 will be bounded as follows:
μ F s ( x S t ) , E i [ S t ] F i s ( x S t ) = ( b 5 ) n μ E F s ( x S t ) 2
Note that equality ( b 5 ) holds based on the definition 3 . The term B 3 will be bounded as below : E i [ S t ] ˜ F i s ( x S , i t ) 2 ( b 6 ) E i [ S t ] F i s ( x S , i t ) 2 + E i [ S t ] ˜ F i s ( x S , i t ) F i s ( x S , i t ) 2 ( b 7 ) n σ L 2 + E i = 1 n F q i t s ( x S , q i t t ) 2
Now, let’s consider t i = F i s ( x S , i t ) following the ideas of [15] for this part:
E i = 1 n F q i t s ( x S , q i t t ) 2 = E i = 1 n t q i t 2 = E i = 1 n t q i t 2 + i z , q i t , q z t [ S t ] t q i t , t q z t = b 8 E n t q 1 t 2 + n ( n 1 ) t q 1 t , t q 2 t = n m i [ S ] t i 2 + n ( n 1 ) m 2 i , z [ S ] t i , t z = n m i [ S ] t i 2 + n ( n 1 ) m 2 i [ S ] t i 2 b 9 n 2 m i [ S ] t i 2 B 4
Note that the equality b 8 is due to independent sampling with replacement as outlined by [15]. The inequality b 9 follows from A8. We bound the term B 4 as follows:
i [ S ] t i 2 = = i = 0 m 1 E F i s ( x S , i t ) 2 = i = 0 m 1 E F i s ( x S , i t ) F s ( x S , i t ) + F s ( x S , i t ) F s ( x S t ) + F s ( x S t ) 2 3 i = 0 m 1 E F i s ( x S , i t ) F s ( x S , i t ) 2 + 3 i = 0 m 1 E F s ( x S , i t ) F s ( x S t ) 2 + 3 i = 0 m 1 E F s ( x S t ) 2 ( b 10 ) 3 m σ G 2 + 3 L 2 i = 0 m 1 E x S , i t x S t 2 B 5 + 3 m E F s ( x S t ) 2
The inequality ( b 10 ) holds due to Assumptions 3.
The term B 5 is bounded similar to [13, Lemma 3]:
E x S , i t x S t 2 ( b 11 ) E x S , i 1 t x S t μ ˜ F i 1 s ( x S , i 1 t ) 2 = E x S , i 1 t x S t 2 + E μ ˜ F i 1 s ( x S , i 1 t ) 2 + 2 x S , i 1 t x S t , μ ˜ F i 1 s ( x S , i 1 t )
( b 12 ) ( 1 + 1 2 m 1 ) E x S , i 1 t x S t 2 + ( 1 + 2 m ) E μ ˜ F i 1 s ( x S , i 1 t ) 2 = ( 1 + 1 2 m 1 ) E x S , i 1 t x S t 2 + ( 1 + 2 m ) μ 2 E ˜ F i 1 s ( x S , i 1 t ) F i 1 s ( x S , i 1 t ) + F i 1 s ( x S , i 1 t ) F i 1 s ( x S t ) + F i 1 s ( x S t ) F s ( x S t ) + F s ( x S t ) 2 ( b 13 ) ( 1 + 1 2 m 1 ) E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 E ˜ F i 1 s ( x S , i 1 t ) F i 1 s ( x S , i 1 t ) 2 + 4 ( 1 + 2 m ) μ 2 E F i 1 s ( x S , i 1 t ) F i 1 s ( x S t ) 2 + 4 ( 1 + 2 m ) μ 2 E F i 1 s ( x S t ) F s ( x S t ) 2 + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 ( b 14 ) ( 1 + 1 2 m 1 + 4 ( 1 + 2 m ) μ 2 L 2 ) E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2
The A48 implies the inequality ( b 11 ) . The inequality ( b 12 ) holds true based on 2 a , b 1 n 1 a 2 + n b 2 for any two vectors a, b and positive number n. The inequality ( b 13 ) follows from the previously mentioned fact at the inequality ( b 3 ) , and the inequality ( b 14 ) is based on Assumptions 3 and 1.
Given μ 1 2 L ( 2 m + 1 ) and by averaging over the clients, observe that:
1 m i = 0 m 1 E x S , i t x S t 2 ( 1 + 4 m 4 m 2 1 ) 1 m i = 1 m 1 E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 ( 1 + 1 m 1 ) 1 m i = 1 m 1 E x S , i 1 t x S t 2 + 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2
Unrolling the recursion, following [13, Lemma 3], it is inferred that:
1 m i = 0 m 1 E x S , i t x S t 2 j = 0 m 1 ( 1 + 1 m 1 ) j 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 ( m 1 ) × 1 + 1 m 1 m 1 × 4 ( 1 + 2 m ) μ 2 ( σ L 2 + σ G 2 ) + 4 ( 1 + 2 m ) μ 2 E F s ( x S t ) 2 16 ( m + 2 m 2 ) μ 2 ( σ L 2 + σ G 2 ) + 16 ( m + 2 m 2 ) μ 2 E F s ( x S t ) 2
Note that in the above inequality, 1 + 1 m 1 m 1 4 for m > 1 . Plugging A51, A52, A55 and A57 into A50, observe that
E F s ( x S t + l ) F s ( x S t ) + 1 2 16 L 4 ( n 3 + 2 n 4 ) μ 3 + L n μ 2 + 48 L 3 n 2 μ 4 ( m + 2 m 2 ) σ L 2 + 1 2 16 L 4 ( n 3 + 2 n 4 ) μ 3 + 3 L μ 2 n 2 + 48 L 3 n 2 μ 4 ( m + 2 m 2 ) σ G 2 + 1 2 μ 2 n μ + 16 L 4 ( n 3 + 2 n 4 ) μ 3 + 3 L μ 2 n 2 + 48 L 3 n 2 μ 4 ( m + 2 m 2 ) E F s ( x S t ) 2 ( b 15 ) F s ( x S t ) + μ ( 4 2 m ) 2 E F s ( x S t ) 2 + L μ m 2 2 14 σ L 2 + 14 σ G 2
After simplifications, the inequality b 15 holds as μ 1 8 L 2 m 2 . Rearranging the terms, and summing over t, observe that:
t E F s ( x S t ) 2 F s ( x S 0 ) F s ( x S * ) μ ( m 2 ) + t L μ m 2 m 2 7 σ L 2 + 7 σ G 2
Assuming there are T global rounds overall,
min t E F s ( x S t ) 2 l F s ( x S 0 ) F s ( x S * ) μ ( m 2 ) T + L μ m 2 m 2 7 σ L 2 + 7 σ G 2
This concludes the proof. □

References

  1. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data, 2023. arXiv:cs.LG/1602.05629].
  2. Thapa, C.; Chamikara, M.A.P.; Camtepe, S. SplitFed: When Federated Learning Meets Split Learning. CoRR 2020, abs/2004.12088, [2004.12088].
  3. Gupta, O.; Raskar, R. Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications 2018, 116, 1–8. [Google Scholar] [CrossRef]
  4. Mu, Y.; Shen, C. Communication and Storage Efficient Federated Split Learning. arXiv preprint 2023, arXiv:2302.05599. [Google Scholar]
  5. Kim, M.; DeRieux, A.; Saad, W. A bargaining game for personalized, energy efficient split learning over wireless networks. 2023 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2023, pp. 1–6.
  6. Li, Y.; Lyu, X. Convergence Analysis of Sequential Split Learning on Heterogeneous Data. arXiv preprint 2023, arXiv:2302.01633. [Google Scholar]
  7. Liao, Y.; Xu, Y.; Xu, H.; Yao, Z.; Wang, L.; Qiao, C. Accelerating federated learning with data and model parallelism in edge computing. IEEE/ACM Transactions on Networking 2023. [CrossRef]
  8. Han, D.J.; Bhatti, H.I.; Lee, J.; Moon, J. Accelerating federated learning with split learning on locally generated losses. ICML 2021 workshop on federated learning for user privacy and data confidentiality. ICML Board, 2021.
  9. Belilovsky, E.; Eickenberg, M.; Oyallon, E. Decoupled greedy learning of cnns. International Conference on Machine Learning. PMLR, 2020, pp. 736–745.
  10. Jaderberg, M.; Czarnecki, W.M.; Osindero, S.; Vinyals, O.; Graves, A.; Silver, D.; Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. International conference on machine learning. PMLR, 2017, pp. 1627–1635.
  11. Ghadimi, S.; Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 2013, 23, 2341–2368. [Google Scholar] [CrossRef]
  12. Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 2021, 14, 1–210. [Google Scholar] [CrossRef]
  13. Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečnỳ, J.; Kumar, S.; McMahan, H.B. Adaptive federated optimization. arXiv preprint 2020, arXiv:2003.00295. [Google Scholar]
  14. Reisizadeh, A.; Mokhtari, A.; Hassani, H.; Jadbabaie, A.; Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. International conference on artificial intelligence and statistics. PMLR, 2020, pp. 2021–2031.
  15. Yang, H.; Fang, M.; Liu, J. Achieving linear speedup with partial worker participation in non-iid federated learning. arXiv preprint 2021, arXiv:2101.11203. [Google Scholar]
Figure 1. CSE-FSL pipeline
Figure 1. CSE-FSL pipeline
Preprints 116953 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated