1. Introduction
Recent scientific research has witnessed remarkable strides in deep learning methodologies and neural network architectures. These advances have garnered considerable attention from researchers due to their extensive applicability in the academic and commercial domains [
1]. Deep learning techniques have demonstrated efficacy across diverse domains such as computer vision, natural language processing, medical imaging, and communication systems. The key to this success lies in the adept use of large datasets with significant computing resources by specialized network models. Therefore, both software and hardware-based applications have been the subject of extensive research aimed at enhancing the prospects of deep learning and facilitating its widespread adoption.
Deep learning is a process of learning complex input data using simple parts [
1]. In common machine learning and deep learning applications, these input data are typically represented by Euclidean structures. In contrast, graph data with complex relationships, such as physical phenomena, chemical bonds, protein structures, and diseases, are represented by non-Euclidean data and require specialized models [
2]. Current machine learning methods for non-Euclidean data have limited performance due to their high computational cost and implementation inflexibility. Although deep learning models have proven successful on vector-based inputs, Graph Neural Networks (GNNs) have attracted the attention of researchers due to their ability to learn complicated relationships between nodes and edges in a graph [
1]. Research has shown that GNNs outperform classical deep learning models when dealing with non-Euclidean data [
2].
GNNs are high-accuracy neural network models that can be trained on graph data and used for inference. Graph datasets often have complex relationships between their nodes, and GNNs can use these relationships to successfully learn local and global information. The unique structure of GNNs demonstrates impressive performance in social networks [
3], friend recommendations, molecular bonding [
4], e-commerce [
5], and product recommendation systems [
6]. Successful results have been achieved in various fields, including particle physics [
7,
8], natural language processing [
9], traffic applications [
10], anomaly detection [
11], and many academic studies [
12,
13,
14]. In addition, GNN technology has attracted the attention of technology companies and major corporations such as Google [
15], Amazon [
16], Facebook [
17], and Alibaba [
6], which have started to include it in their strategic plans. Current research shows that GNNs will have wide-ranging applications in various areas of life, and it is foreseen that this technology will play an important role in various applications such as information processing analysis, science, industry, and daily life.
Compared to the Euclidean data used in classical neural network models, graph data require more computationally intensive resources [
1]. The massive size of the data and the complexity of the connections increase the effectiveness of the GNN on graph datasets [
18,
19]. In addition, the irregular nature and instability of graph data pose several computational challenges [
20]. Current GNN implementations for learning these complex datasets use frameworks such as the Deep Graph Library (DGL) [
16], PyTorch Geometric (PyG) [
21], and TensorFlow GNN [
22]. These frameworks provide software support for graph neural networks on CPUs and GPUs, which are often used for Convolutional Neural Networks (CNNs) and other well-known methods [
23,
24]. However, for embedded device applications such as edge devices, IoT, and mobile applications, generic frameworks are insufficient in terms of integration and efficiency. To overcome this problem, there are several works in the literature to build lightweight and fast GNN models. While these studies cover both traditional and embedded hardware, GPU and CPU implementations form a large part of the overall topic. Although the use of low-bit precision and FPGA-based accelerators to derive low-dimensional and real-time models is a well-known research topic in Neural Networks (NNs) [
25] and Convolutional Neural Networks [
26], there has been insufficient research on GNNs [
20,
24,
27,
28]. Therefore, the study of quantized GNN models for accelerators is a promising potential research topic.
Developments in GNNs have increased the range of applications of the models, and specialized GNN implementations require specialized hardware techniques [
20]. In previous neural network studies, hardware accelerators based on FPGAs have often been the preferred solution when classical neural networks face challenging situations. In the same spirit, these specialized accelerators are seen as crucial in improving the performance of GNN models and making them more suitable for use on embedded devices [
29]. However, there has not yet been enough research conducted on this topic in GNNs [
30]. Due to their high computational power, FPGAs are known as ideal candidates for efficient processing of complex algorithms and fast application execution. Furthermore, the flexibility of FPGA hardware components allows customization to meet the unique requirements of users, enabling neural network models to be optimized to suit specific demands, ultimately leading to increased performance. In addition, specially designed and optimized hardware reduces power consumption by significantly reducing computational costs [
31]. Low power consumption can increase the energy efficiency of embedded devices, making it advantageous for mobile applications. FPGAs offer flexibility, which opens up innovative possibilities for researchers and developers [
32]. The research community aims to utilize FPGA-based accelerators to address issues such as load imbalance, memory requirements, and computing power [
30]. Improved performance in GNN-based applications can be achieved through the effective optimization of FPGAs. However, for larger models, additional approaches are required to achieve performance improvements due to limited storage and computing power.
Neural network quantization constitutes an essential technique for the scaling of network models, as well as the efficient computation on hardware such as FPGAs. It serves as a means to enhance computational efficiency and alleviate memory demands by representing the model parameters with reduced bit precision [
33]. The core objective underlying quantization is the minimization of bit usage, achieved through the mapping of real numbers to lower-precision equivalents while upholding accuracy standards [
34]. This process holds significant promise for facilitating high-performance applications in the future by enabling a streamlined representation of the parameters crucial for matrix multiplication operations. A notable advantage of quantization is its capacity to accommodate lightweight neural networks by employing fewer bits for both activation and model weights. Integer-based representations, which are central to quantization, offer enhanced processing efficiency on FPGAs compared with floating-point numbers [
20]. Although floating-point calculations often entail intricate operations, integer computations can be executed more straightforwardly and with greater efficiency [
35]. This attribute is particularly advantageous for the development of efficient systems tailored for embedded devices, especially when coupled with specialized hardware, such as FPGAs. The significance of quantization techniques transcends mere computational efficiency; they play a pivotal role in optimizing the performance of FPGA-based hardware accelerators in GNN systems.
The literature examines Graph Neural Networks (GNNs) in-depth, covering general knowledge, network structures, potential gaps, and perspectives through various surveys [
1,
2,
12,
15,
18,
19,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47]. In addition, the existing literature includes surveys and reviews focusing on specific applications [
48,
49,
50,
51]. For instance, Lamb et al. [
52] conducted a study on the use of GNNs as a neural-symbolic computing tool, whereas Malekzadeh et al. [
53] analyzed their use in text classification. Ahmad et al. [
54] examined the use of Graph Convolutional Neural Networks (GCNs) in human action recognition and proposed a taxonomy of studies in this area. Other studies have provided detailed insights into the use of GNNs in various applications such as the Internet of Things (IoT) [
55], network science [
56], and language processing [
57]. There are also general information on accelerators and efficient GNNs [
29,
58,
59]. Liu et al. [
60] approach current and future GNN work from an algorithmic perspective, while Abadal et al. [
61] provide a comprehensive overview of the acceleration algorithms and GNN fundamentals. As shown in
Figure 1, we review hardware-based accelerators and quantization approaches for computationally efficient GNNs, with a particular focus on energy-constrained embedded device applications.
This survey explores quantization methods and FPGA-based hardware accelerators to reduce the computational cost and model complexity of GNNs. The paper focuses on quantization approaches that can be applied directly to embedded devices or may be compatible with such hardware in the future, FPGA-based hardware accelerators, and efficient system designs using these two complementary methods together. Moreover, we provide an overview of work on FPGA-based quantized GNNs and a review of other work on medium-large scale FPGA accelerators and quantization methods for researchers investigating GNNs/GCNs, embedded systems with accelerated/reconfigurable hardware (FPGA), and quantization techniques. The motivation behind this survey is twofold. First, to provide a comprehensive study on GNN quantization, which is a promising complementary method for FPGA accelerators. Second, it aims to update the existing literature which is outdated due to the recent increase in research on FPGA-based GNN accelerators. This research makes the following contributions to the academic community:
GNN Basics and Theories: This paper presents the basic concepts of GNNs and their layers. Furthermore,
Figure 2, proposes a taxonomy of GNNs according to their variations.
Quantization Methods and Lightweight Models: This survey includes reviews of quantization methods aimed at building lightweight GNN models for embedded systems or GPU, CPU based applications.
FPGA Based Hardware Accelerators: Our research describes in detail the work being done on hardware-based accelerators (typically FPGAs) that can be used in current or future embedded device applications.
Discussion and Future Research Directions: The study discusses study outcomes for future research based on the findings and provides insights about possible research gaps.
The following sections provide a background with general information about GNN models, GNN quantization including quantization methods and studies, and FPGA-based hardware accelerator approaches. The paper finishes with future directions and a conclusion.
Figure 2.
Proposed taxonomy based on variants of GNNs.
Figure 2.
Proposed taxonomy based on variants of GNNs.
2. Background
Designed to uncover patterns within graph data by scrutinizing the relationships between nodes and edges, GNN models offer an effective means of extracting insights from intricate network structures. Various GNN models with different characteristics are documented in the literature [
62,
63,
64,
65,
66,
67,
68,
69,
70,
71].
Figure 2 illustrates an updated taxonomy that categorizes GNN variants based on their intended applications, as delineated in the existing literature.
Variations of GNNs, including Graph Convolutional Networks (GCN), Graph Isomorphism Network (GIN), Graph Attention Network (GAT), and GraphSAGE, are widely employed in this domain, serving as fundamental building blocks for graph-based applications. This section provides an overview of GNNs, encompassing their foundational principles, mathematical formulations, and commonly used layers.
2.1. Graph Neural Networks
A graph is defined by its vertices and edges, expressed as
, where
V represents the vertices and
E represents the edges. Graph neural network (GNN) models that utilize these edges and vertices to learn the network structure typically involve two main phases: aggregation and combination phases. The aggregation phase, as shown in Equation
1, involves the computation of a new feature vector by aggregating the features of neighbour nodes. Several techniques are used to compute these feature vectors, including weighted averages, maximum values, or summations. It is important to emphasize that the aggregation process operates on the set
of neighbour nodes and serves as an essential component in extracting contextual information for each node.
Subsequently, the combination phase, as represented by Equation
2, combines these derived feature vectors to construct a high-level feature matrix. Here, the new feature vector of each node is fused with the original feature matrix, resulting in a comprehensive representation of high-level features that can be leveraged for classification or regression applications.
In equations
1 and
2,
symbolizes the feature vector of a node, and
denotes the set of neighboring nodes. The sequencing of aggregation and combination phases has been the focus of research into system efficiency [
20,
72]. Experimental proofs show that changing the order of aggregation and combination phases has an impact on the system efficiency. In their study, Tian et al. show that the adoption of
CoAg ordering improves system efficiency [
24]. This research finding highlights the importance of strategic stage sequencing in optimizing the computational efficiency of GNNs.
2.2. Graph Convolutional Networks
GCNs [
73] are a specialized type of GNNs. These models apply the convolution operations of classical neural networks on graph data. GCNs can be expressed locally for a single edge and vertex or globally for all edges and vertices. Equation
3 shows the global representation of GNNs.
In Equation
3,
represents the input feature matrix for layer
l,
A denotes the adjacency matrix, and
denotes the nonlinear activation function. The term
corresponds to the aggregation phase, while
refers to the combination phase. Note that in this equation, only the features of neighbouring nodes are combined, while the nodes themselves aren’t considered. Furthermore, multiplication by the adjacency matrix changes the scale of the feature vector. As a result, higher-degree nodes exert more influence on the aggregation process because they have more neighbours, while the influence of low-degree nodes is reduced. To address these issues, the Equation
4 is proposed.
To acquire the normalized matrix , the identity matrix I is added to the adjacency matrix A. Here, denotes the diagonal node degree matrix.
2.3. Graph Isomorphism Networks
In 2019, Xu et al. [
74] introduced Graph Isomorphism Networks (GIN) as a solution to the analytical challenges posed by Graph Convolutional Networks (GCN) and GraphSAGE, particularly in the context of certain straightforward graph structures. GIN is a recent addition to the Graph Neural Network (GNN) domain and takes a distinctive approach compared to its counterparts.
Equation
5 shows the mathematical formula for GIN. Where
is the feature vector of node
v.
is a multi-layer perceptron and
represents the learnable parameter.
In contrast to conventional models, GIN provides a separate representation of isomorphic and non-isomorphic graphs. This unique feature provides GIN with mathematical robustness, which is advantageous. In particular, GIN is an important achievement by enabling the convergence of the graph to the Weisfeiler-Lehman isomorphism test under certain conditions. The unique feature of GIN enhances its ability to recognize and distinguish effectively non-isomorphic graphs characterized by topological non-identity. GIN’s mathematical ability and the ability to provide a variety of representations contribute to its ability to address the challenges posed by certain graphical structures and make it a remarkable model in the landscape of graphical neural networks.
2.4. Graph Attention Networks
Graph Attention Networks (GAT) [
75] is one of the leading models in the GNN family, proposed by Velickovic et al. in 2018. Unlike traditional GNN models, GAT focuses on learning the importance of neighbouring nodes using an attention mechanism and directing information propagation according to this importance. In this way, GAT can improve the learning performance of the model by giving more weight to information from important neighbouring nodes. Similar to GCNs, this model computes a representation vector for each node and a weight vector for each edge. However, in GAT, the hidden representation vectors of neighbouring nodes and the weight vectors of edges are used to compute attention.
For each node, GAT calculates the attention coefficients for its neighbouring nodes Equation
6. These coefficients are an indication of how important a neighbour node is. For the calculation of the attention coefficients, the node’s hidden representation vector and the hidden representation vector of the neighbouring node are used. Equation
6 utilises
as attention coefficients,
T for transpose, and
for concatenation. Equation
7 shows the mathematical formula for GAT. Here
K is the number of heads and
is the linear transformation weight matrix.
2.5. GraphSAGE
GraphSAGE [
12] offers a general inductive learning framework that is especially useful for large and complex graphs. The GraphSAGE model, proposed by William L. Hamilton et al., learns hidden representations solely for nodes in the training data. This enables the model to generalize effectively to previously unseen nodes. GraphSAGE is a scalable framework that enables the inclusion of other models by using various aggregation and combination functions.
GraphSAGE creates an adjacency matrix for each node based on its own features and those of its neighbors. An aggregation function is used to combine the information in the neighbour feature matrix for each neighbour. The authors analyzed three different aggregation functions and found no significant difference between max and mean pooling. Therefore, max-pooling was used in this study. Additionally, the LSTM aggregator was also analyzed. Equation
8 shows the max-pooling aggregation function, where ’max’ is the maximum operator and sigma is the activation function.
In the combination stage, the information obtained from the aggregation functions is combined with a learnable model parameter. This combination allows for the weighting of the node’s own features and neighbour information.
4. Graph Neural Network Acceleration
The intricate and computationally demanding nature of GNNs, especially when operating on large graphs, often leads to time-consuming processes and efficiency constraints. This is further compounded by performance limitations encountered on conventional CPUs and GPUs. In this regard, hardware accelerators are emerging as pivotal components to facilitate the swift and efficient processing of GNNs.
The major challenges in accelerating GNNs revolve around their complex structure and significant data sizes. Both training and inference tasks require significant computational power and memory resources. Conventional hardware may prove inadequate for such computations, thereby prolonging the overall processes. Conversely, hardware accelerators, leveraging their parallel processing capabilities and tailored computing units, expedite the training and inference processes of GNNs. They offer the requisite low latency crucial for real-time applications.
4.1. Hardware-Based Accelerator Approaches
In the existing literature, various accelerator designs proposed for diverse hardware platforms such as GPU-CPU [
98,
99,
100], ASIC [
28,
101,
102,
103,
104,
105,
106,
107], and FPGA [
20,
30,
33,
108], among others, contribute valuable insights to this domain. This chapter provides a brief overview of other hardware-based accelerators before an in-depth review of FPGA-based accelerators. Additionally,
Figure 8 presents a comparison of different hardware accelerators using HyGCN as a baseline.
HyGCN [
101] is introduced as a solution to tackle the challenges posed by hybrid execution models of GCNs, which involve a combination of irregular aggregation stages and regular combination stages. To effectively handle the dynamic and irregular nature of the aggregation process, as well as exploit the static and regular properties of the combination process in GCNs, the proposed accelerator adopts a hybrid architecture. In order to ensure parallelism in both the aggregation and combination phases, the authors devise a programming model. Additionally, they propose a hardware design that incorporates dedicated Aggregation and Combination Engines. HyGCN achieves significant speedup and energy reduction on both CPU and GPU platforms, highlighting its effectiveness in enhancing the execution of GCNs.
EnGN [
28] is proposed for the efficient processing of large-scale GNNs in real-world applications as a dimension-aware accelerator architecture that optimizes GNN computations based on input and output feature sizes and eliminates the need for external memory accesses. The authors present a dimension-aware stage reordering (DASR) strategy for partitioning large graphs into intervals and shards. The work achieves on-chip memory efficiency using graph tiling and scheduling techniques and minimizes data dependencies between tiles. The accelerator uses a ring-edge-reduction (RER) dataflow to address irregular memory access patterns in GNN propagation, improving computational efficiency.
GRIP [
107] is proposed with the aim of improving the efficiency of GNNs by addressing the problems of delay and energy consumption. Kiningham et al. show improvements in latency reduction and energy efficiency for GNN inference tasks by exploiting special architectural features. The study evaluates the performance of GRIP on various GNN models and datasets, analyses the impact of architecture and model parameter tuning, and assesses the effectiveness of GNN optimizations integrated into GRIP.
Rubik [
104] addresses the challenge of learning from graphs by optimizing software and hardware accelerations in GCN learning to improve energy efficiency and performance. In this work, the authors present a hierarchical computing paradigm that separates graph-level and node-level computation for specific optimizations. The paper includes a lightweight graph reordering technique to improve graph-level data reuse and a custom GCN accelerator architecture with a hierarchical spatial design for efficient data locality utilization. Furthermore, the authors propose a hierarchical mapping methodology that optimizes both graph-level and node-level computations to improve data reuse and task-level parallelism.
GCNAX [
105] is proposed as an accelerator designed to address irregularity in the aggregation phase and exploit regularity in the combination phase of GCNs. The authors present a systematic design space exploration to optimize performance and minimize off-chip data accesses, and use a cycle-accurate simulator for evaluation and ASIC synthesis for area and power estimation. The baseline hardware used for comparison includes two GCN accelerators and one SpMM accelerator. The benchmark is used to evaluate the efficiency of flexible data streaming.
H-GCN [
109] proposes a hybrid accelerator design that utilizes Xilinx Versal ACAPs to enhance GNN inference performance. The paper explores graph heterogeneity and utilizes a hybrid PL and AIE architecture to fully exploit the computational capabilities of ACAPs for GNN computations. The H-HCN involves graph partitioning, the utilization of PL and AIE components, exploring sparsity support in the AIE, and creating an effective approach for mapping sparse matrix-matrix products to the systolic tensor array. Using the Xilinx Versal VCK5000 platform, the paper demonstrates the positives of H-GCN over CPU and GPU platforms on various graphics datasets by comparing speedups, energy efficiency, and inference latency with existing GCN accelerators.
4.2. FPGA-Based Accelerators Approaches
FPGA-based accelerators are advantageous over alternative hardware accelerators due to their inherent flexibility in handling variable states. They are equipped with configurable logic and memory blocks, making them adaptable to specific computing requirements. Additionally, they offer parallel processing capabilities and low power consumption, distinguishing them from fixed-architecture accelerators like GPUs. The flexibility of FPGAs is particularly valuable in scenarios that involve variable graph structures and dynamic data flows. Furthermore, FPGAs often outperform other accelerators in terms of energy efficiency and performance, offering low power consumption at high data processing speeds. However, programming and optimizing FPGAs can be complex and time-consuming, which may limit their widespread adoption. Accessing relevant information about FPGAs can be challenging due to their comparatively lower usage prevalence compared to GPUs.
The literature presents software and hardware architectures that address different scales of FPGA applications, ranging from small to large deployments. Some studies focus exclusively on FPGA research, while others provide insight into heterogeneous system architectures that encompass larger-scale applications.
AWB-GCN, proposed by Geng et al. [
20], endeavors to tackle the challenge of workload imbalance encountered in real-world graph inference scenarios. By dynamically redistributing tasks among processing elements, AWB-GCN aims to optimize hardware resource utilization and enhance performance. The authors employ hardware-based automatic tuning techniques to adaptively rebalance workloads in real-time, thus improving the efficiency of GCN inference across diverse graph structures. In their study, Geng et al. conducted an evaluation focusing on several key aspects, including processing element (PE) utilization, performance metrics, energy efficiency, and hardware consumption. Through analysis, the research compares the PE utilization, performance metrics, energy efficiency, and hardware resource consumption of the Intel D5005 platform across different datasets. Overall, the findings presented by Geng et al. shed light on the significance of dynamic workload distribution in improving the efficiency and effectiveness of real-world graphics inference tasks, particularly within the context of GCN inference optimization.
ACE-GCN [
110] is proposed by Romero et al. for the efficient processing of graph-structured data by exploiting the sparsity and power law distribution in real-world graph datasets for efficient graph convolutional embedding. The authors effectively utilize first-order subgraph similarity, feature exchangeability, and structure redundancy to improve graph convolutional embedding. This addresses performance and resource scalability issues in graph neural network accelerators. The task at hand is to transfer computational complexity to storage capacity while maintaining high accuracy, resulting in speedup gains compared to baseline methods. ACE-GCN optimizes on-chip memory utilization and computational efficiency by customizing configurations for different datasets and using an auxiliary similarity estimation circuit. The results demonstrate that the proposed method provides speedup on datasets compared to baseline models, and in some cases outperforms other FPGA accelerators.
I-GCN [
111] aims to improve data locality and reduce redundant computation in GCNs, particularly for efficient inference on large-scale graphs. According to Geng et al., current hardware accelerators for GCNs face challenges in dealing with the irregular non-zero distribution, high sparsity, and significant size of real-world graphs, leading to inefficiencies in data processing and computation. This paper presents the islanding algorithm, which identifies clusters of nodes with strong internal connections. This enables graph data to be processed at the level of centers and islands rather than individual nodes, resulting in improved data reuse and reduced redundant computation. The islanding process of the research identifies clusters of nodes with strong internal connections, which leads to improved data locality and a reduction in the movement of data off-chip.
Lin et al. [
112] aim to efficiently accelerate GCN inference on FPGA platforms for cloud-based applications, such as e-commerce and recommender systems. They address the challenges of existing solutions, such as high latency inference and energy inefficiency. The paper proposes a new approach to GCN inference acceleration on FPGA. The paper suggests using a partition-centric mapping strategy and HLS-based core design to reduce memory access overhead, leverage data reuse, and achieve significant data parallelism. The authors utilize kernel designs in Vitis-HLS and OpenCL on the Xilinx Alveo U200 platform. They evaluate the performance of the designs using large-scale datasets such as Reddit, Yelp, and Amazon-2M. To do this, they use a two-layer Vanilla-GCN model with detailed specifications about GCN layer sizes and operations.
Figure 9.
Comparison of latency (ms) and energy efficiency (graph/kJ) of 3 different methods using the same hardware (Intel Stratix 10 SX, 330 MHz) on the same datasets.
Figure 9.
Comparison of latency (ms) and energy efficiency (graph/kJ) of 3 different methods using the same hardware (Intel Stratix 10 SX, 330 MHz) on the same datasets.
Zhang et al. [
113] propose an accelerator approach to address the problem of limited on-chip memory when dealing with massive node-attributed graphs with static topologies and to improve GCN inference efficiency on FPGA. Their algorithm-architecture co-optimization approach uses data partitioning, decomposition, and node reordering, which distinguishes it from traditional methods. The article describes a hardware architecture pipeline that supports aggregation and transformation kernels, along with a flexible bus and scheduling strategy that is tailored for various GCN models. Additionally, it includes a mathematical analysis of data communication costs to optimize memory traffic. The article also proposes a two-stage preprocessing algorithm to increase data reuse and reduce external memory access.
SPA-GCN [
114] integrates deep pipelining and customized levels of parallelization to efficiently process small graphs. SPA-GCN uses a very deep pipeline with nested parallelization, customized compute units, and efficient utilisation of FPGA resources. The paper optimizes data flow through FIFOs and dynamic process scheduling. The study compares SPA-GCN’s performance on Xilinx FPGAs with that of Intel Xeon CPUs and NVIDIA GPUs. SPA-GCN highlights the efficiency of the proposed architecture in accelerating GCN computations on small graphics.
BlockGNN addresses the problem of increasing computational complexity in GNNs by proposing a software-hardware co-design strategy using block circulating weight matrices for efficient GNN acceleration, reducing computational complexity from to with minimal loss of accuracy. The authors suggest compressing GNN models by using block-entangled weight matrices at the algorithm level and implementing a pipelined CirCore architecture at the hardware level. They also propose using the Fast Fourier Transform (FFT) during inference to enhance efficiency. Additionally, they introduce a performance and resource model to automatically optimize hardware parameters for various GNN tasks and improve overall acceleration.
Gui et al. [
115] aims to enhance the efficiency of sampling algorithms in GNNs by utilizing hardware acceleration to reduce sampling time while maintaining high test accuracy on large datasets. The paper presents the CONCAT Sampler, an algorithm that merges sample graphs to simplify hardware acceleration and ensure accuracy in GNN models. The authors suggest utilizing the CONCAT Sampler on FPGA hardware. This involves using parallel sampling modules to independently sample from partitioned datasets and combining the results for faster sampling.
Li et al. [
116] present an FPGA-based hardware architecture designed to address the challenges of implementing Large Scale Distributed Graph Neural Networks (LSD-GNNs) in hyperscale environments, with a focus on memory access acceleration and scalability. This work highlights the obstacles faced by LSD-GNNs in hyperscale environments and discusses custom heterogeneous hardware solutions to improve memory access and sampling performance. This work involves the implementation of a domain-specific hardware architecture with an access engine (AxE) to optimize memory access and a RISC-V control interface to implement a memory-on-fabric (MoF) system for near-data processing, with the goals of scalability, and programmability.
Graph-OPU [
117] addresses the need for efficient acceleration methods for GNNs due to their widespread applications and provides a solution to the challenge of time-consuming FPGA reconfiguration when switching between GNN models. The authors present this work as a novel FPGA-based overlay processor adapted to mainstream GNNs, with software programmability that enables fast model switching. Graph-OPU includes instruction set customization, design of a fully pipelined microarchitecture, implementation of unified matrix multiplication, and testing with various datasets on the Xilinx Alveo U50.
4.3. FPGA-Based Heterogeneous Approaches
Heterogeneous computing architectures combine different types of processors to create powerful and flexible systems. CPU-FPGA heterogeneous approaches, in particular, offer solutions for applications requiring high levels of parallelism and customizability. By combining the general-purpose processing capability of the CPU with the specialized hardware advantages of the FPGA, these approaches enable the faster and more energy-efficient performance of complex computational tasks [
27]. Although these structures are not applicable to embedded systems, they are preferred for large-scale applications. This section reviews some proposed heterogeneous FPGA approaches for GNNs.
Zhang et al. [
27] recommend a method for training GNNs on large-scale graphs using a CPU-FPGA heterogeneous platform. The method uses neighbor sampling to address scalability and overfitting challenges. The authors discuss the computational challenges associated with neighbour sampling and feature aggregation, which impact the overall execution time of NS GNN training, thereby impeding scalability and efficiency. The work involves implementing a parallel neighbor sampling algorithm on the main processor and an FPGA accelerator optimized for GNN operations. It also proposes to leverage optimizations such as neighbor sharing and task pipelining to improve memory performance and computational efficiency, ultimately increasing training throughput and accelerating NS GNN training.
GraphACT [
30] proposes a heterogeneous approach to address memory access and load balancing challenges in accelerating the training of GCNs on CPU-FPGA platforms due to significant data communication issues. GraphACT optimizes the FPGA design through a graph-theoretic pre-processing step to balance the load across on-chip compute modules and various FPGA devices. Additionally, it features a systolic array-based design that improves weight update efficiency. The authors evaluated this work using a 40-core Xeon server with a Xilinx Alveo U200 board and demonstrated the strengths of the proposed method with parameters such as convergence time and test set accuracy on datasets such as PPI, Reddit, and Yelp.
Zhang et al. [
118] aim to improve the efficiency and speed of GNNs in real-time applications. They address the challenge of achieving low-latency mini-batch inference on CPU-FPGA platforms. The hardware accelerator design presented in this paper uses the Adaptive Computing Kernel (ACK) architecture to run different GNN computing cores with low latency. It provides a unified solution for different GNN models on FPGA platforms without the need for runtime reconfiguration. The methodology involves identifying GNN compute kernels, designing the flexible ACK architecture, and using a design space exploration algorithm to create a single hardware design point for different GNN models and optimize it for low-latency inference without reconfiguration.
HP-GNN [
119] targets to address inefficiencies in full-graph GNN training on large graphs. It aims to reduce the high memory footprint and increase the frequency of model updates per epoch on CPU-FPGA heterogeneous platforms. In contrast to previous work that focused on specific GNN models or algorithms, HP-GNN provides a general framework for GNN training on CPU-FPGA platforms. Hardware templates are optimized for efficient accelerator generation. Lin et al. suggest a design space exploration engine to improve throughput and generate accelerators automatically. They also provide software APIs to simplify building a high-level abstraction for sampling-based mini-batch GNN training, developing optimized hardware templates, and simplifying development without requiring hardware expertise.
4.4. Frameworks for FPGA-Based Accelerators
In the field of high-performance computing, FPGA accelerator frameworks facilitate the integration between hardware and software, enabling algorithms to run efficiently on hardware. Frameworks allow researchers to quickly and efficiently prototype customized hardware designs. They can be used to develop power-efficient systems that can perform complex data processing tasks in real-time. Customized FPGA frameworks provide performance advantages to designers, particularly in areas such as big data analysis, artificial intelligence applications, and deep learning models. This section reviews frameworks for FPGA-based accelerators designed for GNNs.
BoostGCN [
108] aims to optimize the inference of GCNs on FPGA platforms by proposing a Hardware-aware Partition-Centric Feature Aggregation (PCFA) scheme to overcome inefficiencies and adaptability challenges in memory accesses encountered by GCNs. The authors state that the issues arise from different graph sizes, sparsity levels, and complexities of GCN models. Zhang et al. propose a framework that can adapt to these variations to enhance performance and efficiency. The PCFA scheme in BoostGCN performs three-dimensional graph partitioning considering data reuse on the chip and external memory architecture. Additionally, a central load balancing scheme is employed to effectively address workload imbalance. This study enables important data parallelism through optimized RTL templates and a task scheduling strategy aimed at minimizing pipeline stalls.
FlowGNN [
120] is proposed to support generic GNN models for real-time inference applications. By introducing explicit message passing and multi-level parallelism, the authors provide a comprehensive solution for GNN acceleration without sacrificing adaptability. FlowGNN includes the compiling of each GNN model into the FPGA kernel, allowing easy updates for new architectures and the creation of custom accelerators through modular components. Using the Xilinx Alveo U50 FPGA, FlowGNN outperforms existing baselines such as CPU, GPU, and I-GCN, achieving speedup and energy efficiency on a variety of datasets and GNN models.
DeepBurning-GL addresses the growing need for efficient and specialized accelerators for GNNs due to their complexity and computational demands in various applications. Liang et al. [
121] present an automated framework for designing custom GNN accelerators that can meet performance requirements while satisfying resource constraints and user-specific design goals. By focusing on automating the customization of GNN accelerators, this work stands out for streamlining the design process by providing end-to-end solutions tailored to specific applications without manual intervention. DeepBurning-GL includes a systematic approach that starts with a performance analysis of GNN models to identify bottlenecks, selects templates based on model requirements, combines templates into a unified accelerator design, and fine-tunes design parameters using a simulated annealing algorithm combined with model-based design space pruning for optimization.
DGNN-Booster is proposed by Chen et al. [
122] as an FPGA accelerator framework designed for real-time Dynamic Graph Neural Networks (DGNN) inference with HLS, offering high-speed performance and low energy consumption. DGNNs pose challenges in hardware deployment due to low parallelism and a lack of general accelerator frameworks for dynamic graphs. DGNN-Booster employs two FPGA accelerator designs, V1 and V2, each optimized for different levels of parallelism and computational intensity. V1 focuses on parallelizing the GNN and RNN in adjacent time steps, while V2 emphasizes parallelization in a single time step. The designs incorporate graph renumbering, format conversion, multi-level parallelism, and task scheduling to improve hardware efficiency and performance.
GNNBuilder [
123] is a framework that enables the automatic generation of GNN accelerators for different models, providing design flexibility and optimization strategies. The authors propose using serialized trained direct fit models for efficient design space exploration, facilitating fast performance evaluation compared to traditional HLS synthesis. The framework demonstrates predictive capabilities for runtime and BRAM models based on a database of 400 synthesized designs, using a random forest regressor with 10 predictors and 5-fold cross-validation. The evaluation is performed on FPGA-based parallel implementations and shows speedups over PyG CPU, PyG GPU, and C++ CPU runtimes for various GNN models.
FGNAS [
124] is proposed as a hardware/software co-exploration framework that uses FPGA platforms to optimize hardware accelerators for GNNs, aiming at efficient GNN deployment. FGNAS is differentiated by its holistic approach in integrating hardware and software considerations to improve accuracy and speedup in GNN architectures on FPGAs. The methodology uses reinforcement learning to sequentially sample hardware and software parameters, analyze FPGA model performance, and update the controller using policy gradients for optimized design. By partitioning the search space into architectural and hardware parameters such as embedding size, attention type, aggregation type, and group sizes, FGNAS efficiently explores the design space to determine the optimal GNN architecture and FPGA design.
4.5. FPGA-Based Accelerator Approaches with Quantization
By reducing the dimension of data representations and using hardware resources more efficiently, quantization allows FPGA accelerators to reduce energy consumption and increase processing speed. Quantization is especially important for embedded systems that have limited computational resources due to energy constraints. FPGA accelerators designed using quantization increase the usability of GNNs in real-time applications and scenarios requiring energy efficiency. This section explores in detail the integration of quantization techniques with FPGA-based accelerators.
FPGAN [
125] proposes an FPGA-based accelerator to improve the performance and energy efficiency of Graph Attention Networks (GATs) while maintaining accuracy. The study demonstrates the effectiveness of the software-hardware co-optimized approach in accelerating inference and highlights the potential of FPGA accelerators in this area. The authors integrate model optimization and software hardware co-design to create a dedicated accelerator for GATs. The FPGAN involves process fusion, quantization, data reconfiguration, model tuning, and architecture optimization. Process fusion simplifies the self-attention mechanism, while data reconfiguration improves data storage and access efficiency. Model tuning optimizes activation functions, and architecture design focuses on software and hardware co-design for efficient inference operations. The study demonstrates improvements in performance and energy efficiency without sacrificing accuracy.
SkeletonGCN [
126] addresses the growing demand for efficient training of GCNs on FPGA platforms, due to their computational and memory requirements. The authors present the work as a solution that optimizes data representation, simplifies operations, and uses a unified hardware architecture to achieve significant speedup without trade-offs in accuracy. The methodology involves quantizing data to SINT16 to reduce computation and storage requirements, simplifying non-linear operations, using a compression algorithm for sparse matrices, and designing a unified hardware architecture to improve DSP efficiency for various matrix operations in GCN training.
Table 2.
Quantization studies on FPGA-based accelerators.
Table 2.
Quantization studies on FPGA-based accelerators.
Publication |
Hardware |
Resource Consumption |
Quantized Parameters |
Baselines |
FP-GNN [24] |
VCU128 Freq:225 MHz |
LUT: 717,578 FF: 517,428 BRAM: 1,792 DSP: 8,192 |
Features, Weights 32-bit fixed point |
PyG-CPU-GPU, HyGCN, GCNAX, AWB-GCN, I-GCN |
FTW-GAT [127] |
VCU128 Freq:225 MHz |
LUT: 436.657 FF: 470,222 BRAM: 1,502 DSP: 1,216 |
8-bit int Features3-bit int Weights |
PyG-CPU-GPU, FP-GNN |
Wang et al. [128] |
Alveo U200 Freq:250 MHz |
LUT: 1,010,00 FF: 117,00 BRAM: 1,430 DSP: 392 |
1-bit integer Features, Weights 32-bit integer Adjacency matrix |
PyG-CPU-GPU, ASAP [113] |
LL-GNN [32] |
Alveo U250 Freq:200 MHz |
LUT: 815,000 FF: 139,000 BRAM: 37 DSP: 8,986 |
Model Parameters 12-bit fixed point |
PyG-CPU-GPU |
Ran et al. [129] |
Alveo U200 Freq:250 MHz |
LUT: 427,438 FF: NI BRAM: 1,702 DSP: 33.7 |
Features, Weights |
PyG-CPU-GPU, HyGCN, ASAP [113], AWB-GCN, LW-GCN |
SkeletonGCN [126] |
Alveo U200 Freq:250 MHz |
LUT: 1,021,386 FF: NI BRAM: 1,338 DSP: 960 |
Feature, Adjacency matrices, Trainable parameters 16-bit signed integer |
PyG-CPU-GPU, GraphACT |
QEGCN [130] |
VCU128 Freq:225 MHz |
LUT: 21,935 FF: 9,201 BRAM: 22 DSP: 0 |
Features, Weights 8-bit fixed point |
PyG-CPU-GPU, DGL-CPU-GPU, HyGCN, EnGN, AWB-GCN, ACE-GCN |
Yuan et al. [131] |
VCU128 Freq:300 MHz |
LUT: 3,244 FF: 345 BRAM: 102.5 DSP: 64 |
Features, Weights 32-bit fixed point |
PyG-CPU-GPU |
FPGAN [125] |
Arria10 GX1150 Freq:216 MHz |
LUT: 250,570 FF: 338,490 BRAM: NI DSP: 148 |
Features, Weights fixed point |
PyG-CPU-GPU |
LW-GCN [132] |
Kintex-7 K325T Freq:200 MHz |
LUT: 161,529 FF: 94,369 BRAM: 291.5 DSP: 512 |
Features, Weights 16-bit signed fixed point |
PyG-CPU-GPU, AWB-GCN |
QEGCN [
130] is an instance of efficient hardware accelerators to improve GCN performance. The research proposes an FPGA-based accelerator that uses edge-level parallelism. The authors aim to optimize the execution of quantized GCNs and distinguish themselves from existing graph-level parallelism approaches. The paper emphasizes edge-level parallelism, stating that it enables more efficient processing of graph data compared to traditional methods. QEGCN investigates the impact of data quantization on GCN accuracy, evaluates energy efficiency at different quantization levels, and analyses performance on various benchmark platforms.
FTW-GAT [
127] accelerator tackles the difficulties presented by intricate data dependencies and irregular structures of graph attention networks (GATs) by quantizing GAT weights to ternary values. This simplifies processing elements, eliminates the need for digital signal processors (DSPs), and reduces power consumption. This paper presents a methodology that combines ternary-weight quantization with additional techniques such as process fusion, multi-level pipelining, and graph partitioning to increase parallelism in the GAT inference acceleration process. The aim is to improve efficiency and performance. The authors indicate that the proposed model achieves accuracy values similar to full-precision models with quantization, and outperforms baselines in terms of latency and energy efficiency.
LL-GNN [
32] aims to minimize latency in processing GNNs on FPGA for real-time applications in high-energy physics, especially in collider triggering systems where ultra-low latency is crucial for timely event selection. Que et al. propose a design combining quantization and FPGAs that offers low latency when processing small graphs and can be used in scenarios requiring sub-microsecond latency and high throughput, such as particle identification in fundamental physics experiments. LL-GNN involves a co-design approach that optimizes both the algorithm and hardware for GNNs on FPGAs, including defining delay thresholds, rebalancing Multilayer Perceptron (MLP) sizes, exploring parallelism parameters, and implementing sublayer fusion to improve performance and reduce latency. The authors show that they have achieved low latency as a result of this work and that it is possible to embed GNNs in FPGAs with sub-microsecond latency.
HuGraph [
33] aims to address the increasing demand for efficient and scalable GCN training on large and irregular graphs using heterogeneous FPGA clusters. The paper employs a load-balanced mapping strategy and a scheduling method to optimize GCN training. HuGraph focuses on adapting various large graphs to FPGAs while achieving significant speedups with minimal loss of accuracy compared to traditional platforms. The authors state that they use 8-bit integer quantized adjacency coefficients, features, weights, and error values for efficient computation. HuGraph allocates hardware resources, configures samplers, and simulates training performance for each FPGA and dataset configuration to achieve workload balance across heterogeneous FPGAs.
Figure 10.
Latency (ms) and energy efficiency (graph/kJ) comparison of 4 different methods on the same datasets. Compared to the 3 approaches with quantization, AWB-GCN represents an approach without quantization.
Figure 10.
Latency (ms) and energy efficiency (graph/kJ) comparison of 4 different methods on the same datasets. Compared to the 3 approaches with quantization, AWB-GCN represents an approach without quantization.
Ran et al. [
129] developed a software-hardware co-design to achieve low-latency GCN inference on FPGA platforms. The paper presents an integrated approach that combines algorithm-level attention mechanism-based graph parsing with a two-phase hardware architecture and achieves speedups compared to existing accelerators. This work includes the use of attention mechanisms for graph parsing, designing a pipelined two-stage accelerator for efficient aggregation and merging phases, exploiting edge-level and feature-level parallelism, and implementing a graph partitioning strategy to improve data reuse efficiency. The authors emphasize the use of fixed-point representation in the study, as floating-point representation is more resource-intensive.
FP-GNN [
24] is suggested as a unified processing module that can perform both the aggregation and combination phases simultaneously. This allows for flexible execution orders and supports various GNN models. The authors provide a customizable hardware architecture with components such as a Workflow Controller and Processing Modules, enabling high-performance computing and efficient resource utilization. The Adaptive GNN Accelerator (AGA) framework and Adaptive Graph Partitioning (AGP) strategy in FP-GNN optimize parallelism and memory management, enabling improved performance efficiency in GNN inference tasks. In this work, weights and input features are quantized as 32-bit fixed-point to take advantage of the integer artifacts while avoiding accuracy loss.
Yuan et al. [
131] use a 32-bit fixed-point representation for efficient computation without loss of accuracy, as in FP-GNN. The paper aims to enhance the performance and energy efficiency of GNNs by developing a dedicated accelerator for the Gathering phase on FPGA platforms. This addresses the inefficiencies of CPUs and GPUs in handling dynamic and irregular data access patterns of GNNs. This research presents an architecture that optimizes the execution order of the Apply and Gather phases, reducing operations and improving performance, while leveraging FPGA technology to design an efficient accelerator for the Gather phase of GNNs.
Wang et al. [
128] introduce a customizable FPGA-based accelerator for BiGCNs. The hardware optimizations include fine-grained pipelining, sparse matrix multiplication, and partial unrolling, which significantly improve performance. The work also includes overlapping data transfer with computation, using COO format storage for the adjacency matrix, and sparse matrix multiplication to improve hardware efficiency. The focus is on deep customization to support various GCNs. The results indicate that the proposed FPGA-based accelerator outperforms a previous FPGA-based GCN accelerator by achieving four times faster throughput while consuming fewer DSP resources.
Figure 11.
Resource utilization demonstration for FPGA-based hardware accelerators combined with quantization. Here FP is denoted as FP-GNN, FTW as FTW-GAT, LL as LL-GNN, LW as LW-GCN.
Figure 11.
Resource utilization demonstration for FPGA-based hardware accelerators combined with quantization. Here FP is denoted as FP-GNN, FTW as FTW-GAT, LL as LL-GNN, LW as LW-GCN.
4.6. FPGA-Based Accelerators for Embedded Applications
Efficient system designs for embedded devices are critical for applications that require efficient data processing and fast response times due to their limited computational resources. In this scope, FPGA-based accelerators have the potential to improve the performance of embedded systems, optimizing energy consumption and handling more complex computational tasks. In particular, the implementation of advanced machine learning models such as GNNs in embedded systems is made possible by the flexibility and reconfigurability of FPGA accelerators. This chapter provides a detailed review of the work on FPGA-based accelerators designed for embedded systems.
Table 3.
FPGA-based GNN accelerator works for embedded devices.
Table 3.
FPGA-based GNN accelerator works for embedded devices.
Study |
Target Device |
Datasets |
Fixed-point representation |
LW-GCN [132] |
Xilinx Kintex-7 |
Cora, Citeseer, Pubmed |
✔ |
Zhou et al. [133] |
Xilinx ZCU104, Alveo U200 |
Wikipedia, Reddit, GDELT |
- |
gFADES [72] |
Zynq Ultrascale+ XCZU28DR |
Cora, Citeseer, Pubmed |
- |
Hansson et al. [134] |
Xilinx Zynq UltraScale+ |
Cora, Citeseer, Pubmed |
✔ |
LW-GCN is proposed by Tao et al. [
132] to improve the efficiency of Graph Convolutional Networks (GCNs) on edge devices with limited resources by addressing irregular computation and memory access challenges. LW-GCN addresses irregular computation and memory access challenges and achieves improvements in storage utilization and computation time. This is accomplished through a software-hardware co-design approach and a PCOO compression format that efficiently pre-processes sparse data. The authors present a versatile approach that includes post-training quantization with 16-bit signed fixed-point representation for features and weights, and 4-bit signed fixed-point quantization for non-zero elements in sparse matrices. Additionally, they use an external product tiling technique to balance the workload and reduce the data volume during computation. LW-GCN achieves important reductions in latency and improvements in power efficiency, demonstrating its effectiveness in improving GCN inference performance on edge devices.
Zhou et al. [
133] aims to improve the accuracy and speed of inference over dynamic graphs by addressing the challenge of efficiently processing temporal information in graph neural networks. They investigate the optimization of temporal graph neural networks through a model-architecture co-design approach and exploit batch processing, pipelining, and prefetching techniques for improved performance. This paper presents a co-design methodology for temporal graph neural networks that optimizes both the model and the hardware architecture. The methodology enables efficient processing of evolving temporal information on FPGA platforms. The paper includes proposing a hardware mechanism to enable chronological vertex updates without compromising computational parallelism. The methodology reduces computational complexity and memory access while maintaining an accuracy loss of less than 0.33%. The study compared the performance of embedded-scale and large-scale hardware platforms using parameters such as latency, throughput, and batch size.
gFADES [
72], discusses the complex data access and processing requirements of GNNs combining both dense and sparse data representations, which makes hardware acceleration crucial for efficient computation. The focus of the paper is on hardware acceleration for GNNs, with a particular emphasis on performance optimization for GCNs. This paper stands out due to its approach that utilizes a dataflow architecture of data streams to optimize dense and sparse tensor processing, resulting in high-performance execution of GNNs. The methodology of the study involves developing the gFADES accelerator using a dataflow of dataflows (DoD) approach, optimizing the High-Level Synthesis (HLS) description for extreme sparsity in GNNs, scaling performance with multiple hardware threads and compute units, and integrating the accelerator with Pytorch for edge computing devices. In the study, the Zynq device, known for its resource-constrained structure, was used as the base hardware with the PYNQ overlay. Results of the study show performance improvements of up to 140x with multithreaded hardware configurations compared to optimized software implementations in Pytorch, demonstrating the effectiveness of the gFADES accelerator in improving GNN processing on resource-constrained devices.
Hansson et al. [
134] extend the gFADES architecture and focus on optimizing hardware designs for Graph Convolutional Networks (GCNs) by exploring deep quantization strategies to improve performance and energy efficiency. The authors state that the challenge is to meet the computational requirements of GCNs on resource-constrained embedded hardware while maintaining accuracy. This paper provides runtime hardware-aware training for GCNs by embedding a mixed-precision design in the forward pass of the PyTorch backpropagation training loop, enabling significant speedups without trade-offs in accuracy. The methodology involves integrating a hardware accelerator into the PyTorch training loop to explore various hardware configurations with different bit precision levels of up to four bits and evaluate their impact on classification accuracy and performance. The work focuses on optimizing the quantization of features, adjacencies, weights, and intermediate data to increase hardware efficiency and throughput while minimizing logic requirements. The results of the study are evaluated with classification accuracy, execution time, and speedup metrics for different cases obtained with hardware configurations. The authors state that optimized hardware design with deep quantization can provide significant speedup while maintaining classification accuracy.