1. Introduction
Since their inception, deep neural networks (DNNs) have demonstrated remarkable performance and have been recognized as state-of-the-art methods, achieving near-human performance in various applications such as image processing [
1], pattern recognition [
2], bioinformatics [
3], natural language processing [
4], etc., among others. The efficacy of DNNs becomes particularly evident in scenarios involving vast amounts of data with subtle features that may not be easily discernible by humans. This positions DNNs as invaluable tools to address evolving data processing needs. It is widely observed that the number of layers significantly influences the performance of the network [
5]. DNNs, consisting of more layers, often lead to enhanced feature extraction capabilities. However, it is essential to acknowledge that deeper networks typically demand a larger number of parameters, consequently requiring more extensive computational resources and memory capacity for effective training and inference.
DNN training and inference demands significant computing power, leading to the consumption of considerable energy resources. In a study presented in [
6], it was estimated that dynamic power consumption, training a 176 billion parameter language model consumes
metric tonnes of
. This environmental impact underscores the urgency of optimizing DNN implementations, a concern that has gained widespread attention from the research community [
7,
8]. Addressing these challenges is crucial to strike a balance between the power and environmental costs associated with deploying DNNs for various applications.
In their fundamental structure, DNNs rely on basic mathematical operations such as addition and multiplication, culminating in the multiply-accumulate (MAC) operation. These MAC operations can constitute approximately
of the total computations in convolutional neural networks (CNNs) [
9]. The configuration and arrangement of MAC units depend on the size and structure of the DNN. For instance, the pioneering DNN in the ImageNet challenge, which surpassed human-level accuracy and is known as the ResNet model with 152 layers, necessitates
GMAC operations and 60 million weights [
10,
11]. Typically, processing a single input sample in a DNN demands approximately one billion MAC operations [
12]. This highlights the potential for substantial reductions in computational demands by enhancing the efficiency of MAC operations. In this context, researchers have suggested to prune the computations by replacing floating point numbers with fixed point, with minimal or no loss in accuracy [
13,
14]. Another plausible approach involves minimizing the bit precision used during MAC operations [
15], an avenue extensively explored within the area of approximate computing [
16]. Research indicates that implementing approximate computing techniques in DNNs can result in power savings, for training and inference, of up to
[
17]. However, many of these approximate computing methods often compromise accuracy, which may not be suitable for some critical applications. Consequently, devising methodologies that enhance DNN computation efficiency without compromising output accuracy becomes crucial.
A typical CNN comprises several convolution and fully-connected layers such as VGG-16 as shown in
Figure 1. In such CNN models, the convolution layers typically serve the purpose of complex feature extraction and the fully-connected layers perform the classification task using the complex features extracted from the convolution layers. During inference, these layers execute MAC operations on the input utilizing trained weights to produce a distinct feature representation as the output [
4]. These layers, cascaded in a certain configuration, can approximate a target function. While convolution and fully-connected layers are adept at representing linear functions, they cannot directly cater to the applications requiring nonlinear representations. To introduce non-linearity into the DNN model, the outputs of these layers undergo processing via a nonlinear operator known as an activation function [
18]. Since every output value must pass through an activation function, selecting the appropriate one holds significant impact over the performance of DNNs [
19].
Rectified Linear Unit (ReLU) is among the most employed activation functions in modern DNNs [
20]. The simplicity and piece-wise linearity of ReLU contribute to faster learning capability and stability in values when utilizing gradient-descent methods [
18]. The ReLU function, as shown in
Figure 2, acts as a filter that outputs the same input for positive input values and outputs zero for negative inputs. This implies that precision in output is crucial only when the input is a positive value. The input to a ReLU function typically originates from the output of a fully-connected or convolution layer in the DNN, involving a substantial number of MAC operations [
12]. It is indicated in [
21] that a significant proportion, ranging from
to
of ReLU inputs in DNNs are negative. Consequently, a considerable amount of high-precision computation in DNNs is discarded, as output elements are reduced to zero after the ReLU function. The early detection of these negative values has the potential to curtail the energy expended on high-precision MAC operations, ultimately leading to a more efficient DNN implementation.
To this end, we propose ECHO: Energy-efficient Computing Harnessing Online Arithmetic - A MSDF-based accelerator for DNN inference, aimed at computation pruning at the arithmetic level. ECHO leverages an unconventional arithmetic paradigm, i.e., online, or most-significant digit first (MSDF) arithmetic which works in a digit-serial manner, taking inputs and producing output from most-significant to least-significant side. The MSDF nature of online arithmetic-based computation units combined with a negative output detection scheme can aid in the accurate detection of negative outputs at an early stage which results in promising computation and energy savings. The experimental results demonstrate that the ECHO showcases promising power and throughput improvements compared to contemporary methods.
The rest of the paper is organized as follows.
Section 2 presents a comprehensive review of the relevant literature.
Section 3 presents an overview of online arithmetic-based computation units and the details of the proposed design. The evaluation and results of the proposed methodology has been presented in
Section 4, followed by the conclusion in
Section 5.
2. Related Works
Over the past decade, researchers have extensively addressed challenges in DNN acceleration and proposed solutions such as DaDianNao [
22], Stripes [
23], Bit-Pragmatic [
24], Cnvlutin [
25], Tetris [
26], etc. Bit-Pragmatic [
24], is designed to leverage the bit information content of activation values. The architecture employs bit-serial computation of activation values using sparse encoding to skip computations for zero bits. By avoiding unnecessary computations for zero bits in activation values, Bit-Pragmatic achieves performance and energy improvements. On the other hand, Cnvlutin [
25] aimed to eliminate unnecessary multiplications when the input is zero, leading to improved performance and energy efficiency. The architecture of Cnvlutin incorporates hierarchical data-parallel compute units and dispatchers that dynamically organize input neurons, skipping these unnecessary multiplications, to ensure the efficient utilization of computing units, thereby keeping them busy to achieve superior resource utilization.
Problems such as unnecessary computations and the varying precision requirements across different layers of CNNs have been thoroughly discussed in the literature [
27,
28]. These computations can contribute to increased energy consumption and resource demands in accelerator designs. To tackle these challenges, researchers have investigated designing domain-specific architectures specifically tailored to accelerate the computation of convolution operations in deep neural networks [
29,
30]. To reduce the number of MAC operations in DNNs, the work in [
31] noted that neighboring elements within the output feature map often display similarity. By leveraging this insight, they achieved a
reduction in MAC operations with
loss in accuracy. In reference to [
32], the introduction of processing units equipped with an approximate processing mode resulted in substantial improvements in energy efficiency, ranging from
to
. However, these gains in energy efficiency were achieved at a cost of
drop in accuracy. This trade-off between energy efficiency and accuracy highlights the potential benefits of approximate processing modes in achieving energy savings but underscores the need to carefully balance these gains with the desired level of accuracy in DNN computations.
In recent years, there has been a growing trend towards implementing DNN acceleration and evaluation designs using bit-serial arithmetic circuits [
23,
33,
34,
35,
36]. This shift is motivated by several factors: (1) reducing computational complexity and required communication bandwidth, (2) accommodating the variable data precision needs of different deep learning networks and within the layers of a network, (3) easily varying compute precision using bit-serial designs by adjusting the number of compute cycles in a DNN model evaluation, and (4) enhancing energy and resource utilization through early detection of negative results, thereby terminating ineffective computations yielding negative results.
One notable contribution in this direction is Stripes [
23], recognized as a pioneering work that employs bit-serial multipliers instead of conventional parallel multipliers in its accelerator architecture to address challenges related to power and throughput. In a similar context, UNPU [
33] builds upon the Stripes architecture by incorporating look-up tables (LUTs) to store inputs for reuse multiple times during the computation of an input feature map. These advancements mark significant progress in the research towards more efficient and effective CNN acceleration.
These accelerators are designed to enhance the performance of DNN computations through dedicated hardware implementations. However, it is important to note that none of these hardware accelerators have explicitly focused on the potential for computation pruning through early detection of negative outputs for ReLU activation functions. The aspect of efficiently handling and optimizing ReLU computations, especially in terms of early detection and pruning of negative values, remains an area where further exploration and development could potentially lead to improvements in efficiency and resource utilization.
Most modern CNNs commonly employ ReLU as an activation function, which filters out negative results from convolutions and replaces them with zeros. Studies [
37,
38,
39] indicate that modern CNNs produce approximately
-
negative outputs, resulting in a significant power wastage on unnecessary computations. Traditional CNN acceleration designs typically perform the ReLU activation separately after completing the convolution operations. Existing solutions involve special digit encoding schemes like those in [
38,
39] or intricate circuit designs [
37,
40] to determine whether the result is negative. SnaPEA [
37] aims to reduce the computation of a convolutional layer followed by a ReLU activation layer by identifying negative outputs early in the process. However, it is important to note that SnaPEA introduces some complexities. It requires reordering parameters and maintaining indices to match input activations with the reordered parameters correctly. Additionally, in predictive mode, SnaPEA necessitates several profiling and optimization passes to determine speculative parameters, adding an extra layer of complexity to the implementation. MSDF arithmetic operations have also emerged as a valuable approach for early detection of negative activations [
40,
41]. Shuvo et al. [
40] introduced a novel circuit implementation for convolution where negative results can be detected early, allowing subsequent termination of corresponding operations. Similarly, USPE [
42] and PredictiveNet [
43] propose splitting values statistically into most significant bits and least significant bits for early negative detection. Other works in this avenue include [
44,
45,
46]. However, existing methods for early detection of negative activations often rely on digit encoding schemes, prediction using a threshold, or complex circuitry, which may introduce errors or increase overhead.
4. Experimental Results
To evaluate and compare the performance of the proposed design, we conduct experimental evaluation using two baseline designs, 1) Baseline-1: conventional bit-serial accelerator design based on UNPU accelerator [
33], 2) Baseline-2: online arithmetic-based design without the capability to early-detect and terminate the ineffective negative computations. For a fair comparison, both the baseline designs use the same accelerator array layout as ECHO. We evaluate ECHO on VGG-16 [
52] and ResNet [
11] workloads. The layer-wise architecture of the CNN models is presented in
Table 1. As mentioned in the table, each of the CNNs contain several convolution layer blocks, where the convolution layers within the blocks have the same number of convolution kernels (
M) and the output feature map dimensions (
). It is also worth noting that while VGG-16 contains a maxpooling layer after every convolution block, the ResNet-18 CNN performs down-sampling by a strided (
) convolution in the first convolution layer in each block of layers except for the maxpooling layer after C1.
We used pre-trained VGG-16, ResNet-18, and ResNet-50 models obtained from torchvision [
53] for our experiments. For the evaluation, we used 1000 images from the validation set of ImageNet database [
10]. The RTL of the proposed and baseline accelerators are designed and functionally verified using Xilinx Vivado 2018.2. The implementation and evaluation is carried out on Xilinix Vertix-7 VU3P FPGA.
Table 2 and
Table 3 present the layer-wise comparative results for inference time, power consumption, and speedup for VGG-16 and ResNet-18 networks respectively. It is worth noting that, for VGG-16 workload, the online arithmetic-based design without the early negative detection capability (Baseline-2), outperforms the conventional bit-serial design (Baseline-1). The baseline-2 design, on average, achieves
improvement in inference time compared to the baseline-1 design. It also consumes
less power and
improved speedup in computation compared to the baseline-1 design which underscores the superior capability of the online arithmetic-based DNN accelerator designs. Similarly, ECHO achieves
and
improved inference time, consumes
and
less power, and also achieves
and
improved speedup compared to the baseline-1 and baseline-2 designs, respectively, for VGG-16 network.
Similarly, for ResNet-18 model, as indicated in
Table 3, ECHO achieves
and
improved inference time, consumes
and
less power, achieves
and
higher throughput, and also achieves
and
improved speedup compared to the baseline-1 and baseline-2 designs.
Figure 8 presents the runtimes of ECHO compared to the baseline methods. From the figures, it can be noted that the accelerator design based on online arithmetic (Baseline-2) outperforms the conventional bit-serial design (Baseline-1) even without the capability of early detection of the negative outputs, which emphasizes the superiority of the online arithmetic-based designs over the conventional bit-serial designs. As shown in
Figure 8a, ECHO showcases mean improvements of
and
in runtime compared to baseline-1 and baseline-2 designs respectively, for the VGG-16 workloads. Similarly, as shown in
Figure 8b, ECHO achieves
and
improvements in runtime on ResNet-18 workloads compared to baseline-1 and baseline-2 designs respectively. These average runtime improvements lead to average speedups of
and
for VGG-16 and ResNet-18 CNN models, respectively, compared to conventional bit-serial design. Similarly, compared to online arithmetic-based design without the capability of early detection negative outputs (baseline-2), ECHO achieves an average speedup of
and
for VGG-16 and ResNet-18 workloads, respectively. The faster runtimes of ECHO do not only result due to the efficient design of the online arithmetic-based processing elements, but also due to the large number of negative output activations which in-turn helps in terminating the ineffective computations resulting in substantial power and energy savings.
The power consumption is related to the execution time as well as the utilization of resources in the accelerator. The proposed early detection and termination of negative output activations can result in substantial improvements in power consumption.
Figure 9 shows the layer-wise power consumption of ECHO compared to the baseline designs. For VGG-16 model, as shown in
Figure 9a, ECHO achieves
and
improvements in power consumption compared to baseline-1 and baseline-2 designs respectively. Similarly, as depicted in
Figure 9b, ECHO shows significant improvements of
and
in power consumption for ResNet-18 workloads compared to baseline-1 and baseline-2 designs respectively.
A comparison of the FPGA implementation of ECHO with the conventional bit-serial design (Baseline-1) and online arithmetic design without the early negative detection capability (Baseline-2) is presented in
Table 4. All the designs in these experiments were evaluated on a 100MHz frequency. It can be observed from the comparative results that the logic resource and BRAM utilization for ECHO is marginally higher than those for the Baseline-1 design. However, ECHO achieves
,
, and
higher throughput compared to the Baseline-1 design for VGG-16, ResNet-18, and ResNet-50 CNN workloads respectively. Similarly, ECHO achieves
,
, and
improvement in latency/image for VGG-16, ResNet-18, and ResNet-50 workloads, respectively, compared to baseline-1 design. We also implemented ECHO without the early detection and termination capability (Baseline-2) on FPGA and achieved a throughput improvement of
,
, and
for VGG-16, ResNet-18, and ResNet-50 workloads respectively. Moreover, ECHO also achieved a
,
, and
improvement in latency for VGG-16, ResNet-18, and ResNet-50 models compared to Baseline-2 design. ECHO exhibits improved power consumption due to the use of the proposed method of early detection of the negative results. In particular, for VGG-16 model, ECHO achieved improvements of
and
compared to baseline-1 and baseline-2 designs respectively. The proposed design shows similar improvements of
and
, compared to baseline-1 and baseline-2 designs respectively, for ResNet-18 workload. ECHO achieved
and
less power consumption for ResNet-50 model, compared to baseline-1 and baseline-2 designs respectively.
Similar to the comparison with the baseline designs, a comparison of the proposed design with previous methods is also conducted for VGG-16, ResNet-18, and ResNet-50 workloads. The comparative results are presented in
Table 5. For VGG-16 workload, the proposed design achieves
,
,
, and
superior performance in-terms of peak throughput compared to NEURAghe [
54,
55,
56], and Caffeine [
57], respectively. In-terms of energy efficiency, ECHO achieved
and
improvement in energy efficiency compared to NEURAghe [
54] and OPU [
56], respectively. Similarly, the proposed design outperformed NEURAghe [
54] and [
58] by
and
in-terms of peak throughput performance for ResNet-18 workload. ECHO also achieved superior energy efficiency of
compared to
for [
54] and
for [
58]. For ResNet-50 network, the proposed design achieved
and
superior results in-terms of peak throughput and energy efficiency, respectively. However, besides the promising performance in context to the comparative results stated earlier, the proposed design utilizes slightly large number of logic resources compared to its contemporary counterparts. Another area for future research and exploration in this avenue lies in the lightweight design of the online arithmetic-based compute units, where a compact design may help in reducing the area/logic resource overhead experienced in the present design.
5. Conclusion
This research introduces ECHO - A DNN hardware accelerator focused on computation pruning through the use of online arithmetic. ECHO effectively addresses the challenge of early detection of negative activations in ReLU-based DNNs, showcasing substantial improvements in power efficiency and throughput for VGG-16, ResNet-18, ResNet-50 workloads. Compared to existing methods, the proposed design demonstrates superior performance with mean throughput improvements of , , and for VGG-16, ResNet-18, and ResNet-50 respectively. With regards to energy efficiency, ECHO achieved superior energy efficiencies () of , , and for VGG-16, ResNet-18, and ResNet-50 workloads. Moreover, ECHO achieved average improvements in power consumption of up to , , and for VGG-16, ResNet-18, and ResNet-50 workloads compared to the conventional bit-serial design, respectively. Furthermore, significant average speedups of , , and were observed when comparing the proposed design to conventional bit-serial designs for VGG-16, ResNet-18, and ResNet-50 models respectively. Additionally, ECHO outperforms online arithmetic-based designs without early detection, achieving average speedups of and for VGG-16 and ResNet-18 workloads. These findings underscore the potential of the proposed hardware accelerator in enhancing the efficiency of computing convolution in deep neural networks during inference.
Figure 1.
A typical convolution neural network - VGG-16
Figure 1.
A typical convolution neural network - VGG-16
Figure 2.
The Rectified Linear Unit (ReLU) activation function
Figure 2.
The Rectified Linear Unit (ReLU) activation function
Figure 3.
Timing characteristics of online operation with .
Figure 3.
Timing characteristics of online operation with .
Figure 5.
Processing engine architecture of ECHO. Each PE contains multipliers where each multiplier accepts a bit-serial (input feature) and a parallel (kernel pixel) input.
Figure 5.
Processing engine architecture of ECHO. Each PE contains multipliers where each multiplier accepts a bit-serial (input feature) and a parallel (kernel pixel) input.
Figure 6.
Architecture of ECHO for a convolution layer. Each column is equipped with N PEs to facilitate the input channels, while each column of PE is followed by an online arithmetic-based reduction tree for the generation of the final SOP. The central controller block generates the termination signals and also controls the dataflow to and from the weight buffers (WB) and activation buffers (AB).
Figure 6.
Architecture of ECHO for a convolution layer. Each column is equipped with N PEs to facilitate the input channels, while each column of PE is followed by an online arithmetic-based reduction tree for the generation of the final SOP. The central controller block generates the termination signals and also controls the dataflow to and from the weight buffers (WB) and activation buffers (AB).
Figure 7.
Central Controller and Decision Unit in ECHO
Figure 7.
Central Controller and Decision Unit in ECHO
Figure 8.
Runtimes of the convolution layers for the proposed method with the baseline designs. The proposed design achieved mean runtime improvements of 58.16% and 61.6% compared to conventional bit-serial design (Baseline-1) for VGG-16 and ResNet-18 workloads respectively.
Figure 8.
Runtimes of the convolution layers for the proposed method with the baseline designs. The proposed design achieved mean runtime improvements of 58.16% and 61.6% compared to conventional bit-serial design (Baseline-1) for VGG-16 and ResNet-18 workloads respectively.
Figure 9.
Power consumption of the proposed design with the baseline designs. The proposed design achieved and mean reduction in power consumption compared to conventional bit-serial design (Baseline-1) for VGG-16 and ResNet-18 workloads respectively.
Figure 9.
Power consumption of the proposed design with the baseline designs. The proposed design achieved and mean reduction in power consumption compared to conventional bit-serial design (Baseline-1) for VGG-16 and ResNet-18 workloads respectively.
Table 1.
Convolution layer architecture of VGG-16 and ResNet networks. Where M denotes the number of kernels (number of output feature maps) and denotes the dimensions of the output feature maps.
Table 1.
Convolution layer architecture of VGG-16 and ResNet networks. Where M denotes the number of kernels (number of output feature maps) and denotes the dimensions of the output feature maps.
Network |
Layer |
Kernel Size |
M |
|
VGG-16 |
C1-C2 |
|
64 |
|
C3-C4 |
|
128 |
|
C5-C7 |
|
256 |
|
C8-C10 |
|
512 |
|
C11-C13 |
|
512 |
|
ResNet-18 |
C1 |
|
64 |
|
C2-C5 |
|
64 |
|
C6-C9 |
|
128 |
|
C10-C13 |
|
256 |
|
C14-C17 |
|
512 |
|
ResNet-50 |
C1 |
|
64 |
|
C2-x |
|
64, 256 |
|
C3-x |
|
128, 512 |
|
C4-x |
|
256, 1024 |
|
C5-x |
|
512, 2048 |
|
Table 2.
Comparison of per layer inference time, power consumption, throughput, and speedup for the convolution layers of VGG-16 network.
Table 2.
Comparison of per layer inference time, power consumption, throughput, and speedup for the convolution layers of VGG-16 network.
Layer |
Inference Time (mS) |
Power (W) |
Speedup |
Baseline-1 |
Baseline-2 |
ECHO |
Baseline-1 |
Baseline-2 |
ECHO |
Baseline-1 |
Baseline-2 |
ECHO |
C1 |
35.12 |
22.8 |
12.29 |
4.55 |
1.44 |
0.81 |
1 |
1.54 |
2.86 |
C2 |
78.27 |
60.21 |
33.95 |
108.2 |
36.99 |
20.86 |
1 |
1.3 |
2.31 |
C3 |
39.13 |
30.10 |
16.99 |
54.10 |
18.5 |
10.44 |
1 |
1.3 |
2.3 |
C4 |
80.28 |
64.22 |
29.14 |
110.98 |
39.46 |
17.90 |
1 |
1.25 |
2.75 |
C5 |
40.14 |
32.11 |
19.98 |
55.49 |
19.73 |
12.27 |
1 |
1.25 |
2.01 |
C6 |
82.28 |
68.24 |
34.42 |
113.75 |
41.92 |
21.14 |
1 |
1.21 |
2.39 |
C7 |
82.28 |
68.24 |
38.51 |
113.75 |
41.92 |
23.66 |
1 |
1.21 |
2.14 |
C8 |
41.14 |
34.12 |
15.49 |
56.87 |
20.96 |
9.51 |
1 |
1.21 |
2.67 |
C9 |
84.29 |
72.25 |
46.86 |
116.53 |
44.39 |
28.79 |
1 |
1.16 |
1.79 |
C10 |
84.29 |
72.25 |
15.30 |
116.53 |
44.39 |
9.40 |
1 |
1.16 |
5.51 |
C11 |
21.07 |
18.06 |
15.86 |
29.13 |
11.09 |
9.74 |
1 |
1.16 |
1.33 |
C12 |
21.07 |
18.06 |
1.41 |
29.13 |
11.09 |
0.86 |
1 |
1.16 |
14.92 |
C13 |
21.07 |
18.06 |
17.04 |
29.13 |
11.09 |
10.47 |
1 |
1.16 |
1.24 |
Mean |
54.65 |
44.46 |
22.87 |
72.16 |
26.38 |
13.53 |
1 |
1.23 |
2.39 |
Table 3.
Comparison of per layer inference time, power consumption, throughput, and speedup for the convolution layers of ResNet-18 Model.
Table 3.
Comparison of per layer inference time, power consumption, throughput, and speedup for the convolution layers of ResNet-18 Model.
Layer |
Inference Time (mS) |
Power (W) |
Speedup (×) |
Baseline-1 |
Baseline-2 |
ECHO |
Baseline-1 |
Baseline-2 |
ECHO |
Baseline-1 |
Baseline-2 |
ECHO |
C1 |
12.79 |
6.52 |
3.25 |
9.02 |
1.16 |
0.58 |
1 |
1.96 |
3.93 |
C2 |
4.89 |
3.76 |
1.9 |
6.76 |
2.31 |
1.16 |
1 |
1.3 |
2.57 |
C3 |
4.89 |
3.76 |
1.88 |
6.76 |
2.31 |
1.15 |
1 |
1.3 |
2.6 |
C4 |
4.89 |
3.76 |
1.91 |
6.76 |
2.31 |
1.17 |
1 |
1.3 |
2.56 |
C5 |
4.89 |
3.76 |
1.87 |
6.76 |
2.31 |
1.15 |
1 |
1.3 |
2.61 |
C6 |
2.44 |
1.88 |
0.96 |
3.38 |
1.15 |
0.59 |
1 |
1.29 |
2.54 |
C7 |
5.01 |
4.01 |
1.98 |
6.93 |
2.46 |
1.21 |
1 |
1.24 |
2.53 |
C8 |
5.01 |
4.01 |
2.05 |
6.93 |
2.46 |
1.26 |
1 |
1.24 |
2.44 |
C9 |
5.01 |
4.01 |
2.04 |
6.93 |
2.46 |
1.25 |
1 |
1.24 |
2.45 |
C10 |
2.5 |
2.04 |
1.06 |
3.46 |
1.25 |
0.65 |
1 |
1.22 |
2.35 |
C11 |
5.14 |
4.26 |
2.1 |
7.1 |
2.62 |
1.29 |
1 |
1.2 |
2.44 |
C12 |
5.14 |
4.26 |
2.14 |
7.1 |
2.62 |
1.31 |
1 |
1.2 |
2.4 |
C13 |
5.14 |
4.26 |
2.07 |
7.1 |
2.62 |
1.27 |
1 |
1.2 |
2.48 |
C14 |
2.57 |
2.26 |
1.08 |
3.55 |
1.39 |
0.66 |
1 |
1.13 |
2.37 |
C15 |
5.26 |
4.6 |
2.47 |
7.28 |
2.83 |
1.52 |
1 |
1.14 |
2.12 |
C16 |
5.26 |
4.6 |
1.92 |
7.28 |
2.83 |
1.18 |
1 |
1.14 |
2.73 |
C17 |
5.26 |
4.6 |
2.35 |
7.28 |
2.83 |
1.44 |
1 |
1.14 |
2.23 |
Mean |
5.06 |
3.9 |
1.94 |
6.49 |
2.23 |
1.11 |
1 |
1.29 |
2.6 |
Table 4.
Comparison of FPGA implementation of the proposed design with the conventional bit-serial design (Baseline-1) online arithmetic design without the early negative detection capability (Baseline-2). The FPGA device used for this experiment is Xilinx Ultrascale+ Vertix-7 VU3P. The Peak Throughput and Latency per Image have been mentioned in GOPS and ms respectively.
Table 4.
Comparison of FPGA implementation of the proposed design with the conventional bit-serial design (Baseline-1) online arithmetic design without the early negative detection capability (Baseline-2). The FPGA device used for this experiment is Xilinx Ultrascale+ Vertix-7 VU3P. The Peak Throughput and Latency per Image have been mentioned in GOPS and ms respectively.
Model |
VGG-16 |
ResNet-18 |
ResNet-50 |
Design |
Baseline-1 |
Baseline-2 |
ECHO |
Baseline-1 |
Baseline-2 |
ECHO |
Baseline-1 |
Baseline-2 |
ECHO |
Frequency (MHz) |
100 |
100 |
100 |
Logic Utilization |
238K
(27.6%) |
315K
(36.54%) |
238K
(27.6%) |
315K
(36.54%) |
238K
(27.6%) |
315K
(36.54%) |
BRAM Utilization |
83
(11.4%) |
84
(11.54%) |
83
(11.4%) |
84
(11.54%) |
83
(11.4%) |
84
(11.54%) |
Mean Power (W) |
72.16 |
26.38 |
13.53 |
6.49 |
2.23 |
1.11 |
9.63 |
11.5 |
5.72 |
Peak Throughput (GOPS) |
47.3 |
61.4 |
655 |
47.3 |
61.4 |
123 |
39.01 |
61.4 |
128.7 |
Latency per Image (ms) |
710.5 |
578.03 |
297.3 |
86.2 |
66.4 |
33.1 |
366.7 |
464.01 |
231.03 |
Average Speedup (×) |
1 |
1.23 |
2.39 |
1 |
1.29 |
2.6 |
1 |
1.21 |
2.42 |
Table 5.
Comparison with previous works.
Table 5.
Comparison with previous works.
Models |
VGG-16 |
ResNet-18 |
ResNet-50 |
Designs |
NEURAghe [54] |
[55] |
OPU [56] |
Caffeine [57] |
ECHO |
NEURAghe [54] |
[58] |
ECHO |
[58] |
ECHO |
Device |
Zynq
Z7045 |
Zynq
ZC706 |
Zynq
XC7Z100 |
VX690t |
VU3P |
Zynq
Z7045 |
Arria10
SX660 |
VU3P |
Arria10
SX660 |
VU3P |
Frequency (MHz) |
140 |
150 |
200 |
150 |
100 |
140 |
170 |
100 |
170 |
100 |
Logic Utilization |
100K |
- |
- |
- |
315K |
100K |
102.6K |
315K |
102.6K |
315K |
BRAM Utilization |
320 |
1090 |
1510 |
2940 |
84 |
320 |
465 |
84 |
465 |
84 |
Peak Throughput (GOPS) |
169 |
374.98 |
397 |
636 |
655 |
58 |
89.29 |
123 |
90.19 |
128.7 |
Energy Efficiency (GOPS/W) |
16.9 |
- |
24.06 |
- |
48.41 |
5.8 |
19.41 |
21.58 |
19.61 |
22.51 |