This paper presents a very broad coverage of edge AI processors and PIM processors from the industry. This includes processors already released, processors that have been announced, and processors that have been published about in research venues (such as the ISSCC and the VLSI conferences). This section is divided into four subsections: subsection (i) describes dataflow processors, subsection (ii) describes neuromorphic processors, and subsection (iii) describes PIM processors. All of these sections describe industrial products that have been announced or released. Finally, subsection (iv) describes the processors in industrial research.
i. Dataflow Edge Processor
This section describes the latest dataflow processors from the industry. Dataflow processors are custom designed for neural network inference and, in some cases, training computations. The processors are listed in alphabetical order based on the manufacturer name. The data provided is from the publications or websites of the processors.
Apple released the bionic SoC A16 with an NPU unit for the iPhone 14 [
157]. The A16 processor exhibits about 20% better performance with the same power consumption as their previous version, A15. It is embedded with a 6-core ARM8.6a CPU, 16-core NPU, and 8-core GPU [
157]. The Apple M2 processor was released in 2022 primarily for the Macbooks, and then optimized for iPads. This processor includes a 10-core GPU and 16-core NPU [
158]. M1 performs 11 TOPS with 10 W of power consumption [
159]. M2 has 18% and 35% more powerful CPU and GPU for faster computations.
ARM recently announced the Ethos-N78 with an 8-core NPU for automotive applications [
160]. Ethos-N78 is an upgraded version of Ethos-N77. Both NPUs support INT8 and INT16 precision. Ethos-N78 performs more than 2x better than the earlier version. The most significant improvement of Ethos-N78 is enabling a new data compression method that reduces the bandwidth and improves performance and energy efficiency [
161].
Blaize released its Pathfinder P1600, El Cano AI inference processor. This processor integrates 16 graph streaming processors (GSP) that deliver 16 TOPS at its peak performance [
162]. It uses a dual Cortex-A53 for running the operating system at up to 1GHz. Blaize GSP processors integrate data pipelining and support up to INT-64 and FP-8-bit operations [
163].
AIMotive [
164] introduced the inference edge processor Apache5, which supports a wide range of DNN models. The system has an aiWare3p NPU with an energy efficiency of 2 TOPS/W. Apache5 supports INT8 MAC and INT32 internal precision [
165]. This processor is mainly targeted for autonomous vehicles [
166].
CEVA [
134] released the Neupro-S on-device AI processor for computer vision applications. Neupro comprises two separate cores. One is the DSP-based Vector Processor Unit (VPU), and the other is the Neupro Engine. VPU is the controller, and the Neupro Engine does most of the computing work with INT8 or INT16 precision. A single processor performs up to 12.5 TOPS, while the performance can be scaled to 100 TOPS with multicore clusters [
134,
135]. The deep learning edge processors are mostly employed for inference tasks. CEVA added a retraining capability to its CDNN (CEVA DNN) framework for online learning on client devices [
136].
Cadence introduced the Tensilica DNA 100, which is a comprehensive SoC for domain-specific on-device AI edge accelerators [
167]. It has low, mid, and high-end AI products. Tensilica DNA 100 offers 8 GOPS to 32 TOPS AI processing performance currently and predicts 100 TOPS in future releases [
168]. The target application of the DNA 100 is IoTs, intelligent sensors, vision, and voice application. The mid and high-end applications include smart surveillance and autonomous vehicles, respectively.
Table 2.
Commercial Edge Processors with Operation Technology, Process Technology, and Numerical Precision.
Table 2.
Commercial Edge Processors with Operation Technology, Process Technology, and Numerical Precision.
Company |
Latest Chip |
Power (W) |
Process (nm) |
Area (mm2) |
Precision INT/FP |
Performance (TOPS) |
E. Eff. (TOPS/W) |
Architecture |
Reference |
Apple |
M1 |
10 |
5 |
119 |
64 |
11 |
1.1 |
Dataflow |
[159] |
Apple |
A14 |
6 |
5 |
88 |
64 |
11 |
1.83 |
Dataflow |
[242] |
Apple |
A15 |
7 |
5 |
|
64 |
15.8 |
2.26 |
Dataflow |
[242] |
Apple |
A16 |
5.5 |
4 |
|
64 |
17 |
3 |
Dataflow |
[157] |
*AIStorm |
AIStorm |
0.225 |
|
|
8 |
2.5 |
11 |
Dataflow |
[243] |
*AlphaIC |
RAP-E |
3 |
|
|
8 |
30 |
10 |
Dataflow |
[244] |
aiCTX |
Dynap-CNN |
0.001 |
22 |
12 |
1 |
0.0002 |
0.2 |
Neuromorphic |
[15,213] |
*ARM |
Ethos78 |
1 |
5 |
|
16 |
10 |
10 |
Dataflow |
[160,161] |
*AIMotive |
Apache5 IEP |
0.8 |
16 |
121 |
8 |
1.6-32 |
2 |
Dataflow |
[164,165] |
*Blaize |
Pathfinder, EI Cano |
6 |
14 |
|
64, FP-8, BF16 |
16 |
2.7 |
Dataflow |
[162] |
*Bitman |
BM1880 |
2.5 |
28 |
93.52 |
8 |
2 |
0.8 |
Dataflow |
[245,246] |
*BrainChip |
Akida1000 |
2 |
28 |
225 |
1,2,4 |
1.5 |
0.75 |
Neuromorphic |
[147,148] |
*Cannan |
Kendrite K210 |
2 |
28 |
|
8 |
1.5 |
1.25 |
Dataflow |
[247,248] |
*CEVA |
CEVA-Neuro-S |
|
16 |
|
2, 5, 8, 12, 16 |
12.7 |
|
Dataflow |
[134] |
*CEVA |
CEVA-Neuro-M |
0.83 |
16 |
|
2, 5, 8, 12, 16 |
20 |
24 |
Dataflow |
[135] |
*Cadence |
DNA100 |
0.85 |
16 |
|
16 |
4.6 |
3 |
Dataflow |
[167,168] |
*Deepvision |
ARA-1 |
1.7 |
28 |
|
8,16 |
4 |
2.35 |
Dataflow |
[169] |
*Deepvision |
ARA-2 |
|
16 |
|
|
|
|
Dataflow |
[170] |
*Eta |
ECM3532 |
0.01 |
55 |
25 |
8 |
0.001 |
0.1 |
Dataflow |
[249] |
*FlexLogic |
InferX X1 |
13.5 |
7 |
54 |
8 |
7.65 |
0.57 |
Dataflow |
[250] |
*Google |
Edge TPU |
2 |
28 |
96 |
8, BF16 |
4 |
2 |
Dataflow |
[176,177] |
*Gyrfalcon |
LightSpeer 2803S |
0.7 |
28 |
81 |
8 |
16.8 |
24 |
PIM |
[224] |
*Gyrfalcon |
LightSpeer 5801 |
0.224 |
28 |
36 |
8 |
2.8 |
12.6 |
PIM |
[224] |
*Gyrfalcon |
Janux GS31 |
650/900 |
28 |
10457.5 |
8 |
2150 |
3.30 |
PIM |
[225] |
*GreenWaves |
GAP9 |
0.05 |
22 |
12.25 |
FP-(8,16,32) |
0.05 |
1 |
Dataflow |
[180,181] |
*Horizon |
Journey 3 |
2.5 |
16 |
|
8 |
5 |
2 |
Dataflow |
[171] |
*Horizon |
Journey5/5P |
30 |
16 |
|
8 |
128 |
4.8 |
Dataflow |
[172,173] |
*Hailo |
Hailo 8 M2 |
2.5 |
28 |
225 |
4,8,16 |
26 |
2.8 |
Dataflow |
[174,175] |
Intel |
Loihi 2 |
0.1 |
7 |
31 |
8 |
0.3 |
3 |
Neuromorphic |
[9] |
Intel |
Loihi |
0.11 |
14 |
60 |
1-9 |
0.03 |
0.3 |
Neuromorphic |
[9,218] |
*Intel |
Intel® Movidius |
2 |
16 |
71.928 |
16 |
4 |
2 |
Dataflow |
[186] |
IBM |
TrueNorth |
0.065 |
28 |
430 |
8 |
0.0581 |
0.4 |
Neuroorphic |
[10,218] |
IBM |
NorthPole |
74 |
12 |
800 |
2,4,8 |
200 (INT8) |
2.7 |
Dataflow |
[299,304] |
*Imagination |
PowerVR Series3NX |
|
|
|
FP-(8,16) |
0.60 |
|
Dataflow |
[182,183] |
*Imagination |
IMG 4NX MC1 |
0.417 |
|
|
4,16 |
12.5 |
30 |
Dataflow |
[184] |
*Imec |
DIANA |
|
22 |
10.244 |
2 |
29.5 (A), 0.14 (D) |
14.4 |
PIM+Digital |
[222,223] |
*Kalray |
MPPA3 |
15 |
16 |
|
8,16 |
255 |
1.67 |
Dataflow |
[13] |
*Kneron |
KL720 AI |
1.56 |
28 |
81 |
8,16 |
1.4 |
0.9 |
Dataflow |
[191] |
*Kneron |
KL530 |
0.5 |
|
|
8 |
1 |
2 |
Dataflow |
[192] |
*Koniku |
Konicore |
|
|
|
|
|
|
Neuromorphic |
[12] |
*LeapMind |
Efficiera |
0.237 |
12 |
0.422 |
1,2,4,8,16,32 |
6.55 |
27.7 |
Datalow |
[21] |
Memryx |
MX3 |
1 |
-- |
-- |
4,8,16 (W) BF16 |
5 |
5 |
Dataflow |
[297] |
*Mythic |
M1108 |
4 |
|
361 |
8 |
35 |
8.75 |
PIM |
[89] |
*Mythic |
M1076 |
3 |
40 |
294.5 |
8 |
25 |
8.34 |
PIM |
[18,88,90] |
*mobileEye |
EyeQ5 |
10 |
7 |
45 |
4,8 |
24 |
2.4 |
Dataflow |
[193,194,195] |
*mobileEye |
EyeQ6 |
40 |
7 |
|
4,8 |
128 |
3.2 |
Dataflow |
[196] |
*Mediatek |
i350 |
|
14 |
|
|
0.45 |
|
Dataflow |
[251] |
*NVIDIA |
Jetson Nano B01 |
10 |
20 |
118 |
FP16 |
1.88 |
0.188 |
Dataflow |
[197] |
NVIDIA |
AGX Orin |
60 |
7 |
-- |
8 |
275 |
3.33 |
Dataflow |
[199] |
*NXP |
i.MX 8M+ |
|
14 |
196 |
FP16 |
2.3 |
|
Dataflow |
[86,87] |
*NXP |
i.MX9 |
4x10-6
|
12 |
|
|
|
|
Dataflow |
[85] |
*Perceive |
Ergo |
0.073 |
5 |
49 |
8 |
4 |
55 |
Dataflow |
[252] |
TSU & Polar Bear Tech |
QM930 |
12 |
12 |
1089 |
4,8,16 |
20 (INT8) |
1.67 |
Dataflow |
[302] |
Qualcomm |
QCS8250 |
|
7 |
157.48 |
8 |
15 |
|
Dataflow |
[200,201] |
Qualcomm |
Snapdragon 888+ |
5 |
5 |
|
FP32 |
32 |
6.4 |
Dataflow |
[202,203,204] |
Qualcomm |
Snapdragon 8 Gen2 |
|
4 |
|
4,8,16, FP16 |
51 |
|
Dataflow |
[303] |
*RockChip |
rk3399Pro |
3 |
28 |
729 |
8, 16 |
3 |
1 |
Dataflow |
[253] |
Rokid |
Amlogic A311D |
|
12 |
|
|
5 |
|
Dataflow |
[254] |
Samsung |
Exynos 2100 |
|
5 |
|
|
26 |
|
Dataflow |
[205,206] |
Samsung |
Exynos 2200 |
|
4 |
|
8,16, FP16 |
|
|
Dataflow |
[255] |
Samsung |
HBM-PIM |
0.9 |
20 |
46.88 |
|
1.2 |
1.34 |
PIM |
[226,227] |
Sima..ai |
MLSoC |
10 |
16 |
175.55 |
8 |
50 |
5 |
Dataflow |
[300,301] |
Synopsis |
EV7x |
|
16 |
|
8, 12, 16, |
2.7 |
|
Dataflow |
[209,210] |
*Syntiant |
NDP100 |
0.00014 |
40 |
2.52 |
|
0.000256 |
20 |
PIM |
[228,229] |
*Syntiant |
NDP101 |
0.0002 |
40 |
25 |
1,2, 4,8 |
0.004 |
20 |
PIM |
[228,231] |
*Syntiant |
NDP102 |
0.0001 |
40 |
4.2921 |
1, 2, 4, 8 |
0.003 |
20 |
PIM |
[228,235] |
*Syntiant |
NDP120 |
0.0005 |
40 |
7.75 |
1,2,4,8 |
0.0019 |
3.8 |
PIM |
[228,234] |
*Syntiant |
NDP200 |
0.001 |
40 |
|
1,2,4,8 |
0.0064 |
6.4 |
PIM |
[228,232] |
Think Silicon |
NEMA®|pico XS |
0.0003 |
28 |
0.11 |
FP16,32 |
0.0018 |
6 |
Dataflow |
[256] |
Tesla/Samsung |
FSD Chip |
36 |
14 |
260 |
8, FP-8 |
73.72 |
2.04 |
Dataflow |
[211] |
Videntis |
TEMPO |
|
|
|
|
|
|
Neuromorphic |
[11] |
|
|
|
|
|
|
|
|
|
|
Verisilicon |
VIP9000 |
|
16 |
|
16, FP16 |
0.5-100 |
|
Dataflow |
[207,208] |
Untether |
TsunAImi |
400 |
16 |
|
8 |
2008 |
8 |
PIM |
[236,237] |
UPMEM |
UPMEM-PIM |
700 |
20 |
|
32, 64 |
0.149 |
|
PIM |
[238,239,240,241] |
Deepvision has updated their edge inference coprocessor ARA-1 for applications to autonomous vehicles and smart industries [46. It includes eight compute engines with 4 TOPS and consumes 1.7-2.3 W of power [
169]. The computing engine supports INT8 and INT16 precision. Deepvision has recently announced its second-generation inference engine, ARA-2, which will be released later in 2022 [
170]. The newer version will support LSTM and RNN neural networks in addition to the networks supported in ARA-1.
Horizon announced its next automotive AI inference processor Journey 5/5P [
171], which is the updated version of Journey 3. The mass production of Journey 5 will be starting in 2022. The processor exhibits a performance of 128 TOPS, and has a power of 30 W, giving an energy efficiency of 4.3 TOPS/W [
172,
173].
Hailo released its Hailo-8 M-2 SoC for various edge applications [
174]. The computing engine supports INT8 and INT16 precision. This inference engine is capable of 26 TOPS and requires 2.5 W of power. The processor can be employed as a standalone or coprocessor [
175].
Google introduced its Coral Edge TPU, which comprises only 29% of the floorpan of the original TPU for edge applications [
176]. The Coral TPU shows high energy efficiency in DNN computations compared to the original TPUs which are used in cloud inference applications [
178]. Coral Edge TPU supports INT8 precision and can perform 4 TOPS with 2 Watts of power consumption [
176].
Google released its Tensor processor for mobile applications, coming with its recent Pixel series mobile phone [
179]. Tensor is an 8-core cortex CPU chipset fabricated with 5 nm process technology. The processor has a 20-core Mali-G78 MP20 GPU with 2170 GFLOPS computing speed. The processor has a built-in NPU to accelerate AI models with a performance of 5.7 TOPS. The maximum power consumption of the processor is 10W.
GreenWaves announced their edge inference chip GAP9 [
180]. It is a very low-cost, low-power device that consumes 50 mW and performs 50 GOPS at its peak. GAP9 provides hearable developments through DSP, AI accelerator, and ultra-low latency audio streaming on IoT devices. GAP9 supports a wide range of computing precision, such as INT8, 16, 24, 32, and FP16, 32 [
181].
IBM introduced the NorthPole [
299], a non-Von Neumann deep learning inference engine, at the HotChips 2023 conference. The processor shows massive parallelism with 256 cores. Each core has 768KB of near-compute memory to store weights, activations, and programs. The total on-chip memory capacity is 192 MB. The NorthPole processor does not use off-chip memory to load weights or store intermediate values during deep learning computations. Thus, it dramatically improves latency, throughput, and energy consumption, which helps outperform existing commercial deep learning processors. The external host processor works on three commands: write tensor, run network, and read tensor. The NorthPole processor follows a set of pre-scheduled deterministic operations in the core array. It is implemented in 12nm technology and has 22 billion transistors taking up 800 mm
2 of chip area. The performance data released on the NothPole processor are computed based on frame/sec. The performance metrics of operations/sec in integer or floating point are unavailable in the public domain currently. However, the operation per cycle is available for different data precisions. In vector-matrix multiplication, 8, 4, and 2-bit can perform 2048, 4096, and 8192 operations/cycles. The FP16 can compute 256 operations/cycle (the number of cycles/s is not released at this time). NorthPole can compute 800, 400, and 200 TOPS with INT 2, 4, and 8 precision. The processor can be applied to a broad area of applications and can execute inference with a wide range of network models applied in classification, detection, segmentation, speech recognition, and transformer models in NLP.
Imagination introduced a wide range of edge processors with targeted applications in IoTs to autonomous vehicles [
182]. The edge processor series is categorized as the PowerVR Series3NX and can achieve up to 160 TOPS with multicore implementations. For ultra-low power applications, one can choose PowerVR AX3125, which has a 0.6 TOPS computing performance [
183]. IMG 4NX MC1 is a single-core Series 4 processor for autonomous vehicle applications and performs at 12.5 TOPS with less than 0.5 W of power consumption [
184].
Intel released multiple edge AI processors such as Nirvana Spring Crest NNP-I [
185] and Movidious [
186]. Recently, they have announced a scaleable 4th generation Xeon processor series that can be used for desktop to extreme edge devices [
187]. The power consumption for an ultra-mobile processor is around 9W while computed with INT8 precision. The development utilizes the SuperFin fabrication technology with 10nm process technology. Intel is comparing its core architecture to the Skylake processor, and it claims an efficient core achieves 40% better performance with 40% less power.
IBM developed the Artificial Intelligence Unit (AIU) based on their AI accelerator used in the 7-nanometer Telum chip that powers its z16 system [
188]. AIU is a scaled version developed on a 5 nm process technology and features a 32-core design with a total of 23 billion transistors. AIU uses IBM’s approximate computing frameworks where the computing executes with FP16 and FP32 precisions [
189].
Leapmind has introduced the Efficiera for edge AI inference implemented in FPGA or ASIC [
21]. Efficiency is for ultra-low power applications. The computations are typically performed in 8-, 16-, or 32-bit precision. However, the company claims that 1-bit weight and 2-bit activation can be achieved while still maintaining accuracy for better power and area efficiency. They show 6.55 TOPS at 800MHz clock frequency with an energy efficiency of 27.7 TOPS/W [
190].
Kneron released its edge inference processor, KL 720, for various applications, such as autonomous vehicles and smart industry [
191]. The KL 720 is an upgraded version of the earlier KL 520 for similar applications. The revised version performs at 0.9 TOPS/W and shows up to 1.4 TOPS. The neural computation supports INT8 and INT16 precisions [
191]. Kneron's most up-to-date heterogeneous AI chip is KL 530 [
192]. It is enabled with a brand new NPU, which supports INT4 precision and offers 70% higher performance than that of INT8. The maximum power consumption of KL 530 is 500 mW and can deliver up to 1 TOPS [
192].
Table 3.
Processors, Supported Neural Network Models, Deep Learning Frameworks, And Application Domains.
Table 3.
Processors, Supported Neural Network Models, Deep Learning Frameworks, And Application Domains.
Company |
Product |
Supported Neural Networks |
Supported Frameworks |
Application/benefits |
Apple |
Apple A14 |
DNN |
TFL |
iPhone12 series |
Apple |
Apple A15 |
DNN |
TFL |
iPhone13 series |
aiCTX-Synsense |
Dynap-CNN |
CNN, RNN, Reservoir Computing |
SNN |
High-speed aircraft, IoT, security, healthcare, mobile |
ARM |
Ethos78 |
CNN and RNN |
TF, TFL,Caffe2,PyTorch, MXNet, ONNX |
Automotive |
AIMotive |
Apache5 IEP |
GoogleNet, VGG16, 19, Inception-v4, v2, MobileNet v1, ResNet50, Yolo v2 |
Caffe2 |
Automotives, pedestrian detection, vehicle detection, lane detection, driver status monitoring |
Blaize |
EI Cano |
CNN, YOLO v3 |
TFL |
Fit for industrial, retail, smart-city, and computer-vision systems |
BrainChip |
Akida1000 |
CNN in SNN, Mobilenet |
MetaTF |
Online learning, data analytics, security |
BrainChip |
AKD500, 1500, 2000 |
DNN |
MetaTF |
Smart home, Smart health, Smart City and smart transportation |
CEVA |
Neuro-s |
CNN, RNN |
TFL |
IoTs, smartphones, surveillance, automotive, robotics, medical |
Cadence |
Tensilica DNA100 |
FCC, CNN, LSTM |
ONNX, Caffe2, TensorFlow |
IoT, smartphones, AR/VR, smart surveillance, Autonomous vehicle |
Deepvision |
ARA-1 |
Deep Lab V3, Resnet-50, Resnet-152, MobileNet-SSD, YOLO V3, UNET |
Caffe2, TFL, MXNET, PyTorch |
Smart retail, robotics, industrial automation, smart cities, autonomous vehicles, and more |
Deepvision |
ARA-2 |
Model in ARA-1 and LSTM, RNN, |
TFL, Pytorch |
Smart retail, robotics, industrial automation, smart cities, |
Eta |
ECM3532 |
CNN, GRU, LSTM |
--- |
Smart home, consumer products, medical, logistics, smart industry |
Gyrfalcon |
LightSpeer 2803S |
CNN based, VGG, ResNet, MobileNet; |
TFL, Caffe2 |
High performance audio and video processing |
Gyrfalcon |
LightSpeer 5801 |
CNN based, ResNet, MobileNet and VGG16, |
TFL, PyTorch & Caffe2 |
Object Detection and Tracking,NLP, Visual Analysis |
Gyrfalcon Edge Server |
Janux GS31 |
VGG, REsNet, MobileNet |
TFL, Caffe2, PyTorch |
Smart cities, surveillance, object detection, recognition |
GreenWaves |
GAP9 |
CNN, mobileNet v1 |
|
DSP application |
Horizon |
Journey 3 |
CNN, mobilenet v2, efficient net |
TFL, Pytorch, ONNX, mxnet, Caffe2 |
Automotive |
Horizon |
Journey5/5P |
Resnet18, 50, MobileNet v1-v2, ShuffleNetv2, EfficientNet FasterRCNN, Yolov3 |
TFL, Pytorch, ONNX, mxnet, Caffe2 |
Automotive |
Hailo |
Hailo 8 M2 |
YOLO 3, YOLOv4, CenterPose, CenterNet, ResNet-50 |
ONNX, TFL |
Edge vision applications |
Intel |
Loihi 2 |
SNN based NN |
Lava, TFL, Pytorch |
Online learning, sensing, robotics, healthcare |
Intel |
Loihi |
SNN based NN |
Nengo |
Online learning, robotics, healthcare and many more |
Imagination |
PowerVR Series3NX |
Mobilenet v3, CNN |
Caffe, TFL |
smartphones, smart cameras, drones, automotives, wearables, |
Imec & GF |
DIANA |
DNN |
TFL, Pytorch |
Analog computing in Edge inference |
KoniKu |
Konicore |
synthetic Biology+Silicon |
-- |
Chemical detection, aviation, security |
Kalray |
MPPA3 |
Deep network converted to KaNN |
Kalray's KANN |
Autonomous vehicles, surveillance, robotics, industry, 5G |
Kneron |
KL720 AI |
CNN, RNN, LSTM |
ONNX, TFL, Keras, Caffe2 |
wide applications from automotive to home appliances |
Kneron |
KL520 |
Vgg16, Resnet, GoogleNet, YOLO, Lenet, MobileNet, FCC |
ONNX, TFL, Keras, Caffe2 |
Automotive, home, industry and so on. |
LeapMind |
Efficiera |
CNN, YOLO v3, Mobilenet-v2, Lmnet |
Blueoil, Python & C++ API |
Home, Industrial machinery, surveillance camera, robots |
Memryx |
MX3 |
CNN |
Pytorh, ONNX, TF, Keras |
Automation, surveillance, agriculture, financial |
Mythic |
M1108 |
CNN, large complex DNN, Resnet50, YOLO v3, Body25 |
Pytorch, TFL, and ONNX |
Machine Vision, Electronics, Smart Home, UAV/Drone, Edge Server |
Mythic |
M1076 |
CNN, Complex DNN, Resnet50, YOLO v3 |
Pytorch, TFL, and ONNX |
Surveillance, Vision, voice, Smart Home, UAV, Edge Server |
MobileEye |
EyeQ5 |
DNN |
|
Autonomous driving |
MobileEye |
EyeQ6 |
DNN |
|
Autonomous driving |
Mediatek |
i350 |
DNN |
TFL |
Vision and voice, Biotech and Bio-metric measurements |
NXP |
i.MX 8M+ |
DNN |
TFL, Arm NN, ONNX |
Edge Vision |
NXP |
i.MX9 |
CNN, Mobilenet v1 |
TFL, Arm NN, ONNX |
Graphics, image, display, audio |
NVIDIA |
AGX Orin |
DNN |
TF, TFL, Caffe, Pytorch |
Robotics, Retail, Traffic, Manufacturing |
Qualcomm |
QCS8250 |
CNN, GAN, RNN |
TFL |
smartphone, tablet, support 5G, video and image processing |
Qualcomm |
Snapdragon 888+ |
DNN |
TFL |
Smartphone, tablet, 5G, gaming, video upscaling, image & video processing, |
RockChip |
rk3399Pro |
VGG16, ResNEt50, Inception4 |
TFL, Caffe, mxnet, ONNX, darknet |
Smart Home, City, and Industry; face recognition, driving monitoring, |
Rokid |
Amlogic A311D |
Inception V3, YoloV2, YOLOV3 |
TFL, Caffe2 Darknet |
High-performance multimedia |
Samsung |
Exynos 2100 |
CNN |
TFL |
Smartphone, tablet, advanced image signal processing (ISP), 5G |
Samsung |
HBM-PIM |
DNN see youtube to write on int |
Pytorch, TFL |
Supercomputer and AI application |
Synopsis |
EV7x |
CNN, RNN, LSTM |
OpenCV, OpenVX and OpenCL C, TFL, Caffe2 |
Robotics, autopilot car, vision, SLAM, and DSP algorithms |
Syntiant |
NDP100 |
DNN |
TFL |
Mobile phones, hearing equipment, smartwatches, IoT, remote controls |
Syntiant |
NDP101 |
CNN, RNN, GRU, LSTM |
TFL |
Mobile phones, smart homes, remote controls, smartwatches, IoT |
Syntiant |
NDP102 |
CNN, RNN, GRU, LSTM |
TFL |
Mobile phones, smart homes, remote controls, smartwatches, IoT |
Syntiant |
NDP120 |
CNN, RNN, GRU, LSTM |
TFL |
Mobile phones, smart home, wearables, PC, IoT endpoints, media streamers, AR/VR |
Syntiant |
NDP200 |
FC, Conv, DSConv, RNN-GRU, LSTM |
TFL |
Mobile phones, smart homes, security cameras, video doorbells |
Think Silicon |
Nema PicoXS |
DNN |
---- |
Wearable and embedded devices |
Tesla |
FSD |
CNN |
Pytorch |
Automotive |
Verisilicon |
VIP9000 |
All modern DNN |
TF, Pytorch, TFL, DarkNet, ONNX |
Can perform as intelligent eye and intelligent ear at the edge |
Untether |
TsunAImi |
DNN, ResNet-50, Yolo, Unet, RNN, BERT, TCNs, LSTMs |
TFL, Pytorch |
NLP, Inference at the edge server or data center |
UPMEM |
UPMEM-PIM |
DNN |
----- |
Sequence alignment: DNA or protein; Genome assembly: Metagenomic analysis |
Memryx [
297] released an inference processor, MX3. This processor computes deep learning models with 4, 8, or 16 bit weight and BF16 activation functions. MX3 consumes about 1 W of power and computes with 5 TFLOPS. This chip stores 10 million parameters on a die, and thus needs more chips for implementing larger networks.
MobileEye and STMicroelectronics released EyeQ 5 SoC for autonomous driving [
193]. EyeQ 5 is 4 times faster than their earlier version, EyeQ 4. It can produce 2.4 TOPS/W and goes up to 24 TOPS with 10 W of power [
194]. Recently, MobileEye has announced their next generation processor, EyeQ6, which is around 5x faster than EyeQ5 [
195]. For INT8 precision, EyeQ5 performs 16 TOPS, and EyeQ6 shows 34 TOPS [
196].
NXP introduced their edge processor i.MX 8M+ for the targeted applications in vision, multimedia, and industrial automations [
86]. The system includes a powerful Cortex-A53 processor integrated with an NPU. The neural network performs 2.3 TOPS with 2W of power consumption. The neural computation supports INT16 precision [
87]. NXP is scheduled to launch its next AI processor, iMX9, in 2023 with more features and efficiency [
85].
NVIDIA released the Jetson Nano, which is able to run multiple applications in parallel, such as image classification, object detection, segmentation, and speech processing [
197]. This developer kit is supported by the NVIDIA JetPack SDK and is able to run state-of-the-art AI models. The Jetson Nano consumes around 5-10 W of power and computes 472 GFLOPS in FP16 precision. The new version of Jetson Nano B01 can perform 1.88 TOPS 198].
NVIDIA released Jetson Orin, which includes specialized development hardware, AGX Orin. It is embedded with 32GB of memory, has a 12-core CPU, and can exhibit a computing performance of 275 TOPS while using INT8 precision [
199]. The computing is powered by NVIDIA ampere architecture with 2048 cores, 64 tensor cores, and 2 NVDLA v2.0 accelerators for deep learning [
199].
Qualcomm developed the QCS8250 SoC for intensive camera and edge applications [
200]. This processor supports wifi and 5G for the IoTs. A quad hexagon vector extension V66Q with hexagon DSP is used for machine learning. An integrated NPU is used for advanced video analysis. The NPU supports INT8 precision and runs at 15 TOPS [
201]. Qualcomm has released the Snapdragon 888+ 5G processor for use in smartphones. It takes the smartphone experience to a new level with AI-enhanced gaming, streaming, and photography [
202]. It includes a 6th generation Qualcomm AI engine with the Qualcomm Hexagon780 CPU [
203,
204]. The throughput of the AI engine is 32 TOPS with 5 W of power consumption [
203]. The Snapdragon 8 Gen2 mobile platform was presented at the HotChips 2023 conference and exhibited 60% better energy efficiency than the Snapdragon 8 in INT4 precision.
Samsung announced the Exynos 2100 AI edge processor for smartphones, smartwatches, and automobiles [
205]. Exynos supports 5G network and performs on-device AI computations with triple NPUs. They fabricate using 5nm extreme UV technology. The Exynos 2100 consumes 20% lower power and delivers 10% higher performance than Exynos 990. Exynos 2100 can perform up to 26 TOPS, and it is 2 times more power efficient than the earlier version of Exynos [
206]. A more powerful mobile processor, Exynos 2200, was released recently.
SiMa.ai [
300] introduced the MLSoC for computer vision applications. MLSoc is implemented on TSMC 16nm technology. The accelerator can compute 50 TOPS while consuming 10 W of power. MLSoC uses INT8 precision in computation. The processor has 4 MB of on-chip memory for the deep learning operations. The processor is 1.4x more efficient than Orin, measured in frames/W.
Tsinghua and Polar Bear Tech released their QM930 accelerator consisting of seven chiplets. The chiplets are organized as one hub chiplet and six side chiplets, forming a Hub-Side processor. The processor is implemented in 12nm CMOS technology. The total area for the chiplets is 209 mm2 for seven chiplets. However, the total substrate area of the processor is 1089 mm2. The processor can compute with INT4, INT8, and INT16 precision, showing peak performances of 40, 20, and 10 TOPS, respectively. The system energy efficiency is 1.67 TOPS/W while computed in INT8. The power consumption can be varied from 4.5 to 12 W.
Verisilicon brought VIP 9000 for face and voice recognition. It adopts Vivante's latest VIP V8 NPU architecture for processing neural networks [
207]. The computing engine supports INT8, INT16, FP16, and BF16. The performance can be scaled from 0.5 to 100 TOPS [
208].
Synopsis developed the EV7x multi-core processor family for vision applications [
209]. The processor integrates vector DSP, vector FPU, and a neural network accelerator. Each VPU supports a 32-bit scalar unit. The MAC can be configured for INT8, INT16, or INT32 precisions. The chip can achieve up to 2.7 TOPS in performance [
210].
Tesla designed the FSD processor which was manufactured by Samsung for autonomous vehicle operations [
211]. The SoC processor includes 2 NPUs and one GPU. The NPUs support INT8 precision, and each NPU can compute 36.86 TOPS. The peak performance of the FSD chip is 73.7 TOPS. The total TDP power consumption of each FSD chip is 36 W [
211].
Several other companies have also developed edge processors for various applications but did not share hardware performance details on their websites or through publicly available publications. For instance, Ambarella [
307] has developed various edge processors for automotive, security, consumer, and IoTs for industrial and robotics applications. Ambarella’s processors are SoC types, mainly using ARM processors and GPUs for DNN computations.
ii. Neuromorphic Edge AI Processor
In 2022, the global market value of neuromorphic chips was 3.7 billion, and by 2028, the estimated market value is projected to be
$ 27.85 Billion [
212]. The neuromorphic processors described in this section utilize spike-based processing.
Synsense (formerly AICTx) has introduced a line of ultra-low power neuromorphic processors: DYNAP-CNN, XYLO, DYNAP-SE2, and DYNAP-SEL [
15]. Of these, we were able to find performance information on only the DYNAP-CNN chip. This processor is fabricated on a 22 nm process technology and has a die area of 12 mm
2. Each chip can implement up to a million spiking neurons, and a collection of DYNAP-CNN chips can be utilized to implement a larger CNN architecture. The chip utilizes asynchronous processing circuits [
213].
BrainChip introduced the Akida line of spiking processors. The AKD1000 has 80 NPUs, 3 pJ/synaptic operation, and around 2 W of power consumption [
147]. Each NPU consists of eight neural processing engines that run simultaneously and control convolution, pooling, and activation (ReLu) operations [
148]. Convolution is normally carried out in INT8 precision, but it can be programmed for INT 1, 2, 3 or 4 precisions while sacrificing 1-3% accuracy. BrainChip has announced future releases of smaller and larger Akida processors under the AKD500, AKD1500, and AKD2000 labels [
148]. A trained DNN network can be converted to SNN by using the CNN2SNN tool in the Meta-TF framework for loading a model into an Akida processor. This processor also has on-chip training capability, thus allowing the training of SNNs from scratch by using the Meta-TF framework [
146].
GrAI Matters Lab (GML) developed and optimized a neuromorphic SoC processor named as VIP for computer vision application. VIP is a low power and low latency AI processor, with 5-10 W of power consumption, and the latency is 10x less than the NVIDIA nano [
214]. The target applications are for audio/video processing on the end devices.
IBM developed the TrueNorth neuromorphic spiking system for real-time tracking, identification, and detection [
10]. It consists of 4096 neurosynaptic cores and 1 million digital neurons. The typical power consumption is 65 mW, and the processor can execute 46 GSOPS/W, with 26 pJ per synaptic operation [
10,
215]. The total area of the chip is 430 mm
2, which is almost 14x bigger than that of Intel’s Loihi 2.
Innatera announced a neuromorphic chip that is fabricated using TSMC’s 28 nm process [
216]. When tested with audio signals [
217], each spike event consumed about 200 fJ, while the chip consumed only 100 uW for each inference event. The target application areas are mainly audio, healthcare, and radar voice recognition [
217].
Intel released the Loihi [
9], a spiking neural network chip in 2018, and an updated version, the Loihi 2 [
9], in 2021. The Loihi 2 is fabricated using Intel’s 7nm technology and has 2.3 billion transistors with a chip area of 31mm
2. This processor has 128 neuron cores and 6 low power x86 cores. It can evaluate up to 1 million neurons and 120 million synapses. The Loihi chips support online learning. Loihi processors support INT8 precision. Loihi 1 can deliver 30 GSOPS with 15 pJ per synaptic operation [
218]. Both Loihi 1 and Loihi 2 consume similar amounts of power (110mW and 100mW, respectively [
219]). However, the Loihi 2 outperforms the Loihi 1 by 10 times. The chips can be programmed through several frameworks, including, Nengo, NxSDK, and Lava [
150]. The latter is a framework developed by Intel and is being pushed as the primary platform to program the Loihi 2.
IMEC develops a RISC-V processor based digital neuromorphic processor with 22nm process technology in 2022 [
220]. They have implemented an optimized BF-16 processing pipeline inside the neural process engine. The computation can also support INT4 and INT8 precision. They have used 3-layer memory to reduce the chip area.
Koniku combines biological machines with silicon devices to design a micro electrode array system core [
12]. They are developing the hardware and algorithm that mimics the smell sensory receptor that is found in some animal noses. However, the detailed device parameters are not publicly available. The device is mainly used in security, agriculture, and safe flight operation [
221].
iii. PIM Processor
PIM processors are becoming an alternative for AI application due to the low latency, high energy efficiency, and reduced memory requirements. PIMs are the analog, and in-place computing architecture thus, it reduces the burden of additional storage modules. However, there are some digital presents the schematic representation of a common PIM computing architecture. It consists of the crossbar array (NxM) of the popular storage devices. The crossbar array performs as the weight storage and analog multiplier. The storage devices could be SRAM, RRAM, PCM, STT-MRAM or a flash memory cell. The computing array is equipped with the peripheral circuits, a data converter (ADC or DAC), sensing circuits, and a write circuit for the crossbar. Some of the PIM processors are discussed in this section.
Imec and GlobalFoundries have developed DIANA, a processor that includes both digital and analog cores for DNN processing. The digital core is employed for widely parallel computation, whereas the analog in-memory computing (AiMC) core enables much higher energy efficiency and throughput. The core uses a 6T-SRAM array with a size of 1152x512. Imec developed the architecture, while the chip is fabricated using GlobalFoundries' 22FDX solution [
222]. It is targeted for a wide range of edge applications, from smart speakers to self-driving vehicles. The analog component (AiMC) computes at 29.5 TOPS with and the digital core computes at 0.14 TOPS. The digital and analog components have efficiencies of 4.1 TOPS/W and 410 TOPS/W, respectively in isolation. The overall system energy efficiency of DIANA is 14.4 TOPS/W for Cifar-10 [
223].
Gyrfalcon has developed multiple PIM processors, including the Lightspeeur 5801, 2801, 2802, and 2803 [
24]. The architecture uses digital AI processing in-memory units that compute a series of matrices for CNN. The Lightspeeur 5801 has a performance of 2.8 TOPS at 224 mW and can be scaled up to 12.6 TOPS/W. The Lightspeeur 2803S is their latest PIM processor for the advanced edge, desktop, and data center deployments [
19]. Each Lightspeeur 2803S chip performs 16.8 TOPS while consuming 0.7W of power, giving an efficiency of 24 TOPS/W. Lightspeeur 2801 can compute 5.6 TOPS with an energy efficiency of 9.3 TOPS/W. Gyrfalcon introduced its latest processor, Lightspeeur 2802, using TSMC’s magnetoresistive random access memory technology. Lightspeeur 2802 exhibits an energy efficiency of 9.9 TOPS/W. Janux GS31 is the edge inference server which is built with 128 Lightspeeur 2803S chips [
225]. It can perform 2150 TOPS and consumes 650 W.
Mythic has announced its new analog matrix processor, M1076 [
18]. The latest version of Mythic’s PIM processor reduced its size by combining 76 analog computing tiles, while the original one (M1108) uses 108 tiles. The smaller size offers more compatibility to implant on edge devices. The processor supports 79.69M on-chip weights in the array of flash memory, and 19,456 ADCs for parallel processing. There is no external DRAM storage required. The DNN models are quantized from FP32 to INT8 and retrained in Mythic’s analog compute engine. A single M1076 chip can deliver up to 25 TOPS while consuming 3 W of power [
90]. The system can be scaled for high performance up to 400 TOPS by combining 16 of M1076 chips which requires 75 W [
88,
89].
Samsung has announced its HBM-PIM machine learning-enabled memory system with PIM architecture [
16]. This is the first successful integration of the PIM architecture of high bandwidth memory. This technology incorporates the AI processing function into Samsung HBM2 Aquabolt to speed up high-speed data processing in supercomputers. The system delivered 2.5x performance with 60% lower energy consumption than the earlier HBM1 [
16]. Samsung LPDDR5-PIM memory technology for mobile device technology is targeted to bring the AI capability in the mobile device without connecting to the data center [
226]. The HBM-PIM architecture is different from the traditional analog PIM architecture as outlined in
Figure 2. It does not require data-conversion and sensing circuits as the actual computation is taking place in the near-computing module in the digital domain. Instead, it places a GPU surrounded by HBM stacks to realize the parallel processing and minimize data movement [
227]. Therefore, this is similar to a dataflow processor
Syntiant has developed a line of flash-memory array based edge inference processors, such as NDP10x, NDP120, NDP200 [
228]. Syntiant's PIM architecture is very energy efficient, and it combines with an edge optimized training pipeline. A Cortex-M0 is embedded in the system that runs the NDP firmware. The NDP10x processors can hold 560k weights of INT4 precision and perform MAC operation with an INT8 activation. The training pipeline can build neural networks for various applications according to the specifications with optimized latency, memory size, and power consumption [
228]. Syntiant released five different versions of application processors. NDP 100 is their first AI processor, updated in 2020 with a tiny little dimension of 2.52 mm
2 and ultra-low power consumption, less than 140 μW [
229]. Syntiant continues to provide more PIM processors named NDP 101, 102, 120, and NDP 200 [
230,
231,
232]. The application domains are mainly smartphone, wearable, and hearable pieces of equipment, remote controls, IoT endpoints. The neural computations are supported by INT 1, 2, 4, and 8 precision. The energy efficiency of the NDP 10x series is 2 TOPS/W [
233], which includes NDP100, NDP 101, and NDP 102. NDP 120 [
235] and NDP 200 exhibit 1.9 GOPS/W and 6.4 GOPS/W [
232], respectively.
Untether has developed its PIM AI accelerator card TsunAImi [
236] for inference at the data center or in the server. The heart of the TsunAImi is four runAI200 chips which are fabricated by TSMC in standard SRAM arrays. Each runAI200 chip features 511 cores and 192MB of SRAM memory. runAI200 computes in INT8 precision and performs 502 TOPS at 8 TOPS/W, which is 3x more than NVIDIA’s Ampere A100 GPU. The resulting performance of TsunAImi system is 2008 TOPS with 400 W [
237].
UPMEMP PIM innovatively placed thousands of DPU units within the DRAM memory chips [
238]. The DPUs are controlled by high-level applications running on the main CPU. Each DIMM consists of 16 PIM-enabled chips. Each PIM has 8 DPUs, making 128 DPUs of total DPUs for each UPMEM [
239].
However, the system is massively parallel, and up to 2560 DPUs units can be assembled as a unit server with 256 GB PIM DRAM. The computing power is 15x of x86 server with the main CPU. The throughput is benchmarked for INT32 bit addition is 58.3 MOPS/DPU [
240]. This system is suitable for DNA sequencing, Genome comparison; Phylogenetics; Metagenomic analysis, and more [
241].