Altmetrics
Downloads
120
Views
47
Comments
0
A peer-reviewed article of this preprint also exists.
This version is not peer-reviewed
Submitted:
30 December 2023
Posted:
03 January 2024
You are already at the latest version
Listing 1: A Sample Measured Record On A64FX |
Gaurav Verma is a Ph.D. student in the Department of Computer Science at Stony Brook University, under the guidance of Prof. Barbara Chapman. His research broadly focuses on compiler optimizations, scaling Deep Learning models on heterogeneous hardware, and high-performance computing. Additionally, he works on the development of performance analysis and benchmarking frameworks designed for DNNs. He has obtained a Master’s degree in Computer Science from SUNY at Stony Brook, also holds a Bachelor’s degree in Computer Engineering from the National Institute of Technology, Surathkal, Karnataka, India. | |
Dr. Siddhisanket Raskar is a postdoctoral researcher at the Argonne National Laboratory where he works on evaluating the efficacy of AI architectures for scientific machine learning and on the design of next-generation AI architectures for science. He obtained his Ph.D. on the topic of “Dataflow Software Pipelining for Codelet Model using hardware-software co-design” from the University of Delaware. He has a Master’s degree in Computer Science from the University of Delaware and a Bachelor’s degree in Computer Engineering from the University of Pune, India. His research Interests Include: High-Performance Computing and Dataflow Models, Computer Architecture and Systems, Compilers Technology and Runtime Systems, and Program mapping and optimization under dataflow models. | |
Dr. Murali Emani is a Computer Scientist in the Data Sciences group with the Argonne Leadership Computing Facility (ALCF) at Argonne National Laboratory. Prior, he was a Postdoctoral Research Staff Member at Lawrence Livermore National Laboratory, US. Murali obtained his PhD and worked as a Research Associate at the Institute for Computing Systems Architecture the School of Informatics, University of Edinburgh, UK. His primary research focus lies in the intersection of systems and machine learning. Research interests include Parallel programming models, Hardware accelerators for ML/DL, High-Performance Computing, Scalable Machine Learning, Runtime Systems, Performance optimization, Emerging HPC architectures, and Online Adaptation. He co-chairs MLPerf HPC group with MLCommons to benchmark scientific machine learning applications on supercomputers. | |
Dr. Barbara Chapman is a Professor of Applied Mathematics and Statistics, and of Computer Science, at Stony Brook University, where she is affiliated with the Institute for Advanced Computational Science. Dr. Chapman has performed research on parallel programming interfaces and the related implementation technology for over 20 years and has moreover engaged in efforts to develop community standards for parallel programming, including OpenMP, OpenACC and OpenSHMEM. Dr. Chapman has co-authored over 200 papers and two books. She obtained a B.Sc. with 1st Class Honours in Mathematics from the University of Canterbury and a Ph.D. in Computer Science from Queen’s University of Belfast. |
1 |
Hardware Platform | Processor | Remarks |
---|---|---|
Intel Platinum 8272CL @ 2.60GHz | CPU | 16 cores, AVX-512 |
AMD EPYC 7452 @ 2.35GHz | CPU | 4 cores, AVX-2 |
ARM Graviton2 | CPU | 16 cores, Neon |
NVIDIA Tesla T4 | GPU | Turing Architecture |
NVIDIA GeForce RTX 2080 | GPU | Turing Architecture |
NVIDIA A100 | GPU | Ampere Architecture |
NVIDIA A40 | GPU | Ampere Architecture |
NVIDIA H100 | GPU | Hopper Architecture |
Intel Gold 5115 @ 2.40GHz | CPU | 40 cores, Xeon |
ARM A64FX | CPU | 48 cores, aarch64 |
Hardware Parameter | Definition | Hardware Class | Value (bytes) |
---|---|---|---|
cache_line_bytes | chunks of memory handled by the cache | CPU; GPU | 64 |
max_local_memory_per_block | maximum local memory per block in bytes | GPU | 2147483647 |
max_shared_memory_per_block | maximum shared memory per block in bytes | GPU | 49152 |
max_threads_per_block | maximum number of threads per block | GPU | 1024 |
max_vthread_extent | maximum extent of virtual threading | GPU | 8 |
num_cores | number of cores in the compute hardware | CPU | 24 |
vector_unit_bytes | width of vector units in bytes | CPU; GPU | 64; 16 |
warp_size | thread numbers of a warp | GPU | 32 |
Sampled Kernels | #Kernel_Shapes | Max GFLOPs | Tensor Shape | Mean Execution Time (ms) | |||||
CPU | GPU | CPU | GPU | EPYC-7452 | Graviton2 | Platinum-8272 | T4 | ||
T_add | 229 | 388 | 8.59 | 8.59 | [4, 256, 1024] | 180.97 | 81.25 | 92.86 | 4.31 |
Conv2dOutput | 60 | 27 | 1.20 | 1.07 | [4, 64, 64, 32] | 40.94 | 14.21 | 19.11 | 2.07 |
T_divide | 24 | 69 | 0.003 | 0.003 | [8, 1, 1, 960] | 0.07 | 0.05 | 0.11 | 0.10 |
T_fast_tanh | 9 | 9 | 0.008 | 0.008 | [4, 1024] | 0.43 | 0.43 | 0.53 | 0.97 |
T_multiply | 105 | 150 | 8.92 | 8.92 | [4, 256, 4096] | 320.74 | 48.08 | 95.65 | 0.55 |
T_relu | 300 | 1257 | 73.46 | 73.46 | [4, 144, 72, 8, 64] | 0.52 | 5.70 | 0.72 | 0.23 |
T_softmax_norm | 27 | 27 | 0.016 | 0.016 | [4, 16, 256, 256] | 1.01 | 2.78 | 4.08 | 0.19 |
T_tanh | 9 | 9 | 0.905 | 0.629 | [8, 96, 96, 3] | 5.55 | 33.48 | 50.55 | 0.16 |
conv2d_winograd | 0 | 33 | NA | 0.868 | NA | NA | NA | NA | 0.93 |
Hyperparameter | Value |
---|---|
Batch | 16, 32, 64, 256, 512 |
Epoch | 100, 200, 400 |
Learning Rate | 1e-4 |
Attention Head (fine tuning) | 6 |
#Unrolling Steps for Attention Head | 2 |
Optimizer | Adam |
Primitive | Meaning |
---|---|
AN | Annotation Step |
FU | Fuse Step |
PR | Pragma Step |
RE | Reorder Step |
SP | Split Step |
FSP | Follow Split Step |
FFSP | Follow Fused Split Step |
SA | Storage Align Step |
CA | Compute At Step |
CI | Compute In-line Step |
CR | Compute Root Step |
CHR | Cache Read Step |
CHW | Cache Write Step |
RF | Rfactor Step |
H100 | A64FX | ||
Sequence Length | Total Occurrence (%) | Sequence Length | Total Occurrence (%) |
37 | 46.41 | 21 | 20.89 |
36 | 12.59 | 20 | 20.21 |
39 | 5.56 | 16 | 11.06 |
38 | 5.00 | 17 | 8.91 |
32 | 4.62 | 19 | 7.29 |
Target Hardware | Dataset | Size | XGBoost (train-time (sec)) | MLP (train-time (sec)) | LightGBM (train-time (sec)) | ||||||
within_task | by_task | by_target | within_task | by_task | by_target | within_task | by_task | by_target | |||
GPU | Baseline | 16G | 1504 | 1440 | 454 | 3000 | 2434 | 3150 | 1574 | 780 | 4680 |
Sampled | 9G | 1406 | 1169 | 339 | 1968 | 1655 | 2464 | 1175 | 595 | 3637 | |
CPU | Baseline | 11G | 1490 | 1265 | 428 | 3143 | 2623 | 2043 | 1131 | 636 | 3946 |
Sampled | 6.8G | 905 | 780 | 354 | 2091 | 1672 | 1270 | 489 | 387 | 2435 |
Target Hardware | Baseline Dataset | Sampled Dataset | ||
W/o Transfer Tuning | W/ Transfer Tuning | W/o Transfer Tuning | W/ Transfer Tuning | |
A64FX (CPU) | 66.81 | 149.5 | 58.7 | 112.43 |
Xeon (CPU) | 91.34 | 282.2 | 85.22 | 189.25 |
A40 (GPU) | 627 | 416 | 599 | 175 |
A100 (GPU) | 578 | 391 | 585 | 400 |
H100 (GPU) | 128.12 | 67.30 | 93.42 | 54.25 |
RTX2080 (GPU) | 18.67 | 27.68 | 17.37 | 841.74 |
Target Hardware | Network | W/o Transfer Tuning | W/ Transfer Tuning | ||
Time-to-Tune | Mean Inference Time | Time-to-Tune | Mean Inference Time | ||
CPU | Incpetion_v3 | 614 | 75.27 | 61 | 73.80 |
MobileNet_v3 | 236 | 5.48 | 71 | 5.57 | |
ResNet_50 | 128 | 11.93 | 86 | 12.12 | |
GPU | Incpetion_v3 | 2510 | 28.72 | 191 | 28.73 |
MobileNet_v3 | 1092 | 1.72 | 136 | 1.75 | |
ResNet_50 | 817 | 3.79 | 226 | 3.78 |
Target Hardware | TenSet XGB | Our Tuner | ||
top-1 (%) | top-5 (%) | top-1 (%) | top-5 (%) | |
H100 | 83.94 | 95.81 | 85.67 | 96.08 |
A64FX | 72.6 | 92.49 | 77.04 | 91.79 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 MDPI (Basel, Switzerland) unless otherwise stated