### Summary SIMD Optimization for ARM and RISC-V Vector Extensions arxiv.org

4,695 words - PDF document - View PDF document

### One Line

The migration of ARM NEON Intrinsics codes to RISC-V Vector Extensions using SIMDe resulted in a significant speedup of 1.51x to 5.13x in the Google XNNPACK library.

### Slides

### Slide Presentation (12 slides)

### Key Points

- The paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV).
- The authors propose the use of the open-source tool "SIMD Everywhere" (SIMDe) to automate the migration process.
- They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions.
- The enhanced SIMDe achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe.
- The authors validate their implementation using unit tests within SIMDe and Spike simulator.
- They also conduct benchmark experiments using XNNPACK as a benchmark.
- The RVV-enhanced SIMDe achieves significant speedup compared to the original SIMDe.
- This migration strategy has the potential to drive significant enhancements across a range of applications in the Android ecosystem.

### Summaries

### 22 word summary

ARM NEON Intrinsics codes are migrated to RISC-V Vector Extensions using SIMDe, achieving speedup of 1.51x to 5.13x in Google XNNPACK library.

### 77 word summary

This paper explores the migration of ARM NEON Intrinsics codes to RISC-V Vector Extensions (RVV) using the open-source tool "SIMD Everywhere" (SIMDe). The authors enhance SIMDe to convert ARM NEON Intrinsics types and functions to their RVV counterparts. Through experiments with the Google XNNPACK library, they achieve speedup ranging from 1.51x to 5.13x compared to the original SIMDe. The paper provides background information, migration strategies, and experimental results, highlighting improved performance and potential enhancements for Android applications.

### 152 word summary

This paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV). The authors propose using the open-source tool "SIMD Everywhere" (SIMDe) to automate the migration process. They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions. The authors conduct experiments with the Google XNNPACK library to evaluate the performance of their enhanced SIMDe, finding that it achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe. The paper provides background information on Arm NEON and RISC-V Vector Extensions, details strategies for migrating ARM NEON Intrinsics to RISC-V Vector Extensions, explains how to use SIMDe for code porting, and presents experimental results comparing the native SIMDe with the RVV-enhanced SIMDe. The authors successfully automate the migration process, achieving improved performance and potential enhancements across a range of applications in the Android ecosystem.

### 390 word summary

This paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV). Many libraries, such as OpenCV, FFmpeg, XNNPACK, and Eigen, utilize Arm or x86 SIMD Intrinsics to optimize programs for performance. With the emergence of RVV, there is a need to migrate these libraries and legacy codes for improved performance on RISC-V platforms.

To automate the migration process, the authors propose using the open-source tool "SIMD Everywhere" (SIMDe). They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions. They devise strategies for converting Neon Intrinsics types to RVV Intrinsics by considering vector length agnostic (vla) architectures. They also develop customized conversions for each function based on the results of RVV code generations.

The authors conduct experiments with the Google XNNPACK library to evaluate the performance of their enhanced SIMDe. They compare it with the original SIMDe and find that the enhanced SIMDe achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe.

The remainder of the paper is organized as follows. In Section 2, the authors provide background information on Arm NEON and RISC-V Vector Extensions. In Section 3.1, they introduce the SIMD Everywhere design pattern for intrinsics function and type conversion. They detail their strategies for leveraging SIMD Everywhere to migrate ARM NEON Intrinsics to RISC-V Vector Extensions in Sections 3.2 and 3.3. They explain how to use SIMDe for code porting in Section 3.4. Finally, in Section 4, they present the experimental results, comparing the native SIMDe with their RVV-enhanced SIMDe.

The authors validate their implementation using unit tests within SIMDe and Spike simulator. They also conduct benchmark experiments using XNNPACK as a benchmark. They choose 10 commonly used neural network computation functions implemented using NEON Intrinsics in XNNPACK and transform them into RVV Intrinsics using SIMDe. The performance of the converted code is evaluated using dynamic instruction count as the metric.

The experimental results show that the RVV-enhanced SIMDe achieves significant speedup compared to the original SIMDe, ranging from 1.51x to 5.13x across the tested functions.

In conclusion, the authors successfully automate the migration process from ARM NEON to RISC-V Vector Extensions using SIMDe, achieving improved performance. This migration strategy has the potential to drive significant enhancements across a range of applications in the Android ecosystem.

### 454 word summary

SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions

The paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV). Many libraries, such as OpenCV, FFmpeg, XNNPACK, and Eigen, utilize Arm or x86 SIMD Intrinsics to optimize programs for performance. With the emergence of RVV, there is a need to migrate these libraries and legacy codes for improved performance on RISC-V platforms. The migration process currently requires manual rewriting, which is time-consuming and error-prone.

To address this issue, the authors propose the use of the open-source tool "SIMD Everywhere" (SIMDe) to automate the migration process. They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions. For type conversion, they devise strategies to convert Neon Intrinsics types to RVV Intrinsics by considering the vector length agnostic (vla) architectures. They also analyze commonly used conversion methods in SIMDe and develop customized conversions for each function based on the results of RVV code generations.

The authors conduct experiments with the Google XNNPACK library to evaluate the performance of their enhanced SIMDe. They compare it with the original SIMDe, which does not utilize customized RVV implementations for the conversions. The enhanced SIMDe achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe.

The remainder of the paper is organized as follows. In Section 2, the authors provide background information on Arm NEON and RISC-V Vector Extensions. In Section 3.1, they introduce the SIMD Everywhere design pattern for intrinsics function and type conversion. They detail their strategies for leveraging SIMD Everywhere to migrate ARM NEON Intrinsics to RISC-V Vector Extensions in Sections 3.2 and 3.3. They explain how to use SIMDe for code porting in Section 3.4. Finally, in Section 4, they present the experimental results, comparing the native SIMDe with their RVV-enhanced SIMDe.

The authors validate their implementation using unit tests within SIMDe and Spike simulator. They also conduct benchmark experiments using XNNPACK as a benchmark. They choose 10 commonly used neural network computation functions implemented using NEON Intrinsics in XNNPACK and transform them into RVV Intrinsics using SIMDe. The performance of the converted code is evaluated using dynamic instruction count as the metric.

The experimental results show that the RVV-enhanced SIMDe achieves significant speedup compared to the original SIMDe. The speedup ranges from 1.51x to 5.13x across the tested functions.

In conclusion, the authors successfully automate the migration process from ARM NEON to RISC-V Vector Extensions using SIMDe. Their enhanced SIMDe achieves improved performance compared to the original SIMDe when converting NEON code to RVV code. This migration strategy has the potential to drive significant enhancements across a range of applications in the Android ecosystem.

### Raw indexed text (29,747 chars / 4,695 words / 764 lines)

**
vle32 . v v9 ,( a2 ) // vb = vld1q_s32 ( B ) ;
vadd . vv v8 , v8 , v9 // va = vaddq_s32 ( va , vb ) ;
vse32 . v v8 ,( a1 ) // vst1q_s32 ( A , va ) ;
lw
a1 ,1800( a0 )
lui
a0 ,0 x1c
add
a0 , a0 ,2000 # 1 c7d0 < __clzdi2 +0 x42 >
jal
1034 c < printf >
li
a0 ,0
ld
ra ,8( sp )
add
sp , sp ,16
ret
Listing 10. RVV vector addition
different scenarios. We reused the unit tests within SIMDe
and validated them using Spike simulator.
4.2
Benchmark Experiments
We use XNNPACK as our benchmark. XNNPACK is an open-
source software library developed by the Google team, aims
to optimize neural network computations across different
hardware platforms. Within XNNPACK, there are various
neural network computation functions implemented using
NEON Intrinsics.
Experiments were conducted to compare native SIMDe
with our implementation. We specifically chose 10 com-
monly used neural network computation functions that are
implemented using NEON Intrinsics in XNNPACK. Below
are brief descriptions of the 10 functions. Gemm is a high-
performance function for general matrix multiplication. Conv-
hwc is a convolution function specifically for input data
arranged in the Height-Width-Channel format. Dwconv is
optimized for depthwise separable convolution with reduced
parameters and computation. Maxpool extracts the maxi-
mum value from each region during pooling. Argmaxpool
performs maxpooling while returning the index of the max-
imum value. Vrelu applies the ReLU activation function
element-wise. Vsqrt calculates the square root of each el-
ement in the input vector. Vtanh applies the hyperbolic tan-
gent activation function element-wise. Vsigmoid applies the
sigmoid activation function element-wise. Lastly, Ibilinear
is a high-performance function specialized in performing
bilinear interpolation.
The SIMDe header file was included in these functions, and
the code was compiled using Clang compiler with O3 opti-
mization level. During the preprocessing stage, the functions
accelerated with NEON Intrinsics were transformed into
RVV Intrinsics. Following compilation, we executed the exe-
cutable file using Spike simulator to verify correctness and
calculate the instruction counts. Since Spike is a functional
model rather than a cycle-accurate simulator, we employed
dynamic instruction count as the performance metric instead.
In this experiment, we used Spike 1.1.1 and Clang 17.0.0 (
commit hash 5326c9e480d70e16c2504cb5143524aff3ee2605 ).
Figure 2 illustrates our experimental results, indicating that
SIMDe achieved speedup ranging from 1.51x to 5.13x com-
pared to the original SIMDe, which did not utilize customized
RVV implementations for the conversions. The original flow
goes with clang vector attributes for computations or auto
vectorization of the scalar implementation. It then utilizes
LLVM backend for RVV so that the clang vector attribute
and the auto-vectorization flow can be obtained as a baseline
solution.
5
Conclusion
In this paper, we use the open source tool, "SIMD Every-
where" (SIMDe), to automate the migration from Neon code
to RVV code. Our primary task is to enhance SIMDe to enable
the conversion of ARM NEON Intrinsics types and functions
to their corresponding RVV Intrinsics types and functions.
In our experiments with Google XNNPACK library, our RVV-
enhanced SIMDe achieves speedup ranging from 1.51x to
5.13x compared to the original SIMDe, which does not utilize
customized RVV implementations for the conversions.
References
[1] 2023. ComputeLibrary. Retrieved June 10, 2023 from https://github.
com/ARM-software/ComputeLibrary
[2] 2023. RVV fixed length attribute. Retrieved June 10, 2023 from https:
//reviews.llvm.org/D145088
[3] 2023. SIMD Everywhere. Retrieved June 10, 2023 from https://github.
com/simd-everywhere/simde
[4] 2023. XNNPACK. Retrieved July 7, 2023 from https://github.com/
google/XNNPACK
[5] G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software
Tools (2000).SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions
[17] Suramya Tomar. 2006. Converting video formats with FFmpeg. Linux
journal 2006, 146 (2006), 10.
[18] Chun-Chieh Yang, Yi-Ru Chen, Hui-Hsin Liao, Yuan-Ming Chang,
and Jenq-Kuen Lee. 2023. Auto-tuning Fixed-point Precision with
TVM on RISC-V Packed SIMD Extension. ACM Transactions on Design
Automation of Electronic Systems 28, 3 (2023), 1–21.
[19] Chuan-Yue Yuan, Meng-Shiun Yu, Chao-Lin Lee, Chun-Chieh Yang,
and Jenq-Kuen Lee. 2021. Optimization with TVM Hybrid OP on
RISC-V with P Extension. TVMcon, Seattle, Dec. 15-17, 2021, (Virtual
Conference, lightning talk and slides (2021).
Figure 2. RVV-enhanced SIMDe Performance Comparison
[6] Chuanhua Chang, Chun-Ping Chung, Yu-Tse Huang, Chao-Lin Lee,
Jenq-Kuen Lee, Yu-Wen Shao, Charlie Su, Chia-Hui Su, Bow-Yaw
Wang, and Ti-Han Wu. 2021. Sail Specification for RISC-V P-Extension.
RISC-V Summit, San Francisco, Dec 5-8, 2021. (2021).
[7] Yi-Ru Chen, Hui-Hsin Liao, Chia-Hsuan Chang, Che-Chia Lin, Chao-
Lin Lee, Yuan-Ming Chang, Chun-Chieh Yang, and Jenq-Kuen Lee.
2020. Experiments and optimizations for TVM on RISC-V architectures
with p extension. In 2020 International Symposium on VLSI Design,
Automation and Test (VLSI-DAT). IEEE, 1–4.
[8] Edwin Freed. 1983. Binary Magic Numbers: Efficient Bit Reversal
Techniques. Dr. Dobb’s Journal (1983).
Eigen v3.
[9] Gaël Guennebaud, Benoît Jacob, et al. 2010.
http://eigen.tuxfamily.org.
[10] RISC-V International. 2023. riscv-p-spec. Retrieved July 7, 2023 from
https://github.com/riscv/riscv-p-spec
[11] RISC-V International. 2023. riscv-v-spec. Retrieved July 7, 2023 from
https://github.com/riscv/riscv-v-spec
[12] RISC-V International. 2023. rvv-intrinsic-doc. Retrieved July 7, 2023
from https://github.com/riscv-non-isa/rvv-intrinsic-doc
[13] Hung-Ming Lai and Jenq-Kuen Lee. 2022. Efficient Support of the Scan
Vector Model for RISC-V Vector Extension. In Workshop Proceedings
of the 51st International Conference on Parallel Processing. 1–8.
[14] Jenq-Kuen Lee, Yi-Ru Chen, Chia-Hsuan Chang, Hui-Hsin Liao, Chao-
Lin Lee, Chun-Chieh Yang, Che-Chia Lin, Yuan-Ming Chang, Chun-
Ping Chung, and Ming-Han Yang. 2020. Enable TVM QNN on RISC-V
with Subword SIMD Computation. In TVM and Deep Learning Compi-
lation Conference, Seattle, Dec 2020, (Virtual Conference, lightning talk
and slides).
[15] Jenq-Kuen Lee and Hung-Ming Lai. 2023. Efficient Support of TVM
Scan OP on RISC-V Vector Extension. TVMcon, Seattle, March 15-17,
2023, (Virtual Conference, lightning talk and slides). (2023).
[16] Che-Chia Lin, Chao-Lin Lee, Jenq-Kuen Lee, Howard Wang, and Ming-
Yu Hung. 2021. Accelerate binarized neural networks with processing-
in-memory enabled by RISC-V custom instructions. In 50th Interna-
tional Conference on Parallel Processing Workshop. 1–8.
**