Summary of SIMD Optimization for ARM and RISC-V Vector Extensions

Summary SIMD Optimization for ARM and RISC-V Vector Extensions arxiv.org

4,695 words - PDF document - View PDF document

One Line

The migration of ARM NEON Intrinsics codes to RISC-V Vector Extensions using SIMDe resulted in a significant speedup of 1.51x to 5.13x in the Google XNNPACK library.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions

Source: arxiv.org - PDF - 4,695 words - view

Introduction

• Legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV)

• Need for migration to improve performance on RISC-V platforms

• Manual rewriting is time-consuming and error-prone

Automating the Migration Process

• Use of open-source tool "SIMD Everywhere" (SIMDe)

• Conversion of ARM NEON Intrinsics types and functions to RVV Intrinsics

• Strategies for vector length agnostic (vla) architectures

Customized Conversions for Each Function

• Analyzing commonly used conversion methods in SIMDe

• Developing customized conversions based on RVV code generations

• Enhancing SIMDe for improved performance

Performance Evaluation with Google XNNPACK Library

• Comparison of enhanced SIMDe with original SIMDe

• Speedup ranging from 1.51x to 5.13x

Background on ARM NEON and RISC-V Vector Extensions

• Understanding the technologies involved

• Importance of migrating to RVV for improved performance

SIMD Everywhere Design Pattern for Intrinsics Function and Type Conversion

• Exploring the design pattern used in SIMDe

• Leveraging SIMD Everywhere for migration process

Strategies for Migrating ARM NEON Intrinsics to RVV Vector Extensions

• Detailed strategies for successful migration

• Ensuring compatibility and optimization

Using SIMDe for Code Porting

• Step-by-step guide on utilizing SIMDe for migration

• Simplifying the process and reducing errors

Validation and Benchmark Experiments

• Unit tests within SIMDe and Spike simulator

• Benchmark experiments using XNNPACK as a benchmark

Significant Speedup Achieved with RVV-enhanced SIMDe

• Speedup ranging from 1.51x to 5.13x across tested functions

• Improved performance compared to original SIMDe

Potential Enhancements in the Android Ecosystem

• Migration strategy driving significant enhancements

• Range of applications benefiting from the ARM NEON to RVV migration

• Reminder of the main message: Improved performance through automated migration using SIMDe

Key Points

The paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV).
The authors propose the use of the open-source tool "SIMD Everywhere" (SIMDe) to automate the migration process.
They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions.
The enhanced SIMDe achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe.
The authors validate their implementation using unit tests within SIMDe and Spike simulator.
They also conduct benchmark experiments using XNNPACK as a benchmark.
The RVV-enhanced SIMDe achieves significant speedup compared to the original SIMDe.
This migration strategy has the potential to drive significant enhancements across a range of applications in the Android ecosystem.

Summaries

22 word summary

ARM NEON Intrinsics codes are migrated to RISC-V Vector Extensions using SIMDe, achieving speedup of 1.51x to 5.13x in Google XNNPACK library.

77 word summary

This paper explores the migration of ARM NEON Intrinsics codes to RISC-V Vector Extensions (RVV) using the open-source tool "SIMD Everywhere" (SIMDe). The authors enhance SIMDe to convert ARM NEON Intrinsics types and functions to their RVV counterparts. Through experiments with the Google XNNPACK library, they achieve speedup ranging from 1.51x to 5.13x compared to the original SIMDe. The paper provides background information, migration strategies, and experimental results, highlighting improved performance and potential enhancements for Android applications.

152 word summary

This paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV). The authors propose using the open-source tool "SIMD Everywhere" (SIMDe) to automate the migration process. They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions. The authors conduct experiments with the Google XNNPACK library to evaluate the performance of their enhanced SIMDe, finding that it achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe. The paper provides background information on Arm NEON and RISC-V Vector Extensions, details strategies for migrating ARM NEON Intrinsics to RISC-V Vector Extensions, explains how to use SIMDe for code porting, and presents experimental results comparing the native SIMDe with the RVV-enhanced SIMDe. The authors successfully automate the migration process, achieving improved performance and potential enhancements across a range of applications in the Android ecosystem.

390 word summary

This paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV). Many libraries, such as OpenCV, FFmpeg, XNNPACK, and Eigen, utilize Arm or x86 SIMD Intrinsics to optimize programs for performance. With the emergence of RVV, there is a need to migrate these libraries and legacy codes for improved performance on RISC-V platforms.

To automate the migration process, the authors propose using the open-source tool "SIMD Everywhere" (SIMDe). They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions. They devise strategies for converting Neon Intrinsics types to RVV Intrinsics by considering vector length agnostic (vla) architectures. They also develop customized conversions for each function based on the results of RVV code generations.

The authors conduct experiments with the Google XNNPACK library to evaluate the performance of their enhanced SIMDe. They compare it with the original SIMDe and find that the enhanced SIMDe achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe.

The remainder of the paper is organized as follows. In Section 2, the authors provide background information on Arm NEON and RISC-V Vector Extensions. In Section 3.1, they introduce the SIMD Everywhere design pattern for intrinsics function and type conversion. They detail their strategies for leveraging SIMD Everywhere to migrate ARM NEON Intrinsics to RISC-V Vector Extensions in Sections 3.2 and 3.3. They explain how to use SIMDe for code porting in Section 3.4. Finally, in Section 4, they present the experimental results, comparing the native SIMDe with their RVV-enhanced SIMDe.

The authors validate their implementation using unit tests within SIMDe and Spike simulator. They also conduct benchmark experiments using XNNPACK as a benchmark. They choose 10 commonly used neural network computation functions implemented using NEON Intrinsics in XNNPACK and transform them into RVV Intrinsics using SIMDe. The performance of the converted code is evaluated using dynamic instruction count as the metric.

The experimental results show that the RVV-enhanced SIMDe achieves significant speedup compared to the original SIMDe, ranging from 1.51x to 5.13x across the tested functions.

In conclusion, the authors successfully automate the migration process from ARM NEON to RISC-V Vector Extensions using SIMDe, achieving improved performance. This migration strategy has the potential to drive significant enhancements across a range of applications in the Android ecosystem.

454 word summary

SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions

The paper discusses the migration of performance legacy codes from ARM NEON Intrinsics to RISC-V Vector Extensions (RVV). Many libraries, such as OpenCV, FFmpeg, XNNPACK, and Eigen, utilize Arm or x86 SIMD Intrinsics to optimize programs for performance. With the emergence of RVV, there is a need to migrate these libraries and legacy codes for improved performance on RISC-V platforms. The migration process currently requires manual rewriting, which is time-consuming and error-prone.

To address this issue, the authors propose the use of the open-source tool "SIMD Everywhere" (SIMDe) to automate the migration process. They enhance SIMDe to enable the conversion of ARM NEON Intrinsics types and functions to their corresponding RVV Intrinsics types and functions. For type conversion, they devise strategies to convert Neon Intrinsics types to RVV Intrinsics by considering the vector length agnostic (vla) architectures. They also analyze commonly used conversion methods in SIMDe and develop customized conversions for each function based on the results of RVV code generations.

The authors conduct experiments with the Google XNNPACK library to evaluate the performance of their enhanced SIMDe. They compare it with the original SIMDe, which does not utilize customized RVV implementations for the conversions. The enhanced SIMDe achieves speedup ranging from 1.51x to 5.13x compared to the original SIMDe.

The experimental results show that the RVV-enhanced SIMDe achieves significant speedup compared to the original SIMDe. The speedup ranges from 1.51x to 5.13x across the tested functions.

In conclusion, the authors successfully automate the migration process from ARM NEON to RISC-V Vector Extensions using SIMDe. Their enhanced SIMDe achieves improved performance compared to the original SIMDe when converting NEON code to RVV code. This migration strategy has the potential to drive significant enhancements across a range of applications in the Android ecosystem.

Raw indexed text (29,747 chars / 4,695 words / 764 lines)

SIMD Everywhere Optimization from ARM NEON to

RISC-V Vector Extensions

Ju-Hung Li Jhih-Kuan Lin Yung-Cheng Su

[email protected]

National Tsing Hua University

Taiwan [email protected]

National Tsing Hua University

Taiwan [email protected]

National Tsing Hua University

Taiwan

Chi-Wei Chu Lai-Tak Kuok Hung-Ming Lai

[email protected]

National Tsing Hua University

Taiwan [email protected]

National Tsing Hua University

Taiwan [email protected]

National Tsing Hua University

Taiwan

Chao-Lin Lee Jenq-Kuen Lee

[email protected]

National Tsing Hua University

PeakHills Group

Taiwan [email protected]

National Tsing Hua University

Taiwan

Abstract

Many libraries, such as OpenCV, FFmpeg, XNNPACK, and

Eigen, utilize Arm or x86 SIMD Intrinsics to optimize pro-

grams for performance. With the emergence of RISC-V Vec-

tor Extensions (RVV), there is a need to migrate these per-

formance legacy codes for RVV. Currently, the migration of

NEON code to RVV code requires manual rewriting, which is

a time-consuming and error-prone process. In this work, we

use the open source tool, "SIMD Everywhere" (SIMDe), to au-

tomate the migration. Our primary task is to enhance SIMDe

to enable the conversion of ARM NEON Intrinsics types

and functions to their corresponding RVV Intrinsics types

and functions. For type conversion, we devise strategies to

convert Neon Intrinsics types to RVV Intrinsics by consid-

ering the vector length agnostic (vla) architectures. With

function conversions, we analyze commonly used conver-

sion methods in SIMDe and develop customized conversions

for each function based on the results of RVV code gener-

ations. In our experiments with Google XNNPACK library,

our enhanced SIMDe achieves speedup ranging from 1.51x to

5.13x compared to the original SIMDe, which does not utilize

customized RVV implementations for the conversions.

Introduction

Many libraries, such as ComputeLibrary [1], OpenCV [5],

FFmpeg [17], XNNPACK [4], and Eigen [9], utilize Arm or

x86 SIMD Intrinsics to optimize specific core algorithms and

leverage the parallel processing capabilities of SIMD. With

the emergence of RISC-V Vector Extensions (RVV) [11], there

is a need to migrate these libraries and legacy codes to take

advantage of RVV instructions for improved performance on

RISC-V platforms. Figure 1 illustrates key applications that

can benefit from the migration flow of Neon intrinsics to RVV

intrinsics [12]. Our migration from ARM NEON Intrinsics

to RISC-V Vector Extensions is expected to enhance several

key applications. Android Runtime (ART) could witness en-

hanced system efficiency. Libraries like OpenCV and FFmpeg

could experience faster processing times for computer vision

tasks and multimedia data processing respectively. Tensor-

Flow Lite could see improved execution speed, crucial for

edge device deployment. Other Android applications could

also see performance improvements. Machine learning li-

braries like XNNPACK could benefit in terms of on-device

task performance, and the Eigen library could see improved

calculation efficiency. In essence, this migration strategy is

poised to drive significant enhancements across a range of

applications in the Android ecosystem. Currently, the migra-

tion of NEON code to RVV code requires manual rewriting.

Manual rewriting requires a good understanding of the ar-

chitectural differences between the two instruction sets. It

involves carefully modifying the instructions and data types

in the code, which can be time-consuming, especially for

larger codebases or when utilizing multiple Intrinsics. In this

work, we attempt to automate the rewriting process with

the open source tool, SIMD Everywhere (SIMDe).

In this paper, we devise strategies to convert Neon Intrin-

sics types to RVV Intrinsics types. We also analyze commonly

used conversion methods in SIMDe and develop customized

conversions for each function based on the results of RVV

code generation. Neon Intrinsics types have lengths of 64

bits and 128 bits, while the type length of RVV Intrinsics

is determined by the hardware implementation. This poses

challenges in directly substituting Neon Intrinsics types with

RVV Intrinsics types. Currently, SIMDe project does not yet

have an implementation for converting instruction sets to

the vector length agnostic (vla) architecture. Additionally,

there are many Intrinsics in Neon that can not be directly

replaced one-to-one with RVV Intrinsics. That makes it aJu-Hung Li, Jhih-Kuan Lin, Yung-Cheng Su, Chi-Wei Chu, Lai-Tak Kuok, Hung-Ming Lai, Chao-Lin Lee, and Jenq-Kuen Lee

Return base type Intrinsics counts

int

1279

uint

1448

float

834

poly

371

void

331

bfloat

Table 1. Categorization of Neon Intrinsics with types

Figure 1. Key Applications Benefit from the Optimized In-

trinsics Migration Flow for RVV

challenge to effectively utilize RVV Intrinsics to achieve

the functionality of Neon Intrinsics. In our work, we adopt

the new proposal from LLVM that D145088 [2] proposes

a fixed-size attribute for RISC-V Vector Extensions types ,

which allows declaring fixed-length RVV Intrinsics types

given the length of a single register, making it easier to map

NEON types to RVV types. Overall, we predominantly use

customized RVV Intrinsics implementations for the conver-

sions. The experiment shows our SIMDe achieved speedup

ranging from 1.51x to 5.13x compared to the original SIMDe,

which did not utilize customized RVV implementations for

the conversions, when using XNNPACK as our benchmark.

This work builds upon our prior efforts to enhance RISC-V

software environments [6, 7, 13–16, 18, 19]. Our previous

work sets the stage for the next step in advancing the RISC-V

ecosystem.

The remainder of the paper is organized as follows. In Sec-

tion 2, we first introduce the Neon and RVV instruction sets.

Next, in Section 3.1, we introduce the SIMD Everywhere de-

sign pattern for intrinsics function and type conversion. Next,

we detail our strategies for leveraging SIMD Everywhere to

migrate ARM NEON Intrinsics to RISC-V Vector Extensions

Intrinsics in Sections 3.2 and 3.3. We explain how to use

SIMD Everywhere for code porting in Section 3.4. Finally, in

Section 4, we present the experimental results, comparing

the native SIMDe with our RVV-enhanced SIMDe.

2.1

Background

Neon

Arm Neon is an single instruction multiple data (SIMD) archi-

tecture extension for the Arm Cortex-A and Arm Cortex-R

series of processors with capabilities that vastly improve use

cases on mobile devices, such as multimedia encoding/de-

coding, user interface, 2D/3D graphics, and gaming. Arm

Neon has a total of 4344 Intrinsics. Table 1 categorizes the

total number of Intrinsics by their Return base type.

2.2

RISC-V Vector Extensions

RISC-V Vector Extension (RVV) is an optional addition to the

base RISC-V ISA, providing parallel computing capabilities.

Unlike the RISC-V P extension [10], which uses general-

purpose registers (GPR) for packed-SIMD execution, RVV

introduces a separate vector register file with 32 registers

dedicated to SIMD operations. One notable feature of RVV is

its flexibility in defining the vector length. Instead of being a

fixed architectural constant, the vector length is determined

by the implementation, allowing different microarchitectures

to have varying vector lengths. This flexibility enables RVV

programs to automatically scale across different implemen-

tations without the need for recompilation or rewriting.

3.1

Migration Strategies

SIMD Everywhere

SIMD Everywhere (SIMDe) [3] is a header-only library de-

signed to convert SIMD code across different architectures.

It enables rapid transformation of SIMD libraries, enhancing

the portability of SIMD code and reducing the time required

for programmers to port SIMD libraries. SIMDe leverages

the reuse of types and function conversions between various

architectures, employing a general conversion approach. Re-

garding type conversions, SIMDe utilizes a union as a generic

type. Within this union, besides the architecture-specific

simd vector type variables, there is also a declaration of a

type that is universally applicable across different architec-

tures, typically an array or a variable with vector attributes.

For example, Listing 1 is a universal int32x4 union used to

convert NEON to other implementations. When the target

ISA is SSE2, NEON, or WebAssembly, the union declares the

corresponding simd vector types. Additionally, in all cases, a

variable with vector attributes is also declared. In this work,

we enhance SIMDe with the RVV target.

As for function conversions, SIMDe employs specific ISA

intrinsics for conversion and also utilizes compiler-specific

vector extensions, built-in functions, and auto vectorization

hints in a general conversion approach to enhance the porta-

bility of the SIMDe framework. For example, in Listing 2,

the code converts Neon Intrinsics to other implementations.

If the target ISA is Neon, AltiVec, SSE2, or WebAssembly,SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions

typedef union {

int32_t values __attribute__ (( __vector_size__

(16) ) ) ;

# if defined ( SIMDE_X86_SSE2_NATIVE )

__m128i m128i ;

# endif

# if defined ( SIMDE_ARM_NEON_A32V7_NATIVE )

int32x4_t neon ;

# endif

# if defined ( SIMDE_WASM_SIMD128_NATIVE )

v128_t v128 ;

# endif

} simde_int32x4_private ;

Listing 1. int32x4 union

Neon Intrinsic vaddq_s32 is transformed into the correspond-

ing ISA implementation. If the target ISA is not one of the

aforementioned options, the code utilizes variables with vec-

tor attributes for computations or auto vectorizes the scalar

implementation. In our case for RVV, we can utilize LLVM

backend for RVV so that the auto-vectorization flow can be

obtained with a baseline solution. We further enhance the

flow with RVV intrinsics in the transformation.

SIMDE_FUNCTION_ATTRIBUTES

simde_int32x4_t

simde_vaddq_s32 ( simde_int32x4_t a , simde_int32x4_t

b) {

# if defined ( SIMDE_ARM_NEON_A32V7_NATIVE )

return vaddq_s32 (a , b ) ;

# elif defined ( SIMDE_POWER_ALTIVEC_P6_NATIVE )

return vec_add (a , b ) ;

# else

simde_int32x4_private

r_ ,

a_ = simde_int32x4_to_private ( a ) ,

b_ = simde_int32x4_to_private ( b ) ;

# if defined ( SIMDE_X86_SSE2_NATIVE )

r_ . m128i = _mm_add_epi32 ( a_ . m128i , b_ . m128i )

;

# elif defined ( SIMDE_WASM_SIMD128_NATIVE )

r_ . v128 = wasm_i32x4_add ( a_ . v128 , b_ . v128 ) ;

# elif defined ( SIMDE_VECTOR_SUBSCRIPT_OPS )

r_ . values = a_ . values + b_ . values ;

# else

clang loop vectorize ( enable )

for ( size_t i = 0 ; i < ( sizeof ( r_ . values ) /

sizeof ( r_ . values [0]) ) ; i ++) {

r_ . values [ i ] = a_ . values [ i ] + b_ . values [ i

];

}

# endif

return simde_int32x4_from_private ( r_ ) ;

# endif

}

Listing 2. Neon intrinsics vaddq_s32 conversion

3.2

Migration Strategies with Type Conversion

Neon Intrinsics types have lengths of 64 bits and 128 bits,

while the type length (vlen) of RVV Intrinsics is determined

by the hardware implementation. This makes it difficult to

directly substitute Neon Intrinsics types with RVV Intrinsics

types. Additionally, because RVV vlen is known at runtime,

RVV Intrinsics types are sizeless types. Sizeless types have

greater limitations, such as not being able to be declared in

global variables, structs, or unions. This poses a challenge

when replacing Neon Intrinsics types within specific areas.

To address the inconvenience of sizeless types, LLVM re-

cently introduced a new attribute for RVV Intrinsics types.

With this attribute, RVV Intrinsics types can be treated as

fixed-size vectors when the architecture’s vlen is known.

To perform type conversion, we modify simde/arm/neon/-

types.h so that it includes corresponding RVV Intrinsics

types. Since Neon Intrinsics types have lengths of 64 bits

and 128 bits, we consider that for effective substitution, RVV

vlen should be at least 64 bits for replacing Neon 64-bit types,

and at least 128 bits for replacing Neon 128-bit types. This al-

lows for substitution without relying on loops for operations.

Additionally, in RVV, the number of processed elements is

determined by setting vector length register vl. RVV vlen

only restricts the maximum number of processed elements

and does not solely determine it. Therefore, as long as the

RVV vlen is greater than the vector length of Neon, type

substitution can be performed. Listing 3 shows the code that

adds RVV types to Neon generic int32x4 type. The variable

"__riscv_v_fixed_vlen" determines the length of RVV vector,

which is currently determined by the compiler flag.

typedef vint32m1_t fixed_vint32m1_t __attribute__

(( riscv_rvv_vector_bits ( __riscv_v_fixed_vlen ) )

);

typedef union {

...

# if defined ( SIMDE_RISCV_V_NATIVE ) &&

SIMDE_NATURAL_VECTOR_SIZE >= 128

fixed_vint32m1_t sv128 ;

# endif

...

} simde_int32x4_private ;

Listing 3. int32x4 union with RVV type

Since the size of the union depends on the size of the

largest variable, when the size of the Neon type is smaller

than the size of the RVV type, the size of the union increases.

Currently, in the conversion implementation of store intrin-

sics in SIMDe, memcpy is used to copy a number of bytes

equal to the size of the union from the memory location of

the union to the destination memory address. This can lead

to errors in SIMDe during partial conversions. Therefore,

regardless of the quality of SIMDe’s original implementation,Ju-Hung Li, Jhih-Kuan Lin, Yung-Cheng Su, Chi-Wei Chu, Lai-Tak Kuok, Hung-Ming Lai, Chao-Lin Lee, and Jenq-Kuen Lee

in this scenario, we use customized RVV Intrinsics imple-

mentation to correctly store the desired number of elements

in memory, as shown in the code in Listing 4.

void

simde_vst1q_s32 ( int32_t ptr [ HEDLEY_ARRAY_PARAM (4)

] , simde_int32x4_t val ) {

...

simde_int32x4_private val_ =

simde_int32x4_to_private ( val ) ;

# if defined ( SIMDE_RISCV_V_NATIVE ) && (

SIMDE_NATURAL_VECTOR_SIZE >= 128)

__riscv_vse32_v_i32m1 ( ptr , val_ . sv128 , 4) ;

// Ensure that we save the correct

number of elements into memory .

# else

simde_memcpy ( ptr , & val_ , sizeof ( val_ ) ) ;

# endif

}

Listing 4. Neon intrinsics vst1q_s32 conversion

If there is no corresponding RVV type for a Neon type,

the RVV type variable is not declared in the union. Possible

scenarios include:

1. If vlen is less than 64 bits, substitution is not straight-

forward for Neon 64-bit types.

2. If vlen is less than 128 bits, substitution is not straight-

forward for Neon 128-bit types.

3. Without the Zvfh extension, f16 vectors can not be

replaced straightforwardly in RVV.

In cases where substitution is not possible, the variables

with vector attribute in the union can still be used for intrin-

sics conversion. Table 2 is the mapping table for RVV and

Neon types, assuming that the Zvfh extension is enabled.

LLVM D145088 proposes a fixed-length attribute for RVV

intrinsics type with LMUL=1. Therefore, we currently use

RVV intrinsics type with LMUL=1 for the conversion.

3.3

Migration Strategies with Function Conversion

Here are five commonly used conversion methods in the

SIMDe framework.

1. Utilizing ISA-specific Intrinsics functions

2. Utilizing vector built-in functions.

3. Performing vector operations utilizing variables with

vector attributes.

4. Vectorizing scalar implementations through the com-

piler’s auto-vectorization pass.

5. Combine other functions.

Currently, SIMDe has implementations for converting

Neon intrinsics to generic architecture code. However, since

there is no specific implementation for converting Neon In-

trinsics to RVV Intrinsics, it can only utilize vector attributes

or use compiler auto-vectorization pass implementations.

Unfortunately, these methods are unable to generate the

optimal RVV code in many cases. Overall, in our design, we

Neon

vlen<64 64<=vlen<128 vlen>=128

int8x8_t

vint8m1_t

int16x4_t

vint16m1_t

int32x2_t

vint32m1_t

int64x1_t

vint64m1_t

uint8x8_t

vuint8m1_t

uint16x4_t x

vuint16m1_t

uint32x2_t x

vuint32m1_t

uint64x1_t x

vuint64m1_t

float16x4_t x

vfloat16m1_t vfloat16m1_t

float32x2_t x

vfloat32m1_t vfloat32m1_t

float64x1_t x

vfloat64m1_t vfloat64m1_t

int8x16_t

vint8m1_t

int16x8_t

vint8m1_t

int32x4_t

vint8m1_t

int64x2_t

vint8m1_t

uint8x16_t x

vint8m1_t

uint16x8_t x

vint8m1_t

uint32x4_t x

vint8m1_t

uint64x2_t x

vint8m1_t

float16x8_t x

vfloat16m1_t

float32x4_t x

vfloat32m1_t

float64x2_t x

vfloat64m1_t

Table 2. Mapping table for Neon types and RVV types with

fixed-size attribute

present customized RVV Intrinsics implementations for the

conversions and have implemented conversions for a total

of 1520 Intrinsics. RVV has many Intrinsics that have the

same functionality as Neon Intrinsics and can be directly

substituted one to one. For some Intrinsics, by combining a

few RVV Intrinsics, we can achieve the same functionality

as the corresponding Neon Intrinsics. For example, Listing

5 provides the code for converting Neon’s vget_high_s32

using a customized RVV Intrinsics implementation. Neon

"get_high" Intrinsics is used to extract the upper N/2 ele-

ments from a vector of width N and generate a new vector of

width N/2. We replaced it with RVV "slidedown" Intrinsics,

which shifts the vector elements down by a specified number

of positions.

Listing 6 presents another example for converting Neon’s

vceqq_s32 using a customized RVV Intrinsics implementa-

tion. Neon "ceq" Intrinsics compare two vectors, and if the

corresponding elements are equal, it sets all the bits of the

corresponding elements in the result vector to 1; otherwise,

it sets them to 0. We achieve the same functionality by com-

bining different RVV instructions. In this process, the vmv

instruction is used to generate a vector, vs_0, with all ele-

ments set to 0. Then, the vmseq instruction compares the

corresponding elements of the two vectors and generates

a mask vector. Finally, the vmerge function combines vs_0SIMD Everywhere Optimization from ARM NEON to RISC-V Vector Extensions

simde_int32x2_t

simde_vget_high_s32 ( simde_int32x4_t a ) {

...

simde_int32x2_private r_ ;

simde_int32x4_private a_ =

simde_int32x4_to_private ( a ) ;

# if defined ( SIMDE_RISCV_V_NATIVE )

&& ( SIMDE_NATURAL_VECTOR_SIZE >= 128)

r_ . sv64 = __riscv_vslidedown_vx_i32m1 ( a_ .

sv128 , 2 , 4) ;

...

return simde_int32x2_from_private ( r_ ) ;

# endif

}

Listing 5. Neon intrinsics vget_high_s32 conversion

and -1 based on the mask vector, resulting in the final output

vector.

vuint8m1_t mask ;

simde_vceqq_s32 ( simde_int32x4_t a , simde_int32x4_t

b) {

...

simde_uint32x4_private r_ ;

simde_int32x4_private

a_ = simde_int32x4_to_private ( a ) ,

b_ = simde_int32x4_to_private ( b ) ;

# if defined ( SIMDE_RISCV_V_NATIVE )

&& ( SIMDE_NATURAL_VECTOR_SIZE >= 128)

vuint32m1_t vs_0 = __riscv_vmv_v_x_u32m1 (

UINT32_C (0) , 4) ;

vbool32_t mask = __riscv_vmseq_vv_i32m1_b32 (

a_ . sv128 , b_ . sv128 , 4) ;

r_ . sv128 = __riscv_vmerge_vxm_u32m1 ( vs_0 ,

-1 , mask , 4) ;

...

return simde_uint32x4_from_private ( r_ ) ;

# endif

}

Listing 6. Neon intrinsics vceqq_s32 conversion

We now give an example that Neon Intrinsics may require

more complex conversions. This example is for Neon ’rbit’

Intrinsics, which reverses the bit order of each element in a

vector. To achieve the same functionality using RVV, we refer

to Edwin Freed’s article ’Binary Magic Numbers’ from Dr.

Dobb’s Journal 1983 [8] for the bit reverse solution. Listing

7 provides code implementing an algorithm that reverses

the bit order in an unsigned integer ’v’. It accomplishes the

result by swapping odd and even bits, consecutive pairs of

bits, nibbles (groups of 4 bits), bytes (groups of 8 bits), and

2-byte long pairs through a series of bitwise operations and

shifts. We implement a SIMD version of this algorithm using

RVV bitwise operation intrinsics.

v = (( v >> 1) & 0 x55555555 ) | (( v & 0 x55555555 ) <<

1) ;

v = (( v >> 2) & 0 x33333333 ) | (( v & 0 x33333333 ) <<

2) ;

v = (( v >> 4) & 0 x0F0F0F0F ) | (( v & 0 x0F0F0F0F ) <<

4) ;

v = (( v >> 8) & 0 x00FF00FF ) | (( v & 0 x00FF00FF ) <<

8) ;

v = ( v >> 16

) | ( v

16) ;

Listing 7. Bit reverse solution

While we primarily use customized RVV implementations,

we also retain the use of vector attributes in certain cases.

These cases include situations where Neon Intrinsics types

in the parameters can not be replaced with RVV types, as

well as Intrinsics that are specifically designed for simple

vector arithmetic or shift operations. Using vector attributes

for such Intrinsics often leads to optimal RVV code genera-

tion. This ensures that we can produce optimal RVV code

and maximize the reuse of conversions in SIMDe. Listing 8

provides the code for converting Neon’s simde_vaddq_s32

using variables with vector attributes.

SIMDE_FUNCTION_ATTRIBUTES

simde_int32x4_t

simde_vaddq_s32 ( simde_int32x4_t a , simde_int32x4_t

b) {

...

simde_int32x4_private

r_ ,

a_ = simde_int32x4_to_private ( a ) ,

b_ = simde_int32x4_to_private ( b ) ;

...

# elif defined ( SIMDE_VECTOR_SUBSCRIPT_OPS )

r_ . values = a_ . values + b_ . values ;

return simde_int32x4_from_private ( r_ ) ;

# endif

}

Listing 8. Neon intrinsics vaddq_s32 conversion

3.4

SIMDe Usage

The usage of SIMDe is straightforward. Simply includes the

SIMDe header file in the code that needs to be converted. In

Listing 9, the code uses our RVV-enhanced SIMDe header

file to convert the vector addition code implemented with

Neon Intrinsics, we include neon.h header file in line 3. After

compilation, it produces the RVV code shown in Listing 10.

In lines 7 and 10, the Neon vld1q_s32 instruction is converted

to the RVV vle32 instruction. In line 11, the Neon vaddq_s32

instruction is converted to the RVV vadd instruction. Finally,

in line 12, the Neon vst1q_s32 instruction is converted to the

RVV vse32 instruction.

4.1

Experiments

Validation Workflow

SIMDe includes unit tests for converting Neon code to other

ISAs. These unit tests validate each instruction using multiple

test cases to ensure the conversion functions correctly underJu-Hung Li, Jhih-Kuan Lin, Yung-Cheng Su, Chi-Wei Chu, Lai-Tak Kuok, Hung-Ming Lai, Chao-Lin Lee, and Jenq-Kuen Lee