As an emerging ISA, RISC-V learns a lot from its predecessors’ mistakes and brings some very appealing designs. In my circles, RISC-V is frequently associated with the words “modern” and “elegant”. Its vector extension (RVV) is often given equivalent praise, even though nearly no one has used a real-world RVV machine (including me) or even programmed in RVV. After experimenting with RVV for a while, I feel that it is not as good as many people claimed.

How Is RVV Designed

In contrast to SIMD architectures, RVV has variable-length vector registers. That means different chips (hardware threads, or harts, to be precise) can have different vector register lengths while sharing the same instruction set. To accomplish this, the software must get and set the length parameter with some instructions at runtime. RVV operations distinguish only between vector-vector and vector-scalar, signed and unsigned, but not element lengths. As a result, the element length is a dynamic parameter as well. Furthermore, RVV enables us to use only a portion of a vector register or to combine multiple vector registers, which necessitates the usage of a dynamic parameter. The type of a vector register is mainly governed by the factors listed below.

  • VLEN: a constant, represents the length of a vector register on this chip
  • vl: a Control and Status Register, or CSR, controls the number of elements used in operations, making it easier to handle the tail elements of an array
  • vtype: a CSR, includes
    • vill: represents whether the vtype configuration is ill-formed or not
    • vma / vta: controls the operation behavior of those masked-off elements and tail elements
    • vsew: controls the length of a single elemnt, represented by SEW = 8 | 16 | 32 | 64
    • vlmul: controls how many registers are used in an operation, represented by LMUL = 1/8 | 1/4 | 1/2 | 1 | 2 | 4 | 8

We use vset{i}vl{i} instructions to set both vl and vtype.

I need to elaborate on the parameter LMUL. When LMUL = 1/2, for example, we only use half of the registers. When LMUL = 8, we combine 8 contiguous registers into one, resulting in 32 / 8 = 4 accessible registers. Since there are only 5 bits for a register index in all RVV instructions, we don’t receive more registers when LMUL < 1.

There are a few other parameters. However, they will not be discussed in this blog, so I will not list them.

More details can be found in the RVV Spec. I’ll end my introduction here.

Annoyances of RVV C Intrinsics

In RVV C intrinsics, vl, vma, vta are specified at function invocations, whereas vsew, vlmul are hard-coded into types. Compilers are responsible to insert vset{i}vl{i} instructions for you. We now have the following horrible table (source: RVV Intrinsic RFC).

Data Types

Encode SEW and LMUL into data types. We enforce the constraint LMUL ≥ SEW/ELEN in the implementation. There are the following data types for ELEN = 64.

TypesLMUL = 1LMUL = 2LMUL = 4LMUL = 8LMUL = 1/2LMUL = 1/4LMUL = 1/8
int64_tvint64m1_tvint64m2_tvint64m4_tvint64m8_tN/AN/AN/A
uint64_tvuint64m1_tvuint64m2_tvuint64m4_tvuint64m8_tN/AN/AN/A
int32_tvint32m1_tvint32m2_tvint32m4_tvint32m8_tvint32mf2_tN/AN/A
uint32_tvuint32m1_tvuint32m2_tvuint32m4_tvuint32m8_tvuint32mf2_tN/AN/A
int16_tvint16m1_tvint16m2_tvint16m4_tvint16m8_tvint16mf2_tvint16mf4_tN/A
uint16_tvuint16m1_tvuint16m2_tvuint16m4_tvuint16m8_tvuint16mf2_tvuint16mf4_tN/A
int8_tvint8m1_tvint8m2_tvint8m4_tvint8m8_tvint8mf2_tvint8mf4_tvint8mf8_t
uint8_tvuint8m1_tvuint8m2_tvuint8m4_tvuint8m8_tvuint8mf2_tvuint8mf4_tvuint8mf8_t
vfloat64vfloat64m1_tvfloat64m2_tvfloat64m4_tvfloat64m8_tN/AN/AN/A
vfloat32vfloat32m1_tvfloat32m2_tvfloat32m4_tvfloat32m8_tvfloat32mf2_tN/AN/A
vfloat16vfloat16m1_tvfloat16m2_tvfloat16m4_tvfloat16m8_tvfloat16mf2_tvfloat16mf4_tN/A

There are the following data types for ELEN = 32.

TypesLMUL = 1LMUL = 2LMUL = 4LMUL = 8LMUL = 1/2LMUL = 1/4LMUL = 1/8
int32_tvint32m1_tvint32m2_tvint32m4_tvint32m8_tN/AN/AN/A
uint32_tvuint32m1_tvuint32m2_tvuint32m4_tvuint32m8_tN/AN/AN/A
int16_tvint16m1_tvint16m2_tvint16m4_tvint16m8_tvint16mf2_tN/AN/A
uint16_tvuint16m1_tvuint16m2_tvuint16m4_tvuint16m8_tvuint16mf2_tN/AN/A
int8_tvint8m1_tvint8m2_tvint8m4_tvint8m8_tvint8mf2_tvint8mf4_tN/A
uint8_tvuint8m1_tvuint8m2_tvuint8m4_tvuint8m8_tvuint8mf2_tvuint8mf4_tN/A
vfloat32vfloat32m1_tvfloat32m2_tvfloat32m4_tvfloat32m8_tN/AN/AN/A
vfloat16vfloat16m1_tvfloat16m2_tvfloat16m4_tvfloat16m8_tvfloat16mf2_tN/AN/A

Mask Types

Encode the ratio of SEW/LMUL into the mask types. There are the following mask types.

n = SEW/LMUL

Typesn = 1n = 2n = 4n = 8n = 16n = 32n = 64
boolvbool1_tvbool2_tvbool4_tvbool8_tvbool16_tvbool32_tvbool64_t

There are a lot of N/A here, which makes it a little difficult to generate code with C macros. This is because the RVV specification has a loose VLEN restriction, requiring just that it can contain at least one largest element (i.e. VLEN >= ELEN). As a result, these N/A types may not be available on some chips (for example, an RV64V chip with VLEN = 64 cannot support SEW = 64, LMUL = 1/8). These types don’t seem to matter much, though, because the LMUL < 1 case seems to be uncommon, and is usually used in widening instructions or narrowing instructions, which do not use those N/A types.

Thanks to LMUL, the amount of intrinsic types is huge, making the size of the header file and docs megabytes large. The good news is that RVV inrtinsics provide overloaded functions. But the names of these functions are essentially mapped to assembly instructions. There are a lot of functions that could be overloaded together while they aren’t. When you try to wrap them, you still need to do a lot of extra work, as stated in this issue.

These are just some small annoyances during my experience of RVV. I won’t use them to criticize RVV. The major problem is that these intrinsic types are all dynamically sized types (or sizeless types, or unsized types) due to RVV’s variable-length nature. And I’m afraid that DSTs are poorly supported in all languages, not just C.

The Ecosystem Has Not Prepared for DSTs

First of all, the C language standard actually has a DST, i.e., the variable-length array. When a VLA is constructed, the current stack frame will be dynamically extended. It’s like alloca with some extra information such as type and lifetime. The implementation of RVV intrinsic types in Clang is very similar to VLA.

However, while being supported as a compiler extension by GCC and Clang, VLA is not part of the C++ standard. C++ standard does not have any DST. As for Rust, although it does have DSTs like dyn Trait and [T], they can only be held indirectly using references or pointers. Dynamically extending stack frames is not possible with Rust at all. To properly support RVV, a number of issues must be taken into account, and maybe many changes must be made in these two languages. Consider these: How do you store RVV variables as static variables? How do you put them in a struct? How do you pass them as arguments to a function and return them from a function? None of these actions can be done on VLA variables. They are so fundamental and natural to any other SIMD type, but they pose a significant challenge to RVV.

Currently, the vast majority of existing C++ code are written against statically sized types. We rely on the static sizes in so many places without awareness. Have you ever think of that a sizeof in some constant evaluation context may break? What’s worse, DST is colored. If a struct contains a DST, then it is a DST, too. The suffering and sorrow spreads along the dependency chain.

So what is the status quo? If I recall properly, LLVM/Clang only allows RVV types to be used as local variables, arguments, and return values. Other than these, none of the aforementioned uses are allowed. While GCC’s support for RVV has stuck in an intermediate state for a long time (9/29/2022 update: GCC has supported RVV v1.0). As RVV is not the first vector architecture implementation, we can investigate the language support status for its predecessor, ARM SVE, to see how far we can go. Well, the support is, not less constrained than RVV.

Is Rust better? Rust apparently possesses more language facilities for DSTs. It has a Sized trait used everywhere, implicitly or explicitly. And it permits structs with DST fields as long as they are the final ones (which is weird since Rust doesn’t guarantee the memory layout). So I believe that the process of Rust embracing RVV will be smoother. However, as was already said, Rust is unable to dynamically extend stack frames, making it impossible to even put a DST variable on the stack.

Actually, I seriously doubt that a language exists that supports DSTs well and can build zero-cost abstractions on top of them. Lack of ecosystem support suggests that RVV will be less consistent and composable in terms of language level.

Maybe I’ve taken the problem too seriously, because scenarios that uses SIMD/Vector are quite specific. Even though RVV lacks so much abilities, you probably won’t get affected. But those SIMD library authors will. Their code design might automatically reject RVV. Putting aside the distinctions between SIMD and vectors, there is a more pressing issue: RVV types cannot be wrapped in a struct. In addition to SIMD libraries, GCC and Clang’s vector extension is also hard to support RVV as it requires you to specify the sizes at compile-time. That indicates that there isn’t a lot of SIMD-accelerated code can support RVV cheaply. By the way, Rust’s std::simd simply gives up supporting RVV for now.

Only Google’s highway pronounces support RVV, as far as I’m aware. However, some of its modules, such as vqsort, don’t. It is common for other sorting networks to employ transposes, but vqsort’s sorting network uses a number of permutations to avoid transposing, making it extremely challenging to convert to RVV because it is length-agnostic. It seems that nsimd tried to support RVV but stopped a long time ago.

Choice of Intrinsic Types Is Not Clear

The SIMD type to employ is typically obvious. Numerous SIMD libraries, such as C++’s experimental <simd>, can choose an underlying SIMD type for you automatically. For instance, on x86 platforms, the fallback order is commonly AVX512 -> AVX2 -> SSE2, despite the fact that the time required to switch between licenses and the degree of downclocking of AVX512 and AVX2 differ amongst microarchitectures. And since you simply need to take into account the element type, it is also evident for ARM SVE. However, things become considerably more confusing in RVV.

Recall that RVV has a LMUL parameter. Larger LMUL values are expected to increase speed at the expense of the number of available registers, which suggests a higher likelihood of spilling. You might need to tune LMUL to get a higher efficiency. I initially believed it to be a special procedure that only RVV possesses. But after a while, I noticed how similar it is to the loop unrolling problem.

To some extent, compilers can decide how to unroll loops for you. Then what about LMUL? Can compilers choose a proper value for you? I don’t know. But I think this is not as easy. Because LMUL is not an opt-in feature. You cannot pretend it always equals to one. Widening and narrowing instructions (e.g., convert u32 to u64 or the reverse) will change LMUL. Taken that into consideration, the optimizing process could be more complex than loop unrolling. It reminds me of how RISC-V’s genius design bring problems to linker implementations.

If the compiler is unable to select an appropriate LMUL for you, then you have to tune LMUL manually. Another issue now: can SIMD libraries offer a unified, cross-platform API over it? That is challenging, in my opinion, and I haven’t come across any related design.

Given the resemblance between selecting LMUL values and unrolling loops, I have to wonder if LMUL is really essential. Will the speedup warrant the additional complexity it adds? We don’t know because RVV hardware is currently scarce.

Possible Higher Context Switch Cost

The context size issue plagues older vector processors (according to some articles, though, I’m not familiar with that period of history). Their vector registers are typically made to be long in order to achieve a high speedup. This method will undoubtedly bloat the context size. As a result, operating systems must spend additional time and resources on register saving during context switching.

The RISC-V Reader, a popular resource for RISC-V newcomers, proudly claims that RVV can avoid this problem, because RVV has a dedicated instruction vsetdcfg that can enable / disable registers by need, so that we can only pay for what we use. Sounds promising, doesn’t it? However, The RISC-V Reader is very out-dated. The instruction vsetdcfg has already been deprecated in the current RVV spec. RVV now only has a very coarse-grained mechanism that records whether any vector register is modified or not. If the vendor chooses a long-length implementation, I believe RVV will also experience the context size issue. This problem is unlikely to bother you though. Super long vectors are usually designed for HPC, of which the resource is usually dedicated to one single program at a time.

Can RVV Emulate SIMD?

Some blogs and talks say that RVV can, at worst, emulate SIMD. THIS IS NOT TRUE.

Think of that, how can we use a stuff with less information (register sizes only known at run-time) to emulate a stuff with more information (register sizes known at compile-time)? To expose unified APIs to users, the only feasible way is to erase the extra information from SIMD types. And this is what highway does.

But can’t we set the desired size at run-time to match with SIMD? One might ask. No, you can’t. You must first deal with the DST issue, as I elaborated. And after that, you have to deal with the big variety of VLEN. Recall that the RVV Spec only stipulates VLEN >= ELEN. So in some processors your vl settings might fail. What if we use LMUL to concat registers? Well, then you plunge into the type choosing problem.

A way out is to use Zvl* extensions, which specifies the minimum vector register length. Theoretically, you can use feature flags (e.g., pre-defined macros like __AVX2__ in C++) to pick different implementations for different VLEN at compile-time or simply reject those platforms that don’t have sufficient length. Of course, once you do this, you lose the portability advantage of RVV. And even you use only a part of the register, the context size won’t be smaller. And as far as I know, such flags haven’t been implemented in LLVM/Clang yet.

That’s not the end. Despite all the challenges, there are some situations where SIMD emulation is still impossible (or too expansive), for instance, permutations. RVV does have some permutation instructions like slideup, slidedown, gather, scatter, and compress. And they are very helpful. You can also see some of them in AVX512. But SIMD’s general permutation instructions can do more than that; they accept a lookup table to rearrange elements within a register, which is not possible for RVV. As the length is agnostic, the lookup table size is unknown. Simdjson utilizes permutations to classify characters. Vqsort utilizes permutations to implement its sorting network. Neither of them can be cheaply emulated by RVV.