JEP 438: Vector API (Fifth Incubator)
Summary
Introduce an API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPU architectures, thus achieving performance superior to equivalent scalar computations.
History
The Vector API was first proposed by JEP 338 and integrated into JDK 16 as an incubating API. A second round of incubation was proposed by JEP 414 and integrated into JDK 17. A third round of incubation was proposed by JEP 417 and integrated into JDK 18. A fourth round of incubation was proposed by JEP 426 and integrated into JDK 19.
This JEP proposes to re-incubate the API in JDK 20, with no changes in the API relative to JDK 19. The implementation includes small set of bug fixes and performance enhancements. This JEP also clarifies that alignment with Project Valhalla is a critical part of completing the Vector API.
Goals
-
Clear and concise API — The API should be capable of clearly and concisely expressing a wide range of vector computations consisting of sequences of vector operations composed within loops and possibly with control flow. It should be possible to express a computation that is generic with respect to vector size, or the number of lanes per vector, thus enabling such computations to be portable across hardware supporting different vector sizes.
-
Platform agnostic — The API should be CPU architecture agnostic, enabling implementations on multiple architectures supporting vector instructions. As is usual in Java APIs, where platform optimization and portability conflict then we will bias toward making the API portable, even if that results in some platform-specific idioms not being expressible in portable code.
-
Reliable runtime compilation and performance on x64 and AArch64 architectures — On capable x64 architectures the Java runtime, specifically the HotSpot C2 compiler, should compile vector operations to corresponding efficient and performant vector instructions, such as those supported by Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX). Developers should have confidence that the vector operations they express will reliably map closely to relevant vector instructions. On capable ARM AArch64 architectures C2 will, similarly, compile vector operations to the vector instructions supported by NEON and SVE.
-
Graceful degradation — Sometimes a vector computation cannot be fully expressed at runtime as a sequence of vector instructions, perhaps because the architecture does not support some of the required instructions. In such cases the Vector API implementation should degrade gracefully and still function. This may involve issuing warnings if a vector computation cannot be efficiently compiled to vector instructions. On platforms without vectors, graceful degradation will yield code competitive with manually-unrolled loops, where the unroll factor is the number of lanes in the selected vector.
-
Alignment with Project Valhalla — The long-term goal of the Vector API is to leverage Project Valhalla's enhancements to the Java object model. Primarily this will mean changing the Vector API's current value-based classes to be value classes so that programs can work with value objects, i.e., class instances that lack object identity. Accordingly, the Vector API will incubate over multiple releases until the necessary features of Project Valhalla become available as preview features. Once these Valhalla features are available we will adapt the Vector API and implementation to use them and then promote the Vector API itself to a preview feature. For further details, see the sections on run-time compilation and future work.
Non-Goals
-
It is not a goal to enhance the existing auto-vectorization algorithm in HotSpot.
-
It is not a goal to support vector instructions on CPU architectures other than x64 and AArch64. However it is important to state, as expressed in the goals, that the API must not rule out such implementations.
-
It is not a goal to support the C1 compiler.
-
It is not a goal to guarantee support for strict floating point calculations as is required by the Java platform for scalar operations. The results of floating point operations performed on floating point scalars may differ from equivalent floating point operations performed on vectors of floating point scalars. Any deviations will be clearly documented. This non-goal does not rule out options to express or control the desired precision or reproducibility of floating point vector computations.
Motivation
A vector computation consists of a sequence of operations on vectors. A vector comprises a (usually) fixed sequence of scalar values, where the scalar values correspond to the number of hardware-defined vector lanes. A binary operation applied to two vectors with the same number of lanes would, for each lane, apply the equivalent scalar operation on the corresponding two scalar values from each vector. This is commonly referred to as Single Instruction Multiple Data (SIMD).
Vector operations express a degree of parallelism that enables more work to be performed in a single CPU cycle and thus can result in significant performance gains. For example, given two vectors, each containing a sequence of eight integers (i.e., eight lanes), the two vectors can be added together using a single hardware instruction. The vector addition instruction operates on sixteen integers, performing eight integer additions, in the time it would ordinarily take to operate on two integers, performing one integer addition.
HotSpot already supports auto-vectorization, which transforms scalar operations into superword operations which are then mapped to vector instructions. The set of transformable scalar operations is limited, and also fragile with respect to changes in code shape. Furthermore, only a subset of the available vector instructions might be utilized, limiting the performance of generated code.
Today, a developer who wishes to write scalar operations that are reliably transformed into superword operations needs to understand HotSpot's auto-vectorization algorithm and its limitations in order to achieve reliable and sustainable performance. In some cases it may not be possible to write scalar operations that are transformable. For example, HotSpot does not transform the simple scalar operations for calculating the hash code of an array (the Arrays::hashCode methods), nor can it auto-vectorize code to lexicographically compare two arrays (thus we added an intrinsic for lexicographic comparison).
The Vector API aims to improve the situation by providing a way to write complex vector algorithms in Java, using the existing HotSpot auto-vectorizer but with a user model which makes vectorization far more predictable and robust. Hand-coded vector loops can express high-performance algorithms, such as vectorized hashCode
or specialized array comparisons, which an auto-vectorizer may never optimize. Numerous domains can benefit from this explicit vector API including machine learning, linear algebra, cryptography, finance, and code within the JDK itself.
Description
A vector is represented by the abstract class Vector<E>
. The type variable E
is instantiated as the boxed type of the scalar primitive integral or floating point element types covered by the vector. A vector also has a shape which defines the size, in bits, of the vector. The shape of a vector governs how an instance of Vector<E>
is mapped to a hardware vector register when vector computations are compiled by the HotSpot C2 compiler. The length of a vector, i.e., the number of lanes or elements, is the vector size divided by the element size.
The set of element types (E
) supported is Byte
, Short
, Integer
, Long
, Float
and Double
, corresponding to the scalar primitive types byte
, short
, int
, long
, float
and double
, respectively.
The set of shapes supported correspond to vector sizes of 64, 128, 256, and 512 bits, as well as max bits. A 512-bit shape can pack byte
s into 64 lanes or pack int
s into 16 lanes, and a vector of such a shape can operate on 64 byte
s at a time or 16 int
s at a time. A max-bits shape supports the maximum vector size of the current architecture. This enables support for the ARM SVE platform, where platform implementations can support any fixed size ranging from 128 to 2048 bits, in increments of 128 bits.
We believe that these simple shapes are generic enough to be useful on all relevant platforms. However, as we experiment with future platforms during the incubation of this API we may further modify the design of the shape parameter. Such work is not in the early scope of this project, but these possibilities partly inform the present role of shapes in the Vector API. (For further discussion see the future work section, below.)
The combination of element type and shape determines a vector's species, represented by VectorSpecies<E>
.
Operations on vectors are classified as either lane-wise or cross-lane.
-
A lane-wise operation applies a scalar operator, such as addition, to each lane of one or more vectors in parallel. A lane-wise operation usually, but not always, produces a vector of the same length and shape. Lane-wise operations are further classified as unary, binary, ternary, test, or conversion operations.
-
A cross-lane operation applies an operation across an entire vector. A cross-lane operation produces either a scalar or a vector of possibly a different shape. Cross-lane operations are further classified as permutation or reduction operations.
To reduce the surface of the API, we define collective methods for each class of operation. These methods take operator constants as input; these constants are instances of the VectorOperator.Operator
class and are defined in static final fields in the VectorOperators
class. For convenience we define dedicated methods, which can be used in place of the generic methods, for some common full-service operations such as addition and multiplication.
Certain operations on vectors, such conversion and reinterpretation, are inherently shape-changing; i.e., they produce vectors whose shapes are different from the shapes of their inputs. Shape-changing operations in a vector computation can negatively impact portability and performance. For this reason the API defines a shape-invariant flavor of each shape-changing operation when applicable. For best performance, developers should write shape-invariant code using shape-invariant operations insofar as possible. Shape-changing operations are identified as such in the API specification.
The Vector<E>
class declares a set of methods for common vector operations supported by all element types. For operations specific to an element type there are six abstract subclasses of Vector<E>
, one for each supported element type: ByteVector
, ShortVector
, IntVector
, LongVector
, FloatVector
, and DoubleVector
. These type-specific subclasses define additional operations that are bound to the element type since the method signature refers either to the element type or to the related array type. Examples of such operations include reduction (e.g., summing all lanes to a scalar value), and copying a vector's elements into an array. These subclasses also define additional full-service operations specific to the integral subtypes (e.g., bitwise operations such as logical or), as well as operations specific to the floating point types (e.g., transcendental mathematical functions such as exponentiation).
As an implementation matter, these type-specific subclasses of Vector<E>
are further extended by concrete subclasses for different vector shapes. These concrete subclasses are not public since there is no need to provide operations specific to types and shapes. This reduces the API surface to a sum of concerns rather than a product. Instances of concrete Vector
classes are obtained via factory methods defined in the base Vector<E>
class and its type-specific subclasses. These factories take as input the species of the desired vector instance and produce various kinds of instances, for example the vector instance whose elements are default values (i.e., the zero vector), or a vector instance initialized from a given array.
To support control flow, some vector operations optionally accept masks represented by the public abstract class VectorMask<E>
. Each element in a mask is a boolean value corresponding to a vector lane. A mask selects the lanes to which an operation is applied: It is applied if the mask element for the lane is true, and some alternative action is taken if the mask is false.
Similar to vectors, instances of VectorMask<E>
are instances of non-public concrete subclasses defined for each element type and length combination. The instance of VectorMask<E>
used in an operation should have the same type and length as the vector instances involved in the operation. Vector comparison operations produce masks, which can then be used as input to other operations to selectively operate on certain lanes and thereby emulate flow control. Masks can also be created using static factory methods in the VectorMask<E>
class.
We anticipate that masks will play an important role in the development of vector computations that are generic with respect to shape. This expectation is based on the central importance of predicate registers, the equivalent of masks, in the ARM Scalable Vector Extensions and in Intel's AVX-512.
On such platforms an instance of VectorMask<E>
is mapped to a predicate register, and a mask-accepting operation is compiled to a predicate-register-accepting vector instruction. On platforms that don't support predicate registers, a less efficient approach is applied: An instance of VectorMask<E>
is mapped, where possible, to a compatible vector register, and in general a mask-accepting operation is composed of the equivalent unmasked operation and a blend operation.
To support cross-lane permutation operations, some vector operations accept shuffles represented by the public abstract class VectorShuffle<E>
. Each element in a shuffle is an int
value corresponding to a lane index. A shuffle is a mapping of lane indexes, describing the movement of lane elements from a given vector to a result vector.
Similar to vectors and masks, instances of VectorShuffle<E>
are instances of non-public concrete subclasses defined for each element type and length combination. The instance of VectorShuffle<E>
used in an operation should have the same type and length as the vector instances involved in the operation.