docs/design/features/arm64-intrinsics.md
This document is intended to document proposed design decisions related to the introduction of Arm64 Intrinsics
X86, X64, ARM32 and ARM64 development
ARM32 and ARM64 API's to be similarUse of intrinsics in general is a CoreCLR design decision to allow low level platform specific optimizations.
At first glance, such a decision seems to violate the fundamental principles of .NET code running on any platform. However, the intent is not for the vast majority of apps to use such optimizations. The intended usage model is to allow library developers access to low level functions which enable optimization of key functions. As such the use is expected to be limited, but performance critical.
In general individual intrinsic will be chosen to be fine grained. These will generally correspond to a single assembly instruction.
For various reasons, an individual CPU will have a specific set of supported instructions. For ARM64 the
set of supported instructions is identified by various ID_* System registers.
While these feature registers are only available for the OS to access, they provide
a logical grouping of instructions which are enabled/disabled together.
IsSupportedThe C# API must provide a mechanism to determine which sets of instructions are supported.
Existing design uses a separate static class to group the methods which correspond to each
logical set of instructions. A single IsSupported property is included in each static class
to allow client code to alter control flow. The IsSupported properties are designed so that JIT
can remove code on unused paths. ARM64 will use an identical approach.
PlatformNotSupported ExceptionIf client code calls an intrinsic which is not supported by the platform a PlatformNotSupported
exception must be thrown.
The JIT must use a set of flags corresponding to logical sets of instructions to alter code generation.
The VM must query the OS to populate the set of JIT flags. For the special altJit case, a means must provide for setting the flags.
PAL must provide an OS abstraction layer.
Each OS must provide a mechanism for determining which sets of instructions are supported.
In the event the OS fails to provides a means to detect a support for an instruction set extension it must be treated as unsupported.
NOTE: Exceptions might be where:
For any intrinsic which may not be supported on all variants of a platform, crossgen method compilation should be designed to allow optimal code generation.
Initial implementation will simply trap so that the JIT is forced to generate optimal platform dependent code at runtime. Subsequent implementations may use different approaches.
x86, x64, ARM32 and ARM64 will follow similar naming conventions.
System.Runtime.Intrinsics is used for type definitions useful across multiple platformsSystem.Runtime.Intrinsics.Arm is used type definitions shared across ARM32 and ARM64 platformsSystem.Runtime.Intrinsics.Arm.Arm64 is used for type definitions for the ARM64 platform
ARM64 intrinsics will occur within this namespacex86 and x64 share a common namespace, this document is recommending a separate namespace
for ARM32 and ARM64. This is because AARCH64 is a separate ISA from the AARCH32 Arm & Thumb
instruction sets. It is not an ISA extension, but rather a new ISA. This is different from x64
which could be viewed as a superset of x86.ARM64 and ARM32 instruction sets is different. It is controlled by
different sets of System Registers.For the convenience of the end user, it may be useful to add convenience API's which expose functionality which is common across platforms and sets of platforms. These could be implemented in terms of the platform specific functionality. These API's are currently out of scope of this initial design document.
Within the System.Runtime.Intrinsics.Arm.Arm64 namespace there will be a separate static class for each
logical set of instructions
The sets will be chosen to match the granularity of the ARM64 ID_* register fields.
The table below documents the set of known extensions, their identification, and their recommended intrinsic class names.
| ID Register | Field | Values | Intrinsic static class name |
|---|---|---|---|
| N/A | N/A | N/A | Base |
| ID_AA64ISAR0_EL1 | AES | (1b, 10b) | Aes |
| ID_AA64ISAR0_EL1 | Atomic | (10b) | Atomics |
| ID_AA64ISAR0_EL1 | CRC32 | (1b) | Crc32 |
| ID_AA64ISAR1_EL1 | DPB | (1b) | Dcpop |
| ID_AA64ISAR0_EL1 | DP | (1b) | Dp |
| ID_AA64ISAR1_EL1 | FCMA | (1b) | Fcma |
| ID_AA64PFR0_EL1 | FP | (0b, 1b) | Fp |
| ID_AA64PFR0_EL1 | FP | (1b) | Fp16 |
| ID_AA64ISAR1_EL1 | JSCVT | (1b) | Jscvt |
| ID_AA64ISAR1_EL1 | LRCPC | (1b) | Lrcpc |
| ID_AA64ISAR0_EL1 | AES | (10b) | Pmull |
| ID_AA64PFR0_EL1 | RAS | (1b) | Ras |
| ID_AA64ISAR0_EL1 | SHA1 | (1b) | Sha1 |
| ID_AA64ISAR0_EL1 | SHA2 | (1b, 10b) | Sha2 |
| ID_AA64ISAR0_EL1 | SHA3 | (1b) | Sha3 |
| ID_AA64ISAR0_EL1 | SHA2 | (10b) | Sha512 |
| ID_AA64PFR0_EL1 | AdvSIMD | (0b, 1b) | Simd |
| ID_AA64PFR0_EL1 | AdvSIMD | (1b) | SimdFp16 |
| ID_AA64ISAR0_EL1 | RDM | (1b) | SimdV81 |
| ID_AA64ISAR0_EL1 | SM3 | (1b) | Sm3 |
| ID_AA64ISAR0_EL1 | SM4 | (1b) | Sm4 |
| ID_AA64PFR0_EL1 | SVE | (1b) | Sve |
The All, Simd, and Fp classes will together contain the bulk of the ARM64 intrinsics. Most other extensions
will only add a few instruction so they should be simpler to review.
The Base static class is used to represent any intrinsic which is guaranteed to be implemented on all
ARM64 platforms. This set will include general purpose instructions. For example, this would include intrinsics
such as LeadingZeroCount and LeadingSignCount.
As further extensions are released, this set of intrinsics will grow.
Intrinsics will be named to describe functionality. Names will not correspond to specific named assembly instructions.
Where precedent exists for common operations within the System.Runtime.Intrinsics.X86 namespace, identical method
names will be chosen: Add, Multiply, Load, Store ...
Where ARM naming convention differs substantially from XARCH, ARM naming conventions will sometimes be preferred.
For instance
ARM uses Replicate or Duplicate rather than X86 Broadcast.ARM uses Across rather than X86 Horizontal.These will need to reviewed on a case by case basis.
It is also worth noting System.Runtime.Intrinsics.X86 naming conventions will include the suffix Scalar for
operations which take vector argument(s), but contain an implicit cast(s) to the base type and therefore operate only
on the first item of the argument vector(s).
Intrinsic methods will typically use a standard set of argument and return types:
byte, sbyte, short, ushort, int, uint, long, ulongdouble, single, System.HalfVector128<T>, Vector64<T>ValueTuple<> for return types returning multiple valuesIt is proposed to add the Vector64<T> type. Most ARM64 instructions support 8 byte and 16 byte forms. 8 byte
operations can execute faster with less power on some platforms. So adding Vector64<T> will allow exposing the full
flexibility of the instruction set and allow for optimal usage.
Some intrinsics will need to produce multiple results. The most notable are the structured load operations LD2,
LD3, LD4 ... For these operations it is proposed that the intrinsic API return a ValueTuple<> of Vector64<T> or
Vector128<T>
Some assembly instructions require an immediate encoded directly in the assembly instruction. These need to be constant at JIT time.
While the discussion is still on-going, consensus seems to be that any intrinsic must function correctly even when its arguments are not constant.
static class will
System Register Field and Value from ARM specification.ARM64 assembly instructionAdd, Multiply, Load, StoreARM64 intrinsics are mostly absent or undocumented so
initially this will not be necessary for ARM64As rough guidelines for order of implementation:
Intrinsics will extend the API of CoreCLR. They will need to follow standard API review practices.
Initial XArch intrinsics are proposed to be added to the netcoreapp2.1 Target Framework. ARM64 intrinsics will
be in similar Target Frameworks as the XArch intrinsics.
Each review will identify the Target Framework API version where the API will be extended and released.
static classGiven the need to add hundreds or thousands of intrinsics, it will be helpful to review incrementally.
A separate GitHub Issue will typically created for the review of each intrinsic static class.
When the static class exceeds a few dozen methods, it is desirable to break the review into smaller more manageable
pieces.
The extensive set of ARM64 assembly instructions make reviewing and implementing an exhaustive set a long process.
To facilitate incremental progress, initial intrinsic API for a given static class need not be exhaustive.
static classIsSupported must represent the state of an entire intrinsic static class for a given Target Framework.static class is included in its Target Framework releaseAs intrinsic support is added test coverage must be extended to provide basic testing.
Some ARM64 instructions will require allocation of contiguous blocks of registers. These are likely limited to load and store multiple instructions.
It is not clear if this is a new LSRA feature and if it is how much complexity this will introduce into the LSRA.
For intrinsic method calls, these vector types will implicitly be treated as pass by vector register.
For other calls, ARM64 ABI conventions must be followed. For purposes of the ABI calling conventions, these vector
types will treated as composite struct type containing a contiguous array of T. They will need to follow standard
struct argument and return passing rules.
This document will refer to half precision floating point as Half.
Half type to simplify storage and improve processing time.CIL in general do not have general support for a Half typeHalf intrinsicsSystem.Half to support this request
https://github.com/dotnet/runtime/issues/936Half features will be adjusted based on
System.Half proposalHalfHalf support is currently outside the scope of the initial design proposal. It is discussed below only for
introductory purposes.
ARM64 supports two half precision floating point formats
The two formats are similar. IEEE-754 has support for Inifinity and NAN and therefore has a somewhat smaller range. IEEE-754 should be preferred.
ARM64 baseline support for Half is limited. The following types of operations are supported
FloatVector128<Half> to two Vector128<Float>Vector128<Float> to Vector128<Half>The optional ARMv8.2-FP16 extension adds support for
Half typesHalf typesThese correspond to the proposed static classes Fp16 and SimdFp16
Half and ARM64 ABIAny complete Half implementation must conform to the ARM64 ABI.
The proposed System.Half type must be treated as a floating point type for purposes of the ARM64 ABI
As an argument it must be passed in a floating point register.
As a structure member, it must be treated as a floating point type and enter into the HFA determination logic.
Test cases must be written and conformance must be demonstrated.
SVE, the Scalable Vector Extension introduces its own complexity.
The extension
Z0-Z31 scalable vector registers. These overlay existing vector registers. Each scalar vector
register has a platform specific length
P0-P15 predicate registers. Each predicate register has a platform specific length which is
1/8th of the scalar vector length.Therefore implementation will not be trivial.
Vector<T>, Vector128<t>, Vector256<t>, ... Vector2048<T>, SVE<T> ... in user interface
design?
Vector128<t>, Vector256<t>, ... Vector2048<T> is current default proposal.
Having 16 forms of every API may create issues for framework and client developers.
However generics may provide some/sufficient relief to make this acceptable.Vector<T> may be preferred if SVE will also be used for FEATURE_SIMDSVE<T> may be preferred if SVE will not be used for FEATURE_SIMDGiven lack of available hardware and a lack of thorough understanding of the specification:
Deprecation of instructions should be relatively rare
SetThrowOnDeprecated() interface to allow developers to find these issuesThe following sections document APIs which have completed the API review process.
Until each API is approved it shall be marked "TBD Not Approved"
AllTBD Not approved
AesTBD Not approved
AtomicsTBD Not approved
Crc32TBD Not approved
DcpopTBD Not approved
DpTBD Not approved
FcmaTBD Not approved
FpTBD Not approved
Fp16TBD Not approved
JscvtTBD Not approved
LrcpcTBD Not approved
PmullTBD Not approved
RasTBD Not approved
Sha1TBD Not approved
Sha2TBD Not approved
Sha3TBD Not approved
Sha512TBD Not approved
SimdTBD Not approved
SimdFp16TBD Not approved
SimdV81TBD Not approved
Sm3TBD Not approved
Sm4TBD Not approved
SveTBD Not approved