site/source/docs/porting/simd.rst
.. _Porting SIMD code:
.. role:: raw-html(raw) :format: html
Emscripten supports the WebAssembly SIMD <https://github.com/webassembly/simd/>_ feature. There are five different ways to leverage WebAssembly SIMD in your C/C++ programs:
__attribute__((vector_size(16))))#include <wasm_simd128.h>)#include <*mmintrin.h>)#include <arm_neon.h>)These techniques can be freely combined in a single program.
To enable any of the five types of SIMD above, pass the WebAssembly-specific -msimd128 flag at compile time. This will also turn on LLVM's autovectorization passes. If that is not desirable, additionally pass flags -fno-vectorize -fno-slp-vectorize to disable the autovectorizer. See Auto-Vectorization in LLVM <https://llvm.org/docs/Vectorizers.html>_ for more information.
WebAssembly SIMD is supported by
Chrome ≥ 91 (May 2021),
Firefox ≥ 89 (June 2021),
Safari ≥ 16.4 (March 2023) and
Node.js ≥ 16.4 (June 2021).
See WebAssembly Roadmap <https://webassembly.org/roadmap/>_ for details about other VMs.
An upcoming Relaxed SIMD proposal <https://github.com/WebAssembly/relaxed-simd/tree/main/proposals/relaxed-simd>_ will add more SIMD instructions to WebAssembly.
At the source level, the GCC/Clang SIMD Vector Extensions <https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html>_ can be used and will be lowered to WebAssembly SIMD instructions where possible.
This enables developers to create custom wide vector types via typedefs, and use arithmetic operators (+,-,*,/) on the vectorized types, as well as allow individual lane access via the vector[i] notation. However, the GCC vector built-in functions <https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html>_ are not available. Instead, use the WebAssembly SIMD Intrinsics functions below.
LLVM maintains a WebAssembly SIMD Intrinsics header file that is provided with Emscripten, and adds type definitions for the different supported vector types.
.. code-block:: cpp
#include <wasm_simd128.h>
#include <stdio.h>
int main() {
#ifdef __wasm_simd128__
v128_t v1 = wasm_f32x4_make(1.2f, 3.4f, 5.6f, 7.8f);
v128_t v2 = wasm_f32x4_make(2.1f, 4.3f, 6.5f, 8.7f);
v128_t v3 = wasm_f32x4_add(v1, v2);
// Prints "v3: [3.3, 7.7, 12.1, 16.5]"
printf("v3: [%.1f, %.1f, %.1f, %.1f]\n",
wasm_f32x4_extract_lane(v3, 0),
wasm_f32x4_extract_lane(v3, 1),
wasm_f32x4_extract_lane(v3, 2),
wasm_f32x4_extract_lane(v3, 3));
#endif
}
The Wasm SIMD header can be browsed online at wasm_simd128.h <https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/wasm_simd128.h>_.
Pass flag -msimd128 at compile time to enable targeting WebAssembly SIMD Intrinsics. C/C++ code can use the built-in preprocessor define #ifdef __wasm_simd128__ to detect when building with WebAssembly SIMD enabled.
Pass -mrelaxed-simd to target WebAssembly Relaxed SIMD Intrinsics. C/C++ code can use the built-in preprocessor define #ifdef __wasm_relaxed_simd__ to detect when this target is active.
When porting native SIMD code, it should be noted that because of portability concerns, the WebAssembly SIMD specification does not expose access to all of the native x86/ARM SIMD instructions. In particular the following changes exist:
Emscripten does not support x86 or any other native inline SIMD assembly or building .s assembly files, so all code should be written to use SIMD intrinsic functions or compiler vector extensions.
WebAssembly SIMD does not have control over managing floating point rounding modes or handling denormals.
Cache line prefetch instructions are not available, and calls to these functions will compile, but are treated as no-ops.
Asymmetric memory fence operations are not available, but will be implemented as fully synchronous memory fences when SharedArrayBuffer is enabled (-pthread) or as no-ops when multithreading is not enabled (the default).
SIMD-related bug reports are tracked in the Emscripten bug tracker with the label SIMD <https://github.com/emscripten-core/emscripten/issues?q=is%3Aopen+is%3Aissue+label%3ASIMD>_.
When developing SIMD code to use WebAssembly SIMD, implementors should be aware of semantic differences between the host hardware and WebAssembly semantics; as acknowledged in the WebAssembly design documentation, "this sometimes will lead to poor performance <https://github.com/WebAssembly/design/blob/master/Portability.md#assumptions-for-efficient-execution>_." The following list outlines some WebAssembly SIMD instructions to look out for when performance tuning:
.. list-table:: WebAssembly SIMD instructions with performance implications :widths: 10 10 30 :header-rows: 1 :class: wrap-table-content
5-11 x86 instructions in v8 <https://github.com/v8/v8/blob/c8672adeebb105c7636334b9931831bf1945f4ec/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc#L427-L552>_ (i.e. using 16x8 shifts).6-12 x86 instructions in v8 <https://github.com/v8/v8/blob/c8672adeebb105c7636334b9931831bf1945f4ec/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc#L996-L1057>_.v8's emulation <https://github.com/v8/v8/blob/c8672adeebb105c7636334b9931831bf1945f4ec/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc#L202-L339>_; if possible, use [f32x4|f64x2].[pmin|pmax] instead (1 x86 instruction).emulated with 8-14 x86 instructions in v8 <https://github.com/v8/v8/blob/b6520eda5eafc3b007a5641b37136dfc9d92f63d/src/compiler/backend/x64/code-generator-x64.cc#L3035-L3062>_.emulated with 5-6 x86 instructions in v8 <https://github.com/v8/v8/blob/b6520eda5eafc3b007a5641b37136dfc9d92f63d/src/codegen/x64/macro-assembler-x64.cc#L2241-L2311>_.emulated with 8 x86 instructions in v8 <https://github.com/v8/v8/blob/b6520eda5eafc3b007a5641b37136dfc9d92f63d/src/compiler/backend/x64/code-generator-x64.cc#L2591-L2604>_.emulated with 10 x86 instructions in v8 <https://github.com/v8/v8/blob/b6520eda5eafc3b007a5641b37136dfc9d92f63d/src/compiler/backend/x64/code-generator-x64.cc#L2834-L2858>_.Emscripten supports compiling existing codebases that use x86 SSE instructions by passing the -msimd128 flag, and additionally one of the following:
-msse and #include <xmmintrin.h>. Use #ifdef __SSE__ to gate code.-msse2 and #include <emmintrin.h>. Use #ifdef __SSE2__ to gate code.-msse3 and #include <pmmintrin.h>. Use #ifdef __SSE3__ to gate code.-mssse3 and #include <tmmintrin.h>. Use #ifdef __SSSE3__ to gate code.-msse4.1 and #include <smmintrin.h>. Use #ifdef __SSE4_1__ to gate code.-msse4.2 and #include <nmmintrin.h>. Use #ifdef __SSE4_2__ to gate code.-mavx and #include <immintrin.h>. Use #ifdef __AVX__ to gate code.-mavx2 and #include <immintrin.h>. Use #ifdef __AVX2__ to gate code.Currently only the SSE1, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, and AVX instruction sets are supported. Each of these instruction sets add on top of the previous ones, so e.g. when targeting SSE3, the instruction sets SSE1 and SSE2 are also available.
The following tables highlight the availability and expected performance of different SSE* intrinsics. This can be useful for understanding the performance limitations that the Wasm SIMD specification has when running on x86 hardware.
For detailed information on each SSE intrinsic function, visit the excellent Intel Intrinsics Guide on SSE1 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE>_.
The following legend is used to highlight the expected performance of various instructions:
Certain intrinsics in the table below are marked "virtual". This means that there does not actually exist a native x86 SSE instruction set opcode to implement them, but native compilers offer the function as a convenience. Different compilers might generate a different instruction sequence for these.
In addition to consulting the tables below, you can turn on diagnostics for slow, emulated functions by defining the macro #define WASM_SIMD_COMPAT_SLOW. This will print out warnings if you attempt to use any of the slow paths (corresponding to ❌ or 💣 in the legend).
.. list-table:: x86 SSE intrinsics available via #include <xmmintrin.h> and -msse :widths: 20 30 :header-rows: 1 :class: wrap-table-content
Unaligned load on x86 CPUs. Emulated with scalar loads + shuffle. Emulated with scalar loads + shuffle. Unaligned store on x86 CPUs. No cache control in Wasm SIMD. Unaligned store on x86 CPUs. Emulated with full precision div. simd/#3 <https://github.com/WebAssembly/simd/issues/3>_ Emulated with full precision div+shuffle simd/#3 <https://github.com/WebAssembly/simd/issues/3>_ Emulated with full precision div+sqrt. simd/#3 <https://github.com/WebAssembly/simd/issues/3>_ Emulated with full precision div+sqrt+shuffle. simd/#3 <https://github.com/WebAssembly/simd/issues/3>_ | _MM_ROUND_NEAREST | all exceptions masked (0x1f80).⚫ The following extensions that SSE1 instruction set brought to 64-bit wide MMX registers are not available:
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different SSE2 intrinsics. Refer to Intel Intrinsics Guide on SSE2 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE2>_.
.. list-table:: x86 SSE2 intrinsics available via #include <emmintrin.h> and -msse2 :widths: 20 30 :header-rows: 1 :class: wrap-table-content
Unaligned load on x86 CPUs. Unaligned load on x86 CPUs. Emulated with scalar loads + shuffle. Emulated with scalar loads + shuffle. Emulated with scalar loads + shuffle. ✅ if shift count is immediate constant. ✅ if shift count is immediate constant. ✅ if shift count is immediate constant. ✅ if shift count is immediate constant. ✅ if shift count is immediate constant. ✅ if shift count is immediate constant. ✅ if shift count is immediate constant. ✅ if shift count is immediate constant. Unaligned store on x86 CPUs. Unaligned store on x86 CPUs. Unaligned store on x86 CPUs. No cache control in Wasm SIMD. No cache control in Wasm SIMD. No cache control in Wasm SIMD. No cache control in Wasm SIMD.⚫ The following extensions that SSE2 instruction set brought to 64-bit wide MMX registers are not available:
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different SSE3 intrinsics. Refer to Intel Intrinsics Guide on SSE3 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE3>_.
.. list-table:: x86 SSE3 intrinsics available via #include <pmmintrin.h> and -msse3 :widths: 20 30 :header-rows: 1 :class: wrap-table-content
The following table highlights the availability and expected performance of different SSSE3 intrinsics. Refer to Intel Intrinsics Guide on SSSE3 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSSE3>_.
.. list-table:: x86 SSSE3 intrinsics available via #include <tmmintrin.h> and -mssse3 :widths: 20 30 :header-rows: 1 :class: wrap-table-content
⚫ The SSSE3 functions that deal with 64-bit wide MMX registers are not available:
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different SSE4.1 intrinsics. Refer to Intel Intrinsics Guide on SSE4.1 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_1>_.
.. list-table:: x86 SSE4.1 intrinsics available via #include <smmintrin.h> and -msse4.1 :widths: 20 30 :header-rows: 1 :class: wrap-table-content
Unaligned load on x86 CPUs.The following table highlights the availability and expected performance of different SSE4.2 intrinsics. Refer to Intel Intrinsics Guide on SSE4.2 <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE4_2>_.
.. list-table:: x86 SSE4.2 intrinsics available via #include <nmmintrin.h> and -msse4.2 :widths: 20 30 :header-rows: 1
⚫ The SSE4.2 functions that deal with string comparisons and CRC calculations are not available:
Any code referencing these intrinsics will not compile.
The following table highlights the availability and expected performance of different AVX intrinsics. Refer to Intel Intrinsics Guide on AVX <https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX>_.
.. list-table:: x86 AVX intrinsics available via #include <immintrin.h> and -mavx :widths: 20 30 :header-rows: 1 :class: wrap-table-content
Only the 128-bit wide instructions from AVX instruction set are listed. The 256-bit wide AVX instructions are emulated by two 128-bit wide instructions.
The following table highlights the availability and expected performance of different AVX2 intrinsics. Refer to Intel Intrinsics Guide on AVX2 <https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX2>_.
.. list-table:: x86 AVX2 intrinsics available via #include <immintrin.h> and -mavx2 :widths: 20 30 :header-rows: 1
All the 128-bit wide instructions from AVX2 instruction set are listed. Only a small part of the 256-bit AVX2 instruction set are listed, most of the 256-bit wide AVX2 instructions are emulated by two 128-bit wide instructions.
Emscripten supports compiling existing codebases that use ARM NEON by
passing the -mfpu=neon directive to the compiler, and including the
header <arm_neon.h>.
In terms of performance, it is very important to note that only instructions which operate on 128-bit wide vectors are supported cleanly. This means that nearly any instruction which is not of a "q" variant (i.e. "vaddq" as opposed to "vadd") will be scalarized.
These are pulled from SIMDe repository on GitHub <https://github.com/simd-everywhere/simde>_. To update emscripten
with the latest SIMDe version, run tools/simde_update.py.
The following table highlights the availability of various 128-bit wide intrinsics.
Similarly to above, the following legend is used:
For detailed information on each intrinsic function, refer to NEON Intrinsics Reference <https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics>_.
For the latest NEON intrinsics implementation status, refer to the SIMDe implementation status <https://github.com/simd-everywhere/implementation-status/blob/main/neon.md>_.
.. list-table:: NEON Intrinsics :widths: 20 30 :header-rows: 1 :class: wrap-table-content