Back to Bitnet

BitNet CPU Inference Optimization

src/README.md

latest7.3 KB
Original Source

BitNet CPU Inference Optimization

This update provides significant performance improvements for BitNet inference on CPU through paralleled kernel implementations, native I2_S GEMM/GEMV support, configurable tiling block size and embedding quantization.

Update

  • Parallel Weight & Activation Computation
    Implemented parallel processing of weights and activations in the W2A8 vet_dot kernel, achieving improved throughput on both x86 and ARM architectures.

  • Native I2_S GEMM & GEMV Support
    Integrated I2_S GEMM and GEMV operations into ggml library, making them fully compatible with the llama.cpp architecture. This enables seamless integration with existing inference pipelines.

  • Configurable Tiling & Parallelism
    Introduced configurable GEMM & GEMV block sizes and parallelism levels, allowing performance fine-tuning for different CPU architectures.

  • Embedding Quantization
    Added support for embedding layer quantization with Q6_K format, reducing memory footprint and improving inference speed while maintaining high accuracy.

Usage

Configuration Options

The include/gemm-config.h file controls kernel behavior:

c
#define ROW_BLOCK_SIZE 4
#define COL_BLOCK_SIZE 128
#define PARALLEL_SIZE 4

Modify these values based on your CPU cache size and architecture for optimal performance. Users can fine-tune performance on their machine through include/gemm-config.h.

Enabling Embedding Quantization

To use embedding quantization for additional speedup:

Using setup_env.py:

bash
python setup_env.py --quant-embd

This automatically converts embeddings to Q6_K format.

Manual conversion:

bash
build/bin/llama-quantize --token-embedding-type Q6_K models/BitNet-b1.58-2B-4T/ggml-model-f32.gguf models/BitNet-b1.58-2B-4T/ggml-model-i2_s-embed-q6_k.gguf I2_S 1 1

Optimizations

1. Weight & Activation Parallelism

The kernel implements two parallelization strategies:

  • Weight Parallel: Processes multiple weight rows/columns in a single kernel call, reducing kernel launch overhead.

  • Activation Parallel: Built on top of weight parallel, amortizes the I2_S weight unpacking cost across multiple activation elements.

Recommendation: For I2_S quantization format, activation parallel is recommended due to the unpack operation benefits. The current kernel defaults to activation parallel.

Kernel Performance Comparison:

<div align="center">

Test configuration: AMD EPYC 7V13 (x86), 1 threads, time in milliseconds (mean±std)

Matrix SizeNo ParallelWeight ParallelActivation Parallel
[1, 2048] × [2048, 2048]0.075±0.0120.058±0.0070.076±0.011
[32, 2048] × [2048, 2048]2.400±0.0411.599±0.0201.202±0.018
[128, 2048] × [2048, 2048]10.820±0.0396.458±0.1685.805±0.039
[256, 2048] × [2048, 2048]21.669±0.08012.739±0.18311.882±0.040
[512, 2048] × [2048, 2048]43.257±0.08325.680±0.33523.342±0.082
[2048, 2048] × [2048, 2048]173.175±0.214103.112±0.55293.276±0.612
[128, 2048] × [2048, 8192]43.345±0.09025.541±0.23923.528±0.052
[128, 8192] × [8192, 2048]38.085±0.16223.866±0.09622.569±0.132
</div>

2. GEMM/GEMV Integration with llama.cpp

Integrated I2_S quantization format into llama.cpp's compute graph:

  • GEMV Operations: Optimized matrix-vector multiplication for token generation.
  • GEMM Operations: Efficient matrix-matrix multiplication for prompt processing.
  • Tiling Strategy: Configurable block sizes for optimal cache utilization.

3. Configuration Fine-tuning

Fine-tuning kernel parameters for optimal performance on specific hardware:

Example Configuration (x86, AMD EPYC 7V13):

  • Method: Activation Parallel
  • Threads: 8
  • Workload: 128 prompt tokens (pp128)

Fine-tuning Parameters:

  • Parallelism Degree: [2, 4, 8]
  • Row Block Size: [2, 4, 8, 16, 32]
  • Column Block Size: [32, 64, 128, 256, 512, 1024]

Fine-tuning Results:

<div align="center">

Shows throughput (tokens/s) for various configurations.

</div>

Optimal Configuration: Under this setup (x86, 8 threads, pp128), the best performance is achieved with parallelism degree = 4, row block size = 4, and column block size = 128.

4. Embedding Quantization

Evaluated multiple embedding quantization formats to balance memory usage, model quality, and inference speed:

Perplexity Comparison:

<div align="center">

Test configuration: BitNet-b1.58-2B-4T, TG128

Embedding TypeWikitextPTBLAMBADAIMDBAG NEWS
F3217.1090±0.127833.0858±0.488643.2850±0.636329.3016±0.289036.7686±0.3920
F1617.1090±0.127833.0858±0.488643.2850±0.636329.3016±0.289036.7686±0.3920
Q8_017.1197±0.128033.1181±0.489343.2891±0.636429.3133±0.289236.7740±0.3920
Q6_K17.1487±0.128233.2203±0.491443.3046±0.636229.3491±0.289736.7972±0.3921
Q5_017.2379±0.128833.2439±0.490743.4631±0.637929.5481±0.292036.8539±0.3924
Q4_017.3529±0.130033.7754±0.500144.4552±0.655930.1044±0.297837.3985±0.3997
Q3_K17.6434±0.132034.3914±0.508945.4591±0.673530.8476±0.306939.5692±0.4259
I2_SN/AN/AN/AN/AN/A

*N/A indicates model failure due to extreme quantization.

</div>

Inference Speed Comparison:

<div align="center">

Token generation throughput (tg128) for different embedding quantization types.

</div>

Recommendation: Based on comprehensive evaluation of memory footprint, perplexity preservation, and inference speed, Q6_K is selected as the optimal embedding quantization format.

Performance

Comparison of optimized parallel kernels vs. original implementation:

Test Configuration:

  • Model: BitNet-b1.58-2B-4T
  • Hardware: AMD EPYC 7V13
  • Threads: 1 / 2 / 4 / 8 / 12 / 16
  • Test: 128 prompt tokens (pp128) + 128 generated tokens (tg128)
  • Method: Activation Parallel
<div align="center"> </div>

Test Configuration:

  • Model: BitNet-b1.58-2B-4T
  • Hardware: Intel i7-13800H
  • Threads: 1 / 2 / 4 / 6
  • Test: 128 prompt tokens (pp128) + 128 generated tokens (tg128)
  • Method: Activation Parallel
<div align="center"> </div>

Test Configuration:

  • Model: BitNet-b1.58-2B-4T
  • Hardware: Cobalt 100
  • Threads: 1 / 2 / 4 / 8
  • Test: 128 prompt tokens (pp128) + 128 generated tokens (tg128)
  • Method: Activation Parallel
<div align="center"> </div>

Technical Details

Key Files Modified

  • src/ggml-bitnet-mad.cpp: Parallel kernel implementations
  • 3rdparty/llama.cpp/ggml/src/ggml.c: GEMM/GEMV integration
  • include/gemm-config.h: Configuration file

Supported Architectures

  • ✅ x86-64 with AVX2
  • ✅ ARM with NEON
  • ✅ ARM with DOTPROD extension