doc/en/benchmark.md
To conduct a quick and convenient check, we have employed a simple Python script available here to assess the precision of our ktransformers project. For this evaluation, we utilized the same dataset, which was shuffled in a consistent manner and limited to the first 1,000 data points, to test our implementation across a variety of CPU kernels, MLA kernels, and quantization formats.
We selected the DeepSeek-V3 model in its bf16, int8, and q4km versions for this test. The MMLU dataset, which can be found here, was used (we selected all datasets and shuffled them with a fixed random seed).
!!! However, we skipped the few-shot part and only chose the first 1,000 data points for a quick check. Please note that this approach may result in results that are not consistent with the technical report of DeepSeek-V3. And the test of R1 and further more tests are on going.
To verify our results, we chose cloud service platform as baseline. All tests were conducted using the same script and datasets, allowing us to make a preliminary assessment of our project's precision.
We set the argument temperature=0.6, and to simplify the test process, we skipped the few-shot part and used the following prompt: There is a single choice question. Answer the question by replying A, B, C, D. No other answers are accepted. Just the letter. \nQuestion: {question}\nA. {option_a}\nB. {option_b}\nC. {option_c}\nD. {option_d}\nAnswer: '. For more details, please refer to the script.
Given that we have only tested 1,000 cases, which provides only a preliminary judgment, some fluctuations in the results are reasonable. We selected all datasets and shuffled them with a fixed random seed to ensure consistency.
The bf16 model of DeepSeek-V3 is available here (you may convert it to gguf by llama.cpp). The q4km model can be found here.
The optimization YAML file is located here. For the GEMM Kernel, you can change KLinearMarlin to KLinearTorch.
To switch the MLA Kernel from Triton to Torch, you can check and modify this file, specifically by using the forward_windows method.
When attempting to conduct the bf16 test (both CPU Weight and GPU Weight), you may encounter issues stemming from older versions of g++ and as, particularly when using Ubuntu 20 or earlier versions. To facilitate a smoother experience and enable you to reproduce our results, we have provided a development container. This container offers a pre-configured environment tailored for this purpose. However, please note that the container does not have the ktrans package installed. Therefore, you may still need to manually install certain packages to ensure everything runs smoothly.
devcontainer/devcontainer.json, check the "mouts": config.Uses DeepSeek-V3 model (Some specific cases are R1)
| DataSet | CPU Weight Format | CPU Kernel | GPU Weight Format | GEMM Kernel | MLA Kernel | Siliconflow | |
| Ktrans Point | |||||||
| MMLU |
(shuffle 1k) | | | | | | | | | 1 | bf16 | cpuinfer | bf16 | torch | torch | 81.6 | 81.9 | | 2 | q8_0 | cpuinfer | bf16 | torch | torch | 81.6 | 83.1 | | 3 | q4km | cpuinfer | bf16 | torch | triton | 81.6 | 81.4 | | 4 | q4km | cpuinfer | q4km->marlin 8 | marlin | triton | 81.6 | 81.1 | | 5 | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 81.6 | 81 | | 6 | q4km | cpuinfer | fp8 | fp8gemm | triton | 81.6 | 81.5 | | 7 (DeepSeek-R1) | iq1 | cpuinfer | fp8 | fp8gemm | triton | 78.6 | 83.6 | | MMLU-pro (shuffle 1k) | | | | | | | | | 1 | q4km | cpuinfer | fp8 | fp8gemm | triton | 57.7 | 57.6 | | 2 | q4km | cpuinfer | q4km->marlin 4 | marlin | triton | 57.7 | 57.5 | | 3 (DeepSeek-R1) | iq1 | cpuinfer | fp8 | fp8gem | triton | 71.9 | tbd | | HumanEval | tbd | tbd | tbd | tbd | tbd | tbd | tbd | | GSM8K | tbd | tbd | tbd | tbd | tbd | tbd | tbd |
The details for each case are listed below:
By default, The MLA kernel uses triton in linux and torch in windows. But we need to test torch in linux, so we manually modify the file. Just get rid of all the if branch and force it to use self.forward_windows
KLinearMarlin to KLinearTorch (just find all the usage in this file). The source weight comes from there (you need to use llama.cpp to convert it to gguf)num_bits: 8 (in other words: add this kwargs to all that use KLinearMarlin). The weight file for q4km is here