Evaluation/MBPP/README.md
We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks, MBPP, with 3-shot setting.
pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch
We've created a sample script, eval.sh, that demonstrates how to test the deepseek-coder-1.3b-base model on the MBPP dataset leveraging 8 GPUs.
MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT}
We report experimental results here for several models. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.
| Model | Size | Pass@1 |
|---|---|---|
| CodeShell | 7B | 38.6% |
| CodeGeeX2 | 6B | 36.2% |
| StarCoder | 16B | 42.8% |
| CodeLLama-Base | 7B | 38.6% |
| CodeLLama-Base | 13B | 47.0% |
| CodeLLama-Base | 34B | 55.0% |
| DeepSeek-Coder-Base | 1.3B | 46.8% |
| DeepSeek-Coder-Base | 5.7B | 57.2% |
| DeepSeek-Coder-Base | 6.7B | 60.6% |
| DeepSeek-Coder-Base | 33B | 66.0% |
| Model | Size | Pass@1 |
|---|---|---|
| GPT-3.5-Turbo | - | 70.8% |
| GPT-4 | - | 80.0% |
| DeepSeek-Coder-Instruct | 1.3B | 49.4% |
| DeepSeek-Coder-Instruct | 6.7B | 65.4% |
| DeepSeek-Coder-Instruct | 33B | 70.0% |