Back to Deepseek Coder

README

Evaluation/MBPP/README.md

latest1.8 KB
Original Source

1. Introduction

We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks, MBPP, with 3-shot setting.

2. Setup

pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch

3. Evaluation

We've created a sample script, eval.sh, that demonstrates how to test the deepseek-coder-1.3b-base model on the MBPP dataset leveraging 8 GPUs.

bash
MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT} 

4. Experimental Results

We report experimental results here for several models. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.

(1) Multilingual Base Models

ModelSizePass@1
CodeShell7B38.6%
CodeGeeX26B36.2%
StarCoder16B42.8%
CodeLLama-Base7B38.6%
CodeLLama-Base13B47.0%
CodeLLama-Base34B55.0%
DeepSeek-Coder-Base1.3B46.8%
DeepSeek-Coder-Base5.7B57.2%
DeepSeek-Coder-Base6.7B60.6%
DeepSeek-Coder-Base33B66.0%

(2) Instruction-Tuned Models

ModelSizePass@1
GPT-3.5-Turbo-70.8%
GPT-4-80.0%
DeepSeek-Coder-Instruct1.3B49.4%
DeepSeek-Coder-Instruct6.7B65.4%
DeepSeek-Coder-Instruct33B70.0%