README - Deepseek Coder

1. Introduction

We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks, MBPP, with 3-shot setting.

2. Setup

pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch

3. Evaluation

We've created a sample script, eval.sh, that demonstrates how to test the deepseek-coder-1.3b-base model on the MBPP dataset leveraging 8 GPUs.

bash

MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT}

4. Experimental Results

We report experimental results here for several models. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.

(1) Multilingual Base Models

Model	Size	Pass@1
CodeShell	7B	38.6%
CodeGeeX2	6B	36.2%
StarCoder	16B	42.8%
CodeLLama-Base	7B	38.6%
CodeLLama-Base	13B	47.0%
CodeLLama-Base	34B	55.0%

DeepSeek-Coder-Base	1.3B	46.8%
DeepSeek-Coder-Base	5.7B	57.2%
DeepSeek-Coder-Base	6.7B	60.6%
DeepSeek-Coder-Base	33B	66.0%

(2) Instruction-Tuned Models

Model	Size	Pass@1
GPT-3.5-Turbo	-	70.8%
GPT-4	-	80.0%

DeepSeek-Coder-Instruct	1.3B	49.4%
DeepSeek-Coder-Instruct	6.7B	65.4%
DeepSeek-Coder-Instruct	33B	70.0%