qwencoder-eval/instruct/McEval/README.md
Repository for paper "McEval: Massively Multilingual Code Evaluation"
| Dataset | Download |
|---|---|
| McEval Evaluation Dataset | 🤗 HuggingFace |
| McEval-Instruct | 🤗 HuggingFace |
Runtime environments for different programming languages could be found in Environments
We recommend using Docker for evaluation, we have created a Docker image with all the necessary environments pre-installed.
<!-- Docker images will be released soon. -->Directly pull the image from Docker Hub or Aliyun Docker Hub:
# Docker hub:
docker pull multilingualnlp/mceval
# Aliyun docker hub:
docker pull registry.cn-hangzhou.aliyuncs.com/mceval/mceval:v1
docker run -it -d --restart=always --name mceval_dev --workdir / <image-name> /bin/bash
docker attach mceval_dev
We provide some model inference codes, including torch and vllm implementations.
Take the evaluation generation task as an example.
cd inference
bash scripts/inference_torch.sh
Take the evaluation generation task as an example.
cd inference
bash scripts/run_generation_vllm.sh
🛎️ Please prepare the inference results of the model in the following format and use them for the next evaluation step.
(1) Folder Structure Place the data in the following folder structure, each file corresponds to the test results of each language.
\evaluate_model_name
- CPP.jsonl
- Python.jsonl
- Java.jsonl
...
You can use script split_result.py to split inference results.
python split_result --split_file <inference_result> --save_dir <save_dir>
(2) File Format Each line in the file for each test language has the following format. The raw_generation field is the generated code. More examples can be found in Evualute Data Format Examples
{
"task_id": "Lang/1",
"prompt": "",
"canonical_solution": "",
"test": "",
"entry_point": "",
"signature": "",
"docstring": "",
"instruction": "",
"raw_generation": ["<Generated Code>"]
}
Take the evaluation generation task as an example.
cd eval
bash scripts/eval_generation.sh