qwencoder-eval/instruct/cruxeval/data/README.md
The benchmark dataset is in cruxeval.jsonl. At a high level, our benchmark is constructed as follows:
First, we use Code Llama 34B to generate a large set of functions and inputs. To do so, we prompt it with the name of a function in the Python standard library such as str.zfill and ask it to generate a Python function that makes use of the library function in addition to 5 test inputs. We provide two varying few-shot examples in our prompt for improved diversity of generations (diverse_fewshot_examples.py). The prompts are in the file data_generating_prompt.jsonl, which is generated by generate_function_prompts.py. We use a total of 69 different functions from the standard library: 47 from str, 11 from dict, and 11 from list.
Then, we filter the set so that our benchmark only consists of short problems with low computation and memory requirements, problems which a good human programmer should be able to do without extra memory in a minute or so.
The script in filter/analyze_ops.py is used to filter generations for our benchmark based on the following criteria:
assert f(input) == outputAfter filtering, we randomly select 800 samples passing the filter, ensuring the benchmark is both small enough to easily run but large enough to reliably see performance differences among various models. We also highlight that as models improve, this approach can be used to create future benchmarks that are more difficult and test different aspects of execution.
The final dataset is in cruxeval.jsonl. It is also available on HuggingFace Datasets.