classic/benchmark/agbenchmark/challenges/CHALLENGE.md
Input:
Example:
{
"category": ["basic"],
"task": "Print the capital of America to a .txt file",
"dependencies": ["TestWriteFile"], // the class name of the test
"ground": {
"answer": "Washington",
"should_contain": ["Washington"],
"should_not_contain": ["New York", "Los Angeles", "San Francisco"],
"files": [".txt"],
"eval": {
"type": "llm" or "file" or "python",
"scoring": "percentage" or "scale" or "binary", // only if the type is llm
"template": "rubric" or "reference" or "custom" // only if the type is llm
}
},
"info": {
"difficulty": "basic",
"description": "Tests the writing to file",
"side_effects": ["tests if there is in fact an LLM attached"]
}
}
This is the method of evaluation for a challenge.
This is the default method of evaluation. It will compare the files specified in "files" field to the "should_contain" and "should_not_contain" ground truths.
This runs a python function in the specified "files" which captures the print statements to be scored using the "should_contain" and "should_not_contain" ground truths.
This uses a language model to evaluate the answer.
This folder contains all the files you want the agent to have in its workspace BEFORE the challenge starts
This folder contains all the files you would like the agent to generate. This folder is used to mock the agent. This allows to run agbenchmark --test=TestExample --mock and make sure our challenge actually works.
This folder contains files that will be copied into the agent's workspace and run after the challenge is completed. For example we can have a test.py in it and run this file in the workspace to easily import code generated by the agent. Example: TestBasicCodeGeneration challenge.