Evaluation from 2024-06-19 13:25:26

This report was generated by DevQualityEval benchmark in version 0.5.0.

PLEASE NOTE that we assume the results of the "symbolic execution" are deterministic. Hence they where only benchmarked over a single run and multiplied by 5 to be comparable to the remaining contestants.

Results

Keep in mind that LLMs are nondeterministic. The following results just reflect a current snapshot.

The results of all models have been divided into the following categories:

category unknown: Models in this category could not be categorized.
response error: Models in this category encountered an error.
no code: Models in this category produced no code.
invalid code: Models in this category produced invalid code.
executable code: Models in this category produced executable code.
statement coverage reached: Models in this category produced code that reached full statement coverage.
no excess response: Models in this category did not respond with more content than requested.

The following sections list all models with their categories. The complete log of the evaluation with all outputs can be found here. Detailed scoring can be found here.

Result category "category unknown"

Models in this category could not be categorized.

symflower/symbolic-execution