aider/website/_posts/2024-03-08-claude-3.md
{% if page.date %}
<p class="post-date">{{ page.date | date: "%B %d, %Y" }}</p> {% endif %}Anthropic just released their new Claude 3 models with evals showing better performance on coding tasks. With that in mind, I've been benchmarking the new models using Aider's code editing benchmark suite.
Claude 3 Opus outperforms all of OpenAI's models, making it the best available model for pair programming with AI.
To use Claude 3 Opus with aider:
python -m pip install -U aider-chat
export ANTHROPIC_API_KEY=sk-...
aider --opus
Aider is an open source command line chat tool that lets you pair program with AI on code in your local git repo.
Aider relies on a code editing benchmark to quantitatively evaluate how well an LLM can make changes to existing code. The benchmark uses aider to try and complete 133 Exercism Python coding exercises. For each exercise, Exercism provides a starting python file with stubs for the needed functions, a natural language description of the problem to solve and a test suite to evaluate whether the coder has correctly solved the problem.
The LLM gets two tries to solve each problem:
claude-3-opus-20240229 model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.gpt-4-0125-preview, at 54.1%.claude-3-sonnet-20240229 model performed similarly to OpenAI's GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.It's highly desirable to have the LLM send back code edits as some form of diffs, rather than having it send back an updated copy of the entire source code.
Weaker models like GPT-3.5 are unable to use diffs, and are stuck sending back updated copies of entire source files. Aider uses more efficient search/replace blocks with the original GPT-4 and unified diffs with the newer GPT-4 Turbo models.
Claude 3 Opus works best with the search/replace blocks, allowing it to send back code changes efficiently. Unfortunately, the Sonnet model was only able to work reliably with whole files, which limits it to editing smaller source files and uses more tokens, money and time.
There are a few other things worth noting: