Leaderboards
There are many LLM benchmarks, but for the purposes of evaluating Autonomous DevTools we are most interested in testing and agent's ability to address a realistic task on an existing codebase.
SWE-bench lite
The gold standard.
Released in September 2023 by Princeton NLP, SWE-bench is the most widely accepted measure of an agent's ability to solve tasks in a realistic codebase.
It was constructed from GitHub Pull Requests from real Open Source respositories, with unit tests verifying the change.
To pass, the agent must effectively recreate that Pull Request. Since the full set is costly to run, the subset SWE-bench lite
is often used.
SWE-bench maintains the official leaderboard where results are reported and verified. Nopilot focuses on listing all reported scored by Open Source and Source Available agents.
Rank | Agent | Score (lite) | Status | Group | License |
---|---|---|---|---|---|
1 | Moatless Tools + Claude 3.5 Sonnet | 26.67% | Verified | Albert Örwall | MIT |
2 | OpenDevin CodeAct 1.3 + GPT-4o | 26.67% | Reported | OpenDevin | MIT |
3 | Aider + GPT-4o, Claude 3 Opus | 26.33% | Reported | Paul Gauthier | Apache-2 |
4 | AppMap Navie + GPT-4o | 21.67% | Reported | AppMap | Source Available |
5 | SWE-agent + GPT-4 | 18% | Verified | Princeton NLP | MIT |
Last updated: 2024-06-25
These are unassisted scores. SWE-bench scores come in "assisted" and "unassisted" versions. "Assisted" means the agent is told which files need to be modified by the "oracle". There is usually a large difference between these scores, highlighting that navigating the codebase is a key part of the problem.
Paul Gauthier points out that some SWE-bench cases appear to be underspecified and effectively impossible to solve because the tests rely on implementation detail. It's unclear what the maximum possible score is.
Aider Leaderboards
The coding agent Aider maintains a leaderboard of model performance within its key subtasks.
Last checked: 2024-06-25
Code Editing
- claude-3.5-sonnet
- DeepSeek Coder V2 (Open Weight)
- gpt-4o
- claude-3-opus
- gpt-4
- deepseek-chat v2 (Open Weight)
Code refactoring
- claude-3-opus
- gpt-4o
- gpt-4-turbo
- gemini-1.5-pro
Source for Refactor Benchmark.
LiveCodeBench
LiveCodeBench: "Holistic and Contamination Free Evaluation of Large Language Models for Code"
Tests the strength of models across different coding sub-tasks.
- Code Generation
- Self-Repair
- Test Output Prediction
- Code Execution
The below listing of standout models across subtasks is subjective.
Last checked: 2024-06-25
- Proprietary Leaders: , GPT-4o, GPT-4-Turbo, Claude-3-Opus, Claude-3.5-sonnet
- Open Weight Leaders:
Other notable benchmarks
- HumanEval by OpenAI