Leaderboards
There are many LLM benchmarks, but for the purposes of evaluating Autonomous DevTools we are most interested in testing and agent's ability to address a realistic task on an existing codebase.
SWE-bench verified
The gold standard. Released in September 2023 by Princeton NLP, SWE-bench is the most widely accepted measure of an agent's ability to solve tasks in a realistic codebase. It was constructed from GitHub Pull Requests from real Open Source respositories, with unit tests verifying the change. To pass, the agent must effectively recreate that Pull Request.
The full set is costly to run, we currently prefer the subset SWE-bench verified
, which has been confirmed possible to solve by humans. Paul Gauthier explains the problem that led to that subset.
SWE-bench maintains the official leaderboard where results are reported. Nopilot focuses on listing all reported scored by Open Source and Source Available agents.
Rank | Agent | Score (verified) | Group | License |
---|---|---|---|---|
1 | OpenHands + CodeAct v2.1 (claude-3-5-sonnet-20241022) | 53.00% | OpenHands | MIT |
2 | Agentless-1.5 + Claude-3.5 Sonnet (20241022) | 50.80% | Agentless | MIT |
3 | Composio SWE-Kit (2024-10-25) | 48.60% | Composio | Source Available (ELv2) |
4 | AppMap Navie v2 | 47.20% | AppMap | Source Available |
5 | AutoCodeRover-v2.0 (Claude-3.5-Sonnet-20241022) | 46.20% | AutoCodeRover | Source Available |
6 | Moatless Tools + Claude 3.5 Sonnet (20241022) | ~45% | Albert Örwall | MIT |
7 | SWE-agent + Claude 3.5 Sonnet | 33.60% | Princeton NLP | MIT |
Last updated: 2024-11-24
These are unassisted scores. "Assisted" means the agent is told which files need to be modified by the "oracle". There is usually a large difference between these scores, highlighting that navigating the codebase is a key part of the problem.
Aider Leaderboards
The coding agent Aider maintains a leaderboard of model performance within its key subtasks.
Last checked: 2024-06-25
Code Editing
- claude-3.5-sonnet
- DeepSeek Coder V2 (Open Weight)
- gpt-4o
- claude-3-opus
- gpt-4
- deepseek-chat v2 (Open Weight)
Code refactoring
- claude-3-opus
- gpt-4o
- gpt-4-turbo
- gemini-1.5-pro
Source for Refactor Benchmark.
LiveCodeBench
LiveCodeBench: "Holistic and Contamination Free Evaluation of Large Language Models for Code"
Tests the strength of models across different coding sub-tasks.
- Code Generation
- Self-Repair
- Test Output Prediction
- Code Execution
The below listing of standout models across subtasks is subjective.
Last checked: 2024-06-25
- Proprietary Leaders: , GPT-4o, GPT-4-Turbo, Claude-3-Opus, Claude-3.5-sonnet
- Open Weight Leaders:
Other notable benchmarks
- HumanEval by OpenAI