Skip to main content

Leaderboards

There are many LLM benchmarks, but for the purposes of evaluating Autonomous DevTools we are most interested in testing and agent's ability to address a realistic task on an existing codebase.

SWE-bench verified

The gold standard. Released in September 2023 by Princeton NLP, SWE-bench is the most widely accepted measure of an agent's ability to solve tasks in a realistic codebase. It was constructed from GitHub Pull Requests from real Open Source respositories, with unit tests verifying the change. To pass, the agent must effectively recreate that Pull Request.

The full set is costly to run, we currently prefer the subset SWE-bench verified, which has been confirmed possible to solve by humans. Paul Gauthier explains the problem that led to that subset.

SWE-bench maintains the official leaderboard where results are reported. Nopilot focuses on listing all reported scored by Open Source and Source Available agents.

RankAgentScore (verified)GroupLicense
1OpenHands + CodeAct v2.1
(claude-3-5-sonnet-20241022)
53.00%OpenHandsMIT
2Agentless-1.5
+ Claude-3.5 Sonnet (20241022)
50.80%AgentlessMIT
3Composio SWE-Kit
(2024-10-25)
48.60%ComposioSource Available (ELv2)
4AppMap Navie v247.20%AppMapSource Available
5AutoCodeRover-v2.0
(Claude-3.5-Sonnet-20241022)
46.20%AutoCodeRoverSource Available
6Moatless Tools
+ Claude 3.5 Sonnet (20241022)
~45%Albert ÖrwallMIT
7SWE-agent
+ Claude 3.5 Sonnet
33.60%Princeton NLPMIT

Last updated: 2024-11-24

These are unassisted scores. "Assisted" means the agent is told which files need to be modified by the "oracle". There is usually a large difference between these scores, highlighting that navigating the codebase is a key part of the problem.

Aider Leaderboards

The coding agent Aider maintains a leaderboard of model performance within its key subtasks.

Last checked: 2024-06-25

Code Editing

  • claude-3.5-sonnet
  • DeepSeek Coder V2 (Open Weight)
  • gpt-4o
  • claude-3-opus
  • gpt-4
  • deepseek-chat v2 (Open Weight)

Code refactoring

  • claude-3-opus
  • gpt-4o
  • gpt-4-turbo
  • gemini-1.5-pro

Source for Refactor Benchmark.

LiveCodeBench

LiveCodeBench: "Holistic and Contamination Free Evaluation of Large Language Models for Code"

Tests the strength of models across different coding sub-tasks.

  • Code Generation
  • Self-Repair
  • Test Output Prediction
  • Code Execution

The below listing of standout models across subtasks is subjective.

Last checked: 2024-06-25

Other notable benchmarks