Skip to main content

Leaderboards

There are many LLM benchmarks, but for the purposes of evaluating Autonomous DevTools we are most interested in testing and agent's ability to address a realistic task on an existing codebase.

SWE-bench lite

The gold standard. Released in September 2023 by Princeton NLP, SWE-bench is the most widely accepted measure of an agent's ability to solve tasks in a realistic codebase. It was constructed from GitHub Pull Requests from real Open Source respositories, with unit tests verifying the change. To pass, the agent must effectively recreate that Pull Request. Since the full set is costly to run, the subset SWE-bench lite is often used.

SWE-bench maintains the official leaderboard where results are reported and verified. Nopilot focuses on listing all reported scored by Open Source and Source Available agents.

RankAgentScore (lite)StatusGroupLicense
1Moatless Tools
+ Claude 3.5 Sonnet
26.67%VerifiedAlbert ÖrwallMIT
2OpenDevin CodeAct 1.3
+ GPT-4o
26.67%ReportedOpenDevinMIT
3Aider
+ GPT-4o, Claude 3 Opus
26.33%ReportedPaul GauthierApache-2
4AppMap Navie
+ GPT-4o
21.67%ReportedAppMapSource Available
5SWE-agent
+ GPT-4
18%VerifiedPrinceton NLPMIT

Last updated: 2024-06-25

These are unassisted scores. SWE-bench scores come in "assisted" and "unassisted" versions. "Assisted" means the agent is told which files need to be modified by the "oracle". There is usually a large difference between these scores, highlighting that navigating the codebase is a key part of the problem.

Paul Gauthier points out that some SWE-bench cases appear to be underspecified and effectively impossible to solve because the tests rely on implementation detail. It's unclear what the maximum possible score is.

Aider Leaderboards

The coding agent Aider maintains a leaderboard of model performance within its key subtasks.

Last checked: 2024-06-25

Code Editing

  • claude-3.5-sonnet
  • DeepSeek Coder V2 (Open Weight)
  • gpt-4o
  • claude-3-opus
  • gpt-4
  • deepseek-chat v2 (Open Weight)

Code refactoring

  • claude-3-opus
  • gpt-4o
  • gpt-4-turbo
  • gemini-1.5-pro

Source for Refactor Benchmark.

LiveCodeBench

LiveCodeBench: "Holistic and Contamination Free Evaluation of Large Language Models for Code"

Tests the strength of models across different coding sub-tasks.

  • Code Generation
  • Self-Repair
  • Test Output Prediction
  • Code Execution

The below listing of standout models across subtasks is subjective.

Last checked: 2024-06-25

Other notable benchmarks