The Devin Wars Are Over
Devin deployed. Decisive defeat deflected. Dev displacement dubious
Today we crown the winners of "The Devin Wars", and prepare to reset our goalposts for the next race.
As of Tuesday, Devin is now available. Cognition Labs' viral demo from 9 months ago is no longer just a demo. You can buy it for your team for $500 USD per month.
Cognition still incorrectly refers to this coding agent as "the world's first AI software engineer". Nonetheless, the increased autonomy as compared to today's popular coding assistants remains compelling.
If you'd like more backstory, try this video.
The Momentβ
A lot has happened since March, when Devin's demo sent out a shockwave of excitement, confusion, and even panic.
- Elements of Devin's announcement were found to be exaggerated, especially the claim of solving an Upwork job
- Devin's SWE-bench score of 14% was tied by numerous Open Source agents and eventually doubled
- Numerous corporate players released similar products, including:
- GitHub Copilot Workspace
- Amazon Q Developer
- Cursor Composer
- Sweep AI
Through all of this, of course, human programmers still exist and this shows no signs of changing.
The inevitable perception is that Cognition has lost the moment. They followed up their huge buzz with near silence, and are now in an awkward position as late entrants to the wave they started.
Still, we should expect more surprises from them and everyone else, it's still extremely early for coding agents.
The Award Goes To...β
Nopilot hereby recognizes these contributions which pave the way for the future of Open Source autonomous coding agents.
Drumroll, please...
π₯π₯π₯
π Overall: OpenHandsβ
All Hands AI handily takes home Best In Show. In addition to creating the most polished user interface, their CodeAct agent is the current SWE-bench leader and has been consistently competive over time. (*Disclosure: I had no relationship with All Hands AI when I wrote this article, or when I suggested the parameters of competition in April, but I do have one now.)
Their Open Source project OpenHands (formerly OpenDevin) has gathered impressive momentum, and serves as a modular platform for experimenting with new agents. They recently secured seed funding to continue this work and are currently hiring.
π Research Division: SWE-agentβ
From the creators of SWE-bench itself came SWE-agent, perhaps our best blueprint for understanding how these agents work and how to build new ones.
Their architecture stands out as extremely simple, performing similarly to much more complicated codebases. Their paper SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (video reading) is essential reading, with detailed discussion of design decisions and factors that drove performance.
π Product Vision Award: Cognitionβ
Credit where it's due! Even though Nopilot is here to follow Open Source agents, it's only fair to acknowledge Devin's influence. Many independant code assistant projects suddenly began coordinating around it as the new default vision.
As explained in the initial announcement tweet:
Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser.
This is not the only shape that a coding agent can take, but as a package it resonated with a wide variety of people.
π Unsung Hero: Aiderβ
Pre-dating Devin by a year, Paul Gauthier's Aider was one of the first agents capable of extending an existing codebase given a prompt.
After agents emerged that could take many steps on their own, it was tempting to think that Aider's more interactive design was outdated, but Aider proved us wrong with a state-of-the-art SWE-bench score in June. Keeping up with more agentic approaches, Aider "worked smarter not harder", saving time and tokens.
Here are some other features that Aider adopted early.
- Repo mapping using ctags and tree-sitter.
- Targetted benchmarks comparing LLM accuracy for code editing
- Voice mode
- Incremental git commits
π The Elephant In The Roomβ
Despite all these great efforts, new users will generally find these coding agents are not ready. This applies to the corporate offerings as well, not just the Open Source projects. It takes practice to understand what kinds of tasks are ready for agents to save you time rather than get in your way. In addition to task granularity, you need to consider how local context may confuse the agent, and there are stability issues.
Arguably, the user experience is lagging behind performance improvements on agent benchmarks. This could indicate that we need new measures besides SWE-bench for a more complete picture. Just as SWE-bench was a leap forward from earlier benchmarks like HumanEval, moving from the scope of a file to an entire repo, we should keep an eye out for the next paradigm. If customization for context is a limiting factor, that will be difficult to evaluate objectively.
When in doubt, listen to your users.
What's Nextβ
With so many different products and caveats, this remains a very confusing space. We are midway between research prototypes and true DevTools.
For builders and adopters, here is some broad advice:
- Focus on what can be done safely, reliably, and sustainably
- Limit task scope, specialized agents are usually more effective than general ones
- Prefer algorithms to LLMs when you have the option
- Continue to invest in humanity
For further discussion, here are recent talks from three extremely qualified people.
- From Github Copilot to AI Native Development - Cursor, v0, Bolt.new + many more tools! by Patrick Debois (DevOps)
- The Open Source AI Coding Revolution by Graham Neubig (All Hands AI, Carnegie Mellon University), and related slides Agents in December 2024
- Few Shot Code Generation to Autonomous Software Engineering Agents by John Yang (SWE-bench, Stanford)
Please feel free to reach out with any questions you'd like to see covered.
With patience the most tangled cord may be undone.
Want to help out? See where you fit in!