Skip to main content

The Devin Wars Are Over

Β· 6 min read
Ray Myers
Mender.AI, Empathy in Tech

Devin deployed. Decisive defeat deflected. Dev displacement dubious

Today we crown the winners of "The Devin Wars", and prepare to reset our goalposts for the next race.

As of Tuesday, Devin is now available. Cognition Labs' viral demo from 9 months ago is no longer just a demo. You can buy it for your team for $500 USD per month.

Cognition still incorrectly refers to this coding agent as "the world's first AI software engineer". Nonetheless, the increased autonomy as compared to today's popular coding assistants remains compelling.

If you'd like more backstory, try this video.

The Moment​

A lot has happened since March, when Devin's demo sent out a shockwave of excitement, confusion, and even panic.

  • Elements of Devin's announcement were found to be exaggerated, especially the claim of solving an Upwork job
  • Devin's SWE-bench score of 14% was tied by numerous Open Source agents and eventually doubled
  • Numerous corporate players released similar products, including:
    • GitHub Copilot Workspace
    • Amazon Q Developer
    • Cursor Composer
    • Sweep AI

Through all of this, of course, human programmers still exist and this shows no signs of changing.

The inevitable perception is that Cognition has lost the moment. They followed up their huge buzz with near silence, and are now in an awkward position as late entrants to the wave they started.

Still, we should expect more surprises from them and everyone else, it's still extremely early for coding agents.

The Award Goes To...​

Nopilot hereby recognizes these contributions which pave the way for the future of Open Source autonomous coding agents.

Drumroll, please...

πŸ₯πŸ₯πŸ₯

πŸ† Overall: OpenHands​

All Hands AI handily takes home Best In Show. In addition to creating the most polished user interface, their CodeAct agent is the current SWE-bench leader and has been consistently competive over time. (*Disclosure: I had no relationship with All Hands AI when I wrote this article, or when I suggested the parameters of competition in April, but I do have one now.)

Their Open Source project OpenHands (formerly OpenDevin) has gathered impressive momentum, and serves as a modular platform for experimenting with new agents. They recently secured seed funding to continue this work and are currently hiring.

OpenHands repo star history

πŸ… Research Division: SWE-agent​

From the creators of SWE-bench itself came SWE-agent, perhaps our best blueprint for understanding how these agents work and how to build new ones.

Their architecture stands out as extremely simple, performing similarly to much more complicated codebases. Their paper SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (video reading) is essential reading, with detailed discussion of design decisions and factors that drove performance.

πŸ… Product Vision Award: Cognition​

Credit where it's due! Even though Nopilot is here to follow Open Source agents, it's only fair to acknowledge Devin's influence. Many independant code assistant projects suddenly began coordinating around it as the new default vision.

As explained in the initial announcement tweet:

Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser.

This is not the only shape that a coding agent can take, but as a package it resonated with a wide variety of people.

πŸ… Unsung Hero: Aider​

Pre-dating Devin by a year, Paul Gauthier's Aider was one of the first agents capable of extending an existing codebase given a prompt.

After agents emerged that could take many steps on their own, it was tempting to think that Aider's more interactive design was outdated, but Aider proved us wrong with a state-of-the-art SWE-bench score in June. Keeping up with more agentic approaches, Aider "worked smarter not harder", saving time and tokens.

Here are some other features that Aider adopted early.

🐘 The Elephant In The Room​

Despite all these great efforts, new users will generally find these coding agents are not ready. This applies to the corporate offerings as well, not just the Open Source projects. It takes practice to understand what kinds of tasks are ready for agents to save you time rather than get in your way. In addition to task granularity, you need to consider how local context may confuse the agent, and there are stability issues.

Arguably, the user experience is lagging behind performance improvements on agent benchmarks. This could indicate that we need new measures besides SWE-bench for a more complete picture. Just as SWE-bench was a leap forward from earlier benchmarks like HumanEval, moving from the scope of a file to an entire repo, we should keep an eye out for the next paradigm. If customization for context is a limiting factor, that will be difficult to evaluate objectively.

When in doubt, listen to your users.

What's Next​

With so many different products and caveats, this remains a very confusing space. We are midway between research prototypes and true DevTools.

For builders and adopters, here is some broad advice:

  • Focus on what can be done safely, reliably, and sustainably
  • Limit task scope, specialized agents are usually more effective than general ones
  • Prefer algorithms to LLMs when you have the option
  • Continue to invest in humanity

For further discussion, here are recent talks from three extremely qualified people.

Please feel free to reach out with any questions you'd like to see covered.

With patience the most tangled cord may be undone.


Want to help out? See where you fit in!

Tiny Type Agent - Safely Improving Python Code

Β· 2 min read
Ray Myers
Mender.AI, Empathy in Tech

As a companion to my talk at The Artificially Intelligent Enterprise Summit, I'd like to demonstrate a minimal yet useful and safe Autonomous DevTool.

Nopilot talk

Definitions​

Let's adopt these loose terms for now.

  • Assistant: Takes ~1 action per user interaction
  • Agent: Completes a meaningful task by taking many actions with environment feedback
  • Autonomous DevTool: AI agent packaged into a polished product, ready to integrate with your software development lifecycle

Use case: Add type-hints to legacy Python​

Suppose you like type checking for Python, but the hints are toilsome to add!

This 400-line Python script shows how to address this with LLMs combined with safegaurds: tiny-type-agent on GitHub.

  • LLM suggests types
  • Agent uses Tree-Sitter to update the syntax tree in a tmp "shadow file"
  • MyPy validates the shadow file
  • Based on the result, Agent updates the original source or discards suggestion

It is perhaps not yet a DevTool because it's not polished (for that, help us finish the larger tool Menderbot), but it shows a workflow that is safe enough to improve production code with extremely low risk. In fact, it has already been used to update production code at a major company and found bugs in the process.

Background​

With patience the most tangled cord may be undone.


Want to help out? See where you fit in!

Dissecting Devin

Β· 7 min read
Ray Myers
Mender.AI, Empathy in Tech

Devin dissected. Disinformation debunked. Direction declared!

It's now been one month since Cognition announced Devin, which we remain excited about. However, it's time to go over that initial announcement tweet, because we need to get some things clear.

We predict that Autonomous DevTools will have a historic impact on software (good or bad!) This period is the beginning of their transition from research prototypes to eventual production use.

If you tuned into this "dissection" hoping to see real architecture and code deep dives, don't worry - it's coming next. This site exists to promote the development of high quality Open Source tools, and we intend to hold every player to the same standards we're holding Cognition to here.

The announcement​

Devin announcement tweet

Point-by-point​

Today we're excited to introduce Devin,

Good so far.

the first AI software engineer.

False. We'll come back to that.

The benchmark​

Devin is the new state-of-the-art on the SWE-Bench coding benchmark,

True-ish. Their longer blog post explains that only a 25% sample of the benchmark was run. It hasn't been independently verified, which is probably why SWE-bench doesn't list it. Admittedly, the full benchmark can get expensive to run, which is why there is now an official subset called SWE-bench-lite.

However we tend to believe that the performance was roughly as reported because Cognition was nice enough to provide the results from the runs. Another reason it's plausible is that two Open Source agents using different architectures have since met or exceeded that score, SWE-agent (Princeton) and reportedly AutoCodeRover (NUS), see our leaderboards page for updates.

Job interviews​

and it has successfully passed practical engineering interviews from leading AI companies,

Misleading even if true. Also unimpressive as it's a known capability of GPT-4.

You can see this "old trick" in the paper Sparks of Artificial General Intelligence: Early experiments with GPT-4 from March 2023 by Microsoft Research. Here's Figure 1.5:

Sparks of Artificial General Intelligence: Early experiments with GPT-4

When we design a test for humans, such as an exam or interview, we build in assumptions about the common abilities and deficiencies intrinsic to humans. For instance, to whatever extent a bar exam can assess that a human can practice law, it has very little bearing on whether a machine can, Language Model or otherwise.

Since we're being careful with terms today, these were not job interviews, they were mock interviews. You can only pass a job interview if you are being interviewed for a job, meaning the company believes you might be otherwise qualified.

The Upwork job​

and has even completed real jobs on Upwork.

Completely false, as we can easily see by simply reading the Upwork task description in the demo video, which says, "Your deliverable will be detailed instructions on how to do it in an EC2 instance in AWS." Internet of Bugs provides an excellent analysis showing that not only is the task incomplete, but the impressive debugging capabilities we saw in the video were the result of the agent adding its own bugs while doing unnecessary work because it ignored the instructions in the README. Sometimes they're too much like us!

See the last section for an update on this issue.

The product itself​

Devin is an autonomous agent that solves engineering tasks through the use of its own shell, code editor, and web browser.

True!. In our view, this is the principal achievement. Although there were prior efforts, (GPT-Engineer, Aider, Smol-Developer, etc...), this really brought it into focus - the smooth combination of these elements into a UX that anyone can look at and "get it". Once something like this is released and proven, tons of people will want one. We will buy them and we will build them.

More on SWE-bench​

When evaluated on the SWE-Bench benchmark, which asks an AI to resolve GitHub issues found in real-world open-source projects, Devin correctly resolves 13.86% of the issues unassisted, far exceeding the previous state-of-the-art model performance of 1.96% unassisted and 4.80% assisted.

As covered above, this is plausible. Media outlets have widely misreported this as being able to solve "14% of bugs on GitHub" and similar, which is a wildly different claim that Cognition never made. SWE-bench is limited in the types of issues it samples, containing only widely-used Python libraries covered by tests.

It would be straightforward (if time-consuming) to gather alternate test cases for other languages, creating things like SWE-bench-Java. Beware of the "Python bias" still prevalent across many AI training sets and coding benchmarks.

Check out what Devin can do in the thread below.

It was a cool thread! But where were we?

What is a Software Engineer?​

There is currently no such thing as an "AI Software Engineer."

Software Engineer was a term introduced by Margaret Hamilton for the Apollo Moon landing project in order to secure professional respect for her colleagues and to emphasize the disciplined practices they were developing. By consensus, the term's use in most of the industry remains somewhat aspirational, but with the world now running on software we have good reason to use it carefully.

For discussion's sake, let's consider some basic necessary, though not sufficient conditions. We might all agree that a Software Engineer:

  • Is a professional, having an obligation to uphold a standard of care
    • Therefore, has legal and moral agency typically associated with personhood
  • Is capable not only of solving small predefined tasks in a codebase, but also
    • Creating and maintaining a codebase of significant size over a period of time
    • Partnering to define the requirements to be done

Without the supervision of a human programmer, every autonomous coding tool to date would fail these minimal prerequisites. That's before any consideration of the actual engineering discipline as currently understood, see Dave Farley's recent book Modern Software Engineering for a nice formulation.

Update: Cognition clarifies conceding to critics?​

We promised that if Cognition responded to these points, we would include their remarks, and they have!

Their April 17th thread announced that the closed beta was rolling out to more of the waitlist, and they included this:

We recently got questions about one of Devin’s Upwork runs, where Devin’s output wasn’t what the request asked for. It did the task directly instead of giving directions as requested (which to be clear is on us, sorry for the oversight).

Still, Devin is often inefficient and makes mistakes, some that it fixes and others that cause it to get stuck.

Part of the reason we’re glad to expand access is so more people can form their own opinions about Devin’s strengths and shortcomings.

We believe that skepticism is good and we all need to vet new technologies critically, so keep it coming!

This appears to be a tacit acknowledgement of multiple points in the Internet of Bugs video. It also emphasizes the importance of scrutiny, suggesting an ongoing dialog. While their overall narrative is unchanged, this is a stand up move!

This encourages the conversation about how Autonomous DevTools will integrate into an overall healthy process.

We can fix this​

My advice to everyone building or selling in this space is just take a deep breath and figure out how you can give people exactly what you write on the tin.

What's done is done. Even industry leaders have gotten caught up in the frenzy and claimed some pretty outlandish things, leading to a stream of embarrassing headlines. We need to restore trust.

Building software that people can trust is hard. If you want to learn how, there's help. For instance, next month I'm honored to speak at Craft Conf 2024 in Budapest, with the talk "Code Mending in the AI Age". Just a glance through Craft's featured speaker list shows people who have put out treasure troves on improving software. Here's a quick playlist if you'd like to browse!

With patience the most tangled cord may be undone.


Want to help out? See where you fit in!

Welcome

Β· One min read
Ray Myers
Mender.AI, Empathy in Tech

3 Weeks Ago, the stealth startup Cognition announced an astonishing demo called Devin, reigning in the next generation of AI coding tools.

That announcement was a shockwave that mobilized thousands of developers around the world, scrambling to recreate that result as Open Source.

Amazingly, with their efforts combined with research released in the last week from Princeton and NUS, that has effectively been achieved, the playing field is now level.

What is needed now? Direction. In the tradition of StackOverflow, we will create the tools needed for this community to collaborate and thrive, turning this flash mob into a movement.

That movement will shape the future of software in the open.

See you there :)