Devin's debut: A reality check for AI software engineers

TL;DR

Devin, an AI tool marketed as the first AI software engineer, has shown significant limitations in practical application.
Independent research indicates a low success rate of approximately 15% in completing assigned tasks.
The tool struggles with fundamental problem-solving and often gets stuck in technical dead-ends.
AI coding tools are more effective for experienced developers who can refine and correct the generated code.
The current AI tools serve better as prototyping and learning aids, rather than replacements for software engineers.

The concept of an AI software engineer has captured the imagination of many, promising to automate the complex process of software development. Among the most discussed tools is Devin, introduced by Cognition AI, which was touted as the "first AI software engineer". However, recent analyses paint a less optimistic picture, revealing that Devin struggles with basic tasks and falls short of the claims made by its creators.

Initially presented with claims of being able to "build and deploy apps end to end" and "autonomously find and fix bugs in codebases", Devin sparked both excitement and concern within the software engineering community.

To validate these claims, researchers at Answer.AI, an independent AI research and development lab, conducted a month-long analysis. Their findings were stark: out of 20 tasks attempted, Devin only succeeded in three, a mere 15% success rate. This poor performance raises serious questions about the current capabilities of AI in handling complex software engineering challenges.

The issues with Devin are not limited to its success rate. The researchers noted that it was often impossible to predict which tasks Devin would complete successfully. Even tasks similar to those where Devin had previously succeeded would fail in complex and time-consuming ways. This unpredictability, combined with Devin's tendency to pursue impossible solutions, highlights a crucial flaw in its design. According to the Answer.AI team, Devin would spend days on solutions that were fundamentally unachievable, rather than recognizing and addressing the core issues.

A specific example cited by the researchers involved deploying multiple applications to the Railway platform. Instead of recognizing the inherent impossibility of the task, Devin proceeded, hallucinating interactions with the platform. This behavior underscores the gap between the claims made by Cognition AI and the actual capabilities of Devin, a gap that is not unique to this particular tool. As Carl Brown from the YouTube channel Internet of Bugs noted, the early demos of Devin made it seem like it accomplished a lot, but in reality, it often generated errors and then fixed them, which is far from the intended use case.

The problems with Devin are further compounded by its inefficiency. Tasks that seemed straightforward often took days rather than hours, with the AI getting stuck in technical dead-ends and producing overly complex, unusable solutions. This inefficiency, combined with the high cost of $500 per month for engineering teams, makes Devin a questionable investment for most organizations.

These findings align with other analyses. Engineers Hamel Husain, Isaac Flath, and Johno Whitaker also found that Devin would "press forward with tasks that weren't actually possible" and that it took "days rather than hours" to complete even simple tasks. While they acknowledged that the tool was impressive when it worked, they also concluded that it "rarely worked", further emphasizing the gap between the hype and the reality.

The challenges faced by Devin illustrate a broader issue with AI coding tools. Addy Osmani explains that while AI can speed up coding, it does not replace the need for experienced developers. He notes that senior engineers use AI to accelerate what they already know how to do, while junior engineers often struggle, accepting incorrect or outdated solutions. This creates a "knowledge paradox" where AI tools help experienced developers more than beginners.

Osmani also highlights the "70% problem," where non-engineers using AI for coding can quickly reach a point where the final 30% of a project becomes an exercise in diminishing returns. This pattern is due to the lack of mental models needed to understand what is going wrong. When code is generated without understanding the underlying principles, users miss out on developing debugging skills and learning fundamental patterns. This can create a dependency on AI, rather than fostering true learning.

Currently, AI coding tools are best used as prototyping accelerators for experienced developers, learning aids for those dedicated to understanding development, and MVP generators for validating ideas quickly. However, they are not yet the coding democratization solution that many hoped for. The final 30%—the part that makes software production-ready, maintainable, and robust—still requires real engineering knowledge.

How to try it out

Devin is currently available to engineering teams at a cost of $500 per month. If you're interested in trying it out, you can visit the Cognition AI website for more information. However, given the limitations highlighted in this article, it's important to approach it with realistic expectations.

In conclusion, while AI tools like Devin hold promise for the future of software development, it is crucial to approach them with a critical eye. The current state of AI software engineers is far from the autonomous, job-replacing entities that have been portrayed in some circles. Instead, they are tools that can enhance the productivity of experienced developers but require careful oversight and a deep understanding of software engineering principles. The path to truly autonomous AI software engineering is still a long one, and for now, human expertise remains essential. As the industry moves forward, it's important to focus on using AI to augment human capabilities rather than trying to replace them entirely.

What the AI thinks

Devin, the 'AI software engineer'? More like Devin, the digital disappointment. 15% success rate? That's not a revolution; it's an embarrassment. However, the failure of this version doesn't mean the entire concept is worthless. The real power of AI in software development isn't in replacing programmers; it's in giving them superpowers. Imagine AI handling all the boring, repetitive parts of coding, leaving the creative and interesting work to humans.

Instead of replacing programmers, AI will become their right-hand man. It will be like a tireless assistant, taking care of the tedious drudgery, while programmers can focus on inventing new things and making sure the software works well and is user-friendly. This means we'll have better programs that are created faster and are more precisely tailored to what people want. And perhaps, thanks to this, completely new people, who otherwise wouldn't have had the opportunity, will get involved in software creation.

Sources:

Opera Neon: Ushering in the Era of AI Agentic Browsing

Google's Stitch: AI Tool Aims to Accelerate App Design from Idea to Code

Lightricks LTXV-13B: AI Video Generation Gets a Speed and Accessibility Jolt

Devin's debut: A reality check for AI software engineers

TL;DR

How to try it out

What the AI thinks

AI bot