What I Learned Building This — and What Comes Next

Post 8 in the AI Software House series.

In short: Seven posts of technical depth, but what does it all mean? This closing post steps back to reflect on the lessons learned, how this project vindicates what Builder.ai was attempting, where the remaining hard problems are, and what I’m building next.

The question I started with

When Builder.ai went bankrupt in May 2025, I kept asking myself one question: was the idea wrong, or was it just early?

I’d spent three and a half years there. I’d seen the vision up close. I’d also seen where it broke down — the AI generating code that was plausible but not usable, the engineers filling the gaps, the economics not working at scale.

Building ai-dev-team was my answer to that question. And after running it on real projects, I’m confident: the idea was right. It was early.

What the models changed

The fundamental shift between 2022 and 2025 isn’t raw code quality — it’s structured reasoning. The models available at Builder.ai’s peak could produce code that looked right. They couldn’t reliably follow a multi-step process, produce outputs in a consistent format, or reason about what they’d just built.

The models available now can do all three. That’s what makes the pipeline work. It’s not that GPT-4.1 is a better programmer than GPT-3. It’s that GPT-4.1 can act as a Product Manager who produces a PRD in exactly the right format, every time, that an Architect can parse and build on. The consistency is the capability.

That consistency is also what makes iterative review loops viable. You can ask a model to critique a PRD and respond to that critique because the model understands the structure of a PRD and can reason about whether a specific criterion is testable. In 2022, that reasoning was unreliable. Now it isn’t.

What quality-first actually means

The single most important design decision in this project is running review loops before code is written.

Every automated code generation system I’ve seen — including Builder.ai’s — treats quality as a post-process: generate the output, then validate it. If validation fails, regenerate. The problem with this approach is that bad inputs produce confident wrong outputs. Regenerating doesn’t fix a vague requirement. It just produces a different wrong answer.

Inverting this — quality gates at the PM and Architect stages, before any Engineer is invoked — changed everything. An Architect working from a tight, reviewed PRD makes better decisions. Engineers working from a tight, reviewed design write more coherent code. The Code Reviewer finds fewer issues. The QA engineer writes better tests.

Each stage is a multiplier on the one before it. Getting the inputs right amplifies the effect all the way down.

The economics of AI models make this practical in a way it wasn’t three years ago. Three rounds of PRD revision used to cost dollars. Now it costs cents. There’s no reason not to iterate.

What’s still hard

I won’t pretend this project has solved automated software development. Three problems remain genuinely difficult:

Integration failures. Engineers working in parallel on separate modules can produce code that doesn’t fit together. The Code Reviewer catches many of these, but not all. Human review before merge is still essential.

Long-running context. For large features that span many files and touch many existing patterns, context window limits are a real constraint. RAG helps, but it’s not a complete solution. The agents sometimes miss relevant context that a human engineer would know to check.

Non-obvious requirements. The clarification Q&A system helps, but it only asks questions the PM thinks to ask. Requirements that are ambiguous in ways the AI doesn’t recognise still produce wrong outputs. Domain knowledge is irreplaceable.

These aren’t reasons not to use the system. They’re reasons to treat the pull request as a first draft — which is the right mental model anyway.

What Builder.ai was really building

Looking back, Builder.ai wasn’t building a software factory. They were building the process for a software factory — the workflow, the roles, the handoffs, the quality gates — at a time when the AI models weren’t good enough to fill those roles reliably.

That’s not a failure of vision. It’s a failure of timing.

The engineers who built Builder.ai’s platform spent years learning where the process breaks down, what questions to ask before coding starts, how to structure handoffs between roles. That knowledge didn’t disappear when the company went bankrupt. It became more valuable as the models improved.

This project is built on those same lessons. Different stack, different models, different economics — but the same underlying insight: software development is a process, and a good process beats raw capability every time.

What comes next

ai-dev-team is functional and I’m using it. But there’s more to build.

Learning from merged PRs. The agents currently don’t learn from what gets accepted or rejected at the merge stage. Adding a feedback loop that reads merged PR history and updates agent preferences would close the last open loop.

Smarter module decomposition. The Architect decides how to split work into modules. That decision drives parallelism and integration complexity. Better heuristics — or letting the Architect reason more explicitly about decomposition trade-offs — would improve output quality.

Multi-project learning. The memory system currently operates per-repo. Cross-repo patterns — “this team always uses PostgreSQL”, “this codebase has this testing convention” — aren’t yet surfaced. A shared knowledge layer across projects would make the agents genuinely improve with use.

A final thought

Builder.ai fell too early. Six months later, the model landscape looked completely different.

I’m not trying to rebuild Builder.ai. But I do think the original vision — software development that’s accessible to anyone with an idea — is achievable. Not because AI has replaced engineers, but because AI has made the process of software development automatable in a way it never was before.

This project is my attempt to demonstrate that. It’s open source, it’s GPL3, and everything in these posts is how it actually works.

If you try it, I’d like to know what breaks.

Return to the series introduction for the full index.

Code: github.com/wanleung/ai-dev-team