How We Built Forge with AI - The Commons Blog

April 2026 · 12 min read

The question I get asked most, usually with varying amounts of skepticism, is some version of: "Can you actually build serious software with AI?"

I think this is the wrong question, and people are mostly asking it because the right question is uncomfortable.

The right question is: can an engineer hold a large project together when the AI writes most of the code? Because that is what is actually hard. Generating a function that compiles is easy. Generating ten thousand functions that are internally consistent, follow a single vision, survive refactors, and do not drift away from what the user actually needs - that is a different problem entirely. It is the problem software engineering has always been about.

The skill that matters is not prompting. It is the V-model.

What Forge is

Forge is a desktop IDE for numerical computing - a free alternative to the commercial numerical computing platform that costs around ten thousand dollars a seat. It currently ships 29 toolboxes, 817 functions, and 1,992 validated tests. The core is roughly 50,000 lines of Python. One developer. Three months, give or take (that and 13 years as a scientific programmer).

I want to walk through how this is now possible, because the mechanism matters. The numbers alone are not an argument - you can generate large amounts of code very quickly with AI and produce something unusable. The argument is about the structure around the code.

What the V-model is

The V-model is a development process that pairs every stage of design with a stage of verification. User requirements at the top-left are verified by acceptance tests at the top-right. System requirements are verified by integration tests. Unit designs are verified by unit tests. The shape traces like a V: design flows down the left leg, verification flows up the right. Nothing exists without a requirement above it and a test beside it.

Aerospace and medical devices have been built this way for forty years. Consumer software mostly abandoned it because it is slow, and consumer software generally does not kill anyone when it is wrong.

Why it matters now

AI collapses the time it takes to execute any individual phase. A specification that used to take a senior engineer a week can be drafted in an hour. A unit test suite that used to take days can be generated in minutes. A code module that used to be a sprint is a coffee break.

The cost collapse is not uniform across phases. It is dramatic for the phases AI can own and minimal for the phases a human still has to drive. Whatever phase you keep in human hands becomes the bottleneck.

The tempting response is to cut the slower phases. Skip requirements. Skip tests. Skip design review. This is the shortcut that produces slop at industrial scale. The reason the V-model works is that each phase catches the errors of the phase before it. Removing a stage removes a filter. Removing a filter means the errors ship.

The real solution is to build a structure where AI is autonomously empowered to execute every phase of the V-model - not just the coding phase. Requirements generation, decomposition arguments, test authoring, verification runs, regression scoping, commit hygiene. If the AI cannot produce a defensible artifact at each stage, the stage blocks. If it can, the stage gates forward to the next.

The process documents I run Forge under - a project CLAUDE.md, a V-model process specification, a vetted golden user characterization, a per-iteration pre-commit gate - exist so that Claude has the structure to do exactly that. Without them, every loop of work is a one-off negotiation with the model. With them, the loop becomes a property of the repository itself.

What the V-model gives you under agentic AI

Four properties emerge once the structure is in place, and they are the things the V-model was always supposed to provide. The difference is that now they can be enforced semi-autonomously.

Requirements first. Every feature is based on a testable and unambiguous requirement with an R-number. AI cannot write something without knowing what it is for. When a requirement is ambiguous, you will find that it has been implemented in its weakest form.
Formal verification. Every function has a test that proves it satisfies its requirements. In Forge's case the reference is Octave. If eig() returns the wrong eigenvalues, the test catches it. The AI cannot bluff its way past a differential comparison against an authoritative implementation.
Persistent traceability. Any bug can be walked back to a requirement, a design decision, a test, and a commit. When a user reports that butter() returns filter coefficients in a slightly wrong order, I can land on the exact requirement, the test that was supposed to catch it, and the reason it did not. The trace does not need to be reconstructed. It is a property of the structure.
Drift detection. Without V-model structure, AI-generated code drifts. It drifts within a feature, across features, and over time. With the structure, drift surfaces at verification time - usually within minutes of the code being written - and is fixed before it compounds.

The key insight: AI is not a replacement for engineering judgment. It is a process-maturity accelerator. It amplifies whatever discipline you already have. If your discipline is weak, AI amplifies that too, and you ship broken software at high speed.

What this looks like in practice

The day-to-day looks less exciting than you might think. There is no magic prompt. There is a loop.

It starts with golden user documentation. Before I write anything, I have to be able to describe the target user in enough detail that Claude can reason about them. For Forge, the golden user is a controls engineer who thinks in transfer functions and state-space models, not object hierarchies. They expect tf() and ss() to behave exactly the way they do in Octave. They do not want to learn Python idioms to do signal processing.

Then comes the loop itself. I call it EXPLORE / REQUIRE / DECOMPOSE / IMPLEMENT / VERIFY / COMMIT. The names are not load-bearing. The shape is.

EXPLORE. I either use Forge as the golden user (a little secret here: I am intimately familiar with the golden user - I used to be them) would, or I let Claude try exploring itself. We find gaps - things that are missing, slow, confusing, or wrong - and collect up to ten identifiable ones before moving on to the next stage.
REQUIRE. Each gap becomes a positive, testable requirement with an R-number. R147: Forge shall compute the Bode plot of any proper transfer function within 2% of Octave's values across 0.01 to 1000 rad/s. If I cannot state it positively and testably, the gap is not a requirement yet. It is a feeling.
DECOMPOSE. Break each requirement into sub-requirements. Write test signatures and docstrings before implementing. Write a consistency argument proving the sub-requirements, together, satisfy the parent. This is the step that filters out most of the bad ideas. If I cannot write a consistency argument, I do not understand the requirement.
IMPLEMENT. Claude writes the code. This is the fastest step. It usually takes a few minutes.
VERIFY. Run unit tests headlessly. Run visual integration tests on Windows. Run a targeted regression subset - the subset is scoped to the change. A full regression suite runs before release, not every iteration.
COMMIT. One commit per resolved requirement. The commit message cites the R-number. Then back to EXPLORE from the new baseline.

There is a pre-commit gate of about ten checkboxes: did I actually write tests first, is the consistency argument in the decomposition document, does the commit message reference an R-number, and so on. The gate exists because I will skip steps if there is no gate. Everyone will. This is not a character flaw.

The numbers

I want to be careful here, because numbers without context are the standard way people lie with AI-assisted development. Lines of code is not an argument. Test count is not an argument. What matters is behavioral fidelity.

The 1,992 tests in Forge are almost entirely differential tests against Octave. They run a computation in Forge, run the same computation in Octave, and assert that the results match within numerical tolerance. This is the only test that actually means anything for a numerical engine: did you get the same answer that thirty years of peer-reviewed numerical analysis says you should get.

Once you have that harness, you can start measuring other things honestly. On a recent benchmark, Forge computed eigenvalues about 15x faster than Octave and ran SVD about 12x faster. Those are real numbers on real matrices, not cherry-picked. The speed comes from the underlying libraries (NumPy, SciPy, LAPACK through modern BLAS), but the point is that we can quote the numbers because we already proved the answers are right.

The cost structure on the development side is equally stark. Forge was built on a $100/month AI subscription and a VPS. Comparable commercial tools can easily run you $10,000+ per seat and are built and maintained by hundreds of engineers. I am not claiming we are equivalent in feature coverage. We are not. The industry has a forty-year head start and a much larger toolbox catalog. I am claiming that the baseline functionality ninety percent of users actually touch is now deliverable at roughly three orders of magnitude lower cost - and the tools we are using to displace that cost on your balance sheet are only getting better.

See Why Your Software Costs Too Much for the full economic argument. The short version is that the cost of building software has collapsed, and most of the industry has not yet decided what to do about it.

What AI is not good at

If I only talked about the wins, I would be doing the same thing the hype cycle is doing. Let me be precise about the failure modes.

Novel design under genuine uncertainty. Claude is very good at writing the code once the problem is framed. It is not good at framing the problem. Deciding that Forge should use a reactive store for session state, or that the plugin API should be callback-based rather than declarative, or that Octave compatibility takes priority over Python idiom (in this phase) - those are judgment calls. I made them. The AI helped me think through consequences, but it did not originate them. If I had let it, the architecture would have been a statistically-average mush.

Knowing when it is wrong. The AI does not know when it is wrong. It cannot. Its calibration is getting better but is not - and may never be - good enough to be the sole check on its own output. The test harness is what tells you. If your test harness is weak, AI will ship bugs at high speed. If it is strong, most bugs die at the verification step.

Holding the whole system in its head. Context windows keep growing, but a million tokens is still not the same as the kind of whole-system understanding a senior engineer builds over months of working on a codebase, or decades of learning how systems like this should be built. I still have to know where the seams are, which modules are load-bearing, which abstractions are brittle. The AI does not carry that around between sessions. I do.

Refactoring without drift. Large refactors are the single most dangerous thing to do with AI assistance - because your requirements, while they may be anchors, are almost never going to be complete. Code that was correct in one place gets subtly miscopied to another. Semantics shift by a character. Without verification running constantly - ideally on every commit - refactors silently corrupt the codebase. The V-model saves you here too, but you have to actually run the tests. Nobody will give you a medal for the discipline. You just do it anyway.

The arbitrage thesis

Here is the part that matters for anyone building or buying software.

The cost of software development has collapsed by somewhere between 10x and 100x depending on the domain. Prices have not. That gap is not a short-term quirk. It is the result of structural commitments: incumbents have headcount they cannot shed without firing people, investor expectations anchored to pre-collapse cost structures, and pricing that is psychologically anchored to legacy market rates.

They cannot pass the savings through. Not quickly. Not voluntarily. Not while the board expects 30% year-over-year growth.

New infrastructure, built AI-first from day one, can. Forge at $29/year is a deliberately small example of that. It is what happens when you do not carry the overhead and you design for the new cost structure. It is a proof of concept, not a moonshot.

What this means for engineering organizations

This is where I want to be direct, because I think a lot of CTOs and engineering VPs are currently reading their own version of this story and drawing the wrong conclusions.

The pattern, sorted by scale:

Startups have heroes. One person who carries the product, moves fast, ships things. AI makes heroes more productive but also more dangerous - there is less friction, so more gets shipped before anyone notices it is wrong.
Mid-size companies have trusted ICs. Senior engineers who own systems and make judgment calls. AI makes them enormously more productive - genuinely 10x - but only if the company has enough process maturity to verify what they produce. Most mid-size companies do not.
Large regulated companies have humble high-caliber senior ICs who fade into obscurity. The ones who review every change, write the specs, enforce the gates. These are the people AI most resembles in terms of output quality when constrained properly. The large regulated companies already have the processes to absorb AI productively. They will, and mostly are, outpacing everyone else in the boring parts of software that actually matter.

If you are a senior engineer at a company with strong V-model or equivalent discipline, AI makes you 10x to 100x more productive and the processes to verify your output already exist. You are in the best spot of anyone reading this.

If you are at a company without that discipline, AI is dangerous. It lets you ship unverified code faster. The failure mode is not that the AI writes bugs - it will - it is that nobody catches them because the verification infrastructure was never there.

The question for any engineering org using AI is not "how many engineers can we fire." It is "how do we rapidly mature our processes to match a company ten to a hundred times our size." Whether you want to or not, the leverage AI gives an individual engineer only pays off if you have the institutional muscle of a much larger company to catch the mistakes. Startups are going to find themselves adopting aerospace-style process discipline over the next few years, not because anyone is forcing them, but because without it the AI productivity multiplier turns into a bug-ship multiplier.

AI is not a headcount replacement. It is a process-maturity accelerator. Or a process-maturity exposer, depending on where you start.

Closing

Forge is what happens when you take AI seriously and take engineering rigor seriously at the same time. Either half without the other produces slop. The combination produces something that can reasonably claim to replace a piece of software that took a large company decades to build.

I am not arguing this is a general-purpose argument about AI. It is not. I am arguing that the V-model, which has existed for forty years and is not anybody's idea of exciting, is the piece of engineering discipline that maps almost perfectly onto what AI needs in order to produce professional-grade software. That is a useful thing to know if you are either building with AI or trying to figure out whether to trust software that was.

If you want to see what came out the other end, Forge is free to download. Here is a bit about the person who built it. Here are the numbers.

One senior engineer plus Claude can match the output of a team many times their size. But only if they work inside the processes of a company many times larger than the one they are in. That is the actual lesson of the last two years.

Explore what came out the other end: Forge IDE · Function reference · Migration guide · About the founder