The Tool Was Never the Point: A 12-Step Workflow for AI-Native Projects

I built an AI co-pilot for PCBA test analytics to fill a portfolio gap. The tool became the least interesting thing I made. The real deliverable was the 12-step workflow I used to govern the agents that built it.

Published: June 22, 2026
Author: Hrushiekesh Kanjula Reddy
Read time: ~5 min
Category: essay

#agentic-workflows #ai-native-development #tdd #software-governance #manufacturing

A disciplined twelve-node loop governing autonomous AI coding agents

Let me be honest about why I built an AI co-pilot for PCBA test analytics: I needed a portfolio project. I'm a manufacturing engineer — four years on the line, two master's degrees — trying to prove to hiring managers that I can do AI-native work. A flying-probe test-log analyzer sounded like exactly the kind of thing that would prove it. So I built one: a parser, a DuckDB warehouse, yield and SPC analytics, and a retrieval-augmented co-pilot that answers root-cause questions with citations.

Here's the part I didn't expect. A few weeks later, the manufacturing tool is the least interesting thing I made. The interesting thing was the machine I built to build it.

The honest case for not trusting the robot

The argument for supervising AI agents is now empirical, not philosophical. A controlled 2025 study found that experienced developers using AI assistants took 19% longer to finish their tasks — while reporting they felt about 20% faster. Roughly 45% of AI-generated code ships with a known security weakness. Around one in five package suggestions point to libraries that don't exist, a failure common enough now to have earned a name: slopsquatting. The throughput is real. So is the entropy.

None of this is an argument against agents. It's an argument against unsupervised ones. An LLM with write access to your codebase is a confident intern holding the master key — fast, tireless, and wrong in ways that only surface two hours into debugging. The fix isn't a better prompt. It's a process that assumes the agent will be confidently wrong, and is built to catch it before it lands.

A 12-step loop, sized to the task

A staged pipeline with gated stages and task tiers feeding in

So the real project was a workflow. Every non-trivial change to the co-pilot moved through the same twelve steps: document the requirement, explore the code, plan, write a test-case plan, red-team that plan, clear a decision gate, execute under test-driven development, verify the execution, triple-check it, document, run manual QA, hand off. Three of those steps — the planning, the decision gate, the final triple-check — are mine alone, never delegated. The red-team step is adversarial by design: two or three skeptic subagents are told to refute the plan before a single line of code exists.

What made this livable was tiering. A typo doesn't deserve twelve steps. So every task gets sized first — Trivial, Small, Medium, Large — and runs only the stages that earn their keep. A docs fix skips the red-team. A new retrieval module gets the full gauntlet. It's my own dialect of what the industry spent 2025 converging on under the banner of spec-driven development: the spec and the plan are the things you argue about. The code is just the part the agent types.

What the loop actually caught

Verification isolating a single flawed fragment from clean streams

Discipline is abstract until it saves you. A few moments from this build made the case better than any framework diagram could.

The red-team step kept killing bad plans before they ran. On the retrieval slice, the skeptics caught that ChromaDB defaults to L2 distance, not cosine similarity — a silent correctness bug that would have quietly degraded every search the co-pilot ever made. It never reached the codebase.

Then there was the evaluation that hung for two and a half minutes and died with a 404. Every instinct says network timeout, bump the retry. The real cause was that Google had retired the default Gemini model out from under me, and the gRPC client was patiently burning its full ten-minute deadline against a model that no longer existed. An agent left alone would have "fixed" the timeout. The loop's read-the-actual-exception habit found the real problem, and the fix was a one-line model bump.

My favorite was the test that passed alone and failed in a crowd. One parser test ran green in isolation and flaked in the full suite, because a helper wrote to a single fixed file at the repo root where Windows file locks fought over it. A stress harness reproduced it cleanly: 484 failures in 800 concurrent runs, then zero once the file moved to a per-test path. Agents don't find bugs like that. Patient verification does. (A close runner-up: a night-shift scheduling bug whose wrap-around correction had been written as a literal pass — a no-op that did nothing, exposed the instant a test demanded it actually work.)

The output of all that suspicion is a number I trust: 659 passing tests at 97% coverage, a live co-pilot evaluation that has to clear a hard pass bar before anything ships, and continuous integration on every pull request. I believe those figures because of the process that produced them, not in spite of it.

The real deliverable is a template, not a project

Balance between fast agent output and ordered control

Here's the move that turned a portfolio piece into something I'll actually reuse. The governance layer — the twelve-step rules, the tiering guide, the subagent charters, the git hooks that block dangerous commands — is not bespoke to a test-log analyzer. So I pulled all of it into a portable starter repo I stamp onto every new project. The co-pilot was its first real stress test. The next project inherits the entire apparatus on day one, before a line of feature code is written.

That's the difference between doing AI-assisted work and running an AI-assisted practice. One produces a demo. The other produces a way of working that compounds.

Deloitte's latest survey found that only one in five companies has a mature governance model for autonomous agents. To me that gap isn't a problem — it's the whole opportunity. The flying-probe co-pilot proved I can build the thing. The workflow proves I can do it again, on purpose, without it dissolving into a pile of confident nonsense.

The manufacturing tool will sit quietly in my portfolio. The loop will build the next ten. If you want to see where this is heading — the larger manufacturing-intelligence system this workflow now feeds — that's what I'm building next at the Assembly Hub.

← All posts