Frame, Decompose, Verify: Real Engineering with AI Agents

A practical method for taking AI coding agents past "smart autocomplete" — into real investigation, refactors, sizable features, and multi-week projects.

Featured image

Most people get an AI coding agent to write a function or fix a typo, then stall the moment a task needs real thinking — a bug whose cause isn’t obvious, a refactor across a dozen files, building a sizable feature from a design, a project with many interdependent parts. The output stops being trustworthy, or the agent just spins.

A smarter model helps. But the reliable way to make the output better — whatever model you’re running — is a method: how you frame the work, break it down, and verify it. This guide shows what each level of task actually takes, and how to get there.

Difficulty is uncertainty, not size

A one-line fix can be the hardest thing you do all week, if finding which line took three days. A change touching hundreds of files can be trivial, if every edit is obvious. So before you prompt anything, ask: how much do I not-yet-know, and what kind of not-knowing is it?

That question sorts almost every task onto a ladder — not of “how much code,” but of “how much must be figured out before the code can be written.”


TL;DR — the principles

If you read nothing else:

  1. Match your effort to the uncertainty, not the diff size. Figure out what kind of task you’re holding before you start typing prompts.
  2. Write the problem down before you write the fix. A short note — cause, evidence, plan — is what keeps hard work from going in circles, and lets you resume days later.
  3. Make the agent ground its fix in the real code, not its guesses. Give it real code search and have it find and explain the cause before changing anything — “fix this, and show me the code that causes it” beats “fix this.”
  4. Reason in claims that can be proven wrong, then go prove them. Demand evidence — a test, a measurement, a spec line — not confident-sounding prose.
  5. Trust the check, not the model. Route every conclusion through something that can fail independently of the agent: a compile, a test, a benchmark, the spec, a reviewer.
  6. Break big work into small steps that each stand on their own. Each step should build and test cleanly; review the plan before any code is written.
  7. Before shipping, have the work challenged from several independent angles — not blessed by one. The author (you or the agent) has blind spots; one reviewer with one lens still misses plenty.
  8. A disproven approach, written down, is a result — not wasted effort. Ruling something out narrows what’s left to try, and a dead end you recorded is one neither you nor the agent re-walks.

The rest of this guide explains how to actually do each of these.


The ladder at a glance

Level Name What makes it hard What it feels like
L1 Trivial / mechanical Nothing to figure out — the change is the task “I already know exactly what to type”
L2 Localized fix One clear cause, contained to a file or two “Once I see it, the fix is obvious”
L3 Multi-file investigation Must reason across several files, a reference, or a spec before a small change “The patch is tiny; finding it was the work”
L4 Deep debug / design Hypothesis-driven tracing across subsystems, or a real design decision “I have to form theories and test them”
L5 Multi-part program Too big to hold as one problem — must be split into many sequenced, independently-checked pieces “This needs a plan, not a single sitting”

You don’t need these labels in daily life. You need the reflex: name the uncertainty first.


The loop every level runs on

From a one-liner to a multi-week program, the work is the same loop. Bigger tasks just nest it — each small step is its own pass through the loop.

   ┌─ FRAME ───────────────────────────────────────────────┐
   │  State the goal, the uncertainty, and the check       │
   │  ("done = this passes").                              │
   └───────────────────────────────────────────────────────┘
                               │
   ┌─ ORIENT ──────────────────────────────────────────────┐
   │  Gather ground truth: read the relevant code, the     │
   │  spec, a profile, prior art. Point the agent at it.   │
   └───────────────────────────────────────────────────────┘
                               │
   ┌─ HYPOTHESIZE ─────────────────────────────────────────┐
   │  Form claims you could prove wrong about cause /      │
   │  approach. Write them down. Predict what you'd see.   │
   └───────────────────────────────────────────────────────┘
                               │
   ┌─ ACT ─────────────────────────────────────────────────┐
   │  Make the smallest change (or experiment) that tests  │
   │  or implements one claim.                             │
   └───────────────────────────────────────────────────────┘
                               │
   ┌─ VERIFY ──────────────────────────────────────────────┐
   │  Run the check. Build, test, benchmark, re-read the   │
   │  spec, review. The verdict is independent of the      │
   │  agent's reasoning.                                   │
   └───────────────────────────────────────────────────────┘
                               │
                  pass ────────┴──────── fail → update the note,
                   │                     drop the dead claim,
              COMMIT &                   loop again
              record (incl. dead ends)

The discipline that makes it work: never let a conclusion skip the VERIFY box — though what verifies it varies. A cause backed only by the model’s say-so is a guess; back it with evidence you can point to — a test, a profile, a spec section, a reproduction. Not everything is unit-testable (races, hardware-specific paths, one-off repros) — when it isn’t, fall back to the strongest check you can run, not to trusting the prose. The levels below are just this loop applied at growing scales of uncertainty.


Before you start: prepare your agent

The framework above assumes your agent has a handful of capabilities. These are roles, not products — fill each with whatever your setup offers, and note that several are probably built in already. For each: check whether you have it, add it if not, and confirm it works. The role matters, not the tool.

You don’t need all of these for every task — each level below says which it calls for. Set them up once and they serve every level.

The levels stack. Each level’s What you need is cumulative: it lists only what that level adds on top of the levels below it. So Level 3 assumes you already have everything Levels 1–2 need, and so on.


Level 1 — Trivial / mechanical

What it is. Nothing to investigate. The change is fully specified by the request itself: bump a version, update a generated value, consolidate a known duplicate.

What you need. Almost nothing:

How you drive it. Resist the urge to let “tiny” changes skip your process. Have the agent run it through the same pipeline you’d use for anything else — your normal checks, a clean commit message — and confirm it passed before you accept it. The discipline is the point, not the difficulty.

Examples. Think version bumps, dead-code removal, a typo fix across a few files.


Level 2 — Localized fix

What it is. A real bug, but once you see it the fix lives in one or two files and the reasoning is local. This is the most common kind of fixable bug.

What you need. On top of Level 1:

How you drive it.

  1. Describe the problem precisely, and ask for a reproduction or a failing test first — for example: “Before changing anything, write a test that reproduces this and confirm it fails.”
  2. Tell it to investigate the actual code and report the cause — with the lines that prove it — before it proposes a fix. A prompt that works: “Find the cause and show me the exact code that produces it. Don’t fix it yet.” “Show me the code” beats “fix this.”
  3. Sanity-check the cause is real (see trust the check below for how), then let it make the change and run the test + build; confirm green, then commit.

The skill here is yours: stating the problem precisely, and refusing to accept a fix until the cause behind it is one you’ve checked — not just one that reads convincingly.

Examples (landed in Firefox):


Level 3 — Multi-file investigation

What it is. The change may be small, but you can’t write it until you’ve reasoned across several files, an authoritative definition of correct behavior, or another system’s behavior. The investigation is the hard part; the change itself can be tiny.

What you need. On top of Level 2:

How you drive it.

  1. If it’s a regression, have the agent bisect to the introducing commit so you’re both reasoning about one change, not the whole codebase — “here’s a command that exits 0 when good and 1 when broken; bisect to the commit that introduced it.” If you’re building something new instead, have it map the files the change must touch before writing any of them.
  2. Make it read the relevant code and the source of truth for correct behavior, and write up what it finds — don’t let it answer “is this even a bug?” from memory: “Quote the rule that governs this — spec, doc, or ticket — and tell me whether our behavior actually violates it.”
  3. Require it to state the cause as a claim with the evidence behind it, and review that before you let it write the fix.
  4. Insist on the written note as it goes. That note is what makes the work correct and resumable — for you or for the agent — if it’s put down for a day.

What the note should capture. The format doesn’t matter — a scratch file, a doc, whatever you’ll actually keep. What matters is that it records the things people usually leave out:

Keep it short and falsifiable: a note where every line points at evidence is worth more than a long one that just reads well.

Examples (landed in Firefox):


Level 4 — Deep debug / design

What it is. Either (a) a bug whose cause is hidden — in a different place than where the symptom shows up, so you can’t find it by reading the obvious file — or (b) a modest amount of code but a genuine design decision with trade-offs. (If the work decomposes into a sequence of interdependent steps, that’s no longer one problem — it’s Level 5.) This is the level most people don’t realize an agent can reach, and where how you drive it matters most.

What you need. On top of Level 3:

How you drive it.

  1. Demand evidence before theories. Have it capture evidence — a log, trace, or profile — and compare against a working case (or a reference implementation): “Capture this from both the broken and the working case and tell me what’s different — don’t theorize yet.”
  2. Push it to phrase findings as claims it could prove wrong, then check them: “Give me your top 2-3 hypotheses, each as ‘if this is the cause, we’d see X’ — then go check the cheapest one first.” Reject conclusions that aren’t tied to something observed.
  3. Require it to record the hypotheses that died in the note, so neither of you re-tests them. When one survives the evidence, that’s your cause.
  4. For a design decision, ask for options, not an answer: “Give me two or three approaches with trade-offs and a recommendation” — then make the call yourself.
  5. Have it reviewed adversarially from more than one angle — point fresh agent instances at the change with prompts like “find why this is wrong or incomplete” and “check this against the spec and for missed edge cases” — and run it across environments. Expect to send the fix back when a broader test surfaces something a local run couldn’t.

Examples (landed in Firefox):


Level 5 — Multi-part program

What it is. Work you can’t hold as a single problem. Unlike L4 — one cause to find, or one design decision to make — an L5 task is several interdependent sub-problems that have to be broken apart, sequenced, and kept coherent, each verified on its own. The defining trait is that no single pass solves it; you manage a plan. Note what this is not about: duration. An L5 task often takes longer, but a focused one can be done in a day or two — what makes it L5 is that it needs decomposition, not that it fills a calendar.

It comes in two flavors, approached very differently.

What you need. On top of Level 4 — this is where the machinery of iterating toward a goal matters:

How you drive it.

  1. Make the first thing it produces a plan, not code. If your agent has a planning mode, use it; otherwise just ask: “Before any code, give me a dependency-ordered plan where each step builds and is independently testable.” Then review and edit that plan before approving work.
  2. Sign off on the breakdown — cut steps that are too big, fix the ordering (see the next section for what makes a good step).
  3. Drive it one step at a time, each through the full loop — don’t let it run several steps ahead of your checkpoints. For exploratory work, give it one hypothesis to test per cycle instead of a feature to build.
  4. Keep steering the plan as reality shifts under you; treat it as a living document, not a fixed contract.

Examples (in Firefox):


Decompose — breaking a task into verifiable steps

This is the Decompose part of the method, and at the higher levels it’s mostly your job as the driver. You usually won’t write the breakdown by hand — you ask the agent for one (“before any code, give me a dependency-ordered plan where each step builds and is independently testable”) and then judge and steer it. So the skill below isn’t “how to decompose by yourself” — it’s what a good breakdown looks like, so you can recognize a bad one and push the agent to fix it. Why it matters: handed a big task whole, the agent has too much to hold at once and no checkpoint to catch a wrong turn; small steps fix both.

What a good step looks like — cut or split any step that misses these:

Moves that produce such steps:

Then refine the agent’s draft against the list above — cut steps that are too big, fix the ordering — and execute one at a time. The plan becomes the living document you both work against, and many agents have a dedicated planning mode for exactly this.


Verify — trust the check, not the model

Across every level, one habit separates trustworthy AI engineering from plausible-looking output. A language model can be confidently wrong. A failing test cannot be talked out of failing. So your job is to route as many conclusions as possible through a check that produces a verdict independent of the agent that did the work.

The tools you reach for split into two kinds, and it’s worth seeing the difference:

Not all checks are equally strong. Rank them by how independent the verdict is from the work:

  1. Deterministic (strongest): it runs without error, the test passes, the type-checker or linter is clean. Binary, reproducible, unarguable. (A compile is the strongest form, where your language has one.)
  2. Empirical: a benchmark, a timing or memory number, a CI run, a check against a staging or production-like environment. Real-world ground truth that needs a little interpretation.
  3. Authority: the spec, doc, or ticket that defines correct behavior, or the actual current source — it overrides the model’s memory about how things should behave.
  4. Adversarial review: independent passes (human, or fresh agent instances prompted to refute) — ideally several, each hunting a different failure mode — that catch confident-but-wrong before it ships.
  5. Human approval: a deliberate stop before anything irreversible or sensitive.

When no check exists yet — the common case in investigation. You often can’t auto-test a cause before you understand it, and “the explanation sounds right” is exactly the trap. Build the weakest gate you can instead of trusting prose:

  1. Make it cite the exact lines that produce the behavior, and read them yourself — not a description of the code, the code.
  2. Make it predict something you can observe if it’s right (“then this log line should show N”), and check that.
  3. Run the smallest experiment that would only work if the cause is real — a one-line change, a forced value, a targeted log.

A cause that survives all three is trustworthy; one that’s only explained well is not. As soon as you understand it, turn it into a real check (a failing test) so it can’t regress.

The deeper the task, the larger the share of your conclusions that should have to clear one of these gates before you believe them — not merely sound convincing.


How deep work fails

Symptoms to catch yourself in — when deep work is going sideways, it usually feels like one of these:


Over to you

It’s the same method at every level: frame the task by naming its uncertainty and the check that will settle it, decompose anything large into steps you can verify one at a time, and verify every conclusion through something that can fail independently of the agent. Pick one real task and run it that way; do it on a handful more and it becomes muscle memory — and the level of work you can hand an agent stops being limited by the model and starts being limited only by how well you frame, decompose, and verify.

Those are learnable skills, and you already have what you need to start. Good luck — go take on the problems that used to feel too big.