Field note
Agentic engineering patterns that survive contact with production
Field notes on the patterns that hold up when you put coding agents on real work. Context budgets, tool design, planner-executor splits, evaluation loops.
The interesting question about coding agents in 2026 is not whether they work. It is which patterns hold up once you point them at code that has consequences. After roughly eighteen months of running Claude, Codex, and a rotating cast of free-tier models against a real equity research stack at Leviathan, a small set of patterns keep paying for themselves. The rest get pruned within a week.
This note is a field log, not a tutorial. The frame is engineering, not capability. The question I keep asking is: what survives contact with production?
Context is the budget, not the prompt
The mental model that broke first was treating context as a free resource. A 1M token window does not mean you can stream 1M tokens of garbage into the model. It means you have a budget. Every token of tool output, every diff, every retrieved chunk is a withdrawal from a fund that determines how much reasoning the model can do downstream.
The pattern that holds up is structured compression at the tool boundary. When a tool returns 50KB of JSON, do not pass 50KB into the model. Pass a deterministic summary built from the JSON: counts, top-K items, the specific fields the downstream step needs. Keep the raw blob on disk under a handle. If the model needs more, it can ask.
On Leviathan this looks concrete: every read-heavy bash command goes through RTK, a Rust filter that compresses git status, test output, grep results, and file reads by 60-90 percent before they ever touch the model’s context window.
Token discipline matters more than model size. A 200K window with disciplined compression outperforms a 1M window with raw tool dumps. We measured this on real Claude Code sessions. The 1M model with raw output ran out of useful reasoning before completing the same task the 200K model finished cleanly.
The deeper principle: the model is a function of its inputs. Garbage in, garbage out applies with embarrassing literalness to LLMs. Engineering an agent is mostly engineering the inputs.
Tools are an interface design problem
The second pattern that survived is treating tool definitions like an API design exercise, not an afterthought.
Bad tool design dominates failure modes. The classic symptoms:
- Tools that return too much (the 50KB JSON problem above).
- Tools that take ambiguous parameters and require the model to guess.
- Tools that silently truncate, so the model thinks it has the whole picture when it does not.
- Tools that overlap, so the model has to choose between three roughly equivalent ways to do the same thing.
The fix is to design tools the way you would design a CLI for a sleepy junior engineer at 3am. One job each. Clear parameter names. Honest error messages. Pre-validated inputs. Output capped at a sensible size with explicit pagination if more is needed.
# Bad: overloaded, ambiguous, dumps raw API response
def search(query: str, options: dict = None) -> dict: ...
# Good: one job, explicit shape, capped output
def search_filings(
ticker: str,
form_type: Literal["10-K", "10-Q", "8-K"],
since: date,
limit: int = 10,
) -> list[FilingRef]: ...
The second form is roughly three times more reliable in agent loops in my testing. The reason is not subtle. The model has fewer ways to be wrong.
A useful gut check from Anthropic’s own writeup: if you cannot describe what a tool does in one sentence, the tool is too broad. Split it.
Planner and executor: the simplest split that works
For any task that takes more than a handful of tool calls, the pattern that holds up is planner-executor separation. One model (or one model call) plans. A separate call (or pool of calls) executes the plan step by step.
Concretely, on Leviathan’s research pipeline:
- Planner reads the goal (“explain the moat dynamics in $TICKER”) and writes a sequenced list of subtasks, each with the tool it expects to use.
- Executor(s) run each subtask in isolation. Each one only sees the subtask description plus the artifacts produced by upstream steps. They do not see the global goal.
- Synthesizer stitches the artifacts together into the final output.
This buys you three things:
- The planner uses long-context reasoning once, not repeatedly.
- Executors have small, focused context windows and run cheap.
- You can fan out executors in parallel. Independent subtasks finish in wall-clock time proportional to the slowest one, not the sum.
The cost is one extra layer of indirection. The benefit is roughly a 5 task, and an 8-14 point lift on standard agentic benchmarks like SWE-bench Verified.
The pattern is closest to the orchestrator-worker decomposition described in Anthropic’s Building Effective Agents post. The variations are mostly about how strict the planner is and how much autonomy each executor has.
The failure mode to watch for is planner overfitting. The planner writes a plan that looks plausible but contains a step that cannot actually be executed (the tool does not exist, the data is not available, the assumption is wrong). The fix is to make executors return structured failures with context, and re-run the planner with the failures included in its input.
Evaluation is the part nobody wants to build
Every agent system I have shipped or watched ship has reached a point where it works on the cases the team thought about and fails on the cases they did not. The only way out is evaluation.
The pattern that holds up: a small, fast evaluation harness that runs on every change. Not a Kaggle-style leaderboard. Not a research benchmark. A handful of canonical tasks, each with a deterministic checker, that you can run in under a minute and that tells you whether the agent got worse.
For Leviathan’s equity research agent, the harness has four task shapes:
Where the four shapes are:
- Forensic accounting: given a ticker with known fraud signals, does the agent surface the right red flags?
- DCF reconstruction: given a public filing, does the agent produce a DCF whose intrinsic value falls within a tolerance band of the analyst consensus?
- Peer selection: given a target ticker, does the agent pick a peer set that overlaps with a curated human-picked set by at least 60 percent?
- Citation discipline: does every quantitative claim in the output trace back to a retrieved source?
That last one matters. Models hallucinate citations. Without an automated check, this rot accumulates silently until someone notices in production.
The harness is roughly 800 lines of Python. It is the most valuable 800 lines in the codebase.
What gets pruned
A few patterns that looked promising in 2025 did not hold up:
- Long autonomous loops without checkpointing. Anything over about ten tool calls without human or evaluator intervention drifts. The fix is checkpointing every few steps and asking “is the next action still on the path to the goal?”
- Self-correction loops that rely only on the model’s judgment. The model is bad at noticing its own mistakes in the same context window where it made them. Self-critique works much better when the critic is a fresh context.
- Memory systems that try to remember everything. Most “agent memory” implementations end up as expensive vector databases that surface stale information. Persistent files with explicit naming and explicit invalidation are usually better.
The throughline across the patterns that survive: they treat the model as a reasoning component embedded in a system, not a magic oracle. The engineering work is on the system. The model is one component. A good one, but one.
Where this is going
The trajectory I am betting on, after watching Claude 4.6 then 4.7 then GPT-5.5 land in quick succession: the models keep getting better at planning and at calling tools, but the gap between “demo agent” and “production agent” stays wide. That gap is mostly engineering. Context discipline, tool interface design, planner-executor decomposition, and evaluation harnesses are the load-bearing pieces.
The teams that win in the next two years are not the ones with the biggest models. They are the ones that have built the infrastructure to use those models with discipline.
That is the bet, anyway. We will see how it ages.