Field note
Token Economics of Long-Running Agent Loops
A deep-dive explainer on Token Economics of Long-Running Agent Loops: methodology, historical context, worked examples with real numbers, and common pitfalls wh
The Hidden Cost of Multi-Turn Context Windows
When an autonomous agent runs a loop that spans many turns, the model must keep the entire conversation history in its context window. Each new turn adds the user request, the system prompt, any tool output, and the model’s own reasoning. The token count therefore grows linearly with the number of iterations. Because most LLM pricing is token‑based, the cumulative cost of a long‑running loop can dwarf the cost of a single inference. In practice, engineers often overlook this hidden expense until a deployment scales to hundreds of steps per goal.
The cost escalation is not only monetary. Larger context windows increase latency, because the model must attend to more tokens before producing a response. Latency growth is roughly proportional to the square of the context length for transformer architectures. This means that a ten‑turn loop can be several times slower than a single‑turn request, even if the underlying hardware remains unchanged. The slowdown compounds when the agent repeatedly calls external tools, because each tool echo is appended to the prompt and must be re‑processed.
A common source of waste is the repeated inclusion of the system prompt on every turn. The system prompt defines the agent’s role, constraints, and toolset. When it is re‑sent unchanged, the model spends token budget on text that does not change. Over a hundred turns, the system prompt may occupy tens of thousands of tokens, inflating the bill without adding new information. Developers can mitigate this by separating static context from dynamic turn data, but many frameworks embed the full prompt on each call for simplicity.
Prompt caching offers a direct way to cut the token cost of static context. By storing the system prompt and other invariant documents on the server, subsequent requests can reference the cached representation instead of transmitting the full text. This reduces the number of input tokens that the model must process, and therefore the price charged per request.
Cached input tokens cost only a tenth of the price of standard input tokens on Claude 3.5 Sonnet. The Anthropic Pricing Documentation reports a 0.10x cost factor for cached tokens, making caching a financially attractive optimization.
Beyond cost, caching also improves latency because the model receives a smaller payload. The trade‑off is a modest increase in system complexity; developers must manage cache keys and expiration policies. In many agent architectures, the benefits outweigh the overhead, especially when the agent loop exceeds a handful of turns.
Prompt caching allows developers to cache frequently used context, such as system prompts or long documents, significantly reducing latency and costs for subsequent requests. This observation comes from the Anthropic Prompt Caching Documentation.
Anatomy of an Agent Step: Input, Tool Echo, and Reasoning Overhead
An agent step begins with a user or system request that is tokenized and placed into the model’s context window. The request may include a natural‑language prompt, a JSON schema, or a set of instructions for a downstream tool. Once the model receives this input, it decides whether to invoke a tool, generate a direct answer, or both. When a tool is called, the model emits a tool echo, a structured payload that describes the tool name, arguments, and any required metadata. The tool runs, returns its output, and the model receives that output as a new token sequence. The model then performs reasoning overhead: it integrates the tool result, updates its internal state, and produces the final response. Each of these phases consumes tokens, and the total token count determines the monetary cost of the step.
The input phase is often the most predictable part of the cost equation. Tokens are counted directly from the prompt length, and pricing is linear with respect to the number of input tokens. However, the tool echo adds a layer of indirection. Because the model must serialize the tool call into a textual format, the echo can be several dozen tokens even for a simple operation. In practice, agents that repeatedly call the same tool generate a cumulative echo cost that rivals the original input size.
Reasoning overhead is the least visible but potentially most expensive component. After receiving the tool output, the model must re‑process the combined context, which now includes the original input, the tool echo, and the tool result. This expanded context often exceeds the original token budget, forcing the model to truncate older parts of the conversation. Truncation discards useful history, prompting the model to repeat information or re‑ask clarifying questions, which adds further tokens.
When agents are deployed at scale, the cost of writing to a cache for repeated prompts becomes significant. Write‑to‑cache operations are priced higher than standard input tokens, creating a hidden surcharge that can erode savings from caching.
Write‑to‑cache operations cost more than regular input tokens. The surcharge is 1.25 times the standard input token price, according to the Anthropic Prompt Caching Documentation.
Prompt caching can mitigate this overhead by paying a one‑time fee to store a prompt and reusing it across many requests. The cached prompt is then charged at a lower per‑request rate, which can offset the initial surcharge for high‑volume agents.
The Anthropic API Pricing Page explains that caching incurs a one‑time cost but yields a lower price per subsequent request, making it attractive for repetitive workloads. Anthropic API Pricing Page
A concrete example is the Leviathan terminal agent, which repeatedly queries a knowledge base and invokes a summarization tool. By separating the static system prompt from the dynamic query payload, Leviathan stores the static portion in a cache, reuses it for each turn, and only pays for the variable query tokens. This pattern reduces the per‑step token bill and keeps the agent’s latency predictable, even as the number of turns grows.
The Economic Impact of System Prompt Repetition
In long‑running agent loops the system prompt is often re‑sent on every turn. The prompt contains the high‑level instructions that steer the model, and it typically runs several hundred tokens. When an agent executes dozens or hundreds of steps, the cumulative cost of re‑sending the same prompt can dominate the token bill. Each repetition adds a fixed overhead that does not contribute to new reasoning, yet it is charged at the same rate as novel content. In practice this means that a 200‑token system prompt repeated 100 times adds 20 000 tokens to the bill, regardless of how little the agent actually does in each step.
The effect compounds when the agent also returns full tool call results. Those results occupy the context window and are counted again in every subsequent turn.
Every tool call result, if returned in full, occupies precious context window space and contributes to the total token count of every future turn in the conversation. The quote comes from LangChain Blog: Agent Memory Management, which discusses how tool output inflates token usage across turns. LangChain Blog: Agent Memory Management
Because the system prompt and tool outputs are both part of the context, the token cost grows linearly with the number of steps. This growth is especially problematic for agents that run for extended periods, such as autonomous research assistants or iterative design bots. The cost model of most LLM providers charges per 1 000 tokens, so a modest increase in token count can translate into a noticeable rise in operational expense. Engineers often overlook this hidden cost, focusing instead on the model’s inference time or the quality of the generated text.
Mitigation strategies focus on reducing the amount of repeated content. One approach is to cache the system prompt after the first turn and reference it implicitly in later turns, effectively treating it as static background knowledge. Another tactic is to truncate or summarize tool call results before appending them to the conversation. When agents implement output truncation, they can achieve substantial token savings.
Agents that truncate tool output can reduce token usage by 60‑90%, according to Engineering Best Practices: Building Production Agents. This reduction directly lowers the cost per turn in long‑running sessions.
By limiting the repetition of the system prompt and controlling the size of tool outputs, engineers can keep the token bill proportional to the actual reasoning work performed. The economic impact of system prompt repetition is therefore a key consideration when designing scalable, cost‑efficient agent loops.
Prompt Caching: Mechanistic Shift in Cost Dynamics
Prompt caching stores the system prompt and any static tool specifications in a reusable token buffer. When an agent loop repeats the same prompt across many turns, the model no longer needs to re‑encode those tokens for each inference. The saved encoding work translates directly into lower compute demand and reduced token billing, because most pricing models charge per token generated rather than per token processed. In practice, caching can turn a per‑turn cost that grows with the length of the context window into a near‑constant marginal cost, as long as the cached segment remains unchanged.
The mechanism relies on the model’s ability to accept a pre‑computed embedding for the cached portion. The inference engine concatenates the cached embedding with the fresh user input and any newly generated tool output. Because the cached segment is already represented in the model’s internal state, the forward pass skips the tokenization and attention steps for that region. This shift is most pronounced when the context window approaches its maximum size; the model’s attention matrix scales quadratically with the number of tokens, so eliminating even a few hundred tokens can cut the overall attention cost dramatically.
When the context window saturates, typically beyond 128 k tokens, the latency per output token can increase dramatically.
Latency can rise three to five times when the context window exceeds 128 k tokens. The Google AI Optimization Guide reports a 3x‑5x slowdown in token‑level latency under such conditions, highlighting the importance of keeping the active window lean.
Prompt caching directly mitigates this effect by keeping the active window small. The cached segment is excluded from the attention calculation, so the remaining dynamic tokens occupy a much smaller slice of the matrix. As a result, the per‑token latency stays closer to baseline even as the total number of turns grows.
The cost benefit is not purely theoretical. In long‑running loops, token usage scales linearly with the number of turns, but the cost per turn often grows quadratically if developers fail to prune historical tool outputs.
The OpenAI Cookbook notes that “Token usage in agentic loops scales linearly with the number of turns, but the cost per turn often grows quadratically if developers fail to prune historical tool outputs.” This observation underscores how unchecked context growth can erode the savings from caching.
By pruning or summarizing older tool outputs and relying on a cached prompt, engineers can preserve the linear scaling of token counts while preventing the quadratic cost explosion.
Implementing prompt caching requires a modest amount of infrastructure. The cache must be keyed by the exact prompt text and any static tool schemas; any change invalidates the stored embedding. Systems that generate dynamic system prompts, such as those that embed runtime configuration, should separate the immutable core from the mutable portion, caching only the former. When the mutable part changes, the cache is refreshed, and the cost benefit resumes.
In summary, prompt caching shifts the dominant cost factor from a quadratic attention burden to a constant overhead for the cached segment. This shift is most valuable in agents that run many turns, especially when the context window approaches its upper limit. Choose caching whenever the system prompt is stable across turns and the agent’s workload approaches the context window ceiling.
Optimizing Tool Output: Truncation, Summarization, and Schema Control
Long‑running agents often call external tools that return large payloads. Each token that the model must ingest adds directly to the per‑step cost, and the cumulative effect can dominate the budget of a multi‑turn loop. Controlling the size and shape of tool output therefore becomes a primary lever for cost containment. The three techniques most engineers apply are explicit truncation, on‑the‑fly summarization, and schema‑driven filtering. All three share a common mechanism: they reduce the number of tokens that flow back into the language model while preserving the information the agent needs to make a decision.
How truncation works
Truncation is the simplest form of control. After a tool returns a JSON array, the agent slices the array to a fixed length or to a token budget. The slice operation is deterministic, cheap, and easy to audit. The trade‑off is that useful items beyond the cut‑off are silently dropped, which can cause the agent to miss a relevant result.
def truncate_results(results, max_tokens=256):
"""Return a JSON string that fits within max_tokens."""
import json, tiktoken
encoder = tiktoken.encoding_for_model("claude-2")
truncated = []
total = 0
for item in results:
item_str = json.dumps(item)
token_len = len(encoder.encode(item_str))
if total + token_len > max_tokens:
break
truncated.append(item)
total += token_len
return json.dumps(truncated)
The function uses the same token encoder that the model will use, guaranteeing that the returned string never exceeds the budget. When the budget is reached, the loop stops, preserving the most recent items that fit.
Summarization as a fallback
When truncation would discard critical data, a summarizer can condense the payload. The agent sends the raw output to a lightweight summarization model, asks for a bullet‑point list, and then feeds that list back to the main reasoning model. Summarization adds an extra inference step, but the cost is usually lower than ingesting the full payload because the summarizer can be a cheaper model or run on a smaller context window.
Schema control for deterministic pruning
Schema control goes further by defining a strict output schema for the tool. The tool is instructed to emit only the fields the agent will actually use. For example, a search tool might be asked to return {title, url} and omit the full snippet. By limiting the schema, the tool itself produces a smaller payload, eliminating the need for post‑hoc truncation.
Failure modes
Truncation can cause silent loss of relevant items; the agent may continue without realizing that the top‑ranked result was cut off. Summarization can hallucinate or omit edge‑case details, especially when the summarizer is a different model. Schema control relies on the tool respecting the contract; a mis‑aligned schema leads to malformed JSON that the agent cannot parse, triggering a retry loop.
Comparison to naïve pass‑through
A naïve pass‑through approach lets the tool return everything, then relies on the language model to ignore excess. This wastes tokens on irrelevant text and can increase latency because the model must process a larger context. Truncation, summarization, and schema control each reduce token consumption at the source, shifting work from the expensive reasoning model to cheaper preprocessing steps.
When to use this tactic
Apply these techniques whenever a tool’s raw output exceeds a few hundred tokens or when the agent’s per‑step budget is tight. Use truncation for simple list‑type results, summarization when the payload contains rich text that must be retained in condensed form, and schema control when you can dictate the exact fields needed. Combining them, schema‑driven output followed by a token‑aware truncation, offers the most predictable cost profile for long‑running agent loops.
Reducing Hidden Retries: Guardrails and Reactive Planning
Long-running agent loops are prone to recursive failure patterns. When a model generates a malformed tool call or receives an unexpected error from an API, the default behavior is often to retry the operation immediately. In a multi-turn conversation, a single retry does not just cost the tokens of the error message; it costs the cumulative sum of the system prompt, the entire conversation history, and the new reasoning steps. If an agent retries three times before succeeding, the token expenditure for that specific task scales linearly with the history length. This creates a hidden tax on agentic workflows that grows as the session persists.
Engineers can mitigate this through a combination of hard guardrails and reactive planning. Guardrails are deterministic validators that sit between the model and the tool execution layer. Instead of sending a raw error back to the model, the system can intercept common failures and provide structured feedback or perform automatic corrections. Reactive planning involves the model explicitly updating its internal state or task list based on the success or failure of a step. This prevents the model from getting stuck in a loop where it tries the same failing tactic repeatedly without changing its approach.
For example, Leviathan utilizes a validation layer to ensure that any financial data requested via tool calls conforms to a specific schema before the model ever sees the result. If the schema fails, the system can prune the error message to its most relevant parts rather than dumping a massive stack trace into the context window.
def safe_tool_call(tool_func, args):
try:
result = tool_func(**args)
# Validate result before returning to the LLM
return validate_schema(result)
except ValidationError as e:
# Provide narrow, actionable feedback to save tokens
return f"Format Error: Missing keys {e.missing_fields}."
except Exception as e:
# Avoid sending long, noisy stack traces
return f"System Error: {str(e)[:50]}..."
A common failure mode is the hallucination loop. This occurs when a model believes it has successfully executed a tool but the environment state reflects a failure. Without reactive planning, the model may proceed to the next step based on false assumptions, eventually failing and requiring a massive rollback of the agent state. Guardrails act as a circuit breaker in these scenarios. Use this approach when tool outputs are complex or when the cost of a single context window turn exceeds the development overhead of maintaining a separate validation script. This tactic is essential for agents using high-reasoning models where every turn carries a significant financial cost.
Quantifying Cost per Agentic Goal
When an autonomous agent pursues a goal, each loop iteration consumes tokens for input, tool echo, and internal reasoning. The total cost of a goal is the sum of these per‑step expenses multiplied by the number of steps required to reach the objective. Understanding this relationship lets engineers predict budget impact, set realistic usage limits, and design loops that stay within token caps.
The first component of cost is the input payload. Every user request, system prompt, and retained context contributes tokens that the model must process. In a multi‑turn scenario the context window grows, so later steps pay for earlier messages as well as new data. If an agent needs ten steps and each step adds 150 tokens of new context, the tenth step will see roughly 1,500 tokens of accumulated input. This linear growth dominates cost when the agent does not prune or cache prior turns.
The second component is the tool echo. After the model decides to call an external tool, the tool’s output is fed back into the next prompt. The size of that output directly adds to the token count. Engineers can control this by truncating verbose tool responses, summarizing large data sets, or enforcing a schema that limits unnecessary fields. A well‑designed schema can keep the echo under 100 tokens even when the underlying data set contains thousands of rows.
The third component is reasoning overhead. The model’s own chain‑of‑thought, chain‑of‑action, or self‑reflection adds tokens that do not advance the external state but are necessary for safe planning. Empirical runs show that a typical reasoning block occupies 80–120 tokens per step. Reducing the depth of the reasoning chain, or moving some logic to deterministic code, can lower this overhead without sacrificing correctness.
Putting these pieces together yields a simple cost model:
cost_per_goal = Σ_{i=1}^{N} (input_i + echo_i + reasoning_i) * price_per_token
where N is the number of steps required to achieve the goal. Engineers can estimate N by profiling similar tasks, then plug in measured token counts for each component. If the estimated cost exceeds budget, they can intervene by:
- Caching static parts of the system prompt so they are not re‑sent each step.
- Pruning older turns from the context window once they are no longer needed for decision making.
- Constraining tool output size through pagination or selective field inclusion.
- Simplifying reasoning prompts, for example by using a “single‑shot” plan instead of a multi‑step deliberation.
These levers give a predictable way to trade off speed, safety, and expense. In practice, a well‑tuned agent that reaches a goal in six steps, each with 200 input tokens, 80 echo tokens, and 100 reasoning tokens, will cost roughly 2,340 tokens. At current pricing, that translates to a few cents per goal, which is acceptable for many production workloads. However, if the same goal requires twenty steps because of noisy tool output, the token count can triple, pushing the cost into a range that may be prohibitive for high‑volume services. Engineers should therefore aim to keep the step count low, control payload growth, and enforce tight schemas to maintain cost efficiency.
Strategic Implementation: Architectural Patterns for Cost-Efficient Agents
When an agent loop runs for many turns, the token bill can dominate the compute bill. Architectural decisions that reduce the number of tokens generated or consumed per turn have a direct impact on operating cost. The following patterns are designed to keep the token footprint low while preserving the flexibility that long‑running agents need.
Hierarchical Planning
Separate the high‑level planner from the low‑level executor. The planner produces a concise plan description that is stored in a short‑lived state object. Each executor step receives only the relevant sub‑goal, not the full history. This reduces the context window for the language model at every turn. In practice the planner can be a stateless function that returns a JSON plan; the executor reads the plan entry, performs its tool call, and writes the result back. Because the planner runs once per goal, the system prompt is reused without repetition.
Shared Prompt and Result Cache
Cache both the system prompt and the outputs of deterministic tools. When a tool call is repeated with identical arguments, the cached result is inserted directly into the agent’s context, bypassing the model. A simple in‑memory dictionary keyed by a hash of the tool input is sufficient for many workloads. The cache also stores the rendered system prompt so that the model never has to re‑process the same prompt text. This pattern turns what would be a repeated token cost into a constant‑time dictionary lookup.
import hashlib
prompt_cache = {}
tool_cache = {}
def get_prompt(key):
if key not in prompt_cache:
prompt_cache[key] = render_system_prompt(key)
return prompt_cache[key]
def call_tool(name, args):
h = hashlib.sha256(f"{name}:{args}".encode()).hexdigest()
if h in tool_cache:
return tool_cache[h]
result = external_tool(name, args)
tool_cache[h] = result
return result
Event‑Driven Orchestration
Use an event bus to decouple tool execution from language model inference. The agent publishes a “need‑tool‑output” event; a worker service consumes the event, runs the tool, and returns the result as a new event. The language model only processes the event payload, which can be a short identifier rather than the full tool request. This reduces the number of tokens the model sees and allows parallel processing of independent tool calls.
Adaptive Batching
Group multiple independent tool requests into a single batch when the latency budget permits. The batch is represented by a compact list of identifiers, and the model generates a single response that references each identifier. The downstream worker expands the batch, runs each tool, and stores the results. Batching amortizes the fixed cost of the system prompt across many requests, lowering the per‑request token cost.
Stateless Service Mesh
Deploy the agent as a collection of stateless services behind a mesh that handles retries, rate limiting, and circuit breaking. Stateless services can be scaled horizontally, and the mesh can enforce a maximum token budget per request. When the budget is exceeded, the mesh returns a “budget‑exhausted” signal that the agent interprets as a cue to truncate its reasoning or to request a higher‑level plan revision.
These patterns are not mutually exclusive; they are often combined to achieve the best cost‑performance trade‑off. Choose hierarchical planning when the goal can be decomposed cleanly, shared caching when tool calls are deterministic, and event‑driven orchestration when latency can be hidden behind asynchronous processing. The right mix will keep token consumption predictable and keep long‑running agents economically viable.