Agent Observability

We’re living through a moment where things that were unimaginable or prohibitively complex a couple of years ago are now trivial. Patterns that felt correct last year already look dated. What’s fashionable today may be obsolete six months from now. This is an exciting time to build. Literally, it’s the best time in history. And it’s also a confusing one — because when the rules are moving this fast, the tools we inherit stop fitting the problems we actually have. That’s where this article starts.

Part 1: Why this needs to be a new kind of tool

Observability used to be for developers, because software used to be written by developers

For two decades, this was so obviously true it never had to be said. The entire field of observability — Datadog, New Relic, Honeycomb, OpenTelemetry, Sentry, and the SRE practice that grew around them — was built on a self-consistent loop: the person writing the code, the person reading the logs, and the person fixing the bug were the same person, sharing a vocabulary, a mental model, and an instrumentation discipline. Two decades of careful, beautiful work, all addressed to the same persona. Every design decision in those tools makes sense under that assumption. p99 latency is a developer’s number. Distributed tracing is a developer’s mental model. OpenTelemetry SDK instrumentation is a developer’s workflow. “Add a span here” is a developer’s instinct. The change happened quietly — more like a tide coming in when no one was watching the shore than any single event you could point to. One year the people building software were all developers. The next, they weren’t. And the tools the developers had built for themselves suddenly had a user they were never designed for. None of the old tools became wrong. They are correct, precisely correct, for the world they were built in. The world just quietly stopped being that world.

The new builder doesn’t write code. They describe intent.

What’s changed is that AI is decoupling “writing software” into two layers: intent (describing what you want) and execution (actually producing the behavior). For a couple of decades the dominant way to operate the second layer was to write code. That’s no longer true. The new builder shifts from writing logic to shaping system behavior — defining intent, building harnesses around models and tools, evaluating outputs, and iterating through tight feedback loops. The execution gets handled by an orchestration layer that manages a model, a set of tools, and the context that connects them. What would have sounded absurd a couple of years ago — that someone with no engineering background would deploy autonomous software agents to run their daily work — is now happening before lunch, without a line of code. This is not a thought experiment; it is the daily reality of OpenClaw users right now:

A e-commerce operator has an agent scanning supplier catalogs to surface product opportunities, flagging abnormal orders before they ship, and identifying KOL influencers and drafting personalized outreach. She has never written a for loop.
An academic researcher has an agent reading new arXiv papers in his field every morning, cross-referencing them against his existing notes, and writing a one-page synthesis that points him at the three he should read in full. He writes Python for his research. He has no idea what a span is.
A quant trader has agents searching for alpha — generating candidate signals, testing them across historical data, and assembling executable strategies. The challenge is not finding signals — it’s knowing whether they’re real: whether a strategy relies on leaked future data, unstable correlations, or features that only work in hindsight. He cares about tracing every decision back to its inputs. He does not care about OpenTelemetry.
An independent consultant has an agent processing client interview transcripts, extracting themes, and assembling deliverable drafts. She handles six clients’ confidential data in the same environment. She cares about whether the agent touched a file it had no business reading.

These are not toy use cases. These people have real budgets, real customers, real reputational risk. Their agents fail in real ways: a supplier-scanning agent that suddenly costs 10x more than yesterday; a research agent that starts hitting a domain it has never called before. When those failures happen, the existing observability stack is useless to them — not because the tools are bad, but because the assumption underneath them no longer holds. The user is no longer the developer.

A new category, made of pieces from old categories

The right tool for this moment is not a better APM. It’s not a Langfuse competitor. It’s not a Mixpanel variant. It’s a hybrid that absorbs pieces from three different tool categories — and the only reason they lived in separate tools is that no previous user needed all of them at once.

Analytics tools — Mixpanel, Amplitude

What they solved. Mixpanel and Amplitude were built around an event-stream data model: a user does something, you log an event with properties, and later you slice it by time, cohort, funnel, and retention. The core abstractions are event, user, property, funnel, cohort. The query paradigm is “show me how this number changed over time, segmented by X.” What agents changed. The unit of meaningful behavior is no longer “user clicks button.” It’s “agent decides to call a tool.” A single agent run produces tens to hundreds of these events. The interesting segments are no longer demographic — they’re per-agent, per-task-type, per-model. The interesting retention is no longer “did the user come back next week” — it’s “did this scheduled agent keep running reliably for 30 consecutive days.” What the equivalent looks like now. It needs the time-series sensibility of Mixpanel — daily/weekly/monthly aggregates, change deltas, segment comparisons — but applied to a fundamentally different event taxonomy. Lens’s spend dashboard, for example, slices the same number — dollars per period — by day, by model, by agent, and by whether the run was triggered manually or by a cron job. The shape of the query is recognizably Mixpanel-like: give me daily totals broken down by N dimensions. The dimensions themselves are entirely new.

Observability tools — Datadog, New Relic, Honeycomb

What they solved. Distributed tracing across services, centered on the Span as the atomic unit. A trace is a tree of spans. Each span has a duration, a status, structured attributes, and a parent. The query paradigm is “find me the slow request” or “find me the failing service.” What agents changed. A trace tree still works as a structure — an agent run has a parent message, child tool calls, sub-agent calls, retries — but the dimensions of analysis are completely different. The dominant signal shifts from latency to tokens — how many, what type, and how fast they accumulate. Errors still matter — but agentic systems also fail in ways that never produce an error: an agent that reads an unexpected file, calls a domain it has never visited, or burns through its context window faster than it should. None of these show up in a span tree colored by status code. What the equivalent looks like now. Honeycomb’s high-cardinality philosophy survives intact — but the cardinality is in sessions and tool calls, not in HTTP requests. Lens treats every session as a first-class entity, with token usage, cost, and cache efficiency visible at a glance. One layer deeper, three views cover what a traditional span tree cannot: a tool call timeline with latency and cost per call, a session timeline showing the full message flow between user and agent, and a cache trace that tracks context growth across steps — how many messages were added or dropped at each step, how the context window expanded, and the underlying configuration: model, role distribution, user prompt, and system prompt.

Developer tools — Chrome DevTools

What they solved. Single-execution depth. Chrome DevTools lets you reload one page and see every network call, every JS exception, every layout reflow. The mental model is deep inspection of one event, not aggregation over many. The user workflow is “something looks wrong, I want to look inside this specific instance.” Critically, it is zero-friction: you press a key and it opens. There’s no team, no procurement, no infrastructure. What agents changed. “One execution” used to mean one page load or one HTTP request. Now it means one session — possibly hours long, possibly with hundreds of tool calls, possibly with a 200K-token context window, possibly with intermediate model thinking that the user never sees. The need for deep inspection is the same. The thing being inspected is two orders of magnitude richer. What the equivalent looks like now. It has to be local-first and zero-config — the Chrome DevTools instinct: you should be able to open it in seconds, not configure it for a sprint. With Lens, the contract is a single command, and the dashboard opens. There is no SDK to install, no instrumentation to add, no auth to configure. The session files written to disk by the agent runtime are discovered and ingested automatically; the user does nothing, the data is just there. When a user wants to drill into one specific session, they need to see everything — the full message tree, the full tool arguments, the full model output, the full system prompt all rendered, even with a diff view if the system prompt has changed between sessions. The depth is there because the people who adopt these tools genuinely need it — and the ones who don’t understand it on day one will grow into it, because the need is real. The hybrid — analytics-style aggregation, high-cardinality drill-down, and zero-friction deep inspection — has never existed in one tool. It hasn’t existed because no previous user needed all of them at once. That user is the new builder, and the new builder didn’t exist a couple of years ago.

Lens’s stance, in three design decisions

Three concrete decisions fall out of the position above. 1. Zero-config ingestion. No SDK. No instrumentation. No wrapping of model calls in tracer functions. The user runs a single command, and the dashboard opens with their data already in it. The ingestion pipeline begins by scanning the agent runtime’s session files on disk — the user does nothing, the data is just there. This is not a convenience, it’s a prerequisite. A e-commerce operator who had to import a tracing SDK and instrument her own code would never have started. 2. Cost surfaced as the primary number, with token detail one layer deeper. The headline number Lens shows is dollars. The same record, one layer down, carries the four sources of that number — input tokens, output tokens, cache reads, cache writes — each with its own cost track, each queryable on its own. The new builder reads “$0.43” and immediately understands it. The same builder, two weeks later, learns to ask “why is my cache hit rate only 40%” — and the answer is already there, one click deeper in the same view. The progression from “I see a number” to “I understand the structure of that number” happens inside the tool, not in a separate one. 3. Session-first data model. The atomic unit in Lens is not the LLM call or the span — it is the session. Every session carries its own cost, its own token breakdown, its own tool call history, its own cache trace, its own timeline. Aggregations — by agent, by model, by day — are computed from sessions, not reconstructed from spans. This means any question an operator asks — “what did this agent do last night,” “how much did this task cost,” “what files did it touch” — resolves to a session lookup, not a tree-walk across distributed trace events. These three decisions are the seed of the larger product stance: depth and accessibility are not in tension when you build them on top of the same data model.

Part 2: Fundamental differences between agentic systems and traditional software

The point of this section is not “agents are different in vibes.” The differences are concrete enough to force structural decisions in the data model — and each one forecloses a design choice you could otherwise have made by inheritance from the traditional observability stack.

Context grows silently and is managed, but not visible

In an agentic system, the context window grows with every message. Agent runtimes like OpenClaw manage this automatically — compacting messages when the window grows too large. But compaction details are invisible to the user. How many messages were dropped? How did cache efficiency change afterward? Lens surfaces this at two levels. At the session level, context pressure and burn rate give an at-a-glance read on how fast the window is filling and how close it is to the model’s limit. One layer deeper, a context breakdown and cache trace show the full picture — token composition at each step, message additions and drops, and how effectively the context is being reused.

Cost as a first-class signal

The traditional observability stack is organized around three pillars — metrics, traces, logs. Cost is none of them. Datadog tells you about latency. AWS tells you about your bill, but in a separate dashboard, on a separate cadence, in a different vocabulary, addressed at a different audience. This division of labor made sense when latency was the dominant operational signal and cost moved on a quarterly cycle. In an agentic system, cost is the dominant operational signal — every LLM call has a direct dollar cost, attributable to a specific message in a specific session run by a specific agent. Lens treats cost the way Datadog treats latency: as a first-class field that everything else hangs off. Datadog records latency on every request, aggregates it by service, endpoint, and time window, and lets you drill from a dashboard-level spike down to the single slow request that caused it. Lens does the same for cost — four separate cost streams (input, output, cache read, cache write) are tracked on every message, and the same four streams aggregate by day, by model, by agent, by cron-vs-manual run. The four-way split matters because each source has different controllability: input tokens are controllable by prompt design, output tokens by model choice and instructions, cache reads by session structure, cache writes by how aggressively you re-prefix. A spike in output cost means a different fix than one in input cost — and a single “total cost” number destroys exactly that distinction. For the quant trader running an overnight agent, the question “did it cost me $47 last night because the model generated too much output, or because nothing was being cached?” is not academic — the two problems have completely different root causes. The four-way split makes the diagnosis a one-screen lookup.

Security risk: the agent can do damage without ever throwing an error

In traditional software, dangerous failures are loud — an exception, a 5xx, a stack trace. In agentic systems, an agent can succeed at every step and still do something it should never have done: read a file outside its task scope, call a domain it’s never visited, or expose a credential in an external request. No error fires, because technically nothing went wrong. Lens scans every session for structured risk signals — destructive commands, sensitive paths, never-before-seen domains, and credential exposure — and scores each session as Low, Medium, or High risk: Low flags behavioral anomalies like unusual hours or unknown domains, Medium flags credential exposure in tool output, High flags confirmed exfiltration or destructive commands. Traditional observability tools have no equivalent, because span trees don’t fire when nothing throws.

Part 3: Where this goes next

The half-life of orthodoxy in this field is about six months. What differentiates a tool today will be baseline by the end of the year and forgotten by next spring. So this section is not a roadmap. It is the shape of the next three to six months as I see it.

Direction 1: From inspector to trajectory eval infrastructure

Eval means different things in different contexts. Lens’s direction here is specific — and worth distinguishing from the other meanings. Model eval asks “what can the model do in isolation?” — a fixed prompt, a fixed dataset, a pass rate. This is MMLU, HELM, OpenAI Evals. System eval asks “does the pipeline behave correctly?” — evaluating retrieval, prompting, and tool use under controlled setups. Agentic eval asks “can the system actually complete real tasks end-to-end?” — an agent running in a dynamic environment for minutes or hours, a verifiable outcome. This is SWE-bench, WebArena, τ-bench. Lens does not do model eval or system eval. But agentic eval has a branch that matters, and Lens is already sitting at its starting line. Agentic eval in practice falls into three buckets:

Benchmark-based. Run agents in controlled environments with fixed tasks and score pass@k (e.g., SWE-bench, WebArena, OSWorld). Useful for model comparison and regression checks, but limited by distribution mismatch with real production workloads.
Trajectory-based. Sample real production runs, label outcomes, extract failure modes, and build regression sets from observed failures. Replay them against prompt or tool changes. This is what strong teams do internally — and the fastest way to close the loop between “agent fails” and “agent improves.” Existing tools only partially support this workflow.
Online A/B. Run multiple agent variants in production and attribute downstream business metrics. Standard experimentation infrastructure, but with higher variance, longer feedback cycles, and more complex attribution.

Lens already has most of what trajectory eval requires:

What trajectory eval needs	Lens today
Trajectory storage	Have it — the session, message, and tool-call data model is exactly this
Trajectory browser and search	Have it — session browser, filters, drill-down
Failure signals	Partial — stop reasons, error counts, deep turn detection
Cost and efficiency metrics	Have it — cost, token usage, tool call count, duration per call
Automatic outcome labels	Missing, but every signal needed to compute them is already in the data
Replay and regression testing	Missing — current replay is read-only, not re-executable

The gap is re-execution. The data is already there — every tool call, every argument, every return value, written by the agent runtime and ingested by Lens. What’s missing is the ability to take a recorded session, swap the prompt or the model, and re-run it against the same inputs to see if the outcome improves. Perfect determinism is not achievable — LLMs drift even at temperature zero, and external state changes between runs — but approximate replay is still the fastest way to test a change against a known failure. Combined with a regression harness — batch re-running labeled failures against an agent change and diffing the outcomes — this would close the loop from observation to improvement. These are hard. Replay is a systems-engineering problem that shares DNA with Chrome DevTools’ replay debugger and Linux’s rr debugger. The best teams approximate this internally, and existing tools address parts of it — but a cohesive system that connects replay, failure datasets, and regression loops is still an open space. That is the hole Lens should fill. Explicitly not where Lens focuses, because the direction is different:

Benchmark harnesses (SWE-bench runners). Already well-served, and while useful for model comparison and regression checks, they don’t close the loop on production failures.
Offline prompt datasets. Promptfoo and OpenAI Evals already own this, and the user journey is different — Lens users are debugging a live agent, not curating test sets.
Step-level LLM-as-judge. Using a model to evaluate another model introduces reliability problems that are difficult to bound.
General-purpose prompt IDE. Braintrust and Humanloop own this, and it would change Lens from an inspector into a runtime. Wrong identity.

Direction 2: Predictive cost governance

Current cost monitoring is reactive — you see the damage after it happens. The next step is predictive: estimate the expected cost of a session before it completes. Given the current execution pattern — tool usage, context growth rate, and step count — we can project a cost envelope rather than a single point estimate. This is inherently approximate: agent execution is non-linear, and costs can spike due to loops or heavy tool calls. But even a coarse projection is enough to inform decisions. Prediction alone is not useful without control. The goal is not just to forecast cost, but to decide whether a session should continue. This is where execution health matters. Sessions that are both projected to exceed a budget envelope and showing failure signals — loops, stalled progress, repeated tool misuse — are strong candidates for early termination or intervention. Cost ceilings alone are too blunt. Cost ceilings conditioned on execution health form a usable decision policy. This is a natural extension of observability: once execution is observable, it becomes controllable.

Direction 3: Cross-session pattern learning

Individual session analysis is necessary but insufficient. The real value emerges across sessions. Which prompts consistently lead to loops? Which tool sequences precede stalled or failed outcomes? Which model–task combinations deliver the best cost-to-completion tradeoff? This requires a statistical layer over sessions — not generic analytics, but analysis adapted to how agents actually behave. Unlike user analytics, where the users are external and the product is fixed, here the agent itself keeps changing — prompts get rewritten, tools get swapped, models get upgraded. Patterns observed yesterday may not hold tomorrow, and patterns must be interpreted as signals for redesign, not just correlations. The goal is not to visualize data, but to surface the patterns that inform agent design: which failure modes recur most often, which cost the most, and whether recent changes actually reduced them. This is the feedback channel from trajectory eval into iteration. Once Direction 1 is in place, this becomes the reason to open Lens regularly — not just to debug failures, but to understand how the system behaves in aggregate and where to improve next.

A bit about me, in case it’s useful: I grew up in a small town in northern China where tech hadn’t really arrived yet. Carnegie Mellon University gave me the chance to come to the US and learn it. From there I spent about a decade in big tech. At Google I built Product Listing Ads, Local Inventory Ads, and Google Merchant Center, along with a lot of the experimentation and ecosystem analysis infrastructure around them. At Meta I built the first version of Ads Reporting — an analytical tool backed by a distributed query engine, built to show the company’s largest advertisers the ROI on their spend across breakdowns, attribution windows, and time grains. A few years ago I left to try things on my own — partly to explore what’s possible in the startup space, partly to follow my own intellectual curiosity. Those two impulses don’t usually pull in the same direction — one toward making, one toward learning — but together they took me through an AI-native cross-border e-commerce business and then a stretch on quant research. The past year I’ve been focused on agentic systems, which is where Lens came from. If any of this is what you’re thinking about too — or you’re building in the same space and just want to compare notes — I’d love to hear from you.

​Part 1: Why this needs to be a new kind of tool

​Observability used to be for developers, because software used to be written by developers

​The new builder doesn’t write code. They describe intent.

​A new category, made of pieces from old categories

​Analytics tools — Mixpanel, Amplitude

​Observability tools — Datadog, New Relic, Honeycomb

​Developer tools — Chrome DevTools

​Lens’s stance, in three design decisions

​Part 2: Fundamental differences between agentic systems and traditional software

​Context grows silently and is managed, but not visible

​Cost as a first-class signal

​Security risk: the agent can do damage without ever throwing an error

​Part 3: Where this goes next

​Direction 1: From inspector to trajectory eval infrastructure

​Direction 2: Predictive cost governance

​Direction 3: Cross-session pattern learning