All writing
AIAgentsInfrastructureWorkflow Design

The Agent Harness Is the Product Now

Anthropic's Managed Agents workshop makes a practical point concrete: real agent work depends on evolving harnesses, durable sessions, event logs, tool boundaries, context, and explicit outcomes.

Max KellyMay 30, 202611 min read
The Agent Harness Is the Product Now

A lot of agent demos still focus on the visible part:

The model reasons. The model calls a tool. The model produces an answer.

That is fine for a demo.

It is weak for production work.

The real question is usually not whether the model can do one impressive thing. It is whether the system around the model can survive the normal mess of work: partial context, long-running tasks, credentials, retries, logs, user refreshes, failed tools, stale memory, deletion policy, and humans who need to understand what happened.

That is why Anthropic's Ship your first Managed Agent workshop caught my attention.

The workshop itself is simple. Isabella He builds an incident response agent for a fake SRE workflow. The agent gets logs, metrics, recent deploys, diffs, local tools, a session, a UI, streaming events, and a way to delete session state.

It is not a flashy example.

That is the point.

The useful part of the workshop is that it makes the harness visible.

The First Five Minutes Matter

The most useful part of the video happens before the coding starts.

Isabella walks through the progression from raw model access, to the Agent SDK, to Managed Agents. That history is the argument.

With a basic messages API, the developer owns the loop. They own context management, compaction, tool execution, retries, hosting, scaling, and whatever state survives between requests.

That was manageable when agents could do less.

It gets harder when the agent is expected to inspect files, call tools, continue a task, recover from errors, and produce something useful without a human steering every step.

The shift to Managed Agents is bigger than "Anthropic hosts it for you."

Anthropic is saying the harness itself is now a platform layer.

That matters because the harness is not static.

The best example in the workshop was a model behavior Isabella called context anxiety: the model started wrapping up tasks early even when there was still room in the context window. Anthropic built mitigations into the harness. Later, the behavior went away with a newer model, and that harness work became obsolete.

That is the part builders should sit with.

If you own the agent harness, you also own the maintenance burden created by model behavior changing underneath you.

The Demo Was Really About Plumbing

The agent in the workshop is asked to debug an incident.

It is given the same rough materials a human engineer would want: logs, metrics, recent deployments, and code diffs. Later, Isabella points out that a real version would probably need runbooks and previous postmortems too.

That detail matters.

Humans do not debug incidents from raw telemetry alone. They use the team's accumulated operational memory: which dashboard matters, which service is noisy, which alert is usually misleading, which deploys are risky to roll back, and who owns the system when things get uncomfortable.

An agent needs that same context.

In the demo, the agent eventually traces the latency spike to database pool exhaustion from a recent code change. Nice.

But the more interesting thing is how many pieces had to exist before that answer mattered.

The workshop had to define the operating shape: an agent, an environment, files, local tools, a session, and streaming events. It also had to expose tool calls, retain history, and make deletion possible.

That is not prompt engineering.

That is product infrastructure.

Brain, Hands, Session

The cleanest frame from the workshop is the split between the agent, the environment, and the session.

The agent is the brain. It defines the model, prompt, tools, and capabilities.

The environment is the hands. It is where tool execution happens.

The session binds them together. It gives one agent a place to work, resources to operate over, and an event log that can be shown to a user or inspected later.

I like this frame because it prevents a common mistake.

People talk about "the agent" as if it is one object. In practice, the useful boundary is more specific.

The model should not be confused with the runtime. The runtime should not be confused with the tool surface. The tool surface should not be confused with the session history. The session history should not be confused with durable business memory.

Once those pieces are separated, the architecture gets easier to reason about.

Credentials can sit behind a boundary. Tool execution can move to a controlled environment. Logs can be inspected without pretending they are just chat history. Session state can be retained or deleted deliberately.

This is also why Anthropic's May 19 update is important. With self-hosted sandboxes, the agent loop can stay on Anthropic's infrastructure while tool execution happens in an environment the customer controls. With MCP tunnels, private MCP servers can be reached without exposing them to the public internet.

That is more than a security feature.

It is a sign of where the category is going.

The workshop also tied this split to latency: Anthropic said decoupling the agent loop from tool execution reduced P95 time to first token by more than 90% in their measurements.

I would not assume every system gets that number.

But the mechanism is important.

If orchestration and execution are glued together, every session can inherit container startup, credential handling, and runtime concerns. If they are separated, the product can respond faster while the work still happens in the right environment.

The product boundary is moving from "which model did you call?" to "where does the work actually run?"

Sessions Should Speak in Events

One workshop line stuck with me: Managed Agents sessions speak in events rather than only responses.

That is a small sentence with a lot inside it.

The old API shape is request in, response out. That is still useful for summarization, extraction, classification, drafting, and plenty of other jobs.

But agent work is messier.

An agent may receive a user message, call a tool, get a tool result, inspect a file, call another tool, stream progress, retry after a failure, and then return an answer. If the browser refreshes halfway through, the user should not lose the task. If the tool fails, someone should be able to see why. If the answer is wrong, the team should be able to inspect the path.

That means the application needs an event ledger.

Not just a transcript.

In the workshop, user messages, tool calls, tool outputs, and agent responses become events appended to the session. The UI can stream them. The console can inspect them. The session can resume.

That is the difference between a chat interface and an operational surface.

The chat interface says:

Here is the final answer.

The operational surface says:

Here is what happened while the agent worked.

That distinction matters as soon as the agent is doing anything important.

Managed Does Not Mean Context Is Solved

Managed infrastructure removes some plumbing.

It does not remove the hard part of deciding what the agent should know.

In the workshop, the tools read from local JSON and log files. For a real incident response agent, those same shapes would point at DataDog, deploy history, GitHub, PagerDuty, Slack, runbooks, and postmortems.

Usually the problem is not "the agent needs more data."

It is:

The agent needs the right operational context at the right moment.

For an incident response agent, I would want the first version to see the current symptom, the affected service and owner, recent deploys, relevant error logs, and metric changes over time.

Then I would want the operational memory around that incident: known runbooks, previous similar incidents, rollback constraints, and escalation rules.

That is a lot less glamorous than "an autonomous SRE."

It is also much closer to something a team might actually use.

The agent does not need to be brave. It needs to be well-situated.

The Tradeoff Is the Control Plane

Anthropic's launch post frames Managed Agents as a way to get production agents running faster by letting the platform handle sandboxing, session persistence, orchestration, scoped permissions, tracing, context management, and error recovery.

I think that is a real value proposition.

I also think it should be named plainly:

You are renting part of the agent control plane.

For many products, that is probably the right trade.

If the value is in the user workflow, domain context, tool design, approval model, and product experience, then hand-rolling every piece of agent runtime infrastructure may be a bad use of time.

But it is still a trade.

Before choosing a managed harness, I would want clear answers about where session state lives and what can be deleted.

I would also want to know which tools execute inside our boundary, which credentials the agent can reach, what audit trail we get, what happens when the platform changes behavior, and how portable the agent is if our needs change.

Those are not abstract enterprise questions.

They decide how safe the system feels once real users, real data, and real workflows are involved.

Start With Diagnosis, Not Autonomy

If I were building the incident response agent from the workshop for a real team, I would not start by letting it fix production.

I would start with diagnosis.

The first version would wake up on an incident and produce a structured brief:

  • current symptom
  • affected service
  • timeline
  • suspected cause
  • evidence used
  • causes ruled out
  • confidence level
  • recommended next action
  • escalation owner
  • links to logs, deploys, diffs, and runbooks

That is useful even if a human still makes the call.

It also gives the team something to evaluate.

Was the suspected cause right? Did it cite the correct logs? Did it miss an obvious deployment? Did it recommend something unsafe? Did it reduce time-to-understanding for the on-call engineer?

Once that works, the agent can take on more.

Maybe it drafts a rollback plan. Maybe it opens a PR. Maybe it writes the incident channel update. Maybe it creates the postmortem shell. Maybe it runs safe diagnostics automatically and asks before touching production.

That ordering matters.

Trust should grow from bounded usefulness.

Advanced Features Should Follow the Friction

Anthropic has also been adding memory, dreaming, outcomes, multiagent orchestration, and webhooks. The May 6 update is worth reading because it shows how quickly the harness layer is becoming more capable.

I would still be careful with the order.

I would not start with the most advanced setup.

I would start with one workflow and one trigger.

Then one agent definition, one narrow tool surface, one known context packet, one session event log, one deletion policy, one human review point, and one rubric for what "good" means.

Then add memory when repeated corrections show up.

Then add outcomes when quality can be described as a rubric. That is the advanced feature I would reach for earlier than full autonomy, because it makes "good" explicit before the agent starts doing more on its own.

Then add subagents when one context window is doing too many different jobs.

Then add webhooks when the agent should respond to events without waiting for a person to click a button.

The harness should grow from real friction.

Otherwise you end up with an impressive stack around a workflow nobody trusts.

The Practical Shift

Managed Agents are easy to describe as hosted agents.

I think that undersells the change.

The better framing is that the harness is becoming product infrastructure.

The platform is starting to own more of the machinery teams used to rebuild: loops, sessions, event logs, sandboxes, tool boundaries, tracing, retries, memory, outcome grading, and multiagent coordination.

That does not mean every team should hand the whole thing to a vendor.

It means the build-versus-buy question has moved down a layer.

The old question was:

Which model should we use?

The better question now is:

What agent infrastructure should we own, and what should we rent?

My default answer is to own the domain context, the workflow, the tool design, the approval boundaries, and the user experience.

Be deliberate about the runtime.

Because once agents start doing real work, the harness is not background plumbing.

It is the system the user is trusting.

Sources