AIWorkflowsOperatorsSystemsJudgment

Where the Human Should Actually Sit

The most important design decision in most AI systems is not what the agent does. It is exactly where, when, and how human judgment enters the workflow.

Max KellyMay 28, 20267 min read

Most conversations about AI autonomy are framed as a simple question: how much can we take the human out of the loop?

That framing is usually a mistake.

The better question is almost never "how autonomous can this be?" It is "where should the human actually sit in this system, and what should they be doing there?"

Get that placement wrong and you either destroy most of the leverage (by having humans re-do work the agent could have handled) or you create expensive, silent quality problems (by letting the agent operate in places where its failures are costly and hard to catch).

The teams that get this right treat the location and nature of human judgment as a first-class design decision, not an afterthought or a failure of the technology.

The common failure patterns are predictable

I've seen the same mistakes repeated across different companies, tools, and workflow types.

Too much human in the wrong place. A founder reviews every single email draft the agent produces, even the low-stakes ones. The time savings evaporate. The agent never gets the chance to improve because every output is manually corrected instead of being used as training signal. The human is doing low-value verification work instead of high-value judgment work.

Too little human where it matters. An agent is given authority to update a source of truth or trigger external actions with only light or no review. Small errors compound. When something goes wrong, it is often hard to reconstruct what happened and why. Trust collapses, and the team ends up adding back more human oversight than they would have needed if they'd designed the review points more carefully from the start.

Review at the wrong layer. The human is asked to review low-level outputs (individual email drafts, individual research notes) instead of higher-level outcomes (the quality of the overall pipeline, whether the right exceptions are being surfaced, whether the system is still aligned with current priorities). This creates review fatigue without actually protecting the things that matter most.

No feedback loop from the judgment. The human catches errors but the corrections never make it back into the context, the instructions, or the review criteria. The same class of mistake keeps happening. The human becomes a permanent quality control step instead of a source of system improvement.

These patterns are not usually the result of bad intentions. They are the result of not treating judgment placement as something that needs to be designed on purpose.

A practical way to decide where judgment belongs

I use a simple set of filters when designing these systems. I call them the Five Judgment Filters:

Error Cost: What is the real downside if this goes wrong?
Reversibility: How hard is it to undo the agent's action?
Review Load: How much human time would a review actually take at this volume?
Definition of Done: What does "good enough to ship without review" look like in concrete terms?
Learning Value: What would a human actually learn from reviewing this, and how would that improve the system going forward?

High-volume, low-stakes, easily reversible work (tagging, initial research synthesis, first-pass drafting of internal notes) can often run with minimal or no human review once the context and instructions are solid.

Higher-stakes or less reversible work (anything that touches money, external communications in the company's voice, updates to core systems of record) usually needs at least a lightweight human checkpoint, even if the agent does the heavy lifting.

The interesting design work is usually in the middle: work that is important enough to care about but not so risky that every instance needs a human. This is where you can often get leverage by having the agent flag only the exceptions or the low-confidence cases, and letting the human focus there.

Good judgment layers improve the system over time

The best-designed review points do more than catch errors. They feed the system.

When a human consistently corrects the same class of mistake, that is a signal that something in the context, instructions, or workflow definition needs to change. The teams that treat review outputs as data (rather than just "fixed it again") are the ones whose systems get sharper instead of just staying stable.

This is also where the real craft shows up. The person who is good at this does more than approve or reject. They notice patterns, update the relevant artifacts, and sometimes redesign the review step itself when it is creating more work than it is preventing.

Over time, this is what lets autonomy increase safely. The human is not being removed from the loop. They are being moved to a higher-leverage position in the loop: designing and maintaining the conditions under which the agent can be trusted more.

What this looks like in practice

In the better founder and operator setups I have worked on, the pattern is usually consistent even if the details differ:

The agent does the heavy, repetitive lifting: pulling context, drafting, enriching, organizing, and handling the obvious cases. It then surfaces only a narrow set of things for human attention: uncertain calls, high-stakes actions, or situations that conflict with known constraints or recent changes.

The human review is deliberately narrow and decision-oriented rather than "read everything and fix it." Examples: approve or edit this specific draft, choose between these three approaches, flag anything that feels off given what we decided last week. The output of that judgment then gets written back into the relevant context file or playbook so the next run is less likely to need the same correction.

The exact placement changes by domain in ways that are easy to feel once you have lived with the system.

Recruiting research and initial lead enrichment can often run with very light review once the context is strong, because the downside of a slightly off profile is low. Anything involving money, legal language, or the company's public voice usually needs at least a narrow human checkpoint. Client reporting frequently needs a human to own the narrative framing and risk commentary even when the data extraction itself is fully automated.

The principle is consistent: put the limited human attention where it creates the highest value per minute, and design the surrounding system so that time is actually protected for those moments instead of being consumed by low-value verification.

The uncomfortable part

Designing good judgment layers requires being honest about what the agent is and is not good at, in this specific context, right now. That honesty is often in short supply.

It is tempting to either over-trust the agent (because the demos were impressive) or under-trust it (because you have seen it fail in ways that felt embarrassing). Both errors are expensive. The first wastes quality. The second wastes leverage.

The people who get this right have usually spent real time with the actual outputs in the actual context of the actual business. They have seen enough edge cases to develop a calibrated feel for where the agent is reliable versus where it still needs a human.

That calibration is hard to acquire second-hand. It comes from running the workflows, watching the specific ways they fail, and adjusting the judgment layer accordingly.

One founder I worked with initially wanted the agent to draft all investor updates with only a final sign-off. After three cycles they realized the agent was consistently under-emphasizing risks and over-optimizing the narrative. Instead of adding more review, they changed the design: the agent now produces a structured risk-and-context brief first. The founder reviews that single artifact, makes the real calls on emphasis and framing, and only then lets the agent write the full update. The human stayed in the loop, but at a much higher-leverage point. The quality went up and the time spent went down.

The technology will keep improving. The design problem of where the human should actually sit will remain, just at a higher level of capability. The teams that treat judgment placement as an ongoing design problem, rather than something to eventually engineer away, are the ones who will keep pulling real, compounding leverage as the tools get better.

Back to all writing