AISystemsWorkflowsMaintenanceOperators

AI Systems Don't Stay Good on Their Own

The real cost of AI workflows isn't building them. It's the continuous, unglamorous work required to keep them from quietly rotting.

Max KellyMay 24, 20266 min read

The current AI conversation is strangely obsessed with the moment something starts working.

You see the impressive demo. You hear about the founder who replaced half their team. You read the case study about the workflow that went from four hours to twenty minutes.

What almost nobody shows you is what the same system looks like eight weeks later.

The tone has drifted. The daily brief that used to be excellent is now just acceptable, and the founder is quietly spending more time cleaning it up than they admit. The recruiting research agent keeps missing the exact signals the team actually cares about. Everyone still says it's "mostly working."

This isn't a model problem. It is what happens when you treat AI systems like one-time builds instead of operational infrastructure that requires continuous care.

The decay is faster and quieter than most teams expect

Traditional software usually fails loudly. A test breaks, a dependency stops resolving, performance tanks. Someone notices.

AI systems fail more quietly.

Here are the patterns I keep seeing in real deployments:

Model behavior shifts underneath you. Even staying on the same model family, the way it handles the same context and instructions slowly changes. Something that used to produce crisp judgment starts hedging. Something that used to match voice starts sounding slightly generic. The change is gradual enough that you rationalize it ("maybe the task is just harder this week") until the accumulated drop is obvious.

Context goes stale in place. I watched this happen with a media company that had built a daily performance briefing agent. In January the context file accurately reflected their current shows, ad rates, and key metrics. By May the agent was still dutifully pulling numbers for two shows that had been cancelled and using last year's positioning language for a new sponsorship deal. The output looked professional enough that the team kept using it for weeks before someone finally noticed the agent was briefing them on a version of the business that no longer existed.

The underlying process changes but the workflow doesn't. One founder had an agent triaging his inbox and drafting replies. It worked beautifully until his assistant started routing anything over a certain dollar value to the CFO first. The agent had no idea this new rule existed, so it kept drafting direct responses to large client requests. The founder only caught it after one awkward conversation where the client had already looped in the CFO directly.

Review discipline quietly collapses. In the first few weeks, someone was actually reading outputs with fresh eyes. Over time the volume increases, confidence increases, and the reviews become skims. Small consistent errors stop getting corrected and fed back into the system. The quality floor lowers without anyone declaring that anything is broken.

None of this announces itself as a crisis. It just feels like the AI is being "a bit less impressive lately." By the time most teams name it, they have been carrying the degraded version for months.

This is worse than traditional technical debt in one critical way

Sean Goedecke made the excellent point that carefully hand-tuned prompts and agent configurations are a particularly nasty form of technical debt because they decay silently with every model update. I agree, and the problem is broader than just prompts.

With normal technical debt, degradation tends to be visible. Things break or slow down in ways that force attention. With AI systems, you can have a steady, invisible decline in output quality that still feels better than doing the work by hand. The system continues to "work." People keep using it. Time is still being saved on paper.

So the maintenance never quite feels urgent enough to prioritize over new work.

This is the trap: the surface metrics (usage, reported time saved) can look healthy while the actual leverage is quietly eroding. By the time the degradation becomes embarrassing, you have usually been running a degraded version for far longer than you realize.

The maintenance work is real work

Most teams treat this as optional hygiene rather than core operating work.

The teams that actually keep their leverage over time do four things consistently:

They run deliberate context hygiene on a cadence. They do more than dump new information in. They actively ask what has become outdated, contradictory, or low-signal.

They audit output quality against current standards, not the standards that existed when the workflow was first built. This usually means sampling real outputs and asking "Would I be happy if this went out today?"

They treat every meaningful business change as a required update to the relevant AI artifacts. New positioning, new process, new constraint: these are not nice-to-haves for the agent to eventually figure out.

They regularly prune. Old examples, retired playbooks, instructions that were clever workarounds for model weaknesses that no longer exist. The best systems I have seen are usually smaller and cleaner six months later than they were at launch.

Some of the detection can be automated. Agents can flag potential drift. But the real judgment, deciding what "good" means right now and whether the system is still meeting it, still requires a human with context on the actual work.

What this actually changes about how you should work

If you take the maintenance burden seriously, three things follow.

First, be far more selective about what you build. The right question is not "Can we automate this?" but "Will we still be willing to maintain this six months from now when the novelty has worn off?" Most teams should have fewer AI workflows than they currently have, not more.

Second, design for legibility and handoff from day one. The artifacts need to be readable by someone who did not live through the original implementation. If the only person who can keep a system healthy is the person who originally set it up, you have created a single point of future failure that will eventually bite you.

Third, treat deletion and simplification as normal, recurring work. One of the cleanest operator systems I have seen started with eleven different automated workflows. After six months of honest maintenance, they had deleted four entirely and heavily simplified two more. The remaining five were dramatically more reliable than the original eleven had ever been. The founder told me the biggest shift was finally being willing to kill things that had felt impressive when they were first built.

This is not an argument against using AI aggressively. It is an argument against pretending that once something works, the hard part is over.

The teams that will have real, compounding advantage in two years are not the ones with the flashiest current setups. They are the ones who have gotten genuinely good at the unglamorous discipline of keeping their systems honest while everything around them (models, context, business priorities) keeps changing.

If some of your AI systems feel a little less sharp than they did a few months ago, that is normal. The only real question is whether you are budgeting time and attention for the maintenance, or still pretending it does not exist.

Back to all writing