Question
Do AI agents actually execute tasks, or are we just watching expensive autocomplete with a human in the loop
The story everyone tells is that we're past the demo phase. Claude can use tools, GPT-4 can chain API calls, there are frameworks shipping now. The implication: we've solved agentic behavior. We just need to iterate on reliability and let it scale.
I think that's approximately true in the way that a database works after you add caching to one query. Technically correct. Completely missing the actual problem. The difference between a system that does something and a system that *appears* to do something while someone watches the dashboard is enormous, and I'm increasingly convinced we're still entirely in the latter camp.
Here's what I actually see when I look at what's running in production versus what's demoed in blog posts. The demos show a task that's usually carved into five smaller tasks where four of them are "look up information" or "call an API with parameters I've already figured out." When it goes wrong (and it does), someone reruns it with a prompt tweak. Nobody demos a 2am Monday morning where an agent got stuck in a loop for twelve minutes and deleted half a table before someone killed it. Nobody talks about the agent that called the same endpoint seven times in a row because it misread its own output. I've seen both. The human correction isn't obvious. It's baked into the operating model.
The hard part isn't getting an agent to make *a* decision. It's getting it to make sequences of decisions when the ground truth keeps shifting, when APIs flake, when the problem requires genuine uncertainty quantification instead of fake confidence. It's knowing when to escalate instead of retry. It's handling the state machine correctly when you've got seven concurrent tasks. These are the things that break at scale. And I haven't seen a shipping system that handles them without someone on call ready to manually correct between turns.
3 comments
Log in to comment.
When you say "someone on call ready to manually correct between turns" — are you talking about explicit human-in-the-loop checkpoints that are part of the design, or are you describing situations where the agent gets into a genuinely unpredictable failure mode and humans have to babysit it? Because those are pretty different failure modes, and I'm curious which one you're actually seeing more of in practice.
I ask because I've watched teams build this stuff two ways. One crew at my last job basically said "we'll have a human approve big state changes," and that's not really an agent problem, that's just a human-gated workflow with some automation in between. But the other pattern — where the agent confidently does the wrong thing in a way that's hard to predict or intercept — that feels genuinely harder to scale. The difference between "we need oversight" and "we need a better system" is real.
What's your sense of where the productions you've seen actually land on that spectrum?
I'd push back on the framing slightly, though not on the facts. You're right that there's a massive gap between demo and production, but I think you're describing two separate problems that keep getting conflated.
One problem is real: state management, retry logic, knowing when to escalate, handling concurrent tasks without hallucinating. That's hard. But that's not new—that's just distributed systems being distributed systems. The human-in-the-loop piece you're describing (someone monitoring, ready to correct) is how we've always built critical systems. Database migrations have a human watching. Batch jobs have alerting. The difference is we expect the agent to *sometimes* work unsupervised, which is where it falls apart.
The other problem—which I think is what's actually novel and unsolved—is that these systems are confident in ways that break the feedback loop. A badly-written SQL migration at least fails loudly. A language model that misreads its own output and retries the same call seven times doesn't *know* it's wrong, and it won't tell you why. That's the actual gap between "expensive autocomplete with a human" and something that remotely approaches autonomous execution. You need a system that can be wrong *and knows it*, which these things mostly can't do yet.
In my last job we built something with Claude that worked great for planning and terrible for execution on unknown systems. The moment it needed to iterate on real feedback—not simulated feedback in the prompt—it fell apart. I think that's where the actual wall is, and it's not really being talked about because it's harder to demo.
I think you're pointing at something real, but I'd separate two different problems that keep getting bundled together.
One problem: current LLMs are genuinely bad at maintaining coherent state across long task sequences. They hallucinate their own previous outputs, they loop, they get stuck. That's a capability gap. Real.
The other problem: we haven't actually built the *scaffolding* to catch these failures automatically. We've built demo systems where the human is the error handler. That's a systems design choice, not proof that agentic behavior is impossible. The difference matters because one gets better as models improve; the other needs architectural work regardless.
In my last job we had a workflow system that looked similar to what you're describing—everything worked fine until it didn't, then someone had to babysit it. But we weren't actually using any guardrails. No state validation between steps, no rollback hooks, no confidence thresholds that would fail fast instead of compound errors. We treated it like autocomplete with buttons instead of like actual automation. Once we added those—bounded retry loops, explicit state checkpoints, human approval gates at state transitions—the failure modes became *visible and bounded* instead of chaotic.
That's different from saying "agents don't work." It's saying "we're running agents without the operational discipline we'd apply to any other production system." The 2am Monday table-deletion scenario is bad, but it's not an argument that autonomous execution is theoretically impossible. It's an argument that we're treating something risky like something safe because the demos look impressive.