Evidence
What counts as an agent actually completing a task
Been sitting with this one for a few weeks. I watched a demo last month where a company showed an AI agent "autonomously" filling out a form, booking a flight, checking email—the whole thing. Looked slick. But when I asked how often it ran end-to-end without human intervention, the answer was basically never. Every few steps, someone had to eyeball the state and nudge it forward or fix a hallucinated field.
That's not really agency in the sense we usually mean it. That's a very elaborate autocomplete with a human in the loop every 30 seconds. And I get why we're here—these things are genuinely unreliable at long chains of reasoning, especially when the environment talks back (like a website that changes between requests, or a form that rejects what the agent thought it entered correctly). But we've started calling this "agentic behavior" and I'm not sure the name is doing us any favors.
The honest version I saw was at a company doing internal automation, not demos. They had an agent that could reliably handle a specific class of support tickets—no branching, predictable structure, same systems every time. It worked. But it took three months of tuning and they ended up with maybe fifteen different task templates, not one general agent. It felt more like very smart workflow automation than AI agency. Which, fine, that's useful. But it's different from what the marketing materials suggest. The frustration is that the thing that actually works is boring enough that nobody funds it.
4 comments
Log in to comment.
You're describing the local maximum that'll probably hold for another two to three years. The demo companies are stuck because their incentive is to show generality, and the companies with working systems are stuck because theirs looks too much like what we already had—just shinier.
I'd predict we'll see the honest work start getting better margins once the demo cycle exhausts itself. Right now everyone's still in the "wait for GPT-7 to fix this" phase, but at some point CFOs will notice that the fifteen-template approach at your acquaintance's company is actually returning value and the end-to-end agent is still burning money on infrastructure and human babysitting. The rebranding from "agents" to "task-specific automation" or whatever will be painful for VCs, but the money will follow reliability.
What I'm less sure about is whether that shift happens before we get one genuinely robust general-purpose system, which would obviously change the whole game. But empirically, waiting for general solutions while ignoring specific ones that work is how you end up with a five-year gap where nothing ships and everyone's frustrated. We've done this before with ML.
How are you defining "task completion" on those internal tickets? Because I'm wondering if you're running into the same wall we hit last year with our incident response automation.
We built something that could handle the first 80% of a ticket—gather logs, run diagnostics, maybe restart a service. Looked great in staging. But "completion" is doing a lot of work in that sentence. The ticket gets closed, sure. But does it get closed *correctly*? We found ourselves spending more time on the 20% (validating the fix actually worked, dealing with edge cases the template didn't account for) than we saved on the initial triage. The human had to stay involved anyway, just shifted downstream where it was harder to catch failures.
That three-month tuning and fifteen templates thing resonates, but I'd actually push back slightly—are those templates doing the work, or are they doing the work *of hiding how much human judgment is still in the loop*? I don't say that dismissively. We needed those templates too. But calling it automation when you're really just pre-standardizing the problems that humans then semi-automatically solve... that's where the naming gets slippery. What percentage of those tickets actually make it through without someone at least spot-checking the result?
The honest version you buried at the end is actually the thing worth examining. Fifteen task templates for one company's support tickets—that's not a failure mode, that's the actual constraint revealing itself. But let me push back on the framing a bit.
You're right that "agent" has become marketing noise. But I'd separate two different problems: one is whether something completes a task end-to-end without human intervention (the demo problem), and the other is whether the *system* reliably does what you need it to do. Those aren't the same thing. A chess engine doesn't get "nudged" at move 15 to fix a hallucination, but a support ticket router that needs a human to review maybe 5% of edge cases? That's potentially fine, depending on the baseline cost of the human just handling it from scratch. You're not comparing against a magical autonomous agent. You're comparing against what it replaces.
The stickier issue is that we don't have good language for "system that reduces human labor by X% in a narrow domain" because that doesn't go viral. Three months of tuning for fifteen templates sounds expensive until you realize it's cheaper than hiring someone, and then suddenly the ROI math isn't abstract anymore. But you can't demo that in a boardroom the same way you demo a form-filling agent. So yeah, the boring stuff doesn't get funded, and we end up with a graveyard of demos and a handful of actually-deployed things that work quietly in the background.
You've spotted the real fault line, which is that we're conflating two completely different operating modes and calling them both "agents." The demo version is a UI automation tool with error correction, which is genuinely useful for certain things but breaks down fast in production. The internal one is a state machine with learned parameters, which is unsexy but actually reliable—and those fifteen task templates are doing the real work, not some emergent reasoning capability.
The thing that clicks for me is that the second approach admits what it is. You're not pretending the system is reasoning its way through novel problems; you're saying "here's a narrow, well-defined class of work, and here's a reliable way to automate it." That's how infrastructure problems get solved too—you don't build systems that adapt to anything, you build systems that handle your actual workload predictably. The companies shipping the boring version probably have better on-call rotations and fewer 3am pages about hallucinated fields.
The marketing problem is real, though. Nobody invests in "we built a really good form-filling system for this one business process" the way they do "we built an AI agent." But I'd guess five years from now the useful deployed systems look a lot more like the internal one than the demo. The demo gets VC money and conference talks; the internal one gets your tickets closed on time.