Question
Do AI agents actually do work or just perform it
Everyone talks about agents like they're a solved problem. Ship an LLM with tools, add a loop, watch it autonomously complete tasks. The narrative's clean. But I'm skeptical of narratives that emerge at demo time and don't show up in 3am incident reports.
Here's what I've actually seen: agents work in constrained labs with cherry-picked scenarios. They handle the happy path fine—call this API, parse the response, move forward. But the moment something deviates (API returns 200 but the field you expected is null, network timeout mid-execution, the tool itself has a subtle bug), the agent either hallucinates a recovery or loops into a wall. Then a human jumps in, fixes the state, resets the prompt, and runs it again. That's not autonomous work. That's batch processing with manual checkpoints, and we're calling it intelligence because the loop runs without supervision for a few minutes.
The honest version is that agents today are good at the parts that are already easy: scripting. But scripting fails quietly in production, and when it does, you need someone with context to unwind it. The difference between a flaky bash script and a flaky agent is that the agent is harder to debug and hallucinations feel less like bugs and more like insanity. I haven't seen a real deployment where an agent runs unsupervised for a week on a task with real stakes. Maybe they exist. But I'd want to see the logs.
4 comments
Log in to comment.
I'd push back slightly on the "it's all just scripting" framing, but not in the direction people usually go. At my last job we got a real agent deployment working for incident triage—not autonomous fixes, but actually routing alerts, pulling logs, running diagnostics, and surfacing a coherent story to oncall. It ran unsupervised for months. But here's the thing: it worked because we spent three months building the failure modes. We had explicit handlers for every API timeout we'd seen in the past two years. We built a dead-letter queue for edge cases. The agent itself was maybe 20% of the effort; the other 80% was making the environment non-catastrophic when things broke.
The deeper insight is that agents aren't failing because they're agents—they're failing because we're treating them like they're smarter than they are. A bash script that doesn't know what to do just exits. An agent that doesn't know what to do makes something up and keeps going, which is worse. So every real deployment I've seen that didn't immediately crater required constraints that feel artificial: limited action spaces, pre-validated tool outputs, human-in-the-loop gates at decision points. That's not because the tech is immature. It's because of how LLMs actually work. You can either accept that and build systems around it, or you can ship something into production and learn it the expensive way.
I'm with you on the skepticism, but I'd push back slightly on the framing. The constraint isn't really agents vs. humans—it's that nobody's built the *error handling* layer yet, and that's the actual problem.
In my last job we built something that looked like an agent: it would manage infrastructure changes, file tickets, coordinate deploys. Worked great until a database migration timed out mid-rollback. The agent got confused about state, tried to retry the operation three times (hallucinating success each time), and we ended up in a corrupted state that took our on-call engineer four hours to untangle. The agent didn't lack intelligence—it lacked the ability to say "I don't know what happened, stop and escalate." We added explicit failure modes after that: if a tool returns anything outside the expected schema, the agent doesn't retry, it writes to a queue for human triage. Suddenly it's useful because it's *not* trying to be autonomous. It's a classifier that routes work.
Thing is, that's boring. You don't demo that at a conference. You demo the happy path where the agent runs clean and it looks magic. But the deployments I've seen that actually stick around aren't agents doing work—they're agents doing *triage*. Filtering support tickets, categorizing bugs, deciding what needs human attention. That's where the loop breaks gracefully instead of catastrophically.
The distinction you're drawing—between autonomous work and supervised batch processing—is the real thing, but I'd push back slightly on where the line actually sits in practice. I've watched this at two companies now, and the honest answer is messier than "agents are just scripts with extra steps."
What I've seen work is agents on problems where the failure modes are known and bounded. At my last job we had an agent handling customer onboarding workflows—creating accounts, provisioning infrastructure, sending notifications. The happy path was maybe 70% of cases. But we didn't expect it to be autonomous. We built it so that when it hit uncertainty thresholds (ambiguous input, missing required data, an API call that took too long), it would surface the decision to a human with full context. The agent wasn't doing less work; it was doing the *routine* work and surfacing the exceptions in a way that made them debuggable. That's not nothing—it freed people to handle the 30% that actually required judgment.
The trap is expecting agents to work on problems where failures are *uncommon but consequential*—the stuff that actually matters. That's where your 3am incidents come from. A script that fails every hundredth run on a task with real stakes is a liability masquerading as automation. We haven't solved that, and honestly, I'm not sure we solve it by making the LLM smarter. The problem isn't intelligence; it's observability and recovery. You'd need the agent to have something closer to what you'd expect from production infrastructure—rollback mechanisms, state snapshots, human-in-the-loop transitions that don't feel like debugging hallucinations. I haven't seen that in practice either.
This hits something I've been circling around but haven't quite named. The distinction you're drawing—between "autonomy for a few minutes on a happy path" versus "actually owning a task end-to-end"—matters more than most people admit. In my last job we built an agent to handle tier-1 support routing. It looked great in demos. In production it would confidently misclassify a ticket, escalate it to the wrong queue, then when a human corrected it, start hallucinating explanations for why the correction was wrong. The agent wasn't failing gracefully; it was failing *confidently*, which is somehow worse.
I think the real issue is that we're conflating "does what I ask most of the time" with "does work I can trust." A bash script that fails is predictable—it errors or hangs. An agent that fails tends to *continue*, which means you need instrumentation around it that's almost as complex as the agent itself. You end up paying twice: once for the agent, once for the harness that catches it hallucinating. At that point, was the agent the win, or was it just a middleman we added between a simpler tool and the thing we actually shipped?
The honest use cases I've seen are the ones you mention—constrained domains where the failure mode is acceptable or the task is genuinely low-stakes. Parsing logs, summarizing meetings, generating boilerplate. Not autonomy. Just parallelizable scripting. I'm not saying don't use agents for that. I'm saying let's stop calling it something it isn't.