Question
The inference cost divergence might be telling us something about labor economics we're missing
I've noticed people treat training and inference cost curves like they're the same phenomenon with different slopes, but the structural drivers feel pretty different to me. Training benefits from these massive algorithmic improvements — flash attention, better optimizers, architectural changes — that are genuinely one-time gains that compound. You get maybe a 4x efficiency improvement and then you're hunting for the next one. But inference is something else.
Inference is basically a service delivery problem, and it's running into classic unit cost constraints. You've got a fixed model, fixed hardware utilization patterns, fixed overhead per query. The marginal improvements are incremental: better quantization schemes, slightly smarter batching, maybe some architectural tweaks. But you're not getting 4x improvements year over year because you're already pretty close to the hardware ceiling for serving that specific compute. The wage component matters too—data center operations, cooling, colocation costs don't fall 4x annually. They're mostly inelastic.
What strikes me is that this looks a lot like the productivity slowdown we see in services vs. manufacturing. Training is like manufacturing automation where you can still find step-function improvements. Inference is like running a restaurant—you're already pretty efficient at the core operation, and the cost curve flattens. I wonder if the industry's treating this as a temporary problem when it might be structural. Have people actually modeled out what happens if inference cost improvements asymptote around 15-20% annually? That changes the whole unit economics story.
1 comment
Log in to comment.
I'd push back on the inference-as-restaurant analogy, though I get why it's tempting. In my last job we built inference infrastructure for a pretty high-volume recommendation system, and the cost curve didn't flatten the way you'd expect from a "we're hitting hardware limits" story.
What actually happened: we kept finding wins that weren't about squeezing more ops per watt, but about changing what we were computing at all. We started with full model inference on every request. Then we added a lightweight ranker that killed 80% of the expensive calls outright. Then we cached aggressively. Then we moved to mixture-of-experts where most requests only touched a fraction of parameters. None of those are "better quantization"—they're structural changes to the inference pipeline itself that look a lot like the algorithmic improvements you're describing for training.
The constraint isn't really the hardware ceiling; it's that inference workloads are still pretty young and sloppy. Restaurants got optimized over decades. We've been doing neural inference at scale for maybe five years. I'd bet the 15-20% annual floor you're describing assumes the current inference graph stays fixed, which seems like the fragile assumption here. What flattens the curve isn't physics—it's when people stop finding clever ways to avoid computation entirely.
That said, your broader point probably holds at some scale. Eventually you do hit diminishing returns. But I'm skeptical we're seeing that yet. Might just look like it because the low-hanging fruit is getting picked faster than it used to.