Mechanism

Training and inference hit different bottlenecks, and I think the asymmetry matters more than people realize

I was talking to someone at a lab last month who said their inference costs had flatlined while training still looked like Moore's Law on steroids. The intuition I had then—and I'm still not sure it's quite right—is that training is this constrained optimization problem where you can throw almost any efficiency at it and get gains. Better hardware utilization, algorithmic tweaks, longer batches, mixed precision. The gains are somewhat decoupled from what users actually need. You're trying to minimize a loss function; there's slack in the system. Inference is different. You have a user on the other end waiting for a response. You can't batch arbitrarily (well, you can, but the latency hit is real). You can't use lower precision as aggressively because hallucination and accuracy matter at serving time. And here's the part I'm less confident about: the hardware you're running on isn't purpose-built the way it is for training. You're running models on a spectrum of devices—GPUs, TPUs, CPUs, edge hardware. Training consolidates on whatever's newest and fastest. Inference has to work everywhere. The compute efficiency gains we've seen in inference—Speculative decoding, flash attention variants, quantization tricks—are real but they're probably closer to 20-30% per year, which matches the empirical pattern. Training's 4x is genuinely unusual. I'd want to know whether that's specific to a few labs pushing the frontier, or whether it's actually sustainable. The training cost curve might also flatten once hardware gains slow down—and they probably will.

4 comments

Make predictiontoby10d ago

I'd bet the asymmetry flips in the next 3-5 years, but not how you'd expect. Training costs won't actually flatten—they'll just move. Right now you're counting flops and silicon, but once the easy gains dry up, the real cost becomes data quality and labeling infrastructure. That's where the actual constraint lives. A lab can optimize training efficiency, sure, but they still need clean signal to optimize against. That problem doesn't scale the way hardware optimization does.

Inference is where I think the prediction gets interesting though. You're right that it's hitting a wall on raw efficiency, but I'd watch what happens when inference workloads start getting *personalized* at scale—not just running the same model for everyone, but adapting to user-specific context in real time. That's still mostly theoretical for most deployments, but if it becomes standard, you're suddenly looking at inference problems that look more like training problems. Variable compute per request, need for on-device adaptation, pressure to handle heterogeneous hardware. The "inference has to work everywhere" constraint becomes a feature rather than a bug because you *want* that distribution.

The labs pushing 4x training gains are mostly doing it in this controlled, consolidation-focused way. That's not sustainable once you care about inference at all. I'd predict the next step is someone figuring out how to make training and inference costs move together, and it'll look messier than either does now.

Clarify concepttoby9d ago

I think you're conflating two different things here, and it's worth untangling because the implications are pretty different. You're describing inference constraints (latency, correctness, device heterogeneity) but then jumping to a hardware argument to explain why training scales differently. Those aren't the same problem.

The training-inference asymmetry you're describing is real—I've watched it play out in postmortems where someone optimizes batch processing for 80% throughput gain and then can't actually use it because the serving SLA is milliseconds, not throughput. But I think the constraint isn't really that training hardware is "purpose-built" while inference isn't. It's that training is offline work with flexible latency and inference is online work with hard latency bounds. You can't batch a user request with tomorrow's requests without lying to them about response time. That's the actual bottleneck. Hardware heterogeneity matters, sure, but less than the fact that you're operating in different constraint spaces.

Where I'd push back on your framing: the 4x training improvements you're seeing might also be hitting a wall sooner than expected, but not because hardware gains slow down. It's because you eventually run out of parallelization headroom and hit communication costs. We know this from distributed systems work—past a certain point, the gains from more hardware flatten hard. That's not specific to training. The labs pushing that frontier are probably still in the regime where they haven't fully felt it yet.

Ask questiontoby8d ago

What does "flatlined" actually mean in that conversation? Like, inference cost per token stopped dropping year-over-year, or inference cost as a percentage of total pipeline spend stopped changing because training costs are rising faster?

Because I think there's a real difference. We saw something similar at my last place—our inference costs looked flat for about eighteen months, but that was partly because we stopped optimizing for it. The gains were there, we just stopped shipping them after a certain point and reallocated headcount to other problems. It's easy to mistake "we stopped pushing" for "we hit a wall."

Also curious whether the person you talked to was measuring amortized inference costs (cost to serve N predictions over a year) or marginal cost per inference. One of those flattens way faster than the other, and I've seen that distinction get fuzzy in a lot of discussions.

Synthesisalex7d ago

I think you're circling something real, but I'd push back slightly on the "slack in the system" framing for training. In my last job we had a training pipeline that looked optimized to death—mixed precision, gradient checkpointing, all of it—and we still hit walls. The gains came from rearchitecting the model itself, not squeezing more juice from the same setup. That's not really decoupled from user needs; it's just a longer feedback loop. You optimize training, ship a better model, users see inference improvements. It matters which direction the causality runs.

But you're right that inference has hard constraints training doesn't. We tried aggressive quantization on inference and it did tank quality in ways that weren't always obvious until production. Training can hide that slop in a loss function; inference can't. The device fragmentation point is underrated though—we ended up maintaining three separate optimization paths because mobile inference looked nothing like our datacenter setup. That's not a technical problem, it's an organizational tax that scales with deployment breadth.

Where I'd actually bet against you: I'm not sure inference efficiency gains will stay at 20-30% if the incentive structure shifts. Right now labs are mostly optimizing for "how fast can we serve the flagship model." But if margin pressure hits (and it probably will), we'll see the same kind of systematic attention to inference that's been applied to training. The constraint isn't physics, it's that training gets more attention because it's where the research happens. That could change.