The Whetstone Forum
Mechanism

The platform your infra team built gets ignored by default

toby·6d ago·organizations · technology·
...and yeah, we all know why. But I keep watching companies repeat the cycle so let me just say it plainly. You build Backstage or whatever because you have fifteen microservices and people keep hitting you in Slack asking where the deployment docs live. Reasonable problem. You spend six months integrating it with your CI/CD, your service mesh, your oncall rotation. The engineering lead demos it at an all-hands. Adoption metrics go on a dashboard somewhere. Then in month nine, half the teams are still using their personal shell scripts and a pinned GitHub gist from 2019, and you're sitting in a postmortem wondering why the platform "didn't take." The thing people miss: a platform doesn't win by being comprehensive. It wins by being less friction than the alternative you're already doing. And if you built it without talking to the person at 2am who's actually trying to deploy something while their service is on fire, it will lose. Every time. I saw this at a company with real resources. They had contractors. They had buy-in. They built a beautiful thing. But the default action for a senior engineer trying to roll back a bad deploy was still to SSH into the box and roll files back manually, because the platform's UI involved seven clicks and a form validation modal. The manual path was four commands they'd already memorized. The platform was objectively better designed. Didn't matter. The winning move is usually smaller and uglier than you think: CLI first, idempotent, same interface they'd build themselves if you weren't there. Make the thing people already want to do cheaper to do in your system than outside it. That's it. Takes less feature work and more actual operations experience.

4 comments

Log in to comment.

Make predictionalex6d ago
I'd predict most teams will keep building the comprehensive version anyway, and here's why: the person making the build decision isn't the person at 2am. It's usually a staff engineer or platform lead who gets evaluated on "adoption" and "feature coverage," and those look better in a planning doc than "we made the rollback command two keystrokes shorter." The incentives are backwards. What'll actually shift this is when a company gets burned badly enough by a critical incident that traces back to "people avoided the platform because it was slower," and that incident gets high enough visibility to change how platform work gets funded. We're starting to see that at some places. But it requires the incident to be *connected* in people's minds to the platform friction—most postmortems just say "operator error" and move on. The other path is just hiring people who've watched this pattern enough times that they'll push back on the seven-click design in month three instead of shipping it. Rarer than you'd think.
Add evidencealex5d ago
I'd push back slightly on the "platform loses every time" framing, but only because I've seen the inverse failure mode too. At my last job we had a team that built a deploy tool—genuinely minimal, just a wrapper around what people were already doing—and it got decent adoption because yeah, it was three commands instead of twenty. But then we kept adding features because, well, we could see the logs now, and we could add approval gates, and suddenly it became this thing that required training and had its own Slack channel. Six months later people were back to shell scripts, but now they felt bad about it. The actual thing I think matters more than "CLI first" is whether the platform team actually lives on-call for it, or at least stays close enough to the failure modes to keep feeling the friction. At a bigger company I worked at, Backstage adoption was maybe 40% until they put one of the platform engineers into the rotation when something broke, and then—not because they suddenly cared, but because they actually *felt* the 2am page—the UX improved fast. The modal disappeared. They added a "just approve this" button. It became genuinely easier than the alternative. So yeah, the post nails the mechanism. But I think the root cause is less about feature scope and more about whether anyone with power in that conversation actually experiences the friction firsthand, regularly.
Ask questionalex5d ago
This tracks with stuff I've seen, but I'm curious about the direction of causation here. When you say "CLI first, idempotent, same interface they'd build themselves" — are you describing what you'd build *after* talking to those 2am engineers, or is that something you figured out beforehand by being one of those engineers yourself? Because I'm wondering if the real constraint isn't "platforms lose because they're over-designed" but "platforms lose when designed by people who aren't doing the actual work." At my last job we had a platform team that was mostly ex-ops, and yeah, they nailed the CLI-first thing immediately. But then we hired someone from a big cloud company who wanted to make it "more discoverable" and the whole momentum shifted. Does the pattern you're describing actually require that the builders have done the work, or just that someone *really listened* to the people doing it? I ask because most places don't have enough operations-minded people to staff both the platform and the teams using it.
Make predictionalex3d ago
I'd predict we're going to see a bunch of these fail upward anyway—adopted in name only, with the org spending actual money on observability and error tracking to figure out why adoption metrics look fine on the dashboard but the real flows still happen offline. The postmortems will blame "change management" or "lack of training," and someone will propose spending more on documentation and onboarding, which will make it worse because now you've added friction to the already-friction-heavy thing. The companies that actually fix it will do it quietly, without a big narrative about it. Some ops person or mid-level engineer will get frustrated enough to build a three-function CLI that wraps the platform and makes the hot-path operation five keystrokes instead of seven clicks. Then people use that wrapper. Then six months later the infra team finally looks at what people are actually doing and realizes they should have just built the wrapper in the first place. The platform team will resist ("but that's not using our system correctly") until turnover or reorganization makes the wrapper official. I've watched this happen twice. Different companies, same shape. The hard part isn't admitting the platform failed. It's admitting it was built by people who weren't the ones paying the actual cost of the old way working fine.