AI Didn't Break Your Kubernetes Economics. It Just Made the Damage Visible.

Rob Fletcher
Co-Founder

Part 1 of 2: The AI Spending Problem Is a Kubernetes Problem in Disguise
The numbers from AI infrastructure spending are hard to ignore. The FinOps Foundation's State of FinOps 2026 report found that 98% of organizations now manage AI spend, up from just 31% two years ago. Alongside that explosion in AI usage comes a mandate most infrastructure and platform teams are already hearing from leadership: find the savings to fund it. But before organizations can answer that mandate, they need to understand what is actually driving the cost. And the answer, more often than not, lies in the same broken economic system that has quietly compounded cloud costs for years.
AI didn't create a new cost problem in your Kubernetes clusters. It amplified an existing one and raised the stakes so high that you can no longer afford to ignore it.
The Same Broken System, Higher Unit Costs#
In a previous post, we argued that Kubernetes is fundamentally an economic system: it governs the allocation of scarce resources across multiple actors with misaligned incentives, without price signals. Engineers overprovision because the career cost of an outage vastly outweighs the invisible, delayed cost of wasted infrastructure. The tragedy of the commons plays out at the cluster level, where individual decisions produce collective inefficiency.
Nothing about that system changed when AI workloads arrived. What changed is the cost per unit of waste.
A backend engineer who overprovisioned a CPU-bound service by 2x might waste $500 a month. An ML engineer who overprovisions GPU resources by the same factor wastes $20,000 a month. The logic is identical. The damage is not. According to DevZero's own research, the average GPU-enabled Kubernetes cluster operates at 15-25% utilization. For a 50-GPU cluster running on NVIDIA H100 instances at $30 to $50 per hour, that underutilization translates to $500,000-$600,000 in annual compute costs on a single cluster, wasted on capacity that is reserved but rarely active.
This is the same overprovisioning logic that produces idle CPU headroom across your application services. The economic mechanism is unchanged. The bill is just much larger.
Why AI Workloads Are Structurally Worse#
CPU overprovisioning is bad. GPU overprovisioning is structurally worse for a specific reason: AI workloads are inherently intermittent, yet they are typically provisioned as if they were persistent.
Consider how these workloads actually behave in practice:
- Model training jobs run for hours or days, consume significant GPU memory and compute during execution, and then complete, leaving GPUs allocated but idle until the next run. DevZero's guide to GPU utilization documents the pattern clearly: training clusters see the highest waste before and after runs, when GPUs sit reserved "just in case" the next job starts soon.
- AI inference endpoints maintain warm pools of replicas to handle traffic spikes, but DevZero's analysis shows teams regularly over-replicate by provisioning for peak demand that rarely arrives, with replicas idling 90% of the time.
- Interactive notebooks and research environments are often left running after work ends. A data scientist might reserve an H100 instance for a week-long research project but only use the GPU for 10-15% of that time.
Each of these patterns reflects the same incentive structure from our earlier economic analysis. An ML team that reserves 8 GPUs for training jobs that run twice a week isn't being reckless. They are making a rational decision inside a system that offers no price signal for holding idle capacity. From their perspective: if they don't reserve the GPUs, someone else will, and then they'll be blocked. The cost of that reservation is paid centrally and arrives weeks later in a finance report. No one is accountable. The reservation persists.
This is the tragedy of the commons, now playing out with the most expensive resources in your cluster.
The Configuration Doesn't Change Either#
One of the subtler reasons GPU waste compounds faster than CPU waste is that the static manifest problem is worse for ML workloads. When a backend engineer sets CPU requests for a service, and it ships to production, there is at least some likelihood that a routine reliability review or performance investigation will revisit those values. The workload runs continuously, the team owns it, and there are natural forcing functions to revisit configuration.
GPU workloads don't follow that pattern. A training job that runs on a schedule has a configuration set at deployment time, often during a period of maximum uncertainty about actual resource needs. Once it runs successfully without OOMKilling, the configuration is considered correct and rarely touched again. The fact that the job might complete in four hours and leave GPUs idle for the remaining twenty hours of the day isn't captured in a post-mortem. There is no incident. There is just a very expensive gap between allocation and usage that appears aggregated in a finance report weeks later.
Inference deployments have their own version of this problem. Engineers set replica counts to handle peak traffic and warm-pool requirements. Those replicas keep running. As the DevZero guide to fixing GPU utilization documents, many inference deployments show replicas idling 90% of the time while being provisioned for the rare spike. The configuration that made sense at launch is never revisited because the service is healthy and the team has moved on.
This is the "set and forget" provisioning pattern that describes CPU workloads, now operating with a 10 to 50 times higher cost per resource unit. The economic system has not offered any new mechanism to prevent it. It has simply made the consequences more visible once AI spend reaches a material line item on the cloud bill.
You Can't See It Clearly Either#
One reason GPU waste accumulates faster than CPU waste is that the visibility tools most teams rely on weren't built for GPUs. Standard Kubernetes metrics capture CPU and memory utilization reasonably well. GPU utilization is harder. As DevZero's GPU measurement guide explains, tools like nvidia-smi provide point-in-time snapshots that miss temporal patterns entirely. High memory utilization doesn't mean high compute utilization. A model can be loaded into GPU memory at 90% while the GPU itself processes zero requests. This distinction matters enormously for understanding actual waste.
The State of FinOps 2026 report validated this gap at the industry level. "Granular shared cost and container allocation" was explicitly called out as one of the top missing capabilities in FinOps tooling today. Practitioners noted that "dashboards are the table stakes of yesterday — reactive. You have to move to proactive, real-time automation." That observation applies directly to GPU infrastructure, where reactive monitoring consistently underestimates the true extent of idle capacity.
Without workload-level visibility across training jobs, inference endpoints, and research environments, most organizations are flying partially blind on their most expensive infrastructure.
Why the "Big Rocks" Are Gone#
The State of FinOps 2026 report made another observation worth sitting with: practitioners report diminishing returns from traditional cloud optimization. As one respondent put it, "We have hit the 'big rocks' of waste and now face a high volume of smaller opportunities that require more effort to capture."
For CPU and memory optimization, this is largely true. Reserved instances have been purchased. Obvious idle resources have been cleaned up. What remains is the harder, more granular layer: workload-level inefficiency within clusters, where static infrastructure-as-code manifests don't reflect actual usage patterns, and where the incentive to fix it is weak because the cost remains invisible to the teams creating it.
For GPU optimization, many organizations haven't yet hit the big rocks. The visibility tools are newer, the optimization playbooks are less mature, and the pace of AI workload adoption has outrun the governance structures around it. The FinOps Foundation specifically noted that AI workloads have "less transparent or more variable pricing" than traditional cloud services and are harder to allocate to business units. This is not just an accounting problem. It is an economic system problem. AI workloads have inherited all of the broken incentives of CPU workload management, applied to infrastructure that costs 10 to 50 times more per unit.
The Practical Implication#
Understanding AI cost waste as an economic systems problem rather than a tooling or configuration problem changes how you approach it. A few things follow directly from this framing:
- Visibility is necessary but not sufficient. Dashboards that show GPU waste are useful, but they don't change behavior unless the economic incentives do. Teams measured on reliability will acknowledge the waste and move on.
- Workload-level granularity matters more for GPUs than for CPUs. Node-level autoscalers like Karpenter scale down empty nodes, but a node with a single small workload running on a full GPU won't scale down. The waste lives at the workload level, not the node level.
- Automation is the only path to capturing these savings at scale. The State of FinOps 2026 found that FinOps teams managing $100M or more in cloud spend average just 8-10 practitioners. Manually tracking GPU allocation patterns across training jobs, inference endpoints, and research environments is not a realistic option for a lean team.
- The intermittent nature of AI workloads means the savings opportunity is large and recurring. Unlike a one-time rightsizing exercise on a CPU manifest, GPU waste regenerates continuously as new jobs run and complete, new models get deployed, and new notebooks get opened and abandoned.
The economic argument for fixing this is clearer than ever, and the industry data from the State of FinOps 2026 makes the context explicit: organizations are being asked to self-fund AI investments through optimization savings. That means the waste in your GPU cluster is not just an infrastructure cost problem. It is the budget your AI roadmap is competing against.
In Part 2 of this series, we look at the other side of that equation: what it actually means, operationally and strategically, to treat Kubernetes efficiency as a source of AI funding, and how organizations are building the systems to make that transfer happen at scale.

Rob Fletcher
Co-Founder
Related Posts

Inside KubeCon EU 2025: Highlights and Key Trends
In this post, Sandipan shares key takeaways from KubeCon + CloudNativeCon EU 2025—from presenting on distributed ML workloads with Kubeflow to exploring emerging trends.
By Sandipan Panda

Log Streaming - What We Got Wrong and How We Fixed It
Discover the challenges we faced with log streaming, the mistakes we made, and how we successfully fixed them.
By Christian Miller

Gitpod vs. Codespaces: How to Choose
Learn about two popular platforms, Gitpod vs GitHub Codespaces, for cloud development environments compared to DevZero.
By Jethro Magaji