Kubernetes Cluster Autoscaler: Scale Nodes When Pods Don’t Fit

Most autoscaling in Kubernetes focuses on adjusting pod behavior—adding replicas, tuning requests, reacting to metrics. But sometimes, the real bottleneck isn’t the workload. It’s the infrastructure.

The Cluster Autoscaler (CA) solves that by automatically adding or removing nodes in your cluster based on resource demand. When a pod can’t be scheduled, CA adds capacity. When nodes sit idle, it removes them. It’s essential for keeping clusters right-sized—especially when usage patterns are unpredictable.

This guide is part of our autoscaling series and focuses specifically on how CA works, when to use it, what limitations to watch for, and how to go beyond it with smarter, workload-aware optimization.

Kubernetes Autoscaling Series:

How Cluster Autoscaler Works

The Cluster Autoscaler monitors unschedulable pods and decides whether to add new nodes to the cluster. It also looks for underutilized nodes and removes them when workloads can be rescheduled elsewhere.

At a high level, CA does two things:

Scale up: If pods are stuck in Pending because no node has enough resources, CA adds a new node to fit them.
Scale down: If a node’s workloads can be moved elsewhere, and the node has been underused for a configurable period, CA removes it to save resources.

It works by simulating scheduling decisions based on pod specs, resource requests, taints/tolerations, and affinity rules.

CA Logic Summary

Trigger	Action
Pods can’t be scheduled	Add node to fit them
Node is underutilized	Drain and remove the node
Pod can’t be moved or evicted	Node stays active

Works with Cloud Providers

CA is deeply integrated with major Kubernetes-managed platforms:

GKE: Autopilot and standard node pools
EKS: Via cluster-autoscaler Helm chart
AKS: Native support through VM scale sets
Self-managed clusters: Works with auto-scaling groups or external APIs

Each provider uses its own implementation for how nodes are actually provisioned or terminated, but the autoscaler logic is the same.

CA runs as a deployment in the kube-system namespace and communicates directly with the cloud provider APIs to modify node group sizes.

When to Use Cluster Autoscaler

Cluster Autoscaler is designed for managing infrastructure capacity. It’s a good fit when your workloads scale dynamically and your cluster needs to grow or shrink with demand.

You should use CA when:

Pods regularly fail to schedule due to insufficient cluster resources
You want to avoid paying for idle nodes during off-peak hours
You’re running HPA or VPA and want the cluster to adapt to scaling decisions
You have mixed workloads with variable compute or memory needs

Unlike HPA or VPA, CA doesn’t care about how pods behave—it reacts to whether they can be scheduled at all.

CA Complements HPA and VPA

HPA and VPA adjust workload behavior. CA adjusts the infrastructure to support it. They’re often used together:

HPA + CA: HPA adds pods when CPU spikes; CA adds nodes when the new pods won’t fit
VPA + CA: VPA resizes pod resource requests; CA ensures there’s enough space to schedule them

HPA vs VPA vs CA

Capability	HPA	VPA	Cluster Autoscaler
What it adjusts	Number of pod replicas	Pod resource requests/limits	Number of cluster nodes
Acts on	Running workloads	Pod specs (may restart pods)	Scheduling + node utilization
Best for	Stateless services	Memory-heavy or batch jobs	Cluster-level cost and capacity

Limitations and Trade-offs

Cluster Autoscaler is good at managing infrastructure, but it has its own set of trade-offs—especially when you rely on it for cost efficiency or fast response to demand.

Here’s what to watch for:

Slow scale-up: CA doesn’t act instantly. If a new node takes 30–60 seconds to provision, that’s time your pods stay unscheduled. For bursty workloads, this delay can impact user experience unless you over-provision ahead of time.
Scale-down disruption: When CA removes a node, it evicts all the pods on it. If those pods don’t have PodDisruptionBudgets or can’t be rescheduled cleanly, you risk service instability.
Binpacking problems: CA doesn’t optimize resource distribution across nodes. It may leave clusters fragmented—especially if pod requests aren’t tightly tuned. That leads to wasted space and unnecessary node growth.
No awareness of workload patterns or cost: CA works at the node level. It doesn’t know if the workload is inefficient, oversized, or low-priority. And it doesn’t track how its actions affect cloud cost or node utilization.

Installing and Configuring Cluster Autoscaler

Cluster Autoscaler runs as a deployment in your cluster and interacts directly with your cloud provider’s APIs to scale node groups up or down. The exact setup depends on where you’re running Kubernetes.

On Managed Kubernetes (EKS, GKE, AKS)

Most cloud providers support CA natively or offer official deployment options:

GKE: Node auto-provisioning and autoscaling are built-in. You configure min/max nodes per pool via the console or CLI.
EKS: Install CA using the Helm chart or YAML deployment. Tag your node groups with the required autoscaler permissions.
AKS: Works with Virtual Machine Scale Sets. Enable autoscaling and define node group boundaries.

Tip: Make sure your node groups use instance types that boot fast and match your workload shapes.

Self-Managed Clusters

You can deploy Cluster Autoscaler from the official autoscaler repo:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Key requirements:

Nodes must be part of an auto-scaling group (e.g. AWS ASG, GCP MIG)
Node groups must have min/max boundaries defined
Each node should be labeled to indicate which group it belongs to

Common Configuration Flags

--balance-similar-node-groups=true
--expander=least-waste
--scale-down-enabled=true
--scale-down-unneeded-time=10m
--scale-down-delay-after-add=5m

‍

These flags control how aggressively CA removes nodes, how it chooses which to keep, and how long to wait after scale-ups before considering scale-down.

You can also use taints, labels, and affinity rules to influence where pods are scheduled—improving binpacking and giving CA more room to clean up underused nodes.

Best Practices for Using CA in Production

Cluster Autoscaler can save significant cloud cost, but only if you configure it to match how your workloads behave. Here’s what to keep in mind when running it in production:

Use Mixed Node Pools

Not all workloads need the same CPU, memory, or accelerators. Use multiple node groups (e.g. general-purpose, high-memory, GPU) with clear taints and labels. This gives CA more flexibility when scheduling pods and deciding which nodes to scale.

Set Safe Scale-Down Parameters

Too-aggressive scale-downs can disrupt service. Set timers like scale-down-unneeded-time and scale-down-delay-after-add conservatively—especially for stateful or slow-start workloads.

Combine with HPA or VPA

CA isn’t a replacement for pod-level autoscaling. Use HPA to control replica count and VPA to rightsize requests, so the Cluster Autoscaler can make better decisions about what fits where.

Tune for Binpacking

Poorly tuned resource requests result in low utilization. CA can’t remove nodes if pods won’t fit anywhere else. Review your requests and limits to make sure workloads can be packed efficiently—otherwise you’ll end up with nodes that can’t be emptied.

Tools built for Kubernetes cost optimization can help highlight workloads that are blocking scale-down due to oversized requests or poor distribution.

Monitor Scheduling Gaps

Track how long pods remain unscheduled. If you’re seeing frequent delays, it may mean CA isn’t scaling fast enough—or that requests are too large to fit in available pools.

For full visibility, use Kubernetes cost monitoring to correlate scheduling, resource waste, and cost over time.

Summary Table

Practice	Why It Matters
Use multiple node pools	Better fit for workload variety and taints
Tune scale-down delays	Prevents disruptions on active workloads
Combine with HPA/VPA	Coordinates scaling at both pod and infra levels
Monitor binpacking and requests	Reduces waste and avoids blocked scale-downs
Track pod scheduling delays	Helps detect slow CA response or oversized pods

Advanced Use Cases and Strategies

Once you have the basics of Cluster Autoscaler working, there are more advanced ways to reduce cost, improve scheduling, and better match infrastructure to workload behavior.

Spot Node Groups with Fallback

For non-critical or fault-tolerant workloads, run a dedicated node group using Spot or preemptible instances. Use nodeSelector, taints, or affinity to isolate workloads.

Combine with a fallback group (e.g. on-demand) in case Spot capacity dries up.

GPU-Aware Scaling

If you’re running ML workloads or anything with node taints (like GPU or DPU nodes), make sure:

Pods specify tolerations
CA is allowed to scale the corresponding tainted node pool

This lets CA bring up GPU nodes only when needed, instead of keeping expensive hardware running idle.

Cold Start vs Over-Provisioning

If you need fast startup during peak hours, but can’t afford latency from cold nodes, consider:

Running a small pool of buffer nodes (partially filled but ready)
Tuning scale-down-unneeded-time to keep key nodes alive longer
Combining with HPA to predict bursts

This gives you a balance between cost and responsiveness.

Reduce Waste with Label-Aware Scaling

Use custom labels and constraints (like topology spread or affinity) to influence placement. If CA knows how workloads should be spread, it can make better binpacking and scale-down decisions.

Operational Constraints of Cluster Autoscaler

Cluster Autoscaler solves a critical problem in Kubernetes: adding and removing infrastructure as workload demand changes. It works well for scaling node groups based on pending pods or idle capacity. But it has blind spots.

It doesn’t know if your workloads are oversized or inefficient. It can’t improve binpacking unless pods are perfectly sized. And it has no awareness of cost, workload priority, or whether scaling is even necessary—only that something doesn’t fit.

These gaps make CA reactive, not optimized.

DevZero helps fill those gaps.

Where CA reacts to scheduling pressure, DevZero proactively tunes resource requests in real time—on running pods, without restarts. It reshapes workload placement to free up space and improve binpacking. And it tracks how resource decisions impact actual cloud spend, giving you visibility into the cost of scaling decisions.

While CA manages node count, DevZero helps make sure you’re using those nodes efficiently. Learn more →

‍