Most autoscaling in Kubernetes focuses on adjusting pod behavior—adding replicas, tuning requests, reacting to metrics. But sometimes, the real bottleneck isn’t the workload. It’s the infrastructure.
The Cluster Autoscaler (CA) solves that by automatically adding or removing nodes in your cluster based on resource demand. When a pod can’t be scheduled, CA adds capacity. When nodes sit idle, it removes them. It’s essential for keeping clusters right-sized—especially when usage patterns are unpredictable.
This guide is part of our autoscaling series and focuses specifically on how CA works, when to use it, what limitations to watch for, and how to go beyond it with smarter, workload-aware optimization.
Kubernetes Autoscaling Series:
- Kubernetes Autoscaling
- Horizontal Pod Autoscaler (HPA)
- Vertical Pod Autoscaler (VPA)
- Cluster Autoscaler (CA)
How Cluster Autoscaler Works
The Cluster Autoscaler monitors unschedulable pods and decides whether to add new nodes to the cluster. It also looks for underutilized nodes and removes them when workloads can be rescheduled elsewhere.
At a high level, CA does two things:
- Scale up: If pods are stuck in Pending because no node has enough resources, CA adds a new node to fit them.
- Scale down: If a node’s workloads can be moved elsewhere, and the node has been underused for a configurable period, CA removes it to save resources.
It works by simulating scheduling decisions based on pod specs, resource requests, taints/tolerations, and affinity rules.
CA Logic Summary
Works with Cloud Providers
CA is deeply integrated with major Kubernetes-managed platforms:
- GKE: Autopilot and standard node pools
- EKS: Via cluster-autoscaler Helm chart
- AKS: Native support through VM scale sets
- Self-managed clusters: Works with auto-scaling groups or external APIs
Each provider uses its own implementation for how nodes are actually provisioned or terminated, but the autoscaler logic is the same.
CA runs as a deployment in the kube-system
namespace and communicates directly with the cloud provider APIs to modify node group sizes.
When to Use Cluster Autoscaler
Cluster Autoscaler is designed for managing infrastructure capacity. It’s a good fit when your workloads scale dynamically and your cluster needs to grow or shrink with demand.
You should use CA when:
- Pods regularly fail to schedule due to insufficient cluster resources
- You want to avoid paying for idle nodes during off-peak hours
- You’re running HPA or VPA and want the cluster to adapt to scaling decisions
- You have mixed workloads with variable compute or memory needs
Unlike HPA or VPA, CA doesn’t care about how pods behave—it reacts to whether they can be scheduled at all.
CA Complements HPA and VPA
HPA and VPA adjust workload behavior. CA adjusts the infrastructure to support it. They’re often used together:
- HPA + CA: HPA adds pods when CPU spikes; CA adds nodes when the new pods won’t fit
- VPA + CA: VPA resizes pod resource requests; CA ensures there’s enough space to schedule them
HPA vs VPA vs CA
Limitations and Trade-offs
Cluster Autoscaler is good at managing infrastructure, but it has its own set of trade-offs—especially when you rely on it for cost efficiency or fast response to demand.
Here’s what to watch for:
- Slow scale-up: CA doesn’t act instantly. If a new node takes 30–60 seconds to provision, that’s time your pods stay unscheduled. For bursty workloads, this delay can impact user experience unless you over-provision ahead of time.
- Scale-down disruption: When CA removes a node, it evicts all the pods on it. If those pods don’t have
PodDisruptionBudgets
or can’t be rescheduled cleanly, you risk service instability. - Binpacking problems: CA doesn’t optimize resource distribution across nodes. It may leave clusters fragmented—especially if pod requests aren’t tightly tuned. That leads to wasted space and unnecessary node growth.
- No awareness of workload patterns or cost: CA works at the node level. It doesn’t know if the workload is inefficient, oversized, or low-priority. And it doesn’t track how its actions affect cloud cost or node utilization.
Installing and Configuring Cluster Autoscaler
Cluster Autoscaler runs as a deployment in your cluster and interacts directly with your cloud provider’s APIs to scale node groups up or down. The exact setup depends on where you’re running Kubernetes.
On Managed Kubernetes (EKS, GKE, AKS)
Most cloud providers support CA natively or offer official deployment options:
- GKE: Node auto-provisioning and autoscaling are built-in. You configure min/max nodes per pool via the console or CLI.
- EKS: Install CA using the Helm chart or YAML deployment. Tag your node groups with the required autoscaler permissions.
- AKS: Works with Virtual Machine Scale Sets. Enable autoscaling and define node group boundaries.
Tip: Make sure your node groups use instance types that boot fast and match your workload shapes.
Self-Managed Clusters
You can deploy Cluster Autoscaler from the official autoscaler repo:
kubectl apply -f cluster-autoscaler-autodiscover.yaml
Key requirements:
- Nodes must be part of an auto-scaling group (e.g. AWS ASG, GCP MIG)
- Node groups must have min/max boundaries defined
- Each node should be labeled to indicate which group it belongs to
Common Configuration Flags
--balance-similar-node-groups=true
--expander=least-waste
--scale-down-enabled=true
--scale-down-unneeded-time=10m
--scale-down-delay-after-add=5m
These flags control how aggressively CA removes nodes, how it chooses which to keep, and how long to wait after scale-ups before considering scale-down.
You can also use taints, labels, and affinity rules to influence where pods are scheduled—improving binpacking and giving CA more room to clean up underused nodes.
Best Practices for Using CA in Production
Cluster Autoscaler can save significant cloud cost, but only if you configure it to match how your workloads behave. Here’s what to keep in mind when running it in production:
Use Mixed Node Pools
Not all workloads need the same CPU, memory, or accelerators. Use multiple node groups (e.g. general-purpose, high-memory, GPU) with clear taints and labels. This gives CA more flexibility when scheduling pods and deciding which nodes to scale.
Set Safe Scale-Down Parameters
Too-aggressive scale-downs can disrupt service. Set timers like scale-down-unneeded-time
and scale-down-delay-after-add
conservatively—especially for stateful or slow-start workloads.
Combine with HPA or VPA
CA isn’t a replacement for pod-level autoscaling. Use HPA to control replica count and VPA to rightsize requests, so the Cluster Autoscaler can make better decisions about what fits where.
Tune for Binpacking
Poorly tuned resource requests result in low utilization. CA can’t remove nodes if pods won’t fit anywhere else. Review your requests and limits to make sure workloads can be packed efficiently—otherwise you’ll end up with nodes that can’t be emptied.
Tools built for Kubernetes cost optimization can help highlight workloads that are blocking scale-down due to oversized requests or poor distribution.
Monitor Scheduling Gaps
Track how long pods remain unscheduled. If you’re seeing frequent delays, it may mean CA isn’t scaling fast enough—or that requests are too large to fit in available pools.
For full visibility, use Kubernetes cost monitoring to correlate scheduling, resource waste, and cost over time.
Summary Table
Advanced Use Cases and Strategies
Once you have the basics of Cluster Autoscaler working, there are more advanced ways to reduce cost, improve scheduling, and better match infrastructure to workload behavior.
Spot Node Groups with Fallback
For non-critical or fault-tolerant workloads, run a dedicated node group using Spot or preemptible instances. Use nodeSelector
, taints, or affinity to isolate workloads.
Combine with a fallback group (e.g. on-demand) in case Spot capacity dries up.
GPU-Aware Scaling
If you’re running ML workloads or anything with node taints (like GPU or DPU nodes), make sure:
- Pods specify tolerations
- CA is allowed to scale the corresponding tainted node pool
This lets CA bring up GPU nodes only when needed, instead of keeping expensive hardware running idle.
Cold Start vs Over-Provisioning
If you need fast startup during peak hours, but can’t afford latency from cold nodes, consider:
- Running a small pool of buffer nodes (partially filled but ready)
- Tuning
scale-down-unneeded-time
to keep key nodes alive longer - Combining with HPA to predict bursts
This gives you a balance between cost and responsiveness.
Reduce Waste with Label-Aware Scaling
Use custom labels and constraints (like topology spread or affinity) to influence placement. If CA knows how workloads should be spread, it can make better binpacking and scale-down decisions.
Operational Constraints of Cluster Autoscaler
Cluster Autoscaler solves a critical problem in Kubernetes: adding and removing infrastructure as workload demand changes. It works well for scaling node groups based on pending pods or idle capacity. But it has blind spots.
It doesn’t know if your workloads are oversized or inefficient. It can’t improve binpacking unless pods are perfectly sized. And it has no awareness of cost, workload priority, or whether scaling is even necessary—only that something doesn’t fit.
These gaps make CA reactive, not optimized.
DevZero helps fill those gaps.
Where CA reacts to scheduling pressure, DevZero proactively tunes resource requests in real time—on running pods, without restarts. It reshapes workload placement to free up space and improve binpacking. And it tracks how resource decisions impact actual cloud spend, giving you visibility into the cost of scaling decisions.
While CA manages node count, DevZero helps make sure you’re using those nodes efficiently. Learn more →