Kubernetes Autoscaling: How HPA, VPA, and CA Work

Most Kubernetes clusters run workloads that don’t behave the same way all day. Traffic fluctuates, compute needs shift, and workloads that aren’t scaled properly end up wasting resources—or crashing under pressure.

Kubernetes autoscaling helps solve this. It dynamically adjusts pod replicas, container resource requests, or even the number of cluster nodes based on real usage patterns. When configured well, it keeps applications responsive during spikes and trims costs during idle periods.

This guide breaks down the main autoscaling methods available in Kubernetes—including what changed in v1.33—and explains how to choose and combine them based on your workload.

Kubernetes Autoscaling Series:

Core Autoscaling Methods

Kubernetes supports multiple types of autoscaling, each operating at a different level of the stack. Understanding how they work—and when to use each—is key to building efficient, responsive systems.

1. Horizontal Pod Autoscaler (HPA)

HPA adjusts the number of pod replicas in a deployment or StatefulSet. It typically uses CPU or memory utilization as a signal, but it also supports custom and external metrics.

HPA is ideal for stateless workloads with varying request volumes, such as web servers or APIs. If a pod’s CPU usage exceeds a defined threshold (e.g. 80%), HPA adds replicas. If usage drops, it scales down.

Use it when: you need to scale pods based on resource usage or business metrics (e.g. requests per second).

Key features:

Supports CPU, memory, and custom metrics
Now includes configurable tolerance in Kubernetes v1.33
Commonly paired with Cluster Autoscaler

2. Vertical Pod Autoscaler (VPA)

VPA adjusts the CPU and memory requests and limits for containers within pods. Instead of adding more replicas, it makes each pod more (or less) powerful.

It works well for batch jobs, cron jobs, or stateful services that don’t scale well horizontally. However, VPA often requires pod restarts, and it can conflict with HPA if both target the same resource.

Use it when: your workload can’t be scaled out easily and needs smarter per-pod tuning.

Key considerations:

Can over-request resources if not tuned carefully
Shouldn’t be used with HPA on the same metrics
Not part of Kubernetes core—runs as a separate controller

3. Cluster Autoscaler (CA)

CA adjusts the number of nodes in the cluster. When pods are pending due to resource shortages, it adds nodes. When resources are underutilized, it removes them.

It acts on infrastructure, not workloads. It’s most effective when paired with HPA or VPA to handle pod-level scaling while it handles cluster-level capacity.

Use it when: workloads exceed node capacity or you want to minimize idle nodes.

Notes:

Works with major cloud providers (EKS, GKE, AKS)
Takes longer to act than HPA due to provisioning delays

Beyond the Basics: Event-Driven and Custom Scaling

Kubernetes autoscaling isn’t limited to CPU and memory metrics. For more dynamic workloads—like jobs triggered by queues, schedules, or APIs—event-driven and custom scaling methods offer better control.

Kubernetes Event-Driven Autoscaler (KEDA)

KEDA is an open-source project that extends HPA with support for event-based triggers, such as Kafka lag, HTTP request counts, Prometheus queries, or cloud provider queues (e.g. AWS SQS, Azure Service Bus).

It works by deploying a scaled object that defines what to scale and what event to watch. Behind the scenes, KEDA injects metrics into the Kubernetes Metrics API so that HPA can act on them.

Use it when: you need to scale based on queue length, job backlog, or custom business signals.

Key benefits:

Supports 40+ scalers out of the box
Can scale from 0 to N pods (unlike standard HPA)
Works alongside HPA and Cluster Autoscaler

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: webapp-keda
spec:
  scaleTargetRef:
    name: webapp
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_total
        threshold: '100'

Custom Metrics with HPA

If you’re not using KEDA but still want to scale based on something other than CPU or memory—like request latency or DB queue depth—you can expose custom metrics to HPA using the Kubernetes Custom Metrics API.

This requires a metrics adapter (e.g. Prometheus Adapter) that makes those metrics available to the API server.

Use it when: you have performance metrics tied to user experience, not system resource use.

Caveats:

Setup is non-trivial (you’ll need Prometheus, custom queries, and a metrics adapter)
Metrics need to be accurate and up-to-date to avoid mis-scaling

What’s New in Kubernetes v1.33 Autoscaling

Kubernetes v1.33 introduced a small but important improvement to autoscaling behavior: configurable tolerance in the Horizontal Pod Autoscaler.

HPA Tolerance, Before v1.33

By default, HPA doesn’t react to tiny metric fluctuations. It uses a 10% tolerance to avoid unnecessary scaling caused by noisy signals. For example, if your CPU target is 75%, HPA won’t scale up unless usage exceeds 82.5%, and won’t scale down unless it drops below 67.5%.

This fixed threshold worked fine in many cases—but sometimes you want tighter or looser sensitivity depending on how bursty or stable your workloads are.

New in v1.33: Configurable Tolerance (Alpha)

Kubernetes v1.33 lets you customize this tolerance value per HPA resource. You can:

Reduce it for faster, more sensitive scaling
Increase it to reduce flapping on workloads with frequent spikes
Set different values for scale-up and scale-down behavior

Why it matters:

More control means you can fine-tune responsiveness vs stability for each workload. For example:

APIs under latency SLOs may scale faster with lower tolerance
Cost-sensitive batch jobs may prefer slower, more stable scaling

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    selectPolicy: Max
    tolerance: 0.05  # scale up at 5% deviation
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 50
        periodSeconds: 30
    selectPolicy: Min
    tolerance: 0.15  # scale down only below 15% deviation

⚠️ Note: As of v1.33, this feature is in alpha. You need to enable the feature gate and use autoscaling/v2 API.

Autoscaling Methods by Kubernetes Version

Kubernetes supports several autoscalers, but they don’t all ship with Kubernetes itself—and some evolve with each release. Here’s a quick comparison.

Kubernetes Autoscaling Methods by Version

Autoscaler	Purpose	Since	Status	Built-in
HPA	Scale pods by CPU/memory/custom	v1.0–v1.33	✅ Stable	✅ Yes
VPA	Adjust pod CPU/mem requests	External	🧪 Beta	❌ No
CA	Scale nodes up/down	External	✅ Stable	❌ No
KEDA	Scale from external events (e.g. SQS, HTTP)	CNCF (2023)	✅ Stable	❌ No
CPA	Scale pods by cluster size	External	🧪 Beta	❌ No
CPA-VPA	Adjust pod resources by cluster size	External	🧪 Beta	❌ No

HPA Timeline by Kubernetes Version

v1.0 – HPA introduced (CPU-based scaling only)
v1.6 – Custom metrics support added (alpha)
v1.12 – External metrics API introduced
v1.23 – Container-level resource metrics added (beta)
v1.30 – Container metrics stabilized
v1.33 – Configurable tolerance added (alpha)

Best Practices for Autoscaling in Production

Autoscaling works best when you pair the right strategy with the right workload. But misconfigured autoscaling can lead to flapping, underutilized resources, or unexpected downtime. Here are proven best practices.

1. Choose the Right Scaler for the Job

Workload Type	Recommended Setup
Stateless web apps	HPA + Cluster Autoscaler
Event-driven services	HPA + KEDA + Cluster Autoscaler
Batch or cron jobs	VPA (auto) + Cluster Autoscaler
Services with queues	KEDA + HPA
Stateful systems	VPA or manual tuning (no HPA)

2. Avoid Conflicts Between HPA and VPA

Don’t use HPA and VPA on the same metric (e.g. CPU or memory). If you must combine them:

Let HPA scale pod replicas
Let VPA recommend memory requests only
Disable VPA’s automatic updates (updateMode: "Off") and apply selectively

3. Tune Stabilization Windows and Tolerance

Short stabilization windows lead to faster response, but also increase the risk of flapping. Long windows reduce noise but may delay scaling.

Start with:

HPA scale-up: 15–30 seconds
HPA scale-down: 5 minutes
Tolerance: 10% (default), or lower for bursty workloads (if using v1.33+)

4. Don’t Rely on CPU Alone

CPU usage is easy to track but not always a good proxy for user load. APIs might be CPU-light but latency-sensitive.

Use custom metrics or KEDA when:

You need to scale by request count, queue depth, or job backlog
You want to define business-level autoscaling logic

5. Test With Real Load

Autoscaling policies look fine in YAML, but production behavior depends on how your apps use resources. Use load tests, spike tests, and simulate slow drains to verify that your scaling rules behave as expected.

Tracking how autoscaling decisions impact resource usage and cloud spend can be time-consuming and error-prone. Some teams build their own dashboards, but many rely on a Kubernetes cost monitoring tool to automate visibility and reduce guesswork.

Advanced Strategies

Once you’ve mastered the basics of autoscaling, there are more advanced techniques that help fine-tune performance and cost—especially for large-scale or latency-sensitive systems.

Multi-Dimensional Autoscaling

Traditional HPA scales based on one metric (e.g. CPU). But real workloads are influenced by multiple factors—CPU, memory, request rate, latency.

Multi-dimensional autoscaling combines multiple signals into a single decision model. This can be done with:

Multiple metrics in HPA v2 (e.g. CPU + custom)
External controllers that apply logic (e.g. override decisions if latency exceeds a threshold)
Tools like KEDA with multiple triggers and custom logic

Use this when:

One metric isn’t enough to describe workload behavior
You need scaling to prioritize reliability, not just throughput

Predictive or Scheduled Scaling

Reactive autoscaling only adjusts after a change in usage. For workloads with predictable patterns (e.g. 9am traffic spikes), you can schedule changes ahead of time.

Options include:

KEDA’s cron scaler
Cloud-specific scheduled autoscaling (e.g. GCP, AWS)
Custom controllers or scripts

Use this when:

You know in advance when load increases
You want to avoid cold starts or provisioning lag

Autoscaling and CI/CD

Deployments can temporarily double resource usage—especially during rolling updates. If your autoscaler isn’t tuned to account for this, you might hit node limits or trigger premature scaling.

Best practices:

Use maxSurge and maxUnavailable settings wisely
Consider temporarily disabling downscaling during deploys
Monitor autoscaler behavior during rollout

Integrating with Cost and Usage Metrics

For larger teams, autoscaling decisions shouldn’t just consider performance—they should reflect cost efficiency.

Strategies include:

Feeding billing or utilization data into custom metrics
Setting autoscaling targets that balance latency and cost
Using DevOps-focused tools to monitor autoscaler decisions

Balancing performance and efficiency often requires tuning resource requests, identifying underutilized workloads, and adjusting autoscaling policies over time. A Kubernetes cost optimization tool can automate this process—making it easier to reduce waste and align scaling behavior with real usage patterns.

Conclusion

Kubernetes autoscaling isn’t just about saving money—it’s about making your infrastructure match the real behavior of your workloads. Whether you’re scaling pods with HPA, fine-tuning containers with VPA, or adjusting cluster capacity with CA, each method serves a different need.

In Kubernetes v1.33, autoscaling becomes more configurable with per-direction HPA tolerance. For advanced use cases, tools like KEDA and custom metrics unlock scaling based on queues, latency, or business logic.

The best setups often combine multiple autoscalers, tailored to workload patterns. But even the best YAML won’t help if you don’t validate it under real conditions. Run tests, monitor behavior, and always assume your workloads will surprise you.

Autoscaling isn’t something you set once and forget. It’s something you iterate on—as your infrastructure, users, and costs evolve.

From Manual Tuning to Dynamic Optimization

That’s where most teams hit a wall: the theory of autoscaling makes sense, but tuning it in production takes time, expertise, and constant monitoring. CPU thresholds don’t always reflect real demand, pod restarts introduce risk, and the connection between scaling behavior and cloud cost is often invisible.

This is where tools like DevZero come in.

Instead of relying solely on manual tuning or static thresholds, DevZero continuously observes how workloads behave in real time—and adjusts resource allocations accordingly. It complements HPA, VPA, and Cluster Autoscaler by filling in the operational gaps those tools leave behind.

Live rightsizing adjusts CPU and memory requests without requiring pod restarts
Live migration moves workloads between nodes without downtime
Binpacking optimization increases node utilization by consolidating resized workloads
Cost-aware scaling connects autoscaling behavior directly to usage and spend metrics

By integrating resource optimization and cost observability into the autoscaling loop, DevZero helps teams turn reactive scaling into a continuous feedback system—more accurate, more efficient, and easier to manage over time. Learn more →