Cloud Cost Optimization

Kubernetes Autoscaling Guide: HPA, VPA, CA, and KEDA

Alberto Grande

Head of Marketing

May 19, 2025

Share via Social Media

Most Kubernetes clusters run workloads that don’t behave the same way all day. Traffic fluctuates, compute needs shift, and workloads that aren’t scaled properly end up wasting resources—or crashing under pressure.

Kubernetes autoscaling helps solve this. It dynamically adjusts pod replicas, container resource requests, or even the number of cluster nodes based on real usage patterns. When configured well, it keeps applications responsive during spikes and trims costs during idle periods.

This guide breaks down the main autoscaling methods available in Kubernetes—including what changed in v1.33—and explains how to choose and combine them based on your workload.

Core Autoscaling Methods

Kubernetes supports multiple types of autoscaling, each operating at a different level of the stack. Understanding how they work—and when to use each—is key to building efficient, responsive systems.

1. Horizontal Pod Autoscaler (HPA)

HPA adjusts the number of pod replicas in a deployment or StatefulSet. It typically uses CPU or memory utilization as a signal, but it also supports custom and external metrics.

HPA is ideal for stateless workloads with varying request volumes, such as web servers or APIs. If a pod’s CPU usage exceeds a defined threshold (e.g. 80%), HPA adds replicas. If usage drops, it scales down.

Use it when: you need to scale pods based on resource usage or business metrics (e.g. requests per second).

Key features:

  • Supports CPU, memory, and custom metrics
  • Now includes configurable tolerance in Kubernetes v1.33
  • Commonly paired with Cluster Autoscaler

2. Vertical Pod Autoscaler (VPA)

VPA adjusts the CPU and memory requests and limits for containers within pods. Instead of adding more replicas, it makes each pod more (or less) powerful.

It works well for batch jobs, cron jobs, or stateful services that don’t scale well horizontally. However, VPA often requires pod restarts, and it can conflict with HPA if both target the same resource.

Use it when: your workload can’t be scaled out easily and needs smarter per-pod tuning.

Key considerations:

  • Can over-request resources if not tuned carefully
  • Shouldn’t be used with HPA on the same metrics
  • Not part of Kubernetes core—runs as a separate controller

3. Cluster Autoscaler (CA)

CA adjusts the number of nodes in the cluster. When pods are pending due to resource shortages, it adds nodes. When resources are underutilized, it removes them.

It acts on infrastructure, not workloads. It’s most effective when paired with HPA or VPA to handle pod-level scaling while it handles cluster-level capacity.

Use it when: workloads exceed node capacity or you want to minimize idle nodes.

Notes:

  • Works with major cloud providers (EKS, GKE, AKS)
  • Takes longer to act than HPA due to provisioning delays

Beyond the Basics: Event-Driven and Custom Scaling

Kubernetes autoscaling isn’t limited to CPU and memory metrics. For more dynamic workloads—like jobs triggered by queues, schedules, or APIs—event-driven and custom scaling methods offer better control.

Kubernetes Event-Driven Autoscaler (KEDA)

KEDA is an open-source project that extends HPA with support for event-based triggers, such as Kafka lag, HTTP request counts, Prometheus queries, or cloud provider queues (e.g. AWS SQS, Azure Service Bus).

It works by deploying a scaled object that defines what to scale and what event to watch. Behind the scenes, KEDA injects metrics into the Kubernetes Metrics API so that HPA can act on them.

Use it when: you need to scale based on queue length, job backlog, or custom business signals.

Key benefits:

  • Supports 40+ scalers out of the box
  • Can scale from 0 to N pods (unlike standard HPA)
  • Works alongside HPA and Cluster Autoscaler
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: webapp-keda
spec:
  scaleTargetRef:
    name: webapp
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_total
        threshold: '100'

Custom Metrics with HPA

If you’re not using KEDA but still want to scale based on something other than CPU or memory—like request latency or DB queue depth—you can expose custom metrics to HPA using the Kubernetes Custom Metrics API.

This requires a metrics adapter (e.g. Prometheus Adapter) that makes those metrics available to the API server.

Use it when: you have performance metrics tied to user experience, not system resource use.

Caveats:

  • Setup is non-trivial (you’ll need Prometheus, custom queries, and a metrics adapter)
  • Metrics need to be accurate and up-to-date to avoid mis-scaling

What’s New in Kubernetes v1.33 Autoscaling

Kubernetes v1.33 introduced a small but important improvement to autoscaling behavior: configurable tolerance in the Horizontal Pod Autoscaler.

HPA Tolerance, Before v1.33

By default, HPA doesn’t react to tiny metric fluctuations. It uses a 10% tolerance to avoid unnecessary scaling caused by noisy signals. For example, if your CPU target is 75%, HPA won’t scale up unless usage exceeds 82.5%, and won’t scale down unless it drops below 67.5%.

This fixed threshold worked fine in many cases—but sometimes you want tighter or looser sensitivity depending on how bursty or stable your workloads are.

New in v1.33: Configurable Tolerance (Alpha)

Kubernetes v1.33 lets you customize this tolerance value per HPA resource. You can:

  • Reduce it for faster, more sensitive scaling
  • Increase it to reduce flapping on workloads with frequent spikes
  • Set different values for scale-up and scale-down behavior

Why it matters:

More control means you can fine-tune responsiveness vs stability for each workload. For example:

  • APIs under latency SLOs may scale faster with lower tolerance
  • Cost-sensitive batch jobs may prefer slower, more stable scaling
behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    selectPolicy: Max
    tolerance: 0.05  # scale up at 5% deviation
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 50
        periodSeconds: 30
    selectPolicy: Min
    tolerance: 0.15  # scale down only below 15% deviation
⚠️ Note: As of v1.33, this feature is in alpha. You need to enable the feature gate and use autoscaling/v2 API.

Autoscaling Methods by Kubernetes Version

Kubernetes supports several autoscalers, but they don’t all ship with Kubernetes itself—and some evolve with each release. Here’s a quick comparison.

Kubernetes Autoscaling Methods by Version

Autoscaler Purpose Since Status Built-in
HPA Scale pods by CPU/memory/custom v1.0–v1.33 ✅ Stable ✅ Yes
VPA Adjust pod CPU/mem requests External 🧪 Beta ❌ No
CA Scale nodes up/down External ✅ Stable ❌ No
KEDA Scale from external events (e.g. SQS, HTTP) CNCF (2023) ✅ Stable ❌ No
CPA Scale pods by cluster size External 🧪 Beta ❌ No
CPA-VPA Adjust pod resources by cluster size External 🧪 Beta ❌ No

HPA Timeline by Kubernetes Version

  • v1.0 – HPA introduced (CPU-based scaling only)
  • v1.6 – Custom metrics support added (alpha)
  • v1.12 – External metrics API introduced
  • v1.23 – Container-level resource metrics added (beta)
  • v1.30 – Container metrics stabilized
  • v1.33 – Configurable tolerance added (alpha)

Best Practices for Autoscaling in Production

Autoscaling works best when you pair the right strategy with the right workload. But misconfigured autoscaling can lead to flapping, underutilized resources, or unexpected downtime. Here are proven best practices.

1. Choose the Right Scaler for the Job

Workload Type Recommended Setup
Stateless web apps HPA + Cluster Autoscaler
Event-driven services HPA + KEDA + Cluster Autoscaler
Batch or cron jobs VPA (auto) + Cluster Autoscaler
Services with queues KEDA + HPA
Stateful systems VPA or manual tuning (no HPA)

2. Avoid Conflicts Between HPA and VPA

Don’t use HPA and VPA on the same metric (e.g. CPU or memory). If you must combine them:

  • Let HPA scale pod replicas
  • Let VPA recommend memory requests only
  • Disable VPA’s automatic updates (updateMode: "Off") and apply selectively

3. Tune Stabilization Windows and Tolerance

Short stabilization windows lead to faster response, but also increase the risk of flapping. Long windows reduce noise but may delay scaling.

Start with:

  • HPA scale-up: 15–30 seconds
  • HPA scale-down: 5 minutes
  • Tolerance: 10% (default), or lower for bursty workloads (if using v1.33+)

4. Don’t Rely on CPU Alone

CPU usage is easy to track but not always a good proxy for user load. APIs might be CPU-light but latency-sensitive.

Use custom metrics or KEDA when:

  • You need to scale by request count, queue depth, or job backlog
  • You want to define business-level autoscaling logic

5. Test With Real Load

Autoscaling policies look fine in YAML, but production behavior depends on how your apps use resources. Use load tests, spike tests, and simulate slow drains to verify that your scaling rules behave as expected.

Tracking how autoscaling decisions impact resource usage and cloud spend can be time-consuming and error-prone. Some teams build their own dashboards, but many rely on a Kubernetes cost monitoring tool to automate visibility and reduce guesswork.

Advanced Strategies

Once you’ve mastered the basics of autoscaling, there are more advanced techniques that help fine-tune performance and cost—especially for large-scale or latency-sensitive systems.

Multi-Dimensional Autoscaling

Traditional HPA scales based on one metric (e.g. CPU). But real workloads are influenced by multiple factors—CPU, memory, request rate, latency.

Multi-dimensional autoscaling combines multiple signals into a single decision model. This can be done with:

  • Multiple metrics in HPA v2 (e.g. CPU + custom)
  • External controllers that apply logic (e.g. override decisions if latency exceeds a threshold)
  • Tools like KEDA with multiple triggers and custom logic

Use this when:

  • One metric isn’t enough to describe workload behavior
  • You need scaling to prioritize reliability, not just throughput

Predictive or Scheduled Scaling

Reactive autoscaling only adjusts after a change in usage. For workloads with predictable patterns (e.g. 9am traffic spikes), you can schedule changes ahead of time.

Options include:

  • KEDA’s cron scaler
  • Cloud-specific scheduled autoscaling (e.g. GCP, AWS)
  • Custom controllers or scripts

Use this when:

  • You know in advance when load increases
  • You want to avoid cold starts or provisioning lag

Autoscaling and CI/CD

Deployments can temporarily double resource usage—especially during rolling updates. If your autoscaler isn’t tuned to account for this, you might hit node limits or trigger premature scaling.

Best practices:

  • Use maxSurge and maxUnavailable settings wisely
  • Consider temporarily disabling downscaling during deploys
  • Monitor autoscaler behavior during rollout

Integrating with Cost and Usage Metrics

For larger teams, autoscaling decisions shouldn’t just consider performance—they should reflect cost efficiency.

Strategies include:

  • Feeding billing or utilization data into custom metrics
  • Setting autoscaling targets that balance latency and cost
  • Using DevOps-focused tools to monitor autoscaler decisions

Balancing performance and efficiency often requires tuning resource requests, identifying underutilized workloads, and adjusting autoscaling policies over time. A Kubernetes cost optimization tool can automate this process—making it easier to reduce waste and align scaling behavior with real usage patterns.

Conclusion

Kubernetes autoscaling isn’t just about saving money—it’s about making your infrastructure match the real behavior of your workloads. Whether you’re scaling pods with HPA, fine-tuning containers with VPA, or adjusting cluster capacity with CA, each method serves a different need.

In Kubernetes v1.33, autoscaling becomes more configurable with per-direction HPA tolerance. For advanced use cases, tools like KEDA and custom metrics unlock scaling based on queues, latency, or business logic.

The best setups often combine multiple autoscalers, tailored to workload patterns. But even the best YAML won’t help if you don’t validate it under real conditions. Run tests, monitor behavior, and always assume your workloads will surprise you.

Autoscaling isn’t something you set once and forget. It’s something you iterate on—as your infrastructure, users, and costs evolve.

From Manual Tuning to Dynamic Optimization

That’s where most teams hit a wall: the theory of autoscaling makes sense, but tuning it in production takes time, expertise, and constant monitoring. CPU thresholds don’t always reflect real demand, pod restarts introduce risk, and the connection between scaling behavior and cloud cost is often invisible.

This is where tools like DevZero come in.

Instead of relying solely on manual tuning or static thresholds, DevZero continuously observes how workloads behave in real time—and adjusts resource allocations accordingly. It complements HPA, VPA, and Cluster Autoscaler by filling in the operational gaps those tools leave behind.

  • Live rightsizing adjusts CPU and memory requests without requiring pod restarts
  • Live migration moves workloads between nodes without downtime
  • Binpacking optimization increases node utilization by consolidating resized workloads
  • Cost-aware scaling connects autoscaling behavior directly to usage and spend metrics

By integrating resource optimization and cost observability into the autoscaling loop, DevZero helps teams turn reactive scaling into a continuous feedback system—more accurate, more efficient, and easier to manage over time. Learn more →

Reduce Your Cloud Spend with Live Rightsizing MicroVMs
Run workloads in secure, right-sized microVMs with built-in observability and dynamic scaling. Just a single operator and you are on the path to reducing cloud spend.
Get full visiiblity and pay only for what you use.