Cloud Cost Optimization

Kubernetes HPA: Scale Pods Based on Resource Usage

Alberto Grande

Head of Marketing

May 22, 2025

Share via Social Media

Kubernetes lets you scale applications horizontally, but deciding when and how to add pods isn’t always obvious. Traffic can spike unexpectedly. Resource usage varies. And manual scaling quickly becomes unsustainable in dynamic environments.

The Horizontal Pod Autoscaler (HPA) addresses this by automatically increasing or decreasing pod replicas based on real-time metrics. It’s one of the most widely used Kubernetes autoscaling tools—ideal for stateless workloads that need to adapt quickly to changing demand.

This guide is part of our autoscaling series and focuses specifically on how HPA works, when it’s the right choice, what its limitations are, and how to go beyond it with real-time, cost-aware scaling.

Kubernetes Autoscaling Series:

How Kubernetes HPA Works

The Horizontal Pod Autoscaler adjusts the number of pod replicas in a Deployment, StatefulSet, or ReplicaSet based on a target metric—most commonly CPU utilization. When the metric exceeds the configured threshold, HPA adds replicas. When it drops, it scales down.

At the core is a simple formula:

desiredReplicas = ceil[currentReplicas × (currentMetric / targetMetric)]


For example, if your app is running with 5 replicas at 80% CPU, and your target is 50%, HPA will scale to 8 replicas:

5 × (80 / 50) = 8

Supported Metric Types

Metric Type Description Requires
Resource metrics CPU and memory usage Metrics Server
Custom metrics Application-level metrics exposed via API (e.g. QPS) Prometheus Adapter (or similar)
External metrics Metrics from outside the cluster (e.g. queue length) Custom Metrics + External Metrics API

📌 CPU-based HPA is enabled by default in most clusters. To use custom or external metrics, you’ll need to install an adapter like Prometheus Adapter.

HPA uses these metrics in real time and makes scaling decisions at fixed intervals (default: every 15 seconds). It only affects the number of pods—not the resources per pod, and not the underlying infrastructure.

HPA Timeline by Kubernetes Version

The Horizontal Pod Autoscaler has been part of Kubernetes since v1.0 and has steadily evolved from simple CPU-based scaling to support more advanced metric types and fine-tuned behaviors.

  • v1.0 – HPA introduced with CPU-based scaling only
  • v1.6 – Support for custom metrics added (alpha)
  • v1.12 – External metrics API introduced
  • v1.23 – Container-level resource metrics supported (beta)
  • v1.30 – Container metrics promoted to stable
  • v1.33 – Configurable tolerance for scale-up and scale-down behavior (alpha)

By default, HPA uses a fixed 10% tolerance to avoid scaling on minor fluctuations. Starting in v1.33, you can configure different tolerances for scaling up and scaling down—allowing more control over how aggressively HPA reacts to metric changes.

This alpha feature is available via the autoscaling/v2 API and requires the HPAScaleTolerance feature gate.

When to Use HPA

The Horizontal Pod Autoscaler is ideal for workloads that can scale horizontally—meaning you can run multiple stateless replicas behind a load balancer. It works best for services with variable load, where increasing or decreasing the number of pods helps maintain performance.

Ideal Use Cases:

  • Web APIs or backend services with traffic that fluctuates
  • Workers processing queue-based jobs where load varies by time of day
  • Applications where CPU or memory closely tracks real user demand
  • Systems where quick response to load spikes is more important than per-pod tuning

HPA is often the default autoscaler in most Kubernetes setups because it’s native, easy to enable, and reacts to metrics in near real time. But it’s not ideal for memory-heavy apps that don’t scale well horizontally, or for workloads where latency, I/O, or cost need to factor into scaling decisions.

HPA vs VPA vs Cluster Autoscaler

Here’s how HPA compares to the other built-in autoscaling options:

Metric Type Description Requires
Resource metrics CPU and memory usage Metrics Server
Custom metrics Application-level metrics exposed via API (e.g. QPS) Prometheus Adapter (or similar)
External metrics Metrics from outside the cluster (e.g. queue length) Custom Metrics + External Metrics API

Limitations and Trade-offs

The Horizontal Pod Autoscaler works well for many stateless services, but it comes with a set of constraints that teams often hit in production.

It only responds to metrics like CPU and memory, which don’t always reflect real load. And while HPA can scale quickly, it doesn’t always scale accurately—especially without tuning.

Some of the key limitations include:

  • Metric blind spots: CPU and memory are system-level metrics. For many apps, actual demand is tied to request rate, queue depth, or latency—none of which HPA can read unless you configure custom or external metrics.

  • Lag and instability: Scaling is based on averaged metrics across pods, which can introduce delays. Without stabilization policies, HPA can oscillate, scaling up and down too frequently.

  • Conflicts with VPA: If both HPA and VPA manage CPU, they can interfere with each other. HPA adjusts replica count based on usage, while VPA resizes pods—often requiring restarts.

  • No cost or scheduling awareness: HPA doesn’t consider node availability or binpacking. It can scale up pods into fragmented or saturated clusters, increasing resource waste.

For teams trying to control spend, this becomes a visibility problem. HPA decisions may drive cost spikes without clear insight into why. That’s where tools like Kubernetes cost monitoring come in—connecting scaling behavior to actual infrastructure usage and spend.

Installing and Configuring HPA

HPA is available out of the box in Kubernetes, but to function properly, it requires metrics to be available via the Kubernetes Metrics API. Most clusters support this by default through the Metrics Server, which scrapes CPU and memory usage from the kubelet and exposes it to the HPA controller.

Prerequisites

  • Kubernetes 1.0+ (HPA v1)
  • Metrics Server installed and running
  • Optional: Prometheus Adapter for custom or external metrics

To install the Metrics Server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Example: CPU-Based HPA YAML

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sample-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sample-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70


This configuration sets the CPU target to 70%. HPA will increase or decrease the number of pods to try and maintain that average CPU utilization across the deployment.

Using Custom Metrics (e.g. QPS or queue depth)

To scale on custom metrics, you’ll need to install the Prometheus Adapter, which translates Prometheus data into Kubernetes Custom Metrics API.

Once installed, you can use metric types like:

- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"


External Metrics with KEDA

For event-driven or queue-based workloads, you can use KEDA as a drop-in external scaler. It plugs into HPA via the External Metrics API and supports 40+ backends (like Kafka, Redis, AWS SQS, etc.).

Best Practices for Using HPA in Production

HPA works well in many environments—but only if it’s configured thoughtfully. Without tuning, it can scale too slowly, too aggressively, or in ways that conflict with other autoscalers.

Here are key best practices for running HPA safely in production:

Tune Stabilization Windows and Scaling Behavior

By default, HPA evaluates every 15 seconds, but it helps to define scale-up and scale-down delay windows to prevent flapping (rapid back-and-forth scaling).

behavior:
  scaleUp:
    stabilizationWindowSeconds: 30
  scaleDown:
    stabilizationWindowSeconds: 300


Shorter windows make HPA more responsive, but may cause instability. Longer windows increase stability, but reduce agility.

Always Define minReplicas and maxReplicas

These limits prevent HPA from scaling down to zero or scaling up to unschedulable levels. They also help the Cluster Autoscaler calculate required capacity.

Avoid Overlap with VPA on the Same Metric

If you’re using HPA for CPU, don’t let VPA adjust CPU requests for the same workload. Safe patterns include:

  • HPA on CPU, VPA on memory only
  • HPA active, VPA in recommendation mode (updateMode: Off)

Use External Metrics for Business Logic

For request rates, queue length, or SLOs, you can scale using custom or external metrics. Combine HPA with tools like Prometheus Adapter or KEDA for more flexible logic.

Coordinate with Cluster Autoscaler

HPA doesn’t check if the cluster has room to schedule new pods. It just creates them. Cluster Autoscaler will eventually add nodes, but there can be delays. Use right-sized requests, proper node pools, and monitor binpacking efficiency.

Advanced HPA Use Cases

HPA is commonly used with CPU or memory metrics, but it becomes much more powerful when combined with custom metrics and observability tools. Here are advanced patterns that engineering teams apply in production environments.

Scale by Business Metrics, Not Just CPU

Some of the most effective scaling strategies involve metrics tied directly to user experience:

  • Request rate (e.g. QPS from Prometheus)
  • Queue depth (e.g. Redis or SQS backlog)
  • Latency thresholds (e.g. 95th percentile > 250ms)

To scale this way, you’ll need:

  • A Prometheus-compatible metrics pipeline
  • The Prometheus Adapter exposing those metrics to the HPA controller

These metrics can be configured as custom or external metrics in HPA. They allow you to scale based on meaningful load, not just system resource consumption.

Event-Driven Scaling with KEDA

KEDA extends HPA by plugging into the External Metrics API. It supports more than 40 event sources, including:

  • Kafka
  • AWS SQS
  • Azure Service Bus
  • Prometheus queries
  • HTTP queue lengths

KEDA handles metric ingestion, then feeds values to HPA behind the scenes—letting you scale workloads from 0 to N based on real-world usage.

HPA During Rollouts and CI/CD

HPA can behave unpredictably during deployments. For example:

  • A rolling update with high maxSurge may temporarily double your pods
  • HPA might see an artificial CPU spike and scale unnecessarily

To reduce this risk:

  • Adjust rollout settings (maxSurge, maxUnavailable)
  • Use stabilization windows and scale rate limits
  • Monitor HPA and rollout behavior together

Observability and Cost Awareness

As you scale more intelligently, you’ll also want to track how HPA decisions affect:

  • Node allocation
  • Binpacking
  • Cost efficiency

For this, tools like Kubernetes cost optimization help surface the cost impact of scaling policies—and highlight where fine-tuning or workload reshaping is needed.

Operational Constraints of HPA 

The Horizontal Pod Autoscaler helps solve a core scaling problem in Kubernetes: adjusting the number of pods based on real-time demand. It uses metrics like CPU or memory to scale services up or down as traffic changes—reducing the need for manual intervention.

But in production, HPA has real limitations. It reacts to system-level metrics that don’t always reflect user load. It assumes your pods are already correctly sized. And it has no visibility into how its scaling decisions affect cluster utilization or cost. These constraints make HPA useful for reactive scaling—but hard to rely on for efficiency or long-term optimization.

DevZero builds on the same goal—dynamic scaling—but solves the gaps that limit HPA in practice.

Workload rightsizing: HPA increases pod count, but doesn’t touch resource requests. DevZero adjusts CPU and memory requests on running pods in real time, ensuring they match actual usage and avoid over-provisioning.

Live migration: HPA assumes the cluster can handle new pods. DevZero helps when it can’t—migrating workloads safely between nodes to improve scheduling and avoid capacity issues.

Binpacking optimization: HPA may increase replicas, but doesn’t improve how efficiently they run. DevZero actively redistributes workloads to reduce fragmentation and improve node usage.

Visibility into cost impact: HPA acts blindly when it comes to cost. DevZero connects scaling behavior to real infrastructure spend, giving teams the ability to measure and optimize efficiency, not just performance.

In short, HPA reacts to load. DevZero helps you respond intelligently—with real-time tuning, workload awareness, and cost visibility. Learn more →

Reduce Your Cloud Spend with Live Rightsizing MicroVMs
Run workloads in secure, right-sized microVMs with built-in observability and dynamic scaling. Just a single operator and you are on the path to reducing cloud spend.
Get full visiiblity and pay only for what you use.