Cloud Cost Optimization

Kubernetes VPA: How It Works and When to Use It

Alberto Grande

Head of Marketing

May 22, 2025

Share via Social Media

Kubernetes makes it easy to run workloads in containers, but setting the right CPU and memory requests is still a guessing game. Over-provisioned pods waste resources. Under-provisioned pods get throttled—or worse, evicted.

The Vertical Pod Autoscaler (VPA) helps solve this. It automatically adjusts CPU and memory requests and limits for your pods based on observed usage. Instead of scaling out like HPA, VPA scales up or down the resources each pod needs.

This guide is part of our autoscaling series and focuses specifically on how VPA works, when to use it, what limitations to watch for, and how to go beyond it with real-time, cost-aware optimization.

How Kubernetes VPA Works

The Vertical Pod Autoscaler (VPA) continuously monitors the resource usage of your pods and recommends updated CPU and memory requests. Unlike the Horizontal Pod Autoscaler (HPA), VPA doesn’t add or remove pod replicas—it adjusts the size of each pod.

VPA is composed of three components, each handling a different part of the process:

Component Role Triggers
Recommender Analyzes historical CPU and memory usage and generates resource suggestions Runs continuously
Updater Decides when to apply new recommendations by evicting pods Pod lifecycle events or thresholds exceeded
Admission Controller Injects recommended resources at pod creation time Every new pod start

Here’s how it works in practice:

  1. Recommender collects metrics from pods and generates target CPU/memory values.
  2. Updater decides whether to evict a pod to apply the recommendation (based on policies).
  3. Admission Controller mutates pod specs on startup to apply recommendations automatically.

⚠️ Note: If the updateMode is set to "Auto", VPA may evict and restart pods to apply new values. This can cause downtime if not planned for.

This architecture allows VPA to gradually adapt pod sizing over time, but it also means updates aren’t instant—and pod restarts can affect service stability.

When to Use VPA

VPA is best suited for workloads where scaling out (adding replicas) isn’t effective—or where tuning resource requests manually is inefficient. It helps teams rightsize CPU and memory for individual pods, especially in systems where consistent performance depends on how much is allocated per instance.

Ideal Use Cases

  • Memory-bound applications like Java, Spark, or ML workloads
  • Batch jobs that vary in resource usage over time
  • Internal APIs or services where you prefer fewer, well-sized pods
  • Stateful applications that can’t scale horizontally easily
  • Development and test clusters where developers often guess resource requests

VPA vs HPA vs Cluster Autoscaler

Capability VPA HPA Cluster Autoscaler (CA)
What it scales Pod CPU/memory requests/limits Pod replica count Number of cluster nodes
Acts on live pods? Only with eviction or restart Yes No (infra-level only)
Scaling trigger Historical resource usage Current CPU/memory or custom metrics Pending pods / idle nodes
Can it downscale? Yes (with restarts) Yes Yes
Works with stateful apps? Yes ⚠️ Limited ✅ Yes
Main benefit Rightsizes containers Scales out with load Optimizes node-level capacity

If your workload is CPU-light but memory-heavy—or if you’re constantly adjusting requests to avoid throttling or OOM kills—VPA may be a better fit than HPA. Just keep in mind that updates often require a pod restart, so it’s best used when downtime is acceptable or easily mitigated.

Limitations and Trade-offs

While the Vertical Pod Autoscaler (VPA) solves important problems—like reducing over-provisioning and automating resource tuning—it also comes with trade-offs that make it unsuitable for certain workloads or setups.

Pod Restarts Are Required for Updates

VPA cannot resize a running pod. To apply new resource requests, it needs to evict and restart the pod. This creates a few challenges:

  • Stateful or long-lived apps may experience downtime
  • Pods using emptyDir or non-persistent storage lose data on restart
  • If the app isn’t restart-friendly, updates can introduce risk

This is why many teams run VPA in updateMode: "Off" to collect recommendations first, then apply them manually.

Conflicts with HPA

VPA and HPA don’t work well together if both are configured to manage the same resource, like CPU or memory. Kubernetes doesn’t resolve conflicts—it just creates unpredictable behavior.

Safe patterns include:

  • HPA for replicas, VPA for memory only
  • Or using VPA in recommendation mode alongside HPA for scaling

Limited Signal Awareness

VPA only uses historical CPU and memory usage to generate recommendations. It does not consider:

  • Request rate
  • Latency
  • I/O
  • Business-level metrics           

This makes it less effective for:

  • Highly dynamic workloads
  • Latency-sensitive systems where usage ≠ demand         

Metrics Need Time to Stabilize

VPA relies on aggregated metrics. Short-lived or bursty pods may not generate enough consistent data for meaningful recommendations.

Summary

VPA is a powerful tool for container rightsizing—but it’s not a drop-in solution. It’s best deployed with awareness of pod lifecycle, scaling strategy, and observability needs.

Installing and Configuring VPA

VPA is not included in Kubernetes by default—you’ll need to deploy it as a set of components maintained by the SIG Autoscaling group. Setup is straightforward, but configuration choices (especially update modes) will affect how safely VPA operates.

Step 1: Install VPA Components

You can deploy the official Vertical Pod Autoscaler using the manifests from the VPA GitHub repo:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/autoscaler/master/vertical-pod-autoscaler/deploy/vpa-upstream.yaml

This installs the three core components:

  • vpa-recommender
  • vpa-updater
  • vpa-admission-controller

Make sure your cluster has:

  • Metrics Server installed and working
  • RBAC enabled
  • Webhooks enabled (for the Admission Controller)

Step 2: Create a Deployment

Here’s a basic deployment using static resource requests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
      - name: app
        image: nginx
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "200m"
            memory: "256Mi"


Step 3: Attach a VPA Resource

Now define the VerticalPodAutoscaler object to manage the resource requests.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: sample-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: sample-app
  updatePolicy:
    updateMode: "Auto"


Update Modes Explained

Mode What it does When to use it
Off Only generates recommendations (no action taken) Safest for testing and monitoring
Auto Automatically evicts pods to apply new values Use in non-critical workloads or dev envs
Recreate Applies changes on next pod restart (manual trigger) Useful if you want to control timing

🛑 Be cautious with "Auto" mode in production—it may restart pods at inconvenient times.

Once installed, VPA runs continuously in the background and updates recommendations as it observes usage patterns.

Best Practices for Using VPA in Production

Running VPA in production requires careful planning. It’s not just about enabling autoscaling—it’s about controlling when and how resource updates happen, and avoiding disruptions in critical workloads.

1. Start in Observation Mode (updateMode: "Off")

Begin with VPA in passive mode to collect recommendations without applying changes. This lets you:

  • Validate whether your resource requests are misaligned
  • Understand usage patterns before taking action
  • Avoid surprises in production

Use this data to rightsize manually, or switch to automated modes once you’re confident.

2. Choose the Right Update Mode

Here’s a quick breakdown of when to use each mode:

Mode Applies Changes Automatically? Causes Pod Restarts? Recommended For
Off No No Baseline visibility, production clusters
Auto Yes Yes Non-critical workloads, dev/test environments
Recreate No (applied on next restart) Controlled Stateful apps, canary or rolling deploys

3. Avoid Conflicts with HPA

If you’re using both VPA and HPA:

  • Do not target the same resource (e.g. CPU)
  • Safe pattern: HPA scales replicas based on CPU; VPA adjusts memory requests only
  • Alternatively, use VPA in Off mode for recommendations while HPA handles live scaling

4. Combine VPA with Cluster Autoscaler

VPA can reduce resource requests, which allows the Cluster Autoscaler to pack more pods onto fewer nodes. This improves binpacking and can reduce cloud spend—especially in clusters with mixed workloads.

But remember: if VPA suddenly increases resource requests, pods may go unschedulable unless the Cluster Autoscaler is fast enough to provision space.

5. Monitor Impact with Cost and Usage Metrics

Adjusting resource requests affects:

  • Node binpacking efficiency
  • Pod priority and scheduling
  • Overall cluster cost

It’s important to track how VPA decisions translate to infrastructure behavior. This is where a Kubernetes cost monitoring tool helps—by connecting usage changes to real spend.

Advanced VPA Use Cases

VPA is often treated as a basic resource tuning tool—but it can also support more advanced scenarios, especially when combined with observability and deployment automation.

1. Memory-Bound or ML Workloads

Machine learning jobs and JVM-based services (like Spark, Java, or Scala) often don’t scale well horizontally. They need:

  • High memory per pod
  • Stable performance across execution cycles

VPA helps here by gradually learning resource profiles over time. It allows teams to:

  • Avoid manual tuning per job run
  • Reduce OOM kills and inefficient over-provisioning
  • Adapt to seasonal or dataset-based memory usage shifts

2. Batch Jobs and CronJobs

Short-lived jobs often have unpredictable spikes in resource use. VPA can:

  • Recommend requests based on past executions
  • Allow tighter binpacking across job waves
  • Work well with updateMode: "Recreate" for predictable deployment cycles

If you’re running time-sensitive ETL, data prep, or distributed compute jobs, VPA helps avoid both under- and over-resourcing.

3. Scheduled Resource Resetting

Some teams use VPA to reset resource requests during off-peak hours:

  • Run VPA in Auto mode during maintenance windows
  • Let it update requests and evict pods without user impact
  • Switch back to Off mode during peak hours

This hybrid approach blends automation with operational control—especially useful for clusters with strict uptime or compliance requirements.

4. VPA + Observability Tools

VPA only acts on CPU and memory metrics. But when paired with observability platforms, you can:

  • Validate VPA behavior against latency and SLOs
  • Flag over-aggressive recommendations
  • Feed insights into custom dashboards or cost analysis tools

If you use something like Prometheus + Grafana or a Kubernetes cost optimization tool, this unlocks much deeper tuning.

Operational Constraints of VPA

The Vertical Pod Autoscaler helps solve a common Kubernetes problem: poorly sized workloads. It analyzes historical CPU and memory usage and recommends better resource requests—reducing over-provisioning and manual tuning.

But while useful, VPA has real limitations when used in production. It isn’t designed for real-time responsiveness, introduces disruption when applying changes, and lacks broader context like cost or scheduling efficiency. These constraints make it helpful for offline recommendations—but difficult to rely on for live, automated optimization.

DevZero extends the same intent behind VPA, but addresses these operational gaps.

  • Workload Rightsizing: VPA suggests better resource values but requires restarts to apply them. DevZero adjusts CPU and memory requests on running pods, in real time, without evictions. This eliminates the downtime and complexity associated with production resizing.

  • Live Migration: One of the biggest risks of VPA is that it triggers restarts. DevZero safely migrates workloads across nodes by pausing and resuming execution—avoiding cold starts and service disruption during optimization cycles.

  • Binpacking: VPA reduces pod size, which helps indirectly with binpacking—but DevZero goes further. It actively redistributes workloads across nodes based on updated resource profiles, improving density and reducing the number of active nodes needed.

  • Visibility into cost impact: VPA has no awareness of infrastructure cost. DevZero ties resource decisions to actual spend, so platform teams can see how changes affect node utilization and cloud cost—closing the loop between tuning and business impact.

In short, VPA shows you what to fix—DevZero makes it actionable, safe, and continuous. It’s the next step for teams that want the benefits of autoscaling without the operational trade-offs. Learn more →

Reduce Your Cloud Spend with Live Rightsizing MicroVMs
Run workloads in secure, right-sized microVMs with built-in observability and dynamic scaling. Just a single operator and you are on the path to reducing cloud spend.
Get full visiiblity and pay only for what you use.