KubeCon EUBooth 1151, Amsterdam. March 23-26
Workload Operator

Workload Optimization

How DevZero generates and applies workload optimization recommendations -- rightsizing, replica scaling, and live migration.

Workload Optimization

Tuning zxporter settings to reduce sampling rates will affect the efficiency and effectiveness of recommendations.

Recommendation Modes

Different workloads have different risk tolerances, traffic profiles, and scaling behavior. Recommendation modes allow you to choose how aggressive or conservative the system should be when reducing resources.

In automated mode, when the write operator is applying recommendations, it currently doesn't reset limits, only requests. This is done as a reliability measure.

ModeRequestsLimitsNotes
BalancedUse max observed usage, capped to avoid more than 50% drop.Adjusted to 75% of current limit, never below new request.Default. Recommended for most workloads.
AggressiveUse P90 of max (current and historical utilization).Set to max of 1.5x current max or 75% of current limits.Backed by reinforcement learning.
ConservativeSet to 1.2x max of current utilization.Left unchanged from current values.For critical or stateful workloads.

Example

For a workload with peak observed usage of 4 cores and 12 Gi over the past 12 hours -- currently requesting 9 cores and 32 Gi, with limits set to 14 cores and 48 Gi:

  • CPU Requests: 4 cores (max observed usage)
  • Memory Requests: 12 Gi (max observed usage)
  • CPU Limits: 10.5 cores (75% of current 14-core limit)
  • Memory Limits: 36 Gi (75% of current 48 Gi limit)

Requests are set to the maximum observed usage (P100). If this would reduce requests by more than 50%, the cut is capped at 50% of current requests. Limits are set to 75% of current values, but always >= the recommended requests.

  • CPU Requests: 4.8 cores (1.2x max usage)
  • Memory Requests: 14.4 Gi (1.2x max usage)
  • CPU Limits: 14 cores (unchanged)
  • Memory Limits: 48 Gi (unchanged)

Requests are set to 1.2x the max observed usage, providing extra headroom. Limits are left unchanged.

  • CPU Requests: 3.2 cores (P90 of usage)
  • Memory Requests: 9 Gi (P90 of usage)
  • CPU Limits: 6 cores
  • Memory Limits: 18 Gi

Requests are based on the P90 percentile of observed usage. Limits are set to the greater of 1.5x max usage or 75% of current limits. Optimized for cost, assuming workloads can tolerate throttling.

Replica Count Adjustments (HPA-Aware)

In some cases, DevZero will recommend adjusting the replica count if a workload is significantly over-provisioned.

Applies only when:

  • The workload has more than one replica
  • CPU, GPU, or network bandwidth data is available

Mode multipliers:

  • Aggressive: 1.0 (assumes optimal resource use)
  • Balanced: 1.5 (allows a buffer)
  • Conservative: 2.0 (assumes higher future demand)

Node Recommendations

It is not recommended to have multiple node autoscalers running at the same time in a cluster.

Node recommendations consider:

  • Instance availability in the region and availability zone
  • Current shape of node groups (CPU/Memory/GPU/network bandwidth)
  • Taints, tolerations, and affinity/anti-affinity rules
  • Cloud provider pricing
  • Number of pods running on each node and their utilization

Live Migration

Live migration allows you to checkpoint running workloads and restore them on different nodes without losing application state.

Currently in beta.

Install the agent and scheduler alongside the DevZero Workload Operator. Navigate to your cluster's Overview page in the dashboard, click Operators, and select Workload Operator to get the pre-configured Helm install command.

Label nodes that support live migration:

kubectl label node <node-name> dakr.devzero.io/checkpoint-node=true

Validate the label and ensure the containerd shim is present:

kubectl get nodes -l "dakr.devzero.io/checkpoint-node=true"
kubectl logs daemonset/dakr-dakr-operator-agent -n dakr-operator -c installer

Run workloads on labeled nodes (set nodeSelector to target them).

Apply workload recommendations that have live-migration enabled via the dashboard.

On this page