DevZero Logo
DevZero

Recommendations

Recommendation policies for optimizing resource utilization

Tuning zxporter settings to reduce sampling rates will affect the efficiency and effectiveness of recommendations.

Policy

Recommendations are generated based on active (and historical) behavior.

Nodes (and node groups/pools)

It is not recommended to have multiple node autoscalers running at the same time in a cluster.

The following parameters influence recommendation scoring:

Instance availability in {region, availability zone}.

Current shape and demographics of node {group, pool}.

Resources available (CPU/Memory/GPU devices/block devices/network bandwidth/...)
Taints
Toleration (workloads)
Affinity/Anti-affinity (workloads)

Cloud provider pricing.

Number of candidates for removal (based on recommendation mode).

Number of (non-DaemonSet) pods running on node.

Number of StatefulSet pods running on node.

Pod-level underutilization.

Node-level underutilization.

Workloads

Workload policies attempt to binpack by default, irrespective of whether node recommendation policies are set up.

The following parameters influence recommendation scoring:

Instance availability in {region, availability zone}.

Current shape and demographics of pod specs.

Toleration

Affinity/Anti-affinity

Recommendation Modes

Different workloads have different risk tolerances, traffic profiles, and scaling behavior. Recommendation modes allow you to choose how aggressive or conservative you'd like the system to be when reducing resources.

In automated mode, when dakr-operator is applying recommendations, it currently doesn't reset limits, only requests. This is currently done as a reliability measure, but may change in the future.
ModeRequestsLimitsNotes
BalancedUse max observed usage, but capped to avoid more than 50% drop.Adjusted to 75% of current limit, but never below the new request.[default] Recommended default for most workloads.
AggressiveUse P90 of max (current and historical utilization).Set to the max of 1.5× current max utilization or 75% of current limits.Backed by a reinforcement learning algorithm.
ConservativeSet to 1.2× max of current utilization.Left unchanged from current values.Suggested for critical or stateful workloads.

For a workload with peak observed usage of 4 cores and 12 Gi over the past 12 hours — currently requesting 9 cores and 32 Gi, with limits set to 14 cores and 48 Gi — mode-specific recommendations might look like this:

  • CPU Requests: 4 cores (max observed usage)
  • Memory Requests: 12 Gi (max observed usage)
  • CPU Limits: 10.5 cores (75% of current 14-core limit)
  • Memory Limits: 36 Gi (75% of current 48 Gi limit)
  • Behavior:
    Requests are set to the maximum observed usage (P100).
    If this would reduce requests by more than 50%, the cut is capped at 50% of current requests.
    Limits are set to 75% of current values, but always ≥ the recommended requests.

Replica Count Adjustments (HPA-Aware)

In some cases, we will recommend adjusting the replica count of a workload if it's significantly over-provisioned based on CPU/GPU usage trends.

  • Applies only when:
    • The workload has more than one replica
    • One or more of the following is available and can be acted upon:
      • Network bandwidth information
      • GPU usage and GPU VRAM usage data
  • Based on the selected mode:
    • Aggressive: Assumes optimal resource use, multiplier = 1.0
    • Balanced: Allows a buffer, multiplier = 1.5
    • Conservative: Assumes higher future demand, multiplier = 2.0

Broadly, our methodology can be reduced down to the formula (although our actual implementation is a bit more sophisticated):

recommendedReplicas=min(currentReplicas,totalUsagemodeAdjustment×targetPerReplica)\text{recommendedReplicas} = \min \left( \text{currentReplicas}, \left\lceil \frac{\text{totalUsage}}{\text{modeAdjustment} \times \text{targetPerReplica}} \right\rceil \right)

This ensures workloads aren't over-replicated relative to CPU/GPU demand.


Here's how to select the right mode for your use case:

On this page