Question 1

What is an inference platform?

Accepted Answer

An inference platform is the runtime layer that sits between your application and your LLM/AI providers — metering every call, caching what's safe to cache, swapping models with proof, and surviving provider outages. DevZero is the inference platform that proves cost, latency, and reliability against your own traffic before you ship a change.

Question 2

How does this differ from observability tools?

Accepted Answer

Observability shows you what happened. DevZero shows you what to change, simulates the change against your own traffic with shadow mode and the Eval Lab, and runs the change while monitoring quality drift. Observability is included; it's the starting line, not the finish line.

Question 3

How does the gateway handle provider outages?

Accepted Answer

The gateway watches every provider call for rate limits, error spikes, and timeouts. When a provider degrades, the gateway fails over to a configured backup — same OpenAI-SDK contract, same prompt — without your app noticing. You set the failover policy per workload.

Question 4

What's the latency overhead?

Accepted Answer

Sub-millisecond added latency per call from the proxy itself. Heavy decision-making (observability, shadow caching, recommendations) runs in the DevZero control plane, not on the request path, so per-cluster resource usage stays low and the gateway keeps serving traffic reliably even if the control plane has an intermittent issue.

Question 5

Can I see usage broken down by team or business unit?

Accepted Answer

Yes. Every call carries automatic tags for team, product, workflow, and prompt cluster, rolled up into department-level views that match your org chart. Suitable for chargeback, showback, monthly executive review, and per-team rate limits — all on the same surface.

Question 6

How is DevZero's proxy different from Envoy AI Gateway?

Accepted Answer

Envoy AI Gateway is Kubernetes-only, but the apps calling LLMs run everywhere — serverless, edge, CI, local dev, non-K8s production. DevZero's proxy is platform-agnostic, so it sits in front of every inference call wherever it originates. Architecturally we split the work: the proxy itself is lightweight (sub-millisecond added latency per call, versus 3–5 ms typical for Envoy AI Gateway), while the heavy decision-making — observability, shadow caching, attribution, recommendations — runs in the DevZero control plane. That keeps per-cluster resource usage low and means DevZero's proxy keeps serving traffic reliably even if the control plane has an intermittent issue.

Question 7

Which inference providers does DevZero support?

Accepted Answer

Because DevZero speaks the OpenAI SDK, any provider you can call through an OpenAI-compatible client is supported out of the box — OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, Mistral, and most open-model inference APIs. No separate integrations per provider.

Question 8

Do I have to change application code?

Accepted Answer

One base URL change. If you're using the OpenAI SDK (or any OpenAI-compatible client), you point it at the DevZero gateway and you're instantly metered, cached, and optimizable. No new SDK to learn, no library migration.

Question 9

What is Shadow Cache?

Accepted Answer

Shadow Cache runs a semantic cache alongside your production traffic in dry-run mode. Every prompt is hashed and checked against the cache, but responses still come from the upstream provider. You see exactly how many requests would have hit at each similarity band, how much you would have saved, and the latency improvement you would have realized — so you flip it on with real numbers behind the decision.

Question 10

What is the Eval Lab?

Accepted Answer

The Eval Lab replays a sample of your real traffic through any subset of 20+ candidate models across Anthropic, OpenAI, and Gemini. It scores each model for quality and plots quality-vs-cost so you can validate a cheaper or faster swap on your own prompts before rolling it out.

Question 11

How does attribution work?

Accepted Answer

Every call carries automatic tags for team, product, workflow, and prompt cluster. These roll up into department-level spend views that match your org chart — suitable for chargeback, showback, or monthly executive review without a spreadsheet.

Question 12

Is the gateway self-hosted? Who holds the API keys?

Accepted Answer

Yes. The DevZero gateway runs inside your own infrastructure — your VPC, your cluster, your rules. Your provider API keys stay with you and are never sent to or stored by DevZero. The gateway communicates with the DevZero control plane to power observability, shadow caching, eval lab, and recommendations. You control network policies and can restrict egress to only the DevZero control plane and your LLM providers.

Question 13

How is DevZero's Inference Platform priced?

Accepted Answer

We charge a platform fee based on the volume of inference calls we meter, plus your underlying provider spend (which we reduce on your behalf). Ship confident ROI on the platform fee in week one or you're not a fit — book a demo and we'll show you the model.

Run inference in production.
Cost, latency, and reliability — proven before you ship.

Your AI stack is fragile, slow, AND expensive.

Observe

Simulate

Automate

One LLM gateway. Every provider. Automatic failover.

Prove the savings AND the latency win before you flip the switch.

Swap models with evidence, not vibes.

Trace every agent run.

89 workloads, not 30,000 calls.

Cache hits without quality drift.

Trace every token to the team that spent it.

Savings that come with a price tag.

One dashboard. One conversation.

We've been rightsizing infra at runtime for years.

Frequently asked Questions

Most clusters are overprovisioned.
Let's prove yours is.

Run inference in production.Cost, latency, and reliability — proven before you ship.