DevZero is a Resilience Tool in an Optimizer's Clothing

Debo Ray
Co-Founder, CEO

How our autoscaler keeps customers running during cloud outages.
Recently, a datacenter outage in North Virginia showed that most cloud applications are not ready for the unexpected. Even multibillion-dollar companies with skilled infrastructure teams can be caught off guard.
Thankfully, customers running DevZero—our cloud resilience platform disguised as a Kubernetes cost optimizer—kept running without disruption. I want to explain how.
To do that, we'll start with a primer on how datacenters are organized, using AWS as an example. AWS regions are designed with redundancy. Each one is composed of multiple Availability Zones, also called AZs, which are isolated datacenter buildings with their own power and cooling. If one AZ has an outage, the others pick up the slack. The region carries on.
Infrastructure teams don't take uptime for granted. Normally, they secure more memory, CPUs, and GPUs than they could possibly need. They spread these resources across AZs and regions. That way, they stay running if one AZ or region has an outage. This follows the principles of High Availability (HA), but HA comes at a cost. You have multiple replicas of software running, all of them with overhead capacity, and when they communicate across AZs, that incurs additional fees (with some cloud providers).
The most cost-efficient Kubernetes clusters would have all of its nodes in one AZ but would be vulnerable to outages in that AZ. The most resilient clusters would have nodes in every AZ but would be cost-prohibitive to run. The latter is more common. Operating in the cloud is incredibly reliable, but there will be periods of downtime. To protect against these periods of downtime, teams overprovision and run software across multiple AZs (or their equivalent with other cloud providers).
DevZero lets customers eat their cake and have it too.
Our autoscaler optimizes Kubernetes deployments to save customers on their cloud bills. Simultaneously, our platform detects problems like outages, regional or zonal. If it perceives a threat to a customer's workloads, policies are updated to seamlessly start moving workloads to different AZs, and customers are notified.
That's how a recent outage went for our customers. About 30 seconds after our monitoring detected it, our autoscalers began to move their workloads, safely and gradually, to healthy zones.
DevZero customers didn't experience downtime. Meanwhile, the worst-hit companies had outages lasting seven or more hours.
It's important to note: DevZero temporarily increased the cost of its customers' Kubernetes deployments to make them adaptable to this situation. Once the datacenter was back to normal, our autoscaler returned their deployments to the original state.
DevZero customers don't have to choose between resilience and cost-efficiency. We optimize for one or both based on the situation. Our platform is made for the messy, imperfect realities of running business in the cloud.
If you've dealt with the chaos of an outage before and would have preferred to sleep peacefully through it, get in touch. We'll show you what we can do.

Debo Ray
Co-Founder, CEO

