Health Guardian: Automated Removal of Partially Failed Hosts

I want to share a simple technique I’ve seen used in the past for greatly reducing the impact of single host gray/partial failures through safe and fast automated removal.

We called this approach Health Guardian but I’m sure many people have applied the same technique under different names.

The basic idea is as follows:

Health Guardian monitors the error rates of endpoints in your fleet with via a monitoring system that can can handle per-host cardinality and alert on elevated error rates in as close to real time as possible.
When an individual host has a high fault rate for customer traffic (e.g. over 10% of requests result in system errors over 30 seconds), Health Guardian signals to routing systems (e.g. the load balancer, the client Envoy, whatever) to permanently cordon it away as quickly as possible if the following conditions are met:
1. You haven’t removed too many hosts recently. This should be set quite low: e.g. 1-2 instances a day for a fleet of 100 ec2 instances is probably sufficient.
2. The fleet overall has enough capacity (e.g. at least 95% of hosts are passing health checks)
If you can’t cordon a seemingly unhealthy instance because the safety conditions aren’t met, page an operator (if you’ve tuned your thresholds even a little bit, this should only happen when there’s a more serious issue like widespread host failure)

What makes this sort of approach so great? The key insight here is that the safety conditions make it so safe to remove hosts that you can do it aggressively and without human involvement. In more detail:

It’s very safe: The very low rate limit and capacity check guardrails ensure that even if the health system is wrong and you terminate a few totally healthy hosts, the risk of causing broader impact is extremely low.
It is way faster than a human: Because the approach is so safe, it’s fine to do the cordoning without a human oncall in the loop. This means that the removal of an unhealthy host can happen in potentially seconds (we successfully used removal windows as short as 10 seconds) instead of 5-30 minutes.
It detects gray/partial failures: Because we’re terminating based on the success rates of real customer traffic, we avoid the classic failure mode of normal health checks where the health check succeeds but real traffic is erroring out.
It’s very widely applicable: Any service that can measure success/failure and can handle single host hard failure has all the necessary building blocks (success/failure metrics, spare capacity, fast routing away from a failed host) to adopt Health Guardian.
It’s simple to build: The components of this system are relatively simple to build. Even if you don’t have a high cardinality monitoring system, you can hack together a good-enough system by writing batched per-host metrics on request and failure count stats to a row in a database table on an interval and then have your Health Guardian system scan through the table on an interval.
It’s simple to configure: The criteria for when to cordon off a host doesn’t need to be perfect because over-terminating (within reason) is not a big deal. Also, host health tends to be bimodal: once a host fails its success rate tends to drop precipitously. Most teams I’ve seen onboard to a Health Guardian system have been successful by just picking an arbitrary error rate threshold.

In fact, this sort of system (or at least something automated for quickly removing bad hosts) is borderline essential for operating systems with very high availability requirements. Say you have a 99.9995% monthly availability SLO. If there are 90 hosts in your fleet and it takes 20 minutes for an operator to respond to a page and remove a bad host, then you’ve missed SLO after a single host failure. (1 - 20 / 90 * 24 * 30 * 60 * .000005 = 0.99999486). Health Guardian reduces the response time here to as low as 20 seconds, saving that error budget for use on less tractable failure modes.