Customer-centric Metrics

It’s never been easier to collect metrics. Using them to answer important questions is still hard.

Mark McBride
Turbine Labs

--

There’s been an explosion of metrics collection and visualization systems over the last few years. It’s never been easier to spray thousands of counters into a robust storage system. There are many tools to draw lines that depict variations over time. But metrics are a means, rather than an end, and having a lot of them, or having the wrong ones, provides no intrinsic value. The real goals are to find out about problems before your customers, and to solve those problems faster.

At Turbine Labs, we use a blend of success rates, latency histograms, and request/response rates to help you with these goals. In this post I’ll talk about how we arrived at these specific metrics, and why we think they’re the most important top-line indicators of how your service is behaving.

Success Rate

To meet these goals, metrics and monitoring systems need to be configured to answer the high level question “how’s my service?” and if the answer is “not good,” they need to answer the follow-up, “where should I start looking?” Even figuring out how to answer something as simple as “how’s my service?” takes the right metrics and some deep thinking. You can measure all kinds of stuff. CPU stuff, memory stuff, network stuff, even (gasp!) disk stuff. You can measure these items over many dimensions, e.g. across different time scales or broken down by user agent or geography.

Since the question we’re answering is “how’s my service?” we can simplify things dramatically by considering this question from the customer’s point of view. The app they’re using makes some network requests. They want the data they asked for, and they want it fast. This gives us a sense of what events we want to observe (requests) and what metrics we want to capture (response codes, latency). The customer doesn’t care if it takes 99.5% CPU to do it, or if there was a minor GC event, a cache miss, or a page fault.

The percentage of non-5xx responses, which we’ll call “success rate”, when combined with latency, would seem to capture the customer’s view of your service, but building a high level view of this number can be tricky.

You can’t look at every event, but if you aggregate too coarsely, you’ll miss out on important details. Too fine and you won’t be able to grasp the overall state of your service.

Let’s start from too coarsely and work our way back to useful and actionable. We’ll look at average latency for all the requests and success rate for all the requests.

A cool story

This tells a story. But it’s not a useful one. Are most of your requests ok? Maybe. But not all requests are created equal. Take a typical web storefront. If viewing items succeeds 99.99% of the time, but checkout fails 50% of the time, what will your overall success rate look like? Purchases are likely a much smaller number of requests. They’re also very important requests. Grouping events by the type of call gives us the detail we need to make sure low volume/high value requests aren’t failing.

People can’t buy stuff. This is probably bad.

It also gives you a much narrower area to focus your problem solving efforts. Reducing the scope of an investigation from “the entire site” to “just this endpoint” gives you a big head start in getting the site back to normal.

Latency

We’ve described a more useful way to aggregate success rates. Now, let’s turn our attention to latency. Averages are famously misleading. A large percentage of your customers could be having a really bad day and your average would look just fine.

Averages mask anger.

You get a better view by looking at percentiles. The median (50th percentile) and 99th percentile latencies, viewed as a pair, give you a much clearer picture. The median gives you a view of the latency you’re delivering to the happiest half of your customers. The 99th gives you a view of what you’re delivering to the unhappiest one percent of your customers. This is the one percent that are going to call support, or worse, abandon your service. Managing these numbers is the key to customer retention: if the angry 1% is calling support, that’s bad; if the happiest half are calling, you’re probably on your way out of business. Let’s have a look at our revised “How’s my service?” chart with latency histograms:

Anger unmasked.

Request/Response Rate

Latency and success rate give us a high-level view of customer happiness. But what if customers can’t even get to your service? Request rate gives us a high-level view of customer behavior. If this number drops significantly it may indicate issues with external services (is your CDN down? Did you lose an availability zone?). If it rises dramatically, the same may be true (did your CDN just expire a bunch of content? Is Castle in the Sky on?).

Tweets per second during Castle in the Sky event

Another dimension to consider is the response codes of requests. An elevated number of 5xx HTTP responses will show up in success rate. But what about 3xx and 4xx codes? They aren’t necessarily your failures, but a deflection from the steady state can still indicate a degraded customer experience — for instance, a new client looking for an API call that hasn’t been deployed yet. Measuring response code rates in addition to request rates fleshes out the picture:

A cooler story.

Putting it All Together

Combining latency, success rate, and request/response rate, grouped by endpoint into a single dashboard gives you a tractable, at-a-glance view of customer experience and behavior. This is the general form of the dashboard we ended up with across most of Twitter’s services after a long journey through several different observability strategies. These aren’t the metrics you’re going to use to solve the problem, but they will tell you when there is one, and to what degree it affects your customers.

At Turbine Labs, we use exactly this set of metrics in Houston, our application routing and release system. Houston combines a customer-centric approach to monitoring and observation with insight into changes to your infrastructure, so you can spend less time finding out there’s a problem, and more time fixing it. Sign up now to give it a try!

--

--