Coworkers Will Become Customers—It’s a Good Thing

Managing Service Health Through Scale

Mark McBride
Turbine Labs

--

I recently presented a talk at Velocity NY on Customer-Centric Metrics. When I envisioned the talk, I imagined an in-depth look at the countless metrics you can evaluate for a given release, leading to a distillation of the ones that matter. As I prepared, the talk naturally split itself into three distinct and related sections:

  1. Evaluating the metrics that your customers care about gives you a quick and robust way to evaluate system health.
  2. As teams grow, their interactions start to feel more like the relationship between a customer and service provider.
  3. Adopting the network as an abstraction boundary allows you to easily add high level observability and change control to all services.

The Good Metrics

Caring about the health of your service is a critical part of being a grown-up engineer, but there’s a fine line between concern and anxiety. One cause of anxiety is too little information. Another is receiving too much irrelevant information on the details of each service. Stepping back to consider your service from a customer viewpoint simplifies things. Customers don’t care about your services’ internal details, they care about the quality of their experience.

For web applications, this can be condensed into three main questions:

  1. Does it work? (Are requests returning successfully?)
  2. Is it fast enough? (Are requests returning quickly enough?)
  3. When all my friends start using it, does it still work? (Does it scale?)

This means you should measure success rate, latency, and request rate. If these three metrics are within bounds, you can be confident that your service is healthy. It’s also important to segment these results by the type of actions customers perform. For example, if all of the item display pages on your online store are working well, but checkout is broken 20% of the time, your site isn’t healthy. Site health is in the eye of the customer.

Customers care about different things for different workload types. For instance, data freshness is a key metric for a data processing app. For a mobile app, it may be frame rate. The same approach can be used to identify a small set of metrics that reflect customer experience. Well-considered metrics, segmented by customer action types, provide a quick but comprehensive way to measure service health.

Teammates as Customers

As we talk to more people about a customer-centric approach to monitoring, we encounter a tendency to split services into customer-facing and internal systems. Maybe customer-centric metrics make sense for the things customers actually touch, but internal systems require a broader set of health metrics. Customers differ from team members in many ways, but for the purposes of observability, the critical difference is that they can’t see behind your service boundary.

When engineering teams are small, it’s not uncommon for individuals to be fluent in large portions of your application. As issues arise, it’s easy to dive in and directly fix the problem. The application is small enough that looking at detailed metrics may be a tractable approach to observing system health. As teams grow, however, maintaining deep knowledge of the entire system becomes harder, with more and more of the system moved behind abstraction boundaries to enable specialization.

As engineers increasingly work with service abstractions instead of service internals, they start to look much more like customers from an observability standpoint. Adopting the customer-centric approach to metrics means managing health of the abstraction, not system internals. This doesn’t replace fine-grained instrumentation, because detailed information is critical when you’re trying to troubleshoot and fix problems. The customer-centric approach gives you a quick way to determine whether you need to start troubleshooting, and helps narrow the scope of investigation. A simple set of metrics, applied as a standard across the entire system, means you have a common frame of reference to observe system health, and discuss incidents when they arise.

Observable Data Plane

Applying these metrics across the entire system can be a challenge. The rise of microservices means that abstraction boundaries are increasingly moving to the network. Injecting a proxy into the data path gives you an easy-to-deploy mechanism for system-wide observation and control.

A proxy doesn’t provide detailed metrics on the internal operation of your service, but it can provide a clear view of the behavior your customers are observing. This lets you know that you need to conduct a detailed investigation. Further, it routes traffic away from faulty systems, letting you restore customer service without hampering diagnostic efforts.

It may seem daunting to rework your network communication architecture, but it can be done incrementally. Once rolled out, the observability and controllability gains are applied to every software release, every new service you deploy, and every new language and runtime you adopt. These will, in turn, dramatically increase your customer satisfaction.

In Closing

By establishing a network boundary, internally as well as externally, and paying attention to high-value customer-centric metrics, you allow your services to flourish. No longer will some customers suffer while your healthchecks say things are going swimmingly. The benefits of quickly shifting traffic away from a badly-behaving service when you see a drop in success rate are obvious, as a few 500 errors can ruin a relationship with any customer. Focusing on observable and controllable vectors is a future within reach, and we’d love to chat more about how we can help you get there. Houston, our continuous release product, is an easy-to-install application of these practices. Sign up here to start a 30 day trial!

--

--