The Irreproducibility of Bugs in Large-Scale Production Systems

Before working at Uber, I'd only had experience with very small and simple production systems. If something went wrong with these systems, if I was paged because something was broken in production, I could easily track down the problem, reproduce the bug(s) with minimal effort, and then test and deploy a fix within a few minutes. When I joined Uber one year ago and began working on various parts of its then-1000-microservice ecosystem (Uber is probably around 1500 microservices now), one of the biggest surprises was realizing that the majority of bugs were irreproducible. In fact, I think one of the most important lessons I've learned this past year is that the "track down the problem -> reproduce the bug -> test and deploy the fix" cycle simply does not work in most large and complex production systems. 

The first stage of the bug/incident/outage response process (tracking down the problem) is definitely difficult in large production systems, but it's not impossible. It takes good monitoring, great observability, and some really dedicated and experienced engineers, but it can be done, and most companies I know do it pretty well. Yes, things get difficult when you have large production ecosystems, and you need to deal with crazy capacity, latency, availability, and scalability issues that you wouldn't find in smaller systems or at smaller companies, but there's nothing about running large-scale production systems that makes it impossible to do this (in theory or in practice). The same is true for testing and deploying a fix once a fix has been identified (the last part of the cycle), which is probably the easiest part of the whole process: there's nothing inherently impossible or difficult about accomplishing this in a large system. The problem is found in the second step of the process: in large-scale production systems, the cycle breaks downs in the "reproduce the bug" phase. The cycle breaks down because it is practically impossible to reproduce bugs in large-scale production systems. 

To see exactly how and why the cycle breaks down, let's do a little thought experiment. Let's suppose that we are developers working at a company with four hundred microservices (a rather small number of services). These microservices are developed by various different teams, and they all live on top of an application platform, which lives atop the infrastructure at the company. To make things easy, let's say that the application platform and the infrastructure are relatively simple.  In a perfect, ideal, Platonic-form-heaven of a world, all of these microservices would be completely independent of one another, and wouldn't affect each other at all (neither negatively nor positively). The cold reality of microservice architecture is that microservices do not live in isolation, but have to depend on each other and interact, so what you end up with at the end of the day is a large set of intricately connected services and dependency chains that (when drawn on a whiteboard) eerily resembles a spiderweb. If one microservice in the chain goes down, all upstream microservices tend to go down with it. If one microservice changes something, the odds are really high that the change will affect its clients.  

Now let's think about how these services are developed and deployed. Microservice architecture allows developers to deploy all of the time and to create new features really, really quickly. In the real world, this means that developers will be deploying more than once per week (sometimes once per day, and often multiple times per day). For our thought experiment, let's imagine that the microservices at our pretend company are each deployed once per day, and that each change is of nonzero significance. So now we have four hundred microservices, each being significantly changed every single day with every new deployment. In addition, the platform and infrastructure beneath the microservices is changing every day: new configurations are being deployed, monitoring is being changed, new hardware is being provisioned, resources are being allocated, libraries are changing - the list goes on forever. We can safely assume that in a large-scale production system like this, the state of every single component of the system changes every single day.

 

"The crux of reproducibility when it comes to bugs is this: being able to reproduce a bug requires that the state of the system be nearly identical at the time of reproduction as it was at the time the bug originally occurred - something that is impossible to guarantee in large production systems."

 

In this system (just like in most large-scale production systems), the state of the system changes constantly throughout each day. There are no guarantees that the state of the system will stay the same from one moment to the next, and, given how interconnected and codependent all of the components of the system are, this fact matters very, very much when it comes to reproducing problems with the system. The crux of reproducibility when it comes to bugs is this: being able to reproduce a bug requires that the state of the system be nearly identical at the time of reproduction as it was at the time the bug originally occurred - something that is impossible to guarantee in large production systems. In most cases, a microservice or application in a large system will not be the same service or application that it was twelve hours ago, let alone several days ago, and reproducing problems the same way we are used to reproducing them in smaller, self-contained systems is impossible. 

The solution to the irreproducibility-of-bugs problem (which I should probably call the "irreproducibility-of-state problem", now that I think about its name) comes from the way that we find ourselves working around the problem when we're trying to debug and mitigate production issues. In these situations, where we are faced with a problem and can't reproduce it on-the-fly, often the only way to determine the its root cause is to comb through the service-level or application-level logs, discover the state of the application or service at the time of the outage, and figure out why the service or application failed in that state.

If our logs don't accurately capture the state of our service at the time the problem occurred, then we're shit out of luck and our only way forward is to add some brand-new logging where we suspect there may be a problem, cross our fingers, and hope that our new logging catches it. This is obviously problematic, because it requires our system to fail at least one more time in order for us to root-cause the underlying issue. It's better to take a comprehensive and smart approach to logging from the get-go: we can design our logging such that we can determine from our logs exactly what went wrong and where things fell apart, and we can do this by making sure that our logging accurately captures the state of the system at the time. 

I've spent a lot of time this year while writing my book Production-Ready Microservices trying to figure out a comprehensive list of things that systems should log in order to fit this requirement, and honestly I haven't been able to come up with one - the list of necessary and sufficient things that should be logged in order to adequately describe the state of a system is far too dependent on the specifics of any given system. All I have been able to do is come up with a general rule of thumb: design your logging such that you can determine the state of the system at the time an event is logged.  

 

"a general rule of thumb: design your logging such that you can determine the state of the system at the time an event is logged." 

 

Before I end this post, I want to briefly mention that I've seen several microservice-architecture-based companies attack this problem by versioning microservices and/or their endpoints, and I want to caution against it. Avoid this approach if you can, because it tends to lead to really messy and unfortunately situations in which client services pin to specific (old, outdated) versions of microservices. Instead, make your logging decent enough that you can record the time adequately, which will allow you to correlate bugs and other problems with the state of the system at that time in the past. 

Good luck out there, and happy logging!

P.S. I wrote a whole book full of stuff like this post about building good, production-ready microservices called Production-Ready MicroservicesIf you'd like a discount code for the book, send me an email - I have a few discount codes to give away.