Who's On Call?

One of the projects I've been tasked with recently is determining when and how a specific SRE team should take over on-call responsibilities from development teams. Of course, the details of the task are far more concrete and less based on principles, but implementation of a specific on-call policy requires answering precisely these questions: who is responsible for on-call duties for an application/service, and how is that responsibility defined? More succinctly, who owns on-call? 

The answer is pretty complicated, and mostly for historical reasons. I'm not sure when in the history of software engineering separate operations organizations were built and run to take on the so-called "operational" duties associated with running software applications and systems, but they've been around for quite some time now (by my research, at least the past twenty years - and that's a long time in the software world). Somehow, the tasks associated with building and running applications and systems were split into two separate but related organizations: those responsible for the "building" (designing, developing, and testing - usually done by software engineers and/or software developers) and those responsible for the "running" (being on-call 24/7 to fix any problems with the system, performing any necessary maintainable, etc. - work typically handed to sysadmins, systems engineers, DevOps engineers, and, now, site reliability engineers). This is a crude generalization, and specific delegation of tasks tends to differ slightly from company to company, but I've found that most engineering organizations do divide responsibilities into these two broad categories. 

For the most part, this separation of responsibilities makes sense from a high-level engineering organizational perspective: the specialization allows engineering organizations to be more efficient, optimizing so that each role is clearly defined, and, most importantly, that engineers can be seen and treated as modular (and reusable) components within a large engineering organization. In this framework, delegating the on-call responsibilities to the operations organization makes sense: the builders build, and then the maintainers maintain what the builders built. 

I'm sure there are plenty of companies today where this structure still works, and works well. Enterprise software, any company that has stable releases, and any company that maintains a good deal of legacy and or/established and stable software - it makes sense for companies that run these kinds of systems to split the responsibilities into development vs. operations. At startups and the majority of large tech companies today, however, this model just doesn't make sense. The reason? Things are messy on the ground floor of a fast-paced engineering organization, when new features are shipped every day and you're the engineer on-call. 

Here's how it tends to work. 

You're an engineer working (either on the development side or the operations side) on an application or system, and new code is being deployed every day (some times multiple times each day). The number one cause of outages in production systems at almost every company is bad deployments (usually bugs not caught before the new features are deployed), and chances are, quite of few of these daily deployments to your application or system are going to cause incidents or severe outages. You're on-call, so when anything breaks, you get paged, have to triage the problem, mitigate it until you can find a solution, and then resolve it when a solution is deployed. 

If you're on the development side, you know the application or system pretty well, and so finding the root cause, mitigating it, and then resolving it isn't going to take much detective work. If you're on the operations side, you're not going to know the application or system as well as the developers who wrote the code that is breaking will know it, and so the triage-mitigate-resolve process is going to take you a lot longer - the most likely thing that's going to happen is that you'll end up finding the developers that made the latest commits, then asking them to jump in and fix the outage. 

Here's the problem. Outages cost money. If you're on-call, and you can't resolve the problems quickly, you're costing the company real money. Delegating on-call responsibilities to operational engineers in these unstable development conditions costs the company money

But hold on, you may ask, why not just have the developers themselves always be on-call in these unstable development cases? If they can fix the problems quickly, why not have them take on the on-call responsibility? I'd agree with you, and I think most people would, but this line of thinking goes against the building vs. running dichotomy that's so prevalent within software engineering today. Development teams want to hand off the on-call responsibilities to operations-focused teams, along with the majority of maintenance tasks associated with running an application or system, and who can blame them? Operations responsibility is thankless, hard work, and when you're a software developer being stack-ranked against your peers, the last thing you want is to be saddled with operational tasks. So, for many teams, the solution is simple: pawn off operational tasks to a separate operations team, and focus on designing and building new features, new applications, and new systems. 

Google famously took a somewhat new approach to this by introducing the role of the site reliability engineer (SRE). Instead of following the typical path of hiring sysadmins and systems engineers to carry the operational load, they took a bunch of software engineers trained in both development and systems engineering, and then had them form separate teams that handled operational responsibilities. Because these SREs were trained in development, they took a new approach to operational problems: they automated the majority of the work away, and built new applications and systems to handle the operational grunt work. SREs became responsible for turning the unstable development and deployment processes of applications into stable, reliable systems. 

The Google SRE approach has many advantages, and it's worked extraordinarily well for them. Many of the reasons it's worked so well for them is that the majority of their projects are largely stable and don't add new features multiple times a day. It makes sense to put SREs on stable, long-running systems like these, and Google even has criteria (and here's the catch!) that systems need to meet before SREs will take over operational responsibilities from developers. With this in mind, the SRE approach doesn't seem quite as revolutionary as one might suspect: the division between building and running systems still exists just as strongly, and SREs have become the new sysadmins, systems engineers, and DevOps engineers. 

So, if we're looking at the data, the builders vs. maintainers dichotomy works well for established, stable systems. Even with SREs, we are back at square one for the majority of companies and engineering organizations, few of which will ever reach the stability of some place like Google. We can rephrase our original question: who should go on-call at companies that aren't Google? More generally, who is responsible for maintaining and running applications at the majority of companies today? 

Before I go into what I believe to be the answer, I'm going to give some personal anecdotes, because I've been fortunate to have worked on all sides of the dichotomy. 

I've had a development-type role, where I was on-call for a suite of applications that I contributed to and knew like the back of my hand. When something broke, I knew exactly what caused it, and was able to resolve any incident or outage in mere minutes. 

I've also had a DevOps role, where I was on-call for all of the applications being developed at the company. When something broke, I would get paged, figure out where the problem happened, and then - because I didn't have the authority or knowledge to fix the problem - would have to call developers in the middle of the night and ask them to fix their goddamn code. Most of the time, they wouldn't answer their phones, or I'd get told it was not their problem, so I'd spend hours fixing what should have been a two minute fix. 

Now I'm in a role that bridges the two. I'm an SRE embedded within business-critical services that is responsible for cutting down the development and deployment instability that comes when multiple deployments happen each day. Because I've been a one-person team supporting almost forty services for the past five months, I haven't been on-call for these services, but I'm still called and pinged and paged at all hours to help with operational work. I also teach new uber engineers how to be on-call for their services during their first week on the job. 

In a nutshell, I have a bit of experience with on-call and other operational responsibilities. 

Taking the history, the principles, the reality, the data, and my experience into account, here's my answer. Who should be on-call? Whoever owns the application or service, whoever knows the most about the application or service, whoever can resolve the problem in the shortest amount of time. In the majority of cases, this will be the development team. If the application or system in question is established and stable, then it's appropriate for an operational team to take over the maintenance work. 

Good software engineering is about responsibility. It's about making mistakes, taking responsibility for those mistakes, and learning from them. It's about ownership. 

From what I've seen, development teams that are responsible for on-call duties (and operational tasks) write the best code. They build the most stable, performant, reliable applications and systems. The reason why is obvious: if you are going to be woken up in the middle of the night because a bug you introduced into code caused an outage, you're going to try your hardest to write the best code you possibly can, and catch every possible bug before it causes an outage. Take away that ownership, and the system can (and usually will) suffer. 

There's a follow up question to this, namely what the role of operations-focused engineers is when developers are responsible for running their services. I have a lot of thoughts about this, but here's the basic idea: there's a very important place for SREs within organizations where developers take on operational responsibilities, and it comes from the fact that SREs and engineers in similar roles are experts at running production systems. They know better than anyone how to build and run a stable, reliable, performant, fault-tolerant system, because they have two additional areas of expertise that most developers don't have experience in: distributed system architecture and systems engineering. The job of engineering organizations is to use those strengths, have SREs and developers work side-by-side to build and run the best possible systems. Here I'll have to leave you hanging, because what that looks like in practice deserves a post of its own. 

Til next time!