The Ops Identity Crisis

A big theme in the keynotes and conversation during Velocity Conf in NYC a few weeks ago was the role of ops in an "ops-less" and "server-less" world. It's also been a big feature in discussions on twitter and in conversations I've had with coworkers and friends in the industry. There are several things that stand out to me in these conversations: first, that some ops engineers (sysadmins, techops, devops, and SREs) are worried that they will be phased out if developers and software engineers are responsible for the operational tasks in their systems; second, that developers and software engineers do not have the skills needed to take over responsibility for operational tasks; and third, that building reliable systems is impossible without an operations organization. 

I'm considered by many to be an ops engineer by trade, because I'm a site reliability engineer (SRE) at Uber. At most companies, SREs are just like any old ops engineers: they're viewed by developers as the engineers who are responsible for running the software after the developers build it. At other companies (like Uber),  SREs aren't supposed to be your typical ops engineers - the majority of them aren't on-call for the microservices (the development teams are), they don't deploy the code (the developers do), but they do build tools to enable and empower developers to run their own services, and they are there to make sure that systems are built and run with reliability in mind. Even so, there's not that much that differentiates an SRE from a SWE (software engineer), and when people talk about an ops-less world, SREs (myself included) wonder where we fit into the picture. 

Let's start with a simple question: do we really need operations engineers? My answer is a very loud and firm "no". We don't need operations engineers. We don't need a separate organization whose responsibility it is to deploy all the code, to be on-call for all the services, to monitor everything, to do all of the debugging and troubleshooting and mitigating and resolving every time something breaks. As I've argued before, these tasks are best accomplished when they are the responsibility of whoever knows the application, service, or system best, and, in the majority of cases, the developers are the best engineers for the job. If developers are ready, willing, and trained to take over the operational tasks for their systems, then there's no reason to have a separate operations organization - but in the industry today, that's not the case: many developers aren't ready (they haven't been on-call before, they don't know anything about systems debugging and troubleshooting), they aren't willing (after all, who wants to get paged at 2:00 AM when their code causes the system to fail?), and they aren't trained (they don't know best practices for deployment, for logging, for monitoring, etc.). In short, if developers are responsible for running their own applications, services, or systems, then no, you don't need a separate operations organization - but if they aren't, then you most certainly do. 

To explain my reasoning behind this, I'm going to dig into each of the concerns I listed above. Let's start with the first concern: the fear that ops careers will be phased out. If we get to the point within a company where operational engineers are no longer needed, because developers are responsible for running and maintaining their own software, then we've done our jobs. One of the things that senior ops engineers have always told me is that ops should strive to automate themselves out of their jobs - automate monitoring, automate deployment, automate rollbacks when deployments go badly - and if you find yourself at a point where everything is automated, then it's time to either figure out how to make it even better (if you can), or move on to another company and set them up for success. Automating everything away isn't enough to justify running an ops-less organization, however: developers will need to become responsible for running the deployment systems, for understanding monitoring, for learning how to debug and troubleshoot complex systems problems. To run a successful ops-less organization, developers need to become experts in operations. 

Let's imagine that we find ourselves in a company where the developers are experts in operations, every operational task has been turned into a self-service tool that is maintained by developers, and there are no traditional operational tasks left for the ops engineer to work on. What happens to the operations engineers? The first possibility is that the operations engineer moves on to another company where developers aren't responsible for operations - if you love ops work, then this would be the right move. The second possibility is that operations engineers become the software engineers responsible for internal self-service tools and the application platform, and they build centralized systems that all development teams can use for logging, for monitoring, for debugging, for deployment, and the like. This second possibility requires that ops engineers be strong and capable programmers (if you've automated yourself out of operations, then you're likely a pretty competent programmer), and it comes with work that will never end nor be phased out: there will always be new and improved infrastructure tools that you can adopt and migrate your systems to, there will always be new challenges you will meet. In this second picture, there is practically no difference between a software engineer and an operations engineer - they both build and run their own systems - other than the domain: the software engineer builds and runs the top-level applications, while the (ex-)operations engineer builds and runs the infrastructure and tooling underneath the applications. 

Now for the second concern: that developers and software engineers (some places use the terms interchangeably, some do not, so I'm including both terms here) simply don't have the skills, knowledge, and/or expertise needed to take over operational tasks. There is a little bit of truth to this, because many developers work in organizations where they've never had to do any operational tasks. If they were told they were responsible for their own operations, many of them would have no idea where to start or what to do. However, the majority of developers are very capable and competent engineers who know the fundamentals of software engineering and computer science - if they can write an application, they can definitely debug, monitor, deploy, and be on-call for that application. Like anything else, learning how to handle operational tasks just takes time, training, and help. 

There are three domains of expertise that experienced operations engineers tend to have. They have a lot of low-level systems knowledge: they know how to debug low-level systems problems, they understand the fundamentals of networking, they know the intricacies of the Linux kernel and they know every Linux command on this earth. They also have a lot of high-level distributed systems knowledge: they know how entire production systems fit and work together, they know how to design for scalability and reliability, they know how things should be deployed and monitored and logged. They also have solid programming skills: they know multiple languages quite well, and can easily write applications and tools for developers to use. The tricky thing about replacing operational engineers with software engineers is that software engineers tend to only be experts in the latter domain. In an ops-less organization, developers need to have the skills and experience in the two former domains as well. Again, this isn't impossible, and I've met many, many software engineers who are extraordinarily skilled in all three areas. 

Finally, the third concern: that building a reliable system is impossible without an operations organization. This is, once again, something that is true in organizations where developers aren't responsible for operational tasks and lack the systems and best practices knowledge that operational engineers bring to the table. If you build a system without engineers that have (i) deep understanding of computer systems at a low-level, (ii) knowledge of operational best practices, and (iii) experience with building and running large-scale production systems, then yes, you are going to have a really horrible experience trying to build a reliable system. But if your developers are trained and have experience in these three areas, then you will be just fine. 

I think the identity crisis in ops is justified. Ops roles have been around for many, many years, and recently the conversation in the industry has shifted so much that industry leaders are beginning to ask if having a separate operational organization even makes sense. In organizations that train and expect their developers to run what they build, the operations roles will change dramatically - in my opinion, they will change for the better, and we'll be building large, complex systems and self-service tools and application platforms and all kinds of wonderful, exciting, cool things. However, I think that the majority of operations engineers will find that the their roles will stay the same for a very long time: there will (probably) always be organizations that prefer to split their engineering resources between those who build (developers) and those who run (operations), and there will always be new organizations who are moving toward ops-less but need operations engineers to pave the way, to train the developers, to build the systems and automate things to the point that developers can begin taking on the operational responsibilities. 

In summary: the identity crisis is real, and it's here, and it's unavoidable, but I don't think that it's a sign that things are getting worse for operations and for the tech industry. It's a sign that things are getting better. It's an exciting sign that the field is changing, evolving, growing. It's a sign that responsibility is (finally!) valued in the world of software engineering.