The Irreproducibility of Bugs in Large-Scale Production Systems

The crux of reproducibility when it comes to bugs is this: being able to reproduce a bug requires that the state of the system be nearly identical at the time of reproduction as it was at the time the bug originally occurred - something that is impossible to guarantee in large production systems...

Read more

The Ops Identity Crisis

A big theme in the keynotes and conversation during Velocity Conf in NYC a few weeks ago was the role of ops in an "ops-less" and "server-less" world. It's also been a big feature in discussions on twitter and in conversations I've had with coworkers and friends in the industry. There are several things that stand out to me in these conversations: first, that some ops engineers (sysadmins, techops, devops, and SREs) are worried that they will be phased out if developers and software engineers are responsible for the operational tasks in their systems; second, that developers and software engineers do not have the skills needed to take over responsibility for operational tasks; and third, that building reliable systems is impossible without an operations organization. 

Read more