Dark Mode

You’re probably thinking about automation the wrong way

In infrastructure there’s a term that gets thrown around a lot among SREs and Platform Engineers, “Sysadmin who codes”. As DevOps began gaining more popularity and roles began to change, with roles like Site Reliability Engineer, Platform Engineer, and DevOps Engineer¹, People who had previously worked as Systems Administrators, rarely if ever writing any code, were suddenly required to become programmers. Often used in a somewhat derogatory way, it’s presence is a strong indicator of how the DevOps community views ourselves and our roles - where we are and where we’ve come from.

The difference between an SRE and a “Sysadmin who codes” might not immediately be obvious, particularly to anyone not in our industry. After all, isn’t that what Site Reliability Engineering is all about? Isn’t that just what Google did? Hired Sysadmins who could code? Why would that be a bad thing? It’s easy to see how the narrative was formed, but the reality is that Google never did that. Instead, Google did the opposite. They hired Software Engineers and set them to working on their infrastructure. This (to some subtle) difference is responsible for arguably the biggest scism in our industry.

So what’s the difference? In short, it’s the approach and mindset, and to a smaller extent the skills and capabilities. “Sysadmins who code” often have a more process orientated mindset and a view to resolving problems using less abstract ways. Typically while they have the ability to write code, they lack any training and may find themselves unable to understand relatively basic algorithms and why they’re used. In my experience, the code I’ve seen from these kinds of engineers is functional, but not particularly maintainable.

The biggest way I’ve noticed this manifesting is the way in which people approach resolving problems via “automation”. To me it’s one of the biggest indicators of someone’s mindset and potential red flags to look out for when interviewing or talking to people. A “Sysadmin who codes” views a script as an automated replacement for themselves - a digital record of the actions they would take with little to no internal logic or problem solving. Most engineers I know, understand that the actions a person needs to take to perform a task isn’t always the actions that a computer should take.

Often, this also relates to how the program is run and where responsibility for any impacts lies. Most sysadmins I’ve seen take a certain view that programs shouldn’t make “serious” decisions, instead requiring user input or confirmation for those steps. On some level, this can be unavoidable - sometimes things are just too dangerous to avoid at least a confirmation. In my experience though, this is usually a sign that a process is broken somewhere, and that’s what should be focused on instead. In my opinion, good engineers understand that responsibility for a programs lie with the program itself, not on the user or process running the program. If I give you a program to reset the data in a test database, if it accidentally overwrites the production database, we shouldn’t blame the user but rather the system that allowed the mistake in the first place.

This will usually also be seen in a reverence for runbooks and documentation. Not that I have any issues with either - they’re incredibly important components of any system, but not everything needs to be documented and realistically everytime a runbook is written, you should be asking youself if that’s the best solution. This is where the sysadmins who code often have the biggest trouble in my experience. Historically that was the best way to solve problems as a sysadmin - write down exactly what you did to resolve an issue - after all maybe it’ll happen again next year long after you’re gone and no one else knows exactly what you did or even why. With computers though? They never forget and they never make mistakes or change how they do things, even when you want them to. In my view, self-healing script triggered by an alert > Runbook explaining which script or pipeline to run > Runbook explaining what commands to run.²

All of this isn’t to say that there’s no space left for Systems Administrators in our industry, or even that a team solely of “Sysadmins who code” is a red flag in of itself. With these kinds of things I think it’s always important to remember the context that people are in - how big is the company and how complex is their infrastructure? How important is software to the business’s profit margins? You don’t need to be writing complex programs for a local accountancy. It is, however, important to watch out for when you do need to be tackling these complex problems.

If you’re working in an environment that requires self healing infrastructure, that requires 3 or more 9’s of uptime, it’s more or less impossible to get there with people who have this kind of mindset. Complex, highly available environments require automated processes that are capable for thinking and reasoning for themselves³. After all, if you have an SLA with no longer than 30 minutes downtime per month, you’re probably losing half of that time just in your oncall engineer responding to the alert. Add on that you’re almost certainly not alerting issues immediately but on some kind of timeseries based metric, and it’s not unimaginable that by the time the problem has occurred, alerted (and woken up) your oncall engineer, they’ve drearily booted up their laptop, looked at the issue and determined the script to run to fix the issue, you’ve already blown past not just this months SLA, but possibly next months as well. You need something that is a) never asleep and b) capable of immediately taking action.

Generally speaking I hate the term “DevOps Engineer” and consider it to be the same as “Scrum Master” - examples of roles that should never have existed and go against their very origins, but that’s a topic for another day. ↩
It’s always worth remembering that while it’s unlikely your cloud CI server will be down, it can happen and so the best self-healing system is either self-healing itself or simply calls a script that your oncall engineer could run themselves if the CI server did die. ↩
Within reason ↩