social.kejadlen.dev

Conversation

Alpha Chen

Seeing like an SRE: https://www.usenix.org/publications/loginonline/seeing-sre-site-reliability-engineering-high-modernism

Knowledge of a specific software system is metis, rather than techne. This is why there is a learning curve when we start working on a new system, and why we don’t put our new teammates on call right away. More standardisation in infrastructure, better runbooks and so on are the software equivalent of dredging the shipping channel and putting markers on obstructions — they can somewhat reduce the amount of metis we need to have, but not eliminate the need for a local pilot entirely.
To return to where I started this article, this is why generic checklists always fall a bit flat, and why it’s very difficult to run a thorough production readiness review for a system that you aren’t deeply familiar with. Both of these are an attempt to substitute techne for metis, which just doesn’t work.
This, perhaps, is the source of some of the antipathy that some old-school sysadmins have for SRE (as exemplified by the reaction to Todd Underwood’s LISA 2013 talk, PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability). SRE is seen as a high modernist project, intent on scientifically managing their systems, all techne and no metis; all SLOs and Kubernetes and no systems knowledge and craft. That view is not entirely wrong. Some in the SRE movement do see it that way — things like consulting SRE teams, SRE software platforms, SLOs, and Kubernetes are popular for a reason, and they do have their uses. But call it what you will – craft, metis – specific systems knowledge isn’t going away anytime soon. SRE or sysadmin, metis is the one aspect of our jobs that we are unlikely to ever automate away.

Alpha Chen

Terms of Service