Blameless Post-Mortems

The cluster focuses on discussions about conducting blameless post-mortems, root cause analysis, and SRE practices for handling production outages and incidents to improve reliability without blaming individuals.

➡️ Stable 0.6x DevOps & Infrastructure

4,010

Comments

Years Active

Top Authors

#6158

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

102

2014

100

2015

110

2016

183

2017

220

2018

225

2019

333

2020

339

2021

453

2022

420

2023

404

2024

437

2025

418

2026

Top Contributors

hinkley (42) tptacek (21) tetha (18) jacquesm (14) peterwwillis (13)

Keywords

e.g US CI AWS codeascraft.com IR CD UDP GP HN incident team outage incidents production procedures failure sre service process

Sample Comments

gorodetsky • Feb 1, 2017 • View on HN

Wholeheartedly agree.Incidents are inevitable and it's important to have a proper RCA/Service-Disruption process in place to handle those.As mentioned on the other thread, maybe Gitlab doesn't have enough operational/SRE expertise in-house yet, but that was the case in every fast-growing company I've worked for last decade.

didibus • Dec 6, 2021 • View on HN

I don't think that's an issue per-say. You can't fix everything, but you can at least try to find the most likely path and do something about it. Even if it wasn't the biggest cause, you made progress towards improving some of the variables at play for the issue.There's clearly something to learn and improve in your 5 why's. There was a hard to test feature, typos were allowed to pass through validation in the config, there was only one person making the config c

hackandtrip • Mar 17, 2022 • View on HN

It is probably caused from postmortem culture not being shared in the community."Having problems" in this world (any kind, not only due to the github scale!) is something that happens - we are not perfect and we work on an incredible amount of layers of complexity.It is sufficient to actually touch production code on a daily basis to see that it can happen to the best, with the best observability systems or processes. The key is avoiding blaming, and understanding iteratively how

dullcrisp • Jan 2, 2026 • View on HN

Since you haven’t mentioned it, have you heard of blameless post-mortems? They’re a systematic approach to this type of issue.

lifeisstillgood • Dec 6, 2019 • View on HN

There must be quite a story behind this - will you be putting up a post-mortem ? (Post mortems of business "outages" are usually more instructional)

totally • Feb 3, 2016 • View on HN

relevant:https://codeascraft.com/2012/05/22/blameless-postmortems/

lanstin • Jul 19, 2024 • View on HN

This outage seems to be the natural result of removing QA by a different team than the (always optimistic) dev team as a mandatory step for extremely important changes. And neglecting canary type validations. The big question is will businesses migrate away from such a visibly incompetent organization. (Note I blame the overall org; I am sure talented individuals tried their best inside a set of procedures that asked for trouble.)

roughly • Jul 16, 2025 • View on HN

I’m generally more a “blame the tools” than “blame the people” - depending on how the system is set up and how the configs are generated, it’s easy for a change like this to slip by - especially if a bunch of the diff is autogenerated. It’s still humans doing code review, and this kind of failure indicates process problems, regardless of whether or not laziness or stupidity were also present.But, yes, a second mitigation here would be defense in depth - in an ideal world, all your systems use

uponcoffee • Jul 12, 2019 • View on HN

I think the GP means that as far as incidents occurring, so far as care is (or was) taken to prevent them and learn from them, then that's all one can really reasonably ask for. The first incident falls under that heading and 'is fine' in a 'life happens' sense.The following incident comes across as reckless and avoidable as there should have been procedures to safely test the rollback (and perhaps there were, but a perfect storm allowed it fail in prod). Lacking deta

cweld510 • Jul 19, 2024 • View on HN

It’s not a matter of excusing or not excusing it. Incidents like this one happen for a reason, though, and the real solution is almost never “just do better.”Presumably crowdstrike employs some smart engineers. I think it’s reasonable to assume that those engineers know what CI/CD is, they understand its utility, and they’ve used it in the past, hopefully even at Crowdstrike. Assuming that this is the case, then how does a bug like this make it into production? Why aren’t they doing the