Blameless Post-Mortems

The cluster focuses on discussions about conducting blameless post-mortems, root cause analysis, and SRE practices for handling production outages and incidents to improve reliability without blaming individuals.

➡️ Stable 0.6x DevOps & Infrastructure
4,010
Comments
19
Years Active
5
Top Authors
#6158
Topic ID

Activity Over Time

2008
11
2009
17
2010
57
2011
52
2012
86
2013
102
2014
100
2015
110
2016
183
2017
220
2018
225
2019
333
2020
339
2021
453
2022
420
2023
404
2024
437
2025
418
2026
43

Keywords

e.g US CI AWS codeascraft.com IR CD UDP GP HN incident team outage incidents production procedures failure sre service process

Sample Comments

gorodetsky • Feb 1, 2017 • View on HN

Wholeheartedly agree.Incidents are inevitable and it's important to have a proper RCA/Service-Disruption process in place to handle those.As mentioned on the other thread, maybe Gitlab doesn't have enough operational/SRE expertise in-house yet, but that was the case in every fast-growing company I've worked for last decade.

didibus • Dec 6, 2021 • View on HN

I don't think that's an issue per-say. You can't fix everything, but you can at least try to find the most likely path and do something about it. Even if it wasn't the biggest cause, you made progress towards improving some of the variables at play for the issue.There's clearly something to learn and improve in your 5 why's. There was a hard to test feature, typos were allowed to pass through validation in the config, there was only one person making the config c

hackandtrip • Mar 17, 2022 • View on HN

It is probably caused from postmortem culture not being shared in the community."Having problems" in this world (any kind, not only due to the github scale!) is something that happens - we are not perfect and we work on an incredible amount of layers of complexity.It is sufficient to actually touch production code on a daily basis to see that it can happen to the best, with the best observability systems or processes. The key is avoiding blaming, and understanding iteratively how

dullcrisp • Jan 2, 2026 • View on HN

Since you haven’t mentioned it, have you heard of blameless post-mortems? They’re a systematic approach to this type of issue.

lifeisstillgood • Dec 6, 2019 • View on HN

There must be quite a story behind this - will you be putting up a post-mortem ? (Post mortems of business "outages" are usually more instructional)

totally • Feb 3, 2016 • View on HN

relevant:https://codeascraft.com/2012/05/22/blameless-postmortems/

lanstin • Jul 19, 2024 • View on HN

This outage seems to be the natural result of removing QA by a different team than the (always optimistic) dev team as a mandatory step for extremely important changes. And neglecting canary type validations. The big question is will businesses migrate away from such a visibly incompetent organization. (Note I blame the overall org; I am sure talented individuals tried their best inside a set of procedures that asked for trouble.)

roughly • Jul 16, 2025 • View on HN

I’m generally more a “blame the tools” than “blame the people” - depending on how the system is set up and how the configs are generated, it’s easy for a change like this to slip by - especially if a bunch of the diff is autogenerated. It’s still humans doing code review, and this kind of failure indicates process problems, regardless of whether or not laziness or stupidity were also present.But, yes, a second mitigation here would be defense in depth - in an ideal world, all your systems use

uponcoffee • Jul 12, 2019 • View on HN

I think the GP means that as far as incidents occurring, so far as care is (or was) taken to prevent them and learn from them, then that's all one can really reasonably ask for. The first incident falls under that heading and 'is fine' in a 'life happens' sense.The following incident comes across as reckless and avoidable as there should have been procedures to safely test the rollback (and perhaps there were, but a perfect storm allowed it fail in prod). Lacking deta

cweld510 • Jul 19, 2024 • View on HN

It’s not a matter of excusing or not excusing it. Incidents like this one happen for a reason, though, and the real solution is almost never “just do better.”Presumably crowdstrike employs some smart engineers. I think it’s reasonable to assume that those engineers know what CI/CD is, they understand its utility, and they’ve used it in the past, hopefully even at Crowdstrike. Assuming that this is the case, then how does a bug like this make it into production? Why aren’t they doing the