Service Outages Postmortems
Comments discuss recent major outages at providers like Fastly, Cloudflare, Backblaze, and others, analyzing root causes such as configuration errors, network failures, overloads, and cascading effects, along with praises for detailed postmortem blog posts.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Wasn't their previous major outage because of a bad migration?
Was the outage a configureration error by the admins or was it foul play?
"Impact: fixed processes that led to 8 hr outage" seems like an easy case to make.
Bit drastic considering this was their first major global outage...
Our sincere apologies for tonight's downtime. We're back up now after 30 incredibly frustrating minutes, but we're making changes to ensure this incident can't be repeated.The root cause was a network failure at our CDN, Fastly. The incident was limited to a single Point of Presence (POP) in San Jose, so if you were in Europe or Asia you didn't see anything wrong, but obviously at this time of day most traffic is from the west coast.While our uptime over the last f
We're sorry https://www.youtube.com/watch?v=9u0EL_u4nvwEdit, an outage of this length smells of bad systems architecture...
That literally happened, they blogged about it recently. https://www.backblaze.com/blog/recent-outages-why-we-acceler...
Title seems misleading - a poorly chosen default behavior caused the outage
Companies that explain in great detail why an outage happened - chefskiss
From the blog post it sounds like no. They say a service got overloaded due to an increase in the number of datacenters and triggered a bug.