Open full view…

Muut service failure April 23, 2014: post mortem

Wed, 30 Apr 2014 14:13:41 GMT

Thu, 01 May 2014 01:50:31 GMT

Good luck guys! This is a lesson for all developers. It could happen to any new service. Thanks for the detailed post.

Thu, 01 May 2014 21:56:18 GMT

Thanks for the detailed description.. Not many describe the who, what, where and why. This is a very transparent description and very much appreciated. Though to be honest it can not really be classed/sold as a "perfect storm".. just a combination of an unexpected(?) failure; and that error combined with an unproved/untested infra.. Testing.. lol :-P

Chris Ueland (MaxCDN)
Fri, 02 May 2014 18:58:47 GMT

Thanks for the detailed description guys. Let us know if we can help at all.

Sat, 03 May 2014 03:29:10 GMT

thanks for the description and for your reactivity and help when the failure happened...

Alberto Vasquez
Sat, 03 May 2014 05:50:21 GMT

Agreed that perfect storm is really not that right term when you clearly identified the issues at each step that could have been prevented. If it could have been prevented it isn't anywhere near a perfect storm. Thanks for the write up though. We can all learn from this.

Courtney Couch
Sun, 04 May 2014 13:27:47 GMT

So this is totally not important so take this with a grain of salt. I think the semantics of phrase choices are hardly important here but what the hell I just felt like replying about the _perfect storm_ phrase. :) A few random definitions I found: > Perfect Storm: a critical or disastrous situation created by a powerful concurrence of factors > Perfect Storm: a particularly bad or critical state of affairs, arising from a number of negative and unpredictable factors. > Perfect Storm: a combination of events which are not individually dangerous, but occurring together produce a disastrous outcome I think it's grey territory whether it fits. I always have used the phrase to merely mean any combination of items that multiply a result. Individually the issues here were not serious, in combination it was however. At least how I've always used the phrase, that fits. Maybe I just always use the phrase wrong as well, but from the definitions it seems like it's hardly black and white. That first definition fits how I use it, and the last one does as well. The second fits a bit more with how you two like to use the phrase. Anyway I'm hardly an english expert, and Tero and Janne are Finnish so I can imagine the way we're using it here may not be the most common way it's used.. but then I've been using it wrong for years as well hah. I digress, back to the real subject here! Glad you guys found our post interesting though. We wanted to try to be as detailed and transparent with what happened as we could!

Courtney Couch
Sun, 04 May 2014 13:28:14 GMT

@chris thanks! Definitely will!

Alberto Vasquez
Sun, 04 May 2014 15:35:15 GMT

@courtney good points! :)

Mon, 05 May 2014 21:40:30 GMT

@courtney the term fits.. the use of the term fits. My point was it never "should have" fit ;-) Blaming "it all" on a storm of errors is not correct. Sometimes things are untested.. sometimes the honest approach is saying that something was missed. Yep symantics;-) x

Wed, 08 Oct 2014 07:15:25 GMT

Did your team consider ensuring that your own servers could not DoS each other? You mentioned that because your internal network was on a firewall white-list, you had no DoS protection. A finicky process caught in an infinite start-up loop seems like one of those common bugs that pops up every now and again. Was increasing your number of available connections deemed a fully acceptable mitigation, or was this a stopgap measure?