Good luck guys! This is a lesson for all developers. It could happen to any new service. Thanks for the detailed post.
Thanks for the detailed description.. Not many describe the who, what, where and why. This is a very transparent description and very much appreciated. Though to be honest it can not really be classed/sold as a "perfect storm".. just a combination of an unexpected(?) failure; and that error combined with an unproved/untested infra.. Testing.. lol :-P
Thanks for the detailed description guys. Let us know if we can help at all.
thanks for the description and for your reactivity and help when the failure happened...
Agreed that perfect storm is really not that right term when you clearly identified the issues at each step that could have been prevented. If it could have been prevented it isn't anywhere near a perfect storm. Thanks for the write up though. We can all learn from this.
So this is totally not important so take this with a grain of salt. I think the semantics of phrase choices are hardly important here but what the hell I just felt like replying about the _perfect storm_ phrase. :) A few random definitions I found: > Perfect Storm: a critical or disastrous situation created by a powerful concurrence of factors > Perfect Storm: a particularly bad or critical state of affairs, arising from a number of negative and unpredictable factors. > Perfect Storm: a combination of events which are not individually dangerous, but occurring together produce a disastrous outcome I think it's grey territory whether it fits. I always have used the phrase to merely mean any combination of items that multiply a result. Individually the issues here were not serious, in combination it was however. At least how I've always used the phrase, that fits. Maybe I just always use the phrase wrong as well, but from the definitions it seems like it's hardly black and white. That first definition fits how I use it, and the last one does as well. The second fits a bit more with how you two like to use the phrase. Anyway I'm hardly an english expert, and Tero and Janne are Finnish so I can imagine the way we're using it here may not be the most common way it's used.. but then I've been using it wrong for years as well hah. I digress, back to the real subject here! Glad you guys found our post interesting though. We wanted to try to be as detailed and transparent with what happened as we could!
@chris thanks! Definitely will!
@courtney good points! :)
@courtney the term fits.. the use of the term fits. My point was it never "should have" fit ;-) Blaming "it all" on a storm of errors is not correct. Sometimes things are untested.. sometimes the honest approach is saying that something was missed. Yep symantics;-) x
Did your team consider ensuring that your own servers could not DoS each other? You mentioned that because your internal network was on a firewall white-list, you had no DoS protection. A finicky process caught in an infinite start-up loop seems like one of those common bugs that pops up every now and again. Was increasing your number of available connections deemed a fully acceptable mitigation, or was this a stopgap measure?