Muut service failure April 23, 2014: post mortem
Muut Inc • 2014-04-25
On April 23, 2014, 10:24pm (Finnish time), the Muut service encountered a series of very unfortunate failures — dictionary definition of a perfect storm — that caused all data created after March 27 all the way to April 23 to disappear.
Wednesday, April 23 2014, 22:30 pm (GMT+2).
It’s Wednesday evening. Two of us are sitting in a Helsinki café, brainstorming for future Muut plans. It’s daytime in US, and our guys over there are active on the team chat, discussing the specifics of the next Muut version; our phones are constantly buzzing in our pockets with chat messages.
Phones buzz again. This time it’s a message to us. Expletives onscreen, in assertive all caps.
“Serious issue at the moment … looks like we may have lost data potentially.“
Barging to our own forums on our phone browsers, we see a couple of new posts on top; then the rest from 26 days ago, and older. There is no need to check our users' forums; at this point we know the failure is system wide. Every post, every user account, every forum created after March 27… Gone.
After a moment of tapping our phone screens in ominous silence, finding out more details and trying to figure out the extent of the damage, we assess our options briefly. Forum posts and emails from users reporting posts, entire forums, and user accounts missing start to arrive.
Pretty soon we discover that the vast majority of the data is recoverable. At this point this should have been certain, but when something so unexpected happens with all the preventive measures in place, you have no choice but to check that something equally unexpected didn’t happen to the alternative backups that you thought you’d never even need.
We decide we need to start feeding our users with facts as soon as possible. Preparations for the recovery process have already been started, and they should continue undisturbed, while the rest of the team works on answering questions and helping users to recreate lost forums and user accounts.
We hurry home, one by bike, the other running, while our US team members are active on the forums, letting people know about the hiccup, and starting work on the recovery script. 20 minutes later we fire up our laptops and simultaneously start typing on a text document, composing a note to our users to be posted on the forums, before starting to reply to mails and other posts.
It’s going to be a long night in Helsinki, and a long day in the US.
This is a tale of woe and we invite you to come along with us.
As a big picture perspective, we have a mongo server that deals with some of our data pipeline. Data runs through it but it’s non-critical (as in, if it goes down, or stops, there isn’t really any impact to users)—if there are problems there, the alarms don’t scream red alert. We also have built-in redundancy there, so a single instance failing is also not really something that we panic about.
In any case, a mongo instance simply stopped. Why, you might ask? We did too. There was nothing in the mongo logs and the other mongo instance on the same box was just fine, so this part remains a mystery. Normally this wouldn’t be an issue for us—we shut down these and replace them fairly often so we often think of them as throw away—except for one thing: one of the processes connected to that mongo instance had a bug in its error handling so it started looping its startup process. Infinitely.
Even that infinite process looping shouldn’t have been a problem, except that it started flooding our redis server with connections. Our internal network is on our firewall whitelists so any sort of DoS protection wouldn’t have kicked in. So many connections were being made to the redis masters that they ran out of file handles. This caused redis to… take a crap—and die. We test for catastrophic failure of our master DB in our development environment, so even this shouldn’t have been a major issue; it would have meant a couple minutes of downtime, then the slaves get promoted to masters and all is well with the world.
This, unfortunately, did not happen due to a single word added to the production startup scripts by our chef recipes: respawn. This innocuous-looking word caused upstart to simply boot the broken redis masters right back up instead of staying down as they should have done. This might seem like a good thing until you realize that our slaves are responsible for persistence, and the masters are treated as transient.
Failures on masters should be permanent, then slaves are upgraded to master, and masters are brought back up as slaves. So the masters came back up, loaded the most recent data dump the master had access to (this was 26 days old since the master doesn’t typically do any persistence), then the slaves reconnected and replaced their more current data with the masters'. They, subsequently, wrote their data to disk.
We truncate our AOF file on the slaves and backup the RDB files every hour; the script that was handling this was just keeping 1 backup (the backup script was running with the -dev flag inadvertently). This would again have been fine except for the fact that this failure happened 2 minutes before the hour was up, so before we could get to it, the rdb file was saved. Then the backup script took the old data and copied it replacing the backups.
Ultimately there were several mistakes that together combined in such a way to create the perfect storm:
- We considered mongo non-critical
- One of the processes using mongo didn’t handle failed mongo instances particularly well
- File handles should be much higher on redis masters
- Backup script needs the correct run flag -prod rather than -dev
What are we doing to prevent it from happening again?
Clearly we had some fixes to make:
- All processes across our network are being considered critical even if they don’t impact the service
- We’re testing for non-critical service failure on dev now
- All servers max file handle increased
- Respawn removed
- Backup script corrected
We’re also taking these further proactive changes:
- We are adding a secondary persistence server not managed by chef which constantly copies it’s data files to EBS and S3. This means that if any bugs in our network automation tooling creep in, our secondary persistence servers will be completely insulated.
- Latest slave backups copied to masters with every backup. This means that should some crazy disaster occur, all servers, including masters which do no persistence themselves, have the latest backup of data.
- rsync slave data files to EBS
- Testing for redis respawning on dev environment even though it shouldn’t ever occur.
With these changes we feel we’re well insulated from any data loss like this in the future, and we think these proactive changes will further insulate us from other unpredictable failure scenarios. Rather than simply 2 points of failure for data backups we now have aproximately 7 points that must fail for data loss.
Where are we now? (April 30, 2014)
By now, most of the forum content has been recovered, though many forums are still showing a gap between April 18 and 23. Some of this data is making a somewhat slower comeback. Some users are reporting missing threads despite completed script runs. This data is, sadly, gone. Sorry. It truly sucks.
In nearly as unfortunate news, user accounts and forums created after the cut-off point of March 27th need to be recreated; because of the secure way we store user credentials, they haven’t been duplicated in a recoverable manner outside of our primary data backup mechanisms (which failed).
The upside here is that missing forums didn’t have all their content lost — we can, in many cases restore much of the content that lived on these forums once they have been recreated.
So, will this happen again?
This is a prime example of a perfect storm. It’s like a perfectly arranged set of dominos. Any piece off, and no damage at all would have been done.
Of course, none of this should have happened even with those specific domino pieces in place. Worst case scenario, minutes of posts temporarily missing, milliseconds worth of lost data, and no elaborate data recovery scripts to run for days.
We’re not going to be making excuses — data loss is unacceptable, period. Especially for a service of this scale, it is a disaster — one that we’re doing everything we can in order to prevent from happening in the future.
We totally understand any concerns you might now have. If something like this happened this week, then who’s to say it won’t happen again in the next couple of months?
Luckily, we have figured out exactly what the issue was, and, in addition to doing all we can to recover the missing content as fast as we can, we have already focused our efforts on preventing anything even remotely like this from happening ever again in the future.
We repaired the problems that led to this loss and dramatically improved our backup redundancy. Rather than simply 2 places where failure must occur simultaneously for data to be lost there’s a whopping 7 places where failure has to occur for any of your valuable community data to vanish.
In the long term, the fact that this happened now will make the Muut services that much stronger and more reliable.
We also hope the users of the forums hit hardest by this will continue to contribute, in time making the gap left by the lost posts look tiny in comparison.
For what it’s worth, our resolve in making Muut as amazingly powerful, reliable, and effective a system has never been stronger — and it is only going to get better from here.