Notifixous is back!
… Ouf… I need to breath a little bit… As you might have probably noticed, Notifixious is now back and running. There are still a few hic-cups to be expected since you can’t really restart a massive feed parsing application in a few clicks, but we’re getting to it.
These last 2 days have probably been the toughest since I started Notifixious and I’d never though I would come so close to just stop everything. So, what happens?
That’s actually pretty simple : the “host” on which one of our EC2 instances broke. It’s a hardware failure as it probably happens very often for most of web startups. The difference is that I expected -but shouldn’t have, obviously- that Amazon’s infrastructure was fault-tolerent.
And the instance that broke was actually hosting our database.
This shouldn’t have been a huge problem, since we had a daily/hourly backup of everything. Unfortunately, we did what almost everybody does : we used an external script and we never really checked thouroughly that the backup were actually usable. This was our biggest mistake and the backup we had was not usable.
The failure happened on a Sunday morning at 7AM : probably the worst time of the week for failures to happen.
After a few hours of panick, we’ve been “advised” by the AWS teams to subscribe to their Premium Support and yes, they could help us get our data back. (I feel kind of surprised to see the messages that went from “no chance to recover anything” before I subscribed to the premium support to “allright, your data is safe” just 2 minutes after I paid).
We had then a lot of trouble to retore everything to make sure we were doing the right things in the right order.
What have we learned :
- Do your backups yourself (write the scripts) and test them. Write a recovery plan (and test it as well)
- Replicate the Database. We assumed that our traffic was too small and that we could/should do it later. That is not true, it’s never too early to do it
- Simplify the architecture by removing all that is not necessary from the “core” of it. It will then be easier to stop/start your application if you don’t have to handle half a dozen of related services : this will require some additional devs, but we’re on it.
As a conclusion the last lesson I learnt is that being alone is not sustainable!
