ForumsNewsAnd we're back! (from a very prolonged outage)
And we're back! (from a very prolonged outage)
Author | Message |
---|---|
Jake Toodledo Founder |
I've been working so hard that I haven't had time to pre-type what I was going to announce, so I am doing that now, but wanted to let everyone know quickly that everything has been restored. I'll update this topic shortly with a longer explanation.
UPDATE: So here is the long story of what has been happening over the last 16 hours. I've built Toodledo on the principal of being completely open and honest about everything, so I'm going to lay everything out there, skeletons and all. Our servers are hosted by Rackspace, which is a great company with excellent support and top notch datacenters. At 7:15pm CDT yesterday, a severe storm was coming through and Rackspace decided to switch power to generators. During the switch there was a mechanical failure that caused some servers to lose power unexpectedly. When the servers came back online, we found that our database had become corrupted. Apparently, this is because the database was configured to write data to the filesystem, but the filesystem was configured to flush this to disk every 1 second. During that 1 second, that data was only stored in memory. So when the power went off, that data was lost. When the power came back on, the database freaked out because of that missing second. During this freakout, unknown bad stuff happened and the main database got corrupted beyond repair. Luckily, we have a live backup database (called a slave) where all the data is replicated in real time. The purpose of a slave is to act as a backup in the event that the master dies. Unfortunately, the slave is an exact identical copy of the master, so when the power went out, the slave had the exact same problem. So now our backup was toast too. I should say here, that this 1-second buffer was a mistake and I take full responsibility for this. It was this oversight that is likely the cause of the problems. The way that it was setup, it would have been easy to recover if the master or the slave failed independently. A simultaneous failure was unrecoverable. I admit that I did not anticipate a scenario where both the master and slave would fail simultaneously, and I did not understand the ramifications of the 1-second buffer . The database is now configured to flush to disk immediately, which should greatly help in the short term. We are also exploring other options for long term changes. So, now we were in the sorry state of having to rely on our nightly offline backup, which is done at 4am every day. First, we had to transfer this huge file in from offsite, which took forever. Then we had to import all this data back into the database, which also took forever. This got us restored to 4am yesterday morning. Now, what we needed to do was replay the logs from 4am onward. The logs are like a big tape recorder. Every modification to the database gets logged in a linear fashion to the log. So, if we rewind the tape recorder and then play it back into the database, it won't know the difference from real user interaction and recorded interaction. This replay took forever. When it was done, we ran some tests and came back online. Fortunately, all of the data has been restored. When I say "all" I should qualify that by saying that we did lose that 1-second buffer. So, if you were using the website at 7:15 CDT last night, there is a slight chance that you may have lost the last thing that you did. There is also a slight but unverifiable chance that people who were using the website at 4:00am CDT yesterday morning might have a few edits missing. This is due to the nature of switching from the offsite backup to the tape recorder playback. The data loss should be extremely minimal, and only for a handful of people using the website yesterday at exactly 4:00am or 7:15pm. I would just like to say that there is nobody (nobody) more horrified by this than myself. I was sick to my stomach all night; still am a little. Even though no data was lost, 16 hours of downtime is completely inexcusable and unacceptable. I know how important it is to have your to-do list available at all times. I fully expect to be issuing a lot of refunds and losing customers over this issue. The only thing that I can say is that I am deeply deeply sorry and I am doing everything in my power to prevent this from happening again. Coincidentally, just last night Amazon had a similar weather related outage that affected a huge number of customers, so it can affect even the largest companies. I know that that is no excuse, I just wanted to put things in perspective. As a small token of appreciation for people who are willing to stick with Toodledo, I will be giving all existing Pro and Pro Plus subscribers a free month on the end of their subscriptions. Also, for the next thirty days, new Pro and Pro Plus subscribers will be getting 13 months instead of the usual 12 for their subscription payment. I really appreciate all the positive remarks that I have received so far from users. I am happy to answer questions below. Thanks, Jake This message was edited Jun 11, 2009. |
DomiA94 |
Well done !
Bravo !!! |
dpoort |
Don't sweat it, this stuff happens!! Hope you get some rest soon, I am sure you'll need it!
|
poly915 |
Thank you for fixing this unforeseeable issue. I think you did a great job with the auto redirect to your ongoing explanation. It was extra nice to see that you didn't spend any time making that page fancy or formatted. I truly believe that you have spent every waking moment fixing the problems.
|
Ida |
Job well done. And thank you for keeping us informed.
|
jpropper |
here here
|
david.gareth |
Thanks for your effort in resolving it...
|
Kim |
Thanks!
|
One Red Sock |
Wow it made me realize how much I rely on this program. Thanks for getting it back. YOU ROCK!
|
Ernie A. Stephens |
Congratulations on the recovery; I feel your pain. Just one quick suggestion, it would have been nice to see the Toodledo banner instead of the Soviet-style white background. Even though I never lost faith, it would have been a bit more reassuring to see some familiar colors and images.
I'm in for the long haul...keep up the good work! |
TheGriff_2 |
Glad to hear you are back. I'm thinking Rackspace owes you some money though. ;-)
Also one suggestion. Would have been nice to have time stamps on your updates. Let's us know where things are at and when. |
christina |
Appreciated the updates that you put online each time I typed in toodledo.com. Glad you got it all straightened out, good job!
|
dmcguire |
Toodledo rocks! I appreciate your professional handling of this unforeseeable problem. Don't even worry about it.
|
castiron |
Welcome back! I missed Toodledo, but the updates were reassuring.
|
Anders |
The universe is in harmony once again :)
|
Ken.Griggs |
Thanks for the updates and for getting everything up - what a pain for you. I had actually been planning to recommend RackSpace to a client immediately before this. Not any more. I have to second the comment of a previous poster when I say that it was stunning to discover how much I rely on ToodleDo to function.
|
groyal |
I just signed up 2 days ago and spent quite a bit of time attempting to learn the features. I transferred all my GTD data over from several other sources.
Thankfully, I found out how to print the booklet. I can tell you now, I'm glad I printed that little booklet. Great feature and good job in handling a difficult situation. Thanks. |
mjbishop |
We missed you but certainly understand.
When the dust settles and you think more about how to be more proactive about such outages, you might suggest to folks that they do what I do... Every day or so I simply "print" my Toodledo page to a .PDF file and keep it on my desktop. At least that way I have an easy-to-access backup of my list that's completely under my control. |
drylemming |
Thanks for scrambling. Upside for you from the outage is taht people like me may have to buy the iPhone app now if it will let us have some offline access even if you go down for a bit.
Thanks |
Kevin Carpenter |
Thanks for providing us an explanation and updates instead of leaving us wondering what was going on with a generic error message. Thanks for your honesty and your empathy. I've been showing the people in my office how you've handled this and they're all impressed! Hope you get some good rest soon!
Cheers This message was edited Jun 11, 2009. |
You cannot reply yet
U Back to topic home
R Post a reply
To participate in these forums, you must be signed in.