ForumsNewsAnd we're back! (from a very prolonged outage)

And we're back! (from a very prolonged outage)

Author	Message
Jake Toodledo Founder	Posted: Jun 11, 2009 Score: 25 Reference Jake (Founder) Posted: Jun 11, 2009 Score: 25 Reference I've been working so hard that I haven't had time to pre-type what I was going to announce, so I am doing that now, but wanted to let everyone know quickly that everything has been restored. I'll update this topic shortly with a longer explanation. UPDATE: So here is the long story of what has been happening over the last 16 hours. I've built Toodledo on the principal of being completely open and honest about everything, so I'm going to lay everything out there, skeletons and all. Our servers are hosted by Rackspace, which is a great company with excellent support and top notch datacenters. At 7:15pm CDT yesterday, a severe storm was coming through and Rackspace decided to switch power to generators. During the switch there was a mechanical failure that caused some servers to lose power unexpectedly. When the servers came back online, we found that our database had become corrupted. Apparently, this is because the database was configured to write data to the filesystem, but the filesystem was configured to flush this to disk every 1 second. During that 1 second, that data was only stored in memory. So when the power went off, that data was lost. When the power came back on, the database freaked out because of that missing second. During this freakout, unknown bad stuff happened and the main database got corrupted beyond repair. Luckily, we have a live backup database (called a slave) where all the data is replicated in real time. The purpose of a slave is to act as a backup in the event that the master dies. Unfortunately, the slave is an exact identical copy of the master, so when the power went out, the slave had the exact same problem. So now our backup was toast too. I should say here, that this 1-second buffer was a mistake and I take full responsibility for this. It was this oversight that is likely the cause of the problems. The way that it was setup, it would have been easy to recover if the master or the slave failed independently. A simultaneous failure was unrecoverable. I admit that I did not anticipate a scenario where both the master and slave would fail simultaneously, and I did not understand the ramifications of the 1-second buffer . The database is now configured to flush to disk immediately, which should greatly help in the short term. We are also exploring other options for long term changes. So, now we were in the sorry state of having to rely on our nightly offline backup, which is done at 4am every day. First, we had to transfer this huge file in from offsite, which took forever. Then we had to import all this data back into the database, which also took forever. This got us restored to 4am yesterday morning. Now, what we needed to do was replay the logs from 4am onward. The logs are like a big tape recorder. Every modification to the database gets logged in a linear fashion to the log. So, if we rewind the tape recorder and then play it back into the database, it won't know the difference from real user interaction and recorded interaction. This replay took forever. When it was done, we ran some tests and came back online. Fortunately, all of the data has been restored. When I say "all" I should qualify that by saying that we did lose that 1-second buffer. So, if you were using the website at 7:15 CDT last night, there is a slight chance that you may have lost the last thing that you did. There is also a slight but unverifiable chance that people who were using the website at 4:00am CDT yesterday morning might have a few edits missing. This is due to the nature of switching from the offsite backup to the tape recorder playback. The data loss should be extremely minimal, and only for a handful of people using the website yesterday at exactly 4:00am or 7:15pm. I would just like to say that there is nobody (nobody) more horrified by this than myself. I was sick to my stomach all night; still am a little. Even though no data was lost, 16 hours of downtime is completely inexcusable and unacceptable. I know how important it is to have your to-do list available at all times. I fully expect to be issuing a lot of refunds and losing customers over this issue. The only thing that I can say is that I am deeply deeply sorry and I am doing everything in my power to prevent this from happening again. Coincidentally, just last night Amazon had a similar weather related outage that affected a huge number of customers, so it can affect even the largest companies. I know that that is no excuse, I just wanted to put things in perspective. As a small token of appreciation for people who are willing to stick with Toodledo, I will be giving all existing Pro and Pro Plus subscribers a free month on the end of their subscriptions. Also, for the next thirty days, new Pro and Pro Plus subscribers will be getting 13 months instead of the usual 12 for their subscription payment. I really appreciate all the positive remarks that I have received so far from users. I am happy to answer questions below. Thanks, Jake This message was edited Jun 11, 2009.
DomiA94	Posted: Jun 11, 2009 Score: 2 Reference DomiA94 Posted: Jun 11, 2009 Score: 2 Reference Well done ! Bravo !!!
dpoort	Posted: Jun 11, 2009 Score: 4 Reference dpoort Posted: Jun 11, 2009 Score: 4 Reference Don't sweat it, this stuff happens!! Hope you get some rest soon, I am sure you'll need it!
poly915	Posted: Jun 11, 2009 Score: 2 Reference poly915 Posted: Jun 11, 2009 Score: 2 Reference Thank you for fixing this unforeseeable issue. I think you did a great job with the auto redirect to your ongoing explanation. It was extra nice to see that you didn't spend any time making that page fancy or formatted. I truly believe that you have spent every waking moment fixing the problems.
Ida	Posted: Jun 11, 2009 Score: 2 Reference Ida Posted: Jun 11, 2009 Score: 2 Reference Job well done. And thank you for keeping us informed.
jpropper	Posted: Jun 11, 2009 Score: 1 Reference jpropper Posted: Jun 11, 2009 Score: 1 Reference here here
david.gareth	Posted: Jun 11, 2009 Score: 2 Reference david.gareth Posted: Jun 11, 2009 Score: 2 Reference Thanks for your effort in resolving it...
Kim	Posted: Jun 11, 2009 Score: 1 Reference Kim Posted: Jun 11, 2009 Score: 1 Reference Thanks!
One Red Sock	Posted: Jun 11, 2009 Score: 1 Reference One Red Sock Posted: Jun 11, 2009 Score: 1 Reference Wow it made me realize how much I rely on this program. Thanks for getting it back. YOU ROCK!
Ernie A. Stephens	Posted: Jun 11, 2009 Score: 1 Reference Ernie A. Stephens Posted: Jun 11, 2009 Score: 1 Reference Congratulations on the recovery; I feel your pain. Just one quick suggestion, it would have been nice to see the Toodledo banner instead of the Soviet-style white background. Even though I never lost faith, it would have been a bit more reassuring to see some familiar colors and images. I'm in for the long haul...keep up the good work!
TheGriff_2	Posted: Jun 11, 2009 Score: 1 Reference TheGriff_2 Posted: Jun 11, 2009 Score: 1 Reference Glad to hear you are back. I'm thinking Rackspace owes you some money though. ;-) Also one suggestion. Would have been nice to have time stamps on your updates. Let's us know where things are at and when.
christina	Posted: Jun 11, 2009 Score: 1 Reference christina Posted: Jun 11, 2009 Score: 1 Reference Appreciated the updates that you put online each time I typed in toodledo.com. Glad you got it all straightened out, good job!
dmcguire	Posted: Jun 11, 2009 Score: 2 Reference dmcguire Posted: Jun 11, 2009 Score: 2 Reference Toodledo rocks! I appreciate your professional handling of this unforeseeable problem. Don't even worry about it.
castiron	Posted: Jun 11, 2009 Score: 1 Reference castiron Posted: Jun 11, 2009 Score: 1 Reference Welcome back! I missed Toodledo, but the updates were reassuring.
Anders	Posted: Jun 11, 2009 Score: 1 Reference Anders Posted: Jun 11, 2009 Score: 1 Reference The universe is in harmony once again :)
Ken.Griggs	Posted: Jun 11, 2009 Score: 1 Reference Ken.Griggs Posted: Jun 11, 2009 Score: 1 Reference Thanks for the updates and for getting everything up - what a pain for you. I had actually been planning to recommend RackSpace to a client immediately before this. Not any more. I have to second the comment of a previous poster when I say that it was stunning to discover how much I rely on ToodleDo to function.
groyal	Posted: Jun 11, 2009 Score: 0 Reference groyal Posted: Jun 11, 2009 Score: 0 Reference I just signed up 2 days ago and spent quite a bit of time attempting to learn the features. I transferred all my GTD data over from several other sources. Thankfully, I found out how to print the booklet. I can tell you now, I'm glad I printed that little booklet. Great feature and good job in handling a difficult situation. Thanks.
mjbishop	Posted: Jun 11, 2009 Score: 0 Reference mjbishop Posted: Jun 11, 2009 Score: 0 Reference We missed you but certainly understand. When the dust settles and you think more about how to be more proactive about such outages, you might suggest to folks that they do what I do... Every day or so I simply "print" my Toodledo page to a .PDF file and keep it on my desktop. At least that way I have an easy-to-access backup of my list that's completely under my control.
drylemming	Posted: Jun 11, 2009 Score: 0 Reference drylemming Posted: Jun 11, 2009 Score: 0 Reference Thanks for scrambling. Upside for you from the outage is taht people like me may have to buy the iPhone app now if it will let us have some offline access even if you go down for a bit. Thanks
Kevin Carpenter	Posted: Jun 11, 2009 Score: 0 Reference Kevin Carpenter Posted: Jun 11, 2009 Score: 0 Reference Thanks for providing us an explanation and updates instead of leaving us wondering what was going on with a generic error message. Thanks for your honesty and your empathy. I've been showing the people in my office how you've handled this and they're all impressed! Hope you get some good rest soon! Cheers This message was edited Jun 11, 2009.

You cannot reply yet

U Back to topic home

R Post a reply

Skip to Page: 1 2 3 4 5 6 7 8 9 10 11 Next

To participate in these forums, you must be signed in.