Ongoing server issues
Sunday at about 2pm PST, the power to one of our two entire racks went out when a circuit breaker tripped on one of our PDUs (a rack powerstrip, basically). In this rack were all of our load balancers, which killed access to our service, as well as about 1/3 of our database servers. Power was restored within 10 minutes but a power outage is really bad for a database server as it typically causes index corruption with non-cleanly-closed table files.
The database servers that crashed: db6, 9, 15, 16, 17, 18, 19, 22, 27, 28, 29, 31, 32, 33, 34, 35, and 39.
Standard methods of repair were used on the databases and they appeared fine initially. However as is apparent today, there are some serious issues going on with these servers. All of them are having problems except 17, 27, and 34.
db6 and 29 are the worst because the main table that stores summary data (e.g. visitors tally, top pages, etc) is corrupted, so the dashboards for those sites appear mostly blank. On the other servers, it's mainly the visitor and action log tables that are having issues, so it's not as apparent at first when viewing your reports.
All of the historical data is there. MySQL just refuses to find it when it's queried for. Using tools to analyze the tables, they say the tables are fine.
This particular scenario has only ever happened *once* before, and back then I tried repairing the tables multiple times with no results. Finally I found a solution that worked that time, and that's what's going on right now.
The solution is to run a script that copies all of the data in those tables into new tables, then replace the old tables with the new ones. This is the same process we use when we do our major database maintenance/purges once or twice a year, as copying data to a new table, while leaving some old data behind, is the fastest way to delete hundreds of millions of rows of data from a database. I know, it sounds silly, but trust me.
So basically, these tables are going to get this script run on them to move the data to new tables. This process takes a while. It's unfortunate that most of the servers affected by this were older servers because they have way more data which means more time. For most of these servers it's going to be at least five hours, but a few are going to be closer to 10 hours.
The servers will remain online so whatever data there is available, you will be able to access. However, traffic processing will be halted for each server while this script is running. And then it's going to take quite a while for each of them to catch back up with real time. Realistically, for some of the larger servers, it's going to be close to 24 hours before the fix is done AND traffic is caught back up with real time again. Normally we do this on a Friday or Saturday night so that processing can catch back up much faster, and there's the least impact on you guys. Doing this Monday morning is the absolute worst possible time because it's the highest traffic day for a lot of sites (so it will take the longest to catch back up), not to mention after having a fun weekend you want to go digging into your stats.
We'll be offering refunds or credits to anyone who would like one. We'll be updating this thread as well as our twitter feed as things progress.
http://twitter.com/clicky
If you are having any issues with a site on a database server NOT listed above, or on 17, 27, or 34 (which we think are fine at the moment), let us know!
Posted Mon Oct 21 2013 12:51p by Your Friendly Clicky Admin
As someone who just migrated over to a new server and had database problems... I can relate. No refund required.
"This is the same process we use when we do our major database maintenance/purges once or twice a year". Why do you do this once or twice per year and not daily to gradually remove old data? I ask because I am considering the same thing to purge old data. :)
Posted Mon Oct 21 2013 1:24p by xpos***
How do we get credits? This came at a horrible time for me, as I'm trying to test conversions.
Posted Mon Oct 21 2013 1:24p by stephengu***
I don't want to leave/get a refund, but some credit toward the next billing cycle would be appreciated.
Posted Mon Oct 21 2013 1:25p by stephengu***
Good luck! Thanks for the update!
Posted Mon Oct 21 2013 1:49p by ricardo***
How do we know what DB we're on?
Should we be expecting to see 0 visitors in our logs for today, but see referers/# of people online/visitors in the graph?
Has any tracking data been lost?
Posted Mon Oct 21 2013 1:51p by hireahelpe***
Oh well, crap happens. I hope you chewed the Data Centre a new one, and they will give you some hosting credits. Good luck with the fixes, Clicky remains the best analytics tool, and personally, I can live with a little hiccup like this once in a while. :)
Posted Mon Oct 21 2013 1:56p by Dave***
The visitor and action logs should work good now for all db servers. The main summary table still under repairs on all servers though.
Posted Mon Oct 21 2013 2:02p by Your Friendly Clicky Admin
@stephenguise we'll post more about this later
@DaveL not the DC's fault unfortunately! it's our own PDU. we overloaded it. :(
Posted Mon Oct 21 2013 2:03p by Your Friendly Clicky Admin
I would appreciate some credit towards the next billing cycle. How should I go about getting this?
Posted Mon Oct 21 2013 2:03p by cardi***
Sean, my site on db29 does not show any visitors since a few hours ago. It's like everything just froze at 11:42 PST. Is that just an artifact of it catching up to real time?
Posted Mon Oct 21 2013 2:35p by merono***
really appreciate your candid report, as someone has said already, crap happens. No i don't want a refund but, thanks for asking. Don't bankrupt yourself you do a great job!
alan
www.alanbrainart.com
Posted Mon Oct 21 2013 3:46p by alanbra***
May the Force be with you Clicky team !
@alanbrain : +1
Adrien
Posted Mon Oct 21 2013 4:14p by winbo***
ETA's for summary table repair to be completed, at which point traffic will resume processing:
db6 - 3 hours
db9 - 3 hours
db15 - 4 hours
db16 - 6 hours
db18 - 6 hours
db19 - 3 hours
db22 - 3 hours
db28 - 7 hours
db29 - 2 hours
db31 - 7 hours
db32 - 1 hour
db33 - 2 hours
db35 - 3 hours
db39 - 2 hours
Thanks for your patience! :)
Posted Mon Oct 21 2013 4:57p by Your Friendly Clicky Admin
Note: Above ETA's posted at 5pm PST (-7 GMT)
Posted Mon Oct 21 2013 4:58p by Your Friendly Clicky Admin
@xpose, deleting data from a database is insanely slow because of the way it updates the indexes as data is deleted. And it completely kills I/O in the process. We found it's much better to only have one or two major database planned "outages" a year and get it all done with at once. And we always do it Friday or Saturday nights so it doesn't really impact too many people.
Posted Mon Oct 21 2013 5:09p by Your Friendly Clicky Admin
Still love clicky. I won't go for a refund, but some credit toward the next billing cycle would be appreciated.
Posted Mon Oct 21 2013 5:51p by oleafri***
32, 33, and 39 are all done and have resumed processing traffic. initial diagnosis, all of them appear to be working perfectly. they're all about 6 hours behind real time though.
Posted Mon Oct 21 2013 6:23p by Your Friendly Clicky Admin
Alright db29 is done now, and as I said before, the historical data that was there but not accessible - now accessible! db6 is almost done and it should be the same!
Posted Mon Oct 21 2013 7:29p by Your Friendly Clicky Admin
db6, 19, and 35 also just finished. db6 was the other server (with db29) having the worst issues. And now everything there looks good as well!
Posted Mon Oct 21 2013 7:35p by Your Friendly Clicky Admin
I still can't track one of my sites...
Posted Mon Oct 21 2013 7:38p by manboo***
I am requesting a credit for mobigs.com due to this service problem.
My email is
[email protected]
Posted Mon Oct 21 2013 7:39p by mobig***
So on the databases you're saying are fixed, will it be a few more hours before all the data is restored? I'm still seeing some weirdness on 29.
Posted Mon Oct 21 2013 7:55p by merono***
Yes they will take a while to catch up with real time, they have a huge queue to process!
Posted Mon Oct 21 2013 8:06p by Your Friendly Clicky Admin
db28 will be done in about 10 minutes, and that's the last of 'em folks.
Posted Mon Oct 21 2013 10:56p by Your Friendly Clicky Admin
Great job, I can imagine the stress you guys experienced. Glad the stats are available again!
Posted Mon Oct 21 2013 11:27p by dekruy***
Rather you than me! Been there. Done that! S..t happens it is what you do to sort it and you guys are always on top of that - never lost any data of mine in the years I have used you. Great service and the wingers on here should just step back and look at the alternatives!
Posted Mon Oct 21 2013 11:40p by jerry99***
Sure shi_ happens, but since this is a premium service some credits or a free "site credit" would be nice.
I hope you guys find a way that this kind of problems never happens again. Lost power locally right on the rack?
Posted Mon Oct 21 2013 11:50p by Leder***
It's amazing how much I rely on Clicky to analyse sales - glad you have got it all sorted now.
Posted Tue Oct 22 2013 12:33a by aaclaph***
I wish you all good luck but I still not can track 3 of my sites.
oranzina
Posted Tue Oct 22 2013 12:42a by oranzi***
One out of three sites are not gettings stats. And of course; that is the main one. Was hoping to get new record of stats because of announcements of Apple tonight... (An iPad website)
Will keep fingers crossed -> good luck!
Posted Tue Oct 22 2013 1:03a by Sjelt***
Are you guys going to consider imaging snapshot software?
Posted Tue Oct 22 2013 2:08a by stook***
Hey guys,
Looks like there is still an issue with my site on db18 as of 6am EST. The traffic for yesterday appears to have caught up but the traffic for today shows all zeros which is definitely not the case.
Thanks,
Tom
Posted Tue Oct 22 2013 3a by brainya***
Shi* happens, so don't worry about refunds, go about doing your job, you are doing great.
Posted Tue Oct 22 2013 3:31a by cob***
Sometimes it's happen in the worst possible time. If you handle it in a weak may be I will as for a month refund.
But you already do a great job to handle the issue.
Posted Tue Oct 22 2013 4:13a by echothe***
Great to know that everything is supposed to work fine again. Thanks for your hard work!
I know that you couldn't anticipate this accident, just the timing for me came horribly as I was just about to send out the first batch of invites to a beta test. Unfortunately, my site was also on one of the most affected servers (db29) and visitor numbers still seem to fluctuate a bit (+/-). I appreciate your guy's great work and I really like using Clicky. So, I won't request a refund, but would appreciate some credits for the next billing cycle as well.
Posted Tue Oct 22 2013 4:33a by resum***
Everthing looks like it's pretty much getting caught up - but I just noticed that db50 might not be working. Site 100619848 is showing 0's for the past 60 days which is impossible. Not trying to pile on - just not sure if this is your issue or our's.
Posted Tue Oct 22 2013 5:04a by mbsports***
On 29 it looks like today's data is up to date, but yesterday's is still showing a huge traffic loss. I assume it's still processing the data from yesterday.
Posted Tue Oct 22 2013 7:06a by merono***
@mbsportsweb, I don't see our tracking code on that domain :)
Posted Tue Oct 22 2013 10:56a by Your Friendly Clicky Admin
@meronoid, some traffic from yesterday is missing on db6 and db29 unfortunately.
Posted Tue Oct 22 2013 10:56a by Your Friendly Clicky Admin
How much, Sean? I'm seeing a huge dip for both Sunday and Monday. Is none of that going to come back?
Posted Tue Oct 22 2013 11:30a by merono***
Also, my site on db22 is still being weird... says I've had X visitors, 9% more than last week at this time, but the visual graph for today is flat.
Posted Tue Oct 22 2013 11:35a by merono***
Ah, it does look like db22 hourly table is messed up. I'll need to take it offline for repairs. Should be quick though, hourly table is small.
Posted Tue Oct 22 2013 1p by Your Friendly Clicky Admin
I'm seeing a big drop in traffic on both Sunday and Monday for my site on db29. Is that data lost forever?
Posted Tue Oct 22 2013 1:06p by merono***
Dang, looks like the hourly table was corrupted to the point where no data was getting logged. It's fixed now so new data should get logged, and older data is still there. But basically yesterday and first half of today (my time anyways) no hourly data is there.
db6 and db29 will show appear to have missing data on the dashboard for those two days, yes. However the actual visitors themselves should have still been logged if you look in there.
Posted Tue Oct 22 2013 1:14p by Your Friendly Clicky Admin
db22 is still messed up the same way for me. I tried emptying my cache, but it didn't help.
Posted Tue Oct 22 2013 1:33p by merono***
You must be
logged in to your account to post!