Forums » Help & Troubleshooting



db46/49/52 backups being restored

Today (Feb 8) we had a complete RAID failure on the hardware that hosts database servers 46, 49, and 52. The RAID was in the process of rebuilding after replacing a bad drive, and another drive failed during the rebuild, causing a complete failure.

We are currently restoring backups of these 3 database servers onto new hardware. The process isn't fast unfortunately, it will take anywhere from 12-24 hours, potentially longer if there are any unexpected issues.

We will post updates in this thread as we have relevant information to share.

Sorry for the inconvenience. Monday is our busiest day so we know it's the worst possible timing for it to happen. But we're in the process of fixing it now. Thanks for your patience.

Posted Mon Feb 8 2016 3:10p by Your Friendly Clicky Admin


Our uptime monitoring service and database was also on this server so we'll be restoring that later as well. For now though, the main stats databases are the top priority.

Posted Mon Feb 8 2016 4:04p by Your Friendly Clicky Admin


Now that the restore has been going for a while, and we know how much of the total data has been restored, we can provide some ETA's. Unfortunately they're a bit longer than anyone would prefer.

db46 - 7am PST Wednesday
db49 - 8pm PST Tuesday
db52 - 5pm PST Tuesday

Our databases are gigantic, which is why they will take a while. For example, db46 has almost 5.5 Billion with a B rows of data.

46 and 49 were two of our very largest databases, so that means they will take almost the longest of any database server in our network to restore. This was just an unlucky server to crash.

These are only estimates of when the restores will be completed, supposing there's no issues. Once the restore is completed for each one, we will still need to do testing to make sure all is well so it will likely be at least another hour.

When they're back online, they will be going through their backlog which will be back through 2pm PST today (Monday). So that will take a while to catch back up with real time.

Posted Mon Feb 8 2016 8:50p by Your Friendly Clicky Admin


Thanks Clicky!

Posted Tue Feb 9 2016 2:21a by MikeLosAng***


Things are chugging along without any issue. Our ETA's still seem ballpark accurate.

Posted Tue Feb 9 2016 9:48a by Your Friendly Clicky Admin


db52 is getting close, should be done in about an hour.

db49 is looking like closer to midnight at this point.

db46 is going the slowest in terms of rows/minute, not sure why but unfortunately it's also the biggest, so at this point I'm estimating noon tomorrow.

Posted Tue Feb 9 2016 5:27p by Your Friendly Clicky Admin


Oh man.

The "auto_increment" flag for our primary indexes wasn't included in the schema dumps. Not sure why as that's just *slightly* important. Of course you can't just "turn it on" and you're done - mysql rebuilds the entire index when you change anything to do with an index. That's going to be many more hours... way faster than the overall restore process has been, but still, what a painful realization after this much downtime!

We've never had to actually restore a database from backup before. We've tested it many times, we have a cron job that does a restore of a random backup once a week, always been flawless - but since we never tried to process traffic data after the data was restored, this issue went unnoticed. Damn.

Well, we're starting this process now on db52 since it's restore is done. Once it's been going for a while, I'll provide an ETA.

Posted Tue Feb 9 2016 7:13p by Your Friendly Clicky Admin


WTF

Posted Tue Feb 9 2016 7:14p by redgs***


Looks like this is the bug that killed us:

http://bugs.mysql.com/bug.php?id=22941

Filed 10 years ago, and closed as "not a bug". Um, yeah, it's a bug... oh dear lord.

Currently still investigating options, we want to get this up as fast as possible.

Posted Tue Feb 9 2016 8:46p by Your Friendly Clicky Admin


The process is still going. At this point I can't give an accurate ETA but it's definitely going to be another 8 hours minimum.

Posted Tue Feb 9 2016 11:35p by Your Friendly Clicky Admin


Get well soon, Clicky :(

Posted Wed Feb 10 2016 12:32a by presentationt***


how do i find out what database my sites are tracked on? my stats are off today but i don't see a message telling me that my database is effected.

Posted Wed Feb 10 2016 4:15a by sonicar***


This is that type of outage engineers always comfort management it wont happen...Impossible! etc...bla bla...

all the best and may you service manager provide you guys with enough Pizza and beverages!

Posted Wed Feb 10 2016 5:09a by Ghali***


I wish every partner / vendor we worked with were as transparent with issues as Clicky is. Thanks guys.

Posted Wed Feb 10 2016 5:26a by webes***


Any update at all? This is bad guys!

Posted Wed Feb 10 2016 8:05a by lesmo***


are we going to be able to look back at our traffic the last couple days to see accurate visitor numbers and engagements when you get this fixed?

Posted Wed Feb 10 2016 8:05a by blong***


Thanks for your transparency. Best of luck.

Posted Wed Feb 10 2016 8:09a by BigMe***


^^^ What @blong72 said - I appreciate that it's not a quick process, I'd just like to know if we'll have access to stats from the last few days once the issues are resolved?

Posted Wed Feb 10 2016 8:54a by scothot***


^^^ My guess: I hope so but highly doubt it which is why I'm frustrated...

Posted Wed Feb 10 2016 10:25a by redgs***


db49 and db52 are on their last large database table. 52 is almost done, 49 still has a few hours. But once these tables are done, I'm giving an ETA of 3 hours after that point.

db46 is still going to be quite a while. I don't have an ETA right now.

Posted Wed Feb 10 2016 10:38a by Your Friendly Clicky Admin


Here's the scoop on your data.

The server crashed Monday @ 1pm.

We backup database servers once a week on a rotating schedule.

db46's backup was 8pm Friday.

db49's backup was 8pm Saturday.

db52's backup was 8pm Sunday - but, this backup hadn't been moved to the main backup storage server yet (that happens the next afternoon before backups get rsynced to a remote location) so it was lost when the server's RAID died. So, we're using the backup from the week before, which means db52 will be missing 7.5 days of data.

db49 will be missing 1.5 days of data between its backup and the crash.
db46 will be missing 2.5 days of data between its backup and the crash.


All data logged since the crash however is still sitting in our queue, so it should be processed once they come back online. The downside to this is it will take a while to catch up with real time so you won't have "real time" data for probably at least another 12 hours after they've come back online (other than Spy, which is always real time).

Posted Wed Feb 10 2016 10:49a by Your Friendly Clicky Admin


Thanks for the update, is there any way to find out which db our data is on?

Posted Wed Feb 10 2016 11:15a by scothot***


Sorry ignore that last comment - seen that it's provided on the dashboard - and unfortunately we're on 52 :(

Posted Wed Feb 10 2016 11:16a by scothot***


so DB 49 was Saturday back up. that means we will have all data since sat at 8pm correct? Sunday, Monday, and today will have data or am i reading this wrong?

Posted Wed Feb 10 2016 11:31a by blong***


It means there will be data missing between the time of hte backup (8pm Saturday) and the time of the RAID crash (1pm Monday). About 1.5 days.

Posted Wed Feb 10 2016 12:04p by Your Friendly Clicky Admin


BTW, this "fix the auto increment" crap won't happen again because we've already fixed the backups so "auto increment" will be included in the table definitions. Sigh.

I'm also looking into ways to make the backups/restores faster for the future. For example, there are a couple of fairly large tables that could be excluded entirely, as they're more of "caches" than anything else, and can always be regenerated later.

Posted Wed Feb 10 2016 12:06p by Your Friendly Clicky Admin


im going thru traffic withdrawls not being able to see it now for several days. Just sayin...

Posted Wed Feb 10 2016 2:08p by blong***


By the way, I wanted to make sure everyone affected knew that Spy still works during this time.

Posted Wed Feb 10 2016 3:27p by Your Friendly Clicky Admin


Right now I'm estimating 1-2 hours for db52, 2-3 hours for db49, and 8-10 hours for db46.

Posted Wed Feb 10 2016 5:49p by Your Friendly Clicky Admin


Raid 5?

Posted Wed Feb 10 2016 6:21p by cneum***


We use raid10

Posted Wed Feb 10 2016 6:44p by Your Friendly Clicky Admin


I wanted to let everyone know how awesome you all are being. This has been going on for over 48 hours now, and starting on a Monday no less, which is just an awful situation. On top of that, there was another issue over night affecting about half our database servers where about a third of traffic wasn't logged for 5-6 hours. Yet we haven't had a single person upset about any of this. I love you guys and I'm working my ass off here to get you your stats back as soon as possible. I'm sorry it's taken so longer but my life has been nothing but this since it happened. And it's been a big learning experience, since we've never had to fully restore from backup before into a production setting. Next time, things will be much faster.

Posted Wed Feb 10 2016 6:48p by Your Friendly Clicky Admin


What I understand is, The traffic will be display after 12-24 hours more. But can you present the present data traffic on our account as It is difficult to us as we don't have any data since Monday.

Posted Wed Feb 10 2016 7:15p by space***


db52 is online and processing its backlog! Its backlog goes back to ~1pm Monday. The server we're on is very fast so it should catch up relatively quickly... considering it has almost 2.5 days of data to process, I'm guessing it can do it in less than 12 hours.

Posted Wed Feb 10 2016 7:40p by Your Friendly Clicky Admin


Actually based on the rate it's going, it will be caught up in 4 hours or better. All hail SSDs.

Posted Wed Feb 10 2016 7:45p by Your Friendly Clicky Admin


db52 is already almost caught up with real time, so that will be less than 2 hours total to process that huge backlog.

db49 is back online now, and just started processing its backlog - which is about twice as big as db52's, so it will probably be twice as long (4 hours) before it's caught up with real time.

db46 still has a ways... it will be sometime tomorrow morning.

Posted Wed Feb 10 2016 9:25p by Your Friendly Clicky Admin


*should be

Posted Wed Feb 10 2016 9:26p by Your Friendly Clicky Admin


db46 ETA is 10am PST.

Posted Thu Feb 11 2016 12:53a by Your Friendly Clicky Admin


db46 is online! 3 hours earlier than I expected. It has a 66 hour backlog to process. THen we're all done!

Posted Thu Feb 11 2016 7:29a by Your Friendly Clicky Admin


The uptime server is also back online as of last night. Bad news is that we didn't have a backup of that database, because the uptime service/server its own separate thing, it was just a silly oversight. So you'll need to recreate your uptime checks. Sorry!

Obviously we're fixing this, starting today the uptime database will have daily backups.

Posted Thu Feb 11 2016 8:26a by Your Friendly Clicky Admin


Well done Clicky!

Now get some sleep.

Posted Thu Feb 11 2016 12:45p by cneum***


So db52 - can someone confirm if we're going to get our data back?

Posted Thu Feb 11 2016 2:37p by jjfleck***


@jjfleckie, see this from a comment above:

----

The server crashed Monday @ 1pm.

We backup database servers once a week on a rotating schedule.

db46's backup was 8pm Friday.

db49's backup was 8pm Saturday.

db52's backup was 8pm Sunday - but, this backup hadn't been moved to the main backup storage server yet (that happens the next afternoon before backups get rsynced to a remote location) so it was lost when the server's RAID died. So, we're using the backup from the week before, which means db52 will be missing 7.5 days of data.

db49 will be missing 1.5 days of data between its backup and the crash.
db46 will be missing 2.5 days of data between its backup and the crash.

Posted Thu Feb 11 2016 3:11p by Your Friendly Clicky Admin


Thanks for the info - that's not good news, but I understand how it happened.

Posted Fri Feb 12 2016 8:26a by jjfleck***


You must be logged in to your account to post!