Tracking server issues this morning

You probably notice a bit of missing data from this morning. Let me explain what happened.

This was not a database issue, which seems to be the story of our life recently, but an issue with our tracking servers. This means every single site was affected. The issue is very technical and related to Linux itself, but I'll try to explain it as simply as I can.

As part of our infrastructure improvements we have been making, we upgraded our tracking servers with twice the RAM and much faster hard drives. These two things combined should help eliminate most of the lag you may sometimes notice on your site during peak times, which is about 8am to 2pm PST.

However, a serious human error was made on my part when I formatted these new drives. I haven't had to manually format anything other than a drive meant for a database for quite a while. For our database drives, we use what's called "largefile" inode structure, which is optimized for disks that have very large files. Some of our database servers have individual files that are over 40GB. inodes store metadata about every individual file on a partition, including where exactly a file is on the actual physical part of the disk.

Unfortunately, without thinking about it, I optimized these new drives on our tracking servers the same way. It's habit at this point. The problem is that our tracking servers have hundreds of thousands of tiny text files on them that store all of the traffic coming in for all the sites we monitor. Each site has its own dedicated Spy file, and each database server has its own dedicated file as well, which is basically a Spy file times 8000. We also cache the javascript for each site seperately, for complex reasons. Including pMetrics and the version of Clicky for Webs.com, we're tracking over 500,000 sites, so this translates into a ridiculous amount of files stored on these drives.

I'm not an inode "expert" but I know what works well for different situations. With largefile, it creates an inode every 1 megabyte, which translates into about 143,000 inodes on the 150GB disks Raptors we put in here. With so few inodes for so many files, the percentage of inodes being used reached 100% within about 48 hours. This is a very bad thing for a heavy read/write disk with hundreds of thousands of files. Load skyrocketed to over 400 on each server, which is absolutely ridiculous. The tracking servers slowed down considerably and were timing out.

Normally I get pages within minutes of such an event. However, my stupid iPhone, which I'm about to throw out the window, was somehow stuck in "headphone" mode, which means the external speaker was disabled, which means it made no sound as these pages were continuously coming in. (Note - this is different than "silent" mode - it actually thought there was a headphone inserted, although there was most certainly not). It wasn't until I woke up at my normal time that I noticed that I had hundreds of new text messages that the servers were severely timing out.

Anyways. It took me a while to track down what specifically was causing the problem. But as soon as I found out, I knew exactly what I had done wrong. I took each tracking server offline individually and reformatted the drives that stores these tiny files with the "news" inode type. This creates an inode every 4KB, which translates into over 36 million inodes for these disks, which is exactly what we want for this type of usage. (This is how our old drives were formatted, and worked well except for the fact that the drives were quite slow. These servers were built when we were MUCH smaller.) When I brought each server back online, things returned to normal immediately.

We have been planning to change the javascript tracking code so it's global for all sites but it's not as easy as flipping a switch. If we had been using a global tracking file instead, this problem would not have occurred so soon. But as we continue to grow fairly quickly, it would have eventually reared its ugly head. Now it's fixed, so it should never be a problem again.

Please accept our sincere apologies. We have been having an abnormal amount of problems recently, but the quality of our service is our absolute top priority. You are upset, but know that we are 100x as upset about something like this. As we upgrade the rest of our servers over the next few weeks, we are hopeful the service will return to the stableness and quality you have been accustomed to for nearly 3 years now.
22 comments |   Oct 05 2009 1:38pm

Nothing exciting for a while

We're working on massively improving our infrastructure for the next month or so, which we hope will greatly improve the speed and reliability of our service. During this time, there will likely be few, if any, new and exciting features.

There are so many awesome ideas we have for Clicky but we've reached the point where our existing setup isn't quite cutting the mustard anymore. Nothing is more important to us than the quality of our service, so we're going to be focusing on that for a bit to ensure we can continue growing well into the future with as few problems as possible.

We'll be upgrading our tracking servers with a bunch more RAM and super fast hard drives, which will help to eliminate the lag that occurs sometimes during peak times (8am-2pm USA PST) when these servers are getting blasted with over 1000 hits per second. We'll also be adding more redundancy to our main database and web servers, and splitting off Spy onto its own dedicated server to speed up the web servers even more. You wouldn't believe how much load Spy adds to our entire system - if you knew how much, you would probably cry.

We'll also be doing some more work on our database servers, as I mentioned in our last post. I didn't quite finish everything I wanted to last time I was in our data center, so some db servers may go offline here and there. The downtime should never be more than an hour or two, however. We always tweet live updates during server maintenance, so be sure to follow us on Twitter for up to the minute updates.

So that's what we'll be doing the next month or so. It's a lot more work than it sounds like, but when all is said and done, I think everyone will be really happy.
9 comments |   Sep 23 2009 4:44pm

Server maintenance Thursday and Friday

I'm in our data center today and tomorrow, doing maintenance and replacing some hardware on a bunch of our database servers to help prevent issues like we had last week from happening again. We only take down servers during the week if absolutely necessary, and this is one of those cases. This is because we will be in San Francisco for most of next week, and then I will be taking another small trip unrelated to work.

The last thing we want is to have a problem while we're on the road, because it may take much longer than normal to resolve depending on what we're doing at the time of such an incident (not to mention it's a much more of a PITA to do that type of thing while traveling).

Any given server may be down for as long as 3 hours. Data will not be lost during this time, however when a server does come back online, it will take it a while to catch back up with real time.

Our #1 priority is to make Clicky as reliable as possible for you. Thanks for your patience and understanding.
10 comments |   Sep 10 2009 12:28pm

iPhone updates (and we'll see you in San Fran)

September 8th marked 1 year since we launched our iPhone web application. We've been meaning to add some new features for a while, and what better time to do that than on its first birthday? So we started adding some new features yesterday and just launched them tonight. It's still not meant as a full replacement for the desktop version, but these new features really add to the experience and make it that much more enjoyable to check your analytics on the go:

Visitor segmentation

You can now analyze any segment of visitors like you can in the desktop version. Just click on any item when viewing popular data, and you will see data about that specific segment of visitors:

 


Historical data / graphs

You can now view the historical data for any individual item. Just click the red/green percentage next to any item, and you will see its daily history. We're using the pretty amazing Google Charts API to generate these graphs, since Flash doesn't work on the iPhone.

 


You can view history in landscape mode as well, which will widen the graph to see better detail:





Organized dashboard

We've organized all of the menu options on the dashboard into groups, which makes it much quicker to find what you're looking for. As you can see here, we also added the new Short URLs data (from clicky.me) to the iPhone app:





More dates

You can now select dates going back as far as 6 months, including individual months:




Lastly, we wanted to let you know that the entire crew at Clicky (yeah, all 2 of us) will be at the TechCrunch 50 conference in San Francisco this coming Monday and Tuesday, so if you're going to be there, we'd love to meet up. Also, because of our traveling schedule (we'll actually be in San Fran for 5 days), please be patient if you send us an email. We'll of course still be checking them on a daily basis but unless it's an emergency, it may take a few days before we respond. Thanks for understanding. We look forward to seeing you there :)
12 comments |   Sep 09 2009 11:30pm

Problems this morning

At approximately 4am PST, two separate database servers (db1 and db16) had RAID failures that caused file system corruption. They kept trying to process traffic but Linux had switched part of the file system to "read only", so no traffic data was actually being written to the hard drives. This problem lasted from approximately 4am to 7am PST. Unfortunately, this traffic data is gone and unrecoverable.

We have alert systems setup so that when a significant event occurs, such as a server going offline or a RAID failure, we are alerted immediately. Unfortunately, the RAID notifications on a few servers were recently disabled while we were performing some maintenance, and wouldn't you know it, db1 and db16 were among those servers. Because of this, we weren't notified of the problem, and didn't discover it until we woke up to a flood of emails in our inbox this morning.

There were no problems on other servers that we could find, but if you have a site on a server other than db1 or db16 and it's experiencing issues, please leave a comment here explaining what's happening. Be sure to include the site ID.

We apologize for this issue, which we take very seriously. The RAID notifications are all back online, and we will be sure to always re-enable them immediately after this kind of maintenance in the future. Leaving them disabled was just an honest mistake.

One final note, these RAID failures occurred at the exact same time on two different servers. This happened once before as well, although it was three servers instead of two, and it didn't cause any corruption last time. This seems like very strange behavior to us, and we're not sure what could possibly cause such a thing to happen to separate servers (that don't talk to each other) at the exact same time. If any sysadmins out there have any ideas, please share.
19 comments |   Sep 02 2009 8:44am

Next Page »




Copyright © 2018, Roxr Software Ltd     Blog home   |   Clicky home   |   RSS