Tracking server issues this morning

You probably notice a bit of missing data from this morning. Let me explain what happened.

This was not a database issue, which seems to be the story of our life recently, but an issue with our tracking servers. This means every single site was affected. The issue is very technical and related to Linux itself, but I'll try to explain it as simply as I can.

As part of our infrastructure improvements we have been making, we upgraded our tracking servers with twice the RAM and much faster hard drives. These two things combined should help eliminate most of the lag you may sometimes notice on your site during peak times, which is about 8am to 2pm PST.

However, a serious human error was made on my part when I formatted these new drives. I haven't had to manually format anything other than a drive meant for a database for quite a while. For our database drives, we use what's called "largefile" inode structure, which is optimized for disks that have very large files. Some of our database servers have individual files that are over 40GB. inodes store metadata about every individual file on a partition, including where exactly a file is on the actual physical part of the disk.

Unfortunately, without thinking about it, I optimized these new drives on our tracking servers the same way. It's habit at this point. The problem is that our tracking servers have hundreds of thousands of tiny text files on them that store all of the traffic coming in for all the sites we monitor. Each site has its own dedicated Spy file, and each database server has its own dedicated file as well, which is basically a Spy file times 8000. We also cache the javascript for each site seperately, for complex reasons. Including pMetrics and the version of Clicky for Webs.com, we're tracking over 500,000 sites, so this translates into a ridiculous amount of files stored on these drives.

I'm not an inode "expert" but I know what works well for different situations. With largefile, it creates an inode every 1 megabyte, which translates into about 143,000 inodes on the 150GB disks Raptors we put in here. With so few inodes for so many files, the percentage of inodes being used reached 100% within about 48 hours. This is a very bad thing for a heavy read/write disk with hundreds of thousands of files. Load skyrocketed to over 400 on each server, which is absolutely ridiculous. The tracking servers slowed down considerably and were timing out.

Normally I get pages within minutes of such an event. However, my stupid iPhone, which I'm about to throw out the window, was somehow stuck in "headphone" mode, which means the external speaker was disabled, which means it made no sound as these pages were continuously coming in. (Note - this is different than "silent" mode - it actually thought there was a headphone inserted, although there was most certainly not). It wasn't until I woke up at my normal time that I noticed that I had hundreds of new text messages that the servers were severely timing out.

Anyways. It took me a while to track down what specifically was causing the problem. But as soon as I found out, I knew exactly what I had done wrong. I took each tracking server offline individually and reformatted the drives that stores these tiny files with the "news" inode type. This creates an inode every 4KB, which translates into over 36 million inodes for these disks, which is exactly what we want for this type of usage. (This is how our old drives were formatted, and worked well except for the fact that the drives were quite slow. These servers were built when we were MUCH smaller.) When I brought each server back online, things returned to normal immediately.

We have been planning to change the javascript tracking code so it's global for all sites but it's not as easy as flipping a switch. If we had been using a global tracking file instead, this problem would not have occurred so soon. But as we continue to grow fairly quickly, it would have eventually reared its ugly head. Now it's fixed, so it should never be a problem again.

Please accept our sincere apologies. We have been having an abnormal amount of problems recently, but the quality of our service is our absolute top priority. You are upset, but know that we are 100x as upset about something like this. As we upgrade the rest of our servers over the next few weeks, we are hopeful the service will return to the stableness and quality you have been accustomed to for nearly 3 years now.
22 comments |   Oct 05 2009 1:38pm





Copyright © 2017, Roxr Software Ltd     Blog home   |   Clicky home   |   RSS