Infrastructure upgrades nearly complete

Like we said, nothing exciting for a while. We've been working behind the scenes massively improving our infrastructure and updating some problem servers for greater reliability, and some old servers so they're much, much faster. We're not quite done, but here's the story so far:

Tracking servers

In each of our tracking servers, we doubled the RAM and added much faster drives to store the incoming traffic data. Initially there were a few problems but they were resolved.

As an update to that story, the problems we mentioned were related to the file system we were using, Ext3. The upgrades we initially made did help with performance, but load on the servers was still much higher than we thought it would be. After many hours of research, we discovered that this file system, which is the default for almost any Linux installation, isn't well suited to storing, updating, and deleting thousands of tiny files 24/7. It turns out the file system of our dreams is called ReiserFS. Article after article said check it, this file system is amazing for dealing with thousands of tiny files - use it if that's what you're doing. So we did.

We reformatted the drives that store our incoming traffic data to ReiserFS and the results were stunning. Load plummeted to levels we haven't seen for well over a year. So this was actually the biggest bottleneck of our existing setup, but that isn't to say our RAM and hard drive upgrades were fruitless. Before we discovered ReiserFS, the hardware upgrades still made a significant difference - just not as big as we thought they would, which is why we kept researching. Once we added ReiserFS into the equation, the results were what we were hoping for.

We also made a couple of very major efficiency improvements to the code that logs incoming traffic. The tracking servers are currently in a state of bliss and thanking us kindly for helping them work more efficiently.

Software to Hardware RAID migration

In the last 6 or so servers we built, we were using Linux's built in software RAID to mirror a pair of drives. Software RAID has served me well in the past but it doesn't seem to be quite as reliable for extremely heavy read/write drives. About once a month, we had a RAID failure which would almost always lead to one of our biggest database tables on that server having corruption. So we'd have to take that server offline and repair the 1 or more tables with corruption, which is a slow process to say the least.

A Redundant Array of Independent Disks is supposed to prevent this type of thing. A drive popping offline should be no problem - you either replace it or re-add it to the array and it rebuilds and nothing noticeable happens from the end user's perspective. But this wasn't the case with our Linux software RAID servers.

The main reason we went with software RAID was for cost savings. Not that hardware RAID is that expensive, but it adds about 15% to the cost of each server we build. So, no more software RAID. All servers that had this setup have been migrated to hardware RAID. All of our older servers use hardware RAID and they've never had a single problem.

Upgrades to old servers

As I just mentioned, none of our older servers have ever had any problems. On the other hand, they're all a bit slow, as they're not using drives meant for high performance. The database servers affected most by this were 2, 3, 5, 6, and 7.

We've migrated 2, 3, and 5 to much faster drives. If any of your sites are on these servers, you should notice very significant speed improvements when viewing your stats. We haven't yet migrated 6 or 7. We currently only have 1 spare server ready to host the data from another one. db7 seems to be slightly slower than db6 so that is the one that will be getting the upgrade first, this coming weekend most likely.

Next week, I will be at our data center again building some new servers, hopefully for the last time for a while! At this point, db6 will be moved to new hardware. db12 will also be moving, as it's also on slower drives. db12 is much newer than these others ones so it has less data, which means the speed is still acceptable - but that's only for now. Over time its performance will slowly degrade as well, so we're just going to move it now.

Once that is completed... we'll be done!!!

Well that was fun!

Actually, not really. This is the type of work that is opposite of fun. I've built so many new servers and installed Debian Linux so many times the last month, it's probably some kind of world record. But, that's ok - all of this needed to be done, Clicky is much better because of it, and we hope you have noticed the improvements.

Now, we can get back to working on the software, which is what we really live for. Look for some great new features soon!
8 comments |   Oct 21 2009 11:39am

Clicky crushes it!

Gary Vaynerchuk is one of our first customers. I don't know how he ever found out about Clicky so early in its life - he registered way back in Feb 2007, when we were absolute nobodies - but we've always been psyched to have him as a customer, because we're big fans of everything he does.

He's on tour right now for his new book Crush it, and tonight the tour hit Portland, Oregon, where we are based. We stopped by to watch him speak and take questions from the audience for about 90 minutes. And of course, we grabbed a couple copies of the book. It was awesome to meet him and his signature on my book made my day. To have the absolute king of social media be so passionate about our product means a lot.

Thanks Gary! Good luck with your book, although we know you won't need it.

8 comments |   Oct 19 2009 11:17pm

Tracking server issues this morning

You probably notice a bit of missing data from this morning. Let me explain what happened.

This was not a database issue, which seems to be the story of our life recently, but an issue with our tracking servers. This means every single site was affected. The issue is very technical and related to Linux itself, but I'll try to explain it as simply as I can.

As part of our infrastructure improvements we have been making, we upgraded our tracking servers with twice the RAM and much faster hard drives. These two things combined should help eliminate most of the lag you may sometimes notice on your site during peak times, which is about 8am to 2pm PST.

However, a serious human error was made on my part when I formatted these new drives. I haven't had to manually format anything other than a drive meant for a database for quite a while. For our database drives, we use what's called "largefile" inode structure, which is optimized for disks that have very large files. Some of our database servers have individual files that are over 40GB. inodes store metadata about every individual file on a partition, including where exactly a file is on the actual physical part of the disk.

Unfortunately, without thinking about it, I optimized these new drives on our tracking servers the same way. It's habit at this point. The problem is that our tracking servers have hundreds of thousands of tiny text files on them that store all of the traffic coming in for all the sites we monitor. Each site has its own dedicated Spy file, and each database server has its own dedicated file as well, which is basically a Spy file times 8000. We also cache the javascript for each site seperately, for complex reasons. Including pMetrics and the version of Clicky for, we're tracking over 500,000 sites, so this translates into a ridiculous amount of files stored on these drives.

I'm not an inode "expert" but I know what works well for different situations. With largefile, it creates an inode every 1 megabyte, which translates into about 143,000 inodes on the 150GB disks Raptors we put in here. With so few inodes for so many files, the percentage of inodes being used reached 100% within about 48 hours. This is a very bad thing for a heavy read/write disk with hundreds of thousands of files. Load skyrocketed to over 400 on each server, which is absolutely ridiculous. The tracking servers slowed down considerably and were timing out.

Normally I get pages within minutes of such an event. However, my stupid iPhone, which I'm about to throw out the window, was somehow stuck in "headphone" mode, which means the external speaker was disabled, which means it made no sound as these pages were continuously coming in. (Note - this is different than "silent" mode - it actually thought there was a headphone inserted, although there was most certainly not). It wasn't until I woke up at my normal time that I noticed that I had hundreds of new text messages that the servers were severely timing out.

Anyways. It took me a while to track down what specifically was causing the problem. But as soon as I found out, I knew exactly what I had done wrong. I took each tracking server offline individually and reformatted the drives that stores these tiny files with the "news" inode type. This creates an inode every 4KB, which translates into over 36 million inodes for these disks, which is exactly what we want for this type of usage. (This is how our old drives were formatted, and worked well except for the fact that the drives were quite slow. These servers were built when we were MUCH smaller.) When I brought each server back online, things returned to normal immediately.

We have been planning to change the javascript tracking code so it's global for all sites but it's not as easy as flipping a switch. If we had been using a global tracking file instead, this problem would not have occurred so soon. But as we continue to grow fairly quickly, it would have eventually reared its ugly head. Now it's fixed, so it should never be a problem again.

Please accept our sincere apologies. We have been having an abnormal amount of problems recently, but the quality of our service is our absolute top priority. You are upset, but know that we are 100x as upset about something like this. As we upgrade the rest of our servers over the next few weeks, we are hopeful the service will return to the stableness and quality you have been accustomed to for nearly 3 years now.
22 comments |   Oct 05 2009 1:38pm

Nothing exciting for a while

We're working on massively improving our infrastructure for the next month or so, which we hope will greatly improve the speed and reliability of our service. During this time, there will likely be few, if any, new and exciting features.

There are so many awesome ideas we have for Clicky but we've reached the point where our existing setup isn't quite cutting the mustard anymore. Nothing is more important to us than the quality of our service, so we're going to be focusing on that for a bit to ensure we can continue growing well into the future with as few problems as possible.

We'll be upgrading our tracking servers with a bunch more RAM and super fast hard drives, which will help to eliminate the lag that occurs sometimes during peak times (8am-2pm USA PST) when these servers are getting blasted with over 1000 hits per second. We'll also be adding more redundancy to our main database and web servers, and splitting off Spy onto its own dedicated server to speed up the web servers even more. You wouldn't believe how much load Spy adds to our entire system - if you knew how much, you would probably cry.

We'll also be doing some more work on our database servers, as I mentioned in our last post. I didn't quite finish everything I wanted to last time I was in our data center, so some db servers may go offline here and there. The downtime should never be more than an hour or two, however. We always tweet live updates during server maintenance, so be sure to follow us on Twitter for up to the minute updates.

So that's what we'll be doing the next month or so. It's a lot more work than it sounds like, but when all is said and done, I think everyone will be really happy.
9 comments |   Sep 23 2009 4:44pm

Server maintenance Thursday and Friday

I'm in our data center today and tomorrow, doing maintenance and replacing some hardware on a bunch of our database servers to help prevent issues like we had last week from happening again. We only take down servers during the week if absolutely necessary, and this is one of those cases. This is because we will be in San Francisco for most of next week, and then I will be taking another small trip unrelated to work.

The last thing we want is to have a problem while we're on the road, because it may take much longer than normal to resolve depending on what we're doing at the time of such an incident (not to mention it's a much more of a PITA to do that type of thing while traveling).

Any given server may be down for as long as 3 hours. Data will not be lost during this time, however when a server does come back online, it will take it a while to catch back up with real time.

Our #1 priority is to make Clicky as reliable as possible for you. Thanks for your patience and understanding.
10 comments |   Sep 10 2009 12:28pm

Next Page »

Copyright © 2018, Roxr Software Ltd     Blog home   |   Clicky home   |   RSS