What happened

At approximately 6:30AM PST, our load balancers went offline. We have server monitoring tools that check our public IPs from a remote location every minute, so we were aware of the issue immediately. But it took a bit to figure out what was wrong and get them back online. Total downtime was about 90 minutes.

At first we thought it was a network problem, as happens from time to time, but quickly noticed that a few public IPs we have that are not behind load balancers were fine. Ok, so it's the load balancers then? We have a double load balancer setup with automatic failover, so if one dies, the other takes over. We've tested this and it works great. But the chances of both dying at the same time? Impossible. But we couldn't access their public IPs. That was mildly depressing.

But then we remembered they also have internal "management" IPs for admins to login to. We tunneled into the network and we were able to login to both load balancers, but after looking around a bit, there was nothing obviously wrong. They were online, they could see the internet, they just weren't passing traffic through to the servers behind them. After a bit of banging head upon keyboard, we checked the load balancer logs and saw some messages we'd never seen before, repeating over and over every minute:

Apr 19 07:09:11 : Machine rebooting too often - Passifying

Passifying? Ok, well at least we see that the load balancers were rebooting themselves for some reason. So we did a manual reboot on one to see what would happen, because they weren't rebooting anymore - this message was simply being output every minute so that the admin looking at the log files could not miss it.

Upon reboot, they would pass traffic through for about 20-30 seconds, then go boom. We managed to check the load stats before they went offline, and saw they were at 100%. Well that's not good.

So the problem was that our load balancers were extremely overloaded. We just added pinging to our tracking code, which means visitors to your web sites talk to our tracking servers a lot more often than they used to. We obviously knew this would happen, so what we did was we actually activated our new tracking code last Tuesday for all sites, well before the actual release (Friday night), so we could see the effects that it had on our tracking servers. The load went up significantly, on both the load balancers and the tracking servers, but not enough to be concerned about. It was still a very manageable level.

But at that point, our tracking servers were not actually logging the pings. They were receiving them, but they were discarded. That makes a difference. We made our full release on Friday night so it's been running over the weekend and were watching it very closely. Things seemed great. Of course, the weekend's traffic is quite a bit lower than during the week, particularly Monday. Monday is the biggest traffic day of the week for most sites, so it's the biggest day of the week for our service too.

So Monday morning rolls around, and once most of the US is awake, the spike went up quite a bit higher than it did over the weekend. What happened was the load balancers were getting timeouts when talking to the tracking servers, which means they were keeping connections open for a lot longer than they normally would. This type of thing quickly spirals out of control, hence, the problem this morning.

When our database servers process traffic every minute, they keep a log of what they just did - how many actions were processed, how many pings, how long the process took, etc. This is extremely useful data. So we took a peak at this and saw that pings were accounting for over 80% of traffic being processed. This means that the pinging functionality increased the hits to our tracking servers by 400%, overnight. That's a little more than we were expecting. We were expecting maybe a 200% increase at the most.

In an effort to get things back online as quickly as possible, we changed the tracking code so that the total pings it will send are half as many, and the initial 15 second quick-check ping doesn't happen. After uploading that to the CDN, we rebooted the load balancers again and they came online and have stayed online since, with a load of about 65%. When viewing the database logs now, the pings are more inline with our initial expectations - about a 200% increase over normal traffic.

We want to ping more often though, so we'll be tweaking all sorts of things over the course of the day and probably throughout the week, to get the most pings possible with the least amount of load on our servers. We'll post an update later to let you know how it's going.

This is another one of those unfortunate cases that aren't really testable until they're live. We knew there would be a big spike on our end, and we thought we had it under control, but we just didn't know quite how big it was going to be until the system was fully live and all of our sites were sending us Monday traffic. But for now, things are stable, and that's what's important.
21 comments |   Apr 19 2010 9:27am

Unexpected side effects of our recent changes to tracking

There have been two side effects of our recent changes to tracking that we didn't really anticipate. This doesn't mean that something is broken, things just work differently now. We wanted to post about them here so you can understand what's going on.

More visitors / less actions per visit

You may have seen your daily visitor account increase substantially, but the number of actions remains the same. The main types of sites that are seeing this are those that have user accounts (e.g. people login to your site to do whatever it is they do on it).

What's happening here? First, people tend to share their accounts with other people. We are seeing this big time on in our stats for getclicky.com, for example. We see multiple sessions from the same user account and the same IP address at around the same time, which previously would have been counted as the same session. Because we have cookies now, we are able to determine that these are in fact two unique users. We can tell because when we view these sessions, in almost all cases the computer details are different. One might have Windows XP and Firefox 3.0, the other Windows Vista and Internet Explorer. Since the old method was just using the IP address, they would have been clumped together into one visit. The new system seperates them into two unique sessions, as they should be, since they are occurring on two different computers.

What about the other cases? Well, the same thing can also happen for one visitor using two different browsers on one computer to access our site. Since cookies are stored at the browser level, their cookie ID will be different in each browser used. This isn't really desirable here, but it's also quite rare. Not many people use multiple browsers at the same time. All other analytics services that use cookies will have the same problem. If you think about it though, it really is two different sessions, because the person is doing two seperate things in the two browsers.

"Visitors online" in Spy is lower

If you have a moderate traffic site (at least ~5,000 daily visitors), you may have noticed that when you load up Spy, the "visitors online" value is quite a bit lower than the one you see reported on the dashboard.

The reason this is happening is because we only store the last X number of actions in the Spy "cache" for each site. Although we don't count pings as "actions" in your stats, the way that they are processed in the backend is the same. Since pings are now stored in that cache, when you first load Spy, the cache may not actually have all of the data for your visitors that are all online at that moment. So you will see lower number initially, but once Spy has been running for a few minutes, it should have a much more accurate number. We want to fix this and will probably increase the size of the cache, but we need to be careful about doing that.

You also may be seeing lower numbers because you don't actually have as many visitors online at one time. As we explained in our last post, Spy now uses the pings to determine when visitors are truly online or not, rather than a generic 5 minute timeout that we were doing before. Now Spy will know within 1 minute of a visitor leaving your site that they have actually left (because the pings have stopped), instead of waiting 5 minutes to remove them. The end result here is that we remove visitors a lot quicker from Spy than we used to, so the value we report here might be lower than it used to be. But it's more accurate, and that's a good thing.

Hope that helps.
17 comments |   Apr 18 2010 4:07pm

Here's what new!

The new Clicky is live! In the form of a novel, here's what new:
  • Cookies

    We now use cookies to more accurately track unique visitors. As mentioned previously, you can disable cookies if you don't want them on your site, with clicky_custom.no_cookies. If a visitor does not have cookies enabled or you have disabled them on your site, then we fallback on the visitor's IP address, which is how we were doing this previously.

  • Pinging

    When a visitor remains on a single page of your site, our tracking code will ping our tracking servers in the background so we know they are still online. This will also let us give you much more accurate "time online" values, both per visitor and "average".

    By default, a visitor will ping us for 20 minutes while on one page. Depending on the type of site you are running, you may wish for a longer time that that. Don't worry, you can customize it! Using clicky_custom.timeout, you can extend the pinging time up to 4 hours. This is perfect for sites that are focused on things like videos or games.

  • Spy + Ping

    Spy uses the new pinging functionality to more accurately display who is actually on your site right now. Previously, we were just using a 5 minute timeout period of no activity to determine that a visitor was gone. Pinging allows us to actually know who is on your site, so Spy will be much more accurate. Pinging starts at every 15 seconds but quickly decays to once per minute, so there's still a timeout of 1 minute on Spy before a visitor will disappear. Pinging will also let us display visitors who are just sitting there on one page. Previosly they would disappear after 5 minutes but now they will stay there as long as they keep pinging us.

    The result of this is a lot more spikes in the visitors online "graph", rather than a fairly smooth one as you may be used to. Here's an example of what we mean:

  • New vs returning visitors

    We now track new vs returning visitors. You will see in "the basics" dashboard module, there's a new "expand" link. Clicking this will show your unique visitors and your new visitors for the date or date range you are viewing. Because we are only tracking new visitors from this point forward, this metric won't be terribly accurate for the first week or two until most of your regulars have visited your site again.

    You can also filter by new or returning visitors, it's in the filter drop down box on the visitors page.

  • Bounce rates

    Our bounce rate calculations also take pinging into account now. As far as we know, all other analytics services define a bounce as a visitor who only views one page. But if they actually stay on your site for a little bit, they are probably more engaged with your site. The first ping occurs after 15 seconds, so that's our new threshold for determining what a bounce is. Anyone who is on your site less than 15 seconds is a bounce, anyone on longer is NOT a bounce - regardless of how many actions they had. We think this is a much better way to calculate this metric.

  • Self hosted tracking code

    If you are self-hosting the tracking code, you will need to grab a fresh copy to get cookie and ping support. We'd also like to make clear that we no longer support this option. Our tracking code is now on a CDN and we also offer asynchronous tracking code, so there's no need for it anymore. Making this update backwards compatible for people doing the self-hosted thing was a bit of a pain. From this point on, we cannot guarantee backwards compatibility with those of your self-hosting, so you are doing so at your own risk!

  • More efficient

    The way we process and store visitors has dramatically changed, and the result is about an 80% increase in efficiency - which is a very big deal. This means we'll have to do database maintenance a lot less often (e.g. what we do every few months to a couple of servers, such as last week with db7/8/9). Smaller storage also means much faster filtering. You should notice a dramatic increase in filtering (segmentation) of your visitors for most data types.

    This also results in much faster processing of traffic. We think it's going to allow us to track higher traffic sites. The highest level of traffic we allow for any site is 500,000 page views per day. We think our new system will be able to double that capacity - but we're not sure yet. We're going to be monitoring things and will certainly let people know if we increase this limit.

    Please note, converting your existing visitors to this new format will take up to 48 hours on some servers. So rather than keep these servers offline during the conversion, the conversion will be occuring over the weekend while the servers are live. It should be finished by Sunday though. The point is, when viewing your visitors list, some visitors may have incomplete data (e.g. just an IP address). Don't panic - their data *will* appear! We're doing this from newest to oldest visitors, so you probably won't even notice it unless you go back in your history a bit.

There may be a few bugs lying around. We'll be monitoring and fixing anything unusual over the weekend. If you see anything strange, don't hesitate to leave a comment here or send us an email. Hope you enjoy!
20 comments |   Apr 16 2010 10:08pm

Reminder: new release tonight

The newest version of Clicky is being released tonight. There are some pretty big database changes so we'll be taking all database servers offline for approximately two hours while the changes are made. During this time, stats will be unavailable, but incoming traffic data will of course still be logged, and will be processed once the servers come back online.

This process will start tonight (Friday) around 8 or 9 PM PST (GMT -8). We'll be posting a full run down of all the new features once it's been released.

Follow us on Twitter for real time updates during the process.
4 comments |   Apr 16 2010 2:01pm

Database maintenance this weekend, new release next weekend

We need to perform maintenance on a few database servers this weekend. This will begin on Friday night around 8-10 PM (USA PST), and will last 8-10 hours. Affected servers are db7, 8, and 9. The database server that any site is hosted on can be found on that site's main preference page.

These are all fairly old servers, although they were opened up for some new sites for a couple of weeks in the beginning of 2010. Usually we only have to do this kind of maintenance on older servers so it doesn't affect too many people (since typically speaking, the longer you've been a member, the less often you log on). Unfortunately, this is not the case for everyone this time, since these servers have some sites registered as recently as 3 months ago.

When we do this type of maintenance we usually take the affected servers fully offline. However, due to complaints we are going to try a new method this time where they stay online so you can still view your stats, but no new data will be processed while the maintenance is happening. This will slow down the maintenance to some degree, but we think it will still be acceptably fast, especially since Friday night / Saturday morning is the lowest traffic levels of the week. We will be monitoring the performance and if we decide it's slowing it down too much, we will take them offline so that they can finish as quickly as possible, but we hope to not have to do that.

Our new release, that will include pinging and cookie support in the tracking code, much more accurate unique visitor and time online values, as well as tracking new vs returning visitors, is coming along really well. We were hoping to get it out the door this weekend, but combined with the database maintenance, it is just too much. So we are delaying it until next weekend. There are a couple of major database changes that in this new release, so every database server will have to be taken fully offline for approximately 2 hours while the changes are made. We plan to do this on a Friday night as well, so the least number of people are affected.

During maintenance periods, your traffic data is still logged, but it is not processed until the database comes back online. So don't worry, you won't lose any stats! But they will be lagging behind real time when they come back online for at least a couple of hours.

We are also always tweeting live updates during maintenance, so be sure to follow us on Twitter to for up the minute updates.
4 comments |   Apr 08 2010 8:28pm

Next Page »

Copyright © 2019, Roxr Software Ltd     Blog home   |   Clicky home   |   RSS