Thursday, July 25, 2013

The most common mistake when load balancing with HAProxy

So, a while back, I was put in charge of a medium-sized site, which is running Drupal (probably known best for it's performance -- or lack thereof). At some point, we had to scale the website up, because it was getting more and more attention, so we went with HAProxy, because we heard good things about it. (We tried Amazon ELB, because we were already in the Amazon network, but that didn't work out for us)

So we installed HAProxy and pulled some different configurations on the internet, and got something working. This was about 1.5 years ago, and at that time, the website was getting about 900k unique visitors.

At first, the website was handling the traffic fairly OK with just 2 servers under the LB. Then we had to add another server, because we've developed some new features, with a major impact on the site's performance (we weren't allowed to switch from Drupal, so we couldn't optimize the code too much).



Then, the real fun began: the website's owners decided to send out weekly newsletters. The instant traffic during this time is around 1.3k simultaneous users, so naturally, we had to add more servers during newsletter hours, coming to a total of 6, sometimes 7 web servers under the LB.

To scroll forward, right now, we're getting 2M unique visitors per month, and we're supporting all that traffic with 3 servers, and 1 extra server during newsletter hour. So how did we do it (and here begins the tech part)?

Well, we did 3 changes:
1. We separated static content from heavy PHP requests. It was fairly easy, it's just a simple ACL, if the filename ends with css, js, png and whatnot. All that static traffic always goes to one dedicated server (so not load balanced per se). We did this because we noticed those two don't get along very well.
2. The second, and I think most important thing, was to limit the number of simultaneous requests each server could handle. We set it to a number of 30 in our case (I know it's little, but hey, it's Drupal).
The way this works is that if there are more than 30 requests on a certain server, the other requests go to the other servers if available (I know, that's a given in load balancing), but the "kinky" part is that if all servers are full (ie. they all have 30 requests to handle), HAProxy has a queue for that, and all the extra requests stay there until a server is free enough to handle them. If no server frees up within X seconds (configurable), the user gets a "503 Service Unavailable" error (not sure about the error code, but there's definitely an error). So that's the only downside of this.
3. Before, we were using the roundrobin LB algo, with stickyness (ie. if the first time you come to the site, you hit server A, you'll always hit that server until your cookie expires). The problem with that, is that some servers were under really heavy load, and some were free. That's why we got rid of the stickyness, and switched to the leastconn algo. I'm not sure if that really helped, but at least now, all servers have an equal amount of requests (this is important to do with #1, as 90% of the requests are static).

I can't provide the name of the site, nor do I want to provide code snippets, but I can tell you that it's all very well documented, and you'll find it on Google. Also, I don't take credit for all these things we did, most of them were our sysadmin's idea.

As a side-note, I'm not sure why I'm writing this article, because nobody will read this anyway, but if you're that 1 in 1M person who reads this, I hope this helped in some way.

No comments:

Post a Comment