WARNING: This is going to get technical, so hold on to your pocket protector. I’m not going to re-hash, so if you haven’t read the previous BI infrastructure post, you should check that out first.
Last time I discussed our technology, BI was pushing around 80 million monthly page views, using two Varnish cache servers, four Apache web servers, and three MongoDB database servers. Traffic has more than doubled since then, and BI regularly delivers 175+ million page views a month ... using the same two Varnish servers, four Apache servers, and three
Even though traffic has doubled, not much has changed in the core
Though our core hasn’t changed, the software running on it is undergoing plenty of changes. We’re refactoring our PHP code and moving from our legacy custom MVC framework to the Symfony 2 open source framework. We’ve also been steadily streamlining our editorial CMS and improving it with new features to make the editors' lives easier.
We migrated away from the Google Search Appliance to an open source Solr server, which has been very successful at making our search results are better, with great filtering and sorting options that we couldn’t offer while we were still using the GSA.
We’ve begun setting up a Jenkins server to do continuous integration, and unit test coverage is getting better as we continue integrating Symfony 2. We’ve set up Nagios to do our own internal monitoring of the network and services and help catch any hiccups right away. We’re still using Catchpoint to keep an eye on site speed and availability, which has been invaluable for spotting problems immediately.
Unfortunately, as good as it’s been, this architecture won’t withstand another doubling. At peak traffic a single Varnish server could probably handle our current traffic but would have difficulty on its own. We need to be able to withstand a server crash, so we’re close to a single point of failure.
However, there are a few catches involved in simply adding a third varnish server. Right now our two front-end caching servers individually have a full cache of every URL on the site, and they’re balanced between randomly. Each time a post or vertical page is purged, the Apache and MongoDB backend needs to generate two new copies, one for each server. If we simply add a third server, our backend will need to generate a third copy, increasing the load on the back end by 50 percent. Needless to say, that’s the opposite of what we want.
So how will we get around this issue? I have a few possible ideas cooking, but you’ll have to wait for my follow-up article to find out how we ultimately decide to solve it.