We’ve been migrating a number of clients to our AWS architecture because they’ve outgrown their current hosts and require a more stable, reliable service that can grow with them. We’re accustomed to architecting different solutions depending on the requirements of the applications we’ve built: some, like our trading application, require extremely high uptime and very low latency under relatively low load whilst others require reliability under extremely spiky traffic conditions.
The most recent application we’ve migrated was somewhere in-between; the website is for global lighting manufacturer Havells Sylvania, with reasonably steady and predictable traffic (100K unique visitors per month with 80% of traffic between 9am and 5pm), but with some specific overheads because there’s a lot of product data generated on the fly (pdfs, images, zip files and so on).
Havells Sylvania initially managed their own hosting, done so on a single, dedicated, general utility box, used for all tasks including database, ElasticSearch, generating dynamic content, serving the content managed site, and fetching data from a third party API. Over the years, traffic and usage has grown steadily and the application has consumed more and more resources, as feature development has demanded. Recently, whenever the site came under heavy load the whole thing would come crashing down, requiring manual intervention to get it back up again.
Havells Sylvania were no longer happy managing the hosting themselves and after consultation we agreed to move them to Amazon Web Services (AWS), allowing us to load-balance the website and better react to spikes in traffic and heavy server loads as well as separate the concerns of the different aspects of the application.
Our tests were reasonably simple in nature: we only needed to prove the site could handle the existing load with a bit of added contingency. We had access to the site analytics and already knew a lot about the user’s behaviour, so were aiming to support 50VUs for 15 – 20 minutes as a proof of concept, which is at the upper limits of current site activity. We also had to ensure that traffic would hit all the servers in the load balancer, so it’s not one server handling all the traffic. As the load balancer uses sticky sessions, we needed to configure two load zones or use two IPs so the ELB could send traffic to both servers.
The tested environment consisted of an Elastic Load Balancer (ELB), up to four small EC2 instances, a MySQL RDS database and an ElasticSearch server.
The site was built a few years ago, with the view that it was going to be hosted entirely on one server. We had to re-architect several parts of the application to ensure that it could be distributed over several servers, that we could separate concerns and still be performant. After a few load tests, maxing out at 50VUs, we were able to quickly run tests, analyse the results in the control panel, and site performance using New Relic. This allowed us to tweak settings, change code and retest our changes really quickly.
The first few tests were a complete failure. The servers couldn’t handle anything above 20VUs, failed to respond and were taken out of the load balancer. This created a ‘death spiral’ with the remaining servers now facing more traffic and resulting in catastrophic failure. After a few more tests with similar results, we added New Relic to identify bottlenecks. Fixing some of the problems here sped up a few aspects of the page load times, but the servers were still struggling. It became evident that we needed to set up some caching. Once we enabled APC on the app, the results were immediate. We went from crashing the entire site at around 30VUs, to handling the required 50VUs for over 15 minutes with capacity to burn. A few tweaks were still needed to help it along, such as upgrading our version of PHP from 5.3 to 5.6, and a little bit of PHP-FPM tuning to make sure we were making full use of our small instances.
The difference in page load time with and without APC enabled was dramatic: without, page loads started fairly steadily around 1 – 3 seconds, under reasonably low usage. After a couple of minutes, the response times looked horrible, increasing to around 20 – 30 seconds a page. This continued to fluctuate throughout the test.
With APC enabled, load times started similarly low, around 1 – 2 seconds, but maintained these levels, even as server load hit its peak.
Load Impact was particularly useful in allowing us to test, monitor and retest very quickly and was the perfect partner for New Relic when identifying and troubleshooting performance issues. The fast feedback loop allowed us to optimise the site in a couple of days, with very little setup and no scripting required from us.
After going live with the hosting changeover, we were quickly confronted with a struggling website. Normal using meant the server was under significantly heavier load than when we ran the Load Impact tests. We needed to be able to test the site with much more random page loads. Eddy discovered a tool called Locust.io which allowed us to write small python scripts which were able to iterate over a set of product SKUs. This mean that instead of loading the same page over and over again, different product pages were loading, cause images to be cached each time. This was a much more realistic representation of usual traffic, and allowed us to make the infrastructure much more stable.
This post was written as a case study for Load Impact, which you can read here.