Email iconarrow-down-circleGroup 8Path 3arrow-rightGroup 4Combined Shapearrow-rightGroup 4Combined ShapeUntitled 2Untitled 2Path 3ozFill 166crosscupcake-icondribbble iconGroupPage 1GitHamburgerPage 1Page 1LinkedInOval 1Page 1Email iconphone iconPodcast ctaPodcast ctaPodcastpushpinblog icon copy 2 + Bitmap Copy 2Fill 1medal copy 3Group 7twitter icontwitter iconPage 1

Introduction

We’ve been migrating a number of clients to our AWS architecture because they’ve outgrown their current hosts and require a more stable, reliable service that can grow with them. We’re accustomed to architecting different solutions depending on the requirements of the applications we’ve built: some, like our trading application, require extremely high uptime and very low latency under relatively low load whilst others require reliability under extremely spiky traffic conditions.

The most recent application we’ve migrated was somewhere in-between; the website is for global lighting manufacturer Havells Sylvania, with reasonably steady and predictable traffic (100K unique visitors per month with 80% of traffic between 9am and 5pm), but with some specific overheads because there’s a lot of product data generated on the fly (pdfs, images, zip files and so on).

Case

Havells Sylvania initially managed their own hosting, done so on a single, dedicated, general utility box, used for all tasks including database, ElasticSearch, generating dynamic content, serving the content managed site, and fetching data from a third party API. Over the years, traffic and usage has grown steadily and the application has consumed more and more resources, as feature development has demanded. Recently, whenever the site came under heavy load the whole thing would come crashing down, requiring manual intervention to get it back up again.

Havells Sylvania were no longer happy managing the hosting themselves and after consultation we agreed to move them to Amazon Web Services (AWS), allowing us to load-balance the website and better react to spikes in traffic and heavy server loads as well as separate the concerns of the different aspects of the application.

Test setup

Our venture into load testing didn’t initially start with the Load Impact service. We tried a number of other methods including trebuchet and blitz.io but these proved hard to manage: tests were tricker to set up and the results harder to interpret. It seems difficult to find a service that will play nicely with a typical load balanced setup on AWS when using sticky sessions. The great thing about Load Impact was the low barrier to entry: it was simple to set up tests and real-time results were very easy to interpret. Load Impact loads the entire page, including JavaScript files, CSS and images, and gives detailed information on how long each aspect is taking to load, allowing you to clearly identify where the bottlenecks might be. Other services only load the HTML, which does not give a clear indication of actual usage, especially important to us as a lot of the web applications we build make heavy use of JavaScript frameworks such as Angular.js. The pay as you go model is also very useful when you’re trying the service for the first time because it allows you to use the full set of features as you would with a fully subscribed account without committing to a monthly subscription; thus reducing the business risk at the point you know the least about the service.

Our tests were reasonably simple in nature: we only needed to prove the site could handle the existing load with a bit of added contingency. We had access to the site analytics and already knew a lot about the user’s behaviour, so were aiming to support 50VUs for 15 – 20 minutes as a proof of concept, which is at the upper limits of current site activity. We also had to ensure that traffic would hit all the servers in the load balancer, so it’s not one server handling all the traffic. As the load balancer uses sticky sessions, we needed to configure two load zones or use two IPs so the ELB could send traffic to both servers.

Service environment

  • Nginx
  • PHP
  • MySQL
  • ElasticSearch
  • Ubuntu

The tested environment consisted of an Elastic Load Balancer (ELB), up to four small EC2 instances, a MySQL RDS database and an ElasticSearch server.

Challenges

The site was built a few years ago, with the view that it was going to be hosted entirely on one server. We had to re-architect several parts of the application to ensure that it could be distributed over several servers, that we could separate concerns and still be performant. After a few load tests, maxing out at 50VUs, we were able to quickly run tests, analyse the results in the control panel, and site performance using New Relic. This allowed us to tweak settings, change code and retest our changes really quickly.

Solution

The first few tests were a complete failure. The servers couldn’t handle anything above 20VUs, failed to respond and were taken out of the load balancer. This created a ‘death spiral’ with the remaining servers now facing more traffic and resulting in catastrophic failure. After a few more tests with similar results, we added New Relic to identify bottlenecks. Fixing some of the problems here sped up a few aspects of the page load times, but the servers were still struggling. It became evident that we needed to set up some caching. Once we enabled APC on the app, the results were immediate. We went from crashing the entire site at around 30VUs, to handling the required 50VUs for over 15 minutes with capacity to burn. A few tweaks were still needed to help it along, such as upgrading our version of PHP from 5.3 to 5.6, and a little bit of PHP-FPM tuning to make sure we were making full use of our small instances.

Results

The difference in page load time with and without APC enabled was dramatic: without, page loads started fairly steadily around 1 – 3 seconds, under reasonably low usage. After a couple of minutes, the response times looked horrible, increasing to around 20 – 30 seconds a page. This continued to fluctuate throughout the test.

blog image 1

With APC enabled, load times started similarly low, around 1 – 2 seconds, but maintained these levels, even as server load hit its peak.

blog image 2

Load Impact was particularly useful in allowing us to test, monitor and retest very quickly and was the perfect partner for New Relic when identifying and troubleshooting performance issues. The fast feedback loop allowed us to optimise the site in a couple of days, with very little setup and no scripting required from us.

Footnote

After going live with the hosting changeover, we were quickly confronted with a struggling website. Normal using meant the server was under significantly heavier load than when we ran the Load Impact tests. We needed to be able to test the site with much more random page loads. Eddy discovered a tool called Locust.io which allowed us to write small python scripts which were able to iterate over a set of product SKUs. This mean that instead of loading the same page over and over again, different product pages were loading, cause images to be cached each time. This was a much more realistic representation of usual traffic, and allowed us to make the infrastructure much more stable.


This post was written as a case study for Load Impact, which you can read here.

Share: