Email iconarrow-down-circleGroup 8Path 3arrow-rightGroup 4Combined Shapearrow-rightGroup 4Combined ShapeUntitled 2Untitled 2Path 3ozFill 166crosscupcake-icondribbble iconGroupPage 1GitHamburgerPage 1Page 1LinkedInOval 1Page 1Email iconphone iconPodcast ctaPodcast ctaPodcastpushpinblog icon copy 2 + Bitmap Copy 2Fill 1medal copy 3Group 7twitter icontwitter iconPage 1

Introduction

Vote for Policies is a service to help voters make an informed decision by comparing what the political parties are promising to do. Having quickly gained popularity in the 2010 general election the service was unable to cope with the unexpected heavy load. For the 2015 general election Vote for Policies raised funds through crowdfunder and approached us to build the service from the ground up.  Below I’ve outlined how we architected the application for stability, reliability and performance.

Stack Architecture

ELB

We’re hosting the application under the guidance of an AWS ELB which distributes traffic to a number of t2.small EC2 instances.

The ELB has a configured Auto-Scaling group which will scale up if either:

  • CPUUtilization > 60 for 1 minute
  • NetworkIn >= 50Mbps for 1 minute

We noticed that once any of the instances behind the ELB went up 70Mbps NetworkIn we would begin to see 503 responses, so we chose the 50Mbps threshold to allow time to add and start a new instance to the ELB before any of the instances reached 70Mbps. We’ve yet to see the CPU reach anywhere close to the 60% metric as we’ve offloaded any intensive tasks to our queue system, so the CPUUtilization metric is just a safe-guard.

We scale the group down if:

  • NetworkOut <= 30,000,000 for 15 minutes

Web Server

Our web-server instances are running PHP 5.6 with PHP-FPM and Opcache enabled.

Our web server configuration is:

  • EC2 t2.small instance (2GB Memory, EBS Storage)
  • PHP 5.6.X running as PHP-FPM with Opcache enabled
  • Nginx
  • FastCGI Cache

We utilise the nginx fastcgi_cache on each instance, this eliminates any single point of failure which may arise from keeping a reverse proxy in-front of all of the instances, which would result in more cache hits.

All of our static pages are cached with the fastcgi_cache, in addition to public survey results pages. This cache is bypassed if the user has completed a survey on our website, which allows us to provide a link to their results the next time they visit the site.

Queue Server

We are using the RabbitMQ messaging system, our queue server is run by CloudAMPQ (Big Bunny instance, dedicated server).

Queue server configuration:

  • Master/standby failover
  • Max ~10k msgs/s
  • Max ~1k connections
  • Max ~5M queued messages

We made the decision not to manage our own queue server due to time constraints and resource management, but there is no reason why this couldn’t be a part of our own AWS stack if the need arises.

Worker Server

We use worker servers to execute anything that could delay a request from completing, even persisting surveys, in order to allow the user experience to be as quick as possible.

Our worker servers also live behind an ELB but don’t have auto-scaling enabled; we manually manage the amount of instances based on the size of our queues, we can check using the RabbitMQ management console. In the future we could automate this but for now we’re fine with occasionally adding/removing another worker to the group if the queue looks to be getting too long.

Our worker server configuration is:

  • EC2 m3.medium instance (3.75 GB Memory, EBS Storage)
  • PHP 5.6.X with Opcache enabled
  • SupervisorD

Here is an extract from our supervisor configuration:

[program:persist_survey_result_consumer]
autostart=true
autorestart=true
startsecs=10
stopwaitsecs=30
startretries=100
logfile=/var/www/sites/voteforpolicies/current/app/logs/supervisord.log
directory=/var/www/sites/voteforpolicies/current
command=/var/www/sites/voteforpolicies/current/app/console rabbitmq:consumer survey_results_complete --env prod --messages 100
numprocs=4
process_name=%(program_name)s%(process_num)s
[program:national_refresh_consumer]
autostart=true
autorestart=true
startsecs=10
stopwaitsecs=30
startretries=100
directory=/var/www/sites/voteforpolicies/current
command=/var/www/sites/voteforpolicies/current/app/console rabbitmq:consumer national_refresh --env prod --messages 100

The consumer processes are symfony2 commands, we tell each process to only consume 100 messages before exiting, and allowing supervisor to restart them, to make sure that any memory leaks from long-running PHP processes are avoided. Some consumers are run multiple times in parallel, such as the survey persist consumer, which again is handled by supervisord.

MySQL

MySQL is primarily used as a *third-tier* cache layer. Most requests to the application (homepage, static pages and shared results pages) should be caught by the nginx fastcgi_cache before they even reach PHP. If they do reach PHP, 99% of the time the data required should be in Redis. If the request is the first for a particular resource though, or a particular survey has expired from Redis (we set an expiration on surveys that have been persisted for one week, after this time passes it’s unlikely that a survey is going to be viewed again) we retrieve the survey from the database and cache in Redis again for another week. The three tiers: Nginx > Redis > MySQL.

All of our MySQL queries are handled by the Doctrine ORM and written using the Doctrine QueryBuilder. These doctrine queries are also cached in Redis as SQL.

Redis

For Redis we’re using the AWS ElastiCache service. As Redis is such an integral part of our application we set up a replication group with two m3.medium instances in a master/slave formation in different AZ’s. This means that in the unlikely event of an outage ElastiCache will perform an automatic failover by swapping the roles of the two nodes and our application will continue to work as expected. An additional benefit to this solution is that automated snapshots can be taken from the node currently in the read replica role, negating any performance impact.

We use Redis and MySQL for storing data. Redis is used as a result cache throughout the application in addition to holding user sessions, site-wide totals and incomplete surveys. MySQL stores all additional data and completed surveys. Our aim is that we never hit the MySQL box from our main application servers.

By storing user sessions in redis we can easily scale the application horizontally, without needing to turn on sticky sessions.

Application

Our application is based on Symfony 2.6.* standard edition. 

For Redis we use the SncRedisBundle. For RabbitMQ interactions we are using the RabbitMqBundle.

We’re using the DoctrineMigrationsBundle for database migrations and the data-fixtures and AliceBundle for database fixtures.

In terms of testing we’re using PHPSpec and Behat 3 with Selenium2. Our CI tool Jenkins runs all of our tests and triggers a new capistrano deployment if they pass.

We store over 2.5 million postcodes in our local database, which are extracted from CSV file from the Office for National Statistics, this is done with a Symfony2 console command. We’ve created a custom Rake task to fully bootstrap the application, but for development we just use a select few postcodes.

Results

So far the application has handled over 165,000 completed surveys in a period of a month, and we’ve seen the architecture handle celebrity tweets and a mention on BBC Question Time without breaking a sweat, which will hopefully stand us in good stead for the inevitable user influx prior to the 2015 general election.

 

Share: