Vote for Policies is a service to help voters make an informed decision by comparing what the political parties are promising to do. Having quickly gained popularity in the 2010 general election the service was unable to cope with the unexpected heavy load. For the 2015 general election Vote for Policies raised funds through crowdfunder and approached us to build the service from the ground up. Below I’ve outlined how we architected the application for stability, reliability and performance.
We’re hosting the application under the guidance of an AWS ELB which distributes traffic to a number of t2.small EC2 instances.
The ELB has a configured Auto-Scaling group which will scale up if either:
- CPUUtilization > 60 for 1 minute
- NetworkIn >= 50Mbps for 1 minute
We noticed that once any of the instances behind the ELB went up 70Mbps NetworkIn we would begin to see 503 responses, so we chose the 50Mbps threshold to allow time to add and start a new instance to the ELB before any of the instances reached 70Mbps. We’ve yet to see the CPU reach anywhere close to the 60% metric as we’ve offloaded any intensive tasks to our queue system, so the CPUUtilization metric is just a safe-guard.
We scale the group down if:
- NetworkOut <= 30,000,000 for 15 minutes
Our web-server instances are running PHP 5.6 with PHP-FPM and Opcache enabled.
Our web server configuration is:
- EC2 t2.small instance (2GB Memory, EBS Storage)
- PHP 5.6.X running as PHP-FPM with Opcache enabled
- FastCGI Cache
We utilise the nginx fastcgi_cache on each instance, this eliminates any single point of failure which may arise from keeping a reverse proxy in-front of all of the instances, which would result in more cache hits.
All of our static pages are cached with the fastcgi_cache, in addition to public survey results pages. This cache is bypassed if the user has completed a survey on our website, which allows us to provide a link to their results the next time they visit the site.
Queue server configuration:
- Master/standby failover
- Max ~10k msgs/s
- Max ~1k connections
- Max ~5M queued messages
We made the decision not to manage our own queue server due to time constraints and resource management, but there is no reason why this couldn’t be a part of our own AWS stack if the need arises.
We use worker servers to execute anything that could delay a request from completing, even persisting surveys, in order to allow the user experience to be as quick as possible.
Our worker servers also live behind an ELB but don’t have auto-scaling enabled; we manually manage the amount of instances based on the size of our queues, we can check using the RabbitMQ management console. In the future we could automate this but for now we’re fine with occasionally adding/removing another worker to the group if the queue looks to be getting too long.
Our worker server configuration is:
- EC2 m3.medium instance (3.75 GB Memory, EBS Storage)
- PHP 5.6.X with Opcache enabled
Here is an extract from our supervisor configuration:
[program:persist_survey_result_consumer] autostart=true autorestart=true startsecs=10 stopwaitsecs=30 startretries=100 logfile=/var/www/sites/voteforpolicies/current/app/logs/supervisord.log directory=/var/www/sites/voteforpolicies/current command=/var/www/sites/voteforpolicies/current/app/console rabbitmq:consumer survey_results_complete --env prod --messages 100 numprocs=4 process_name=%(program_name)s%(process_num)s
[program:national_refresh_consumer] autostart=true autorestart=true startsecs=10 stopwaitsecs=30 startretries=100 directory=/var/www/sites/voteforpolicies/current command=/var/www/sites/voteforpolicies/current/app/console rabbitmq:consumer national_refresh --env prod --messages 100
The consumer processes are symfony2 commands, we tell each process to only consume 100 messages before exiting, and allowing supervisor to restart them, to make sure that any memory leaks from long-running PHP processes are avoided. Some consumers are run multiple times in parallel, such as the survey persist consumer, which again is handled by supervisord.
MySQL is primarily used as a *third-tier* cache layer. Most requests to the application (homepage, static pages and shared results pages) should be caught by the nginx fastcgi_cache before they even reach PHP. If they do reach PHP, 99% of the time the data required should be in Redis. If the request is the first for a particular resource though, or a particular survey has expired from Redis (we set an expiration on surveys that have been persisted for one week, after this time passes it’s unlikely that a survey is going to be viewed again) we retrieve the survey from the database and cache in Redis again for another week. The three tiers: Nginx > Redis > MySQL.
For Redis we’re using the AWS ElastiCache service. As Redis is such an integral part of our application we set up a replication group with two m3.medium instances in a master/slave formation in different AZ’s. This means that in the unlikely event of an outage ElastiCache will perform an automatic failover by swapping the roles of the two nodes and our application will continue to work as expected. An additional benefit to this solution is that automated snapshots can be taken from the node currently in the read replica role, negating any performance impact.
We use Redis and MySQL for storing data. Redis is used as a result cache throughout the application in addition to holding user sessions, site-wide totals and incomplete surveys. MySQL stores all additional data and completed surveys. Our aim is that we never hit the MySQL box from our main application servers.
By storing user sessions in redis we can easily scale the application horizontally, without needing to turn on sticky sessions.
Our application is based on Symfony 2.6.* standard edition.
We store over 2.5 million postcodes in our local database, which are extracted from CSV file from the Office for National Statistics, this is done with a Symfony2 console command. We’ve created a custom Rake task to fully bootstrap the application, but for development we just use a select few postcodes.
So far the application has handled over 165,000 completed surveys in a period of a month, and we’ve seen the architecture handle celebrity tweets and a mention on BBC Question Time without breaking a sweat, which will hopefully stand us in good stead for the inevitable user influx prior to the 2015 general election.