Scale Summit 2015: Performance and scalability unconference

This year’s ScaleSummit was excellent, as it was last time. I would strongly encourage anyone involved in the scaling of large systems to attend. Thanks to @bruntonspall, @jtopper, @scalefactory and @jonty for organising.

Below are my rough notes taken in the sessions I went to. Notes on other sessions can be found in:

Baz Bremner’s GitHub notes

### First session: Multi-cloud PaaS

VPC per app
Engineers in each division; product teams inc. engineers; own app in Production
OH: “I don’t believe in punctuation, makes it looks like Perl”
Tightly coupled to AWS due to services used
Deployments managed by platform team (based on templates), product teams deploy
Blue/green rolling deployments
AWS rate limits; some are hard limits (network NAT, etc); AWS account per development team
Separate AWS accounts for Production and testing environments
Core team managed SSH bastion hosts and DNS
Core team only called out if AWS is down or rate limits hit
Hybrid - AWS and bare metal
London PaaS user group
Developers tend to want to use AWS-specific products
Application developers woken up at 2am on cloud platform - alarm trips, call developers
Or make application developers the first-line support; help from core team if required (e.g. second opinion)
Core team have little domain knowledge of applications; worse people to call out
“Full stack engineers” - no dev/ops split - everyone should know about everything?
Docker and Mesos on Hybrid cloud, Marathon, generate a new repository as a template; git push to environment
GoCD as alternative to Jenkins
PCI-compliant servers outside of Mesos
Secrets mounted via volume mounts, no access to hosts
SmartStack for service discovery, on each localhost
Logging in SmartStack
HACheck; check constant healthchecks received
Container introspection for outdated libraries
Product teams must ensure mechanism for checking outdated libraries on arbitrary OS and Docker images rebuilt
Deadline of 48 hours for security update otherwise service switched off
Docker container security isolation? Docker proxy daemon (per-user socket); intercepts volume mounts; turns off raw sockets, suid, guid, restricts what can be mounted
CloudFoundry have at least one pair of developers in each team that review external contributions
Docker and CloudFoundry possible (CloudFoundry version 3), using Garden

Second session: Queues; RabbitMQ and others

Used to use RabbitMQ; then Redis; then RabbitMQ
RabbitMQ overly complicated, difficult to maintain
Celery client, then PyRes, wrapper around Pika (bad experiences of Celery; don’t use)
ActiveMQ worse than RabbitMQ
Samza
Beanstalkd for job queue
Pub/sub versus work queue
Redis not well suited to queues
Beanstalkd fell over, ate its own queue once, some good experiences
SQS; be aware of semantics
PostgreSQL for queueing? Notification system; in memory row-level locks
Publisher confirms if you really care about not losing messages; e.g. financial data
Federated clusters
Cluster mode risks losing data
More exchanges, gets worse; especially if flushing to disk with poor IOPS
Delayed retrying queues; dead-lettering queues
OH: “Queuing is fine so as long as there is no queue”
Several thousands bindings fine
Disque? Very new
ZeroMQ is risky if your code is not robust
Beware of Redis master persisting to disk and invoking the OOM killer and slave deleting all data
Need to apply back pressure if too many messages; better to fail early than back up for long periods of time
The less queueing the better; do you really need a queue?
Beanstalk really nice
Have a look at Darner on-disk queue: https://github.com/wavii/darner

Third session (after lunch): Request routing; back-pressure, circuit

breakers and failure scenarios

Timeout of 10ms, retry request
Duplicate all requests; use first response
Dirty session cookies for read/write consistency
Serve-from-stale using Varnish (caching always helps)
Long tail: using search as an example
Use application healthchecks to avoid common upstream problems (e.g. database is failing); is this useful given the infrequency of healthchecks
Change UI as last resort to relieve traffic
Mongrel 2 has support for ZeroMQ as transport; HTTP requests from Mongrel (a work queue for HTTP), needs compatible web server
Loads testing framework
“HTTP multicasting”
Use ESIs (Varnish), caching much easier using fragments of pages
Don’t serve ads if they’re slower than the page’s business functionality
Content rendering slowed down by adverts
Circuit breakers are traditionally manual
Too much load to have applications check all dependencies during healthcheck (must be asynchronous to healthcheck request)
Cache healthcheck endpoints (e.g. 10 second); also helps prevent DDoS (e.g. thundering herd when load balancer starts up)
HACheck from Uber
SmartStack: local HA proxy
Choose healthchecks very careful
How to define failure dependency/service coupling (should this be external to the application, probably not?)
- Use application logic to determine service health
EDML
Healthchecks might check replication lag? Is database up?
Apps should be able to determine their own health based on their dependent services
Vulcan, SmartStack + HA, Synapse
Avoid ZooKeeper unless of course you’re already using it

Fourth session: Application logging

User auditing
Use controller actions to infer action taken by user (e.g. Rails)
Unique ID for tracing
Add unique tracing ID as comment to MySQL queries; breaks query caching but that has high contention so don’t use it
- Shows up in slow query log
Percona PT Fingerprint tool for genericising queries; hash generic query to find recurrences of same queries
DjangoSampler
github.com/fac/logging-agent (Ruby)
- MySQL slow query parser
NewRelic
Heroku Logplex (including back pressure; tackles large-scale problems)
Log all that you can with third party services
Elasticsearch scroll cursor
Dock values in Elasticsearch (foundation.no consultancy)
Percolate API for Elasticsearch; tailing Lucene queries
Graylog as Elasticsearch frontend (as alternative to Kibana)

Fifth session: War stories

Too numerous to note.