Scaling: from 0 to 20 million users

Our first server crashed so often, we survived by manually uploading an HTML file via FTP. Now, we serve 20 million users on a global scale.

#1about 2 minutes

An overview of scaling a sports app to millions of users

The initial single-server architecture for a sports results app struggled with exponential user growth, leading to frequent server crashes under load.

#2about 6 minutes

Using proactive and manual caching to survive traffic spikes

Early scaling involved using Memcached with proactive caching to pre-load live data, culminating in a manual static HTML file hack to handle a massive event.

#3about 3 minutes

Moving to the cloud and implementing Varnish cache

The first cloud migration to AWS introduced Varnish for superior HTTP caching and request coalescing, alongside stateless AMIs for effective auto-scaling.

#4about 2 minutes

Migrating from MongoDB to Postgres for data reliability

After encountering data type errors and a lack of locking in MongoDB, a live migration to Postgres was performed to gain stability and analytical power.

#5about 2 minutes

Optimizing cache efficiency with a dedicated sharded layer

To solve cache inefficiency from auto-scaling, the architecture was changed to a dedicated, sharded Varnish layer in front of application servers.

#6about 2 minutes

Migrating from cloud to on-premise to reduce costs

High AWS traffic costs prompted a move back to an over-provisioned on-premise data center, drastically reducing infrastructure expenses relative to user growth.

#7about 4 minutes

Solving global latency with a distributed cache network

To improve performance for international users, a globally distributed cache was implemented with geo-routing, reducing average latency from 500ms to 80ms.

#8about 2 minutes

Adopting Kubernetes for multi-datacenter redundancy

After a provider's data center fire, a second data center was added and managed with Kubernetes to ensure high availability and simplify deployments.

#9about 1 minute

Implementing real-time updates with NATS messaging

To eliminate polling delays and deliver instant updates, a pub/sub architecture using NATS messaging was implemented for millions of concurrent client connections.

#10about 2 minutes

Managing petabyte-scale analytics data with ClickHouse

To power AI/ML models and analyze nearly a petabyte of data on-premise, ClickHouse was chosen for its high-performance analytical capabilities.

#11about 2 minutes

Key principles for building scalable and efficient infrastructure

The core lessons learned include prioritizing statelessness, aggressive caching, using queues for slow tasks, and choosing the right tool for each specific job.

Josip Stuhli