Mario Valderrama

Operating etcd for Managed Kubernetes

To achieve zero-downtime migrations, we modified etcd snapshots to artificially inflate the revision number. Discover the surprising challenges of operating etcd at massive scale.

Operating etcd for Managed Kubernetes
#1about 3 minutes

The journey to managed Kubernetes at IONOS

From its first release in 2019 to managing over 20,000 clusters, IONOS scaled its Kubernetes service by building on a massive etcd foundation.

#2about 4 minutes

Evolving etcd deployment strategies over time

The team progressed from the CoreOS operator and Bitnami Helm charts to a simplified custom Helm chart for better control and stability.

#3about 3 minutes

Understanding multi-tenancy and its performance impact

Using a shared etcd with client-side prefixes reduces cost but creates noisy neighbor problems, requiring careful tuning like compaction and defragmentation.

#4about 3 minutes

Iterating on etcd cluster layouts for reliability

Initial cross-location clusters suffered from latency and revision drift, leading to a more stable single data center layout using availability zones.

#5about 3 minutes

A zero-downtime control plane migration strategy

A live migration process using `etcdctl mirror` allows moving a Kubernetes control plane to a new etcd cluster without global downtime or data loss.

#6about 3 minutes

Manipulating etcd revisions for seamless migration

By modifying an etcd snapshot to insert a high revision number, clients like kubelet continue watching for changes without needing a restart after migration.

#7about 2 minutes

Future plans for etcd management and automation

The team is working on automating the migration process, offering dedicated etcd clusters, and contributing their migration learnings to the Kaji project.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

From learning to earning

Jobs that call for the skills explored in this talk.