Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
In this role you'll join our operations team for our MeetingSuite product in Munich - a flat and diverse SRE team of four engineers. It's a team where influence comes from example rather than authority. Your day-to-day is keeping our Kubernetes platforms observable, resilient and boring-to-upgrade: GitOps with Flux, multi-AZ design, zero-downtime releases, and a centralised observability story every service owner can use without calling SRE. Alongside that, you'll partner closely with our Application Security Engineer on Kubernetes and container security - with room to grow into our security champion over time - to keep the bar high for the DAX 30 and other DACH customers we serve.
If multi-cluster Kubernetes, GitOps, logging, monitoring and NoSQL database management on Kubernetes are in your vocabulary, read on.
Here's a breakdown of what you'll do (not all of it, just the important stuff):
-
Operate and continuously improve our Kubernetes production platforms, contributing to zero-downtime upgrades and multi-AZ resilience as team-wide goals.
-
Grow into the team's expert on our ELK-based log platform - centralised cross-cluster monitoring and anomaly detection - so every service owner can see, alert on and debug their workload without SRE hand-holding. Maintain and evolve our Prometheus alerting rules and Grafana dashboards alongside the team.
-
Partner with our Application Security Engineer on Kubernetes and container security - admission control, workload identity, secrets management, network segmentation and runtime threat detection - with an interest in growing into our security champion over time.
-
Love automation. Chip away at operational toil - deployments, monitoring setup, internal reporting - building on the baseline the team already has, and ship reliably through our GitOps workflow (Flux, GitLab CI).
-
Participate in our Standby and Daily Business rotation, lead incident response, run blameless post-mortems and drive the resulting action items to completion.
Requirements
Do you have experience in REST?, You're a seasoned Site Reliability Engineer with years spent running production Kubernetes at scale, and you're the kind of engineer who takes the initiative when something can be better - observability, resilience, a tricky upgrade, or the way the team thinks about security. You're looking for a role where that initiative has room to turn into real improvements on a platform that customers trust with their most confidential data., * Several years hands-on SRE, DevOps or Platform Engineering, including meaningful time running production Kubernetes at scale.
-
Strong Kubernetes expertise with deep hands-on experience in at least one area - cluster lifecycle and upgrades, workload identity and RBAC, admission control, network policies, or custom resources and operators - and working familiarity with the rest.
-
Solid grasp of Kubernetes and container security - secrets management, network segmentation and runtime protection - and an interest in growing into our security champion alongside our Application Security Engineer.
-
Proven depth in the ELK stack (or a very similar log platform) - pipelines, indexing, dashboards, alerting - with an interest in growing into the team's observability expert. Working knowledge of Prometheus and Grafana.
-
Comfortable with GitOps and CI/CD as a daily way of working (we run Flux and GitLab CI; equivalents like Argo CD, GitHub Actions or Jenkins are fine), and hands-on experience with Helm and Kustomize for managing manifests. Solid coding in Go, Python or Bash, with a love for automating away repetitive work.
-
Comfortable being on-call and leading incidents calmly under pressure.
-
Professional fluency in German and excellent English; at home working in a diverse team.
It would be great if you had these to, but we'll support you if you don't:
-
Experience in regulated industries (financial services, legal, healthcare, defence) or under compliance frameworks such as ISO 27001 or C5.
-
Track record of designing or contributing to custom Kubernetes Operators.
-
Service-mesh experience (Istio, Linkerd, Cilium).
-
A demonstrated interest in working shoulder-to-shoulder with AppSec engineers to raise platform security posture.
-
Experience operating Couchbase (Couchbase Operator, server groups, XDCR) or another stateful data platform on Kubernetes.
-
Experience migrating ingress controllers or other cluster-wide components with zero customer downtime.
-
Experience with anomaly detection on platform telemetry. #LIHybrid