Cloud Reliability Engineer (SRE)

AOK Systems GmbH

Bonn, Germany

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English, German

Job location

Remote

Tech stack

Java

Artificial Intelligence

Amazon Web Services (AWS)

Cloud Computing

Databases

Continuous Integration

Data Stores

Software Debugging

DevOps

Distributed Systems

Amazon DynamoDB

JSON

Python

MongoDB

Reliability Engineering

Prometheus

Software Engineering

Web Services

Load Balancing

Large Language Models

Grafana

Adobe

Kubernetes

Information Technology

Cassandra

Terraform

Splunk

New Relic (SaaS)

Docker

Microservices

Job description

Improve the reliability, scalability, performance, security, and cost-efficiency of the platform's microservices running on Kubernetes and AWS.
Build and maintain strong observability using metrics, logs, traces, dashboards, and meaningful alerting. Use monitoring solutions like Prometheus, New Relic, Grafana, and Splunk. This helps us detect and understand issues before customers do.
Own infrastructure-as-code and automated delivery with Terraform, Kubernetes, Helm, ArgoCD, and CI/CD pipelines - keeping infrastructure across AWS repeatable, consistent, reviewable, and auditable.
Drive down toil with AI-assisted and agentic automation - auto-remediation, self-healing workflows, and LLM-generated runbooks and IaC - rather than hand-crafting one-off scripts, so the team's effort compounds.
Help grow a shared automation platform that tackles auto-remediation, self-healing workflows, and infrastructure-as-code - where AI accelerates the build, and every contribution compounds the team's capability.
Partner with engineering teams, e.g. to forecast capacity based on usage trends or implement new technologies to ensure the platform scales to meet growing demand.
Contribute to the security and compliance posture of the platform, partnering with collaborators on controls, evidence, and audit readiness throughout daily reliability work.
Help set the bar for how the team uses AI in operations - choosing where agentic and LLM-assisted tooling adds real leverage, and where human judgment must stay in the loop.
Participate in healthy, sustainable on-call rotation, and help continuously improve our runbooks and operational practices.
Collaborate across Adobe's global Reliability organization to advance the shared mission of "delivering better software faster."

Requirements

We don't expect any single person to check every box. When you bring most of the core skills below and are excited about the rest, we'd love to hear from you.

Several years of professional experience operating, scaling, or building distributed systems in production (SRE, DevOps, platform, or backend engineering backgrounds all welcome).
Hands-on production experience with AWS and with container orchestration on Kubernetes (plus tooling like Docker, Helm, and ArgoCD).
Practical experience with infrastructure-as-code, ideally Terraform, and with modern GitOps based CI/CD workflows.
Experience with monitoring and observability solutions - for example Prometheus, New Relic, Grafana, or Splunk.
A modern, AI-forward mindset: you reach for agentic and LLM-assisted tooling to do the work, and you have the judgment to know where it accelerates you and where humans must stay in the loop.
Enough programming ability to read, debug, and contribute to services and tooling. These are largely Java/Spring services, so comfort reading and debugging Java is valuable, and Python is a strong advantage for automation and tooling.
We expect enough software development experience to read, debug, and contribute to services, automation, and tooling. This includes Python and Golang for our own toolset, but also Java/Spring for the service we support.
Working knowledge of web services and supporting technologies including HTTP, JSON, REST, and service-to-service networking (e.g. proxies, load balancers, service meshes).
Exposure to the data stores that enable these services such as MongoDB, Cassandra, or DynamoDB is helpful, as Reliability Engineering manages these together with our Database Reliability team.
Strong communication and collaboration skills, and a genuine commitment to teamwork, shared ownership, and continuous improvement.
Professional working proficiency in English. German is a plus, given our Hamburg base, but not required.
A Bachelor's degree or higher in Computer Science, a related field, or equivalent experience. We value demonstrated ability over specific credentials.

About the company

Die AOK Systems ist einer der führenden IT-Partner für die Sozialversicherung in Deutschland. Wir entwickeln, implementieren und pflegen die SAP-basierte, standardisierte und vollintegrierte GKV-Branchenlösung oscare® -
unser IT-Herzstück für die GKV.

Als Spezialisten für integrierte IT-Komplettdienstleistungen mit Schwerpunkt bei den gesetzlichen Kranken- und Pflegeversicherungen sind wir Orchestrator für starke Lösungen von Kundengemeinschaften und Möglichmacher von individuellen Digitalisierungsstrategien – für eine starke GKV, die Menschen und ihre Gesundheit in den Mittelpunkt stellt.

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all