Cloud Reliability Engineer (SRE)

AOK Systems GmbH
Bonn, Germany
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English, German

Job location

Remote

Tech stack

Java
Artificial Intelligence
Amazon Web Services (AWS)
Cloud Computing
Databases
Continuous Integration
Data Stores
Software Debugging
DevOps
Distributed Systems
Amazon DynamoDB
JSON
Python
MongoDB
Reliability Engineering
Prometheus
Software Engineering
Web Services
Load Balancing
Large Language Models
Grafana
Adobe
Kubernetes
Information Technology
Cassandra
Terraform
Splunk
New Relic (SaaS)
Docker
Go
Microservices

Job description

  • Improve the reliability, scalability, performance, security, and cost-efficiency of the platform's microservices running on Kubernetes and AWS.
  • Build and maintain strong observability using metrics, logs, traces, dashboards, and meaningful alerting. Use monitoring solutions like Prometheus, New Relic, Grafana, and Splunk. This helps us detect and understand issues before customers do.
  • Own infrastructure-as-code and automated delivery with Terraform, Kubernetes, Helm, ArgoCD, and CI/CD pipelines - keeping infrastructure across AWS repeatable, consistent, reviewable, and auditable.
  • Drive down toil with AI-assisted and agentic automation - auto-remediation, self-healing workflows, and LLM-generated runbooks and IaC - rather than hand-crafting one-off scripts, so the team's effort compounds.
  • Help grow a shared automation platform that tackles auto-remediation, self-healing workflows, and infrastructure-as-code - where AI accelerates the build, and every contribution compounds the team's capability.
  • Partner with engineering teams, e.g. to forecast capacity based on usage trends or implement new technologies to ensure the platform scales to meet growing demand.
  • Contribute to the security and compliance posture of the platform, partnering with collaborators on controls, evidence, and audit readiness throughout daily reliability work.
  • Help set the bar for how the team uses AI in operations - choosing where agentic and LLM-assisted tooling adds real leverage, and where human judgment must stay in the loop.
  • Participate in healthy, sustainable on-call rotation, and help continuously improve our runbooks and operational practices.
  • Collaborate across Adobe's global Reliability organization to advance the shared mission of "delivering better software faster."

Requirements

We don't expect any single person to check every box. When you bring most of the core skills below and are excited about the rest, we'd love to hear from you.

  • Several years of professional experience operating, scaling, or building distributed systems in production (SRE, DevOps, platform, or backend engineering backgrounds all welcome).
  • Hands-on production experience with AWS and with container orchestration on Kubernetes (plus tooling like Docker, Helm, and ArgoCD).
  • Practical experience with infrastructure-as-code, ideally Terraform, and with modern GitOps based CI/CD workflows.
  • Experience with monitoring and observability solutions - for example Prometheus, New Relic, Grafana, or Splunk.
  • A modern, AI-forward mindset: you reach for agentic and LLM-assisted tooling to do the work, and you have the judgment to know where it accelerates you and where humans must stay in the loop.
  • Enough programming ability to read, debug, and contribute to services and tooling. These are largely Java/Spring services, so comfort reading and debugging Java is valuable, and Python is a strong advantage for automation and tooling.
  • We expect enough software development experience to read, debug, and contribute to services, automation, and tooling. This includes Python and Golang for our own toolset, but also Java/Spring for the service we support.
  • Working knowledge of web services and supporting technologies including HTTP, JSON, REST, and service-to-service networking (e.g. proxies, load balancers, service meshes).
  • Exposure to the data stores that enable these services such as MongoDB, Cassandra, or DynamoDB is helpful, as Reliability Engineering manages these together with our Database Reliability team.
  • Strong communication and collaboration skills, and a genuine commitment to teamwork, shared ownership, and continuous improvement.
  • Professional working proficiency in English. German is a plus, given our Hamburg base, but not required.
  • A Bachelor's degree or higher in Computer Science, a related field, or equivalent experience. We value demonstrated ability over specific credentials.

About the company

Die AOK Systems ist einer der führenden IT-Partner für die Sozialversicherung in Deutschland. Wir entwickeln, implementieren und pflegen die SAP-basierte, standardisierte und vollintegrierte GKV-Branchenlösung oscare® - 
unser IT-Herzstück für die GKV.  

Als Spezialisten für integrierte IT-Komplettdienstleistungen mit Schwerpunkt bei den gesetzlichen Kranken- und Pflegeversicherungen sind wir Orchestrator für starke Lösungen von Kundengemeinschaften und Möglichmacher von individuellen Digitalisierungsstrategien – für eine starke GKV, die Menschen und ihre Gesundheit in den Mittelpunkt stellt. 

Apply for this position