System Reliability Engineer

arculus GmbH

München, Germany

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Shift work

Languages

English

Experience level

Intermediate

Job location

München, Germany

Tech stack

Amazon Web Services (AWS)

Azure

Bash

DevOps

Distributed Systems

Python

Reliability Engineering

Prometheus

Robotic Automation Software

Software Engineering

Google Cloud Platform

System Availability

Grafana

Reliability of Systems

Kubernetes

Deployment Automation

ELK

Job description

As a System Reliability Engineer, you will be responsible for ensuring the stability, performance, and scalability of our Automation Software platform. Your mission begins with a strong focus on the "Now": building robust monitoring, automation, and operational practices that keep our systems reliable under real-world conditions.

Operating at the intersection of software development and operations, you will proactively prevent incidents, optimize system behavior, and enable fast, reliable service delivery. By aligning reliability engineering with product and architectural goals, you will ensure our systems meet critical KPIs such as uptime, latency, and deployment velocity across the entire lifecycle.

Your Tasks & Responsibilities

Design and operate monitoring, alerting, and incident response systems to ensure high availability
Define and manage SLIs, SLOs, and SLAs; proactively mitigate reliability, performance, and capacity risks
Automate deployments, scaling, and operational workflows; implement infrastructure as code and self-healing patterns
Optimize CI/CD pipelines for faster, safer, and more reliable releases
Lead or support incident response, root cause analysis, and post-mortems; translate findings into preventive measures
Collaborate with architects, developers, and product teams to ensure scalable, reliable system design
Review system changes for operational, performance, and reliability impact
Support capacity planning, performance benchmarking, and scaling strategies
Contribute to security monitoring and ensure secure system operations
Drive continuous improvement in observability, reliability, and operational efficiency

Requirements

3+ years in Site Reliability Engineering, DevOps, or similar roles in production environments
Proven experience improving system reliability, reducing downtime, and enhancing deployment processes
Strong expertise in cloud platforms (AWS, GCP, Azure) and Kubernetes
Hands-on experience with observability tools (Prometheus, Grafana, ELK stack)
Solid scripting and automation skills (e.g., Python, Bash)
Experience operating and scaling distributed systems in large production environments
Familiarity with CI/CD pipelines, infrastructure as code, and modern DevOps practices

Who You Are

Passionate about building reliable, scalable, and observable systems
Strong communicator, able to collaborate effectively across engineering, product, and operations teams
Proactive and solution-oriented, with a strong sense of ownership and accountability
Analytical and structured thinker with a focus on continuous improvement
Comfortable working in fast-paced, complex environments with evolving system landscapes
Motivated to ensure technical excellence translates into stable and high-performing real-world systems

About the company

At arculus, we design, build, and maintain cutting-edge autonomous mobile robots and the software ecosystem around them. Our Development department brings together software, infrastructure, and product experts in a collaborative, international environment, focused on delivering reliable and high-quality products that make a real difference in intralogistics., WHY ARCULUS * We are a diverse, global team of 100+ creative thinkers, algorithmic brains, makers, movers, and shakers. * Our approach comes from a continuous cycle: assemble, weld, code, test, deploy or delete, and repeat. That is how we deliver innovative solutions to tackle the biggest intralogistics challenges. * You will find our tech space nestled within the eastern region of Munich. It serves as a hub for our team's creativity and collaboration, featuring state-of-the-art meeting rooms, a fully-equipped electronics lab, and a spacious robotics testing area. Our team also enjoys a variety of social spaces, all within the modern infrastructure of the renowned Neue Balan campus. * We are more than just a workplace: we are a community. We encourage connection and affiliation through a range of activities: hiking trips, running events, ping pong tournaments, and quiz nights - there is something for everyone. * We also believe that work should be rewarding in more ways than just one. That is why we offer competitive salaries and benefits like EGYM Wellpass, language courses, Jobrad, and flexible working hours. * If you are moving to join our team, we provide relocation and visa support to help make the transition as smooth as possible., arculus is a part of Jungheinrich and independently develops high-end mobile robots and software products for intralogistics automation. From mechanics to electronics and code - our engineering powerhouse has it all. We combine the speed and creativity of an agile tech company with the strength of a leading global intralogistics player. Collaboration, innovation, and continuous learning: that is how we achieve an open-minded and fast-paced working culture.