Site Reliability Engineer

EVENTIM

Berlin, Germany

4 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Berlin, Germany

Tech stack

Amazon Web Services (AWS)

Bash

Cloud Computing

Configuration Management

Continuous Integration

Linux

DNS

Monitoring of Systems

Python

Performance Tuning

Reliability Engineering

Ansible

Zabbix

Load Balancing

Grafana

Gitlab

Kubernetes

Deployment Automation

Terraform

Docker

Jenkins

Job description

As a Site Reliability Engineer (m/f/d), you operate and evolve Linux-based production platforms that power critical business services at scale. You focus on automation, reliability, and reducing operational overhead while enabling teams to work more independently.

Working at the intersection of infrastructure, automation, and reliability, you contribute to building resilient systems and support the evolution toward a "you build it, you run it" culture.

What to expect

Ensure reliable, secure, and high-performing Linux-based production systems with full ownership
Automate operational tasks (e.g. patching, provisioning, deployments) to eliminate manual effort and improve efficiency
Standardize and optimize deployment and configuration processes for scalability and consistency
Lead incident response and drive root cause analysis and long-term fixes
Manage and automate access and identity processes with a strong focus on security and auditability
Maintain and improve core Linux infrastructure services essential for platform operations
Collaborate with engineering teams to enhance observability and shared operational practices
Analyze complex systems end-to-end and simplify them to improve reliability and performance
Drive the modernization of operations towards automation, scalability, and self-service models
Adapt quickly to changing environments and deliver pragmatic, effective solutions

Requirements

5+ years of experience in Linux-based production environments
Strong expertise in Linux systems engineering, performance tuning, and lifecycle management
Strong understanding of reliability concepts (SLOs, SLAs, performance, capacity)
Solid scripting and automation skills (e.g., Bash, Python) with a continuous improvement mindset
Hands-on experience with configuration management (e.g., Salt, Ansible) and Infrastructure as Code (e.g., Terraform)
Experience with CI/CD tools (e.g., GitLab, Jenkins) and automated deployments
Good knowledge of monitoring and observability tools (e.g., Zabbix, Grafana, ELK)
Proven experience in incident management, root cause analysis, and postmortems
Experience with security practices, including patching and access control
Knowledge of core traffic services (DNS, load balancing, CDN)
Basic experience with container and cloud technologies (Docker, Kubernetes, AWS)