Site Reliability Engineer

IONOS SE

Berlin, Germany

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Shift work

Languages

English, German

Job location

Berlin, Germany

Tech stack

Proxmox

Amazon Web Services (AWS)

Bash

Cloud Storage

Code Review

Protocol Stack

Computer Programming

Continuous Integration

Software Debugging

Linux

DevOps

RAID

Distributed Systems

Monitoring of Systems

InfiniBand

Python

Logical Volume Manager

Octopus Deploy

OpenStack

Performance Tuning

Quick EMUlator (QEMU)

Remote Direct Memory Access

Reliability Engineering

Ansible

Prometheus

Software Engineering

Tcpdump

Virtualization Technology

Working Model 2D

Ceph

Data Logging

Scripting (Bash/Python/Go/Ruby)

Saltstack

Grafana

Kubernetes Helm Charts

Perf (Linux)

Gitlab-ci

Kubernetes

Build Process

Terraform

Software Version Control

Docker

ELK

Job description

We are seeking a highly skilled and experienced Site Reliability Engineer to join our team working on a 24/7shift basis. The Site Reliability Engineering L2 department operates all IONOS Cloud IaaS and PaaS services. As a Site Reliability Engineer, you will be responsible for ensuring the stability, security, and performance of our complex and distributed systems. You will work closely with our development teams to design, implement, and maintain scalable and reliable infrastructure, and to automate and optimize our systems and processes., * Maintain monitoring, logging, and alerting solutions using tools such as Prometheus, Grafana, and Loki, to proactively detect blockers in shift rotation and contribute to resolving complex issues in distributed systems.

Troubleshoot network (LAN/WAN/VPN, DNS, DHCP) and storage systems (file/object/block), including provision, operation of highly available services on Linux and Kubernetes with Helm Charts.
Maintain Infrastructure as a Code, automation and playbooks using tools such as Ansible, Terraform, GitLab CI/CD, ArgoCD, and scripting languages like Bash, Python, and Go.
Collaborate with development teams to enhance processes and deployments, and to ensure smooth integration of new services and applications into our cloud and Kubernetes environment.
Ensure the stable and secure operation of our platforms, including management of incidents end-to-end, from initial analysis to resolution and follow-up through Problem Management.

Requirements

Willingness to work in a 24x7 shift model that includes nights, weekends, and holidays with a strong problem-solving and troubleshooting approach to resolve complex technical problems.
You have multiple years of experience as a Site Reliability Engineer or in a related role (Linux System Administrator, Platform Engineer, DevOps/Infrastructure Engineer, Full Stack Developer).
Strong Experience with automation tools (e.g., Ansible, SaltStack), monitoring and observability tools (e.g., Prometheus, Grafana, Loki), and logging and alerting solutions (e.g., ELK Stack).
Strong Experience with virtualized environments, including Qemu/KVM, OpenStack, Proxmox, Cloud Storage technologies (File, Object, Block) and proficient knowledge of Docker & Kubernetes (K8s).
Proficiency in at least one programming or scripting language (e.g., Go, Python, Bash) for automation and monitoring tasks.
Experience with code management is required, with knowledge of merge conflicts, feature branches, merge requests, and continuous integration (CI/CD) being a plus.

Nice to have:

Experience with RDMA, InfiniBand, and RoCE protocols.
Strong experience with Linux MD RAID (mdadm , sedadm) and LVM.
Proficiency in Linux performance tuning and network stack debugging (e.g., ethtool, perf, tcpdump, ibstat, ibtop).
Experience with S3, Ceph and software-defined networks.
Experience with established software development practices, including code reviews, build processes, packaging, and testing.

Language Skills: Must be fluent in German and English. At least B2 CEFR Level.

Benefits & conditions

Hybrid working model.
Shift working hours.
At some locations a subsidized canteen and various free drinks.
Modern office space with very good transport connections.
Various employee discounts for activities and products.
Employee events such as summer and winter parties, as well as workshops.
Numerous training and development opportunities.
Various health offers, such as sports and health courses.

About the company

At IONOS, the leading European provider of cloud infrastructure, cloud services and hosting services, you will work together with a wide range of teams. We are characterized by open structures, a friendly working culture and flat hierarchies with a strong team spirit. We firmly believe that work and fun are compatible, and offer you the right environment for this. Our constant growth means that we are always looking for new colleagues. Become part of IONOS and grow with us., IONOS is the leading European digitalization partner for small and medium-sized businesses (SMB). The company serves around six million customers and operates across 18 markets in Europe and North America, with its services being accessible worldwide. With its Web Presence & Productivity portfolio, IONOS acts as a 'one-stop shop' for all digitalization needs: from domains and web hosting to classic website builders and do-it-yourself solutions, from e-commerce to online marketing tools. In addition, the company offers Cloud Solutions to enterprises who are looking to move to the cloud as their businesses evolve.