Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled and experienced Site Reliability Engineer to join our team working on a 24/7shift basis. The Site Reliability Engineering L2 department operates all IONOS Cloud IaaS and PaaS services. As a Site Reliability Engineer, you will be responsible for ensuring the stability, security, and performance of our complex and distributed systems. You will work closely with our development teams to design, implement, and maintain scalable and reliable infrastructure, and to automate and optimize our systems and processes., * Maintain monitoring, logging, and alerting solutions using tools such as Prometheus, Grafana, and Loki, to proactively detect blockers in shift rotation and contribute to resolving complex issues in distributed systems.
- Troubleshoot network (LAN/WAN/VPN, DNS, DHCP) and storage systems (file/object/block), including provision, operation of highly available services on Linux and Kubernetes with Helm Charts.
- Maintain Infrastructure as a Code, automation and playbooks using tools such as Ansible, Terraform, GitLab CI/CD, ArgoCD, and scripting languages like Bash, Python, and Go.
- Collaborate with development teams to enhance processes and deployments, and to ensure smooth integration of new services and applications into our cloud and Kubernetes environment.
- Ensure the stable and secure operation of our platforms, including management of incidents end-to-end, from initial analysis to resolution and follow-up through Problem Management.
Requirements
- Willingness to work in a 24x7 shift model that includes nights, weekends, and holidays with a strong problem-solving and troubleshooting approach to resolve complex technical problems.
- You have multiple years of experience as a Site Reliability Engineer or in a related role (Linux System Administrator, Platform Engineer, DevOps/Infrastructure Engineer, Full Stack Developer).
- Strong Experience with automation tools (e.g., Ansible, SaltStack), monitoring and observability tools (e.g., Prometheus, Grafana, Loki), and logging and alerting solutions (e.g., ELK Stack).
- Strong Experience with virtualized environments, including Qemu/KVM, OpenStack, Proxmox, Cloud Storage technologies (File, Object, Block) and proficient knowledge of Docker & Kubernetes (K8s).
- Proficiency in at least one programming or scripting language (e.g., Go, Python, Bash) for automation and monitoring tasks.
- Experience with code management is required, with knowledge of merge conflicts, feature branches, merge requests, and continuous integration (CI/CD) being a plus.
Nice to have:
- Experience with RDMA, InfiniBand, and RoCE protocols.
- Strong experience with Linux MD RAID (mdadm , sedadm) and LVM.
- Proficiency in Linux performance tuning and network stack debugging (e.g., ethtool, perf, tcpdump, ibstat, ibtop).
- Experience with S3, Ceph and software-defined networks.
- Experience with established software development practices, including code reviews, build processes, packaging, and testing.
Language Skills: Must be fluent in German and English. At least B2 CEFR Level.
Benefits & conditions
- Hybrid working model.
- Shift working hours.
- At some locations a subsidized canteen and various free drinks.
- Modern office space with very good transport connections.
- Various employee discounts for activities and products.
- Employee events such as summer and winter parties, as well as workshops.
- Numerous training and development opportunities.
- Various health offers, such as sports and health courses.