Interim Site Reliability Engineer

Michael Page International (Deutschland) GmbH

Düsseldorf, Germany

2 days ago

Role details

Contract type

Contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

€ 139K

Job location

Düsseldorf, Germany

Tech stack

API

Artificial Intelligence

Amazon Web Services (AWS)

Application Lifecycle Management

Azure

Cloud Computing

Continuous Delivery

Continuous Integration

Github

Identity and Access Management

Node.js

Public Key Infrastructure

Reliability Engineering

Cloud Services

Ansible

Prometheus

Software Engineering

Data Logging

Google Cloud Platform

Performance Testing

Autoscaling

System Availability

Grafana

Spring-boot

Reliability of Systems

Infrastructure as Code (IaC)

Gitlab-ci

Kubernetes

Puppet

Terraform

Docker

ELK

Jenkins

Programming Languages

Job description

Nice to have:

Aim42 or any other architecture improvement method
Testing Automation (Integration, Unit, Functional)
FinOps
Data & AI
Hypermedia / API / REST
Team Topologies / Macro Architecture- Mentoring / Coaching
Community Management & Developer Relations
Performance Testing
Chaos Engineering
Knowledge in Public Key Infrastructure
Identity and Access Management
ISO 25010
SLIs, SLOs, Error Budgets & SLAs
Service Management (ITIL)

Requirements

Project OverviewWe are seeking an experienced Site Reliability Engineer for our client. The ideal candidate has a strong foundation in software development and has transitioned into infrastructure and operations, with a passion for scaling, automation, and reliability of cloud-native systems. Furthermore previous experience as a Site Reliability Engineer is a must have. Ideal Candidate Background

Software Engineering Foundation: Preferably the candidate started their career in software development, establishing a solid foundation in coding, system design, and software lifecycle management. This background provides a deep understanding of the development process and the importance of operational efficiency and system reliability.
Transition to Infrastructure and Operations: After gaining valuable experience in software engineering, the candidate transitioned into infrastructure and operations. This move was driven by an interest in scaling, automating, and improving the reliability of cloud-native applications and systems., * Cloud-Native Applications: Proficient in deploying, managing, and scaling applications in a cloud-native environment. This includes using containerization technologies like Docker and orchestrators such as Kubernetes to manage containerized applications across various environments.
Kubernetes Experience: Extensive experience with Kubernetes, including setting up clusters, deploying applications, managing stateful and stateless workloads, implementing autoscaling, and ensuring high availability. Familiarity with Kubernetes ecosystem tools (e.g., Helm, Kustomize) and practices is essential.
Hyperscaler Expertise: Strong experience with at least one major cloud services provider, preferably AWS, but also open to experience with Azure or Google Cloud Platform. This includes managing cloud resources, implementing security best practices, and leveraging cloud-native services for operational efficiency.
Infrastructure as Code (IaC): Skilled in using IaC tools such as Terraform, Ansible, Chef, or Puppet to automate the provisioning and management of infrastructure, ensuring consistency and compliance.
Continuous Integration/Continuous Deployment (CI/CD): Experienced in setting up and managing CI/CD pipelines using tools like Jenkins, GitLab CI, or GitHub Actions to automate testing and deployment processes.
Monitoring and Logging: Proficient in implementing monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack) to ensure proactive issue identification and resolution.
Programming Languages/Frameworks: Familiarity with at least one of the following: Node.js, Golang, or Java Spring Boot, for effective automation, tooling, and incident response.

Operational Skills

On-Call Duties: Willingness to participate in an on-call rotation, defined as 18/7 and for some rare cases, 24/7, understanding the critical role of maintaining system reliability and performance.
Incident Management: Capable of quickly diagnosing and resolving issues, minimizing downtime, and learning from incidents to prevent future occurrences.
Cost Optimization: Ability to monitor, analyze, and optimize cloud resources for cost efficiency without compromising performance or security.

Soft Skills

Good English Communication Skills: Excellent verbal and written communication skills, capable of effectively collaborating with team members, stakeholders, and clients.
Teamwork and Collaboration: Ability to work well within a team, share knowledge, and contribute to a positive working environment.
Continuous Improvement: A strong desire for continuous learning and improvement, staying up-to-date with the latest technologies and best practices.
Problem-Solving: Strong analytical and problem-solving skills, with a proactive approach to identifying and addressing challenges.
High Adaptability: Exceptional adaptability is required for collaborating with multiple teams, quickly learning new technologies, and adjusting to changing project demands.

About the company

Wenn Du außerdem wissen möchtest, ob der Standort des Jobs LGBTQ-freundlich ist, frage gerne nach unserem Pride@Page-Komitee-Kontakt für ein vertrauliches Gespräch und/oder schaue Dir diese Ressource an: https://www.iglta.org/destinations/travel-guides/lgbtq-safety-guide/. PageGroup ermutigt Mitglieder der LGBTQ-Gemeinschaft, sich auf interne Stellen zu bewerben; wir können zwar die lokalen Gesetze und Gepflogenheiten nicht ändern, aber wir werden alles tun, was wir können, um Dich auf Deine nächste Aufgabe vorzubereiten und ggf. einen Standort zu finden, der für Dich und Deine Angehörigen geeignet ist.