Interim Site Reliability Engineer
Michael Page International (Deutschland) GmbH
Düsseldorf, Germany
2 days ago
Role details
Contract type
Contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Compensation
€ 139KJob location
Düsseldorf, Germany
Tech stack
API
Artificial Intelligence
Amazon Web Services (AWS)
Application Lifecycle Management
Azure
Cloud Computing
Continuous Delivery
Continuous Integration
Github
Identity and Access Management
Node.js
Public Key Infrastructure
Reliability Engineering
Cloud Services
Ansible
Prometheus
Software Engineering
Data Logging
Google Cloud Platform
Performance Testing
Autoscaling
System Availability
Grafana
Spring-boot
Reliability of Systems
Infrastructure as Code (IaC)
Gitlab-ci
Kubernetes
Puppet
Terraform
Docker
ELK
Jenkins
Go
Programming Languages
Job description
Nice to have:
- Aim42 or any other architecture improvement method
- Testing Automation (Integration, Unit, Functional)
- FinOps
- Data & AI
- Hypermedia / API / REST
- Team Topologies / Macro Architecture- Mentoring / Coaching
- Community Management & Developer Relations
- Performance Testing
- Chaos Engineering
- Knowledge in Public Key Infrastructure
- Identity and Access Management
- ISO 25010
- SLIs, SLOs, Error Budgets & SLAs
- Service Management (ITIL)
Requirements
Project OverviewWe are seeking an experienced Site Reliability Engineer for our client. The ideal candidate has a strong foundation in software development and has transitioned into infrastructure and operations, with a passion for scaling, automation, and reliability of cloud-native systems. Furthermore previous experience as a Site Reliability Engineer is a must have. Ideal Candidate Background
- Software Engineering Foundation: Preferably the candidate started their career in software development, establishing a solid foundation in coding, system design, and software lifecycle management. This background provides a deep understanding of the development process and the importance of operational efficiency and system reliability.
- Transition to Infrastructure and Operations: After gaining valuable experience in software engineering, the candidate transitioned into infrastructure and operations. This move was driven by an interest in scaling, automating, and improving the reliability of cloud-native applications and systems., * Cloud-Native Applications: Proficient in deploying, managing, and scaling applications in a cloud-native environment. This includes using containerization technologies like Docker and orchestrators such as Kubernetes to manage containerized applications across various environments.
- Kubernetes Experience: Extensive experience with Kubernetes, including setting up clusters, deploying applications, managing stateful and stateless workloads, implementing autoscaling, and ensuring high availability. Familiarity with Kubernetes ecosystem tools (e.g., Helm, Kustomize) and practices is essential.
- Hyperscaler Expertise: Strong experience with at least one major cloud services provider, preferably AWS, but also open to experience with Azure or Google Cloud Platform. This includes managing cloud resources, implementing security best practices, and leveraging cloud-native services for operational efficiency.
- Infrastructure as Code (IaC): Skilled in using IaC tools such as Terraform, Ansible, Chef, or Puppet to automate the provisioning and management of infrastructure, ensuring consistency and compliance.
- Continuous Integration/Continuous Deployment (CI/CD): Experienced in setting up and managing CI/CD pipelines using tools like Jenkins, GitLab CI, or GitHub Actions to automate testing and deployment processes.
- Monitoring and Logging: Proficient in implementing monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack) to ensure proactive issue identification and resolution.
- Programming Languages/Frameworks: Familiarity with at least one of the following: Node.js, Golang, or Java Spring Boot, for effective automation, tooling, and incident response.
Operational Skills
- On-Call Duties: Willingness to participate in an on-call rotation, defined as 18/7 and for some rare cases, 24/7, understanding the critical role of maintaining system reliability and performance.
- Incident Management: Capable of quickly diagnosing and resolving issues, minimizing downtime, and learning from incidents to prevent future occurrences.
- Cost Optimization: Ability to monitor, analyze, and optimize cloud resources for cost efficiency without compromising performance or security.
Soft Skills
- Good English Communication Skills: Excellent verbal and written communication skills, capable of effectively collaborating with team members, stakeholders, and clients.
- Teamwork and Collaboration: Ability to work well within a team, share knowledge, and contribute to a positive working environment.
- Continuous Improvement: A strong desire for continuous learning and improvement, staying up-to-date with the latest technologies and best practices.
- Problem-Solving: Strong analytical and problem-solving skills, with a proactive approach to identifying and addressing challenges.
- High Adaptability: Exceptional adaptability is required for collaborating with multiple teams, quickly learning new technologies, and adjusting to changing project demands.
About the company
Wenn Du außerdem wissen möchtest, ob der Standort des Jobs LGBTQ-freundlich ist, frage gerne nach unserem Pride@Page-Komitee-Kontakt für ein vertrauliches Gespräch und/oder schaue Dir diese Ressource an: https://www.iglta.org/destinations/travel-guides/lgbtq-safety-guide/. PageGroup ermutigt Mitglieder der LGBTQ-Gemeinschaft, sich auf interne Stellen zu bewerben; wir können zwar die lokalen Gesetze und Gepflogenheiten nicht ändern, aber wir werden alles tun, was wir können, um Dich auf Deine nächste Aufgabe vorzubereiten und ggf. einen Standort zu finden, der für Dich und Deine Angehörigen geeignet ist.