Site Reliability Engineer

Infinity Quest
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
£ 56K

Job location

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Azure
Cloud Computing
Distributed Systems
Information Technology Operations
Python
Reliability Engineering
Ansible
Datadog
Scripting (Bash/Python/Go/Ruby)
Mttr
Containerization
Kubernetes
Information Technology
Dynatrace
Docker
Microservices

Job description

SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands-on expertise who can lead modernization efforts while fostering a culture of reliability and innovation., * Work closely with Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.

  • Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
  • Propose & drive strategies for AI-driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
  • Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
  • Establish & create AIOPS roadmap for improving operational efficiency.
  • Lead efforts to automate repetitive tasks (toil) using scripting, orchestration tools, and AI/ML-based solutions.
  • Drive toil automation initiatives for automated incident responses & self-healing automation for achieving autonomous operations.
  • Collaborate with cross-functional teams to ensure systems are scalable, resilient, and maintainable.
  • Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
  • Partner with engineering, architecture, and product teams to enable shift-left engineering practices ensuring reliability.
  • Mentor and guide teams on adopting SRE principles and tools.
  • Advocate for a culture of reliability, automation, and continuous improvement across the organization.

Requirements

  • Strong expertise in implementing Site Reliability Engineering (SRE) principles.

  • Advanced knowledge of establishing observability using tools Dynatrace & Datadog (primary skills).

  • Proficiency in automation & scripting using Python & Ansible (primary skills).

  • Strong experience with cloud platforms AWS & Azure (primary skills).

  • Solid understanding of containerization and orchestration tools like Docker and Kubernetes.

  • Proficiency in cloud native distributed systems & microservices architecture.

  • Exposure to AI/ML techniques for predictive analytics and automated problem resolution.

Familiarity with CI/CD pipelines

Apply for this position