Site Reliability Engineer

Infinity Quest

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

£ 56K

Job location

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Azure

Cloud Computing

Distributed Systems

Information Technology Operations

Python

Reliability Engineering

Ansible

Datadog

Scripting (Bash/Python/Go/Ruby)

Mttr

Containerization

Kubernetes

Information Technology

Dynatrace

Docker

Microservices

Job description

SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands-on expertise who can lead modernization efforts while fostering a culture of reliability and innovation., * Work closely with Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.

Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
Propose & drive strategies for AI-driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
Establish & create AIOPS roadmap for improving operational efficiency.
Lead efforts to automate repetitive tasks (toil) using scripting, orchestration tools, and AI/ML-based solutions.
Drive toil automation initiatives for automated incident responses & self-healing automation for achieving autonomous operations.
Collaborate with cross-functional teams to ensure systems are scalable, resilient, and maintainable.
Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
Partner with engineering, architecture, and product teams to enable shift-left engineering practices ensuring reliability.
Mentor and guide teams on adopting SRE principles and tools.
Advocate for a culture of reliability, automation, and continuous improvement across the organization.

Requirements

Strong expertise in implementing Site Reliability Engineering (SRE) principles.
Advanced knowledge of establishing observability using tools Dynatrace & Datadog (primary skills).
Proficiency in automation & scripting using Python & Ansible (primary skills).
Strong experience with cloud platforms AWS & Azure (primary skills).
Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
Proficiency in cloud native distributed systems & microservices architecture.
Exposure to AI/ML techniques for predictive analytics and automated problem resolution.