Site Reliability Engineer
Role details
Job location
Tech stack
Job description
SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands-on expertise who can lead modernization efforts while fostering a culture of reliability and innovation., * Work closely with Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.
- Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
- Propose & drive strategies for AI-driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
- Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
- Establish & create AIOPS roadmap for improving operational efficiency.
- Lead efforts to automate repetitive tasks (toil) using scripting, orchestration tools, and AI/ML-based solutions.
- Drive toil automation initiatives for automated incident responses & self-healing automation for achieving autonomous operations.
- Collaborate with cross-functional teams to ensure systems are scalable, resilient, and maintainable.
- Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
- Partner with engineering, architecture, and product teams to enable shift-left engineering practices ensuring reliability.
- Mentor and guide teams on adopting SRE principles and tools.
- Advocate for a culture of reliability, automation, and continuous improvement across the organization.
Requirements
-
Strong expertise in implementing Site Reliability Engineering (SRE) principles.
-
Advanced knowledge of establishing observability using tools Dynatrace & Datadog (primary skills).
-
Proficiency in automation & scripting using Python & Ansible (primary skills).
-
Strong experience with cloud platforms AWS & Azure (primary skills).
-
Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
-
Proficiency in cloud native distributed systems & microservices architecture.
-
Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
Familiarity with CI/CD pipelines