System Reliability Engineer
Role details
Job location
Tech stack
Job description
As a System Reliability Engineer, you will be responsible for ensuring the stability, performance, and scalability of our Automation Software platform. Your mission begins with a strong focus on the "Now": building robust monitoring, automation, and operational practices that keep our systems reliable under real-world conditions.
Operating at the intersection of software development and operations, you will proactively prevent incidents, optimize system behavior, and enable fast, reliable service delivery. By aligning reliability engineering with product and architectural goals, you will ensure our systems meet critical KPIs such as uptime, latency, and deployment velocity across the entire lifecycle.
Your Tasks & Responsibilities
- Design and operate monitoring, alerting, and incident response systems to ensure high availability
- Define and manage SLIs, SLOs, and SLAs; proactively mitigate reliability, performance, and capacity risks
- Automate deployments, scaling, and operational workflows; implement infrastructure as code and self-healing patterns
- Optimize CI/CD pipelines for faster, safer, and more reliable releases
- Lead or support incident response, root cause analysis, and post-mortems; translate findings into preventive measures
- Collaborate with architects, developers, and product teams to ensure scalable, reliable system design
- Review system changes for operational, performance, and reliability impact
- Support capacity planning, performance benchmarking, and scaling strategies
- Contribute to security monitoring and ensure secure system operations
- Drive continuous improvement in observability, reliability, and operational efficiency
Requirements
- 3+ years in Site Reliability Engineering, DevOps, or similar roles in production environments
- Proven experience improving system reliability, reducing downtime, and enhancing deployment processes
- Strong expertise in cloud platforms (AWS, GCP, Azure) and Kubernetes
- Hands-on experience with observability tools (Prometheus, Grafana, ELK stack)
- Solid scripting and automation skills (e.g., Python, Bash)
- Experience operating and scaling distributed systems in large production environments
- Familiarity with CI/CD pipelines, infrastructure as code, and modern DevOps practices
Who You Are
- Passionate about building reliable, scalable, and observable systems
- Strong communicator, able to collaborate effectively across engineering, product, and operations teams
- Proactive and solution-oriented, with a strong sense of ownership and accountability
- Analytical and structured thinker with a focus on continuous improvement
- Comfortable working in fast-paced, complex environments with evolving system landscapes
- Motivated to ensure technical excellence translates into stable and high-performing real-world systems