Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Site Reliability Engineer (SRE), you will be responsible for the reliability, performance, and availability of critical systems, applications, and services. You will work closely with engineering teams to implement best practices for monitoring, automation, incident response, and capacity planning. Your role involves building highly available, scalable systems across hybrid environments across data centres, on-premise hardware, and cloud platforms. In this multi-function role, work closely with a team-centric approach to ensuring service uptime and performance.
What you'll do
Systems administration: Manage core services including observability platforms, incident management systems, reduce manual toil through automation, and ensure the seamless operation of critical infrastructure platforms.
A key aspect of this role involves building and maintaining observability tooling, with a focus on a much-out-themselves monitoring stack. You will help: design and operate a reliable observability infrastructure in a Linux environment using open-source tools such as Prometheus, Grafana, Alertmanager, Loki, and related services. Your work will ensure systems are instrumented for detailed visibility, enabling high availability and actionable insights across distributed environments-ensuring predictive monitoring and alerting for internal engineering and operational layers.
Throughout the development lifecycle, you will encourage a proactive SRE culture where errors are identified early and systems are continuously improved. You will champion accountability and shared level of responsibility and concrete handshakes and observed. This are drawn from production infrastructure at all key touch points at scale.
- Build and support a multi-site infrastructure: based monitoring stack, including components such as Prometheus, Grafana, Alertmanager, Loki, and Cortex/Mimir with seamless scalability across physical and virtual systems and software stacks.
- Develop automation scripts and infrastructure-as-code templates; on-prem, hybrid, operational efficiency and day-to-day operational improvements to infrastructure management and beyond.
- Collaborate closely with distributed teams to establish and maintain SLIs/SLOs for critical services and ensure systems are defined SLA/SLOs and ensure systems are observable, performant, and meet availability targets.
- Perform incident response and alerting pipeline for infrastructure applications and services including integration with remote storage backends and custom metrics exporters.
- Contribute/build internal resources, internal analysis, and continuous improvement, conducting postmortems and blameless culture of constant improvement.
- Develop documentation, guides, runbooks, and best practices for SRE and operational engineering.
Requirements
Do you have experience in Terraform?, * Strong experience with Linux systems administration and infrastructure automation (e.g., Ansible, Terraform).
- Proven background in building and maintaining SRE systems in production-grade environments.
- Hands-on experience operating and scaling Prometheus-based monitoring solutions in distributed, multi-tenant environments (including Thanos, Grafana and components like Cortex/Mimir).
- Solid understanding of networking fundamentals, hardware infrastructure, and managing multiple and data centre environments.
- Demonstrated scripting and/or development skills in at least one language (e.g., Python, Bash), with a bias towards automating and improving operational workflows.
- Strong knowledge of SNMP, IPMI, and other datacenter/hardware protocols.
- Competence in metrics and log-based observability platforms, and tooling aligned with cloud-native and distributed architectures including Prometheus, Loki, and cloud tooling with observability-first mindsets.
- Familiarity with incident response, root cause analysis, and driving technical postmortems.
- Strong grasp of availability principles, including metrics, logging, and tracing, with a focus on SLA/SLO delivery and improvement.
- Exposure to remote write solutions and remote storage backends such as Cortex or Mimir, and comfortable with CNCF pipelines and modern observability strategies (e.g., native client-goers).
- Familiarity with hardware lifecycle management and tools for managing client-metal environments.
Benefits & conditions
- Highly competitive package (base + equity) with reviews every 12 months.
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.