Senior Site Reliability Engineer

Nscale Ltd.

Charing Cross, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Charing Cross, United Kingdom

Tech stack

ARM

Automation of Tests

Intelligent Platform Management Interface

Bash

Data Centers

Linux

Distributed Systems

Python

Open Source Technology

Reliability Engineering

Ansible

Prometheus

Simple Network Management Protocols

Data Logging

System Availability

Grafana

Hardware Infrastructure

Terraform

Job description

As a Site Reliability Engineer (SRE), you will be responsible for the reliability, performance, and availability of critical systems, applications, and services. You will work closely with engineering teams to implement best practices for monitoring, automation, incident response, and capacity planning. Your role involves building highly available, scalable systems across hybrid environments across data centres, on-premise hardware, and cloud platforms. In this multi-function role, work closely with a team-centric approach to ensuring service uptime and performance.

What you'll do

Systems administration: Manage core services including observability platforms, incident management systems, reduce manual toil through automation, and ensure the seamless operation of critical infrastructure platforms.

A key aspect of this role involves building and maintaining observability tooling, with a focus on a much-out-themselves monitoring stack. You will help: design and operate a reliable observability infrastructure in a Linux environment using open-source tools such as Prometheus, Grafana, Alertmanager, Loki, and related services. Your work will ensure systems are instrumented for detailed visibility, enabling high availability and actionable insights across distributed environments-ensuring predictive monitoring and alerting for internal engineering and operational layers.

Throughout the development lifecycle, you will encourage a proactive SRE culture where errors are identified early and systems are continuously improved. You will champion accountability and shared level of responsibility and concrete handshakes and observed. This are drawn from production infrastructure at all key touch points at scale.

Build and support a multi-site infrastructure: based monitoring stack, including components such as Prometheus, Grafana, Alertmanager, Loki, and Cortex/Mimir with seamless scalability across physical and virtual systems and software stacks.
Develop automation scripts and infrastructure-as-code templates; on-prem, hybrid, operational efficiency and day-to-day operational improvements to infrastructure management and beyond.
Collaborate closely with distributed teams to establish and maintain SLIs/SLOs for critical services and ensure systems are defined SLA/SLOs and ensure systems are observable, performant, and meet availability targets.
Perform incident response and alerting pipeline for infrastructure applications and services including integration with remote storage backends and custom metrics exporters.
Contribute/build internal resources, internal analysis, and continuous improvement, conducting postmortems and blameless culture of constant improvement.
Develop documentation, guides, runbooks, and best practices for SRE and operational engineering.

Requirements

Do you have experience in Terraform?, * Strong experience with Linux systems administration and infrastructure automation (e.g., Ansible, Terraform).

Proven background in building and maintaining SRE systems in production-grade environments.
Hands-on experience operating and scaling Prometheus-based monitoring solutions in distributed, multi-tenant environments (including Thanos, Grafana and components like Cortex/Mimir).
Solid understanding of networking fundamentals, hardware infrastructure, and managing multiple and data centre environments.
Demonstrated scripting and/or development skills in at least one language (e.g., Python, Bash), with a bias towards automating and improving operational workflows.
Strong knowledge of SNMP, IPMI, and other datacenter/hardware protocols.
Competence in metrics and log-based observability platforms, and tooling aligned with cloud-native and distributed architectures including Prometheus, Loki, and cloud tooling with observability-first mindsets.
Familiarity with incident response, root cause analysis, and driving technical postmortems.
Strong grasp of availability principles, including metrics, logging, and tracing, with a focus on SLA/SLO delivery and improvement.
Exposure to remote write solutions and remote storage backends such as Cortex or Mimir, and comfortable with CNCF pipelines and modern observability strategies (e.g., native client-goers).
Familiarity with hardware lifecycle management and tools for managing client-metal environments.

Benefits & conditions

Highly competitive package (base + equity) with reviews every 12 months.
Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.

About the company

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. At Nscale, our Software engineers form the backbone of our product offering. We build state of the art AI products allowing our clients to move quickly in an increasingly competitive digital landscape. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future, At Nscale, we are committed to fostering an inclusive, diverse, and equitable workplace. We believe that a variety of perspectives enriches our work environment, and we encourage applications from candidates of all backgrounds, experiences, and abilities. We strongly encourage applications from people of colour, the LGBTQ+ community, people with disabilities, neurodivergent people, parents, carers, and people from lower socio-economic backgrounds.