Senior Site Reliability Engineer

Realm

Charing Cross, United Kingdom

9 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Charing Cross, United Kingdom

Tech stack

Artificial Intelligence

Computing Platforms

Bash

Big Data

Computer Programming

Data Centers

DevOps

Distributed Systems

Python

Machine Learning

Open Source Technology

Reliability Engineering

Software Systems

Scripting (Bash/Python/Go/Ruby)

Computer Networking Systems

High Performance Computing

Computer Network Technologies

Deep Learning

Reliability of Systems

Kubernetes

Infrastructure Automation Frameworks

Bare Metal

Job description

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility., * Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.

Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements
Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors
Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost
Troubleshooting across the full stack, including hardware, networking, and distributed systems
Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency

Participation in an on-call rotation required (approximately one week per month).

Requirements

Strong ownership mindset with focus on delivery and accountability
Experience building maintainable, well-documented systems in complex environments
Ability to operate effectively in ambiguous and rapidly evolving contexts
Clear and effective communication skills with collaborative, low-ego approach, * 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing
Strong written and verbal communication skills in English
Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)
Programming or scripting experience in Go, Python, or Bash
Familiarity with infrastructure automation and infrastructure-as-code tools
Strong technical foundation in computing or related discipline

Preferred Experience

Experience operating large-scale machine learning or AI-compute workloads
Background in multi-tenant distributed systems at scale
Hands-on experience with data centre or bare-metal infrastructure
Knowledge of high-performance networking technologies
Experience managing large-scale storage systems (commercial or open-source)

Benefits & conditions

Competitive salary and equity package
Retirement or pension contributions aligned with local standards
Health coverage including medical, dental, and vision
Generous paid time off policy

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all