Senior Site Reliability Engineer

Realm
Charing Cross, United Kingdom
9 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Charing Cross, United Kingdom

Tech stack

Artificial Intelligence
Computing Platforms
Bash
Big Data
Computer Programming
Data Centers
DevOps
Distributed Systems
Python
Machine Learning
Open Source Technology
Reliability Engineering
Software Systems
Scripting (Bash/Python/Go/Ruby)
Computer Networking Systems
High Performance Computing
Computer Network Technologies
Deep Learning
Reliability of Systems
Kubernetes
Infrastructure Automation Frameworks
Bare Metal

Job description

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility., * Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.

  • Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
  • Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

  • Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements
  • Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors
  • Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost
  • Troubleshooting across the full stack, including hardware, networking, and distributed systems
  • Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency

Participation in an on-call rotation required (approximately one week per month).

Requirements

  • Strong ownership mindset with focus on delivery and accountability
  • Experience building maintainable, well-documented systems in complex environments
  • Ability to operate effectively in ambiguous and rapidly evolving contexts
  • Clear and effective communication skills with collaborative, low-ego approach, * 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing
  • Strong written and verbal communication skills in English
  • Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)
  • Programming or scripting experience in Go, Python, or Bash
  • Familiarity with infrastructure automation and infrastructure-as-code tools
  • Strong technical foundation in computing or related discipline

Preferred Experience

  • Experience operating large-scale machine learning or AI-compute workloads
  • Background in multi-tenant distributed systems at scale
  • Hands-on experience with data centre or bare-metal infrastructure
  • Knowledge of high-performance networking technologies
  • Experience managing large-scale storage systems (commercial or open-source)

Benefits & conditions

  • Competitive salary and equity package
  • Retirement or pension contributions aligned with local standards
  • Health coverage including medical, dental, and vision
  • Generous paid time off policy

Apply for this position