Networks focused Site Reliability Engineer (SRE)

La Fosse
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate
Compensation
£ 180K

Job location

Tech stack

API
Automation of Tests
Border Gateway Protocol
Complex Networks
Continuous Integration
Elasticsearch
Python
Microsoft Office
Network Architecture
Routing
Network administration
Remote Direct Memory Access
Reliability Engineering
Software Tools
Prometheus
Software Engineering
Data Streaming
TCP/IP
Computer Network Operations
Istio
Grafana
Reliability of Systems
Pytest
Kubernetes
Low Latency
Kafka
Terraform
Cisco Switches
Go

Job description

Network Site Reliability Engineer - Python/GO, Observability, Monitoring, HPC Within the Network Engineering Team, this role is critical in ensuring our clients High-Performance Computing (HPC) environments are supported by a resilient, data-driven, and software-defined network foundation. We are seeking a Networks focused Site Reliability Engineer (SRE) with a focus on Observability, Telemetry, and Monitoring. In this role, you will apply a software engineering mindset to network operations, bridging the gap between traditional networking and modern Site Reliability Engineering (SRE). You will be responsible for ensuring our high-performance network infrastructure is not just functional, but deeply visible. You will build the tooling and automation that allow the team to move from reactive troubleshooting to proactive, automated remediation and "self-healing" infrastructure., * Reliability Engineering: Apply SRE principles to the network; define and maintain SLIs, SLOs, and Error Budgets for network latency, packet loss, and availability.

  • HPC Connectivity & Performance: Support low-latency, high-throughput network architectures (e.g., RDMA, RoCE) designed for intensive HPC and financial data workloads.
  • Advanced Telemetry: Design and manage high-cardinality telemetry pipelines to collect and analyze flow logs, metrics, and traces at scale.
  • Network Automation (Python/Go): Build and maintain internal software tools, APIs, and "self-healing" scripts to automate routine operations and complex failure recoveries.
  • Infrastructure-as-Code (IaC): Use Terraform to manage complex network configurations and observability stacks (Prometheus, Grafana, OpenSearch) as code.
  • Observability & Monitoring: Implement automated alerting and dashboarding that provide real-time insights into network health and traffic patterns.
  • Incident Management & Post-Mortems: Lead technical troubleshooting for complex outages and conduct "blameless post-mortems" to drive systemic improvements.

Requirements

  • 3+ years of experience in a Network Reliability (NRE), SRE, or Network Operations role within a high-performance environment.
  • Software Engineering Mindset: Strong proficiency in Python and Go for building automation, custom exporters, or network management tools.
  • Observability Stack Expertise: Hands-on experience with Prometheus, Grafana, OpenSearch/Elasticsearch, and distributed tracing.
  • Networking Fundamentals: Deep knowledge of TCP/IP, BGP, EVPN, and routing/switching concepts in a high-bandwidth environment.
  • Infrastructure as Code: Proven experience using Terraform to ensure scalable, repeatable, and version-controlled network deployments.
  • HPC Awareness: Familiarity with the networking requirements of high-performance computing, such as non-blocking fabrics and low-latency interconnects.

Desirable Experience

  • Streaming Telemetry: Experience with gNMI, gRPC, or Kafka for real-time network data streaming.
  • CI/CD for Networking: Familiarity with "NetDevOps" workflows, including automated testing (Pytest/Go test) and pipeline validation for network changes.
  • Container Networking: Knowledge of Kubernetes networking, CNI plugins, and Service Mesh (e.g., Istio or Cilium).
  • Traffic Engineering: Experience with segment routing or advanced load-balancing strategies for high-performance workloads.

Apply for this position