Networks focused Site Reliability Engineer (SRE)

La Fosse

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Compensation

£ 180K

Job location

Tech stack

API

Automation of Tests

Border Gateway Protocol

Complex Networks

Continuous Integration

Elasticsearch

Python

Microsoft Office

Network Architecture

Routing

Network administration

Remote Direct Memory Access

Reliability Engineering

Software Tools

Prometheus

Software Engineering

Data Streaming

TCP/IP

Computer Network Operations

Istio

Grafana

Reliability of Systems

Pytest

Kubernetes

Low Latency

Kafka

Terraform

Cisco Switches

Job description

Network Site Reliability Engineer - Python/GO, Observability, Monitoring, HPC Within the Network Engineering Team, this role is critical in ensuring our clients High-Performance Computing (HPC) environments are supported by a resilient, data-driven, and software-defined network foundation. We are seeking a Networks focused Site Reliability Engineer (SRE) with a focus on Observability, Telemetry, and Monitoring. In this role, you will apply a software engineering mindset to network operations, bridging the gap between traditional networking and modern Site Reliability Engineering (SRE). You will be responsible for ensuring our high-performance network infrastructure is not just functional, but deeply visible. You will build the tooling and automation that allow the team to move from reactive troubleshooting to proactive, automated remediation and "self-healing" infrastructure., * Reliability Engineering: Apply SRE principles to the network; define and maintain SLIs, SLOs, and Error Budgets for network latency, packet loss, and availability.

HPC Connectivity & Performance: Support low-latency, high-throughput network architectures (e.g., RDMA, RoCE) designed for intensive HPC and financial data workloads.
Advanced Telemetry: Design and manage high-cardinality telemetry pipelines to collect and analyze flow logs, metrics, and traces at scale.
Network Automation (Python/Go): Build and maintain internal software tools, APIs, and "self-healing" scripts to automate routine operations and complex failure recoveries.
Infrastructure-as-Code (IaC): Use Terraform to manage complex network configurations and observability stacks (Prometheus, Grafana, OpenSearch) as code.
Observability & Monitoring: Implement automated alerting and dashboarding that provide real-time insights into network health and traffic patterns.
Incident Management & Post-Mortems: Lead technical troubleshooting for complex outages and conduct "blameless post-mortems" to drive systemic improvements.

Requirements

3+ years of experience in a Network Reliability (NRE), SRE, or Network Operations role within a high-performance environment.
Software Engineering Mindset: Strong proficiency in Python and Go for building automation, custom exporters, or network management tools.
Observability Stack Expertise: Hands-on experience with Prometheus, Grafana, OpenSearch/Elasticsearch, and distributed tracing.
Networking Fundamentals: Deep knowledge of TCP/IP, BGP, EVPN, and routing/switching concepts in a high-bandwidth environment.
Infrastructure as Code: Proven experience using Terraform to ensure scalable, repeatable, and version-controlled network deployments.
HPC Awareness: Familiarity with the networking requirements of high-performance computing, such as non-blocking fabrics and low-latency interconnects.

Desirable Experience

Streaming Telemetry: Experience with gNMI, gRPC, or Kafka for real-time network data streaming.
CI/CD for Networking: Familiarity with "NetDevOps" workflows, including automated testing (Pytest/Go test) and pipeline validation for network changes.
Container Networking: Knowledge of Kubernetes networking, CNI plugins, and Service Mesh (e.g., Istio or Cilium).
Traffic Engineering: Experience with segment routing or advanced load-balancing strategies for high-performance workloads.