Site Reliability Engineer - Core Data Services

Balyasny Asset Management LP

Charing Cross, United Kingdom

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Charing Cross, United Kingdom

Tech stack

Agile Methodologies

Airflow

Algorithmic Trading

Amazon Web Services (AWS)

Application Configuration Access Protocols

Application Performance Management

Bash

Cloud Computing

Information Systems

Databases

Continuous Integration

DevOps

Distributed Systems

Fault Tolerance

Python

PostgreSQL

Redis

Reliability Engineering

Prometheus

Snowflake

Grafana

Containerization

Core Data

Kubernetes

Information Technology

Kafka

Docker

Job description

We are looking for a Site Reliability Engineer who can cultivate our SRE philosophy, processes, and technologies from the ground up. This role entails driving standards and fostering adoption within our Core Data Services team, whilst closely partnering with our DevOps and Cloud teams. With a hands-on approach, you'll work across both cloud and on-premises hosting platforms, ensuring the reliability and scalability of our trading systems and production environments. This is a chance to play a pivotal role in transforming our operational capabilities and enhancing performance across a wide array of environments and platforms. Key Responsibilities: Develop and promote our SRE philosophy, establishing best practices and processes that will be instrumental in scaling our infrastructure. Implement and scale end-to-end observability and monitoring solutions using Prometheus, Grafana, Loki, and Tempo, ensuring high visibility into application performance and infrastructure health. Participate

Requirements

in on-call rotation with approximately 1 week per month of on-call time shared equally across members of the team Review and define standards for application reliability requirements within our Kubernetes environment, ensuring application configuration is optimized for performance, cost and reliability. Develop automation and tooling to improve efficiency and reliability of deployment pipelines, system health checks, and recovery procedures. Collaborate with development teams to enhance service stability, scalability, and fault tolerance through SRE best practices like blameless post-mortems and service level objectives (SLOs). To be considered a good fit, you must have: 5+ years of experience in SRE or similar roles within complex, distributed systems environments. A Bachelor's degree in engineering, computer science, information systems, or equivalent experience SME with key SRE technologies such as Prometheus, Grafana, Loki, Tempo (OTEL). Extensive knowledge of container orchestration using Kubernetes and containerization with Docker. Hands-on experience with both cloud (AWS preferred) and on-premises hosting platforms. Proven ability to script in languages like Python, Bash, or Go, to automate routine tasks and deployment pipelines. Strong understanding of CI/CD principles, agile methodologies, and DevOps culture. High level of initiative, passion for reliability engineering, detail orientation, and follow-through capabilities. Exceptional interpersonal and communication skills, with the ability to explain complex technical concepts to a diverse audience. Nice to Have: Experience with Databases (PostgreSQL, Redis, Snowflake), Messaging (Kafka, Solace), and/or Orchestration (Airflow). Don't have all the skills listed above? Have extra skills you think are important that we haven't thought of? Please, let us know by applying and telling us a bit more about yourself and why you think you're qualified!