Site Reliability Engineer - Core Data Services

Balyasny Asset Management LP
Charing Cross, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Charing Cross, United Kingdom

Tech stack

Agile Methodologies
Airflow
Algorithmic Trading
Amazon Web Services (AWS)
Application Configuration Access Protocols
Application Performance Management
Bash
Cloud Computing
Information Systems
Databases
Continuous Integration
DevOps
Distributed Systems
Fault Tolerance
Python
PostgreSQL
Redis
Reliability Engineering
Prometheus
Snowflake
Grafana
Containerization
Core Data
Kubernetes
Information Technology
Kafka
Docker

Job description

We are looking for a Site Reliability Engineer who can cultivate our SRE philosophy, processes, and technologies from the ground up. This role entails driving standards and fostering adoption within our Core Data Services team, whilst closely partnering with our DevOps and Cloud teams. With a hands-on approach, you'll work across both cloud and on-premises hosting platforms, ensuring the reliability and scalability of our trading systems and production environments. This is a chance to play a pivotal role in transforming our operational capabilities and enhancing performance across a wide array of environments and platforms. Key Responsibilities: Develop and promote our SRE philosophy, establishing best practices and processes that will be instrumental in scaling our infrastructure. Implement and scale end-to-end observability and monitoring solutions using Prometheus, Grafana, Loki, and Tempo, ensuring high visibility into application performance and infrastructure health. Participate

Requirements

in on-call rotation with approximately 1 week per month of on-call time shared equally across members of the team Review and define standards for application reliability requirements within our Kubernetes environment, ensuring application configuration is optimized for performance, cost and reliability. Develop automation and tooling to improve efficiency and reliability of deployment pipelines, system health checks, and recovery procedures. Collaborate with development teams to enhance service stability, scalability, and fault tolerance through SRE best practices like blameless post-mortems and service level objectives (SLOs). To be considered a good fit, you must have: 5+ years of experience in SRE or similar roles within complex, distributed systems environments. A Bachelor's degree in engineering, computer science, information systems, or equivalent experience SME with key SRE technologies such as Prometheus, Grafana, Loki, Tempo (OTEL). Extensive knowledge of container orchestration using Kubernetes and containerization with Docker. Hands-on experience with both cloud (AWS preferred) and on-premises hosting platforms. Proven ability to script in languages like Python, Bash, or Go, to automate routine tasks and deployment pipelines. Strong understanding of CI/CD principles, agile methodologies, and DevOps culture. High level of initiative, passion for reliability engineering, detail orientation, and follow-through capabilities. Exceptional interpersonal and communication skills, with the ability to explain complex technical concepts to a diverse audience. Nice to Have: Experience with Databases (PostgreSQL, Redis, Snowflake), Messaging (Kafka, Solace), and/or Orchestration (Airflow). Don't have all the skills listed above? Have extra skills you think are important that we haven't thought of? Please, let us know by applying and telling us a bit more about yourself and why you think you're qualified!

Apply for this position