Site Reliability Engineer - Core Data Services
Role details
Job location
Tech stack
Job description
We are looking for a Site Reliability Engineer who can cultivate our SRE philosophy, processes, and technologies from the ground up. This role entails driving standards and fostering adoption within our Core Data Services team, whilst closely partnering with our DevOps and Cloud teams. With a hands-on approach, you'll work across both cloud and on-premises hosting platforms, ensuring the reliability and scalability of our trading systems and production environments. This is a chance to play a pivotal role in transforming our operational capabilities and enhancing performance across a wide array of environments and platforms. Key Responsibilities: Develop and promote our SRE philosophy, establishing best practices and processes that will be instrumental in scaling our infrastructure. Implement and scale end-to-end observability and monitoring solutions using Prometheus, Grafana, Loki, and Tempo, ensuring high visibility into application performance and infrastructure health. Participate
Requirements
in on-call rotation with approximately 1 week per month of on-call time shared equally across members of the team Review and define standards for application reliability requirements within our Kubernetes environment, ensuring application configuration is optimized for performance, cost and reliability. Develop automation and tooling to improve efficiency and reliability of deployment pipelines, system health checks, and recovery procedures. Collaborate with development teams to enhance service stability, scalability, and fault tolerance through SRE best practices like blameless post-mortems and service level objectives (SLOs). To be considered a good fit, you must have: 5+ years of experience in SRE or similar roles within complex, distributed systems environments. A Bachelor's degree in engineering, computer science, information systems, or equivalent experience SME with key SRE technologies such as Prometheus, Grafana, Loki, Tempo (OTEL). Extensive knowledge of container orchestration using Kubernetes and containerization with Docker. Hands-on experience with both cloud (AWS preferred) and on-premises hosting platforms. Proven ability to script in languages like Python, Bash, or Go, to automate routine tasks and deployment pipelines. Strong understanding of CI/CD principles, agile methodologies, and DevOps culture. High level of initiative, passion for reliability engineering, detail orientation, and follow-through capabilities. Exceptional interpersonal and communication skills, with the ability to explain complex technical concepts to a diverse audience. Nice to Have: Experience with Databases (PostgreSQL, Redis, Snowflake), Messaging (Kafka, Solace), and/or Orchestration (Airflow). Don't have all the skills listed above? Have extra skills you think are important that we haven't thought of? Please, let us know by applying and telling us a bit more about yourself and why you think you're qualified!