Site Reliability Engineer - Data Platform
Role details
Job location
Tech stack
Job description
IMC operates on the cutting-edge use of technology to create a competitive edge over the competition. We also grow quick and have plenty of complex technical challenges. We're looking for an experienced SRE with a strong linux, automation, and distributed systems background, who can help us standardize deployments, elevate observability, and scale our data platform and other critical data services. You will join our Data Platform team, part of our local data team that support engineering teams, traders and other users with all their data needs. They're the team responsible for the foundational platform that our data frameworks and tooling is built on top of. That includes monitoring and alerting, scalability and supporting standardised deployments.
Your Core Responsibilities:
As an SRE within IMC you will join a small sub-team that takes a central role in all the data needs and you'll be focusing your energy towards:
- Design, implement and manage our data platforms.
- Improve observability with Prometheus, Grafana and other tools.
- Develop automation processes that allow for scalability and improved reliability of internal tools and systems, supporting an automation-first culture across our data infrastructure.
- Contribute to long-term architectural improvements - not just fixing issues, but preventing them.
- Support critical services like HDFS, Kafka and Dremio.
Requirements
- Strong experience with distributed data platforms (e.g. Kafka, Hadoop, Spark); including full installation of those platforms as well as debugging and performance tuning.
- Strong experience deploying, configuring, orchestrating and operating software on Linux and Kubernetes.
- Strong experience with automation; including reading and writing of Python.
- Comfort reading Java source code, tuning and debugging running JVMs.
- Comfort with infrastructure as code (Ansible preferred).
- Comfort with reading, writing and tuning SQL queries run on various query engines.
- A proactive mindset - you're not just fixing issues but preventing them.
- Comfortable working across teams, with minimal oversight.
Our Tech Stack:
- Infrastructure & Observability: Linux, Prometheus, Grafana, AlertManager, OpsGenie
- Data Tools: Hadoop (HDFS), Kafka, Spark, Airflow, SQL
- Automation: Ansible, Puppet, ArgoCD, Kustomize
- Containerization & Orchestration: Kubernetes
- Scripting/Automation: Python, Bash
- Others: Dremio, PCAP infrastructure