Manager - Site Reliability Engineering
Role details
Job location
Tech stack
Job description
You will be responsible for ensuring stability, resilience, and performance of our production systems while driving continuous improvement and SRE best practices across the platform.
What you'll be doing:
- Service Ownership
Assume end-to-end accountability for Clearing production environment, ensuring high availability, optimal performance, and robust resilience of business-critical systems.
- Incident Management & Crisis Leadership
Act as Incident Commander during major incidents, leading resolution efforts, managing stakeholder communications, and driving root cause analysis and remediation.
- Team Leadership & Talent Development
Build and mentor a high-performing SRE team. Promote a culture of accountability, continuous improvement, and blameless postmortems to enhance operational excellence.
- Operational Excellence & SLA Compliance
Ensure consistency to response and resolution SLAs. Oversee efficient ticket management and escalation processes through ServiceNow, removing blockers promptly.
- Stakeholder Engagement & Relationship Management
Develop strong partnerships across LCH and LSEG teams. Ensure timely delivery of business-critical activities and transparent communication of risks and challenges.
- Process Optimisation & Continuous Improvement
Monitor and analyse technical processes to identify improvement opportunities. Implement enhancements to minimise business disruption and improve operational efficiency.
- Risk Management & Compliance
Ensure compliance with regulatory standards and internal governance. Proactively identify and mitigate operational risks.
- Metrics & Observability
Establish and maintain robust observability practices, employing metrics, logging, and tracing to drive data-driven decisions and improve system health.
- Out of hours support / On-call support
- Be available for overnight support of production services to ensure successful completion of processing
- Respond to overnight calls and deal with issues
- Participate in Disaster Recovery exercises
Requirements
- Degree educated or equivalent work experience, * Number of years in Production Support / SRE roles with at least 3 years in a leadership capacity.
- Deep technical expertise in Oracle database - troubleshooting, scalability, performance tuning and optimization.
- Demonstrated experience implementing SRE frameworks - including SLOs, SLIs, incident management, and chaos engineering.
- Experience leading teams supporting systems deployed across mixed infrastructure (Cloud and On-Premise, AWS preferred)
- Solid understanding of change management, risk posture, and production readiness.
- Strong track record of delivering automation at scale, reducing toil, and eliminating manual operational tasks.
- Excellent communication and stakeholder management skills, particularly under pressure.
- Expertise in automation (Python, Shell, PowerShell etc.)
- Familiarity with observability tools and practices (metrics, logging, tracing).
- Ability to lead capacity planning and scalability strategies to support growth.
- Knowledge of clearing and settlement processes in financial markets.
- Familiarity with regulatory requirements and governance frameworks in financial services.
- Demonstrated ability to build, mentor, and retain high-performing SRE teams.
- Good communication and stakeholder management skills under pressure., * Demonstrable experience managing SRE or Production Support teams in a critically important financial services environment
- Experience managing teams located across multiple locations and time zones.
- Excellent analytical skills, Attention to detail and problem-solving abilities.
- Solid technical background in the core technologies with several years of experience.
- Ability to communicate clearly and concisely to IT and business teams and to senior management
- Ability to break down complex technical issues into easy to digest format
- Familiarity with financial products and terminology.