Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
Operated within the SS&C WIT business, Genesis is an all-new investment operations platform that provides extensive asset class and functional support across the front, middle, and back office. Built natively for the cloud with advanced technology, Genesis features an innovative user experience, actionable monitors, notifications, and alerts infused with AI., * Maintain shared ownership for providing production level resilience and reliability for business-critical systems.
- Leverage industry-standard observability technologies to provide a centralized view of system and service health.
- Implement and continually improve monitoring and alerting based on harvested logs, metrics and traces.
- Lead incident response, post incident reviews and post remediation improvements.
- Define and establish KPIs, SLIs and SLOs in support of agreed service levels.
- Develop and maintain automation, and leverage generative AI technologies to reduce operational toil, improve MTTD and MTTR.
- Take on new support for additional technical service components as the service evolves. Support, mentor and train SRE Engineers.
- Work with other teams to maintain a sound knowledge of all aspects of the application technical architecture.
- Contribute to building up and maintaining a knowledge base in support of the technical role.
- Maintain and awareness of, comply with and champion the stated service controls required to achieve audit compliance.
Requirements
The role requires an in-depth knowledge of observability principles and strong experience in implementing the observability stack across infrastructure, data and application layers for real time, compute intensive, distributed environments. The Senior SRE Engineer will have a solid understanding of cloud platforms and container orchestration. They will have a comprehensive grasp of incident management and operational risk mitigation and experience in implementing automation frameworks to minimize toil and reduce MTTD/MTTR. They will have proven experience in using infrastructure as code and familiarity with AI-driven operational tooling. Logical thinkers with strong problem solving and communication skills and a desire to effect continuous improvements., * Bachelor's degree in Computer Science, Software Engineering, or a related field.
- ITIL foundation level or experience working in an ITIL framework preferred.
- 4+ years of Linux OS and Windows OS systems management experience.
- 4+ years of experience with observability technologies for system monitoring and alerting technologies (e.g. Prometheus, Grafana, Loki).
- 2+ years working in a team environment with operational responsibilities for client facing applications.
- 2+ years of experience with containerization technologies and Kubernetes.
- Proven scripting skills in at least one of Linux shell scripting (csh, ksh, Bash or Windows PowerShell), Ansible, Terraform or Python.
- Working experience in use of versatile workload automation / enterprise scheduling tools such as Airflow.
- Working experience with, and a technical understanding of, NoSQL DBs such as MongoDB/Cassandra and traditional relational DBs such as SQL Server/Oracle/Postgre.
- Working experience of a cloud self-service environment.
- Working experience of LLM or AI usage in monitoring and observability stacks.