Site Reliability Engineering (SRE) Manager
Role details
Job location
Tech stack
Job description
As a Software Developer: Generalist, you will design, develop, test, and deliver offerings using leading-edge and/or proven technologies. You will work in an Agile, collaborative environment to understand stakeholder requirements and contribute to the development of innovative software solutions.
Your primary responsibilities will include:
-
Develop Component-Level Solutions: Design, code, and test innovative component-level software solutions, ensuring that the implemented solutions are unit tested and ready to be integrated into their product.
-
Contribute to CI/CD Pipeline: Contribute to the automated CI/CD pipeline that takes code through various quality stages, ensuring seamless integration and delivery.
-
Debug Customer-Reported Problems: Design, develop, and unit test code fixes for customer-reported problems, collaborating with stakeholders to resolve issues efficiently.
-
Deliver Offerings: Deliver high-quality offerings using leading-edge and/or proven technologies, meeting stakeholder requirements and expectations.
-
Collaborate in Agile Environment: Work collaboratively in an Agile environment to understand stakeholder requirements, aligning solutions with business needs and goals.
Requirements
'- Proven experience managing or leading engineering, SRE, DevOps, or operations teams.
-
Oversee implementation and automation of operational processes, infrastructure, monitoring, incident response and runbooks.
-
Own end-to-end service reliability, including SLI/SLOs, capacity planning, performance optimization and operational health.
-
Ensure platforms meet IBM CISO and enterprise security standards, regulatory requirements and risk policies.
-
Communicate strategy, risks, operational status and metrics to leadership and stakeholders.
-
Influence technology roadmaps and operational readiness for new internal solutions.
-
Strong background in delivering reliable, highly available services.
-
Deep understanding of security, compliance, and risk management frameworks.
-
Demonstrated success driving automation of infrastructure, monitoring, and operational tasks.
-
Lead, develop, and mentor a team of Site Reliability Engineers; provide coaching, career development, and performance management.
-
Foster a high-performing engineering culture centered around accountability, innovation, and continuous improvement.
-
Align team objectives with the strategic direction of the IBM CISO organization and broader Enterprise & Technology Services.
-
Plan staffing, manage workload distribution, and ensure on-call readiness and 24/7 service support coverage.
-
Excellent written and verbal communication skills with ability to influence and drive alignment across teams.
-
Ability to balance support of current systems while leading modernization and future-state design.
-
Experience with Release/Change Management processes.
-
Ability to handle critical issues outside of business hours.
Preferred technical and professional experience
'- Experience with Kubernetes, OpenShift, or similar container orchestration platforms.
-
Experience building or operating Cloud-native environments (AWS, Azure, GCP, IBM Cloud), Hybrid Cloud and on-prem infrastructure environments.
-
Familiarity with observability tools.
-
Understanding of networking fundamentals and modern networking architectures.
-
Knowledge of Infrastructure as Code (Terraform, Ansible, etc.).
-
Exposure to Agile methodologies (Jira, Kanban, Scrum, etc.).
-
Working knowledge or scripting/programming languages (e.g., Python, etc.).
-
Professional Cloud and/or Security certifications (AWS, CISSP, etc.).