Site Reliability Engineering (SRE) Manager

IBM
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Tech stack

Agile Methodologies
Amazon Web Services (AWS)
JIRA
Unit Testing
Azure
Cloud Computing
Continuous Integration
Software Debugging
DevOps
IBM Cloud Computing
Python
Network Architecture
Networking Basics
Openshift
Performance Tuning
Scrum
Reliability Engineering
Ansible
Software Systems
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Grafana
HybridCloud
Kubernetes
Hardware Infrastructure
Terraform
Programming Languages

Job description

As a Software Developer: Generalist, you will design, develop, test, and deliver offerings using leading-edge and/or proven technologies. You will work in an Agile, collaborative environment to understand stakeholder requirements and contribute to the development of innovative software solutions.

Your primary responsibilities will include:

  • Develop Component-Level Solutions: Design, code, and test innovative component-level software solutions, ensuring that the implemented solutions are unit tested and ready to be integrated into their product.

  • Contribute to CI/CD Pipeline: Contribute to the automated CI/CD pipeline that takes code through various quality stages, ensuring seamless integration and delivery.

  • Debug Customer-Reported Problems: Design, develop, and unit test code fixes for customer-reported problems, collaborating with stakeholders to resolve issues efficiently.

  • Deliver Offerings: Deliver high-quality offerings using leading-edge and/or proven technologies, meeting stakeholder requirements and expectations.

  • Collaborate in Agile Environment: Work collaboratively in an Agile environment to understand stakeholder requirements, aligning solutions with business needs and goals.

Requirements

'- Proven experience managing or leading engineering, SRE, DevOps, or operations teams.

  • Oversee implementation and automation of operational processes, infrastructure, monitoring, incident response and runbooks.

  • Own end-to-end service reliability, including SLI/SLOs, capacity planning, performance optimization and operational health.

  • Ensure platforms meet IBM CISO and enterprise security standards, regulatory requirements and risk policies.

  • Communicate strategy, risks, operational status and metrics to leadership and stakeholders.

  • Influence technology roadmaps and operational readiness for new internal solutions.

  • Strong background in delivering reliable, highly available services.

  • Deep understanding of security, compliance, and risk management frameworks.

  • Demonstrated success driving automation of infrastructure, monitoring, and operational tasks.

  • Lead, develop, and mentor a team of Site Reliability Engineers; provide coaching, career development, and performance management.

  • Foster a high-performing engineering culture centered around accountability, innovation, and continuous improvement.

  • Align team objectives with the strategic direction of the IBM CISO organization and broader Enterprise & Technology Services.

  • Plan staffing, manage workload distribution, and ensure on-call readiness and 24/7 service support coverage.

  • Excellent written and verbal communication skills with ability to influence and drive alignment across teams.

  • Ability to balance support of current systems while leading modernization and future-state design.

  • Experience with Release/Change Management processes.

  • Ability to handle critical issues outside of business hours.

Preferred technical and professional experience

'- Experience with Kubernetes, OpenShift, or similar container orchestration platforms.

  • Experience building or operating Cloud-native environments (AWS, Azure, GCP, IBM Cloud), Hybrid Cloud and on-prem infrastructure environments.

  • Familiarity with observability tools.

  • Understanding of networking fundamentals and modern networking architectures.

  • Knowledge of Infrastructure as Code (Terraform, Ansible, etc.).

  • Exposure to Agile methodologies (Jira, Kanban, Scrum, etc.).

  • Working knowledge or scripting/programming languages (e.g., Python, etc.).

  • Professional Cloud and/or Security certifications (AWS, CISSP, etc.).

Apply for this position