Platform Operations Engineer
Role details
Job location
Tech stack
Job description
IBM's AI Lab is building next-generation AI platforms and services. To support this mission, we're growing our Platform Operations Team, responsible for the cloud infrastructure that powers our AI services. As a Platform Operations Engineer, you'll work across AWS, Kubernetes, and internal automation tools to ensure the platform runs smoothly, securely, and efficiently. This role suits someone who enjoys working at the intersection of software development and operations - writing code, automating infrastructure, and supporting high-performance machine learning environments., * Deploy, manage, and monitor applications on AWS EKS (Kubernetes)
- Build and maintain Helm charts, manifests, and ArgoCD configurations
- Contribute Python code for internal tooling, automation, and services
- Manage CI/CD pipelines (e.g. Concourse, GitLab CI)
- Troubleshoot issues in networking, permissions, and application performance
- Work with development teams to streamline deployment and scaling of AI systems
- Maintain secure cloud environments through thoughtful IAM and Terraform configurations
Requirements
- Strong hands-on experience with Kubernetes (deployment, debugging, Helm)
- Intermediate to advanced Python development skills
- Familiarity with CI/CD pipelines, especially writing and debugging them
- Solid understanding of AWS services (EKS, IAM, S3)
- Confident with Linux-based environments and containerization (Docker)
Ideal (Bonus) Skills
- Experience with Helm, ArgoCD, and GitOps workflows
- Practical knowledge of Terraform for infrastructure-as-code
- Understanding of Kubernetes networking, ingress management, and certificate handling
- Exposure to OAuth/OpenID, certificates, and authentication proxies