Sr. AI Infrastructure Engineer
Role details
Job location
Tech stack
Job description
Lenovo is seeking a senior technical leader to guide the strategy, architecture, and delivery of our next generation Hybrid AI Platform. In this role, you will provide leadership across AI infrastructure, MLOps, cloud native platform engineering, and operational excellence -setting direction for teams that build and run production grade AI/ML platforms on Kubernetes. You will drive the vision for scalable, secure, and reliable AI systems while partnering closely with engineering, product, and executive stakeholders. If you are passionate about leading high impact AI platform initiatives, mentoring engineering talent, and shaping enterprise wide Hybrid AI capabilities, we invite you to join us., AI Platform Engineering & Operations
- Provide technical leadership and architectural direction for Kubernetes/OpenShift based AI/ML platform design, scalability strategy, security posture, and operational standards.
- Oversee platform roadmap, ensuring alignment with Lenovo's broader Hybrid AI strategy and enterprise architecture principles.
- Lead engineering teams in implementing GitOps driven, cloud native platform automation using ArgoCD and Helm.
- Set standards for Linux systems management, platform hardening, and operational reliability across all AI infrastructure.
MLOps & Model Lifecycle Management
- Define and evolve the enterprise MLOps architecture, enabling reproducible, automated, and governed AI model workflows.
- Lead teams in building and optimizing ML pipelines using Kubeflow Pipelines, Tekton, and Python SDKs.
- Architect scalable, production ready model serving solutions using KServe, Knative, and Triton (where applicable).
- Champion consistency in model registry usage, metadata management, workflow orchestration, and ML lifecycle governance.
Automation, Observability & Reliability
- Develop the long term automation and platform SRE strategy, including Python/Ansible based automation and Terraform driven IaC patterns.
- Establish observability standards for AI/ML systems using Prometheus, Grafana, AlertManager, and related tooling.
- Oversee capacity planning, performance engineering, incident response processes, and continuous reliability improvements.
- Drive adoption of automation first principles to reduce operational overhead and improve engineering velocity.
Cloud & Infrastructure Integration
- Own the multi cloud and hybrid cloud integration strategy across AWS, GCP, Azure, and on premises environments.
- Direct the design of enterprise grade identity and security integrations (Azure AD, LDAP, RBAC, secrets management).
- Partner with cloud, security, and networking leadership to ensure the AI platform meets enterprise compliance and governance requirements.
Collaboration & Customer Success
- Act as a senior point of technical escalation for internal teams and critical customer deployments.
- Influence cross functional strategy across AI engineering, DevOps, data science, and product teams.
- Mentor staff engineers and up level the team's capabilities through architectural reviews, technical coaching, and leadership by example.
- Represent the platform's strategy and progress to leadership stakeholders, ensuring alignment with business goals and customer needs.
Requirements
Do you have a valid Dangerous Goods Driver's License license?, Do you have experience in Terraform?, Do you have a Master's degree?, * Bachelor's degree in Computer Science, Engineering, or related field (Master's preferred).
- 10+ years of experience in DevOps, cloud native platform engineering, or AI/ML platform operations, including leadership or architectural responsibility.
- Proven expertise in Kubernetes/OpenShift platform leadership, including cluster lifecycle management, operator design, advanced networking, and platform level security.
- In depth experience with GitOps at scale using ArgoCD, Helm, and automated cluster configuration patterns.
- Advanced knowledge of MLOps tooling (e.g., KServe, Kubeflow, Tekton, Knative) and ML workflow automation.
- Strong proficiency in Python, Bash, and automation frameworks like Ansible and Terraform.
- Deep experience with AWS, GCP, Azure, and hybrid cloud architectural patterns.
- Strong observability leadership experience with Prometheus, Grafana, and distributed system monitoring.
- Exceptional communication, stakeholder management, and cross functional leadership skills.
- Proven track record shaping technical strategy, influencing engineering culture, and delivering complex, large scale platforms.
Bonus Points
- Experience leading initiatives within the Red Hat OpenShift AI ecosystem.
- Knowledge of enterprise scale LLM and model serving architectures (e.g., Triton ensemble models, OCI artifact based LLM deployments).
- Advanced industry certifications such as CKA, CKS, GCP ACE, AWS SAA/SA Pro, or Red Hat OpenShift specializations.
- Experience guiding data engineering or AI/ML workflow orchestration teams.
- Demonstrated leadership in monorepo based CI/CD modernization initiatives.
- Experience implementing and governing Internal Developer Portals (e.g., Backstage) across large engineering organizations.
Benefits & conditions
What we offer:
- Opportunities for career advancement and personal development
- Access to a diverse range of training programs
- Performance-based rewards that celebrate your achievements
- Flexibility with a hybrid work model (3:2) that blends home and office life
- Electric car salary sacrifice scheme
- Life insurance