AI Infrastructure and Kubernetes Platform Architect

NVIDIA Ltd.
31 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Remote

Tech stack

Artificial Intelligence
Bash
DevOps
InfiniBand
Python
Ansible
YAML
Scripting (Bash/Python/Go/Ruby)
Graphics Processing Unit (GPU)
Enterprise Software Applications
Kubernetes
Performance Monitor
Machine Learning Operations
Terraform
Data Pipelines
Vulnerability Analysis

Job description

We are seeking a highly skilled AI Infrastructure and Kubernetes Platform Architect with deep expertise in managing GPU-accelerated workloads on NVIDIA DGX systems. The ideal candidate will have hands-on experience with Kubernetes at the administrator, application developer, and security levels (CKA, CKAD, CKS), and will be responsible for designing, deploying, securing, and maintaining large-scale AI infrastructure powered by DGX BasePODs and SuperPODs. This role involves optimizing AI workloads, managing high-performance networking (InfiniBand), and ensuring operational excellence across NVIDIA AI systems and BlueField DPU environments., * Architect and maintain containerized AI/ML platforms using Kubernetes on DGX systems.

  • Integrate NVIDIA Base Command Manager with Kubernetes for workload scheduling and GPU resource optimization.
  • Design multi-tenant GPU resource partitioning strategies using MIG (Multi-Instance GPU) to maximize hardware utilization across concurrent AI workloads.
  • Implement and manage Helm charts, custom controllers, and GPU operators for scalable ML infrastructure.

DGX Infrastructure Administration

  • Administer and optimize NVIDIA DGX BasePODs and SuperPODs.
  • Ensure optimal GPU, CPU, and storage performance across AI clusters.
  • Leverage DGX System Administration best practices for lifecycle management and updates.
  • Coordinate capacity planning for DGX cluster expansion including rack power, cooling, and storage integration with NVIDIA AI Enterprise software stack.

High-Performance Networking & DPU

  • Deploy, monitor, and manage InfiniBand networks using Unified Fabric Manager (UFM).
  • Integrate BlueField DPUs for offloaded security, networking, and storage tasks.
  • Optimize end-to-end data pipelines from storage to GPUs.

Security and Compliance

  • Apply best practices from the CKS certification to harden Kubernetes clusters and AI workloads.
  • Implement secure service mesh and microsegmentation with BlueField DPU integration.
  • Conduct regular audits, vulnerability scanning, and security policy enforcement.

Automation & Monitoring

  • Automate deployment pipelines and infrastructure provisioning with IaC tools (Terraform, Ansible).
  • Monitor performance metrics using GPU telemetry, PrometheGrafana, and NVIDIA DCGM.
  • Troubleshoot and resolve complex system issues across hardware and software layers.
  • Implement MLOps workflows integrating KubeFlow Pipelines, NVIDIA Triton Inference Server, and model registry tooling to support end-to-end model training and production deployment.

Requirements

  • CKA, CKAD, CKS certifications - demonstrating full-stack Kubernetes expertise.
  • Proven experience with NVIDIA DGX systems and AI workload orchestration.
  • Hands-on expertise in InfiniBand networking, UFM, and BlueField DPU administration.
  • Strong scripting and automation skills in Python, Bash, YAML.
  • Familiarity with Base Command Manager, NVIDIA GPU Operator, and KubeFlow is a plus.
  • Ability to work across teams to support ML researchers, DevOps engineers, and infrastructure teams.

Apply for this position