AI Infrastructure Engineer
Role details
Job location
Tech stack
Job description
- NVIDIA GPU & Hardware Infrastructure Deployment
-
Hardware Provisioning: Rack, stack, configure, and maintain high-performance bare-metal GPU servers (e.g., NVIDIA H200, B300 or equivalent Supermicro/Dell/HGX architectures).
-
AI Software Stack: Install, update, and optimize NVIDIA Drivers, CUDA Toolkit, cuDNN, and NVIDIA Container Toolkit on physical host machines.
-
Containerization & Orchestration: Manage GPU-accelerated environments using Docker, including configuring GPU partitioning (MIG/vGPU) for optimal resource allocation.
- Network & Systems Engineering
-
High-Performance Networks: Configure and optimize InfiniBand (IB) switches and RoCE (RDMA over Converged Ethernet) to ensure ultra-low latency and maximum throughput for multi-GPU training workloads.
-
Core Infrastructure: Manage enterprise firewalls, core switches, VLANs, and local network routing to ensure high security and stability of the data center network.
-
Linux Administration: Oversee Linux server administration (Ubuntu, RHEL, or Rocky Linux), including automated OS provisioning and local storage clusters.
- Metering & Billing System Integration
-
Resource Metering: Implement and configure telemetry tools to accurately monitor and log GPU time, CPU utilization, storage usage, and network traffic.
-
Billing System Management: Maintain and integrate usage-based billing/metering engines to track infrastructure costs or client usage.
-
Automation: Write robust scripts (Python, Go, or Bash) to link data center resource telemetry with the billing platform for precise invoicing and automated usage reporting.
Requirements
Do you have experience in Infrastructure as Code (IaC)?, * Experience: 3-5+ years of experience in Network Engineering, Linux Systems Administration, or DevOps, with hands-on experience in GPU infrastructure deployment.
-
Linux & Automation: Expert-level knowledge of Linux environments and infrastructure-as-code/automation tools (Ansible, Terraform, or SaltStack).
-
NVIDIA Ecosystem: Deep technical understanding of the NVIDIA AI Enterprise stack (CUDA, NCCL, NVLink).
-
Billing/Metering Awareness: Practical experience working with usage-based tracking, billing APIs, or internal chargeback tools.
Benefits & conditions
Pulled from the full job description
- Health insurance
- Paid time off
- Vision insurance
- Dental insurance, * Dental insurance
- Health insurance
- Paid time off
- Vision insurance