AI Infrastructure Engineer

Richtech Robotics Inc.
Las Vegas, United States of America
6 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate
Compensation
$ 120K

Job location

Las Vegas, United States of America

Tech stack

API
Artificial Intelligence
Systems Engineering
Bash
Ubuntu (Operating System)
Nvidia CUDA
Computer Networks
Data Centers
Linux
DevOps
Ethernet
InfiniBand
Python
Local Area Networks
Linux System Administration
Remote Direct Memory Access
Red Hat Enterprise Linux - RHEL
Ansible
Virtual Local Area Networks
AI Infrastructure
Saltstack
Offline Storage
Infrastructure as Code (IaC)
Data Center Networking
Containerization
Low Latency
Bare Metal
Hardware Infrastructure
Terraform
Docker

Job description

  1. NVIDIA GPU & Hardware Infrastructure Deployment
  • Hardware Provisioning: Rack, stack, configure, and maintain high-performance bare-metal GPU servers (e.g., NVIDIA H200, B300 or equivalent Supermicro/Dell/HGX architectures).

  • AI Software Stack: Install, update, and optimize NVIDIA Drivers, CUDA Toolkit, cuDNN, and NVIDIA Container Toolkit on physical host machines.

  • Containerization & Orchestration: Manage GPU-accelerated environments using Docker, including configuring GPU partitioning (MIG/vGPU) for optimal resource allocation.

  1. Network & Systems Engineering
  • High-Performance Networks: Configure and optimize InfiniBand (IB) switches and RoCE (RDMA over Converged Ethernet) to ensure ultra-low latency and maximum throughput for multi-GPU training workloads.

  • Core Infrastructure: Manage enterprise firewalls, core switches, VLANs, and local network routing to ensure high security and stability of the data center network.

  • Linux Administration: Oversee Linux server administration (Ubuntu, RHEL, or Rocky Linux), including automated OS provisioning and local storage clusters.

  1. Metering & Billing System Integration
  • Resource Metering: Implement and configure telemetry tools to accurately monitor and log GPU time, CPU utilization, storage usage, and network traffic.

  • Billing System Management: Maintain and integrate usage-based billing/metering engines to track infrastructure costs or client usage.

  • Automation: Write robust scripts (Python, Go, or Bash) to link data center resource telemetry with the billing platform for precise invoicing and automated usage reporting.

Requirements

Do you have experience in Infrastructure as Code (IaC)?, * Experience: 3-5+ years of experience in Network Engineering, Linux Systems Administration, or DevOps, with hands-on experience in GPU infrastructure deployment.

  • Linux & Automation: Expert-level knowledge of Linux environments and infrastructure-as-code/automation tools (Ansible, Terraform, or SaltStack).

  • NVIDIA Ecosystem: Deep technical understanding of the NVIDIA AI Enterprise stack (CUDA, NCCL, NVLink).

  • Billing/Metering Awareness: Practical experience working with usage-based tracking, billing APIs, or internal chargeback tools.

Benefits & conditions

Pulled from the full job description

  • Health insurance
  • Paid time off
  • Vision insurance
  • Dental insurance, * Dental insurance
  • Health insurance
  • Paid time off
  • Vision insurance

Apply for this position