AI Infrastructure Engineer

Richtech Robotics Inc.

Las Vegas, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Compensation

$ 120K

Job location

Las Vegas, United States of America

Tech stack

API

Artificial Intelligence

Systems Engineering

Bash

Ubuntu (Operating System)

Nvidia CUDA

Computer Networks

Data Centers

Linux

DevOps

Ethernet

InfiniBand

Python

Local Area Networks

Linux System Administration

Remote Direct Memory Access

Red Hat Enterprise Linux - RHEL

Ansible

Virtual Local Area Networks

AI Infrastructure

Saltstack

Offline Storage

Infrastructure as Code (IaC)

Data Center Networking

Containerization

Low Latency

Bare Metal

Hardware Infrastructure

Terraform

Docker

Job description

NVIDIA GPU & Hardware Infrastructure Deployment

Hardware Provisioning: Rack, stack, configure, and maintain high-performance bare-metal GPU servers (e.g., NVIDIA H200, B300 or equivalent Supermicro/Dell/HGX architectures).
AI Software Stack: Install, update, and optimize NVIDIA Drivers, CUDA Toolkit, cuDNN, and NVIDIA Container Toolkit on physical host machines.
Containerization & Orchestration: Manage GPU-accelerated environments using Docker, including configuring GPU partitioning (MIG/vGPU) for optimal resource allocation.

Network & Systems Engineering

High-Performance Networks: Configure and optimize InfiniBand (IB) switches and RoCE (RDMA over Converged Ethernet) to ensure ultra-low latency and maximum throughput for multi-GPU training workloads.
Core Infrastructure: Manage enterprise firewalls, core switches, VLANs, and local network routing to ensure high security and stability of the data center network.
Linux Administration: Oversee Linux server administration (Ubuntu, RHEL, or Rocky Linux), including automated OS provisioning and local storage clusters.

Metering & Billing System Integration

Resource Metering: Implement and configure telemetry tools to accurately monitor and log GPU time, CPU utilization, storage usage, and network traffic.
Billing System Management: Maintain and integrate usage-based billing/metering engines to track infrastructure costs or client usage.
Automation: Write robust scripts (Python, Go, or Bash) to link data center resource telemetry with the billing platform for precise invoicing and automated usage reporting.

Requirements

Do you have experience in Infrastructure as Code (IaC)?, * Experience: 3-5+ years of experience in Network Engineering, Linux Systems Administration, or DevOps, with hands-on experience in GPU infrastructure deployment.

Linux & Automation: Expert-level knowledge of Linux environments and infrastructure-as-code/automation tools (Ansible, Terraform, or SaltStack).
NVIDIA Ecosystem: Deep technical understanding of the NVIDIA AI Enterprise stack (CUDA, NCCL, NVLink).
Billing/Metering Awareness: Practical experience working with usage-based tracking, billing APIs, or internal chargeback tools.

Benefits & conditions

Pulled from the full job description

Health insurance
Paid time off
Vision insurance
Dental insurance, * Dental insurance
Health insurance
Paid time off
Vision insurance

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all