Principal IT Infrastructure Engineer

Acceler8 Talent

Mountain View, United States of America

8 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Mountain View, United States of America

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Systems Engineering

Azure

Bash

Computer Clusters

Nvidia CUDA

Data Centers

Linux

DevOps

Distributed Systems

InfiniBand

Python

Reliability Engineering

Ansible

Prometheus

AI Infrastructure

Datadog

Graphics Processing Unit (GPU)

Google Cloud Platform

High Performance Computing

Grafana

HybridCloud

Kubernetes

Infrastructure Automation Frameworks

Information Technology

Bare Metal

Slurm

Hardware Infrastructure

Terraform

Splunk

Job description

The focus is on building scalable AI/HPC infrastructure from the ground up, owning large hardware clusters and working closely with senior technical leadership across hardware, systems and software.

The ideal candidate will have strong experience across Linux, GPU clusters, HPC schedulers, Kubernetes, Slurm/LSF, networking, automation, observability and hybrid cloud environments. They should be comfortable operating in a fast-moving startup environment and have practical experience scaling infrastructure rather than only maintaining mature systems. Key Responsibilities

Build, scale and operate large Linux-based infrastructure for AI/HPC workloads.
Manage GPU and compute clusters across bare metal, on-prem and cloud.
Work with Slurm, LSF, Kubernetes, or similar scheduling/orchestration tools.
Support hybrid cloud environments across AWS, Azure, GCP, or GPU cloud providers.
Automate infrastructure using Terraform, Ansible, Python, Bash, or similar.
Troubleshoot issues across Linux, GPUs, networking, storage, schedulers, containers and distributed systems.
Build monitoring and observability using Prometheus, Grafana, ELK, Datadog, Splunk, or similar.
Partner with hardware and software teams on cluster expansion, reliability, and performance.

Requirements

7+ years in Infrastructure, DevOps, SRE, HPC, Systems Engineering, or Technical Operations.
Experience building or administering large-scale GPU, AI/ML, HPC, or hardware infrastructure clusters.
Strong Linux, networking, automation, and observability experience.
Exposure to NVIDIA/AMD GPUs, InfiniBand, NVLink, CUDA, NCCL, high-memory bandwidth systems, or similar is highly desirable.
Experience in a startup, AI infrastructure, hyperscale, neocloud, semiconductor, research compute, or advanced data center environment would be a strong fit.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all