Principal IT Infrastructure Engineer
Acceler8 Talent
Mountain View, United States of America
8 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Mountain View, United States of America
Tech stack
Artificial Intelligence
Amazon Web Services (AWS)
Systems Engineering
Azure
Bash
Computer Clusters
Nvidia CUDA
Data Centers
Linux
DevOps
Distributed Systems
InfiniBand
Python
Reliability Engineering
Ansible
Prometheus
AI Infrastructure
Datadog
Graphics Processing Unit (GPU)
Google Cloud Platform
High Performance Computing
Grafana
HybridCloud
Kubernetes
Infrastructure Automation Frameworks
Information Technology
Bare Metal
Slurm
Hardware Infrastructure
Terraform
Splunk
Job description
The focus is on building scalable AI/HPC infrastructure from the ground up, owning large hardware clusters and working closely with senior technical leadership across hardware, systems and software.
The ideal candidate will have strong experience across Linux, GPU clusters, HPC schedulers, Kubernetes, Slurm/LSF, networking, automation, observability and hybrid cloud environments. They should be comfortable operating in a fast-moving startup environment and have practical experience scaling infrastructure rather than only maintaining mature systems. Key Responsibilities
- Build, scale and operate large Linux-based infrastructure for AI/HPC workloads.
- Manage GPU and compute clusters across bare metal, on-prem and cloud.
- Work with Slurm, LSF, Kubernetes, or similar scheduling/orchestration tools.
- Support hybrid cloud environments across AWS, Azure, GCP, or GPU cloud providers.
- Automate infrastructure using Terraform, Ansible, Python, Bash, or similar.
- Troubleshoot issues across Linux, GPUs, networking, storage, schedulers, containers and distributed systems.
- Build monitoring and observability using Prometheus, Grafana, ELK, Datadog, Splunk, or similar.
- Partner with hardware and software teams on cluster expansion, reliability, and performance.
Requirements
- 7+ years in Infrastructure, DevOps, SRE, HPC, Systems Engineering, or Technical Operations.
- Experience building or administering large-scale GPU, AI/ML, HPC, or hardware infrastructure clusters.
- Strong Linux, networking, automation, and observability experience.
- Exposure to NVIDIA/AMD GPUs, InfiniBand, NVLink, CUDA, NCCL, high-memory bandwidth systems, or similar is highly desirable.
- Experience in a startup, AI infrastructure, hyperscale, neocloud, semiconductor, research compute, or advanced data center environment would be a strong fit.