Infrastructure Engineer (GPU & Compute)
Role details
Job location
Tech stack
Job description
- Focus: We complete one goal at a time with care, collaborating as a team to deliver features with precision.
- Balance: Sustained performance comes from rest and recovery. We ensure a healthy work-life balance to keep you at your best.
- Craftsmanship: Innovation through excellence. Every detail matters, and we take pride in mastering our craft.
- Minimal: Simplicity drives our innovation. We eliminate complexity through discipline and focus on what truly matters., In this role, you will own image management, system diagnostics, and validation across large-scale bare-metal compute infrastructure, with a particular focus on GPU-enabled systems. You will work at the intersection of hardware, systems, and software-developing automation, improving reliability, and enabling efficient cluster bring-up for AI/ML and HPC workloads.
You will play a key role in owning and evolving our image pipeline, running validation environments and test clusters, and supporting both system-level and GPU hardware qualification. This role is critical to ensuring that our infrastructure is consistent, performant, and ready to support demanding AI workloads from day one., * Own and evolve systems for image management, deployment, and validation across bare-metal infrastructure
- Run and maintain test clusters used for system validation, diagnostics, and bring-up
- Validate firmware, drivers, and OS images across compute and GPU-enabled systems
- Support hardware qualification efforts for next-generation platforms, * Diagnose and resolve complex issues across GPUs, drivers, OS, and hardware layers
- Analyze system and GPU performance using tools such as NVIDIA DCGM
- Identify failure patterns and drive improvements in system stability and validation coverage, * Build and maintain automation for provisioning, validation, and system bring-up
- Develop Python-based tools and workflows to improve efficiency and reduce manual operational overhead
- Improve the reliability, repeatability, and scalability of image pipelines and validation systems, * Partner with Infrastructure, Hardware, and Data Center teams on system bring-up and validation
- Collaborate with platform and ML teams to ensure systems meet workload requirements
- Contribute to best practices for provisioning, diagnostics, and lifecycle management of infrastructure
Requirements
- 5+ years of experience in infrastructure engineering, systems engineering, or related roles
- Strong Linux systems experience in production environments
- Hands-on experience with GPU-enabled systems and tools such as NVIDIA DCGM
- Familiarity with bare-metal provisioning and system bring-up workflows
- Proficiency in Python or similar scripting/programming languages for automation
- Ability to debug complex issues across hardware, OS, GPUs, and system software, * Experience with high-performance interconnects (e.g., InfiniBand, NVLink)
- Experience with PXE boot environments, LiveCD systems, or image-based provisioning workflows
- Experience with hardware management interfaces such as iDRAC, IPMI, or Redfish
- Data center operations experience, including working with physical hardware
- Experience supporting AI/ML or HPC workloads at scale
- Experience with GPU validation frameworks or large-scale hardware qualification processes
Benefits & conditions
Paid parental leave, Parental leave, Health insurance, Paid time off, Vision insurance, Dental insurance, Paid holidays, We offer a comprehensive and competitive benefits package designed to support our employees' health, well-being, and long-term success. Benefits may vary by location, team, and role., * Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)
- Retirement and financial wellness support (U.S.); Pension contribution (U.K.)
- Generous paid time off, plus holidays
- Paid parental leave
- Professional development support
- Wellness and work-from-home stipends
- Flexible work environment