Director of AI Infrastructure
Role details
Job location
Tech stack
Job description
We are seeking a Director of AI Infrastructure to oversee the systems that power our research. This leader will be responsible for the full lifecycle of our high-performance computing (HPC) environment which includes on-prem GPU clusters and the software orchestration layer that schedules workloads across a hybrid cloud environment., * Cluster Management: Oversee the availability and performance of dense on-prem GPU clusters. You will partner with hardware vendors and internal teams to ensure our physical infrastructure meets the demands of frontier model training.
- Orchestration & Scheduling: Direct the strategy for Beaker, our internal orchestration platform. Your goal is to optimize job scheduling, ensuring high utilization of both on-prem assets and elastic cloud resources (AWS/GCP).
- Storage Architecture: Develop and execute a long-term roadmap for storage that balances high-throughput performance for active training with cost-effective durability for petascale research data.
- Resource Economics: Act as the primary steward of our GPU compute budget. You will make data-driven decisions on when to burst to the cloud versus when to invest in on-prem capacity.
- User Support & Velocity: Serve as the technical bridge to our research teams. You will ensure that infrastructure is an accelerator, not a bottleneck, for a diverse set of research objectives., Note: This job description in no way states or implies that these are the only duties to be performed by the team members(s) of this position. Team members will be required to follow any other job-related instructions and to perform any other job-related duties requested by any person authorized to give instructions or assignments. All duties and responsibilities are essential functions and requirements and are subject to possible modification to reasonably accommodate individuals with disabilities. To perform this job successfully, the team member(s) will possess the skills, aptitudes, and abilities to perform each duty proficiently. Some requirements may exclude individuals who pose a direct threat or significant risk to the health or safety of themselves or others. The requirements listed in this document are the minimum levels of knowledge, skills, or abilities. This document does not create an employment contract, implied or otherwise, other than an at will relationship.
Requirements
- Systems Expert: You have a deep understanding of the Linux kernel, container runtimes, and distributed systems. You understand the performance implications of InfiniBand topologies and NCCL optimizations.
- Strategic Thinker: You look beyond the immediate "fire" to design systems that will scale for the next 3-5 years of AI research.
- Pragmatic Leader: You are comfortable making trade-offs between technical elegance and operational necessity. You prioritize reliability and researcher velocity above all else., * Experience: 12+ years in infrastructure, systems engineering, or HPC, with at least 5 years in a leadership role managing multi-disciplinary engineering teams.
- Bachelor's degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience
- GPU/HPC Stack: Direct experience managing large-scale NVIDIA GPU clusters and high-performance networking (InfiniBand/RoCE).
- Cloud Native: Strong background in Kubernetes, Slurm, or similar orchestration frameworks, particularly in hybrid-cloud configurations.
- Storage Mastery: Experience with distributed filesystems (e.g., WEKA, Ceph, Lustre) and cloud storage integration at scale.
- Software Development: Proficient in Go or Python, with the ability to review architecture and code for our internal tooling.
Physical Demands and Work Environment:
The physical demands described here are representative of those that must be met by a team member to successfully perform the essential functions of this position. Reasonable accommodations may be made to enable individuals with disabilities to perform the functions.
- Must be able to remain in a stationary position for long periods of time.
- The ability to communicate information and ideas so others will understand. Must be able to exchange accurate information in these situations.
- The ability to observe details at close range.
- Can work under deadlines.
Benefits & conditions
- Team members and their families are covered by medical, dental, vision, and an employee assistance program.
- Team members are able to enroll in our health savings account plan, our healthcare reimbursement arrangement plan, and our health care and dependent care flexible spending account plans.
- Team members are able to enroll in our company's 401k plan.
- Team members will receive $125 per month to assist with commuting or internet expenses and will also receive $200 per month for fitness and wellbeing expenses.
- Team members will also receive up to ten sick days per year, up to seven personal days per year, up to 20 vacation days per year and twelve paid holidays throughout the calendar year.
- Team members will be able to receive annual bonuses and can participate in the long-term incentive plan.