Senior Software Engineer, AI Infrastructure
Role details
Job location
Tech stack
Job description
You will be a senior technical contributor responsible for ensuring that when a researcher submits a job, the software schedules it intelligently and the hardware executes it flawlessly. This involves:
- Designing for Scale: Designing and scaling our orchestration layer to ensure that the highest value workloads receive GPU time.
- Operational Excellence: Moving our HPC operations from manual intervention to high-level automation.
- Performance Engineering: Working directly with researchers to squeeze every bit of performance out of our GPU-accelerated computing environment.
Your Responsibilities:
- Full-Stack Ownership: Independently design and deliver critical systems that span the entire stack-from the Beaker job scheduler to the execution runtime.
- System Automation: Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management.
- Performance Optimization: Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads.
- Technical Input & Ownership: Provide valuable input into the roadmap for managing large-scale HPC systems, including the deployment of compute, networking, and storage in partnership with leadership.
- Mentorship & Culture: Foster a high-performance culture by reviewing code/design docs, mentoring team members, and driving process improvements within the team.
- Collaboration: Effectively communicate and collaborate with internal research staff to share system designs, gather feedback, and support engineers on implementation tasks., Note: This job description in no way states or implies that these are the only duties to be performed by the team members(s) of this position. Team members will be required to follow any other job-related instructions and to perform any other job-related duties requested by any person authorized to give instructions or assignments. All duties and responsibilities are essential functions and requirements and are subject to possible modification to reasonably accommodate individuals with disabilities. To perform this job successfully, the team member(s) will possess the skills, aptitudes, and abilities to perform each duty proficiently. Some requirements may exclude individuals who pose a direct threat or significant risk to the health or safety of themselves or others. The requirements listed in this document are the minimum levels of knowledge, skills, or abilities. This document does not create an employment contract, implied or otherwise, other than an at
Requirements
You are an expert systems engineer who occupies the space between high-level software orchestration and low-level system performance. You are motivated by the idea that world-class infrastructure should be a catalyst for public good, not a proprietary secret. You are as comfortable designing a resource allocation algorithm in Go as you are debugging a NCCL timeout.
You lead by example, blending the rigor of aSenior Software Engineer with the pragmatic, hands-on urgency of an HPC operator. Not only do you build systems, but you also ensure they thrive under the pressure of training world-class AI models., * 8+ years of professional experience developing business-critical software and operating large-scale compute infrastructure. Proficiency in Go and/or Python preferred.
- Bachelor's degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience.
- Linux Expertise: Expert-level knowledge of Linux internals, and container runtimes like Docker.
- Distributed Systems Expertise: A proven track record of designing, debugging, and optimizing high-scale distributed systems and databases.
- Communication: Exceptional writing skills and the ability to drive consensus across diverse groups of researchers and engineers.
- A principled approach to engineering: You care about how systems are built and are excited by the unique constraints and freedoms of a non-profit research environment.
Bonus Qualifications:
- Applied experience with workload schedulers (like Kubernetes or Slurm) and high-performance networking (NCCL and InfiniBand).
- Prior experience training or fine-tuning frontier AI models.
- Deep systems administration expertise or "Site Reliability Engineering" (SRE) background in an HPC context.
- Experience contributing to open-source infrastructure or orchestration projects.
- Familiarity with on-prem storage systems like WEKA and Ceph.
Physical Demands and Work Environment:
The physical demands described here are representative of those that must be met by a team member to successfully perform the essential functions of this position. Reasonable accommodations may be made to enable individuals with disabilities to perform the functions.
- Must be able to remain in a stationary position for long periods of time.
- The ability to communicate information and ideas so others will understand. Must be able to exchange accurate information in these situations.
- The ability to observe details at close range.
- Can work under deadlines.
Benefits & conditions
The Allen Institute for Artificial Intelligence flexible benefit account, paid holidays, sick time, 401(k) United States, Washington, Seattle 2157 North Northlake Way (Show on map) Jun 09, 2026 Persons in these roles are expected to work from our offices in Seattle. On-site requirements vary based on position and team. If you have questions about on-site work arrangements for this role, please ask your recruiter. Ourbase salary range is $126,000 - $189,000, and in addition we have generous bonus plans to provide a competitive compensation package., * Team members and their families are covered by medical, dental, vision, and an employee assistance program.
- Team members are able to enroll in our health savings account plan, our healthcare reimbursement arrangement plan, and our health care and dependent care flexible spending account plans.
- Team members are able to enroll in our company's 401k plan.
- Team members will receive $125 per month to assist with commuting or internet expenses and will also receive $200 per month for fitness and wellbeing expenses.
- Team members will also receive up to ten sick days per year, up to seven personal days per year, up to 20 vacation days per year and twelve paid holidays throughout the calendar year.
- Team members will be able to receive annual bonuses and can participate in the long-term incentive plan.