HPC / AI Software Infrastructure Lead (E)
Role details
Job location
Tech stack
Job description
- Lead the architecture and development of large-scale HPC and AI infrastructure supporting cutting-edge image processing and machine learning workloads
- Design scalable, high-performance distributed systems that unify traditional image processing with modern AI/Deep Learning pipelines
- Drive GPU-accelerated computing strategies, optimizing performance across compute, storage, and networking layers
- Partner cross-functionally with hardware, algorithms, and product teams to deliver robust, production-ready platforms
- Establish engineering best practices (code quality, CI/CD, observability, performance tuning) for mission-critical systems
- Mentor and develop engineers, providing technical guidance, coaching, and growth opportunities for junior team members
- Serve as a technical leader and decision-maker, influencing architecture and long-term platform strategy, * Work on real-world AI systems at scale, not just experiments
- Collaborate across hardware, software, and algorithm teams in a deeply technical environment
- Join a growing engineering presence in Ann Arbor, with access to top talent and a strong technical community
- Opportunity to shape the direction of AI infrastructure in a core product domain
Requirements
- 10+ years in software engineering, including leading and scaling technical teams
- Proven success building distributed systems in HPC, AI/ML, or cloud-native environments
- Track record of delivering performance-critical infrastructure at scale
- Experience mentoring and growing early- and mid-career engineers
Technical Expertise
- Deep understanding of distributed systems, parallel computing, and Linux systems programming
- Strong programming skills in C++, Python, or similar systems-level languages
- Experience with GPU computing (CUDA, ROCm) and modern AI frameworks (PyTorch, TensorFlow, etc.)
- Familiarity with high-performance storage systems, networking, and data pipelines
- Strong foundation in CI/CD, DevOps, and production system reliability
Bonus Experience
- Background in image processing, computer vision, or scientific computing
- Experience supporting hybrid HPC + AI workloads in production environments
Leadership & Impact
- Passion for developing talent and building inclusive, high-performing teams
- Ability to operate as both a hands-on engineer and strategic technical leader
- Strong communication skills with the ability to influence across engineering and product stakeholders, Doctorate (Academic) Degree and related work experience of 5 years; Master's Level Degree and related work experience of 8 years; Bachelor's Level Degree and related work experience of 12 years
Benefits & conditions
Base Pay Range: $151,100.00 - $256,900.00
Primary Location: USA-MI-Ann Arbor-KLA
KLA's total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits including but not limited to: medical, dental, vision, life, and other voluntary benefits, 401(K) including company matching, employee stock purchase program (ESPP), student debt assistance, tuition reimbursement program, development and career growth opportunities and programs, financial planning benefits, wellness benefits including an employee assistance program (EAP), paid time off and paid company holidays, and family care and bonding leave.
Interns are eligible for some of the benefits listed. Our pay ranges are determined by role, level, and location. The range displayed reflects the pay for this position in the primary location identified in this posting. Actual pay depends on several factors, including state minimum pay wage rates, location, job-related skills, experience, and relevant education level or training. We are committed to complying with all applicable federal and state minimum wage requirements where applicable. If applicable, your recruiter can share more about the specific pay range for your preferred location during the hiring process.