HPC / AI Software Infrastructure Lead (E)

KLA-Tencor
Ann Arbor, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 257K

Job location

Ann Arbor, United States of America

Tech stack

Artificial Intelligence
Computer Vision
C++
Cloud Computing
Software Quality
Nvidia CUDA
Computer Programming
Continuous Integration
Linux
DevOps
Distributed Systems
General-Purpose Computing on Graphics Processing Units
Python
Machine Learning
Performance Tuning
TensorFlow
Scientific Computating
Software Systems
AI Infrastructure
PyTorch
Deep Learning
Parallel Computation
Data Pipelines

Job description

  • Lead the architecture and development of large-scale HPC and AI infrastructure supporting cutting-edge image processing and machine learning workloads
  • Design scalable, high-performance distributed systems that unify traditional image processing with modern AI/Deep Learning pipelines
  • Drive GPU-accelerated computing strategies, optimizing performance across compute, storage, and networking layers
  • Partner cross-functionally with hardware, algorithms, and product teams to deliver robust, production-ready platforms
  • Establish engineering best practices (code quality, CI/CD, observability, performance tuning) for mission-critical systems
  • Mentor and develop engineers, providing technical guidance, coaching, and growth opportunities for junior team members
  • Serve as a technical leader and decision-maker, influencing architecture and long-term platform strategy, * Work on real-world AI systems at scale, not just experiments
  • Collaborate across hardware, software, and algorithm teams in a deeply technical environment
  • Join a growing engineering presence in Ann Arbor, with access to top talent and a strong technical community
  • Opportunity to shape the direction of AI infrastructure in a core product domain

Requirements

  • 10+ years in software engineering, including leading and scaling technical teams
  • Proven success building distributed systems in HPC, AI/ML, or cloud-native environments
  • Track record of delivering performance-critical infrastructure at scale
  • Experience mentoring and growing early- and mid-career engineers

Technical Expertise

  • Deep understanding of distributed systems, parallel computing, and Linux systems programming
  • Strong programming skills in C++, Python, or similar systems-level languages
  • Experience with GPU computing (CUDA, ROCm) and modern AI frameworks (PyTorch, TensorFlow, etc.)
  • Familiarity with high-performance storage systems, networking, and data pipelines
  • Strong foundation in CI/CD, DevOps, and production system reliability

Bonus Experience

  • Background in image processing, computer vision, or scientific computing
  • Experience supporting hybrid HPC + AI workloads in production environments

Leadership & Impact

  • Passion for developing talent and building inclusive, high-performing teams
  • Ability to operate as both a hands-on engineer and strategic technical leader
  • Strong communication skills with the ability to influence across engineering and product stakeholders, Doctorate (Academic) Degree and related work experience of 5 years; Master's Level Degree and related work experience of 8 years; Bachelor's Level Degree and related work experience of 12 years

Benefits & conditions

Base Pay Range: $151,100.00 - $256,900.00

Primary Location: USA-MI-Ann Arbor-KLA

KLA's total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits including but not limited to: medical, dental, vision, life, and other voluntary benefits, 401(K) including company matching, employee stock purchase program (ESPP), student debt assistance, tuition reimbursement program, development and career growth opportunities and programs, financial planning benefits, wellness benefits including an employee assistance program (EAP), paid time off and paid company holidays, and family care and bonding leave.

Interns are eligible for some of the benefits listed. Our pay ranges are determined by role, level, and location. The range displayed reflects the pay for this position in the primary location identified in this posting. Actual pay depends on several factors, including state minimum pay wage rates, location, job-related skills, experience, and relevant education level or training. We are committed to complying with all applicable federal and state minimum wage requirements where applicable. If applicable, your recruiter can share more about the specific pay range for your preferred location during the hiring process.

About the company

KLA is a global leader in diversified electronics for the semiconductor manufacturing ecosystem. Virtually every electronic device in the world is produced using our technologies. No laptop, smartphone, wearable device, voice-controlled gadget, flexible screen, VR device or smart car would have made it into your hands without us. KLA invents systems and solutions for the manufacturing of wafers and reticles, integrated circuits, packaging, printed circuit boards and flat panel displays. The innovative ideas and devices that are advancing humanity all begin with inspiration, research and development. KLA focuses more than average on innovation and we invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem-solvers work together with the world's leading technology providers to accelerate the delivery of tomorrow's electronic devices. Life here is exciting and our teams thrive on tackling really hard problems. There is never a dull moment with, At KLA, we're pushing the boundaries of semiconductor inspection through advanced AI and high-performance computing. We are looking for a hands-on technical leader to architect and scale the next generation of AI/HPC infrastructure powering our most critical imaging and data platforms. This role is ideal for someone who thrives at the intersection of distributed systems, GPU computing, and real-world AI workloads, and who enjoys building and mentoring high-performing engineering teams while driving technical excellence.

Apply for this position