Sr. Technical Program Manager - Training at Scale
Role details
Job location
Tech stack
Job description
AMD is seeking a Senior Technical Program Manager (TPM) to lead Training at Scale programs for AMD Instinct products. You will driveend-to-endexecution of largescale AI training initiatives while owningmulti-quarterplanning, roadmap development, and the operating cadence that turns strategy into predictable delivery across the Training at Scale engineering portfolio.
In this role, you will be a core partner to engineering leadership,ensuringthatnear-termexecution strength is matched by clearlong-termplans, OKR rigor, and early risk/decision management across an evolving opensource AI ecosystem.
THE PERSON:
The ideal candidate is a highly structured program leader with technical depth in AI training frameworks and distributed training at scale, comfortable operating in ambiguity and turning strategy into executable roadmaps. You communicate crisply at all levels, build alignment acrosscross functionalteams, and proactively surface risks, tradeoffs, and decision points before they become delivery blockers
You thrive in afast-movingenvironment, bring a strong operating cadence (OKRs, reviews, dashboards), and can build durable planning mechanisms that reduce engineering overhead while improving delivery predictability., * Own Training at Scale portfolio planning:translatestrategy into amulti-quarterroadmap, quarterly plans, and measurable outcomes.
-
Establish and run an execution operating model (OKRs, program reviews, decision logs, dashboards) to drive rigor, transparency, and predictable delivery.
-
Driveend-to-enddelivery of largescale AI training capabilities acrosscross functionalengineering teams; manage scope, milestones, dependencies, and critical path.
-
Apply technical judgment toidentifyand managearchitecture leveltradeoffs, technical dependencies, and technicalrisk;proactively surface decision points and escalation paths.
-
Build alignment with engineering leadership and key stakeholders on priorities, sequencing, and resourcing., AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here.
Requirements
-
Technical fluency in AI/ML systems, including distributed training and scalability/performance considerations.
-
Handson familiarity with AI training frameworks/ecosystems (e.g.,PyTorch, JAX) and related tooling.
-
Understanding of GPU compute software stacks and performance considerations; familiarity with AMDROCmand/or NVIDIA CUDA
-
Experience working in opensource ecosystems (contributing, managing upstream dependencies, release planning, community/ecosystem coordination).
-
Track recordof proactive risk management andexecutive levelstakeholder communication in ambiguous environments., * Master's orBachelor's degree in Computer Engineering, Computer Science or Electrical Engineering is desired