System Level Debug Engineer - Data Center GPU
Role details
Job location
Tech stack
Job description
AMD is looking for alead systems engineerto provide thought leadership and subject matterexpertise to our growing team. As a key contributor, you will havea strongtechnical background to contributetoall aspects of the software development process.We have competitive benefit packages and an award-winning culture. Join us!
The Datacenter Graphics and Accelerated Computing(DCGPU) organization islooking for an experiencedsystemleveldebug engineer.Individual will be part of ateam that as to bring-up,validateand ensure the platform being used is fullyvalidated: including electrical, power,networkingandSOC.Individual willbe requiredto lead and document the plan forvalidatingthe system itself as well put in documentation for unique steps to enable it.Individualwill need to be able to drive to rootclosureany issuesencounteredand communicate with the differentFunctional andIP layers for resolution., * Debug / triage engineerand understanding of industry tools for root causing complex issues
-
Understanding of GPU/System level HW and SW flow
-
Ability to probe parts of a board; check electrical and power currents andvalidatea system
-
Provide leadershipfor driving to root cause issues
-
Communicate / Document flows and methods ofbring-up, boot-up, system initialization anddebug
-
Lead technical presentations demonstrating a good understanding of application, data, infrastructure, architecture expertise and application systems design
-
Collaborate with application, and infrastructure architects and be responsible for the defining-designing-delivering of the technical architectures, patterns, technical quality, risks, fitness for purpose and operability of technical architecture solutions
-
Be a leader and mentor to the operation team; be hands-on and lead by example
-
Be able to hand-on troubleshoot and solve the technical issues; own the problem and drive for resolution
-
Able to proactively support team culture that fosters knowledge sharing, excellence, and collaboration, AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's "Responsible AI Policy" is available here.
Requirements
You area highlymotivatedhands-on leaderwitha strongdevelopment background, problem solving mentality, excellent communication skills, ability to prioritize tasks along withwillingness to learn and adapt. Excellent teamwork skills and capable ofleading a highly technical team.
Experience indebugging ofcomplex HW/FW issuesisa must, understand the flow of a GPU through the different layers of a systemand be able tovalidatethe items connecting to the GPU SOC (pcie,vr's, RMs,retimers, HBM, internal networking).Communication Is essential in working with different owners of thefunctionalcode stack aswell asthe ability to drive issues via phone calls, chat messages, e-mails.Hands on experience with Hardware in aDataCenterenvironmentwill berequired., * Significant experience in SoC and/or System debug of complex issues
-
Develop / Document debug capabilities on a given SOC and System
-
Go-to-person for debugging of issues for the Production level Platform validation
-
Collaborate with internal teams on root causing issues, finding optimum resolutions
-
Hands-on experience in using industry debug tools, scopes as well examine board level power
-
Proven experience with C/C++
-
Demonstrable experience in facilitating Agile,Scrum or Kanban
-
Skilled in scripting languages such as Perl,Ruby,and Shell script
-
Proficient with revision control (GIT, SVN and CVS)
-
Experience crafting and supporting cloud environments, including IaaS and PaaS
-
Database development, PostgreSQL, Oracle, MS SQL Server
-
Good balance of hardware,architecture,and software expertise
-
Proven ability to drive resolution of critical problems within a lab, Datacenter
-
Relationship with external customers/partners and able to help resolve problems in their Data Center
-
Relationship with external customers/partners on ability to work manufacturing issues/failures
-
Relationship with external customers/partners on ability to define rqmts for manufacturing validation, * Bachelor's/Master's degree in Computer Science or related field strongly preferred+ minimum8yrs experience inSystem or SOC level debug and triage