Principal Firmware Engineer, Annapurna Labs ML Acceleration Systems Software

Amazon.com, Inc.
Austin, United States of America
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 195K

Job location

Austin, United States of America

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Systems Engineering
C++
Computer Engineering
Data Centers
Embedded Software
Firmware
Python
Machine Learning
Systems Development Life Cycle
Software Engineering
Systems Integration
Real Time Systems
Software Troubleshooting
Reliability of Systems
Information Technology
Programming Languages

Job description

In this role, you will lead a team of software and firmware developers to design and develop server software at AWS scale. You'll collaborate with hardware developers and software engineers to design validation strategies that ensure reliability across our entire product line. Your days will include mentoring your team through complex technical challenges, establishing operational procedures that scale across products, and working cross-functionally to integrate design-for-excellence principles into our development process. You'll also participate in technical discussions that shape how we approach system design & validation, ensuring we're catching issues before they reach customers.

This is a fast-paced, intellectually challenging position, and you'll work with thought leaders in multiple technology areas. You'll have high standards for yourself and everyone you work with, and you'll be constantly looking for ways to improve your product's performance, quality and cost. Using data and key metrics, you will also drive and measure process improvements that enhance our operational effectiveness.

A day in the life Your day to day responsibilities will include interfacing with our internal and external customers to understand project requirements and facilitate system development ontop of your server design. You will be responsible for learning operational challenges to our existing fleet with the goal of improving the current customer experience as well as developing improved systems for future designs. You will work directly with vendors and ODM/JDM design teams to develop and manufacture your product at scale.

About the team

Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we're building an environment that celebrates knowledge-sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, design reviews. We care about your career growth and strive to assign projects that help our team members develop your engineering expertise so you feel empowered to take on more complex tasks in the future.

We're a collaborative group of software engineers and hardware developers united by a shared mission: making Amazon Trainium products more reliable and easier to troubleshoot. Our team values partnership across disciplines-your success depends on building strong relationships with hardware specialists, validation engineers, and other technical leaders. We're focused on establishing best-in-class operational procedures and diagnostic capabilities that set the standard for the industry. By joining us, you'll help shape the future of how we approach system reliability and contribute to products that power some of the most demanding machine learning applications in the world.

Requirements

7+ years of working directly with engineering teams experience

  • Experience managing programs across cross functional teams, building processes and coordinating release schedules
  • Experience building and evaluating system-level technical design
  • Bachelor's degree in Computer Science, Computer Engineering, or related fields
  • Experience managing teams, or experience as a mentor, tech lead or leading an engineering team
  • Experience in software development, or experience troubleshooting and debugging technical systems and experience that includes strong analytical skills, attention to detail, and effective communication abilities
  • Experience with hardware/software integration and real-time systems
  • 10+ years of systems software or firmware engineering
  • Proficiency with programming languages commonly used in systems software (such as C, C++, Rust, or Python)

Preferred Qualifications

  • 5+ years of project management disciplines including scope, schedule, budget, quality, along with risk and critical path management experience
  • Experience managing projects across cross functional teams, building sustainable processes and coordinating release schedules
  • Experience defining KPI's/SLA's used to drive multi-million dollar businesses and reporting to senior leadership
  • Master's degree in Computer Science, Computer Engineering, or related fields
  • Experience troubleshooting and debugging technical systems
  • 5+ years of embedded firmware development experience
  • Knowledge of data center infrastructure design, operations, or delivery
  • Experience navigating a knowledge base and following Standard Operating Procedures (SOPs)
  • Experience with AI or machine learning applications in systems engineering

Benefits & conditions

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.

USA, TX, Austin - 144,100.00 - 194,900.00 USD annually

About the company

In Annapurna Labs we are at the forefront of hardware/software accelerator solutions for not only Amazon Web Services (AWS), but across the industry. The Machine Learning Acceleration Systems Firmware team is looking for candidates interested in diving deep into our designs of Machine Learning servers and developing world class firmware to support current and future generations of accelerator silicon. Our team designs and builds Annapurna's fleet of Accelerated Servers using Internally designed silicon. We solve systemic hardware issues and we build hardware and software systems to detect and mitigate future failure recurrences so that our our customers can experience the highest quality of service possible! In this role, you will lead an organization of software and firmware developers to build reliable server firmware deployed across millions of accelerators across EC2. You will build AI-driven software tooling that root causes failures and identifies causes of system failures-work that directly impacts how our customers leverage AWS Trainium for their machine learning workloads.

Apply for this position