In Annapurna Labs we are at the forefront of hardware/software accelerator solutions for not only Amazon Web Services (
AWS), but across the industry. The
Machine Learning Acceleration Systems Firmware team is looking for candidates interested in diving deep into our designs of
Machine Learning servers and developing world class firmware to support current and future generations of accelerator silicon.
Our team designs and builds Annapurna's fleet of Accelerated Servers using Internally designed silicon. We solve systemic hardware issues and we build hardware and software systems to detect and mitigate future failure recurrences so that our our customers can experience the highest quality of service possible!
In this role, you will lead an organization of software and firmware developers to build reliable server firmware deployed across millions of accelerators across
EC2. You will build AI-driven software tooling that root causes failures and identifies causes of system failures—work that directly impacts how our customers leverage
AWS Trainium for their
machine learning workloads.
Key job responsibilities
In this role, you will lead a team of software and firmware developers to design and develop server software at
AWS scale. You'll collaborate with hardware developers and software engineers to design validation strategies that ensure reliability across our entire product line. Your days will include mentoring your team through complex technical challenges, establishing operational procedures that scale across products, and working cross-functionally to integrate design-for-excellence principles into our development process. You'll also participate in technical discussions that shape how we approach
system design & validation, ensuring we're catching issues before they reach customers.
This is a fast-paced, intellectually challenging position, and you’ll work with thought leaders in multiple technology areas. You’ll have high standards for yourself and everyone you work with, and you’ll be constantly looking for ways to improve your product’s performance, quality and cost. Using data and key metrics, you will also drive and measure process improvements that enhance our operational effectiveness.
A day in the life
Your day to day responsibilities will include interfacing with our internal and external customers to understand project requirements and facilitate system development ontop of your server design. You will be responsible for learning operational challenges to our existing fleet with the goal of improving the current customer experience as well as developing improved systems for future designs. You will work directly with vendors and ODM/JDM design teams to develop and manufacture your product at scale.
About the team
Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge-sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, design reviews. We care about your career growth and strive to assign projects that help our team members develop your engineering expertise so you feel empowered to take on more complex tasks in the future.
We're a collaborative group of software engineers and hardware developers united by a shared mission: making Amazon Trainium products more reliable and easier to troubleshoot. Our team values partnership across disciplines—your success depends on building strong relationships with hardware specialists, validation engineers, and other technical leaders. We're focused on establishing best-in-class operational procedures and diagnostic capabilities that set the standard for the industry. By joining us, you'll help shape the future of how we approach system reliability and contribute to products that power some of the most demanding
machine learning applications in the world.
- 7+ years of working directly with engineering teams experience
- Experience managing programs across cross functional teams, building processes and coordinating release schedules
- Experience building and evaluating system-level technical design
- Bachelor's degree in Computer Science, Computer Engineering, or related fields
- Experience managing teams, or experience as a mentor, tech lead or leading an engineering team
- Experience in software development, or experience troubleshooting and debugging technical systems and experience that includes strong analytical skills, attention to detail, and effective communication abilities
- Experience with hardware/software integration and
real-time systems
- 10+ years of systems software or firmware engineering
- Proficiency with programming languages commonly used in systems software (such as C,
C++,
Rust, or
Python)
- 5+ years of project management disciplines including scope, schedule, budget, quality, along with risk and critical path management experience
- Experience managing projects across cross functional teams, building sustainable processes and coordinating release schedules
- Experience defining KPI's/SLA's used to drive multi-million dollar businesses and reporting to senior leadership
- Master's degree in Computer Science, Computer Engineering, or related fields
- Experience troubleshooting and debugging technical systems
- 5+ years of embedded firmware development experience
- Knowledge of data center infrastructure design, operations, or delivery
- Experience navigating a knowledge base and following Standard Operating Procedures (SOPs)
- Experience with AI or
machine learning applications in systems engineering
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit
https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.
USA, TX, Austin - 144,100.00 - 194,900.00 USD annually