Meta is seeking a Production Systems Engineer, Tooling to join our Production Systems Engineering organization, where you will help drive the reliability, efficiency, and scalability of Meta's large-scale hardware infrastructure through improvements by test automation. You will design and build the systems tooling, test automation, and frameworks that keep Meta's global production fleet — spanning compute, storage, networking, and custom silicon — operating at peak performance. Working at the intersection of hardware and software, you will partner with data center operations, hardware engineering, platform teams, and ODM/vendor partners to drive systemic improvements across the full infrastructure stack.
Responsibilities
- Design, build, and scale test orchestration and validation tooling, CI/CD pipelines, and automation frameworks that qualify large-scale AI hardware platforms at cluster scale — spanning provisioning, monitoring, and lifecycle management of compute, storage, and networking infrastructure
- Develop tooling for hardware lifecycle management, fleet health observability, and automated remediation of production system failures across Meta's data center fleets
- Identify and resolve systemic reliability and performance issues by analyzing hardware telemetry, failure patterns, and system-level diagnostics at scale
- Collaborate with hardware engineering teams to define software interfaces, firmware integration requirements, and bring-up workflows for new server and accelerator platforms
- Lead cross-functional efforts to evaluate, qualify, and integrate new hardware technologies into the production environment, including validation and qualification workflows
- Develop scalable infrastructure automation that reduces operational toil and accelerates hardware deployment and remediation across the global fleet
- Mentor other engineers on systems software design, debugging methodologies, and production infrastructure best practices
- Communicate technical designs and architectural decisions through written documentation and cross-functional stakeholder alignment
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 3+ years of experience in production systems engineering or infrastructure software engineering, including development in C, C++, or Python for Linux-based environments
- 3+ years of experience with large-scale hardware infrastructure systems, including fleet automation, hardware lifecycle management, or data center operations software
- 3+ years of experience in designing and operating distributed systems software at scale, including monitoring, alerting, and automated remediation pipelines
- 3+ years of experience in communicating system designs and technical decisions through written documentation and cross-functional stakeholder engagement
- Demonstrated troubleshooting skills across hardware products and automation software Master's Degree in Computer Science, Computer Engineering, or similar field
- 6+ years of experience across a variety of infrastructure components such as network, and compute in a datacenter or large-scale production environment
- 3+ years of experience in building or operating CI/CD pipelines and test automation frameworks for infrastructure software
- Familiarity with custom silicon or accelerator platform integration, including firmware and platform management interfaces
- Expertise guiding cross-functional teams or ODM/vendor partners through the setup, integration, and execution of automation and validation frameworks at scale