Meta is seeking a forward-thinking, experienced individual to join the Data Center Fleet Operations team. The Fleet Operations Manager is accountable for managing and leading a geographically dispersed team, delivering SLA/KPI’s related to production server hardware, resolution of systemic technical issues, and repairs throughout the assigned geographic region of data centers.
We are looking for someone who can effectively prioritize and adapt to shifting priorities in a dynamic operational environment. The ideal candidate is an IT professional with strong leadership skills and experience in Server Hardware, Project Management, Quality Management, Data Analytics, Networks, OS repair, Linux and Automation, ideally in a datacenter environment. Having an extensive understanding of managing servers in a large-scale distributed environment
Responsibilities
- Build and lead a geographically dispersed, high-performing data center operations team, developing both the technical capabilities and leadership qualities of engineers
- Establish and manage a Data Center Operations Team accountable for the maintenance and operation of server hardware and supporting infrastructure at scale
- Become a technical expert in Meta's infrastructure, including platforms, tools, systems, architecture, workflows, and performance
- Provide strategic direction, guidance, and support for site and fleet-level operations
- Analyze and drive continuous improvement in the engineering and operational performance of our data centers
- Employ data analytics to identify inefficiencies, opportunities, exceptions, and correlations in a complex, highly interconnected, technical environment. Enable rapid and effective problem solving, along with proactive identification and mitigation of risks and issues
- Collaborate with cross-functional partner teams to ensure fleet health and maintain targeted capacity levels, resulting in optimized operations, minimized downtime, and seamless scalability
- Evolve and optimize processes in a globally consistent way to allow Meta to scale and grow effectively
- Support and mentor engineers in their day-to-day work, as well as in finding opportunities to develop and grow based on their areas of strength and interest
- Create and drive a culture of ownership, innovation, collaboration, accountability, continuous improvement, and safety
- Conduct performance management for a technical engineering team, providing clear expectations and goals
- Assume the role of incident manager during large-scale, site-wide, and region-wide production-impacting events, as the primary point of contact for your site. This requires working cross-functionally to scope problems, mitigate risks, affect fixes, and communicate the nature, status, and resolution plan for incidents
- Support and contribute thought leadership to the development and implementation of business practices, processes and automated tooling
- Develop deep knowledge and ownership of a hyper-scale computing fleet through the use of data analysis to identify trends and systemic issues and opportunities; reporting out globally and sharing with peers as appropriate
Minimum Qualifications
- BS, BA, or BEng in a technical field or commensurate experience
- Ability to travel up to 30% is required
- Experience participating in or leading technical projects related to areas such as process improvement, technology, and/or automation, including bringing in additional expertise as needed
- 5+ years of experience managing teams of technical resources, including people and performance management responsibilities
- Understanding of data center infrastructure and/or operations, including power, cooling, and/or network systems; structured cabling; and management of projects, incidents, and vendors
- Experience using data and metrics to drive decision-making
- Ability to influence effectively, working on cross-functional teams to advance the needs of the company and adapting teams to meet these needs
- 10+ years of engineering or operations experience, preferably in a mature engineering or operations environment, working with cross-functional teams
- Ability to communicate effectively, in a clear and concise manner, appropriately tailoring messages to the audience Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
- Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
- Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies
- Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
- Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
- Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies
- Six Sigma knowledge/certification
- Experience leading technical resources using Linux or an equivalent OS to support hardware systems in a complex IT environment
- Experience with large-scale AI implementations and the use of AI to drive automation
- Experience in large-scale data center hardware deployments and building scalable infrastructure
- Knowledge of the interdependencies of data center functions and technologies, including electrical, cooling, structured cabling, security, and network