AI/HPC System Performance Engineer

Facebook Inc.

Menlo Park, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Menlo Park, United States of America

Tech stack

Artificial Intelligence

Application Layers

C++

Profiling

Network Congestion

Software Debugging

Distributed Systems

Network Topologies

Job Scheduling

Network Architecture

Remote Direct Memory Access

TensorFlow

Software Deployment

Software Requirements Analysis

Systems Architecture

High Performance Computing

PyTorch

Low Latency

Performance Monitor

Job description

Meta is building some of the world's largest AI and high-performance computing infrastructure to power next-generation AI research and products. As an AI/HPC System Performance Engineer on the Network Infrastructure Engineering team, you will drive end-to-end performance characterization, bottleneck analysis, and optimization of large-scale AI training and inference clusters. In this role, you will work at the intersection of network fabric design, distributed computing, and AI workload behavior to ensure Meta's HPC systems deliver maximum throughput and efficiency for frontier model development., * Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks

Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack

Requirements

Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI
Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers
Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure
Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders
6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments

Preferred Qualifications:

Experience in developing systems software in languages like C++
Experience with machine learning frameworks such as PyTorch and TensorFlow
Understanding of RDMA congestion control mechanisms on IB and RoCE Networks
Understanding of the latest artificial intelligence (AI) technologies
Understanding of AI training workloads and demands they exert on networks
Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies

About the company

Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today-beyond the constraints of screens, the limits of distance, and even the rules of physics.

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all