Software Engineer, SystemML - AI Networking

The Meta Game, Inc.

Menlo Park, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

$ 257K

Job location

Menlo Park, United States of America

Tech stack

Artificial Intelligence

C++

Nvidia CUDA

Computer Programming

Computer Engineering

Distributed Data Store

InfiniBand

Python

Machine Learning

Performance Tuning

TensorFlow

AI Infrastructure

Graphics Processing Unit (GPU)

High Performance Computing

PyTorch

Large Language Models

Deep Learning

Parallel Computation

Information Technology

Machine Learning Operations

Job description

GenAI/LLM training) from the trainer down to the inter-GPU and network communication layer. And we are seeking for engineers to work on the space of GenAI/LLM scaling reliability and performance.

Requirements

Do you have experience in Team leadership?, Do you have a Bachelor's degree?, * Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience

Proven C/C++ and Python programming skills
Proven track record of leading successful projects
Effective leadership and communication skills
Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch), * Experience with NCCL and distributed GPU performance analysis on RoCE/Infiniband
PhD in Computer Science, Computer Engineering, or relevant technical field
Knowledge of GPU architectures and CUDA programming
Knowledge of ML, deep learning and LLM
Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
Experience in HPC and parallel computing
Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models

Benefits & conditions

4.04.0 out of 5 stars 1 Hacker Way, Menlo Park, CA 94025 $183,997 - $257,000 a year

About the company

In this role, you will be a member of the AI Networking Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns.At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale GPU training and inference fleet through an observable, reliable and high-performance distributed AI/GPU communication stack. Currently, one of the team's focus is on building customized features, SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale, Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today-beyond the constraints of screens, the limits of distance, and even the rules of physics.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all