Software Engineer, DGX Cloud AI Infrastructure
Role details
Job location
Tech stack
Job description
In this role you will help bring up, benchmark, and debug distributed LLM workloads on multi-GPU and multi-node deployments, and own the design and implementation of the benchmarking tooling, automation, and debugging workflows that support them. This is a hands-on role for an engineer who enjoys deep technical problems across deep learning systems, GPU performance, distributed computing, and large-scale operations.
What you'll be doing:
- Bring up, validate, and debug large-scale AI clusters, infrastructure, and end-to-end workloads.
- Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.
- Perform root-cause analysis of failures in large distributed environments
- Contribute to the resilience and failure-attribution tooling that detects, triages, and attributes node, fabric, and workload failures across the cluster.
- Build and maintain repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms.
- Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams.
- Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization.
Requirements
Do you have experience in Triage?, Do you have a Master's degree?, * Bachelor's or Master's in Computer Science or a related technical field (or equivalent experience).
- 3+ years of experience developing software for AI, HPC, or systems-level applications.
- Hands-on experience with multi-GPU or multi-node workloads and CUDA-aware distributed execution.
- Backgroun with debugging and scaling distributed systems.
- Experience debugging and triaging AI applications across the full stack, from the application level toward the hardware.
- Experience operating workloads in scheduled, containerized cluster environments.
- Excellent analytical, debugging, and communication skills, and a collaborative approach across teams.
- Strong Python and C/C++ programming skills.
Ways to stand out from the crowd:
- Hands-on experience with NCCL and CUDA-aware distributed execution.
- Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric) and with InfiniBand / RoCE congestion debugging.
- Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms, including MLPerf.
- Experience diagnosing performance jitter
- Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure.
Benefits & conditions
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 116,000 USD - 189,750 USD for Level 2, and 140,000 USD - 224,250 USD for Level 3.