GPU Systems Engineer (CUDA)
Role details
Job location
Tech stack
Job description
As we continue to grow, we're looking for a skilled GPU Systems Engineer (CUDA) to join our dynamic team and contribute to our mission of transforming business processes through technology., This role is part of Bright Vision Technologies' in-house Statement of Work (SOW) engagement. The client, end customer, and employer for this position is Bright Vision Technologies - there is no third-party client, vendor, or implementation partner involved. We do not engage in C2C, 1099, or third-party arrangements for this role. BUT STRICTLY NO C2C/1099/3RD PARTY COMPANIES. ALL OUR ROLES ARE W2 AND NO 3RD PARTY BROKERING PLEASE. Candidates must be willing to work directly as a full-time W2 employee of Bright Vision Technologies and contribute to our in-house SOW deliverables. No new H1B sponsorship is available for this role. However, candidates who are currently on a valid H1B visa and require a transfer are welcome to apply. We will support H1B transfers for qualified candidates. For every role, a technical coding assessment is mandatory. Please apply only if you are confident in your technical abilities and hands-on experience., We are seeking a GPU Systems Engineer with deep expertise in CUDA programming, GPU architecture, and high-performance computing to design and optimize compute-intensive workloads on modern accelerator hardware. This role focuses on extracting maximum performance from GPU platforms for AI training, inference, scientific computing, and high-throughput data processing workloads. The ideal candidate combines low-level systems mastery with strong software engineering practices, and has a track record of delivering measurable performance improvements on production GPU systems. In this role you will work closely with cross-functional partners - product, design, engineering, operations, and business stakeholders - to translate ambiguous requirements into well-engineered solutions, and will be expected to raise the bar through code review, design review, and mentorship of more junior engineers. The successful candidate brings strong engineering discipline, a clear communication style, and a track record of shipping meaningful work that holds up well in production., * Design and implement high-performance CUDA kernels for compute-intensive workloads across AI and HPC use cases.
- Profile and optimize GPU code using tools such as Nsight Systems, Nsight Compute, and CUDA profilers.
- Tune memory access patterns, occupancy, register usage, and shared memory utilization for peak performance.
- Develop highly optimized libraries for linear algebra, attention, and other ML primitives.
- Optimize multi-GPU and multi-node training using NCCL, RDMA, and high-performance networking.
- Implement custom operators and fused kernels in PyTorch, JAX, or Triton.
- Collaborate with ML engineers to identify performance bottlenecks in training and inference pipelines.
- Develop benchmarks and regression tests to safeguard performance over time.
- Evaluate new GPU architectures and feature sets, and advise on adoption strategy.
- Contribute to compiler-level optimizations for tensor programs where appropriate, working at the boundary between ML frameworks and underlying accelerator codegen to unlock performance not reachable through framework-level tuning alone.
- Optimize memory hierarchy usage across HBM, L2, shared memory, and registers.
- Implement mixed-precision and quantized compute paths that maximize accelerator throughput while preserving numerical fidelity within bounds acceptable for the target workloads.
- Document performance characteristics, design decisions, and tuning playbooks for internal teams.
- Stay current with GPU architecture, CUDA evolution, and emerging accelerator technologies.
Requirements
Do you have experience in Performance tuning?, Do you have a Master's degree?, * Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.
- Six or more years of experience in GPU programming and performance engineering.
- Deep expertise in CUDA C/C++ and GPU programming models.
- Strong understanding of modern GPU architectures, memory hierarchies, and execution models.
- Hands-on experience profiling and optimizing GPU workloads in production.
- Familiarity with NCCL, MPI, and high-performance interconnect technologies.
- Experience integrating custom kernels into ML frameworks.
- Strong C++ skills and familiarity with modern systems programming practices.
- Solid grounding in linear algebra and numerical methods.
- Strong communication and collaboration skills with research and engineering teams.
Preferred Qualifications
- Experience with Triton, CUTLASS, or other GPU kernel authoring frameworks.
- Familiarity with TensorRT, FasterTransformer, or vLLM internals.
- Exposure to compiler infrastructure such as LLVM or MLIR.
- Open-source contributions to GPU or ML performance libraries.
- Experience with large-scale distributed training infrastructure.