GPU Systems Engineer (CUDA)

Bright Vision Technologies
8 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote

Tech stack

Artificial Intelligence
Systems Engineering
C++
Profiling
Code Review
Nvidia CUDA
Computer Programming
Computer Engineering
Performance Tuning
Remote Direct Memory Access
Regression Testing
TensorFlow
Scientific Computating
Software Engineering
System Programming
Data Processing
Graphics Processing Unit (GPU)
High Performance Computing
Gpu Programming
Information Technology
Free and Open-Source Software
TensorRT

Job description

As we continue to grow, we're looking for a skilled GPU Systems Engineer (CUDA) to join our dynamic team and contribute to our mission of transforming business processes through technology., This role is part of Bright Vision Technologies' in-house Statement of Work (SOW) engagement. The client, end customer, and employer for this position is Bright Vision Technologies - there is no third-party client, vendor, or implementation partner involved. We do not engage in C2C, 1099, or third-party arrangements for this role. BUT STRICTLY NO C2C/1099/3RD PARTY COMPANIES. ALL OUR ROLES ARE W2 AND NO 3RD PARTY BROKERING PLEASE. Candidates must be willing to work directly as a full-time W2 employee of Bright Vision Technologies and contribute to our in-house SOW deliverables. No new H1B sponsorship is available for this role. However, candidates who are currently on a valid H1B visa and require a transfer are welcome to apply. We will support H1B transfers for qualified candidates. For every role, a technical coding assessment is mandatory. Please apply only if you are confident in your technical abilities and hands-on experience., We are seeking a GPU Systems Engineer with deep expertise in CUDA programming, GPU architecture, and high-performance computing to design and optimize compute-intensive workloads on modern accelerator hardware. This role focuses on extracting maximum performance from GPU platforms for AI training, inference, scientific computing, and high-throughput data processing workloads. The ideal candidate combines low-level systems mastery with strong software engineering practices, and has a track record of delivering measurable performance improvements on production GPU systems. In this role you will work closely with cross-functional partners - product, design, engineering, operations, and business stakeholders - to translate ambiguous requirements into well-engineered solutions, and will be expected to raise the bar through code review, design review, and mentorship of more junior engineers. The successful candidate brings strong engineering discipline, a clear communication style, and a track record of shipping meaningful work that holds up well in production., * Design and implement high-performance CUDA kernels for compute-intensive workloads across AI and HPC use cases.

  • Profile and optimize GPU code using tools such as Nsight Systems, Nsight Compute, and CUDA profilers.
  • Tune memory access patterns, occupancy, register usage, and shared memory utilization for peak performance.
  • Develop highly optimized libraries for linear algebra, attention, and other ML primitives.
  • Optimize multi-GPU and multi-node training using NCCL, RDMA, and high-performance networking.
  • Implement custom operators and fused kernels in PyTorch, JAX, or Triton.
  • Collaborate with ML engineers to identify performance bottlenecks in training and inference pipelines.
  • Develop benchmarks and regression tests to safeguard performance over time.
  • Evaluate new GPU architectures and feature sets, and advise on adoption strategy.
  • Contribute to compiler-level optimizations for tensor programs where appropriate, working at the boundary between ML frameworks and underlying accelerator codegen to unlock performance not reachable through framework-level tuning alone.
  • Optimize memory hierarchy usage across HBM, L2, shared memory, and registers.
  • Implement mixed-precision and quantized compute paths that maximize accelerator throughput while preserving numerical fidelity within bounds acceptable for the target workloads.
  • Document performance characteristics, design decisions, and tuning playbooks for internal teams.
  • Stay current with GPU architecture, CUDA evolution, and emerging accelerator technologies.

Requirements

Do you have experience in Performance tuning?, Do you have a Master's degree?, * Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.

  • Six or more years of experience in GPU programming and performance engineering.
  • Deep expertise in CUDA C/C++ and GPU programming models.
  • Strong understanding of modern GPU architectures, memory hierarchies, and execution models.
  • Hands-on experience profiling and optimizing GPU workloads in production.
  • Familiarity with NCCL, MPI, and high-performance interconnect technologies.
  • Experience integrating custom kernels into ML frameworks.
  • Strong C++ skills and familiarity with modern systems programming practices.
  • Solid grounding in linear algebra and numerical methods.
  • Strong communication and collaboration skills with research and engineering teams.

Preferred Qualifications

  • Experience with Triton, CUTLASS, or other GPU kernel authoring frameworks.
  • Familiarity with TensorRT, FasterTransformer, or vLLM internals.
  • Exposure to compiler infrastructure such as LLVM or MLIR.
  • Open-source contributions to GPU or ML performance libraries.
  • Experience with large-scale distributed training infrastructure.

About the company

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.

Apply for this position