Emerging Network Architect - AI/Compute

CTECH NY INC
Dallas, United States of America
1 month ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Dallas, United States of America

Tech stack

Artificial Intelligence
Systems Engineering
Fluid
Packet Switching
File Systems
Ethernet
InfiniBand
Linux kernel
Machine Learning
Network Architecture
Routing
Parallel Computing
PCI Express
Scientific Computating
Software Systems
Facebook Flow
Information Technology

Job description

The Network architect will design and evolve next-generation high-performance infrastructure for advanced AI and compute workloads. This role evaluates emerging network, hardware, and software technologies and translates findings into scalable, production-ready architectures. You will run hands-on testing in lab environments, analyze system performance, and identify bottlenecks. Working across networking, compute, and storage teams, you will develop integrated HPC solutions aligned with customer requirements. The role also partners closely with vendors to incorporate roadmap insights and provide technical guidance for large-scale deployments., * Explore opportunities from the industry to utilize future network architectures and technologies that will increase key HPC metrics like Model Flops Utilization, and Performance per Watt

  • Develop and maintain strong technical partnerships with leading vendors in networks to incorporate their future roadmaps in our HPC platform and architectures
  • Recommending and justifying hardware and software solutions aligned with performance, efficiency and scalability objectives
  • Work with customers to understand their HPC current and future workloads and requirements and the impact on our models and performance benchmarks
  • Evaluate the incoming hardware and software sufficient to verify the systems in our own environment and lab setups
  • Aid in bottleneck identification and performance evaluation done within the team for new hardware especially as it pertains to networks, such as latency/bandwidth modelling
  • Collaborate with storage and compute architects to stitch together the individual vendor's pieces to achieve a complete HPC solution
  • Contribute technical guidance and support to other internal teams responsible for standing up the chosen architectures at scale
  • Constantly evaluate and stay current on the existing and future HPC landscape from proven vendors to start-ups in exploring the best and brightest ideas and products in this space
  • Influencing vendor roadmaps through feedback, joint initiatives and technology evaluations

Requirements

  • A Bachelor's or Master's degree in Computer Science, Engineering, Physics, or a related technical field, with demonstrable experience in HPC solution architecture, systems engineering or a similar domain
  • Deep expertise in network architectures and interconnect topologies with demonstrable experience working on these products for HPC
  • Hands on experience with high-speed fabric solutions, particularly InfiniBand and NVLink, but also including Ethernet (RoCE) and Omni-Path, etc..
  • Expertise in various forms of packet switching, routing algorithms, flow control, and congestion management and can adapt the right solutions for the highest network performance
  • An understanding of how high-speed networking impacts compute and experience with performance modelling and industry standard benchmarks like OSU, MPI, STREAM, etc..
  • Previous experience of being hands on in a lab environment through running benchmarks and test jobs on unproven hardware in a testing environment
  • Understanding of storage distributed and parallel file systems such as VAST, Lustre and particularly its needs and impact on the network performance of a system
  • Proven experience designing HPC clusters and parallel computing environments, with strong proficiency in Linux kernel tuning, system-level optimizations and performance profiling
  • Any experience working with emerging Network trends like CXL, PCIe Gen 6, DPU are not required, but strongly valued
  • Demonstrated success working directly with clients to capture technical requirements and deliver tailored, scalable system designs across AI/ML, scientific computing, and CFD workloads

Apply for this position