Senior Network Engineer - Data Center / HPC/AI

GTN Technical Staffing
Dallas, United States of America
4 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Dallas, United States of America

Tech stack

Artificial Intelligence
Border Gateway Protocol
Big Data
Common Lisp Object Systems
Computer Clusters
Computer Networks
Data Centers
Deep Packet Inspection
Distributed Computing Environment
Distributed Data Store
Distributed Systems
Monitoring of Systems
InfiniBand
Interoperability
Multi-protocol Systems
Jinja (Template Engine)
Python
Network Architecture
Packet Analyzer
Overlay Transport Virtualization
Remote Direct Memory Access
Ansible
AI Infrastructure
High Performance Computing
Nx-os
System Availability
Parallel Computation
GIT
Data Center Networking
Infrastructure Automation Frameworks
Low Latency
Open Network Automation Platform
Cisco networks
Network Optimization

Job description

We are seeking a Senior Network Engineer to design, build, and operate high-performance data center networks supporting HPC, AI/ML workloads, and next-generation CaaS / GPUaaS platforms.

This role focuses on delivering ultra-low-latency, high-throughput network infrastructure optimized for GPU- and CPU-intensive compute environments. You will play a critical role in enabling scalable, multi-tenant AI infrastructure by ensuring network performance, reliability, and efficiency across distributed data center environments.

The ideal candidate brings deep expertise in modern data center networking, hands-on experience with high-performance fabrics, and a strong understanding of networking requirements for GPU-accelerated and containerized platforms at scale., Data Center & HPC Network Engineering

  • Design, implement, and operate high-performance data center networks supporting HPC, AI/ML, and GPUaaS / CaaS environments
  • Optimize architectures for east-west traffic, low latency, and high throughput across large-scale compute clusters
  • Support distributed GPU and CPU workloads, ensuring consistent performance under heavy parallel processing demands

Network Architecture & Multi-Tenant Design

  • Design and manage leaf-spine / Clos architectures using EVPN-VXLAN overlays
  • Build scalable, multi-tenant network architectures supporting workload isolation and segmentation for CaaS / GPUaaS platforms
  • Support DCI, backbone connectivity, and hybrid/cloud on-ramp strategies

Performance, Reliability & Optimization

  • Monitor and tune network performance for latency, throughput, and congestion across HPC environments
  • Perform deep packet inspection, traffic flow analysis, and root cause troubleshooting
  • Drive capacity planning and scaling strategies aligned with compute and GPU cluster growth
  • Ensure high availability through redundancy, failover validation, and operational rigor

Automation & Infrastructure Engineering

  • Develop network automation frameworks using Python, Ansible, Git, and Jinja2
  • Implement Infrastructure-as-Code (IaC) and CI/CD pipelines for network provisioning and changes
  • Standardize and scale network deployments across environments

Observability & Telemetry

  • Implement telemetry and monitoring solutions to provide real-time visibility into network performance
  • Analyze metrics to proactively identify risks and optimize system behavior
  • Integrate network observability into broader platform monitoring ecosystems

Cross-Functional Collaboration

  • Partner with HPC platform, compute, storage, and infrastructure teams to align network architecture with workload demands
  • Collaborate with architecture and engineering teams on new environment design and deployment
  • Work closely with vendors to validate performance, interoperability, and scalability

Technical Leadership

  • Serve as a senior escalation point for network incidents and complex troubleshooting
  • Mentor junior engineers and contribute to documentation, standards, and best practices
  • Drive continuous improvement across network architecture, operations, and tooling

Requirements

  • 5-8+ years of experience designing and supporting large-scale data center networks
  • Experience supporting HPC, AI/ML, or GPU-accelerated infrastructure environments
  • Experience working with or supporting CaaS, GPUaaS, or multi-tenant platform architectures
  • Strong expertise with:
  • Leaf-spine / Clos architectures
  • EVPN, VXLAN, BGP, MPLS
  • Cisco and/or Arista platforms (NX-OS, EOS, IOS-XR)
  • Strong understanding of low-latency, high-throughput network optimization
  • Proven troubleshooting experience in complex distributed environments

Technical Skills

  • Network automation: Python, Ansible, Jinja2, Git
  • Infrastructure-as-Code (IaC) and CI/CD pipelines
  • Network observability, telemetry, and monitoring tools
  • Packet analysis and traffic flow diagnostics

Preferred Experience

  • Experience with HPC networking concepts (GPU clusters, distributed training environments)
  • Familiarity with InfiniBand, RDMA, or RoCE networking
  • Experience in hyperscale or AI-focused data center environments
  • CCNP or equivalent certification preferred; CCIE or advanced certifications a plus

Additional Requirements

  • This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
  • We are unable to sponsor or take over sponsorship of employment visas at this time.

Apply for this position