Senior Network Engineer - Data Center / HPC/AI

GTN Technical Staffing

Dallas, United States of America

4 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Dallas, United States of America

Tech stack

Artificial Intelligence

Border Gateway Protocol

Big Data

Common Lisp Object Systems

Computer Clusters

Computer Networks

Data Centers

Deep Packet Inspection

Distributed Computing Environment

Distributed Data Store

Distributed Systems

Monitoring of Systems

InfiniBand

Interoperability

Multi-protocol Systems

Jinja (Template Engine)

Python

Network Architecture

Packet Analyzer

Overlay Transport Virtualization

Remote Direct Memory Access

Ansible

AI Infrastructure

High Performance Computing

Nx-os

System Availability

Parallel Computation

GIT

Data Center Networking

Infrastructure Automation Frameworks

Low Latency

Open Network Automation Platform

Cisco networks

Network Optimization

Job description

We are seeking a Senior Network Engineer to design, build, and operate high-performance data center networks supporting HPC, AI/ML workloads, and next-generation CaaS / GPUaaS platforms.

This role focuses on delivering ultra-low-latency, high-throughput network infrastructure optimized for GPU- and CPU-intensive compute environments. You will play a critical role in enabling scalable, multi-tenant AI infrastructure by ensuring network performance, reliability, and efficiency across distributed data center environments.

The ideal candidate brings deep expertise in modern data center networking, hands-on experience with high-performance fabrics, and a strong understanding of networking requirements for GPU-accelerated and containerized platforms at scale., Data Center & HPC Network Engineering

Design, implement, and operate high-performance data center networks supporting HPC, AI/ML, and GPUaaS / CaaS environments
Optimize architectures for east-west traffic, low latency, and high throughput across large-scale compute clusters
Support distributed GPU and CPU workloads, ensuring consistent performance under heavy parallel processing demands

Network Architecture & Multi-Tenant Design

Design and manage leaf-spine / Clos architectures using EVPN-VXLAN overlays
Build scalable, multi-tenant network architectures supporting workload isolation and segmentation for CaaS / GPUaaS platforms
Support DCI, backbone connectivity, and hybrid/cloud on-ramp strategies

Performance, Reliability & Optimization

Monitor and tune network performance for latency, throughput, and congestion across HPC environments
Perform deep packet inspection, traffic flow analysis, and root cause troubleshooting
Drive capacity planning and scaling strategies aligned with compute and GPU cluster growth
Ensure high availability through redundancy, failover validation, and operational rigor

Automation & Infrastructure Engineering

Develop network automation frameworks using Python, Ansible, Git, and Jinja2
Implement Infrastructure-as-Code (IaC) and CI/CD pipelines for network provisioning and changes
Standardize and scale network deployments across environments

Observability & Telemetry

Implement telemetry and monitoring solutions to provide real-time visibility into network performance
Analyze metrics to proactively identify risks and optimize system behavior
Integrate network observability into broader platform monitoring ecosystems

Cross-Functional Collaboration

Partner with HPC platform, compute, storage, and infrastructure teams to align network architecture with workload demands
Collaborate with architecture and engineering teams on new environment design and deployment
Work closely with vendors to validate performance, interoperability, and scalability

Technical Leadership

Serve as a senior escalation point for network incidents and complex troubleshooting
Mentor junior engineers and contribute to documentation, standards, and best practices
Drive continuous improvement across network architecture, operations, and tooling

Requirements

5-8+ years of experience designing and supporting large-scale data center networks
Experience supporting HPC, AI/ML, or GPU-accelerated infrastructure environments
Experience working with or supporting CaaS, GPUaaS, or multi-tenant platform architectures
Strong expertise with:
Leaf-spine / Clos architectures
EVPN, VXLAN, BGP, MPLS
Cisco and/or Arista platforms (NX-OS, EOS, IOS-XR)
Strong understanding of low-latency, high-throughput network optimization
Proven troubleshooting experience in complex distributed environments

Technical Skills

Network automation: Python, Ansible, Jinja2, Git
Infrastructure-as-Code (IaC) and CI/CD pipelines
Network observability, telemetry, and monitoring tools
Packet analysis and traffic flow diagnostics

Preferred Experience

Experience with HPC networking concepts (GPU clusters, distributed training environments)
Familiarity with InfiniBand, RDMA, or RoCE networking
Experience in hyperscale or AI-focused data center environments
CCNP or equivalent certification preferred; CCIE or advanced certifications a plus

Additional Requirements

This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
We are unable to sponsor or take over sponsorship of employment visas at this time.