Senior Network Engineer
Role details
Job location
Tech stack
Job description
The Senior Network Engineer is responsible for the design, deployment, maintenance, optimization, and troubleshooting of high-performance AI infrastructure networks supporting GPU compute clusters, storage fabrics, and large-scale AI workloads. This role focuses on High-Speed Ethernet, InfiniBand, EVPN/VXLAN fabrics, BGP routing, GPU cluster networking, storage integration, and secure multi-tenant AI environments., AI & HPC Network Architecture
- Design and maintain High-Speed Ethernet and InfiniBand fabrics for GPU clusters.
- Deploy and manage EVPN/VXLAN spine-leaf architectures.
- Configure and maintain BGP routing, VRFs, VLANs, MLAG, and high-availability networking.
- Design resilient AI infrastructure supporting multi-tenant environments and large-scale GPU workloads.
- Optimize east-west traffic flows for AI training and inference workloads.
GPU Cluster Infrastructure
- Support NVIDIA GPU cluster networking including:
- NCCL optimization
- GPUDirect RDMA
- RoCEv2
- InfiniBand subnet management
- Troubleshoot cluster communication issues including:
- Link flapping
- Congestion
- Latency
- Throughput bottlenecks
- Execute and validate NCCL performance testing.
Storage & Data Fabric
- Deploy and support high-performance storage platforms including WEKA and distributed AI storage systems.
- Configure storage VLANs, passthrough networking, and bonded interfaces.
- Optimize storage throughput and low-latency communication between compute and storage environments.
Firewall & Security
- Configure and maintain enterprise firewalls including:
- NAT
- VIPs
- VPN/IPSec
- Traffic shaping
- Security segmentation
- Implement secure multi-tenant access controls.
- Assist with AI governance and controlled AI integration environments.
Automation & Monitoring
- Develop infrastructure automation for:
- Network provisioning
- Firewall policy deployment
- VLAN assignments
- Server imaging
- Implement monitoring and alerting through Grafana and telemetry systems.
- Support API-driven infrastructure management and orchestration.
Data Center Operations
- Assist with deployment and operational planning for AI data center infrastructure.
- Support Tier III resiliency planning and redundancy validation.
- Coordinate with facilities, power, mechanical, and external ISP providers.
Requirements
Do you have experience in VLAN?, The ideal candidate will possess deep experience with hyperscale or HPC networking, AI cluster deployments, data center operations, and advanced troubleshooting across compute, storage, and network fabrics., * 5+ years of enterprise or data center networking experience.
- 2+ years supporting AI, HPC, or GPU cluster environments.
- Strong experience with:
- BGP
- EVPN/VXLAN
- VLANs
- MLAG
- High-Speed Ethernet (100G/200G/400G/800G)
- InfiniBand
- Cumulus
- Experience with NVIDIA UFM and AI fabric management.
- Strong Linux administration skills.
- Experience troubleshooting GPU cluster communication issues.
- Experience with enterprise firewalls and network segmentation.
- Understanding of AI workload traffic patterns and storage networking.
Preferred Qualifications
- Experience with:
- WEKA
- RoCEv2
- NCCL
- Kubernetes
- Slurm
- CUDA environments
- Experience deploying Supermicro Datacenter Building Block Solutions (SMDCBBS).
- Familiarity with liquid-cooled GPU infrastructure.
- Experience with AI inference and training environments.
- Scripting or automation experience:
- Python
- Bash
- Ansible
- API integrations
Soft Skills
- Strong troubleshooting and analytical skills.
- Ability to operate in fast-paced production environments.
- Excellent communication and documentation abilities.
- Ability to coordinate across infrastructure, facilities, and engineering teams.
- Strong sense of ownership and operational accountability.
Preferred Certifications
- CCNP / CCIE
- NVIDIA Networking Certifications
- Fortinet NSE
- Linux Certifications
- Kubernetes Certifications
Benefits & conditions
Pulled from the full job description
- Health insurance
- Paid time off
- Vision insurance
- Dental insurance
- Disability insurance
- Wellness program, Hut 8 offers a benefits and wellness program that includes medical, dental, vision, life, and short-term and long-term disability insurance, as well as paid time off. We are proud to invest in building the best team in the industry. At all levels of the organization, we are driven by an entrepreneurial spirit, radical transparency, and relentless growth mentality.
At Hut 8, you will have the opportunity to:
- Work with bright, driven peers from a range of educational and professional backgrounds including software development, energy, engineering, entrepreneurship, investment banking, private equity, and management consulting