Senior Front-End Network Engineer, AI Infrastructure Operations
Role details
Job location
Tech stack
Job description
Within Nscale, the Network Operations team is responsible for the performance and reliability of the high-speed networks that underpin our AI platforms. These front-end networks are critical to inference workloads, cluster management, data movement, and storage connectivity., In this role, you will be responsible for the day-to-day health, stability, and performance of Nscale's large-scale Ethernet front-end networks. You'll bring deep operational expertise from hyperscale or high-performance environments and play a key role in incident response, performance tuning, automation, and continuous improvement of production AI networking systems., * Owning the operational health, configuration consistency, and performance tuning of large-scale Ethernet front-end fabrics (leaf-spine / Clos) supporting AI inference, management, and storage workloads
- Leading the diagnosis and resolution of complex network incidents (P0/P1), spanning optics, routing, switching hardware, long-haul circuits, and storage connectivity layers
- Driving blameless postmortems and implementing preventative fixes to improve long-term fabric stability and availability
- Partnering with SREs to define requirements for automation and tooling, and contributing to network provisioning, validation, and monitoring systems
- Collaborating with Network Architecture and Engineering teams to validate designs and enforce standards for routing, congestion management, firmware baselines, and change safety
- Monitoring fabric utilisation and performance, identifying bottlenecks, and tuning for predictable latency and throughput on front-end networks
- Acting as a subject matter expert for cross-functional teams on high-speed Ethernet networking, long-haul/DCI circuits, and storage network integration
- Participating in an on-call rotation supporting mission-critical, customer-facing infrastructure
Requirements
Do you have experience in Optics?, * 5+ years of experience in network engineering, with at least 3 years operating large-scale Ethernet data centre or cloud networks
- Deep, hands-on operational experience with high-speed Ethernet fabrics in hyperscale or production environments
- Strong expertise with Arista (EOS) and/or Nokia (7220 IXR / 7250 IXR / 7750 SR series) platforms
- Solid understanding of modern data centre networking, including BGP, OSPF, ECMP, EVPN-VXLAN, and leaf-spine architectures
- Proven experience with long-haul circuits and DCI (dark fiber, carrier Ethernet, coherent optics)
- Experience with storage networking over Ethernet and shared storage connectivity
- Proven ability to troubleshoot complex network issues using Linux-based tooling and fabric diagnostics
- Proficiency in Python, Go, or shell scripting for automation, data analysis, or configuration management
- Experience working in a 24/7 operational environment with a strong focus on reliability and toil reduction, * Extensive hands-on experience with Arista or Nokia platforms at scale
- Deep familiarity with front-end network patterns for large AI clusters (inference traffic, management networks, and storage integration)
- Experience operating large-scale DCI / long-haul optical or carrier networks
- Strong background in network observability and telemetry systems (streaming telemetry, sFlow, Prometheus, Grafana, etc.)
- Prior experience in automation-first network operations or building internal tooling
Benefits & conditions
- Highly competitive package (base + equity) with reviews every 12 months .
- Join the fastest-growing tech startup , your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.