Senior Front-End Network Engineer, AI Infrastructure Operations

NSCALE, LLC

San Francisco, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

San Francisco, United States of America

Tech stack

Artificial Intelligence

Data analysis

Border Gateway Protocol

Common Lisp Object Systems

Configuration Management

Complex Networks

Data Centers

ETL

Linux

Ethernet

Firmware

Monitoring of Systems

Storage Area Network (SAN)

Python

Network Architecture

Routing

Open Shortest Path First

Performance Tuning

Prometheus

Shell Script

Data Streaming

AI Infrastructure

Cloud-native Network Functions (CNF)

Computer Network Operations

Grafana

AI Platforms

Front End Software Development

Job description

Within Nscale, the Network Operations team is responsible for the performance and reliability of the high-speed networks that underpin our AI platforms. These front-end networks are critical to inference workloads, cluster management, data movement, and storage connectivity., In this role, you will be responsible for the day-to-day health, stability, and performance of Nscale's large-scale Ethernet front-end networks. You'll bring deep operational expertise from hyperscale or high-performance environments and play a key role in incident response, performance tuning, automation, and continuous improvement of production AI networking systems., * Owning the operational health, configuration consistency, and performance tuning of large-scale Ethernet front-end fabrics (leaf-spine / Clos) supporting AI inference, management, and storage workloads

Leading the diagnosis and resolution of complex network incidents (P0/P1), spanning optics, routing, switching hardware, long-haul circuits, and storage connectivity layers
Driving blameless postmortems and implementing preventative fixes to improve long-term fabric stability and availability
Partnering with SREs to define requirements for automation and tooling, and contributing to network provisioning, validation, and monitoring systems
Collaborating with Network Architecture and Engineering teams to validate designs and enforce standards for routing, congestion management, firmware baselines, and change safety
Monitoring fabric utilisation and performance, identifying bottlenecks, and tuning for predictable latency and throughput on front-end networks
Acting as a subject matter expert for cross-functional teams on high-speed Ethernet networking, long-haul/DCI circuits, and storage network integration
Participating in an on-call rotation supporting mission-critical, customer-facing infrastructure

Requirements

Do you have experience in Optics?, * 5+ years of experience in network engineering, with at least 3 years operating large-scale Ethernet data centre or cloud networks

Deep, hands-on operational experience with high-speed Ethernet fabrics in hyperscale or production environments
Strong expertise with Arista (EOS) and/or Nokia (7220 IXR / 7250 IXR / 7750 SR series) platforms
Solid understanding of modern data centre networking, including BGP, OSPF, ECMP, EVPN-VXLAN, and leaf-spine architectures
Proven experience with long-haul circuits and DCI (dark fiber, carrier Ethernet, coherent optics)
Experience with storage networking over Ethernet and shared storage connectivity
Proven ability to troubleshoot complex network issues using Linux-based tooling and fabric diagnostics
Proficiency in Python, Go, or shell scripting for automation, data analysis, or configuration management
Experience working in a 24/7 operational environment with a strong focus on reliability and toil reduction, * Extensive hands-on experience with Arista or Nokia platforms at scale
Deep familiarity with front-end network patterns for large AI clusters (inference traffic, management networks, and storage integration)
Experience operating large-scale DCI / long-haul optical or carrier networks
Strong background in network observability and telemetry systems (streaming telemetry, sFlow, Prometheus, Grafana, etc.)
Prior experience in automation-first network operations or building internal tooling

Benefits & conditions

Highly competitive package (base + equity) with reviews every 12 months .
Join the fastest-growing tech startup , your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.

About the company

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you'll be contributing to building the technology that powers the future.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all