Principal Software Engineer - Fleet Management
Role details
Job location
Tech stack
Job description
We're hiring a Principal Software Engineer to lead the technical development of our Fleet Manager platform - the workflow automation system that provisions, tests, and remediates GPU nodes and network switches at scale.
As technical lead, you'll own the architecture and delivery of foundational Python-based automation systems that manage the entire lifecycle of our compute infrastructure: device enrolment, burn-in testing, network configuration, GPU health monitoring, and self-healing capabilities. You'll mentor a team of senior engineers, set technical direction, and drive engineering excellence while remaining hands-on with critical systems.
What you'll do
- Lead technical architecture and roadmap for Fleet Manager's workflow automation systems
- Own end-to-end delivery of device provisioning, validation, testing, and remediation workflows at scale
- Design and build workflow orchestration systems for GPU node and network switch lifecycle management
- Establish engineering standards for reliability, observability, and operational excellence across all Fleet Manager services
- Mentor and raise the bar for a team of senior engineers through design reviews, technical leadership, and hands-on collaboration
- Drive architecture decisions balancing automation complexity, reliability, and maintainability
- Integrate with infrastructure tooling: DCIMs, NetBox, OpenStack, bare metal APIs (MAAS, Ironic, IPMI)
- Partner with Infrastructure, Platform, and SRE teams to translate operational needs into robust, scalable automation
- Build production-grade Python systems for hardware lifecycle automation, leveraging AI tools to accelerate delivery
Requirements
Do you have experience in Terraform?, * You have 10+ years software engineering experience building and operating production systems, with proven technical leadership in infrastructure automation or workflow tooling
- Strong Python engineering fundamentals with experience leading complex, multi-service distributed systems
- You are driven by building distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement
- Technical expertise: quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems
- Track record of owning technical roadmaps and delivering large-scale automation systems from ambiguous requirements to production
- You use AI tools like Claude or Cursor as a core part of your development workflow - as a fundamental multiplier of what you can build
- Deep understanding of operational excellence: SLOs, monitoring, alerting, incident response, and production reliability
- Strong mentorship skills with ability to develop high-performing engineering teams
- Excellent communication skills to build consensus with stakeholders, both internally and externally
Nice to have
- Experience with workflow orchestration tools like Temporal, Airflow, Prefect, or similar
- Hands-on experience with infrastructure tooling: DCIMs, NetBox, OpenStack, or ERP systems
- Bare metal provisioning and automation: MAAS, Ironic, IPMI, PXE boot, or network automation
- Experience building hardware lifecycle automation: provisioning, validation, testing, or remediation workflows
- GPU infrastructure experience: health monitoring, burn-in testing, or cluster management
- HPC and networking: datacenter topology, high-performance interconnects (InfiniBand, RoCE)
- Deep knowledge of Kubernetes, Infrastructure as Code (Terraform, Pulumi), AWS, and GCP
- Track record of 1+ years leading large-scale, complex projects or technical teams
- Open-source contributions in infrastructure automation or cloud-native tooling
Benefits & conditions
- Highly competitive package (base + equity) with reviews every 12 months.
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.