Senior Software Engineer - Fleet Management
Role details
Job location
Tech stack
Job description
We're hiring a Senior Software Engineer to build our Fleet Manager platform - the workflow automation system that provisions, tests, and remediates GPU nodes and network switches at scale.
You'll build foundational Python-based automation systems that manage the entire lifecycle of our compute infrastructure: device enrolment, burn-in testing, network configuration, GPU health monitoring, and self-healing capabilities. This role is for someone obsessed with distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement.
What you'll do
- Build workflow automation systems for GPU node and network switch lifecycle management at scale
- Design foundational platform components with established software patterns that others build on
- Implement device provisioning, burn-in testing, network configuration, and hardware health validation workflows
- Integrate with datacenter infrastructure management systems, cloud orchestration platforms, and bare metal provisioning tools
- Build distributed workflow orchestration systems to coordinate complex automation tasks across the fleet
- Drive technical strategy for reliability, observability, incident response, and operational excellence
- Partner with Infrastructure, Platform, and SRE teams to automate hardware lifecycle operations
- Use AI tools to accelerate delivery while maintaining architectural coherence
Requirements
Do you have experience in Terraform?, * You have 5+ years software engineering experience building and operating production systems, with focus on infrastructure automation or workflow tooling
- Strong proficiency in Python (Fleet Manager is built entirely in Python)
- You are driven by building distributed systems at scale, infrastructure reliability, scalability, security, and continuous improvement
- Technical expertise: quickly understanding systems design tradeoffs, keeping track of rapidly evolving software systems
- You use AI tools like Claude or Cursor as a core part of your development workflow as a fundamental multiplier of what you can build
- You have delivered automation systems from ambiguous requirements to operational systems in production, with hands-on day 2 operations experience (monitoring, incident response, performance optimisation)
- Strong problem-solving skills and ability to work independently in a fast-paced, high-agency environment
- Excellent communication skills to build consensus with stakeholders, both internally and externally
Nice to have
- Experience with workflow orchestration tools like Temporal, Airflow, Prefect, or similar
- Hands-on experience with infrastructure tooling: DCIMs, NetBox, OpenStack, or ERP systems
- Bare metal provisioning and automation: MAAS, Ironic, IPMI, PXE boot, or network automation
- Experience building hardware lifecycle automation: provisioning, validation, testing, or remediation workflows
- GPU infrastructure experience: health monitoring, burn-in testing, or cluster management
- HPC and networking: datacenter topology, high-performance interconnects (InfiniBand, RoCE)
- Deep knowledge of Kubernetes, Infrastructure as Code (Terraform, Pulumi), AWS, and GCP
- Open-source contributions in infrastructure automation or cloud-native tooling
Benefits & conditions
- Highly competitive package (base + equity) with reviews every 12 months.
- Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI.
- Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
- Human-First Flexibility: We treat you as humans first. Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.
Join our thriving remote-first team. Geography is no barrier to impact or connection. We build seamless virtual collaboration, empowering you, wherever you work.