HPC Solution Architect
Role details
Job location
Tech stack
Job description
The Software Engineering team delivers next-generation software application enhancements and new products for a changing world. Working at the cutting edge, we design and develop software for platforms, peripherals, applications and diagnostics - all with the most advanced technologies, tools, software engineering methodologies and the collaboration of internal and external partners., As a Senior Software Principal Engineer, you will be responsible for developing sophisticated systems and software basis the customer's business goals, needs and general business environment creating software solutions., We are hiring a Senior HPC Solution Architect to design, deploy, and support large-scale HPC and AI clusters for enterprise, research, and hyperscale customers. This is a hands-on, customer-facing Individual Contributor role that blends Linux systems engineering, cluster lifecycle automation, provisioning frameworks (Omnia/OpenCHAMI), Slurm/Kubernetes , and deep troubleshooting of production environments. Ideal for strong technical engineers who enjoy solving complex customer problems, contributing to open-source, and shaping modern HPC deployment practices.
You will:
- Lead customer architecture & design, translating HPC/AI workload requirements into scalable cluster architectures (compute, schedulers, storage, interconnects)
- Deploy and operationalize clusters using Omnia or similar automation, including provisioning, scheduler bring-up, telemetry, authentication, and repo management
- Build and maintain provisioning workflows (OpenCHAMI-based or equivalent) covering PXE/iPXE boot, cloud-init, security, and identity/cert operations
- Serve as Tier-3 engineering escalation, troubleshooting complex provisioning, scheduling, GPU, networking, and performance issues; perform RCAs and drive permanent fixes
- Contribute to open source and customer enablement through code contributions, documentation, workshops, runbooks, templates, and field readiness materials
Requirements
- HPC & Distributed Systems: 8+ years engineering large-scale HPC and distributed infrastructure, with strong knowledge of cluster architecture, schedulers, and provisioning workflows
- Linux & Automation: Deep experience with RHEL/Rocky/Ubuntu; hands-on cluster deployments using open-source toolchains, Omnia, and OpenCHAMI (composable provisioning, cloud-init, microservices)
- Schedulers, Containers & Observability: Production experience with Slurm and/or Kubernetes; proficient with Docker/Podman, OpenTelemetry pipelines, and telemetry instrumentation
- Networking, Fabrics & Streaming: Solid L2/L3 fundamentals, PXE/iPXE, DHCP/TFTP; experience with InfiniBand/RoCE/Omni-Path fabrics and event streaming with Kafka
- Scripting, Monitoring & Customer Engagement: Strong skills in Ansible, Python, Bash; expertise with Prometheus and Grafana dashboards; proven communication skills for escalations and simplifying complex HPC concepts
Benefits & conditions
Dell is committed to fair and equitable compensation practices. The salary range for this position is $210,000 - $265,000.
Benefits and Perks of working at Dell Technologies
Your life. Your health. Supported by your benefits. You can explore the overall benefits experience that awaits you as a Dell Technologies team member - right now at MyWellatDell.com