Hpc/ai infrastructure architect
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled and visionary HPC/AI Infrastructure Architect - Pre-Sales Specialist to lead the technical design and architecture of large-scale AI infrastructure solutions. This role is pivotal in shaping next-generation AI Factories, supporting customer engagements, and driving technical excellence across compute, interconnect, and software stack domains., 1. AI Factory Architecture & Design (35%)
- Design GPU cluster architectures tailored for AI and HPC workloads.
- Define node configurations for diverse workload types including dense GPU nodes, cost-optimized nodes, and high-memory CPU nodes.
- Specify and validate performance metrics including compute throughput, memory bandwidth, and power consumption.
- Architect multi-tier interconnect networks using NVLink, InfiniBand, and high-speed Ethernet.
- Develop topology designs and calculate bandwidth and latency targets.
- Model performance for customer workloads and validate against industry benchmarks.
- Pre-Sales Technical Leadership (30%)
- Lead technical discussions with customer architects and stakeholders.
- Conduct workload sizing and architectural presentations.
- Develop technical content for proposals including BoMs, compliance matrices, and scoring alignment.
- Analyze competitor solutions and articulate technical differentiators.
- Demonstrator Lab Development (20%)
- Design and expand lab infrastructure for AI workload testing and validation.
- Build reference architectures across industries such as finance, manufacturing, healthcare, and research.
- Support lab operations including cluster configuration, workload orchestration, and software stack maintenance.
- Customer Demonstrations & PoCs (10%)
- Deploy and showcase customer-specific AI workloads including LLM training, computer vision, and scientific simulations.
- Manage proof-of-concept projects, define success criteria, and present outcomes to stakeholders.
- Technical Expertise & Innovation (5%)
- Maintain relationships with key technology vendors and participate in early access programs.
- Evaluate emerging technologies and contribute to innovation roadmaps and adoption strategies.
Requirements
Technical Competencies
-
GPU Architectures: NVIDIA (H100, H200, B100, B200), AMD (MI300X), Intel (Gaudi2/3)
-
Interconnects: InfiniBand (HDR/NDR/XDR), NVLink, RoCE, Infinity Fabric
-
Storage Systems:Lustre, GPFS, BeeGFS, NVMe-oF, S3-compatible object storage
-
Container Platforms: Kubernetes, Docker, Singularity/Apptainer
-
Performance Tools: NVIDIA Nsight, ROCm, Intel VTune Certifications (Preferred)
-
NVIDIA Deep Learning Institute (DLI)
-
Red Hat Certified Specialist in OpenShift
-
InfiniBand Certified Professional Experience
-
8+ years in HPC/AI infrastructure design
-
5+ years working with GPU-accelerated systems
-
Proven experience with large-scale GPU deployments (1000+ GPUs)
-
Successful track record in technical bid support and customer engagement