Engineer HPC Operations

Halian International
Paris, France
23 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Paris, France

Tech stack

Artificial Intelligence
Bash
Linux
Distributed Data Store
Monitoring of Systems
Python
Machine Learning
Performance Tuning
Azure
Scripting (Bash/Python/Go/Ruby)
High Performance Computing
Kubernetes
Infrastructure Automation Frameworks
Slurm

Job description

We are looking for a Principal Engineer - HPC Operations to lead the operational excellence of large-scale high-performance computing platforms supporting advanced AI and machine learning workloads. This role combines deep technical expertise, strong operational ownership, and leadership, with a focus on reliability, performance optimisation, and automation across distributed environments. Responsibilities: · Ensure the day-to-day operational stability of HPC platforms, covering compute, storage, networking, and scheduling layers. · Drive performance optimisation and capacity efficiency, maximising resource utilisation while reducing incidents and downtime. · Act as the technical owner for HPC environments, including new platform deployments and major evolutions. · Serve as the senior escalation point for complex operational incidents, leading resolution and root cause analysis. · Define and enforce scheduling, prioritisation, and workload governance policies to balance fairness, efficiency, and business needs. · Mentor and guide operations engineers, promoting best practices, automation, and continuous improvement., With over 28 years of experience, we have come to understand that innovation is the only way to provide agile, practical solutions that transform businesses and careers. Our resourcing and smart services help you to realize tomorrow's potential. Discover the amazing things possible when you bring the right people and the right technologies together. At Halian, we recognize that diversity, equity, and inclusion (DEI) are essential to building high-performing teams for our clients. We are committed to connecting organizations with top talent from all backgrounds, ensuring that every individual feels valued, respected, and empowered to contribute their unique perspectives. We encourage applications from all qualified candidates, regardless of race, gender, disability, or any other characteristic that makes them unique. By fostering diverse and inclusive workplaces, we help our clients drive innovation, enhance collaboration, and better reflect the communities they serve.

Requirements

· Strong experience operating large-scale HPC or AI/ML platforms in production environments. · Hands-on expertise with workload schedulers and orchestration platforms (e.g. Slurm, Kubernetes). · Solid knowledge of GPU-based workloads, performance tuning, and resource management. · Proven experience with monitoring and observability tools to ensure system health and performance. · Advanced automation and scripting skills (e.g. Python, Bash, Infrastructure as Code). · Deep understanding of Linux systems, high-speed networking, and distributed storage architectures.

Apply for this position