Engineer HPC Operations

Halian International

Paris, France

23 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Paris, France

Tech stack

Artificial Intelligence

Bash

Linux

Distributed Data Store

Monitoring of Systems

Python

Machine Learning

Performance Tuning

Azure

Scripting (Bash/Python/Go/Ruby)

High Performance Computing

Kubernetes

Infrastructure Automation Frameworks

Slurm

Job description

We are looking for a Principal Engineer - HPC Operations to lead the operational excellence of large-scale high-performance computing platforms supporting advanced AI and machine learning workloads. This role combines deep technical expertise, strong operational ownership, and leadership, with a focus on reliability, performance optimisation, and automation across distributed environments. Responsibilities: · Ensure the day-to-day operational stability of HPC platforms, covering compute, storage, networking, and scheduling layers. · Drive performance optimisation and capacity efficiency, maximising resource utilisation while reducing incidents and downtime. · Act as the technical owner for HPC environments, including new platform deployments and major evolutions. · Serve as the senior escalation point for complex operational incidents, leading resolution and root cause analysis. · Define and enforce scheduling, prioritisation, and workload governance policies to balance fairness, efficiency, and business needs. · Mentor and guide operations engineers, promoting best practices, automation, and continuous improvement., With over 28 years of experience, we have come to understand that innovation is the only way to provide agile, practical solutions that transform businesses and careers. Our resourcing and smart services help you to realize tomorrow's potential. Discover the amazing things possible when you bring the right people and the right technologies together. At Halian, we recognize that diversity, equity, and inclusion (DEI) are essential to building high-performing teams for our clients. We are committed to connecting organizations with top talent from all backgrounds, ensuring that every individual feels valued, respected, and empowered to contribute their unique perspectives. We encourage applications from all qualified candidates, regardless of race, gender, disability, or any other characteristic that makes them unique. By fostering diverse and inclusive workplaces, we help our clients drive innovation, enhance collaboration, and better reflect the communities they serve.

Requirements

· Strong experience operating large-scale HPC or AI/ML platforms in production environments. · Hands-on expertise with workload schedulers and orchestration platforms (e.g. Slurm, Kubernetes). · Solid knowledge of GPU-based workloads, performance tuning, and resource management. · Proven experience with monitoring and observability tools to ensure system health and performance. · Advanced automation and scripting skills (e.g. Python, Bash, Infrastructure as Code). · Deep understanding of Linux systems, high-speed networking, and distributed storage architectures.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all