{"@context":"https://schema.org/","@type":"JobPosting","title":"HPC Systems Administrator

Accenture
Charing Cross, United Kingdom
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Charing Cross, United Kingdom

Tech stack

Microsoft Windows
Artificial Intelligence
Bash
Nvidia CUDA
Data Security
Linux
Python
Parallel Computing
Performance Tuning
Powershell
TensorFlow
Scripting (Bash/Python/Go/Ruby)
PyTorch
Slurm

Job description

Locations: UK, London (must be willing to travel to client sites throughout the UK on an ad hoc basis), *Design, deploy, and manage HPC infrastructures including GPU clusters and parallel computing environments.

*Support AI model training platforms by maintaining compute resources, optimizing scheduling, and ensuring compatibility with AI frameworks and libraries.

*Monitor, analyse, and fine tune performance metrics addressing bottlenecks or inefficiencies.

*Develop and maintain automation scripts and tools (e.g., PowerShell, Python, Bash) to streamline operational tasks, monitoring, and reporting.

*Document architecture, configurations, processes, and resolutions for compliance, knowledge transfer, and continuous improvement. Participate in root cause analysis (RCA) and post-incident reviews for compute or HPC-related incidents, implementing preventive measures as needed.

Requirements

*Expertise in an HPC environment, including GPU cluster administration (e.g., NVIDIA, AMD) and workload schedulers such as SLURM or PBS.

*Proficiency with AI model training workflows and experience supporting popular AI/ML frameworks (e.g., TensorFlow, PyTorch, CUDA). Solid understanding of networking, storage, and server platforms in both Windows and Linux environments.

*Advanced analytical, troubleshooting, and performance tuning skills, with the ability to diagnose and resolve complex compute and HPC issues.

*Experience with automation, monitoring platforms, and scripting languages (e.g., Python, PowerShell, Bash) to enhance operational efficiency.

*Strong communication and collaboration skills, with a track record of working effectively across technical and non-technical teams. Familiarity with compliance, data security, and best practices for compute and HPC environments.

Benefits & conditions

Salary: Competitive salary and package (Depending on level of experience), Salary: Competitive salary and package (Depending on level of experience)

Accenture are partnering with scaled UK AI compute pioneers to lead the charge on next-generation infrastructure for sovereign AI. To support this endeavor, we're building a high-performance compute operations team in London.

Our work will be sensitive, secure and on the most up-to-date high density compute stacks available.

Apply for this position