{"@context":"https://schema.org/","@type":"JobPosting","title":"HPC Systems Administrator
Role details
Job location
Tech stack
Job description
Locations: UK, London (must be willing to travel to client sites throughout the UK on an ad hoc basis), *Design, deploy, and manage HPC infrastructures including GPU clusters and parallel computing environments.
*Support AI model training platforms by maintaining compute resources, optimizing scheduling, and ensuring compatibility with AI frameworks and libraries.
*Monitor, analyse, and fine tune performance metrics addressing bottlenecks or inefficiencies.
*Develop and maintain automation scripts and tools (e.g., PowerShell, Python, Bash) to streamline operational tasks, monitoring, and reporting.
*Document architecture, configurations, processes, and resolutions for compliance, knowledge transfer, and continuous improvement. Participate in root cause analysis (RCA) and post-incident reviews for compute or HPC-related incidents, implementing preventive measures as needed.
Requirements
*Expertise in an HPC environment, including GPU cluster administration (e.g., NVIDIA, AMD) and workload schedulers such as SLURM or PBS.
*Proficiency with AI model training workflows and experience supporting popular AI/ML frameworks (e.g., TensorFlow, PyTorch, CUDA). Solid understanding of networking, storage, and server platforms in both Windows and Linux environments.
*Advanced analytical, troubleshooting, and performance tuning skills, with the ability to diagnose and resolve complex compute and HPC issues.
*Experience with automation, monitoring platforms, and scripting languages (e.g., Python, PowerShell, Bash) to enhance operational efficiency.
*Strong communication and collaboration skills, with a track record of working effectively across technical and non-technical teams. Familiarity with compliance, data security, and best practices for compute and HPC environments.
Benefits & conditions
Salary: Competitive salary and package (Depending on level of experience), Salary: Competitive salary and package (Depending on level of experience)
Accenture are partnering with scaled UK AI compute pioneers to lead the charge on next-generation infrastructure for sovereign AI. To support this endeavor, we're building a high-performance compute operations team in London.
Our work will be sensitive, secure and on the most up-to-date high density compute stacks available.