HPC and AI Network Specialist
Role details
Job location
Tech stack
Job description
We are looking for an enthusiastic Network Specialist to join our team, operating and enhancing large-scale HPC and AI research infrastructure, including Dawn, one of the UK's fastest AI supercomputers.
You'll join the University of Cambridge's Research Computing Services, a national supercomputing centre providing services to world-renowned scientists, clinicians and engineers across the UK and Europe underpinning critical national research in fields including clean energy, personalised medicine, and climate science.
Your role
You will be responsible for the design and delivery of the networks that underpin the research computing services ensuring they meet the requirements of peta-scale computational and data storage systems.
You will design and maintain the operational tools required to configure, monitor and secure the networks. You will provide guidance and technical leadership to other members of the HPC Data Centre Infrastructure team in the use of these tools.
What you will do
- Lead on the design and rollout of new network features for large-scale scientific workflows.
- Collaborate with other technical specialists to ensure effective solution delivery .
- Automate network configuration and deployment using Ansible, Python, and GitLab.
- Supporting and evolve the network backbone for HPC, storage, and AI infrastructure.
- Troubleshoot network performance issues.
- Implement changes across firewalls, routers, and switches, ensuring resilience and security.
- Participate in incident response, root cause analysis, and proactive improvements to service reliability.
What you will bring
Essential skills and experience
- Proven experience in data centre network design, operation and security controls.
- Hands-on proficiency with network automation with tools such as Ansible, Python.
- Strong understanding of networking protocols (e.g. TCP/IP, IPv4/6, IPSec, DNS).
- Familiarity with open source monitoring tools.
- Ability to work independently, communicate clearly, and collaborate with multidisciplinary teams.
Desirable experience
- Experience of HPC or AI infrastructure, especially high-speed interconnects (e.g. InfiniBand, RoCE).
- Experience working in security certified environments (e.g. ISO27001).
- Operational knowledge of Linux and server virtualisation.
- Experience supporting services in a scientific or research environment.
Why Join us
The University of Cambridge is one of the world's most respected academic research institutions where your expertise will help enable cutting-edge research and innovation. In the Research Computing Service, you'll be part of a collaborative, mission-driven team with access to:
- A modern hybrid working environment with excellent on-campus facilities.
- Opportunities for continuous learning and development with access to the latest technologies.
- Annual leave allowance of 41 days (including public holidays).
- University pension scheme.
- Access to University services; libraries, fitness centres and cultural facilities.
- Access to a range of shopping and travel discounts through the Cambens scheme.
Once an offer of employment has been accepted, the successful candidate will be required to undergo a basic disclosure (criminal records check) check and a security check.
To apply online for this vacancy and to view further information about the role, please click on the 'Apply' button above.
Queries should be directed to recruitment@uis.cam.ac.uk in the first instance quoting reference VC47972.
Please submit your CV and a cover letter to apply.
The University actively supports equality, diversity and inclusion and encourages applications from all sections of society.
The University has a responsibility to ensure that all employees are eligible to live and work in the UK.
Requirements
- Proven experience in data centre network design, operation and security controls.
- Hands-on proficiency with network automation with tools such as Ansible, Python.
- Strong understanding of networking protocols (e.g. TCP/IP, IPv4/6, IPSec, DNS).
- Familiarity with open source monitoring tools.
- Ability to work independently, communicate clearly, and collaborate with multidisciplinary teams.
Desirable experience
- Experience of HPC or AI infrastructure, especially high-speed interconnects (e.g. InfiniBand, RoCE).
- Experience working in security certified environments (e.g. ISO27001).
- Operational knowledge of Linux and server virtualisation.
- Experience supporting services in a scientific or research environment.
Benefits & conditions
The University of Cambridge is one of the world's most respected academic research institutions where your expertise will help enable cutting-edge research and innovation. In the Research Computing Service, you'll be part of a collaborative, mission-driven team with access to:
- A modern hybrid working environment with excellent on-campus facilities.
- Opportunities for continuous learning and development with access to the latest technologies.
- Annual leave allowance of 41 days (including public holidays).
- University pension scheme.
- Access to University services; libraries, fitness centres and cultural facilities.
- Access to a range of shopping and travel discounts through the Cambens scheme.
Once an offer of employment has been accepted, the successful candidate will be required to undergo a basic disclosure (criminal records check) check and a security check.