SW Engineer
Role details
Job location
Tech stack
Job description
Site Reliability Engineer (SRE) - GPU Infrastructure Data CentresFully Remote Role - Work from homeThe Site Reliability Engineer (SRE) is responsible for the end-to-end validation, testing, and readiness of GPU compute clusters prior to production release. The role ensures that all hardware, networking, and system components meet operational and reliability standards before customer workloads are deployed.Working closely with global infrastructure and engineering teams, the SRE plays a critical role in maintaining the quality, stability, and integrity of high-performance compute environments.Key ResponsibilitiesCluster Validation & TestingValidate GPU clusters of varying sizes to ensure hardware and system integrity prior to production releasePerform functional and reliability testing of GPUs, servers, and associated componentsVerify network connectivity and performance, including high-speed interconnects where applicableOrchestration & BenchmarkingProvision and configure GPU clusters, Site Reliability Engineer (SRE) - GPU Infrastructure Data CentresFully Remote Role - Work from homeThe Site Reliability Engineer (SRE) is responsible for the end-to-end validation, testing, and readiness of GPU compute clusters prior to production release. The role ensures..., Job Description: HPC Expert ( Must have SC Security clearance) The information below covers the role requirements, expected candidate experience, and accompanying qualifications. Pay: £450-500 per day PAYE 600-700 Umbrella Location: Nottingham /Derby Start date:..., Job Description: HPC Expert ( Must have SC Security clearance) Ensure all your application information is up to date and in order before applying for this opportunity. Pay: £450-500 per day PAYE 600-700 Umbrella Location: Nottingham /Derby Start date: ASAP Role...
Requirements
using automated workflowsExecute and analyse performance and stability benchmarks orchestrated via a workload schedulerValidate results against expected performance and reliability thresholdsTest Framework & AutomationMaintain and extend the automated validation framework built using Python and AnsibleIntegrate new test cases to support additional hardware platforms and GPU generationsImprove test reliability, coverage, and execution efficiencyRemediation & System IntegrityDiagnose and remediate unhealthy nodes through configuration changes or software fixesCoordinate with on-site support teams for hardware replacements when requiredEnsure all issues are resolved and documented prior to handover to production operationsDocumentation & HandoverProduce clear, accurate documentation of test results, hardware states, and remediation actionsEnsure smooth handovers to operations and engineering teamsMaintain up-to-date runbooks and validation proceduresTeam Collaboration & TrainingWork as part of a distributed, international infrastructure and engineering teamParticipate in knowledge sharing, process improvement, and technical reviewsThe working language is English; additional language skills are beneficialShift & Availability RequirementsAbility to work independently within a remote environmentReliable internet connection and suitable home working setupRole is fully remote; company hardware will be providedSkills & ExperienceEssentialStrong hands-on experience administering and troubleshooting Linux systemsConfident use of CLI tools for diagnostics, including analysis of kernel logs, drivers, and system servicesProven experience writing and maintaining Ansible playbooksProficiency in Python for automation, test execution, and parsing resultsStrong analytical and problem-solving skills with attention to detailExcellent written and verbal English communication skillsHigh standards for system reliability, consistency, and documentationPreferred / DesirableExperience working with GPU-based or high-performance compute environmentsFamiliarity with workload schedulers (e.G. Slurm or similar tools)Understanding of data centre hardware lifecycle and server validation processesExposure to high-speed networking technologiesExperience working with distributed or remote infrastructure teamsPerformance & Success MetricsAccuracy and completeness of cluster validation prior to production releaseReduction in post-deployment hardware or configuration issuesQuality and clarity of validation documentation and handover materialsEffectiveness of remediation and coordination with on-site teamsReliability and maintainability of automated test frameworksCollaboration and communication quality with engineering and operations teams Similar jobs, SW Engineer (Distributed Computing, AWS, Python, C#/C++) Oxford - 3-4 days per week in office. £45000 - £68000 + Package. - Must have a Computing/STEM Degree (2:1 or higher). - Can work in their Oxford head office 3-4 days week. - Must have experinece with AWS / Distributed..., SW Engineer (Distributed Computing, AWS, Python, C#/C++) Do you have the skills to fill this role Read the complete details below, and make your application today. Oxford - 3-4 days per week in office. £45000 - £68000 + Package. - Must have a Computing/STEM Degree (2:1 or...
High performance computing expert, SW Engineer (Distributed Computing, AWS, Python, C#/C++) Oxford - 3-4 days per week in office. £45000 - £68000 Package. - Must have a Computing/STEM Degree (2:1 or higher). - Can work in their Oxford head office 3-4 days week. - Must have experinece with AWS / Distributed..., SW Engineer (Distributed Computing, AWS, Python, C#/C++) Oxford Scroll down the page to see all associated job requirements, and any responsibilities successful candidates can expect. - 3-4 days per week in office. £45000 - £68000 Package. - Must have a Computing/STEM...
Benefits & conditions
SW Engineer (Distributed Computing, AWS, Python, C#/C++) Oxford - 3-4 days per week in office. £45000 - £68000 + Package. - Must have a Computing/STEM Degree (2:1 or higher). - Can work in their Oxford head office 3-4 days week. - Must have experinece with AWS / Distributed... © 2026, Jobsora.com