Server Repair Engineering Supervisor
Role details
Job location
Tech stack
Job description
The Server Repair Engineering Supervisor is responsible for leading both the technical validation of AI server hardware/software systems and the optimization of operational workflows within a high-performance AI Server Service Center. This dual-role position oversees test engineering operations, drives process improvement initiatives, and ensures quality, reliability, and efficiency across service center activities. The role requires strong technical expertise, process-driven leadership, and hands-on experience with AI server technology., * Lead a team of test engineers performing diagnostics, validation, and troubleshooting on AI server hardware and software.
- Establish, implement, and monitor test procedures to ensure compliance with Dell quality standards and internal requirements.
- Evaluate AI servers for performance, reliability, and functionality using advanced diagnostic tools and methodologies.
- Develop, refine, and automate testing scripts and procedures for server systems and components.
- Collaborate closely with Product Development, Quality Assurance, and Engineering teams to identify issues and drive resolution during testing and validation stages., * Analyze, design, and improve operational workflows for testing, repair, refurbishment, and upgrades of AI servers.
- Lead initiatives that enhance throughput, quality, and cost efficiency across service center operations.
- Conduct root cause analysis for process-related failures and establish robust corrective and preventive action plans.
- Continuously research industry best practices to ensure alignment with modern process optimization and manufacturing engineering standards.
- Perform capacity planning to support scalable testing and service operations.
Leadership & Operational Management
- Supervise and mentor the TE and PE teams, providing guidance, coaching, and technical expertise.
- Allocate resources, establish project priorities, and ensure timely completion of testing and process-related deliverables.
- Maintain compliance with quality, environmental, and safety standards (ISO, internal AI standards, regulatory guidelines).
- Communicate operational updates, challenges, risks, and improvement plans to leadership and cross-functional partners.
- Serve as the point of escalation for complex technical or operational issues within the service center.
Requirements
Do you have experience in Team management?, Do you have a Master's degree?, * Bachelor's degree in Electrical Engineering, Industrial Engineering, Computer Science, or related field required; Master's degree preferred.
- 8-10 years of relevant experience in test engineering, process engineering, or hardware/software system operations.
- Minimum 3 years of supervisory or technical leadership experience.
- Strong preference for experience with AI servers, high-performance computing systems, or advanced enterprise server environments.
Preferred Certifications
- EMC Proven Professional or comparable server/hardware certifications.
- Six Sigma Green Belt or Black Belt certification (process optimization).
- Certifications related to AI/ML hardware or data workflows (e.g., Deep Learning Institute credentials).
Essential Skills
- Expertise in server diagnostics and troubleshooting for CPUs, GPUs, memory, storage, power supplies, and other critical components.
- Strong working knowledge of AI server platforms (e.g., Dell PowerEdge, NVIDIA DGX) and related AI/ML frameworks such as TensorFlow or PyTorch.
- Proficiency with process optimization methodologies (Six Sigma, Lean, Kaizen).
- Experience with test automation tools and scripting languages (Python, MATLAB, LabVIEW, etc.).
- Familiarity with server management platforms such as iDRAC and IPMI, and operating systems including Linux and Windows Server.
- Ability to support high-performance computing environments and advanced AI server technologies.
- Strong analytical, problem-solving, communication, and continuous improvement skills.