Sr. Lead Test Engineer, Compute Server & Storage - AI Data Center

Celestica, Inc.
Richardson, United States of America
9 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Richardson, United States of America

Tech stack

Testing (Software)
Artificial Intelligence
Amazon Web Services (AWS)
Data analysis
Automation of Tests
Azure
Bash
BIOS
Ubuntu (Operating System)
CentOS
Command-Line Interface
Profiling
Computer Networks
Continuous Integration
Data Centers
Software Debugging
Linux
Distributed File Systems
RAID
Distributed Data Store
Ethernet
Serial ATA
Firmware
General Parallel File Systems
Hardware Design
InfiniBand
Python
Machine Learning
NetApp Applications
Open Source Technology
PCI Express
Red Hat Enterprise Linux - RHEL
Software Reliability Testing
TensorFlow
SAS (Software)
Software Engineering
Subsystems
TCP/IP
Strategies of Testing
Virtualization Technology
Ceph
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Cloud Platform System
Performance Testing
Data Ingestion
PyTorch
Perf (Linux)
Containerization
Kubernetes
Storage Technologies
Information Technology
Data Management
Hardware Infrastructure
Network Server
Docker
Server Operating Systems & Platforms
Nvme

Job description

The Senior Lead server and storage Test Engineer will play a pivotal role in the design, development, and execution of comprehensive test strategies for our AI data center's server infrastructure. This position requires deep expertise in enterprise storage systems, server architectures, networking, and a strong understanding of the unique performance and reliability demands of AI/ML workloads. The ideal candidate will be a hands-on technical Individual Contributor capable of driving test automation, and collaborating across engineering teams to deliver robust and high-performing solutions.

Knowledge/Skills/Competencies

  • Define, develop, and implement comprehensive test plans and strategies for all server and storage hardware, firmware, and software components within the AI data center environment.
  • Mentor and provide technical guidance to junior test engineers, fostering a culture of technical excellence and continuous improvement.
  • Design and implement automated test frameworks and scripts using languages like Python, Go, or similar, to improve efficiency and coverage of testing.
  • Conduct in-depth performance analysis and bottleneck identification for storage systems (e.g., NVMe, SSD, HDD arrays, JBODs, distributed storage) and server platforms (e.g., CPU, GPU, BMC, BIOS, DIMM memory, PCIe, networking), and OpenBMC interfaces/features
  • This includes debugging issues related to BMC functionality and its interaction with server hardware.
  • Develop and maintain robust testbeds and infrastructure for continuous integration and validation.
  • Utilize open-source and commercial test tools relevant to storage, server, and OpenBMC validation.
  • Collaborate closely with hardware design, software development, infrastructure, and AI/ML engineering teams to understand requirements and integrate testing throughout the product lifecycle.
  • Communicate test progress, results, and critical issues effectively to stakeholders, including executive leadership.
  • Develop specialized test methodologies to validate performance and reliability under heavy AI/ML workloads (e.g., large model training, inference at scale, data ingestion).
  • Understand and test the interactions between GPU-accelerated computing, high-speed networking,PCIe Switches and storage systems., * Duties of this position are performed in a normal office environment.
  • Duties may require extended periods of sitting and sustained visual concentration on a computer monitor or on numbers and other detailed data.
  • Repetitive manual movements (e.g., data entry, using a computer mouse, using a calculator, etc.) are frequently required.
  • Occasional travel may be required.

Requirements

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field.
  • 7+ years of experience in hardware and/or software testing, with at least 5 years focused on enterprise-level server and storage systems.
  • 3+ years of experience in a lead or senior technical role, mentoring junior engineers or leading test initiatives.
  • Proven experience in a lead or senior technical role, mentoring and guiding other engineers.
  • Deep expertise in various storage technologies including NVMe, SAS/SATA SSDs/HDDs, RAID, distributed file systems (e.g., Ceph, Lustre, GPFS), SAN, and NAS.
  • Strong understanding of server architectures (x86, ARM, AMD/NVIDIA GPU based servers), CPU/memory subsystems, PCIe, power management, and Baseband Management Controllers (BMC) functionality.
  • Proficiency in scripting languages (e.g., Python, Bash) for test automation and data analysis.
  • Experience with Linux operating systems (e.g., Ubuntu, CentOS, RHEL) and command-line tools.
  • Familiarity with networking concepts (Ethernet, TCP/IP, InfiniBand) and network testing methodologies.
  • Experience with test methodologies such as performance testing, reliability testing, stress testing, and fault injection.
  • Excellent problem-solving, analytical, and debugging skills.
  • Strong communication and interpersonal skills, with the ability to collaborate effectively across diverse teams.

Preferred Qualifications

  • Familiarity with OCP (Open Compute Project)
  • Experience with cloud environments (AWS, Azure, GCP) and virtualization technologies.
  • Knowledge of containerization technologies (Docker, Kubernetes).
  • Familiarity with AI/ML frameworks (e.g., TensorFlow, PyTorch) and their infrastructure requirements.
  • Experience with performance profiling tools (e.g., fio, Iometer, Perf, MLTT, VTune).
  • Contributions to open-source projects related to storage, servers, or testing.
  • Certifications in relevant technologies (e.g., NetApp, Dell EMC, HPE, NVIDIA)., Bachelor degree or consideration of an equivalent combination of education and experience.

Educational Requirements may vary by Geography

About the company

Celestica (NYSE, TSX: CLS) enables the world's best brands. Through our recognized customer-centric approach, we partner with leading companies in Aerospace and Defense, Communications, Enterprise, HealthTech, Industrial, Capital Equipment and Energy to deliver solutions for their most complex challenges. As a leader in design, manufacturing, hardware platform and supply chain solutions, Celestica brings global expertise and insight at every stage of product development - from drawing board to full-scale production and after-market services for products from advanced medical devices, to highly engineered aviation systems, to next-generation hardware platform solutions for the Cloud. Headquartered in Toronto, with talented teams spanning 40+ locations in 13 countries across the Americas, Europe and Asia, we imagine, develop and deliver a better future with our customers.

Apply for this position