Data Center Operations Engineer
Role details
Job location
Tech stack
Job description
The Data Center Operations Engineer is responsible for supporting, maintaining, and deploying critical data center infrastructure with a strong focus on Linux-based systems, GPU server deployments, and InfiniBand networking. This role requires hands-on expertise in data center operations, cluster bring-up, hardware installation, and troubleshooting across compute, network, and GPU environments. The engineer will collaborate closely with global infrastructure, development, and operations teams to ensure reliable, secure, and scalable service delivery., * Provide hands-on operational support for all data center projects, deployments, and repair activities.
- Participate in an on-call rotation and provide on-site or remote support during maintenance windows and incidents.
- Troubleshoot and resolve operational issues related to Linux servers, GPU platforms, networking, and storage infrastructure.
- Support customer and internal deployments, ensuring timely and successful bring-up of GPU servers and clusters.
- Perform InfiniBand fabric bring-up, switch configuration, subnet management, and troubleshooting.
- Conduct daily health checks of Linux systems and infrastructure components, proactively identifying and mitigating risks.
- Install, configure, test, and maintain server hardware (rack and stack, labeling, HDDs, memory, CPUs, RAID batteries, NICs, etc.).
- Install, configure, and troubleshoot networking equipment including routers, switches, and terminal servers for out-of-band management.
- Review and validate equipment deployments against approved design documentation and standards.
- Support data center builds, refreshes, migrations, and expansions while adhering to quality and safety standards.
- Coordinate with vendors and onsite staff for hardware delivery, diagnostics, replacement, and warranty services.
- Utilize monitoring and alerting frameworks to identify issues, escalate appropriately, and ensure timely service restoration.
- Maintain accurate documentation of operational procedures, system configurations, and runbooks.
- Follow established incident management, escalation procedures, and service-level agreements (SLAs).
- Collaborate with global teams across time zones to support operational initiatives and continuous improvement efforts.
- Contribute to process improvement initiatives and ensure adherence to documented policies, processes, and procedures., * Interaction with hardware vendors, service providers, and internal engineering teams.
- Fast-paced operational setting requiring attention to detail, adherence to safety standards, and rapid problem resolution. We're doing work that matters. Help us solve what others can't.
Additional Jobs (https://cadence.wd1.myworkdayjobs.com/addl_jobs)
Equal Employment Opportunity Policy:
Cadence is committed to equal employment opportunity throughout all levels of the organization.
- Read the policy(opens in a new tab) (https://www.cadence.com/content/dam/cadence-www/global/en_US/documents/company/careers/equal-employment-opportunity-policy.pdf)
Requirements
- Bachelor's degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
- Strong hands-on experience in Linux environments, including system administration, troubleshooting, and performance validation.
- Proficiency with Linux command-line tools and shell scripting (Bash or equivalent).
- Experience with cluster bring-up, driver installation, and system-level configuration.
- Hands-on experience setting up and validating GPU servers in clustered environments.
- Experience with end-to-end GPU testing in InfiniBand-based clusters.
- Working knowledge of InfiniBand networking, including switch configuration and subnet management.
- Solid understanding of networking fundamentals, including the OSI model and TCP/IP protocol suite (IP, ARP, ICMP, TCP, UDP, SMTP, FTP, TFTP).
- Experience installing, configuring, and troubleshooting routers, switches, and terminal servers.
- Familiarity with fiber and copper cabling, including IP and SAN deployments.
- Experience managing incident tickets, maintaining acceptable ticket loads, and meeting SLAs.
- Strong organizational skills with meticulous attention to detail in data center environments.
- Ability to follow and enforce documented escalation procedures and operational policies.
- Strong verbal and written communication skills, with the ability to collaborate effectively with cross-functional and global teams., * Experience supporting HPC, AI, or large-scale GPU environments.
- Exposure to data center monitoring
- Experience documenting operational processes and maintaining technical runbooks.
- Familiarity with large-scale data center buildouts or refresh programs.
Physical Requirements
- Ability to perform the essential functions of the role, including lifting, moving, and installing equipment weighing 50 pounds or more, with or without reasonable accommodation.
- Ability to work in data center environments, including raised floors, equipment racks, and confined spaces.
- Willingness to work flexible hours, including nights, weekends, and on-call rotations as required.