Data Center Operations Engineering Lead
Role details
Job location
Tech stack
Job description
NVIDIA is seeking a highly experienced Data Center Operations Engineering Lead to serve as the on-site operational owner for a critical data center location. This role is the operational backbone of the site-responsible for ensuring infrastructure reliability, uptime, compliance, and readiness to support production workloads across NVIDIA's rapidly growing global data center footprint. This is a hands-on, high-impact role for a senior engineer who thrives in mission-critical environments, owns issues end-to-end, and drives operational excellence through strong technical judgment, disciplined processes, and cross-functional leadership. You will act as the primary on-site authority and escalation point while partnering with centrally managed engineering, facilities, network, security, and capacity planning teams. **Being able to track and report on continuous areas of improvement is key for the DC to continue to progress. Key Responsibilities *Data Center Operations & Incident Management Own day-to-day operational health of the assigned data center site. *Serve as the primary on-site escalation point for operational, infrastructure, and facilities issues. *Lead incident response, triage, escalation, and resolution to maintain high availability and uptime. *Coordinate with internal teams, vendors, colocation providers, and Facilities Operations Centers (FOC) during incidents and maintenance events. Infrastructure Readiness & Reliability *Ensure infrastructure readiness for new site turn-ups, expansions, and post go-live stabilization. *Inherit newly built lab or data center environments after buildout and transition them to steady-state operations. *Govern infrastructure changes including installs, upgrades, retrofits, and decommissions with appropriate change management and rollback planning. *Maintain deep operational knowledge of critical systems: power distribution, cooling (air and liquid), networking, space, and rack density. Preventative Maintenance, Capacity & Asset Management *Manage and track preventative maintenance schedules for power, cooling, network, and compute infrastructure. *Monitor and manage site capacity (power, cooling, space, racks) and identify constraints and risks. *Maintain accurate asset inventories and track lifecycle from deployment through decommissioning using DCIM tools. Process Excellence, Metrics & Continuous Improvement *Develop, document, and continuously improve SOPs, runbooks, escalation workflows, and site readiness checklists. *Lead ITIL-aligned change management and operational governance processes. *Track and report site-level operational metrics; analyze trends to drive reliability and service improvements. *Identify opportunities to automate operational tasks and improve tooling and visibility. Cross-Functional Leadership, Security & Compliance *Act as the local liaison between facilities, engineering, networking, security, capacity planning, and compliance teams. *Ensure physical and logical access controls are enforced and compliant. *Maintain audit readiness and support compliance efforts (e.g., SOC 2, ISO 27001, safety and regulatory certifications). *Manage relationships with vendors, service providers, and colocation partners, including SLAs and contracts. Skills Data center, data center operations, data center maintenance, Hardware troubleshooting, Troubleshooting, Infrastructure, cooling systems, Power, PDU, data center mgr, Data Center Facilities, Rack and stack Top Skills Details Data center,data center operations,data center maintenance,Hardware troubleshooting,Troubleshooting,Infrastructure,cooling systems,Power,PDU
Requirements
Strong operational judgment, prioritization, and organizational skills. Excellent written and verbal communication skills, including executive-level incident communication. Ability to operate independently on-site while collaborating with distributed teams and off-site managers. Experience with ITIL frameworks, change management, vendor SLAs, and compliance standards. Ways to Stand Out Experience supporting high-density or liquid-cooled GPU, AI, or HPC environments. Prior ownership or leadership of data center compliance audits. Scripting or automation experience (Python, Bash, etc.) to improve operational efficiency. Experience Level Entry Level
Benefits & conditions
This is a Contract position based out of Hillsboro, OR. Pay and Benefits The pay range for this position is $50.00 - $60.00/hr. Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to specific elections, plan, or program terms. If eligible, the benefits available for this temporary role may include the following: Medical, dental & vision Critical Illness, Accident, and Hospital 401(k) Retirement Plan - Pre-tax and Roth post-tax contributions available Life Insurance (Voluntary Life & AD&D for the employee and dependents) Short and long-term disability Health Spending Account (HSA) Transportation benefits Employee Assistance Program Time Off/Leave (PTO, Vacation or Sick Leave)