Platform Operations Engineer (Site Reliability Engineer)

Vertiv Corporation
Westerville, United States of America
yesterday

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Westerville, United States of America

Tech stack

Java
JavaScript
Artificial Intelligence
Amazon Web Services (AWS)
Azure
C Sharp (Programming Language)
Cloud Computing
Configuration Management
Computer Security
Information Systems
Continuous Integration
Cursor (Graphical User Interface Elements)
DevOps
Github
Monitoring of Systems
Python
Key Management
Automation of Marketing
Powershell
Reliability Engineering
Ansible
Prometheus
Ruby
Runbook
Secure Coding
Web Platforms
Datadog
Enterprise Software Applications
Microsoft Power Automate
Cloud Monitoring
Grafana
Mttr
Zapier
Gitlab
Containerization
AI Platforms
Uipath
Kubernetes
Infrastructure Automation Frameworks
Information Technology
Performance Monitor
Terraform
Splunk
Multiplatform
Devsecops
Docker
Jenkins
Static Application Security Testing
Programming Languages
Dynamic Application Security Testing

Job description

Vertiv is seeking a skilled Platform Operations Engineer (Site Reliability Engineer) to serve as the owner of cross-platform observability, incident management, and operational reliability within Vertiv's Digital organization. This individual contributor role is responsible for designing, implementing, and continuously improving monitoring and alerting solutions across Vertiv's digital platform ecosystem - including Compass AI , Writer AI , Site Scope , UiPath , Workato , Cursor , and other approved enterprise tools - while owning incident response processes, SLA management, and operational governance. The Platform Operations / SRE will operate within the Digital organization and play a central role in advancing Vertiv's Operational Excellence strategic priority by ensuring the availability, performance, and resilience of platforms that power critical digital workflows and business functions.

As an individual contributor in a lead capacity, this role includes proactive reliability engineering - applying SRE principles such as SLOs, error budgets, and blameless post-mortems - and embedding secure coding and operational governance practices across the Digital organization. The Platform Operations / SRE Engineer will define and enforce observability standards, lead incident response and root cause analysis, manage platform-level SLAs, and partner with engineering, security, and business stakeholders to ensure that all digital platforms meet agreed availability and performance targets.

This position partners closely with IT Security, NPDI, Digital delivery teams, and business operations, and is based on site at Vertiv's Westerville, OH headquarters., * Own Cross-Platform Monitoring & Observability: Design, implement, and maintain end-to-end monitoring, alerting, and observability solutions across Vertiv's digital platform ecosystem - including AI platforms, automation tools, and internal applications - ensuring real-time visibility into system health, performance, and availability.

  • Lead Incident Response & Management: Serve as the primary escalation point and incident commander for P1/P2 incidents across Digital platforms; lead root cause analysis (RCA), blameless post-mortems, and corrective action tracking to prevent recurrence and reduce mean time to resolution (MTTR).
  • Manage Platform SLAs & Reliability Targets: Define, instrument, and enforce service level objectives (SLOs), service level indicators (SLIs), and error budgets across Digital platforms; produce regular SLA performance reports for leadership and drive platform improvements to meet or exceed agreed availability and performance targets.
  • Drive Secure Coding & Operational Governance: Champion secure coding practices and DevSecOps standards within Digital delivery teams; conduct operational readiness reviews for new platform deployments, enforce configuration management and change control processes, and partner with IT Security and NPDI to ensure all platforms meet Vertiv's security and compliance requirements.
  • Automate Operations & Reduce Toil: Identify and eliminate manual operational toil through automation. This includes automated remediation runbooks and anomaly detection through the use of scripting, IaC tools, and approved automation platforms.
  • Capacity Planning & Performance Engineering: Analyze platform utilization trends and conduct capacity planning across Digital environments; proactively identify performance bottlenecks and recommend architectural improvements to ensure platforms scale reliably with business demand.
  • CI/CD Pipeline Reliability & Deployment Support: Partner with Digital delivery teams to ensure CI/CD pipelines are instrumented for reliability, deployment risk is managed through progressive rollout strategies, and production deployments are supported with appropriate rollback and health-check capabilities.
  • Evaluate & Advance Observability Tooling: Stay current on advancements in observability, AIOps, and SRE tooling; evaluate and recommend new tools and practices that enhance Vertiv's platform operations maturity, and drive adoption of modern reliability engineering standards across the Digital organization.

Requirements

Do you have experience in System performance monitoring?, * Bachelor's degree in Computer Science, Information Systems, Engineering, or a related field; equivalent practical experience considered.

  • 5+ years of professional experience in platform operations, site reliability engineering, DevOps, or a related software/infrastructure engineering discipline.
  • 3+ years of hands-on experience with enterprise monitoring and observability platforms (e.g., Datadog, Grafana, Prometheus, Azure Monitor, Splunk, or equivalent) in a multi-platform environment.
  • Demonstrated experience owning and managing incident response processes, post-mortem facilitation, and SLA/SLO governance.
  • Experience implementing secure coding practices, DevSecOps standards, or operational governance frameworks in an enterprise software delivery environment.

Technical Skills

  • Proficiency with monitoring and observability tools (Datadog, Grafana, Prometheus, Azure Monitor, Splunk, or equivalent) for cross-platform health and performance tracking.
  • Strong knowledge of SRE principles, including SLOs, SLIs, blameless post-mortems, and toil reduction practices.
  • Hands-on experience with cloud platforms (AWS preferred) and familiarity with containerized environments (Docker, Kubernetes) and infrastructure-as-code tooling (Terraform, Ansible, or equivalent).
  • Proficiency in at multiple programming languages (Python, Ruby, Powershell, Java, Javascript, C#, etc.) for automation and runbook development.
  • Experience with CI/CD platforms (GitLab, Jenkins, GitHub Actions, Azure DevOps, or equivalent) and deployment reliability practices including progressive rollout, feature flags, and automated health checks., * Google SRE certification, AWS DevOps Professional, Azure certifications, or equivalent SRE/cloud operations certification.
  • Experience with AIOps tooling or AI-assisted anomaly detection and automated remediation capabilities.
  • Familiarity with the Vertiv digital platform ecosystem: Workato, UiPath, Power Automate, Compass AI, Writer AI, or Cursor.
  • Experience applying DevSecOps practices, including SAST/DAST scanning, secrets management, and compliance-as-code in enterprise environments.Experience working in Agile/Scrum delivery environments; familiarity with ITIL incident and change management frameworks.

The successful candidate will embrace Vertiv's Core Principals & Behaviors to help execute our Strategic Priorities.

OUR CORE PRINCIPALS : Safety. Integrity. Respect. Teamwork. Diversity & Inclusion., No calls or agencies please. Vertiv will only employ those who are legally authorized to work in the United States. This is not a position for which sponsorship will be provided. Individuals with temporary visas such as E, F-1, H-1, H-2, L, B, J, or TN or who need sponsorship for work authorization now or in the future, are not eligible for hire.

About the company

Vertiv is a $10.2 billion global critical infrastructure and data center technology company. We ensure customers' vital applications run continuously by bringing together hardware, software, analytics and ongoing services. Our portfolio includes power, cooling and IT infrastructure solutions and services that extends from the cloud to the edge of the network. Headquartered in Columbus, Ohio, USA, Vertiv employs around 20,000 people and does business in more than 130 countries. Visit Vertiv.com to learn more.

Apply for this position