Systems Operations Manager - Data Platforms -Teradata & Hadoop

Wells Fargo
Irving, United States of America
23 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Irving, United States of America

Tech stack

Systems Engineering
Cloud Engineering
Computer Security
Continuous Integration
Data Infrastructure
Disaster Recovery
Distributed Systems
Hadoop
Performance Tuning
Reliability Engineering
Site Reliability Engineering Practices
Teradata
Software Vulnerability Management
Cloud Platform System
Mttr
Containerization
Kubernetes
Performance Monitor
Data Management
Devsecops

Job description

This role is accountable for platform stability, reliability, and operational excellence across a complex, multi-tenant ecosystem supporting 100+ tenants. The manager will lead a 24x7 operations team, apply Site Reliability Engineering (SRE) principles, and drive automation-led transformation to ensure predictable, resilient service delivery at scale.

This is a hands-on leadership role requiring strong execution discipline, ownership, and the ability to operate in a high-risk, regulated environment, ensuring SLA adherence, compliance, and business continuity outcomes.

In this role, you will:

Operational Leadership & Platform Ownership

  • Lead end-to-end platform operations for Teradata and Hadoop environments, ensuring availability, performance, and resilience
  • Provide clear ownership and accountability for production services, operational outcomes, and service stability
  • welDrive incident, problem, and change management, including major incident command and recovery leadership
  • Lead 24x7 global support operations, including on-call governance and escalation management

Operational Excellence & Service Performance

  • Own and drive SLA/OLA adherence, uptime, and service health metrics
  • Lead capacity management, performance tuning, and proactive issue prevention initiatives
  • Establish and enforce operational standards, runbooks, and service management practices
  • Drive root cause analysis (RCA) and long-term remediation of systemic issues
  • Drive adoption of automation, observability, and AIOps practices to reduce manual toil and improve MTTR.

Governance, Risk & Compliance

  • Ensure alignment with enterprise risk, compliance, and change management frameworks
  • Drive patching, vulnerability remediation, and platform security posture
  • Maintain audit readiness, documentation quality, and control adherence
  • Identify, escalate, and mitigate operational and platform risks

Multi-Tenant Platform Operations

  • Manage operations across shared, multi-tenant platforms, ensuring workload isolation and stability
  • Oversee resource allocation, scheduler configuration, and workload prioritization
  • Execute in high-risk production environments where changes impact multiple tenants simultaneously

Site Reliability Engineering (SRE) & Automation

  • Apply SRE principles to improve reliability, availability, and scalability of data platforms
  • Drive automation-first operations to eliminate manual toil and standardize service delivery
  • Implement and enhance observability, monitoring, and self-service capabilities
  • Partner with engineering teams to improve platform reliability, operability, and service maturity
  • Drive adoption of automation, observability, and AIOps practices to reduce manual toil and improve MTTR.

Stakeholder Engagement & Execution Alignment

  • Partner with Engineering, CIO-aligned teams, Cybersecurity, and LOB stakeholders
  • Provide clear, executive-ready communication on platform health, risks, and priorities
  • Drive cross-functional accountability and execution discipline across teams

People Leadership & Talent Development

  • Lead, coach, and develop a team of Systems Operations engineers and analysts

  • Build a culture of ownership, accountability, and operational excellence

  • Manage resource allocation, workforce planning, and vendor/partner support

  • Develop team capabilities in SRE practices, automation, and platform operations maturity

Resiliency & Business Continuity

  • Ensure resiliency posture across Teradata and Hadoop platforms, including:
  • Disaster recovery (DR) readiness and execution
  • RTO/RPO alignment and validation
  • Continuous improvement of recovery capabilities
  • Lead BCP execution and failover coordination for critical platforms, Employees support our focus on building strong customer relationships balanced with a strong risk mitigating and compliance-driven culture which firmly establishes those disciplines as critical to the success of our customers and company. They are accountable for execution of all applicable risk programs (Credit, Market, Financial Crimes, Operational, Regulatory Compliance), which includes effectively following and adhering to applicable Wells Fargo policies and procedures, appropriately fulfilling risk and compliance obligations, timely and effective escalation and remediation of issues, and making sound risk decisions. There is emphasis on proactive monitoring, governance, risk identification and escalation, as well as making sound risk decisions commensurate with the business unit's risk appetite and all risk and compliance program requirements.

Requirements

Do you have experience in Teradata?, * 5+ years of Systems Engineering, and Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

  • 2+ years of Leadership experience
  • Hands-on experience with:
  • Teradata and Hadoop platforms
  • Distributed systems and data platform operations
  • Incident, problem, and change management processes

Desired Qualifications:

  • Experience supporting enterprise-scale Teradata and Hadoop platforms
  • Demonstrated leadership in 24x7 production support and SRE environments
  • Strong experience in:
  • Automation, AIOps, and operational transformation
  • DevSecOps and CI/CD practices
  • Observability, monitoring, and platform telemetry
  • Familiarity with Kubernetes, containerization, and cloud-native architectures
  • Strong understanding of:
  • Multi-tenant data platforms and workload management
  • Regulatory, audit, and risk-controlled environments

Apply for this position