Systems Operations Manager - Data Platforms -Teradata & Hadoop

Wells Fargo

Irving, United States of America

23 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Job location

Irving, United States of America

Tech stack

Systems Engineering

Cloud Engineering

Computer Security

Continuous Integration

Data Infrastructure

Disaster Recovery

Distributed Systems

Hadoop

Performance Tuning

Reliability Engineering

Site Reliability Engineering Practices

Teradata

Software Vulnerability Management

Cloud Platform System

Mttr

Containerization

Kubernetes

Performance Monitor

Data Management

Devsecops

Job description

This role is accountable for platform stability, reliability, and operational excellence across a complex, multi-tenant ecosystem supporting 100+ tenants. The manager will lead a 24x7 operations team, apply Site Reliability Engineering (SRE) principles, and drive automation-led transformation to ensure predictable, resilient service delivery at scale.

This is a hands-on leadership role requiring strong execution discipline, ownership, and the ability to operate in a high-risk, regulated environment, ensuring SLA adherence, compliance, and business continuity outcomes.

In this role, you will:

Operational Leadership & Platform Ownership

Lead end-to-end platform operations for Teradata and Hadoop environments, ensuring availability, performance, and resilience
Provide clear ownership and accountability for production services, operational outcomes, and service stability
welDrive incident, problem, and change management, including major incident command and recovery leadership
Lead 24x7 global support operations, including on-call governance and escalation management

Operational Excellence & Service Performance

Own and drive SLA/OLA adherence, uptime, and service health metrics
Lead capacity management, performance tuning, and proactive issue prevention initiatives
Establish and enforce operational standards, runbooks, and service management practices
Drive root cause analysis (RCA) and long-term remediation of systemic issues
Drive adoption of automation, observability, and AIOps practices to reduce manual toil and improve MTTR.

Governance, Risk & Compliance

Ensure alignment with enterprise risk, compliance, and change management frameworks
Drive patching, vulnerability remediation, and platform security posture
Maintain audit readiness, documentation quality, and control adherence
Identify, escalate, and mitigate operational and platform risks

Multi-Tenant Platform Operations

Manage operations across shared, multi-tenant platforms, ensuring workload isolation and stability
Oversee resource allocation, scheduler configuration, and workload prioritization
Execute in high-risk production environments where changes impact multiple tenants simultaneously

Site Reliability Engineering (SRE) & Automation

Apply SRE principles to improve reliability, availability, and scalability of data platforms
Drive automation-first operations to eliminate manual toil and standardize service delivery
Implement and enhance observability, monitoring, and self-service capabilities
Partner with engineering teams to improve platform reliability, operability, and service maturity
Drive adoption of automation, observability, and AIOps practices to reduce manual toil and improve MTTR.

Stakeholder Engagement & Execution Alignment

Partner with Engineering, CIO-aligned teams, Cybersecurity, and LOB stakeholders
Provide clear, executive-ready communication on platform health, risks, and priorities
Drive cross-functional accountability and execution discipline across teams

People Leadership & Talent Development

Lead, coach, and develop a team of Systems Operations engineers and analysts
Build a culture of ownership, accountability, and operational excellence
Manage resource allocation, workforce planning, and vendor/partner support
Develop team capabilities in SRE practices, automation, and platform operations maturity

Resiliency & Business Continuity

Ensure resiliency posture across Teradata and Hadoop platforms, including:

Disaster recovery (DR) readiness and execution
RTO/RPO alignment and validation
Continuous improvement of recovery capabilities

Lead BCP execution and failover coordination for critical platforms, Employees support our focus on building strong customer relationships balanced with a strong risk mitigating and compliance-driven culture which firmly establishes those disciplines as critical to the success of our customers and company. They are accountable for execution of all applicable risk programs (Credit, Market, Financial Crimes, Operational, Regulatory Compliance), which includes effectively following and adhering to applicable Wells Fargo policies and procedures, appropriately fulfilling risk and compliance obligations, timely and effective escalation and remediation of issues, and making sound risk decisions. There is emphasis on proactive monitoring, governance, risk identification and escalation, as well as making sound risk decisions commensurate with the business unit's risk appetite and all risk and compliance program requirements.

Requirements

Do you have experience in Teradata?, * 5+ years of Systems Engineering, and Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

2+ years of Leadership experience
Hands-on experience with:

Teradata and Hadoop platforms
Distributed systems and data platform operations
Incident, problem, and change management processes

Desired Qualifications:

Experience supporting enterprise-scale Teradata and Hadoop platforms
Demonstrated leadership in 24x7 production support and SRE environments
Strong experience in:

Automation, AIOps, and operational transformation
DevSecOps and CI/CD practices
Observability, monitoring, and platform telemetry

Familiarity with Kubernetes, containerization, and cloud-native architectures
Strong understanding of:

Multi-tenant data platforms and workload management
Regulatory, audit, and risk-controlled environments

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all