Systems Operations Manager - Data Platforms -Teradata & Hadoop
Role details
Job location
Tech stack
Job description
This role is accountable for platform stability, reliability, and operational excellence across a complex, multi-tenant ecosystem supporting 100+ tenants. The manager will lead a 24x7 operations team, apply Site Reliability Engineering (SRE) principles, and drive automation-led transformation to ensure predictable, resilient service delivery at scale.
This is a hands-on leadership role requiring strong execution discipline, ownership, and the ability to operate in a high-risk, regulated environment, ensuring SLA adherence, compliance, and business continuity outcomes.
In this role, you will:
Operational Leadership & Platform Ownership
- Lead end-to-end platform operations for Teradata and Hadoop environments, ensuring availability, performance, and resilience
- Provide clear ownership and accountability for production services, operational outcomes, and service stability
- welDrive incident, problem, and change management, including major incident command and recovery leadership
- Lead 24x7 global support operations, including on-call governance and escalation management
Operational Excellence & Service Performance
- Own and drive SLA/OLA adherence, uptime, and service health metrics
- Lead capacity management, performance tuning, and proactive issue prevention initiatives
- Establish and enforce operational standards, runbooks, and service management practices
- Drive root cause analysis (RCA) and long-term remediation of systemic issues
- Drive adoption of automation, observability, and AIOps practices to reduce manual toil and improve MTTR.
Governance, Risk & Compliance
- Ensure alignment with enterprise risk, compliance, and change management frameworks
- Drive patching, vulnerability remediation, and platform security posture
- Maintain audit readiness, documentation quality, and control adherence
- Identify, escalate, and mitigate operational and platform risks
Multi-Tenant Platform Operations
- Manage operations across shared, multi-tenant platforms, ensuring workload isolation and stability
- Oversee resource allocation, scheduler configuration, and workload prioritization
- Execute in high-risk production environments where changes impact multiple tenants simultaneously
Site Reliability Engineering (SRE) & Automation
- Apply SRE principles to improve reliability, availability, and scalability of data platforms
- Drive automation-first operations to eliminate manual toil and standardize service delivery
- Implement and enhance observability, monitoring, and self-service capabilities
- Partner with engineering teams to improve platform reliability, operability, and service maturity
- Drive adoption of automation, observability, and AIOps practices to reduce manual toil and improve MTTR.
Stakeholder Engagement & Execution Alignment
- Partner with Engineering, CIO-aligned teams, Cybersecurity, and LOB stakeholders
- Provide clear, executive-ready communication on platform health, risks, and priorities
- Drive cross-functional accountability and execution discipline across teams
People Leadership & Talent Development
-
Lead, coach, and develop a team of Systems Operations engineers and analysts
-
Build a culture of ownership, accountability, and operational excellence
-
Manage resource allocation, workforce planning, and vendor/partner support
-
Develop team capabilities in SRE practices, automation, and platform operations maturity
Resiliency & Business Continuity
- Ensure resiliency posture across Teradata and Hadoop platforms, including:
- Disaster recovery (DR) readiness and execution
- RTO/RPO alignment and validation
- Continuous improvement of recovery capabilities
- Lead BCP execution and failover coordination for critical platforms, Employees support our focus on building strong customer relationships balanced with a strong risk mitigating and compliance-driven culture which firmly establishes those disciplines as critical to the success of our customers and company. They are accountable for execution of all applicable risk programs (Credit, Market, Financial Crimes, Operational, Regulatory Compliance), which includes effectively following and adhering to applicable Wells Fargo policies and procedures, appropriately fulfilling risk and compliance obligations, timely and effective escalation and remediation of issues, and making sound risk decisions. There is emphasis on proactive monitoring, governance, risk identification and escalation, as well as making sound risk decisions commensurate with the business unit's risk appetite and all risk and compliance program requirements.
Requirements
Do you have experience in Teradata?, * 5+ years of Systems Engineering, and Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
- 2+ years of Leadership experience
- Hands-on experience with:
- Teradata and Hadoop platforms
- Distributed systems and data platform operations
- Incident, problem, and change management processes
Desired Qualifications:
- Experience supporting enterprise-scale Teradata and Hadoop platforms
- Demonstrated leadership in 24x7 production support and SRE environments
- Strong experience in:
- Automation, AIOps, and operational transformation
- DevSecOps and CI/CD practices
- Observability, monitoring, and platform telemetry
- Familiarity with Kubernetes, containerization, and cloud-native architectures
- Strong understanding of:
- Multi-tenant data platforms and workload management
- Regulatory, audit, and risk-controlled environments