Senior Infrastructure Engineer - SRE/Platform...
Role details
Job location
Tech stack
Job description
This role is focused on driving infrastructure stability, automation, and reliability across critical API and platform systems that support high-impact financial transactions. This is a highly visible role responsible for owning production reliability, improving operational efficiency, and enabling scalable platform capabilities across API Management, CI/CD, and supporting platform environments.
In this role, you will:
-
Lead daily support operations for Apigee OPDK, Apigee Hybrid, to ensure platform uptime, stability, and performance
-
Troubleshoot runtime, policy, routing, and security issues on DataPower appliances
-
Develop specifications for complex infrastructure systems, design and test solutions
-
Contribute to the testing of business, application and technical infrastructure requirements
-
Implement reliability improvements through Infrastructure-as-Code (IaC) using Terraform, Ansible, and GitOps
-
Develop automated recovery scripts and tools to reduce manual operational overhead
-
Review and analyze solutions for cloud security, secrets management and key rotations
-
Design, code, test, debug and document programs using Agile development practices
-
Plan and execute version upgrades, patching cycles, infrastructure migrations, and configuration refactoring.
-
Improve proactive alerting to reduce mean time to detect (MTTD) and mean time to recover (MTTR)
-
Own and resolve P1/P2 high-severity incidents with quick response and deep technical troubleshooting
-
Direct the daily risk and control flow of operations, focusing on policies, procedures and work standards to ensure success
-
Participate in design discussions, architectural reviews, API governance activities, and platform modernization initiatives
-
Work with CAB (Change Advisory Board) for change planning, approvals, and execution tracking
-
Contribute to runbooks, SOPs, architectural diagrams, and platform knowledge base assets, Employees support our focus on building strong customer relationships balanced with a strong risk mitigating and compliance-driven culture which firmly establishes those disciplines as critical to the success of our customers and company. They are accountable for execution of all applicable risk programs (Credit, Market, Financial Crimes, Operational, Regulatory Compliance), which includes effectively following and adhering to applicable Wells Fargo policies and procedures, appropriately fulfilling risk and compliance obligations, timely and effective escalation and remediation of issues, and making sound risk decisions. There is emphasis on proactive monitoring, governance, risk identification and escalation, as well as making sound risk decisions commensurate with the business unit's risk appetite and all risk and compliance program requirements.
Requirements
-
4+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
-
4+ years of Proficiency in leveraging observability platforms such as BigPanda, ThousandEyes, Grafana, Prometheus, ELK, Splunk Observability, and AppDynamics to enhance service reliability and performance monitoring
-
3+ years of experience working with Red Hat Enterprise Linux and Kubernetes, with a strong focus on Red Hat OpenShift Container Platform (OCP)
-
3+ years of experience with Site Reliability Engineering and supporting production grade
-
3+ years of experience with automation & scripting
Desired Qualifications:
-
4+ years of experience in IT Service Management (ITSM), with a strong background in incident, problem, and change management processes
-
Experience with API management platforms such as Apigee or API gateways
-
Exposure to IBM DataPower or similar enterprise integration tools
-
Expertise in Ansible Tower, including developing and maintaining playbooks
-
Experience with cloud-native architectures, high-availability systems, Cloud & Container Technologies like GCP or Azure and familiarity with Kubernetes
-
Strong experience working in Agile methodologies / Scrum environments
-
Experience improving system reliability, scalability, and operational efficiency
-
Experience in project management and stakeholder engagement
-
Proven experience in leading cross-functional teams
-
Strong problem-solving and decision-making abilities
-
Excellent communication and collaboration skills