AI Integration Engineer (Java + AI)
Role details
Job location
Tech stack
Job description
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to support key Shared Services Operations Technology platforms, including Payment Evaluations, Regulatory Operations, Financial Crimes, and Business & Real Estate Evaluation. You will be part of a team responsible for maintaining availability, performance, and reliability across ~85 applications that support KYC, AML, and other critical financial-crimes-related workloads. This role blends software engineering , systems operations , and cloud-native reliability practices to drive automation, enhance resilience, and support modernization across a large enterprise ecosystem. You will also help evolve AIOps capabilities, including predictive alerting, self-healing workflows, and AI/ML-driven incident analysis. Some occasional weekend work or overtime may be required for critical system support.
What You'll Do
- Site Reliability & Operations Lead SRE practices that enhance system availability, performance, and scalability across multi-cloud environments.
- Support and improve critical applications and customer journeys; lead incident response and blameless postmortems.
- Conduct root-cause analysis and drive long-term remediation of recurrent issues.
- Define and enforce operational readiness and Non-Functional Requirements (NFRs) during platform modernization.
Automation & Tooling
- Design and implement automation to eliminate operational toil and improve service reliability.
- Build frameworks for automated SLO/SLI tracking, availability metrics, error budgeting, and customer impact analysis.
- Implement self-healing and autonomic systems using AI/ML, RPA, and intelligent monitoring.
Monitoring, Observability & AIOps
- Develop and enhance monitoring, alerting, and observability capabilities.
- Drive adoption of AIOps platforms to support anomaly detection, predictive alerting, and automated incident resolution.
Collaboration & Leadership
- Collaborate with platform teams, product owners, and technology partners across the COO Technology organization.
- Mentor peers and champion SRE best practices across engineering teams.
- Identify process gaps across domains and recommend scalable, long-term improvements.
Requirements
- 5+ years in Systems Engineering, Site Reliability Engineering, Technology Architecture, or related fields (or equivalent military/training/education experience).
- 2+ years performing as part of an SRE team.
- Strong written and verbal communication skills.
Technical Skills
- Software Development Proficiency in Python and/or Java/J2EE .
- Experience with REST APIs , microservices , Kafka/MQ , and modern integration patterns.
- Familiarity with JavaScript frameworks (React, Bootstrap).
- Strong SQL skills and database schema design experience.
- Infrastructure & Cloud Expertise with Linux and container orchestration ( Kubernetes , OpenShift/OCP strongly preferred).
- Experience with PCF, AWS, Google Cloud Platform, or Azure environments.
- CI/CD & Automation Tools: Jenkins , GitLab , SonarQube , Artifactory , Ansible .
- Observability & AIOps Tools: Grafana , Prometheus , Splunk/ELK , AppDynamics , Elastic , ThousandEyes , Aternity , Google Cloud Logging .
- AIOps Platforms: Moogsoft , AI/ML-based analytics frameworks.
- Operations & Data ITSM Tools: ServiceNow , Remedy , IBM Netcool .
- Databases: Oracle , DB2 , SQL Server , MongoDB , Hadoop/Cloudera , Spark , Teradata .
- Foundational AI Knowledge Understanding of common AI/ML concepts (classification, regression, clustering, anomaly detection).
- Ability to work with structured/unstructured data for model evaluation.
- Awareness of ethical/operational considerations in AI systems.
- Experience integrating AI into automation workflows is a plus., * Experience with AutoSys .
- Prior experience in corporate banking or financial services.
- Strong interest in AI-driven operations and AIOps.