Digital - Principal SRE

Huntington National Bank

Columbus, United States of America

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Columbus, United States of America

Tech stack

Java

Abstraction Layers

Artificial Intelligence

Amazon Web Services (AWS)

Azure

Cloud Computing

Cloud Engineering

Computer Programming

Information Engineering

DevOps

Monitoring of Systems

Python

Machine Learning

Reliability Engineering

Ansible

Prometheus

Runbook

Software Deployment

Web Platforms

Google Cloud Platform

Large Language Models

Grafana

Generative AI

Containerization

AI Platforms

Kubernetes

Information Technology

Data Analytics

Machine Learning Operations

Terraform

Docker

ELK

ServiceNow

Job description

The Digital - Principal SRE (AI Engineer) role is a position that blends expertise in artificial intelligence, machine learning, and reliability engineering. This professional is responsible for designing, deploying, and maintaining AI-driven solutions while ensuring the reliability, scalability, and performance of digital platforms and services. The ideal candidate will work closely with Digital SRE engineers, data scientists, DevOps, and operations teams to deliver robust, efficient, and automated systems that support business goals., The IS Technical Specialist provides technical and consultative support on the most complex technical matters. This role typically reports to the Head of Digital SRE and may involve on-call responsibilities. The position provides opportunities to work on cutting-edge AI solutions, collaborate with cross segment teams, and drive reliability for mission-critical digital services, * Design, develop, and implement AI-driven systems and automation tools to enhance the reliability and efficiency of digital platforms.

Monitor the health, availability, and performance of AI-enabled applications and infrastructure using SRE best practices.
Collaborate with cross-functional teams to integrate machine learning models into production environments, ensuring seamless deployment and operation.
Establish and enforce service-level objectives (SLOs), error budgets, and incident response procedures for AI-driven services.
Identify, troubleshoot, and resolve complex incidents related to AI systems, leveraging observability and monitoring tools.
Drive continuous improvement by analyzing post-incident reviews, automating manual tasks, and optimizing system performance.
Stay up to date with advancements in AI, SRE, and cloud technologies, recommending innovative solutions to enhance digital reliability.
Document processes and runbooks for operational transparency and knowledge sharing.
AI Platform Integration: Develop abstraction layers across AI providers (Google, OpenAI, etc. ) to enable seamless integration and enablement.
Conduct design workshops, POCs, and code-with sessions to shape data-driven agent workflows with stakeholders, fostering trust and adoption.
Measure & Improve: Define and use key metrics, test harnesses, and evaluation plans to measure agent accuracy, latency, safety, and cost effectiveness.
Knowledge Sharing: Craft reusable patterns, documentation, and best practices to influence internal assets and client roadmaps., Certain positions outside our branch network may be eligible for a flexible work arrangement. We're combining the best of both worlds: in-office and work from home. Our approach enables our teams to deepen connections, maintain a strong community, and do their best work. Remote roles will also have the opportunity to come together in our offices for moments that matter. Specific work arrangements will be provided by the hiring team.

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, Data Science, or a related field.
Minimum 5 YOE Proven experience in AI/ML engineering, SRE, DevOps, or related roles.
Strong programming skills in Python, Java, or similar languages, with experience in developing and deploying machine learning models.
Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and containerization technologies (Docker, Kubernetes).
Familiarity with observability tools (Prometheus, Grafana, ELK stack) and Service Now incident management platforms.
Solid understanding of SRE principles: monitoring, alerting, SLOs, error budgets, and automation.
Experience with infrastructure-as-code (Terraform, Ansible) and CI/CD pipelines.
Excellent problem-solving skills, attention to detail, and ability to work in a fast-paced, collaborative environment., * Experience operationalizing large language models (LLMs) or generative AI systems in production settings.
Background in MLOps, data engineering, and/or cloud-native AI deployment.
Strong communication and documentation abilities
Knowledge of security best practices for AI and cloud infrastructure.
Contributions to open source AI/SRE projects or relevant technical communities Exempt Status: (Yes = not eligible for overtime pay) (No = eligible for overtime pay)