Digital - Principal SRE

Huntington National Bank
Columbus, United States of America
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote
Columbus, United States of America

Tech stack

Java
Abstraction Layers
Artificial Intelligence
Amazon Web Services (AWS)
Azure
Cloud Computing
Cloud Engineering
Computer Programming
Information Engineering
DevOps
Monitoring of Systems
Python
Machine Learning
Reliability Engineering
Ansible
Prometheus
Runbook
Software Deployment
Web Platforms
Google Cloud Platform
Large Language Models
Grafana
Generative AI
Containerization
AI Platforms
Kubernetes
Information Technology
Data Analytics
Machine Learning Operations
Terraform
Docker
ELK
ServiceNow

Job description

The Digital - Principal SRE (AI Engineer) role is a position that blends expertise in artificial intelligence, machine learning, and reliability engineering. This professional is responsible for designing, deploying, and maintaining AI-driven solutions while ensuring the reliability, scalability, and performance of digital platforms and services. The ideal candidate will work closely with Digital SRE engineers, data scientists, DevOps, and operations teams to deliver robust, efficient, and automated systems that support business goals., The IS Technical Specialist provides technical and consultative support on the most complex technical matters. This role typically reports to the Head of Digital SRE and may involve on-call responsibilities. The position provides opportunities to work on cutting-edge AI solutions, collaborate with cross segment teams, and drive reliability for mission-critical digital services, * Design, develop, and implement AI-driven systems and automation tools to enhance the reliability and efficiency of digital platforms.

  • Monitor the health, availability, and performance of AI-enabled applications and infrastructure using SRE best practices.
  • Collaborate with cross-functional teams to integrate machine learning models into production environments, ensuring seamless deployment and operation.
  • Establish and enforce service-level objectives (SLOs), error budgets, and incident response procedures for AI-driven services.
  • Identify, troubleshoot, and resolve complex incidents related to AI systems, leveraging observability and monitoring tools.
  • Drive continuous improvement by analyzing post-incident reviews, automating manual tasks, and optimizing system performance.
  • Stay up to date with advancements in AI, SRE, and cloud technologies, recommending innovative solutions to enhance digital reliability.
  • Document processes and runbooks for operational transparency and knowledge sharing.
  • AI Platform Integration: Develop abstraction layers across AI providers (Google, OpenAI, etc. ) to enable seamless integration and enablement.
  • Conduct design workshops, POCs, and code-with sessions to shape data-driven agent workflows with stakeholders, fostering trust and adoption.
  • Measure & Improve: Define and use key metrics, test harnesses, and evaluation plans to measure agent accuracy, latency, safety, and cost effectiveness.
  • Knowledge Sharing: Craft reusable patterns, documentation, and best practices to influence internal assets and client roadmaps., Certain positions outside our branch network may be eligible for a flexible work arrangement. We're combining the best of both worlds: in-office and work from home. Our approach enables our teams to deepen connections, maintain a strong community, and do their best work. Remote roles will also have the opportunity to come together in our offices for moments that matter. Specific work arrangements will be provided by the hiring team.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, Data Science, or a related field.
  • Minimum 5 YOE Proven experience in AI/ML engineering, SRE, DevOps, or related roles.
  • Strong programming skills in Python, Java, or similar languages, with experience in developing and deploying machine learning models.
  • Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and containerization technologies (Docker, Kubernetes).
  • Familiarity with observability tools (Prometheus, Grafana, ELK stack) and Service Now incident management platforms.
  • Solid understanding of SRE principles: monitoring, alerting, SLOs, error budgets, and automation.
  • Experience with infrastructure-as-code (Terraform, Ansible) and CI/CD pipelines.
  • Excellent problem-solving skills, attention to detail, and ability to work in a fast-paced, collaborative environment., * Experience operationalizing large language models (LLMs) or generative AI systems in production settings.
  • Background in MLOps, data engineering, and/or cloud-native AI deployment.
  • Strong communication and documentation abilities
  • Knowledge of security best practices for AI and cloud infrastructure.
  • Contributions to open source AI/SRE projects or relevant technical communities Exempt Status: (Yes = not eligible for overtime pay) (No = eligible for overtime pay)

Apply for this position