Site Reliability Engineer

Kelly Services Inc.
Orlando, United States of America
yesterday

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 187K

Job location

Orlando, United States of America

Tech stack

Flutter
Artificial Intelligence
Amazon Web Services (AWS)
JIRA
Azure
Bash
Mobile Application Development
Cloud Computing
Configuration Management
Software Quality
Information Systems
Continuous Integration
DevOps
Digital Architecture
Distributed Systems
Github
Identity and Access Management
Mobile Application Software
Python
PostgreSQL
MongoDB
Node.js
Scrum
Redis
Reliability Engineering
Cloud Services
Prometheus
Webui
Service Pack
Vault (Revision Control System)
Systems Architecture
TypeScript
YAML
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
Kubernetes Helm Charts
Multi-Cloud
Generative AI
HybridCloud
Backend
Gitlab
AI Platforms
Kubernetes
Information Technology
Deployment Automation
Kafka
Nintex
Data Management
Software Coding
Terraform
Splunk
Appdynamics
Jenkins

Job description

The Lead Site Reliability Engineer will report to the Sr. Manager, Generative AI Engineering and play a key role in guiding the JedAI team's cloud infrastructure and generative AI platform reliability strategy.

You will lead infrastructure strategy across multi-cloud environments (GCP, AWS, and Azure) supporting our Generative AI and Conversational Experience platforms.

You'll modernize and manage applications including LiteLLM, Open Web UI, Archestra, Arize AX, and support back-end systems like Kafka, PostgreSQL, Redis, Vault, MongoDB, and n8n, ensuring they meet our internal UI and security standards.

What You'll Do

  • Plan, design, and build Helm charts, and infrastructure Terraform to maintain an annual 99.99% availability SLAs.
  • Lead and mentor a team of Site Reliability Engineers and DevOps specialists within the our AI platform.
  • Architect, design, and maintain infrastructure environments supporting AI and data service workloads across GCP (primary), AWS (secondary), and Azure (tertiary).
  • Identify, plan, and assign work for other peer team members (Jira).
  • Review and provide feedback on platform sizing and volume estimations.
  • Assist the capacity planning team to ensure scalability boundaries are aligned with expected workloads.
  • Implement our observability, monitoring, alerting, and tracing best practices across platform components (Splunk, OpenTelemetry, Prometheus, AppDynamics).
  • Plan, design, and implement automated deployment processes via Harness.
  • Plan, design, and implement modern enterprise rollout patterns such as blue/green deployments, canary deployments, and feature flags.
  • Provide guidance to the platform architecture team with respect to solution infrastructure and scalability.
  • Establish and support operational maintenance processes including backups, version updates, capacity planning, and security patching.
  • Evaluate and pitch recommendations on emerging DevOps and SRE technologies, influencing our reliability strategy across AI & platform teams.
  • Ensure team compliance with our governance, security, and business continuity frameworks.

Why This Role is Needed

Rapid Growth and Innovation

The Digital Architecture & Engineering team is experiencing rapid growth and expansion, requiring an architect to guide the development of our mobile platform to meet evolving user needs and business objectives.

Complex Architecture

  • Our mobile application utilizes a complex architecture involving Flutter, Server Driven UI, Node.js, Typescript, Runtime, and Cloud services (AWS/GCP).
  • This requires a deep understanding of these technologies and the ability to design a cohesive and efficient system.

Technical Leadership

  • We need a strong technical leader who can mentor and guide our development team, ensuring best practices, code quality, and efficient development processes.

Future Proofing

  • The Lead Software Architect will be responsible for designing scalable and adaptable architecture that can accommodate future growth, new features, and evolving technologies.

What You Will Do

  • Define and implement the overall mobile architecture, including backend integration, and data management.
  • Lead the development of new features and functionalities, ensuring alignment with business requirements and user needs.
  • Collaborate with cross-functional teams (design, product, backend) to ensure seamless integration and optimal user experience.
  • Develop and maintain technical documentation, including architecture diagrams, design specifications, and coding standards.
  • Mentor and guide junior developers, fostering a culture of continuous learning and improvement.
  • Stay abreast of emerging technologies and trends in mobile development, identifying opportunities for innovation and improvement.

Requirements

  • 7+ years of SRE, DevOps, or platform engineering experience.
  • Expert in Kubernetes operations, cluster scaling, and Helm-based configuration management.
  • Advanced knowledge of Terraform and Harness for automated deployment and configuration.
  • Proven experience managing multi-cloud services on GCP, AWS, and Azure.
  • Strong scripting in Python, Bash, and YAML for automation and reliability workflows.
  • Experience with PostgreSQL, Redis, Kafka, MongoDB, and Vault in production environments.
  • Proficiency in CI/CD orchestration technologies (Harness, GitHub Actions, GitLab, Jenkins, and Azure DevOps) with deployment automation, feature flags, and observability.
  • Self-motivated with strong leadership ability in Agile/Scrum environments; ability to set technical direction and mentor peers.
  • Strong written communication skills; particularly in clearly explaining technical topics to less-technical audiences.
  • Outstanding troubleshooting and diagnostic skills across distributed systems.
  • Deep understanding of system security, identity management, and data protection compliance models.

Preferred Qualifications

  • Prior leadership in hybrid cloud environments.
  • Experience leading large infrastructure-focused initiatives., * Bachelor's degree in Computer Science, Information Systems, or equivalent relevant experience.
  • Master's degree preferred.

Apply for this position