Site Reliability Engineer
Role details
Job location
Tech stack
Job description
The Lead Site Reliability Engineer will report to the Sr. Manager, Generative AI Engineering and play a key role in guiding the JedAI team's cloud infrastructure and generative AI platform reliability strategy.
You will lead infrastructure strategy across multi-cloud environments (GCP, AWS, and Azure) supporting our Generative AI and Conversational Experience platforms.
You'll modernize and manage applications including LiteLLM, Open Web UI, Archestra, Arize AX, and support back-end systems like Kafka, PostgreSQL, Redis, Vault, MongoDB, and n8n, ensuring they meet our internal UI and security standards.
What You'll Do
- Plan, design, and build Helm charts, and infrastructure Terraform to maintain an annual 99.99% availability SLAs.
- Lead and mentor a team of Site Reliability Engineers and DevOps specialists within the our AI platform.
- Architect, design, and maintain infrastructure environments supporting AI and data service workloads across GCP (primary), AWS (secondary), and Azure (tertiary).
- Identify, plan, and assign work for other peer team members (Jira).
- Review and provide feedback on platform sizing and volume estimations.
- Assist the capacity planning team to ensure scalability boundaries are aligned with expected workloads.
- Implement our observability, monitoring, alerting, and tracing best practices across platform components (Splunk, OpenTelemetry, Prometheus, AppDynamics).
- Plan, design, and implement automated deployment processes via Harness.
- Plan, design, and implement modern enterprise rollout patterns such as blue/green deployments, canary deployments, and feature flags.
- Provide guidance to the platform architecture team with respect to solution infrastructure and scalability.
- Establish and support operational maintenance processes including backups, version updates, capacity planning, and security patching.
- Evaluate and pitch recommendations on emerging DevOps and SRE technologies, influencing our reliability strategy across AI & platform teams.
- Ensure team compliance with our governance, security, and business continuity frameworks.
Why This Role is Needed
Rapid Growth and Innovation
The Digital Architecture & Engineering team is experiencing rapid growth and expansion, requiring an architect to guide the development of our mobile platform to meet evolving user needs and business objectives.
Complex Architecture
- Our mobile application utilizes a complex architecture involving Flutter, Server Driven UI, Node.js, Typescript, Runtime, and Cloud services (AWS/GCP).
- This requires a deep understanding of these technologies and the ability to design a cohesive and efficient system.
Technical Leadership
- We need a strong technical leader who can mentor and guide our development team, ensuring best practices, code quality, and efficient development processes.
Future Proofing
- The Lead Software Architect will be responsible for designing scalable and adaptable architecture that can accommodate future growth, new features, and evolving technologies.
What You Will Do
- Define and implement the overall mobile architecture, including backend integration, and data management.
- Lead the development of new features and functionalities, ensuring alignment with business requirements and user needs.
- Collaborate with cross-functional teams (design, product, backend) to ensure seamless integration and optimal user experience.
- Develop and maintain technical documentation, including architecture diagrams, design specifications, and coding standards.
- Mentor and guide junior developers, fostering a culture of continuous learning and improvement.
- Stay abreast of emerging technologies and trends in mobile development, identifying opportunities for innovation and improvement.
Requirements
- 7+ years of SRE, DevOps, or platform engineering experience.
- Expert in Kubernetes operations, cluster scaling, and Helm-based configuration management.
- Advanced knowledge of Terraform and Harness for automated deployment and configuration.
- Proven experience managing multi-cloud services on GCP, AWS, and Azure.
- Strong scripting in Python, Bash, and YAML for automation and reliability workflows.
- Experience with PostgreSQL, Redis, Kafka, MongoDB, and Vault in production environments.
- Proficiency in CI/CD orchestration technologies (Harness, GitHub Actions, GitLab, Jenkins, and Azure DevOps) with deployment automation, feature flags, and observability.
- Self-motivated with strong leadership ability in Agile/Scrum environments; ability to set technical direction and mentor peers.
- Strong written communication skills; particularly in clearly explaining technical topics to less-technical audiences.
- Outstanding troubleshooting and diagnostic skills across distributed systems.
- Deep understanding of system security, identity management, and data protection compliance models.
Preferred Qualifications
- Prior leadership in hybrid cloud environments.
- Experience leading large infrastructure-focused initiatives., * Bachelor's degree in Computer Science, Information Systems, or equivalent relevant experience.
- Master's degree preferred.