Sr. Site Reliability Engineer

Tiger Analytics

Washington, United States of America

21 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Washington, United States of America

Tech stack

Artificial Intelligence

Bash

Google BigQuery

Continuous Integration

Data Systems

Github

Python

Machine Learning

Reliability Engineering

Prometheus

Azure

Software Engineering

Data Streaming

Systems Architecture

AI Infrastructure

Pulumi

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

System Availability

Delivery Pipeline

Large Language Models

Grafana

Infrastructure as Code (IaC)

Build Server

AI Platforms

Kubernetes

Data Analytics

Machine Learning Operations

Terraform

Virtual Private Clouds

Docker

Job description

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps-bridging the gap between model development and production-grade reliability., * SLA/SLO Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.

Error Budgeting: Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
Scalability: Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.

MLOps & AI Infrastructure

Model Serving Reliability: Ensure the high availability of Vertex AI endpoints and custom inference services.
GPU/TPU Optimization: Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
Pipeline Resilience: Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.

Automation & Orchestration (Eliminating "Toil")

Infrastructure as Code (IaC): Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
CI/CD & GitOps: Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
Task Automation: Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.

Monitoring, Alerting & Incident Response

Observability: Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver).
Incident Management: Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
Blameless Post-Mortems: Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

Requirements

Do you have experience in Virtual Private Clouds?, Orchestration: Expert-level knowledge of Kubernetes (K8s) and Docker.

MLOps Stack: Familiarity with tools such as Kubeflow, Vertex AI, MLflow, or DVC.

Scripting: Strong proficiency in Python (for automation) and Bash; knowledge of Go is a plus.

Data Systems: Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).

Benefits & conditions

Significant career development opportunities exist as the company grows. The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all