Senior Site Reliability Engineer

TechChain Talent View all jobs

San Francisco, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 250K

Job location

San Francisco, United States of America

Tech stack

Kubernetes Security

Artificial Intelligence

Configuration Management

Software Debugging

Distributed Systems

Key Management

Network Segmentation

Reliability Engineering

Prometheus

Azure

AI Infrastructure

Data Logging

Grafana

Hardware Infrastructure

ELK

Job description

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. You'll define and maintain SLOs, build incident response systems, manage capacity across our distributed GPU network, and implement secure rollout/rollback mechanisms.

Requirements

Experience in site reliability engineering, including working with SLOs and SLAs for production systems

Experience with capacity planning and resource management for distributed systems
Experience with incident response, on-call rotations, and post-mortem processes
Experience with deployment systems (e.g., canary deployments, feature flags, automated rollbacks)
Experience with observability tools (e.g., Prometheus, Grafana, ELK stack, logging, tracing, alerting)
Experience with infrastructure security (e.g., network segmentation, workload isolation, security hardening)
Experience with secrets management and key management systems (KMS)
Experience with compliance frameworks (e.g., SOC 2, ISO 27001)
Experience debugging distributed systems
Experience with infrastructure-as-code, configuration management, and CI/CD pipelines Bonus Skills
Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale
Knowledge of multi-tenancy security patterns, container security, and runtime security tools
Experience with chaos engineering, fault injection, and resilience testing
Experience building and operating systems with 99.9%+ SLA uptime requirements