Senior Site Reliability Engineer

TechChain Talent View all jobs
San Francisco, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 250K

Job location

San Francisco, United States of America

Tech stack

Kubernetes Security
Artificial Intelligence
Configuration Management
Software Debugging
Distributed Systems
Key Management
Network Segmentation
Reliability Engineering
Prometheus
Azure
AI Infrastructure
Data Logging
Grafana
Hardware Infrastructure
ELK

Job description

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. You'll define and maintain SLOs, build incident response systems, manage capacity across our distributed GPU network, and implement secure rollout/rollback mechanisms.

Requirements

Experience in site reliability engineering, including working with SLOs and SLAs for production systems

  • Experience with capacity planning and resource management for distributed systems

  • Experience with incident response, on-call rotations, and post-mortem processes

  • Experience with deployment systems (e.g., canary deployments, feature flags, automated rollbacks)

  • Experience with observability tools (e.g., Prometheus, Grafana, ELK stack, logging, tracing, alerting)

  • Experience with infrastructure security (e.g., network segmentation, workload isolation, security hardening)

  • Experience with secrets management and key management systems (KMS)

  • Experience with compliance frameworks (e.g., SOC 2, ISO 27001)

  • Experience debugging distributed systems

  • Experience with infrastructure-as-code, configuration management, and CI/CD pipelines Bonus Skills

  • Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale

  • Knowledge of multi-tenancy security patterns, container security, and runtime security tools

  • Experience with chaos engineering, fault injection, and resilience testing

  • Experience building and operating systems with 99.9%+ SLA uptime requirements

About the company

© 2026 Careerjet All rights reserved

Apply for this position