Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
The Akamai Inference Cloud team is part of Akamai's Cloud Technology Group. We design, implement, deploy and operate AI platforms that enable customers to run inference models and developers to create AI applications with unmatched performance, compliance, and economics.
Partner with the best
In this role, you'll own reliability workstreams for Akamai's serverless inference platform, build automation and tooling, and contribute to architecture and operational decisions. Opportunities exist to take ownership of critical reliability problems end-to-end, partner with product engineering teams, and develop expertise in GPU infrastructure, Kubernetes at scale, and AI inference workloads.
As a Senior Site Reliability Engineer, you will be responsible for:
- Building and maintaining observability for AI workloads, including telemetry, dashboards, alerts, SLO/SLI tracking, and driving improvements when targets are missed
- Writing automation and tooling to reduce operational toil, improve deployment safety, and accelerate incident response
- Integrating AI workloads into Akamai's existing incident management processes, building runbooks, participating in on-call rotations, and conducting blameless post-mortems
- Building and maintaining CI/CD integrations, deployment safety checks, and rollback automation
- Collaborating with product engineering teams to improve reliability, contribute to architecture decisions, and ensure operational readiness for product releases
- Contributing to capacity planning, autoscaling configuration, and workload scheduling for AI compute infrastructure
Requirements
- 5+ years of experience in SRE, infrastructure engineering, or platform engineering, working with large-scale distributed systems
- Have extensive experience with Kubernetes and containerization at scale
- Have experience defining SLOs and working with observability tools such as Prometheus, Grafana, and distributed tracing
- Possess coding ability in Python or Go for automation and tooling, with experience in CI/CD pipelines, deployment safety, and infrastructure-as-code
- Interest in or experience with AI/ML infrastructure, model serving, or GPU workloads
- Possess the ability to take ownership of problems and drive them to resolution independently
Work in a way that works for you
Benefits & conditions
Employee stock purchase plan, Parental leave, 401(k), Health insurance, Paid time off, Employee assistance program