Site Reliability Engineer II
Role details
Job location
Tech stack
Job description
In this role, responsibilities will include automation, monitoring, incident response, and working collaboratively with skilled team members. Candidates should possess expertise in Linux systems, automation, and SRE practices. Daily activities involve coding, improving dashboards, enhancing alerts, and minimizing repetitive tasks. Opportunities exist to focus on GPU infrastructure, Kubernetes, and ensuring reliability for AI workloads within Akamai's serverless inference platform.
As an Site Reliability Engineer II, you will be responsible for:
- Building and maintaining dashboards, alerts, and monitoring for inference workloads using Akamai's existing observability platform
- Writing automation and tooling in Python or Go to reduce operational toil and improve system reliability
- Building and improving runbooks for inference-specific operational procedures, integrating into Akamai's existing incident management processes
- Contributing to SLO tracking and reporting, identifying trends and areas for improvement
- Supporting CI/CD pipeline maintenance, deployment safety checks, and rollback procedures
- Collaborating with product engineering teams to troubleshoot complex problems across the stack
- Participating in on-call rotations, responding to production incidents, and conducting blameless post-mortems
Requirements
- Have 2+ years of experience in Site Reliability Engineering and a Bachelor's Degree or its equivalent experience
- Demonstrate coding ability in at least one programming language (Python or Go) with experience writing automation
- Have experience with Linux systems administration and the ability to troubleshoot complex infrastructure issues
- Show familiarity with Kubernetes and containerization concepts
- Have experience with monitoring and observability tools such as Prometheus, Grafana, or similar
- Have exposure to CI/CD pipelines and infrastructure-as-code tools (Terraform, SaltStack, or equivalent)
- Show a willingness to learn and grow, with genuine curiosity about AI infrastructure and distributed systems
Work in a way that works for you
Benefits & conditions
Employee stock purchase plan, Parental leave, 401(k), Health insurance, Paid time off, Employee assistance program